Perl6::Bible::A05 man page on Fedora

Man page or keyword search:  
man Server   31170 pages
apropos Keyword Search (all sections)
Output format
Fedora logo
[printable version]

Perl6::Bible::A05(3)  User Contributed Perl Documentation Perl6::Bible::A05(3)

NAME
       Apocalypse_05 - Pattern Matching

AUTHOR
       Larry Wall <larry@wall.org>

VERSION
	 Maintainer: Larry Wall <larry@wall.org>
	 Date: 4 Jun 2002
	 Last Modified: 9 Dec 2004
	 Number: 5
	 Version: 5

       This is the Apocalypse on Pattern Matching, generally having to do with
       what we call "regular expressions", which are only marginally related
       to real regular expressions. Nevertheless, the term has grown with the
       capabilities of our pattern matching engines, so I'm not going to try
       to fight linguistic necessity here. I will, however, generally call
       them "regexes" (or "regexen", when I'm in an Anglo-Saxon mood).

       Here are the RFCs covered in this Apocalypse. PSA stands for "problem,
       solution, acceptance", my private rating of how this RFC will fit into
       Perl 6. Doubtless I have misclassified your RFC, though the other
       ratings are pretty accurate. ":-)"

	   RFC	 PSA   Title
	   ---	 ---   -----
	   072	 aaa   Variable-length lookbehind.
	   093	 abb   Regex: Support for incremental pattern matching
	   110	 bbb   counting matches
	   112	 acc   Assignment within a regex
	   135	 acr   Require explicit m on matches, even with ?? and // as delimiters.
	   144	 aaa   Behavior of empty regex should be simple
	   145	 acr   Brace-matching for Perl Regular Expressions
	   150	 acc   Extend regex syntax to provide for return of a hash of matched subpatterns
	   156	 aaa   Replace first match function (C<?...?>) with a flag to the match command.
	   164	 ccr   Replace =~, !~, m//, s///, and tr// with match(), subst(), and trade()
	   165	 acc   Allow Variables in tr///
	   166	 abc   Alternative lists and quoting of things
	   191	 bbc   smart container slicing
	   197	 cdr   Numeric Value Ranges In Regular Expressions
	   198	 adr   Boolean Regexes
	   261	 dbr   Pattern matching on perl values
	   274	 acc   Generalised Additions to Regexs
	   276	 aaa   Localising Paren Counts in qr()s.
	   308	 dar   Ban Perl hooks into regexes
	   316	 bcr   Regex modifier for support of chunk processing and prefix matching
	   317	 aaa   Access to optimisation information for regular expressions
	   331	 acc   Consolidate the $1 and \1 notations
	   332	 abc   Regex: Make /$/ equivalent to /\z/ under the '/s' modifier
	   348	 bcc   Regex assertions in plain Perl code
	   360	 acb   Allow multiply matched groups in regexes to return a listref of all matches
	   361	 abb   Simplifying split()

       Interestingly, there were no withdrawn RFCs for pattern matching. That
       means either that there were no cork-brained ideas proposed, or that
       regex culture is so cork-brained already that the cork-brained ideas
       blend right in. I know where my money is... ":-)"

       In fact, regular expression culture is a mess, and I share some of the
       blame for making it that way. Since my mother always told me to clean
       up my own messes, I suppose I'll have to do just that.

       For prior Apocalypses, I've used the RFCs as a springboard for
       discussion of my thinking, but this one is special, because none of the
       RFCs were courageous enough (or foolhardy enough) to look at the big
       picture and propose radical change where it's needed. But Perl has
       often been tagged as a language in which it's easy to write programs
       that are difficult to read, and it's no secret that regular expression
       syntax that has been the chief culprit. Funny that other languages have
       been borrowing Perl's regular expressions as fast as they can...

       That's primarily because we took several large steps in Perl 5 to
       enhance regex capabilities. We took one large step forwards with the
       "/x" option, which allowed whitespace between regex tokens. But we also
       took several large steps sideways with the "(?...)" extension syntax. I
       call them steps sideways, but they were simultaneously steps forward in
       terms of functionality and steps backwards in terms of readability. At
       the time, I rationalized it all in the name of backward compatibility,
       and perhaps that approach was correct for that time and place. It's not
       correct now, since the Perl 6 approach is to break everything that
       needs breaking all at once.

       And unfortunately, there's a lot of regex culture that needs breaking.

       Regex culture has gone wrong in a variety of ways, but it's not my
       intent to assign blame--there's plenty of blame to go around, and
       plenty of things that have gone wrong that are nobody's fault in
       particular. For example, it's nobody's fault that you can't
       realistically complement a character set anymore. It's just an accident
       of the way Unicode defines combining characters. The whole notion of
       character classes is mutating, and that will have some bearing on the
       future of regular expression syntax.

       Given all this, I need to warn you that this Apocalypse is going to be
       somewhat radical. We'll be proposing changes to certain "sacred"
       features of regex culture, and this is guaranteed to result in future
       shock for some of our more conservative citizens. Do not be alarmed. We
       will provide ways for you to continue programming in old-fashioned
       regular expressions if you desire. But I hope that once you've thought
       about it a little and worked through some examples, you'll like most of
       the changes we're proposing here.

       So although the RFCs did contribute greatly to my thinking for this
       Apocalypse, I'm going to present my own vision first for where regex
       culture should go, and then analyze the RFCs with respect to that
       vision.

       First, let me enumerate some of the things that are wrong with current
       regex culture.

       ·   Too much history

       ·   Too compact and "cute"

       ·   Poor Huffman coding

       ·   Too much reliance on too few metacharacters

       ·   Different things look too similar

       ·   Poor end-weight design

       ·   Too much reliance on modifiers

       ·   Too many special rules and boobytraps

       ·   Backreferences not useful enough

       ·   Too hard to match a literal string

       ·   Two-level interpretation is problematic

       ·   Too little abstraction

       ·   Little support for named captures

       ·   Difficult to use nested patterns

       ·   Little support for grammars

       ·   Inability to define variants

       ·   Poor integration with "real" language

       ·   Missing backtracking controls

       ·   Difficult to define assertions

       I'm sure there are other problems, but that'll do for starters. Let's
       look at each of these in more detail.

   Too much history
       Most of the other problems stem from trying to deal with a rich
       history. Now there's nothing wrong with history per se, but those of us
       who are doomed to repeat it find that many parts of history are
       suboptimal and contradictory. Perl has always tried to err on the side
       of incorporating as much history as possible, and sometimes Perl has
       succeeded in that endeavor.

       Cultural continuity has much to be said for it, but what can you do
       when the culture you're trying to be continuous with is itself
       discontinuous? As it says in Ecclesiastes, there's a time to build up,
       and a time to tear down. The first five versions of Perl mostly built
       up without tearing down, so now we're trying to redress that omission.

   Too compact and "cute"
       Regular expressions were invented by computational linguists who love
       to write examples like "/aa*b*(cd)*ee/". While these are conducive to
       reasoning about pattern matching in the abstract, they aren't so good
       for pattern matching in the concrete. In real life, most atoms are
       longer than ""a"" or ""b"". In real life, tokens are more recognizable
       if they are separated by whitespace. In the abstract, "/a+/" is
       reducible to "/aa*/". In real life, nobody wants to repeat a 15
       character token merely to satisfy somebody's idea of theoretical
       purity. So we have shortcuts like the "+" quantifier to say "one or
       more".

       Now, you may rightly point out that "+" is something we already have,
       and we already introduced "/x" to allow whitespace, so why is this
       bullet point here? Well, there's a lot of inertia in culture, and the
       problem with "/x" is that it's not the default, so people don't think
       to turn it on when it would probably do a lot of good. The culture is
       biased in the wrong direction. Whitespace around tokens should be the
       norm, not the exception. It should be acceptable to use whitespace to
       separate tokens that could be confused. It should not be considered
       acceptable to define new constructs that contain a plethora of
       punctuation, but we've become accustomed to constructs like "(?<=...)"
       and "(??{...})" and "[\r\n\ck\p{Zl}\p{Zp}]", so we don't complain.
       We're frogs who are getting boiled in a pot full of single-character
       morphemes, and we don't notice.

   Poor Huffman coding
       Huffman invented a method of data compaction in which common characters
       are represented by a small number of bits, and rarer characters are
       represented by more bits. The principle is more general, however, and
       language designers would do well to pay attention to the "other" Perl
       slogan: Easy things should be easy, and hard things should be possible.
       However, we haven't always taken our own advice. Consider those two
       regex constructs we just saw:

	   (?<=...)
	   (??{...})

       Which one do you think is likely to be the most common in everyday use?
       Guess which one is longer...

       There are many examples of poor Huffman coding in current regexes.
       Consider these:

	   (...)
	   (?:...)

       Is it really the case that grouping is rarer than capturing? And by two
       gobbledygooky character's worth? Likewise there are many constructs
       that are the same length that shouldn't be:

	   (?:...)
	   (?#...)

       Grouping is much more important than the ability to embed a comment.
       Yet they're the same length currently.

   Too much reliance on too few metacharacters
       A lot of our Huffman troubles came about because we were trying to
       shoehorn new capabilities into an old syntax without breaking anything.
       The "(?...)" construct succeeded at that goal, but it was new wine in
       old wineskins, as they say. More successful was the "*?" minimal
       matching hack, but it's still symptomatic of the problem that we only
       had three characters to choose from that would have worked at that
       point in the grammar. We've pretty nearly exhausted the available
       backslash sequences.

       The waterbed theory of linguistic complexity says that if you push down
       one place, it goes up somewhere else. If you arbitrarily limit yourself
       to too few metacharacters, the complexity comes out somewhere else. So
       it seems obvious to me that the way out of this mess is to grab a few
       more metacharacters. And the metacharacters I want to grab are...well,
       we'll see in a moment.

   Different things look too similar
       Consider these constructs:

	   (??{...})
	   (?{...})
	   (?#...)
	   (?:...)
	   (?i:...)
	   (?=...)
	   (?!...)
	   (?<=...)
	   (?<!...)
	   (?>...)
	   (?(...)...|...)

       These all look quite similar, but some of them do radically different
       things. In particular, the "(?<...)" does not mean the opposite of the
       "(?>...)". The underlying visual problem is the overuse of parentheses,
       as in Lisp. Programs are more readable if different things look
       different.

   Poor end-weight design
       In linguistics, the notion of end-weight is the idea that people tend
       to prefer sentences where the short things come first and the long
       things come last. That minimizes the amount of stuff you have to
       remember while you're reading or listening. Perl violates this with
       regex modifiers. It's okay when you say something short like this:

	   s/foo/bar/g

       But when you say something like we find in RFC 360:

	   while ($text =~ /name:\s*(.*?)\n\s*
			   children:\s*(?:(?@\S+)[, ]*)*\n\s*
			   favorite\ colors:\s*(?:(?@\S+)[, ]*)*\n/sigx) {...}

       it's not until you read the "/sigx" at the end that you know how to
       read the regex. This actually causes problems for the Perl 5 parser,
       which has to defer parsing the regular expression till it sees the
       "/x", because that changes how whitespace and comments work.

   Too much reliance on modifiers
       The "/s" modifier in the previous example changes the meaning of the
       "." metacharacter. We could, in fact, do away with the "/s" modifier
       entirely if we only had two different representations for "any
       character", one of which matched a newline, and one which didn't. A
       similar argument applies to the "/m" modifier. The whole notion of
       something outside the regex changing the meaning of the regex is just a
       bit bogus, not because we're afraid of context sensitivity, but because
       we need to have better control within the regex of what we mean, and in
       this case the context supplied outside the regex is not precise enough.
       (Perl 5 has a way to control the inner contexts, but it uses the self-
       obfuscating "(?...)" notation.)

       Modifiers that control how the regex is used as a whole do make some
       sense outside the regex. But they still have the end-weight problem.

   Too many special rules and boobytraps
       Without knowing the context, you cannot know what the pattern "//" will
       do. It might match a null string, or it might match the previously
       successful match.

       The "local" operator behaves differently inside regular expressions
       than it does outside.

       It's too easy to write a null pattern accidentally. For instance, the
       following will never match anything but the null string:

	   /
	   | foo
	   | bar
	   | baz
	   /x

       Even when it's intentional, it may not look intentional:

	   (a|b|c|)

       That's hard to read because it's difficult to make the absence of
       something visible.

       It's too easy to confuse the multiple meanings of dot. Or the multiple
       meanings of "^", and "$". And the opposite of "\A" is frequently not
       "\Z", but "\z". Tell me again, when do I say "\1", and when do I say
       $1? Why are they different?

   Backreferences not useful enough
       Speaking of "\1", backreferences have a number of shortcomings. The
       first is actually getting ahold of the right backreference. Since
       captures are numbered from the beginning, you have to count, and you
       can easily count wrong. For many purposes it would be better if you
       could ask for the last capture, or the one before that. Or perhaps if
       there were a way to restart the numbering part way through...

       Another major problem with backreferences is that you can't easily
       modify one to search for a variant. Suppose you match an opening
       parenthesis, bracket, or curly. You'll like to search for everything up
       to the corresponding closing parenthesis, bracket, or curly, but
       there's no way to transmogrify the opening version to the closing
       version, because the backref search is hardwired independently of
       ordinary variable matching. And that's because Perl doesn't instantiate
       $1 soon enough. And that's because Perl relies on variable
       interpolation to get subexpressions into regexes. Which leads us to...

   Too hard to match a literal string
       Since regexes undergo an interpolation pass before they're compiled,
       anything you interpolate is forced to be treated as a regular
       expression. Often that's not what you want, so we have the klunky
       "\Q$string\E" mechanism to hide regex metacharacters. And that's
       because...

   Two-level interpretation is problematic
       The problem with "\Q$string\E" arises because of the fundamental
       mistake of using interpolation to build regexes instead of letting the
       regex control how it treats the variables it references. Regexes aren't
       strings, they're programs. Or, rather, they're strings only in the
       sense that any piece of program is a string. Just as you have to work
       to eval a string as a program, you should have to work to eval a string
       as a regular expression. Most people tend to expect a variable in a
       regular expression to match its contents literally. Perl violates that
       expectation. And because it violates that expectation, we can't make $1
       synonymous with "\1". And interpolated parentheses throw off the
       capture count, so you can't easily use interpolation to call subrules,
       so we invented "(??{$var})" to get around that. But then you can't
       actually get at the parentheses captured by the subrule. The
       ramifications go on and on.

   Too little abstraction
       Historically, regular expressions were considered a very low-level
       language, a kind of glorified assembly language for the regex engine.
       When you're only dealing with ASCII, there is little need for
       abstraction, since the shortest way to say "[a-z]" is just that. With
       the advent of the eighth bit, we started getting into a little bit of
       trouble, and POSIX started thinking about names like "[:alpha:]" to
       deal with locale difficulties. But as with the problem of conciseness,
       the culture was still biased away from naming abstractly anything that
       could be expressed concretely.

       However, it's almost impossible to write a parser without naming
       things, because you have to be able to name the separate grammar rules
       so that the various rules can refer to each other.

       It's difficult to deal with any subset of Unicode without naming it.
       These days, if you see "[a-z]" in a program, it's probably an outright
       bug. It's much better to use a named character property so that your
       program will work right in areas that don't just use ASCII.

       Even where we do allow names, it tends to be awkward because of the
       cultural bias against it. To call a subrule by name in Perl 5 you have
       to say this:

	   (??{$rule})

       That has 4 or 5 more characters than it ought to. Dearth of abstraction
       produces bad Huffman coding.

   Little support for named captures
       Make that "no support" in Perl, unless you include assignment to a
       list. This is just a part of the bias against naming things. Instead we
       are forced to number our capturing parens and count. That works okay
       for the top-level regular expression, when we can do list assignment or
       assign $1 to $foo. But it breaks down as soon as you start trying to
       use nested regexes. It also breaks down when the capturing parentheses
       match more than once. Perl handles this currently by returning only the
       last match. This is slightly better than useless, but not by much.

   Difficult to use nested patterns
       For many of the reasons we've mentioned, it's difficult to make regexes
       refer to each other, and even if you do, it's almost impossible to get
       the nested information back out of them. And there are entire classes
       of parsing problems that are not solvable without recursive
       definitions.

   Little support for grammars
       Even if it were easier for regexes to refer to other regexes, we'd
       still have the problem that those other regexes aren't organized in any
       meaningful way. They might be off in variables that come and go at the
       whim of the surrounding context.

       When we have an organized system of parsing rules, we call it a
       grammar. One advantage of having a grammar is that you can optimize
       based on the assumption that the rules maintain their relationship to
       each other. For instance, if you think of grammar rules as a funny kind
       of subroutine, you can write an optimizer to inline some of the
       subrules--but only if you know the subrule is fixed in the grammar.

       Without support for grammar classes, there's no decent way to think of
       deriving one grammar from another. And if you can't derive one grammar
       from another, you can't easily evolve your language to handle new kinds
       of problems.

   Inability to define variants
       If we want to have variant grammars for Perl dialects, then what about
       regex dialects? Can regexes be extended either at compile time or at
       run time? Perl 5 has some rudimentary overloading magic for rewriting
       regex strings, but that's got the same problems as source filters for
       Perl code; namely that you just get the raw regex source text and have
       to parse it yourself. Once again the fundamental assumption is that a
       regex is a funny kind of string, existing only at the behest of the
       surrounding program.

       Do we think of regexes as a real, living language?

   Poor integration with rich languages
       Let's face it, in the culture of computing, regex languages are mostly
       considered second-class citizens, or worse. "Real" languages like C and
       C++ will exploit regexes, but only through a strict policy of
       apartheid. Regular expressions are our servants or slaves; we tell them
       what to do, they go and do it, and then they come back to say whether
       they succeeded or not.

       At the other extreme, we have languages like Prolog or Snobol where the
       pattern matching is built into the very control structure of the
       language. These languages don't succeed in the long run because
       thinking about that kind of control structure is rather difficult in
       actual fact, and one gets tired of doing it constantly. The path to
       freedom is not to make everyone a slave.

       However, I would like to think that there is some happy medium between
       those two extremes. Coming from a C background, Perl has historically
       treated regexes as servants. True, Perl has treated them as trusted
       servants, letting them move about in Perl society better than any other
       C-like language to date. Nevertheless, if we emancipate regexes to
       serve as co-equal control structures, and if we can rid ourselves of
       the regexist attitudes that many of us secretly harbor, we'll have a
       much more productive society than we currently do. We need to empower
       regexes with a sense of control (structure). It needs to be just as
       easy for a regex to call Perl code as it is for Perl code to call a
       regex.

   Missing backtracking controls
       Perl 5 started to give regexes more control of their own destiny with
       the "grab" construct, "(?>...)", which tells the regex engine that when
       it fails to match the rest of the pattern, it should not backtrack into
       the innards of the grab, but skip back to before it. That's a useful
       notion, but there are problems. First, the notation sucks, but you knew
       that already. Second, it doesn't go far enough. There's no way to
       backtrack out of just the current grouping. There's no way to backtrack
       out of just the current rule. Both of these are crucial for giving
       first-class status to the control flow of regexes.

   Difficult to define assertions
       Notionally, a regex is an organization of assertions that either
       succeed or fail. Some assertions are easily expressed in traditional
       regex language, while others are more easily expressed in a procedural
       language like Perl.

       The natural (but wrong) solution is to try to reinvent Perl expressions
       within regex language. So, for instance, I'm rejecting those RFCs that
       propose special assertion syntax for numerics or booleans. The better
       solution is to make it easier to embed Perl assertions within regexes.

Brave New World
       I've just made a ton of negative assertions about the current state of
       regex culture. Now I'd like you to perform a cool mental trick and turn
       all those negatives assertions into positive assertions about what I'm
       going to say, because I'm not intending to give the rationale again,
       but just present the design as it stands. Damian will discuss an
       extended example in his Exegesis 5, which will show the big picture of
       how these various features work together to produce a much more
       readable whole.

       So anyway, here's what's new.

   Metacharacter Reform
       Some things stay the same: "(...)" captures text just as it did before,
       and the quantifiers "*", "+", and "?" are also unchanged.  The vertical
       bar "|" still separates alternatives. The backslash "\" still protects
       the following character from its ordinary interpretation. The "?"
       suffix character still does minimal matching.  (Note that these are by
       far the most commonly used metacharacters, so many ordinary regexes
       will look nearly identical in Perl 5 and Perl 6.)

       Since "/x" extended syntax is now the default, "#" is now always a
       metacharacter indicating a comment, and whitespace is now always
       "meta". Whitespace is now the standard way to separate regex tokens
       that would otherwise be confused as a single token.

       Even in character classes, whitespace is not taken literally any more.
       Backwhack the space if you mean it literally, or use "<sp>", or "\040",
       or "\x20", or "\c[SPACE]". But speaking of character classes...

       Perhaps the most radical change is that I've taken "[...]" away from
       character classes and made it the non-capturing grouping operator,
       because grouping is more fundamental than character classes, and
       explicit character classes are becoming less common than named
       character classes. (You can still do character classes, just not with
       bare square brackets.)

       I've also stolen "{...}" from generalized quantifiers and made them
       into closure delimiters. (Use "<n,m>" for the generalized quantifier
       now.)

       [Update: The generalized quantifier is now "**{n..m}".]

       I've stolen three new metacharacters. The new extensible metasyntax for
       assertions uses angle brackets, "<...>". And the colon ":" is now used
       for declaration and backtracking control. (Recall Larry's 2nd Law of
       Language Redesign: Larry gets the colon.) The colon always introduces a
       token that controls the meaning of what is around it. The nature of the
       token depends on what follows the colon. Both the colon syntax and
       angle syntax are extensible. (Backslash syntax is also extensible.)

       This may sound like we're complexifying things, but we're really
       simplifying. We now have the following regex invariants:

	   (...)       # always delimits a capturing group
	   [...]       # always delimits a non-capturing group
	   {...}       # always delimits a closure
	   <...>       # always delimits an assertion
	   :...	       # always introduces a metasyntactic token

       (Note that we're using "assertion" here in the broad sense of anything
       that either matches or fails, whether or not it has a width.)

       [Update: We now also have "X...X" for non-capturing assertions.]

       The nature of the angle assertion is controlled by the first character
       inside it. If the first character is alphabetic, it's a grammatical
       assertion, and the entire first word controls the meaning. The word is
       first looked up in the current grammar, if any. If not found there, it
       is checked to see if it is one of the built-in grammar rules such as
       those defined by the Unicode property classes. If the first character
       is not alphabetic, there will be special rules in the current grammar
       or in the Perl grammar for looking up the parse rule. For instance, by
       default, any assertion that begins with "!" is simply negated.
       Assertions that start with a digit are assumed to be a range assertion
       ("<n,m>") regarding the previous atom. (Taking the last two together,
       you can say "<!n,m>" to exclude a range.) Assertions that start with
       "$", "@", "%", or "&" are assumed to interpolate an indirect regex rule
       stored in a variable or returned by a subroutine.  An assertion that
       starts with a parenthesis is a closure being used as an assertion. An
       assertion that starts with a square bracket or another angle bracket is
       a character class. An assertion that starts with a quote asserts the
       match of a literal string. And so on.

       [Update: If the inside character is an angle bracket, it's the ASCII
       workaround "<<...>>" for "X...X".]

       Some metacharacters are still used but have a slightly different
       meaning, in part to get rid of the "/s" and "/m" modifiers, and in part
       because most strings in Perl 6 will come from the filehandle pre-
       chomped. So anchors "^" and "$" now always mean the real beginning and
       ending of the string. Use "^^" and $$ to match the beginnings and
       endings of lines within a string. (They're doubled because they're
       "fancier", because they can match in multiple places, and because
       they'll be rarer, so Huffman says they should be longer.)  The "^^" and
       $$ also match where "^" and "$" would.

       [Update: If the string ends with a newline, the null string following
       it is not considered a line.  So neither "^^" nor $$ match after a
       final newline.  In other words, $$ matches at "$" only if there is
       "extra" text there not terminated by a newline.]

       The dot "." now always matches any character including newline. (Use
       "\N" to match a non-newline. Or better, use an autochomping filehandle,
       if you're processing line-by-line.)

       In a sense, the sigils "$", "@", "%", and "&" are different
       metacharacters because they don't interpolate, but are now subject to
       the interpretation of the regex engine. This allows us to change the
       default behavior of ordinary "interpolation" to a literal match, and
       also lets us put in lvalue-ish constructs like:

	   / $name := (\S+) /
	   / @kids := [(\S+) \s+]* /
	   / %pets := [(\S+) \: (\S+) \s+]* /

       (Notice also the delicate interplay of quantified non-capturing
       brackets with capturing parens, particularly for gathering multiple
       values or even multiple key/value pairs.)

       Here are some of the metacharacter differences in table form:

	   Old			 New
	   ---			 ---
	   /pat pat #text	 /pat pat #text
	       pat/x		     pat/	       # Look Ma, no /x!
	   /patpat(?#text)/	  /pat pat <('text')>/ # can always use whitespace

	   /pat pat/		 / pat\ pat /	       # match whitespace literally
				 / pat \s* pat /       # or generically
				 / pat \h* pat /       # or horizontally
				 / pat <' '> pat /     # or as a literal string
				 / pat <sp> pat /      # or by explicit rule
				 /:w pat pat/	       # or by implicit rule

	   /^pat$/		 /^pat\n?$/	       # ^ and $ mean string
	   /^pat$/m		 /^^pat$$/	       # no more /m
	   /\A...(^pat$)*...\z/m /^...(^^pat$$)*...$/  # no more \A or \z
	   /.*\n/		 /\N*\n/	       # \N is negated \n
				 /.*?\n/	       # this still works
	   /.*/s		 /.*/		       # . always matches "any"

	   \Q$string\E		 $string	       # interpret literally

	   (?{ code })		 { code }	       # call code, ignore return
				 { code or fail }      # use code as an assertion

	   (??{$rule})		 <$var>		       # call $var as regex
				 <name>		       # call rule from current grammar

       [Update: Follows ordinary scoping rules, so lexical rules take
       precedence over grammar rules.]

				 <Other::rule>	       # call rule from some Other grammar
				 <*rule>	       # bypass local rule to call built-in
				 <@array>	       # call array of alternate rules
				 <%hash>	       # parse keyword as key to rule
				 <@array[1]>	       # call a rule from an array
				 <%hash{"x"}>	       # call a rule from a hash
				 <&sub(1,2,3)>	       # call a rule returned by a sub
				 <{ code }>	       # call return value as anonymous rule
				 <( code )>	       # call code as boolean assertion

				 <name(expr)>	       # call rule, passing Perl args
				 { .name(expr) }       # same thing.

				 <$var(expr)>	       # call rule indirectly by name
				 { .$var(expr) }       # same thing.

				 <name pat>	       # call rule, passing regex arg
				 { .name(/pat/) }      # same thing.

				 # maybe...
				 <name: text>	       # call rule, passing string
				 { .name(q<text>) }    # same thing.

	   [\040\t\p{Zs}]	 \h		       # horizontal whitespace
	   [\r\n\ck\p{Zl}\p{Zp}] \v		       # vertical whitespace
	   [a-z]		 <[a-z]>	       # equivalently non-international
				 <alpha>	       # more international
	   [[:alpha:][:digit:]]	 <<alpha><digit>>      # POSIX classes are built-in rules

       [Update: That must now be written "<+<alpha>+<digit>>", or it will be
       mistaken for "Xalpha><digitX", which doesn't work too well.]

       [Update: Actually, that's now written "<+alpha+digit>", avoiding the
       mistaken impression entirely.]

	   {n,m}		 <n,m>		       # assert repeat count

       [Update: Is now " **{n..m} ".]

	   {$n,$m}		 <$n,$m>	       # indirect repeat counts

       [Update: Is now " **{$n..$m} ".]

	   (?>.*)		 [.*]:		       # don't backtrack through [.*]
				 .*:		       # brackets not necessary on atom
				 (.*):		       # same, but capture
				 <xyz>:		       # don't backtrack into subrule

				 :		       # skip previous atom when backtracking
				 ::		       # fail all |'s when backtracking
				 :::		       # fail current rule when backtracking

				 :=		       # bind a name to following atom
	   my ($x) = /(.*)/	 my $x; / $x:=(.*) /   # may now bind it inside regex

	   (?i)			 :i		       # ignore case in the following
				 :ignorecase	       # same thing, self-documenting form
	   (?i:...)		 [:i ...]	       # can limit scope without capture
				 (:i ...)	       # can limit scope with capture

       Declarations like ":i" are lexically scoped and do not pass to any
       subrules. Each rule maintains its own sensitivity. There is no built-in
       operator to turn case ignorance back off--just call a different rule
       and it's automatically case sensitive again. (If you want a
       parameterized subrule, that can be arranged. It's just a method, after
       all. Proof of this assertion is left to future generations of hackers.)

   Backslash Reform
       There are some changes to backslash sequences. Character properties
       "\p" and "\P" are no longer needed--predefined character classes are
       just considered intrinsic grammar rules. (You can negate any "<...>"
       assertion by using "<!...>" instead.) As mentioned in a previous
       Apocalypse, the "\L", "\U", and "\Q" sequences no longer use "\E" to
       terminate--they now require bracketing characters of some sort. And
       "\Q" will rarely be needed due to regex policy changes. In fact, they
       may all go away since it's easy to say things like:

	   $(lc $foo)

       [Update: And in fact, they did all go away.]

       [Update: "$()" is gone, so you have to say "<{lc $foo}>" or some such.]

       For any bracketing construct, square brackets are preferred, but others
       are allowed:

	   \x[...]     # preferred, indicates simple bracketing
	   \x(...)     # okay, but doesn't capture.
	   \x{...}     # okay, but isn't a closure.
	   \x<...>     # okay, but isn't an assertion

       The "\c" sequence is now a bracketing construct, having been extended
       from representing control characters to any named character.

       Backreferences such as "\1" are gone in favor of the corresponding
       variable $1. "\A", "\Z", and "\z" are gone with the disappearance of
       "/s" and "/m". The position assertion "\G" is gone in favor of a ":c"
       modifier that forces continuation from where the last match left off.
       That's because "\G" was almost never used except at the front of a
       regex. In the unlikely event that you want to assert that you're at the
       old final position elsewhere in your regex, you can always test the
       current position (via the ".pos" method) with an assertion:

	   $oldpos = pos $string;
	   $string =~ m/... <( .pos == $oldpos )> .../;

       You may be thinking of ".pos" as the final position of the previous
       match, but that's not what it is. It's the current position of the
       current match. It's just that, between matches, the current position of
       the current match happens to be the same as the final position of the
       current match, which happens to be the last match, which happens to be
       done. But as soon as you start another match, the last match is no
       longer the current match.

       Note that the ":c" continuation is needed only on constructs that
       ordinarily force the search to start from the beginning. Subrules
       automatically continue at the current location, since their initial
       position is controlled by some other rule.

       [Update: There's also a ":p" that is like ":c" but anchors the next
       chunk to the starting position, so it's good for parsers.]

       There are two new backslash sequences, "\h" and "\v", which match
       horizontal and vertical whitespace respectively, including Unicode
       spacing characters and control codes. Note that "\r" is considered
       vertical even though it theoretically moves the carriage sideways.
       Finally, "\n" matches a logical newline, which is not necessarily a
       linefeed character on all architectures. After all, that's why it's an
       "n", not an "l". Your program should not break just because you
       happened to run it on a file from a partition mounted from a Windows
       machine. (Within an interpolated string, "\n" still produces whatever
       is the normal newline for the current architecture.)

	   Old		       New
	   ---		       ---
	   \x0a		       \x0a		       # same
	   \x{263a}	       \x263a		       # brackets required only if ambiguous
	   \x{263a}abc	       \x[263a]abc	       # brackets required only if ambiguous
	   \0123	       \0123		       # same (no ambiguity with $123 now)
	   \0123	       \0[123]		       # can use brackets here too

	   \p{prop}	       <prop>		       # properties are just grammar rules
	   \P{prop}	       <!prop>

       [Update: That's "<-prop>" now, since "<!prop>" would be a zero-width
       assertion.]

	   [\040\t\p{Zs}]      \h		       # horizontal whitespace
	   space	       \h		       # not exact, but often more correct
	   [\r\n\ck\p{Zl}\p{Zp}] \v		       # vertical whitespace

	   \Qstring\E	       \q[string]
			       <'string with spaces'>  # match literal string
			       <' '>		       # match literal space

	   \E		       gone		       # use \Q[...] instead

       [Update: "\Q[...]" is also gone.	 Use "<{quotemeta ...}>".]

	   \A		       ^		       # ^ now invariant
	   \a		       \c[BEL]		       # alarm (bell)

	   \Z		       \n?$		       # clearer
	   \z		       $		       # $ now invariant
	   \G		       <( .pos == $oldpos )>   # match at particular position
						       # typically just use m:c/pat/

       [Update: Or "m:p/pat/".]

	   \N{CENT SIGN}       \c[CENT SIGN]	       # named character
	   \c[		       \e		       # escape
	   \cX		       \c[^X]		       # control char
	   \n		       \c[LF]		       # specifically a linefeed
	   \x0a\x0d	       \x[0a;0d]	       # CRLF
	   \x0a\x0d	       \c[CR;LF]	       # CRLF (conjectural)
	   \C		       [:u0 .]		       # forced byte in utf8 (dangerous)

       [Update: The Unicode levels are now specified by ":bytes", ":codes",
       ":graphs", and ":langs", not by numeric level.]

	   [^\N[CENT SIGN]]    \C[CENT SIGN]	       # match any char but CENT SIGN

	   \Q$var\E	       $var		       # always assumed literal,
	   \1		       $1		       # so $1 is literal backref
	   /$1/		       my $old1 = $1; /$old1/  # must use temporary here

	   \r?\n	       \n		       # \n asserts logical newline

	   [^\n]	       \N		       # not a logical newline
			       \C[LF]		       # not a linefeed

	   [^\t]	       \T		       # not a tab (these are conjectural)
	   [^\r]	       \R		       # not a return
	   [^\f]	       \F		       # not a form feed
	   [^\e]	       \E		       # not an escape
	   [^\x1B]	       \X1B		       # not specified hex char
	   [^\x{263a}]	       \X[263a]		       # not a Unicode SMILEY

	   \X		       <.>		       # a grapheme (combining char seq)
			       [:u2 .]		       # At level 2+, dot means grapheme

       [Update: That's ":graphs" now.]

       Under level 2 Unicode support, a character is assumed to mean a
       grapheme, that is, a sequence consisting of a base character followed
       by 0 or more combining characters. That not only affects the meaning of
       the "." character, but also any negated character, since a negated
       character is really a negative lookahead assertion followed by the
       traversal of a single character. For instance, "\N" really means:

	   [<!before \n> . ]

       So it doesn't really matter how many characters "\n" actually matches.
       "\N" always matches a single character--whatever that is...

   Modifier Reform
       You can't use colon for a regex delimiter any more. That's because
       regex modifiers may now be placed in front of a regex construct:

	   s:w:i:e /foo/bar/	       # :words :ignorecase :each

       That can also be written:

	   s/:w:i:e foo/bar/	       # :words :ignorecase :each

       Single character modifiers may be bundled like this:

	   s:wie /foo/bar/	       # :words :ignorecase :each

       [Update: Bundling is no longer allowed. Whitespace is allowed.]

       ...but only if the sequence as a whole is not already defined as a long
       modifier, since ambiguity will be resolved in favor of the long
       modifier. Long modifiers may not be bundled with any other modifier.
       So this is legal:

	   s:once:wie /foo/bar/

       [Update: Not any more.]

       but not these (unless you've defined them):

	   s:wieonce /foo/bar/
	   s:oncewie /foo/bar/

       Not only is colon disallowed as a regex delimiter, but you may no
       longer use parentheses as the delimiters either. This will allow us to
       parameterize modifiers:

	   s:myoption($x) /foo/bar/

       This rule also allows us to differentiate "s///" from an "s()"
       function, "tr///" from "tr()", etc. If you want matching brackets for
       the delimiters I'd suggest that you use square brackets, since they now
       mean grouping without capturing.

       [Update: Parens may be used for delimiters as long as they are not
       adjacent to the final option.  They may be separated with either a
       colon or whitespace.]

       Several modifiers, "/x", "/s", and "/m", are no longer needed and have
       been retired. It's unclear whether "/o" is necessary any more. We will
       assume it's gone unless it's shown that caching can't handle the
       problem. Note that the regex now has more control over when to cache
       subrules because it is no longer subject to the vagaries of standard
       interpolation.

       The old "/c" modifier is gone because regexes never reset the position
       on failure any more. To do that, set "$string.pos = 0" explicitly. But
       note also that assigning to a string automatically resets its position
       to 0, so any string in your typical loop is going to start with its
       current search position already set 0. Modifying a string in place
       causes the position to move to the end of the replacement section by
       default, if the position was within the span replaced. (This is
       consistent with "s///" semantics.)

       The "/e" modifier is also gone, since it did reverse parsing magic, and
       ":e" will be short for ":each"--see below. It's still easy to
       substitute the value of an expression though:

	   s/pat/$( code )/;

       or even

	   s(/pat/, { code });

       [Update: ":each" is gone, and we're back to ":global" and ":g" to stay
       with common culture.  Since "$()" is also gone, just interpolate code
       using a closure: "s/pat/{ code }/".]

       There's a new modifier, ":once", that causes a match to succeed only
       once (like the old "?...?" construct). To reset it, use the ".reset"
       method on the regex object. (If you haven't named the regex object, too
       bad...)

       Another new modifier is ":w", which causes an implicit match of
       whitespace wherever there's literal whitespace in a pattern. In other
       words, it replaces every sequence of actual whitespace in the pattern
       with a "\s+" (between two identifiers) or a "\s*" (between anything
       else). So

	   m:w/ foo bar \: ( baz )*/

       really means (expressed in Perl 5 form):

	   m:p5/\s*foo\s+bar\s*:(\s*baz\s*)*/

       [Update: The optional whitespace is actually matched by the "<?ws>"
       rule, which you can redefine to change the semantics of ":words".]

       You can still control the handling of whitespace under ":w", since we
       extend the rule to say that any explicit whitespace-matching token
       can't match whitespace implicitly on either side. So:

	   m:w/ foo\ bar \h* \: (baz)*/

       really means (expressed in Perl 5 form):

	   m:p5/\s*foo bar[\040\t\p{Zs}]*:\s*(baz)*/

       The first space in

	   /[:w foo bar]/

       matches "\s*" before ""foo"". That's usually what you want, but if it's
       not what you want, you have a little problem. Unfortunately you can't
       just say:

	   /[:wfoo bar]/

       That won't work because it'll look for the ":wfoo" modifier. However,
       there are several ways to get the effect you want:

	   /[:w()foo bar]/
	   /[:w[]foo bar]/
	   /[:w\bfoo bar]/
	   /[:w::foo bar]/

       That last one is just our friend the "::" operator in disguise. If you
       backtrack into it, you're leaving the brackets anyway, so it's
       essentially a no-op.

       The new ":c"/":cont" modifier forces the regex to continue at the
       current "pos" of the string. It may only be used outside the regex.
       (Well, it could be used inside but it'd be redundant.) The modifier
       also forces the regex to match only the next available thing. That's
       not quite the same as the "^" anchor, though, because it not only
       disables the implicit scanning done by "m//" and "s///", but it also
       works on more than the first iteration. It forces all matches to be
       contiguous, in other words. So ":c" is short for both "continue" and
       "contiguous". If you say

	   $_ = "foofoofoo foofoofoo";
	   s:each:cont/foo/FOO/;

       you get:

	   FOOFOOFOO foofoofoo

       This may seem odd, but it's precisely the semantics of any embedded
       regex:

	   $_ = "foofoofoo foofoofoo";
	   $rx = rx/foo/;
	   m/<$rx>*/;	       # matches "foofoofoo"

       [Update: Now we use ":p"/":pos" to get contiguous semantics.  The
       ":c"/":continue" modifier starts scanning again each time, so when used
       with ":global" it really only influences where the first match is
       found.]

       A modifier that starts with a number causes the pattern to match that
       many times. It may only be used outside the regex. It may not be
       bundled, because ordinals are distinguished from cardinals. That is,
       how it treats those multiple matches depends on the next character. If
       you say

	   s:3x /foo/bar/

       then it changes the first 3 instances. But if you say

	   s:3rd /foo/bar/

       it changes only the 3rd instance. You can say

	   s:1st /foo/bar/

       but that's just the default, and should not be construed as equivalent
       to ":once", which matches only once, ever. (Unless you ".reset" it, of
       course.)

       You can combine cardinals and ordinals:

	   s:3x:3rd /foo/bar/

       That changes the 3rd, 6th, and 9th occurrences. To change every other
       quote character, say

	   s:each:2nd /"/\&rquot;/;

       ":each" is synonymous with ":3x" (for large values of 3). Note that
       ":each" does not, in fact, generate every possible match, because it
       disallows overlaps. To get every possible match, use the ":any"
       modifier. Saying:

	   $_ = "abracadabra";
	   @all = m:any /a[^a]+a/;

       produces:

	   abra aca ada abra

       It can even match multiple times at the same spot as long as the rest
       of the regex progresses somehow. Saying:

	   @all = m:any /a.*?a/;

       produces:

	   abra abraca abracada abracadabra aca acada acadabra ada adabra abra

       If you say

	   $sentence.m:any /^ <english> $/

       you'll get every possible parsing of the sentence according to the
       rules of "english" (not to be confused with the rules of English, which
       are already confusing enough, except when they aren't).

       [Update: ":any" has been replaced with ":overlap" and ":exhaustive" to
       differentiate the cases of wanting one solution at each position from
       wanting all solutions at every position.]

       To indicate varying levels of Unicode support we have these modifiers,
       which may be used either inside or outside a regex:

	   :u0	       # use bytes	 (. is byte)
	   :u1	       # level 1 support (. is codepoint)
	   :u2	       # level 1 support (. is grapheme)
	   :u3	       # level 1 support (. is language dependent)

       These modifiers say nothing about the state of the data, but in general
       internal Perl data will already be in Normalization Form C, so even
       under ":u1", the precomposed characters will usually do the right
       thing. Note that these modifiers are for overriding the default support
       level, which was probably set by pragma at the top of the file.

       [Update: Replaced with ":bytes", ":codes", ":graphs", and ":langs".]

       Finally, there's the ":p5" modifier, which causes the rest of the regex
       (or group) to be parsed as a Perl 5 regular expression, including any
       interpolated strings. (But it still doesn't enable Perl 5's trailing
       modifiers.)

       [Update: Now spelled out ":perl5".]

	   Old		       New
	   ---		       ---
	   ?pat?	       m:once/pat/	       # match once only
	   /pat/i	       m:i/pat/		       # ignorecase
			       /:i pat/		       # ignorecase
	   /pat/x	       /pat/		       # always extended
	   /pat\s*pat/	       /:w pat pat/	       # match word sequence
	   /(?i)$p5pat/	       m:p5/(?i)$p5pat/	       # use Perl 5 syntax
	   $n = () = /.../g    $n = +/.../;	       # count occurrences
	   for $i (1..3){s///} s:3///;		       # do 3 times
	   /^pat$/m	       /^^pat$$/	       # no more /m
	   /./s		       /./		       # no more /s
	   /./		       /\N/		       # . now works like /s

   Keyword and Context Reform
       Deferred regex rules are now defined with "rx//" rather than "qr//",
       because a regular expression is no longer a kind of quoted string.

       Actually, just as you can define closures without an explicit "sub",
       any "//" or "rx//" declares a deferred regex if it's not in a context
       that executes it immediately. A regex is executed automatically if it's
       in a boolean, numeric, or string context. But assignment to an untyped
       variable is not such a context, nor is passing the regex as an untyped
       parameter to a function. (Of course, an explicitly declared RULE
       parameter doesn't provide an evaluating context either.)

       So these are equivalent:

	   my $foo = /.../;	       # create regex object
	   my $foo = rx[...];	       # create regex object
	   my $foo = rule {...};       # create regex object

       Likewise, these are equivalent:

	   @x = split /.../;
	   @x = split rx[...];
	   @x = split rule {...};

       [Update: Note that "split" supplies an implicit ":c" here.]

       The "rule" syntax is just a way of declaring a deferred regex as if it
       were a subroutine or method. More on that later.

       To force an immediate evaluation of a deferred regex where it wouldn't
       ordinarily be, you can use the appropriate unary operator:

	   my $foo = ?/.../;   # boolean context, return whether matched,
	   my $foo = +/.../;   # numeric context, return count of matches
	   my $foo = _/.../;   # string context, return captured/matched string

       [Update: Unary "_" is now unary "~".]

       The standard match and substitution forms also force immediate
       evaluation regardless of context:

	   $result = m/.../;	       # do match on topic string
	   $result = s/.../.../;       # do substitution on topic string

       These forms also force the regex to start matching at the beginning of
       the string in question and scan forward through the string for the
       match, as if there were an implicit ".*?" in front of every iteration.
       (Both of these behaviors are suppressed if you use the ":c"/":cont"
       modifier). In contrast, the meaning of the deferred forms is dependent
       on context. In particular, a deferred regex naturally assumes ":c" when
       used as a subrule. That is, it continues where the last match left off,
       and the next thing has to match right there at the head of the string.

       [Update: A regex actually assumes ":p" semantics when used as a
       subrule.	 And the long form of ":c" is ":continue".]

       In any other context, including list context, a deferred regex is not
       immediately evaluated, but produces a reference to the regex object:

	   my $rx = /.../;     # not evaluated
	   my @foo = $rx;      # ERROR: type mismatch.
	   my @foo = ($rx);    # One element, a regex object.
	   my @foo = (/.../);  # Same thing.
	   my @foo := $rx;     # Set autogrow rule for @foo.

       To evaluate repeatedly in list context, treat the regex object as you
       would any other iterator:

	   my @foo = <$rx>;

       You can also use the more explicit form:

	   my @foo = m/<$rx>/;

       Those aren't identical, since the former assumes ":c" and starts up at
       the current position of the unmentioned topic, while the latter
       explicitly resets the position to the beginning before scanning. Also,
       since the deferred regex assumes a ":c" modifier, "<$rx>" won't scan
       through the string like "m//". It can return multiple values to the
       list, but they have to be contiguous. You can get the scanning effect
       of "m//" by prepending the pattern with ".*?".

       [Update: Where the preceding paragraphs says ":c" read ":p".]

       But it's vitally important to understand this fundamental change, that
       "//" is no longer a short form of "m//", but rather a short form of
       "rx//". If you want to add modifiers to a "//", you have to turn it
       into an "rx//", not an "m//". It's now wrong to call "split" like this:

	   split m/.../

       (That is, it's wrong unless you actually want the return value of the
       pattern match to be used as the literal split delimiter.)

       The old "?...?" syntax is gone. Indeed, it has to go for us to get the
       unary "?" operator.

	   Old		       New
	   ---		       ---
	   ?pat?	       m:once/pat/
	   qr//		       rx//
			       rule { }

   Null String Reform
       The null pattern is now illegal. To match whatever you used to match
       with a null pattern, use one of these:

	   Old		       New
	   ---		       ---
	   //		       /<prior>/       # match what prior match did
	   //		       /<null>/	       # match the null string between chars
	   (a|b|)	       (a|b|<null>)    # match a null alternative

       Note that, as an assertion, "<null>" always succeeds. You never want to
       say:

	   / <null> | single | double | triple | home run /

       because you'll never get to first base.

   Extension Syntax Reform
       There are no longer any "(?...)" sequences, because parens now always
       capture. Some of the replacement sequences take their intrinsic scoping
       from "<...>", while others are associated with other bracketing
       characters, or with any arbitrary atom that could be a bracketed
       construct. Looking at the metasyntax problem from the perspective of a
       Perl5-to-Perl6 translator, here's what the various Perl 5 extension
       constructs translate to:

	   Old		       New
	   ---		       ---
	   (??{$rule})	       <$rule>	       # call regex in variable
	   (?{ code })	       { code }	       # call Perl code, ignore result
	   (?#...)	       <('...')>       # in-line comment, rarely needed
	   (?:...)	       [...]	       # non-capturing brackets
	   (?=...)	       <before ...>    # positive lookahead
	   (?!...)	       <!before ...>   # negative lookahead
	   (?<=...)	       <after ...>     # positive lookbehind
	   (?<!...)	       <!after ...>    # negative lookbehind
	   (?>...)	       [...]:	       # grab (any atom)

	   (?(cond)yes|no)     [ cond :: yes | no ]
	   (?(1)yes|no)	       [ <(defined $1)> :: yes | no ]

       The "<$rule>" construct does a "delayed" call of another regular
       expression stored in the $rule variable. If it is a regex object, it's
       just called as if it were a subroutine, so there's no performance
       problem. If it's a string, it is compiled as a regex and executed. The
       compiled form is cached as a property of the string, so it doesn't have
       to be recompiled unless the string changes. (This implies that we can
       have properties that invalidate themselves when their base object is
       modified.) In either case, the evaluated regex is treated as a subrule,
       and any captures it does are invisible to the outer regex unless the
       outer regex takes steps to retrieve them. In any event, subrule parens
       never change the paren count of the outer rule.

       [Update: Use "X$ruleX" to turn any capturing rule into a non-capturing
       rule.  Or say "(X$ruleX)" if you only want to capture the string.]

       The "{code}" form doesn't return anything meaningful--it is used for
       its side effects. Any such closure may behave as an assertion. It
       merely has to throw an exception in order to fail. To throw such an
       exception (on purpose) one may use "fail":

	   $_ = "666";
	   / (\d+) { $1 < 582 or fail }/

       As with any assertion, the failing closure starts backtracking at the
       location of the closure. In this case, it backtracks into the "\d+" and
       ends up matching "66" rather than "666". If you didn't want that, use
       "\d+:" instead.

       It's more succinct, however, to use the code assertion syntax. Just put
       angles around a parenthesized Perl expression:

	   / (\d+) <( $1 < 582 )> /

       I find the parens to be vaguely reminiscent of the parentheses you have
       to put around conditionals in C (but not Perl (anymore)). Also, the
       parentheses are meant to remind you that you only want to put an
       expression there, not a full statement.

       Don't use a bare closure to try to interpolate a calculated regex,
       since the result will be ignored. Instead, use the "<{expr}>" form to
       do that. As with "<&rule()>", the result will be interpreted as a
       subrule, not as if it were interpolated.

       Since a string is usually true, you can just assert it to get the
       effect of an inline comment: "<("this is a comment")>". But I've never
       used one except to show it as an example. Line ending comments are
       usually much clearer. (Just bear in mind you can't put the final regex
       delimiter on the same line, because it won't be seen in the comment.)
       You could also use the "{'...'}" construct for comments, but then you
       risk warnings about "useless use of a string in void context".

       The "[...]" is the new non-capturing bracket notation. It seems to work
       very well for this purpose--I tried the other brackets and they tend to
       "disappear" faster than square brackets. So we reserve "(...)"  and
       "<...>" for constructs where the visual distance between opening and
       closing is typically shorter than for square brackets or curlies. The
       square brackets also work nicely when lined up vertically with vertical
       bars. Here's a declaration of a named rule from the class Perl6Grammar.
       It parses Perl 6 statements. (Think of it as a funny looking method
       declaration.)

	   rule state  { <label>
			   [ <control>		{.control}
			   | <sideff> <eostate> {.sideff}
			   | <@other_statements>
			   ]
		       };

       Huffman coding says that rarer forms should be longer, and that's the
       case with the lookahead and lookbehind assertions, "<before ...>" and
       "<after ...>". (The negations are formed via the general "<!...>"
       rule.) Note that these prepositions are interpreted as assertions, not
       operations. For example, "<before X>" is to be read "Assert that we are
       before X" rather than "Look before where we are for X".

       The new ":" operator replaces the "(?>...)" construct. It modifies
       whatever comes before it, much like "*" does, so it's naturally scoped
       if the preceding atom (or quantified atom) is a bracketed construct.
       Parsers can use this every time they commit to the parsing of a token
       or phrase to tell the regex engine that there's no point in
       backtracking through the atom in question, so backtracking will skip
       backwards over the atom and continue with some earlier branch point.
       The following takes a long time to fail if it has to look at every
       sequence of ""a"" to see if there is a ""b"" after it:

	   "aaaaaaaaaaaaaaaaaaaaaaaaaaaaac" =~ /^ a* b /

       But we already know that the only possible match is the longest one. So
       if you put in the colon, it fails in one pass because the "*" grabs
       everything and gives nothing back on backtracking.

	   "aaaaaaaaaaaaaaaaaaaaaaaaaaaaac" =~ /^ a*: b /

       You can use colon on a longer sequence too. The following might match a
       list of expressions separated by comma:

	   / <expr> [ , <expr> ]*: /

       It is an error to use ":" on any atom that does no backtracking. This
       will help to catch errors where you've forgotten to backslash a literal
       colon in things like:

	   /^From: (.*)/

       Perl 6 has no need for a special conditional construct like Perl 5's
       "(?(cond)yes|no)". That's because with a slight tweak, ordinary
       alternation can do the same thing. That tweak is our next backtracking
       modifier, the "::" operator. If you backtrack across it, it fails all
       the way out of the current list of alternatives. Consider an ordinary
       list of alternatives:

	   [ <A> <X> | <B> <Y> | <C> <Z> ]

       The way the rules of backtracking work, if either "<A>" or "<X>" fail,
       it backtracks to the next alternative. Likewise for "<B>" and "<Y>". In
       the case of "<C>" or "<E>", there is no next alternative, so it
       naturally fails out of the entire construct. That's not how a
       conditional is supposed to work, because in the conditional, only the
       condition determines which case is executed. Once you've committed to a
       particular case, it has to stand or fall as if the conditional hadn't
       been there. So all we need for our purposes is to have is something
       that separates the assertions that matter from those that don't. That's
       what "::" does, and it reads rather well as a "then", or as a
       "corresponds to". If you write

	   [ <A> :: <X>
	   | <B> :: <Y>
	   | <C> :: <Z>
	   ]

       then the failure of "<A>", "<B>", or "<C>" proceeds to the next case
       (if any), while any failure in "<X>", "<Y>", or "<Z>" is guaranteed to
       backtrack out of the front of the alternative list and revise a former
       choice (just as the success of "<X>", "<Y>", or "<Z>" is guaranteed to
       "forward track" out of the end of the alternative list and try to match
       more). It's a natural mapping to existing regex semantics. Here's a
       more realistic example from the Perl 6 grammar. It parses statement
       modifiers. (The "<ws>" rule parses optional whitespace.)

	   rule modifier { if	  <ws> :: <expr> { .new_cond(0,$expr) }
			 | unless <ws> :: <expr> { .new_cond(1,$expr) }
			 | while  <ws> :: <expr> { .new_loop(0,$expr) }
			 | until  <ws> :: <expr> { .new_loop(1,$expr) }
			 | for	  <ws> :: <expr> { .new_for($expr)   }
			 | <@other_modifiers>  # user defined
			 | <null>	       # no modifier
			 }

       In each case, once we recognize a keyword (and its following
       whitespace), we need to look for an expression, and then call a closure
       that builds the syntax tree. If either of those fails, the entire
       modifier rule fails. We only get to the last two alternatives on
       failure of assertions before the "::".

       Note that the "::" only says that we can't backtrack from the "then"
       into the "if". It says nothing about backtracking into the alternative
       list as a whole. The alternatives are still choice points, so the regex
       engine is allowed to backtrack into the alternative list and try
       another alternative. (To disable that, simply put a ":" after the
       closing bracket of the alternative list.)

       There is nothing in Perl 5 corresponding to the ":::" operator, but it
       works just like "::", only more so. If you backtrack across it, it
       fails all the way out of the current rule definition (though not out of
       any rule invoking this definition). That is, it fails all the way out
       of the innermost lexically enclosing "/.../", "m/.../", "s/...//",
       "rx/.../", or "rule {...}", skipping out through any enclosing nestings
       of "<...>", "[...]", or "(...)". (A pattern nested within a closure is
       classified as its own rule, however, so it never gets the chance to
       pass out of a "{...}" closure.)

       Since the alternatives in our last example are at the top level of the
       regex, we could have used the ":::" operator to get the same effect as
       "::", because terminating the rule and terminating the alternation
       amount to the same thing in that case. You can think of all of these as
       variants on Prolog's "cut" operator.

       If you backtrack over the "::::" operator, it will delete your program
       from the disk. ";-)"

       Actually, the real name of the real "::::" operator is "<commit>". It
       fails the entire match if you backtrack over it, not just the current
       rule. That is, it fails all the way out of the outermost dynamically
       enclosing "/.../", "m/.../", "s/...//", "rx/.../", or "rule {...}" that
       is executing on the current string.

       There is one "cut" operator that is beyond "<commit>"; it is
       appropriately named "<cut>", for two reasons. First of all, it's a real
       cut operator in that, if you backtrack over it, the current match fails
       completely, just like "<commit>". But that's just a side effect of the
       other reason, which is that "<cut>" cuts off the front of the string
       that you're currently matching on, turning the current position into
       the new beginning of the string. When you're matching on a potentially
       infinite string, it's important that you have a way of discarding that
       part of the match that you've already committed to. In Perl 5, the only
       way to do that was with a coordinated system of "s/^pat//" operations.
       With the "<cut>" assertion, however, you can just match normally, and
       cut at one spot in your top-level rule when you reach an "accept"
       state.

       In the realm of idle speculation, we could go as far as to define a
       variant of "<cut>" that would render "s///" slightly redundant:

	   s/foo/bar/;
	   m/foo <replace("bar")> /

       Note that we don't need any special forms for controlling the scope of
       a "fail" in a closure. Just prefix the closure with the appropriate
       backtracking operator:

	   / pattern ::: { code() or fail } /  # fails entire rule

   Character Class Reform
       As we mentioned earlier, character classes are becoming more like
       standard grammar rules, because the definition of "character" is
       getting fuzzier. This is part of the motivation for demoting enumerated
       character classes and stealing the square brackets for another purpose.
       Actually, for old times' sake you still use square brackets on
       enumerated character classes, but you have to put an extra set of
       angles around it. But this actually tends to save keystrokes when you
       want to use any named character classes or Unicode properties,
       particularly when you want to combine them:

	   Old		       New
	   ---		       ---
	   [a-z]	       <[a-z]>
	   [[:alpha:]]	       <alpha>
	   [^[:alpha:]]	       <-alpha>
	   [[:alpha:][:digit:]] <<alpha><digit>>

       The outer "<...>" also naturally serves as a container for any extra
       syntax we decide to come up with for character set manipulation:

	   <[_]+<alpha>+<digit>-<Swedish>>

       [Update: If the first thing in angle brackets is another angle bracket,
       you have to put a "+" between to avoid confusion with "Texas" quotes
       emulating French quotes.]

   State
       [This section gets pretty abstruse. It's okay if your eyes glaze over.]

       Every regex match maintains a state object, and any closure within the
       regex is actually an anonymous method of that object, which means in
       turn that the closure's topic is the current state object. Since a
       unary dot introduces a method call on the current topic, it follows
       that you can call any method in the state object that way:

	   /(.*) { print .pos }/       # print current position

       The state object may in fact be an instance of a grammar class. A
       grammar object has additional methods that know how to build a parse
       tree. Its rules also know how to refer to each other or to rules of
       related grammars.

       Note that $_ within the closure refers to this state object, not the
       original search string. If you search on the state object, however, it
       pretends that you wanted to continue the search on the original string.
       If the internal search succeeds, the position of the external state is
       updated as well, just as if the internal search had been a rule invoked
       directly from the outer regex.

       Because the state object is aware of how the tree is being built, when
       backtracking occurs the object can destroy parts of the parse tree that
       were conjectured in error. Because the grammar's action methods have
       control of the regex state, they can access named fields in the regex
       without having to explicitly pass them to the method call.

       For instance, in our earlier example we passed $expr explicitly to
       build the parse tree, but the method can actually figure that out
       itself. So we could have just written:

	   rule modifier { if	  <ws> :: <expr> { .new_cond(0) }
			 | unless <ws> :: <expr> { .new_cond(1) }
			 | while  <ws> :: <expr> { .new_loop(0) }
			 | until  <ws> :: <expr> { .new_loop(1) }
			 | for	  <ws> :: <expr> { .new_for }
			 | <@other_modifiers>  # user defined
			 | <null>	       # no modifier
			 },

       See Variable Scoping below for where @other_modifiers gets looked up.

       Within a closure, $_ represents the current state of the current regex,
       and by extension, the current state of all the regexes participating in
       the current match. (The type of the state object is the current grammar
       class, which may be an anonymous type if the current grammar has no
       name. If the regex is not a member of a grammar, it's of type RULE.)
       Part of the state of the current regex is the current node of the parse
       tree that is being built. When the current regex succeeds, the state
       object becomes a result object, and is returned to the calling regex.
       The calling regex can refer to the returned object as a "hypothetical"
       variable, the name of which is either implicitly generated from the
       name of the rule, or explicitly bound using ":=". Through that variable
       you can get at anything captured by the subrule. (That is what $expr
       was doing earlier.)

       [Update: Only rules of the form "<ident>" are captured by default.  You
       can use ":=" to force capture of anything else, or the ":keepall"
       adverb to capture everything else.]

       When the entire match succeeds, the top-level node is returned as a
       result object that has various values in various contexts, whether
       boolean, numeric, or string context. The name of the result object is
       $0. The result object contains all the other information, such as $1,
       $2, etc. Unlike $& in Perl 5, $0 is lexically scoped to the enclosing
       block. By extension, $1, etc. are also lexically scoped.

       [Update: The $0 variable's functionality is now split up into a $/
       variable that represents the match state object, and a $0 variable that
       represents the capture of the entire pattern.  In other words, $0 is
       exactly like a $1 capture, only with assumed parens around the entire
       pattern.]

       As a kind of iterator, a regex stored in a variable doesn't expand in
       list context unless you put angles around it or use it with "m//":

	   $rx = /(xxx)/;
	   print 1,2,<$rx($_)>;
	   print 1,2,</(xxx)/>;
	   my &rx := /(xxx)/;
	   print 1,2,<rx($_)>;

       $0, $1, etc. are not set in iterated cases like this. Each list item is
       a result object, though, and you can still get at the internal values
       that way.

   Hypothetical Variables, er Values
       Values that are determined within a regular expression should usually
       be viewed as speculative, subject to cancellation if backtracking
       occurs. This applies not only to the values captured by "(...)" within
       the regex, but also to values determined within closures embedded in
       the regex. The scope of these values is rather strange, compared to
       ordinary variables. They are dynamically scoped, but not like "temp"
       variables. A temporary variable is restored at the end of the current
       block. A hypothetical variable keeps its value after the current block
       exits, and in fact keeps that value clear to the end of its natural
       lifetime if the regex succeeds (where the natural lifetime depends on
       where it's declared). But if failure causes backtracking over where the
       variable was set, then it is restored to its previous state. Perl 5
       actually coerced the "local" operator into supporting this behavior,
       but that was a mistake. In Perl 6 "temp" will keep consistent
       semantics, and restore values on exit from the current block. A new
       word, "let", will indicate the desire to set a variable to a
       hypothetical value. (I was tempted to use "suppose", but "let" is
       shorter, and tends to mean the same thing, at least to mathematicians.)

	   my $x;
	   / (\S*) { let $x = .pos } \s* foo /

       After this pattern, $x will be set to the ending position of $1--but
       only if the pattern succeeds. If it fails, $x is restored to "undef"
       when the closure is backtracked. It's possible to do things in a
       closure that the regex engine doesn't know how to backtrack, of course,
       but a hypothetical value doesn't fall into that category. For things
       that do fall into that category, perhaps we need to define a "BACK"
       block that is like "UNDO", but scoped to backtracking.

       Sometimes we'll talk about declaring a hypothetical variable, but as
       with "temp" variables, we're not actually declaring the variable
       itself, but the dynamic scope of its new value. In Perl 6, you can in
       fact say:

	   my $x = 0;
	   ...
	   {
	       temp $x = 1;    # temporizes the lexical variable
	       ...
	   }
	   # $x restored to 0

       (This is primarily useful for dynamically scoping a file-scoped
       lexical, which is slightly safer than temporizing a package variable
       since nobody can see it outside the file.)

       You may declare a hypothetical variable only when the topic is a regex
       state. This is not as much of a hardship as it might seem. Suppose your
       closure calls out to some other routine, and passes the regex state as
       an argument, $rx_state. It suffices to say:

	   given $rx_state { let $x = .pos }

       [Update: This restriction is no longer in effect.  Any variable may be
       treated hypothetically, not just variables stored in $/.	 The capture
       variables in $/ are now syntactically distinguished from ordinary
       variables.  Hypotheticality is orthogonal to that, except insofar as $/
       is somewhat hypothetical as a whole.]

       As it happens, $1 and friends are all simply hypothetical variables.
       When we say "hypothetical variable", we aren't speaking of where the
       variable is stored, but rather how its contents are treated
       dynamically. If a regex sets a hypothetical variable that was declared
       with either "my" or "our" beforehand, then the regex modifies that
       lexical or package variable, and "let" is purely a run-time operation.

       On the other hand, if the variable is not pre-declared, it's actually
       stored in the regex state object. In this case, the "let" also serves
       to declare the variable as lexically scoped to the rest of the regex,
       in addition to its run-time action. Such a variable is not directly
       visible outside the regex, but you can get at it through the $0 object
       (always presuming the match succeeded). For a regex variable named
       $maybe, its external name is "$0._var_{'maybe'}". The $0 object can
       behave as a hash, so $0{maybe} is the short way to say that.

       [Update: Nowadays the short name of that variable is "$<maybe>".	 And
       the match variable is named $/ instead of $0.]

       All other variable names are stored with their sigil, so the external
       name for @maybe is $0{'@maybe'}, and for %maybe is $0{'%maybe'}.

       [Update: None of these are stored with their sigil, which means you
       can't store two different types under the same name unless you
       explicitly bind to a name like "$<@maybe>".]

       $1 is a special case--it's visible outside the regex, not because it's
       predeclared, but because Perl already knows that the numbered variable
       $1 is really stored as a subarray of the $0 object: $0[1]. The numbered
       variables are available only through the array, not the hash.

       [Update: the number variables are also available through the hash, so
       $1 is the same as either "$/<1>" or $/[0].]

       Since $0 represents the state of the currently executing regex, you
       can't use it within a rule to get at the result of a completed subrule.
       When you successfully call a subrule named "<somerule>", the regex
       state is automatically placed in a hypothetical variable named
       $somerule. (Rules accessed indirectly must be captured explicitly, or
       they won't have a name by which you can get to them. More on that in
       the next section.)

       [Update: Any subrule's capture may be suppressed with by starting with
       a question mark, so "<foo>" captures but "<?foo>" does not.]

       As the current recursive regex executes, it automatically builds a tree
       of hashes corresponding to all captured hypothetical variables. So from
       outside the regex, you could get at the $1 of the subrule "<somerule>"
       by saying "$0{somerule}[1]".

       [Update: These days that's "$/{'somerule'}[0]" or "$<somerule><1>".]

   Named Captures
       Suppose you want to use a hypothetical variable to bind a name to a
       capture:

	   / (\S+) { let $x := $1 } /

       A shorthand for that is:

	   / $x:=(\S+) /

       The parens are numbered independently of any name, so $x is an alias
       for $1.

       You may also use arrays to capture appropriately quantified patterns:

	   / @x := (\S+ \s*)* /		       # including space
	   / @x := [ (\S+) \s* ]* /	       # excluding space
	   / @x := [ (\S+) (\s*) ]* /	       # each element is [word, space]

       Note that in general, naming square brackets doesn't cause the square
       brackets to capture, but rather provides a destination for the parens
       within the square brackets. Only parens and rules can capture. It's
       illegal to name square brackets that don't capture something inside.

       You can also capture to a hash:

	   / %x := [ (\S+)\: \s* (.*) ]* /     # key/value pairs

       After that match, $1 returns a list of keys, and $2 returns a list of
       values. You can capture just the keys:

	   / %x := [ (\S+) \s* ]* /	       # just enter keys, values are undef

       You can capture a closure's return value too:

	   / $x := { "I'm in scalar context" } /
	   / @x := { "I", "am", "in", "list", "context" } /
	   / %x := { "I" => "am in hash context" } /

       Note that these do not use parens. If you say:

	   / $x := ({ code }) /

       it would capture whatever text was traversed by the closure, but ignore
       the closure's actual return value.

       You can reorder paren groups by naming them with numeric variables:

	   / $2:=(.*?), \h* $1:=(.*) /

       If you use a numeric variable, the numeric variables will start
       renumbering from that point, so subsequent captures can be of a known
       number (which clobbers any previous association with that number). So
       for instance you can reset the numbers for each alternative:

	   / $1 := (.*?) (\:)  (.*) { process $1, $2, $3 }
	   | $1 := (.*?) (=\>) (.*) { process $1, $2, $3 }
	   | $1 := (.*?) (-\>) (.*) { process $1, $2, $3 }
	   /

       It's also possible to refer to captures relative to the current
       location. "$-1" refers to the immediately preceding capture (what used
       to be known as $+). "$-2" refers to the one before that. If you use
       anything above "$-3" we'll come and take you away to the insane asylum.

       Subrules called via "<rule>" also capture their result in hypothetical
       variables. It's possible to name the results of any "<...>", but
       grammar rules already have a name by default, so you don't have to give
       them names unless you call the same rule more than once. So, presuming
       you have grammar rules defining "key" and "value", you can say:

	   / <key> \: <value> { let %hash{$key} = $value } /

       [Update: That is now written:

	   / <key> \: <value> { let %hash{$<key>} = $<value> } /

       ]

       Of course, in a typical grammar the typical rule may not return a
       string, but a reference to an anonymous object representing a node of
       the parse tree. But that depends on what the subrule decides to
       capture. If the only thing captured in the subrule is a single string,
       that's what you get. (If nothing is captured, you get the entire
       match.)

       Any capture that will capture multiple items will, if asked to put it
       into a scalar variable, produce an anonymous list automatically. This
       should rarely be a surprise, since it's obvious by inspection whether
       you've quantified the capture or not. So if you say any of:

	   / $x := <word>*/
	   / $x := <word>+/
	   / $x := <word><1,3>/

       [Update: That last should use "**{1..3}" instead.]

       then you've "pluralized" the naming, and you can expect to get some
       number of values in $x as an anonymous list. However, the "?"
       quantifier specifically doesn't pluralize. If you say:

	   / $x := <word>?/

       then $x will either be the result of the subrule or "undef".

       You can name the results of a zero-width assertion, but you'd typically
       only get the null string out of it. This can still be useful, since it
       contrasts with the undefined value you'd have if the assertion fails.
       (It is possible with an explicit capture to return a non-zero-width
       string from a zero-width assertion, however.)

   Variable Scoping
       When you refer to a variable @foo as an rvalue in a regex, it searches
       for an existing variable in the following places:

       1. We first look to see if the variable is already declared lexically
       with either "my @foo" or "our @foo". If so declared, we're done.
       2. Next we look for @foo in the current regex's name table. The name of
       the variable is really $0{'@foo'}.
	   [Update: These $/ variables now have names like "@<foo>", so skip
	   this step.]

       3. If the regex belongs to a grammar, we next look for @foo in the
       grammar object. If there, its real name is "@.foo", or some such. (It
       might be objected that the grammar object is not yet constructed when
       the regex is compiled. After all, the regex is probably being passed to
       the grammar object's constructor. But I think if such a variable is
       declared as an object attribute we know that there will be such a
       variable/accessor later when we have finished constructing, and that
       seems like enough info to know how to compile the regex.)
	   [Update: I think we can skip this one too--just import the array if
	   you need it, or call it via a real rule accessor.]

       4. Next we look for @foo as a declared core global variable @*foo.
       5. Finally, if "strict vars" is not in effect, we assume that @foo is
       stored in the current package. Otherwise it's a stricture error.

   Variable Interpretation
       As we mentioned earlier, bare scalars match their contents literally.
       (Use "<$var>" instead to match a regex defined in $var.)	 Subscripted
       arrays and hashes behave just like a scalar as long as the subscripts
       aren't slices.

       If you use a bare array (unsubscripted), it will match if any element
       of the array matches literally at that point. (A slice of an array or
       hash also behaves this way.) If you say

	   @array = ("^", "$", ".");
	   / @array /

       it's as if you said

	   / \^ | \$ | \. /

       But if you you slice it like this:

	   / @array[0..1] /

       it won't match the dot.

       If you want the array to be considered as a set of regex alternatives,
       enclose in angles:

	   @array = ("^foo$", "^bar$", "^baz$");
	   / <@array> /

       Bare hashes in a regex provide a sophisticated match-via-lookup
       mechanism. Bare hashes are matched as follows:

       1. Match a key at the current point in the string.
	   1a. If the hash has its "keymatch" property set to some regex, use
	   that regex to match the key.
	   1b. Otherwise, use "/\w+:/" to match the key.
       2. If a key isn't found at the current position in the string, the
       match fails.
       3. Otherwise, get the value in the hash corresponding to the matched
       key.
       4. If the is no entry for that key, the match fails.
       5. If the hash doesn't have a "valuematch" property, the match succeeds
       immediately.
       6. Otherwise use the hash's "valuematch" property (typically itself a
       regex) to extract the value at the current point in the string.
       7. If no value can be extracted, matching of the hash fails.
       8. If the extracted value string is "eq" to the key's actual value,
       matching of the original hash immediately succeeds.
       9. Otherwise, matching of the original hash fails.

       So matching a bare hash is equivalent to:

	   rule {
	       $key := <{ %hash.prop{keymatch} // /\w+:/ }>    # find key
	       <( exists %hash{$key} )>			       # if exists
	       [ <( not defined %hash.prop{valuematch} )> ::   # done?
		   <null>				       # succeed
	       |					       # else
		   $val := <%hash.prop{valuematch}>	       # find value
		       <( $val eq %hash{$val} )>	       # assert eq
	       ]
	   }

       A typical "valuematch" might look like:

	   rule {
	       \s* =\> \s*	       # match =>
	       $q:=(<["']>)	       # match initial quote
	       $0:=( [ \\. | . ]*? )   # return matched value
	       $q		       # match trailing quote
	   }

       In essence, the presence or absence of the "valuematch" property
       controls whether the hash tries to match only keys, or both keys and
       values.

       [Update: The above is all completely bogus.  A hash's set of keys are
       simply compiled into a longest-first token match table.	The
       corresponding value contains the closure or rule to fire off when you
       match, or a boolean to make the key match merely succeed.  See S5.]

       A hash may be used inside angles as well. In that case, it finds the
       key by the same method (steps 1 and 2 above), but always treats the
       corresponding hash value as a regex (regardless of any properties the
       hash might have). The parse then continues according to the rule found
       in the hash. For example, we could parse a set of control structures
       with:

	   rule { <%controls> }

       The %controls hash can have keys like ""if"" and ""while"" in it.  The
       corresponding entry says how to parse the rest of an "if" or a "while"
       statement. For example:

	       %controls = (
		   if	  => / <condition>	<closure> /,
		   unless => / <condition>	<closure> /,
		   while  => / <condition>	<closure> /,
		   until  => / <condition>	<closure> /,
		   for	  => / <list_expr>	<closure> /,
		   loop	  => / <loop_controls>? <closure> /,
	       );

       So saying:

	   <%controls>

       is really much as if we'd said:

	   [ if	    \b <%controls{if}>
	   | unless \b <%controls{unless}>
	   | while  \b <%controls{while}>
	   | until  \b <%controls{until}>
	   | for    \b <%controls{for}>
	   | loop   \b <%controls{loop}>
	   ]

       Only it actually works more like

	   / $k=<{ %controls.prop{keymatch} // /\w+:/ }> <%controls{$k}> /

       [Update: This is also bogus.  The keys are treated as a set of rules
       with longest-first semantics for any leading literal components.	 See
       S5.]

       Note that in Perl 6 it's perfectly valid to use "//" inside an
       expression embedded in a regex delimited by slashes. That's because a
       regex is no longer considered a string, so we don't have to find the
       end of it before we parse it. Since we can parse it in one pass, the
       expression parser can handle the "//" when it gets to it without
       worrying about the outer slash, and the final slash is recognized as
       the terminator by the regex parser without having to worry about
       anything the expression parser saw.

       A bare subroutine call may be used in a regex, provided it starts with
       "&" and uses parentheses around the arguments. The return value of the
       subroutine is matched literally. The subroutine may have side effects,
       and may throw an exception to fail.

   Defining Your Own Rules
       Suppose your name is Hugo and you don't like to use "!" to negate an
       assertion. You can define your own assertion like this:

	   my rule not (str $rx) { <!<{"<$rx>"}>> }	       # define Hugo not
	   / <not [a-z]> /     # same as <![a-z]>

       That rule would be lexically scoped because of the "my". If you think
       it looks like a "sub" declaration, you're right. In fact, it's possible
       you could even declare it anonymously like a closure:

	   my $not = rule (str $rx) { <!<{"<$rx>"}>> };
	   / <$not tonight dear> /

       But maybe you don't want it lexically scoped because you're writing a
       grammar for general use:

	   grammar HugoGrammar {
		rule not ($rx) { <!$rx> }
		rule identifier { <not before \d> \w+ }
		rule \j { \c[LF] }
		rule parse { ^ <identifier> \j $ }
	   }
	   HugoGrammar.parse($line);

       In this case a rule is simply a method in a grammar class, and a
       grammar class is any class derived implicitly or explicitly from the
       universal RULE grammar class. The built-in regex assertions like
       "<before \w>" are really just calls to methods in the RULE class. The
       namespace of a grammar is simply the method namespace of the current
       class, which is the class's methods plus all inherited methods.

       [Update: Nowadays it's the "Rule" class.]

       In addition to normal subrules, we allow some funny looking method
       names like:

	   rule :a { ... }
	   rule \a { ... }

       [Update: These are now handled as macro-like syntactic categories.  See
       S5.]

       Modules that mutate Perl's grammar on the fly can do so by deriving an
       anonymous grammar class from the default Perl6Grammar, and installing
       extra rules on the fly. The current regex state then continues parsing
       the rest of the lexical scope using some rule from the new rule set.
       Subsequent grammatical mutations will be derived from the current
       anonymous grammar unless you switch explicitly to an entirely different
       grammar.

       Since we're writing grammar rules as if they were methods, we have
       access to the full syntax of method declaration, including formal
       parameter lists and compile-time properties. So we can easily annotate
       rules with pragmatic information such as operator precedence levels
       when you don't want to write a strictly recursive-descent parser, for
       instance. (And we don't want to, for Perl.)

Accepted RFCs
   RFC 072: Variable-length lookbehind.
       This seems good to me. It's just a SMOP to reverse the ordering of
       nodes in the syntax tree, and I think we can pretty well determine when
       it's impossible to reverse the tree. The operation of a reversed syntax
       tree will not be totally transparent, however, so it will be necessary
       to document that quantifiers will actually be working right-to-left
       rather than left-to-right. (It's probably also a good idea to document
       that many syntactic constructs can't actually be reliably recognized in
       reverse. An attempt to do so probably means you needed to do a
       lookahead earlier, rather than a lookbehind later.)

       The syntax of lookbehind uses the new assertion syntax:

	   <after ...>	       # positive lookbehind
	   <!after ...>	       # negative lookbehind

       Yes, the "pos()" function could return multiple values in list context,
       but I think it's more reasonable for the individual captured elements
       to know where their positions are. The "pos" function is really just a
       special case of a more general data structure contained in the regex
       result object from the last successful match. In which case, maybe it
       really needs to have a better name than "pos". Maybe $0 or something.
       Then we get "$0.beg" and "$0.end", "$1.beg", and "$1.end", etc. Since
       @$0 returns a list of captures, you can do "@$0^.beg" and "@$0^.end" if
       you want a list of beginnings and endings. Did I mention that the
       magical "@+" and "@-" arrays are gonna be real dead? Never could
       remember which one was which anyway...

       [Update: $0 still represents the entire matched string, but the match
       object is now $/, and a list of all beginnings is returned by the
       hyperoperator "$/X.beg".	 But note that string positions are not
       necessarily integers in Perl 6.	They are tagged with the units of the
       string, so that you can't inadvertently mix byte, codepoint, or
       grapheme offsets.]

   RFC 093: Regex: Support for incremental pattern matching
       I don't think this proposal is powerful enough. "Infinite" strings are
       a more powerful concept. But I don't think infinite strings are
       powerful enough either!

       We're certainly going to have "infinite" arrays for which missing
       elements are defined by a generator (where the action could be as
       simple as reading more data from some other source). We could do the
       same thing for strings directly, or we could define strings that are
       implemented underneath via arrays (of strings or of stringifiable
       objects), and achieve infinitude that way. This latter approach has the
       benefit that the array element boundaries could be meaningful as zero-
       width boundaries between, say, tokens in a token stream. We're thinking
       that "<,>" could match such a boundary.

       But beyond that, such arrays-as-strings could allow us to associate
       hidden metadata with the tokens, if the abstract string is constructed
       from a list of objects, or a list of strings with properties. This is
       typically how a parser would receive data from a lexical analyzer. It's
       the parser's job to transform the linear stream of objects into a parse
       tree of objects.

       Matching against such boundaries or metadata would not be possible
       unless either the regex engine is aware that it is matching against an
       array, or the string emulation provides visibility through the abstract
       string into the underlying array. The latter may be preferable, since
       (by the rules of the "=~" matrix discussed in Apocalypse 4) "@array =~
       /regex/" is currently interpreted as matching against each element of
       the array individually rather than sequentially, and there are other
       uses for a string that's really an array. In fact, "@array =~ /regex/"
       could conceivably be matching against a set of infinite strings in
       parallel, though that seems a bit scary.

       [Update: The "=~" operator is renamed "~~", and it doesn't
       automatically "any-fy" an array anymore, so we could pretty easily make
       it work over an entire array as if it were a string.]

       Even if we don't care about the boundaries between array elements, this
       approach gives us the ability to read a file in chunks and not worry
       that the pattern won't match because it happens to span a boundary.

       It might be objected that matching against a subroutine rather than an
       infinite string or array has the benefit of not promising to keep
       around the entire string or array in memory. But this is not really a
       feature, since in general a regex can potentially backtrack all the way
       to the beginning of the string. And there's nothing to say that the
       front of the infinite string or array has to stay around anyway.
       Whether to throw away the head of a string or array should really
       depend on the programmer, and I don't think there's a more intuitive
       way to manage that than to simply let the programmer whack off the
       front of the string or array using operators like "substr" or "splice",
       or the new "<cut>" assertion.

       Indeed, the very existence of the string/array precludes the caching
       problem that RFC 316 complains about.

       The question remains how to declare such a string/array. If we decided
       to do a magical name identification, we could conceivably declare

	   my $@array;

       and then both $array and @array refer to the same object, but treated
       as a string when you say $array and as an array when you say "@array)".
       One is tempted to set up the input routine by saying

	   my $@array is from { <$input> };

       Additional lines (or chunks) would then come from the "<$input>"
       iterator.

       But really, the infinite nature of the array is a feature of the
       underlying object, not the variable. After all, we want to be able to
       say

	   @array := 1..Inf;

       even with an ordinary array.

       So we could even make this work:

	   my $@array := <$input>;

       But I think we need to break the aliasing down, which will give us more
       flexibility at the expense of more verbiage:

	   my @array := <$input>;	       # @array now bound to iterator
	   my $array is ArrayString(@array);   # an ordinary tie

       That would let us do cool and/or sick things like this:

	   my @lines := <$article>;
	   my $_ is ArrayString(@lines);
	   s/^ .*? \n<2,> //;  # delete header from $_ AND @lines!
	   for @lines { ... }  # process remaining lines

       The "for" loop potentially runs forever, since @lines is implicitly
       extended from an iterator. The array is automatically extended on the
       end, but it's not automatically shifted on the front. So if you really
       did want the loop to run forever without exhausting memory, you'd need
       to say something like:

	   substr($_, 0, $_.pos, "");

       The same effect can be effected within a regex by asserting "<cut>",
       which makes the current position the new string beginning. (If you
       backtrack over "<cut>", the entire match will fail.)

       [Update: Now we can just say "<@lines ~~ s/^.*? \n**{2...} //">.]

   RFC 110: counting matches
       I think we can avoid using any options if we make a pattern count
       matches when used in a numeric context. If in doubt, make it explicit:

	   $count = +/foo/;

       If it turns out we do need an option, it'll probably be ":n".

   RFC 112: Assignment within a regex
       This RFC is basically covered by the "$foo:=(...)" notation, plus
       variations. The RFC claims that such assignments are not done till the
       end, except that they are done ahead of closures. I'd rather state it
       the other way around: it always appears that the current hypothetical
       binding is assigned if you check, but as long as the optimizer can
       determine that you aren't looking, it doesn't have to keep up
       appearances. Contrariwise, if $foo is just a fancy way of saying $1,
       there may in fact be no more overhead in maintaining $foo than $1.
       Either is really just pointing into a table of offsets into the string.
       That's assuming we get the scoping right on hypothetical variables.

       Some excerpts from the RFC:

       The camel and the docs include this example:
		  if (/Time: (..):(..):(..)/) {
		       $hours = $1;
		       $minutes = $2;
		       $seconds = $3;
		   }

	   This then becomes:

		 /Time: (?$hours=..):(?$minutes=..):(?$seconds=..)/

       Now that looks like this:

	     /Time\: $hours:=(..) \: $minutes:=(..) \: $seconds:=(..)/

       It may be appropriate for any assignments made before a code callout to
       be localized so they can unrolled should the expression finally fail.

       Rather than localized (or temporized), they are hypothesized.

       The first versions of this RFC did not allow for backrefs. I now think
       this was a shortcoming. It can be done with "(??{quotemeta $foo})", but
       I find this clumsy, a better way of using a named back ref might be
       "(?\$foo)".

       Backrefs are now unified with hypothetical variables, so the issue
       doesn't arise. Just use $foo.

       [Update: Now "$<foo>", short for $/{'foo'}.]

       Using this method for capturing wanted content, it might be desirable
       to stop ordinary brackets capturing, and needing to use "(?:...)". I
       therefore suggest that as an enhancement to regexes that /b (bracket?)
       ordinary brackets just group, without capture - in effect they all
       behave as "(?:...)".

       There's no need for a "/b" now that we have "[...]" for non-capturing
       brackets.

   RFC 144: Behavior of empty regex should be simple
       I agree, the behavior should be simple. However, rather than always
       matching, I propose to make it an error to actually have a null
       pattern, or a null choice in a list of alternatives. Use an explicit
       "<null>" if that's what you mean. (It's not a problem if $foo is null
       in "/$foo/", since variables are now managed by the regex engine and
       not by interpolation.)

   RFC 150: Extend regex syntax to provide for return of a hash of matched
       subpatterns
       The "$foo:=(...)" notation essentially covers that case. One can say:

	   / %hash{foo}:=(...) %hash{bar}:=(...) /

       Fancier things can be done with closures.

   RFC 156: Replace first match function ("?...?") with a flag to the match
       command.
       Having a ":f" modifier seems like a reasonable way to do it:

	   m:f/.../

       Though it's vaguely possible we should be having a set of verbs that
       parse like "split":

	   split /.../
	   count /.../
	   first /.../

       It's not clear whether those are actually methods, and if so, on which
       object, the string or the regex. In any event, I don't think we have to
       nail that down quite yet. I'm accepting the basic premise of this RFC
       that the "?...?" construct is going away, one way or another.

       At the moment, it looks like this option is spelled ":once".

   RFC 165: Allow Variables in tr///
       If interpolation of patterns by default is wrong, I think extending the
       "tr///" interface via scalar interpolation is doubly wrong. Run-time
       generated transliterations should be based on mappings that aren't so
       position dependent. That is, rather than specifying it as two long
       lists:

	   abc12xyz => ABC34XYZ

       we specify something more like this:

	   abc => ABC
	   12  => 34
	   xyz => XYZ

       That looks more like a list of pairs of scalars than a pair of scalars.
       In fact, internally, it's done like a funny parallel substitution:

	   s:e(/a/A/,
	       /b/B/,
	       /c/C/,
	       /1/3/,
	       /2/4/,
	       /x/X/,
	       /y/Y/,
	       /z/Z/)

       In any event, it's more like "tr/@foo/@bar/" than "tr/$foo/$bar/".  But
       then, why stick with the fake string notation? Why not just say
       "tr(@foo,@bar)" if that's what we mean? Then we're not limited to
       character substitutions:

	   $string.tr [ " "	, "<"	, ">"	, "\n"	 ],
		      [ " ", "<", ">", "<br>" ];

       Or how about "tr(%trans)"?

	   %upper = {
	       "a-z" => "A-Z",
	   }
	   $string.tr %upper;

       or just pair lists of some sort:

	   $string.tr("a-c" => "x-z",
		      "1-2" => "3-4",
		      "A-C" => "X-Z",
		     );
	   @trans = [
	       "a-z" => "A-Z",
	       @tr_danish,
	   ];
	   $string.tr(@trans)

   RFC 166: Alternative lists and quoting of things
       Alternative lists of literals are included simply by mentioning the
       array:

	   /@names/

       Alternative lists of subrules are included with:

	   /<@names>/

       There's no longer any need for quoting constructs because variables
       match as literals by default. You have to use angle brackets to get
       interpretation of a string as a subrule. (But it's still preferable to
       precompile your regexen.)

   RFC 191: smart container slicing
       As proposed, this might prevent us from using a regex object as a key
       to a hash. However, with some tweaking, it'll fit in with how slicing
       is done in Perl 6.

       Perl 6 will DWIM subscripts based on their appearance. Obviously,

	   %hash{"foo"}

       has a single subscript. And just as obviously,

	   %hash{"a" .. "z"}

       has 26 subscripts or so. In the absence of any scalar guidance, a
       subscript will be interpreted in list context. So

	   %hash{ @array }

       will automatically slice on the list of keys in the array. Any function
       will be called in a list context by default, giving it the opportunity
       to return multiple values. Perl 6 subscripts are naturally biased
       toward slicing. To unbias it, here are some of the specifically
       recognized subscripts:

	   %hash{"foo"}	       # scalar literal
	   %hash{bar}	       # scalar literal

       [Update: Now "%hash<bar>" instead.]

	   %hash{1}	       # scalar literal
	   %hash{$x}	       # scalar variable
	   %hash{\$x}	       # scalar reference
	   %hash{["a", "b"]}   # array reference
	   %hash{{"a" => "b"}} # hash reference
	   %hash{ "a" => "b" } # pair reference
	   %hash{ /pat/ }      # rule reference
	   %hash{ _ expr }     # force expr to return a single string

       [Update: Now "~" instead of "_".	 Also, active slicing with closures is
       done with a ".slice" method on either arrays or hashes.	Ordinary
       subscripts always assume integer slices for arrays and string slices
       for hashes (by default).	 The optimizer is allowed to assume that in
       @foo[@bar], the @bar array returns a list of integers.]

	   %hash{ + expr }     # force expr to return a single number

       Boolean expressions and closures look like singular values but cause a
       match against all possible values of the subscript.

	   %hash{ ?1 }	       # select all subscripts
	   %hash{ ?/pat/ }     # select subscripts for which pat matches
	   %hash{ $_ =~ /pat/ }# select subscripts for which pat matches
	   %hash{ $_ ge "a" }  # select lowercase keys (assuming ASCII)
	   %hash{ .ge "a" }    # same thing, maybe
	   %hash{ { expr } }   # select subscripts for which closure returns true

       Multiple slice subscripts are separated by semicolons, so that you can
       use commas within each slice subscript for list building. This is more
       important for multi-dimensional arrays:

	   my @array is dim(9,9,9) = cubic();
	   @3d_slice = @array[ @x; @y; @z ];
	   @3d_slice = @array[ 0,1,3,8 ; 0,1,3,8; ?1 ];
	   @3d_slice = @array[ 0..9 ; 0..9:-1; ?test($_) ];
	   @3d_slice = @array[ !($_ % 2) ; 0..9:3; ?test($_) ];

       [Update: "dim" is now spelled "shape".  See S9.]

   RFC 274: Generalized Additions to Regexs
       This proposal has significant early/late binding issues. A definition
       that forces run-time overhead is not as useful as it might be. On the
       other hand, a pure compile-time mechanism is not as general as it might
       be--but a compile-time mechanism can always compile in a run-time
       mechanism if it chooses to defer evaluation.

       So it seems like this is a good place for syntactic warpage of some
       sort or other. That would make it possible to do both compile-time and
       run-time bindings. We'll be using the "<...>" notation for our
       extensible syntax, and the grammar rules for parsing that particular
       part of Perl syntax will be just as easy to tweak as any other Perl
       grammar rule.

       That being said, the very fact that we can associate a grammar with the
       regex means that it's easy to define any instance of "<word>" to mean
       whatever you want it to. (In a sense, these subrules are the very
       callbacks that the RFC proposes.) These subrules can be bound either at
       Perl compile time or at Perl run time. They can be defined to take a
       string, regex, or Perl expression as an argument. The latter two cases
       are efficient because they come in as a regex or code reference
       respectively.

       Following on, if (?{...}) etc code is evaluated in forward match, it
       would be a good idea to likewise support some code block that is
       ignored on a forward match but is executed when the code is unwound due
       to backtracking.

       Yes, though hypothetical values take some of the pressure off for this.
       But if a closure contained a BACK block, it could be automatically
       fired off on backtracking. As with LAST et al., I suppose there's a
       corresponding "back" property on variables. In a sense, saying

	   let $var = $newval

       is much like saying

	   our $var is back { .set($oldval) } = $newval

       except that $var may well be stored in the regex state object rather
       than in a global symbol table.

   RFC 276: Localising Paren Counts in qr()s.
       I agree totally. As for the problem of pulling captures out of a
       subrule, it's up to the subrule to determine what it "returns". We
       could make some intelligent defaults, though different kinds of rules
       might want different defaults. One approach might be to say that if
       there is a single capture, that is returned as the result. If there is
       no capture, it's as if the entire subpattern were captured. If there
       are multiple captures, they are returned as an anonymous list. So $1
       from such a subrule might come through like this:

	   / $sub:=<subrule> { print $sub[1] } /

       or just:

	   / <subrule> { print $subrule[1] } /

       But named captures and named rules intrude on this idyllic picture.
       You'd also like a default anonymous hash value returned that is keyed
       by all the named captures or rules. The question is whether that forces
       numbered captures to come through the hash interface. Or maybe that's
       just always the case, so to get at $1 of a subrule, you'd say:

	   / $sub:=<subrule> { print $sub{'1'} } /

       But there are reasons for wanting to treat the result object as an
       array, so that

	   / $sub:=<subrule> { process(@$sub) } /

       processes all the numbered captures from the subrule. So I think the
       return object behaves either like a hash or an array as appropriate.
       (Note that such an array might be declared to have an origin at 1
       rather than 0.)

       [Update: We're steering clear of 1-based arrays for now.]

   RFC 317: Access to optimisation information for regular expressions
       Seems like a no-brainer. All such information is likely to be available
       to Perl anyway, given that we'd like to do the parser, optimizers, and
       code generators in Perl if at all possible.

   RFC 331: Consolidate the $1 and "\1" notations
       I like the title of this RFC. It fits in with the new "my" policy of
       immediate introduction. However, there are certain difficulties with
       the proposed implementation. The statement-by-statement setting of the
       "@/" array looks pretty ugly to me. I'd rather have a consistent view
       of hypothetical variables that can live on outside the regex in
       question without regard to statement boundaries. In the rare event that
       someone needs to refer to $1 (or anything else) from a prior regex, a
       temporary variable should be used.

   RFC 332: Regex: Make /$/ equivalent to /\z/ under the '/s' modifier
       Another RFC that is accepted in principle, but that doesn't go far
       enough. The "/s" modifier is going away, along with "/m". A "$" will
       always mean end-of-string, and $$ will match at the end of any line.
       (The current process id is now $*PID, by the way, so there's no
       conflict there. But how often to you want to write a pattern to match
       the current process id anyway?)

   RFC 348: Regex assertions in plain Perl code
       This RFC makes some good points, though the code assertion syntax will
       be:

	   <( code )>

       The RFC also makes a case for getting rid of the special behavior of
       "local" in Perl 5, which treated "local" differently within a regex.
       However, something very like the "local" behavior will still be needed
       for making hypotheses, though the RFC is correct that it's not needed
       in the typical code assertion, In Perl 6, localization is done with
       "temp", and it will not do the hypothetical variable hack that Perl 5
       did. Instead there will be an explicit lvalue modifier, "let", which
       specifically requests a variable's value to be scoped to the success of
       the current point in the regex. These hypothetical variables actually
       have much broader use than this RFC suggests.

       Perl 5's hardwired use of $^R just translates to an appropriately named
       hypothetical variable in Perl 6.

   RFC 360: Allow multiply matched groups in regexes to return a listref of
       all matches
       I think that parens that can potentially match multiple times will
       automatically produce a list rather than matching the final one. It's
       not as if we can't tell whether something's inside a quantifier...

       Here's the RFC's proposed solution:

	   while ($text =~ /name:\s*(.*?)\n\s*
			   children:\s*(?:(?@\S+)[, ]*)*\n\s*
			   favorite\ colors:\s*(?:(?@\S+)[, ]*)*\n/sigx) {
	       # now we have:
	       #  $1 = "John Abajace";
	       #  $2 = ["Tom", "Dick", "Harry"]
	       #  $3 = ["red", "green", "blue"]
	   }

       Apart from the change in behavior of "(...)" within a quantifier, I
       have the urge to rewrite this example for several reasons:

       ·

	       The C</x> and C</s> flags no longer exist.

       ·

	       The C</i> and C</g> flags must be pulled out to the front for visibility.
	       (And the C</g> flag is renamed C<:e>).

       ·

	       There's now a C<\h> for horizontal whitespace, and C<\H> for the negation
	       of that.	 (Not that RFC is incorrect to use C<\s>.)

       ·

	       The negation of C<\n> is now C<\N>.

       ·

	       The C<:> character is now a metacharacter, and so must be backslashed.

       ·

	       Character classes are now represented with C<< <[...]> >>.

       ·

	       Grouping is now represented with C<[...]>.

       With these changes, and making better use of whitespace, the sample
       regex ends up looking like this:

	   for ($text =~ m:ie[
				   name		    \: \h*   (\N*?)	       \n
			       \h* children	    \: \h* [ (\S+) <[,\h]>* ]* \n
			       \h* favorite\ colors \: \h* [ (\S+) <[,\h]>* ]* \n
			     ]
		 )
	   {
		    # now we have:
		    #  $1 = "John Abajace";
		    #  $2 = ["Tom", "Dick", "Harry"]
		    #  $3 = ["red", "green", "blue"]
	   }

       I think in the long run people will find this more readable once
       they're used to it. Certainly tabularizing the parallelisms will make
       any typing errors stand out.

       [Update: The ":ie" is now written ":i:g".]

   RFC 361: Simplifying "split()"
       The RFC makes five suggestions. I'll consider them one by one.

       The first argument to split is currently interpreted as a regexp,
       regardless of whether or not it actually is one. (Yes, "split '.',
       $foo" doesn't split on dot -- it's currently the same an "split /./,
       $foo".) I suggest that split be changed to treat only regexps as
       regexps, and everything else as literals.

       Fine, I think. If the first argument to "split" is untyped, it should
       parse correctly, either evaluating a quoted string immediately or
       deferring interpretation of a regex. One could even do something like
       split on the first delimiter matched by another pattern:

	   split _/(,|;)/;

       That would split on either all commas or all semicolons, depending on
       which it found first in the string. The _ forces the regex to return a
       string, which is whatever was captured by the parens in this case.

       [Update: "_" is now "~".]

       Empty trailing fields are currently suppressed (although a -1 as the
       third argument disables this). I suggest that empty trailing fields be
       retained by default.

       Probably okay, though we need a way to translate old code. It was
       originally done this way because split on whitespace would typically
       return an extra field after the newline. But most newlines will be
       prechomped in Perl 6.

       When not in list context, split currently splits into @_. I suggest
       that this side-effect be removed.

       Fine. It's easy enough to translate to an explicit assignment.

       "split ?pat?" in any context currently splits into @_. I suggest that
       this side-effect be removed.

       Fine. I don't think anyone uses that.

       "split ' '" (but not "split / /") currently splits on whitespace, but
       also removes leading empty fields. I suggest that this irregularity be
       removed.

       The question is, what to replace it with, since it's a very handy
       construct. We could use a different conventional pattern:

	   @array = split /<ws>/, $string;

       Or we could say that it's now a split on whitespace only if the split
       argument is unspecified. That wouldn't work very well with the old
       syntax, where we often have to supply the second argument. But given
       that the "=~" operator now serves as a topicalizer for any term, we
       could translate:

	   @array = split ' ', $string;

       to this:

	   @array = $string =~ split;

       Oddly, this probably also works:

	   $string =~ (@array = split);

       or maybe even this:

	   @array = split given $string;

       But I think I like the OO notation better here anyway:

	   @array = $string.split;

       In fact, split may not be a function at all. The default split might
       just be a string method and use unary dot:

	   @array = .split;

       We still have the third argument to deal with, but that's likely to be
       specified like this:

	   @array = $string.split(limit => 3);

       We could conceivably make a different method for word splitting, much
       like REXX does:

	   @array = .words;

       Then a limit could be the first argument:

	   @array = .words(3);

       But there almost doesn't need to be such a method, since

	   @array = m/ [ (\S*) \s* ]* /;

       will do the right thing. Admittedly, a ".words" method would be much
       more readable...

       Fortunately, "split" is a function, so I can put off that decision till
       Apocalypse 29. ":-)"

       [Update: At the moment I think there's a ".words" multimethod on
       strings.]

Rejected RFCs
   RFC 135: Require explicit m on matches, even with ?? and // as delimiters.
       Squish that gnat... ":-)"

       A decent Perl parser is still going to have to keep track of whether a
       term or an operator is expected. And while we're simplifying the
       grammar in many ways, it's also the case that we're letting users
       install their own grammar rules to perform syntactic warpage. Besides,
       people like to write patterns with "/.../". So rather than
       impoverishing Perl's syntax artificially, let's make the standard
       parser more accessible by writing it all in Perl 6 regexes.

   RFC 145: Brace-matching for Perl Regular Expressions
       Good problem, not-so-good solution from a complexity point of view. I'd
       like to leverage existing character class and backref notations maybe.
       If there were simply some way to tell a backref to invert any match
       characters, that might do it. Or maybe reverse them when you remember
       them, and leave the backref ignorant? (Downside is nested brackets
       would probably need recursive patterns.)

       Recursion might be advisable anyway--you can't really pick up the
       arguments to a function, for instance, without also handling things
       like quoted strings, which may have different bracketing rules than
       outside of strings. Certainly matching "\"" would be dependent on
       whether you're inside or outside of a string. Given that recursion is
       often necessary, I'm not sure making this construct recurse itself is
       all that useful.

       Along the lines of how "tr///" works (or ought to work), I think it'd
       be more generally useful to have character remapping facility within a
       backref generator:

	   (
	    <[ \( \[ \{ \< ] =>
	     [ \) \] \} \> ]> )

       That might match a left bracket of some sort but return the
       corresponding right bracket as $1. But maybe we should just use an
       "existing" mechanism to translate strings:

	   my %closing = {
	       '[' => ']',
	       '(' => ')',
	       '{' => '}',
	       '<' => '>',
	   };
	   rule balanced {
	       <![\[\(\{\<\]\)\}\>]>*  # any non-brackets
	       [		       # followed by either
		   $		       #   end of string
	       |		       # or
		   $b := <[[({<]>      #   an opening bracket
		   <self>	       #   containing a balanced expr
		   %closing{$b}	       #   followed by corresponding close bracket
		   <self>	       #   followed by a balanced expr
	       ]
	    }

   RFC 164: Replace =~, !~, m//, s///, and tr// with match(), subst(), and
       trade()
       All operators will have a way to name them, which means it's possible
       to alias them to any other name. Rearranging the formal order of
       parameters would be a little harder, however. We need inlining to do
       that efficiently. Still, now that "//" doesn't evaluate in a typeless
       context, it's relatively straightforward to define a subroutine or
       method that does

	   subst $string, /foo/, {"bar"}

       in whatever order you like.

       [Update: We're defining standard ".match", ".subst", and ".trans"
       methods for strings as alternatives to the quoted syntaxes.]

   RFC 197: Numeric Value Ranges In Regular Expressions
       If we go down this road, eventually we reinvent all of Perl syntax in
       regular expressions. Not that I'm against TMTOWTDI, but I'd rather have
       a better way to run Perl code from within a regex and have it "succeed"
       or "fail", and maybe better ways to test ranges from Perl code.
       Anything beyond that could be done with syntactic warpage.

       In any event, overloading "()" and "[]" for this would be mentally
       treacherous, not to mention completely opaque to non-mathematicians.
       We'll stick with the standard boolean assertion:

	   / (\d+) <( $1 =~ 1..10 )> /

       Interestingly, that can also be written:

	   / <( _/\d+/ =~ 1..10 )> /

       [Update: That'd be "<( ~/\d+/ ~~ 1..10 )>" these days.]

   RFC 198: Boolean Regexes
       Again, I'm not much in favor of inventing new regex syntax that
       duplicates ordinary Perl syntax. I think we need richer ways of
       interconnecting related regexes via ordinary Perl syntax. Certainly it
       helps to have an easy way to specify a Perl assertion:

	   / (\w+) <( %count{$1} > 3 )> /

       But there's something to be said for forcing submatch assertions to be
       defined externally to the current regex, much like we discourage inline
       code where subroutine calls are in order.

       So anyway, I think most of the submatches like onion rings should be
       handled simply by searching on captured strings within a closure.
       Booleans can be put into closures as well, but the new "::" operator
       makes it pretty easy to AND and OR assertions together in a more
       regexly fashion without reinventing the wheel.

       As proposed, there will be a "fail" token, but it's spelled "<fail>",
       not "\F". And the "true" token is spelled "<null>". ":-)"

   RFC 261: Pattern matching on perl values
       This reminds me a bit of unification in Prolog. It's not explained very
       well here, and I'm wondering if it will be too hard to explain in
       general. I think this is probably too powerful a concept for the
       typical Perl programmer, who is lucky to understand simple lvalues that
       always do what they're told.

       This sort of matching can probably be provided as syntactic warpage,
       though I'm not sure if that prevents useful optimizations. Anyway, this
       sort of thing is unlikely to make it into the Perl 6 core unless it
       generalizes usefully to function argument lists, and it may be too
       powerful for there too. For that purpose it would resemble a form of
       overloading, but with the "types" specified by keys. I suspect real
       types are more useful.

   RFC 308: Ban Perl hooks into regexes
       We must be able to call back into Perl code if we want to write parsers
       conveniently in Perl. Think of how yacc works. Certainly the way that
       Perl 5 does it is ugly, I'll admit. We can beautify that.

       But the whole point of Perl is to have all the most useful "Krakken
       tentacles". And I don't really care if it makes it hard to put the Perl
       regex engine into some other language. ":-)"

   RFC 316: Regex modifier for support of chunk processing and prefix matching
       Infinite strings (via infinite arrays) seem like a more useful concept.
       It would be easy for the extension subroutine to fail and produce the
       results desired in this RFC, but without the necessity of the extra
       syntax specified by the RFC. A match naturally fails when it gets to
       the end of its string without finishing the pattern. Incremental
       matching can also easily be done via infinite strings, and the user
       interface can be a simple as we like, as long as extension rule is
       somehow associated with the string in question.

       I think "pos()" is rather too low-level a concept for general use.
       Certainly it needs to be there, but I think we need some way of
       implying that one regex is a continuation of a previous one, but within
       some higher-level syntactic construct, so that it's easy to write
       parsers without invoking "pos()" or "\g" or "/c" all over the place.

       [Update: That turns out to be the ":p" modifier.]

"<cut>"
       Well, I could say a lot more, but that's it for this time. I hope
       you're excited by all this, in a positive sort of way. But if your jaw
       lost all of its bounce when it hit the table, I expect Damian's
       upcoming Exegesis 5 will do a better job of showing how this all fits
       together into a pretty picture.

perl v5.14.0			  2006-02-28		  Perl6::Bible::A05(3)
[top]

List of man pages available for Fedora

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net