Using regex and terminal attributes
The #[teleparse(regex(...))] and #[teleparse(terminal(...))] attributes
are used to define the pattern to match for the token type and terminals in the
syntax tree.
The simpliest example is to define a single terminal struct,
along with a regex for the lexer to produce the token type when the remaining
source code matches the regex.
use teleparse::prelude::*; #[derive_lexicon] #[teleparse(terminal_parse)] pub enum MyToken { #[teleparse(regex(r"\w+"), terminal(Ident))] Ident, } fn main() { assert_eq!(Ident::parse("hell0"), Ok(Some(Ident::from_span(0..5)))); }
You can also add additional terminals that have to match a specific literal value.
use teleparse::prelude::*; #[derive_lexicon] #[teleparse(terminal_parse)] pub enum MyToken { #[teleparse(regex(r"\w+"), terminal(Ident, KwClass = "class"))] Word, } fn main() { let source = "class"; // can be parsed as Ident and KwClass assert_eq!(Ident::parse(source), Ok(Some(Ident::from_span(0..5)))); assert_eq!(KwClass::parse(source), Ok(Some(KwClass::from_span(0..5)))); // other words can only be parsed as Ident let source = "javascript"; assert_eq!(Ident::parse(source), Ok(Some(Ident::from_span(0..10)))); assert_eq!(KwClass::parse(source), Ok(None)); }
Note that there’s no “conflict” here! The regex is for the lexer,
and the literals are for the parser. When seeing “class” in the source,
the lexer will produce a Word token with the content "class".
It is up to the parsing context if a Ident or KwClass is expected.
When such literals are present specified for the terminals
along with the regex for the variant, derive_lexicon
will do some checks at compile-time to make sure the literals
make sense.
For each literal, the regex must:
- has a match in the literal that starts at the beginning (position 0)
- the match must not be a proper prefix of the literal
For the first condition, suppose the regex is board and the literal is keyboard.
The lexer will never be able to emit keyboard when the rest of the input
starts with board.
use teleparse::prelude::*; #[derive_lexicon] pub enum MyToken { #[teleparse(regex("board"), terminal(Board, Keyboard = "keyboard"))] DoesNotMatchTerminal, } fn main() {}
error: This regex does not match the beginning of `keyboard`. This is likely a mistake, because the terminal will never be matched
--> tests/ui/lex_regex_not_match_start.rs:5:23
|
5 | #[teleparse(regex("board"), terminal(Board, Keyboard = "keyboard"))]
| ^^^^^^^
For the second condition, suppose the regex is key and the literal is keyboard.
The lexer will again never be able to emit keyboard:
- If it were to emit
keyboardof this token type, the rest of the input must start withkeyboard - However, if so, the lexer would emit
keyinstead
use teleparse::prelude::*; #[derive_lexicon] pub enum MyToken { #[teleparse(regex("key"), terminal(Key, Keyboard = "keyboard"))] DoesNotMatchTerminal, } fn main() {}
error: This regex matches a proper prefix of `keyboard`. This is likely a mistake, because the terminal will never be matched (the prefix will instead)
--> tests/ui/lex_regex_not_match_is_prefix.rs:4:23
|
4 | #[teleparse(regex("key"), terminal(Key, Keyboard = "keyboard"))]
| ^^^^^