EN - How to make an elegant tokenizer in C
Definitions
Understanding the concepts of lexer, parser, and tokenizer can be challenging when first encountered. During my work on a shell parser project at 42, I encountered many different definitions of these terms that initially seemed overwhelming.
However, these concepts are simpler than they first appear. Let me break them down clearly.
Token
A “token” is a simplified representation of an element in the lexical context.
Example with the shell language:
ls && pwd
Here, we have 3 tokens of 2 types:
ls is TOK_WORD
&& is TOK_AND
pwd is also TOK_WORD
TOK_* is an enum.
This abstraction helps us avoid repeatedly analyzing the same word during processing.
Lexer
The lexer defines the rules for tokenization. Its role is to check the compatibility of the current context with the next one. At this step, we work at the character level, not at the token level (which is the parser’s role).
For example, we might define these rules:
A WORD token would be: a-zA-Z[0-9]
An OPERATOR token would be: &|><!;
(These are simplified examples for illustration.)
In simpler terms:
- A
WORDcan contain any lowercase letter (a-z), uppercase letter (A-Z), or digit (0-9) - An
OPERATORcan be any of these characters:&,|,>,<,!,;
These rules allow us to differentiate one token from another.
For instance, ls is a word, while >> is an operator (more specifically, a redirection).
Tokenizer
The tokenizer is the code that uses the lexer’s rules and will give us the full context.
Parser
The parser checks the compatibility between different contexts (at the token level).
For more details on parser implementation, see “Recursive Descent Parser” in the references section.
Implementation
Let’s examine an example using shell grammar:
typedef enum e_toktype {
TOK_ERROR,
...
TOK_WORD,
TOK_AND_IF,
TOK_OR_IF,
...
TOK_MAX
} t_toktype;
Here we have our token types.
How do we recognize words efficiently? Instead of using string splitting and switch cases, we use a more elegant approach: CHR_CLASS.
Lexer + CHR_CLASS
We define enums to replace basic character recognition (like s[i] == '='):
typedef enum e_chr_class {
CHR_ERROR,
CHR_SP,
CHR_DASH,
CHR_BANG,
CHR_AND,
CHR_SEMI,
CHR_WORD,
...
CHR_MAX
} t_chr_class;
Now let’s see how we achieve complete abstraction using these enums:
static t_chr_class g_get_chr_class[255] =
{
[' '] = CHR_SP,
['\t'] = CHR_SP,
[';'] = CHR_SEMI,
['$'] = CHR_DOL,
['#'] = CHR_WORD,
['A'...'Z'] = CHR_WORD,
['a'...'z'] = CHR_WORD,
['0'...'9'] = CHR_DIGIT,
...
};
Here, we define enums that allow us to abstract the character set. We can easily decide to treat a space as part of a word, or group characters by their meaning.
For example, with ls:
The word has 2 characters: l and s, both abstracted as CHR_WORD.
This eliminates verbose switch statements, resulting in clean and readable code.
This is particularly useful compared to the traditional approach:
switch (s[i]) {
case 'A':
return (CHR_WORD);
case 'B':
return (CHR_WORD);
...
default:
return (CHR_ERROR);
}
Now let’s implement the lexer rules that will use these character classes.
Lexer
The lexer defines rules for the current context.
At this point, “context” refers to the token type. We need to determine the current token before processing it.
We use an array for this purpose:
static t_toktype g_get_tok_type[CHR_MAX] = {
[CHR_SP] = TOK_SP,
[CHR_WORD] = TOK_WORD,
[CHR_ESCAPE] = TOK_WORD,
[CHR_DASH] = TOK_WORD,
[CHR_DIGIT] = TOK_WORD,
[CHR_BANG] = TOK_BANG,
...
};
This array allows us to determine the token type from the first character:
unsigned int token_type = g_get_tok_type[g_get_chr_class[string[i]]];
Once we have the token type, we need to process the token as long as we remain in a valid context.
Here are the rules that define valid character sequences for each token type:
static int g_token_chr_rules[TOK_MAX][CHR_MAX] =
{
[TOK_SP] = {
[CHR_SP] = 0
},
[TOK_WORD] = {
[CHR_WORD] = 1,
[CHR_DIGIT] = 1,
[CHR_ESCAPE] = 1,
[CHR_SQUOTE] = 1,
[CHR_DQUOTE] = 1,
[CHR_BQUOTE] = 1,
[CHR_LPAREN] = 1,
[CHR_RPAREN] = 0,
[CHR_LBRACE] = 1,
[CHR_RBRACE] = 0,
[CHR_DOL] = 1,
[CHR_DASH] = 1
},
...
};
With these rules defined, we can now iterate through the string while the context remains valid, then save the token.
Tokenizer in Action
Once we have the token type (e.g., TOK_WORD), we iterate through the string until the context is no longer valid:
unsigned int i = 1;
unsigned int token_type = g_get_tok_type[g_get_chr_class[string[0]]];
while (g_token_chr_rules[token_type][g_get_chr_class[string[i]])
{
...
i++;
}
// here you can now save the token with your favorite data structure.
Conclusion
You now have a working lexer and tokenizer implementation.
For parser implementation, you can explore it on your own or check out my Recursive Descent Parser for a complete example.
References
Finite State Machine
Lexical_analysis
Recursive-Descent-Parser