I've been doing a lot of work on a project recently that reads in configuration/modding data via command-line flags and tables stored in a text-based file format. The format is terrible, full of idiosyncrasies and inconsistencies depending on the type of table being read in.

The original parser for these was C-based and used a line-by-line system that stored a pointer into a file and advanced that pointer as tokens and rules were consumed. I really didn't like it at all, and for reasons mentioned below it was a real pain to refactor its dependencies. Clearly I was going to have to use an alternative.

Original parser implementation sample:

if (optional_string("#Team Colors")) { //advances the parse pointer if string is found

    while (required_string_either("#End", "$Team Name:")) { //lookahead
        required_string("$Team Name:"); // required to move the parse pointer forward
        team_color temp_color;
        char temp2[NAME_LENGTH];
        stuff_string(temp2, F_NAME, NAME_LENGTH); //read in a string, advance parse pointer by string length
        temp = temp2;

        if (!stricmp(temp2, "none")) { //business logic intermixed with parser logic
            Warning(LOCATION, "Team color in '%s' defined with a name of '%s'; this won't be usable due to 'None' being used for a lack of a team color by the engine.\n", filename, temp2);
        }

        if (required_string("$Team Stripe Color:")) {
            int rgb[3];
            stuff_int_list(rgb, 3, RAW_INTEGER_TYPE);
            for (i = 0; i < 3; i++) {
                CLAMP(rgb[i], 0, 255);
            }

            temp_color.stripe.r = rgb[0] / 255.0f;
            temp_color.stripe.g = rgb[1] / 255.0f;
            temp_color.stripe.b = rgb[2] / 255.0f;
        }

        if (required_string("$Team Base Color:")) {
            int rgb[3];
            stuff_int_list(rgb, 3, RAW_INTEGER_TYPE);
            for (i = 0; i < 3; i++) {
                CLAMP(rgb[i], 0, 255);
            }

            temp_color.base.r = rgb[0] / 255.0f;
            temp_color.base.g = rgb[1] / 255.0f;
            temp_color.base.b = rgb[2] / 255.0f;
        }

        if (Team_Colors.find(temp) == Team_Colors.end()) {	// Only push to the vector if the team isn't already defined.
            Team_Names.push_back(temp);
        }
        Team_Colors[temp] = temp_color;
    }
    required_string("#End");
}

There are a couple of problems with this approach:

  • The file format is never explicitly consolidated anywhere
  • Because the API is c-ish, there's global state being manipulated by optional_string, stuff_int_list, etc
  • The parser isn't re-entrant as a result (the api has to provide 'pause parsing' functions that basically push the parser state onto a stack)
  • Functions like required_string_either behave as lookahead, but because they don't consume tokens you have to repeat the rule with required_string afterwards to actually consume said tokens
  • Parser code directly interacts with other modules state (eg Team_Colors vector in above example) - this means the parser code needs access to headers for all the other modules so that it can write state as it finds it in the file

That last issue was the most pressing one for me, as it created circular dependencies between the other modules that used the parser (ie they called it) and the parser in turn knew about internal module details of the module (to be able to store the results).

Rather than hand-rolling a parser like this, you might be tempted to look at a framework like ANTLR for generating parsers. This also has some annoying limitations:

  • You have to link to the ANTLR static libraries
  • The ANTLR grammar gets converted to code in a pre-build step
  • ANTLR grammar is expressed in its own syntax, not C++

An alternative, is a PEG-based parser using something like cpp-peglib. This has the following advantages:

  • The library is header-only
  • No separate code generation step
  • You can either build your grammar using C++ OR using a text-based grammar that can be embedded in your source code as a literal C string

The native C++ syntax uses what are called Parser Combinators. These are individual functions that return objects representing operations, that can be composed together.

Here's an example of a rule built with parser combinators, and the equivalent rule expressed in native PEG syntax:

CommandLineGrammar["ParamName"] <= 
    peg::seq(peg::chr('-'), peg::tok(peg::oom(peg::ncls("-\r\n\t "))));

ParamName <- '-' < [^-\r\n\t ]+ >

This rule is for a symbol called ParamName, and it describes a sequence of the character '-' followed by the actual token, which is one or more characters not in [-\r\n\t ], that is, one or more non-space, non- '-' characters.

The biggest advantage of the parser combinators is that they can be composed into higher-level structures with a more fluent API, such that you can create EDSLs (embedded domain-specific languages) that are native C++ and read more like natural speech.

The equivalent of the above rule might be something like:

CommandLineGrammar.Define("ParamName",
    Character('-')
    & Token(
        OneOrMore(
            CharacterNotInClass("-\r\n\t ")
        )
    )
);

You can also write functions that return a set of rules that comprise a higher-level rule:

auto TeamColorInfo = ColorsFile.Define("TeamColor",
	ColorsFile.RequiredVariable("$Team Name")
	& ColorsFile.RequiredVariable("$Team Stripe Color")
	& ColorsFile.RequiredVariable("$Team Base Color")
);

These sequences of operators can then be referenced elsewhere in the grammar. For instance, this grammar fragment specifies an optional section called Team Colors that contains a list of TeamColorInfo objects:

ColorsFile.Optional(
    ColorsFile.Section("Team Colors",
        ColorsFile.ListOf(TeamColorInfo, false)
    )
)

This means that the code very explicitly lays out the expected structure of the input file, all in one location, not requiring the intermingling of business logic with the parsing behaviour - it basically ticks the boxes for being 'self-documenting'.

The major significant limitation of a PEG parser is that error recovery is limited, and usually requires the insertion of recovery rules into the grammar in order to allow for unused input to be consumed so the parser can resume operation. However, if you know what the error recovery rules should be for higher-level constructs these can be incorporated into the functions that construct them, such as Section and ListOf in the above examples.

So we have a parser but how do we actually get information regarding the parsed file?

peglib exposes the following function on the Definition/Rule class:

Result parse_and_get_value(const char *s, T &val, const char *path = nullptr)

If you pass in an empty shared_ptr to an ASTBase-derived class as val, you'll get a populated AST out as a result. You can walk the AST however you want - each node in the tree tells you what type of symbol it contains, as well as the string representing the token (if the symbol is a token), and child nodes for any contained symbols.

To make things easier to switch on symbol names, peglib provides the following user-defined literal operator that performs a compile-time hash on a string:

namespace udl {
    inline constexpr unsigned int operator"" _(const char *s, size_t) {
    return str2tag(s);
    }
}

This allows you to effectively pass nodes to factory functions, or perform any other despatch that you'd like:

switch (ast->tag) {
    case "block"_:
        block(ast, scope);
    break;
    case "assignment"_:
        assignment(ast, scope);
    break;
    default:
        for (auto node : ast->nodes) {
            build_on_ast(node, scope);
        }
    break;
}

There's some additional work going on in my config file EDSL to ensure that the syntax is as natural as possible, but I'll describe that in more detail in its own post later on.

Hopefully this encourages you to check out peglib! The author provides excellent support and I've found it great to use.