Evaluating Parser Options

Rusty ECMAScript Scanner (RESS)
- https://github.com/FreeMasen/RESS
Rusty ECMAScript Syntax Analyzer (RESSA)
- https://github.com/FreeMasen/RESSA

@FreeMasen
- twitter
- irc
- discord
https://wiredforge.com
- infrequent blogging
- random JS things
https://robertmasen.com
- My resume online

In early 2018, I found myself looking for a crate that would parse javascript into an AST. While there are a few options out there for parsing javascript, they are all tied to some greater development tool. What I was looking for was syn but for javascript. Since this didn't exist, I thought it would be a good idea to build it. After looking over the options available, I wanted to share my experience.

nom
- Macro based
combine
- impl Trait based
pest
- External Grammar Files

The first choice to make when building a parser is which of the parser crates to use, if one at all. In my search for a parsing crate I found 3 options that I wanted to investigate.

The three crates were nom, pest, and combine. Three popular options that take different approaches. nom's approach is to use macros to define parsing pipelines called combinators. combine also leverages combinators but instead of using macros, it uses regular rust functions. pest is a PEG parser generator, meaning it utilizes an external grammar file to generate a parser for you.

As we investigate these options, we should probably have a format that we want to parse to keep each example consistent.

One format that I think is simple enough to be digestible while being complicated enough to be interesting is the ISO 8601 Duration format. ISO 8601 is the date/time specification that aids in making dates work across cultures, sometimes referred to as I18n. You may be familiar time libraries that advertize their 8601 support, chrono for example.

As an aside, I18n is an abbreviation for Internationalization, it is used because there are 18 letters between the I and the last n

The duration format is dynamic, yet simple. Each duration will start with the character P, a list of number letter pairs. The number can be an integer or decimal value, the letter is a single capital letter. The units are broken into two sections separated by a T. The first section includes Year, Month, Week, and Day and the second section include Hour, Minute, and Second.

A standard way to encode a length of time into a string
I18n?

Basic Description

Start with a capital letter P
up to 7 Decimal number + letter pairs
Date half & Time half (separator: T)
letters: Y, M, W, D, H, M, S
Each is optional, minimum of 1 is required
If at least one unit that comes after Days is present they must be preceded by T

Here are some examples of different lengths of times as ISO 8601 Durations.

One Day: P1D, PT24H, P0.14285W
One Hour: PT1H, P0.041666D
One of Everything: P1Y1M1W1DT1H1M1.1S

With any data format, there are a few things here that could be problematic in a real world scenario. First, the size of each number is unspecified meaning deserializing a duration with two large a value could lead to overflow. Second, the larger of the units don't always have a consistent or clear meaning, a good example of this would a Month, using the gregorian calendar a month could be anywhere from 28 to 31 days. I bring this up because theses are important things to think about when approaching deserialization or parsing process. So long as both the serializer and deserializer agree on the meaning of any given symbol, everything should work out just fine.

At this point I want to take a moment and cover the concept of language grammars. In my experience most parsing resources will assume that you have at least a little knowledge about the concept making these resources difficult to use if you don't. If you already know a bit about language grammars, feel free to skip this page.

Essentially grammars are a way to write a definition for a language with the goal of making it easier to talk about that language. I am using the term language here pretty broadly to mean any agreed upon data format. That means that the 8601 Duration format could be thought of as a language though most people would not consider it one.

There are many options to choose from when trying to document a grammar, just like there are many language options to choose from when building software. I am going to use the Backus-Naur Form (BNF). We will going to walk through a BNF grammar describing the ISO 8601 Duration format piece by piece but if you wanted to look at the full thing you can scroll to the bottom of this page.

For those unfamiliar with grammar forms, they consist of a series of rules, the left side of a rule is a name, the right side is a description of what that rule means, in BNF the sides are separated by ::=.

I find it easiest to read a grammar from the bottom up. For our duration format we would start with the rule digit

<digit>       ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

On the right side of this rule we see each number from 0 to 9 separated by a |, which here means or, so a digit is a single digit number from 0 through 9. Next we have the integer rule.

<integer>     ::= <digit> | <integer><digit>

An integer is either a single digit or an integer followed by a digit. This is where things can get a little confusing as this style of notation might feel backwards, at least it does to me. Lets use the example 999, if you think about it starting with the right most 9, you would assign that position <integer><digit>, the <integer> here would point to the middle 9, this would also be assigned <integer><digit> and again the <integer> would represent the left most 9, this would finally be assigned <digit>. Here is a little flow chart to hopefully help visualize what I am trying to say.

<integer> = 9
    ┗━━━━━━━━━━━┓
<integer> = <integer>9
    ┗━━━━━━━━━━━┓
<integer> = <integer>9

When the right side of a rule looks like this it is referred to as a "left recursive" rule. In this case we could also write it as <digit><integer> making it "right recursive". The more important take away is that if you see a rule in its own definition then it could go on forever in one direction or the other.

Moving up the grammar, next we have the remainder rule.

<remainder>   ::= .<integer>

This rule is defined by a . followed by an integer. At this point it should become clear that as we move up the grammar, each rule will combine the previous rules, possibly with some additions which is why I like to start from the bottom. Next up we have number.

<number>      ::= <integer> | <remainder> | <integer><remainder>

A number is either an integer or a remainder or an integer followed by a remainder. This is a bit verbose but essentially we need a way to say that both are optional but at least one must exist. So 0 would work, also .877 would work and finally 0.877 would work.

The next few rules finally start to get into the specifics of the format. There are two rules for each of our number + letter pairs. Since we are using BNF the only operator we get is the | for or, some other grammar notations use other operators to make things a little more concise. If you have ever written a regular expression, you would be familiar with the + or ? operators for declaring recursion or optional values. In BNF we are required to create a new rule for the optional case if we want to have both optional and non-optional. Take the seconds and seconds-opt rules as an example.

<seconds-opt> ::= <seconds> | ""
<seconds>     ::= <number>S

The bottom one fits with what we went over on the previous page, a number followed by the capital letter S. The top one is just a way to make the previous rule optional. There is an entry like the above for each of our duration parts. After those we get to the time rule, this rule will hopefully make it clear why we needed those optional rules.

<time>        ::= T<hours-opt><minths-opt><seconds> |
                  T<hours-opt><minths><seconds-opt> |
                  T<hours><minths-opt><seconds-opt>

minth is used here as both Month and Minute

This rule has 3 options, all three start with the letter T and each is followed by all 3 of the time rules, in each case one of the time rules is not optional. This is saying that each of these parts can be optional but not all of them can be absent. So T1H or T2M1S are okay but T is not a valid time. Skipping ahead to the date rule we see a similar pattern.

<date>        ::= <years-opt><minths-opt><weeks-opt><days> |
                  <years-opt><minths-opt><weeks><days-opt> |
                  <years-opt><minths><weeks-opt><days-opt> |
                  <years><minths-opt><weeks-opt><days-opt>

So a date can be 1D or 1M1W but couldn't just be empty. You might have noticed that we again have an optional version of time, just like we did for the duration unit rules. When we look at the duration rule, the top rule in the grammar, it should become clears as to why.

<duration>    ::= P<date-length><time-length-opt> |
                  P<time-length>

As we covered in the previous page, durations can have from 1 to 7 number letter pairs. So this rule is saying that we can have a duration only with a date part (P1M) or a date part and a time part time part (P1DT1H) or just a time part(PT1S) but it cannot be empty (P).

If you haven't already below is the full BNF grammar I wrote for this the ISO 8601 Duration format.

<duration>    ::= P<date-length><time-length-opt> |
                  P<time-length>
<date>        ::= <years-opt><minths-opt><weeks-opt><days> |
                  <years-opt><minths-opt><weeks><days-opt> |
                  <years-opt><minths><weeks-opt><days-opt> |
                  <years><minths-opt><weeks-opt><days-opt>
<time-opt>    ::= <time-length> | ""
<time>        ::= T<hours-opt><minths-opt><seconds> |
                  T<hours-opt><minths><seconds-opt> |
                  T<hours><minths-opt><seconds-opt>
<years-opt>   ::= <years> | ""
<years>       ::= <number>Y
<minths-opt>  ::= <months-or-minutes> | ""
<minths>      ::= <number>M
<weeks-opt>   ::= <weeks> | ""
<weeks>       ::= <number>W
<days-opt>    ::= <days> | ""
<days>        ::= <number>D
<hours-opt>   ::= <hours> | ""
<hours>       ::= <number>H
<seconds-opt> ::= <seconds> | ""
<seconds>     ::= <number>S
<number>      ::= <integer> | <remainder> | <integer><remainder>
<remainder>   ::= .<integer>
<integer>     ::= <digit> | <integer><digit>
<digit>       ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Before we dig into each implementation, I want to cover a few things that are common to most of the examples. The first thing they share is the following enum.


# #![allow(unused_variables)]
#fn main() {
enum DurationPart {
    Years(f32),
    Months(f32),
    Weeks(f32),
    Days(f32),
    Hours(f32),
    Minutes(f32),
    Seconds(f32),
}
#}

This is going to be how we represent each of the parts of a duration as we are parsing the input. Once parsing is done we will combine them into a full representation of a duration.


# #![allow(unused_variables)]
#fn main() {
pub struct Duration {
    years: Option<f32>,
    months: Option<f32>,
    weeks: Option<f32>,
    days: Option<f32>,
    hours: Option<f32>,
    minutes: Option<f32>,
    seconds: Option<f32>,
}
#}

The next thing that pops up across multiple crates is the need to parse a string into a float.

quick note:, the values are not actual floating point numbers as the spec does not allow for scientific notation (1.2e-3), they would be more akin to the decimal data type provided in some languages.

For this we are going to lean on the implementation in the standard library so it will typically just look like this.


# #![allow(unused_variables)]
#fn main() {
let value_str = "1.222";
let value: f32 = value_str.parse();
#}

Here the parse method on &str returns a result, so we would need to deal with that as well.

The last thing to cover here is that each parser will need to deal with the fact that M can mean either month or minute. While there isn't a shared code solution for this, it does pop up a few times.

By far the most popular option, nom heavily utilizes macros to build parsing workflows. This allows it to be performant yet flexible but significantly increases the learning curve for users.

Most popular
Performant
Flexible
High learning curve

nom has really set the bar high for the performance of any parsing library. Once you have a working parser, you can feel pretty confident that it will be just about as fast as you can make it. It's popularity makes it the most battle tested option.

Performance
Heavily used

Working with nom lead me to a headache's worth of frustration. While it is well documented, it can be difficult to know where to look when getting started. When putting together the code samples for this, I spent the better part of a day and a half trying to implement the nom version, at that point I put that version aside to come back to it later. The inclusion of new operators, the requirement of using nested macros and the stream focus of the project were all big ergonomic barriers for me. At one point I had a working version of that parser that kept failing because the input I was providing wasn't wrapped up as a CompleteStr, a type defined by nom that indicates all input has been received. I needed to reach out to someone in the nom gitter channel to know that was my problem. The feature of being stream focused I think is a plus overall for nom as more experienced users would leverage that feature but it does make the process of learning quite confusing.

Learning curve
- Custom operators
- Nested macro calls
- Stream focused

A good place to start when learning to use nom is the named! macro, this macro will create a function for you with a specific name and behavior. named! takes two arguments, the first is the function's name, this needs to include 2 generic arguments, the second is the function's body. This can be a little difficult to get used to but it isn't too much of a pain. The generic arguments provided with the name define the data type of the argument and return type respectively.

This is already starting to feel difficult to explain, I will re-iterate one last time that the learning curve for nom is pretty steep.

Just like with our grammar, let's start at the bottom and move up. The first named! entry we have in that direction is called float, which takes CompleteStr and returns f32.


# #![allow(unused_variables)]
#fn main() {
named!(float<CompleteStr, f32>,
    map_res!(take_while1!(digit), parse_float)
);
#}

The second argument is two nested macro calls. To help understand this conceptually here is a very loose interpretation of what this might look like as a function.


# #![allow(unused_variables)]
#fn main() {
fn float(i: CompleteStr) -> Result<f32, Error> {
    map_res!(take_while1!(i, digit), parse_float)
}
#}

While the above is an extreme simplification, I hope it illustrates what is going on when we use the named! macro.

To dig into the function body a little, we are using macros provided by nom, the first is map_res! which takes two arguments. The first argument is going to be a parser, this will capture some portion of the input, the second argument is a function that will take the result of the first argument's parsing and return a Result. The basic idea here is that we are going to map the output of a parser with the function, since the function might fail map_res! will convert the function's error case into a nom error. The first argument is take_while1!, another nom macro, this takes a function that takes 1 argument, this argument needs to be the smallest part of the input which for this would be a char the function then needs to return a boolean, indicating if that char matches what we are expecting. To flesh that our a little we should probably cover the two helper functions we are using digit and parse_float.


# #![allow(unused_variables)]
#fn main() {
fn digit(c: char) -> bool {
    c.is_digit(10) || c == '.'
}

fn parse_float(s: CompleteStr) -> Result<f32, ::std::num::ParseFloatError> {
    s.parse()
}
#}

If we had the input 111.111H, we would first use take_while1 to pass each character to digit.

1	1	1	.	1	1	1	H
✓	✓	✓	✓	✓	✓	✓	✗

that would leave us with 111.111, this would get passed along to parse_float, which would just call the parse method on &str. Since the parse method returns a result, this would work for the requirement of map_res!. Now that we have a way to parse the number part of our number + letter pairs, we can move on to the next use of named!.


# #![allow(unused_variables)]
#fn main() {
named!(duration_part<CompleteStr, (f32, CompleteStr)>,
    do_parse!(
        value: float >>
        flag: take!(1) >>
        (value, flag)
    )
);
#}

Here we have defined duration_part this will take us from a CompleteStr to a tuple with the type (f32, CompleteStr), it does this by using the do_parse! macro. do_parse! is our primary way to declare a parsing pipeline, we provide it with a list of parsers and the remaining input should be applied to them in order. Here we see the use of a couple of custom syntax items. The first is how we store a parser's result in a variable, we do that by putting the variable name and a colon before the parser (value: float). The second bit of custom syntax is the >> operator, this essentially means, "and then". To break down what we are doing here, first we are using the float parser defined above, once done with that pass any remaining input on to the parser take!, which doesn't look at the input, just takes the number of characters requested, in our case 1.

At this point, you may be asking yourself "why does take! need to be called but float doesn't?", that is a great question. One of the things that is hardest to get used to is that all of the nom macros return functions, not values. To break this down a little take! doesn't return the character, it returns a function that returns the character, it might look something like this.


# #![allow(unused_variables)]
#fn main() {
fn take(i: CompleteStr) -> Result<CompleteStr, Error> {
    if i.len() >= 1 {
        Ok(&i[0..1].into())
    } else {
        Err(...)
    }
}
#}

That means the result of take! can be called, just like the float. To put it another way, we are not parsing the input, instead we are building a parser out of other parsers that will parse the input.

To finish the explanation of this do_parse!, the last step in any use of do_parse! we need to have a tuple, this will be the return value, for us that is the tuple of (value, flag) which is of type (f32, CompleteStr).

At this point we should be able to parse a duration's smallest part. Now we need to figure out how to get that into our DurationPart enum. Then next item moving up the file is the helper function combine_duration_part.


# #![allow(unused_variables)]
#fn main() {
fn combine_duration_part(value: f32, flag: CompleteStr, is_time: bool) -> DurationPart {
    match *flag {
        "Y" => DurationPart::Years(value),
        "M" => if is_time {
            DurationPart::Minutes(value)
        } else {
            DurationPart::Months(value)
        },
        "W" => DurationPart::Weeks(value),
        "D" => DurationPart::Days(value),
        "H" => DurationPart::Hours(value),
        "S" => DurationPart::Seconds(value),
        _ => unreachable!()
    }
}
#}

This is going to take in the two parts the f32 value and the flag as well as a third argument to indicate if an M means minutes or months.

We are going to use this as we are building the two halves. The function itself is relatively simple, just a match statement on the flag, matching the appropriate letter to the duration part.

As we move up the file, we have two more named! macros date_part and time_part.


# #![allow(unused_variables)]
#fn main() {
named!(date_part<CompleteStr, DurationPart>,
    do_parse!(
        part: duration_part >>
        (combine_duration_part(part.0, part.1, false))
    )
);


named!(time_part<CompleteStr, DurationPart>,
    do_parse!(
        part: duration_part >>
        (combine_duration_part(part.0, part.1, true))
    )
);
#}

These are almost identical, essentially they use the duration_part to get first two arguments of combine_duration_part, it then provides the appropriate value for the third.

As we continue up we have a named! item to parse the time_parts,


# #![allow(unused_variables)]
#fn main() {
named!(time_parts<CompleteStr, Vec<DurationPart>>,
    do_parse!(
        tag!("T") >>
        time_parts: many_m_n!(1, 3, time_part) >>
        (time_parts)
    )
);
#}

Here we have another call to do_parse! the first item here is is a call to tag!, another nom macro. This one simply recognizes whatever string you provide to it. In our case we are matching against the T that would begin the time section. It then moves on to another provided macro many_m_n!, this one takes three arguments, the first is the minimum number of times you expect something, the second is the maximum number of times you expect something, and the third is what you expect. In our case we expect a time_part at least 1 time, up to 3 times. We are going to assign the result of that to the variable time_parts, this value will be a Vec<DurationPart>. Since the last item in do_parse! needs to be a tuple, we wrap time_parts in parentheses.

Up next is another helper function, combine_duration_parts


# #![allow(unused_variables)]
#fn main() {
fn combine_duration_parts(date_parts: Option<Vec<DurationPart>>, time_parts: Option<Vec<DurationPart>>) -> Duration {
    let mut duration = Duration::new();
    for part in date_parts.unwrap_or(Vec::new()).iter().chain(time_parts.unwrap_or(Vec::new()).iter()) {
        match part {
            DurationPart::Years(value) => duration.set_years(*value),
            DurationPart::Months(value) => duration.set_months(*value),
            DurationPart::Weeks(value) => duration.set_weeks(*value),
            DurationPart::Days(value) => duration.set_days(*value),
            DurationPart::Hours(value) => duration.set_hours(*value),
            DurationPart::Minutes(value) => duration.set_minutes(*value),
            DurationPart::Seconds(value) => duration.set_seconds(*value),
        }
    }
    duration
}
#}

This takes in two Option<Vec<DurationPart>>, the first will represent the date part and the second will represent the time part of our duration. Again this helper is mostly straight forward, it loops over both, sets of parts (if they exist) and adds them all up into a Duration.

Our last named! is the final step for combining all of our previous work.


# #![allow(unused_variables)]
#fn main() {
named!(duration<CompleteStr, Duration>,
    do_parse!(
        tag!("P") >>
        date_parts: opt!(many_m_n!(1, 4, date_part)) >>
        time_parts: opt!(time_parts) >>
        (combine_duration_parts(date_parts, time_parts))
    )
);
#}

Again we start with a do_parse!, this one begins with a tag! for the first letter P, we pass the rest on to another nom macor opt!, this converts a standard parser into a one that returns an Option, we pass the first one another many_m_n! that will generate a Vec<DurationPart> with a length from one to four, assigning the result to a variable date_parts. It then passes the remainder of the text to another opt!, this one around time_parts and assigns the result to a variable of the same name. Now we have the two Option<Vec<DurationPart>>s that we need to pass them along to the combine_duration_parts helper function.

With all that we have finally built a nom parser that would take in our duration string and return a Duration. You will notice in the parse function that we actually call the last named! item manually providing the argument. nom parsers always return a Result, the Ok case for this result is going to be a tuple, the first index will be the remainder of the input, the second will be a Duration.

You can find the whole parser file below.


# #![allow(unused_variables)]
#fn main() {
extern crate duration;
#[macro_use]
extern crate nom;
use nom::types::CompleteStr;
use duration::Duration;
pub fn parse(s: &str) -> Result<Duration, String> {
    let pair = duration(s.into()).map_err(|e| format!("{}", e))?;
    Ok(pair.1)
}

named!(duration<CompleteStr, Duration>,
    do_parse!(
        tag!("P") >>
        date_parts: opt!(many_m_n!(1, 4, date_part)) >>
        time_parts: opt!(time_parts) >>
        (combine_duration_parts(date_parts, time_parts))
    )
);

fn combine_duration_parts(date_parts: Option<Vec<DurationPart>>, time_parts: Option<Vec<DurationPart>>) -> Duration {
    let mut duration = Duration::new();
    for part in date_parts.unwrap_or(Vec::new()).iter().chain(time_parts.unwrap_or(Vec::new()).iter()) {
        match part {
            DurationPart::Years(value) => duration.set_years(*value),
            DurationPart::Months(value) => duration.set_months(*value),
            DurationPart::Weeks(value) => duration.set_weeks(*value),
            DurationPart::Days(value) => duration.set_days(*value),
            DurationPart::Hours(value) => duration.set_hours(*value),
            DurationPart::Minutes(value) => duration.set_minutes(*value),
            DurationPart::Seconds(value) => duration.set_seconds(*value),
        }
    }
    duration
}

named!(time_parts<CompleteStr, Vec<DurationPart>>,
    do_parse!(
        tag!("T") >>
        time_parts: many_m_n!(1, 3, time_part) >>
        (time_parts)
    )
);

named!(date_part<CompleteStr, DurationPart>,
    do_parse!(
        part: duration_part >>
        (combine_duration_part(part.0, part.1, false))
    )
);


named!(time_part<CompleteStr, DurationPart>,
    do_parse!(
        part: duration_part >>
        (combine_duration_part(part.0, part.1, true))
    )
);

fn combine_duration_part(value: f32, flag: CompleteStr, is_time: bool) -> DurationPart {
    match *flag {
        "Y" => DurationPart::Years(value),
        "M" => if is_time {
            DurationPart::Minutes(value)
        } else {
            DurationPart::Months(value)
        },
        "W" => DurationPart::Weeks(value),
        "D" => DurationPart::Days(value),
        "H" => DurationPart::Hours(value),
        "S" => DurationPart::Seconds(value),
        _ => unreachable!()
    }
}

named!(duration_part<CompleteStr, (f32, CompleteStr)>,
    do_parse!(
        value: float >>
        flag: take!(1) >>
        (value, flag)
    )
);


named!(float<CompleteStr, f32>,
    map_res!(take_while1!(digit), parse_float)
);

fn digit(c: char) -> bool {
    c.is_digit(10) || c == '.'
}

fn parse_float(s: CompleteStr) -> Result<f32, ::std::num::ParseFloatError> {
    s.parse()
}
#}

combine takes a similar approach to nom in that it asks its users to combine a series of parsers into something that itself is a parser though its approach to creating a parser does not involve macros. This departure makes the learning curve much shallower but if you want to get nom-like speeds it is somewhat difficult to learn how to do that.

Trait based combinators
Quick to learn the basics
Slow w/o advanced knowledge

After attempting to learn nom, combine feels like a breeze. The type annotations are a little verbose but there is no macro syntax to learn. Their documentation and examples make getting started very easy. While the naive approach to building a parser in combine will almost certainly not be blazing fast, can achieve some impressive performance.

No new syntax to learn
Good documentation
Can be as fast as nom

The biggest positive also leads to the biggest negative, the naive implementation will never be as performant as a nom parser. This comes down to weighing performance vs productivity, you can be more productive more quickly with combine but that parser might not be the fastest. The other big negative is that when you start digging into the more advanced options, it can be difficult to correctly annotate the types for a parser. The compiler is good at inferring the types but if you wanted to break something into its own function, you need those type annotations.

Slow, w/o tuning
Types can be hard

To start, I want to point out how combine is able to create combinators w/o using macros. It heavily leverages the impl Trait feature released this year. looking the signature of the float parser:


# #![allow(unused_variables)]
#fn main() {
fn float<I>() -> impl Parser<Input = I, Output = f32>
where
    I: Stream<Item = char>,
    I::Error: ParseError<I::Item, I::Range, I::Position>,
{
    (
        many1(digit()),
        optional((
            char('.'),
            many1::<String, _>(digit())
        ))
    ).map(|(int, rem)| {
        let f = if let Some(rem) = rem {
            format!("{}.{}", int, rem.1)
        } else {
            int
        };
        f.parse().unwrap()
    })
}
#}

We are saying that this function will return a parser who's input is of type I and output is an f32. Additionally we constrain I saying it needs to implement Stream, where the Item type is a char, and the Error type of I needs to implement ParseError. ParseError is also generic so we pass along the properties of I, Item, Range, and Position. This is pretty verbose, but I found that building a simple parser I didn't need to worry a ton about what it meant, instead just using it as a blueprint for all of my parser functions. That is to say, I just copied and pasted this whenever I wanted to add another parser to a source file.

One thing to keep in mind is that these functions are not parsing the input but returning a parser that will be able to parse the input. We define these by combining them together and mapping over the result if successful. To indicate that parsers are chained together in a sequence we wrap them in either an array if they are all the exact same type otherwise we would use a tuple. Looking at the body of float we are using a tuple, with two parsers inside it. The first parser is many1 this takes a parser as an argument and applies it a minimum of 1 times but will collect the results until in argument parser fails, we passed in the result of calling digit which is a parser that will get us a single digit number. Next we have optional this takes in a parser and wraps the result in an Option, we pass it a tuple of 2 parsers the char parser, this is a single character and another many1 with digit as its argument. Notice in this second call to many1 we included a type annotation, many1 operates similar to the collect method on an iterator, the result could be a number of different collections we are just telling it it should be a String.

We now call map on our tuple, this essentially is saying if you find the pattern we defined, call this closure to generate the Output. In this closure we are going to check and see if the remainder exists and coalesce that into a string with the integer portion if not just use the integer portion. We then call str::parse on that coalesced value.

Moving up the file we next see value_pair the signature is similar to float though the Output type is DurationPart and it takes in an argument time to determine if M is a minute or a month.


# #![allow(unused_variables)]
#fn main() {
fn value_pair<I>(time: bool) -> impl Parser<Input = I, Output = DurationPart>
where
    I: Stream<Item = char>,
    I::Error: ParseError<I::Item, I::Range, I::Position>,
{
    (
        float(),
        choice([
            char('Y'),
            char('M'),
            char('W'),
            char('D'),
            char('H'),
            char('S'),
        ])).map(move |(v, c): (f32, char)| {
            match c {
                'Y' => DurationPart::Years(v),
                'M' => if time {
                    DurationPart::Minutes(v)
                } else {
                    DurationPart::Months(v)
                },
                'W' => DurationPart::Weeks(v),
                'D' => DurationPart::Days(v),
                'H' => DurationPart::Hours(v),
                'S' => DurationPart::Seconds(v),
                _ => unreachable!()
            }
        })
}
#}

Again here we have a tuple, the first parser is the float parser we just defined, next is a choice parser, this takes in an array or tuple of parsers and tries each one starting at the top, stopping at the first successful. We have passed choice an array of char parsers with our unit characters. We map our tuple, this time our closure has the move keyword to take ownership of the time argument. We have annotated the argument to our closure to help the compiler figure out what we are trying to do here. map on a parser always takes 1 argument, this will be a tuple of the results, for us that is f32 and char. We simply match on the char and generate the correct DurationPart variant as per the letter and the time flag.

Next we have the time_part parser.


# #![allow(unused_variables)]
#fn main() {
fn time_part<I>() -> impl Parser<Input = I, Output = Vec<DurationPart>>
where
    I: Stream<Item = char>,
    I::Error: ParseError<I::Item, I::Range, I::Position>,
{
    (
        char('T'),
        many1(value_pair(true))
    ).map(|(_, p)| p)
}
#}

Here we have another tuple, the first parser is a call to char with our T indicator that this is time based values, the second is a many1 call to the value_pair parser we just defined, passing true for the time flag. The map here is simply discarding the T character, that make the Output type annotation work for p which means it also work for many1.

After time_part we have date_part.


# #![allow(unused_variables)]
#fn main() {
fn date_part<I>() -> impl Parser<Input = I, Output = Vec<DurationPart>>
where
    I: Stream<Item = char>,
    I::Error: ParseError<I::Item, I::Range, I::Position>,
{
    (
        many1(value_pair(false))
    )
}
#}

This one is just the many1 call to value_pair passing false for the time flag. Notice that we don't need a map here since no additional changes need to be made to match the Output.

The last of the parsers we are going to define is duration


# #![allow(unused_variables)]
#fn main() {
fn duration<I>() -> impl Parser<Input = I, Output = Duration>
where
    I: Stream<Item = char>,
    I::Error: ParseError<I::Item, I::Range, I::Position>,
{
    (
        char('P'),
        optional(date_part()),
        optional(time_part()),
    ).map(|(_, d, t)| {
        let mut ret = Duration::new();
        for part in d.unwrap_or(Vec::new()).iter().chain(t.unwrap_or(Vec::new()).iter()) {
            match part {
                DurationPart::Years(v) => ret.set_years(*v),
                DurationPart::Months(v) => ret.set_months(*v),
                DurationPart::Weeks(v) => ret.set_weeks(*v),
                DurationPart::Days(v) => ret.set_days(*v),
                DurationPart::Hours(v) => ret.set_hours(*v),
                DurationPart::Minutes(v) => ret.set_minutes(*v),
                DurationPart::Seconds(v) => ret.set_seconds(*v),
            }
        }
        ret
    })
}
#}

This is another tuple indicating the sequence of a char of P followed by an optional date_part followed by an optional time_part. The argument to the map closure would be of type (char, Option<Vec<DurationPart>>, Option<Vec<DurationPart>>). In the body we loop over those two Vecs if they exist, adding them to a Duration returning the result.

Finally there is the parse function.


# #![allow(unused_variables)]
#fn main() {
pub fn parse(s: &str) -> Result<Duration, String> {
    let d = duration().parse(s).map_err(|e| format!("{}", e))?;
    Ok(d.0)
}
#}

Which creates a Duration parser by calling duration and then passes a &str to the parse method, this is one of the methods defined on the Parser trait. Parse returns a Result and in the Ok position we have a tuple, the first item has the same type as Output and the second is the remaining input to parse. For us the Output is going to be a Duration so we return Ok(d.0)

Here is the full source file for the combine parser.


# #![allow(unused_variables)]
#fn main() {
extern crate duration;
use duration::{Duration, DurationPart};
extern crate combine;
use combine::{
    choice,
    char::{char, digit},
    many1,
    optional,
    Parser,
    ParseError,
    Stream,
};
pub fn parse(s: &str) -> Result<Duration, String> {
    let d = duration().parse(s).map_err(|e| format!("{}", e))?;
    Ok(d.0)
}

fn duration<I>() -> impl Parser<Input = I, Output = Duration>
where
    I: Stream<Item = char>,
    I::Error: ParseError<I::Item, I::Range, I::Position>,
{
    (
        char('P'),
        optional(date_part()),
        optional(time_part()),
    ).map(|(_, d, t)| {
        let mut ret = Duration::new();
        for part in d.unwrap_or(Vec::new()).iter().chain(t.unwrap_or(Vec::new()).iter()) {
            match part {
                DurationPart::Years(v) => ret.set_years(*v),
                DurationPart::Months(v) => ret.set_months(*v),
                DurationPart::Weeks(v) => ret.set_weeks(*v),
                DurationPart::Days(v) => ret.set_days(*v),
                DurationPart::Hours(v) => ret.set_hours(*v),
                DurationPart::Minutes(v) => ret.set_minutes(*v),
                DurationPart::Seconds(v) => ret.set_seconds(*v),
            }
        }
        ret
    })
}

fn date_part<I>() -> impl Parser<Input = I, Output = Vec<DurationPart>>
where
    I: Stream<Item = char>,
    I::Error: ParseError<I::Item, I::Range, I::Position>,
{
    (
        many1(value_pair(false))
    )
}

fn time_part<I>() -> impl Parser<Input = I, Output = Vec<DurationPart>>
where
    I: Stream<Item = char>,
    I::Error: ParseError<I::Item, I::Range, I::Position>,
{
    (
        char('T'),
        many1(value_pair(true))
    ).map(|(_, p)| p)
}

fn value_pair<I>(time: bool) -> impl Parser<Input = I, Output = DurationPart>
where
    I: Stream<Item = char>,
    I::Error: ParseError<I::Item, I::Range, I::Position>,
{
    (
        float(),
        choice([
            char('Y'),
            char('M'),
            char('W'),
            char('D'),
            char('H'),
            char('S'),
        ])).map(move |(v, c): (f32, char)| {
            match c {
                'Y' => DurationPart::Years(v),
                'M' => if time {
                    DurationPart::Minutes(v)
                } else {
                    DurationPart::Months(v)
                },
                'W' => DurationPart::Weeks(v),
                'D' => DurationPart::Days(v),
                'H' => DurationPart::Hours(v),
                'S' => DurationPart::Seconds(v),
                _ => unreachable!()
            }
        })
}

fn float<I>() -> impl Parser<Input = I, Output = f32>
where
    I: Stream<Item = char>,
    I::Error: ParseError<I::Item, I::Range, I::Position>,
{
    (
        many1(digit()),
        optional((
            char('.'),
            many1::<String, _>(digit())
        ))
    ).map(|(int, rem)| {
        let f = if let Some(rem) = rem {
            format!("{}.{}", int, rem.1)
        } else {
            int
        };
        f.parse().unwrap()
    })
}
#}

pest is a completely different approach from the other two options discussed here. Instead of building your own parser you write out your grammar, in their format, and pest does all the building for you. Since it is designed to handle anything you throw at it you ultimately are sacrificing speed for convenience which can be a very valid sacrifice.

Provide Grammar
Generate Parser

The biggest advantage that pest has over the other options is the barrier to entry. The learning curve is pinned almost entirely learning to the grammar format. If you don't already have this skill it is a pretty valuable one to have so the cost of that is nominal. Another perk of this system is that you might be able to find a similar grammar format published online somewhere. A good example of this is the es5 grammar I found when I was getting started with my parsers, you can see it here. It won't fit exactly but often a simple find/replace can do almost all of the work for you. Lastly, you are off loading a large amount of the solution development to a tool, this makes it much faster to get up and running.

Low barrier to entry
Grammar might already exist?
Lots of work offloaded

The biggest negative is that you give up a lot of control, you have little to no ability to adjust the parser for performance reason. Another big issue in the same vein is you are stuck with what is offered by the library maintainer, when I was testing out pest to use in my js parser, they did not yet have unicode support meaning I needed to include in my grammar any unicode categories manually which bloated my binary size and compilation times. More recently they have baked that into the library but if you wanted to express something complicated that is not yet implemented, you are stuck. The next issue is that it outputs the larges binary size, this isn't a metric that is always important but it can be. The last issue is speed, it is not the fastest option but it also isn't the slowest. I bring this up last because I think the speed is easily offset by the east of use depending on your intentions.

Limited control
Largest Binary Size
Speed, to some extent

Before we start digging into the rust code, we should first cover the grammar file.

It looks a lot like our BNF grammar, the biggest difference is that we have the opportunity to use some more flexible notation. For example the instead of having one rule for optional values and another for non-optional, we can use the ? to say that any existing rule is optional. When noting that values repeat, we can use + to indicate 1 or more and the * to indicate 0 or more. These might be familiar to you if you have used regular expressions.

Some other things to keep in mind when using the pest grammar syntax, the right hand side of a rule needs to be wrapped in curly braces and each segment should be separated with ~. There are some more advanced things you can do with this style but we don't need them here.

Starting from the bottom again, first we define our Decimal rule, this is really just an alias for the ASCII_DIGIT rule provided by pest.

Decimal = { ASCII_DIGIT }

Next we have Integer which is 1 or more decimals.

Integer = { Decimal+ }

Then Remainder, a period followed by an integer, notice that strings need to be wrapped in double quotes.

Remainder = { "." ~ Integer }

Now we can define our Number rule as either an Integer with an optional Remainder or an optional Integer followed by a Remainder.

Number = { (Integer ~ Remainder?) |
           (Integer? ~ Remainder)
}

Above that is all of our unit/value pairs.

Year = { Number ~ "Y" }
Week = { Number ~ "W" }
Day = { Number ~ "D" }
Hour = { Number ~ "H" }
MinuteOrMonth = { Number ~ "M" }
Second = { Number ~ "S" }

Followed by the time_section and date_section rules.

DateSection = {
    (Year? ~ MinuteOrMonth? ~ Week? ~ Day) |
    (Year? ~ MinuteOrMonth? ~ Week ~ Day?) |
    (Year? ~ MinuteOrMonth ~ Week? ~ Day?) |
    (Year ~ MinuteOrMonth? ~ Week? ~ Day?)
}
TimeSection = { "T" ~ (
    (Hour? ~ MinuteOrMonth? ~ Second) |
    (Hour? ~ MinuteOrMonth ~ Second?) |
    (Hour ~ MinuteOrMonth? ~ Second?)
    )
}

All the way at the top we have the Duration rule.

Duration = {
    "P" ~ ((DateSection ~ TimeSection?) | (DateSection? ~ TimeSection))
}

Now for the rust part, to start we are going to use a derive provided by pest for their trait Parser. The derive allows for an attribute grammar which should be assigned the relative plath to the grammar file. We apply these to a unit struct, I called mine DurationParser.


# #![allow(unused_variables)]
#fn main() {
#[derive(Parser)]
#[grammar = "duration.pest"]
pub struct DurationParser;
#}

This will create an enum called Rule that will have one variant for each of the rules in our grammar file. Here it would look something like this.


# #![allow(unused_variables)]
#fn main() {
enum Rule {
    Duration,
    DateSection,
    TimeSection,
    Year,
    Week,
    Day,
    MinuteOrMonth,
    Second,
    Number,
    Remainder,
    Integer,
    Decimal,
}
#}

Inside of the parse function, the first thing we do is call DurationParser::parse, providing the rule we are looking to parse, in this case Rule::Duration and the &str.


# #![allow(unused_variables)]
#fn main() {
pub fn parse(s: &str) -> Result<Duration, String> {
    let duration = DurationParser::parse(Rule::Duration, s)
                .map_err(|e| format!("{}", e))?
                .next()
                .unwrap();
    let ret = assemble_parts(duration)?;
    Ok(ret)
}
#}

This is going to return a Result with a Pairs in the success position. Pairs is an iterator over Pair. For our case, we just need to first Pair so we can call next to get that. Once we have that we can pass it off to assemble_parts, which will take the Pair and pull out the inner rules. You can think about that in the same way our grammar is layed out, the Duration rule had DateSection and TimeSection in its definition, so the inner pairs would be one of these two variants of the Rule enum.


# #![allow(unused_variables)]
#fn main() {
fn assemble_parts(pair: Pair<Rule>) -> Result<Duration, String> {
    let mut ret = Duration::new();
    for part in pair.into_inner() {
        match part.as_rule() {
            Rule::DateSection => {
                assemble_part(&mut ret, part, false)?;
            },
            Rule::TimeSection => {
                assemble_part(&mut ret, part, true)?;
            },
            _ => unreachable!()
        }
    }
    Ok(ret)
#}

Once we have the inner values we are going to loop over them and pass it off to assemble_part.


# #![allow(unused_variables)]
#fn main() {
fn assemble_part(d: &mut Duration, pair: pest::iterators::Pair<Rule>, time: bool) -> Result<(), String> {
    for ref part in pair.into_inner() {
        update_duration(d, part, time)?;
    }
    Ok(())
}
#}

This is again going to pull out the inner Pair which should be one of the unit value rules. Once it has pulled that out it passes that pair off to update_duration.


# #![allow(unused_variables)]
#fn main() {
fn update_duration(d: &mut Duration, pair: &Pair<Rule>, time: bool) -> Result<(), String> {
    let f = get_float(pair)?;
    match pair.as_rule() {
        Rule::Year => {
            d.set_years(f);
        },
        Rule::MinuteOrMonth => {
            if time {
                //minute
                d.set_minutes(f);
            } else {
                //month
                d.set_months(f);
            }
        },
        Rule::Week => {
            d.set_weeks(f);
        },
        Rule::Day => {
            d.set_days(f);
        }
        Rule::Hour => {
            d.set_hours(f);
        }
        Rule::Second => {
            d.set_seconds(f);
        },
        _ => unreachable!()
    }
    Ok(())
}

fn get_float(pair: &Pair<Rule>) -> Result<f32, String> {
    let s = pair.as_str();
    let s = &s[..s.len() - 1];
    s.parse().map_err(|e| format!("error parsing float: {:?} {}", s, e))
}
#}

Here we are going to first get the float value from the pair, we do this by calling as_str on the Pair which gives the full slice of the original, we know the last character is the unit so we call parse on the sub string not including that. Now that we have the value, we can just match on the Pair::as_rule which will be one of our unit variants. At each stage we have passed down a mutable reference to the Duration we are assembling, making it easier to update it as needed. That is it, we off loaded quite a bit of the logic to the parser generator.

Here are the full grammar and rust files.


Duration = {
    "P" ~ ((DateSection ~ TimeSection?) | (DateSection? ~ TimeSection))
}
DateSection = {
    (Year? ~ MinuteOrMonth? ~ Week? ~ Day) |
    (Year? ~ MinuteOrMonth? ~ Week ~ Day?) |
    (Year? ~ MinuteOrMonth ~ Week? ~ Day?) |
    (Year ~ MinuteOrMonth? ~ Week? ~ Day?)
}
TimeSection = { "T" ~ (
    (Hour? ~ MinuteOrMonth? ~ Second) |
    (Hour? ~ MinuteOrMonth ~ Second?) |
    (Hour ~ MinuteOrMonth? ~ Second?)
    )
}
Year = { Number ~ "Y" }
Week = { Number ~ "W" }
Day = { Number ~ "D" }
Hour = { Number ~ "H" }
MinuteOrMonth = { Number ~ "M" }
Second = { Number ~ "S" }
Number = { (Integer ~ Remainder?) |
           (Integer? ~ Remainder)
}
Remainder = { "." ~ Integer }
Integer = { Decimal+ }
Decimal = { ASCII_DIGIT }


# #![allow(unused_variables)]
#fn main() {
extern crate duration;
extern crate pest;
#[macro_use]
extern crate pest_derive;

use duration::Duration;
use pest::{
    Parser,
    iterators::Pair,
};

#[derive(Parser)]
#[grammar = "duration.pest"]
pub struct DurationParser;

pub fn parse(s: &str) -> Result<Duration, String> {
    let duration = DurationParser::parse(Rule::Duration, s)
                .map_err(|e| format!("{}", e))?
                .next()
                .unwrap();
    let ret = assemble_parts(duration)?;
    Ok(ret)
}

fn assemble_parts(pair: Pair<Rule>) -> Result<Duration, String> {
    let mut ret = Duration::new();
    for part in pair.into_inner() {
        match part.as_rule() {
            Rule::DateSection => {
                assemble_part(&mut ret, part, false)?;
            },
            Rule::TimeSection => {
                assemble_part(&mut ret, part, true)?;
            },
            _ => unreachable!()
        }
    }
    Ok(ret)
}

fn assemble_part(d: &mut Duration, pair: pest::iterators::Pair<Rule>, time: bool) -> Result<(), String> {
    for ref part in pair.into_inner() {
        update_duration(d, part, time)?;
    }
    Ok(())
}

fn update_duration(d: &mut Duration, pair: &Pair<Rule>, time: bool) -> Result<(), String> {
    let f = get_float(pair)?;
    match pair.as_rule() {
        Rule::Year => {
            d.set_years(f);
        },
        Rule::MinuteOrMonth => {
            if time {
                //minute
                d.set_minutes(f);
            } else {
                //month
                d.set_months(f);
            }
        },
        Rule::Week => {
            d.set_weeks(f);
        },
        Rule::Day => {
            d.set_days(f);
        }
        Rule::Hour => {
            d.set_hours(f);
        }
        Rule::Second => {
            d.set_seconds(f);
        },
        _ => unreachable!()
    }
    Ok(())
}

fn get_float(pair: &Pair<Rule>) -> Result<f32, String> {
    let s = pair.as_str();
    let s = &s[..s.len() - 1];
    s.parse().map_err(|e| format!("error parsing float: {:?} {}", s, e))
}

#[cfg(test)]
mod test {
    use super::*;
    #[test]
    fn one_of_each() {
        parse("P1Y1M1W1DT1H1M1.1S").unwrap();
    }
}
#}

The final option is to just do it yourself. For something as simple as an ISO 8601 Duration, this might be the best option. Once things start to get more complicated, you are essentially going to need to re-invent the options we have covered.

Simple format, Simple solution
Complex format, Complex solution

The biggest benefit here is that you have full control of every aspect of the parsing process. This means that you can tune in your parser to be as fast as possible and you don't need to learn someone else's framework to do so.

Full control
Nothing new to learn
Can be fast

The biggest negative is that you are going to, at least to some extent, re-invent the wheel. The maintainers of the three crates I have covered are focusing their energy on building a quality base for you to build on, if you built that base yourself would it be as solid? The other big issue with the DIY approach is that you are going to introduce a significant amount of complexity to your solution and complexity is often a synonym for risk.

Re-invent (at least part of) the wheel
Complex formats = Complex solutions
- Complexity = Risk

With this one, we can start from the top. Our entry point, as always, is the parse function. The first step here is to check if the first character is a P, if it isn't we can stop this is not a Duration. Next we want to split the &str into parts, the first part would be everything after P and before T, the second would be everything from T to the end of the string.

Once we split it up, we can expect there to be, at most, 2 parts so a manual calls to next on the Split iterator should be enough. We can feel confident that the first one is going to be the date part because calling split on T3H would be "" followed by "3D", so first we test that next is Some then we double check it isn't "", if both are true we can pass the first half to parse_parts with the false as the second argument.

parse_parts takes in one half of the duration and a flag to indicate if M should be a month or a minute. It first finds all of the char_indices that have one of our unit characters. We are going to need to keep track of our position in this &str and this is done with the start_idx variable. We can now loop over the indices getting a slice of the input string from the start_index to the idx of the unit character which we want to parse as an f32. Next we want to match on the unit character which should be at the idx, using the time flag to determine if M means minute or month, we create a duration part. We are going to collect all of these parts in a Vec<DurationPart> to eventually return it so we push the duration into that Vec and finally update thestart_idx to be the idx + 1. This should get us through all of the parsing, next we need to collect these DurationParts into a Duration. We do that by passing a mutable reference to a Duration along with the each of the DurationParts off to the update_duration function. This just matches on the DurationPart and updates the provided Duration accordingly. We do this for both of our expected iterator items and we are done. There is a check in here to make sure that there is at least 1 unit/value pair.


# #![allow(unused_variables)]
#fn main() {
extern crate duration;
use duration::{Duration, DurationPart};


pub fn parse(s: &str) -> Result<Duration, String> {
    if &s[0..1] != "P" {
        return Err(format!("All durations must start with a P: {:?}", s));
    }
    let s = &s[1..];
    let mut parts = s.split('T');
    let mut found_one = false;
    let mut ret = Duration::new();
    if let Some(date_part) = parts.next() {
        if date_part != "" {
            found_one = true;
            for part in parse_parts(date_part, false)? {
                update_duration(&mut ret, &part);
            }
        }
    }
    if let Some(time_part) = parts.next() {
        if time_part != "" {
            found_one = true;
            for part in parse_parts(time_part, true)? {
                update_duration(&mut ret, &part);
            }
        }
    }
    if !found_one {
        return Err(format!("duration contains no information: {:?}", s));
    }

    Ok(ret)
}

fn parse_parts(s: &str, is_time: bool) -> Result<Vec<DurationPart>, String> {
    let idxs = s.char_indices().filter_map(|(i, c)| {
        if c == 'Y'
        || c == 'M'
        || c == 'W'
        || c == 'D'
        || c == 'H'
        || c == 'M'
        || c == 'S' {
            Some(i)
        } else {
            None
        }
    });
    let mut ret = Vec::with_capacity(4);
    let mut start_idx = 0;
    for idx in idxs {
        let float: f32 = s[start_idx..idx].parse().map_err(|e| format!("{}", e))?;
        let tag = &s[idx..idx+1];
        let part = match tag {
            "Y" => DurationPart::Years(float),
            "M" => if is_time {
                DurationPart::Minutes(float)
            } else {
                DurationPart::Months(float)
            },
            "W" => DurationPart::Weeks(float),
            "D" => DurationPart::Days(float),
            "H" => DurationPart::Hours(float),
            "S" => DurationPart::Seconds(float),
            _ => return Err(format!("Invalid unit tag pair at {} in {:?}", idx, s)),
        };
        ret.push(part);
        start_idx = idx + 1;
    }
    Ok(ret)
}

fn update_duration(d: &mut Duration, part: &DurationPart) {
    match part {
        DurationPart::Years(v) => d.set_years(*v),
        DurationPart::Months(v) => d.set_months(*v),
        DurationPart::Weeks(v) => d.set_weeks(*v),
        DurationPart::Days(v) => d.set_days(*v),
        DurationPart::Hours(v) => d.set_hours(*v),
        DurationPart::Minutes(v) => d.set_minutes(*v),
        DurationPart::Seconds(v) => d.set_seconds(*v),
    }
}

#[cfg(test)]
mod test {
    use super::*;
    #[test]
    fn all() {
        let d = "P1Y1M1W1DT1H1M1.1S";
        let p = parse(d).unwrap();
        assert_eq!(d, &format!("{}", p));
    }
    #[test]
    fn time_only() {
        let d = "PT1H1M1.1S";
        let p = parse(d).unwrap();
        assert_eq!(d, &format!("{}", p));
    }
    #[test]
    fn date_only() {
        let d = "P1Y1M1W1D";
        let p = parse(d).unwrap();
        assert_eq!(d, &format!("{}", p));
    }
}
#}

Below you will find a table of benchmark information for each of the 4 implementations. There are two benchmarks for each implementation, the first is parsing just 1 duration, the second is parsing 1000 durations, these are all created using lazy_static so each parser is provided the same input. For the build time and build size benchmarks, I used the following binary application with feature flags to conditionally use each of the implementations.

extern crate duration;
extern crate random_durations;
#[cfg(all(feature = "nom", not(feature = "bench")))]
extern crate nom_duration_parser as parser;
#[cfg(all(feature = "combine", not(feature = "bench")))]
extern crate combine_duration_parser as parser;
#[cfg(all(feature = "pest", not(feature = "bench")))]
extern crate pest_duration_parser as parser;
#[cfg(all(feature = "hand", not(feature = "bench")))]
extern crate hand_rolled_duration_parser as parser;

#[cfg(any(feature = "bench", not(any(feature = "nom", feature = "combine", feature = "pest", feature = "hand"))))]
fn main() {
    println!("{}", random_durations::gen_random_durs_text(get_count()).join("\n"));
}

#[cfg(
    all(
        any(feature = "nom", feature = "combine", feature = "pest", feature = "hand"),
        not(feature = "bench")
        )
)]
fn main() {
    for d in random_durations::gen_random_durs(get_count()) {
        let s = format!("{}", d);
        let p = parser::parse(&s).unwrap();
        assert_eq!(d, p);
        println!("duration:{}\nparsed to\n{}", d, p.human_readable());
    }
}

fn get_count() -> usize {
    for arg in ::std::env::args() {
        match arg.parse() {
            Ok(u) => return u,
            Err(_) => ()
        }
    }
    return 1000
}

This is the time it took to run cargo build on each implementation. With the increasing improvement of incremental compilation, it isn't the most important metric but some people might care about it.

This is the size of the program when built using cargo build --release.

This is the time cargo +nightly bench reported for this parser to parse 1 duration.

This is the time cargo +nightly bench reported for this parser to parse 1000 durations.


# #![allow(unused_variables)]
#![feature(test)]

#fn main() {
extern crate test;
extern crate duration;
extern crate random_durations;
#[cfg(feature = "nom")]
extern crate nom_duration_parser;
#[cfg(feature = "pest")]
extern crate pest_duration_parser;
#[cfg(feature = "combine")]
extern crate combine_duration_parser;
#[cfg(feature = "hand")]
extern crate hand_rolled_duration_parser;
#[macro_use]
extern crate lazy_static;

use test::{Bencher, black_box};
use duration::Duration;

lazy_static! {
    static ref DURATION: String = format!("{}", random_durations::gen_random_dur());
}

lazy_static! {
    static ref DURATIONS: Vec<String> = random_durations::gen_random_durs_text(1000);
}

#[cfg(feature = "nom")]
#[bench]
fn nom(b: &mut Bencher) {
    single(b, &nom_duration_parser::parse);
}

#[cfg(feature = "nom")]
#[bench]
fn nom_1000(b: &mut Bencher) {
    thousand(b, &nom_duration_parser::parse);
}

#[cfg(feature = "pest")]
#[bench]
fn pest(b: &mut Bencher) {
    single(b, &pest_duration_parser::parse);
}
#[cfg(feature = "pest")]
#[bench]
fn pest_1000(b: &mut Bencher) {
    thousand(b, &pest_duration_parser::parse);
}

#[cfg(feature = "combine")]
#[bench]
fn combine(b: &mut Bencher) {
    single(b, &combine_duration_parser::parse);
}
#[cfg(feature = "combine")]
#[bench]
fn combine_1000(b: &mut Bencher) {
    thousand(b, &combine_duration_parser::parse);
}

#[cfg(feature = "hand")]
#[bench]
fn hand_rolled(b: &mut Bencher) {
    single(b, &hand_rolled_duration_parser::parse);
}
#[cfg(feature = "hand")]
#[bench]
fn hand_rolled_1000(b: &mut Bencher) {
    thousand(b, &hand_rolled_duration_parser::parse);
}
fn single(b: &mut Bencher, f: &impl Fn(&str) -> Result<Duration, String>) {
    b.iter(|| {
        black_box(f(&DURATION).unwrap());
    });
}

fn thousand(b: &mut Bencher, f: &impl Fn(&str) -> Result<Duration, String>) {
    b.iter(|| {
        for s in DURATIONS.iter() {
            black_box(f(s).unwrap());
        }
    })
}
#}

crate	parse 1 (+/-)	parse 1000 (+/-)	build time	release size
nom	363.00ns (10.00ns)	844.53ms (64.04ms)	6.98s	727.08 kb
combine	1.19ms (51.00ns)	3.92s (276.65ms)	21.40s	739.68 kb
pest	5.13ms (126.00ns)	4.05s (93.54ms)	19.22s	767.86 kb
hand_rolled	205.00ns (1.00ns)	541.68ms (21.55ms)	7.22s	718.87 kb

I wanted to revisit combine, while the original works, I was able to create an implementation that is actually faster than nom's. Doing so required quite a bit of additional work in understanding how the "zero-copy" tools in combine work. In truth I still haven't been able to figure out how the type annotations need to be set for this implementation to work, instead each parser is assigned to a variable inside of a function body. With the following implementation the combine parser, the whole set of benchmarks are:

Learning Curve gets steeper
Performance gets better

crate	parse 1 (+/-)	parse 1000 (+/-)	build time	bin size
combine	893.00ns (37.00ns)	1.17s (73.91ms)	25.41s	723.99 kb
nom	1.33ms (69.00ns)	833.53ms (37.55ms)	10.66s	727.08 kb
pest	3.66ms (342.00ns)	3.89s (279.92ms)	31.31s	767.86 kb
hand_rolled	694.00ns (93.00ns)	551.86ms (77.66ms)	10.81s	718.87 kb


# #![allow(unused_variables)]
#fn main() {
extern crate duration;
use duration::{Duration, DurationPart};
extern crate combine;
use combine::{
    optional,
    Parser,
    range::recognize,
    parser::{
        item::item,
        byte::digit,
        repeat::{
            skip_many,
            skip_many1
        },
        repeat::skip_count,
    },
    error::UnexpectedParse,
};
pub fn parse(s: &str) -> Result<Duration, String> {
    let value = || {
        recognize((
            skip_many1(digit()),
            optional((
                item(b'.'),
                skip_many(digit())
            )),
        ))
        .and_then(|bs: &[u8]| {
            let s = ::std::str::from_utf8(bs).map_err(|_| UnexpectedParse::Unexpected)?;
            s.parse::<f32>().map_err(|_| UnexpectedParse::Unexpected)
        })
    };
    let pair = |time: bool| {
        (
            value(),
            combine::parser::item::any()
        ).and_then(move |(v, c): (f32, u8)| {
            let part = match c {
                b'Y' => DurationPart::Years(v),
                b'M' => if time {
                    DurationPart::Minutes(v)
                } else {
                    DurationPart::Months(v)
                },
                b'W' => DurationPart::Weeks(v),
                b'D' => DurationPart::Days(v),
                b'H' => DurationPart::Hours(v),
                b'S' => DurationPart::Seconds(v),
                _ => return Err(UnexpectedParse::Unexpected)
            };
            Ok(part)
        })
    };
    let date_part = combine::count(4, pair(false)).map(|p: Vec<DurationPart>| p);
    let time_part = skip_count(1, item(b'T')).and(combine::count(3, pair(true))).map(|(_, p): (_, Vec<DurationPart>)| p);
    let mut duration = skip_count(1, item(b'P')).and(date_part).and(time_part).map(|((_, d), t): ((_, std::vec::Vec<DurationPart>), std::vec::Vec<DurationPart>)| (d, t));
    let ((date_parts, time_parts), rem): ((Vec<DurationPart>, Vec<DurationPart>), &[u8]) = duration.parse(s.as_bytes()).map_err(|e| format!("{}", e))?;
    if rem.len() > 0 {
        return Err(format!("did not parse full string provided {}", String::from_utf8_lossy(rem)));
    }
    let mut ret = Duration::new();
    for part in date_parts.iter().chain(time_parts.iter()) {
        match part {
            DurationPart::Years(v) => ret.set_years(*v),
            DurationPart::Months(v) => ret.set_months(*v),
            DurationPart::Weeks(v) => ret.set_weeks(*v),
            DurationPart::Days(v) => ret.set_days(*v),
            DurationPart::Hours(v) => ret.set_hours(*v),
            DurationPart::Minutes(v) => ret.set_minutes(*v),
            DurationPart::Seconds(v) => ret.set_seconds(*v),
        }
    }
    Ok(ret)
}
#}


# #![allow(unused_variables)]
#fn main() {
extern crate duration;
use duration::{Duration, DurationPart};
extern crate combine;
use combine::{
    optional,
    Parser,
    range::recognize,
    parser::{
        item::item,
        byte::digit,
        repeat::{
            skip_many,
            skip_many1
        },
        repeat::skip_count,
    },
    error::UnexpectedParse,
};
pub fn parse(s: &str) -> Result<Duration, String> {
    let value = || {
        recognize((
            skip_many1(digit()),
            optional((
                item(b'.'),
                skip_many(digit())
            )),
        ))
        .and_then(|bs: &[u8]| {
            let s = ::std::str::from_utf8(bs).map_err(|_| UnexpectedParse::Unexpected)?;
            s.parse::<f32>().map_err(|_| UnexpectedParse::Unexpected)
        })
    };
    let pair = |time: bool| {
        (
            value(),
            combine::parser::item::any()
        ).and_then(move |(v, c): (f32, u8)| {
            let part = match c {
                b'Y' => DurationPart::Years(v),
                b'M' => if time {
                    DurationPart::Minutes(v)
                } else {
                    DurationPart::Months(v)
                },
                b'W' => DurationPart::Weeks(v),
                b'D' => DurationPart::Days(v),
                b'H' => DurationPart::Hours(v),
                b'S' => DurationPart::Seconds(v),
                _ => return Err(UnexpectedParse::Unexpected)
            };
            Ok(part)
        })
    };
    let date_part = combine::count(4, pair(false)).map(|p: Vec<DurationPart>| p);
    let time_part = skip_count(1, item(b'T')).and(combine::count(3, pair(true))).map(|(_, p): (_, Vec<DurationPart>)| p);
    let mut duration = skip_count(1, item(b'P')).and(date_part).and(time_part).map(|((_, d), t): ((_, std::vec::Vec<DurationPart>), std::vec::Vec<DurationPart>)| (d, t));
    let ((date_parts, time_parts), rem): ((Vec<DurationPart>, Vec<DurationPart>), &[u8]) = duration.parse(s.as_bytes()).map_err(|e| format!("{}", e))?;
    if rem.len() > 0 {
        return Err(format!("did not parse full string provided {}", String::from_utf8_lossy(rem)));
    }
    let mut ret = Duration::new();
    for part in date_parts.iter().chain(time_parts.iter()) {
        match part {
            DurationPart::Years(v) => ret.set_years(*v),
            DurationPart::Months(v) => ret.set_months(*v),
            DurationPart::Weeks(v) => ret.set_weeks(*v),
            DurationPart::Days(v) => ret.set_days(*v),
            DurationPart::Hours(v) => ret.set_hours(*v),
            DurationPart::Minutes(v) => ret.set_minutes(*v),
            DurationPart::Seconds(v) => ret.set_seconds(*v),
        }
    }
    Ok(ret)
}
#}

Ultimately the decision of which parser library to choose comes down to the exact problem and constraints that you are working with. If you need the performance and can afford the time to learn , nom is the clear winner but if you don't need the performance it is worth it to check out the other options.

I know that I haven't covered all of the parser crates that exist, feel free to open an issue on this book's repo and I might be able to include it.

If you are interested in seeing my progress on the JS parser you can checkout their repos.

RESS - The lexer
RESSA - The parser

Additional places to find me online

Twitter: @FreeMasen
GitHub: FreeMasen
Blog: WiredForge
Resume Site: RobertMasen.com
Cute pictures of my dog: juno.mashton

Evaluating Parser Options

Choosing a parser in Rust

Exploring 3 popular parser crates

Robert Masen

December 2018

Introduction

Why me?

Where to find me?

What We'll Cover

Our Format

ISO 8601 Durations

Basic Description

Examples

BNF Grammar

Common Code Items

nom

Overview

Positives

Negatives

Nom

Now for a demo!

Combine

Overview

Positives

Negatives

Combine

Demo Time!

Pest

Overview

Positives

Negatives

Pest

D-E-M-O

demo!

demo!

DIY

Overview

Positives

Negatives

DIY

Are we not men?

We are demo!

Performance

Build time

Bin size

Parse 1

Parse 1000

Benches

Combine (revisited)

How's about a demo?

Final Thoughts