An Elm implementation of the exercism.io challenge "Word Count" using Elm-Parser

by tkshill

Solving exercism.io's Word Count problem using Elm-Parser

October 24th 2020

Kirk Shillingford

Tl;dr
This is a short write up on how I used the elm-parser library to solve a coding challenge on exercism.io. It took me a lot more time than I anticipated because there's not many resources on parsing in Elm at the moment so I'm hoping this article can add to the available resources for others. To skip directly to the implementation, click here to see the breakdown of how I did it. To see the full code and tests, click here.

Table of Contents

Intro

So this weekend I tried the Word-Count problem on exercism as part of a weekly challenge I'm doing with some friends. We've just started, but it's been fun so far as a chance to sharpen my Elm skills, but also to introduce others to the language. I highly recommend it.

The details of the challenge are below, but essentially, it requires extracting some information out of a string of text.

Why use parsers where regular expressions would do?

My initial and immediate thought was to use regular expressions since that's the way I've always done pattern matching on strings before. Imagine my surprise though when I looked up the elm-regex library to see the following line as the first thing in the library Readme.md:

"Generally speaking, it will be easier and nicer to use a parsing library like elm/parser instead of this."

What kind of library tells you to go use another library?

Well, the Elm community is very intent on two things; trying to get things right, but also being practical. And sometimes that means you have to provide a solution while acknowledging that it's not the best solution. The author goes on to explain why he thinks elm-regex shouldn't be your first choice for most things and it was certainly an interesting argument. It's definitely worth a read in the original doc but a crude summary would be:

  • In general, regex is difficult to understand, maintain, and test
  • The relative ease of use of regex means it is frequently used for applications it was never designed to handle.
  • He'd rather the elm ecosystem evolve to have specific parser libraries for handling common validation like passwords, forms, etc.

I think he raises a valid point. Regex is simple, fast, and powerful, but there are disadvantages to using a string of text to validate other strings of text.

Can you guess what this regex does?

/^[a-z0-9_-]{6,18}$/

It's a common pattern for checking for valid passwords. The valiation is described as:

We begin by telling the parser to find the beginning of the string (^), followed by any lowercase letter (a-z), number (0-9), an underscore, or a hyphen. Next, {6,18} makes sure that are at least 6 of those characters, but no more than 18. Finally, we want the end of the string ($).

And that's... fine? But is that obvious from the regex expression? Maybe, if you're familiar. But if you're not familiar with regex, can you tell if you've typed the string correctly? If you are familiar with regex and you make a mistake, would you be able to spot it?

Regex can't tell you if it's wrong or right. So we're left to unit tests as the only real safeguard against error and that can be unwieldy.

So considering all this I thought, why not give elm-parser a try. Surely it can't be too bad.*

*It was actually quite bad 😬, but that's not really the library's fault. The Elm-parser library is pretty reasonable, once you know what you're doing, and I did not. It took me several hours to get a working solution and that involves innumerable backtracks, edits, and reading, re-reading, and re-re-readings of this article by Alex Korban (bless you Alex) and the docs because I was attempting to both learn a library and learn the concept of parsing and solve a specific problem. Maybe not the best idea but here we are.

Methodology

The rest of this article will be an attempt to explain my methodology had I known what I was doing from the start, because if I tried to take you through my actual path, this would become 50 pages long.

Instructions for Word-Count challenge on exercism.io

Let's examine the actual program requirements.

Given a phrase, count the occurrences of each word in that phrase.

For the purposes of this exercise you can expect that a word will always be one of:

  • A number composed of one or more ASCII digits (ie "0" or "1234") OR
  • A simple word composed of one or more ASCII letters (ie "a" or "they") OR
  • A contraction of two simple words joined by a single apostrophe (ie "it's" or "they're")

When counting words you can assume the following rules:

  • The count is case insensitive (ie "You", "you", and "YOU" are 3 uses of the same word)
  • The count is unordered; the tests will ignore how words and counts are ordered
  • Other than the apostrophe in a contraction all forms of punctuation are ignored
  • The words can be separated by any form of whitespace (ie "\t", "\n", " ")

At the end, we expose a single function WordCount with the type signature:
WordCount : String -> Dict String Int which means that the function will accept some string of text and return a dictionary with keys of strings and values of integers, representing our words and the frequency of their appearances respectively. How we implement this is left entirely up to us.

So what's our strategy

Let's examine the problem. Our program needs to create a function that accepts a string of characters, and returns
a Dictionary with the words from the string as keys, and the frequency of their appearances as values.

We can break down our program into two distinct steps:

  1. Parse the string into a list of valid words
  2. Transform the list of words into our dictionary

Brief aside: What even is a parser?

A parser can be defined as "a computer program that breaks down text into recognized strings of characters for further analysis". Parsing can be seen as a alternative control flow to validating. Instead of checking whether data is in a state than can be utilized by your program/process, parsing attempts to transform the incoming data into that valid state; if the data cannot be transformed, the data was invalid and the parsing fails.

This is a powerful technique because it ensures that after the parsing step you either have:

  • Data that is exactly the shape you need for your business logic (and you dont need to validate it without your program logic again)
  • A respresentation of an incompatible object that your program can deal with before it can cause errors in your internal logic.

Parsers can be broken broken down into three distinct steps/parts:

A tokenizer breaks a stream of text into tokens, usually by looking for whitespace (tabs, spaces, new lines).

A lexer is basically a tokenizer, but it usually attaches extra context to the tokens -- this token is a number, that token is a string literal, this other token is an equality operator.

A parser takes the stream of tokens from the lexer and turns it into an abstract syntax tree representing the (usually) program represented by the original text.

For this example, our main focus is the tokenizing and lexing as we're not actually constructing an abstract syntax tree. We're just directly consuming the list of valid words returned by our lexer.

So like, how do we do that?

Implementation

Defining our tokens

type alias Contraction =
    Maybe String


type ValidWord
    = Numeric String
    | Alphabetic String Contraction

First, let's use the type system to define the type of tokens we want to returns from our parser. Accroding to the instructions, we can expect two types of valid words. One is a sequence of all numeric characters, and one is a sequence of all alphabetic characters, which may or may not have a single apostraphe. Our type ValidWord will represent these two options. Note how the Alphabetic branch takes two arguments; the first is a string representing a sequence of characters. The second, Contraction is a Maybe String. Words with apostrophes will therefore have contractions of Just somestring while words without will have a contraction of Nothing.

Creating a parser for Numeric valid words

isIgnorable : Char -> Bool
isIgnorable =
    not << Char.isAlphaNum


numericWord : Parser ValidWord
numericWord =
    succeed Numeric
        |. chompWhile isIgnorable
        |= (getChompedString <|
                succeed identity
                    |. chompIf Char.isDigit
                    |. chompWhile Char.isDigit
           )
        |. chompWhile isIgnorable

Okay, I know this is a lot. I'm going to try my best to explain what's going on here.

  • isIgnorable is just a helper function that returns true if a character isn't alpahbetic or numeric. We're going to use it when we need to check for spaces and random symbols between words
  • numericWord is our ValidWord Parser. Specifically it tries to create the Numeric branch of the ValidWord type. It does not handle the Alphabetic case (we'll have another parser for that).
    • succeed is the first function we see from the parser library and for now we can just treat it as function that helps us start off a definition of a sequence of parsing steps
    • |. is helper function from the parsing library that lets you parse part of the stream without passing back the contents. It's great for when you want to ensure something exists, but you don't want to pass it to your token constructors
    • chompWhile does exactly what it sounds like. It will consume a stream of text as long as the condition passed into it is satisfied. Combined with (|.) we're essentially saying "there could be some non alphanumeric characters at the beginning, we don't care about them
    • |= is the opposite of |. It does pass whatever is the result of the parser that comes after it
    • getChompedString - typically chompers don't return anything. They just consume and check. getChompedString turns the result of whatever you've chomped as a String. (Note that it took me forever to figure out that this was the best way to get back weirdly defined strings like this. It's not the only way - see Parser.variable - but it is the best way for this particular situation). We pass getChompedString another subcategory of parsers that is actually what we want to consume to make our Numeric token
      • identity is a common helper function in functional programming that just returns whatever is passed into it.
      • chompIf - chomps only the char matches the following function exactly. In this case we want to chomp only if the character is a digit.
      • We follow it up with a chompWhile to say "after identifying at least one digit, you can a sequence of as many more digits as you like
    • Finally, we add one more check for non alphanumeric characters.
  • Altogether we can read this Parser as:

"Hey, check if there are some nonalphanumeric characters at the start of this thing. I don't care how many. Then, you should see at least one digit. Grab that sequence of digits. Then you should see some more nonalphanumeric characters. The End" ❤️

That was a lot! But essentially now we have a Parser than can handle numeric valid words.

Creating a Parser for Alphabetic Valid Words

apostrophe : Char
apostrophe =
    Char.fromCode 39

contraction : Parser (Maybe String)
contraction =
    (getChompedString <|
        succeed identity
            |. chompIf ((==) apostrophe)
            |. chompIf Char.isAlpha
            |. chompWhile Char.isAlpha
    )
        |> Parser.map Just


alphabeticWord : Parser ValidWord
alphabeticWord =
    succeed Alphabetic
        |. chompWhile isIgnorable
        |= (getChompedString <|
                succeed identity
                    |. chompIf Char.isAlphaNum
                    |. chompWhile Char.isAlpha
           )
        |= oneOf [ backtrackable contraction, succeed Nothing ]
        |. chompWhile isIgnorable

"Oh no, it's even more complicated than the numberic one." If you're thinking that, you are right. But the basic technique is the same. The main difference is, Alphabetic ValidWords accept two pieces of data; a first string for alphabetic text before an apostrophe, and a second Maybe String that represents that second piece of a conjuction. Let's go over the things here that are different from the Numeric version.

  • apostrophe - First, we can see we've defined a constant called apostrophe, which represents the apostrople character. I did this because Elm represents characters using apostophes/single quotes. So "n" is the String n but 'n' is the Char n. That's typically fine, but in the case of the apostrophe itself we see that it's Char representation would have to be ''' which the compiler didn't seem to like. 😬 So we used the code point reference number from the unicode library to create an apostrophe character. There may be other ways to do this, but this seemed clear and reasonable at the time.
  • contraction is the Parser for the contraction part of our valid word. Note that the first thing it checks is if the first character is an apostrophe.
    • Then we check if the apostrophe is followed by an alphabetic character. This means that trailing apostrophes are rejected.
    • Parser.map is a helper function with the type signature Parser.map : (a -> b) -> Parser a -> Parser b. It allows us to tranform values inside a parser and return a new Parser of the transformed value. In this case, we're wrapping the String from getChompedString into a Maybe value.
    • Altogether we can interpret contraction as:
      "Hey, check if there's an apostrophe, followed by at least one alphabetic character. Wrap this in a Maybe."
  • alphabeticWord is the main function for producing a parser for Alphabetic branch of the ValidWord type. Its structure is similar to numericWord, checking for some whitespace, then looking for a sequence of alphabetic characters
    • oneOf is a function we haven't seen before. oneOf applies a list of parsers to a sequence, trying them one by one until one is accepted.
      • backtrackable is a helper function in the parser library that says that if a parser fails, jump back to where it began for the next part of the parser.
      • altogther, oneOf tries our contraction parser and if it fails (because no contraction exists), it returns a Parser of the other case of our maybe Nothing.
    • Note that alphabetic word has two |= operators, each one taking an input for Alphabetic constructor. We can read the entire parser as:
      "Hey, check if there's some nonalphanumeric characters. Then, grab a sequence of at least one alphabetic charcter. Then if you see a sequence of an apostrophe followed by more alphabetic characters, return Just that, otherwise, return Nothing."

To be honest, I feel like it's hard to explain these concepts if you're not starting from first principles, so I can only hope that the examples plus the write up have enough context to help people along.

Converting a List to a Dictionary

Before we tie the knot on our Parsers, lets jump over for a second to talk about converting a List of Valid Words to Dictionary of String keys and Integers

validWordToString : ValidWord -> String
validWordToString vw =
    case vw of
        Numeric word ->
            word

        Alphabetic word contraction_ ->
            contraction_
                |> Maybe.withDefault ""
                |> String.append word

dictUpdate : String -> Dict String Int -> Dict String Int
dictUpdate key dict =
    case Dict.get key dict of
        Just count ->
            Dict.insert key (count + 1) dict

        Nothing ->
            Dict.insert key 1 dict

To do this we have to define two helper functions, validWordToString and dictUpdate.

validWordToString does exactly what it says, and can convert any instance of ValidWord into a simple String. We use case statements to determine which type of ValidWord we're dealing with. Note that while the Numeric case is fairly straightforward pattern decomposition, the Alphabetic case has that pesky contraction, which is a Maybe String. We use Maybe.withDefault here to provide a default value in case the contraction is Nothing (Our default is the empty string "") and we use String.append to attach it to the first part of the string.

dictUpdate is function that takes a key and a Dictionary and does the following:

  • if the key already exist in the dictionary, replace the value at that key with itself plus one more
  • if the key does not exist, insert the key into the dictionary with a value of 1.

These two functions will be combined with the last few functions we have to put this all together.

Putting it all together with loops

validWordHelper : List ValidWord -> Parser (Step (List ValidWord) (Dict String Int))
validWordHelper revValidWords =
    let
        continueLooping : ValidWord -> Step (List ValidWord) (Dict String Int)
        continueLooping vw =
            Loop (vw :: revValidWords)

        stopLooping : a -> Step (List ValidWord) (Dict String Int)
        stopLooping _ =
            revValidWords
                |> List.map (validWordToString >> String.toLower)
                |> List.foldl dictUpdate Dict.empty
                |> Done
    in
    oneOf
        [ succeed continueLooping
            |= oneOf [ backtrackable alphabeticWord, numericWord ]
        , succeed () |> Parser.map stopLooping
        ]


validWord : Parser (Dict String Int)
validWord =
    loop [] validWordHelper

We're almost at the finish line of our parser. We've created two parsers than can handle either case of our valid tokens, numericWord and alphabeticWord, but they both only produce one token. We still need to define a way to turn a string into a sequence of tokens.

Enter the Parser.loop function. This is by far the gnarliest part of this application.

loop has the type signature

loop : state -> (state -> Parser (Step state a)) -> Parser a

If you find this signature a little intimidating, don't worry about it. I literally only understood it while writing this document. 😅 It also helps if you're used to folding functions for sequences.

What can help is if you notice that the loop function references another Parser library type we haven't seen yet, Step.

Step is adefined by:

type Step state a
    = Loop state
    | Done a

Essentially, loop lets you traverse a sequence, applying a parser(s) and updating some "state" every time the parser succeeds. Once the parser succeeds, the Loop branch of the Step type defines what to do next, and the Done branch defines what to do one there are no more tokens to be captured.

In our case, our "state" that we're updating is a List ValidWords and our "a", the shape of the data once the loop terminates, is the Dict String Int we want as our final product.

As we chomp through our string, we bring along this constantly updating state and then finally when we stopLooping, we use the validWordToString and dictUpdate function we defined earlier to transform our
List ValidWord state to a List String and then we "fold" that list into a dictionary using our dictUpdate function to handle adding each string in our list to the dictionary correctly.

If this is still confusing to you, that's okay. It's defintely not the most intuitive thing if you haven't worked with functional parsers before. I wrote this article so people could explore this code at their leisure, while using the docs and other resources for reference.

And finally...

Last but not least we define the actual function that defines our API wordCount.

wordCount : String -> Dict String Int
wordCount sentence =
    case Parser.run validWord sentence of
        Ok parsed ->
            parsed

        Err _ ->
            Dict.empty

Up until now we've just been constructing Parsers.
Parser.run is the function that can actually take a Parser a and return a value. To be more specific, Parser.run has the signature:

run : Parser a -> String -> Result (List DeadEnd) a

Here we see the another infamous functional programming type, the Result type!

Result has the signature

type Result error value
    = Ok value
    | Err error

Like Maybe, Result is used to encapsulate data from operations that may succeed or fail. Unlike Maybe which simply returns Nothing, Result has an Err a case where that a can be any data type we want, which you can use to hold extra context for your error.

In the can of the run function, it's associated error type is a List DeadEnd, while is basically a list of all the ways your parser failed.

All this to say, when we run our parser in wordCount we have to handle both cases of Result, the case where it succeeds (in which case we return the successfully parsed Dictionary), and the case where it fails (we chose to ignore the error type and just return an empty Dictionary)

Conclusion

Recap

  • We needed to create a function that accepted a string and converted it to a dictionary of key words and the number of times they occur
  • valid words could be either a sequence of numbers, letters, or letters with an apostrophe embedded in them - We created a "token" type ValidWords to represent the two ways of producing a valid word
  • We defined a parser for each case of ValidWord, designed to chomp if from a string of text
  • We defined a loop that would keep trying to chomp a ValidWord from our string of text
  • We defined some helper functions to convert between or major types, ValidWord, String, List, and Dict

Final Thoughts

I spent way too much time on this project but I really enjoyed it. Parsing is a foundational technique of software development and the elm-parser library is actually extremely robust. I could spend a lot of time tweaking this solution for better performance, or more strict enforcement of the type rules, but I think I can leave it in this space for now. I won't presume to think this is the perfect article. But perfect is the enemy of the good.

Additional Resources

  • Here's a gist of the full solution, and the corresponding tests
  • Here's a link to an online Elm editor where I've made a very very crude interface for trying out some of the test samples.

If you can think of anything I could do to improve this article, please reach out to me at tkshillinz@icloud.com. I'm also in the Elm and F# slack groups as @kirk Shillingford

Thank you for your time. 🙏🏾

Created 1 year ago | Updated 1 year ago

Comments (0)

GistLog © 2022
Brought to you by the lovely humans at Tighten