Parsing the IRC message format as a client
The IRC protocol is infested with annoyances. One of these is parsing the messages sent to the client by the server. This is a problem I needed to solve when developing IRC Tools. Without a proper parsing mechanism, the entire set of libraries would be completely useless. How did I do it? Read on!
The Message Format
First, it was necessary to understand the format of IRC messages. The primary annoyance of parsing IRC messages is due to the variation in the message format. The format is documented in IRC Client Protocol, RFC 2812. Creating an IRC client requires intimate familiarity with that RFC, so I recommend keeping it handy.
The message format is:
:<prefix> <command> <params> :<trailing>
And here is an example:
:CalebDelnay!calebd@localhost PRIVMSG #mychannel :Hello everyone!
The only required part of the message is the command name. Everything else is optional and can be mixed and matched. Some other examples:
:CalebDelnay!calebd@localhost QUIT :Bye bye!
:CalebDelnay!calebd@localhost JOIN #mychannel
:CalebDelnay!calebd@localhost MODE #mychannel -l
PING :irc.localhost.localdomain
I'll break down each of the four parts. <prefix> represents the origin of the message, if applicable. In the above examples, the prefix is CalebDelnay!calebd@localhost, which is my example user. It indicates that those messages originated from CalebDelnay (vs the server itself or another user). If there is no prefix, then the source of the message is the server for the current connection, as in the PING example.
The <command> part is, surprisingly, the command. In the above examples, PRIVMSG, QUIT, JOIN, MODE, and PING are the commands. Based on the type of command, the client is expected to react appropriately. In the case of the PRIVMSG example, the UI of the client might display something like: <CalebDelnay> Hello everyone!. With QUIT and JOIN, the UI of the client would indicate that a user has quit the server or joined the channel respectively and also modify its internal state to reflect those changes.
While many messages contain specific textual commands like above, some messages have numeric replies instead. A numeric reply is sent to the client by the server in response to a command sent to the server by the client. For example, if a client sends the server NICK Caleb and that nickname is already in use, the server will send a numeric reply back to the client like so: :irc.localhost.localdomain 433 Caleb :Nickname is already in use. The 433 bit is the command, and the RFC for the IRC protocol states that the numeric reply 433 should be used if a nickname is already in use.
The <params> bit is a set of space separated parameters. Not all messages have parameters, but many do. In the previous examples, the channel name (#mychannel) of the PRIVMSG, JOIN, and MODE messages is a parameter. The MODE message in particular has two parameters, the channel and the mode that was changed.
Finally, <trailing> is a special type of parameter. Because parameters are space separated, it isn't possible to include a parameter with a space in the normal set of parameters. For that reason, the very last parameter is indicated with a leading colon, telling the client that everything after the colon should be interpreted together. This allows a message to carry one fully textual piece (e.g. a sentence).
Parsing Technique #1: Applied Logic
The goal when parsing an IRC message is to extract the four pieces I had discussed previously: the prefix, command, parameters, and trailing parameter. The difficulty comes from handling the variety created by the optional portions of an IRC message.
Using the message structure it becomes fairly easy to overcome that difficulty. Because only the command part of a message is required, the parsing code will require a few if conditionals to determine what parts of the message are present. Note that the code snippets below include variables that declared outside of the snippet. Also note that all of the code examples are written in C# but can easily be rewritten for other languages.
The Prefix
The presence of the prefix is indicated by the message beginning with a colon character, so it is possible to use String.StartsWith to test whether the message string does indeed begin with a colon. If it does, the prefix must be extracted from the message. The prefix cannot contain spaces and thus can be extracted by grabbing the substring of the message starting at the 2nd character (to skip the colon) and continuing until the first space. Below is the code to do that. The end of the prefix is stored in the prefixEnd variable because the value will need to be referenced in a later snippet of code.
if (message.StartsWith(":")) { prefixEnd = message.IndexOf(" "); prefix = message.Substring(1, prefixEnd - 1); }
It's tempting to continue parsing the message by extracting the command, the parameters, and the trailing part in sequence. However, it is possible to reduce the amount of code needed by extracting the trailing part first. It'll make sense in a moment, trust me.
The Trail
The defining characteristic of the trailing part is that it also begins with a colon but is preceded by a space. Combining that fact with the fact that the trailing part continues until the end of the message, extracting it becomes a straight forward exercise. Simply grab the substring of the message that begins at the first occurrence of " :" (a space and colon).
Below is some example code. The conditional checks that the message does in fact contain a trailing part by checking the presence of " :" (note there is a space before the colon) and, if it does, proceeds to extract it. Note that the "+ 2" is to exclude the space and colon from being included within the trailing part. The trailingStart variable is used to store the starting position of the trailing part. If there is no trailing part, the start index is indicated as being at the end of the message using message.Length. This becomes important in the command and parameter extraction code.
trailingStart = message.IndexOf(" :"); if (trailingStart >= 0) trailing = message.Substring(trailingStart + 2); else trailingStart = message.Length;
The Command and Parameters
Extracting the command and the parameters is next. By extracting the prefix and trailing parts first and determining when the prefix stops and the trailing part begins, it makes the job of pulling out the command and parameters a lot easier. Simply passing the prefix end and the difference between the trailing start and prefix end into String.Substring on the message will return the command and parameters. The second parameter is a difference between the two because String.Substring expects the length of the substring as the second parameter, not the ending index.
var commandAndParameters = message.Substring( prefixEnd + 1, trailingStart - prefixEnd - 1).Split(' ');
The +1 and -1 are to compensate for spaces that are potentially present before or after the command and parameters. What about messages that don't include a prefix or trailing part? If prefixEnd is initialized to -1 (the same value returned by String.IndexOf if nothing is found, for reference) the +1 that normally compensates for a space results in 0 being passed into String.Substring, which translates to "the beginning of the string" in English.
By initializing trailingStart to the length of the message and ensuring it remains as such or is the true index of the trailing part end (if present), the math works out to always grab the correct length of the command and parameters. In the event that prefixEnd is -1 (meaning no prefix is present), the subtraction of -1 actually adds 1 to the trailing start which is promptly subtracted to compensate for any spaces if the trailing part is actually present.
The result is that commandAndParameters is a string array of the command and parameters. Extracting the command is simple, as it will always be the first element in the array.
command = commandAndParameters.First();
Because the command is always required, it isn't necessary to wrap this code in any conditionals. Regardless of what else the message contains (meaning if it has a prefix or not, or a trailing part or not, or even parameters), the command will always be the first element in the array. All of the following elements will be the parameters, although their presence is not guaranteed. In the previous example I use the LINQ extension method First, but the more traditional (and faster) index syntax could be used: command = commandAndParameters[0]; just as well.
Grabbing the final part, the parameters, is easy. They're already stored along side the command within the commandAndParameters variable. Parameters are optional, so it is necessary to first check that they exist. This can be done by testing that commandAndParameters.Length is greater than 1. Remember that commandAndParameters will always have at least one element, the command, so if it has more than one element the rest must be parameters. Pulling them means grabbing the array sans the first element, like so:
if (commandAndParameters.Length > 1) parameters = commandAndParameters.Skip(1).ToArray();
Putting it together
Taking all of the previous code examples and explanation, the result is the function below. It has one normal argument (the message string) and three out arguments (the prefix, command, and parameters combined with the trailing part). Why three out arguments instead of four? Well, as I mentioned earlier, the trailing part is actually a special kind of parameter. From the client's perspective, it shouldn't give any special treatment to the trailing part and should interpret it as the last parameter of the message.
At the end of the function is an if statement to check the validity of the trailing part. If it has contents, the parameters, excluding the trailing part, and the trailing part are combined into the final set of parameters.
static void ParseIrcMessage(string message, out string prefix, out string command, out string[] parameters) { int prefixEnd = -1, trailingStart = message.Length; string trailing = null; prefix = command = String.Empty; parameters = new string[] { }; // Grab the prefix if it is present. If a message begins // with a colon, the characters following the colon until // the first space are the prefix. if (message.StartsWith(":")) { prefixEnd = message.IndexOf(" "); prefix = message.Substring(1, prefixEnd - 1); } // Grab the trailing if it is present. If a message contains // a space immediately following a colon, all characters after // the colon are the trailing part. trailingStart = message.IndexOf(" :"); if (trailingStart >= 0) trailing = message.Substring(trailingStart + 2); else trailingStart = message.Length; // Use the prefix end position and trailing part start // position to extract the command and parameters. var commandAndParameters = message.Substring(prefixEnd + 1, trailingStart - prefixEnd - 1).Split(' '); // The command will always be the first element of the array. command = commandAndParameters.First(); // The rest of the elements are the parameters, if they exist. // Skip the first element because that is the command. if (commandAndParameters.Length > 1) parameters = commandAndParameters.Skip(1).ToArray(); // If the trailing part is valid add the trailing part to the // end of the parameters. if (!String.IsNullOrEmpty(trailing)) parameters = parameters.Concat(new string[] { trailing }).ToArray(); }
Parsing Technique #2: Regular Expressions
Regular Expressions can be a blessing or a curse. On one side, a well crafted regular expression can reduce 100 lines of code down to one line. On the other side, deciphering that one line is akin to reading an ancient forgotten language.
As a challenge, I took it upon myself to craft a single regular expression that is capable of parsing any variation of an IRC message. The result is this .NET regular expression:
^(:(?<prefix>\S+) )?(?<command>\S+)( (?!:)(?<params>.+?))?( :(?<trail>.+))?$
If that looks like hieroglyphics, I recommend Googling for more information on regular expressions and then reading the MSDN page about regular expressions in .NET.
Breaking it down
As before, the problem when parsing an IRC message is the variation created by the optional parts of the message. Handling the variation increases complexity in the code, be it pure C# (or any other language) or a regular expression. The regular expression I created is actually fairly tame as far as regular expressions go. So how does it work?
Here is the expression roughly broken apart into its major components. Note that the blank lines represent spaces, as seen in the condensed expression above.
^
(
:
(?<prefix>\S+)
)?
(?<command>\S+)
(
(?!:)
(?<params>.+?)
)?
(
:
(?<trail>.+)
)?
$The first part to notice is that IRC messages are wholly contained, necessitating the use of the caret and dollar sign characters to match the beginning and end of the input respectively. The expression itself can be seen as having four different components corresponding to the parts of an IRC message – the prefix, command, parameters, and trailing parts.
Each of the parts, excluding the command, follows a specific trend. The part and it's related text are grouped together and marked as optional, which is the question mark after each group. Within each group is the text that helps identify the part and the part itself.
The Prefix
The prefix, for example, always starts with a colon, so the regular expression group also starts with the colon. Following the colon is a named capture group, indicated with the ?<prefix> inside the parenthesis. The named capture group, prefix in this case, will match one or more non-whitespace characters, which is the \S+ part of the expression.
After the named capture group is a space because the IRC message format always includes a space separating the prefix and command parts. That space will only be present if the prefix is present, so the space is included into the optional group with the colon and prefix capture group to ensure that a match isn't possible if the input begins with a space.
The Command
The command part is the only required part of an IRC message. Thus, in the regular expression it is the only part that must always match something. A command cannot contain white space characters, so a match involves only capturing one or more characters up to the first white space character.
The Parameters
Correctly matching the parameters proved to be quite tricky. The parameters are the block of text starting after the command and continuing until the end of the message or until the first occurrence of a colon following a space. Matching the parameters means performing the match only if the characters following the command aren't a space and colon. This is achieved using a negative look-ahead — (?!:) — against the colon character. The negative look-ahead causes matching to fail if the negative look-ahead is matched.
If there is no colon, matching continues with the named capture group params. This capture group tries to grab one or more of any character but is non-greedy as indicated by the use of a question mark after the quantifier. A non-greedy quantifier will stop when there is a successful match on the text against the portion of the expression after the quantifier. Thus, the params capture group will match until the end of the input or when a space and colon are found, because that matches the leadin to the trailing part.
The Trail
A space and a colon indicate the start of the trail. So the optional capture group must first match a space and a colon and then match all other characters as part of the named capture group trail. Because the trail is the last part of the message and can contain any character, there are no special tricks needed other than leading with the space and colon to indicate the start of the match.
The Beauty
The beauty of this regular expression is that it can handle any properly formatted IRC message while capturing all of the parts of that message without grabbing anything additional, such as the colon leading the prefix and trailing parts, or the spaces between each of the parts.
Using the regular expression to create a function that can parse IRC messages results in something like this:
static void ParseIrcMessageWithRegex(string message, out string prefix, out string command, out string[] parameters) { string trailing = null; prefix = command = String.Empty; parameters = new string[] { }; Regex parsingRegex = new Regex(@"^(:(?<prefix>\S+) )?(?<command>\S+)( (?!:)(?<params>.+?))?( :(?<trail>.+))?$", RegexOptions.Compiled | RegexOptions.ExplicitCapture); Match messageMatch = parsingRegex.Match(message); if (messageMatch.Success) { prefix = messageMatch.Groups["prefix"].Value; command = messageMatch.Groups["command"].Value; parameters = messageMatch.Groups["params"].Value.Split(' '); trailing = messageMatch.Groups["trail"].Value; if (!String.IsNullOrEmpty(trailing)) parameters = parameters.Concat(new string[] { trailing }).ToArray(); } }
There are a few things to note in the code above.
First, the regular expression is created using the RegexOptions.ExplicitCapture option. That option tells the regular expression engine to only care about capture groups that are named. The parsing regular expression uses a lot of capture groups to indicate optional portions of the message and returning the results of those capture groups is a waste of time.
Second, in that particular code example a new Regex object is created every time the function executes. In reality this function would be part of a class and the Regex object would be statically defined. It would be created and compiled during assembly initialization and as a result the function would have less overhead.
Third, the parameters are captured as a single string. The parameters aren't very useful jammed together in a string, so a simple call to Split using the space character as the separator breaks the parameters up into the required string array. Also, as before, the trailing part is just another parameter. If it is valid it is added as the last parameter to the parameters array.
Fourth, read the next section of this article.
Regular expressions are the wrong solution
Simply put, regular expressions are the wrong solution to the particular problem of parsing an IRC message into its parts. Why is that? There are a couple reasons:
Regular expressions are tempting to use due to their compacted nature. Condensing code from multiple lines into a single line often eases maintenance of that code, as there is less to maintain. However, regular expressions are a double edged sword, and while they are less code, they're often complex and require specialized knowledge to understand.
The regular expression used to solve this problem is not easily understandable at first glance. Compare the understandability of the regular expression to the code written in the first part of this article, which almost reads like plain English due to its simplicity, and its clear that the gain simply isn't there. The regular expression is harder to understand, harder to maintain, and is condensing only a small amount of code into one line.
Regular expressions are often slower than a comparable solution written only in code. This is because the regular expression engine introduces some overhead. In many situations the slow down does not matter when considering the large amount of code reduction attained by using the regular expression. However, my completely unscientific benchmarks comparing the two parsing techniques show that regular expressions are approximately three times slower than the code only technique (and that's with the same
Regexobject withRegexOptions.Compiledset being reused). Though IRC isn't often high volume in terms of data needing to be parsed, the slowness combined with the problems pointed out in my first reason make the regular expression solution unattractive.
Why bother crafting the regular expression in the first place? Well, it was an interesting challenge and exercise! What more reason is needed?
Summary
Although both parsing techniques are interesting, the choice as to which one to use is obvious. The regular expression technique is just too slow, too difficulty to understand at a glace, and not very maintainable in comparison. For those reasons, IRC Tools uses code similar to the complete example I gave for the pure code parsing technique.
One thing I'd like to stress is that none of the code samples I've provided have any kind of validation of the incoming message. Validating the message is outside of this article's scope, but some point in the pipeline it would be wise to ensure the message is actually an IRC message. I'll likely write an article concerning validation, but for now I recommend taking a look at RFC 2812, as it explicitly defines what constitutes a valid message.
Questions can be left as comments on the article below.

Add new comment