doc: added in-depth info on the new message parser system

author: Rainer Gerhards <rgerhards@adiscon.com> 2009-11-06 11:55:12 +0100
committer: Rainer Gerhards <rgerhards@adiscon.com> 2009-11-06 11:55:12 +0100
commit: 2759e1dd846cb4e3bef78a849005afc8d0fa7438 (patch)
tree: e4b254131a16441a1a82eac85a900f2291050116 /doc/messageparser.html
parent: 83140cb480b7c806ae11a3a206a1b45936005191 (diff)
1 files changed, 222 insertions, 0 deletions
diff --git a/doc/messageparser.html b/doc/messageparser.html
new file mode 100644
index 00000000..370db59f
--- /dev/null
+++ b/doc/messageparser.html
@@ -0,0 +1,222 @@
+<html>
+<head>
+<title>Message parsers in rsyslog</title>
+</head>
+<body>
+<a href="manual.html">rsyslog documentation</a>
+
+<h1>Message parsers in rsyslog</h1>
+<p><small><i>Written by <a href="http://www.gerhards.net/rainer">Rainer Gerhards</a>
+(2009-11-06)</i></small></p>
+<h2>Intro</h2>
+<p>Message parsers are a feature of rsyslog 5.3.4 and above. In this article, I describe what
+message parsers are, what they can do and how they relate to the relevant standards. I will
+also describe what you can not do with time. Finally, I give some advice on implementing your
+own custom parser.
+
+<h2>What are message parsers?</h2>
+<p>Well, the quick answer is that message parsers are the component of rsyslog that
+parses the syslog message after it is being received. Prior to rsyslog 5.3.4, message parsers
+where built in into the rsyslog core itself and could not be modified (other than by modifying 
+the rsyslog code).
+<p>In 5.3.4, we changed that: message parsers are now loadable modules (just
+like input and output modules). That means that new message parsers can be added without
+modifying the rsyslog core, even without contributing something back to the 
+project.
+<p>But that doesn't answer what a message parser really is. What does ist mean to &quot;parse a
+message&quot; and, maybe more importantly, what is a message? To answer these questions correctly,
+we need to dig down into the relevant standards.
+<a href="http://tools.ietf.org/html/rfc5424">RFC5424</a> specifies a layered architecture
+for the syslog protocol:
+<p align="center"><img src="rfc5424layers.png" alt="RFC5424 syslog protocol layers">
+<p>For us important is the distinction between the syslog transport and the upper layers.
+The transport layer specifies how a stream of messages is assembled at the sender side and
+how this stream of messages is disassembled into the individual messages at the receiver
+side. In networking terminology, this is called &quot;framing&quot;. The core idea is that
+each message is put into a so-called "frame", which then is transmitted over the communications
+link.
+<p>The framing used is depending on the protocol. For example, in UDP the "frame"-equivalent is 
+a packet that is being sent (this also means that no two messages can travel within a single
+UDP packet). In "plain tcp syslog", the industry standard, LF is used as a frame delimiter
+(which also means that no multi-line message can properly be transmitted, a "design" flaw
+in plain tcp syslog). In <a href="http://tools.ietf.org/html/rfc5425">RFC5425</a> there is
+a header in front of each frame that contains the size of the message. With this framing,
+any message content can properly be transferred.
+<p>And now comes the important part: <b>message parsers do NOT operate at the transport
+layer</b>, they operate, as their name implies, on messages. So we can not use message
+parsers to change the underlying framing. For example, if a sender splits (for whatever 
+reason) a single message into two and encapsulates these into two frames, there is no way
+a message parser could undo that.
+<p>A typical example may be a multi-line message: let's assume some originator has generated
+a message for the format "A\nB" (where \n means LF). If that message is being transmitted
+via plain tcp syslog, the frame delimiter is LF. So the sender will delimite the frame with
+LF, but otherwise send the message unmodified onto the wire (because that is how things are
+-unfortunately- done in plain tcp syslog...). So wire will see "A\nB\n". When this
+arrives at the receiver, the transport layer will undo the framing. When it sees the LF
+after A, it thinks it finds a valid frame delimiter (in fact, this is the correct view!). So
+the receive will extract one complete message A and one complete message B, not knowing
+that they once were both part of a large multi-line message. These two messages are then
+passed to the upper layers, where the message parsers receive them and extract information.
+However, the message parsers never know (or even have a chance to see) that A and B
+belonged together. Even further, in rsyslog there is no guarnatee that A will be parsed 
+before B - concurrent operations may cause the reverse order (and do so very validly).
+<p>The important lesson is: <b>message parsers can not be used to fix a broken framing</b>.
+You need a full protocol implementation to do that, what is the domain of input and
+output modules.
+<p>I have now told you what you can not do with message parsers. But what they are good for? 
+Thankfully, broken framing is not the primary problem of the syslog world. A wealth of different
+formats is. Unfortunately, many real-world implementations violate the relevant standards
+in one way or another. That makes it often very hard to extract meaningful information from
+a message or to process messages from different sources by the same rules. In my article
+<a href="syslog_parsing.html">syslog parsing in rsyslog</a> I have elaborated on all
+the real-world evil that you can usually see. So I won't repeat that here. But in short, the
+real problem is not the framing, but how to make malformed messages well-looking.
+<p><b>This is what message parsers permit you to do: take a (well-known) malformed message, parse
+it according to its semantics and generate perfectly valid internal message representations
+from it.</b> So as long as messages are consistenly in the same wrong format (and they usually
+are!), a message parser can look at that format, parse it, and make the message processable just
+like it were wellformed in the first place. Plus, one can abuse the interface to do some other
+"intersting" tricks, but that would take us to far.
+<p>While this functionality may not sound exciting, it actually solves a very big issue (that you
+only really understand if you have managed a system with various different syslog sources).
+Note that we were often able to process malformed messages in the past with the help of the
+property replacer and regular expressions. While this is nice, it has a performance hit. A
+message parser is a C code, compiled to native language, and thus typically much faster than
+any regular expression based method (depending, of course, on the quality of the implementation...).
+
+<h2>How are message parsers used?</h2>
+<p>In a simlified view, rsyslog
+<ol>
+<li>first receives messages (via the input module),
+<li><i>then parses them (at the message level!)</i> and
+<li>then processes them (operating on the internal message representation).
+</ol>
+Message parsers are utilized in the second step (written in italics).
+Thus, they take the raw message (NOT frame!) received from the remote system and create
+the internal structure out of it that the other parts of rsyslog need in order to perform
+their processing. Parsing is vital, because an unparsed message can not be processed in the
+third stage, the actual application-level processing (like forwarding or writing to files).
+<h3>Parser Chains and how they Operate</h3>
+Rsyslog chains parsers together to provide flexibility.
+A <b>parser chain</b>
+contains all parsers that can potentially be used to parse a message.
+It is assumed that there is some
+way a parser can detect if the message it is being presented is supported by it. If so, the parser
+will tell the rsyslog engine and parse the message. The rsyslog engine now calls each parser
+inside the chain (in sequence!) until the first parser is able to parse the message. After one
+parser has been found, the message is considered parsed and no others parsers are called on that
+message.
+<p>Side-note: this method implies there are some "not-so-dirty" tricks available to modify
+the message by a parser module that declares itself as "unable to parse" but still does
+some message modification. This was not a primary design goal, but may be utilized, and the 
+interface probably extended, to support generic filter modules. These would need to go
+to the root of the parser chain. As mentioned, the current system already supports this.
+<p>The position inside the parser chain can be thought of as a priority: parser sitting
+earlier in the chain take precedence over those sitting later in it. So more specific 
+parser should go ealier in the chain. A good example of how this works is the default parser
+set provided by rsyslog: rsyslog.rfc5424 and rsyslog.rfc3164, each one parses according to the
+rfc that has named it. RFC5424 was designed to be distinguishable from RFC3164 message by the
+sequence "1 " immediately after the so-called PRI-part (don't worry about these words, it is
+sufficient if you understand there is a well-defined sequence used to indentify RFC5424
+messages). In contrary, RFC3164 actually permits everything as a valid message. Thus the
+RFC3164 parser will always parse a message, sometimes with quite unexpected outcome (there is
+a lot of guesswork involved in that parser, which unfortunately is unavoidable due to
+existing techology limits). So the default parser chain is to try the RFC5424 parser first
+and after it the RFC3164 parser. If we have a 5424-formatted message, that parser will
+identify and parse it and the rsyslog engine will stop processing. But if we receive a
+legacy syslog message, the RFC5424 will detect that it can not parse it, return this status
+to the engine which then calls the next parser inside the chain. That usually happens to be
+the RFC3164 parser, which will always process the message. But there could also be any other
+parser inside the chain, and then each one would be called unless one that is able to parse
+can be found.
+<p>If we reversed the parser order, RFC5424 messages would incorrectly parsed. Why? Because the
+RFC3164 parser will always parse every message, so if it were asked first, it would parse
+(and misinterpret) the 5424-formatted message, return it did so and the rsyslog engine would
+never call the 5424 parser. So oder of sequence is very important.
+<p>What happens if no parser in the chain could parse a message? Well, then we could not 
+obtain the in-memory representation that is needed to further process the message. In that
+case, rsyslog has no other choice than to discard the message. If it does so, it will emit
+a warning message, but only in the first 1,000 incidents. This limit is a safety measure 
+against message-loops, which otherwise could quickly result from a parser chain 
+misconfiguration. <b>If you do not tolerate loss of unparsable messages, you must ensure
+that each message can be parsed.</b> You can easily achive this by always using the
+"rsyslog-rfc3164" parser as the <i>last</i> parser inside parser chains. That may result
+in invalid parsing, but you will have a chance to see the invalid message (in debug mode,
+a warning message will be written to the debug log each time a message is dropped due to
+inability to parse it).
+<h3>Where are parser chains used?</h3>
+<p>We now know what parser chains are and how they operate. The question is now how many
+parser chains can be active and how it is decicded which parser chain is used on which message.
+This is controlled via <a href="multi_ruleset.html">rsyslog's rulesets</a>. In short, multiple
+rulesets can be defined and there always exist at least one ruleset (for specifcs, follow
+the <a href="multi_ruleset.html">link</a>). A parser chain is bound to a specific ruleset. 
+This is done by virtue of defining parsers via the
+<a href="rsconf1_rulesetparser.html">$RulesetParser</a> configuration directive (for specifics,
+see there). If no such directive is specified, the default parser chain is used. As of this
+writing, the default parser chain always consists of "rsyslog.rfc5424", "rsyslog.rfc3164", in
+that order. As soon as a parser is configured, the default list is cleared and the new parser
+is added to the end of the (initially empty) ruleset's parser chain.
+<p>The important point to know is that parser chains are defined on a per-ruleset basis.
+<h3>Can I use different parser chains for different devices?</h3>
+<p>The correct answer is: generally yes, but it depends. First of all, remember that input
+modules (and specific listeners) may be bound to specific rulesets. As parser chains "reside"
+in rulesets, binding to a ruleset also binds to the parser chain that is bound to that ruleset.
+As a number one prequisite, the input module must support binding to different rulesets. Not
+all do, but their number is growing. For example, the important 
+<a href="imudp.html">imudp</a> and <a href="imtcp.html">imtcp</a> input modules support
+that functionality. Those that do not (for example <a href="im3195">im3195</a>) can only
+utilize the default ruleset and thus the parser chain defined in that ruleset.
+<p>If you do not know if the input module in question supports ruleset binding, check
+its documentation page. Those that support it have the requiered directives.
+<p>Note that it is currently under evaluation if rsyslog will support binding parser chains
+to specific inputs directly, without depending on the ruleset. There are some concerns that
+this may not be necessary but adds considerable complexity to the configuration. So this may
+or may not be possible in the future. In any case, if we decide to add it, input modules 
+need to support it, so this functionality would require some time to implement.
+<p>The coockbook recipe for using different parsers for different devices is given
+as an actual in-depth example in the <a href="rscon1_rulesetsparser.html">$RulesetParser</a>
+configuration directive doc page. In short, it is acomplished by defining specific rulesets
+for the required parser chains, definining different listener ports for each of the devices
+with different format and binding these listeners to the correct ruleset (and thus parser
+chains). Using that approach, a variety of different message formats can be supported
+via a single rsyslog instance.
+
+<h2>Which message parsers are available</h2>
+<p>As of this writing, there exist only two message parsers, one for RFC5424 format and one for
+legacy syslog (loosely described in
+<a href="http://tools.ietf.org/html/rfc3164">RFC3164</a>). These parsers are built-in and
+must not be explicitely loaded. However, message parsers can be added with relative ease
+by anyone knowing to code in C. Then, they can be loaded via $ModLoad just like any
+other loadable module. It is expected that the rsyslog project will be contributed additional
+message parsers over time, so that at some point there hopefully is a rich choice of them
+(I intend to add a browsable repository as soon as new parsers pop up).
+<h3>How to write a message parser?</h3>
+<p>As a prequisite, you need to know the exact format that the device is sending. Then, you need
+moderate C coding skills, and a little bit of rsyslog internals. I guess the rsyslog specific part 
+should not be that hard, as almost all information can be gained from the existing parsers. They
+are rather simple in structure and can be found under the "./tools" directory. They are named
+pmrfc3164.c and pmrfc5424.c. You need to follow the usual loadable module guidelines.
+It is my expectation that writing a parser should typically not take longer than a single
+day, with maybe a day more to get aquainted with rsyslog. Of course, I am not sure if the number
+is actually right.
+<p>If you can not program or have no time to do it, Adiscon can also write a message parser
+for you as
+part of the <a href="http://www.rsyslog/professional-services">rsyslog professional services
+offering</a>.
+<h2>Conclusion</h2>
+<p>Malformed syslog messages are a pain and unfortunately often seen in practice. Message parsers
+provide a fast and efficient solution for this problem. Different parsers can be defined for
+different devices, and they all convert message information into rsyslog's well-defined
+internal format. Message parsers were first introduced in rsyslog 5.3.4 and also offer
+some interesting ideas that may be explored in the future - up to full message normalization
+capabilities. It is strongly recommended that anyone with a heterogenous environment take
+a look at message parser capabilities.
+
+<p>[<a href="rsyslog_conf.html">rsyslog.conf overview</a>] [<a href="manual.html">manual 
+index</a>] [<a href="http://www.rsyslog.com/">rsyslog site</a>]</p>
+<p><font size="2">This documentation is part of the
+<a href="http://www.rsyslog.com/">rsyslog</a> project.<br>
+Copyright &copy; 2009 by <a href="http://www.gerhards.net/rainer">Rainer Gerhards</a> and
+<a href="http://www.adiscon.com/">Adiscon</a>. Released under the GNU GPL version 3 or higher.</font></p>
+</body>
+</html>
author	Rainer Gerhards <rgerhards@adiscon.com>	2009-11-06 11:55:12 +0100
committer	Rainer Gerhards <rgerhards@adiscon.com>	2009-11-06 11:55:12 +0100
commit	2759e1dd846cb4e3bef78a849005afc8d0fa7438 (patch)
tree	e4b254131a16441a1a82eac85a900f2291050116 /doc/messageparser.html
parent	83140cb480b7c806ae11a3a206a1b45936005191 (diff)