summaryrefslogtreecommitdiffstats
path: root/doc
diff options
context:
space:
mode:
authorRainer Gerhards <rgerhards@adiscon.com>2009-11-06 11:55:12 +0100
committerRainer Gerhards <rgerhards@adiscon.com>2009-11-06 11:55:12 +0100
commit2759e1dd846cb4e3bef78a849005afc8d0fa7438 (patch)
treee4b254131a16441a1a82eac85a900f2291050116 /doc
parent83140cb480b7c806ae11a3a206a1b45936005191 (diff)
downloadrsyslog-2759e1dd846cb4e3bef78a849005afc8d0fa7438.tar.gz
rsyslog-2759e1dd846cb4e3bef78a849005afc8d0fa7438.tar.xz
rsyslog-2759e1dd846cb4e3bef78a849005afc8d0fa7438.zip
doc: added in-depth info on the new message parser system
Diffstat (limited to 'doc')
-rw-r--r--doc/Makefile.am3
-rw-r--r--doc/manual.html3
-rw-r--r--doc/messageparser.html222
-rw-r--r--doc/rfc5424layers.pngbin0 -> 10605 bytes
-rw-r--r--doc/rsconf1_rulesetparser.html8
-rw-r--r--doc/src/rfc5424layers.diabin0 -> 1205 bytes
-rw-r--r--doc/status.html19
7 files changed, 243 insertions, 12 deletions
diff --git a/doc/Makefile.am b/doc/Makefile.am
index 285ba600..4832e757 100644
--- a/doc/Makefile.am
+++ b/doc/Makefile.am
@@ -92,6 +92,7 @@ html_files = \
rsconf1_resetconfigvariables.html \
rsconf1_rulesetcreatemainqueue.html \
rsconf1_umask.html \
+ rulesetparser.html \
v3compatibility.html \
v4compatibility.html \
v5compatibility.html \
@@ -131,6 +132,8 @@ grfx_files = \
dataflow.png \
queue_analogy_tv.png \
gssapi.png \
+ rfc5424layers.png \
+ src/rfc5424layers.dia \
rsyslog-vers.png
EXTRA_DIST = $(html_files) $(grfx_files)
diff --git a/doc/manual.html b/doc/manual.html
index a39a96a5..57f3d4a6 100644
--- a/doc/manual.html
+++ b/doc/manual.html
@@ -43,9 +43,10 @@ if you do not read the doc, but doing so will definitely improve your experience
<li> <a href="property_replacer.html">property replacer, an important core component</a></li>
<li>a commented <a href="sample.conf.html">sample rsyslog.conf</a> </li>
<li><a href="bugs.html">rsyslog bug list</a></li>
-<li><a href="rsyslog_packages.html"> rsyslog packages</a></li>
+<li><a href="messageparser.html">understanding rsyslog message parsers</a></li>
<li><a href="generic_design.html">backgrounder on generic syslog application design</a></li>
<li><a href="modules.html">description of rsyslog modules</a></li>
+<li><a href="rsyslog_packages.html">rsyslog packages</a></li>
</ul>
<p><b>We have some in-depth papers on</b></p>
<ul>
diff --git a/doc/messageparser.html b/doc/messageparser.html
new file mode 100644
index 00000000..370db59f
--- /dev/null
+++ b/doc/messageparser.html
@@ -0,0 +1,222 @@
+<html>
+<head>
+<title>Message parsers in rsyslog</title>
+</head>
+<body>
+<a href="manual.html">rsyslog documentation</a>
+
+<h1>Message parsers in rsyslog</h1>
+<p><small><i>Written by <a href="http://www.gerhards.net/rainer">Rainer Gerhards</a>
+(2009-11-06)</i></small></p>
+<h2>Intro</h2>
+<p>Message parsers are a feature of rsyslog 5.3.4 and above. In this article, I describe what
+message parsers are, what they can do and how they relate to the relevant standards. I will
+also describe what you can not do with time. Finally, I give some advice on implementing your
+own custom parser.
+
+<h2>What are message parsers?</h2>
+<p>Well, the quick answer is that message parsers are the component of rsyslog that
+parses the syslog message after it is being received. Prior to rsyslog 5.3.4, message parsers
+where built in into the rsyslog core itself and could not be modified (other than by modifying
+the rsyslog code).
+<p>In 5.3.4, we changed that: message parsers are now loadable modules (just
+like input and output modules). That means that new message parsers can be added without
+modifying the rsyslog core, even without contributing something back to the
+project.
+<p>But that doesn't answer what a message parser really is. What does ist mean to &quot;parse a
+message&quot; and, maybe more importantly, what is a message? To answer these questions correctly,
+we need to dig down into the relevant standards.
+<a href="http://tools.ietf.org/html/rfc5424">RFC5424</a> specifies a layered architecture
+for the syslog protocol:
+<p align="center"><img src="rfc5424layers.png" alt="RFC5424 syslog protocol layers">
+<p>For us important is the distinction between the syslog transport and the upper layers.
+The transport layer specifies how a stream of messages is assembled at the sender side and
+how this stream of messages is disassembled into the individual messages at the receiver
+side. In networking terminology, this is called &quot;framing&quot;. The core idea is that
+each message is put into a so-called "frame", which then is transmitted over the communications
+link.
+<p>The framing used is depending on the protocol. For example, in UDP the "frame"-equivalent is
+a packet that is being sent (this also means that no two messages can travel within a single
+UDP packet). In "plain tcp syslog", the industry standard, LF is used as a frame delimiter
+(which also means that no multi-line message can properly be transmitted, a "design" flaw
+in plain tcp syslog). In <a href="http://tools.ietf.org/html/rfc5425">RFC5425</a> there is
+a header in front of each frame that contains the size of the message. With this framing,
+any message content can properly be transferred.
+<p>And now comes the important part: <b>message parsers do NOT operate at the transport
+layer</b>, they operate, as their name implies, on messages. So we can not use message
+parsers to change the underlying framing. For example, if a sender splits (for whatever
+reason) a single message into two and encapsulates these into two frames, there is no way
+a message parser could undo that.
+<p>A typical example may be a multi-line message: let's assume some originator has generated
+a message for the format "A\nB" (where \n means LF). If that message is being transmitted
+via plain tcp syslog, the frame delimiter is LF. So the sender will delimite the frame with
+LF, but otherwise send the message unmodified onto the wire (because that is how things are
+-unfortunately- done in plain tcp syslog...). So wire will see "A\nB\n". When this
+arrives at the receiver, the transport layer will undo the framing. When it sees the LF
+after A, it thinks it finds a valid frame delimiter (in fact, this is the correct view!). So
+the receive will extract one complete message A and one complete message B, not knowing
+that they once were both part of a large multi-line message. These two messages are then
+passed to the upper layers, where the message parsers receive them and extract information.
+However, the message parsers never know (or even have a chance to see) that A and B
+belonged together. Even further, in rsyslog there is no guarnatee that A will be parsed
+before B - concurrent operations may cause the reverse order (and do so very validly).
+<p>The important lesson is: <b>message parsers can not be used to fix a broken framing</b>.
+You need a full protocol implementation to do that, what is the domain of input and
+output modules.
+<p>I have now told you what you can not do with message parsers. But what they are good for?
+Thankfully, broken framing is not the primary problem of the syslog world. A wealth of different
+formats is. Unfortunately, many real-world implementations violate the relevant standards
+in one way or another. That makes it often very hard to extract meaningful information from
+a message or to process messages from different sources by the same rules. In my article
+<a href="syslog_parsing.html">syslog parsing in rsyslog</a> I have elaborated on all
+the real-world evil that you can usually see. So I won't repeat that here. But in short, the
+real problem is not the framing, but how to make malformed messages well-looking.
+<p><b>This is what message parsers permit you to do: take a (well-known) malformed message, parse
+it according to its semantics and generate perfectly valid internal message representations
+from it.</b> So as long as messages are consistenly in the same wrong format (and they usually
+are!), a message parser can look at that format, parse it, and make the message processable just
+like it were wellformed in the first place. Plus, one can abuse the interface to do some other
+"intersting" tricks, but that would take us to far.
+<p>While this functionality may not sound exciting, it actually solves a very big issue (that you
+only really understand if you have managed a system with various different syslog sources).
+Note that we were often able to process malformed messages in the past with the help of the
+property replacer and regular expressions. While this is nice, it has a performance hit. A
+message parser is a C code, compiled to native language, and thus typically much faster than
+any regular expression based method (depending, of course, on the quality of the implementation...).
+
+<h2>How are message parsers used?</h2>
+<p>In a simlified view, rsyslog
+<ol>
+<li>first receives messages (via the input module),
+<li><i>then parses them (at the message level!)</i> and
+<li>then processes them (operating on the internal message representation).
+</ol>
+Message parsers are utilized in the second step (written in italics).
+Thus, they take the raw message (NOT frame!) received from the remote system and create
+the internal structure out of it that the other parts of rsyslog need in order to perform
+their processing. Parsing is vital, because an unparsed message can not be processed in the
+third stage, the actual application-level processing (like forwarding or writing to files).
+<h3>Parser Chains and how they Operate</h3>
+Rsyslog chains parsers together to provide flexibility.
+A <b>parser chain</b>
+contains all parsers that can potentially be used to parse a message.
+It is assumed that there is some
+way a parser can detect if the message it is being presented is supported by it. If so, the parser
+will tell the rsyslog engine and parse the message. The rsyslog engine now calls each parser
+inside the chain (in sequence!) until the first parser is able to parse the message. After one
+parser has been found, the message is considered parsed and no others parsers are called on that
+message.
+<p>Side-note: this method implies there are some "not-so-dirty" tricks available to modify
+the message by a parser module that declares itself as "unable to parse" but still does
+some message modification. This was not a primary design goal, but may be utilized, and the
+interface probably extended, to support generic filter modules. These would need to go
+to the root of the parser chain. As mentioned, the current system already supports this.
+<p>The position inside the parser chain can be thought of as a priority: parser sitting
+earlier in the chain take precedence over those sitting later in it. So more specific
+parser should go ealier in the chain. A good example of how this works is the default parser
+set provided by rsyslog: rsyslog.rfc5424 and rsyslog.rfc3164, each one parses according to the
+rfc that has named it. RFC5424 was designed to be distinguishable from RFC3164 message by the
+sequence "1 " immediately after the so-called PRI-part (don't worry about these words, it is
+sufficient if you understand there is a well-defined sequence used to indentify RFC5424
+messages). In contrary, RFC3164 actually permits everything as a valid message. Thus the
+RFC3164 parser will always parse a message, sometimes with quite unexpected outcome (there is
+a lot of guesswork involved in that parser, which unfortunately is unavoidable due to
+existing techology limits). So the default parser chain is to try the RFC5424 parser first
+and after it the RFC3164 parser. If we have a 5424-formatted message, that parser will
+identify and parse it and the rsyslog engine will stop processing. But if we receive a
+legacy syslog message, the RFC5424 will detect that it can not parse it, return this status
+to the engine which then calls the next parser inside the chain. That usually happens to be
+the RFC3164 parser, which will always process the message. But there could also be any other
+parser inside the chain, and then each one would be called unless one that is able to parse
+can be found.
+<p>If we reversed the parser order, RFC5424 messages would incorrectly parsed. Why? Because the
+RFC3164 parser will always parse every message, so if it were asked first, it would parse
+(and misinterpret) the 5424-formatted message, return it did so and the rsyslog engine would
+never call the 5424 parser. So oder of sequence is very important.
+<p>What happens if no parser in the chain could parse a message? Well, then we could not
+obtain the in-memory representation that is needed to further process the message. In that
+case, rsyslog has no other choice than to discard the message. If it does so, it will emit
+a warning message, but only in the first 1,000 incidents. This limit is a safety measure
+against message-loops, which otherwise could quickly result from a parser chain
+misconfiguration. <b>If you do not tolerate loss of unparsable messages, you must ensure
+that each message can be parsed.</b> You can easily achive this by always using the
+"rsyslog-rfc3164" parser as the <i>last</i> parser inside parser chains. That may result
+in invalid parsing, but you will have a chance to see the invalid message (in debug mode,
+a warning message will be written to the debug log each time a message is dropped due to
+inability to parse it).
+<h3>Where are parser chains used?</h3>
+<p>We now know what parser chains are and how they operate. The question is now how many
+parser chains can be active and how it is decicded which parser chain is used on which message.
+This is controlled via <a href="multi_ruleset.html">rsyslog's rulesets</a>. In short, multiple
+rulesets can be defined and there always exist at least one ruleset (for specifcs, follow
+the <a href="multi_ruleset.html">link</a>). A parser chain is bound to a specific ruleset.
+This is done by virtue of defining parsers via the
+<a href="rsconf1_rulesetparser.html">$RulesetParser</a> configuration directive (for specifics,
+see there). If no such directive is specified, the default parser chain is used. As of this
+writing, the default parser chain always consists of "rsyslog.rfc5424", "rsyslog.rfc3164", in
+that order. As soon as a parser is configured, the default list is cleared and the new parser
+is added to the end of the (initially empty) ruleset's parser chain.
+<p>The important point to know is that parser chains are defined on a per-ruleset basis.
+<h3>Can I use different parser chains for different devices?</h3>
+<p>The correct answer is: generally yes, but it depends. First of all, remember that input
+modules (and specific listeners) may be bound to specific rulesets. As parser chains "reside"
+in rulesets, binding to a ruleset also binds to the parser chain that is bound to that ruleset.
+As a number one prequisite, the input module must support binding to different rulesets. Not
+all do, but their number is growing. For example, the important
+<a href="imudp.html">imudp</a> and <a href="imtcp.html">imtcp</a> input modules support
+that functionality. Those that do not (for example <a href="im3195">im3195</a>) can only
+utilize the default ruleset and thus the parser chain defined in that ruleset.
+<p>If you do not know if the input module in question supports ruleset binding, check
+its documentation page. Those that support it have the requiered directives.
+<p>Note that it is currently under evaluation if rsyslog will support binding parser chains
+to specific inputs directly, without depending on the ruleset. There are some concerns that
+this may not be necessary but adds considerable complexity to the configuration. So this may
+or may not be possible in the future. In any case, if we decide to add it, input modules
+need to support it, so this functionality would require some time to implement.
+<p>The coockbook recipe for using different parsers for different devices is given
+as an actual in-depth example in the <a href="rscon1_rulesetsparser.html">$RulesetParser</a>
+configuration directive doc page. In short, it is acomplished by defining specific rulesets
+for the required parser chains, definining different listener ports for each of the devices
+with different format and binding these listeners to the correct ruleset (and thus parser
+chains). Using that approach, a variety of different message formats can be supported
+via a single rsyslog instance.
+
+<h2>Which message parsers are available</h2>
+<p>As of this writing, there exist only two message parsers, one for RFC5424 format and one for
+legacy syslog (loosely described in
+<a href="http://tools.ietf.org/html/rfc3164">RFC3164</a>). These parsers are built-in and
+must not be explicitely loaded. However, message parsers can be added with relative ease
+by anyone knowing to code in C. Then, they can be loaded via $ModLoad just like any
+other loadable module. It is expected that the rsyslog project will be contributed additional
+message parsers over time, so that at some point there hopefully is a rich choice of them
+(I intend to add a browsable repository as soon as new parsers pop up).
+<h3>How to write a message parser?</h3>
+<p>As a prequisite, you need to know the exact format that the device is sending. Then, you need
+moderate C coding skills, and a little bit of rsyslog internals. I guess the rsyslog specific part
+should not be that hard, as almost all information can be gained from the existing parsers. They
+are rather simple in structure and can be found under the "./tools" directory. They are named
+pmrfc3164.c and pmrfc5424.c. You need to follow the usual loadable module guidelines.
+It is my expectation that writing a parser should typically not take longer than a single
+day, with maybe a day more to get aquainted with rsyslog. Of course, I am not sure if the number
+is actually right.
+<p>If you can not program or have no time to do it, Adiscon can also write a message parser
+for you as
+part of the <a href="http://www.rsyslog/professional-services">rsyslog professional services
+offering</a>.
+<h2>Conclusion</h2>
+<p>Malformed syslog messages are a pain and unfortunately often seen in practice. Message parsers
+provide a fast and efficient solution for this problem. Different parsers can be defined for
+different devices, and they all convert message information into rsyslog's well-defined
+internal format. Message parsers were first introduced in rsyslog 5.3.4 and also offer
+some interesting ideas that may be explored in the future - up to full message normalization
+capabilities. It is strongly recommended that anyone with a heterogenous environment take
+a look at message parser capabilities.
+
+<p>[<a href="rsyslog_conf.html">rsyslog.conf overview</a>] [<a href="manual.html">manual
+index</a>] [<a href="http://www.rsyslog.com/">rsyslog site</a>]</p>
+<p><font size="2">This documentation is part of the
+<a href="http://www.rsyslog.com/">rsyslog</a> project.<br>
+Copyright &copy; 2009 by <a href="http://www.gerhards.net/rainer">Rainer Gerhards</a> and
+<a href="http://www.adiscon.com/">Adiscon</a>. Released under the GNU GPL version 3 or higher.</font></p>
+</body>
+</html>
diff --git a/doc/rfc5424layers.png b/doc/rfc5424layers.png
new file mode 100644
index 00000000..70192cc0
--- /dev/null
+++ b/doc/rfc5424layers.png
Binary files differ
diff --git a/doc/rsconf1_rulesetparser.html b/doc/rsconf1_rulesetparser.html
index 03306ca9..84350431 100644
--- a/doc/rsconf1_rulesetparser.html
+++ b/doc/rsconf1_rulesetparser.html
@@ -12,13 +12,15 @@
<p><b>Default:</b> rsyslog.rfc5424;rsyslog.rfc5425</p>
<p><b>Description:</b></p>
<p>
-This directive permits to specify which message parsers should be used for the ruleset
+This directive permits to specify which
+<a href="messageparser.html">message parsers</a> should be used for the ruleset
in question. It no ruleset is explicitely specified, the default ruleset is used. Message
parsers are contained in (loadable) parser modules with the most common cases
(RFC3164 and RFC5424) being build-in into rsyslogd.
<p>When this directive is specified the first time for a ruleset, it will not only add the
-parser to the ruleset, it will also wipe out the default parsers. So if you need to have
-them in addition to the custom parser, you need to specify them as well.
+parser to the ruleset's parser chain, it will also wipe out the default parser chain.
+So if you need to have
+them in addition to the custom parser, you need to specify those as well.
<p>Order of directives is important. Parsers are tried one after another, in the order
they are specified inside the config. As soon as a parser is able to parse the message,
it will do so and no other parsers will be executed. If no matching parser can be found,
diff --git a/doc/src/rfc5424layers.dia b/doc/src/rfc5424layers.dia
new file mode 100644
index 00000000..300b7796
--- /dev/null
+++ b/doc/src/rfc5424layers.dia
Binary files differ
diff --git a/doc/status.html b/doc/status.html
index 0a2b9a0d..ff056489 100644
--- a/doc/status.html
+++ b/doc/status.html
@@ -2,12 +2,12 @@
<html><head><title>rsyslog status page</title></head>
<body>
<h2>rsyslog status page</h2>
-<p>This page reflects the status as of 2009-11-02.</p>
+<p>This page reflects the status as of 2009-11-05.</p>
<h2>Current Releases</h2>
-<p><b>v5 development:</b> 5.3.3 [2009-10-27] -
-<a href="http://www.rsyslog.com/Article419.phtml">change log</a> -
-<a href="http://www.rsyslog.com/Downloads-req-viewdownloaddetails-lid-183.phtml">download</a>
+<p><b>v5 development:</b> 5.3.4 [2009-11-04] -
+<a href="http://www.rsyslog.com/Article423.phtml">change log</a> -
+<a href="http://www.rsyslog.com/Downloads-req-viewdownloaddetails-lid-185.phtml">download</a>
<br>
<!-- not at the moment!
<b>v4 development:</b> 4.5.1 [2009-07-15] -
@@ -15,15 +15,18 @@
<a href="http://www.rsyslog.com/Downloads-req-viewdownloaddetails-lid-167.phtml">download</a></p>
-->
+<!-- not at the moment!
<br><b>v5-beta:</b> 5.1.6 [2009-10-15] -
<a href="http://www.rsyslog.com/Article413.phtml">change log</a> -
<a href="http://www.rsyslog.com/Downloads-req-viewdownloaddetails-lid-180.phtml">download</a>
+-->
-<br><b>v4-beta:</b> 4.5.5 [2009-10-21] -
-<a href="http://www.rsyslog.com/Article416.phtml">change log</a> -
-<a href="http://www.rsyslog.com/Downloads-req-viewdownloaddetails-lid-182.phtml">download</a></p>
+<br><b>v4-beta:</b> 4.5.6 [2009-11-05] -
+<a href="http://www.rsyslog.com/Article425.phtml">change log</a> -
+<a href="http://www.rsyslog.com/Downloads-req-viewdownloaddetails-lid-186.phtml">download</a></p>
-<p><b>v5 stable:</b> 5.2.0 [2009-11-02] -
+<p><b>v5 stable:</b> 5.2.0 [2009-11-02] (recommended to use
+<a href="http://www.rsyslog.com/Downloads-req-viewdownloaddetails-lid-183.phtml">5.3.3</a> instead) -
<a href="http://www.rsyslog.com/Article421.phtml">change log</a> -
<a href="http://www.rsyslog.com/Downloads-req-viewdownloaddetails-lid-184.phtml">download</a>