1 files changed, 196 insertions, 156 deletions
diff --git a/doc/syslog-protocol.html b/doc/syslog-protocol.html
index e5789ab8..5305d812 100644
--- a/doc/syslog-protocol.html
+++ b/doc/syslog-protocol.html
@@ -1,156 +1,196 @@
-<html>
-<head>
-<title>syslog-protocol support in rsyslog</title>
-</head>
-<body>
-<h1>syslog-protocol support in rsyslog</h1>
-<p><b><a href="http://www.rsyslog.com/">Rsyslog</a>&nbsp; provides a trial 
-implementation of the proposed
-<a href="http://www.monitorware.com/Common/en/glossary/syslog-protocol.php">
-syslog-protocol</a> standard.</b> The intention of this implementation is to 
-find out what inside syslog-protocol is causing problems during implementation. 
-As syslog-protocol is a standard under development, its support in rsyslog is 
-highly volatile. It may change from release to release. So while it provides 
-some advantages in the real world, users are cautioned against using it right 
-now. If you do, be prepared that you will probably need to update all of your 
-rsyslogds with each new release. If you try it anyhow, please provide feedback 
-as that would be most benefitial for us.</p>
-<h2>Currently supported message format</h2>
-<p>Due to recent discussion on syslog-protocol, we do not follow any specific 
-revision of the draft but rather the candidate ideas. The format supported 
-currently is:</p>
-<p><b><code>&lt;PRI&gt;VERSION SP TIMESTAMP SP HOSTNAME SP APP-NAME SP PROCID SP MSGID SP [SD-ID]s 
-SP MSG</code></b></p>
-<p>Field syntax and semantics are as defined in IETF I-D syslog-protocol-15.</p>
-<h2>Capabilities Implemented</h2>
-<ul>
-	<li>receiving message in the supported format (see above)</li>
-	<li>sending messages in the supported format</li>
-	<li>relaying messages</li>
-	<li>receiving messages in either legacy or -protocol format and transforming 
-	them into the other one</li>
-	<li>virtual availability of TAG, PROCID, APP-NAME, MSGID, SD-ID no matter if 
-	the message was received via legacy format, API or syslog-protocol format (non-present 
-	fields are being emulated with great success)</li>
-	<li>maximum message size is set via preprocessor #define</li>
-</ul>
-<h2>Findings</h2>
-<p>This lists what has been found during implementation:</p>
-<ul>
-	<li>The same receiver must be able to support both legacy and 
-	syslog-protocol syslog messages. Anything else would be a big inconvenience 
-	to users and would make deployment much harder. The detection must be done 
-	automatically (see below on how easy that is).</li>
-	<li><b>NUL characters inside MSG</b> cause the message to be truncated at 
-	that point. This is probably a major point for many C-based implementations. 
-	No measures have yet been taken against this. Modifying the code to &quot;cleanly&quot; 
-	support NUL characters is non-trivial, even though rsyslogd already has some 
-	byte-counted string library (but this is new and not yet available 
-	everywhere).</li>
-	<li><b>character encoding in MSG</b>: is is problematic to do the right 
-	UTF-8 encoding. The reason is that we pick up the MSG from the local domain 
-	socket (which got it from the syslog(3) API). The text obtained does not 
-	include any encoding information, but it does include non US-ASCII 
-	characters. It may also include any other encoding. Other than by guessing 
-	based on the provided text, I have no way to find out what it is. In order 
-	to make the syslogd do anything useful, I have now simply taken the message 
-	as is and stuffed it into the MSG part. Please note that I think this will 
-	be a route that other implementors would take, too.</li>
-	<li>A minimal parser is easy to implement. It took me roughly 2 hours to add 
-	it to rsyslogd. This includes the time for restructering the code to be able 
-	to parse both legacy syslog as well as syslog-protocol. The parser has some 
-	restrictions, though<ul>
-	<li>STRUCTURED-DATA field is extracted, but not validated. Structured data 
-	&quot;[test ]]&quot; is not caught as an error. Nor are any other errors caught. For 
-	my needs with this syslogd, that level of structued data processing is 
-	probably sufficient. I do not want to parse/validate it in all cases. This 
-	is also a performance issue. I think other implementors could have the same 
-	view. As such, we should not make validation a requirement.</li>
-	<li>MSG is not further processed (e.g. Unicode not being validated)</li>
-	<li>the other header fields are also extracted, but no validation is 
-	performed right now. At least some validation should be easy to add (not 
-	done this because it is a proof-of-concept and scheduled to change).</li>
-</ul>
-	</li>
-	<li>Universal access to all syslog fields (missing ones being emulated) was 
-	also quite easy. It took me around another 2 hours to integrate emulation of 
-	non-present fields into the code base.</li>
-	<li>The version at the start of the message makes it easy to detect if we 
-	have legacy syslog or syslog-protocol. Do NOT move it to somewhere inside 
-	the middle of the message, that would complicate things. It might not be 
-	totally fail-safe to just rely on &quot;1 &quot; as the &quot;cookie&quot; for a syslog-protocol. 
-	Eventually, it would be good to add some more uniqueness, e.g. &quot;@#1 &quot;.</li>
-	<li>I have no (easy) way to detect truncation if that happens on the UDP 
-	stack. All I see is that I receive e.g. a 4K message. If the message was e.g. 
-	6K, I received two chunks. The first chunk (4K) is correctly detected as a 
-	syslog-protocol message, the second (2K) as legacy syslog. I do not see what 
-	we could do against this. This questions the usefulness of the TRUNCATE bit. 
-	Eventually, I could look at the UDP headers and see that it is a fragment. I 
-	have looked at a network sniffer log of the conversation. This looks like 
-	two totally-independant messages were sent by the sender stack.</li>
-	<li>The maximum message size is currently being configured via a 
-	preprocessor #define. It can easily be set to 2K or 4K, but more than 4K is 
-	not possible because of UDP stack limitations. Eventually, this can be 
-	worked around, but I have not done this yet.</li>
-	<li>rsyslogd can accept syslog-protocol formatted messages but is able to 
-	relay them in legacy format. I find this a must in real-life deployments. 
-	For this, I needed to do some field mapping so that APP-NAME/PROCID are 
-	mapped into a TAG.</li>
-	<li>rsyslogd can also accept legacy syslog message and relay them in 
-	syslog-protocol format. For this, I needed to apply some sub-parsing of the 
-	TAG, which on most occasions provides correct results. There might be some 
-	misinterpretations but I consider these to be mostly non-intrusive. </li>
-	<li>Messages received from the syslog API (the normal case under *nix) also 
-	do not have APP-NAME and PROCID and I must parse them out of TAG as 
-	described directly above. As such, this algorithm is absolutely vital to 
-	make things work on *nix.</li>
-	<li>I have an issue with messages received via the syslog(3) API (or, to be 
-	more precise, via the local domain socket this API writes to): These 
-	messages contain a timestamp, but that timestamp does neither have the year 
-	nor the high-resolution time. The year is no real issue, I just take the 
-	year of the reception of that message. There is a very small window of 
-	exposure for messages read from the log immediately after midnight Jan 1st. 
-	The message in the domain socket might have been written immediately before 
-	midnight in the old year. I think this is acceptable. However, I can not 
-	assign a high-precision timestamp, at least it is somewhat off if I take the 
-	timestamp from message reception on the local socket. An alternative might 
-	be to �gnore the timestamp present and instead use that one when the message 
-	is pulled from the local socket (I am talking about IPC, not the network - 
-	just a reminder...). This is doable, but eventually not advisable. It looks 
-	like this needs to be resolved via a configuration option.</li>
-	<li>rsyslogd already advertised its origin information on application 
-	startup (in a syslog-protocol-14 compatible format). It is fairly easy to 
-	include that with any message if desired (not currently done).</li>
-	<li>A big problem I noticed are malformed messages. In -syslog-protocol, we 
-	recommend/require to discard malformed messages. However, in practice users 
-	would like to see everything that the syslogd receives, even if it is in 
-	error. For the first version, I have not included any error handling at all. 
-	However, I think I would deliberately ignore any &quot;discard&quot; requirement. My 
-	current point of view is that in my code I would eventually flag a message 
-	as being invalid and allow the user to filter on this invalidness. So these 
-	invalid messages could be redirected into special bins.</li>
-	<li>The error logging recommendations (those I insisted on;)) are not really 
-	practical. My application has its own error logging philosophy and I will 
-	not change this to follow a draft.</li>
-</ul>
-<p>&nbsp;</p>
-<h2>Conlusions/Suggestions</h2>
-<p>These are my personal conclusions and suggestions. Obviously, they must be 
-discussed ;)</p>
-<ul>
-	<li>NUL should be disallowd in MSG</li>
-	<li>As it is not possible to definitely know the character encoding of the 
-	application-provided message, MSG should <b>not</b> be specified to use UTF-8 
-	exclusively. Instead, it is suggested that any encoding may be used but 
-	UTF-8 is preferred. To detect UTF-8, the MSG should start with the UTF-8 
-	byte order mask of &quot;EF BB BF&quot; if it is UTF-8 encoded (see section 155.9 of
-	<a href="http://www.unicode.org/versions/Unicode4.0.0/ch15.pdf">
-	http://www.unicode.org/versions/Unicode4.0.0/ch15.pdf</a>) </li>
-	<li>Requirements to drop messages should be reconsidered. I guess I would 
-	not be the only implementor ignoring them.</li>
-	<li>Logging requirements should be reconsidered and probably be removed.</li>
-</ul>
-<p>&nbsp;</p>
-</body>
-</html>
-
+<html>
+<head>
+<title>syslog-protocol support in rsyslog</title>
+</head>
+<body>
+<h1>syslog-protocol support in rsyslog</h1>
+<p><b><a href="http://www.rsyslog.com/">Rsyslog</a>&nbsp; provides a trial 
+implementation of the proposed
+<a href="http://www.monitorware.com/Common/en/glossary/syslog-protocol.php">
+syslog-protocol</a> standard.</b> The intention of this implementation is to 
+find out what inside syslog-protocol is causing problems during implementation. 
+As syslog-protocol is a standard under development, its support in rsyslog is 
+highly volatile. It may change from release to release. So while it provides 
+some advantages in the real world, users are cautioned against using it right 
+now. If you do, be prepared that you will probably need to update all of your 
+rsyslogds with each new release. If you try it anyhow, please provide feedback 
+as that would be most benefitial for us.</p>
+<h2>Currently supported message format</h2>
+<p>Due to recent discussion on syslog-protocol, we do not follow any specific 
+revision of the draft but rather the candidate ideas. The format supported 
+currently is:</p>
+<p><b><code>&lt;PRI&gt;VERSION SP TIMESTAMP SP HOSTNAME SP APP-NAME SP PROCID SP MSGID SP [SD-ID]s 
+SP MSG</code></b></p>
+<p>Field syntax and semantics are as defined in IETF I-D syslog-protocol-15.</p>
+<h2>Capabilities Implemented</h2>
+<ul>
+	<li>receiving message in the supported format (see above)</li>
+	<li>sending messages in the supported format</li>
+	<li>relaying messages</li>
+	<li>receiving messages in either legacy or -protocol format and transforming 
+	them into the other one</li>
+	<li>virtual availability of TAG, PROCID, APP-NAME, MSGID, SD-ID no matter if 
+	the message was received via legacy format, API or syslog-protocol format (non-present 
+	fields are being emulated with great success)</li>
+	<li>maximum message size is set via preprocessor #define</li>
+	<li>syslog-protocol messages can be transmitted both over UDP and plain TCP 
+	with some restrictions on compliance in the case of TCP</li>
+</ul>
+<h2>Findings</h2>
+<p>This lists what has been found during implementation:</p>
+<ul>
+	<li>The same receiver must be able to support both legacy and 
+	syslog-protocol syslog messages. Anything else would be a big inconvenience 
+	to users and would make deployment much harder. The detection must be done 
+	automatically (see below on how easy that is).</li>
+	<li><b>NUL characters inside MSG</b> cause the message to be truncated at 
+	that point. This is probably a major point for many C-based implementations. 
+	No measures have yet been taken against this. Modifying the code to &quot;cleanly&quot; 
+	support NUL characters is non-trivial, even though rsyslogd already has some 
+	byte-counted string library (but this is new and not yet available 
+	everywhere).</li>
+	<li><b>character encoding in MSG</b>: is is problematic to do the right 
+	UTF-8 encoding. The reason is that we pick up the MSG from the local domain 
+	socket (which got it from the syslog(3) API). The text obtained does not 
+	include any encoding information, but it does include non US-ASCII 
+	characters. It may also include any other encoding. Other than by guessing 
+	based on the provided text, I have no way to find out what it is. In order 
+	to make the syslogd do anything useful, I have now simply taken the message 
+	as is and stuffed it into the MSG part. Please note that I think this will 
+	be a route that other implementors would take, too.</li>
+	<li>A minimal parser is easy to implement. It took me roughly 2 hours to add 
+	it to rsyslogd. This includes the time for restructering the code to be able 
+	to parse both legacy syslog as well as syslog-protocol. The parser has some 
+	restrictions, though<ul>
+	<li>STRUCTURED-DATA field is extracted, but not validated. Structured data 
+	&quot;[test ]]&quot; is not caught as an error. Nor are any other errors caught. For 
+	my needs with this syslogd, that level of structued data processing is 
+	probably sufficient. I do not want to parse/validate it in all cases. This 
+	is also a performance issue. I think other implementors could have the same 
+	view. As such, we should not make validation a requirement.</li>
+	<li>MSG is not further processed (e.g. Unicode not being validated)</li>
+	<li>the other header fields are also extracted, but no validation is 
+	performed right now. At least some validation should be easy to add (not 
+	done this because it is a proof-of-concept and scheduled to change).</li>
+</ul>
+	</li>
+	<li>Universal access to all syslog fields (missing ones being emulated) was 
+	also quite easy. It took me around another 2 hours to integrate emulation of 
+	non-present fields into the code base.</li>
+	<li>The version at the start of the message makes it easy to detect if we 
+	have legacy syslog or syslog-protocol. Do NOT move it to somewhere inside 
+	the middle of the message, that would complicate things. It might not be 
+	totally fail-safe to just rely on &quot;1 &quot; as the &quot;cookie&quot; for a syslog-protocol. 
+	Eventually, it would be good to add some more uniqueness, e.g. &quot;@#1 &quot;.</li>
+	<li>I have no (easy) way to detect truncation if that happens on the UDP 
+	stack. All I see is that I receive e.g. a 4K message. If the message was e.g. 
+	6K, I received two chunks. The first chunk (4K) is correctly detected as a 
+	syslog-protocol message, the second (2K) as legacy syslog. I do not see what 
+	we could do against this. This questions the usefulness of the TRUNCATE bit. 
+	Eventually, I could look at the UDP headers and see that it is a fragment. I 
+	have looked at a network sniffer log of the conversation. This looks like 
+	two totally-independant messages were sent by the sender stack.</li>
+	<li>The maximum message size is currently being configured via a 
+	preprocessor #define. It can easily be set to 2K or 4K, but more than 4K is 
+	not possible because of UDP stack limitations. Eventually, this can be 
+	worked around, but I have not done this yet.</li>
+	<li>rsyslogd can accept syslog-protocol formatted messages but is able to 
+	relay them in legacy format. I find this a must in real-life deployments. 
+	For this, I needed to do some field mapping so that APP-NAME/PROCID are 
+	mapped into a TAG.</li>
+	<li>rsyslogd can also accept legacy syslog message and relay them in 
+	syslog-protocol format. For this, I needed to apply some sub-parsing of the 
+	TAG, which on most occasions provides correct results. There might be some 
+	misinterpretations but I consider these to be mostly non-intrusive. </li>
+	<li>Messages received from the syslog API (the normal case under *nix) also 
+	do not have APP-NAME and PROCID and I must parse them out of TAG as 
+	described directly above. As such, this algorithm is absolutely vital to 
+	make things work on *nix.</li>
+	<li>I have an issue with messages received via the syslog(3) API (or, to be 
+	more precise, via the local domain socket this API writes to): These 
+	messages contain a timestamp, but that timestamp does neither have the year 
+	nor the high-resolution time. The year is no real issue, I just take the 
+	year of the reception of that message. There is a very small window of 
+	exposure for messages read from the log immediately after midnight Jan 1st. 
+	The message in the domain socket might have been written immediately before 
+	midnight in the old year. I think this is acceptable. However, I can not 
+	assign a high-precision timestamp, at least it is somewhat off if I take the 
+	timestamp from message reception on the local socket. An alternative might 
+	be to �gnore the timestamp present and instead use that one when the message 
+	is pulled from the local socket (I am talking about IPC, not the network - 
+	just a reminder...). This is doable, but eventually not advisable. It looks 
+	like this needs to be resolved via a configuration option.</li>
+	<li>rsyslogd already advertised its origin information on application 
+	startup (in a syslog-protocol-14 compatible format). It is fairly easy to 
+	include that with any message if desired (not currently done).</li>
+	<li>A big problem I noticed are malformed messages. In -syslog-protocol, we 
+	recommend/require to discard malformed messages. However, in practice users 
+	would like to see everything that the syslogd receives, even if it is in 
+	error. For the first version, I have not included any error handling at all. 
+	However, I think I would deliberately ignore any &quot;discard&quot; requirement. My 
+	current point of view is that in my code I would eventually flag a message 
+	as being invalid and allow the user to filter on this invalidness. So these 
+	invalid messages could be redirected into special bins.</li>
+	<li>The error logging recommendations (those I insisted on;)) are not really 
+	practical. My application has its own error logging philosophy and I will 
+	not change this to follow a draft.</li>
+	<li>Relevance of support for leap seconds and senders without knowledge of 
+	time is questionable. I have not made any specific provisions in the code 
+	nor would I know how to handle that differently. I could, however, pull the 
+	local reception timestamp in this case, so it might be useful to have this 
+	feature. I do not think any more about this for the initial proof-of-concept. 
+	Note it as a potential problem area, especially when logging to databases.</li>
+	<li>The HOSTNAME field for internally generated messages currently contains 
+	the hostname part only, not the FQDN. This can be changed inside the code 
+	base, but it requires some thinking so that thinks are kept compatible with 
+	legacy syslog. I have not done this for the proof-of-concept, but I think it 
+	is not really bad. Maybe an hour or half a day of thinking.</li>
+	<li>It is possible that I did not receive a TAG with legacy syslog or via 
+	the syslog API. In this case, I can not generate the APP-NAME. For 
+	consistency, I have used &quot;-&quot; in such cases (just like in PROCID, MSGID and 
+	STRUCTURED-DATA).</li>
+	<li>As an architectural side-effect, syslog-protocol formatted messages can 
+	also be transmitted over non-standard syslog/raw tcp. This implementation 
+	uses the industry-standard LF termination of tcp syslog records. As such, 
+	syslog-protocol messages containing a LF will be broken invalidly. There is 
+	nothing that can be done against this without specifying a TCP transport. 
+	This issue might be more important than one thinks on first thought. The 
+	reason is the wide deployment of syslog/tcp via industry standard.</li>
+</ul>
+<p><b>Some notes on syslog-transport-udp-06</b></p>
+<ul>
+	<li>I did not make any low-level modifications to the UDP code and think I 
+	am still basically covered with this I-D.</li>
+	<li>I deliberately violate section 3.3 insofar as that I do not necessarily 
+	accept messages destined to port 514. This feature is user-required and a 
+	must. The same applies to the destination port. I am not sure if the &quot;MUST&quot; 
+	in section 3.3 was meant that this MUST be an option, but not necessarily be 
+	active. The wording should be clarified.</li>
+	<li>section 3.6: I do not check checksums. See the issue with discarding 
+	messages above. The same solution will probably be applied in my code.</li>
+</ul>
+<p>&nbsp;</p>
+<h2>Conlusions/Suggestions</h2>
+<p>These are my personal conclusions and suggestions. Obviously, they must be 
+discussed ;)</p>
+<ul>
+	<li>NUL should be disallowd in MSG</li>
+	<li>As it is not possible to definitely know the character encoding of the 
+	application-provided message, MSG should <b>not</b> be specified to use UTF-8 
+	exclusively. Instead, it is suggested that any encoding may be used but 
+	UTF-8 is preferred. To detect UTF-8, the MSG should start with the UTF-8 
+	byte order mask of &quot;EF BB BF&quot; if it is UTF-8 encoded (see section 155.9 of
+	<a href="http://www.unicode.org/versions/Unicode4.0.0/ch15.pdf">
+	http://www.unicode.org/versions/Unicode4.0.0/ch15.pdf</a>) </li>
+	<li>Requirements to drop messages should be reconsidered. I guess I would 
+	not be the only implementor ignoring them.</li>
+	<li>Logging requirements should be reconsidered and probably be removed.</li>
+	<li>It would be advisable to specify &quot;-&quot; for APP-NAME is the name is not 
+	known to the sender.</li>
+	<li>The implications of the current syslog/tcp industry standard on 
+	syslog-protocol should be further evaluated and be fully understood</li>
+</ul>
+<p>&nbsp;</p>
+</body>
+</html>
+