doc/queues_analogy.html


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html><head>
<title>turning lanes and rsyslog queues - an analogy</title></head>
<body>
<a href="rsyslog_conf_global.html">back</a>

<h1>Turning Lanes and Rsyslog Queues - an Analogy</h1>
<p>If there is a single object absolutely vital to understanding the way 
rsyslog works, this object is queues. Queues offer a variety of services,
including support for multithreading. While there is elaborate in-depth
documentation on the ins and outs of <a href="queues.html">rsyslog queues</a>,
some of the concepts are hard to grasp even for experienced people. I think this
is because rsyslog uses a very high layer of abstraction which includes things 
that look quite unnatural, like queues that do <b>not</b> actually queue...
<p>With this document, I take a different approach: I will not describe every specific
detail of queue operation but hope to be able to provide the core idea of how
queues are used in rsyslog by using an analogy. I will compare the rsyslog data flow
with real-life traffic flowing at an intersection.
<p>But first let's set the stage for the rsyslog part. The graphic below describes
the data flow inside rsyslog:
<p align="center"><img src="dataflow.png" alt="rsyslog data flow">
<p>Note that there is a <a href="http://www.rsyslog.com/Article350.phtml">video tutorial</a>
available on the data flow. It is not perfect, but may aid in understanding this picture.
<p>For our needs, the important fact to know is that messages enter rsyslog on &quot;the
left side&quot; (for example, via UDP), are being preprocessed, put into the
so-called main queue, taken off that queue, be filtered and be placed into one or
several action queues (depending on filter results). They leave rsyslog on &quot;the
right side&quot; where output modules (like the file or database writer) consume them.
<p>So there are always <b>two</b> stages where a message (conceptually) is queued - first
in the main queue and later on in <i>n</i> action specific queues (with <i>n</i> being the number of
actions that the message in question needs to be processed by, what is being decided
by the &quot;Filter Engine&quot;). As such, a message will be in at least two queues
during its lifetime (with the exception of messages being discarded by the queue itself,
but for the purpose of this document, we will ignore that possibility).
<p>Also, it is vitally
important to understand that <b>each</b> action has a queue sitting in front of it.
If you have dug into the details of rsyslog configuration, you have probably seen that
a queue mode can be set for each action. And the default queue mode is the so-called
&quot;direct mode&quot;, in which &quot;the queue does not actually enqueue data&quot;.
That sounds silly, but is not. It is an important abstraction that helps keep the code clean.
<p>To understand this, we first need to look at who is the active component. In our data flow,
the active part always sits to the left of the object. For example, the &quot;Preprocessor&quot;
is being called by the inputs and calls itself into the main message queue. That is, the queue
receiver is called, it is passive. One might think that the &quot;Parser &amp; Filter Engine&quot;
is an active component that actively pulls messages from the queue. This is wrong! Actually,
it is the queue that has a pool of worker threads, and these workers pull data from the queue 
and then call the passively waiting Parser and Filter Engine with those messages. So the
main message queue is the active part, the Parser and Filter Engine is passive.
<p>Let's now try an analogy analogy for this part: Think about a TV show. The show is produced
in some TV studio, from there sent (actively) to a radio tower. The radio tower passively
receives from the studio and then actively sends out a signal, which is passively received
by your TV set. In our simplified view, we have the following picture:
<p align="center"><img src="queue_analogy_tv.png" alt="rsyslog queues and TV analogy">
<p>The lower part of the picture lists the equivalent rsyslog entities, in an abstracted way.
Every queue has a producer (in the above sample the input) and a consumer (in the above sample the Parser
and Filter Engine). Their active and passive functions are equivalent to the TV entities
that are listed on top of the rsyslog entity. For example, a rsyslog consumer can never 
actively initiate reception of a message in the same way a TV set cannot actively
&quot;initiate&quot; a TV show - both can only &quot;handle&quot; (display or process)
what is sent to them.
<p>Now let's look at the action queues: here, the active part, the producer, is the
Parser and Filter Engine. The passive part is the Action Processor. The later does any
processing that is necessary to call the output plugin, in particular it processes the template
to create the plugin calling parameters (either a string or vector of arguments). From the
action queue's point of view, Action Processor and Output form a single entity. Again, the
TV set analogy holds. The Output <b>does not</b> actively ask the queue for data, but
rather passively waits until the queue itself pushes some data to it.

<p>Armed with this knowledge, we can now look at the way action queue modes work. My analogy here
is a junction, as shown below (note that the colors in the pictures below are <b>not</b> related to
the colors in the pictures above!):
<p align="center"><img src="direct_queue0.png">
<p>This is a very simple real-life traffic case: one road joins another. We look at
traffic on the straight road, here shown by blue and green arrows. Traffic in the 
opposing direction is shown in blue. Traffic flows without
any delays as long as nobody takes turns. To be more precise, if the opposing traffic takes 
a (right) turn, traffic still continues to flow without delay. However, if a car in the red traffic
flow intends to do a (left, then) turn, the situation changes:
<p align="center"><img src="direct_queue1.png">
<p>The turning car is represented by the green arrow. It cannot turn unless there is a gap 
in the &quot;blue traffic stream&quot;. And as this car blocks the roadway, the remaining
traffic (now shown in red, which should indicate the block condition),
must wait until the &quot;green&quot; car has made its turn. So
a queue will build up on that lane, waiting for the turn to be completed.
Note that in the examples below I do not care that much about the properties of the
opposing traffic. That is, because its structure is not really important for what I intend to
show. Think about the blue arrow as being a traffic stream that most of the time blocks
left-turners, but from time to time has a gap that is sufficiently large for a left-turn
to complete.
<p>Our road network designers know that this may be unfortunate, and for more important roads
and junctions, they came up with the concept of turning lanes:
<p align="center"><img src="direct_queue2.png">
<p>Now, the car taking the turn can wait in a special area, the turning lane. As such, 
the &quot;straight&quot; traffic is no longer blocked and can flow in parallel to the
turning lane (indicated by a now-green-again arrow).

<p>However, the turning lane offers only finite space. So if too many cars intend to
take a left turn, and there is no gap in the &quot;blue&quot; traffic, we end up with
this well-known situation:
<p align="center"><img src="direct_queue3.png">
<p>The turning lane is now filled up, resulting in a tailback of cars intending to 
left turn on the main driving lane. The end result is that &quot;straight&quot; traffic
is again being blocked, just as in our initial problem case without the turning lane.
In essence, the turning lane has provided some relief, but only for a limited amount of
cars. Street system designers now try to weight cost vs. benefit and create (costly) 
turning lanes that are sufficiently large to prevent traffic jams in most, but not all
cases.
<p><b>Now let's dig a bit into the mathematical properties of turning lanes.</b> We assume that
cars all have the same length. So, units of cars, the length is alsways one (which is nice,
as we don't need to care about that factor any longer ;)). A turning lane has finite capacity of
<i>n</i> cars. As long as the number of cars wanting to take a turn is less than or eqal
to <i>n</i>, &quot;straigth traffic&quot; is not blocked (or the other way round, traffic
is blocked if at least <i>n + 1</i> cars want to take a turn!). We can now find an optimal
value for <i>n</i>: it is a function of the probability that a car wants to turn
and the cost of the turning lane
(as well as the probability there is a gap in the &quot;blue&quot; traffic, but we ignore this
in our simple sample).
If we start from some finite upper bound of <i>n</i>, we can decrease
<i>n</i> to a point where it reaches zero. But let's first look at <i>n = 1</i>, in which case exactly
one car can wait on the turning lane. More than one car, and the rest of the traffic is blocked.
Our everyday logic indicates that this is actually the lowest boundary for <i>n</i>.
<p>In an abstract view, however, <i>n</i> can be zero and that works nicely. There still can be
<i>n</i> cars at any given time on the turning lane, it just happens that this means there can
be no car at all on it. And, as usual, if we have at least <i>n + 1</i> cars wanting to turn,
the main traffic flow is blocked. True, but <i>n + 1 = 0 + 1 = 1</i> so as soon as there is any
car wanting to take a turn, the main traffic flow is blocked (remember, in all cases, I assume
no sufficiently large gaps in the opposing traffic).
<p>This is the situation our everyday perception calls &quot;road without turning lane&quot;.
In my math model, it is a &quot;road with turning lane of size 0&quot;. The subtle difference
is important: my math model guarantees that, in an abstract sense, there always is a turning
lane, it may just be too short. But it exists, even though we don't see it. And now I can
claim that even in my small home village, all roads have turning lanes, which is rather
impressive, isn't it? ;)
<p><b>And now we finally have arrived at rsyslog's queues!</b> Rsyslog action queues exists for 
all actions just like all roads in my village have turning lanes! And as in this real-life sample,
it may be hard to see the action queues for that reason. In rsyslog, the &quot;direct&quot; queue
mode is the equivalent to the 0-sized turning lane. And actions queues are the equivalent to turning
lanes in general, with our real-life <i>n</i> being the maximum queue size.
The main traffic line (which sometimes is blocked) is the equivalent to the 
main message queue. And the periods without gaps in the opposing traffic are equivalent to 
execution time of an action. In a rough sketch, the rsyslog main and action queues look like in the
following picture. 
<p align="center"><img src="direct_queue_rsyslog.png">
<p>We need to read this picture from right to left (otherwise I would need to redo all
the graphics ;)). In action 3, you see a 0-sized turning lane, aka an action queue in &quot;direct&quot;
mode. All other queues are run in non-direct modes, but with different sizes greater than 0.
<p>Let us first use our car analogy:
Assume we are in a car on the main lane that wants to take turn into the &quot;action 4&quot;
road. We pass action 1, where a number of cars wait in the turning lane and we pass
action 2, which has a slightly smaller, but still not filled up turning lane. So we pass that
without delay, too. Then we come to &quot;action 3&quot;, which has no turning lane. Unfortunately,
the car in front of us wants to turn left into that road, so it blocks the main lane. So, this time
we need to wait. An observer standing on the sidewalk may see that while we need to wait, there are
still some cars in the &quot;action 4&quot; turning lane. As such, even though no new cars can
arrive on the main lane, cars still turn into the &quot;action 4&quot; lane. In other words,
an observer standing in &quot;action 4&quot; road is unable to see that traffic on the main lane
is blocked.
<p>Now on to rsyslog: Other than in the real-world traffic example, messages in rsyslog
can - at more or less the
same time - &quot;take turns&quot; into several roads at once. This is done by duplicating the message
if the road has a non-zero-sized &quot;turning lane&quot; - or in rsyslog terms a queue that is
running in any non-direct mode. If so, a deep copy of the message object is made, that placed into
the action queue and then the initial message proceeds on the &quot;main lane&quot;. The action
queue then pushes the duplicates through action processing. This is also the reason why a
discard action inside a non-direct queue does not seem to have an effect. Actually, it discards the 
copy that was just created, but the original message object continues to flow.
<p>
In action 1, we have some entries in the action queue, as we have in action 2 (where the queue is
slightly shorter). As we have seen, new messages pass action one and two almost instantaneously.
However, when a messages reaches action 3, its flow is blocked. Now, message processing must wait
for the action to complete. Processing flow in a direct mode queue is something like a U-turn:

<p align="center"><img src="direct_queue_directq.png" alt="message processing in an rsyslog action queue in direct mode">
<p>The message starts to execute the action and once this is done, processing flow continues.
In a real-life analogy, this may be the route of a delivery man who needs to drop a parcel
in a side street before he continues driving on the main route. As a side-note, think of what happens 
with the rest of the delivery route, at least for today, if the delivery truck has a serious accident
in the side street. The rest of the parcels won't be delivered today, will they? This is exactly how the
discard action works. It drops the message object inside the action and thus the message will no
longer be available for further delivery - but as I said, only if the discard is done in a
direct mode queue (I am stressing this example because it often causes a lot of confusion).
<p>Back to the overall scenario. We have seen that messages need to wait for action 3 to
complete. Does this necessarily mean that at the same time no messages can be processed
in action 4? Well, it depends. As in the real-life scenario, action 4 will continue to 
receive traffic as long as its action queue (&quot;turn lane&quot;) is not drained. In 
our drawing, it is not. So action 4 will be executed while messages still wait for action 3
to be completed.
<p>Now look at the overall picture from a slightly different angle:
<p align="center"><img src="direct_queue_rsyslog2.png" alt="message processing in an rsyslog action queue in direct mode">
<p>The number of all connected green and red arrows is four - one each for action 1, 2 and 3
(this one is dotted as action 4 was a special case) and one for the &quot;main lane&quot; as
well as action 3 (this one contains the sole red arrow). <b>This number is the lower bound for
the number of threads in rsyslog's output system (&quot;right-hand part&quot; of the main message
queue)!</b> Each of the connected arrows is a continuous thread and each &quot;turn lane&quot; is
a place where processing is forked onto a new thread. Also, note that in action 3 the processing
is carried out on the main thread, but not in the non-direct queue modes.
<p>I have said this is &quot;the lower bound for the number of threads...&quot;. This is with
good reason: the main queue may have more than one worker thread (individual action queues
currently do not support this, but could do in the future - there are good reasons for that, too
but exploring why would finally take us away from what we intend to see). Note that you 
configure an upper bound for the number of main message queue worker threads. The actual number
varies depending on a lot of operational variables, most importantly the number of messages
inside the queue. The number <i>t_m</i> of actually running threads is within the integer-interval
[0,confLimit] (with confLimit being the operator configured limit, which defaults to 5).
Output plugins may have more than one thread created by themselves. It is quite unusual for an
output plugin to create such threads and so I assume we do not have any of these.
Then, the overall number of threads in rsyslog's filtering and output system is
<i>t_total = t_m + number of actions in non-direct modes</i>. Add the number of
inputs configured to that and you have the total number of threads running in rsyslog at
a given time (assuming again that inputs utilize only one thread per plugin, a not-so-safe
assumption).
<p>A quick side-note: I gave the lower bound for <i>t_m</i> as zero, which is somewhat in contrast
to what I wrote at the begin of the last paragraph. Zero is actually correct, because rsyslog
stops all worker threads when there is no work to do. This is also true for the action queues.
So the ultimate lower bound for a rsyslog output system without any work to carry out actually is zero.
But this bound will never be reached when there is continuous flow of activity. And, if you are
curios: if the number of workers is zero, the worker wakeup process is actually handled within the
threading context of the &quot;left-hand-side&quot; (or producer) of the queue. After being
started, the worker begins to play the active queue component again. All of this, of course,
can be overridden with configuration directives.
<p>When looking at the threading model, one can simply add n lanes to the main lane but otherwise
retain the traffic analogy. This is a very good description of the actual process (think what this
means to the &quot;turning lanes&quot;; hint: there still is only one per action!).
<p><b>Let's try to do a warp-up:</b> I have hopefully been able to show that in rsyslog, an action
queue &quot;sits in front of&quot; each output plugin. Messages are received and flow, from input
to output, over various stages and two level of queues to the outputs. Actions queues are always
present, but may not easily be visible when in direct mode (where no actual queuing takes place).
The "road junction with turning lane" analogy well describes the way - and intent - of the various
queue levels in rsyslog.
<p>On the output side, the queue is the active component, <b>not</b> the consumer. As such, the consumer
cannot ask the queue for anything (like n number of messages) but rather is activated by the queue
itself. As such, a queue somewhat resembles a &quot;living thing&quot; whereas the outputs are
just tools that this &quot;living thing&quot; uses.
<p><b>Note that I left out a couple of subtleties</b>, especially when it comes
to error handling and terminating
a queue (you hopefully have now at least a rough idea why I say &quot;terminating <b>a queue</b>&quot;
and not &quot;terminating an action&quot; - <i>who is the &quot;living thing&quot;?</i>). An action returns
a status to the queue, but it is the queue that ultimately decides which messages can finally be
considered processed and which not. Please note that the queue may even cancel an output right in
the middle of its action. This happens, if configured, if an output needs more than a configured
maximum processing time and is a guard condition to prevent slow outputs from deferring a rsyslog
restart for too long. Especially in this case re-queuing and cleanup is not trivial. Also, note that
I did not discuss disk-assisted queue modes. The basic rules apply, but there are some additional
constraints, especially in regard to the threading model. Transitioning between actual
disk-assisted mode and pure-in-memory-mode (which is done automatically when needed) is also far from
trivial and a real joy for an implementer to work on ;).
<p>If you have not done so before, it may be worth reading the
<a href="queues.html">rsyslog queue user's guide,</a> which most importantly lists all
the knobs you can turn to tweak queue operation.
<p>[<a href="manual.html">manual index</a>]
[<a href="rsyslog_conf.html">rsyslog.conf</a>]
[<a href="http://www.rsyslog.com/">rsyslog site</a>]</p>
<p><font size="2">This documentation is part of the
<a href="http://www.rsyslog.com/">rsyslog</a> project.<br>
Copyright &copy; 2009 by <a href="http://www.gerhards.net/rainer">Rainer Gerhards</a> and
<a href="http://www.adiscon.com/">Adiscon</a>. Released under the GNU GPL
version 3 or higher.</font></p>
</body>
</html>
/*
 * INET		An implementation of the TCP/IP protocol suite for the LINUX
 *		operating system.  INET is implemented using the  BSD Socket
 *		interface as the means of communication with the user level.
 *
 *		Implementation of the Transmission Control Protocol(TCP).
 *
 * Version:	$Id: tcp_input.c,v 1.243 2002/02/01 22:01:04 davem Exp $
 *
 * Authors:	Ross Biro
 *		Fred N. van Kempen, <waltje@uWalt.NL.Mugnet.ORG>
 *		Mark Evans, <evansmp@uhura.aston.ac.uk>
 *		Corey Minyard <wf-rch!minyard@relay.EU.net>
 *		Florian La Roche, <flla@stud.uni-sb.de>
 *		Charles Hedrick, <hedrick@klinzhai.rutgers.edu>
 *		Linus Torvalds, <torvalds@cs.helsinki.fi>
 *		Alan Cox, <gw4pts@gw4pts.ampr.org>
 *		Matthew Dillon, <dillon@apollo.west.oic.com>
 *		Arnt Gulbrandsen, <agulbra@nvg.unit.no>
 *		Jorge Cwik, <jorge@laser.satlink.net>
 */

/*
 * Changes:
 *		Pedro Roque	:	Fast Retransmit/Recovery.
 *					Two receive queues.
 *					Retransmit queue handled by TCP.
 *					Better retransmit timer handling.
 *					New congestion avoidance.
 *					Header prediction.
 *					Variable renaming.
 *
 *		Eric		:	Fast Retransmit.
 *		Randy Scott	:	MSS option defines.
 *		Eric Schenk	:	Fixes to slow start algorithm.
 *		Eric Schenk	:	Yet another double ACK bug.
 *		Eric Schenk	:	Delayed ACK bug fixes.
 *		Eric Schenk	:	Floyd style fast retrans war avoidance.
 *		David S. Miller	:	Don't allow zero congestion window.
 *		Eric Schenk	:	Fix retransmitter so that it sends
 *					next packet on ack of previous packet.
 *		Andi Kleen	:	Moved open_request checking here
 *					and process RSTs for open_requests.
 *		Andi Kleen	:	Better prune_queue, and other fixes.
 *		Andrey Savochkin:	Fix RTT measurements in the presence of
 *					timestamps.
 *		Andrey Savochkin:	Check sequence numbers correctly when
 *					removing SACKs due to in sequence incoming
 *					data segments.
 *		Andi Kleen:		Make sure we never ack data there is not
 *					enough room for. Also make this condition
 *					a fatal error if it might still happen.
 *		Andi Kleen:		Add tcp_measure_rcv_mss to make 
 *					connections with MSS<min(MTU,ann. MSS)
 *					work without delayed acks. 
 *		Andi Kleen:		Process packets with PSH set in the
 *					fast path.
 *		J Hadi Salim:		ECN support
 *	 	Andrei Gurtov,
 *		Pasi Sarolahti,
 *		Panu Kuhlberg:		Experimental audit of TCP (re)transmission
 *					engine. Lots of bugs are found.
 *		Pasi Sarolahti:		F-RTO for dealing with spurious RTOs
 */

#include <linux/config.h>
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/sysctl.h>
#include <net/tcp.h>
#include <net/inet_common.h>
#include <linux/ipsec.h>
#include <asm/unaligned.h>

int sysctl_tcp_timestamps = 1;
int sysctl_tcp_window_scaling = 1;
int sysctl_tcp_sack = 1;
int sysctl_tcp_fack = 1;
int sysctl_tcp_reordering = TCP_FASTRETRANS_THRESH;
int sysctl_tcp_ecn;
int sysctl_tcp_dsack = 1;
int sysctl_tcp_app_win = 31;
int sysctl_tcp_adv_win_scale = 2;

int sysctl_tcp_stdurg;
int sysctl_tcp_rfc1337;
int sysctl_tcp_max_orphans = NR_FILE;
int sysctl_tcp_frto;
int sysctl_tcp_nometrics_save;

int sysctl_tcp_moderate_rcvbuf = 1;
int sysctl_tcp_abc = 1;

#define FLAG_DATA		0x01 /* Incoming frame contained data.		*/
#define FLAG_WIN_UPDATE		0x02 /* Incoming ACK was a window update.	*/
#define FLAG_DATA_ACKED		0x04 /* This ACK acknowledged new data.		*/
#define FLAG_RETRANS_DATA_ACKED	0x08 /* "" "" some of which was retransmitted.	*/
#define FLAG_SYN_ACKED		0x10 /* This ACK acknowledged SYN.		*/
#define FLAG_DATA_SACKED	0x20 /* New SACK.				*/
#define FLAG_ECE		0x40 /* ECE in this ACK				*/
#define FLAG_DATA_LOST		0x80 /* SACK detected data lossage.		*/
#define FLAG_SLOWPATH		0x100 /* Do not skip RFC checks for window update.*/

#define FLAG_ACKED		(FLAG_DATA_ACKED|FLAG_SYN_ACKED)
#define FLAG_NOT_DUP		(FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
#define FLAG_CA_ALERT		(FLAG_DATA_SACKED|FLAG_ECE)
#define FLAG_FORWARD_PROGRESS	(FLAG_ACKED|FLAG_DATA_SACKED)

#define IsReno(tp) ((tp)->rx_opt.sack_ok == 0)
#define IsFack(tp) ((tp)->rx_opt.sack_ok & 2)
#define IsDSack(tp) ((tp)->rx_opt.sack_ok & 4)

#define TCP_REMNANT (TCP_FLAG_FIN|TCP_FLAG_URG|TCP_FLAG_SYN|TCP_FLAG_PSH)

/* Adapt the MSS value used to make delayed ack decision to the 
 * real world.
 */ 
static void tcp_measure_rcv_mss(struct sock *sk,
				const struct sk_buff *skb)
{
	struct inet_connection_sock *icsk = inet_csk(sk);
	const unsigned int lss = icsk->icsk_ack.last_seg_size; 
	unsigned int len;

	icsk->icsk_ack.last_seg_size = 0; 

	/* skb->len may jitter because of SACKs, even if peer
	 * sends good full-sized frames.
	 */
	len = skb->len;
	if (len >= icsk->icsk_ack.rcv_mss) {
		icsk->icsk_ack.rcv_mss = len;
	} else {
		/* Otherwise, we make more careful check taking into account,
		 * that SACKs block is variable.
		 *
		 * "len" is invariant segment length, including TCP header.
		 */
		len += skb->data - skb->h.raw;
		if (len >= TCP_MIN_RCVMSS + sizeof(struct tcphdr) ||
		    /* If PSH is not set, packet should be
		     * full sized, provided peer TCP is not badly broken.
		     * This observation (if it is correct 8)) allows
		     * to handle super-low mtu links fairly.
		     */
		    (len >= TCP_MIN_MSS + sizeof(struct tcphdr) &&
		     !(tcp_flag_word(skb->h.th)&TCP_REMNANT))) {
			/* Subtract also invariant (if peer is RFC compliant),
			 * tcp header plus fixed timestamp option length.
			 * Resulting "len" is MSS free of SACK jitter.
			 */
			len -= tcp_sk(sk)->tcp_header_len;
			icsk->icsk_ack.last_seg_size = len;
			if (len == lss) {
				icsk->icsk_ack.rcv_mss = len;
				return;
			}
		}
		icsk->icsk_ack.pending |= ICSK_ACK_PUSHED;
	}
}

static void tcp_incr_quickack(struct sock *sk)
{
	struct inet_connection_sock *icsk = inet_csk(sk);
	unsigned quickacks = tcp_sk(sk)->rcv_wnd / (2 * icsk->icsk_ack.rcv_mss);

	if (quickacks==0)
		quickacks=2;
	if (quickacks > icsk->icsk_ack.quick)
		icsk->icsk_ack.quick = min(quickacks, TCP_MAX_QUICKACKS);
}

void tcp_enter_quickack_mode(struct sock *sk)
{
	struct inet_connection_sock *icsk = inet_csk(sk);
	tcp_incr_quickack(sk);
	icsk->icsk_ack.pingpong = 0;
	icsk->icsk_ack.ato = TCP_ATO_MIN;
}

/* Send ACKs quickly, if "quick" count is not exhausted
 * and the session is not interactive.
 */

static inline int tcp_in_quickack_mode(const struct sock *sk)
{
	const struct inet_connection_sock *icsk = inet_csk(sk);
	return icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong;
}

/* Buffer size and advertised window tuning.
 *
 * 1. Tuning sk->sk_sndbuf, when connection enters established state.
 */

static void tcp_fixup_sndbuf(struct sock *sk)
{
	int sndmem = tcp_sk(sk)->rx_opt.mss_clamp + MAX_TCP_HEADER + 16 +
		     sizeof(struct sk_buff);

	if (sk->sk_sndbuf < 3 * sndmem)
		sk->sk_sndbuf = min(3 * sndmem, sysctl_tcp_wmem[2]);
}

/* 2. Tuning advertised window (window_clamp, rcv_ssthresh)
 *
 * All tcp_full_space() is split to two parts: "network" buffer, allocated
 * forward and advertised in receiver window (tp->rcv_wnd) and
 * "application buffer", required to isolate scheduling/application
 * latencies from network.
 * window_clamp is maximal advertised window. It can be less than
 * tcp_full_space(), in this case tcp_full_space() - window_clamp
 * is reserved for "application" buffer. The less window_clamp is
 * the smoother our behaviour from viewpoint of network, but the lower
 * throughput and the higher sensitivity of the connection to losses. 8)
 *
 * rcv_ssthresh is more strict window_clamp used at "slow start"
 * phase to predict further behaviour of this connection.
 * It is used for two goals:
 * - to enforce header prediction at sender, even when application
 *   requires some significant "application buffer". It is check #1.
 * - to prevent pruning of receive queue because of misprediction
 *   of receiver window. Check #2.
 *
 * The scheme does not work when sender sends good segments opening
 * window and then starts to feed us spaghetti. But it should work
 * in common situations. Otherwise, we have to rely on queue collapsing.
 */

/* Slow part of check#2. */
static int __tcp_grow_window(const struct sock *sk, struct tcp_sock *tp,
			     const struct sk_buff *skb)
{
	/* Optimize this! */
	int truesize = tcp_win_from_space(skb->truesize)/2;
	int window = tcp_win_from_space(sysctl_tcp_rmem[2])/2;

	while (tp->rcv_ssthresh <= window) {
		if (truesize <= skb->len)
			return 2 * inet_csk(sk)->icsk_ack.rcv_mss;

		truesize >>= 1;
		window >>= 1;
	}
	return 0;
}

static void tcp_grow_window(struct sock *sk, struct tcp_sock *tp,
			    struct sk_buff *skb)
{
	/* Check #1 */
	if (tp->rcv_ssthresh < tp->window_clamp &&
	    (int)tp->rcv_ssthresh < tcp_space(sk) &&
	    !tcp_memory_pressure) {
		int incr;

		/* Check #2. Increase window, if skb with such overhead
		 * will fit to rcvbuf in future.
		 */
		if (tcp_win_from_space(skb->truesize) <= skb->len)
			incr = 2*tp->advmss;
		else
			incr = __tcp_grow_window(sk, tp, skb);

		if (incr) {
			tp->rcv_ssthresh = min(tp->rcv_ssthresh + incr, tp->window_clamp);
			inet_csk(sk)->icsk_ack.quick |= 1;
		}
	}
}

/* 3. Tuning rcvbuf, when connection enters established state. */

static void tcp_fixup_rcvbuf(struct sock *sk)
{
	struct tcp_sock *tp = tcp_sk(sk);
	int rcvmem = tp->advmss + MAX_TCP_HEADER + 16 + sizeof(struct sk_buff);

	/* Try to select rcvbuf so that 4 mss-sized segments
	 * will fit to window and corresponding skbs will fit to our rcvbuf.
	 * (was 3; 4 is minimum to allow fast retransmit to work.)
	 */
	while (tcp_win_from_space(rcvmem) < tp->advmss)
		rcvmem += 128;
	if (sk->sk_rcvbuf < 4 * rcvmem)
		sk->sk_rcvbuf = min(4 * rcvmem, sysctl_tcp_rmem[2]);
}

/* 4. Try to fixup all. It is made immediately after connection enters
 *    established state.
 */
static void tcp_init_buffer_space(struct sock *sk)
{
	struct tcp_sock *tp = tcp_sk(sk);
	int maxwin;

	if (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK))
		tcp_fixup_rcvbuf(sk);
	if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK))
		tcp_fixup_sndbuf(sk);

	tp->rcvq_space.space = tp->rcv_wnd;

	maxwin = tcp_full_space(sk);

	if (tp->window_clamp >= maxwin) {
		tp->window_clamp = maxwin;

		if (sysctl_tcp_app_win && maxwin > 4 * tp->advmss)
			tp->window_clamp = max(maxwin -
					       (maxwin >> sysctl_tcp_app_win),
					       4 * tp->advmss);
	}

	/* Force reservation of one segment. */
	if (sysctl_tcp_app_win &&
	    tp->window_clamp > 2 * tp->advmss &&
	    tp->window_clamp + tp->advmss > maxwin)
		tp->window_clamp = max(2 * tp->advmss, maxwin - tp->advmss);

	tp->rcv_ssthresh = min(tp->rcv_ssthresh, tp->window_clamp);
	tp->snd_cwnd_stamp = tcp_time_stamp;
}

/* 5. Recalculate window clamp after socket hit its memory bounds. */
static void tcp_clamp_window(struct sock *sk, struct tcp_sock *tp)
{
	struct inet_connection_sock *icsk = inet_csk(sk);

	icsk->icsk_ack.quick = 0;

	if (sk->sk_rcvbuf < sysctl_tcp_rmem[2] &&
	    !(sk->sk_userlocks & SOCK_RCVBUF_LOCK) &&
	    !tcp_memory_pressure &&
	    atomic_read(&tcp_memory_allocated) < sysctl_tcp_mem[0]) {
		sk->sk_rcvbuf = min(atomic_read(&sk->sk_rmem_alloc),
				    sysctl_tcp_rmem[2]);
	}
	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
		tp->rcv_ssthresh = min(tp->window_clamp, 2U*tp->advmss);
}


/* Initialize RCV_MSS value.
 * RCV_MSS is an our guess about MSS used by the peer.
 * We haven't any direct information about the MSS.
 * It's better to underestimate the RCV_MSS rather than overestimate.
 * Overestimations make us ACKing less frequently than needed.
 * Underestimations are more easy to detect and fix by tcp_measure_rcv_mss().
 */
void tcp_initialize_rcv_mss(struct sock *sk)
{
	struct tcp_sock *tp = tcp_sk(sk);
	unsigned int hint = min_t(unsigned int, tp->advmss, tp->mss_cache);

	hint = min(hint, tp->rcv_wnd/2);
	hint = min(hint, TCP_MIN_RCVMSS);
	hint = max(hint, TCP_MIN_MSS);

	inet_csk(sk)->icsk_ack.rcv_mss = hint;
}

/* Receiver "autotuning" code.
 *
 * The algorithm for RTT estimation w/o timestamps is based on
 * Dynamic Right-Sizing (DRS) by Wu Feng and Mike Fisk of LANL.
 * <http://www.lanl.gov/radiant/website/pubs/drs/lacsi2001.ps>
 *
 * More detail on this code can be found at
 * <http://www.psc.edu/~jheffner/senior_thesis.ps>,
 * though this reference is out of date.  A new paper
 * is pending.
 */
static void tcp_rcv_rtt_update(struct tcp_sock *tp, u32 sample, int win_dep)
{
	u32 new_sample = tp->rcv_rtt_est.rtt;
	long m = sample;

	if (m == 0)
		m = 1;

	if (new_sample != 0) {
		/* If we sample in larger samples in the non-timestamp
		 * case, we could grossly overestimate the RTT especially
		 * with chatty applications or bulk transfer apps which
		 * are stalled on filesystem I/O.
		 *
		 * Also, since we are only going for a minimum in the
		 * non-timestamp case, we do not smooth things out
		 * else with timestamps disabled convergence takes too
		 * long.
		 */
		if (!win_dep) {
			m -= (new_sample >> 3);
			new_sample += m;
		} else if (m < new_sample)
			new_sample = m << 3;
	} else {
		/* No previous measure. */
		new_sample = m << 3;
	}

	if (tp->rcv_rtt_est.rtt != new_sample)
		tp->rcv_rtt_est.rtt = new_sample;
}

static inline void tcp_rcv_rtt_measure(struct tcp_sock *tp)
{
	if (tp->rcv_rtt_est.time == 0)
		goto new_measure;
	if (before(tp->rcv_nxt, tp->rcv_rtt_est.seq))
		return;
	tcp_rcv_rtt_update(tp,
			   jiffies - tp->rcv_rtt_est.time,
			   1);

new_measure:
	tp->rcv_rtt_est.seq = tp->rcv_nxt + tp->rcv_wnd;
	tp->rcv_rtt_est.time = tcp_time_stamp;
}

static inline void tcp_rcv_rtt_measure_ts(struct sock *sk, const struct sk_buff *skb)
{
	struct tcp_sock *tp = tcp_sk(sk);
	if (tp->rx_opt.rcv_tsecr &&
	    (TCP_SKB_CB(skb)->end_seq -
	     TCP_SKB_CB(skb)->seq >= inet_csk(sk)->icsk_ack.rcv_mss))
		tcp_rcv_rtt_update(tp, tcp_time_stamp - tp->rx_opt.rcv_tsecr, 0);
}

/*
 * This function should be called every time data is copied to user space.
 * It calculates the appropriate TCP receive buffer space.
 */
void tcp_rcv_space_adjust(struct sock *sk)
{
	struct tcp_sock *tp = tcp_sk(sk);
	int time;
	int space;
	
	if (tp->rcvq_space.time == 0)
		goto new_measure;
	
	time = tcp_time_stamp - tp->rcvq_space.time;
	if (time < (tp->rcv_rtt_est.rtt >> 3) ||
	    tp->rcv_rtt_est.rtt == 0)
		return;
	
	space = 2 * (tp->copied_seq - tp->rcvq_space.seq);

	space = max(tp->rcvq_space.space, space);

	if (tp->rcvq_space.space != space) {
		int rcvmem;

		tp->rcvq_space.space = space;

		if (sysctl_tcp_moderate_rcvbuf) {
			int new_clamp = space;

			/* Receive space grows, normalize in order to
			 * take into account packet headers and sk_buff
			 * structure overhead.
			 */
			space /= tp->advmss;
			if (!space)
				space = 1;
			rcvmem = (tp->advmss + MAX_TCP_HEADER +
				  16 + sizeof(struct sk_buff));
			while (tcp_win_from_space(rcvmem) < tp->advmss)
				rcvmem += 128;
			space *= rcvmem;
			space = min(space, sysctl_tcp_rmem[2]);
			if (space > sk->sk_rcvbuf) {
				sk->sk_rcvbuf = space;

				/* Make the window clamp follow along.  */
				tp->window_clamp = new_clamp;
			}
		}
	}
	
new_measure:
	tp->rcvq_space.seq = tp->copied_seq;
	tp->rcvq_space.time = tcp_time_stamp;
}

/* There is something which you must keep in mind when you analyze the
 * behavior of the tp->ato delayed ack timeout interval.  When a
 * connection starts up, we want to ack as quickly as possible.  The
 * problem is that "good" TCP's do slow start at the beginning of data
 * transmission.  The means that until we send the first few ACK's the
 * sender will sit on his end and only queue most of his data, because
 * he can only send snd_cwnd unacked packets at any given time.  For
 * each ACK we send, he increments snd_cwnd and transmits more of his
 * queue.  -DaveM
 */
static void tcp_event_data_recv(struct sock *sk, struct tcp_sock *tp, struct sk_buff *skb)
{
	struct inet_connection_sock *icsk = inet_csk(sk);
	u32 now;

	inet_csk_schedule_ack(sk);

	tcp_measure_rcv_mss(sk, skb);

	tcp_rcv_rtt_measure(tp);
	
	now = tcp_time_stamp;

	if (!icsk->icsk_ack.ato) {
		/* The _first_ data packet received, initialize
		 * delayed ACK engine.
		 */
		tcp_incr_quickack(sk);
		icsk->icsk_ack.ato = TCP_ATO_MIN;
	} else {
		int m = now - icsk->icsk_ack.lrcvtime;

		if (m <= TCP_ATO_MIN/2) {
			/* The fastest case is the first. */
			icsk->icsk_ack.ato = (icsk->icsk_ack.ato >> 1) + TCP_ATO_MIN / 2;
		} else if (m < icsk->icsk_ack.ato) {
			icsk->icsk_ack.ato = (icsk->icsk_ack.ato >> 1) + m;
			if (icsk->icsk_ack.ato > icsk->icsk_rto)
				icsk->icsk_ack.ato = icsk->icsk_rto;
		} else if (m > icsk->icsk_rto) {
			/* Too long gap. Apparently sender failed to
			 * restart window, so that we send ACKs quickly.
			 */
			tcp_incr_quickack(sk);
			sk_stream_mem_reclaim(sk);
		}
	}
	icsk->icsk_ack.lrcvtime = now;

	TCP_ECN_check_ce(tp, skb);

	if (skb->len >= 128)
		tcp_grow_window(sk, tp, skb);
}

/* Called to compute a smoothed rtt estimate. The data fed to this
 * routine either comes from timestamps, or from segments that were
 * known _not_ to have been retransmitted [see Karn/Partridge
 * Proceedings SIGCOMM 87]. The algorithm is from the SIGCOMM 88
 * piece by Van Jacobson.
 * NOTE: the next three routines used to be one big routine.
 * To save cycles in the RFC 1323 implementation it was better to break
 * it up into three procedures. -- erics
 */
static void tcp_rtt_estimator(struct sock *sk, const __u32 mrtt)
{
	struct tcp_sock *tp = tcp_sk(sk);
	long m = mrtt; /* RTT */

	/*	The following amusing code comes from Jacobson's
	 *	article in SIGCOMM '88.  Note that rtt and mdev
	 *	are scaled versions of rtt and mean deviation.
	 *	This is designed to be as fast as possible 
	 *	m stands for "measurement".
	 *
	 *	On a 1990 paper the rto value is changed to:
	 *	RTO = rtt + 4 * mdev
	 *
	 * Funny. This algorithm seems to be very broken.
	 * These formulae increase RTO, when it should be decreased, increase
	 * too slowly, when it should be increased quickly, decrease too quickly
	 * etc. I guess in BSD RTO takes ONE value, so that it is absolutely
	 * does not matter how to _calculate_ it. Seems, it was trap
	 * that VJ failed to avoid. 8)
	 */
	if(m == 0)
		m = 1;
	if (tp->srtt != 0) {
		m -= (tp->srtt >> 3);	/* m is now error in rtt est */
		tp->srtt += m;		/* rtt = 7/8 rtt + 1/8 new */
		if (m < 0) {
			m = -m;		/* m is now abs(error) */
			m -= (tp->mdev >> 2);   /* similar update on mdev */
			/* This is similar to one of Eifel findings.
			 * Eifel blocks mdev updates when rtt decreases.
			 * This solution is a bit different: we use finer gain
			 * for mdev in this case (alpha*beta).
			 * Like Eifel it also prevents growth of rto,
			 * but also it limits too fast rto decreases,
			 * happening in pure Eifel.
			 */
			if (m > 0)
				m >>= 3;
		} else {
			m -= (tp->mdev >> 2);   /* similar update on mdev */
		}
		tp->mdev += m;	    	/* mdev = 3/4 mdev + 1/4 new */
		if (tp->mdev > tp->mdev_max) {
			tp->mdev_max = tp->mdev;
			if (tp->mdev_max > tp->rttvar)
				tp->rttvar = tp->mdev_max;
		}
		if (after(tp->snd_una, tp->rtt_seq)) {
			if (tp->mdev_max < tp->rttvar)
				tp->rttvar -= (tp->rttvar-tp->mdev_max)>>2;
			tp->rtt_seq = tp->snd_nxt;
			tp->mdev_max = TCP_RTO_MIN;
		}
	} else {
		/* no previous measure. */
		tp->srtt = m<<3;	/* take the measured time to be rtt */
		tp->mdev = m<<1;	/* make sure rto = 3*rtt */
		tp->mdev_max = tp->rttvar = max(tp->mdev, TCP_RTO_MIN);
		tp->rtt_seq = tp->snd_nxt;
	}
}

/* Calculate rto without backoff.  This is the second half of Van Jacobson's
 * routine referred to above.
 */
static inline void tcp_set_rto(struct sock *sk)
{
	const struct tcp_sock *tp = tcp_sk(sk);
	/* Old crap is replaced with new one. 8)
	 *
	 * More seriously:
	 * 1. If rtt variance happened to be less 50msec, it is hallucination.
	 *    It cannot be less due to utterly erratic ACK generation made
	 *    at least by solaris and freebsd. "Erratic ACKs" has _nothing_
	 *    to do with delayed acks, because at cwnd>2 true delack timeout
	 *    is invisible. Actually, Linux-2.4 also generates erratic
	 *    ACKs in some circumstances.
	 */
	inet_csk(sk)->icsk_rto = (tp->srtt >> 3) + tp->rttvar;

	/* 2. Fixups made earlier cannot be right.
	 *    If we do not estimate RTO correctly without them,
	 *    all the algo is pure shit and should be replaced
	 *    with correct one. It is exactly, which we pretend to do.
	 */
}

/* NOTE: clamping at TCP_RTO_MIN is not required, current algo
 * guarantees that rto is higher.
 */
static inline void tcp_bound_rto(struct sock *sk)
{
	if (inet_csk(sk)->icsk_rto > TCP_RTO_MAX)
		inet_csk(sk)->icsk_rto = TCP_RTO_MAX;
}

/* Save metrics learned by this TCP session.
   This function is called only, when TCP finishes successfully
   i.e. when it enters TIME-WAIT or goes from LAST-ACK to CLOSE.
 */
void tcp_update_metrics(struct sock *sk)
{
	struct tcp_sock *tp = tcp_sk(sk);
	struct dst_entry *dst = __sk_dst_get(sk);

	if (sysctl_tcp_nometrics_save)
		return;

	dst_confirm(dst);

	if (dst && (dst->flags&DST_HOST)) {
		const struct inet_connection_sock *icsk = inet_csk(sk);
		int m;

		if (icsk->icsk_backoff || !tp->srtt) {
			/* This session failed to estimate rtt. Why?
			 * Probably, no packets returned in time.
			 * Reset our results.
			 */
			if (!(dst_metric_locked(dst, RTAX_RTT)))
				dst->metrics[RTAX_RTT-1] = 0;
			return;
		}

		m = dst_metric(dst, RTAX_RTT) - tp->srtt;

		/* If newly calculated rtt larger than stored one,
		 * store new one. Otherwise, use EWMA. Remember,
		 * rtt overestimation is always better than underestimation.
		 */
		if (!(dst_metric_locked(dst, RTAX_RTT))) {
			if (m <= 0)
				dst->metrics[RTAX_RTT-1] = tp->srtt;
			else
				dst->metrics[RTAX_RTT-1] -= (m>>3);
		}

		if (!(dst_metric_locked(dst, RTAX_RTTVAR))) {
			if (m < 0)
				m = -m;

			/* Scale deviation to rttvar fixed point */
			m >>= 1;
			if (m < tp->mdev)
				m = tp->mdev;

			if (m >= dst_metric(dst, RTAX_RTTVAR))
				dst->metrics[RTAX_RTTVAR-1] = m;
			else
				dst->metrics[RTAX_RTTVAR-1] -=
					(dst->metrics[RTAX_RTTVAR-1] - m)>>2;
		}

		if (tp->snd_ssthresh >= 0xFFFF) {
			/* Slow start still did not finish. */
			if (dst_metric(dst, RTAX_SSTHRESH) &&
			    !dst_metric_locked(dst, RTAX_SSTHRESH) &&
			    (tp->snd_cwnd >> 1) > dst_metric(dst, RTAX_SSTHRESH))
				dst->metrics[RTAX_SSTHRESH-1] = tp->snd_cwnd >> 1;
			if (!dst_metric_locked(dst, RTAX_CWND) &&
			    tp->snd_cwnd > dst_metric(dst, RTAX_CWND))
				dst->metrics[RTAX_CWND-1] = tp->snd_cwnd;
		} else if (tp->snd_cwnd > tp->snd_ssthresh &&
			   icsk->icsk_ca_state == TCP_CA_Open) {
			/* Cong. avoidance phase, cwnd is reliable. */
			if (!dst_metric_locked(dst, RTAX_SSTHRESH))
				dst->metrics[RTAX_SSTHRESH-1] =
					max(tp->snd_cwnd >> 1, tp->snd_ssthresh);
			if (!dst_metric_locked(dst, RTAX_CWND))
				dst->metrics[RTAX_CWND-1] = (dst->metrics[RTAX_CWND-1] + tp->snd_cwnd) >> 1;
		} else {
			/* Else slow start did not finish, cwnd is non-sense,
			   ssthresh may be also invalid.
			 */
			if (!dst_metric_locked(dst, RTAX_CWND))
				dst->metrics[RTAX_CWND-1] = (dst->metrics[RTAX_CWND-1] + tp->snd_ssthresh) >> 1;
			if (dst->metrics[RTAX_SSTHRESH-1] &&
			    !dst_metric_locked(dst, RTAX_SSTHRESH) &&
			    tp->snd_ssthresh > dst->metrics[RTAX_SSTHRESH-1])
				dst->metrics[RTAX_SSTHRESH-1] = tp->snd_ssthresh;
		}

		if (!dst_metric_locked(dst, RTAX_REORDERING)) {
			if (dst->metrics[RTAX_REORDERING-1] < tp->reordering &&
			    tp->reordering != sysctl_tcp_reordering)
				dst->metrics[RTAX_REORDERING-1] = tp->reordering;
		}
	}
}

/* Numbers are taken from RFC2414.  */
__u32 tcp_init_cwnd(struct tcp_sock *tp, struct dst_entry *dst)
{
	__u32 cwnd = (dst ? dst_metric(dst, RTAX_INITCWND) : 0);

	if (!cwnd) {
		if (tp->mss_cache > 1460)
			cwnd = 2;
		else
			cwnd = (tp->mss_cache > 1095) ? 3 : 4;
	}
	return min_t(__u32, cwnd, tp->snd_cwnd_clamp);
}

/* Set slow start threshold and cwnd not falling to slow start */
void tcp_enter_cwr(struct sock *sk)
{
	struct tcp_sock *tp = tcp_sk(sk);

	tp->prior_ssthresh = 0;
	tp->bytes_acked = 0;
	if (inet_csk(sk)->icsk_ca_state < TCP_CA_CWR) {
		tp->undo_marker = 0;
		tp->snd_ssthresh = inet_csk(sk)->icsk_ca_ops->ssthresh(sk);
		tp->snd_cwnd = min(tp->snd_cwnd,
				   tcp_packets_in_flight(tp) + 1U);
		tp->snd_cwnd_cnt = 0;
		tp->high_seq = tp->snd_nxt;
		tp->snd_cwnd_stamp = tcp_time_stamp;
		TCP_ECN_queue_cwr(tp);

		tcp_set_ca_state(sk, TCP_CA_CWR);
	}
}

/* Initialize metrics on socket. */

static void tcp_init_metrics(struct sock *sk)
{
	struct tcp_sock *tp = tcp_sk(sk);
	struct dst_entry *dst = __sk_dst_get(sk);

	if (dst == NULL)
		goto reset;

	dst_confirm(dst);

	if (dst_metric_locked(dst, RTAX_CWND))
		tp->snd_cwnd_clamp = dst_metric(dst, RTAX_CWND);
	if (dst_metric(dst, RTAX_SSTHRESH)) {
		tp->snd_ssthresh = dst_metric(dst, RTAX_SSTHRESH);
		if (tp->snd_ssthresh > tp->snd_cwnd_clamp)
			tp->snd_ssthresh = tp->snd_cwnd_clamp;
	}
	if (dst_metric(dst, RTAX_REORDERING) &&
	    tp->reordering != dst_metric(dst, RTAX_REORDERING)) {
		tp->rx_opt.sack_ok &= ~2;
		tp->reordering = dst_metric(dst, RTAX_REORDERING);
	}

	if (dst_metric(dst, RTAX_RTT) == 0)
		goto reset;

	if (!tp->srtt && dst_metric(dst, RTAX_RTT) < (TCP_TIMEOUT_INIT << 3))
		goto reset;

	/* Initial rtt is determined from SYN,SYN-ACK.
	 * The segment is small and rtt may appear much
	 * less than real one. Use per-dst memory
	 * to make it more realistic.
	 *
	 * A bit of theory. RTT is time passed after "normal" sized packet
	 * is sent until it is ACKed. In normal circumstances sending small
	 * packets force peer to delay ACKs and calculation is correct too.
	 * The algorithm is adaptive and, provided we follow specs, it
	 * NEVER underestimate RTT. BUT! If peer tries to make some clever
	 * tricks sort of "quick acks" for time long enough to decrease RTT
	 * to low value, and then abruptly stops to do it and starts to delay
	 * ACKs, wait for troubles.
	 */
	if (dst_metric(dst, RTAX_RTT) > tp->srtt) {
		tp->srtt = dst_metric(dst, RTAX_RTT);
		tp->rtt_seq = tp->snd_nxt;
	}
	if (dst_metric(dst, RTAX_RTTVAR) > tp->mdev) {
		tp->mdev = dst_metric(dst, RTAX_RTTVAR);
		tp->mdev_max = tp->rttvar = max(tp->mdev, TCP_RTO_MIN);
	}
	tcp_set_rto(sk);
	tcp_bound_rto(sk);
	if (inet_csk(sk)->icsk_rto < TCP_TIMEOUT_INIT && !tp->rx_opt.saw_tstamp)
		goto reset;
	tp->snd_cwnd = tcp_init_cwnd(tp, dst);
	tp->snd_cwnd_stamp = tcp_time_stamp;
	return;

reset:
	/* Play conservative. If timestamps are not
	 * supported, TCP will fail to recalculate correct
	 * rtt, if initial rto is too small. FORGET ALL AND RESET!
	 */
	if (!tp->rx_opt.saw_tstamp && tp->srtt) {
		tp->srtt = 0;
		tp->mdev = tp->mdev_max = tp->rttvar = TCP_TIMEOUT_INIT;
		inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
	}
}

static void tcp_update_reordering(struct sock *sk, const int metric,
				  const int ts)
{
	struct tcp_sock *tp = tcp_sk(sk);
	if (metric > tp->reordering) {
		tp->reordering = min(TCP_MAX_REORDERING, metric);

		/* This exciting event is worth to be remembered. 8) */
		if (ts)
			NET_INC_STATS_BH(LINUX_MIB_TCPTSREORDER);
		else if (IsReno(tp))
			NET_INC_STATS_BH(LINUX_MIB_TCPRENOREORDER);
		else if (IsFack(tp))
			NET_INC_STATS_BH(LINUX_MIB_TCPFACKREORDER);
		else
			NET_INC_STATS_BH(LINUX_MIB_TCPSACKREORDER);
#if FASTRETRANS_DEBUG > 1
		printk(KERN_DEBUG "Disorder%d %d %u f%u s%u rr%d\n",
		       tp->rx_opt.sack_ok, inet_csk(sk)->icsk_ca_state,
		       tp->reordering,
		       tp->fackets_out,
		       tp->sacked_out,
		       tp->undo_marker ? tp->undo_retrans : 0);
#endif
		/* Disable FACK yet. */
		tp->rx_opt.sack_ok &= ~2;
	}
}

/* This procedure tags the retransmission queue when SACKs arrive.
 *
 * We have three tag bits: SACKED(S), RETRANS(R) and LOST(L).
 * Packets in queue with these bits set are counted in variables
 * sacked_out, retrans_out and lost_out, correspondingly.
 *
 * Valid combinations are:
 * Tag  InFlight	Description
 * 0	1		- orig segment is in flight.
 * S	0		- nothing flies, orig reached receiver.
 * L	0		- nothing flies, orig lost by net.
 * R	2		- both orig and retransmit are in flight.
 * L|R	1		- orig is lost, retransmit is in flight.
 * S|R  1		- orig reached receiver, retrans is still in flight.
 * (L|S|R is logically valid, it could occur when L|R is sacked,
 *  but it is equivalent to plain S and code short-curcuits it to S.
 *  L|S is logically invalid, it would mean -1 packet in flight 8))
 *
 * These 6 states form finite state machine, controlled by the following events:
 * 1. New ACK (+SACK) arrives. (tcp_sacktag_write_queue())
 * 2. Retransmission. (tcp_retransmit_skb(), tcp_xmit_retransmit_queue())
 * 3. Loss detection event of one of three flavors:
 *	A. Scoreboard estimator decided the packet is lost.
 *	   A'. Reno "three dupacks" marks head of queue lost.
 *	   A''. Its FACK modfication, head until snd.fack is lost.
 *	B. SACK arrives sacking data transmitted after never retransmitted
 *	   hole was sent out.
 *	C. SACK arrives sacking SND.NXT at the moment, when the
 *	   segment was retransmitted.
 * 4. D-SACK added new rule: D-SACK changes any tag to S.
 *
 * It is pleasant to note, that state diagram turns out to be commutative,
 * so that we are allowed not to be bothered by order of our actions,
 * when multiple events arrive simultaneously. (see the function below).
 *
 * Reordering detection.
 * --------------------
 * Reordering metric is maximal distance, which a packet can be displaced
 * in packet stream. With SACKs we can estimate it:
 *
 * 1. SACK fills old hole and the corresponding segment was not
 *    ever retransmitted -> reordering. Alas, we cannot use it
 *    when segment was retransmitted.
 * 2. The last flaw is solved with D-SACK. D-SACK arrives
 *    for retransmitted and already SACKed segment -> reordering..
 * Both of these heuristics are not used in Loss state, when we cannot
 * account for retransmits accurately.
 */
static int
tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_una)
{
	const struct inet_connection_sock *icsk = inet_csk(sk);
	struct tcp_sock *tp = tcp_sk(sk);
	unsigned char *ptr = ack_skb->h.raw + TCP_SKB_CB(ack_skb)->sacked;
	struct tcp_sack_block *sp = (struct tcp_sack_block *)(ptr+2);
	int num_sacks = (ptr[1] - TCPOLEN_SACK_BASE)>>3;
	int reord = tp->packets_out;
	int prior_fackets;
	u32 lost_retrans = 0;
	int flag = 0;
	int dup_sack = 0;
	int i;

	if (!tp->sacked_out)
		tp->fackets_out = 0;
	prior_fackets = tp->fackets_out;

	/* SACK fastpath:
	 * if the only SACK change is the increase of the end_seq of
	 * the first block then only apply that SACK block
	 * and use retrans queue hinting otherwise slowpath */
	flag = 1;
	for (i = 0; i< num_sacks; i++) {
		__u32 start_seq = ntohl(sp[i].start_seq);
		__u32 end_seq =	 ntohl(sp[i].end_seq);

		if (i == 0){
			if (tp->recv_sack_cache[i].start_seq != start_seq)
				flag = 0;
		} else {
			if ((tp->recv_sack_cache[i].start_seq != start_seq) ||
			    (tp->recv_sack_cache[i].end_seq != end_seq))
				flag = 0;
		}
		tp->recv_sack_cache[i].start_seq = start_seq;
		tp->recv_sack_cache[i].end_seq = end_seq;

		/* Check for D-SACK. */
		if (i == 0) {
			u32 ack = TCP_SKB_CB(ack_skb)->ack_seq;

			if (before(start_seq, ack)) {
				dup_sack = 1;
				tp->rx_opt.sack_ok |= 4;
				NET_INC_STATS_BH(LINUX_MIB_TCPDSACKRECV);
			} else if (num_sacks > 1 &&
				   !after(end_seq, ntohl(sp[1].end_seq)) &&
				   !before(start_seq, ntohl(sp[1].start_seq))) {
				dup_sack = 1;
				tp->rx_opt.sack_ok |= 4;
				NET_INC_STATS_BH(LINUX_MIB_TCPDSACKOFORECV);
			}

			/* D-SACK for already forgotten data...
			 * Do dumb counting. */
			if (dup_sack &&
			    !after(end_seq, prior_snd_una) &&
			    after(end_seq, tp->undo_marker))
				tp->undo_retrans--;

			/* Eliminate too old ACKs, but take into
			 * account more or less fresh ones, they can
			 * contain valid SACK info.
			 */
			if (before(ack, prior_snd_una - tp->max_window))
				return 0;
		}
	}

	if (flag)
		num_sacks = 1;
	else {
		int j;
		tp->fastpath_skb_hint = NULL;

		/* order SACK blocks to allow in order walk of the retrans queue */
		for (i = num_sacks-1; i > 0; i--) {
			for (j = 0; j < i; j++){
				if (after(ntohl(sp[j].start_seq),
					  ntohl(sp[j+1].start_seq))){
					sp[j].start_seq = htonl(tp->recv_sack_cache[j+1].start_seq);
					sp[j].end_seq = htonl(tp->recv_sack_cache[j+1].end_seq);
					sp[j+1].start_seq = htonl(tp->recv_sack_cache[j].start_seq);
					sp[j+1].end_seq = htonl(tp->recv_sack_cache[j].end_seq);
				}

			}
		}
	}

	/* clear flag as used for different purpose in following code */
	flag = 0;

	for (i=0; i<num_sacks; i++, sp++) {
		struct sk_buff *skb;
		__u32 start_seq = ntohl(sp->start_seq);
		__u32 end_seq = ntohl(sp->end_seq);
		int fack_count;

		/* Use SACK fastpath hint if valid */
		if (tp->fastpath_skb_hint) {
			skb = tp->fastpath_skb_hint;
			fack_count = tp->fastpath_cnt_hint;
		} else {
			skb = sk->sk_write_queue.next;
			fack_count = 0;
		}

		/* Event "B" in the comment above. */
		if (after(end_seq, tp->high_seq))
			flag |= FLAG_DATA_LOST;

		sk_stream_for_retrans_queue_from(skb, sk) {
			int in_sack, pcount;
			u8 sacked;

			tp->fastpath_skb_hint = skb;
			tp->fastpath_cnt_hint = fack_count;

			/* The retransmission queue is always in order, so
			 * we can short-circuit the walk early.
			 */
			if (!before(TCP_SKB_CB(skb)->seq, end_seq))
				break;

			in_sack = !after(start_seq, TCP_SKB_CB(skb)->seq) &&
				!before(end_seq, TCP_SKB_CB(skb)->end_seq);

			pcount = tcp_skb_pcount(skb);

			if (pcount > 1 && !in_sack &&
			    after(TCP_SKB_CB(skb)->end_seq, start_seq)) {
				unsigned int pkt_len;

				in_sack = !after(start_seq,
						 TCP_SKB_CB(skb)->seq);

				if (!in_sack)
					pkt_len = (start_seq -
						   TCP_SKB_CB(skb)->seq);
				else
					pkt_len = (end_seq -
						   TCP_SKB_CB(skb)->seq);
				if (tcp_fragment(sk, skb, pkt_len, skb_shinfo(skb)->tso_size))
					break;
				pcount = tcp_skb_pcount(skb);
			}

			fack_count += pcount;

			sacked = TCP_SKB_CB(skb)->sacked;

			/* Account D-SACK for retransmitted packet. */
			if ((dup_sack && in_sack) &&
			    (sacked & TCPCB_RETRANS) &&
			    after(TCP_SKB_CB(skb)->end_seq, tp->undo_marker))
				tp->undo_retrans--;

			/* The frame is ACKed. */
			if (!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una)) {
				if (sacked&TCPCB_RETRANS) {
					if ((dup_sack && in_sack) &&
					    (sacked&TCPCB_SACKED_ACKED))
						reord = min(fack_count, reord);
				} else {
					/* If it was in a hole, we detected reordering. */
					if (fack_count < prior_fackets &&
					    !(sacked&TCPCB_SACKED_ACKED))
						reord = min(fack_count, reord);
				}

				/* Nothing to do; acked frame is about to be dropped. */
				continue;
			}

			if ((sacked&TCPCB_SACKED_RETRANS) &&
			    after(end_seq, TCP_SKB_CB(skb)->ack_seq) &&
			    (!lost_retrans || after(end_seq, lost_retrans)))
				lost_retrans = end_seq;

			if (!in_sack)
				continue;

			if (!(sacked&TCPCB_SACKED_ACKED)) {
				if (sacked & TCPCB_SACKED_RETRANS) {
					/* If the segment is not tagged as lost,
					 * we do not clear RETRANS, believing
					 * that retransmission is still in flight.
					 */
					if (sacked & TCPCB_LOST) {