TCP/IP ARCHITECTURE DESIGN AND IMPLEMENTATION IN LINUX
1 INTRODUCTION 1
client – server, server 监听端口,等待 client 请求。
source
1.1 Overview of TCP/IP Stack 2
1.1.1 Moving Down the Stack 3
在用户层发起,沿着协议栈向下组装数据,然后发送。
1.1.2 Moving Up the Stack 5
接收端收到数据,沿着协议栈向上解包数据,最后到应用层。(iocp是在用户层使用的?本机通信是否用到协议栈)
1.2 Source Code Organization for Linux 2.4.20 5
1.2.1 Source Code Organization for Networking Code 7
1.3 TCP/IP Stack and Kernel Control Paths 7
发送数据可以从应用层到达任何一层栈,然后会有对应的路径继续处理,接收栈归为整体分3步处理。
1.4 Linux Kernel Until Version 2.4 Is Non-preemptible 11
系统内核发生中断,用户层发起中断。那些CPU寄存器还是要复习一下。
X86_Assembly
1.4.1 System Call on Linux 14
1.4.2 Adding New System Call 16
编写系统需要了解。
1.5 Linux Process and Thread 17
用户模式和核心模式用户栈不同 context 不同,task_struct结构体内容不同。
每个进程关连一个系统级别线程(无用户模式下?),也是task_struct类型的对象。
1.5.1 fork() 17
Processes created。user stacks for child and parent are shared(增长后有自己的副本)
1.5.2 Thread 18
分用户类型和内核类型,用户级别的就用户调度,对应用户进程的内核线程还是唯一的。
内核级别的线程更像轻量级进程,是调度单位。对同一个线程的调度不影响同一进程的其它线程。
1.5.3 Kernel Threads 19
Kernel threads are created by making a call to kernel_thread() .(关联进程?)
1.6 Kernel Synchronization Mechanism 22
synchronize access to kernel global data structures across different kernel
control paths and also across CPUs。
1.6.1 Semaphore 22
1.6.2 Atomic Operations 23
1.6.3 Spin Lock 23
1.7 Application Interfaces for TCP/IP Programming 24
The client sends out a request to the server for the service,
which in turn offers its services once they are connected to each other.
1.7.1 Server Application 25
services 和服务端口号关联。Http Server(其实就是service) 关联端口号 80。
Port is a unique number that identifi es a connection or specific services on a given host。
我们向系统注册服务的时候,系统服务应用程序会提供一个唯一端口号和我们的service关联。
The socket is a framework to communicate with the network protocol within the kernel.
1.7.2 Client Application 27
通过ip地址找到主机,通过端口找到服务。
1.7.3 Socket Options 29
setsockopt () . getsockopt () .
1.7.4 Option Values 29
SO_DEBUG, SO_BROADCAST, SO_REUSEADDR, SO_KEEPALIVE, SO_LINGER, SO_OOBINLINE, SO_RCVBUF,
SO_DONTROUTE, SO_RCVTIMEO, SO_SNDTIMEO,
1.8 Shutdown 35
1 | int shutdown(int socket, int how); |
1.8.1 Kernel Shutdown Implementation 36
SEND_SHUTDOWN, RCV_SHUTDOWN
1.8.2 Send Shutdown 36
1.8.3 Receive Shutdown 36
1.9 I/O 38
1.9.1 read() 38
1.9.2 write() 38
1.9.3 recv() 38
1.9.4 send() 39
1.9.5 select() 39
The added feature is to do I/O multiplexing demultiplexing.
1.10 TCP State 39
TCP three - way handshake. 成功后进入连接状态。1
2
31. Connection initiation (active and passive)
2. Established connection
3. Connection closure (active and passive)
Four - way connection closure process. 完成后进入关闭状态。
TIME_WAIT2 state is skipped as ACK is piggybacked with FIN segment.
1.10.1 Partial Close 45
Time - line diagram for client that issues shutdown on write.
1.10.2 tcpdump Output for Partial Close 47
1.11 Summary 48
TCP data may be queued at different levels such as socket ’ s send queue,
device queue (TOS), and CPU output queue.
2 PROTOCOL FUNDAMENTALS 49
TCP manages connection and data integrity, whereas IP is responsible for delivery of data to the correct destination.
2.1 TCP 50
2.1.1 TCP Header 50
Port Numbers, Sequence Number, Acknowledgment Number, Header Length,
Unused Field, TCP Flags, Window Size, Checksum, Urgent Pointer
2.2 TCP Options (RFC 1323) 54
2.2.1 mss Option 55
2.2.2 Window-Scaling Option 55
2.2.3 Timestamp Option 56
2.2.4 Selective Acknowledgment Option 57
2.3 TCP Data Flow 58
2.3.1 ACKing of Data Segments 58
2.4 Delayed Acknowledgment 67
2.5 Nagle’s Algorithm (RFC 896) 69
2.6 TCP Sliding Window Protocol 72
2.7 Maximizing TCP Throughput 79
2.8 TCP Timers 82
2.8.1 Retransmission Timer 82
2.8.2 Persistent Timer 83
2.8.3 Keepalive Timer 84
2.8.4 TIME_WAIT Timer 85
2.9 TCP Congestion Control 85
2.10 TCP Performance and Reliability 86
2.10.1 RTTD 86
2.10.2 SACK/DSACK 86
2.10.3 Window Scaling 87
2.11 IP (Internet Protocol) 87
2.11.1 IP Header 88
2.12 Routing 90
2.13 netstat 90
2.14 traceroute 92
2.14.1 traceroute Mechanism 93
2.15 ICMP 93
2.16 ping 95
2.17 ARP/RARP 97
2.18 Summary 99
3 KERNEL IMPLEMENTATION OF SOCKETS 101
3.1 Socket Layer 102
3.2 VFS and Socket 103
3.3 Protocol Socket Registration 105
3.4 struct inet_protosw 107
3.5 Socket Organization in the Kernel 107
3.6 Socket 108
3.7 inet_create 110
3.7.1 Sock 112
3.8 Flow Diagram for Socket Call 118
3.9 Summary 118
4 KERNEL IMPLEMENTATION OF TCP CONNECTION SETUP 121
4.1 Connection Setup 122
4.1.1 Server Side Setup 122
4.1.2 Server Side Operations 124
4.2 Bind 124
4.2.1 Data Structures Related to Socket BIND 125
4.2.2 Hash Buckets for tcp Bind 125
4.2.3 tcp_ehash 125
4.2.4 tcp_listening_hash 125
4.2.5 tcp_bhash 125
4.2.6 tcp_hashinfo 126
4.2.7 tcp_bind_hashbucket 129
4.2.8 tcp_bind_bucket 129
4.2.9 bind() 130
4.2.10 sys_bind() 130
4.2.11 sockfd_lookup() 130
4.2.12 fget() 131
4.2.13 inet_bind() 131
4.2.14 tcp_v4_get_port() 133
4.2.15 tcp_bind_confl ict() 135
4.3 Listen 137
4.3.1 sys_listen() 138
4.3.2 inet_listen() 139
4.3.3 tcp_listen_start() 139
4.3.4 Listen Flow 142
4.3.5 struct open_request 142
4.3.6 Accept Queue Is Full 147
4.3.7 Established Sockets Linked in tcp_ehash Hash Table 150
4.3.8 State of the Connection Request when the Three-Way Handshake Is Still Pending 150
4.3.9 State of the Connection Request when the Three-Way Handshake Is Completed 151
4.4 Connection Request Handling by Kernel 151
4.4.1 SYN Queue Processing 155
4.4.2 Accept Queue Processing 155
4.4.3 Flow Control for Handling a New Connection Request 156
4.5 Accept 156
4.5.1 inet_accept() 159
4.5.2 Linking of Inode and Socket Data Structures when the Three-Way Handshake Has Completed and Is Accepted by Application 161
4.5.3 Linking of VFS and Socket Data Structures in the Kernel when a New Connection Is Established 162
4.5.4 File Table Entry of a New Accepted Connected Socket 162
4.5.5 Flow Control for Accepting New Established Connections 162
4.6 Client Side Setup 163
4.6.1 Client Side Operations 164
4.6.2 Connect 164
4.6.3 tcp_v4_connect() 167
4.6.4 ip_route_connect() 167
4.6.5 Flow Control for Generating a Connection Request 167
4.6.6 tcp_v4_hash_connect() 170
4.6.7 __tcp_v4_check_established() 171
4.6.8 tcp_connect() 174
4.6.9 tcp_transmit_skb() 176
4.7 Summary 178
5 sk_buff AND PROTOCOL HEADERS 181
5.1 struct sk_buff 182
5.2 struct skb_shared_info 186
5.3 sk_buff and DMA—SKB_FRAG_STRUCT 187
5.3.1 DMA and Fragmented sk_buff Containing Paged Data 188
5.3.2 sk_buff and IP Fragmentation 188
5.3.3 sk_buff and Fragmentation 190
5.4 Routines Operating on sk_buff 190
5.4.1 alloc_skb() 190
5.4.2 skb_reserve() 191
5.4.3 skb_put() 192
5.4.4 skb_push() 194
5.4.5 skb_pull() 195
5.5 sk_buff Builds Protocol Headers as It Traverses Down the Protocol Layers 196
5.5.1 Tcp Header Is Added to sk_buff 196
5.5.2 Ip Header Is Added to sk_buff 197
5.5.3 Link Layer Header Is Added to sk_buff 198
5.6 sk_buff Extracts Protocol Headers as It Traverses Up the Protocol Layers When a Packet Arrives 199
5.6.1 sk_buff Is Made to Point to a Datalink Layer Header Which Will Be Processed by a Dalalink Driver 199
5.6.2 sk_buff Is Made to Point to an ip Layer Header Which Will Be Processed by an IP Layer 200
5.6.3 sk_buff Is Made to Point to a tcp Layer Header Which Will Be Processed by a tcp Layer 200
5.7 Summary 202
6 MOVEMENT OF sk_buff ACROSS PROTOCOL LAYERS 205
6.1 Packet Traversing Down the TCP/IP Stack 206
6.1.1 Path of Packet Traversal from Socket Layer to Device for Transmission 207
6.1.2 Kernel Path for TCP Packet Traversing Down the Stack 208
6.2 Routed Packet Ready for Transmission 214
6.3 Kernel Flow for a Packet Moving Down the Stack 214
6.4 Packet Traversing Up the TCP/IP Stack 214
6.4.1 Path of Packet Traversal from Device (Reception) to Socket Layer 219
6.4.2 Kernel Path for TCP Packet Traversing Up the Stack 219
6.5 Kernel Flow for a Packet Moving Up the Stack 225
6.6 Summary 225
7 TCP SEND 231
7.1 TCP Segmentation Unit for Sending Data 232
7.1.1 Functioning of Segmentation Unit without Scatter–Gather Support 232
7.1.2 Segmentation without Scatter–Gather Support 234
7.1.3 1 mss of Data Written over the Socket 235
7.2 Segmentation with Scatter–Gather Technique 235
7.2.1 Segmentation with Scatter–Gather Support 239
7.2.2 Application Writes Y Bytes over the Socket 239
7.2.3 can_coalesce() 239
7.2.4 tcp_copy_to_page() 240
7.2.5 tcp_mark_push() 241
7.2.6 forced_push() 241
7.2.7 tcp_push() 242
7.2.8 __tcp_push_pending_frames() 243
7.2.9 tcp_snd_test() 243
7.2.10 tcp_nagle_check() 244
7.2.11 tcp_minshall_ckeck() 245
7.2.12 tcp_write_xmit() 245
7.2.13 update_send_head() 247
7.2.14 tcp_push_one() 247
7.2.15 skb_entail() 248
7.3 Sending OOB Data 249
7.4 Flow for TCP Segmentation Unit and Send Process 250
7.5 Functional Level Flow for Segmentation and Send Mechanism 250
7.6 Summary 251
8 TCP RECEIVE 255
8.1 Queuing Mechanism 256
8.1.1 Processing in tcp_rcv_established() 256
8.1.2 tcp_prequeue() 258
8.1.3 Processing of Queues 259
8.1.4 tcp_data_wait() 263
8.1.5 tcp_prequeue_process() 264
8.1.6 lock_sock() 265
8.1.7 lock_sock() 265
8.1.8 release_sock() 266
8.1.9 release_sock() 266
8.2 Processing of TCP Data from the Receive Queue 267
8.2.1 cleanup_rbuf() 268
8.2.2 skb_copy_datagram_iovec() 271
8.2.3 Reading Data from Receive Buffer without Paged Data Area 273
8.2.4 X Bytes Requested from the Application 273
8.2.5 1 mss = n Bytes Requested from the Application 275
8.2.6 n − X Bytes Requested from the Application 275
8.2.7 Consumption of Data from a Paged Buffer 275
8.2.8 n Bytes Requested by the Application 276
8.2.9 One Page of Data Requested by the Application 276
8.3 TCP Urgent Byte Processing 276
8.3.1 Urgent Byte Read as OOB Data 277
8.3.2 tcp_recv_urg() 278
8.3.3 Urgent Mode Processing and Reading an Urgent Byte as Inline Data 280
8.4 DATA Flow Diagram for Receiving Data over the TCP Socket 284
8.5 Summary 290
9 TCP MEMORY MANAGEMENT 291
9.1 Transmit Side TCP Memory Management 291
9.1.1 select_size() 294
9.1.2 tcp_alloc_pskb() 295
9.1.3 alloc_skb() 296
9.1.4 tcp_alloc_page() 297
9.1.5 skb_charge() 298
9.1.6 tcp_mem_schedule() 298
9.1.7 tcp_free_skb() 300
9.1.8 sock_wfree() 300
9.1.9 tcp_write_space() 301
9.1.10 tcp_mem_reclaim() 302
9.1.11 __tcp_mem_reclaim() 302
9.1.12 wait_for_tcp_memory() 303
9.2 Receive Side TCP Memory Management 305
9.2.1 tcp_prune_queue() 308
9.2.2 tcp_clamp_window() 309
9.2.3 tcp_collapse_ofo_queue() 311
9.2.4 tcp_collapse() 312
9.2.5 __skb_queue_purge() 317
9.3 Freeing of Memory Allocated to a Receive Buffer 319
9.4 System-Wide Control Parameters Are Worth Noticing When It Comes to TCP Memory Management 319
9.5 Summary 321
10 TCP TIMERS 323
10.1 Timers in Linux 324
10.1.1 mod_timer() 324
10.1.2 detach_timer() 325
10.1.3 del_timer() 325
10.1.4 When Are Timer Routines Executed? 326
10.2 TCP Retransmit Timer 326
10.2.1 When Do We Set Retransmit Timer? 327
10.2.2 When Do We Reset or Cancel Retransmit Timers? 327
10.2.3 tcp_enter_loss() 330
10.2.4 tcp_retransmit_skb() 333
10.2.5 tcp_retrans_try_collapse() 334
10.2.6 skb_cloned() 336
10.3 Zero Window Probe Timer 336
10.3.1 When Is the First Time Probe Timer Installed? 337
10.3.2 When Is the Probe Timer Canceled for the Connection? 337
10.3.3 tcp_ack_probe() 338
10.3.4 How Does the Window Probe Timer Work? 338
10.3.5 tcp_probe_timer() 339
10.3.6 tcp_send_probe0() 339
10.3.7 tcp_write_wakeup() 339
10.4 Delay ACK Timer 342
10.4.1 When Is the ACK Scheduled? 344
10.4.2 How and When Is the ACK Segment Sent? 344
10.4.3 Quick ACK Mode 345
10.4.4 __tcp_ack_snd_check() 345
10.4.5 tcp_ack_snd_check() 346
10.4.6 tcp_send_delayed_ack() 347
10.4.7 tcp_delack_timer() 348
10.4.8 tcp_reset_xmit_timer() 349
10.4.9 tcp_write_timer() 351
10.4.10 tcp_clear_xmit_timer() 352
10.5 Keepalive Timer 353
10.5.1 When Is the Keepalive Timer Activated? 353
10.5.2 How Is the Timer Reset? 354
10.5.3 tcp_keepalive_timer() 354
10.6 SYN-ACK Timer 356
10.6.1 When Is the SYN-ACK Timer Activated? 356
10.6.2 When Is the SYN-ACK Timer Stopped? 357
10.6.3 tcp_synack_timer() 357
10.7 TIME_WAIT Timer 361
10.7.1 When Do We Trigger TIME_WAIT Timer? 361
10.7.2 tcp_time_wait() 362
10.7.3 tcp_tw_schedule() 362
10.7.4 Non-recycle Mode 363
10.7.5 Recycle Mode 365
10.7.6 tcp_twkill() 367
10.7.7 tcp_twcal_tick() 370
10.7.8 __tcp_tw_hashdance() 374
10.8 Summary 375
11 TCP CORE PROCESSING 377
11.1 TCP Incoming Segment Processing 378
11.1.1 Prediction Flags 378
11.1.2 Building Prediction Flags 379
11.1.3 Condition to Enable the Fast Path 380
11.1.4 When to Enable the Slow Path 382
11.1.5 When to Enable the Fast Path 382
11.1.6 Points to Remember about Prediction Flags 383
11.2 Fast Path Processing 384
11.3 Slow Path Processing 386
11.3.1 tcp_sequence() 387
11.3.2 tcp_replace_ts_recent() 387
11.3.3 tcp_event_data_recv() 390
11.3.4 tcp_incr_quickack() 391
11.3.5 tcp_grow_window() 392
11.3.6 tcp_grow_window() 393
11.3.7 How Do We Calculate Window to Be Advertised? 394
11.3.8 tcp_receive_window() 395
11.3.9 tcp_select_window() 395
11.3.10 tcp_space() 397
11.3.11 tcp_data_snd_check() 397
11.3.12 __tcp_data_snd_check() 398
11.3.13 tcp_paws_discard() 398
11.4 Processing of Incoming ACK 400
11.4.1 tcp-packets_in_fl ight() 403
11.4.2 tcp_ack_is_dubious() 404
11.4.3 tcp_cong_avoid() 405
11.4.4 tcp_ack_update_window() 406
11.4.5 tcp_may_update_window() 407
11.4.6 tcp_clean_rtx_queue() 408
11.5 Processing of SACK blocks 410
11.5.1 tcp_sacktag_write_queue() 410
11.6 Reordering Length 417
11.7 Processing TCP Urgent Pointer 421
11.7.1 tcp_check_urg() 422
11.8 Processing Data Segments in Slow Path 424
11.8.1 tcp_sack_new_ofo_skb() 433
11.8.2 tcp_sack_maybe_coalesce() 434
11.8.3 tcp_sack_extend() 435
11.8.4 tcp_ofo_queue() 436
11.8.5 tcp_sack_remove() 441
11.9 Overview of Core TCP Processing 442
11.10 Summary 442
12 TCP STATE PROCESSING 445
12.1 Overview of State Processing 446
12.2 TCP States 448
12.2.1 TCP_CA_CWR 449
12.2.2 Undoing from TCP_CA_CWR 449
12.3 Processing of Duplicate/Partial ACKs in Recovery State 449
12.3.1 tcp_remove_reno_sacks() 450
12.3.2 tcp_try_undo_partial() 451
12.4 Processing of Duplicate/Partial ACKs in Loss State 452
12.4.1 tcp_try_undo_loss() 453
12.4.2 tcp_check_sack_reneging() 455
12.5 Default Processing of TCP States 456
12.5.1 tcp_time_to_recover() 459
12.5.2 tcp_head_timedout() 460
12.5.3 tcp_try_to_open() 461
12.5.4 tcp_update_scoreboard() 462
12.5.5 tcp_xmit_retransmit_queue() 464
12.5.6 tcp_packet_delayed() 466
12.6 Processing of TCP Non-open States when ACKed Beyond tp → high_seq 467
12.6.1 TCP_CA_Loss 467
12.6.2 TCP_CA_CWR 468
12.6.3 TCP_CA_Disorder 470
12.6.4 tcp_try_undo_dsack() 471
12.6.5 TCP_CA_Recovery 471
12.6.6 tcp_add_reno_sack() 472
12.6.7 tcp_check_reno_reordering() 473
12.6.8 tcp_may_undo() 473
12.6.9 tcp_packet_delayed() 474
12.6.10 tcp_undo_cwr() 475
12.6.11 tcp_mark_head_lost() 475
12.6.12 tcp_sync_left_out() 477
12.7 Summary 477
13 NETLINK SOCKETS 479
13.1 Introduction to Netlink Sockets 479
13.2 Netlink Socket Registration and Initialization at Boot Time 480
13.3 How Is the Kernel Netlink Socket Created? 481
13.4 How Is the User Netlink Socket Created? 482
13.5 Netlink Data Structures 485
13.5.1 nl_table 485
13.5.2 rtnetlink_link 486
13.6 Other Important Data Strutures 488
13.6.1 struct nlmsghdr 488
13.6.2 struct msghdr 489
13.7 Netlink Packet Format 490
13.8 Netlink Socket Example—tc Command for Adding a qdisc 490
13.8.1 tc Command Flow in User Space for Adding a qdisc 490
13.8.2 tc Command in Kernel Space 491
13.8.2.1 sys_sendmsg() 491
13.8.2.2 sock_sendmsg() 492
13.8.2.3 netlink_sendmsg() 492
13.8.2.4 netlink_unicast() 493
13.8.2.5 netlink_data_ready() 494
13.8.2.6 rtnetlink_rcv() 494
13.8.2.7 rtnetlink_rcv_skb() 494
13.8.2.8 rtnetlink_rcv_msg() 495
13.9 Flow Diagram for tc Command in Kernel Space 496
13.10 Summary 496
14 IP ROUTING 499
14.1 Routing 501
14.2 Policy-Based Routing 503
14.3 Multipathing 505
14.4 Record Route Options (RFC 791) and Processing by Linux Stack 509
14.4.1 Record Routing 510
14.5 Source Routing 510
14.5.1 Strict Record Routing 510
14.5.2 Loose Record Routing 511
14.5.3 SRR Processing Implementation 511
14.6 Linux Kernel Implementation of Routing Table and Caches 517
14.7 Routing Cache Implementation Overview 517
14.7.1 Routing Cache Data Structures 519
14.8 Managing Routing Cache 523
14.8.1 Routing Cache for Local Connections 525
14.8.2 sk_dst_check() 526
14.8.3 Link Failure and Reporting to Routing Subsystem 527
14.8.4 dst_link_failure() 527
14.8.5 ipv4_link_failure() 527
14.8.6 dst_set_expires() 528
14.8.7 Routing Cache for the Incoming Packets 529
14.8.8 Routing Cache Timer 530
14.8.9 rt_periodic_timer 530
14.8.10 rt_may_expire() 533
14.8.11 dst_free() 534
14.8.12 dst_free() 535
14.8.13 dst_destroy() 535
14.8.14 dst_run_gc() 536
14.8.15 Interface down and rt_fl ush_timer 537
14.8.16 rt_cache_fl ush() 538
14.9 Implementation Overview of Forwarding Information Base (FIB) 540
14.9.1 struct fi b_table 540
14.9.2 struct fn_hash 543
14.9.3 struct fn_zone 543
14.9.4 struct fi b_node 544
14.9.5 struct fi b_info 546
14.9.6 struct fi b_nh 547
14.9.7 struct fi b_rule 548
14.10 Adding New Entry in Routing Table Using ip Command (RT Netlink Interface) 549
14.10.1 What Happens When the ip Command Is Run with a Route Option for Adding an Entry in Routing Table? 550
14.10.2 inet_rtm_newroute() 550
14.10.3 struct rtmsg 551
14.10.4 struct kern_rta 552
14.10.5 fn_hash_insert() 553
14.10.6 fn_new_zone() 554
14.10.7 fi b_create_info() 557
14.10.8 fn_hash_insert() 558
14.11 What Happens When the ip Command Is Run with a Rule Option for Adding an Entry in the Routing Table? 558
14.11.1 inet_rtm_newrule() 559
14.11.2 FIB Initialization 561
14.12 FIB Traversal Flow Diagram 563
14.12.1 ip_route_output() 563
14.12.2 ip_route_output_key() 564
14.12.3 ip_route_output_slow() 566
14.12.4 ip_dev_fi nd() 576
14.12.5 in_dev_get() 577
14.12.6 inet_select_addr() 578
14.12.7 ROUTESCOPES 580
14.12.8 fi b_lookup() 581
14.13 Summary 589
15 IP QUALITY OF SERVICE IN LINUX (IP QoS) 591
15.1 Introduction 591
15.2 Basic Components of Linux Traffi c Control 592
15.3 Linux Implementation of pfi fo_fast qdisc 593
15.4 Queueing Discipline Data Structure 596
15.4.1 struct Qdisc 596
15.4.2 struct Qdisc_ops 597
15.4.3 struct Qdisc_class_ops 598
15.4.4 struct cbq_class 599
15.5 tc User Program and Kernel Implementation Details 601
15.5.1 tc_modify_qdisc() 601
15.5.2 qdisc_create() 602
15.5.3 cbq_init() 604
15.5.4 qdisc_graft() 604
15.5.5 dev_graft_qdisc() 605
15.6 The tc Commands for Creating Class Hierarchy for CBQ 605
15.6.1 tc_ctl_tclass() 607
15.6.2 cbq_change_class() 607
15.7 Filters 610
15.7.1 tc_ctl_tfi lter() 611
15.8 u32 Filter Implementation 614
15.8.1 u32_change() 615
15.9 Route Filter Implementation 616
15.9.1 route4_change() 618
15.10 Enqueue 619
15.10.1 cbq_enqueue() 620
15.10.2 cbq_classify() 621
15.10.3 Overview of cbq_enqueue() 621
15.11 Overview of Linux Implementation of CBQ 622
15.12 cbq_dequeue() 622
15.12.1 From net/dev/core.c 626
15.12.2 qdisc_run() 626
15.12.3 qdisc_restart() 626
15.12.4 cbq_dequeue() 627
15.12.5 cbq_dequeue_1() 629
15.12.6 cbq_dequeue_prio() 630
15.13 Summary 633
16 IP FILTER AND FIREWALL 635
16.1 Netfi lter Hook Framework 636
16.2 Netfi lter Hooks on IP Stack 638
16.2.1 Hooks for Outgoing Packets 638
16.2.2 Hooks for Incoming Packets 639
16.3 Overview of Netfi lter Hooks on Linux TCP-IP Stack 640
16.4 Registration of Netfi lter Hooks 640
16.5 Processing of Netfi lter Hooks 642
16.5.1 nf_hook_slow() 642
16.5.2 nf_iterate() 643
16.5.3 struct nf_hook_ops 644
16.6 Compatibility Framework 644
16.6.1 fw_in() 645
16.7 Ip Chains 647
16.7.1 Filtering with Ipchains 648
16.7.2 Ipchain Chain of Rules 649
16.7.3 struct ip_chain 649
16.7.4 struct ip_fwkernel 650
16.7.5 struct ip_reent 651
16.7.6 struct ip_fw 651
16.7.7 Organization of Tables in Ipchains 652
16.8 How Is the Packet Filtered with Ipchains 653
16.8.1 ip_fw_check() 653
16.8.2 ip_rule_match() 655
16.9 Iptables 655
16.9.1 Registration of Iptables Hooks 657
16.10 Iptables Filter Rules and Target Organization 657
16.10.1 struct ipt_table 658
16.10.2 struct ipt_table_info 658
16.10.3 struct ipt_entry 661
16.10.4 struct ipt_entry_match 662
16.10.5 struct ipt_tcp 663
16.10.6 struct ipt_entry_target 664
16.10.7 struct ipt_standard_target 664
16.11 Organization of Filter Rules and Target for Iptables 664
16.12 Filtering Packets with Iptables 664
16.12.1 ipt_do_table() 664
16.12.2 IPT_MATCH_ITERATE 668
16.13 Summary 668
17 NET SOFTIRQ 671
17.1 Why Net SoftIRQs, and How Do We Raise Them? 672
17.1.1 Transmission 672
17.1.2 Reception 672
17.2 How Are SoftIRQs Are Processed, and When? 675
17.3 Registration of SoftIRQs 678
17.4 Packet Reception and Delayed Processing by Rx SoftIRQ 679
17.5 Processing of Net Rx SoftIRQ 682
17.6 Packet Transmission and SoftIRQ 686
17.7 Summary 696
18 TRANSMISSION AND RECEPTION OF PACKETS 697
18.1 DMA Ring Buffers for Transmission and Reception of Packets 698
18.2 Packet Reception Process 698
18.2.1 Flow of Packet Reception with DMA 698
18.2.2 Reception Ring Buffer 698
18.3 Packet Transmission Process 700
18.3.1 Flow of Packet Transmission with DMA 702
18.3.2 Transmission Ring Buffer 702
18.3.3 Transmission Ring Buffer 703
18.4 Implementation of Reception and Transmission of Packets 704
18.4.1 struct etrax_eth_descr 705
18.4.2 struct etrax_dma_descr 706
18.4.3 Initialization of Device 707
18.4.5 Initialization of DMA Transmit Ring Buffers 707
18.4.6 Initialization of DMA Receive Ring Buffers 709
18.5 Rx Interrupt for Reception of Packets 709
18.5.1 Rx DMA Buffer Initialized 711
18.5.2 e100_rx() 711
18.5.3 Rx Descriptors After Reception of Three Packets in DMA Buffer Before Rx Interrupt Being Raised 713
18.5.4 Rx Descriptors After First Packet Is Pulled Out of DMA Buffer and Given to OS in Rx Interrupt Handler 713
18.6 Transmission of Packets 713
18.6.1 e100_send_packet() 713
18.6.2 Tx DMA Ring Buffer Descriptor After Initialization 717
18.6.3 e100_hardware_send_packet() 717
18.6.4 There Are Two Packets in Device’s DMA Tx Ring Buffer to Be Transmitted 717
18.6.5 e100tx_interrupt() 720
18.6.6 First Packet from the DMA Queue Is Transmitted and Second One Is yet to Be Transmitted; After Interrupt Is Generated, Transmitted Buffer Is Freed 721
18.7 Summary 721
19 lkcd AND DEBUGGING TCP/IP STACK 723
19.1 lkcd Source and Patches 724
19.2 Touching the Socket 724
19.3 Looking into the Receive Socket Buffer 726
19.3.1 Route Information in sk_buff 727
19.4 Peep into Send Socket Buffer 727
19.5 TCP Segmentation Unit 729
19.6 Send Congestion Window and ssthresh 730
19.7 Retransmissions and Route 733
19.8 Peeping into Connection Queues and SYN Queues 733
19.9 Routing and IP Qos lcrash Steps 735
19.9.1 lcrash Steps for Default Queueing Discipline in Linux (pfi fo_fast) 735
19.10 CBQ (Class-Based) Queueing Discipline lcrash Steps 739
19.11 U32 Filters 739
19.12 Route Filters 743
19.13 FIB Table lcrash Output for Setting Up the Realm Using ip Command 745
19.14 lcrash Output for Setting Up Route Filter Using tc Command 749
19.15 Netlink Data Structure 755
19.15.1 nl_table 755
19.15.2 rtnetlink_link 755
19.16 Summary 757
20 NEXT EDITION 759
Bibliography 763
Index 765