📄 packet_mmap.txt
字号:
To understand the constraints of PACKET_MMAP, we have to see the structure used to hold the pointers to each block.Currently, this structure is a dynamically allocated vector with kmalloc called pg_vec, its size limits the number of blocks that can be allocated. +---+---+---+---+ | x | x | x | x | +---+---+---+---+ | | | | | | | v | | v block #4 | v block #3 v block #2 block #1kmalloc allocates any number of bytes of physically contiguous memory from a pool of pre-determined sizes. This pool of memory is maintained by the slab allocator which is at the end the responsible for doing the allocation and hence which imposes the maximum memory that kmalloc can allocate. In a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The predetermined sizes that kmalloc uses can be checked in the "size-<bytes>" entries of /proc/slabinfoIn a 32 bit architecture, pointers are 4 bytes long, so the total number of pointers to blocks is 131072/4 = 32768 blocks PACKET_MMAP buffer size calculator------------------------------------Definitions:<size-max> : is the maximum size of allocable with kmalloc (see /proc/slabinfo)<pointer size>: depends on the architecture -- sizeof(void *)<page size> : depends on the architecture -- PAGE_SIZE or getpagesize (2)<max-order> : is the value defined with MAX_ORDER<frame size> : it's an upper bound of frame's capture size (more on this later)from these definitions we will derive <block number> = <size-max>/<pointer size> <block size> = <pagesize> << <max-order>so, the max buffer size is <block number> * <block size>and, the number of frames be <block number> * <block size> / <frame size>Suppose the following parameters, which apply for 2.6 kernel and ani386 architecture: <size-max> = 131072 bytes <pointer size> = 4 bytes <pagesize> = 4096 bytes <max-order> = 11and a value for <frame size> of 2048 bytes. These parameters will yield <block number> = 131072/4 = 32768 blocks <block size> = 4096 << 11 = 8 MiB.and hence the buffer will have a 262144 MiB size. So it can hold 262144 MiB / 2048 bytes = 134217728 framesActually, this buffer size is not possible with an i386 architecture. Remember that the memory is allocated in kernel space, in the case of an i386 kernel's memory size is limited to 1GiB.All memory allocations are not freed until the socket is closed. The memory allocations are done with GFP_KERNEL priority, this basically means that the allocation can wait and swap other process' memory in order to allocate the necessary memory, so normally limits can be reached. Other constraints-------------------If you check the source code you will see that what I draw here as a frameis not only the link level frame. At the beginning of each frame there is a header called struct tpacket_hdr used in PACKET_MMAP to hold link level's framemeta information like timestamp. So what we draw here a frame it's really the following (from include/linux/if_packet.h):/* Frame structure: - Start. Frame must be aligned to TPACKET_ALIGNMENT=16 - struct tpacket_hdr - pad to TPACKET_ALIGNMENT=16 - struct sockaddr_ll - Gap, chosen so that packet data (Start+tp_net) aligns to TPACKET_ALIGNMENT=16 - Start+tp_mac: [ Optional MAC header ] - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16. - Pad to align to TPACKET_ALIGNMENT=16 */ The following are conditions that are checked in packet_set_ring tp_block_size must be a multiple of PAGE_SIZE (1) tp_frame_size must be greater than TPACKET_HDRLEN (obvious) tp_frame_size must be a multiple of TPACKET_ALIGNMENT tp_frame_nr must be exactly frames_per_block*tp_block_nrNote that tp_block_size should be chosen to be a power of two or there willbe a waste of memory.--------------------------------------------------------------------------------+ Mapping and use of the circular buffer (ring)--------------------------------------------------------------------------------The mapping of the buffer in the user process is done with the conventional mmap function. Even the circular buffer is compound of several physicallydiscontiguous blocks of memory, they are contiguous to the user space, hencejust one call to mmap is needed: mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);If tp_frame_size is a divisor of tp_block_size frames will be contiguosly spaced by tp_frame_size bytes. If not, each tp_block_size/tp_frame_size frames there will be a gap between the frames. This is because a frame cannot be spawn across twoblocks. At the beginning of each frame there is an status field (see struct tpacket_hdr). If this field is 0 means that the frame is readyto be used for the kernel, If not, there is a frame the user can read and the following flags apply: from include/linux/if_packet.h #define TP_STATUS_COPY 2 #define TP_STATUS_LOSING 4 #define TP_STATUS_CSUMNOTREADY 8 TP_STATUS_COPY : This flag indicates that the frame (and associated meta information) has been truncated because it's larger than tp_frame_size. This packet can be read entirely with recvfrom(). In order to make this work it must to be enabled previously with setsockopt() and the PACKET_COPY_THRESH option. The number of frames than can be buffered to be read with recvfrom is limited like a normal socket. See the SO_RCVBUF option in the socket (7) man page.TP_STATUS_LOSING : indicates there were packet drops from last time statistics where checked with getsockopt() and the PACKET_STATISTICS option.TP_STATUS_CSUMNOTREADY: currently it's used for outgoing IP packets which it's checksum will be done in hardware. So while reading the packet we should not try to check the checksum. for convenience there are also the following defines: #define TP_STATUS_KERNEL 0 #define TP_STATUS_USER 1The kernel initializes all frames to TP_STATUS_KERNEL, when the kernelreceives a packet it puts in the buffer and updates the status withat least the TP_STATUS_USER flag. Then the user can read the packet,once the packet is read the user must zero the status field, so the kernel can use again that frame buffer.The user can use poll (any other variant should apply too) to check if newpackets are in the ring: struct pollfd pfd; pfd.fd = fd; pfd.revents = 0; pfd.events = POLLIN|POLLRDNORM|POLLERR; if (status == TP_STATUS_KERNEL) retval = poll(&pfd, 1, timeout);It doesn't incur in a race condition to first check the status value and then poll for frames.--------------------------------------------------------------------------------+ THANKS-------------------------------------------------------------------------------- Jesse Brandeburg, for fixing my grammathical/spelling errors
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -