Network Drivers

The role of a network interface within the system is similar to that of a mounted block device. A block device registers its features in the blk_dev array and other kernel structures, and it then "transmits" and "receives" blocks on request, by means of its request function. Similarly, a network interface must register itself in specific data structures in order to be invoked when packets are exchanged with the outside world.

There are a few important differences between mounted disks and packet-delivery interfaces. To begin with, a disk exists as a special file in the /dev directory, whereas a network interface has no such entry point. The normal file operations (read, write, and so on) do not make sense when applied to network interfaces, so it is not possible to apply the Unix "everything is a file" approach to them. Thus, network interfaces exist in their own namespace and export a different set of operations.

Although you may object that applications use the read and write system calls when using sockets, those calls act on a software object that is distinct from the interface. Several hundred sockets can be multiplexed on the same physical interface.

But the most important difference between the two is that block drivers operate only in response to requests from the kernel, whereas network drivers receive packets asynchronously from the outside. Thus, while a block driver is asked to send a buffer toward the kernel, the network device asksto push incoming packets toward the kernel. The kernel interface for network drivers is designed for this different mode of operation.

Network drivers also have to be prepared to support a number of administrative tasks, such as setting addresses, modifying transmission parameters, and maintaining traffic and error statistics. The API for network drivers reflects this need, and thus looks somewhat different from the interfaces we have seen so far.

The network subsystem of the Linux kernel is designed to be completely protocol independent. This applies to both networking protocols (IP versus IPX or other protocols) and hardware protocols (Ethernet versus token ring, etc.). Interaction between a network driver and the kernel proper deals with one network packet at a time; this allows protocol issues to be hidden neatly from the driver and the physical transmission to be hidden from the protocol.

When a driver module is loaded into a running kernel, it requests resources and offers facilities; there's nothing new in that. And there's also nothing new in the way resources are requested. The driver should probe for its device and its hardware location (I/O ports and IRQ line). The way a network driver is registered by its module initialization function is different from char and block drivers. Since there is no equivalent of major and minor numbers for network interfaces, a network driver does not request such a number. Instead, the driver inserts a data structure for each newly detected interface into a global list of network devices.

Each interface is described by a struct net_device item.

Details of structure net_device

The first struct net_device field we will look at is name, which holds the interface name (the string identifying the interface). The driver can hardwire a name for the interface or it can allow dynamic assignment, which works like this: if the name contains a %d format string, the first available name found by replacing that string with a small integer is used.

The net_device structure is at the very core of the network driver layer and deserves a complete description. At a first reading, however, you can skip this section, because you don't need a thorough understanding of the structure to get started. This list describes all the fields, but more to provide a reference than to be memorized. The rest of this chapter briefly describes each field as soon as it is used in the sample code, so you don't need to keep referring back to this section.

struct net_device can be conceptually divided into two parts: visible and invisible. The visible part of the structure is made up of the fields that can be explicitly assigned in static net_device structures. All structures in drivers/net/Space.c are initialized in this way, without using the tagged syntax for structure initialization. The remaining fields are used internally by the network code and usually are not initialized at compilation time, not even by tagged initialization. Some of the fields are accessed by drivers (for example, the ones that are assigned at initialization time), while some shouldn't be touched.

The Visible Head

The first part of struct net_device is composed of the following fields, in this order:

char name[IFNAMSIZ]; 

The name of the device. If the name contains a %d format string, the first available device name with the given base is used; assigned numbers start at zero.

unsigned long rmem_end; 
unsigned long rmem_start; 
unsigned long mem_end; 
unsigned long mem_start; 

Device memory information.These fields hold the beginning and ending addresses of the shared memory used by the device. If the device has different receive and transmit memories, the mem fields are used for transmit memory and the rmem fields for receive memory. mem_start and mem_end can be specified on the kernel command line at system boot, and their values are retrieved by ifconfig. The rmem fields are never referenced outside of the driver itself. By convention, the end fields are set so that end - start is the amount of available on-board memory.

unsigned long base_addr; 

The I/O base address of the network interface. This field, like the previous ones, is assigned during device probe. The ifconfig command can be used to display or modify the current value. The base_addr can be explicitly assigned on the kernel command line at system boot or at load time. The field is not used by the kernel, like the memory fields shown previously.

unsigned char irq; 

The assigned interrupt number. The value of dev->irq is printed by ifconfig when interfaces are listed. This value can usually be set at boot or load time and modified later using ifconfig.

unsigned char irq; 
unsigned char if_port; 

Which port is in use on multiport devices. This field is used, for example, with devices that support both coaxial (IF_PORT_10BASE2) and twisted-pair (IF_PORT_10BASET) Ethernet connections. The full set of known port types is defined in <linux/netdevice.h>;.

unsigned char dma; 

The DMA channel allocated by the device. The field makes sense only with some peripheral buses, like ISA. It is not used outside of the device driver itself, but for informational purposes (in ifconfig).

unsigned long state; 

Device state. The field includes several flags. Drivers do not normally manipulate these flags directly; instead, a set of utility functions has been provided. These functions will be discussed shortly when we get into driver operations.

struct net_device *next; 

Pointer to the next device in the global linked list. This field shouldn't be touched by the driver.

struct net_device *next; 
int (*init)(struct net_device *dev); 

The device methods

As happens with the char and block drivers, each network device declares the functions that act on it. Operations that can be performed on network interfaces are listed in this section. Some of the operations can be left NULL, and some are usually untouched because ether_setup assigns suitable methods to them.

Device methods for a network interface can be divided into two groups: fundamental and optional. Fundamental methods include those that are needed to be able to use the interface; optional methods implement more advanced functionalities that are not strictly required. The following are the fundamental methods:

int (*open)(struct net_device *dev); 

Opens the interface. The interface is opened whenever ifconfig activates it. The open method should register any system resource it needs (I/O ports, IRQ, DMA, etc.), turn on the hardware, and increment the module usage count.

int (*stop)(struct net_device *dev); 

Stops the interface. The interface is stopped when it is brought down; operations performed at open time should be reversed.

int (*hard_start_xmit) (struct sk_buff *skb, struct net_device *dev);

This method initiates the transmission of a packet. The full packet (protocol headers and all) is contained in a socket buffer (sk_buff) structure.

Socket Buffer

Whenever the kernel needs to transmit a data packet, it calls the hard_start_transmit method to put the data on an outgoing queue. Each packet handled by the kernel is contained in a socket buffer structure (struct sk_buff), whose definition is found in <linux/skbuff.h>. The structure gets its name from the Unix abstraction used to represent a network connection, the socket. Even if the interface has nothing to do with sockets, each network packet belongs to a socket in the higher network layers, and the input/output buffers of any socket are lists of struct sk_buff structures. The same sk_buff structure is used to host network data throughout all the Linux network subsystems, but a socket buffer is just a packet as far as the interface is concerned.

The socket buffer is a complex structure, and the kernel offers a number of functions to act on it. The functions are described later in "The Socket Buffers"; for now a few basic facts about sk_buff are enough for us to write a working driver.

The socket buffer passed to hard_start_xmitcontains the physical packet as it should appear on the media, complete with the transmission-level headers. The interface doesn't need to modify the data being transmitted. skb->data points to the packet being transmitted, and skb->len is its length, in octets.

The Important Fields

The fields introduced here are the ones a driver might need to access. They are listed in no particular order.

struct net_device *rx_dev; 
struct net_device *dev; 

The devices receiving and sending this buffer, respectively.

union { /* ... */ } h; 
union { /* ... */ } nh; 
union { /*... */} mac; 

Pointers to the various levels of headers contained within the packet. Each field of the unions is a pointer to a different type of data structure. h hosts pointers to transport layer headers (for example, struct tcphdr *th); nh includes network layer headers (such as struct iphdr *iph); and mac collects pointers to link layer headers (such as struct ethdr *ethernet).

If your driver needs to look at the source and destination addresses of a TCP packet, it can find them in skb->h.th. See the header file for the full set of header types that can be accessed in this way. Note that network drivers are responsible for setting the mac pointer for incoming packets. This task is normally handled by ether_type_trans, but non-Ethernet drivers will have to set skb->mac.raw directly, as shown later in "Non-Ethernet Headers".

unsigned char *head; 
unsigned char *data; 
unsigned char *tail; 
unsigned char *end; 

Pointers used to address the data in the packet. head points to the beginning of the allocated space, data is the beginning of the valid octets (and is usually slightly greater than head), tail is the end of the valid octets, and end points to the maximum address tail can reach. Another way to look at it is that the available buffer space is skb->end - skb->head, and the currently used data space is skb->tail - skb->data.

unsigned long len;

The length of the data itself (skb->tail - skb->data).

unsigned char ip_summed; 

The checksum policy for this packet. The field is set by the driver on incoming packets, as was described in "Packet Reception".

unsigned char pkt_type; 

Packet classification used in delivering it. The driver is responsible for setting it to PACKET_HOST (this packet is for me), PACKET_BROADCAST, PACKET_MULTICAST, or PACKET_OTHERHOST (no, this packet is not for me). Ethernet drivers don't modify pkt_type explicitly because eth_type_trans does it for them.

The remaining fields in the structure are not particularly interesting. They are used to maintain lists of buffers, to account for memory belonging to the socket that owns the buffer, and so on.

Functions Acting on Socket Buffers

Network devices that use a sock_buff act on the structure by means of the official interface functions. Many functions operate on socket buffers; here are the most interesting ones:

struct sk_buff *alloc_skb(unsigned int len, int priority); 
struct sk_buff *dev_alloc_skb(unsigned int len); 

Allocate a buffer. The alloc_skb function allocates a buffer and initializes both skb->data and skb->tail to skb->head. The dev_alloc_skb function is a shortcut that calls alloc_skb with GFP_ATOMIC priority and reserves some space between skb->head and skb->data. This data space is used for optimizations within the network layer and should not be touched by the driver.

void kfree_skb(struct sk_buff *skb); 
void dev_kfree_skb(struct sk_buff *skb); 

Free a buffer. The kfree_skb call is used internally by the kernel. A driver should use dev_kfree_skb instead, which is intended to be safe to call from driver context.

unsigned char *skb_put(struct sk_buff *skb, int len); 
unsigned char *__skb_put(struct sk_buff *skb, int len); 

These inline functions update the tail and len fields of the sk_buff structure; they are used to add data to the end of the buffer. Each function's return value is the previous value of skb->tail (in other words, it points to the data space just created). Drivers can use the return value to copy data by invoking ins(ioaddr, skb_put(...)) or memcpy(skb_put(...), data, len). The difference between the two functions is that skb_put checks to be sure that the data will fit in the buffer, whereas __skb_put omits the check.

unsigned char *skb_push(struct sk_buff *skb, int len); 
unsigned char *__skb_push(struct sk_buff *skb, int len); 

These functions decrement skb->data and increment skb->len. They are similar to skb_put, except that data is added to the beginning of the packet instead of the end. The return value points to the data space just created. The functions are used to add a hardware header before transmitting a packet. Once again, __skb_push differs in that it does not check for adequate available space.

int skb_tailroom(struct sk_buff *skb); 

This function returns the amount of space available for putting data in the buffer. If a driver puts more data into the buffer than it can hold, the system panics. Although you might object that a printk would be sufficient to tag the error, memory corruption is so harmful to the system that the developers decided to take definitive action. In practice, you shouldn't need to check the available space if the buffer has been correctly allocated. Since drivers usually get the packet size before allocating a buffer, only a severely broken driver will put too much data in the buffer, and a panic might be seen as due punishment.

int skb_headroom(struct sk_buff *skb); 

Returns the amount of space available in front of data, that is, how many octets one can "push" to the buffer.

void skb_reserve(struct sk_buff *skb, int len); 

This function increments both data and tail. The function can be used to reserve headroom before filling the buffer. Most Ethernet interfaces reserve 2 bytes in front of the packet; thus, the IP header is aligned on a 16-byte boundary, after a 14-byte Ethernet header. snull does this as well, although the instruction was not shown in "Packet Reception" to avoid introducing extra concepts at that point.

unsigned char *skb_pull(struct sk_buff *skb, int len); 

Removes data from the head of the packet. The driver won't need to use this function, but it is included here for completeness. It decrements skb->len and increments skb->data; this is how the hardware header (Ethernet or equivalent) is stripped from the beginning of incoming packets.

The kernel defines several other functions that act on socket buffers, but they are meant to be used in higher layers of networking code, and the driver won't need them.

Installing an Interrupt Handler

If you want to actually "see'' interrupts being generated, writing to the hardware device isn't enough; a software handler must be configured in the system. If the Linux kernel hasn't been told to expect your interrupt, it will simply acknowledge and ignore it.

Interrupt lines are a precious and often limited resource, particularly when there are only 15 or 16 of them. The kernel keeps a registry of interrupt lines, similar to the registry of I/O ports. A module is expected to request an interrupt channel (or IRQ, for interrupt request) before using it, and to release it when it's done. In many situations, modules are also expected to be able to share interrupt lines with other drivers, as we will see. The following functions, declared in <linux/sched.h>, implement the interface:

int request_irq(unsigned int irq, 
   void (*handler)(int, void *, 
   struct pt_regs *), 
   unsigned long flags, 
   const char *dev_name,
   void *dev_id);
   void free_irq(unsigned int irq, void *dev_id);

The value returned from request_irq to the requesting function is either 0 to indicate success or a negative error code, as usual. It's not uncommon for the function to return -EBUSY to signal that another driver is already using the requested interrupt line. The arguments to the functions are as follows:

unsigned int irq 

This is the interrupt number being requested.

void (*handler)(int, void *, struct pt_regs *) 

The pointer to the handling function being installed. We'll discuss the arguments to this function later in this chapter.

unsigned long flags 

As you might expect, a bit mask of options (described later) related to interrupt management.

const char *dev_name 

The string passed to request_irq is used in /proc/interrupts to show the owner of the interrupt.

void *dev_id 

This pointer is used for shared interrupt lines. It is a unique identifier that is used when the interrupt line is freed and that may also be used by the driver to point to its own private data area (to identify which device is interrupting). When no sharing is in force, dev_id can be set to NULL, but it a good idea anyway to use this item to point to the device structure. We'll see a practical use for dev_id in "Implementing a Handler", later in this chapter.

The bits that can be set in flags are as follows:

SA_INTERRUPT SA_SHIRQ SA_SAMPLE_RANDOM

This bit indicates that the generated interrupts can contribute to the entropy pool used by /dev/random and /dev/urandom. These devices return truly random numbers when read and are designed to help application software choose secure keys for encryption. Such random numbers are extracted from an entropy pool that is contributed by various random events. If your device generates interrupts at truly random times, you should set this flag. If, on the other hand, your interrupts will be predictable (for example, vertical blanking of a frame grabber), the flag is not worth setting -- it wouldn't contribute to system entropy anyway. Devices that could be influenced by attackers should not set this flag; for example, network drivers can be subjected to predictable packet timing from outside and should not contribute to the entropy pool. See the comments in drivers/char/random.cfor more information.

The interrupt handler can be installed either at driver initialization or when the device is first opened. Although installing the interrupt handler from within the module's initialization function might sound like a good idea, it actually isn't. Because the number of interrupt lines is limited, you don't want to waste them. You can easily end up with more devices in your computer than there are interrupts. If a module requests an IRQ at initialization, it prevents any other driver from using the interrupt, even if the device holding it is never used. Requesting the interrupt at device open, on the other hand, allows some sharing of resources.

It is possible, for example, to run a frame grabber on the same interrupt as a modem, as long as you don't use the two devices at the same time. It is quite common for users to load the module for a special device at system boot, even if the device is rarely used. A data acquisition gadget might use the same interrupt as the second serial port. While it's not too hard to avoid connecting to your Internet service provider (ISP) during data acquisition, being forced to unload a module in order to use the modem is really unpleasant.

The correct place to call request_irq is when the device is first opened, before the hardware is instructed to generate interrupts. The place to call free_irq is the last time the device is closed, after the hardware is told not to interrupt the processor any more. The disadvantage of this technique is that you need to keep a per-device open count. Using the module count isn't enough if you control two or more devices from the same module.

This discussion notwithstanding, shortrequests its interrupt line at load time. This was done so that you can run the test programs without having to run an extra process to keep the device open. short, therefore, requests the interrupt from within its initialization function (short_init) instead of doing it in short_open, as a real device driver would.