Alessandro Rubini, _Linux Device Drivers_ XXX = could be interesting * Registering symbols - p23. You need to avoid namespace pollution. Either: + All global symbols are assumed globally visible (default) + You can call register_symtab(table) to register only some symbols + You can call register_symtab(NULL) to make all symbols not visible - p384. Better macros for this stuff in newer kernels. * Registering services - p26. If you have errors in initialization (e.g., no more memory), you need to back out of registering your services. See sample code. - p28. cleanup_module must unregister all services - p29. Same idea with ports: need to register usage - p236. Block drivers use generic block_read, _write, _fsync functions. Also need to add extra info to blk_dev global data structure.f Also need to set blk_size, blksize_size, hardsect_size, and read_ahead -- yuck! - p295. Need to request dma usage, too. (And IRQ...) - p306. Network driver module detects interfaces and then adds them via register_netdev. When you register, calls init field. - p307. Network driver can specify names as null/blank, in which case automatically assigned ethn. - p308. Init should (1) use dev->base_addr as I/O address if given, (2) probe if dev->base_addr 0 (usually not done -- bad for ISA), or (3) ignore invalid dev->base_addr and don't probe. - p309. Standard to add to private network data a struct enet_statistic to hold statistics to show for ifconfig. (Note, p400: struct net_device_stats takes its place.) - p331. Network devices expected to take io= and irq= arguments. - p385. Module params defined with MODULE_PARM macro. Does some type checking! Type description string in linux/module.h. - p386. Can have module unloaded when function predicate says, rather than by usage count. * Implementation requirements - p237. blksize_size must be a power of 2. - p243. Block request function begins with call to INIT_REQUEST. Ends with call to end_request with either 1 (success) or 0 (failure). Then loops forever (INIT_REQUEST may return). From linux/include/blk.h: ``All functions called within end_request() _must_be_ atomic.'' - p247. Protocol for doing clustered I/O. - p248. When device is mounted, open is called; everything but f_mode random and should be ignored. Umount = release (close); filp will be null. - p250. ioctl block device commands. Pattern for a lot of these is the same, e.g., code for BLKROSET and GET is device-independent. - p254. Should call check_disk_change in open. - p318 (and others). Some fields of data structures are not used by driver, but by kernel. - p327. dev_alloc_skb calls alloc_skb but allocates 16 bytes extra at the front, ``for hardware headers'' (?). - p338-339. ``[E]ven inferfaces that can't deal with multicast packets need to implement the set_multicast_list method to be notified about changes in the dev->flags.'' Needs to deal with IFF_PROMISC so that tcpdump can work. - p400. current pointer no longer global variable. (But still exists.) See asm/current.h. * Usage counts - p27. Need to maintain a reference count of number of module users, using some macros: MOD_INC_USE_COUNT, MOD_DEC_USE_COUNT, MOD_IN_USE - p54. Increment by open (usually) - p55. Decremented by release (usually) *XXX Race conditions - p32. For reads, need to mark stuff as volatile. For writes, need to mask of interrupts with sequence unsigned long flags; save_flags(flags); cli(); /* critical code */ restore_flags(flags); Maybe can catch writes to volatile stuff and ensure it's within such a region? Problems: Many basic drivers don't mark stuff as volatile; but the sound and video drivers seem to. Look for occurrences of ``save_flags'' to see where stuff really is volatile. - p354. Sometimes use just cli/sti instead of save/restore_flags because interrupts known to be enabled. - p62. Calls to copy to/from user from/to kernel space may sleep because they can trigger a page fault. Thus can be run simultaneously with any other driver funcs. - p204. Three ways to handle locking. (1) Use a circular buffer (huh?), (2) disable interrupts, (3) locks. - p204. Can mark variables as volatile; OR (??) can disable interrupts around them. - p205. Need to disable interrupts when interrupt handler may modify data and other process touches data. - p206. Can lock by using atomic bit operations. (Standard semaphores.) - p207. Can access atomic_t () by certain atomic operations. (Kernel source wraps the int in a struct to give compiler warnings.) - p209. If test condition and then sleep must happen atomically, need to disable interrupts. Why the sti() on return from interruptible_sleep_on()? Elaborate version of atomic cond/sleep that Linus likes. - p324. Set dev->interrupt when interrupt handler invoked. Why don't you need to use a semaphore here? - p327. Drivers use dev_kfree_skb to deallocate an skb, because it handles locking correctly. - p350. Need to cli/sti around PCI config space writes. (How does this code work, anyhow? Seems wrong.) - http://www.samba.org/netfilter/unreliable-guides/kernel-locking/ sleeping-things.html: You can never call the following routines while holding a spinlock, as they may sleep. This also means you need to be in user context. Accesses to userspace: copy_from_user() copy_to_user() get_user() put_user() kmalloc(GFP_KERNEL) down_interruptible() and down() There is a down_trylock() which can be used inside interrupt context, as it will not sleep. up() will also never sleep. printk() can be called in any context, interestingly enough. - Also see http://www.samba.org/netfilter/unreliable-guides/kernel-locking/ racing-timers.html for a trick with locking timers. - Also see Documentation/spinlocks.txt. Explains other versions of locking. Summary: Three kinds of locking, cli+save_flags/sti, spin_lock/unlock, and rw *XXX Int sizes - p44. Some things require specific sized arguments; e.g., major number is 8 bits, as is minor number. (Seems to be false in 2.2.14) - p216. Pointers in kernel are unsigned longs, not pointer types! (Probably intptr_t now.) - p217. Claims if you use interface types then you're safe. *Not true*! Compiler doesn't seem to warn you of anything because of implicit casts. - p314. Some of the bits in struct device.flags are actually read-only! - p344. Byte order on PCI boards little-endian (fixed). - p350. Access configuration space through funcalls. Note that 64-bytes maximum, even though int argument for addr. - p397. Kernel has some macros for manipulating endianness. *XXX Module numbers - p47. Need to request a module number in init and free it in cleanup (especially important). - p48. kdev_t is a type that can only be manipulated in certain ways. See linux/kdev_t.h: However, for the time being we let kdev_t be almost the same as dev_t: typedef struct { unsigned short major, minor; } kdev_t; Admissible operations on an object of type kdev_t: - passing it along - comparing it for equality with another such object - storing it in ROOT_DEV, inode->i_dev, inode->i_rdev, sb->s_dev, bh->b_dev, req->rq_dev, de->dc_dev, tty->device - using its bit pattern as argument in a hash function - finding its major and minor - complaining about it - p54. Open identifies the minor number and updates f_op (if operations change depending on minor number) - p63-64. Read returns count of transferred bytes. Many cases. - p65. Similarly with write. * Return values/arguments of driver functions - pp50-51. Positive for good answer, negative for error. - p63. count argument to read/write changed from int to unsigned long. Return value changed from int to long. Introduces dependency on kernel version, heavy use of macros. - p95. ioctl takes one optional argument. Meaning/type of argument depend on which command is issued. - p100. default ioctl return should either be -EINVAL (sane) or -ENOTTY (backwards compatible) - p122. Is lseeking isn't possible, should add lseek method that returns -ESPIPE. Default is to seek be setting f_pos - p278. vm_ops fields for mmap defined to be NULL before f_op->mmap called (but should check, anyhow, probably). - p278-279. For mmap, defined open/close add functionality while defined swap* ops replace functions (!). - p333. snull disallows configuring running device. Why does kernel let those ops through? It seems to do a lot of other interception. - p334. dev->do_ioctl called when ioctl called with SIOCDEVPRIVATE through SIOCDEVPRIVATE+15 as ctl argument. *XXX Memory access - p53. Can put anything in struct file.private_data, but if you malloc something you'd better free it when the release (i.e., close) function/method is called. - p100-101. If you access a user-space pointer, you need to make sure the memory is mapped, or you need to handle the exception (documented for newer kernel in later chapter). verify_argument. - p102. Use macros put_user and get_user to copy data more efficiently to/from user space. - p106. Everything kmalloced needs to be kfreed, otherwise it sticks around forever. - p155. ``If you try to free a different number of pages than you allocated, the memory map will probably become corrupted.'' - p172. Use macros to do I/O mem accesses (readb, writeb, etc); usually expand to dereferences. - p321. Looks like code copies too much if ETH_ZLEN too small? - p327. ``If a driver puts more data into the [skb] buffer than it can hold, the system panis.'' - p391. Copy kernel-user memory with access_ok (to check), get_user (calls access_ok), __get_user (doesn't call access_ok), get_user_ret (returns retval if fails), put macros, copy_from_user (calls access_ok, returns # bytes no transferred), __copy_from_user (no call to access_ok), copy_to_user. See asm/uaccess.h. - p393. Impl of read doesn't call access_ok because dispatch already checked. - p399. New versions of linux have exception maps in ELF object files to deal with exceptions caused by copying to/from user area. * Blocking I/O - p105-107. Use a wait_queue; call interruptible_sleep_on to go to sleep, call wake_up_interruptible to wake up other processes (also wake_up and sleep_on possible). A bit confusing, since it's such a bogus example. wait_queues are stored in device data structure, one per dev in general. Note: blocking open possible; see fs/pipe.c. Nonblock flag is set by fcntl. * Non-blocking I/O - p114-115. Select or poll system calls can be used to wait for a device to be ready. Use select_wait function in select to say things aren't ready yet. See pp389-390 for description of poll. - p263. For block device, end_request called by interrupt handler to indicate data transfer complete. - p264. Kernel doesn't call driver with new request if one is pending. Has to reinvoke function when request done (check whether CURRENT is null to tell), since broke out of loop. (Why is there a loop in the initial code?) * Any I/O - p116. Rules for read/write/flush. highlights: Read should always return if there's data, even if less than req. Write should not never block, even if O_NONBLOCK clear; many apps call select to check whether write will block, and assume it won't if select says so. (This last thing about writing is a little confusing. What exactly does it mean?) * Asynchronous operations - p119-121. Can make a device asynchronous -- alerts user when data comes in (sends signal). Use generic kernel data structures to implement this; provide fasync method. Also close needs remove closed file from list of asynchronous notifiers. ``Asynchronous'' is a bad name, since non-blocking I/O is also asynchronous. A better name would be something like ``signaling I/O.'' - p140. When things are executing asynchronously (e.g., tasks from a task queue), can't do certain things, e.g., there's no ``current'' (pointer to current process). kmalloc(GFP_KERNEL) can't be called (if intr_count > 0, i.e., at ``interrupt time''; can call with atomic priority). Can't call schedule(). * Access control policies - p123-128. How do you deal with multiple opens? Choices: Only allow one open at once; allow only the user to reopen multiple times (keep ref count); clone the device (What is this used for? Something to do with mouse.) * Task Queues - p139. Before queueing a task, next and sync fields must be 0. - p139. Three ways to queue task: queue_task, queue_task_irq, queue_task_irq_off. All do this same thing, but have different preconditions. Latter ones are faster but have more preconditions (e.g., interrupts off). However, faster versions not available in new kernels. - p139. ``Once the task has been queued, structure...`owned' by the kernel and shouldn't be modified.'' Interesting lifetime. - p139. queue_task_irq can be called from non-reentrant function. queue_task_irq_off can be called when interrupts disabled. Could check these are being called correctly. (Oops, removed in later versions, because speed gain not worth it.) - p145. Queued task can reschedule itself because head of task queue consumed before queued task run. (But it's a queue -- why should this matter?!) *XXX Interrupt time - p140. Code being executed at interrupt time can't access current pointer, can't invoke scheduler, can't call kmalloc(GFP_KERNEL). In interrupt time when intr_count != 0. - p189. Interrupt code restricted in what it can do; same restrictions as task queues. Can't xfer to/from user space. Atomic? (Unclear in new kernels.) - p325. Network drivers should declared interrupt handler slow -- queues bottom half. - p396. intr_count doesn't exist anymore. Use in_interrupt function. Finer grained locking. - p396. Fast/slow interrupts not distinction. * Being reentrant - p153. ``[E]very kernel function that calls kmalloc(GFP_KERNEL) should be reentrant.'' * IRQs - p183. Should get at device open and release at device close. - p200. Shared handlers must pass unique dev_id parameter, otherwise kernel may oops. (Can this really be true?!)