UNIX Filesystems

Tue 21 October 2014 by Justin Kinney

In UNIX, everything is a file. This is a bit of an overgeneralization, but it nearly always holds true. This post is the first in a series digging into UNIX (specifically Linux) internals and I will be collecting notes and consolidating my understanding of filesystems in this post. My goal here is to discuss the general methods used in working with filesystems from a developement standpoint, as well as a look into filesystem internals.

Working with file-like objects

In order to work with a file,

File Descriptors

A file descriptor is a non-negative integer that is returned by the kernel when open() is called by a process. There are a few special file descriptors (0 - stdin, 1 - stdout, and 2 - stderr) which are, by convention, used by shells.

For the most part, the following 5 functions are used when working with file-like objects in UNIX: open(), close(), read(), write(), and lseek().

open()

When a process calls open(), either a file descriptor or -1 is returned. There are many options to the open() system call, summarized below:

[table]

In Linux, files can be opened in many ways, including the following:

  • Canonical mode - read()/write() used as normal, read(), when called, blocks until data is copied into the user mode address space of the process. write() terminates as soon as user data is copied into the page cache.
  • Synchronous mode - similar to canonical mode, but write() blocks until data is written to disk.
  • Memory mapped mode - Once the file is opened and mmap() is called, the file is accessible as an array of bytes in memory. The process can access/mutate the bytes directly without calling read()/write()
  • Direct I/O mode - read()/write() calls bypass the page cache.
  • Asynchronous mode - requests for data transfers never block, and carry on "in the background".

When a file is opened by calling open(), an unused file descriptor is found in the fd array of the process descriptor. The file is then opened by calling the filp_open() function, passing parameters including pathname, access mode flags, and permission bit mask. Several security checks and accounting items are completed at this time, including creation of a dentry object if it does not already exist, creation of a new disk inode if it does not already exist, as well as a call to dentry_open(), which creates and initializes a new file object and inserts it into the list of opened files for the filesystem superblock. The file descriptor slot in the fd array is updated to include the address of the file object returned by dentry_open().

close()

When a process calls close() on a file descriptor, any record locks held by the process are released. The kernel will also automatically close any files left open when a process exits.

When a file is closed, the corresponding fd entry in the process files list is set to NULL. In addition, a call to filp_close() is made, which invokes any flush operations associated with the file, releases locks, and finally releases the file object.

read()

When used on a file, read() returns the requested number of bytes from the current file offset. If the number of bytes requested exceeds the file size, all the bytes that can be read are returned and the next call to read() will return 0.

This call locates the physical blocks that contain the data being accessed and activate the block device driver to start the data transfer. Reading a file is page-based, and the kernel always transfers whole pages of data at once. If a few bytes are requested with read(), the kernel allocates a new page frame, fills it with the appropriate portion of data, adds the page to the page cache, and then copies the requested bytes into the process address space.

write()

When used on a file, write() returns the number of bytes written, which is usually the same number as what was requested. write() begins writing bytes at the files current offset, and offset is incremented for each byte written.

Write operations require the kernel to copy the bytes to be written from the user mode process address space into kernel data structures, and then to disk. In most cases, actually writing to disk involves identifying the address inode of the inode object to be written, acquiring a lock on it via the semaphore in the inode object, and writing the data into the page cache before releasing the lock. The pages written are marked dirty and will eventually be written to disk by the pdflush process.

lseek()

For every open file object, there is a current file offset, which tracks the number of bytes we are "into" the file. This offest is incremented each time read() is called, by the number of bytes returned by read(). lseek() is commonly used to advance the location of the current file offset within the kernel. It does not read or return any data.

In Linux, this offset is stored in the f_pos field of the file object pointed to by the file descriptor opened by the process using the file.

Holes in files

Here is where files can get interesting. A file's offset can be greater than the file's size on disk. If you lseek() past the end of a file and write data, the file is extended, and any unwritten bytes are read() back as 0. Depending on the filesystem implementation, these holes may not have blocks allocated to store the "empty" data.

A few words on I/O efficiency

When reading and writing data, a buffer size must be selected. Choosing this buffer size has significant impact on the performance of an application. More details can be found in APUE[1], but suffice it to say that it is most often the case that setting the buffer size to the same value as the filesystem block size is usually a good choice.

Also, most filesystems have some sort of read-ahead optimization when sequential blocks of data have been accessed. In this case, additional bytes will be read into the system cache, making the next call to read() faster.

Linux Internals

This section will focus on the internals of the Linux kernel and the data structures used for file I/O.

There are several very important kernel data structures used when handling files. For this section, I will focus on those pertinent to the VFS:

  • superblock
  • inode
  • file
  • dentry
  • dentry cache

The superblock object

The inode object

Ah, the inode object. The classic litmus test of Linux internals knowledge used by phone screeners around the world. This object stores general information about a specific file. Each inode object has a unique (to that filesystem) inode number. An inode object contains an arbitrary tag to represent the filename, but the inode itself is unique to the file and remains the same as long as the file exists.

Some of the basic fields in an inode object include: [table] i_ino - inode number i_count - usage counter i_mode - file type and permissions i_nlink - number of hard links i_uid - owner id i_gid - group id i_size - file size in bytes i_atime - last access time (time of last read, if filesystem is mounted with atime enabled) i_mtime - last modified time (time of last write) i_ctime - time of last inode change i_blksize - block size in bytes

It is important to know that there are two inode representations: the inode on disk, and the inode object in memory. The inode object in memory duplicates some of the data in the inode on disk.

Each inode object in memory is always represented in one of the following places: the list of valid unused inodes - inotes available for use the list of in-use inodes - those inodes that mirror a disk inode and are in use by a process * the list of dirty inodes - inodes with data not yet written to disk

All inodes are also included in a doubly-linked circular list found in the superblock object, as well as a hash table to speed searches for an inode object.

The file object

A file object represents an open file owned used by a process. The objcet is created when a file is open()ed and has some interesting fields:

f_dentry - the dentry object associated with the file f_count - reference counter f_mode - process access mode f_pos - current file offset f_uid/f_gid - user id/group id

One of the most important fields is f_pos, which tracks the current position where calls to read() and write() will take place. The reason this offset is kept in the file object instead of the inode is that multiple processes may be accessing a file at once, and they each need distinct tracking of position in the file.

The dentry object

more to come soon!