Exploring the ext3 Filesystem

By: Bill von Hagen
Friday, April 5, 2002 09:53:55 AM EST
URL: http://www.linuxplanet.com/linuxplanet/reports/4136/1/

Introduction to the ext3 Filesystem

As a sophisticated, powerful, and free operating system, Linux provides a fertile territory for developing sophisticated system and user-level software. Some of the most exciting developments in recent Linux kernels are new, high-performance techniques for managing how the data on Linux systems is stored, allocated, and updated on disk. One of the most interesting of these new mechanisms is the ext3 filesystem, which has been integrated into the Linux kernel since version 2.4.16 and is already available as a default filesystem type on Linux distributions from Red Hat and SuSE.

The ext3 filesystem is a journaling filesystem that is 100% compatible with all of the utilities created for creating, managing, and fine-tuning the ext2 filesystem, which is the default filesystem used by Linux systems for the last few years. Before delving into the differences between the ext2 and ext3 filesystems, a quick refresher on storage and filesystem terminology is in order.

Some Background on Linux Filesystems

Since the date in the distant past when computer systems first began to read and write from magnetic storage media, guaranteeing the consistency of the files and (later) directories on that media has been a thorn in the paw of system administrators and designers everywhere. At the system level, all of the data on a computer exists as data blocks on some storage device, organized using special on-disk data structures into partitions (logical subsets of storage media) which are themselves organized into files, directories, and unallocated (free) space.

Filesystems are created on disk partitions to enable applications to store and organize data in the form of files and directories. Linux, like Unix systems, uses a hierarchical filesystem composed of files and directories, which can contain either files or other directories. The files and directories in a Linux filesystem are made available to users by mounting them using the Linux "mount" command, which is generally done as part of a Linux system's startup process. The list of filesystems that are available for use is stored in the file /etc/fstab (which stands for filesystem table), while the list of filesystems that are currently mounted is maintained by Linux in the file /etc/mtab (which stands for mount table).

As each filesystem is mounted during the boot process, a bit in the filesystem header (popularly known as the "clean bit") is cleared, which indicates that the filesystem is in use and that the data structures used to manage allocation and file and directory organization in that filesystem may actively be changing.

Filesystems are referred to as being consistent when all data blocks in the filesystem are either used or free, each allocated data block is claimed by one and only one file or directory, and all files and directories can be access by traversing a series of other directories in the filesystem. When a Linux system is intentionally shut down using operator commands, all filesystems are unmounted. Unmounting a filesystem during a standard shutdown sets the clean bit in the filesystem header, indicating that the filesystem was cleanly umounted and can therefore be assumed to be consistent.

Years of filesystem debugging and redesign and the use of some extremely clever algorithms for writing data to disk have largely eliminated filesystem corruption caused by applications or the Linux kernel itself, but eliminating data corruption and loss due to power outages and other system mishaps is still the system programmer's equivalent of the search for the holy grail. When a Linux system crashes or is simply turned off without going through the standard shutdown procedure, the clean bit in filesystem headers is not set. The next time the system boots, the mount process detects filesystems that are are not marked as being clean, and manually verifies their consistency using the Linux/Unix 'fsck' (filesystem check) utility.

What is Journaling?

Running fsck on a number of large filesystems can take quite a bit of time, which is not a good thing given today's high-availability assumptions. The reason that inconsistencies may exist in a filesystem that is not cleanly unmounted is that writes to the disk may have been in progress when the system went down. Applications may have been updating the data contained in files and the system may have been updating filesystem metadata, which is "data about filesystem data" - on other words, the information about which blocks are allocated to which files, which files live in which directories, and so on. Inconsistencies in file data are bad enough, but inconsistencies in filesystem metadata are the stuff of which lost files and other operational nightmares are made.

In order to minimize filesystem inconsistencies and minimize system restart time, journaling filesystems keep track of the changes that they will be making to the filesystem before they actually make them to the filesystem. These records are stored in a separate part of the filesystem, typically known as the "journal" or "log". Once these journal records (also commonly known as "log" records) are safely written, a journaling filesystem applies those changes to the filesystem and then purges those entries from the log. Journal records are organized into sets of related filesystem changes, much like changes made to a database are organized into transactions.

Journaling filesystems maximize filesystem consistency because log records are written before filesystem changes are made, and because the filesystem saves these records until they have been safely and completely applied to the filesystem. When rebooting a computer that uses journaling filesystems, the mount program can guarantee the consistency of the filesystem by simply checking the log for pending changes that are not marked as being done and applying them to the filesystem. In most cases, the system doesn't have to check filesystem consistency, meaning that computers using journaling filesystems are available almost instantly after rebooting them. The chances of losing data due to filesystem consistency problems are similarly reduced.

There are a number of journaling filesystems available for Linux. The best known of these are XFS, a journaling filesystem originally developed by Silicon Graphics but now released as open source, the ReiserFS, a journaling filesystem developed especially for Linux, JFS, a journaling filesystem originally developed by IBM but now released as open source, and the ext3 filesystem, developed by Dr. Stephen Tweedie at Red Hat and a host of other contributors.

The Linux ext3 Filesystem

The ext3 filesystem is a journaling version of the Linux ext2 filesystem. The ext3 filesystem has one significant advantage that no other journaling filesystem has - it is totally compatible with the ext2 filesystem. It can therefore make use of all of the existing applications that have already been developed to manipulate and fine-tune the ext2 filesystem. The ext3 filesystem is supported in Linux kernel versions 2.4.16 and newer, but must be activated using the Filesystems Configuration dialog when building the kernel. Linux distributions such as Red Hat 7.2 and SuSE 7.3 already include built-in support for the ext3 filesystem. You can only use the ext3 filesystem if ext3 support is compiled into your kernel and you have the latest versions of the mount and e2fsprogs Linux utilities.

In most cases, converting filesystems from one format to another involves backing up all of the data that they contain, reformatting the partition or logical volume that contains the filesystem, and then restoring all of the previous data to that filesystem. Due to the compatibility between the ext2 and ext3 filesystems, this sort of conversion process is totally unnecessary when converting and ext2 filesystem to ext3, which can be done (as root) with a single command:

 # /sbin/tune2fs -j <partition-name>

As an example, converting the ext2 filesystem located on the partition /dev/hda5 to an ext3 filesystem would be done with the following command:

 # /sbin/tune2fs -j /dev/hda5

The tune2fs command's -j option creates the ext3 journal on an existing ext2 filesystem. After converting an ext2 filesystem to ext3, you must also update the entries in the /etc/fstab file for that filesystem to specify that it is an ext3 filesystem. You can also use the "auto" filesystem type option, but I prefer to explicitly identify the type of filesystem that I'm using. The following examples from an /etc/fstab file show before and after versions of the entry for a filesystem on /dev/hda5:

Before:

/dev/hda5      /opt            ext2       defaults         1 2

After:

/dev/hda5      /opt            ext3       defaults         1 0

The last field of a Linux /etc/fstab entry specifies the stage in the boot process during which filesystem consistency should be verified by the "fsck" program. When using the ext3 filesystem, you can set this field to "0", as shown in the previous example. This means that the fsck program will never check the consistency of the filesystem, since the consistency of the filesystem is guaranteed by playing back the journal.

Converting the root filesystem of a Linux system to ext3 requires some special handling, and is best done in single user mode after creating an initial RAM disk that supports the ext3 filesystem.

Multiple Journaling Modes in the ext3 Filesystem

Aside from its compatibility with ext2 filesystem utilities and the ease with which you can convert ext2 filesystems to ext3, the ext3 filesystem also offers several different types of journaling. A classic issue in journaling filesystems is whether they only log changes to filesystem metadata or log changes to all filesystem data, including changes to files themselves. The ext3 filesystem supports three different journaling modes, which you can activate in the /etc/fstab entry for an ext3 filesystem. These journaling modes are the following:

journal - logs all filesystem data and metadata changes. The slowest of the three ext3 journaling modes, this journaling mode minimizes the chance of losing the changes you have made to any file in an ext3 filesystem.

ordered - only logs changes to filesystem metadata, but flushes file data updates to disk before making changes to associated filesystem metadata. This is the default ext3 journaling mode.

writeback - only logs changes to filesystem metadata but relies on the standard filesystem write process to write file data changes to disk. This is the fastest ext3 journaling mode.

The differences between these journaling modes are both subtle and profound. Using the "journal" mode requires that an ext3 filesystem write every change to a filesystem twice - once to the journal, and then again to the filesystem itself. This can reduce the overall performance of your filesystem, but is the mode most beloved by users, because it minimizes the chances of losing changes to your files since both metatdata and data updates are recorded in the ext3 journal and can be replayed when a system reboots.

Using the "ordered" mode, only filesystem metadata changes are logged, which reduces redundancy between writing to the filesystem and to the journal and is therefore faster. Though the changes to file data are not logged, they must be done before associated filesystem metadata changes are made by the ext3 journaling daemon, which can slightly reduce the performance of your system. However, using this journaling mode guarantees that files in the filesystem will never be out of sync with any related changes to filesystem metadata.

Using the "writeback" mode is faster than the other two ext3 journaling modes because it only logs changes to filesystem metadata and does not wait for associated changes to file data to be written before updating things like file size and directory information. Because updates to file data are done asynchronously to journaled changes to filesystem metadata, files in the filesystem may exhibit metadata inconsistencies such as owning data blocks to which updated data was not yet written when the system went down. This isn't fatal, but can be disappointing to users.

Specifying the journaling mode used by an ext3 filesystem is done in the /etc/fstab entry for that filesystem. The "ordered" journaling mode is the default journaling mode by ext3 filesystems, but you can specify a different journaling mode by updating the filesystem options portion of an /etc/fstab entry. For example, an /etc/fstab entry that specifies the "writeback" journaling mode would look like the following:


/dev/hda5      /opt            ext3       data=writeback        1 0

Conclusion

Journaling filesystems provide significant advantages across the whole spectrum of Linux users, minimizing delays when rebooting a Linux system and almost eliminating the chance of filesystem inconsistencies. The ext3 filesystem is a high-performance journaling filesystem whose compatibility with the ext2 filesystem and associated utilities makes it easy to upgrade your system to use the ext3 filesystem. This compatibility also extends the usability of all of the utilities that have already been developed for working with the ext2 filesystem. The ext3 filesystem is a true win-win filesystem solution for improving the availability and consistency of Linux systems everywhere.

A number of other details about the ext3 filesystem may be of interest to system administrators, but are outside the scope of this article. For more information about the ext3 filesystem and other advanced Linux filesystems, see my book on "Linux Filesystems" (SAMS, ISBN 0672322722). For more information on the ext3 filesystem itself, see Red Hat's ext3 white paper at the URL
http://www.redhat.com/support/wpapers/redhat/ext3/index.html#toc or the ext3 FAQ at the URL http://people.spoiled.org/jha/ext3-faq.html.

Copyright © 1999 internet.com Corp. All Rights Reserved.