Poor performance due to excessively large directories

Understanding the problem

Directories with a large number of files become extremely inefficient to work with. This is because directories have to be read linearly (using the readdir() system call), and many commands read the whole directory before operating on it. In one test, running du -ak on a directory with 500,000 files took 46 seconds to complete; doing a wildcard match on a single file in the same directory took seven seconds. Operations that find and operate on multiple files suffer especially badly -- running "rm -rf" on this directory took over two hours.

This is not just a performance problem for your own jobs; it also can cause sluggish performance for other users due to the large amount of disk activity involved in accessing these directories.

How big is too big?

Smaller is always better, but I recommend keeping it each directory to less than 10,000 files. By 100,000 files performance declines extremely rapidly.

Tip: The size of a directory in an ls listing on our system usually tells you the number of entries inside it. (This is not 100% true, but it's a good approximation, and much quicker than running ls -f | wc -l on a large directory.) For example, the directory below has 184 entries:

brodbd@dryas:~$ ls -ld test
drwx------ 2 brodbd brodbd 184 Apr 14 15:07 test

An empty directory will have a size of 2, because every directory has at least two entries. ("." and "..")

Avoiding the problem

The solution is to split the files up into subdirectories, allowing for a much more efficient search. It doesn't really matter how you do this, as long as the resulting structure is easy for your software to create and search.

Example: On sites that have a very large number of users, searching /home for a user's home directory can become a performance bottleneck. A common solution is to create directories /home/a/ through /home/z/, then underneath each one create another set of a/ through z/. For a user named "brodbd" their home directory then goes in /home/b/r/brodbd. This creates 676 subdirectories (effectively, hash buckets) in a two-level structure.

Be particularly careful when designing large parallel jobs; it's very easy to end up with a lot of files in one directory. Consider splitting each run into its own subdirectory.

Cleaning up

If you've accidentally created a large, flat directory, how can you deal with it? Just dealing with a directory of this size can be quite difficult to do in any reasonable amount of time.

  • Running ls is going to take a while, because it has to read the whole directory, then sort it. Using the -f flag to avoid sorting can help.
  • Do not try to access large directories via Windows file sharing. Windows Explorer tries to read and analyze each file before displaying it. Even on local filesystems this starts to break down after a few tens of thousands of files.
  • Be careful with wildcards. A command like rm * first expands *, creating one long list of filenames that may exceed the maximum command line length. You may need to use the find command with its -exec option to process files, or use wildcards that only match small subsets of files at a time.
  • Because of the way directories are handled internal to the filesystem, once a directory becomes large, its representation on disk stays large. If you're sorting files from a large directory into subdirectories, it's best to create a new tree instead of putting them under the formerly large directory; the old directory will never perform very well, even if it's nearly empty. The only way to shrink it is to rmdir it and create a new one. (This will also recover the space that the directory file is consuming out of your quota.)
  • If you have a very large directory and just want it gone, contact me at linghelp@u with the path; I can delete it directly from the file server, which is usually somewhat faster.
Topic revision: r4 - 2010-06-29 - 17:12:46 - brodbd
 

This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
Privacy Statement Terms & Conditions