Poor performance due to excessively large directories
Understanding the problem
Directories with a large number of files become extremely inefficient to work with. This is because directories have to be read linearly (using the
system call), and many commands read the whole directory before operating on it. In one test, running
on a directory with 500,000 files took 46 seconds to complete; doing a wildcard match on a single file in the same directory took seven seconds. Operations that find and operate on multiple files suffer especially badly -- running "rm -rf" on this directory took over two hours
This is not just a performance problem for your own jobs; it also can cause sluggish performance for other users due to the large amount of disk activity involved in accessing these directories.
How big is too big?
Smaller is always better, but I recommend keeping it each directory to less than 10,000 files. By 100,000 files performance declines extremely rapidly.
Tip: The size of a directory in an ls listing on our system usually tells you the number of entries inside it. (This is not 100% true, but it's a good approximation, and much quicker than running
ls -f | wc -l
on a large directory.) For example, the directory below has 184 entries:
brodbd@dryas:~$ ls -ld test
drwx------ 2 brodbd brodbd 184 Apr 14 15:07 test
An empty directory will have a size of 2, because every directory has at least two entries. ("." and "..")
Avoiding the problem
The solution is to split the files up into subdirectories, allowing for a much more efficient search. It doesn't really matter how you do this, as long as the resulting structure is easy for your software to create and search.
Example: On sites that have a very large number of users, searching /home for a user's home directory can become a performance bottleneck. A common solution is to create directories /home/a/ through /home/z/, then underneath each one create another set of a/ through z/. For a user named "brodbd" their home directory then goes in /home/b/r/brodbd. This creates 676 subdirectories (effectively, hash buckets) in a two-level structure.
Be particularly careful when designing large parallel jobs; it's very easy to end up with a lot of files in one directory. Consider splitting each run into its own subdirectory.
If you've accidentally created a large, flat directory, how can you deal with it? Just dealing with a directory of this size can be quite difficult to do in any reasonable amount of time.
ls is going to take a while, because it has to read the whole directory, then sort it. Using the
-f flag to avoid sorting can help.
- Do not try to access large directories via Windows file sharing. Windows Explorer tries to read and analyze each file before displaying it. Even on local filesystems this starts to break down after a few tens of thousands of files.
- Be careful with wildcards. A command like
rm * first expands
*, creating one long list of filenames that may exceed the maximum command line length. You may need to use the
find command with its
-exec option to process files, or use wildcards that only match small subsets of files at a time.
- Because of the way directories are handled internal to the filesystem, once a directory becomes large, its representation on disk stays large. If you're sorting files from a large directory into subdirectories, it's best to create a new tree instead of putting them under the formerly large directory; the old directory will never perform very well, even if it's nearly empty. The only way to shrink it is to
rmdir it and create a new one. (This will also recover the space that the directory file is consuming out of your quota.)
- If you have a very large directory and just want it gone, contact me at linghelp@u with the path; I can delete it directly from the file server, which is usually somewhat faster.