Numeric File Names

Usually when we scan a PDF file using some hardware (mobile phone, dedicated PDF scanner), the file name will read something like 2020_11_28_13_43_00.pdf. Many other semi-automated systems produces similar date and time based filenames.

Sometimes the file may also contain the name of the application being used, or some other information like for example the applicable DPI (dots per inch) or the scanned paper size.

When collecting PDF files together from different sources, file naming conventions may differ significantly and it may be good to standardize on a numeric (or part numeric) file name.

This also applies to other domains and sets of files. For example, your recipes or photo collection, data samples generated automated monitoring systems, log files ready for archiving, a set of SQL files for the database engineer, and generally any data collected from different sources with different naming schemes.

Bulk Rename Files to Numeric File Names

In Linux, it is easy to quickly rename a whole set of files with completely different file names, to a numerical sequence. “Easy” means “easy to execute” here: the problem of bulk renaming files to numerical numbers is complex to code in itself: the oneliner script below took 3-4 hours to research, create and test. Many other commands tried all had limitations which I wanted to avoid.

Please note that no warranties are given or provided, and this code is provided ‘as is’. Please do your own research before running it. That said, I did test it successfully against files with various special characters, and also against more then 50k files without any file being lost. I also checked a file named ‘a’$‘n’‘a.pdf’ which contains a newline.

Let’s first look at how this works, and then analyze the command. We have a created a directory with eight files, all named quite differently, except their extension matches and is .pdf. We next run the command above:

The outcome was that the 8 files have been renamed to 1.pdf, 2.pdf, 3.pdf, etc., even though their names were quite offset before.

The command assumes you do not have any 1.pdf to x.pdf named files yet. If you do, you can move those files into a separate directory, set the echo 1 to a higher number to start the renaming the remaining files at a given offset, and then merge the two directories together again.

Please always take care not to overwrite any files, and it is always a good idea to take a quick backup before updating anything.

Let’s look at the command in detail. It can help to see what is happening by adding the -t option to xargs which lets us see what is going on behind the scenes:

To start, the command uses two small temporary files (named _e and _c) as temporary storage. At the start of the oneliner it does a safety check using an if statement to ensure that both _e and _c files are not present. If there is a file with that name, the script will not proceed.

On the topic of using small temporary files versus variables, I can say that whereas using variables would have been ideal (saves some disk I/O), there were two issues I was running into.

The first one is that if you EXPORT a variable at the start of the oneliner and then use that same variable later, if another script uses the same variable (including this script run more then once simultaneously on the same machine), then that script, or this one, may be affected. Such interference is best avoided when it comes to renaming many files!

The second one was that xargs in combination with bash -c seems to have a limitation in variable handling inside the bash -c command line. Even extensive research online did not provide a workable solution for this. Thus, I ended up using a small file _c which keep progress.

_e Is the extension we will be searching for and using, and _c is a counter which will be automatically increased on each rename. The echo $[ $(cat _c) + 1 ] > _c code takes care of this, by displaying the file with cat, adding one number, and re-writing it.

The command also uses the best possible method of handling special file name characters by using null-termination instead of the standard newline termination, i.e. the character. This is ensured by the -print0 option to find, and by the -0 option to xargs.

The find command will search for any files with the extension as specified in the _e file (created by the echo ‘pdf’ > _e command. You can vary this extension to any other extension you want, but please do not prefix it with a dot. The dot is already included in the later *.$(cat _e) -name specifier to find.

Once find has located all the files and sent them – terminated to xargs, xargs will rename the files one by one using the counter file (_c) and the same extension file (_e). To obtain the contents of the two files, a simple cat command is used, executed from within a subshell.

The mv move command uses -n to avoid overwriting any file already present. Finally we cleanup the two temporary files by removing them.

While the cost of using two state files and subshell forking may be limited, this does add some overhead to the script, especially when dealing with a large amount of files.

There are all sorts of other solutions for this same problem online, and many have tried and failed to create a fully working solution. A lot of solutions forgot all sorts of side cases, like using ls without specifying –color=never, which may lead to hex codes being parsed when directory listing color coding is used.

Yet other solutions missed handling files with spaces, newlines and special characters like ” correctly. For this, the combination find … -print0 … | xargs -0 … is usually indicated and ideal (and both the find and xargs manuals allude to this fact quite strongly).

Whereas I do not consider my implementation the perfect or end solution, it seems to make a significant furtherance to many of the other solutions out there, by using find and terminated strings, ensuring maximum filename and parsing compatibility, as well as having a few other niceties like being able to specify a starting offset, and being fully Bash-native.

Enjoy!