In this article we will show you a built in software in Ubuntu that we can use to download stuff from the internet using wget. On top of that we will show you how to schedule the download using Cron.

Download Using Wget

Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive command line tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.

Open your terminal and let’s explore how we can use wget to download stuff from the net. The basic syntax of downloading with wget is the following:

wget [option]… [URL]…

This command will download the wget manual into your local drive

wget http://www.gnu.org/software/wget/manual/wget.pdf

Linux Cron

Ubuntu comes with a cron daemon used for scheduling tasks to be executed at a certain time. Crontab allows you to specify actions and times that they should be executed. This is how you would normally schedule a task using the command line tool.

Open a terminal window and enter crontab -e.

Each of the sections in a crontab is separated by a space, with the final section having one or more spaces in it. A cron entry consist of minute (0-59), hour (0-23, 0 = midnight), day (1-31), month (1-12), weekday (0-6, 0 = Sunday), command. The third entry in the above crontab downloads wget.pdf at 2 am. The first entry (0) and the second entry (2) means 2:00. The third to the fifth entry (*) means any time of day, month, or week. The last entry is the wget command to download the wget.pdf from the specified URL.

That is the basic on wget and how Cron works. Let’s take a loot at a real life example on how to schedule a download.

Scheduling Download

We are going to download Firefox 3.6 at 2 AM.Since our ISP only gives a limited amount of data, we need to stop the download at 8 AM. This is what the setup looks like.

Ignore the first 2 entries in the above crontab. The third and fourth command are the only 2 commands that you need. The third command setup a task that will download Firefox at 2 AM:

[code] 0 2 * * * wget -c http://download.mozilla.org/?product=firefox-3.6.6&os=win&lang=en-GB [/code]

The -c options denote that wget should resume the existing download if it has not been completed.

The fourth command will stop wget at 8 am. ‘Killall’ is a unix command that kill processes by name.

[code] 0 8 * * * killall wget [/code]

The killall wget tells Ubuntu to stop wget from downloading the file at 8 AM.

Other useful wget commands

  1. Specifying the directory to download a file

[code] wget –output-document=/home/zainul/Downloads/wget manual.pdf http://www.gnu.org/software/wget/manual/wget.pdf [/code]

the option –output-document lets you specify the directory and the name of the file that you download

  1. Downloading a website

wget is also capable to download a website.

[code] wget -m http://www.google.com/profiles/zainul.franciscus [/code]

The above command will download my entire google profile web page. The option ‘-m’ tells wget to download a ‘mirror’ image of the specified URL.

Another important option is to tell wget how many links should it follows when it download a website.

[code] wget -r -l1 http://www.google.com/profiles/zainul.franciscus [/code]

The above wget command uses two options. The first option ‘-r’ tells wget to download the specified website recursively. The second option ‘-l1’ tells wget to only get the first level of links from that specified website. We can set up to three level ‘-l2’ and ‘-l3’.

  1. Ignoring robot entry

Web master maintain a text file called Robot.txt. ‘Robot.txt’ maintain a list of URL that a web page crawler such as wget should not crawl. We can tell wget to ignore the ‘Robot.txt’ with ‘-erobots=off’ option. The following command tells wget to download the first page of my google profile and ignore the ‘Robot.txt.

[code] wget -erobots=off http://www.google.com/profiles/zainul.franciscus [/code]

Another useful option is -U. This option will mask wget as a browser. Take note that masking an application as an other application may violate the term and service of a web service provider.

[code] wget -erobots=off -U Mozilla http://www.google.com/profiles/zainul.franciscus [/code]

Conclusion

Wget is a very old school yet hackable GNU software package that we can use to download files. Wget is an interactive command line tool which means we can let it run on our computer in the background without having to start any application. Check out the wget man page

[code] $ man wget [/code]

to understand other options that we can use with wget.

Wget Manual How to Combine Two Downloaded Files When wget Fails Halfway Through Linux QuickTip: Downloading and Un-tarring in One Step