Last evening I got the need to build a huge wordlist from my own, because I need to crack a salted password.
I started with the assumption, that that the password will likely be an italian, a german or an english word.
So I started searching the web for a huge amount of words to collect, and found a interresting website full of (old) books in (zipped) txt format. Nice.
After a quick look I noticed that it was going to be boring to download each book manually, so I started to read the wget manual, and realized what for a great application it is =)
(Note: I'm not going to explain what every command parameter means, if you don't understand, RTFM ;) )
$ wget -r -nd -N -erobots=off -A.zip \ -I cyberbooks \ "http://www.cyberbooks.it/cyberbooks/autori/ind_a.htm"
That downloaded me all linked zip files on the page (and all subpages) in the current directory. Then I unzipped them:
$ for i in `ls`; do unzip -o $i; done;
And finally I cancelled the temporary files and kept only the txt files:
$ find . -maxdepth 1 -type f \! -name '*txt' \ -exec rm -rf {} \;
Now, being sure that no other files then txt where in the directory, I put them all together, removed some unneeded characters, put each word on a new line, sorted them alphabetically, removed unprintable characters and kept only unique entries:
$ cat *.txt | sed -e 's/://g' -e 's/\.//g'\ -e 's/!//g' -e 's/?//g' -e 's/-//g' \ -e 's/;//g' -e 's/,//g' -e 's/*//g' \ -e 's/(//g' -e 's/)//g' | \ awk '{printf "%s",$0}!//{print}' | \ sed -e 's/ /\n/g' | sort -u | \ strings > test.txt
this method resulted in a wordlist, but many garbage was around the words, so I decided to take the opposite approach, that means not to exclude unwanted characters/string, but to keep only they what I needed, and discard the others. For this operation I used the 'tr' command.
$ cat *.txt | tr ' ' '\n' | \ tr -d -c '[A-Za-z][\300-\374]\012\047' \ | sort -u > part01.txt
$ cat part01.txt | wc -l 564486
Nice. A half million words in my list. But I needed more, john the ripper checks them in a couple of milliseconds, and no password was cracked :-(
So lets fetch more data from the internet.. I took a famous italian news portal repubblica.it. And downloaded the whole portal.. some gigs of html files ^^
$ wget -r -nd -N -erobots=off \ -R.jpg,.gif,.png,.js,.css \ "http://www.repubblica.it/index.html"
I made a little script to handle the processing of each file, because some file had a strange name and the cat * gave me errors.
$ cd repubblica/
$ for i in `ls`; do \ echo "processing $i" && cat "$i" | \ tr -c '[A-Za-z][\300-\374]\012\047' ' ' | \ tr ' ' '\n' >> ../tmp02.txt; \ done; [...]
$ cat ../tmp02.txt | sort -u > ../part02.txt
Noticed that only a little improvement was done, I decided to put a list of portals in a file, and let wget download them all.
Then I repeated the same operation for each portal, and got a nice wordlist:
$ cat part0* | sort -u > final_wordlist.txt
$ wc -l final_wordlist.txt 827624 final_wordlist.txt
The opportunities now are infinite, just download the whole internet and make a big wordlist.. then send it to me =)
happy cracking ;)
U238 alias 0xF0rD.