Too Cool for Internet Explorer

giovedì 16 aprile 2009

HowTo: A simple way to create a wordlist/dictionary

Hi all,

Last evening I got the need to build a huge wordlist from my own, because I need to crack a salted password.
I started with the assumption, that that the password will likely be an italian, a german or an english word.
So I started searching the web for a huge amount of words to collect, and found a interresting website full of (old) books in (zipped) txt format. Nice.
After a quick look I noticed that it was going to be boring to download each book manually, so I started to read the wget manual, and realized what for a great application it is =)
(Note: I'm not going to explain what every command parameter means, if you don't understand, RTFM ;) )

$ wget -r -nd -N -erobots=off -A.zip \
 -I cyberbooks \
 "http://www.cyberbooks.it/cyberbooks/autori/ind_a.htm"


That downloaded me all linked zip files on the page (and all subpages) in the current directory. Then I unzipped them:

$ for i in `ls`; do unzip -o $i; done;


And finally I cancelled the temporary files and kept only the txt files:

$ find . -maxdepth 1 -type f \! -name '*txt' \
 -exec rm -rf {} \;


Now, being sure that no other files then txt where in the directory, I put them all together, removed some unneeded characters, put each word on a new line, sorted them alphabetically, removed unprintable characters and kept only unique entries:

$ cat *.txt | sed -e 's/://g' -e 's/\.//g'\
 -e 's/!//g' -e 's/?//g' -e 's/-//g' \
 -e 's/;//g' -e 's/,//g' -e 's/*//g' \
 -e 's/(//g' -e 's/)//g' | \
 awk '{printf "%s",$0}!//{print}' | \
 sed -e 's/ /\n/g' | sort -u | \
 strings > test.txt


this method resulted in a wordlist, but many garbage was around the words, so I decided to take the opposite approach, that means not to exclude unwanted characters/string, but to keep only they what I needed, and discard the others. For this operation I used the 'tr' command.

$ cat *.txt | tr ' ' '\n' | \
 tr -d -c '[A-Za-z][\300-\374]\012\047' \
 | sort -u > part01.txt
$ cat part01.txt | wc -l
564486


Nice. A half million words in my list. But I needed more, john the ripper checks them in a couple of milliseconds, and no password was cracked :-(

So lets fetch more data from the internet.. I took a famous italian news portal repubblica.it. And downloaded the whole portal.. some gigs of html files ^^

$ wget -r -nd -N -erobots=off \
 -R.jpg,.gif,.png,.js,.css \
 "http://www.repubblica.it/index.html"


I made a little script to handle the processing of each file, because some file had a strange name and the cat * gave me errors.

$ cd repubblica/
$ for i in `ls`; do \ 
 echo "processing $i" && cat "$i" | \
 tr -c '[A-Za-z][\300-\374]\012\047' ' ' | \
 tr ' ' '\n' >> ../tmp02.txt; \
 done;
[...]
$ cat ../tmp02.txt | sort -u > ../part02.txt


Noticed that only a little improvement was done, I decided to put a list of portals in a file, and let wget download them all.
Then I repeated the same operation for each portal, and got a nice wordlist:
$ cat part0* | sort -u > final_wordlist.txt
$ wc -l final_wordlist.txt
827624 final_wordlist.txt


The opportunities now are infinite, just download the whole internet and make a big wordlist.. then send it to me =)

happy cracking ;)

U238 alias 0xF0rD.