The world needs a phishing corpus. Basically, how hard is it to gather up a few hundred phishing emails and assemble them into an unmolested mailbox for study? Oddly, it's easy to do yet no one has done it.

I collect my spam, I always have ... it's a compulsion by now. In this time period (nov 2004 - june 2005) I collected over 32,000 spams, yet only about 415 phishing emails. This corpus is hand selected from these messages and contains nothing but good old fashioned phishing emails.

To the best of my knowledge this is the first such phishing corpus publicly available.

===== The files =====

414 messages from November 27, 2004, until June 13, 2005, covering a variety of common phishing schemes. 3119972 bytes. Format: UNIX mbox (plain text, processable with procmail, Python, Perl, etc, imports into Eudora, Mail.app, and others).

434 messages from June 14, 2005, until 14 November, 2005, covering all of the common phishing schemes. 4118879 bytes. Format: UNIX mbox.

1423 messages from November 15, 2005, until 7 August, 2006, covering many of the newer phishing schemes. 10420294 bytes. Format: UNIX mbox.

2279 messages from August 7, 2006 - August 7, 2007 (a year!), covering many many phish and targets. 20067215 bytes. Format: UNIX mbox.

