There is only one rule of Internet safety that all people know. Never send personal information to suspicious websites. If this is the freshman rule, then there is a sophomore rule as well. Even websites with good intentions do not always succeed at keeping your data safe. For the longest time, I expected this to leave me with absolutely no recourse if the companies in question decide not to publicize their data breaches. However, there is a wonderful service at HaveIBeenPwned.Com which can put some agency back in the hands of us users.
The maintainer, a security expert named Troy Hunt, keeps it up to date with information about hundreds of data breaches on major social networks, blogs and online stores. More ambitiously, you can also search for your email address (or your friend's email address) in order to directly see whether one of the accounts associated with it has been compromised. This is because Troy often obtains the precise databases that are sold by black hat hackers. But the benefits do not stop there. The best part of the site is a function to search for leaked passwords themselves. Whenever Have I Been Pwned knows about one of the passwords you type in, there can be no doubt that it's time to retire that password for good.
This is a function that can be used in three ways so I want to explain all three.
The third method, which is described on the official site, is based on hash algorithms. These allow you to take an arbitrary file and generate a sequence of hexadecimal numbers associated with it. The hexadecimal string is a fingerprint in the sense that tiny changes to the input will almost certainly produce drastic changes in the output. The simplest such program, called md5sum, is included on every Unix-like OS. I use it all the time to clean up duplicates after experimenting with different changes to a file I'm working on. Programs like this are also used during quantum key distribution to account for the parts of this procedure that are inherently uncertain. But it's much more common to use hash based integrity checks to verify that a file transfer has gone smoothly without running into a disk error, network error or malicious man-in-the-middle attack.
To guard against the last possibility, 128 bit MD5 hashes are too simple to be a good choice. This is because people have figured out how to reliably generate hash collisions which are fake files having the same hash as the real file. The technique is a little more complicated than changing the fake file 2128 times by brute force until you finally land on the right hash. But it is still the case that longer hashes tend to be more secure. HIBP uses sha1sum which is 160 bits long. This has been cracked by now as well but it is perfectly fine for applications that only send small chunks.
Consider the password "jollyroger" which we can hash with a simple piped command.
$ echo -n "jollyroger" | sha1sum
1cf90f2f251a69e0f190bdccb9f7a2d84ad2b620 -
HIBP can perform a search based on the first 5 digits. The difficulty of generating hash collisions is what should make us feel comfortable announcing that our password is something that has "1CF90" as the first 20 bits of its hash. It is certainly much safer than announcing that the first two and a half letters are "j", "o" and then something in the first half of the alphabet. The search is performed by going to https://api.pwnedpasswords.com/range/1CF90 to get a page consisting of all pwned password hashes which start off that way. Since "jollyroger" is a common password, we can see that the end of the hash, namely "F2F251A69E0F190BDCCB9F7A2D84AD2B620", indeed shows up in that list. At the time of writing, it has been seen 2756 times. As an exercise, we can try a different search to see that "areyousmarterthanafifthgrader" has not been seen at all. It would therefore be a good password if I didn't discuss it on the blog just now!
Because this is all based on hashes, it doesn't really tell you that the users of websites affected by data breaches had the same passwords as you. While extremely unlikely, it is possible that all 2756 of those people were using different passwords that just happened to have the same SHA1 value as "jollyroger". What it does do is tell you the passwords that have never been exposed so that you can err on the side of caution and change all the others. More to the point, your true password does not need to have leaked to still put you in a compromised position. Websites that know what they're doing don't store your password at all after you sign up. They limit themselves to only ever storing the hash value so that intruders gaining access to their database still have to go to the trouble of finding collisions. Of course this is sometimes possible — data breaches often expose hashes for millions of accounts and somebody trying to sniff out credit card information just needs to be able to login to one of them.
Another trick that improves security is adding salt to the end of the password before the hash is computed. This is a set of fixed digits randomly generated when the user signs up. To allow people to login again, it has to be stored along with the username. And this means that someone in possession of leaked password hashes usually has access to the associated salt as well. So doesn't this make it useless? Isn't figuring out that "jollyroger84B03d034b409d4e" hashes to "b08d5aed1ab68d6e5c5d4e2848e15f59acf427f2" just as easy as figuring out that "jollyroger" hashes to "1cf90f2f251a69e0f190bdccb9f7a2d84ad2b620"? In fact it is much harder. In the unsalted case, a dictionary containing hashes for millions of short combinations of English words that you use on the first password can be immediately reused for the second. The processing time required to create the dictionary (or more effectively a rainbow table) is therefore a one time cost. This changes completely when there is a different piece of salt adorning each password that one is trying to crack.
After I was sold on how useful this technology is, I started looking up all of my passwords on HIBP. About half of them had never been pwned. The other half mostly consisted of embarrassingly short words that are probably used by hundreds of similarly stupid people in my generation. But a few passwords that I thought would be secure showed up in the pwned archive as well. This could be because I typed them into too many public computers or because a company which upgraded its hard drives somewhere forgot to properly wipe the old ones. When I tried to change the affected passwords, I was only able to do so about one third of the time. In another third of the cases, the original site had gone out of business. The rest of the time, I found myself unable to login despite having never changed the password. This must mean the company deleted my account due to inactivity. Either that or it was already hacked.
Considering how many random sites I've joined over the years, making these changes took longer than expected. Obviously, I would still strongly recommend everyone to put in the hours. The twelve days of Christmas are over but the accounts I just hardened (many of which I had forgotten all about) are on: 9 online stores, 6 physics websites, 6 software websites, 5 random forums, 4 random wikis, 4 social networks, 3 review aggregators and 1 site for a charity!