Sunday, June 25, 2017

32hex is not MD5? What are Youku talking about?

 

32hex is not MD5? What are Youku talking about?


During April 2017, various online sources alleged that Youku, a Chinese video hosting service was hacked and that roughly 100 million user accounts were compromised. These sources stated that Youku usernames along with passwords hashed with MD5 and SHA1 algorithms were leaked. We decided to take a closer look in early June and will be presenting our findings in this post.

Of the 99,075,692 lines of data present in the leak provided to us, we were able to extract 99,028,838 usable hash strings. From the hash strings extracted from the original dump, we noticed there were hashes of varying lengths ranging from 30 to 32 ASCII-hex characters and thus suggesting to us they could be more MD5 like. After de-duplicating the hashes we were left with 57,205,528 hashes suggesting there was password re-use in this data.

A common practice, especially those seen in Chinese websites, is that the developers employ a form of ob-security in their password storage schemes. We suspect this is most likely done to deter the hashes being loaded into off-the-shelf password crackers. Another explanation would be that mistakes were made in processing the data.

As we started to work on this data set, it quickly became apparent that there were more than just MD5 hashes in this file.  We were able to identify both iterated MD5 hashes, as well as more complex sub-string iterated hashes.  Each of these also appeared as a chopped (last digits removed) value as well.  The majority of hashes were MD5($pass), but we found a sizeable number of MD5(MD5($pass)) and MD5(MD5(MD5($pass))).  The substring hashes were of the form MD5(substr(MD5($pass),8,16)).

The number of different MD5 variations used in hashing the passwords could be attributed to a number of factors which we won’t know but can only make assumptions. The simplest explanation is that the developers decided to change the hashing method through update iterations to their website. Some other explanations could be they merged with another service and also merged in those user accounts along with hashes, alternatively different accounts such as operators and users may have used different hashing schemes.

Dealing with the chopped hashes was not a problem for our tools. MDXfind natively supports partial matching of hashes, but we did modify hashcat to support these as well. See below for an example patch based on hashcat 3.6.0. A “clean” version including MD5sub8-24MD5 may be released at a later point. This required both small changes in the input parser, as well as the kernel code. We then ran the cracked passwords as a dictionary with MDXfind to mark the hashes correctly.

 

Of the 99 million hashes we parsed, we were able to recover 94.836 million - roughly 95.7% success rate. Interestingly, we noticed about 1.5 million MD5 like hashes which were in uppercase ASCII-hex form, as opposed to lowercase like the rest. We were not able to recover any of these hashes, and it is possible these are either salted or use a more exotic algorithm.

We found 48 million unique passwords, which solved the 94.8 million hashes.  The top-25 passwords for this list are typical for this type of web-site. It is interesting to note the fourth most common password used ‘xuanchuan’ is the romanized representation of 宣傳 translated to English means propaganda.


Perhaps the most interesting thing about this leak was the number of “created” or “generated” accounts we found.  Many, perhaps even the majority, of the accounts use what we consider to be generated email addresses and certainly machine-generated passwords.  While the exact number is difficult to calculate with certainty, we suspect tens of millions of these accounts are generated.

For example, there are 222 accounts we believe were created on October 10, 2011, at 14:25:03, all with 11 character random usernames @qq.com.  Why do we believe this?  Because they share exactly the same password: “2011-10-10 14:25:03”.   These accounts are part of a larger group of 606,733 accounts all created that day, presumably between 14:25 and 15:33.  There were an additional 22,741 accounts similar to these created, we believe, on October 14, 2011 - again with a similar style of @qq.com accounts (but using 9 character user names).  We do not believe that any of these qq.com accounts exist.

Another example is the uppercase ASCII-hex hashes. 1,563,853 (all but 1538) of these have email addresses like this: 037d6909-04a9-4b45-a309-157ef846c573@qzone.com. Having a UUID as the email address is strange enough but we looked into qzone.com. The records of DNSTrails show that an MX record for this domain only existed between October 2008 and August 2009. Also, the wayback machine of archive.org doesn’t have any recordings during that period. These facts lead us to believe that these are generated accounts.

One thing to take from this is that ob-security doesn't really help, in addition, it is interesting to see how there are so many different plays on MD5 used in this leak. It is always a good idea to not assume a single hash algorithm is being used, even if it comes from a single data set. Hopefully, we have provided an interesting read and we would love to find out why there are 1.5 M hashes which seem slightly different to the rest. If you know something, contact us.



Friday, July 8, 2016

Bitcrack / Hashkiller contest write-up 2016



Bitcrack / Hashkiller 2016 contest write-up

 

Members
Amd

gearjunkie

mastercracker

usasoft

blazer

hops

Milzo

User

cvsi

Jimbas

s3in!c

Waffle

espira

jugganuts420

tony

winxp5421

Software
HashcatV3 & HashcatV2 (https://hashcat.net/), MDXfind (https://hashes.org/mdxfind.php), hashtopussy (fork of the hashtopus project), TeamLogic (hash management platform), Unified List Manager (http://unifiedlm.com/)

Hardware


GPU (GH)
CPU (cores)
CPU (cores) [bcrypt only]
Total (cores)
Base
150 (SHA1)
100
130
230
Peak
190 (SHA1)
250
300
550

A constant combined compute power of 150 GH (measured on SHA1 bruteforce) was used throughout the contest. This figure peaked to about 190 GH which is the rough equivalent of 35 GTX 980Ti. Around 130 CPU cores were reserved solely for GPU unfriendly algorithms, this burst to maximum of 300 cores for a short period. An additional 100 CPU cores were used for all other algorithms which peaked to 250 cores.
Strategy
  • Free-for-all approach
  • Have fun
  • Utilize resources efficiently
  • Surprise the other teams

Before the contest
We redeveloped our hash management system and ensured it was fully functional prior to the contest. In addition we had the pleasure of beta testing a personal project of one of our members. An improved distributed hashcat system dubbed Hashtopussy, (a fork of the hashtopus project) with numerous improvements including; a revamped interface, multi-user and user-rights-management support, optimized hash handling and of course support for Hashcat3. Keep an eye out for this project, as it will be released soon.

Hashtopussy instances were deployed and allowed the team to remotely manage, voluntarily donate compute cycles and deploy tasks across clusters of compute nodes and streamline the cracking process. As hashcat is now open source (big thanks to the hashcat developers), we were able to easily apply minor changes to ensure it played nicely in a distributed environment.

During the contest
We started off by probing all algorithms looking, for any signs of patterns and tackled the bcrypts immediately by running extremely simple checks against common passwords.  We recovered about 20 bcrypts within the first hour on our CPU cluster and were able to feed it with enough test candidates allowing us yield hits consistently.

MDXfind was used to quickly test algorithms which hashcat couldn’t initially handle namely DCC, with Waffle quickly adding WBB support. Once we knew these hashes were valid, support for both these algorithms were swiftly added to hashcat.

As there is already a write-up regarding the patterns for the generated hashes we won’t go into them, other than saying we spotted some and missed others and discovered some too late into the contest. 11 hours into the contest and we had hits for every algorithm except phpbb3_gen which we didn’t waste too much time pursuing. This was a pretty good starting point and kept us busy through the remainder of the time.

To make it up to some individuals who have complained that our large submission towards the end of the contest would have skewed any pretty graphs, we have decided to provide analytics gathered by our hash management system. The graphs should reflect the actual crack progression for each individual hashlist throughout the contest. This should provide some insight on how we tackled each hashlist.

D:\Hashes\Plains\chart(3).png
D:\Hashes\Plains\chart(4).png
D:\Hashes\Plains\chart(5).png
Graphs for real hashlists


Graphs for generated hashlists

Interesting observations
As a portion of the hashes were from the real environment there is always the chance the hashes are mislabeled. We identified some DoubleMD5 labelled as MD5, these hashes tackled by cracking the initial MD5 list as DoubleMD5 then performing a single MD5 on the password prior to submission. We also identified vBulletin <3.8.5 hashes which were mislabeled MD5:pass with the salt being the plain for this MD5, there was no possible way to submit these since they were technically solved.

Once again since there were real world hashes, sometimes hashes become corrupted during extraction or transport. A feature of hashcat is that does not match every bit of the hash, allowing it to essentially detect a mistyped hash. We encountered a small portion of these which we assumed were most likely corrupted. As there wasn’t a large number of these, we simply ignored them.
While GPUs are extremely powerful in parallel hash cracking, it was surprising to see that the top scorer in our team predominately used CPUs.

Final remarks

A huge thanks to Bitcrack and Hashkiller for organizing an almost flawless contest, we had plenty of fun and very little sleep. We can only imagine the amount of time and effort put into arranging this contest to ensure it run so smoothly. Congratulations to Team Hashcat on their second place, glad we’re able to finally beat our rivals. Congratulations to the FCHC, I’m in your Wifi, LeakedSource and all other teams who participated.

Thursday, July 7, 2016

Myspace hashes, length 10 and beyond


RecoveredPercent
Total360,213,049
Usable data359,005,905355,886,68699.13%
Unique116,822,086113,830,17697%
Salted hashes68,494,253
Salted pairs66,099,05947,120,45371.29%
non-user pass14,412,2995,8310.04%
meaningful passes51,686,76047,114,62291.15%

When we obtained the Myspace data, we didn’t think too much of it for several reasons. In addition to being a fairly old data-set, the passwords were also truncated to length ten and converted to lowercase prior to being hashed with the SHA-1 algorithm. This means that some of the passwords recovered would be ambiguous and incomplete. This is no longer the case for roughly 68M of the hashes.

The total data-set of roughly 360,213,049 lines contained 359,005,905 usable hashes. This data was de-duplicated to 116,822,086 SHA-1 hashes. Roughly 97% of these hashes were recovered by our group, totaling to 113M hashes. As the passwords were all pre-processed before hashing, the plain-texts which we recovered did not exceed length ten and were all lower-cased.

Since the plain-text passwords aren’t in their original form, they are not as interesting as it does not allow us to gather that much useful information from them. Being truncated, they do give us a glimpse of some longer passwords we may have previously not been able to recover.

Interestingly, user ‘frekvent’ over at the hashes.org forum made an amazing discovery. It appears that for some users there exists an additional salted SHA-1 hash that contains the password in it’s original form, without being truncated or lower-cased. This hash is generated by salting the password with the userid prior to being hashed with SHA-1.

Rather than directly recover the salted SHA-1 hashes, we can take a shortcut. This means for all those users who contain this secondary salted SHA-1 hash, we can now case correct it against the plain-text we previously recovered. It also means we can derive the actual password  for these users prior to length ten truncation.

A generated example

UserID: 65535
Password: Cynosureprime082!

First hash:
(password is truncated to length 10 and lower cased)
cynosurepr->SHA1->6fba0c905ded07590fdbc4b0fa6eb17e565dd814

Second hash:
(userid is applied as a salt to the unmodified password)
Cynosureprime082!->SHA1($salt.$pass)->20c25cbb791bc0b7fcce739f42b682376057eb9e:65535

Stored as:
65535:email:0x6FBA0C905DED07590FDBC4B0FA6EB17E565DD814:0x20C25CBB791BC0B7FCCE739F42B682376057EB9E

Step 1: Recover 6fba0c905ded07590fdbc4b0fa6eb17e565dd814 as cynosurepr
Step 2: Perform case toggling and length extension cynosureprA, cYnosureprBB, cyNosureprZZ etc etc and test against 20c25cbb791bc0b7fcce739f42b682376057eb9e:65535

Out of the entire data-set, about 68M users contain the secondary salted SHA-1 password hash.  Of these 68M users, we were able to pair 66M up with the recovered password. This 66M list was then divided into two groups, ‘non-user pass’ which are users containing system generated passwords (14M) and ‘meaningful passes’, those which belong to users (51.6M). We were only able to pair 66M of the total 68M hashes as we have not fully recovered all the SHA1 hashes, but only 97% of them.

Using our tools we performed either a case toggle and/or length extension attack for each of the salted hash pairs. We have successfully verified over 45M plain-texts against their salted SHA-1 counterpart. The case toggle refers to toggling all passes length ten or less against the salted SHA-1. The length extension attack involves cycling through all possible characters and appending them to the plain-text derived from the recovered normal SHA1 and checking this against the salted SHA-1 hash.

Having both variations of the password hashes has made cracking the longer passwords quite easy since we can first recover the length 10 representation and use this in length extension attacks to obtain the full length password. It would appear that the Myspace data may have some usefulness after all.

Note: The salted hashes can be paired up with their corresponding plaintext data and arranged such that they can be recovered using off the shelf software. However, this won't work for case correction, you will also need to reparse the final output.