Posts: 2
Threads: 0
Joined: Jul 2023
I know a method to scrape data like that without being hit by a rate limit, I'm sure it scrapes through user IDs from 1 to 999999++, I have tried it but I think the data is useless because there is no email/phone number.
Posts: 447
Threads: 12
Joined: Jan 2024
(03-21-2025, 11:03 PM)xvz Wrote: I know a method to scrape data like that without being hit by a rate limit, I'm sure it scrapes through user IDs from 1 to 999999++, I have tried it but I think the data is useless because there is no email/phone number.
There are already many millions of emails associated with Twitter screennames. I've got a copy of this new data with emails attached to something like 1/4-1/3 of the active accounts.
If rate limiting weren't an issue, it would be easy to just check all user IDs... but in early 2016, they switched to 64-bit user IDs, which would be nearly impossible to enumerate (they are based on the timestamp to the millisecond, machine ID, and sequence number).
Posts: 31
Threads: 0
Joined: Mar 2025
刚刚尝试了 BitTorrent + Transmission + Tixati,它们都不适合我,所以我想我得等人重新上传这个
This forum account is currently banned. Ban Length: (Permanent)
Ban Reason: this is a English only forum
Posts: 447
Threads: 12
Joined: Jan 2024
03-24-2025, 11:40 AM
(This post was last modified: 03-24-2025, 12:23 PM by ThinkingOne.)
(03-24-2025, 08:13 AM)bidatou Wrote: 刚刚尝试了 BitTorrent + Transmission + Tixati,它们都不适合我,所以我想我得等人重新上传这个
There are not many seeds for this torrent.
If you leave the client open and running, when a seed connects, it will automatically start to download.
Something else to add... for fun, I created a tool to search through all the 2.8 billion screennames. So instead of searching for, say, "elonmusk" you can search for all Twitter/X screennames that contain "elonmusk" (thousands of them!).
It does use 200GB of data/index files in addition to the 400GB of CSV files, but was a fun programming exercise.
Posts: 447
Threads: 12
Joined: Jan 2024
03-30-2025, 12:39 PM
(This post was last modified: 03-30-2025, 12:49 PM by ThinkingOne.)
So, on Friday I posted a merged file, as a last resort to make sure that X was aware of this massive breach.
I'm not a hacker, and trying to be very careful not to violate any laws (my opsec is good enough for what I do, but no match for anyone determined to find me, and I would not hide anyways). I had previously said that I intentionally only uploaded to 1 legit filehost so if they took it down, people wouldn't be able to access the file without me taking intervention.
Last night, the filehost reported that the file was gone, with possible reasons being that it was inactive for too long (not the case) or that I deleted it (not the case).
If I knew X asked for it to be removed, I would know they were aware of the 2.8B breach, and would immediately stop. Word has started to get out... but there is also a lot of disbelief because people assume that 2.8B is much higher than it possibly could be (hint: Google "monthly active users"). Grok (xAI) has been telling some people it is probably real, but others that it is not (a few hours ago it said "No, X doesn’t have 2.8B users. ...").
So while the file was taken down, I have no idea whether it was because of a problem with the file. So I'm debating on whether to re-upload somewhere else, as it may put me one small step closer to crossing a line I do not want to cross.
I did hear that someone got a well known security researcher to investigate. If a well known security research verifies this, I'm sure that X will find out. So I'll probably wait a bit before deciding how to proceed.
The funny thing is that the file I posted is really meant to be a "guidepost" to help people find out about the 2.8B breach, and isn't really that interesting compared to the 2.8B breach. But people also likely can't find it, since it doesn't have "billion" in the title, and starts with "[2022]", likely leading people to think they are looking at an old breach.
Posts: 447
Threads: 12
Joined: Jan 2024
03-30-2025, 09:12 PM
(This post was last modified: 03-30-2025, 10:11 PM by ThinkingOne.)
So, I'm still seeing zero signs that X is aware of this breach, and not seeing much traction on this... despite the fact that 2.8 billion user accounts were taken, with hints that it is possible emails/phones could have been taken as well (but not leaked). There still seems to be no well-known security researcher that has said "Yeah, there really are 2.8 billion unique Twitter accounts in here, and it was leaked in 2025" (something trivial to do!).
I think the benefit of re-uploading the file to help ensure that X and the general public are aware outweighs any very minimal damage that could possibly be done. Damn it X, check your spam folder and tell me you know! I don't want to be doing this.
I'm going to see if I can find a good filehost for a 9GB file (pixeldrain would be good, except it will cut the speed after 6GB, and gofile.io has deleted my file without explanation while keeping the 2.8B breach files). If anyone is reading this knows of a good filehost before I re-upload, let me know.
I'm re-uploading to biteblob. Not exactly what I was hoping for, but with no details from gofile.io on why it was deleted, and most file hosting places limiting to 5GB or less, there weren't many options, but someone here listed it as the best option. I won't be able to delete it, and don't know where they are located. But, X needs to know about this, and the public as well. What if passwords were taken too?
Posts: 2
Threads: 0
Joined: Jul 2024
What are the chances that there is a full data leak including emails and possibly passwords? And that it will get leaked in the future?
Posts: 447
Threads: 12
Joined: Jan 2024
(03-30-2025, 10:25 PM)kalattimore2 Wrote: What are the chances that there is a full data leak including emails and possibly passwords? And that it will get leaked in the future?
Great question!
My gut instinct is that there is not a full leak. To me, it's kind of like getting back to your house and finding broken glass and an open door and no TV... even though your safe is there and closed and doesn't appear to have been touched, you're not going to ignore it... you're going to check to see if everything is in there.
All I can say for sure is what I know. This breach is real. Nobody seems to know how they could have enumerated screennames (you can't brute force 15+ character screennames, or 64-bit user IDs). This seems to have been an inside job. The CSV files were most likely created after the data was exfiltrated (meaning that they are not the original files that were taken). I'm not sure what percent of people at Twitter that had the capability of enumerating screennames would have also had access to email/phone/password though.
To me, that's enough to sound the alarms that this needs to be investigated to see if more was taken. If I had to put a number on it, without further information, I'd say as a ballpark maybe a 10% chance that some combination of email/phone/passwords was taken in addition to what was leaked.
Posts: 2
Threads: 0
Joined: Jul 2024
(03-30-2025, 10:53 PM)ThinkingOne Wrote: you can't brute force 15+ character screennames, or 64-bit user IDs
I looked at a few IDs:
1591822015824011264
1591822015824297985
Here you would only need to know the last 6 digits to get a new one. This makes me wonder if someone could just get a few IDs leaked to them and then brute-force the rest.
Posts: 447
Threads: 12
Joined: Jan 2024
(03-30-2025, 11:23 PM)kalattimore2 Wrote: I looked at a few IDs:
1591822015824011264
1591822015824297985
Here you would only need to know the last 6 digits to get a new one. This makes me wonder if someone could just get a few IDs leaked to them and then brute-force the rest.
Ah, someone who actually pays attention, I like it!
The Twitter 64-bit user IDs are indeed easier to guess than random 64-bit user IDs. They are a 42-bit timestamp, followed by a machine ID (10 bits), and then a sequence ID (12 bits). IIRC there were somewhere around 30 different machine IDs when I tested the 2.8B dataset, and sequence IDs usually start at 0 and go up. But there were some anomalies as well that were in the 2.8B dataset (I think the sequence was close to the max on those).
If you knew the time to the ms of when someone created an account, and know the possible machine IDs, you would likely be able to guess the ID with perhaps roughly 20-100 guesses. But with 86,400,000ms in a day, even knowing the day the account was created wouldn't help.
|