This is a multi-part interview with the individual known as “Nam3L3ss” who leaked more than 100 databases on a popular hacking forum and will soon be leaking many more. Read the Preface. In Part 1, he answered some questions about his background and what motivated him to do what does. In this part, we talk about his methods. In Part 3, we discuss some ethical concerns and the future.
Finding Exposed Data
Dissent Doe (DD): Forum participants have tried to find out a bit more about what you do and how you do it. Before we get into specifics, can you say whether you had any special training or credentials that prepared you for finding and cleaning so much data?
Nam3L3ss (N): As both an Investigator and Information Intelligence Specialist, my unique perspective on data allows me to navigate the intricate web of information in ways that many find challenging. While the average person may dismiss data with missing components as worthless, I recognize that such gaps can be rich with potential insights. A redacted field, for instance, is not merely an absence of information, it can serve as a crucial indicator, signaling which connections warrant deeper investigation. The art of cross-referencing becomes a powerful tool, as those seemingly insignificant omissions often point toward noteworthy conclusions about individuals or entities involved.
Reverse engineering data is a skill that many overlook, yet it can be deceptively straightforward. My experience in reverse engineering data has often uncovered revelations that are veiled from standard access, leading to intelligence that some might categorize as “Top Secret.”
Many government entities operate under the illusion of impenetrable secrecy, yet I have found that their data is more accessible than they realize. By meticulously piecing together fragmented information, I can uncover narratives that offer a broader understanding of complex issues. Ultimately, my role hinges on transforming perceived limitations into opportunities for insight, shedding light on what lies beneath the surface of concealed data.
DD: Elsewhere, you claimed you have 250 TB of archived databases. How much of what you have downloaded was found using services like Shodan, Censys, or GrayHat Warfare? Are most of your discoveries MongoDB, S3 buckets, Azure blobs, open directories, rsync backups, or elastic search…?
N: The reference to 250 terabytes (TB) as the amount currently being cleaned and curated may seem impressive on the surface, but it represents only a fractional slice of a much larger universe of data in my possession. In fact, this volume constitutes less than 0.01% of my total data.
Interestingly, while some might assume that the data I gather from ransomware sites plays a significant role in my collection, it actually represents a mere fraction of my resources, amounting to only about one petabyte. In stark contrast, my primary reservoir consists of over eleven petabytes of data harvested from exposed web services like AWS, MongoDB, and similar platforms. In total, my data warehouse boasts well over thirty petabytes of information, with approximately 90% focused on individuals and corporations in the United States, Canada and UK. This compilation underscores the vast landscape of data available.
I have reviewed the sources and platforms you mentioned, but unfortunately, they fall short of my requirements for a comprehensive analysis. I need a much more robust and in-depth approach than what they typically offer. Take GrayHat, for instance. It only indexes the first one million files from exposed sources and often results in incomplete data sets. A prime example of this limitation is the recent Vertafore leak, where GrayHat managed to index only two out of what I believe were five backup data files, leaving significant gaps in the information available.
In contrast, my methodology involves indexing every single file I encounter within exposed data buckets. This meticulous process allows me to access buckets that contain well over a million files, with some boasting in excess of a billion files. It’s true that many of these large buckets—holding, for example, one billion to four billion files, may not be pertinent to my needs, as they often consist solely of picture galleries or other non-essential content. However, without indexing the entire exposed bucket, there’s no way to ascertain its relevance or utility.
To streamline my efforts, I employ a sort of blacklist for buckets. Once I determine that a bucket is predominantly filled with irrelevant content, such as mere picture galleries or open-source release sites, I remove it from my tools to ensure I receive timely updates on only the most pertinent file information in the future. This targeted approach maximizes efficiency and allows me to focus on genuinely valuable data.
DD: To follow-up on the above: what types of leaks account for the largest source of your acquired datasets: buckets, blobs, MongoDBs, open directories, elastic search, rsync, …..?
N: I would say, AWS Buckets and Azure Blobs are the most egregious.
DD: How do you scan for potential buckets? Are you using keyword searches or strings or…?
N: Identifying potentially exposed buckets involves utilizing word lists and employing methods akin to those used by password hackers. While there are additional techniques that can expedite this process, I prefer not to disclose them.
DD: Do you code all your own scripts for your searches? If so, is there anything you are willing to share about your scripts or tools? And if you’re not willing to share, why are you willing to share all the databases but not the scripts to find them or download them?
N: Sharing these tools could significantly empower malicious actors, such as real criminal hackers and threat agents, who would gain unprecedented access to a vast array of sensitive data. The potential for such tools to facilitate the rapid identification of vulnerable targets, like exposed cloud storage buckets or insecure services, underscores the responsibility to withhold the tools.
These tools often yield staggering results, enabling the discovery of more than 10,000 new databases or personally identifiable information (PII) records in a single session. While the breadth and depth of this data can vary, the sheer volume highlights a pressing issue in data security. Protecting these tools not only safeguards the integrity of the data discovered but also serves to shield organizations from the repercussions of malicious exploitation.
The act of responsibly leaking cleaned data serves a higher purpose. It not only raises awareness about the entities that have inadvertently exposed sensitive information but also strengthens the arms of legal practitioners seeking accountability. While the sources of these breaches may not always be immediately traceable through conventional identifiers such as URLs or IP addresses, there are often subtleties within the data structure itself that provide vital clues to the originating company. This intricate interplay of data scrutiny and ethical responsibility shapes the landscape of cybersecurity and accountability, and provides a powerful demonstration of the need for a more secure digital ecosystem.
DD: In a pre-interview chat, you mentioned that you are able to scan every FTP site on the internet in less than an hour and get back a report on every anon ftp on port 21. How do you then search or scan the results to find data of interest to you?
N: Using zmap over a vpn or high speed proxy allows me to scan the entire internet for a single port in less than an hour.
Indexing files in open directories, FTP servers, or exposed cloud buckets has become a remarkably straightforward task. By first gathering all filenames and their corresponding sizes, I can leverage additional tools designed for efficient data retrieval. Specifically, I focus on retrieving the first 1MB of every file based on its extension type or MIME type. This initial data chunk serves as a gatekeeper, allowing me to use another tool that automatically searches for keywords within these files. The preliminary keyword analysis helps me determine which files hold potential value, enabling me to preview the 1MB snippets before committing to a full download, which can often range from several gigabytes to hundreds of gigabytes. Certain file types and MIME types are either skipped altogether or auto-downloaded based on prior knowledge of their contents, streamlining the process even further.
Moreover, the development of a command line tool that autonomously creates databases or tables based on formatted file headers presents a pragmatic solution for data management. By automating the database creation and population process, I can swiftly identify which files require further scrutiny. An automated report detailing any data load errors provides critical insight into problematic files, highlighting areas that may need additional attention. This method not only drastically reduces the time spent on manual inspections but also enhances the speed at which the data becomes actionable. Ultimately, the ability to isolate and inspect problematic data allows for informed decisions on whether to clean or modify it, ensuring that the data landscape remains organized and valuable for future analysis.
DD: You’ve mentioned elsewhere that most of your work is automated. You’ve also mentioned searching datasets to remove any data that might harm victims and witnesses. Are those latter searches also automated? And have you ever discovered that you had leaked material you wish you hadn’t leaked?
N: While my data collection and loading processes into databases are primarily automated, I believe that ensuring the integrity and accuracy of the information released warrants a more hands-on approach. Automation undeniably streamlines the flow of data and enhances efficiency, but it can also introduce errors or overlook nuances that a human touch can easily catch. For this reason, I take the time to manually verify the data before it is published or shared.
My verification process not only allows me to scrutinize the dataset for sensitive information that should be withheld, but it also provides an opportunity to ensure that the data aligns with the intended purpose and audience.
Source of Data
DD: The issue of whether the companies themselves are directly leaking or their vendors are leaking their data is an important one, especially when we think of accountability and monitoring for data security. Can you tell me whether most leaks you find are by the primary entity or if they are more likely to be by vendors?
N: A significant proportion of the leaks associated with services such as AWS, Azure, Digital Ocean, and similar platforms primarily comes to light through third-party vendors, data conversion firms, and data aggregation companies.
DD: Do you ever collaborate with ransomware groups or cybercriminal entities to share resources, knowledge, or infrastructure?
N: I considered this recently, but after hearing other viewpoints, and the fact that it could be construed as being a co-conspirator, I have dropped the idea completely. I have never interacted with ransomware groups at all, and do not intend to.
DD: Looking at your forum posts, you get a lot of data from ransomware group leaks, even though it may be only a very small fraction of the amount of data you have acquired. As a journalist, I know how hard it was to download Clop’s leaks to try to investigate the leaks and their scope. You are making it easier for people to download and explore the stolen data, and there is nothing that stops anyone from just downloading all your cleaned-up data leaks and misusing them or selling them to criminals or data brokers. Do you have any concerns about that?
N: Although it is true that the data I release can be downloaded and misused by anyone, it is important to recognize that such information is already circulating on a far larger scale than the general public may realize.
The alarming rise of ransomware attacks has given birth to an equally intriguing phenomenon: the proliferation of release sites operated by cybercriminals. These platforms often serve as a showcase for the stolen data of their victims, providing a stark glimpse into the scale and consequences of these malicious acts. By examining various screenshots from select ransomware group release sites, one can gain insight into the level of engagement these criminals receive from their audience. While most leak sites do not track the actual downloads of the compromised data, the number of views reveals a troubling fascination and a potential market for the stolen information. Here’s a small sample:
-
- Ransomhouse: https://s3.amazonaws.com/i.snag.gy/D0H21W.jpg
- Monti: https://s3.amazonaws.com/i.snag.gy/W9Kc6V.jpg
- Medusa: https://s3.amazonaws.com/i.snag.gy/3Hl6RM.jpg
- RansomHub: https://s3.amazonaws.com/i.snag.gy/GmnZPR.jpg
- LeakNet: https://s3.amazonaws.com/i.snag.gy/TdzGXC.jpg
- LockBit3.0: https://s3.amazonaws.com/i.snag.gy/XdopBa.jpg
- Cactus: https://s3.amazonaws.com/i.snag.gy/DpStsY.jpg
- Cicada3301: https://s3.amazonaws.com/i.snag.gy/AW1zBL.jpg
- Cloak: https://s3.amazonaws.com/i.snag.gy/iJPbH1.jpg
And here is a site that tracks ransomware group victims and provides a direct json file anyone can download for groups and victims they track:
DD: Okay, but to follow up on an earlier question and for the databases that are not from ransomware sites: how much of what you have comes from the companies’ own buckets or storage accounts, how much comes from vendor’s servers, how much comes from data brokers’ servers, and how much comes from government servers?
N: For the most part, many might categorize me as a data broker, as my sources mirror those of my contemporaries in the industry. However, the distinction lies in my commitment to not selling access to this data; everything I collect and manage resides on an air-gapped network, ensuring its integrity and security. My data acquisition spans a vast array of public records databases, encompassing essential documents such as deeds, mortgages, traffic citations, and vital life events including marriages, divorces, births, and deaths. I also tap into voter records and various Open Data portals, utilizing bulk government data downloads that provide a wealth of information.
DD: Interesting, but you still haven’t answered my question. Which of the types of sources accounts for the single largest percentage of the databases you acquire: primary entities, their vendors, data brokers, or government servers?
N: Less than one third of my data comes from breaches and leaks. The rest comes directly from government agencies or major brokers like the credit bureaus.
DD: So far, you have only leaked data from breaches and leaks, unless I missed something. Will you be also leaking data from government agencies that has never been leaked before, or is the government data also previously leaked but just cleaned up by you?
N: Data I release I always cite 100% the sources of where and when it was breached or leaked. Since I do not get involved in actual hacking or breaching of systems, it is either something I obtained from an exposed bucket, or came from a hacker, or ransomware group that was made public.
I make no distinction between companies and governments. If it was breached and released, or if it was a leak I discovered and downloaded, I would not give special treatment to government leaks. Leaks and Breaches, regardless of entity type, I would clean and release if I thought it was worth my time. Obviously I will vet government data more closely than say a normal company, but governments cannot expect special treatment. They have a duty and responsibility just like companies do, if not more so in protecting the data.
DD: Is there anything else about your methods that you haven’t mentioned so far that you are willing to share?
N: Not that I can think of right now.
Continue to:
Part 3 (Ethics and Goals) or
Return to Preface or Part 1 (Background)