It may look good, but that data breach report is not necessarily accurate

Two analyses of data breaches in 2014 have been released within the past month. One is Gemalto’s annual Breach Level Index report (pdf), which is based on 1,541 breach reports resulting in 1,023,108,267 breached records. The other is Risk Based Security’s Data Breach Quick View (pdf), which is based on 3,014 incidents exposing 1,068,191,345 records.

How can an analysis that includes almost twice as many incidents report almost the same number of records as an analysis based on so many fewer incidents, you ask?

Spoiler alert: Gemaltos Breach Level Index report was a badly flawed report, in my opinion (this might be a good time to remind everyone that I’m a volunteer curator for DataLossDB.org, which fuels RBS’s report).

If one looks at what Gemalto considered “attacks” (a term that they don’t seem to define in the report) and how they got their numbers, their methodology leaves much to be desired. For example, Gemalto reports that the top breach targets among “educational institutions” were:

Benesse Holdings, with an identity theft attack involving 48.6 million records; Netherlands Primary School, with an identity theft attack involving 1 million records; and Maricopa County Community College District, with an identify theft attack that exposed 309,079 records.

Two parts of that appear incorrect (or three, if you think Benesse is a service provider and should have been listed under the Business sector and not Educational Institutions sector):

What Netherlands Primary School “attack?” There was no attack as far as I could determine from available media. If it’s the incident I’m thinking of, what happened appears to have been a privacy breach and not an attack. Is Gemalto talking about the Snappet incident or something else? And what “Netherlands Primary School?” I cannot find any school by that name.
Gemalto erroneously reported 309,079 records for the Maricopa County Community College District (MCCCD breach). I suspect Gemalto confused the MCCCD breach with the U. of Maryland breach, where there were 309,079 records reported. The MCCCD breach was in 2013 and affected more than 2.4 million.

Then there were these claims about the healthcare sector in Gemalto’s report:

Among the top breaches in healthcare were the Korean Medical Association, with 17,000,000 records exposed in an identity theft attack; Community Health Systems, with 4,500,000 records in identity theft; and the State of Texas Department of Health & Human Services, with 2,000,000 records in identity theft.

Not quite:

The Korean Medical Association did not have 17,000,000 records exposed in an identity theft attack. According to multiple news sources, 17 million people had their details stolen and leaked from 225 websites, three of which were the Korean Medical Association, Korean Dental Association, and the Association of Korean Medicine. There is no report that 17,000,000 records were hacked from the KMA.
As to the Texas Health & Human Services Commission, well, I probably wouldn’t include that as it’s a disputed contractual dispute, and as Xerox had a business associate contract in place, they would not be misusing the information for identity theft, as Gemalto claims.

I won’t bother commenting on all their other sections and claims, as I think you get the point by now. When you find yourself questioning the accuracy of verifiable claims, it starts to raise doubts about the accuracy of the whole report.

Gemalto did not respond to an email inquiry sent to them on February 12 asking about their methodology for the report and how they got some specific numbers. That inquiry also asked:

And what is Gemalto’s policy on including vulnerabilities if there’s never any report that they were actually exploited? Does Gemalto think it’s correct to label that a breach? Yes, there may have been hundreds of millions of records at risk or potentially compromised by Alibaba, but does Gemalto include all vulnerability reports as breaches even if there’s no confirmation of actual breach/access/acquisition of data?

As I said, Gemalto never replied. Maybe now that I’ve gone public with my criticism of their report, they’ll respond to my questions. If they do, I will post their response.

But to all you mainstream journalists who just mindlessly reported Gemalto’s claims and numbers: shame on you. And if all you ever cite is Privacy Rights Clearinghouse’s figures (67,596,246 records from 293 breaches fitting their narrow criteria) or Identity Theft Resource Center’s (ITRC’s) figures (85,611,528 records from 783 breaches fitting their narrow criteria), well, you’re missing the bulk of breaches, aren’t you?

If you’re a journalist reporting on data breaches, learn to use DataLossDB.org/RBS and Verizon DBIR reports. No other sources compile as much breach information. Yes, there will be discrepancies between their reports based on methodology and data sources, but you’ll have a much more accurate picture of global breaches than by using others’ reports.

So what did Risk Based Security (RBS) find for 2014? Well, it turns out some of their statistics are almost identical to Gemalto’s report, but they got there by different – and seemingly more accurate – means. RBS reports:

3,014 incidents exposing 1.1 billion records.
Four Hacking incidents alone exposed a combined 647 million records.
A single act of Fraud exposed 104 million records.
The Business sector accounted for 52.9% of reported incidents, followed by Government (15.5%), Unknown (13.2%), Medical (9.6%), and Education (8.8%).
The Business sector accounted for 55.1% of the number of records exposed, followed by Unknown (25.9%), and Government (17.9%).
67.7% of reported incidents were the result of Hacking, which accounted for 83.3% of the exposed records.
Fraud accounted for 14.3% of the exposed records, but represented just 4.3% of the reported incidents.
Breaches involving U.S. entities accounted for 44.5% of the incidents and 47.9% of the exposed records.
35.8% of the incidents exposed between one and 100 records.
Thirty-one incidents in 2014 each exposed more than one million records.
Five incidents in 2014 secured a place on the DataLossDB.org Top 10 All Time Breach List.
Two states (New York and California) account for 71.7% of exposed US records.

Now mull those figures over. For specific comparisons to 2013, and for definitions of terms and their methodology, see the report.

2 thoughts on “It may look good, but that data breach report is not necessarily accurate”

Ed says:

February 24, 2015 at 2:16 am

The 1 billion number in the Gemalto report is largely based on the 300 million records “breached/compromised” in the AliExpress incident. AliExpress only has 7.7 million customers and their details might have been exposed, not compromised. The study is highly inaccurate. I get it that it’s based on public reports, but you can’t (shouldn’t) just take numbers from sensationalized reports and use them in a study.

I told them about the flaws in methodology and they said “Thanks!” O_o
1. Dissent says:
  
  February 24, 2015 at 7:46 am
  
  I don’t think their report even accurately reflects public reports, as I tried to demonstrate in my post.
  
  And yes, that one Alibaba/AliExpress report boosted their total count significantly.
  
  At least you got a reply from them.
  
  The pity is that I am just a humble blogger and I fear most of the journos who mindlessly just repeated their claims will not see my critique or wake up and use more reliable sources for their reporting.

Comments are closed.

Related:

2 thoughts on “It may look good, but that data breach report is not necessarily accurate”