Posted 26 мая 2020,, 13:45

Published 26 мая 2020,, 13:45

Modified 24 декабря 2022,, 22:37

Updated 24 декабря 2022,, 22:37

The sly nines: Analyst shows how the pandemic data is faked

The sly nines: Analyst shows how the pandemic data is faked

26 мая 2020, 13:45
According to all mathematical laws, official statistics provided by the Russian authorities do not correspond to reality.

Everyone noticed the fact that the official covide statistics in our country looks strange to say the least, and from time to time, evidence of certain data appeared in social networks and media. But the director of research at Data Insight, Boris Ovchinnikov , closely monitoring Russian pandemic statistics, went even further. He noticed one extremely interesting trend: over the past 25 days, the official number of newly detected carriers of coronavirus in Russia ended at 99: 7099 - April 30, 10699 - May 8, 10899 - May 12, 8599 - May 24. According to his rigorous calculations, the likelihood that these data reflect the natural state of affairs is negligible:

“This is 16 times higher than the mathematical expectation (which is simply considered here - 99 at the end should fall out on average once out of 100).

How likely is such an accidental coincidence? In short, it is not ruled out (this is still not Volodinsky 62.2% in the Saratov region), but it is extremely unlikely (and if in detail - then there are a lot of figures and reasoning). A much more plausible explanation - especially if we recall the facts we know about the oddities of official statistics at the regional level - that the falsification of coronavirus statistics in the regions is also subject to falsification at the federal level, when instead of the simple summation of the numbers sent from the regions (real ones from some regions) , drawn from others) a number is invented that will be issued to citizens for the “total” in Russia, and then the statistics for the regions are adjusted to this drawn federal number.

A little theoretical background. Based on the official version, the number of newly identified patients voiced by the Operational Headquarters every morning throughout Russia is the sum of 85 independent values from each other — the number of patients identified in each region. When adding 85 independent quantities (most of which are equal to several or even many tens), the theoretical probability of getting a number with 99 at the end is the same as with 00 or 01 or any other pair of digits.

Now to specific calculations. The probability of random coincidence is easily calculated through the binomial distribution. It turns out 0.011% or 1 case in 9350 attempts. But this is the probability of falling 4 times out of 25 for one specific number (in this case, 99). The probability that with 25 attempts any number from 0 to 99 will fall out more than 3 times is already (approximately) 1.1%. The probability of 4 times falling out of some “beautiful” number (and 99 is clearly “beautiful”, an extra-ordinary number) is less — but the exact estimate depends on which numbers we agree to consider “beautiful”. At the same time, it must be recognized that these estimates are calculated for the probability of coincidence within the period of my voluntarily chosen period (April 30-24, 05), moreover, chosen so as to minimize the estimate of the probability of random coincidence.

It will be more honest to calculate for an independently selected period. It’s most logical to choose April 20 as the border of such a period - then for the first time in a long time (since April 4) the number of new cases was less than in the previous 2 days, and in fact it was on April 20 that the first “regiment” began, when before the end of the month, until 29.04 inclusive, according to official figures, there was a linear rather than exponential growth. Moreover, the general impression of regional statistics also shows the border of the second and third decades of April as a turning point, from which the reliability and adequacy of statistics begins to decline rapidly.

So, if you take the period from April 20 to today, May 25 (35 days), then the probability of a repeat of the number “99” 4 times out of 35 is 0.041%. The probability of any two-digit number falling out 4 times out of 35 is approximately 100 times more, 4.0%. Do we remove suspicions?

No, prematurely.

In addition to "99", only 4 two-digit numbers fell out more than 1 time since April 20. And one of these four numbers is adjacent to "99", "98". That is, in 6 cases out of 35 (actually even out of 28, if you count not from April 20, but from the first occurrence of the combination “98” in daily reports), the number of patients detected per day ended in one of two maximum two-digit numbers - or 98 , or by 99. For comparison, of the other possible pairs of neighboring numbers, only one, 33 and 34, met three times (2 times less often), and the rest - even from 0 to 2 times.

6 times “98” or “99” out of 35 at a theoretical frequency of 2% - this is a probability of 0.0063% (or 1 case per 15850). The probability that any pair of adjacent numbers will fall 6 times out of 35 - 0.6%. Quite correctly (ok, in my subjective opinion), we identified in the official data an event whose probability of occurrence is accidental (i.e., by simply summing the numbers from the regions without editing the result) - a fraction of a percent (0.6%). And this is an estimate even without taking into account the fact that 98 and 99 are not the most ordinary numbers.

To this, it is worth adding that over the past 35 days, the number of newly diagnosed patients has never ended in five (the probability of such accidental bad luck is 2.5%) and only once ended in zero, moreover, immediately two zeros (9200 on May 16) . The probability that in 35 days only a multiple of 5 will fall out is only 0.4% (if you analyze only cases “away” from the peak at 98 and 99, that is, in the range of the last two digits from 10 to 90, then it turns out 0 numbers that are multiples of 5, with 25 “attempts”, and the probability of an accidental appearance of such an anomaly is again about 0.4%).

But 13 times (+6 to expectation) either fours or nines fell out; the probability of such a frequency deviation from a multiple of 5 per unit downward is 1.4%. If we take the last digit (the number of detected patients per day) not in decimal but in quaternary, then the probability of such a spread as in the data for the last 35 days (any one digit occurs only 1 time or less, and the other any digit occurs 13 times or more), will be less than 0.45%.

The probability that in one set of 35 two-digit numbers (which should be distributed approximately evenly by logic) we will simultaneously randomly get a pair of neighboring numbers with 8 hits, and - for the rest of the distribution - the zero frequency of the multiples of 5 is 0, 6% * 0.4% = 0.0024%. Or 1 case for 42 thousand. If we assume that the frequency peak is not accidentally attributable to the two largest two-digit numbers (98 and 99), then the probability estimate decreases by another 100 times. It is true that it should be noted that this is a calculation for a specific combination of oddities - and such combinations, presumably indicating the drawing of numbers, can be thought up by more than one.

But more than one, not thousands

In total, we have two versions:

a) or is it just a coincidence that in the last 5 weeks (i.e., during a period of significant improvement in the dynamics of the epidemic according to official statistics) very often the sum of the number of detected patients in 85 regions gives a number ending in 99 or 98, and almost never there are no numbers that are multiples of 10 or 5. The probability of each of these anomalies individually is one percent or even tenths of a percent. The probability of their simultaneous manifestation is even less;

b) or the number of patients detected per day, which is then announced to citizens, is not calculated by summing up data from regions, but is determined from above - in the format of an indication like “show an increase of about 8600”. Further before publication, this figure is “de-rounded" - often simply by subtracting one or two, which, by the way, requires further adjustment of the numbers by region or even drawing them from scratch (it is possible that masking the roundness of federal numbers is done already at the final stage, when the round federal sum is painted by region, and then the abundance of 99 and 98 is clear: changing two zeroes to 99 or 98 at the end of the federal number is much easier than say by 73 or any other number - the approved figures for the regions need to be adjusted less). The round number 9200 on May 16 also fits into this logic - one day they could forget about the need to mask round numbers, or they could clumsyly try to portray the naturalness of numbers, they say, round numbers should sometimes drop out.

When choosing between these two versions, it should be borne in mind that falsification of statistics on identified patients at the regional level could be considered proven earlier (see, for example, 12 days in a row according to 96-99 identified in the Krasnodar Territory, 8 regions coinciding at once in one day at 97 -98 identified, etc.). The question remained - how likely is it that these falsifications happen only at the regional level. Initially, by the way, I believed that this probability is more than 50% - but even if we a priori estimated this probability we would say 90% (and left only 10% for what we draw at the federal level), then after a caret of 99 old Bayes would have winked at us anyway: “this is almost certainly not an accident, but a sign of falsification, and they are also falsified at the federal level.” Of the two above versions, the second one looks much more plausible - the abundance of 98 and 99 in the results are caused by an attempt to mask the suspicious "roundness" of the numbers lowered from above.

What are the conclusions from all this?

  • It seems that falsification of statistics on the number of patients identified is not individual local initiatives, but a single multi-level system where the invented nationwide figure is primary, and regional figures are already adjusted to it
  • Official figures on the number of cases can be thrown into the bin - there is no reason to believe that they adequately show the dynamics of the epidemic. Maybe show, maybe not - is unknown. The quality of the drawn numbers is impossible and absurd to evaluate
  • (upd) The identification of fraud at the federal level drastically reduces for any region the likelihood that the figures are conditionally honest. Figures, for example, in Moscow and Novosibirsk could be considered without a link to drawing in Krasnodar - but can they be considered without a link to drawing federal numbers? IMHO no
  • Even if some seemingly adequate figures go from Monday, this will not change anything - we (the society) will have no reason to believe that they began to honestly consider it, and did not guess better to mask their drawing
  • Of course, resignations are needed and an independent audit is needed, but I honestly have a poor idea of the scale and algorithm of actions needed to restore confidence in official statistics.


I will add a few lyrical details:

  1. A month and a half ago, I was among those who believed that too many people at all levels were involved in the calculation of statistics on coronavirus so that it could be drawn from the ceiling; 3 weeks ago I was sure that they draw only in certain regions; and even when on May 17 immediately 8 regions gave similar numbers 97 or 98, I attributed this to a coincidence of thinking when drawing numbers , and not to the centralization of falsifications. Why am I writing this? In addition, I was not initially set up to catch obvious artifacts of drawing numbers in federal statistics - their appearance was contrary to my expectations, and their analysis and verification substantially adjusted my idea of how coronavirus figure drawing can be arranged;
  2. From the point of view of the analysis on the reliability and naturalness of the numbers, I initially paid all or almost all the attention precisely to the figures of the daily increase in the number of identified patients. Because the rest of the numbers are either too small (fortunately - but this makes it difficult or impossible to carry out statistical tests), or are initially considered to be the left foot (recovered), or are the mechanical sum of daily gains. And it was in that parameter that I a priori considered key that an anomaly was discovered
  3. On May 8, I drew attention to the appearance for the second time in a short time of the number 99 at the end. It is clear that this could be an accidental coincidence - but even then I was surprised that it coincided precisely with 99. When 99 repeated again, I was even going to write about it - but a formal calculation of the probability yielded completely unconvincing results. Continued the observation. When after 4 days they issued a number with two zeros at the end, I generally relaxed - I decided that the authors of the numbers realized that 99 was a bit much and for the contrast they gave out a round number. And then unexpectedly yesterday 99 fell again. Here I had to write [and I dreamed of digging into the search statistics yesterday - alas, it didn’t work..."