Tag Archives: statistics

New statistical approach aims to predict when low-probability, high-impact events happen

A team of researchers from the U.S. and Hong Kong is working to develop new methods of statistical analysis that may let us predict the risks of very rare but dramatic events such as pandemics, earthquakes, or meteorite strikes happening in the future.

Image via Pixabay.

Our lives over these last two years have been profoundly marked by the pandemic — and although researchers warned us about the risk of a pandemic, society was very much surprised. But what if we could statistically predict the risk of such an event happening in advance?

An international team of researchers is working towards that exact goal by developing a whole new way to perform statistical analyses. Typically, events of such rarity are very hard to study through the prism of statistical methods, as they simply happen too rarely to yield reliable conclusions.

The method is in its early stages and, as such, hasn’t proven itself. But the team is confident that their work can help policymakers better prepare for world-spanning, dramatic events in the future.

Black swans

“Though they are by definition rare, such events do occur and they matter; we hope this is a useful set of tools to understand and calculate these risks better,” said mathematical biologist Joel Cohen, a professor at the Rockefeller University and at the Earth Institute of Columbia University, and a co-author of the study describing the findings.

The team hopes that their work will give statisticians an effective tool with which to analyze sets of data when it contains very sparse points of data, as is the case for very dramatic (positive or negative) events. This, they argue, would give government officials and other decision-makers a way to make informed decisions when planning for such events in the future.

Statistics by now is a tried and true field of mathematics. It’s one of our best tools when trying to make sense of the world around us and, generally, serves us well. However, the quality of the conclusions statistics can draw from a dataset relies directly on how rich those datasets are, and the quality of the information they contain. As such, statistics has a very hard time dealing with events that are exceedingly rare.

That hasn’t stopped statisticians from trying to apply their methods to rare-but-extreme events, however, over the last century or so. It’s still a relatively new field of research in the grand scheme of things, so we’re still learning what works here and what doesn’t. Where a worker would need to use the appropriate tool for the job at hand, statisticians need to apply the right calculation method on their dataset; which method they employ has a direct impact on which conclusions they draw, and how reliably these reflect reality.

Two important parameters when processing a dataset are the average value and the variance. You’re already familiar with what an average value is. The variance, however, shows how far apart the values that make up that average are. For example, both 0 and 100, as well as 49 and 51, average out to 50; the first set, however, has a much larger variance than the latter.

Black swan theory describes events that come as a surprise but have a major effect and then are inappropriately rationalized after the fact with the benefit of hindsight. The new research doesn’t only focus on black swans, but on all unlikely events that would have a major impact.

For typical sets, the average value and the variance can both be defined by finite numbers. In the case of the events that made the object of this study, however, the sheer rarity with which they take place can push these numbers towards ridiculous values bordering on infinity. World wars, for example, have been extremely rare events in human history, but each one has also had an incredibly large effect, shaping the world into what it is today.

“There’s a category where large events happen very rarely, but often enough to drive the average and/or the variance towards infinity,” said Cohen.

Such datasets require new tools to be properly handled, the team argues. If we can make heads and tails of it, however, we could be much better prepared for them, and see a greater return on investments into preparedness. Governments and other ruling bodies would obviously stand to benefit from having such information on hand.

Being able to accurately predict the risk of dramatic events would also benefit us as individuals, and provide important tangible benefits in society. From allowing us better plan out our lives (who here wouldn’t have liked to know that the pandemic was going to happen in advance?), to better preparing for threatening events, to giving us arguments for lower insurance premiums, such information would definitely be useful to have. If nothing bad is likely to happen during our lifetimes, you could argue, wouldn’t it make sense for my life insurance policy premiums to be lower? The insurance industry in the US alone is worth over $1 trillion and making the system more efficient could amount to major savings.

But does it work?

The authors started from mathematical models used to calculate risk and examined whether they can be adapted to analyze low-probability, very high-impact events with infinite mean and variance. The standard approach these methods use involves semi-variances: the practice of separating the dataset in ‘below-average’ and ‘above-average’ halves, then examining the risk in each. Still, this didn’t provide reliable data.

What does work, the authors explain, is to examine the log (logarithmic function) of the average to the log of the semi-variance in each half of the dataset. Logarithmic functions are the reverse of exponentials, just like division is the reverse of multiplication. They’re a very powerful tool when you’re dealing with massive, long numbers, as they simplify the picture without cutting out any meaningful data — ideal for studying the kind of numbers produced by rare events.

“Without the logs, you get less useful information,” Cohen said. “But with the logs, the limiting behavior for large samples of data gives you information about the shape of the underlying distribution, which is very useful.”

While this study isn’t the end-all-be-all of the topic, it does provide a strong foundation for other researchers to build upon. For now, although new and in their infancy, the findings do hold promise. Right now, they’re the closest we’ve gotten to a formula that can predict when something big is going to happen.

“We think there are practical applications for financial mathematics, for agricultural economics, and potentially even epidemics, but since it’s so new, we’re not even sure what the most useful areas might be,” Cohen said. “We just opened up this world. It’s just at the beginning.”

The paper Taylor’s law of fluctuation scaling for semivariances and higher moments of heavy-tailed data” has been published in the journal Proceedings of the National Academy of Sciences.

The statistic of the year: 90.5% of plastic has never been recycled

Mankind has output an estimated 6,300 million metric tonnes of plastic, and very little of that has been recycled — 90.5% of it was either burned or accumulated in dumps or in the oceans. That is the year’s winning statistic, according to UK’s Royal Statistic Society (RSS).

‘It’s very concerning that such a large proportion of plastic waste has never been recycled’, says RSS President, Sir David Spiegelhalter, who chaired the Stats of the Year judging panel. ‘This statistic helps to show the scale of the challenge we all face. It has rightly been named the RSS’s ‘International Statistic of the Year’ for 2018.’

Image in public domain.

Every year, the RSS publishes what it considers to be the most “zeitgeist” statistics of the year — the most relevant and important figures. They feature vital aspects of the time (such as plastic recycling), as well as more offbeat topics, like Kylie Jenner (who slashed $1.3 billion from Snapchat’s value with a single Tweet). Dr. Jen Rogers, RSS vice-president and a member of the judging panel, said the number showed “the power of celebrity”, and it could be the “world’s most costly Tweet”. Jenner tried to mend things later by saying that Snapchat is still her “first love”, but the damage had already been done.

Another quirky statistic has to do with Jaffa Cakes. According to Rogers, there was a 16.7% reduction in the number of Jaffa cakes in McVities’ Christmas tube, illustrating the concept of ‘shrinkflation’ (as reported in The Sun and Metro).

The top UK statistic was 28.7%: the peak percentage of all electricity produced in the UK due to solar power on 30 June this year. Although briefly, solar energy was the country’s main electricity provider, the first time since the Industrial Revolution. This year also marked a period of 55 hours when the UK ran without coal.

“It’s a reflection of what are the important things facing us as a population. We are becoming more and more aware of these issues surrounding us like climate change, the relationship we have with the environment, the things we can do to help the environment,” Rogers commented.

Here are a few of the other statistics mentioned by the RSS:

  • 9.5%: the percentage point reduction in worldwide ‘absolute poverty’ over the last ten years. Extreme poverty has halved since 2008;
  • 64,946: the number of measles cases in Europe from November 2017 to October 2018. Measles has made an unlikely comeback, in part due to the decrease in vaccinations;
  • 6.4%: the percentage of female executive directors within FTSE 250 companies (in the UK).
  • 85.9%: the proportion of British trains that ran on time.

What does 5-sigma mean in science?


Credit: Pixabay.

When doing science, you can never afford certainties. A skeptical outlook will always do you good but if this is the case how can scientists tell if their results are significant in the first place? Well, instead of relying on gut feeling, any researcher that’s worth his salt will let the data speak for itself. Namely, a result will be meaningful if it’s statistically significant. But in order for a statistical result to be significant for everyone involved, you also need a standard to measure things.

When referring to statistical significance, the unit of measurement of choice is the standard deviation. Typically denoted by the lowercase Greek letter sigma (σ), this term describes how much variability there is in a given set of data, around a mean, or average, and can be thought of as how “wide” the distribution of points or values is. Samples with a high standard deviation are considered to be more spread out, meaning it has more variability and the results are more interpretable. A low standard deviation, however, revolves more tightly around the mean.

Roll the dice

To understand how scientists use the standard deviation in their work, it helps to consider a familiar statistical example: the coin toss. The coin only has two sides, heads or tails, so the probability of getting one side of the other following a toss is 50 percent. If you flip a coin 100 times, though, chances are you won’t get 50 instances of heads and 50 of tails. Rather, you’ll likely get something like 49 vs 51. If you repeat this 100-coin-toss test another 100 times, you’ll get even more interesting results. Sometimes you’ll get something like 45 vs 55 and in a couple of extreme cases 20 vs 80.

If you plot all of these coin-toss tests on a graph, you should typically see a bell-shaped curve with the highest point of the curve in the middle, tapering off on both sides. This is what you’d call a normal distribution, while the deviation is how far a given point is from the average.

One standard deviation or one-sigma, plotted either above or below the average value, includes 68 percent of all data points. Two-sigma includes 95 percent and three-sigma includes 99.7 percent. Higher sigma values mean that the discovery is less and less likely to be accidentally a mistake or ‘random chance’.

Here’s another way to look at it. The mean human IQ is 100. Data suggests 68 percent of the population are in what is called one standard deviation from the mean (one-sigma) and 27.2 percent of the population are two standard deviations from the mean, being either bright or rather intellectually challenged depending on the side of the bell curve they are on. About 2.1 percent of the population is 3 standard deviations from the mean (3-sigma) — these are brilliant people. Around 0.1% of the population is 4 standard deviations from the mean, the geniuses.

[panel style=”panel-info” title=”Worthy mention: the p-value” footer=””]

The standard deviation becomes an essential tool when testing the likelihood of a hypothesis. Typically, what scientists do is they construct two hypotheses, one where let’s say two phenomena A and B are not connected (the null hypothesis) and one where A and B are connected (the research hypothesis).

What scientists do is they first assume the null hypothesis is true, because that’s the most intellectually conservative thing to do, and then calculate the probability of obtaining data as extreme as the kind they’re observing. This calculation renders the p-value. A p-value close to zero signals that your null hypothesis is false, and typically that a difference is very likely to exist. Large p-values (p is expressed as a value between 0 and 1) imply that there is no detectable difference for the sample size used. A p-value of .05, for example, indicates that you would have only a 5% chance of drawing the sample being tested if the null hypothesis was actually true. Depending on the field, typically psychology and other social sciences, you’ll see papers use the p-value to illustrate statistical significance while maths and physics will employ sigma.


The probabilities of a value lying within 1-sigma, 2-sigma and 3-sigma of the mean for a normal distribution. Credit: Wikimedia Commons.

The probabilities of a value lying within 1-sigma, 2-sigma and 3-sigma of the mean for a normal distribution. Credit: Wikimedia Commons.

Don’t be so sure

Sometimes just two standard deviations above or below the average, which gives a 95 percent confidence level, is reasonable. Two-sigma is, in fact, standard practice among pollsters and the deviation is directly related to that “margin of sampling error” you’ll often hear reporters mention — in this case it’s 3 percent. If a poll found that 55 percent of the entire population favors candidate A, then 95 percent of the time, a second poll that samples the same number of (random) people will find candidate A is favored somewhere between 52 and 58 percent.

The table below summarizes various σ levels down to two decimal places.

σ Confidence that result is real
1.5 σ 93.32%
2 σ 97.73%
2.5 σ 99.38%
3 σ 99.87%
3.5 σ 99.98%
> 4 σ 100% (almost)


For some fields of science, however, 2-sigma isn’t enough, nor 3 or 4-sigma for that matter. In particle physics, for instance, scientists work with million or even billions of data points, each corresponding to a high energy proton collision. In 2012, CERN researchers reported the discovery of the Higgs boson and press releases tossed the term 5-sigma around. Five-sigma corresponds to a p-value, or probability, of 3×10-7, or about 1 in 3.5 million. This is where you need to put your thinking caps on because 5-sigma doesn’t mean there’s a 1 in 3.5 million chance that the Higgs boson is real or not. Rather, it means that if the Higgs boson doesn’t exist (the null hypothesis) there’s only a 1 in 3.5 million chance the CERN data is at least as extreme as what they observed.

Sometimes 5-sigma isn’t enough to be ‘super sure’ of a result. Not even six sigma, which roughly translates to one chance in half a billion that a result is a random fluke. Case in point, in 2011 another experiment from CERN called OPERA found that nearly massless neutrinos travel faster than light. This claim, which bore 6-sigma confidence, was rightfully controversial because it directly violates Einstein’s principle of relativity which says the speed of light is constant to all observers and nothing can travel faster than it. Later, four independent experiments failed to come up with the same level of confidence and OPERA scientists think their original measurement can be written off as owing to a faulty element of the experiment’s fiber-optic timing system.

So bear in mind, just because a result falls inside an accepted interval for significance, that doesn’t necessarily make it truly significant. Context matters, especially if your results are breaking the laws of known physics.

The migration of the world's intellectuals traced back in history.

How culture migrated and expanded from city to city in the past 2,000 years

Using nothing but birth and death records, sociologists at North­eastern Uni­ver­sity  developed a working framework that details the migration patterns of some of humanity’s most notable intellectuals in North America and Europe in the past 2,000 years. The data allowed the researchers to iden­tify the major cul­tural cen­ters on the two con­ti­nents over two millennia. Rome, Paris, London and New York are some of the world’s prolific cultural centers in history.

A history of culture

The migration of the world's intellectuals traced back in history.

The migration of the world’s intellectuals traced back in history.

The researchers extensively relied on big datasets, like the Gen­eral Artist Lex­icon that con­sists exclu­sively of artists and includes more than 150,000 names and Free­base with roughly 120,000 indi­vid­uals, 2,200 of whom are artists. Using network tools and complexity theory, the researchers drew migration patterns that helped paint a broad picture of how culture converged and migrated from hub to hub, retracing the cultural narratives of Europe and North America.

“By tracking the migra­tion of notable indi­vid­uals for over two mil­lennia, we could for the first time explore the boom and bust of the cul­tural cen­ters of the world,” said Albert-​​László Barabási, Robert Gray Dodge Pro­fessor of Net­work Sci­ence and director of Northeastern’s Center for Com­plex Net­work Research. “The observed rapid changes offer a fas­ci­nating view of the tran­sience of intel­lec­tual supremacy.”

For example, Rome was a major cul­tural hub until the late 18th cen­tury, at which point Paris took over the reins. Around the 16th century, in Europe at least, two distinct approaches could be identified:  countries with intellectual ‘monster hubs’ that attract a sub­stan­tial and con­stant flow of intel­lec­tuals (i.e.: Paris, France) and a more dispersed regime with cities within a fed­eral region (i.e.: Ger­many) com­peting with each other for their share of intel­lec­tuals, clearly outnumbered by the monster hubs but well above average, compensating in numbers.

Where culture goes to die

The dawn of the XXth century saw New York not only a bustling cultural center where many intellectuals would flock, but also a fantastic breeding ground where many notable figures of the time were born. Addi­tion­ally, loca­tions like Hol­ly­wood, the Alps, and the French Riv­iera, which have not pro­duced a large number of notable fig­ures, have become, at dif­ferent points in his­tory, major des­ti­na­tions for intel­lec­tuals, per­haps ini­tially emerging for rea­sons such as the location’s beauty or climate.

“We’re starting out to do some­thing which is called cul­tural sci­ence where we’re in a very sim­ilar tra­jec­tory as sys­tems biology for example,” said Schich, now an asso­ciate pro­fessor in arts and tech­nology at the Uni­ver­sity of Texas at Dallas. “As data sets about birth and death loca­tions grow, the approach will be able to reveal an even more com­plete pic­ture of his­tory. In the next five to 10 years, we’ll have con­sid­er­ably larger amounts of data and then we can do more and better, address more questions.”

Possibly the most interesting tidbit from the study is the fact that over the past eight centuries, the migration distance people have undertaken has not increased considerably, despite considerable transportation advancements (motor cars, trains) or extensive colonization. The findings seem to support Ernst Georg Ravenstein’s empirical findings based on the migration patterns he studied in the XIX century: most migrants do not go very far, those who do aim for big cities, urban centres grow from immigration far more than procreation, and so on.

The findings were reported in the journal Science. Below you can watch a beautiful time lapse video of how culture migrated in history.

Good looking people more money

Beautiful people earn $250,000 extra on average

Good looking people more moneyIt’s generally known that people of above-average physical looks are at a greater social advantage than people of average or sub-average appearance. Beautiful people are known to be more successful, happier and more financially fulfilled. Regarding the last part, there’s always been a controversy regarding the economics behind this kind of superficial advantage.

Renowned economist Daniel Hamermesh of University of Texas at Austin, decided to explore the concept and provide an insightful view upon the correlation between one’s physical appearance and income in his recently published book, Beauty Pays.

“In economic terms, beauty is scarce. People distinguish themselves and pay attention to beauty,” Hamermesh says. “Most of us want to look better so we can make more money. Companies realize that hiring better-looking people helps in various ways. In every market, whether it’s jobs or marriage, beauty matters.”

In the most comprehensive study of its kind to date, Hamermesh gathered and correlated date from both his own research and that of numerous other scientists to paint an accurate picture of the economics behind beauty. Beauty is indeed in the eye of the beholder, however much of the populace can generally agree upon what can be considered attractive.

After stabilizing the countless factors to the closest nominal denominator, like studying people of similar background and education, but of different physical appearances, Hamermesh was able to assert some palpable claims. He found that the best-looking one-third of the population makes 5 percent more money than average-looking people and 10 to 12 percent more than the worst-looking people. This doesn’t mean that better looking people automatically get a bigger salary, but just a direct consequence of the fact that beautiful people have an easier time scoring better paying jobs or advancing up the social ladder. Connections quite probably represent the most important asset in the business environment and a good looking individual will generally manage to do better social-wise.

One of the economist’s leading claim, and maybe most evidence to the financial contribution physical appearance has, is that the best looking people earn an extra $250,000, on average, during their careers than the least attractive people and are more likely to remain employed, get promoted, and even secure loans.

Surprisingly enough, beauty affects the earnings of men in the labor market more than women, since apparently women have more options outside the workforce. Beautiful women are statistically known to marry high earning men, an economic factor which contributes to the documented trend that good-looking people are happier.

This doesn’t mean however that people of average or sub-average appearance are miserable or don’t succeed in life, Hamermesh says.

“Take advantage of other things: brains, brawn, personality,” he says. “This is the economic theory of ‘comparative advantage.’ You work off the things you’re good at and if looks isn’t one of them, you try to de-stress that.”