Statistics for Machine Learning — I

A Beginner’s Guide to Descriptive Statistics

Published in

Towards Dev

10 min readApr 3, 2022

Image of a Graph — Photo by olieman.eth on Unsplash

Statistics form a sizable chunk of the journey of studying Machine Learning, but often we avoid this. Because it didn’t sound fancy enough like “Random Forest”, “Support Vector Machine” or because of the scary-looking formulas with weird notations.
This upcoming series of blogs is an effort to ramp down the entry barrier into statistics and also summarize statistical concepts used in Machine Learning.

The segment of statistics that is heavily used in Machine Learning is Applied Statistics. Applied statistics can mainly be divided into two branches —
a. Descriptive Statistics
b. Inferential Statistics

Descriptive Statistics

Overview

As the term says, “descriptive” i.e. describing data using statistics. Descriptive statistics are used to study the characteristics of a dataset. Given a sample or population data, it helps in understanding the mean, the median, how much variance is there in the dataset, what distribution it follows, etc.

Image of Population vs Sample Data — Population vs Sample Data | Credit: Quizlet

**Sample Data is the Subset of a larger dataset, i.e. Population Data.

How are Descriptive Statistics different from Inferential Statistics?
As descriptive statistics help to understand the characteristics of data, Inferential Statistics helps in generalizing the sample data for the larger dataset of which the sample data is a part.
e.g — Average (mean) heights of Students in Class V~ Descriptive Stats.
A vaccine was tested on a sample population (country population) of 40,000 to determine its efficacy for the entire population ~ Inferential Stats.

Statistical Data Types

Data Types in Statistics can be divided into mainly two parts —
a. Categorical
b. Numerical

A. Categorical
This type of data includes categories like male, female, strawberry ice cream, chocolate ice cream, etc.
1. Nominal — These are discrete, orderless data with no quantitive value.
e.g. Are you male or female? Here, male and female are nominal data.
2. Ordinal — These are also discrete data but with a sense of order.
e.g. Happiness meter, socio-economic status, ranks in an organization, etc.

B. Numerical
These are data types with numerical values.
1. Interval — Ordered data with the same difference between two units. These data are compared by using differences.
e.g: Marker on centimeter scale.
2. Ratio — It has the same properties as interval data, but it has a “true zero” value i.e. zero is an absolute number of this type. These data are compared by using fractions.
e.g. Age is a ratio data as it has a true zero value.
3. Discrete — These data types are specific, separate, and fixed data. Like the number of students in a class.
4. Continuous — These are data types that can be measured and take any values between two given points.
e.g: The speed of the bike is 86.7 km/h

Topics to be Covered

Measures of Central Tendency
Measures of Dispersion
Measures of Shape
Outliers
Shanon’s Entropy
Miscellaneous

Measures of Central Tendency

The Central Tendency of a dataset is a value that describes the central location of a dataset.
There are 3 ways to measure the central tendency of a dataset —

a. Mean
A fancy word for “Average”.
Sum all the data points in a dataset/Total number of data points.

μ = Mean of Population Data
Σ x = Summation of all the data points
N = Total number of data points

Failure Case:
1. Outliers heavily influence the Mean.
2. Mean is also not a good measure of centrality for Skewed Distributions like f-distribution.

b. Median
The value that lies in the mid-index of a sorted dataset is the Median. The median is valid for a unimodal distribution (a distribution with a single peak).

c. Mode
It returns the value with the highest number of occurrences in the dataset.
E.g: [33,4,5,6,33,54,6,33,6,33,98,111] → Mode is 33 as it occurred 4 times in the dataset.

Formulae of Mode | Credit: Math StackExchange

Measures of Dispersion

The concept of dispersion depicts how spread out the data points are around the central tendency. The variance or standard deviation of the dataset describes the dispersion of the dataset.

a. Variance

σ² = Population Variance
μ = Population Mean
N = Population size
s² = Sample Variance
x̄ = Sample Mean
n = Sample size

Why even bother with such complicated formulae when we can just use the range to get the spread of the dataset?
Using only range will make the spread more prone to outliers. Secondly, it doesn't take in all other data points to calculate the spread.

Why subtract each data point from the mean → (x-μ)?
Dispersion in a dataset is measured with respect to its central tendency i.e mean. If we don't do that, then variance will shoot up according to the order of magnitude of data in the dataset.
E.g X=[1,3,4,5,6], Y= [1001,1003,1004,1005,1006]
For these two above datasets, if you don't mean-center the data, it will output a different variance. But in reality, both datasets have the same variance.

Why is sample variance divided by (n-1) unlike by N in population variance?
Dividing by (n-1) in sample variance makes sample variance an unbiased estimator of the population variance. Sounds high level, let me explain with an example.

Let’s say in the above figure, that the line is population data, which is skewed, and we chose the curly bracket part as sample data. We then calculate its mean, which lies between the chosen sample data range. But the population data are skewed and so the population mean is outside that sample range.
So if you plug this in the formula →(x-sample_mean) will bear small values, and the variance will be significantly smaller than the population variance.
That’s why sample variance is always divided by (n-1) to make it approximately equal to population variance.

b. Standard Deviation(S.D)
Standard deviation is the square root of variance.

Population Std(left) | Sample Std(right)

c. Coefficient of Variation(C.V)
This gives a dispersion of data points with respect to its mean. Whereas S.D gives absolute dispersion, C.V gives relative dispersion.

σ = Population Standard Deviation
μ = Population Mean

E.g X=[1,3,4,5,6], Y= [1001,1003,1004,1005,1006]
X and Y have the same S.D of 1.92 but CV of X = 50.62; CV of Y = 0.19

Measures of Shape

It describes the shape of a distribution of the dataset. Data distributions can be symmetric and asymmetric. Symmetric distribution will look like a perfect bell curve, such as a theoretical Gaussian distribution. Asymmetric distributions are like F-distribution, chi-square distribution, etc.
In a perfectly symmetric distribution, [Mean, Median, Mode] are overlapping over each other.

a. Skewness
Asymmetry in a data distribution is defined by Skewness. There are two types of skewness —
a. Positive Skewness or Right Skewed
b. Negative skewness or Left Skewed

Positive Skewness — This happens when there are outliers on the right side of the distribution. The Outlier affects the mean and so it's largest in the positively skewed dataset.
Mode < Median < Mean

Negative Skewness — Vice versa, to positive skewness, outliers are on the left of the distribution.
Mean < Median < Mode

Mean, Median, Mode spreads more as the skewness of the dataset increases.

Pearson mode Formula for skewness

The above formula is modified, as the mode is not a good measure of central tendency. So, Mode is replaced by the below formula, which is valid for approx skewed distribution.
Mode = 3 (Median) — 2 (Mean)

** Most statistical packages use the adjusted Fisher-Pearson standardized moment coefficient for skewness. Derivation of these formulae is out of the scope of this blog.
To get more depth on skewness, check out this link.

Adjusted Fisher-Pearson moment coefficient

b. Kurtosis
If we follow textbooks, kurtosis is defined as the “peakedness” of a distribution.

In above figure 1, both the blue and pink distributions have the same mean, same standard deviation, and zero skews. Then how do you differentiate them mathematically?
The answer is “Kurtosis”.
The blue distribution is more peaked, with fatter tails at both ends. “Kurtosis” calculates this property.

The above formula is derived from the statistical moment and then standardized for sample data.
The normal range of kurtosis is between 1 to infinity.

Types of Kurtosis —
a. Leptokurtic — Kurtosis > 3 i.e pointy peak with fatter tails.
b. Mesokurtic — Kurtosis = 3 i.e perfect normal distribution.
c. Platykurtic — Kurtosis < 3 i.e broader peak with thin tails.

**Controversy — Peakedness doesn’t matter. It’s all about the tails. If you look closely at the formula of kurtosis. (x-μ)⁴ → this part of the numerator is exploding the kurtosis of the distribution if outliers are large, or else diminishes it a lot if it's small, like fractions.
So, it is said that the data points on the tails influence heavily, and it doesn’t matter much if they have lots of data points in the central location, making the curve broader.

Outliers

Outliers are those “evil” data points that disturb the overall pattern of the dataset.
E.g: “Sharma Ji ka beta” — that guy in the class who gets 99% in the exam and others hardly passes. (I am “Bengali” sharma, so the above example doesn’t fit me)

In the above figure, you can see how the linear model gets affected by Outlier.
There are different detection and removal techniques used to deal with Outliers like the Z-score method, IQR filtering, Euclidean distance, etc.
**It is not always right to remove Outliers. Leverage outliers according to the needs.

Shanon’s Entropy

The concept of information entropy was introduced by Claude Shanon. The idea behind Shanon’s entropy is to quantify the uncertainty or surprise in an event.

Let's say you are given 3 boxes with brown and black balls in them. Now, you are asked to pick one brown ball from each box. Picking from which box will have more uncertainty of getting brown balls?
The answer is from the green box, and low uncertainty is from the blue box.

What we intuitively understand is that uncertainty or surprise increases when the probability is low i.e uncertainty is inversely proportional to probability.

From the above formula, we can say that Shanon’s Entropy is the expected value of uncertainty.

Miscellaneous

A. Inter-Quartile Range
Quartiles divide the dataset according to the gathered data points. Median cuts the dataset into two halves. So, the median is a special case of Quartile. It is also called Quartile 2.

Inter-Quartile Range is between Quartile 1 the 25% mark and Quartile 3 i.e 75% mark. It determines how far the spread is between the 25% and 75% mark.

B. Degrees Of Freedom
The number of independent parameters present, which are free to vary while performing a statistical calculation over the dataset, is called Degrees of Freedom.
Let’s say you are playing a 3 cup game*. You can perfectly say for sure under which cup ball is there, if and only if the contents of the two cups are revealed.
So, in this game, your degree of freedom is 2 i.e only the contents of two cups can be varied to change the outcome of the game.

*3 cup game(shell game)— where you are given three cups and there is one ball under one cup. After shuffling the cup, you have to guess which cup has the ball.

Let’s look at it from a statistically relevant perspective.
Why is the sample mean divided by n but sample variance by n-1?
I answered this question above from a different approach. Now let's look at it from the degrees of freedom approach.
While calculating the sample mean, we require all the data points to get it. So, the degree of freedom is “n”. Now, for the “n” points dataset, if n-1 and mean are known, we can get the “nth” value.
So while calculating sample variance we are already utilizing the sample mean. So, dividing by “n” in sample variance would underestimate the variance as it is taking into consideration that “nth” data point that is redundant. This is the reason, we divide sample variance by n-1, unlike the sample mean.

**Degree of Freedom will appear heavily on Inferential statistics like ANOVA, t-test, etc.

This brings us to the end of the first part of the “Statistics for Machine Learning” series.
You can check out here “Statistics for Machine Learning — II” to see the code implementation of the above topics in Python.

I wrote a blog on one of the most influential research papers on computer vision — ALEXNET. You can check that here.

“Consumers are statistics. Customers are people.” — Stanley Marcus