Statistics for Machine Learning — II
Code Implementation of Descriptive Statistics in Python
In the previous part “Statistics for Machine Learning — I”, I discussed the theory of Descriptive Statistics. In this part, I will be focusing on the code implementation of those topics in Python on Jupyter Notebook.
**In this Blog, I won't explain any theoretical concept. It will be completely code-focused. For theory check out the first part.
Topics to be Covered
A. Measures of Central Tendency
B. Measure of Dispersion
C. Measures of Shapes
D. Shanon’s Entropy
Measures of Central Tendency
- Mean
In the above code, I am sampling randomly from a normal (Gaussian Distribution) with “zero” mean and “one” standard deviation, then plotting on a distribution plot. The purple line is the mean of the distribution.
For the outlier part, I sampled from a random normal distribution and then replaced some data points with the power “five” to create outliers.
Now, if we compare the mean, it's very evident that the mean shifted because of the outlier present in the “x_failure_outlier” distribution.
2. Median
In the first half of the code, the median is a good measure of central tendency as the distribution is unimodal.
But for the next half, I concatenated two unimodal distributions to create a bi-modal distribution. Now, I calculate the median, and it falls in the middle of the two peaks. So, the median is not a good measure of central tendency for bi-modal or multi-modal distribution.
3. Mode
In cell 2, I sampled from a normal distribution and then changed them to bins (by rounding to the upper bound). In cell 5, I calculate the mode, and the bin with the highest frequency is the mode of the distribution.
Measures of Dispersion
- Variance
In cell 3, I calculated variance with “np. var”, the “ddof” parameter is the degree of freedom and for sample variance, it is 1. By default, in NumPy, it's zero.
In cell 4, I am manually calculating the variance. First, I mean-center the data, then square and sum them, then divide by (sample size -1).
2. Standard Deviation
In cell 4, I plotted multiples of standard deviation away from the mean. The data between the purple lines are “1” standard deviation away from the mean and contain 68.2% of the data. The red lines contain 95.4% data and “2” standard deviations away from the mean.
The black dotted lines contain 99.7% data and a “3” standard deviation away from the mean.
3. Inter-Quartile Range
I created two types of distribution, the first one is the normal distribution and the second one is the log-normal distribution.
By plotting the box-whisker plot, I demarcated data into quartiles and 50% of data lies between the 1st and 3rd quartiles. The black dots beyond those vertical lines are outliers.
Measures of Shape
Skewness & Kurtosis
To calculate skewness, I sampled from a chi-square distribution. The degree of freedom is the mean for chi-square distributions. As the degree of freedom increases, it tends to become normal distribution. The relation between chi-square and normal distribution is out of the scope of this blog.
For kurtosis, I sampled from three different distributions “uniform” for Platykurtic, “normal” for mesokurtic, and “exponential” for leptokurtic.
Shanon’s Entropy
In the above code, we are picking a number from 0 to 5 randomly 1000 times. Then, calculate the entropy of the event by summing over the multiplication of probabilities of each ball and its surprise. Epsilon is a very minuscule value added to avoid log(0).
To get a theoretical understanding and formula check the first part of the blog.
This brings us to the end of this short blog. In the next part, I will start with Inferential Statistics.
“Statistics don't lie. It’s the people who make up the statistics that lie.” — George Buck.