statistics python standard deviations

I showed my statistics teacher my python file for making stem and leaf plots, and asked her if I could use the python programs I write on my stat's tests.   I was worried she'd say no, but if she didn't want me to use them, I could manage, and I would still be making them anyways, because this helps me learn. 

This is a file with three different methods/functions in it that can find your Standard Deviation of a Sample, and Standard Deviation of a Population and the z-score of a value.

# Scroll to the bottom if you just want the code#


The math is not easy to explain, but basically, you want to know:
Deviation is the normal difference between the numbers in the data set and the center value - mean of all those numbers.

Mean vs Median:
 mean is the center value of all those integers like average
  median is the center data point. (which may not be a value that is close to the average/mean.
Sample vs Population:
If we wanted to know the average of ALL students in a particular math class, and had a number for each student, this would be a population.
If we wanted an average of the math scores from all math classes in a each college of a state, the data would probably just be samples from each college, not every student.  Therefore a sample.

The math: 
In deviations of a sample data set, the math is a little more complex because the mathaletes (I use this term lovingly) decided to use the square of a number in part of the formula to eliminate the negative numbers.  Now if your programming, you might have used 'absolute value' before in your code.
With absolute value, we take any number and say we just want the positive version of it.
So in part of this code for population deviation you'll see me using abs(<some integer>)  This is giving me the whole positive value of the integer in there.  Note this is used in population only.

Why is sample deviation different from the population?:
For the purpose of the statistics class and sample deviation math, we need a number squared, because that's how the mathematicians do it.

Also this is why the sample data formula is divided by (n-1) instead of n.   To make up for that squared population data, they have to reduce the data sets size (n).  My teacher said she's not sure why they chose one...  But my running theory is that your eliminating one item from the data set to get a more accurate result, because squaring integers can result in a massive difference from one point to another.

The actual explanation in Wikipedia:'s_inequality

A simpler explanation:
Lets say we have these two lists:
[2, 2, 4, 4, 5, 5, 5, 6, 6, 8, 8, 9, 9, 9] = Sample Data
[4, 4, 16, 16, 25, 25, 25, 36, 36, 64, 64, 81, 81, 81] = squares of the Sample Data.
Do you see how the first list there isn't much difference between the numbers?  Our biggest gap is 2.
(6 - 8 = 2)
Now in that other list (squares of Sample Data) the difference between each item in the list grows almost exponentially.  The biggest difference in that set is 28.  I believe this number (if the set has fairly close data points),  [1, 2, 3] instead of very divergent numbers [2, 26, 27, 98] will be close to middle ground (mean/average) of the data set.
(64 - 36 = 28) square root of 28 = *rounded* 5.29
The reason this number gap is so big,  is the result of squaring those numbers from the origin data. As the numbers in the first list get bigger, the difference between the data in the second list grows rapidly.

My explanation on n-1 &  Conclusion: 
So my theory is, We are eliminating one from the n (number of items in the data set) Because that last point in the data can be so huge that it completely throws off a realistic (more accurate) result.  The sample, is not the whole population, and they want to adjust for this by increasing the data in some way, (squaring it) and they also need it to be an absolute value, (positive integer) hence squaring works for that too.   *Any number squared is always a positive integer* .

As always I'm happy to correct, fix, add or take any comments to improve my code or this post.  Feel free to comment, take the code and break it, use it, manipulate it and learn from it. I once was a newbie too and had to find ways to figure this stuff out, and I'm just hoping I make it a little easier for someone who's on the start of their journey.

The Code:

--- Start code block ----

 ###   Standard deviations (population and sample)
###   Z-score

import math
nums = [21, 21, 22, 23, 24, 24, 25, 25, 28, 29, 29, 31, 32, 33, 33, 34, 35, 36, 36, 36, 36, 38, 38, 38, 40]
def standard_deviation(alist, to_print=True):
    # formual to get the standard deviation of a population
    if to_print:
        print("*   mean and standard deviation of a population    *")
    total = 0
    population_size = len(nums)
    for item in alist:
        total = total + item
    mean = total/population_size
    if to_print:
        print(f"MEAN = {mean}")
    total = 0
    for item in alist:
        # Here's where I use absolute value to simplify the formula
        x = abs(mean - item) 
        total = total + x

    deviation = total/len(alist)
    if to_print:
        print(f"Standard Deviation (Population) = {deviation}")
    #  The mean can be useful for my class, but feel free to eliminate it from the return
    if to_print:
        print("*    End method standard_deviation()      *\n")
    return deviation, mean

def standard_deviation_sample(alist):
    # Get the standard deviation of a sample
    # this is where all the squared business comes in.
    print("*     standard_deviation() of a sample     *")
    total = 0
    sample_size = len(nums)
    for item in alist:
        total = total + item
    mean = total/sample_size
    print(f"MEAN = {mean}")
    total = 0
    for item in alist:
        x = (mean - item) ** 2
        total = total + x
    n = len(alist)
    deviation = total/(n - 1)
    sample_deviation = math.sqrt(deviation)
    print(f"Standard Sample Deviation = {sample_deviation}")
    print("*      END standard_deviation_sample()     *\n")
    return sample_deviation

def z_score(value, alist=nums):
    #standard_dev, mean = standard_deviation(nums)
    # formula:(value -mean_of_set_of_alist) / deviation
    Since I have my standard_deviation() formula returning the mean of our list,
    we can use that to get the mean and deviation.  
    If you want to change this data set (nums) you can copy paste your data in to a new list,
    and pass that in as alist.  z_score(value, alist=yourlist)
    *The nums list is to demonstrate these methods.
    to_print = False is keeping the print statements from filling this part with the print data
    from the standard_deviation method. Change it to to_print=True to have that data printed with z-score
    print("\n*get z-score of a value from a data set z-score() *")
    deviation, mean = standard_deviation(alist, to_print=False)
    z_score = (value - mean) / deviation
    print(f"z-score = ", z_score) # note the variable z_score uses underscore, and z-score uses dash. 
    print("*    END z-score()     *\n")

def test():
    #  see the methods in action 
    # for z-score put in a value that is relational to the list of numbers.


--- End code block -----


Popular posts from this blog

Pandas Python and a little probability

JavaScript Ascii animation with while loops and console.log

Setting up a Javascript experiment.