python statistics normal distribution continuous density

Normal Distribution: Stats Class Code.


Colorful wind chimes of bells and clay


I was attempting to write the different formula's I would need for a normal distribution curve and the calculations we'd need for class.  There is a bit of success and a bit of failure in the code.  I may tweek it a bit more, so that someone could enter the a and b,  size of the curve, and the area range they want and it does the deviation, and mean calculations also, but this is enough for now.

As always, drop a comment if you see anything wrong, or that needs improvement.  Take the code, practice, break it, build it, have fun learning.

Lessons learned:

Now for this bit of code, I was sure I could use the formula I found in this Khan video to write my own python function to find the probability in a range under a curve.

Turns out, no, the math is way more complex for a continuous function.
I am leaving my function and the way it uses the formula in the code block.
I made an effort, but it failed.  I am under the assumption that scipy's numbers are correct, and mine should match theirs.  Mine are off by a small bit, but close.
I'm sure there's some reason I fail to see why this won't match scipy's.

The big lesson I learned from this:
    USE YOUR TOOLS.  SciPy, matplotlib, all of those amazing libraries are built by fantastic coders and they want you to use them to make your life easier.  So while doing this I did learn the math in greater detail, sometimes if you're trying to drive a nail into a board with a spatula, and someone hands you a hammer -- don't be prideful.  Use the hammer.
    It's our major trait in the animal kingdom, we use tools.  So use them.  Or not.... and spend hours figuring out the better way, learn from it, and then blog about the whole thing later.  It's actually your choice.  I'll probably do this same thing again, to learn.  If you don't need to learn this way, like me, then just use the hammer.

Links:

Here's the khan academy links to the normal distribution math:

https://www.khanacademy.org/math/statistics-probability/modeling-distributions-of-data/more-on-normal-distributions/v/introduction-to-the-normal-distribution

A stackOverflow question that helped me figure out how I was writing the formula wrong
See the 'answered : jiminy_crist'  solution.
https://stackoverflow.com/questions/12412895/calculate-probability-in-normal-distribution-given-mean-std-in-python




Code:

--start code block--


# Normal Distribution AKA Gaussian probability
import math
from scipy.stats import norm

#Note:  z-score is the amount of deviations a given x in p(x) is away from the mean.

def normal_deviation(a, b):
    """
    What is the standard deviation of the data point distribution?
    I need clarification on where this '12' comes in.  Why 12?
    deviation is:  the square root of ((the high value, minus the low value squared) divided by 12)
    """
    deviation = math.sqrt((b-a)**2 / 12)
    print("The deviation of this normal distribution is : ", deviation)
    return deviation

def normal_mean(a, b):
    """
    What is b and a?
    The range at which the distribution falls.
    example: 
    the number of paintings sold in a day are 5 - 100.
    a = 5,  b = 100
    
    The mu, or standard mean (center point of data) of this normal distribution:
    u = (a + b) / 2
    """
    mean = (a + b) / 2
    print("The standard mean of this normal distribution is: ", mean)
    return mean

def x_bar_Normal_distribution(a, b, n):
    """
    THIS HAS NOT BEEN TESTED.
    The deviation of the Normal distribution of a sample of means of a normal distribution.
    deviation / sqrt(number of samples)
    """
    mean = normal_mean(a, b)
    deviation = normal_deviation(a, b)
    normal_x_bar_deviation = deviation / math.sqrt(n)
    print("The standard deviation of the sample of means from the normal distribution ( [n] samples ) is: ", normal_x_bar_deviation)
    return normal_x_bar_deviation



def range_probability_cdf(mean, devi, range_low, range_high):
    """
    The formula will not work on a continuous probability.  SciPi has a built in method to do this math for us.
    I wanted to do it myself, so I could learn the math, and I did learn the formula below pretty good, but
    the formula we need to calculate a continuous probability is much more complex.
    Time to use the hammer instead of my hand.

    norm.cdf(x, loc(mean), scale(deviation))

    Formula probability density function:
    the area under the curve for a given point p(x) is:
    1 / ( 2 * pi * (mean squared) * (mathmatical number e to the power of (our z score squared)))
    1/ 2 * pi *  m**2 * e**(z**2)  
   [[e = math.exp, pi = math.pi]]
    for the range, we would take the larger area,  minus the smaller area to get the difference, which is the area just between
    these two p(x)s.
    """
    # 1 / (2 * pi * deviation**2) = x
    # e ** -((range_num - mean)**2 / 2*deviation**2 = y
    # area = y/x

    large = norm.cdf(range_high, mean, devi)
    print("scipy large area =  ", large)
    small = norm.cdf(range_low, mean, devi)
    print("scipy small area = ", small)
    range_area = large - small
    message = f"The area in range {range_low} - {range_high} is {range_area}"
    return range_area
    
def mycdf(mean, devi, range_low, range_high):
    """
    If the ranges are above the mean, we must get the inverse area.
    so if it's one deviation above, or any positive deviation above, we want 1 - results.
    
    Formula probability density function:
    the area under the curve for a given point p(x) is:
    1 / ( 2 * pi * (mean squared) * (mathmatical number e to the power of (our z score squared)))
    1/ 2 * pi *  m**2 * e**(z**2)  
    [[e = math.exp, pi = math.pi]]
    for the range, we would take the larger area,  minus the smaller area to get the difference, which is the area just between
    these two p(x)s.
    """

    devi_square = float(devi**2)
    low_e_num = math.exp(-((float(range_low) - float(mean))**2 / (2*devi_square)))
    denom = float( math.sqrt(2 * math.pi * devi_square) )
    high_e_num = math.exp(-((float(range_high) - float(mean))**2 / (2*devi_square)))
    low_area = float(low_e_num / denom)
    high_area = float(high_e_num / denom)
    if range_low > mean:
        low_area = 1 - low_area
    if range_high > mean:
        high_area = 1 - high_area
    print("my high_area = ", high_area)
    print("my low_area = ", low_area)
    under_curve = high_area - low_area
    message = f"The area under the curve for range {range_low} - {range_high} = {under_curve}"
    return under_curve
        

def test1():
    # test against excel data:
    # excel formula 1 = [=NORM.DIST(8.5, 10, 1.5, 1) => .1587
    # excel formula 2 = [=NORM.DIST(11.5, 10, 1.5, 1) => .8414
     #  ex1 - ex2 = .6827
   #range_probability_withmeandevi(mean, devi, range_low, range_high):
   area = range_probability_cdf(10, 1.5, 8.5, 11.5)
   print(area)


def test2():
    """
    compare results for my density function and using scipy's
    empirical rule says that this range should be about 95%
    """
    area = range_probability_cdf(12, 1.3, 9.4, 14.6)
    area2 = mycdf(12, 1.3, 9.4, 14.6)
    print("scipy result:", area)
    print("my result:", area2)

def test3():
    """
    compare results for my density function and using scipy's
    empirical rule says that this range should be about 64%
    """
    scipy_area = range_probability_cdf(10, 1.5, 8.5, 11.5)
    my_area = mycdf(10, 1.5, 8.5, 11.5)
    print("scipy result:", scipy_area)
    print("my result:", my_area)

#test1() -- pass
#test2() -- fail
#test3() -- fail






--end code block--


Comments

Popular posts from this blog

playing with color in powershell python

JavaScript Ascii animation with while loops and console.log

playing with trigonometry sin in pygame