python statistics normal distribution continuous density
Normal Distribution: Stats Class Code.
Colorful wind chimes of bells and clay |
I was attempting to write the different formula's I would need for a normal distribution curve and the calculations we'd need for class. There is a bit of success and a bit of failure in the code. I may tweek it a bit more, so that someone could enter the a and b, size of the curve, and the area range they want and it does the deviation, and mean calculations also, but this is enough for now.
As always, drop a comment if you see anything wrong, or that needs improvement. Take the code, practice, break it, build it, have fun learning.
Lessons learned:
Now for this bit of code, I was sure I could use the formula I found in this Khan video to write my own python function to find the probability in a range under a curve.
Turns out, no, the math is way more complex for a continuous function.
I am leaving my function and the way it uses the formula in the code block.
I made an effort, but it failed. I am under the assumption that scipy's numbers are correct, and mine should match theirs. Mine are off by a small bit, but close.
I'm sure there's some reason I fail to see why this won't match scipy's.
The big lesson I learned from this:
USE YOUR TOOLS. SciPy, matplotlib, all of those amazing libraries are built by fantastic coders and they want you to use them to make your life easier. So while doing this I did learn the math in greater detail, sometimes if you're trying to drive a nail into a board with a spatula, and someone hands you a hammer -- don't be prideful. Use the hammer.
It's our major trait in the animal kingdom, we use tools. So use them. Or not.... and spend hours figuring out the better way, learn from it, and then blog about the whole thing later. It's actually your choice. I'll probably do this same thing again, to learn. If you don't need to learn this way, like me, then just use the hammer.
Links:
Here's the khan academy links to the normal distribution math:
https://www.khanacademy.org/math/statistics-probability/modeling-distributions-of-data/more-on-normal-distributions/v/introduction-to-the-normal-distribution
A stackOverflow question that helped me figure out how I was writing the formula wrong
See the 'answered : jiminy_crist' solution.
https://stackoverflow.com/questions/12412895/calculate-probability-in-normal-distribution-given-mean-std-in-python
Code:
--start code block--
# Normal Distribution AKA Gaussian probability
import math
from scipy.stats import norm
#Note: z-score is the amount of deviations a given x in p(x) is away from the mean.
def normal_deviation(a, b):
"""
What is the standard deviation of the data point distribution?
I need clarification on where this '12' comes in. Why 12?
deviation is: the square root of ((the high value, minus the low value squared) divided by 12)
"""
deviation = math.sqrt((b-a)**2 / 12)
print("The deviation of this normal distribution is : ", deviation)
return deviation
def normal_mean(a, b):
"""
What is b and a?
The range at which the distribution falls.
example:
the number of paintings sold in a day are 5 - 100.
a = 5, b = 100
The mu, or standard mean (center point of data) of this normal distribution:
u = (a + b) / 2
"""
mean = (a + b) / 2
print("The standard mean of this normal distribution is: ", mean)
return mean
def x_bar_Normal_distribution(a, b, n):
"""
THIS HAS NOT BEEN TESTED.
The deviation of the Normal distribution of a sample of means of a normal distribution.
deviation / sqrt(number of samples)
"""
mean = normal_mean(a, b)
deviation = normal_deviation(a, b)
normal_x_bar_deviation = deviation / math.sqrt(n)
print("The standard deviation of the sample of means from the normal distribution ( [n] samples ) is: ", normal_x_bar_deviation)
return normal_x_bar_deviation
def range_probability_cdf(mean, devi, range_low, range_high):
"""
The formula will not work on a continuous probability. SciPi has a built in method to do this math for us.
I wanted to do it myself, so I could learn the math, and I did learn the formula below pretty good, but
the formula we need to calculate a continuous probability is much more complex.
Time to use the hammer instead of my hand.
norm.cdf(x, loc(mean), scale(deviation))
Formula probability density function:
the area under the curve for a given point p(x) is:
1 / ( 2 * pi * (mean squared) * (mathmatical number e to the power of (our z score squared)))
1/ 2 * pi * m**2 * e**(z**2)
[[e = math.exp, pi = math.pi]]
for the range, we would take the larger area, minus the smaller area to get the difference, which is the area just between
these two p(x)s.
"""
# 1 / (2 * pi * deviation**2) = x
# e ** -((range_num - mean)**2 / 2*deviation**2 = y
# area = y/x
large = norm.cdf(range_high, mean, devi)
print("scipy large area = ", large)
small = norm.cdf(range_low, mean, devi)
print("scipy small area = ", small)
range_area = large - small
message = f"The area in range {range_low} - {range_high} is {range_area}"
return range_area
def mycdf(mean, devi, range_low, range_high):
"""
If the ranges are above the mean, we must get the inverse area.
so if it's one deviation above, or any positive deviation above, we want 1 - results.
Formula probability density function:
the area under the curve for a given point p(x) is:
1 / ( 2 * pi * (mean squared) * (mathmatical number e to the power of (our z score squared)))
1/ 2 * pi * m**2 * e**(z**2)
[[e = math.exp, pi = math.pi]]
for the range, we would take the larger area, minus the smaller area to get the difference, which is the area just between
these two p(x)s.
"""
devi_square = float(devi**2)
low_e_num = math.exp(-((float(range_low) - float(mean))**2 / (2*devi_square)))
denom = float( math.sqrt(2 * math.pi * devi_square) )
high_e_num = math.exp(-((float(range_high) - float(mean))**2 / (2*devi_square)))
low_area = float(low_e_num / denom)
high_area = float(high_e_num / denom)
if range_low > mean:
low_area = 1 - low_area
if range_high > mean:
high_area = 1 - high_area
print("my high_area = ", high_area)
print("my low_area = ", low_area)
under_curve = high_area - low_area
message = f"The area under the curve for range {range_low} - {range_high} = {under_curve}"
return under_curve
def test1():
# test against excel data:
# excel formula 1 = [=NORM.DIST(8.5, 10, 1.5, 1) => .1587
# excel formula 2 = [=NORM.DIST(11.5, 10, 1.5, 1) => .8414
# ex1 - ex2 = .6827
#range_probability_withmeandevi(mean, devi, range_low, range_high):
area = range_probability_cdf(10, 1.5, 8.5, 11.5)
print(area)
def test2():
"""
compare results for my density function and using scipy's
empirical rule says that this range should be about 95%
"""
area = range_probability_cdf(12, 1.3, 9.4, 14.6)
area2 = mycdf(12, 1.3, 9.4, 14.6)
print("scipy result:", area)
print("my result:", area2)
def test3():
"""
compare results for my density function and using scipy's
empirical rule says that this range should be about 64%
"""
scipy_area = range_probability_cdf(10, 1.5, 8.5, 11.5)
my_area = mycdf(10, 1.5, 8.5, 11.5)
print("scipy result:", scipy_area)
print("my result:", my_area)
#test1() -- pass
#test2() -- fail
#test3() -- fail
Comments
Post a Comment