Normal distributions, z-scores, and the empirical rule

 
 
Normal distributions and z-scores blog post.jpeg
 
 
 

The bell-shaped normal distribution and the empirical rule

In the last section, we talked about a normal distribution, which is a bell-shaped, symmetric curve for normally distributed data, that looks something like this:

Krista King Math.jpg

Hi! I'm krista.

I create online courses to help you rock your math class. Read more.

 
 
mean and median are equal in a normal distribution
 

We’ll spend a lot of time working with distributions like this, so let’s talk about some of the most important properties of a normal distribution.

The empirical rule

Normal distributions follow the empirical rule, also called the 68-95-99.7 rule. The rule tells us that, for a normal distribution, there’s a

  • 68%68\% chance a data point falls within 11 standard deviation of the mean

  • 95%95\% chance a data point falls within 22 standard deviations of the mean

  • 99.7%99.7\% chance a data point falls within 33 standard deviations of the mean

In other words, if we want to show this graphically,

 
empirical rule for the normal distribution
 

we can show that 68%68\% of the data will fall within 11 standard deviation of the mean, that within 22 full standard deviations of the mean we’ll have 95%95\% of the data, and that within 33 full standard deviations from the mean we’ll have 97.7%97.7\% of the data.

And we can draw all kinds of conclusions based on this information, and the fact that the all the area under the graph represents 100%100\% of the data. For example, since total area is 100%100\%, and the data within three standard deviations is 99.7%99.7\%, that means that we’ll always have 0.3%0.3\% of the data in a normal distribution that lies outside three standard deviations from the mean. Or if we wanted to know how much of our data will lie between one and two standard deviations from the mean, we can say that it’s 95%68%=27%95\%-68\%=27\%.


Percentile

We look a lot at percentiles within a normal distribution. The nth percentile is the value such that n percent of the values lie below it. In other words, a value in the 95th percentile is greater than 95%95\% of the data. The 50th percentile in a normal distribution always gives the median, and the IQR is always found using the 75th percentile minus the 25th percentile.

Z-scores

A zz-score tells you the number of standard deviations a point is from the mean. To calculate a zz-score for normally distributed data (normal distributions) we use the

z=xμσz=\frac{x-\mu}{\sigma}

where xx is the data point, μ\mu is the mean, and σ\sigma is the standard deviation.

The zz-score for a data point is how far it is from the mean, and you always want to give the zz-score in terms of standard deviations. Therefore, to find the zz-score at a certain point in the distribution, we use the formula above, taking the data point, subtracting the mean, and then dividing that result by the standard deviation. That gives us a value for zz.

We’ll look up the zz-score in a zz-table, which is a table that takes the number of standard deviations and tells you the percentage of the area under the curve up to that point.

Data points that are less than the mean will be to the left of the mean and will have a negative zz-score. They should be looked up in the table of negative zz-scores:

 
negative z-score table
 

Data points that are greater than the mean will be to the right of the mean and will have a positive zz-score. They should be looked up in the table of positive zz-scores:

 
positive z-score table
 

A zz-score is unusual if it’s further than three standard deviations from the mean. Essentially the zz-score tells us the percentile rank of the data point that we started with. If the zz-score for our data point is 0.71230.7123, it means that the data point is greater than 71.23%71.23\% of the data, meaning that our data point is in the 71.2371.23 percentile.

Remember, the zz-table always gives you the percentage of data that’s below your data point. Therefore, to find the percentage of data above your data point, you have to take 11 minus the value from the table.

Thresholds

Sometimes we want to know the threshold, or cutoff, in our data set. In other words, we might want to know “What’s the minimum value needed in order to be in the “top 10%10\%” of the data?

In order to figure this out, we need to work backwards starting from the zz-table. For example, if we want to find the top 30%30\% of the data, we’d use the zz-table to find the first zz-score that’s just barely above 70%70\%, or 0.70000.7000. Then we’ll look at the row and column headers that correspond with a zz-table value of 0.70000.7000. The decimal number given by the row and column headers tells us how many standard deviations above the mean we need to be in order to be above 70%70\%, or, in the top 30%30\%.

If we multiply that decimal number by the standard deviation, and then add the result to the mean, that will tell us the value that’s at the bottom of the top 30%30\%. If instead we were looking up the “bottom 40%40\%” in the zz-table, we’d need to look for the zz-table value that’s just under 0.40000.4000.

 
 

How to find z-scores from the mean and standard deviation


 
 
Probability & Statistics course.png

Take the course

Want to learn more about Calculus 1? I have a step-by-step course for that. :)

 
 

 
 

How to guarantee your percentile based on the mean and standard deviation

Example

Let’s say the mean finishing time for male speed skaters at the winter Olympics on the 500500 meter track is 70.4270.42 seconds, with a standard deviation of 0.340.34 seconds (the data is normally distributed). What is the maximum time a skater can post if he wants to skate faster than 95%95\% of his competitors?

We know that μ=70.42\mu=70.42 and σ=0.34\sigma=0.34. This athlete wants to be faster than 95%95\% of the event’s participants, which means he wants his time to be in the fastest 5%5\%.

Normal distributions and z scores for Statistics.jpg

Normal distributions follow the empirical rule, also called the 68-95-99.7 rule.

Since that means he wants his time to be in the top 5%5\% of finishers, which is equivalent to having a finishing time in the fastest 5%5\% of all finishing times, which is equivalent to having a value in the bottom 5%5\% of the normal distribution, we need to look in a zz-table for the negative zz-score that will keep us under 0.050.05.

looking up a value in the z-table

The zz-score that keeps us within the top 5%5\% of participants is 1.65-1.65. A zz-score of 1.64-1.64 would push us into the bottom 95%95\% of competitors, over the 5%5\% threshold.

If our standard deviation is σ=0.34\sigma=0.34, then a zz-score of 1.65-1.65 would be

0.34(1.65)=0.5610.34(-1.65)=-0.561

Therefore, if the skater’s time is

70.4200.56170.420-0.561

69.85969.859 seconds

or faster, he’ll be within the top 5%5\% of competitors in the event.

 
Krista King.png
 

Get access to the complete Probability & Statistics course