The most in demand skills in the world right now are in Data Science & Machine Learning! In this one video I will teach you a key part of the Math of Machine Learning and Data Science which is Statistics.

I took everything in a standard 500 page text book on Statistics and put it in this one video. I will cover every formula, but also will solve real world problems with each formula. If you pause your way through this video while taking notes you will master Statistics.

►► Get my Python Programming Bootcamp Series for $9.99 ( Expires August 1st ) : https://bit.ly/PythonMaster2

►► Highest Rated Python Udemy Course + 28 Hrs + 121 Videos + New Videos Every Week

**Transcript of the Video**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
Statistics Tutorial Statistics is the science of collecting and analyzing data taken from a sample of the population 2. The Population represents all items or people of interest. *A Sample is a subset of the population that we can analyze. We mainly focus on Successes, or results we are looking for in a sample. Examples being Age, Car Owner, College Graduates, Sex, Home Owner, etc. *Here M represents Successes in the Population. N the Total Population. x successes in the sample. And, n the total sample from the population. 3. There are many types of data. Categorical Data describes what makes a thing unique like (Age, Car Owner, Sex, Graduate, or the answer to a Yes or No Question) 4. Numerical Data is either Finite meaning that it has an ending value, or infinite meaning the opposite. 5. Continuous Data is data that can be broken down into infinitely smaller amounts. Think of things like distance, height, weight, etc. 6. Qualitative Data can be either Nominal (Named Data). It is data used for naming something which doesn't have an order. Race would be an example because there are many races, but there is no order to them. Ordinal data is also named but it has an order like (Bad, OK, Good, Great) 7. Quantitative Data is like a ratio or interval being an amount between 2 defined amounts like (Numbers between 8 and 16). 8. There are many ways to visualize data. A Cross Table shows relationships between rows and columns of data. Frequency shows how often something happens. Here we can see that when we sampled 100 random men that 78 were men that didn't exercise. 9. With Pie Charts each slice represents a category and the size of the slice represents its frequency. What differentiates it from other charts is that it must always equal 100%. 10. Bar Charts have bars that represent the categories and the bar lengths represent the frequency 11. A Pareto Chart lists categories in descending order and includes a line that represents the cumulative frequency or the sum of all other frequencies. 12. A Frequency Distribution Table focuses on the number of occurrences or the frequency. Here we list a range of test scores and how many students scored in that range. *A Histogram differs from a bar graph in that histograms show the distribution of grades in a range in this example versus using categories like a bar graph. Also Histograms are drawn with the bars touching. 13. The Mean or average provides an average value by summing all values and dividing by the number of components. μ is used to represent the mean of the population. x̅ represents the mean of a sample. While it can be very useful often outliers dramatically effect results. For example 1, 2, 3, 4, 5 has a mean of 3. 1, 2, 3, 4, 100 has a mean of 22. 14. Median tries to eliminate the influence of outliers by returning the number at the center of the data set. If you have an even number of components instead take the center 2 values and return the average. 15. The Mode returns the value that occurs most often. If components occur at an equal rate there is no mode. If there are multiple values then you will have more than 1 mode. 16. The Variance measures how data is spread around the mean. There is both a symbol for variance of the population and the sample. To find it we first calculate the mean. Then we sum all sample values minus the mean squared. Then we divide by the number of samples minus 1 in the case of a variance of a sample which is what well use. 17. Because we square values with variance that gives extra weight to outliers. For this reason we find the square root of the variance to find the Standard Deviation. The Standard Deviation is large if the numbers are more spread out and lower if they are closer to the mean. 18. The coefficient of variation is used to compare 2 measurements that operate on different scales. Here I'm comparing miles to kilometers. Even though they measure the same distance because they use different units that is not seen when calculating standard deviation. By dividing by the mean however we can see that they actually have the same dispersion. 19. Covariance tells us if 2 groups of data are moving in the same direction. Here I'll compare whether earnings effect the market cap of a corporation. The market cap of a corporation is the total value of all that corporations stock. You make this calculation by plugging in the values minus their mean and then multiply. If I do that I get a value of 5803.2. If the value is greater than 0 that means those values are moving together. If less than 0 they are moving in opposite directions. Zero means that they are independent. 20. The Correlation Coefficient adjust the covariance so that it is easier to see the relationship between x and y. Its value can't be greater than 1 or less than -1. The closer you get to 1 the closer the relationship between the values. In this example we plug in the standard deviations of the market cap and earnings. When we do this we get a value of .6601 which means they are correlated. Perfect correlation would have a value of 1. 0 shows independence. Negative values show an inverse correlation. SPREADSHEET - HOW ARE THESE CALCULATIONS MEANINGFUL? 21. A Probability Distribution finds the probability of different outcomes * A coin flip has a probability distribution of .5 * A die roll has a probability distribution of 1/6 or .167 * When you sum all probabilities you get a value of 1. 22. You see here the probabilities of all die rolls with 2 die * A Relative Frequency Histogram charts out all those probabilities. Pay particular attention to the shape of that chart because... 23. Next we'll talk about a Normal Distribution. A Normal Distribution is when data forms a bell curve. Also 1 Standard Deviation is representative of 68% of the data. 2 standard deviations cover 95% and 3 covers 99.7%. 24. To have a Normal Distribution the Mean = Median and Mode * Also 50% of values are both less than and greater that the mean. 25. A Standard Normal Distribution has a mean of zero and a standard deviation of 1. If we calculate the mean we see it is 4. If we calculate the standard deviation that comes to 1.58. 26. We can turn this into a Standard Normal Distribution by subtracting the mean from each value and divide by the standard deviation. If we do that we get the chart here. 27. The Central Limit Theorem states that the more samples you take the closer you get to the mean. Also the distribution will approximate the Normal Distribution * As you can see as the sample size increases the standard deviation decreases. 28. The Standard Error measures the accuracy of an estimate. To find it divide standard deviation by the square root of the sample size. Again notice as the sample size increases the Standard Error decreases. 29. The Z Score gives us the value in standard deviations for the percentile we want * For example if we want 95% of the data it tells us how many standard deviations are required. * The formula asks for the length from the mean to x and divides by the standard deviation. 30. This will make more sense with an example. Here is a Z Table. If we know our mean is 40.8, the standard deviation is 3.5 and we want the area to the left of the point 48 we perform our calculation to get 2.06. * We then find 2.0 on the left of the Z Table * and .06 on the top. * This tells us that the area under the curve makes up .98030 of the total. 31. Now let's talk about Confidence Intervals. Point Estimates are what we have largely used, but they can be inaccurate. An alternative is an interval * For example if we had 3 sample means as you see here we could instead say that they lie in the interval of (5,7) * We then state how confident we are in the interval. Common amounts are 90%, 95% and 99%. For example if we have a 90% confidence that means we expect 9 out of 10 intervals to contain the mean * Alpha represents the doubt we have which is 1 minus the confidence. 32. Now I'll show you how to calculate a confidence interval. We need a sample mean, alpha, standard deviation and the number of samples represented by lowercase n * Here the value after the plus or minus represents the Margin of Error. 33. Now I'll walk you through an example where we calculate the probable salary we would receive if we became a player for the Houston Rockets. We have the mean salary * We want our results to be confident to 95% * We get alpha from confidence * Critical Probability is calculated by subtracting alpha divided by 2 from 1. * Then we look up the Z Code in a table. If we search for .975 we find that the Z Code is 1.96. * We find our standard deviation and then plug in our values. * And when we do we find our Confidence Interval salary. 34. Student's T Distributions are used when your sample size is small and/or the population variance is unknown 35. A T Distribution looks like a Normal Distribution with fatter tails meaning a wider dispersion of variables 36. When we know the standard deviation we can compute the Z Score and use the Normal Distribution to calculate probabilities 37. The formula is t = (x̅ - μ) / (s/√n), where x̅ is the sample mean, μ is the population mean, s is the Standard Deviation of the sample and n is the sample size 38. In this example let's say a manufacturer is promising break pads will last for 65,000 km with a .95 confidence level. * Our sample mean is 62,456.2 * The standard deviation is 2418.4 39. Degrees of freedom is the number of samples taken minus 1. If we take 30 samples that means degrees of freedom equals 29. 40. If we know confidence is .95 then we subtract .95 from 1 to get .05. If we look up 29 and .05 in the T Table we get a value of 1.699 41. If we plug our values into our formula we find the interval for our sample. 42. Let's talk about the difference between Dependent & Independent Samples. With Dependent samples 1 sample can be used to determine the other samples results. You'll often see examples of cause & effect or pairs of results. An example would be if I roll a die, what is the probability that it is odd. Or, if subjects lifted dumbbells each day and recorded results before and after the week what did we find? 43. Independent Samples are those in which samples from 1 population has no relation to another group. Normally you'll see the word random and not cause and effect terms. An example is blood samples are taken from 10 random people that are tested at lab A. 10 other random samples are tested from lab B. Or, Give 1 random group a drug and another a placebo and test the results. 44. When thinking about probabilities we first must create a hypothesis. A hypothesis is an educated guess that you can test * If you say restaurants in Los Angeles are expensive that is a statement and not a hypothesis because there is nothing to test that against * If however we say restaurants in Los Angeles are expensive versus restaurants in Pittsburgh we can test for that. * The technical name for the hypothesis we are testing is the Null Hypothesis. An example is a test to see if average used car prices fall between $19,000 and $21,000 * The Alternative Hypothesis includes all other possible prices in this example. That would be values from $0 to $19,000 and then from $21,000 and higher. 45. When you test a hypothesis the probability of rejecting the Null Hypothesis when it is actually true is called the Significance Level represented by α. * Common αs include .01, .05 and .1. * Previously we talked about Z Tables. If the sample mean and the population mean are equal then Z equals 0. * If we create a bell graph and we know that α is .05 then we know that the rejection for the Null Hypothesis is found at α/2 or .025. * If we use a Z Table and we know µ is 0 and α/2 = .025 we find that the rejected region is less than -1.96 and greater than 1.96. (This is known as a 2 sided test) 46. * With 1 sided tests for example if I say I think used car prices are greater than $21,000, the Null Hypothesis is everything to the right of the Z Code for α instead of α/2 which is 1 - .05 = .95 In the Z Table that is -1.65. 47. When it comes to hypothesis errors there are 2 types Type I Errors called False Positives, refer to a rejection of a true null hypothesis. The probability of making this error is alpha. * Then you have Type II errors called false negatives which is when you accept a false null hypothesis. This error is normally caused by poor sampling. The probability of making this error is represented by Beta * The goal of hypothesis testing is to reject a false null hypothesis which has a probability of 1 - Beta. You increase the power of the test by increasing the number of samples. 48. This example will clear hypothesis errors up. If you believe the null hypothesis is that there is no reason to apply for a job because you won't get it. You can call this the status quo belief. * If you then don't apply and the null hypothesis was correct you'd see that your decision was correct. * Also if you rejected the null hypothesis and applied and you got the job you would see again that you made the correct decision. * However if the hypothesis was correct and you applied that would be an example of a Type I Error. * And again if you choose not to apply but the hypothesis was false this would be an example of a Type II Error 49. Now let's talk about means testing. I want to calculate if my sample is higher or lower than the population mean. To find out I need a 2 sided test. The population mean is the Null Hypothesis. * That Null Hypothesis is that break pads should last for 64,000 kms. 50. * Here is my Sample break pad data 51. We calculate our sample mean * standard deviation * sample size, * Sample Error 52.* We need to standardize our means so we can compare them even if they have different standard deviations. * We standardize our variable by subtracting the mean and then divide by the standard deviation. When we do this we normalize our data meaning we get a mean of zero and a standard deviation of 1. Z = (x̅ - μ0) / Sample Error Sample Error = standard_deviation(*args) / (math.sqrt(len(args))) 53. * We then get the absolute value of this result 54. If my confidence is .95 α is .05 and since we are using a 2 sided test we use α/2 = .025. * If we subtract .025 from 1 we get .9750. * If we look up .9750 on the Z Table we get a Z Score of 1.96. 55. * We now compare the absolute value of the z score we calculated before which is 8.99 to the Critical Value which is 1.96. If 8.99 is greater than 1.96 which it is we reject the Null Hypothesis. To be more specific we are saying at .95 confidence level we reject that the break pads have an average lifecycle of 64,000 km. 56. The P Value is the smallest level of significance at which we can reject the Null Hypothesis. * In our example we found a Z Score of 8.99 which isn't on our chart. * Let's say instead that the Null Hypothesis was 61,750 kms. That would mean the hypothesis would be correct at 1 - .99996 = .00004 significance. So here the P Value for a 1 sided test is .00004. For a 2 sided test we multiply .00004 by 2 = .00008. 57. Now let's talk about regression. Neural Networks are made up from huge datasets that are hard to work with. We can statistically calculate outputs based on sample inputs. If we believe there is a linear relationship between 2 types of data, meaning as one increases so does the other, we can make other predictions. Linear regression is looking at samples and fitting a line to those samples. 58. We do this like we do with any linear equation. We find the slope and then b0 which is the Y intercept. We are basically averaging the sample points to our line. This is called the regression line. We note that it is a regression line by using y hat instead of y. 59. Here is the formula for calculating b1. We sum all values of x minus their means and the same for all values of y. We square the results to eliminate negative values. We then divide by the sum of x minus the mean again squared. Now we have the slope. To calculate the y intercept or b0 I find y bar - slope * x bar. 60. Here is an example on how you'd calculate the linear regression line. Get the means for x & y. Sum the product of each value of x minus the mean and the same for y. Get the sum of all values of x minus the mean squared. Then find the slope by dividing those values to get 5.958. The calculate the value for the y intercept. Then you can create the formula for the line which you can see to the right. 61. How do we find out if our regression line is a good fit for our data? We do that with something we have already covered which is the correlation coefficient. Remember that the correlation coefficient calculates whether the values of x and y are related (correlated). We calculate it by finding the covariance of X & Y and then divide by the product of the standard deviations of X & Y. If the value is close to 1 then the data is highly correlated which means our regression line should have an easy to modeling the data. 62. Let's work through an example where we find the correlation coefficient. First we must calculate the covariance for all x and y values which equals 1733.09. 63. Now that we have the covariance we can divide it by the standard deviation of x multiplied by the standard deviation of y. When we do that we get .9618. Since .9618 is so close to 1 we know that are linear regression line will be tightly matched to the data. 64. Now I want to talk about the coefficient of determination. There are numerous calculations involved in creating a regression line. Meanwhile it takes seconds and zero thought to calculate the mean line. So is it worth it to go through the hassle? The Coefficient of Determination tells us. 65. The Coefficient of Determination is calculated as a percentage. What we need to do is to calculate the sum of the square errors between the mean and the sample points. We build a square from the points to the mean line. Then we create squares from the regression line to all sample points. We can then sum the areas of the squares for both, subtract and then find out how much area we eliminated with our regression line. 66. As an example let's say the sum of the square areas for the mean line is 1000. Again let's say the sum of areas for the regression line is 150. We can then calculate that 85% of the error is eliminated when we calculate the regression line. 67. Root Mean Squared Deviation is the measure of the differences between sample points and the regression line. We are using all these formulas to better understand how well our regression linear equation is estimating the data. So we find the residual for each data point. The residual is represented by the black lines that go from the data point to the regression line. If each residual is e, we take the sum of all residuals squared divided by the number of samples minus 1. We have the table with both the samples and the regression line so I'll find the Root Mean Squared Deviation. 68. If I calculate e by subtracting the value of my regression line from the sample y. I then square all those values and find their sum. If I divide by the number of samples minus 1 and then find the square root I get 28.86. That means for 1 standard deviation which makes up for 68% of all samples, our regression line will be off at most plus or minus 28.86. 69. We could then add and subtract 28.86 and create 2 more lines that will capture 68% of all values. * We could then add in another line on the top and bottom and capture 95% of all points. 70. Now to finish up we'll talk about Chi (Kai) Square Tests. Before you can perform the tests you must meet the conditions that data is random, large (each cell must be > 5) and independent (Sample with Replacement or 10% rule). Chi square test of homogeneity is used when you want to look at the relationship between different categories of variables. This is used when you sample from 2 groups and want to compare their probability distributions. 71. What we are trying to find is if age has an effect on peoples preferences for favorite sport. Our null hypothesis is that age doesn't effect favorite sport and the alternative is that it does. 72. If we calculate the percentages for all columns we get these results. Now to prove the null hypothesis we should expect that 25% of 18 to 29 year olds should prefer the NBA for example. Also the percentages should work out for all other sports organizations. The easiest way to calculate the expected value for each cell in the chart is to multiply the cell column value by the row total and then divide by the total number of people. So the expected value for 18 to 29 year olds that like the NBA is 66 * 35 / 142 = 16.3 73. I calculated the expected value for each cell. You can see that the row column totals are still the same. The Chi square formula is χ2 = Σ(observed - expected)2 / expected. If we perform this calculation we get 7.28. The larger this value the more likely these values effect each other. We look up this value in a Chi square table, but we also must have the degrees of freedom for our data. You get that by multiplying # of columns -1 by number of rows minus 1 or 3 * 1 or 3. 74. Now we find our degrees of freedom and the closest match to 7.28 in a Chi Square Test table. When we do we find that we have between a 90 to 95% confidence that age doesn't effect a persons favorite sport. |

## Leave a Reply