All of these are examples of variable
let’s say that you own a bookstore and you’ve collected information from 125 customers—information about each customer’s age, income, occupation, marital status, and reading preferences.
Data Distribution: listing of the values or responses associated with a particular variable in a data set.
A population (or universe) is all possible cases that meet certain criteria. It’s the total collection of cases that you’re interested in studying
Examples:
Population as a collection of all possible cases, recognizing the fact that what constitutes the population may be changing.
A sample is simply a portion of a population
A sample should be representative to the population is to say that the sample mirrors the population in important aspects
Example: Imagine a population that has a male/female split, or ratio, of 60%/40%. If a sample of the population is representative, you’d expect it to have a male/female split very close to 60%/40%.
1. Descriptive statistics are used to summarize or describe data from samples and populations.
Example: Describe your last semester scores
2. Inferential Statistics
Inferential statistics: Rely on sample data to make inferences about the population. Use sample statistics to make inferences about population parameters
Sample statistics vs. Population parameters
Can you assume that the mean you calculated for your sample is equal to the mean of the population ?
Sampling error: Every sample can yeild a different statistic
salaries = [1000, 1200, 800, 900, 850, 1150, 1300, 5000]
x_bar = sum(salaries)/len(salaries)
x_bar
1525.0
One of the properties of the mean is that it is sensitive to extreme scores
import numpy as np
salaries = [1000, 1200, 800, 900, 850, 1150, 1300, 5000]
print(np.mean(salaries))
np.median(salaries)
1525.0
1075.0
Two distributions can share the same mean, but can be very different in terms of the variability of individual scores.
Despersion/Variability is an expression of the extent to which the scores are spread out in a distribution.
The range represents the lowest score and the highest score in a distribution
What is the limitation in deviations from the mean ?
Assuming our goal is to get a summary measure of the deviation from or about the mean, we always end up with the same sum of deviations (a value of 0), regardless of the underlying distribution, and that tells us nothing.
What is the limitation in Mean deviation ?
It’s based on absolute values, and absolute values are difficult to manipulate in more complex formulas.
What is the limitation in the variance ?
What is your best and worst performance ?
We found the difference between your score on a particular test and the mean for that test. Then you divided that difference by the standard deviation.
1. Categorical data: Bar graph
2. Continious(Numeric) data: Frequency distribution
Two distributions have the same mean score (50), but beyond that, they are very different.
How they look like ?
These two distributions:
How they look like ?
How it look like ?
A symmetrical distribution is one in which the two halves of the distribution are mirror images of each other.
A skewed distribution is a distribution that departs from symmetry in the sense that most of the cases are concentrated at one end of the distribution.
import numpy as np
import matplotlib.pyplot as plt
scores = np.load("scores.npy")
avg = np.mean(scores)
med = np.median(scores)
std = np.std(scores)
print(avg, med, std)
plt.hist(scores);
3.7126436781609193 4.0 1.3188121264072776
# random seed
np.random.normal(loc=75, scale=10, size=50)
array([ 74.30093752, 78.92972416, 61.02005794, 79.30927317, 73.50646556, 74.56632259, 78.74505964, 80.58006857, 43.09238996, 74.14711304, 79.38212521, 70.52603402, 68.85559408, 74.52966114, 70.71681896, 64.66813334, 80.19544652, 76.06015824, 72.36173114, 74.90111396, 74.49191692, 67.23081953, 84.19548895, 72.85509745, 55.67532168, 79.17958047, 81.70368627, 81.10091578, 88.72521979, 59.11988725, 68.13747224, 71.36196035, 66.16370853, 68.15002101, 96.50913299, 46.99519364, 100.56229186, 87.02282881, 65.77934199, 74.92775888, 84.25130334, 84.62474095, 91.79857967, 61.26616534, 67.33894403, 88.91289692, 89.3180086 , 47.46850179, 72.36407 , 73.03471751])
Use the representative sample you have choose above to answer the following questions:
a. Is the sample normaly distributed ? Justify your answer
b. What is the value that 50% of the data below it?
c. What is the value that 75% of the data below it?
d. What is the value that 25% of the data below it?
e. What are the values calculated in b, c, and d are statistically called ?