Statistics¶

Chapter 1: The What and How of Statistics¶

  • Basic terminology
  • Data types
  • Samples and Population
  • Types of Statistical analysis
  • Age of students
  • Attitudes toward a particular social issue
  • The number of hours people spend on social media
  • The crime rates in different cities
  • The levels of air pollution in different locations

All of these are examples of variable

Definitions:¶

  • A variable is anything that can vary; it’s anything that can take on a different quality or quantity.
  • The information about different variables is referred to as data, a term that’s at the center of statistical analysis
  • Statistical Analysis: Collection, organization, and interpretation of data according to well-defined procedures.
  • When the data relative to some specific variables are assembled, we refer to the collection or bundle of information as a data set.
  • The individual pieces of information are referred to as data points
  • Case (Observation): The data points for the same subject

Example¶

let’s say that you own a bookstore and you’ve collected information from 125 customers—information about each customer’s age, income, occupation, marital status, and reading preferences.

  • What are the variables?
  • Give examples of data points?
  • Give example of a case
  • What is the data set

Data Distribution¶

Data Distribution: listing of the values or responses associated with a particular variable in a data set.

Example: Class data distibution¶

Data types¶

  1. Nominal: Color, Gender, Nationality, Names, ...
  1. Ordinal: Education level, level of income, grades,...
  1. Numerical: Hight, wight, income, temperature,...
    • Interval
    • Ratio

Samples and Population¶

Population:¶

A population (or universe) is all possible cases that meet certain criteria. It’s the total collection of cases that you’re interested in studying

Examples:

  • University students averege scores
  • The percentage of Females students who got honor last semester
  • Does the performance of computer science students effectedd by remote vs. Onsite learning?

Can we collect data for all cases in the population ?¶

  1. Sometimes it is technically not possible
  1. The population is constantly changing

Population as a collection of all possible cases, recognizing the fact that what constitutes the population may be changing.

Sample¶

A sample is simply a portion of a population

A sample should be representative to the population is to say that the sample mirrors the population in important aspects

Example: Imagine a population that has a male/female split, or ratio, of 60%/40%. If a sample of the population is representative, you’d expect it to have a male/female split very close to 60%/40%.

Types of Statistical Analysis¶

  1. Descriptive statistics
  2. Inferential statistics

1. Descriptive statistics are used to summarize or describe data from samples and populations.

Example: Describe your last semester scores

2. Inferential Statistics

Inferential statistics: Rely on sample data to make inferences about the population. Use sample statistics to make inferences about population parameters

Sample statistics vs. Population parameters

  • Average income for all palestinian household is called parameter
  • Average income for a sample of palestinian household is called statistic

Sampling error¶

Can you assume that the mean you calculated for your sample is equal to the mean of the population ?

Sampling error: Every sample can yeild a different statistic

Chapter 2: Describing Data and Distributions¶

  • Measures of Central Tendency
  • Measures of Variability or Dispersion

TODO corroleation¶

How to describe the data?¶

Measures of Central Tendency¶

  • The purpose behind any measure of central tendency is to get an idea about the center, or typicality, of a distribution
  • Types of central tendency:
    • Mean
    • Median
    • Mode

Mean¶

Example¶

In [2]:
salaries = [1000, 1200, 800, 900, 850, 1150, 1300, 5000]

x_bar = sum(salaries)/len(salaries) 
x_bar
Out[2]:
1525.0

One of the properties of the mean is that it is sensitive to extreme scores

The Median¶

  • Median: is the point in a distribution that divides the distribution into halves

  • The median doesn’t have to be a value that actually appears in the distribution.
  • The median is not sensitive to extreme scores
In [3]:
import numpy as np
salaries = [1000, 1200, 800, 900, 850, 1150, 1300, 5000]

print(np.mean(salaries))
np.median(salaries)
1525.0
Out[3]:
1075.0

Mode¶

  • The mode is the value that appears most frequently in a distribution
  • Usually used for categorical data ( Nominal and Ordinal)

Is central dendency sufficient to describe the data ?¶

Two distributions can share the same mean, but can be very different in terms of the variability of individual scores.

  • For example:
    • Section 1 has mean score of 75 but the distribution ranging from 55 to 95
    • Section 2 has same mean score of 75 but the distribution is more compact and it ranging from 70 to 85
  • Statisticians are routinely interested in this matter of dispersion or variability—the extent to which scores are spread out in a distribution.

Measures of Variability or Dispersion¶

Despersion/Variability is an expression of the extent to which the scores are spread out in a distribution.

The Range¶

The range represents the lowest score and the highest score in a distribution

Deviations From the Mean¶

  • How far an individual or raw score in a distribution deviates from the mean of the distribution

What is the limitation in deviations from the mean ?

Assuming our goal is to get a summary measure of the deviation from or about the mean, we always end up with the same sum of deviations (a value of 0), regardless of the underlying distribution, and that tells us nothing.

The Mean Deviation¶

What is the limitation in Mean deviation ?

It’s based on absolute values, and absolute values are difficult to manipulate in more complex formulas.

Variance¶

What is the limitation in the variance ?

  • Inflates on large values
  • Not easy to interpret

Standard deviation¶

  • The standard deviation is the square root of the variance.

Exercise¶

What is your best and worst performance ?

Z-score¶

We found the difference between your score on a particular test and the mean for that test. Then you divided that difference by the standard deviation.

Exercises¶

Excercise 1¶

Excercise 2¶

Excercise 3¶

Chapter 3: The Shape of Distributions¶

How to visualiz data ?¶

1. Categorical data: Bar graph

2. Continious(Numeric) data: Frequency distribution

Question 1:¶

Two distributions have the same mean score (50), but beyond that, they are very different.

  • In Distribution A, the scores are widely dispersed, ranging from 10 to 90.
  • In Distribution B, the scores are tightly clustered about the mean, ranging from 30 to 70.

How they look like ?

Question 2:¶

These two distributions:

  • In Distribution A, the scores are ranging from 10 to 40
  • In Distribution B, the scores are ranging from 70 to 100

How they look like ?

Question 3:¶

  • The salaries of the employees at some organizations range between 1,000-2,000 but few executives getting salaries around 15,000

How it look like ?

Symmetrical distributions:¶

A symmetrical distribution is one in which the two halves of the distribution are mirror images of each other.

Skewed distribution¶

A skewed distribution is a distribution that departs from symmetry in the sense that most of the cases are concentrated at one end of the distribution.

Normal Propability Distribution¶

  • A normal curve is symmetrical. Therefore, 50% of the cases (or area) under the curve are found above the mean, and 50% of the cases (or area) are found below the mean.
  • The mean, median, and mode all coincide on a normal curve
  • The mean and the standard deviation will define the exact shape of a normal curve.

Are assignment #1 scores Normally distributed ?¶

In [10]:
import numpy as np
import matplotlib.pyplot as plt

scores = np.load("scores.npy")
avg = np.mean(scores)
med = np.median(scores)
std = np.std(scores)
print(avg, med, std)

plt.hist(scores);
3.7126436781609193 4.0 1.3188121264072776

Which one has larger standard deviation?¶

1-2-3 Rule (The Empirical Rule)¶

How to create a random sample from normal distribution ?¶

In [16]:
# random seed

np.random.normal(loc=75, scale=10, size=50)
              
Out[16]:
array([ 74.30093752,  78.92972416,  61.02005794,  79.30927317,
        73.50646556,  74.56632259,  78.74505964,  80.58006857,
        43.09238996,  74.14711304,  79.38212521,  70.52603402,
        68.85559408,  74.52966114,  70.71681896,  64.66813334,
        80.19544652,  76.06015824,  72.36173114,  74.90111396,
        74.49191692,  67.23081953,  84.19548895,  72.85509745,
        55.67532168,  79.17958047,  81.70368627,  81.10091578,
        88.72521979,  59.11988725,  68.13747224,  71.36196035,
        66.16370853,  68.15002101,  96.50913299,  46.99519364,
       100.56229186,  87.02282881,  65.77934199,  74.92775888,
        84.25130334,  84.62474095,  91.79857967,  61.26616534,
        67.33894403,  88.91289692,  89.3180086 ,  47.46850179,
        72.36407   ,  73.03471751])

Verify mean and median are similar¶

Calculate the z-score for every data point¶

Assignment #3¶

  • Generate a normally distributed sample of size 10,000,000 with mean = 1500, and standard deviation = 700, with your ID as random seed.
  • Assume the data set generated above is the population, sample randomly the following samples from population:
    • 10
    • 100
    • 1,000
    • 10,000
  • Answer the following questions:
    • What is the relation between the sample size and the sampling error?
    • Which sample is more representative of the population? justify your answer
  • Use the representative sample you have choose above to answer the following questions:

    a. Is the sample normaly distributed ? Justify your answer

    b. What is the value that 50% of the data below it?

    c. What is the value that 75% of the data below it?

    d. What is the value that 25% of the data below it?

    e. What are the values calculated in b, c, and d are statistically called ?

Notes:¶

  • Will not accept zip, .py, .html files
  • Add your ID to the notebook file name
  • Rubrics:
    • Clean and organised code
    • No errors
    • Clear intepretation
    • Correct solution