Pandas¶

  • It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python
  • Pandas is designed for working with tabular or heterogeneous data. Numpy not
In [ ]:
 

Series¶

A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index.

In [1]:
# Create series
import pandas as pd
s = pd.Series([4, 7, -5, 3])
s 
Out[1]:
0    4
1    7
2   -5
3    3
dtype: int64
In [ ]:
# Create series with index
obj2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
obj2
In [ ]:
#Change value by index
obj2['b'] = 10
obj2

Use NumPy functions or NumPy-like operations¶

In [ ]:
# Filter with boolean array
obj2[obj2 > 1 ]
In [ ]:
# scalar multiplication
obj2 * 2
In [ ]:
#apply math functions
import numpy as np
np.exp(obj2)

Dictionary operations¶

In [ ]:
# Check if contains index
"b" in obj2
In [ ]:
#Create series from dictionary
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
obj3 = pd.Series(sdata)
obj3
In [ ]:
#Series to dictionary
obj3.to_dict()

Dataframe¶

A DataFrame represents a rectangular table of data composed of series objects share the same index

Create dataframe¶

In [ ]:
# Create from dictionary
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
df = pd.DataFrame(data) 

# Show all data
df

Usually we read the data from files¶

Galton's expirement¶

In [3]:
# Read data from csv file
df = pd.read_csv('galton.csv')
df
Out[3]:
family father mother midparentHeight children childNum gender childHeight
0 1 78.5 67.0 75.43 4 1 male 73.2
1 1 78.5 67.0 75.43 4 2 female 69.2
2 1 78.5 67.0 75.43 4 3 female 69.0
3 1 78.5 67.0 75.43 4 4 female 69.0
4 2 75.5 66.5 73.66 4 1 male 73.5
... ... ... ... ... ... ... ... ...
929 203 62.0 66.0 66.64 3 1 male 64.0
930 203 62.0 66.0 66.64 3 2 female 62.0
931 203 62.0 66.0 66.64 3 3 female 61.0
932 204 62.5 63.0 65.27 2 1 male 66.5
933 204 62.5 63.0 65.27 2 2 female 57.0

934 rows × 8 columns

In [ ]:
# Dataframe metadata
df.info()
In [ ]:
# Show the first 3 rows
df.head(3)
In [ ]:
# Show the last 3 rows
df.tail(3)

Show colums¶

In [ ]:
# Show columns
df.columns
In [ ]:
# Select column by name
df['childHeight']
In [ ]:
# Select column by name, option 2
df.childHeight

Which one is more valid?¶

Add columns¶

In [ ]:
# Create new column with constant value
df['family_size'] = 'small'
df

More operations¶

In [ ]:
# Convert to Numpy array
df.to_numpy()
In [ ]:
# Transpose dataframe
df.T

Filtering data¶

In [ ]:
# Select one column
df['mother']
In [ ]:
# Select columns
df[['family', 'children']]
In [ ]:
# Select rows by index
df.loc[0]
In [ ]:
# Select by sequence index
df.iloc[0]
In [ ]:
# Select rows by range
df.loc[1:3]
In [ ]:
# Select rows and columns
df.loc[1:3 ,['childHeight', 'gender']]
In [ ]:
# Filter criteria
df[df['children'] > 3]

Element-wise operations¶

apply, map, applymap functions¶

In [ ]:
# apply funtion
#TODO: find family_size


df['family_size'] = df.apply(??, axis=1)
df
In [ ]:
# common array statistics: min, max, mean, sum,...
df.max()  
In [ ]:
 
In [ ]:
# map function
def my_format(x): 
    return f"{x:.1f}"

df['midparentHeight'] = df[' '].map(my_format)
df 
In [ ]:
#applymap function
#df['midparentHeight'] = df['midparentHeight'].astype(float)
df[['father', 'mother', 'midparentHeight']].applymap(my_format)

Sort dataframe¶

In [4]:
# Sort by columns
df.sort_values(["family", "childHeight"])
Out[4]:
family father mother midparentHeight children childNum gender childHeight
2 1 78.5 67.0 75.43 4 3 female 69.0
3 1 78.5 67.0 75.43 4 4 female 69.0
1 1 78.5 67.0 75.43 4 2 female 69.2
0 1 78.5 67.0 75.43 4 1 male 73.2
32 10 74.0 65.5 72.37 1 1 female 65.5
... ... ... ... ... ... ... ... ...
427 99 69.0 66.0 70.14 8 5 female 65.5
426 99 69.0 66.0 70.14 8 4 male 71.5
425 99 69.0 66.0 70.14 8 3 male 71.7
424 99 69.0 66.0 70.14 8 2 male 72.0
423 99 69.0 66.0 70.14 8 1 male 73.0

934 rows × 8 columns

In [5]:
# Sort by columns, descending
df.sort_values(["family", "childHeight"], ascending = False)
Out[5]:
family father mother midparentHeight children childNum gender childHeight
423 99 69.0 66.0 70.14 8 1 male 73.0
424 99 69.0 66.0 70.14 8 2 male 72.0
425 99 69.0 66.0 70.14 8 3 male 71.7
426 99 69.0 66.0 70.14 8 4 male 71.5
427 99 69.0 66.0 70.14 8 5 female 65.5
... ... ... ... ... ... ... ... ...
32 10 74.0 65.5 72.37 1 1 female 65.5
0 1 78.5 67.0 75.43 4 1 male 73.2
1 1 78.5 67.0 75.43 4 2 female 69.2
2 1 78.5 67.0 75.43 4 3 female 69.0
3 1 78.5 67.0 75.43 4 4 female 69.0

934 rows × 8 columns

Can dataframe contain duplicate index values ?¶

In [6]:
# Create dataframe with dupolicate index values
import numpy as np
obj = pd.DataFrame(np.arange(5), index=["a", "a", "b", "b", "c"], columns=['col_1'])
obj
Out[6]:
col_1
a 0
a 1
b 2
b 3
c 4
In [7]:
# Check if index is unique
obj.index.is_unique
Out[7]:
False
In [8]:
# Access duplicate index
obj.loc['b']
Out[8]:
col_1
b 2
b 3

Dataframe with missing values¶

In [9]:
# Create dataframe with some missing values
obj = pd.DataFrame([[1.4, np.nan], 
                   [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                index=["a", "b", "c", "d"],
                columns=["one", "two"])
obj
 
Out[9]:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
In [13]:
# Check missing values
obj.isna().sum()
Out[13]:
one    1
two    2
dtype: int64
In [ ]:
obj.isna().sum() 

Descriptive statistics¶

In [14]:
# sum
df.sum(mn )
Out[14]:
family             1111222233444445555556777777888910111111111111...
father                                                       64630.1
mother                                                       59859.4
midparentHeight                                            64639.126
children                                                        5764
childNum                                                        3349
gender             malefemalefemalefemalemalemalefemalefemalemale...
childHeight                                                  62340.7
dtype: object
In [15]:
# mean
df.mean()
/var/folders/h2/b7pvzcsn4ylbpbhjmh2sl5m40000gn/T/ipykernel_4281/3239134474.py:2: FutureWarning: The default value of numeric_only in DataFrame.mean is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.
  df.mean()
Out[15]:
father             69.197109
mother             64.089293
midparentHeight    69.206773
children            6.171306
childNum            3.585653
childHeight        66.745931
dtype: float64
In [19]:
# index value where the maximum values are attained
df.loc[df["childHeight"]. ()]
Out[19]:
family               72
father             70.0
mother             65.0
midparentHeight    70.1
children              7
childNum              1
gender             male
childHeight        79.0
Name: 292, dtype: object
In [20]:
# Show summary statistics
df.describe()
Out[20]:
father mother midparentHeight children childNum childHeight
count 934.000000 934.000000 934.000000 934.000000 934.000000 934.000000
mean 69.197109 64.089293 69.206773 6.171306 3.585653 66.745931
std 2.476479 2.290886 1.802370 2.729025 2.361410 3.579251
min 62.000000 58.000000 64.400000 1.000000 1.000000 56.000000
25% 68.000000 63.000000 68.140000 4.000000 2.000000 64.000000
50% 69.000000 64.000000 69.248000 6.000000 3.000000 66.500000
75% 71.000000 65.875000 70.140000 8.000000 5.000000 69.700000
max 78.500000 70.500000 75.430000 15.000000 15.000000 79.000000

Correlation¶

Correlation summarizes the relationship between two variables

How to assess the correlation ?¶

  • Option 1: Visualize data (scatter plot)
  • Option 2: Correlation coefficient

Option 1: Visualization¶

In [21]:
df.plot.scatter(x='father', y='childHeight');
In [23]:
#df.plot.scatter(x='mother', y='childHeight');
df.plot.scatter(x='midparentHeight', y='childHeight');

Option 2: Correlation Coefficient¶

  • A correlation coefficient quite close to 0 implies little or no relationship between the two variables.
  • A correlation coefficient close to +1 means a positive relationship between the two variables, with increases in one of the variables being associated with increases in the other variable.
  • A correlation coefficient close to -1 indicates a negative relationship between two variables, with an increase in one of the variables being associated with a decrease in the other variable.

In [24]:
df.corr()
/var/folders/h2/b7pvzcsn4ylbpbhjmh2sl5m40000gn/T/ipykernel_4281/1134722465.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  df.corr()
Out[24]:
father mother midparentHeight children childNum childHeight
father 1.000000 0.060366 0.728439 -0.151333 -0.087446 0.266039
mother 0.060366 1.000000 0.727834 -0.033582 -0.019405 0.201322
midparentHeight 0.728439 0.727834 1.000000 -0.127016 -0.073395 0.320950
children -0.151333 -0.033582 -0.127016 1.000000 0.577838 -0.126720
childNum -0.087446 -0.019405 -0.073395 0.577838 1.000000 -0.625026
childHeight 0.266039 0.201322 0.320950 -0.126720 -0.625026 1.000000

Explore categorical variables¶

In [25]:
# Get unique values
df['children'].unique()
Out[25]:
array([ 4,  2,  5,  6,  1,  3,  8,  9,  7, 11, 10, 15])
In [26]:
# value counts
df['children'].value_counts()
Out[26]:
5     140
8     136
4     124
7     119
6     108
3      66
9      63
2      46
11     44
10     40
1      33
15     15
Name: children, dtype: int64

Chapter 6: Data Loading¶

Arguments categories:¶

Data loading functions has a long list of optional arguments

for example: pandas.read_csv has around 50 argument

type inference: That means you don’t necessarily have to specify which columns are numeric, integer, Boolean, or string

Loading and writing CSV files¶

In [27]:
# Read CSV file
df = pd.read_csv("galton.csv")
df
Out[27]:
family father mother midparentHeight children childNum gender childHeight
0 1 78.5 67.0 75.43 4 1 male 73.2
1 1 78.5 67.0 75.43 4 2 female 69.2
2 1 78.5 67.0 75.43 4 3 female 69.0
3 1 78.5 67.0 75.43 4 4 female 69.0
4 2 75.5 66.5 73.66 4 1 male 73.5
... ... ... ... ... ... ... ... ...
929 203 62.0 66.0 66.64 3 1 male 64.0
930 203 62.0 66.0 66.64 3 2 female 62.0
931 203 62.0 66.0 66.64 3 3 female 61.0
932 204 62.5 63.0 65.27 2 1 male 66.5
933 204 62.5 63.0 65.27 2 2 female 57.0

934 rows × 8 columns

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [28]:
# Define index
df2 = pd.read_csv("galton.csv", index_col = 'family')
df2.head()
Out[28]:
father mother midparentHeight children childNum gender childHeight
family
1 78.5 67.0 75.43 4 1 male 73.2
1 78.5 67.0 75.43 4 2 female 69.2
1 78.5 67.0 75.43 4 3 female 69.0
1 78.5 67.0 75.43 4 4 female 69.0
2 75.5 66.5 73.66 4 1 male 73.5
In [ ]:
df2.iloc[0]
In [ ]:
df2.loc[0]
In [29]:
# Define hierarchical index
df_3 = pd.read_csv("galton.csv", index_col = ['family', 'childNum'])
df_3
Out[29]:
father mother midparentHeight children gender childHeight
family childNum
1 1 78.5 67.0 75.43 4 male 73.2
2 78.5 67.0 75.43 4 female 69.2
3 78.5 67.0 75.43 4 female 69.0
4 78.5 67.0 75.43 4 female 69.0
2 1 75.5 66.5 73.66 4 male 73.5
... ... ... ... ... ... ... ...
203 1 62.0 66.0 66.64 3 male 64.0
2 62.0 66.0 66.64 3 female 62.0
3 62.0 66.0 66.64 3 female 61.0
204 1 62.5 63.0 65.27 2 male 66.5
2 62.5 63.0 65.27 2 female 57.0

934 rows × 6 columns

In [31]:
# Write csv file
df.to_csv('new path')

Loading and writing Pickle (Binary) files¶

In [32]:
# Save(Serialize) df as binary file
df.to_pickle("df.pickle")
In [33]:
# Read dataframe from pickle file
pd.read_pickle('df.pickle')
Out[33]:
family father mother midparentHeight children childNum gender childHeight
0 1 78.5 67.0 75.43 4 1 male 73.2
1 1 78.5 67.0 75.43 4 2 female 69.2
2 1 78.5 67.0 75.43 4 3 female 69.0
3 1 78.5 67.0 75.43 4 4 female 69.0
4 2 75.5 66.5 73.66 4 1 male 73.5
... ... ... ... ... ... ... ... ...
929 203 62.0 66.0 66.64 3 1 male 64.0
930 203 62.0 66.0 66.64 3 2 female 62.0
931 203 62.0 66.0 66.64 3 3 female 61.0
932 204 62.5 63.0 65.27 2 1 male 66.5
933 204 62.5 63.0 65.27 2 2 female 57.0

934 rows × 8 columns

Reading from MS Excel¶

In [34]:
pd.read_excel("grocery.xlsx", sheet_name = 'prices')
Out[34]:
Item Mawares Abo Ahmad Bravo Karaz Max mar Shini Baghdad Gardens
0 Lemon 2.3 2.0 3.0 3.0 3.5 2.7 3.1 4.0
1 Tomato 2.0 2.0 2.2 2.5 3.0 2.0 3.0 2.4
2 Potato 3.3 3.3 3.3 4.0 5.0 3.0 3.5 3.5
3 Watermelon 1.0 1.5 1.5 1.5 3.0 2.0 3.0 3.0
4 Zucchini 12.0 8.0 8.0 8.0 12.0 7.0 8.0 10.0
5 Eggplant 3.0 2.5 3.0 3.0 5.0 3.0 5.0 5.0
6 Apple 7.0 7.0 8.0 7.0 13.0 8.0 10.0 12.0
7 Panana 3.3 3.3 3.0 4.0 4.0 3.5 5.0 5.0

Loading from web¶

In [35]:
import requests
url = "https://dummyjson.com/products/2"
resp = requests.get(url)
data = resp.json()
data
Out[35]:
{'id': 2,
 'title': 'iPhone X',
 'description': 'SIM-Free, Model A19211 6.5-inch Super Retina HD display with OLED technology A12 Bionic chip with ...',
 'price': 899,
 'discountPercentage': 17.94,
 'rating': 4.44,
 'stock': 34,
 'brand': 'Apple',
 'category': 'smartphones',
 'thumbnail': 'https://cdn.dummyjson.com/product-images/2/thumbnail.jpg',
 'images': ['https://cdn.dummyjson.com/product-images/2/1.jpg',
  'https://cdn.dummyjson.com/product-images/2/2.jpg',
  'https://cdn.dummyjson.com/product-images/2/3.jpg',
  'https://cdn.dummyjson.com/product-images/2/thumbnail.jpg']}
In [36]:
pd.DataFrame(data)
Out[36]:
id title description price discountPercentage rating stock brand category thumbnail images
0 2 iPhone X SIM-Free, Model A19211 6.5-inch Super Retina H... 899 17.94 4.44 34 Apple smartphones https://cdn.dummyjson.com/product-images/2/thu... https://cdn.dummyjson.com/product-images/2/1.jpg
1 2 iPhone X SIM-Free, Model A19211 6.5-inch Super Retina H... 899 17.94 4.44 34 Apple smartphones https://cdn.dummyjson.com/product-images/2/thu... https://cdn.dummyjson.com/product-images/2/2.jpg
2 2 iPhone X SIM-Free, Model A19211 6.5-inch Super Retina H... 899 17.94 4.44 34 Apple smartphones https://cdn.dummyjson.com/product-images/2/thu... https://cdn.dummyjson.com/product-images/2/3.jpg
3 2 iPhone X SIM-Free, Model A19211 6.5-inch Super Retina H... 899 17.94 4.44 34 Apple smartphones https://cdn.dummyjson.com/product-images/2/thu... https://cdn.dummyjson.com/product-images/2/thu...

Assignment #4¶

Navigate Kaggle and select a data set that contains at least 5 features and have both categorical and numerical features.

  1. With well-styled markdown cell, describe the data set and show the dictionary of the columns (for each column provide the name, data type, and description)
  2. Load the data set to your notebook and answer the following questions:
    • What is the shape of the data?
    • Does it contain missing values?
    • Which features are numerical, and which are nominal, and which are ordinal ?
    • Select two numerical features and:
      • Show their descriptive statistics, add your interpretation on each feature
      • Does they have skweness? interpret
    • Select two categorical features and verify if the data is balanced between the categories of each feature
    • Based on your intuition, select two features that you think they could be correlated. Verify your assumption using both visualization and the correlation coefficient approaches
    • Are their features that can be used as index?
  • Add your name and id in the top of the notebook
  • Use generative AI to answer the relationship and representative questions
  • replace=False/True
  • Write the code in one cell and generate the interpretation out of the code
  • Used statistical tests such as Shapiro to test normality
  • Using cuz, and such terms in inerpretation