Data Visualization¶

Why is important ?¶

Making informative visualizations (sometimes called plots) is one of the most important tasks in data science.

  • Understanding the Data: Visualization provides a way to explore and understand the underlying patterns, trends, and relationships within the data. Humans are highly visual creatures, and presenting data visually allows us to quickly grasp complex information that might be difficult to discern from raw data alone.
  • Communication: Visualization facilitates communication of findings and insights to stakeholders who may not have expertise in data analysis or statistics. Well-designed visualizations can effectively convey complex information in a clear and intuitive manner, enabling better decision-making.
  • Identifying Patterns and Anomalies: Visualization helps in identifying patterns, trends, outliers, and anomalies in the data that may not be apparent from summary statistics or tabular representations. This enables data scientists to gain deeper insights into the data and make informed decisions.
  • Model Evaluation: Visualizations are crucial for evaluating the performance of machine learning models. Plots such as ROC curves, confusion matrices, and calibration plots provide valuable insights into the performance of classification and regression models, helping data scientists fine-tune their models for better accuracy and generalization

Types of visualization (plots)¶

  1. Univariate plots:
    • Definition: Univariate plots display the characteristics of a single variable.
    • Purpose: Useful for initial exploration of data, identifying outliers, understanding variable distributions, and assessing data quality.
    • Examples: Histograms, box plots, bar plots, line plots, pie charts, and density plots
  2. Multivariate plot:
    • Definition: Multivariate plots display the relationships between multiple variables simultaneously.
    • Purpose: They are used to explore interactions, patterns, correlations, and dependencies between two or more variables.
    • Examples: Scatter plots, heatmap, pair plots

Univariate Plots¶

Histogram¶

  • A histogram shows the shape of values, or distribution, of a continuous variable.
  • Histograms help you see the center, spread and shape of a set of data. You can also use them as a visual tool to check for normality.
  • They can be used to check data for extreme values, or outliers.

Properties of histogram:¶

  • The horizontal axis shows your data values, where each bar includes a range of values. The vertical axis shows how many points in your data have values in the specified range for the bar.

    • For example, the first bar shows the count of values that fall between 30 and 35.
    • The histogram shows that the center of the data is somewhere around 45
    • and the spread of the data is from about 30 to 65.
    • It also shows the shape of the data as roughly bell-shaped (normal)

How to create a histoigram:¶

  • Determine the range of the data
  • Divide the range into equal width of groups, called bins
  • Calculate the height of the bars by the frequency of data values in each bin.

How extreme values are observed in histograms¶

How skewness is observed in histograms¶

Bar chart¶

  • Bar charts show the frequency counts or the percentages of values for the different levels of a categorical variable.
  • It is like the histogram, shows the shape of values, or distribution, but on categorical variable.
  • The bars show the levels of the variable; the height of the bars show the counts/ratios of responses for that level.

The bar char can be horizental:

Box Plot¶

  • A box plot shows the distribution of data for a continuous variable.
  • Box plots help you see the center and spread of data. You can also use them as a visual tool to check for normality or to identify points that may be outliers.

Properties of the Box Plot:¶

  • The center line in the box shows the median for the data. Half of the data is above this value, and half is below.
  • If the data are symmetrical, the median will be in the center of the box. If the data are skewed, the median will be closer to the top or to the bottom of the box.

  • The bottom and top of the box show the 25th and 75th quantiles, or percentiles. These two quantiles are also called quartiles because each cuts off a quarter (25%) of the data. The length of the box is the difference between these two percentiles and is called the interquartile range (IQR).

  • The lines that extend from the box are called whiskers. The whiskers represent the expected variation of the data. The whiskers extend 1.5 times the IQR from the top and bottom of the box.

  • If there are values that fall above or below the end of the whiskers, they are plotted as dots. These points are often called outliers.

Box vs. Histogram¶

  • The box plot helps you see skewness, because the line for the median will not be near the center of the box if the data is skewed.
  • The box plot helps identify the 25th and 75th percentiles better than the histogram
  • The box plot helps identify outliers better than histogram
  • while the histogram helps you see the overall shape of your data better than the box plot.

Boxplot example¶

Multivariate Plots¶

Gruoped Bar Charts¶

  • Number of variables: 2 or more
  • Displays bar charts for groups defined by another variable. Grouped bar charts have a separate chart within each level of the grouping variable.

Stacked Bar Charts¶

  • Number of variables: 2 or more
  • Displays bar charts for groups defined by another variable. Stacked bar charts have a single bar for each level of the grouping variable. Colors or patterns for counts of another variable are stacked in each bar.

Line Plot:¶

  • A line graph is a simple way to visually communicate how the a continuous variable change over time. A line graph may also be called a line chart, a trend plot, run chart or a time series plot.

  • The variable that measures time is plotted on the x-axis. The continuous variable is plotted on the y-axis.

How to deal with missing data ?¶

Line plot with multiple categories¶

Scatter Plot¶

  • A scatter plot shows the relationship between two continuous variables, x and y.
  • A dot or some other symbol is placed at the (x, y) coordinates for each pair of variables.
  • The pattern of the dots can provide clues regarding how the two variables are related.

Negative relashionship¶

No relashionship:¶

Curve relashionship:¶

Detecting outliers on scatter plot¶

Styling scatter plot:¶

Heatmaps¶

  • Heatmaps organize data in a grid, with different colors or shades indicating different levels of the data's magnitude.
  • The visual nature of heatmaps allows for immediate recognition of patterns, such as clusters, trends, and anomalies. This makes heatmaps an effective tool for exploratory data analysis.

Correlation Matric as Heatmap¶

Questions ?¶