Scatter Plot - A Tool for Descriptive Statistics

Koushik C S
The Startup
Published in
4 min readNov 27, 2020

--

Scatter Plot is a plot of two variables that is used to understand if there is any relationship between two variables. The relationship can be linear or non-linear. It is also used to identify the outliers.

Relationship between culmen_length_mm variable and body_mass_g variable from penguin dataset

We could see the random dots but what relationship can we see?

From the graph we can say that there is some linear relationship between those two variables. When X variable(culmen_length_mm) increases Y variable(body_mass_g) also increases (Positive correlation).

How strong is the relationship?

From the above plots, we could clearly say that both plot have a Linear relationship with positive correlation, but which plot have a stronger correlation?

For that we need something in numbers to compare. Hence we use Pearson’s Correlation Coefficient.

The Pearson coefficient is a type of correlation coefficient that represents the relationship between two variables that are measured on the same interval or ratio scale. The Pearson coefficient is a measure of the strength of the association between two continuous variables. Pearson correlations are only suitable for quantitative variables.

Pearson’s Correlation Coefficient formula is

  • Pearson coefficients range from +1 to -1, with +1 representing a positive correlation, -1 representing a negative correlation, and 0 representing no relationship. It is independent of the unit of measurement of the variables.
  • The Pearson coefficient shows correlation, not causation.
  • The correlation coefficient between the variables is symmetric, which means that the value of the correlation coefficient between Y and X or X and Y will remain the same.
  • Correlations are very sensitive to outliers. A single unusual observation may have a huge impact on a correlation. Such outliers are easily detected by a quick inspection a scatterplot.

The Pearson’s coefficients for above two plot are 0.59 .Both the plot have same Correlation coefficient because left plot is nothing but Zoomed version of the Right plot.

TYPES OF CORRELATION WITH THEIR RESPECTIVE PEARSON’S CORRELATION COEFFICIENT VALUE

Positive Correlation → X variable increases Y variable also increases.(X is directly proportional to Y)

Negative Correlation → X variable increases Y variable decreases.(X is inversely proportional to Y)

No Correlation → There is no correlation between X variable and Y variable.

From the plot we can see that there exists a curve but the Pearson’s correlation Coefficient is Zero.

Pearson’s correlation coefficient only identifies a linear relationship. If any non-linear relationship exists such as a curve, circle, etc, the Pearson’s correlation coefficient value will be 0.Hence it is always better to visualize any dataset as a scatterplot to find any hidden non-linear patterns.

Correlation is only an association relationship and not a causal relationship.

Correlation is not Causation

From the above pic, we can’t say that if a person owns a cat he is likely to get struck by lightning though there exists a positive correlation. The two variables may have high correlation co-efficient value although there may not be any direct dependence between them. It doesn’t mean that X caused Y to happen or vice-versa.

INTRAPOLATION & EXTRAPOLATION

A scatter plot with point size based on a third variable actually goes by a distinct name, the bubble chart.

Bubble Chart

A scatter plot with point size based on a third variable(‘sex’) and a color based on fourth variable(‘island’) is shown in the above Bubble Chart.

--

--