Histogram-The Skyscraper of Visualization

Koushik C S
The Startup
Published in
4 min readDec 2, 2020

--

A histogram is an approximate representation of the distribution of numerical (continuous)data. It was first introduced by Karl Pearson. Histograms are used to see the distribution of the data and estimating the probability density function of the underlying variable.

  • It helps to identify the measures of central tendency( median ,mean and mode)
  • It helps to identify the measures of spread.
  • It helps in find the skewness and determine the shape of the Distribution namely Left-Skewed, Right-Skewed or Symmetric Distribution.

Histogram vs Bar Plot

The major difference is that a histogram is only used to plot the frequency of score occurrences in a continuous data set that has been divided into classes, called bins. Bar charts, on the other hand, can be used for a great deal of other types of variables including ordinal and nominal data sets.

In Histogram, the area (Height of bin* Width of bin) indicates the frequency of occurrences for each bin and not the height of bars as in bar plot.

Binning in Histogram

The summary for the above histogram is shown below.

We can see that there are almost 333 data points which range from 32 to 59 but all the individual 333 data points are not shown in the histogram as they are grouped into buckets normally known as bins or class-intervals.

So we dwell with the next question that why they are grouped into bins?

All the four Histogram are representing the same variable but with different bin size.

  • The Blue Histogram with Bin Size=1 shows all the data points as an individual bar. It seems good but it makes humans a bit harder to get insights from it.
  • The Orange Histogram with Bin Size=25 shows all the data points into 25 buckets. With 25 buckets it seems to good.
  • The Green Histogram with Bin Size=80 shows all the data points into 80 buckets. Since there are 80 buckets for this small data its very hard to get insights.
  • The Pink Histogram with Bin Size=11 shows all the data points into 11 buckets. This gives a brief insight on the data with 11 buckets.
  • The Red Histogram with Bin Size=333 shows all the all points into 1 buckets .It seems very bad as we can see any patterns and it doesn’t reveal any insight about the data.

In the below Histogram ,the Bin Size=1 shows all the data.

In this case, increasing the bin size more will make the plot very bad.

Hence the bin size of the Histogram is determined by the Range of the dataset and the context of the problem.

Types of Histograms

There are different types of Histogram based on the Distribution of the Data.

The Measures of Central Tendency will be the same strategy for different distributions of the Histogram as shown in the diagram.

Most of the Real world Problem follows Positively Skewed or Right-Skewed Distribution.

--

--