Box & Whisker plot is one of the few data visualization techniques that perform further computations on the dataset before it can be visualized. Some other methods that require calculation include a histogram, which needs a class interval and a pie chart that requires one to calculate the degree of each slice in the pie.
Computations are no doubt a common occurrence in statistical analysis, but minimal when visualizing data. Box & Whisker plots make use of the five-way summary (median, lower quartile, higher quartile, minimum, and maximum) when describing data.
There are different steps involved in this process, and it will be further explained in the rest of this article.
A box plot is a statistical data visualization technique that uses rectangular bars to indicate data groups through their quartiles. It may also have line extensions extending from the boxes, which usually indicates variability beyond the upper and lower quartiles.
The name, box and whisker plot is derived from the nature of the graph. That is, the rectangular bars(or boxes), top of the boxes indicating the upper quartile, the bottom of the boxes indicating the lower quartile, the centerline indicating the margin, and the line drawn from each end of the boxes is known as the whisker.
The boxes can either be drawn vertically or horizontally depending on the goal of visualizing the data. Although rare, some box plots do not have whiskers.
The median is the quantity that falls in the middle when a set of values are arranged in an ascending or descending order. The median can be easily formulated when the dataset contains an odd number of values.
However, when it is even, the median is calculated by finding the average of the two numbers in the middle. The median is also known as the second quartile.
The first quartile is also known as the lower quartile because it is calculated at the 25th percentile. That is the lower quartile value.
It is calculated by multiplying the one-fourth of the value by 1. For example, the first quartile of 100 is (¼)*100*1= 25.
The third quartile is also known as the upper quartile because it is calculated at the 75th percentile. That is the upper quartile value.
It is calculated by multiplying the one-fourth of the value by 3. For example, the first quartile of 100 is (¼)*100*3= 75.
The interquartile range is the difference between the first quartile and the third quartile. It is often said to be a better measure of spread when compared to the range.
This is simply the highest non-outlier value in the dataset being visualized by the box plot. The highest value, in this case, is not necessarily the highest value in the dataset.
Given the dataset 1,2,3,4,5,1000 for instance, the highest value is 1000. However, this is most likely not the highest value in the box plot because there is a high probability that that 1000 will be an outlier.
The most feasible highest value is 5.
This is simply the lowest non-outlier value in the dataset being visualized by the box plot. The lowest value, in this case, is not necessarily the lowest value in the dataset.
Given the dataset -100, 50, 60, 70, 80, 90 for instance, the lowest value is -100. However, this is most likely not the lowest value in the interquartile range of distribution because there is a high probability that that -100 will be an outlier.
The most feasible highest value is 50.
David and Bryan are both sales attendants at a Phone shop. At the end of each month, they record the number of phones sold. By the end of the year, they both submitted their sales record, and they made the following number of sales.
David: 51, 17, 25, 39, 7, 49, 62, 41, 20, 6, 43, 13.
Bryan: 30, 56, 23, 65, 42, 61, 54, 17, 21, 34, 3, 16.
The five-number summary for Bryan’s sales is 3, 19, 32, 55, 65.
Consider the graph below which describes an altered version of the monthly sales data in example 1 above. What is the outlier in the plot? Hence, explain how this graph can help detect inconsistencies.
Solution: In the graph, we discover a lone dot above the maximum value. This lone dot has a value of 110 and tallies with Bryan’s sales.
With the box plot, we are able to easily discover inconsistencies in the visualized data due to the outlier. This is very useful in sales recording, especially in cases where salespeople have to meet a target.
They may, therefore, decide to alter their sales data to meet up. With the box plot, one can easily discover inconsistencies like this.
Consider the box plot below, which describes the following data; 45, 22, 26, 27, 18, 24, 38, 20. Use the plot to identify the outlier and the five-way summary.
Confirm the validity of your answer by solving it using the required formula.
Solution: Clearly, the outlier in the chart above is 50. We can also observe from the following five-way summary values from the box plot.
Minimum Value = 18
Maximum Value = 45
Median = 25
Q1 = 20.5
Q3 = 35.25
Arranging the data in ascending order, we have; 18, 20, 22, 24, 26, 27, 38, 45.
Median = (24+26)/2 = 25
Q1: The first half of the values are; 18, 20, 22, 24.
Q1 = (20+22)/2 =20.5
Q3: The second half of the values are; 26, 27, 38, 45.
Before going into interpreting a Box and Whisker plot, we need to first understand the different parts of a box plot. Hence, let’s consider this box plot drawn using some data generated from Excel’s Random number generator.
On the graph above, the horizontal line inside the blue box represents the median value of the data set. In this case, it is … inches. The x on top of the line, still located inside the blue box is the mean value of the data.
However, you need to note that the mean value does not necessarily have to be a value in the data. It is only a statistical model used in representing the data.
The blue box represents the data points that fall between the 1st and 3rd quartiles of the randomly generated data set.
The top of the box represents the third quartile, while the bottom of the box represents the first quartile. The median can also be referred to as the second quartile.
You will notice two vertical lines, one drawn from the top of the box to a point in the chart, while the other was drawn from the bottom of the box to a point in the chart. These two lines are referred to as the whiskers.
The horizontal line perpendicular to the top whisker indicates the maximum value, while the one perpendicular to the bottom whisker indicates the minimum value in the data set.
Just like the box gives us the quartile range of the data, the whiskers help us to determine the range of the data set. One can easily read this information at a glance.
Lastly, the dot at the farther top of the graph, somewhere above the maximum value is called the outlier. The outlier is an unusual data present in the data set.
This brings us to explain the fact that the maximum and minimum values are not necessarily the actual max and min of the data set. They represent the maximum and minimal of the usual values present in the data set.
Now that we have been able to fully understand what the different parts of a box plot mean, we can go into interpreting the Box Plot. To properly explain this, let’s consider the box plot below, describing the average yearly income of men and women that fall in particular age groups.
Notice that in the graph above, there are two sets of box plots, with blue representing the men and orange representing the women. Box plots can easily make comparisons of the elements of a large data set.
In the above plot, for example, you can easily see the average annual income of males and females across the different age groups. Overall, it is easy to discover that males generally earn more than females across generations.
We also see that the money earned by men is much more even, with the pay gap across individuals is not that much. In fact, the maximum yearly income is not visible and can be inferred to be close or the same as the third quartile. On the other hand, women’s yearly income varies more largely.
Given the much longer “whiskers” for women, we can interpret that they vary more widely in the amount of money they earn yearly, while men tend to center more toward the average.
The third is the skew of the data. Skew refers to the asymmetry of your data. If you look at the women, the box and whiskers are pretty even on either side of the median/mean. However, the case is quite different in men. Hence, we say that this data is skewed
Finally, we look for outliers, which statistically represent different data points. We notice that only men has outliers
Boxplots are useful little graphics that contain a lot of information in very little space. They are best used at the beginning of data analysis to identify early patterns in the data. Although, as we have seen here, they are useful for reporting results in clear and concise ways.
To explain how a box plot can be created using Excel, we will create one using Example 3 above. Follow these simple steps to create a Box and Whisker Plot on Excel.
In statistical analysis, there are different ways in which the spread of a distribution can be described. One of these ways is using the five-way summary- median, first quartile, third quartile, maximum number, and minimum number.
This method is what is used by a box plot in visualizing data. An alternative method used for measuring the spread is the mean and standard deviation.
This method, however, has a lot of restrictions compared to the five-number summary. Even with these restrictions, it is used more in statistical data analysis.
You may also like:
Data visualization is an important aspect of data analysis, and one of the simplest methods of visualizing data is through a line graph...
Multiple experiences and characteristics come together to determine how consumers make purchasing decisions and interact with your...
Learning something new is like putting a shelf of books in your brain. If you don’t take them out and read them again, you will probably...
Imagine a 2-year-old trying to open a bag of chips. The child has never opened one before, so if you just leave the child to it, they...