Box & Whisker plot is one of the few data visualization techniques that perform further computations on the dataset before it can be visualized. Some other methods that require calculation include a histogram, which needs a class interval and a pie chart that requires one to calculate the degree of each slice in the pie.
Computations are no doubt a common occurrence in statistical analysis, but minimal when visualizing data. Box & Whisker plots make use of the five-way summary (median, lower quartile, higher quartile, minimum, and maximum) when describing data.
There are different steps involved in this process, and it will be further explained in the rest of this article.
Box Plot Definition
A box plot is a statistical data visualization technique that uses rectangular bars to indicate data groups through their quartiles. It may also have line extensions extending from the boxes, which usually indicates variability beyond the upper and lower quartiles.
The name, box and whisker plot is derived from the nature of the graph. That is, the rectangular bars(or boxes), top of the boxes indicating the upper quartile, the bottom of the boxes indicating the lower quartile, the centerline indicating the margin, and the line drawn from each end of the boxes is known as the whisker.
The boxes can either be drawn vertically or horizontally depending on the goal of visualizing the data. Although rare, some box plots do not have whiskers.
Elements of a Box Plot
- The Median
The median is the quantity that falls in the middle when a set of values are arranged in an ascending or descending order. The median can be easily formulated when the dataset contains an odd number of values.
However, when it is even, the median is calculated by finding the average of the two numbers in the middle. The median is also known as the second quartile.
- First Quartile(Q1)
The first quartile is also known as the lower quartile because it is calculated at the 25th percentile. That is the lower quartile value.
It is calculated by multiplying the one-fourth of the value by 1. For example, the first quartile of 100 is (¼)*100*1= 25.
- Third Quartile(Q3)
The third quartile is also known as the upper quartile because it is calculated at the 75th percentile. That is the upper quartile value.
It is calculated by multiplying the one-fourth of the value by 3. For example, the first quartile of 100 is (¼)*100*3= 75.
- Interquartile Range(Q3-Q1)
The interquartile range is the difference between the first quartile and the third quartile. It is often said to be a better measure of spread when compared to the range.
- Highest Value
This is simply the highest non-outlier value in the dataset being visualized by the box plot. The highest value, in this case, is not necessarily the highest value in the dataset.
Given the dataset 1,2,3,4,5,1000 for instance, the highest value is 1000. However, this is most likely not the highest value in the box plot because there is a high probability that that 1000 will be an outlier.
The most feasible highest value is 5.
- Lowest Value
This is simply the lowest non-outlier value in the dataset being visualized by the box plot. The lowest value, in this case, is not necessarily the lowest value in the dataset.
Given the dataset -100, 50, 60, 70, 80, 90 for instance, the lowest value is -100. However, this is most likely not the lowest value in the interquartile range of distribution because there is a high probability that that -100 will be an outlier.
The most feasible highest value is 50.
Box Plot Examples
Example 1: David and Bryan are both sales attendants at a Phone shop. At the end of each month, they record the number of phones sold. By the end of the year, they both submitted their sales record, and they made the following number of sales.
David: 51, 17, 25, 39, 7, 49, 62, 41, 20, 6, 43, 13.
Bryan: 30, 56, 23, 65, 42, 61, 54, 17, 21, 34, 3, 16.
- Arrange the monthly sales made by David and Bryan in a tabular form.
- Give a five-number summary of David and Bryan’s sales.
- Make a box and whisker plots describing the sales made by David and Bryan.
Solution
- The monthly sales made by David and Bryan are arranged in the table below
- The five-number summary of the data is the median, first quartile, third quartile, First minimum value, and maximum value
- David
6, 7, 13, 17, 20, 25, 39, 41, 43, 49, 51, 62.
Median = (sixth + seventh observations) ÷ 2
= (25 + 39) ÷ 2
= 32
There are six numbers below the median, namely: 6, 7, 13, 17, 20, 25.
Q1 = the median of these six items
= (third + fourth observations) ÷ 2
= (13 + 17) ÷ 2
= 15
Here are six numbers above the median, namely: 39, 41, 43, 49, 51, 62.
Q3 = the median of these six items
= (third + fourth observations) ÷ 2
= 46
The five-number summary for David's sales is 6, 15, 32, 46, 62.
Using the same calculations, for Bryan, we have: 3, 16, 17, 21, 23, 30, 34, 42, 54, 56, 61, 65. - Median = (sixth + seventh observations) ÷ 2
= (30+34) ÷ 2
= 32
There are six numbers below the median, namely: 3, 16, 17, 21, 23, 30.
Q1 = the median of these six items
= (third + fourth observations) ÷ 2
= (21 + 17) ÷ 2
= 19
Here are six numbers above the median, namely: 34, 42, 54, 56, 61, 65.
Q3 = the median of these six items
= (third + fourth observations) ÷ 2
= 55
The five-number summary for Bryan’s sales is 3, 19, 32, 55, 65.
- The resulting box plot from the monthly sales data can be found below.
Example 2: Consider the graph below which describes an altered version of the monthly sales data in example 1 above. What is the outlier in the plot? Hence, explain how this graph can help detect inconsistencies.
Solution: In the graph, we discover a lone dot above the maximum value. This lone dot has a value of 110 and tallies with Bryan's sales.
With the box plot, we are able to easily discover inconsistencies in the visualized data due to the outlier. This is very useful in sales recording, especially in cases where salespeople have to meet a target.
They may, therefore, decide to alter their sales data to meet up. With the box plot, one can easily discover inconsistencies like this.
Example 3: Consider the box plot below, which describes the following data; 45, 22, 26, 27, 18, 24, 38, 20. Use the plot to identify the outlier and the five-way summary.
Confirm the validity of your answer by solving it using the required formula.
Solution: Clearly, the outlier in the chart above is 50. We can also observe from the following five-way summary values from the box plot.
Minimum Value = 18
Maximum Value = 45
Median = 25
Q1 = 20.5
Q3 = 35.25
Arranging the data in ascending order, we have; 18, 20, 22, 24, 26, 27, 38, 45.
Median = (24+26)/2 = 25
Q1: The first half of the values are; 18, 20, 22, 24.
Q1 = (20+22)/2 =20.5
Q3: The second half of the values are; 26, 27, 38, 45.
How to Interpret a Box and Whisker Plot
Before going into interpreting a Box and Whisker plot, we need to first understand the different parts of a box plot. Hence, let's consider this box plot drawn using some data generated from Excel's Random number generator.
On the graph above, the horizontal line inside the blue box represents the median value of the data set. In this case, it is … inches. The x on top of the line, still located inside the blue box is the mean value of the data.
However, you need to note that the mean value does not necessarily have to be a value in the data. It is only a statistical model used in representing the data.
Now, let's properly identify the parts of a box plot. The blue box represents the data points that fall between the 1st and 3rd quartiles of the randomly generated data set.
The top of the box represents the third quartile, while the bottom of the box represents the first quartile. The median can also be referred to as the second quartile.
You will notice two vertical lines, one drawn from the top of the box to a point in the chart, while the other was drawn from the bottom of the box to a point in the chart. These two lines are referred to as the whiskers.
The horizontal line perpendicular to the top whisker indicates the maximum value, while the one perpendicular to the bottom whisker indicates the minimum value in the data set.
Just like the box gives us the quartile range of the data, the whiskers help us to determine the range of the data set. One can easily read this information at a glance.
Lastly, the dot at the farther top of the graph, somewhere above the maximum value is called the outlier. The outlier is an unusual data present in the data set.
This brings us to explain the fact that the maximum and minimum values are not necessarily the actual max and min of the data set. They represent the maximum and minimal of the usual values present in the data set.
Interpreting a Box Plot
Now that we have been able to fully understand what the different parts of a box plot mean, we can go into interpreting the Box Plot. To properly explain this, let's consider the box plot below, describing the average yearly income of men and women that fall in particular age groups.
Notice that in the graph above, there are two sets of box plots, with blue representing the men and orange representing the women. Box plots can easily make comparisons of the elements of a large data set.
In the above plot, for example, you can easily see the average annual income of males and females across the different age groups. Overall, it is easy to discover that males generally earn more than females across generations.
We also see that the money earned by men is much more even, with the pay gap across individuals is not that much. In fact, the maximum yearly income is not visible and can be inferred to be close or the same as the third quartile. On the other hand, women's yearly income varies more largely.
Secondly, given the much longer “whiskers” for women, we can interpret that they vary more widely in the amount of money they earn yearly, while men tend to center more toward the average.
The third is the skew of the data. Skew refers to the asymmetry of your data. If you look at the women, the box and whiskers are pretty even on either side of the median/mean. However, the case is quite different in men. Hence, we say that this data is skewed
Finally, we look for outliers, which statistically represent different data points. We notice that only men has outliers
Boxplots are useful little graphics that contain a lot of information in very little space. They are best used at the beginning of data analysis to identify early patterns in the data. Although, as we have seen here, they are useful for reporting results in clear and concise ways.
How to Make a Box and Whisker Plot on Excel
To explain how a box plot can be created using Excel, we will create one using Example 3 above. Follow these simple steps to create a Box and Whisker Plot on Excel.
- Enter the data into your Excel worksheet.
- Highlight the data, and go to Insert > Charts > Other Charts > Statistical | Box and Whisker as shown in the diagram below.
- Your box plot will be immediately generated by the Excel chart
- Right-click one of the boxes on the chart to select that box and then, on the shortcut menu, click Format Data Series.
- In the Format Data Series pane, with Series Options selected, make the changes that you want.
Advantages of Box Plot Over Other Plots
- It can easily visualize large datasets. Due to the five-number summary technique embraced by the box plot, it can summarize large datasets and easily describe it on the graph.
- It gives a clear summary of the datasets under consideration. It allows the reader to easily detect the symmetry of the data at a glance.
- Unlike most data visualization techniques, the box plot displays outliers within a dataset. Outliers are values in a dataset that falls outside the minimum and maximum values on the box plot. One can easily detect outliers on the box plot.
Disadvantages of Box Plot
- It does not retain the exact values of the dataset. It only displays the summary of the values in the dataset. Hence, it is advised to use a box plot together with other data visualization techniques that give a detailed analysis of the data.
- It is not easy for laymen to understand box plots. It is quite complicated for non-scientists.
- It is difficult to detect the meaning of the data from the box plot.
Conclusion
In statistical analysis, there are different ways in which the spread of a distribution can be described. One of these ways is using the five-way summary- median, first quartile, third quartile, maximum number, and minimum number.
This method is what is used by a box plot in visualizing data. An alternative method used for measuring the spread is the mean and standard deviation.
This method, however, has a lot of restrictions compared to the five-number summary. Even with these restrictions, it is used more in statistical data analysis.