Data ScienceAnalyticsArtificial inteliigenceStatistics

Navigating the Data Universe: A Comprehensive Guide from Basics to Advanced Analytics - Part 1

By Ayoola Fakoya
Picture of the author
Published on
image of a panda exploring data

Navigating the Data Universe: A Comprehensive Guide from Basics to Advanced Analytics - Part 1

Embarking on the exhilarating journey from an entry-level data scientist to an intermediate professional involves mastering the language of data—statistics. As the compass guiding us through the complex cityscape of data science, understanding the nuances of descriptive statistics is akin to mastering a city's intricate map. In this detailed blog post, we'll explore the foundational concepts of descriptive statistics, unraveling quartiles, percentiles, interquartile range, and the five-number summary. Tailored for entry-level and intermediate data scientists and developers, this guide aims to empower you with the statistical tools that form the bedrock of advanced analytics.

A Deep Dive into Data's Core

Quartiles and Percentiles: Illuminating Data Tiers

Quartiles gracefully partition data into four distinct segments, providing a nuanced snapshot of its distribution. Imagine managing a sports team, analyzing player goal scores over a season. Quartiles reveal tiers of performance, ranging from the lower 25% to the upper 25%, acting as your compass in navigating the data landscape.

Measures of Position: Percentiles and Quartiles

Measures of position let you determine the position of a value about other values in a dataset. Along with center and spread, it’s helpful to know the relative position of your values. For example, whether one value is higher or lower than another, or whether a value falls in the lower, middle, or upper portion of your dataset.

Percentile

A percentile is the value below which a percentage of data falls. Percentiles divide your data into 100 equal parts, giving the relative position or rank of a particular value in a dataset.

For example, percentiles are commonly used to rank test scores on school exams. Let’s say a test score falls in the 99th percentile. This means the score is higher than 99% of all test scores. If a score falls in the 75th percentile, the score is higher than 75% of all test scores. If a score falls in the 50th percentile, the score is higher than half, or 50%, of all test scores.

A distribution curve represents test score percentiles, from the 1st to the 99th percentile. Note: Percentiles and percentages are distinct concepts. For example, say you score 90/100, or 90%, on a test. This doesn’t necessarily mean your score of 90% is in the 90th percentile. Percentile depends on the relative performance of all test takers. If half of all test takers score above 90%, then a score of 90% will be in the 50th percentile.

Percentiles are useful for comparing values and putting data in context. For example, imagine you want to buy a new car. You’d like a midsize sedan with great fuel economy. In the United States, fuel economy is measured in miles per gallon of fuel, or mpg. The sedan you’re considering gets 23 mpg. Is that good or bad? Without a basis for comparison, it’s hard to know. However, if you know that 23 mpg is in the 25th percentile of all midsize sedans, you have a much clearer idea of its relative performance. In this case, 75% of all midsize sedans have a higher mpg than the car you’re thinking about buying.

Quartile

Three quartiles divide the data into four quarters. Quartiles let you compare values relative to the four quarters of data. Each quarter includes 25% of the values in your dataset.

The first quartile, Q1, is the middle value in the first half of the dataset. Q1 refers to the 25th percentile. 25% of the values in the entire dataset are below Q1, and 75% are above it.

The second quartile, Q2, is the median of the dataset. Q2 refers to the 50th percentile. 50% of the values in the entire dataset are below Q2, and 50% are above it.

The third quartile, Q3, is the middle value in the second half of the dataset. Q3 refers to the 75th percentile. 75% of the values in the entire dataset are below Q3, and 25% are above it.

A rectangle divided into four equal parts represents three quartiles dividing a dataset into four quarters of data.

Calculating Quartiles

To calculate quartiles, you can follow these steps:

  1. Arrange the data in order from smallest to largest.
  2. Find the median. The median is the middle value of the dataset. If there are an even number of values, the median is the average of the two middle values.
  3. Find the first quartile (Q1). Q1 is the middle value of the lower half of the dataset.
  4. Find the third quartile (Q3). Q3 is the middle value of the upper half of the dataset.
Example: Car Sales

For example, imagine you’re a data professional working for an auto dealership. The manager of the sales team wants to compare the performance of each sales representative on the team. The manager asks you to analyze data that provides how many cars each sales representative sold during the past month.

Sales Representative Data

18 13 6 10 15 7 10 9

You can calculate quartiles for your data in four steps:

Arrange the values in your dataset from smallest to largest. [6, 7, 9, 10, 10, 13, 15, 18] Find the median, or middle value, of your entire dataset. This is Q2. There are an even number of values in the dataset, so the median is the average of the two middle values, 10 and 10. Q2 = (10 + 10) ÷ 2 = 20 ÷ 2 = 10 Find the median of the lower half of your dataset [6, 7, 9, 10]. This is Q1. The median is the average of the two middle values, 7 and 9. Q1 = (7 + 9) ÷ 2 = 16 ÷ 2 = 8 Finally, find the median of the upper half of your dataset [10, 13, 15, 18]. This is Q3. The median is the average of the two middle values, 13 and 15. Q3 = (13 + 15) ÷ 2 = 28 ÷ 2 = 14 Dividing the data into quartiles gives you a clear idea of sales rep performance. You now know that the lower quartile (Q1) of reps sold 8 cars or fewer, and the upper quartile (Q3) sold 14 cars or more. In other words, the lower 25% of reps sold 8 cars or fewer, and the upper 25% sold 14 cars or more. The middle 50% of representatives sold between 8 and 14 cars.

Interquartile Range (IQR)

The middle 50% of your data is called the interquartile range, or IQR. The interquartile range is the distance between the first quartile (Q1) and the third quartile (Q3). This is the same as the distance between the 25th and 75th percentiles. IQR is useful for determining the relative position of your data values. For instance, data values outside the interval Q1 - (1.5 * IQR) and Q3 + (1.5 * IQR) are often considered outliers.

Formula

IQR = Q3 - Q1

Example

In this case, Q3 = 14 and Q1 = 8. IQR = 14 - 8 = 6

Properties of IQR

IQR is a measure of dispersion, which means it measures the spread of the data. IQR is less sensitive to outliers than the range because it doesn't include the more extreme values in your dataset. Applications of IQR

IQR is used to box plots, which are graphical representations of data distributions. IQR is used to identify outliers. IQR can be used to compare the spread of data from different groups.

Five-Number Summary

Finally, you can summarize the major divisions in your dataset with the five number summary. The five numbers include:

The minimum The first quartile (Q1) The median, or second quartile (Q2) The third quartile (Q3) The maximum The five number summary is useful because it gives you an overall idea of the distribution of your data, from the extreme values to the center. You can visualize it with a box plot.

Interpreting the Box Plot

The box part of the box plot goes from Q1 to Q3. The vertical line in the middle of the box is the median (Q2). The horizontal lines on each side of the box, known as whiskers, go from Q1 to the minimum, and from Q3 to the maximum.

Box Plot
Box Plot

The following box plot illustrates the data on car sales. You can find the values on the box plot and determine the interquartile range (IQR). The IQR is the length of the box, or the distance from Q1 to Q3.

Python: A Gateway to Descriptive Analytics

The sheer volume of data in today's world necessitates computational assistance. Thankfully, we don't have to manually crunch numbers; software and programming languages like Python come to our rescue.

Python Libraries and Packages for Descriptive Analytics

Python's rich ecosystem includes libraries like NumPy, Pandas, and Matplotlib, transforming statistical computation into a streamlined process.

Descriptive Analytics in Python: An Example

Let's consider a hypothetical dataset—perhaps one showcasing user engagement metrics on a website. Pandas, a powerful data manipulation library, can swiftly calculate quartiles, and percentiles, and generate a five-number summary.

import pandas as pd

# Sample user engagement dataset
data = {'engagement_time': [10, 15, 20, 30, 40, 45, 50, 60, 70]}
df = pd.DataFrame(data)

# Calculating quartiles and summary statistics
quartiles = df['engagement_time'].quantile([0.25, 0.5, 0.75])
summary = df['engagement_time'].describe()

print("Quartiles:")
print(quartiles)
print("\nFive-Number Summary:")
print(summary)

Statistical Inference: Unveiling the Hidden Truth

Descriptive statistics, while informative, offer limited insight into the underlying population from which the data was sampled. This is where statistical inference steps in.

Statistical Inference: The Cornerstone of Data-Driven Decisions

Statistical inference allows us to make generalizations about the population based on the sample data. It involves formulating hypotheses, collecting and analyzing data, and drawing conclusions about the population parameters.

Hypothesis Testing: A Journey from Assumption to Conclusion

Hypothesis testing is a cornerstone of statistical inference. It involves stating a hypothesis, typically a null hypothesis, and using data to determine whether there is sufficient evidence to reject it.

Conclusion: Empowering Data Scientists for Informed Exploration

As we've explored in this blog post, descriptive statistics provide a foundation for understanding and summarizing data, while statistical inference allows us to draw meaningful conclusions about the population from which the data was sampled. These tools empower data scientists to make informed decisions, uncover hidden patterns, and drive impactful outcomes.

Whether you're an aspiring or experienced data scientist, embracing the power of statistics will transform your ability to navigate the intricate landscape of data and extract valuable insights. So, embark on this exciting journey of statistical discovery and unlock the hidden truths that lie within your data!

Stay Tuned

Want to Get in touch?
The best articles, links and news related to AI delivered once a week to your inbox.