Before diving deeper into this chapter, please review our previous chapter where we discuss the GROUP BY clause in detail. Aggregation functions and GROUP BY go hand in hand, as they are often used together to group data and perform calculations on these groups.
Aggregation functions in SQL allow you to perform calculations on a set of values to return a single value. These functions are often used with the GROUP BY statement, which groups rows that have the same values into summary rows. Before diving into aggregation functions, it's crucial to understand the GROUP BY statement as they are commonly used together.
SELECT
COUNT(*) AS total_trips,
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`
This query counts the total number of rows in the specified table, resulting in a single value that represents the total number of records in the dataset. For example, the bikeshare_trips
table contains 1,947,417 rows in total.
Next, we calculate the average value of a specific column using the AVG function.
SELECT
AVG(duration_sec) AS avg_duration_sec
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`
This query calculates the average value of all entries in the duration_sec
column. Based on the result, the average trip duration is approximately 1,008 seconds, which is about 17 minutes. Such general insights can be effectively extracted using the AVG function.
Now, let's introduce a column into this query and use GROUP BY clause with it.
SELECT
start_station_name,
AVG(duration_sec) AS avg_duration_sec
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`
GROUP BY 1
This query calculates the average trip duration (duration_sec
) for each start_station_name
. By grouping the results based on the start station, you can determine the average duration of trips that originated from each station.
Now, let’s find out which station had the longest average duration by using the ORDER BY clause to sort the results.
SELECT
start_station_name,
AVG(duration_sec) AS avg_duration_sec
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`
GROUP BY 1
ORDER BY 2 DESC
This query calculates the average trip duration for each start station and then orders the results in descending order by the average duration (avg_duration_sec
). This way, the station with the longest average trip duration—such as University and Emerson—will appear at the top of the list.
While the average duration provides useful information, it could be skewed if a station like University and Emerson had only one exceptionally long trip. This limitation means we don't know how frequently such events occurred. To gain a clearer picture of the data, we can use the SUM function to add up all the trip durations for each station and the COUNT function to determine how many trips started from each station.
SELECT
start_station_name,
AVG(duration_sec) AS avg_duration_sec,
SUM(duration_sec) AS sum_duration_sec,
COUNT(trip_id) AS count_trip
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`
GROUP BY 1
ORDER BY 2 DESC
Note: The SUM and AVG functions can only be applied to numeric data types, such as integers and floats. They cannot be used with string or date data types.
This query calculates the average trip duration, the total duration of all trips, and the number of trips for each start station. By ordering the results based on the total duration (total_duration_sec
), we can better understand which stations have the highest cumulative trip durations and how frequently trips start from each station. This approach provides a more comprehensive view of the data beyond just the average duration.
The MIN and MAX functions help identify the smallest and largest values in a dataset.
SELECT
start_station_name,
AVG(duration_sec) AS avg_duration_sec,
SUM(duration_sec) AS sum_duration_sec,
COUNT(trip_id) AS count_trip,
MIN(duration_sec) AS min_duration_sec,
MAX(duration_sec) AS max_duration_sec,
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`
GROUP BY 1
ORDER BY 4 DESC
This query finds the minimum and maximum values in the specified column. The result shows the smallest and largest values, respectively.
Now, we will switch to shop data to use aggrregation functions in the different type of dataset.
SELECT
seller_channel,
COUNT(*) AS count_rows,
COUNT(order_number) AS count_order_number,
COUNT (discount_code) AS count_discount_code
FROM `dwh-udemy-course.3_dwh_ecommerce.order_item`
GROUP BY 1
ORDER BY 1
In summary, we previously examined how order numbers were consistently present, but discount codes were only relevant to the online shop. For a deeper understanding, we looked into product quantities ordered across different seller channels. For instance, we found that for two seller channels—'About You' and 'Online Shop'—we had quantities ordered for eBay but not for 'Auto,' indicating no products were sold through that channel.
Additionally, we calculated the average quantity of products ordered and compared it across channels. We observed that while the 'Online Shop' had a slightly higher average quantity of products ordered compared to 'About You,' it is essential to note that this average was based on order items, not the overall order level. This distinction is crucial for accurate analysis.
As we’ve seen with the BigQuery examples, aggregation functions and the GROUP BY clause provide powerful tools for analyzing data. For more detailed insights, consider experimenting with different GROUP BY columns, such as comparing start and end station names in the bike share data to assess average durations between stations. Understanding these nuances allows for a more comprehensive data analysis.
Understanding and effectively using aggregation functions in SQL is crucial for data analysis. Functions like AVG, SUM, MIN, MAX, and COUNT allow you to summarize and analyze large datasets efficiently. When combined with the GROUP BY statement, these functions provide powerful insights into your data.
Whether you're counting total entries, calculating averages, summing values, or finding minimum and maximum values, mastering these aggregation functions will enhance your data analysis capabilities. Experiment with these functions on your datasets to uncover valuable insights and trends. For more information and in-depth tutorials, check out our other blog posts and videos. Happy querying!
In the world of data analysis, proficiency in SQL is essential. One critical aspect of SQL is understanding how to use aggregation functions to summarize and analyze data. This blog post will delve into key aggregation functions—AVG, SUM, MIN, MAX, and COUNT—using Google BigQuery. We will explore practical examples to help you grasp these functions' applications and significance.