Chapter 5: Understanding BigQuery Group By Statement and COUNT Function

Introduction
‍

The `GROUP BY` statement is a fundamental aspect of SQL, essential for data analysis and manipulation. By grouping data and applying aggregation functions, we can derive meaningful insights from large datasets. In this blog, we'll explore the `GROUP BY` statement and the `COUNT` function in BigQuery, breaking down each concept step-by-step and providing practical example queries to illustrate their usage.

‍

1. Understanding `SELECT DISTINCT` vs `GROUP BY`

The `SELECT DISTINCT` statement is used to retrieve unique values from a column, filtering out duplicate entries. In contrast, `GROUP BY` groups rows that have the same values in specified columns and allows for aggregation of data, such as summing or counting occurrences.

Example Query using SELECT DISTINCT

‍

SELECT 
    DISTINCT subscriber_type
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`
ORDER BY 1

‍

Query result using SELECT DISTINCT function

‍

This query returns unique values from the `subscriber_type` column, sorting them alphabetically.

2. Basic Usage of `GROUP BY`

The same result can be obtained by GROUP BY function.

Example Query using GROUP BY Statement

SELECT 
      subscriber_type
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`
GROUP BY subscriber_type
ORDER BY 1

‍

‍

This query groups the data by `subscriber_type`, yielding the same output as the `SELECT DISTINCT` query but with the capability to aggregate data.

Although the output is similar, DISTINCT and GROUP BY work very differently. The DISTINCT function removes duplicate rows from the result set and ensures that only unique rows are returned based on the specified columns. The GROUP BY clause groups rows that have the same values in specified columns and allows aggregate functions (e.g., COUNT, SUM, AVG) to be applied to each group.

In this chapter, we will solely focus on the COUNT function. We have a separate chapter where we discuss other aggregation functions such as SUM, AVG, MAX, and MIN.

‍

3. Counting Occurrences with `COUNT`

The above query result shows all unique subscriber types but does not provide information on the frequency of each type. This is where `GROUP BY` becomes useful.

‍

Example Query

SELECT 
      subscriber_type, COUNT(*) AS trip_count
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`
GROUP BY subscriber_type
ORDER BY 1

‍

Query result using GROUP BY clause and COUNT function

‍

‍

This query counts the number of occurrences of each `subscriber_type` and orders the results by the count in descending order.
The output shows the number of trips for each subscriber type, highlighting which type has the most trips. This is crucial for understanding user behavior and trends.

4. Combining `GROUP BY` with Multiple Columns

Example Query

SELECT 
      subscriber_type, 
      COUNT(*) AS trip_count,
      COUNT(duration_sec) AS sum_duration_sec
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`
GROUP BY subscriber_type
ORDER BY subscriber_type

‍

The query result shows the row count of all the rows available in the table, as well as the row count of duration_sec. The row counts for both trip_count and duration_sec are exactly the same, indicating that there are no NULL values in the duration_sec column. We can further confirm this by adding another COUNT function to count the rows that have NULL values, if any exist.

‍

SELECT 
      subscriber_type, 
      COUNT(*) AS trip_count,
      COUNT(duration_sec) AS sum_duration_sec,
      COUNT (NULL) AS count_null
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`
GROUP BY subscriber_type
ORDER BY subscriber_type

‍

‍

This query confirms that there are no row that contains NULL values.

‍

It will be interesting to find out which subscriber type is starting their trip from which station. We can achieve this by using ORDER BY with trip_count.

‍

SELECT 
      subscriber_type, 
      start_station_name,
      COUNT(*) AS trip_count,
      COUNT(duration_sec) AS sum_duration_sec,
      COUNT (NULL) AS count_null
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`
GROUP BY subscriber_type, start_station_name
ORDER BY trip_count DESC

‍

‍

Based on our results, the most number of trips were started from San Francisco Caltrain by the nan subscriber type, where nan means neither a customer nor a subscriber. When you use the GROUP BY clause to aggregate certain metrics together, you gain a better understanding of your data. It's crucial for comprehending your dataset and is used frequently, especially when modeling data. Additionally, GROUP BY is often needed in combination with joins and subqueries.

‍

5. Aggregating Data with Additional Columns

‍

Let's see some more examples of GROUP BY clause within a shop data. First, let's query types of seller_channel in our order_item data.

‍

SELECT 
    seller_channel
 FROM `dwh-udemy-course.3_dwh_ecommerce.order_item`
GROUP BY 1
ORDER BY 1

‍

The query shows different types of seller channel avaialble in the order_item table.

‍

Simple Query using GROUP By in order_item table

‍

SELECT 
    seller_channel,
    COUNT(*) AS count_rows
 FROM `dwh-udemy-course.3_dwh_ecommerce.order_item`
GROUP BY 1
ORDER BY 1

‍

Query result showing row count of seller channel in order item table

This query groups by the seller channels and counts the number of rows for each seller channel. The results highlight which seller channels are most active and how frequently these seller channels are used. Interestingly, eBay and Limango have only one row each, indicating that either these seller channels are not active or have just started recently. These kinds of insights are incredibly significant when you are trying to understand your data.

‍

Let's do few more checks to understand he data even further..

SELECT 
    seller_channel,
    COUNT(*) AS count_rows,
    COUNT(order_number) AS count_order_number,
    COUNT (discount_code) AS count_discount_code
 FROM `dwh-udemy-course.3_dwh_ecommerce.order_item`
GROUP BY 1
ORDER BY 1

‍

‍
The results highlight which sales channels are most active and how frequently the discount codes are used, offering valuable insights for marketing and sales strategies.

Conclusion
The `GROUP BY` statement in SQL, combined with aggregation functions like `COUNT`, is a powerful tool for data analysis. It enables you to group and summarize large datasets, uncovering patterns and insights that are essential for making informed decisions. By mastering these concepts and applying them to real-world scenarios, you can unlock the full potential of your data in BigQuery. Whether you're analyzing user behavior, sales performance, or operational metrics, `GROUP BY` and `COUNT` provide the foundation for effective data-driven decision-making.

Subscribe to our newsletter