The `GROUP BY` statement is a fundamental aspect of SQL, essential for data analysis and manipulation. By grouping data and applying aggregation functions, we can derive meaningful insights from large datasets. In this blog, we'll explore the `GROUP BY` statement and the `COUNT` function in BigQuery, breaking down each concept step-by-step and providing practical example queries to illustrate their usage.
The `SELECT DISTINCT` statement is used to retrieve unique values from a column, filtering out duplicate entries. In contrast, `GROUP BY` groups rows that have the same values in specified columns and allows for aggregation of data, such as summing or counting occurrences.
Example Query using SELECT DISTINCT
SELECT
DISTINCT subscriber_type
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`
ORDER BY 1
This query returns unique values from the `subscriber_type` column, sorting them alphabetically.
The same result can be obtained by GROUP BY function.
Example Query using GROUP BY Statement
SELECT
subscriber_type
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`
GROUP BY subscriber_type
ORDER BY 1
This query groups the data by `subscriber_type`, yielding the same output as the `SELECT DISTINCT` query but with the capability to aggregate data.
Although the output is similar, DISTINCT
and GROUP BY
work very differently. The DISTINCT
function removes duplicate rows from the result set and ensures that only unique rows are returned based on the specified columns. The GROUP BY
clause groups rows that have the same values in specified columns and allows aggregate functions (e.g., COUNT
, SUM
, AVG
) to be applied to each group.
In this chapter, we will solely focus on the COUNT
function. We have a separate chapter where we discuss other aggregation functions such as SUM
, AVG
, MAX
, and MIN
.
The above query result shows all unique subscriber types but does not provide information on the frequency of each type. This is where `GROUP BY` becomes useful.
Example Query
SELECT
subscriber_type, COUNT(*) AS trip_count
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`
GROUP BY subscriber_type
ORDER BY 1
This query counts the number of occurrences of each `subscriber_type` and orders the results by the count in descending order.
The output shows the number of trips for each subscriber type, highlighting which type has the most trips. This is crucial for understanding user behavior and trends.
Example Query
SELECT
subscriber_type,
COUNT(*) AS trip_count,
COUNT(duration_sec) AS sum_duration_sec
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`
GROUP BY subscriber_type
ORDER BY subscriber_type
The query result shows the row count of all the rows available in the table, as well as the row count of duration_sec
. The row counts for both trip_count
and duration_sec
are exactly the same, indicating that there are no NULL
values in the duration_sec
column. We can further confirm this by adding another COUNT
function to count the rows that have NULL
values, if any exist.
SELECT
subscriber_type,
COUNT(*) AS trip_count,
COUNT(duration_sec) AS sum_duration_sec,
COUNT (NULL) AS count_null
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`
GROUP BY subscriber_type
ORDER BY subscriber_type
This query confirms that there are no row that contains NULL values.
It will be interesting to find out which subscriber type is starting their trip from which station. We can achieve this by using ORDER BY
with trip_count
.
SELECT
subscriber_type,
start_station_name,
COUNT(*) AS trip_count,
COUNT(duration_sec) AS sum_duration_sec,
COUNT (NULL) AS count_null
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`
GROUP BY subscriber_type, start_station_name
ORDER BY trip_count DESC
Based on our results, the most number of trips were started from San Francisco Caltrain by the nan
subscriber type, where nan
means neither a customer nor a subscriber. When you use the GROUP BY
clause to aggregate certain metrics together, you gain a better understanding of your data. It's crucial for comprehending your dataset and is used frequently, especially when modeling data. Additionally, GROUP BY
is often needed in combination with joins and subqueries.
Let's see some more examples of GROUP BY clause within a shop data. First, let's query types of seller_channel in our order_item data.
SELECT
seller_channel
FROM `dwh-udemy-course.3_dwh_ecommerce.order_item`
GROUP BY 1
ORDER BY 1
The query shows different types of seller channel avaialble in the order_item table.
SELECT
seller_channel,
COUNT(*) AS count_rows
FROM `dwh-udemy-course.3_dwh_ecommerce.order_item`
GROUP BY 1
ORDER BY 1
This query groups by the seller channels and counts the number of rows for each seller channel. The results highlight which seller channels are most active and how frequently these seller channels are used. Interestingly, eBay and Limango have only one row each, indicating that either these seller channels are not active or have just started recently. These kinds of insights are incredibly significant when you are trying to understand your data.
Let's do few more checks to understand he data even further..
SELECT
seller_channel,
COUNT(*) AS count_rows,
COUNT(order_number) AS count_order_number,
COUNT (discount_code) AS count_discount_code
FROM `dwh-udemy-course.3_dwh_ecommerce.order_item`
GROUP BY 1
ORDER BY 1
The results highlight which sales channels are most active and how frequently the discount codes are used, offering valuable insights for marketing and sales strategies.
Conclusion
The `GROUP BY` statement in SQL, combined with aggregation functions like `COUNT`, is a powerful tool for data analysis. It enables you to group and summarize large datasets, uncovering patterns and insights that are essential for making informed decisions. By mastering these concepts and applying them to real-world scenarios, you can unlock the full potential of your data in BigQuery. Whether you're analyzing user behavior, sales performance, or operational metrics, `GROUP BY` and `COUNT` provide the foundation for effective data-driven decision-making.