Hello and welcome to our blog on mastering the HAVING statement in BigQuery! If you're familiar with SQL, you likely know the WHERE clause, but today, we dive into its powerful counterpart for aggregated data: the HAVING statement. This tutorial will guide you through understanding the HAVING statement, its differences from the WHERE clause, and provide practical examples to solidify your learning. If you’re new to SQL or need a refresher on GROUP BY and aggregation functionalities, be sure to check out our previous tutorials as they lay the foundation for this lesson.
The HAVING statement is essential for filtering results after data has been aggregated. This is particularly useful when you want to impose conditions on aggregated data. Let’s break down its functionality and illustrate with examples.
Let's start with a simple query:
SELECT
start_station_name,
SUM(duration_sec) AS total_duration,
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`
GROUP BY 1
ORDER BY 2 DESC
This query aggregates the data from the bikeshare_trips
table by start_station_name
, summing up the total duration of trips for each station. It groups the results by start_station_name
and orders them in descending order based on the total_duration
.
In the result:
To analyze specific sets of data or exclude unusual entries, you need to filter based on aggregated results. For instance, if you want to only show stations with a total duration above 500,000 seconds, you might think of using a WHERE clause like this:
SELECT
start_station_name,
SUM(duration_sec) AS total_duration,
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`
WHERE total_duration > 500000
GROUP BY 1
ORDER BY 2 DESC
This query attempts to use WHERE to filter results based on total_duration
, which is an aggregated value. However, this will result in an error because the WHERE clause is applied before the aggregation (i.e., before total_duration
is computed). The WHERE clause does not recognize total_duration
since it does not exist until after the GROUP BY operation.
Instead of using WHERE, you should use the HAVING clause to filter results after aggregation:
SELECT
start_station_name,
SUM(duration_sec) AS total_duration,
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`
GROUP BY 1
HAVING total_duration > 500000
ORDER BY 2 DESC
In this query:
start_station_name
and calculate total_duration
using SUM(duration_sec)
.HAVING
clause is then used to filter the grouped results to include only those stations where total_duration
exceeds 500,000 seconds.total_duration
in descending order.
SELECT
start_station_name,
SUM(duration_sec) AS total_duration,
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`
GROUP BY 1
HAVING total_duration > 50000000
ORDER BY 2 DESC
SELECT
start_station_name,
SUM(duration_sec) AS total_duration,
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`
GROUP BY 1
HAVING total_duration BETWEEN 50000 AND 80000
ORDER BY 2 DESC
The HAVING statement is a powerful tool in SQL for filtering aggregated data, offering functionalities that the WHERE clause cannot. By leveraging HAVING, you can apply complex conditions to your grouped data and achieve more precise results in your queries. Practice using the examples provided to enhance your understanding and visit our [website](#) for more tutorials and resources. Don't forget to like this blog, subscribe to our YouTube channel, and stay tuned for more SQL tips and tricks. Happy querying!