BIGQUERY LEARNING

Welcome to our BigQuery Learning Hub!
In this beginner-friendly guide, we'll delve into BigQuery and its tools for querying and analyzing data, perfect for those starting out or seeking to enhance their expertise.

Your Free One-Stop Guide to Mastering BigQuery.

What is BigQuery

Google BigQuery is a part of the Google Cloud Platform that offers an on-demand data warehouse service. It enables you to upload organized data into tables and leverage Google's cloud infrastructure to swiftly analyze vast amounts of data rows within seconds.

Start with Free BigQuery Tutorial

Before you Begin

In this guide, we'll be accessing tables within a publicly available dataset provided by Google to explore BigQuery via the Google Cloud Platform. Hence, it's assumed that:

You have access to the Google Could Platform

You have already created a Google Cloud project.

Google sandbox environment is up and running

About the Instructor

Introducing Alexander Junke: a data expert with over seven years of hands-on experience in Data Engineering and Analytics. Equipped with a wealth of SQL and a mastery of Google Tag Manager, Alex is your go-to guru for navigating the intricate world of data. With a talent for transforming complex datasets into actionable insights, he's your trusted guide on the journey to data-driven success.

About the Course

Understanding data is paramount for anyone seeking insights into their business or any analytical endeavor. As data scales, so does the need for robust tools to handle it efficiently.

While small datasets find solace in tools like Google Sheets or Excel, when data grows, BigQuery steps in as the ultimate solution. With BigQuery, massive datasets are effortlessly managed, providing swift and profound insights in minutes.

At datadice, our seasoned data engineer, Alexander Junke, will guide you through this comprehensive yet accessible journey into BigQuery. With each video uploaded every 1 or 2 weeks, we delve into various topics, from querying data for diverse purposes to sharing invaluable tips and tricks for efficient data manipulation.

Join us as we navigate through distinct chapters, unraveling the complexities of data analysis while mastering the art of querying data effectively. Let's explore each chapter together, one step at a time.

Chapters

Chapter 1:
Understanding SQL Starting with SELECT * FROM

In this BigQuery tutorial for beginners, you'll dive into the fundamentals of SQL, focusing on the SELECT and FROM statements. Starting with an overview of BigQuery, including project structures and datasets, you'll learn how to effectively query data using SQL. Through practical examples and demonstrations, you'll understand how to retrieve specific data from tables, utilize SELECT DISTINCT for unique values, and navigate potential pitfalls like using SELECT * efficiently. By the end of this tutorial, you'll have a solid foundation in SQL querying within the BigQuery environment.

What will you learn:

  • Understand the basics of BigQuery, including projects, datasets, and tables.
  • Learn how to construct SQL queries using the SELECT and FROM statements.
  • Explore practical examples to retrieve specific data from tables in BigQuery.
  • Master the use of SELECT DISTINCT to obtain unique values from columns.
  • Gain insights into optimizing SQL queries for cost efficiency, such as avoiding SELECT *.


Introduction to BigQuery and Dataset Organization


Before diving into SQL queries, let's familiarize ourselves with BigQuery's structure. BigQuery organizes data into projects, datasets, and tables. Each project can contain multiple datasets, and each dataset can contain multiple tables.

For our tutorials, we'll primarily focus on the dwh_udemy_course dataset and the Big Public Data, which contains publicly available data. Within the dwh_udemy_course dataset, we'll explore the dwh_ecommerce dataset, comprising tables such as order item, product, and customer data. And for the BigQuery public dataset, we will be using the San Francisco Bike Share data.

1. SELECT FROM

Let's write our first query. The most basic query that you can write is SELECT * FROM. Let's take an example from the Bike Share data. The breakdown of this query will be as follows: This statement will select a column called "start_station_name" from the table bikeshare_trips which is located inside bigquery-public-dataset, inside the San Francisco Bike Share dataset.

Once you write the query, check on the top right corner, if there is a green check mark ✅,  it means the query is correct. Also, another important thing to notice is how many MBs of data will be processed when running the query. This MBs of data determines the cost of running the query. To learn more details about the cost of the query, check out this link.

SELECT start_station_name
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`



2. SELECT DISTINCT FROM STATEMENT

When running the query without using the DISTINCT function, all rows are shown in the result, including duplicates. For example, if there are multiple trips starting from the same station, each trip will be displayed as a separate row in the result.

However, if we use the DISTINCT function, the result will only show unique rows. Duplicate rows will be removed from the result set. For instance, if there are multiple trips starting from the same station, only one instance of that station will be displayed in the result, eliminating redundancy and providing a cleaner output. An example illustrating both scenarios has been illustrated in the image below.

SELECT DISTINCT start_station_name
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`

3. SELECT * FROM STATEMENT

When using the SELECT * FROM statement without any additional functions, all columns and rows from the specified table are retrieved in the query result. This means that every piece of data in the table will be displayed, including duplicates if they exist. "SELECT *"  is equivalent to "SELECT ALL"

SELECT *
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`

4. SELECT AS STATEMENT

To enhance clarity and consistency in our queries, we can rename columns and tables using aliases. Aliases provide descriptive names for columns and tables, making it easier to understand query results, especially when dealing with multiple tables and JOIN operations. This time let's try an exmaple with a different table - bikeshare_station_info.

SELECT name AS station_name
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_station_info`


You can also write the statement without AS and it will function the same.

SELECT name station_name
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_station_info`


Similarly, you can also add aliases for tables. For example,

SELECT name station_name
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_station_info` AS station


Conclusion

In conclusion, mastering the SELECT * FROM statement is essential for anyone working with SQL queries in BigQuery. By understanding BigQuery's structure, utilizing keywords like DISTINCT, and optimizing queries for cost efficiency, you can efficiently extract valuable insights from your datasets.

Additionally, for visual learners, we have a comprehensive YouTube video on this topic. In the video, we also showcase examples of statements from the dwh_udemy_course dataset, providing a practical demonstration of how queries are constructed and executed. If you're interested in seeing these queries in action, check out our YouTube video.

Furthermore, in the next chapter, we delve into the WHERE statement in detail, exploring how it can be used to filter data and refine query results.

Jump to chapter 2.

Chapter 2:
Understanding BigQuery WHERE Statement

In the world of data analysis and manipulation, filtering data is a fundamental operation. Whether you're a beginner or an experienced SQL user, understanding how to efficiently filter data can significantly enhance your analytical capabilities. In this comprehensive guide, we'll delve into the intricacies of data filtering using the WHERE statement in BigQuery. From basic filtering to advanced techniques, we'll cover everything you need to know to master data filtering in BigQuery.

What will you Learn:

  • Understand the significance of the WHERE statement in BigQuery SQL queries.
  • Learn how to filter data based on specific criteria using conditions like greater than, equal to, and IN.
  • Gain insights into filtering techniques for textual data such as names and numerical data like dates and quantities.
  • Discover advanced filtering methods and best practices to streamline data analysis and manipulation.
  • Acquire practical skills to enhance data management and analysis tasks within BigQuery, empowering you to extract meaningful insights efficiently.


Understanding the WHERE Statement


The WHERE statement in SQL, including BigQuery, allows you to filter rows based on specified conditions. It serves as a powerful tool for extracting relevant data from large datasets.
Let's explore some example statements to illustrate its usage:
Please check Chapter One to see details about the dataset used in these examples.
First, let's run an example statement without WHERE statement.

SELECT name, capacity
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_station_info`


The result will display a list of different station names and their corresponding capacities. Additionally, the total number of rows is 472.


Now, let's add the WHERE clause as capacity > 15

SELECT name, capacity
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_station_info`
WHERE capacity > 15


In this example, we filter rows where the capacity is greater than 15. As a result, only stations with a capacity greater than 15 are displayed. The number of rows has decreased, with a total of 390 rows remaining.


2. Filtering by String values

Let's filter our table, where station name = 'Otis St at Brady St'

SELECT name, capacity
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_station_info`
WHERE name = 'Otis St at Brady St'


This result will show table with rows based on the station name 'Otis St at Brady St'

Note:
When adding a string WHERE condition, ensure to enclose the value with single quotes, and remember that values are case-sensitive.


3. Combining multiple condition with 'OR' Operator

If you want to filter your table with multiple distinct values, you can use the 'OR' operator to achieve that.

SELECT 
    name,
    capacity
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_station_info`
WHERE name = 'Otis St at Brady St'
    OR  name = 'Filbert St at Fillmore St'
    OR  name = 'Fillmore St at Jefferson St'
    OR  name = 'Turk Blvd at Stanyan St'



4. Using IN Operator

Instead of using the OR operator, you can achieve the same result more efficiently by using the IN operator. This approach results in a shorter and more manageable statement.

SELECT 
    name,
    capacity
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_station_info`
WHERE name IN ('Otis St at Brady St','Filbert St at Fillmore St','Fillmore St at Jefferson St','Turk Blvd at Stanyan St')


The IN operator provides a more concise way to specify multiple conditions, achieving the same result as the previous example.


5. Filtering by Date

Next, you can also filter your query using dates. For example, if you only want rows where the event happened after April 20, 2028, then you can use this query. Additionally, here we are using a different table inside the bikeshare dataset - bikeshare_trips.

SELECT 
      *
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`
WHERE start_date > '2018-04-20'



5. Advanced filtering with Multiple Conditions

Here, we filter rows based on both date range and minimum product quantity ordered.

SELECT *
FROM dataset.table
WHERE start_date BETWEEN '2023-04-01' AND '2023-04-30'
AND product_quantity_ordered >= 10;


Best Practices and Tips:

  • Always consider the order of conditions in complex WHERE clauses to ensure accurate filtering.
  • When filtering by columns not included in the SELECT statement, it's advisable to include those columns to maintain clarity in analysis.
  • Experiment with different filtering techniques to optimize query performance and achieve desired results.


Conclusion

Mastering data filtering in BigQuery opens up a world of possibilities for data analysis and exploration. By harnessing the power of the WHERE statement and applying best practices, you can efficiently extract valuable insights from your datasets. Whether you're filtering by numeric values, string values, dates, or combining multiple conditions, understanding these techniques will empower you to make informed decisions based on your data.

Additionally, we also have a YouTube tutorial for the same topic, where we showcase more examples of the WHERE statement. If you are a visual learner (or not), then you must check out our YouTube video.

Jump to chapter 3.

Chapter 3:
Understanding BigQuery ORDER BY Statement

In this tutorial, we delve into the intricacies of the ORDER BY statement in SQL, particularly focusing on its usage within BigQuery. The blog guides beginners through understanding how to organize query results effectively. Beginning with the basic syntax, viewers learn how to apply the ORDER BY clause to sort data based on specific columns. Through examples, the tutorial demonstrates sorting in both ascending and descending orders, providing a comprehensive understanding of the functionality. Furthermore, it explores the concept of ordering by multiple columns, offering viewers a versatile approach to organizing their data. By the end, readers gain valuable insights into manipulating query results to suit their analytical needs using BigQuery.

What Will You Learn:

  • Understand the fundamental usage of the ORDER BY statement in SQL within the context of BigQuery.
  • Learn how to structure SQL queries to sort query results based on specified columns.
  • Explore sorting data in both ascending and descending orders to effectively organize information.
  • Gain proficiency in ordering query results by multiple columns for enhanced data presentation and analysis.
  • Acquire valuable skills in manipulating and optimizing query results to extract meaningful insights using BigQuery's ORDER BY functionality.


Understanding the basics of ORDER BY

Let's start with the fundamentals. The ORDER BY statement allows you to arrange your query results in a specific order. Whether you're sorting alphabetically, numerically, or by any other criteria, ORDER BY is your go-to command for organizing data precisely how you need it.

Let's take an example from our Bikeshare dataset. (For more details about the Bikeshare dataset, check chapter 1)
Let's say you want to analyze the start and end stations of trips. Here's a basic query to get you started.

SELECT start_station_name
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`
ORDER BY start_station_name


The result of this query is displaying the start_station_name values alphabetically arranged from A to Z. For instance, if the dataset contains station names like "A Street," "B Boulevard," and so on, the query results will list these stations in alphabetical order, starting from "A Street" and ending with the last station name in the dataset.


Ascending vs Descending ORDER BY

Now, let's talk about the direction of your sorting. By default, ORDER BY arranges data in ascending order. This means that numerical values start from the smallest and increase, while alphabetical values follow their natural order (from A to Z).For instance, if you're sorting station names alphabetically, "Filbert Street" would come before "Gilbert Street." Similarly, numerical values like dates or quantities would ascend from the smallest to the largest.

But what if you want to flip the script and sort in descending order? That's where the ASC and DESC keywords come into play. Use ASC for ascending order and DESC for descending order. Here's how you can tweak your query. Let's also add DISTINCT clause so that we see rows without duplicate values.

SELECT DISTINCT start_station_name
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`
ORDER BY start_station_name DESC



Adding Complexity: Multiple Columns

ORDER BY isn't limited to a single column. You can level up your sorting game by incorporating multiple columns into your query. This is particularly useful when you want to prioritize one sorting criterion over another.

Consider this example where we sort trips by start station name and then by end station name:

SELECT start_station_name, end_station_name
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`
ORDER BY start_station_name, end_station_name


By listing multiple columns after ORDER BY, you're creating a hierarchical sorting system. First, the query sorts by start station name. Then, for trips with the same start station, it further sorts by end station name.
Run the query and see for yourself!


Unlocking Insights with ORDER BY

Now that you've mastered the basics, let's explore how ORDER BY can uncover valuable insights in your data. Imagine you're analyzing Bike Share stations based on their capacity. With ORDER BY, you can identify stations with the most and least bike capacity, helping you optimize bike distribution and maintenance schedules.

Let's run the query from a different table called bikeshare_station_info

SELECT name, capacity
FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_station_info`
ORDER BY capacity DESC


By sorting stations by capacity in descending order, you'll quickly pinpoint the stations with the highest bike capacity, such as "Glen Park Bath Station" with 40 bikes.


Tips and Tricks for Success

As you embark on your ORDER BY journey, keep these tips in mind:
Choose Your Order Wisely: Consider the purpose of your analysis and choose the most relevant sorting criteria.

Optimize Performance: ORDER BY can impact query performance, especially with large datasets. Use it judiciously to avoid unnecessary processing overhead.

Experiment with Multiple Columns: Don't hesitate to experiment with sorting by multiple columns to uncover deeper insights in your data.Stay Curious: Keep exploring and refining your queries to extract maximum value from your data.


ORDER BY with Dates

You can do so much more with the ORDER BY statement, such as uncovering insights via date, using multiple sorting, and so on. To see the full demonstration, check out the YouTube video on the same topic, where we showcase how we use ORDER BY to uncover insights from the order_item table for e-commerce data.

Conclusion

In conclusion, ORDER BY is a powerful tool in your BigQuery toolkit. By mastering its intricacies and applying it strategically, you can transform raw data into actionable insights. So go ahead, dive into your datasets, and let ORDER BY lead the way to data inisights!

Jump to chapter 4

Chapter 4: BigQuery WHERE Statement (Part 2) Coming Soon