Drag-and-drop Data Modeling in Y42 (Part 1)
Start modeling your data with the Input and Field nodes. By Sanu Maharjan
In the last blog on the Y42 series, we covered data integrations into Y42. The raw data has all sorts of information embedded in it. It is important to have a variety of relevant columns and rows but too much unnecessary data could create clutter and lead to confusion.
For example, if the task is to find the correlation between money spent and revenue generated by an online store, it is not a wise decision to be looking at the customer’s shipping address. Yes, it provides the information on where the products are selling the most, but that is already a different topic. Focusing on only the required set of data should be the main goal of data cleaning. Now, let's take a look at some of Y42’s functionalities when it comes to data transformation.
At datadice, we are very fond of the F1 races. Every year we try to guess which driver would finish the laps in the shortest time, the top 5 players, and the best team. Since we are a data-driven company, we have collected all the relevant data for each race. I want to create a dashboard that looks at the points gained by each employee for correct guesses. I have already integrated the data into Y42. Need to refresh your memory on how to import data into Y42 again? Click me.
How to start modeling your data
All the data cleaning and processing are done in the ‘Model’ section. First, create a new model, by clicking on the “Add…” button on the right top corner and choosing “UI Model”, in this case. The “SQL Model” will be covered later in the series, so watch out for upcoming blogs. For now, let’s focus on the UI Model. Give it a name, like “f1_2022” and click on the “Create” button.
An empty canvas is created. This is where data pipelines are designed by a simple drag-and-drop feature. On the left, different transformation tools are sub-divided into 3 different topics, namely Table Manipulation, Column Manipulation, and Row Manipulation.
How to add data to a model
The first step in building pipelines begins with bringing the imported data to the canvas. Under Table Manipulation, there is a node called “Input”. Either by clicking it or dragging and dropping it to the canvas, you add the “Input” node to the model. When you click on the node, it gives you the option to select the data that needs to be imported. Since the first F1 race was in Bahrain, I’ll import allData-Bahrain.
After selecting the data, a preview is shown on the right side, which is a handy feature in Y42 to have a quick glance at the data you’re using. Now, click on the “Save” button below.
Let's look at some Column Manipulation nodes. As I explained earlier, it is wise to select only those columns which are essential for the end result. To filter out the unnecessary columns, click on the “Fields” node and drop it into the canvas.
Position it in front of the input node. Y42 will try to connect the nodes automatically, but it can also be done manually.
The color of the Fields node is showing up as red, which means that something is wrong. In our case, the node isn’t saved yet. In order to do that, click on the Fields node and it will show up all the columns present on the table. For better management of the nodes, it is recommended to provide a title to every node on the left top corner.
By clicking on “Input” at the bottom of the node, it shows the data that was inserted into the node. By clicking on “Output”, it shows the transformed data.
If we take a look at the data in the preview window at the right, the column names were generated automatically while importing the data. But as you can see, the last row of the table seems to show the correct column names. So let's first rename the column names and then delete the last row.
Changing column names
On the left side, different columns are shown. Under Changed ID, when double-clicked on any field, it is possible to rename the field. In a similar fashion, the new Display Name can also be changed. After renaming the rows, the color is shifted to yellow, indicating that the user has made some changes in the column name or data type.
The difference between ID and Display Name is, that ID is used for all further processes of the data workflow, whereas the Display Name is only for display, as the name suggests. When making some changes in the column, the ID name is always taken into account. I have made changes to the names of the columns, as shown in the screenshot below.
Some other differences between ID and Display Name are, that ID can’t have capital letters, spacebar, or special characters, whereas, for the display name, everything is allowed. Hence it is recommended to use the underscore symbol ( _ ) if the ID contains more than one word.
Another rich feature of Y42 while using the Fields node is that it already shows the data types. On the Type column, it is possible to change the data type. If there are columns that contain only numbers, then the data type can be converted as a number, or if the column has only boolean values (TRUE or FALSE). There is also an option for boolean data types. Note that, if the columns that contain dates are in “text” format, then it cannot be directly changed here, but there is a date parse node already built-in, which we will cover along the way.
Filtering the columns
Filtering columns is as easy as clicking. Because there is no need for the second to the last column, you just have to uncheck the box below the Show column, and BOOM! It is excluded. As simple as that.
If renaming and filtering are done, click on Output again. On the right side, it shows how the data looks after all the changes. If you see the expected result, then click on Save and it will take you back to the canvas.
I will end this session here for now but in the upcoming posts, I and my friend Debbi will build more complex pipelines and add more nodes to refine our data. In the end, we will build a dashboard and see who will have the most points for guessing correctly on the race.
This post is part of the Y42 series from datadice and the series gives you detailed information on different features in Y42.
If you think that Y42 could be an attractive solution to your data problems, be sure to get in touch with them.
Check out our LinkedIn account, to get insights into our daily working life and get important updates about BigQuery, Data Studio, and marketing analytics
We also started with our own YouTube channel. We talk about important DWH, BigQuery, Data Studio, and many more topics. Check out the channel here.
If you want to learn more about how to use Google Data Studio and take it to the next level in combination with BigQuery, check our Udemy course here.
Have trouble setting up an ETL Pipeline on Y42 or if you are looking for help to set up a modern and cost-efficient data warehouse or analytical dashboards in general, send us an email at firstname.lastname@example.org and we will schedule a call.