In the last blog on the Y42 series, we covered data integrations into Y42. The raw data has all sorts of information embedded in it. It is important to have a variety of relevant columns and rows but too much unnecessary data could create clutter and lead to confusion.
For example, if the task is to find the correlation between money spent and revenue generated by an online store, it is not a wise decision to be looking at the customer’s shipping address. Yes, it provides the information on where the products are selling the most, but that is already a different topic. Focusing on only the required set of data should be the main goal of data cleaning. Now, let's take a look at some of Y42’s functionalities when it comes to data transformation.
At datadice, we are very fond of the F1 races. Every year we try to guess which driver would finish the laps in the shortest time, the top 5 players, and the best team. Since we are a data-driven company, we have collected all the relevant data for each race. I want to create a dashboard that looks at the points gained by each employee for correct guesses. I have already integrated the data into Y42. Need to refresh your memory on how to import data into Y42 again? Click me.
All the data cleaning and processing are done in the ‘Model’ section. First, create a new model, by clicking on the “Add…” button on the right top corner and choosing “UI Model”, in this case. The “SQL Model” will be covered later in the series, so watch out for upcoming blogs. For now, let’s focus on the UI Model. Give it a name, like “f1_2022” and click on the “Create” button.
An empty canvas is created. This is where data pipelines are designed by a simple drag-and-drop feature. On the left, different transformation tools are sub-divided into 3 different topics, namely Table Manipulation, Column Manipulation, and Row Manipulation.
The first step in building pipelines begins with bringing the imported data to the canvas. Under Table Manipulation, there is a node called “Input”. Either by clicking it or dragging and dropping it to the canvas, you add the “Input” node to the model. When you click on the node, it gives you the option to select the data that needs to be imported. Since the first F1 race was in Bahrain, I’ll import allData-Bahrain.
After selecting the data, a preview is shown on the right side, which is a handy feature in Y42 to have a quick glance at the data you’re using. Now, click on the “Save” button below.
Let's look at some Column Manipulation nodes. As I explained earlier, it is wise to select only those columns which are essential for the end result. To filter out the unnecessary columns, click on the “Fields” node and drop it into the canvas.
Position it in front of the input node. Y42 will try to connect the nodes automatically, but it can also be done manually.
The color of the Fields node is showing up as red, which means that something is wrong. In our case, the node isn’t saved yet. In order to do that, click on the Fields node and it will show up all the columns present on the table. For better management of the nodes, it is recommended to provide a title to every node on the left top corner.
By clicking on “Input” at the bottom of the node, it shows the data that was inserted into the node. By clicking on “Output”, it shows the transformed data.
If we take a look at the data in the preview window at the right, the column names were generated automatically while importing the data. But as you can see, the last row of the table seems to show the correct column names. So let's first rename the column names and then delete the last row.
On the left side, different columns are shown. Under Changed ID, when double-clicked on any field, it is possible to rename the field. In a similar fashion, the new Display Name can also be changed. After renaming the rows, the color is shifted to yellow, indicating that the user has made some changes in the column name or data type.
The difference between ID and Display Name is, that ID is used for all further processes of the data workflow, whereas the Display Name is only for display, as the name suggests. When making some changes in the column, the ID name is always taken into account. I have made changes to the names of the columns, as shown in the screenshot below.
Some other differences between ID and Display Name are, that ID can’t have capital letters, spacebar, or special characters, whereas, for the display name, everything is allowed. Hence it is recommended to use the underscore symbol ( _ ) if the ID contains more than one word.
Another rich feature of Y42 while using the Fields node is that it already shows the data types. On the Type column, it is possible to change the data type. If there are columns that contain only numbers, then the data type can be converted as a number, or if the column has only boolean values (TRUE or FALSE). There is also an option for boolean data types. Note that, if the columns that contain dates are in “text” format, then it cannot be directly changed here, but there is a date parse node already built-in, which we will cover along the way.
Filtering columns is as easy as clicking. Because there is no need for the second to the last column, you just have to uncheck the box below the Show column, and BOOM! It is excluded. As simple as that.
If renaming and filtering are done, click on Output again. On the right side, it shows how the data looks after all the changes. If you see the expected result, then click on Save and it will take you back to the canvas.
I will end this session here for now but in the upcoming posts, I and my friend Debbi will build more complex pipelines and add more nodes to refine our data. In the end, we will build a dashboard and see who will have the most points for guessing correctly on the race.