Google Cloud Dataprep is an intelligent data service on Google Cloud Platform for exploring, cleaning, and preparing structured and unstructured data.
There are 5 principles important to know before your data preparation with Dataprep.
1. Create baseline dataset before profiling source data
Before you get started cleaning your dataset, it is helpful to create a virtual profile of the source data. First, create a minimal recipe on a dataset after you have ingested into the Transformer page. Then, click Run Job to generate a profile of the data, which can be used as a baseline dataset for validating and debugging the origin of data problems you discover.
2. Normalize data before applying Deduplicate Transform
Remove identical rows from your dataset after uniqueness check is a common step in data preparation. Google Cloud Dataprep provides a single transform deduplicate, which can remove identical rows from your dataset.
There are 2 limitations:
- This transform is case-sensitive. So, if a column has values
DARREN, the rows containing those values are not considered duplicates and cannot be removed with this transform.
- Whitespace and the beginning and ending of values is not ignored.
It is necessary to normalize your data before applying deduplicate transform. For example, you can use the
LOWER function to make the case of each entry in a column to be consistent, then call the
trim function to remove leading and trailing whitespace.
3. Join early and Union later
You can enrich your data by Join or Union dataset from multiple sources together. Join operations should be performed early in your recipe so that you reduce the chance of having changes to your join keys impacting the results of your join operations.
Union operations should be performed later in the recipe. By doing them later in the process, you minimize the chance of changes to the union operation, including dataset refreshes, affecting the recipe and the output.
4. Use statistical information to evaluate generated data
After you have completed your recipe and run the job, you can open the source data and the profile you created for the source data in separate browser tabs to evaluate how consistent and complete your data remains from beginning to end of the wrangling process.
Instead of comparing data row by row, use the statistical information in the generated profile to compare with the statistics generated from the source, so that you can identify if your changes have introduced unwanted changes to these values.
5. Keep recipe records after profiling source data
For record keeping, click View Recipe to copy and paste the recipe used to create the profile. You can Download Recipe into a text file.
These are the 5 principles important to know before you start working on your datasets with Google Cloud Dataprep. If you have any question about building data pipeline or training Machine Learning models on Cloud, feel free to leave me a message. Thanks for reading.
Originally published at my Medium blog.