Generate Dummy Data With Faker

A guide on how to use python’s Faker package to create a dummy dataset. By Fassil S. Yehuala

Bildschirmfoto 2022-08-18 um 20.24.42.png

Fictitious data is required for a variety of purposes. Whether for testing, anonymising sensitive data, or adding “noise” to a training dataset, it can be beneficial to have access to a fake dataset in the same shape as the real data. You may also need to generate dummy data for testing and operational purposes. That is, to test what you have developed and how your code reacts to different types of input.

 

However, finding the necessary data in a specific format we want can be difficult. So, where do you get dummy data for your own application? There is an elegant solution to this problem in the form of the Faker package. With Python, you can use the Faker package to generate data according to your data needs. Faker is an open source library designed to generate different types of synthetic data.

 

In this article, we’ll take a quick tour of Faker’s features and how to use them to create a dummy dataset.

Installation and Use

Faker allows you to generate random data in dozens of languages. Since Faker is an open library for the community, it is constantly evolving. Providers –generators specific to a certain type of data– are added regularly by the community. Let’s take a look at how to use it in terms of codes.

The installation can be done via pip with the command:

pip install Faker

With the following two lines of code you can initialise Faker. While the first line imports the generator (Class Faker), the second one is used to initialise the generator with English as a default language parameter. If you want to initialise Faker in other languages you need to specify the language parameter (eg. Faker(“de_DE”) for German).

from faker import Faker

fake = Faker()

Generating Fakes

Now, you are ready to generate whatever data you want. The generated data is called fake. As the name suggests, it is fake data that is randomly generated. Its purpose is to act as a substitute or placeholder for the actual data. A fake is generated when the method corresponding to the data type is called.

The name() method can be used to create a full name. Let’s jump into the code and check how these methods work.

for i in range(5):       # Returns full names

  print(fake.name())

>>>Samantha Fernandez

>>>Denise Barnes

>>>Jason Strong

>>>Edward Burton

>>>Tonya Rocha

However, if you want the only first or last name instead, you can use the first_name() and last_name() methods. 

fake.first_name() # Returns a first name

>>>Samuel

 

Note that, each call to these methods will generate a random name.

fake.last_name() # Returns last name

>>>Espinoza

 

To create addresses, you can use the address().

fake.address() # Returns an address

>>>3066 Mary Hills Suite 873

>>>Lake Stevenport, NV 32423

 

Moreover, the fake.sentence() method will return a string containing a random sentence, whereas faker.text() will return a randomly generated text.

fake.sentence() # Returns a random sentence

>>>Never across staff attention within.

As can be seen below faker.text() generates a random paragraph. 

fake.text() # Returns a random text

>>>From send bed. Could country reveal send role. Guy involve issue picture get election. Sure do memory kitchen candidate fish defense. Try paper forward to build gas human.

 

Let’s say you want to generate a list of 5 email addresses. Each time, the below code generates 5 random names.

 

for i in range(5):       # generates 5 random emails

   print(fake.email())

>>>garciaeric@example.com

>>>logan01@example.net

>>>contrerasaustin@example.org

>>>rpreston@example.org

>>>brandy16@example.net

But when the data gets bigger, there is a chance that you would get the same email address more than once. So, to create unique dummy data using the Faker package, you can use the .unique property of the generator.

for i in range(10):     # generates 5 unique random emails

   print(fake.unique.email())

>>>hughesbrian@example.org

>>>raymondchapman@example.org

>>>vicki25@example.com

>>>munozzachary@example.net

>>>karen44@example.org

 

Each time the above code runs, it will generate 5 unique email addresses. This is quite helpful when you are generating data like ID, that does not need to be repeated.

Faker also has a method to generate a dummy profile.

fake.profile() #Returns a fake profile

>>>{‘address’: ‘64992 Becky Stream Apt. 932\nRebeccaville, WV 34184’,

>>>‘birthdate’: datetime.date(2000, 3, 24),

>>>‘blood_group’: ‘O-’,

>>>‘company’: ‘Lopez and Sons’,

>>>‘current_location’: (Decimal(‘78.061493’), Decimal(‘-114.798399’)),

>>>‘job’: ‘Pharmacologist’,

>>>‘mail’: ‘rebeccahansen@yahoo.com’,

>>>‘name’: ‘Autumn Sanchez’,

>>>‘residence’: ‘8702 Matthew Circles Apt. 938\nDickersonfurt, WA 82226’,

>>>‘sex’: ‘F’,

>>>‘ssn’: ‘534–29–2074’,

>>>‘username’: ‘llowe’,

>>>‘website’: [‘http://hawkins.com/', ‘https://wolf.com/']}

 

So far we have used forger generator properties like name(), first_name(), last_name(), email(), etc. There are also many such properties packaged in ‘Providers’. Some are standard providers, while others are providers developed by the community.

Standard Providers

There are many standard providers like address, currency, credit_card, date_time, internet, geo, person, profile, bank etc. that help create the relevant dummy data. More information on the full list of standard providers and their properties can be found here.

Let’s have a look at some examples from faker.providers.address

for i in range(5):     # Returns 5 country names

   print(fake.country())

>>>Luxembourg

>>>Vietnam

>>>Tonga

>>>Mozambique

>>>Austria

You can also get country codes.

for i in range(5): # Returns 5 country codes

   print(fake.country_code())

>>>ES

>>>RO

>>>MH

>>>MR

>>>CL

As stated before, the default language is English and the default country is set to be the United States.

fake.current_country() #Returns current country

>>>United States

When the locale is changed the output of current_country(), current_country_code(), address(), etc will be changed as follows:

Fake = Fake(“de_DE”)

fake.current_country_code() #Returns current country code

>>>DE

Community Providers

There are many community providers like Credit Score, Air Travel, Vehicle, Music, etc. You can also create your provider and add it to the Faker package. More information on the full list of community providers and their properties can be found here.

Let’s have a look at some examples from Faker_music. Before you start generating fake music data using community providers you need to install the package using pip.

pip install faker_music

And then you need to add the provider to your Faker instance:

from faker_music import MusicProvider

fake = Faker()

fake.add_provider(MusicProvider)

Now you set to generate fake music data:

for i in range (5):    #Returns music genres

   print(fake.music_genre())

>>>Rock

>>>World

>>>Classical

>>>Pop

>>>Vocal

Localised Providers

You can create the localised dummy data by providing the required locale as an argument to the dummy generator. It also supports multiple locales. In that case, all locales must be provided in the Python list data type like in the example shown below.

fake = Faker([‘De_DE’, ‘fr_FR’, ‘ja_JP’])

for _ in range(10):

   print(fake.name())

>>>山本 陽子

>>>Lina Weinhold

>>>Dorothee Huhn

>>>Anika Henck-Hörle

>>>Ilonka Drubin MBA.

>>>Philomena Rohleder

>>>高橋 裕太

>>>Jacques Dumont Le Perrin

>>>斎藤 治

>>>小林 淳

The default locale is ‘en_US’, i.e. US English. Let’s code to create 5 addresses in Germany.

fake=Faker(“de_DE”) # Returns German addresses

for i in range(3):

   print(fake.address())

>>>Rafael-Mende-Platz 04

>>>04196 Steinfurt

>>>Resi-Atzler-Allee 843

>>>96746 Coburg

>>>Scheibeplatz 5/1

>>>52115 Stollberg

fake=Faker(“de_DE”) #Returns German federal states

for i in range(5):

   print(fake.administrative_unit())

>>>Bremen

>>>Hessen

>>>Rheinland-Pfalz

>>>Nordrhein-Westfalen

>>>Bayern

Generating a Dummy Dataset

We will create a fictitious dataset of 100 people with attributes such as id, name, email, address, date of birth, place of birth, etc. We will use the standard provider ‘Profiles’ to create this data and use Pandas Dataframes to save that.

#Import packages

from faker import Faker

from faker_music import MusicProvider

import pandas as pd

#Declare faker object

fake = Faker()

#Add music faker

fake.add_provider(MusicProvider)

#Define function to generate fake data and store into a JSON file

def generate_dummy_data(records):

   data={}

   #Iterate the loop and generate fake data

   for i in range(0, records):

       data[i]={}

       data[i][“id”] = fake.unique.random_number(8)

       data[i][“name”] = fake.name()

       data[i][“email_address”]= fake.unique.email()

       data[i][“address”]= fake.address()

       data[i][“date_of_birth”]= fake.date_between(“-67y”, “-18y”)

       data[i][“country_of_birth”]= fake.country()

       data[i][“member_since”]= fake.date_time_between(“-2y”,“now”)

   return data

#Call the function to generate fake data and store into a json file

fake_data = generate_dummy_data(100)

# Convert JSON to DataFrame

fake_data = pd.DataFrame(fake_data)

fake_data = fake_data.T

fake_data

Bildschirmfoto 2022-08-18 um 20.25.09.png

Conclusion

Faker is a Python library for generating fake data. It can be very practical in several cases. There are several alternatives to Faker but it remains the most well-known option in Python. It is popular because it is the easiest way to create fake records that look real. You can use it to create loops of dummy data –with simple steps it generates a large number of data in seconds.

I hope you enjoyed this article. If you have any questions leave a comment below.

Further Links

Check out our LinkedIn account, to get insights into our daily working life and get important updates about BigQuery, Data Studio and marketing analytics

 

We also started with our own YouTube channel. We talk about important DWH, BigQuery, Data Studio and many more topics. Check out the channel here.​

 

If you want to learn more about how to use Google Data Studio and take it to the next level in combination with BigQuery, check our Udemy course here.

 

If you are looking for help to set up a modern and cost-efficient data warehouse or analytical dashboards, send us an email to hello@datadice.io and we will schedule a call.