Generate dummy data with Faker with Python

What is dummy data?

Dummy data is fictitious information that is generated or used to simulate real data in various contexts such as testing, development, and training. This type of data is designed to mimic the characteristics and structure of actual data, without containing any meaningful or sensitive information. Dummy data is commonly used to:

Why use dummy data?

Fictitious data is required for a variety of purposes. Whether for testing, anonymising sensitive data, or adding “noise” to a training dataset, it can be beneficial to have access to a fake dataset in the same shape as the real data. You may also need to generate dummy data for testing and operational purposes. That is, to test what you have developed and how your code reacts to different types of input.

However, finding the necessary data in a specific format we want can be difficult. So, where do you get dummy data for your own application? There is an elegant solution to this problem in the form of the Faker package. With Python, you can use the Faker package to generate data according to your data needs. Faker is an open source library designed to generate different types of synthetic data.

How to populate database with dummy data?

In this article, we’ll take a quick tour of Faker package in Python and how to use them to create a dummy dataset.

The Faker library in Python is a popular tool for generating fake data for a variety of uses, such as testing, development, and training machine learning models. It allows users to create dummy data that mimics real-world data in a flexible and customizable manner. Faker can generate data in various formats, including names, addresses, dates, text, and more.

Key features of the Faker library include:

How to use the Faker library in Python

Installation and Use

Faker allows you to generate random data in dozens of languages. Since Faker is an open library for the community, it is constantly evolving. Providers –generators specific to a certain type of data– are added regularly by the community. Let’s take a look at how to use it in terms of codes.

The installation can be done via pip with the command:

pip install Faker

With the following two lines of code you can initialise Faker. While the first line imports the generator (Class Faker), the second one is used to initialise the generator with English as a default language parameter. If you want to initialise Faker in other languages you need to specify the language parameter (eg. Faker(“de_DE”) for German).

from faker import Faker

fake = Faker()

Generating Fakes

Now, you are ready to generate whatever data you want. The generated data is called fake. As the name suggests, it is fake data that is randomly generated. Its purpose is to act as a substitute or placeholder for the actual data. A fake is generated when the method corresponding to the data type is called.

The name() method can be used to create a full name. Let’s jump into the code and check how these methods work.

for i in range(5): # Returns full names

print(fake.name())

>>>Samantha Fernandez

>>>Denise Barnes

>>>Jason Strong

>>>Edward Burton

>>>Tonya Rocha

However, if you want the only first or last name instead, you can use the first_name() and last_name() methods.

fake.first_name() # Returns a first name

>>>Samuel

Note that, each call to these methods will generate a random name.

fake.last_name() # Returns last name

>>>Espinoza

To create addresses, you can use the address().

fake.address() # Returns an address

>>>3066 Mary Hills Suite 873

>>>Lake Stevenport, NV 32423

Moreover, the fake.sentence() method will return a string containing a random sentence, whereas faker.text() will return a randomly generated text.

fake.sentence() # Returns a random sentence

>>>Never across staff attention within.

As can be seen below faker.text() generates a random paragraph.

fake.text() # Returns a random text

>>>From send bed. Could country reveal send role. Guy involve issue picture get election. Sure do memory kitchen candidate fish defense. Try paper forward to build gas human.

Let’s say you want to generate a list of 5 email addresses. Each time, the below code generates 5 random names.

for i in range(5): # generates 5 random emails

print(fake.email())

>>>garciaeric@example.com

>>>logan01@example.net

>>>contrerasaustin@example.org

>>>rpreston@example.org

>>>brandy16@example.net

But when the data gets bigger, there is a chance that you would get the same email address more than once. So, to create unique dummy data using the Faker package, you can use the .unique property of the generator.

for i in range(10): # generates 5 unique random emails

print(fake.unique.email())

>>>hughesbrian@example.org

>>>raymondchapman@example.org

>>>vicki25@example.com

>>>munozzachary@example.net

>>>karen44@example.org

Each time the above code runs, it will generate 5 unique email addresses. This is quite helpful when you are generating data like ID, that does not need to be repeated.

Faker also has a method to generate a dummy profile.

fake.profile() #Returns a fake profile

>>>{‘address’: ‘64992 Becky Stream Apt. 932\nRebeccaville, WV 34184’,

>>>‘birthdate’: datetime.date(2000, 3, 24),

>>>‘blood_group’: ‘O-’,

>>>‘company’: ‘Lopez and Sons’,

>>>‘current_location’: (Decimal(‘78.061493’), Decimal(‘-114.798399’)),

>>>‘job’: ‘Pharmacologist’,

>>>‘mail’: ‘rebeccahansen@yahoo.com’,

>>>‘name’: ‘Autumn Sanchez’,

>>>‘residence’: ‘8702 Matthew Circles Apt. 938\nDickersonfurt, WA 82226’,

>>>‘sex’: ‘F’,

>>>‘ssn’: ‘534–29–2074’,

>>>‘username’: ‘llowe’,

>>>‘website’: [‘http://hawkins.com/', ‘https://wolf.com/']}

So far we have used forger generator properties like name(), first_name(), last_name(), email(), etc. There are also many such properties packaged in ‘Providers’. Some are standard providers, while others are providers developed by the community.

Standard Providers

There are many standard providers like address, currency, credit_card, date_time, internet, geo, person, profile, bank etc. that help create the relevant dummy data. More information on the full list of standard providers and their properties can be found here.

Let’s have a look at some examples from faker.providers.address

for i in range(5): # Returns 5 country names

print(fake.country())

>>>Luxembourg

>>>Vietnam

>>>Tonga

>>>Mozambique

>>>Austria

You can also get country codes.

for i in range(5): # Returns 5 country codes

print(fake.country_code())

>>>ES

>>>RO

>>>MH

>>>MR

>>>CL

As stated before, the default language is English and the default country is set to be the United States.

fake.current_country() #Returns current country

>>>United States

When the locale is changed the output of current_country(), current_country_code(), address(), etc will be changed as follows:

Fake = Fake(“de_DE”)

fake.current_country_code() #Returns current country code

>>>DE

Community Providers

There are many community providers like Credit Score, Air Travel, Vehicle, Music, etc. You can also create your provider and add it to the Faker package. More information on the full list of community providers and their properties can be found here.

Let’s have a look at some examples from Faker_music. Before you start generating fake music data using community providers you need to install the package using pip.

pip install faker_music

And then you need to add the provider to your Faker instance:

from faker_music import MusicProvider

fake = Faker()

fake.add_provider(MusicProvider)

Now you set to generate fake music data:

for i in range (5): #Returns music genres

print(fake.music_genre())

>>>Rock

>>>World

>>>Classical

>>>Pop

>>>Vocal

Localised Providers

You can create the localised dummy data by providing the required locale as an argument to the dummy generator. It also supports multiple locales. In that case, all locales must be provided in the Python list data type like in the example shown below.

fake = Faker([‘De_DE’, ‘fr_FR’, ‘ja_JP’])

for _ in range(10):

print(fake.name())

>>>山本陽子

>>>Lina Weinhold

>>>Dorothee Huhn

>>>Anika Henck-Hörle

>>>Ilonka Drubin MBA.

>>>Philomena Rohleder

>>>高橋裕太

>>>Jacques Dumont Le Perrin

>>>斎藤治

>>>小林淳

The default locale is ‘en_US’, i.e. US English. Let’s code to create 5 addresses in Germany.

fake=Faker(“de_DE”) # Returns German addresses

for i in range(3):

print(fake.address())

>>>Rafael-Mende-Platz 04

>>>04196 Steinfurt

>>>Resi-Atzler-Allee 843

>>>96746 Coburg

>>>Scheibeplatz 5/1

>>>52115 Stollberg

fake=Faker(“de_DE”) #Returns German federal states

for i in range(5):

print(fake.administrative_unit())

>>>Bremen

>>>Hessen

>>>Rheinland-Pfalz

>>>Nordrhein-Westfalen

>>>Bayern

Generating a Dummy Dataset

We will create a fictitious dataset of 100 people with attributes such as id, name, email, address, date of birth, place of birth, etc. We will use the standard provider ‘Profiles’ to create this data and use Pandas Dataframes to save that.

#Import packages

from faker import Faker

from faker_music import MusicProvider

import pandas as pd

#Declare faker object

fake = Faker()

#Add music faker

fake.add_provider(MusicProvider)

#Define function to generate fake data and store into a JSON file

def generate_dummy_data(records):

data={}

#Iterate the loop and generate fake data

for i in range(0, records):

data[i]={}

data[i][“id”] = fake.unique.random_number(8)

data[i][“name”] = fake.name()

data[i][“email_address”]= fake.unique.email()

data[i][“address”]= fake.address()

data[i][“date_of_birth”]= fake.date_between(“-67y”, “-18y”)

data[i][“country_of_birth”]= fake.country()

data[i][“member_since”]= fake.date_time_between(“-2y”,“now”)

return data

#Call the function to generate fake data and store into a json file

fake_data = generate_dummy_data(100)

# Convert JSON to DataFrame

fake_data = pd.DataFrame(fake_data)

fake_data = fake_data.T

fake_data

Bildschirmfoto 2022-08-18 um 20.25.09.png

Conclusion

Faker is a Python library for generating fake data. It can be very practical in several cases. There are several alternatives to Faker but it remains the most well-known option in Python. It is popular because it is the easiest way to create fake records that look real. You can use it to create loops of dummy data –with simple steps it generates a large number of data in seconds.

I hope you enjoyed this article. If you have any questions leave a comment below.

Generate dummy data with Faker with Python

What is dummy data?

Why use dummy data?

How to populate database with dummy data?

How to use the Faker library in Python

Installation and Use

Generating Fakes

Standard Providers

Community Providers

Localised Providers

Generating a Dummy Dataset

Conclusion

Mehr von datadice

Upgrade Your ls Command to eza

How to Dockerize your Python Script

Kommentare

Kommentar hinterlassen