By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.
Jul 8, 2024
More

Generate dummy data with Faker with Python

A guide on how to use Faker package in Python to populate a dummy dataset.

What is dummy data?

Dummy data is fictitious information that is generated or used to simulate real data in various contexts such as testing, development, and training. This type of data is designed to mimic the characteristics and structure of actual data, without containing any meaningful or sensitive information. Dummy data is commonly used to:

  1. Test Software: Ensure that applications function correctly under various conditions and inputs.
  2. Develop Systems: Facilitate the creation and debugging of programs by providing sample data for processing and manipulation.
  3. Train Algorithms: Serve as input for machine learning models during the development phase.
  4. Demonstrate Features: Illustrate the capabilities of software products without exposing real user data.

Why use dummy data?

Fictitious data is required for a variety of purposes. Whether for testing, anonymising sensitive data, or adding “noise” to a training dataset, it can be beneficial to have access to a fake dataset in the same shape as the real data. You may also need to generate dummy data for testing and operational purposes. That is, to test what you have developed and how your code reacts to different types of input.

However, finding the necessary data in a specific format we want can be difficult. So, where do you get dummy data for your own application? There is an elegant solution to this problem in the form of the Faker package. With Python, you can use the Faker package to generate data according to your data needs. Faker is an open source library designed to generate different types of synthetic data.

How to populate database with dummy data?

In this article, we’ll take a quick tour of Faker package in Python and how to use them to create a dummy dataset.

The Faker library in Python is a popular tool for generating fake data for a variety of uses, such as testing, development, and training machine learning models. It allows users to create dummy data that mimics real-world data in a flexible and customizable manner. Faker can generate data in various formats, including names, addresses, dates, text, and more.

Key features of the Faker library include:

  1. Versatile Data Generation: Faker can generate a wide range of data types, including names, addresses, phone numbers, email addresses, job titles, company names, lorem ipsum text, dates, times, and even complex data structures.
  2. Localization: Faker supports multiple locales, allowing the generation of data that is specific to different countries and regions. This includes localized names, addresses, and other culturally relevant data.
  3. Customization: Users can customize the data generation process by creating their own providers or modifying existing ones to fit specific needs.
  4. Ease of Use: Faker is designed to be simple to use, with an intuitive API that makes it easy to generate data with just a few lines of code.

How to use the Faker library in Python

Installation and Use

Faker allows you to generate random data in dozens of languages. Since Faker is an open library for the community, it is constantly evolving. Providers –generators specific to a certain type of data– are added regularly by the community. Let’s take a look at how to use it in terms of codes.

The installation can be done via pip with the command:

pip install Faker

With the following two lines of code you can initialise Faker. While the first line imports the generator (Class Faker), the second one is used to initialise the generator with English as a default language parameter. If you want to initialise Faker in other languages you need to specify the language parameter (eg. Faker(“de_DE”) for German).

from faker import Faker

fake = Faker()

Generating Fakes

Now, you are ready to generate whatever data you want. The generated data is called fake. As the name suggests, it is fake data that is randomly generated. Its purpose is to act as a substitute or placeholder for the actual data. A fake is generated when the method corresponding to the data type is called.

The name() method can be used to create a full name. Let’s jump into the code and check how these methods work.

for i in range(5):       # Returns full names

 print(fake.name())

>>>Samantha Fernandez

>>>Denise Barnes

>>>Jason Strong

>>>Edward Burton

>>>Tonya Rocha

However, if you want the only first or last name instead, you can use the first_name() and last_name() methods.

fake.first_name() # Returns a first name

>>>Samuel

Note that, each call to these methods will generate a random name.

fake.last_name() # Returns last name

>>>Espinoza

To create addresses, you can use the address().

fake.address() # Returns an address

>>>3066 Mary Hills Suite 873

>>>Lake Stevenport, NV 32423

Moreover, the fake.sentence() method will return a string containing a random sentence, whereas faker.text() will return a randomly generated text.

fake.sentence() # Returns a random sentence

>>>Never across staff attention within.

As can be seen below faker.text() generates a random paragraph.

fake.text() # Returns a random text

>>>From send bed. Could country reveal send role. Guy involve issue picture get election. Sure do memory kitchen candidate fish defense. Try paper forward to build gas human.

Let’s say you want to generate a list of 5 email addresses. Each time, the below code generates 5 random names.

for i in range(5):       # generates 5 random emails

  print(fake.email())

>>>garciaeric@example.com

>>>logan01@example.net

>>>contrerasaustin@example.org

>>>rpreston@example.org

>>>brandy16@example.net

But when the data gets bigger, there is a chance that you would get the same email address more than once. So, to create unique dummy data using the Faker package, you can use the .unique property of the generator.

for i in range(10):     # generates 5 unique random emails

  print(fake.unique.email())

>>>hughesbrian@example.org

>>>raymondchapman@example.org

>>>vicki25@example.com

>>>munozzachary@example.net

>>>karen44@example.org

Each time the above code runs, it will generate 5 unique email addresses. This is quite helpful when you are generating data like ID, that does not need to be repeated.

Faker also has a method to generate a dummy profile.

fake.profile() #Returns a fake profile

>>>{‘address’: ‘64992 Becky Stream Apt. 932\nRebeccaville, WV 34184’,

>>>‘birthdate’: datetime.date(2000, 3, 24),

>>>‘blood_group’: ‘O-’,

>>>‘company’: ‘Lopez and Sons’,

>>>‘current_location’: (Decimal(‘78.061493’), Decimal(‘-114.798399’)),

>>>‘job’: ‘Pharmacologist’,

>>>‘mail’: ‘rebeccahansen@yahoo.com’,

>>>‘name’: ‘Autumn Sanchez’,

>>>‘residence’: ‘8702 Matthew Circles Apt. 938\nDickersonfurt, WA 82226’,

>>>‘sex’: ‘F’,

>>>‘ssn’: ‘534–29–2074’,

>>>‘username’: ‘llowe’,

>>>‘website’: [‘http://hawkins.com/', ‘https://wolf.com/']}

So far we have used forger generator properties like name(), first_name(), last_name(), email(), etc. There are also many such properties packaged in ‘Providers’. Some are standard providers, while others are providers developed by the community.

Standard Providers

There are many standard providers like address, currency, credit_card, date_time, internet, geo, person, profile, bank etc. that help create the relevant dummy data. More information on the full list of standard providers and their properties can be found here.

Let’s have a look at some examples from faker.providers.address

for i in range(5):     # Returns 5 country names

  print(fake.country())

>>>Luxembourg

>>>Vietnam

>>>Tonga

>>>Mozambique

>>>Austria

You can also get country codes.

for i in range(5): # Returns 5 country codes

  print(fake.country_code())

>>>ES

>>>RO

>>>MH

>>>MR

>>>CL

As stated before, the default language is English and the default country is set to be the United States.

fake.current_country() #Returns current country

>>>United States

When the locale is changed the output of current_country(), current_country_code(), address(), etc will be changed as follows:

Fake = Fake(“de_DE”)

fake.current_country_code() #Returns current country code

>>>DE

Community Providers

There are many community providers like Credit Score, Air Travel, Vehicle, Music, etc. You can also create your provider and add it to the Faker package. More information on the full list of community providers and their properties can be found here.

Let’s have a look at some examples from Faker_music. Before you start generating fake music data using community providers you need to install the package using pip.

pip install faker_music

And then you need to add the provider to your Faker instance:

from faker_music import MusicProvider

fake = Faker()

fake.add_provider(MusicProvider)

Now you set to generate fake music data:

for i in range (5):    #Returns music genres

  print(fake.music_genre())

>>>Rock

>>>World

>>>Classical

>>>Pop

>>>Vocal

Localised Providers

You can create the localised dummy data by providing the required locale as an argument to the dummy generator. It also supports multiple locales. In that case, all locales must be provided in the Python list data type like in the example shown below.

fake = Faker([‘De_DE’, ‘fr_FR’, ‘ja_JP’])

for _ in range(10):

  print(fake.name())

>>>山本 陽子

>>>Lina Weinhold

>>>Dorothee Huhn

>>>Anika Henck-Hörle

>>>Ilonka Drubin MBA.

>>>Philomena Rohleder

>>>高橋 裕太

>>>Jacques Dumont Le Perrin

>>>斎藤 治

>>>小林 淳

The default locale is ‘en_US’, i.e. US English. Let’s code to create 5 addresses in Germany.

fake=Faker(“de_DE”) # Returns German addresses

for i in range(3):

  print(fake.address())

>>>Rafael-Mende-Platz 04

>>>04196 Steinfurt

>>>Resi-Atzler-Allee 843

>>>96746 Coburg

>>>Scheibeplatz 5/1

>>>52115 Stollberg

fake=Faker(“de_DE”) #Returns German federal states

for i in range(5):

  print(fake.administrative_unit())

>>>Bremen

>>>Hessen

>>>Rheinland-Pfalz

>>>Nordrhein-Westfalen

>>>Bayern

Generating a Dummy Dataset

We will create a fictitious dataset of 100 people with attributes such as id, name, email, address, date of birth, place of birth, etc. We will use the standard provider ‘Profiles’ to create this data and use Pandas Dataframes to save that.

#Import packages

from faker import Faker

from faker_music import MusicProvider

import pandas as pd

#Declare faker object

fake = Faker()

#Add music faker

fake.add_provider(MusicProvider)

#Define function to generate fake data and store into a JSON file

def generate_dummy_data(records):

  data={}

  #Iterate the loop and generate fake data

  for i in range(0, records):

      data[i]={}

      data[i][“id”] = fake.unique.random_number(8)

      data[i][“name”] = fake.name()

      data[i][“email_address”]= fake.unique.email()

      data[i][“address”]= fake.address()

      data[i][“date_of_birth”]= fake.date_between(“-67y”, “-18y”)

      data[i][“country_of_birth”]= fake.country()

      data[i][“member_since”]= fake.date_time_between(“-2y”,“now”)

  return data

#Call the function to generate fake data and store into a json file

fake_data = generate_dummy_data(100)

# Convert JSON to DataFrame

fake_data = pd.DataFrame(fake_data)

fake_data = fake_data.T

fake_data

Bildschirmfoto 2022-08-18 um 20.25.09.png

Conclusion

Faker is a Python library for generating fake data. It can be very practical in several cases. There are several alternatives to Faker but it remains the most well-known option in Python. It is popular because it is the easiest way to create fake records that look real. You can use it to create loops of dummy data –with simple steps it generates a large number of data in seconds.

I hope you enjoyed this article. If you have any questions leave a comment below.