Synthetic data generator and test data automation using intelligent OSS and online productivity tools

Test data is critical to the success of software products that are being developed, maintained and enhanced. But, how many of us really give importance to it?

If the software product has required a volume of test data, which is also realistic in nature, then most of the issues or bugs could have been identified or avoided before the product goes live. And the cost of fixing the defects could have been less & controlled and also the final product can be delivered on schedule i.e. project can be completed on time or earlier and within the budget (Time to cost).

Having realistic test data will also motivate developers to code better. Otherwise, he/she may enter data like ‘Test 1, Name abcd, bla bla, lorum ipsum, 123456789 for phone numbers, 1234567898762345 for the credit card, Address 1, 3, test city, mystate, etc.,‘ which will look very odd and irritating to product owners, stakeholders or anyone. And many times we go with such junk data for demos…OMG….we should stop using such junk data from the beginning of the software development itself.

The solution to the above is to generate synthetic data, using tools that help in cutting the overall development and testing efforts multifold. Having realistic data in terms of lookalike and required volume of data (in 1000s), we can address the functional aspects and performance aspects easily. The process of generating test data involves more for developers than the testers, or else the testers who can program (Google call them as Software Engineers in Test SET and Microsoft call them SDET).

Synthetic data generation and integration into development and testing workflow

How can we generate 1000s of realistic test data (also called as SYNTHETIC DATA) of various combinations as per the domain model and industry vertical of the software you are building? It will be challenging and requires intelligence (Artificial Intelligence) and deep learning than the mainstream product or project you are developing. But there are many free and paid synthetic data generation tools available in the market, which can be leveraged into your test strategy and workflow early in the project development life cycle.


imageI am using to  generate synthetic data and the benefit I am getting is huge in terms of effort reduction, and also, the quality of the final deliverable is stupendous. Mockaroo has a free tier and paid tier, but the free one will suffice most of your need. Also, it exposes REST API using which you can easily integrate into test automation workflow for repeatable test data generation. Usually, I use Node.JS to stitch my test data generation strategy.



Another product I like is which is free and open source whose code can be forked for many purpose.

A live use case from one of my projects is given here with a fictitious data model to understand the process and benefits.

I had Geo based (lat, lon) application for which I needed to inject as many 1000’s of test data, equally spread in all 8 directions (North, South, East, West, North East, North West, South East, South West). Each data should be plotted with X meters distance between them (using Haversine formula). You can see the complete working code in my Github (TBD, please stay tuned). Follow the screenshot for how I set it up:

Step # 1 Fictitious table for which test data needs to be generated


Step # 2 Signup or login

Step # 3 Define your test data schema as below



Step #  4 Preview test data for just created schema [I have chosen JSON as it is easy to manipulate programmatically]




Data attributes shaded in yellow in the below screenshot will be manipulated later through another program stitched in a node.js application [This process deserve a separate blog post]. For latitude and longitude, Haversine formula is used to plot new lat, lon for a given lat, lon and distance in meters. For thumbnail source, Flicker API is used to resolve unique photo thumbnails for a given tags and text search with parental controls, etc.



Step # 5 Save your schema


Step # 6 Save your schema

Step # 7 Copy the REST API end point which can be accessed programmatically in any REST client.image

Step # 8 Optional step, test the REST API (Generate data) from POSTMAN REST Client (POSTMAN is my choice)


Step # 9 Optional step; automate data load by integrating mockaroo generated synthetic data, data transformation for lat. lon and thumbnail source. Store final result as xls/csv file (JSON2XLS) or insert directly into the SQL table using any programming language of your choice.

I have used node.js to weave all of the above tasks. The complete application can be viewed online from my public cloud9 IDE @ [ Note: To access the editor, you need to have cloud9 login which comes with a free registration @]

Alternatively, you can download or the code from my Git repo


One thought on “Synthetic data generator and test data automation using intelligent OSS and online productivity tools

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s