A Comprehensive Guide to Synthetic Qualitative Data: Techniques, Tools and Best Practices

Generating and using synthetic data for qualitative research and insights


FEB 2024 [UPDATED AUG 2024]

The concept of synthetic data isn't new, but the term has previously been reserved for data used in training ML models. Synthetic qualitative data introduces a novel approach to data generation which can aid research, analytics and decision-making.

Synthetic qualitative data may seem like a complex idea. Fortunately, with significant improvements in Large Language Models, prompt engineering, and proprietary frameworks, there are tools and techniques available that can simplify this process, leading to high quality data, improved insights and faster results.

To facilitate a deeper understanding of synthetic data, this guide will be divided into smaller sections: defining synthetic qualitative data, the advantages it offers, how to generate it, and best practices for using it. Throughout the guide you'll also find tips & tricks, and links to our hands-on guide which provides worked examples.

So, whether you're a researcher, data scientist, analyst, consultant, part of a product or analytics team, or just looking to learn, this guide is for you.

Contents:


What is synthetic qualitative data?

At its core, synthetic data is data that is artificially generated using algorithms or simulations. It is not real-world data, but it should reflect the complexity and characteristics of real-world data. This allows researchers, analysts and data enthusiasts to augment existing datasets with generated data, to help support data analysis to develop findings and draw stronger conclusions.

Synthetic qualitative data is a specific type of artificial data that is typically generated using Large Language Models (LLMs) rather than being collected from real-world observations or events. This type of data is designed to mimic the characteristics of real-world qualitative data, without requiring complex, costly and time-consuming data collection, while generating accurate data that doesn't contain any personally identifiable information or sensitive details.

The generation of synthetic qualitative data involves the use of advanced machine learning and artificial intelligence techniques. This typically involves natural language generation (NLG) for text data, as well as other specialized AI techniques. Fortunately, with the advent of large language models (like ChatGPT), synthetic data can now be generated without code. However, there are some key considerations for how and when you use synthetic data.


When should you use synthetic qualitative data?

The use of synthetic qualitative data has several advantages. It can help overcome issues related to data privacy and confidentiality, it can be used in situations where real-world data is scarce or difficult to collect, and it allows for the creation of large and diverse datasets, improving the richness and insights from analysis. Here are just a few situational factors that may benefit from synthetic data:


Characteristics of synthetic qualitative data

Generating synthetic data isn't trivial, and errors can lead to inaccurate output and analysis. But with a well developed data generation process (...we'll get into that later), synthetic qualitative data has been found to closely match the quality of human-generated data¹, while offering some unique advantages over its real-world cousin:

¹ Marketing Science 2024 https://doi.org/10.1287/mksc.2023.0454 

How to generate synthetic qualitative data

Step 1: Defining your parameters

Generating high quality synthetic data requires real-world input data. 

Just as you might recruit specific participants for a qualitative study based on set parameters, you need to define these parameters for your synthetic data. Think of this as setting your synthetic sample. Your synthetic sample will be defined by your project needs, use case, and/or research question.

Successful techniques include defining demographic variables, establishing user personas, and/or modifying existing datasets to generate a set of core parameters. Within these parameters, you'll likely need to introduce a level of variability to reflect real-world samples and complexities. Key questions to consider include:

Using an LLM to help you generate your synthetic sample is a quick and useful approach, but the quality of your sample will depend on the quality of your instructions. Even with well-defined instructions it's vital to check your data for consistency.

✔︎ Tip: Tradeoff: The more variables you define, the more likely your outputs are to reflect real-world data. However, too many variables can increase the risk of erroneous outputs as the model struggles to keep within your parameter set. Keep parameters limited to only those you need.

Step 2: Verifying your inputs

You have your defined parameters and established your synthetic sample. But before diving into the process of generating synthetic data, it's crucial to verify your inputs. 

This involves understanding any real data you're working with. This includes any statistical distribution, trends, patterns, and outliers. Synthetic data is meant to replicate real data so it's important to understand the variables, context, and knowledge that is likely to impact on your data generation process. Understanding your inputs also ensure you're able to better verify the accuracy of your outputs (more on this later).

Take the time to familiarize yourself with any input data you have. Ensure that there is no personally identifiable information (PII), and the data is free from unwanted biases, as these may be reflected in the synthetic data.

Step 3: Generating your data

Once you've defined your parameters and verified your inputs, the next step is to generate your synthetic data. 

This involves using your chosen LLM to create data that mirrors the complexity and characteristics of real-world data. The aim is to replicate the complexity found within your original data, in whatever output format you desire. This requires sufficient input data and sophisticated prompting, utilizing a framework that will generate outputs in your desired format or form.

With LLMs, the quality of the output is only ever as good as the quality of your input data & prompt. If you haven't generated synthetic data before, we recommend running a number of tests with a small sample. This gives you time to ensure you're getting the desired outputs, before increasing the complexity of your outputs.

✔︎ Tip: We recommend using an LLM which gives you access to system-prompts and additional controls (e.g., model Temperature, Top P, and Frequency Penalty). For example OpenAI's GPT Playground. This allows you fuller customization and control of your outputs.

✔︎ Tip: Test various models, and always use the highest quality model (the model with the largest parameters) that you can access/afford. Models with higher parameters produce higher quality outputs, improving the richness of your synthetic data.

Step 4: Verifying your outputs

After generating your synthetic data, the final step is to verify your outputs. 

This involves validating and verifying the quality of the synthetic data before using it for any analysis, predictions or insights. You can do this through manual checks (strongly recommended) or by using automated tools for data validation. 

In cases were data validity is critical, we recommend using a 3rd party who wasn't involved in the data generation to verify your output data. This orthogonal approach to validation ensures independent verification of the quality of your data.

It's also important to ensure that the synthetic data maintains the same level of data privacy as the original data, without revealing any PII.

✔︎ Tip: When producing synthetic data at scale, we recommend using an API-based solution rather than a chat-interface. Models accessed via API will typically generate faster and allow for more fidelity in model control and output format.

Best practices for generating synthetic qualitative data

There are numerous traditional methods and frameworks for capturing qualitative data, but synthetic data is a cutting edge technique that should be added to a research- or data-team's tool kit with care. Here are a few key practices to keep in mind:


Summary & Key Takeaways

While real data will always be preferred for research, insights, and business decision-making, synthetic data is powerful solution when such real raw data is unavailable, prohibitively expensive, or too time consuming for analysis. 

Generating synthetic data requires a strong understanding of data modeling and a clear understanding of the real data and its environment. By following the best practices outlined in this guide, you can effectively start generating and using synthetic data within your analysis and decision-making.

Try MindPort for better insights, faster ⬇︎

MindPort empowers researchers, data teams, and agencies to effortlessly generate synthetic data with unmatched efficiency. Whether you want us to run an end-to-end research project, or just generate the data for your in-house team, we use our proprietary data-driven platform to scope, synthesise, collect, analyze and share data with your team. 

Using GPT can only get you so far. Inside MindPort, our data team can organize your projects, scope your requirements, generate a synthetic sample and track individual synthetic participants. We're able to generate validated data in a fully bespoke workflow. This means your teams get the data and insights they need, in one centralized location.

Take your insights and analytics a step further with MindPort's synthetic data and insights: 

Notably transforms your data analysis process, making synthetic data synthesis a seamless and insightful journey from start to finish. 

Interested? Contact us and we'll send you a synthetic sample for free.

Sign up receive our insight & reports straight to your inbox. Always interesting, and never more than once per month. We promise.

Share this Insight: