A Comprehensive Guide to Synthetic Qualitative Data: Techniques, Tools and Best Practices
Generating and using synthetic data for qualitative research and insights
FEB 2024 [UPDATED AUG 2024]
The concept of synthetic data isn't new, but the term has previously been reserved for data used in training ML models. Synthetic qualitative data introduces a novel approach to data generation which can aid research, analytics and decision-making.
Synthetic qualitative data may seem like a complex idea. Fortunately, with significant improvements in Large Language Models, prompt engineering, and proprietary frameworks, there are tools and techniques available that can simplify this process, leading to high quality data, improved insights and faster results.
To facilitate a deeper understanding of synthetic data, this guide will be divided into smaller sections: defining synthetic qualitative data, the advantages it offers, how to generate it, and best practices for using it. Throughout the guide you'll also find tips & tricks, and links to our hands-on guide which provides worked examples.
So, whether you're a researcher, data scientist, analyst, consultant, part of a product or analytics team, or just looking to learn, this guide is for you.
Contents:
What is synthetic qualitative data?
At its core, synthetic data is data that is artificially generated using algorithms or simulations. It is not real-world data, but it should reflect the complexity and characteristics of real-world data. This allows researchers, analysts and data enthusiasts to augment existing datasets with generated data, to help support data analysis to develop findings and draw stronger conclusions.
Synthetic qualitative data is a specific type of artificial data that is typically generated using Large Language Models (LLMs) rather than being collected from real-world observations or events. This type of data is designed to mimic the characteristics of real-world qualitative data, without requiring complex, costly and time-consuming data collection, while generating accurate data that doesn't contain any personally identifiable information or sensitive details.
The generation of synthetic qualitative data involves the use of advanced machine learning and artificial intelligence techniques. This typically involves natural language generation (NLG) for text data, as well as other specialized AI techniques. Fortunately, with the advent of large language models (like ChatGPT), synthetic data can now be generated without code. However, there are some key considerations for how and when you use synthetic data.
When should you use synthetic qualitative data?
The use of synthetic qualitative data has several advantages. It can help overcome issues related to data privacy and confidentiality, it can be used in situations where real-world data is scarce or difficult to collect, and it allows for the creation of large and diverse datasets, improving the richness and insights from analysis. Here are just a few situational factors that may benefit from synthetic data:
Data privacy concerns: Synthetic data is an excellent choice when dealing with sensitive or regulated data. It allows you to generate data that resembles your original data without requiring or revealing any personally identifiable information (PII). Since the data is synthetic, there are significantly less concerns about privacy, data storage, data sharing, or regulatory compliance.
Unavailability of real data: In cases where real data is unavailable, synthetic data can be a lifesaver. For example, when working with entirely new products, niche audiences, or a hard-to-reach sample, data might be hard to find or non-existent. In these cases, synthetic data can be generated to help support analysis and decision-making.
Cost and time limitations: Real-world data has its advantages, but it's also costly and time consuming to collect. Synthetic data generation is significantly more cost-effective and quicker than collecting real-world data, especially in scenarios that involve complex or large-scale data collection.
Pilots, trials and beta-tests: You might intend to capture real-world data, but want to start with a pilot, to test your ideas and experiment with your methods. Running real-world trials pilots no only consume limited resources, but might also force you to exclude valuable participants from future studies. Running a trial using synthetic data ensures your real-world study is optimally designed, before you go live.
Characteristics of synthetic qualitative data
Generating synthetic data isn't trivial, and errors can lead to inaccurate output and analysis. But with a well developed data generation process (...we'll get into that later), synthetic qualitative data has been found to closely match the quality of human-generated data¹, while offering some unique advantages over its real-world cousin:
Flexibility: Synthetic qualitative data can be adapted to suit various needs and scenarios. The generation of data is complex but, once understood, can be designed, fine-tuned, and iterated upon to capture nuances and complexities that provide a level of flexibility that real-world data collection may not afford.
Improved data quality: Real-world data, other than being difficult and expensive to acquire, is also likely to be vulnerable to human errors, inaccuracies, and biases. Although these same issues are present within machine learning models, companies can place higher confidence in the quality, diversity, and balance of the data while generating synthetic data.
Scalability of data: With the increasing demand for training data, data scientists have no other option but to opt for synthetic data. It can be adapted in size to fit the training needs of the machine learning models.
¹ Marketing Science 2024 https://doi.org/10.1287/mksc.2023.0454
How to generate synthetic qualitative data
Step 1: Defining your parameters
Generating high quality synthetic data requires real-world input data.
Just as you might recruit specific participants for a qualitative study based on set parameters, you need to define these parameters for your synthetic data. Think of this as setting your synthetic sample. Your synthetic sample will be defined by your project needs, use case, and/or research question.
Successful techniques include defining demographic variables, establishing user personas, and/or modifying existing datasets to generate a set of core parameters. Within these parameters, you'll likely need to introduce a level of variability to reflect real-world samples and complexities. Key questions to consider include:
Who you're synthetic sample is comprised of
How large your synthetic sample is
The level of homogeneity or heterogeneity you're seeking from your synthetic data
What type and/or format of synthetic data you want
Using an LLM to help you generate your synthetic sample is a quick and useful approach, but the quality of your sample will depend on the quality of your instructions. Even with well-defined instructions it's vital to check your data for consistency.
✔︎ Tip: Tradeoff: The more variables you define, the more likely your outputs are to reflect real-world data. However, too many variables can increase the risk of erroneous outputs as the model struggles to keep within your parameter set. Keep parameters limited to only those you need.
Step 2: Verifying your inputs
You have your defined parameters and established your synthetic sample. But before diving into the process of generating synthetic data, it's crucial to verify your inputs.
This involves understanding any real data you're working with. This includes any statistical distribution, trends, patterns, and outliers. Synthetic data is meant to replicate real data so it's important to understand the variables, context, and knowledge that is likely to impact on your data generation process. Understanding your inputs also ensure you're able to better verify the accuracy of your outputs (more on this later).
Take the time to familiarize yourself with any input data you have. Ensure that there is no personally identifiable information (PII), and the data is free from unwanted biases, as these may be reflected in the synthetic data.
Step 3: Generating your data
Once you've defined your parameters and verified your inputs, the next step is to generate your synthetic data.
This involves using your chosen LLM to create data that mirrors the complexity and characteristics of real-world data. The aim is to replicate the complexity found within your original data, in whatever output format you desire. This requires sufficient input data and sophisticated prompting, utilizing a framework that will generate outputs in your desired format or form.
With LLMs, the quality of the output is only ever as good as the quality of your input data & prompt. If you haven't generated synthetic data before, we recommend running a number of tests with a small sample. This gives you time to ensure you're getting the desired outputs, before increasing the complexity of your outputs.
✔︎ Tip: We recommend using an LLM which gives you access to system-prompts and additional controls (e.g., model Temperature, Top P, and Frequency Penalty). For example OpenAI's GPT Playground. This allows you fuller customization and control of your outputs.
✔︎ Tip: Test various models, and always use the highest quality model (the model with the largest parameters) that you can access/afford. Models with higher parameters produce higher quality outputs, improving the richness of your synthetic data.
Step 4: Verifying your outputs
After generating your synthetic data, the final step is to verify your outputs.
This involves validating and verifying the quality of the synthetic data before using it for any analysis, predictions or insights. You can do this through manual checks (strongly recommended) or by using automated tools for data validation.
In cases were data validity is critical, we recommend using a 3rd party who wasn't involved in the data generation to verify your output data. This orthogonal approach to validation ensures independent verification of the quality of your data.
It's also important to ensure that the synthetic data maintains the same level of data privacy as the original data, without revealing any PII.
✔︎ Tip: When producing synthetic data at scale, we recommend using an API-based solution rather than a chat-interface. Models accessed via API will typically generate faster and allow for more fidelity in model control and output format.
Best practices for generating synthetic qualitative data
There are numerous traditional methods and frameworks for capturing qualitative data, but synthetic data is a cutting edge technique that should be added to a research- or data-team's tool kit with care. Here are a few key practices to keep in mind:
Understand your real data first: Before generating synthetic data, it's crucial to have a clear understanding of any real data you already have. This includes understanding the statistical distribution, trends, and patterns.
Use appropriate tools: Various tools and libraries are available for generating synthetic data. Choose the one that best suits your requirements. Large language models like ChatGPT and Claude.ai are popular choices for generating text, but be aware of their limitations.
Replicate real-world complexity: When generating synthetic data, aim to replicate the complexity found within the original data. This includes replicating the distribution and trends, as well as the outliers of the original data.
Maintain data quality: The quality of synthetic data is significantly associated with the quality of the input data and the model used to generate the data. Ensure you're using a rigorous framework, and model with a large number of parameters in order to generate high quality data. If using commercial models, we recommend using systems in which you can access system-prompts in order to gain fuller control of generation and outputs.
Validate and verify: Always validate and verify the quality of the synthetic data before using it for any analysis or insights. This includes manual checks and 3rd party data validation. Remember that synthetic data is vulnerable biases and stereotypes, just like real-world data.
Avoid extremes: When it comes to LLMs remember that they're a reflection of the data they were trained on. If you're looking to generate insights from an extremely unusual sample, your data is likely to be less accurate than a more typical sample. When generating synthetic data for extreme cases, it's even more important to independently verify the data.
Build acceptance: Synthetic data is a new concept, and people who have not seen its advantages may not be ready to trust analysis and predictions based on it. Thus, creating awareness about the value of synthetic data is crucial.
Summary & Key Takeaways
While real data will always be preferred for research, insights, and business decision-making, synthetic data is powerful solution when such real raw data is unavailable, prohibitively expensive, or too time consuming for analysis.
Generating synthetic data requires a strong understanding of data modeling and a clear understanding of the real data and its environment. By following the best practices outlined in this guide, you can effectively start generating and using synthetic data within your analysis and decision-making.
Try MindPort for better insights, faster ⬇︎
MindPort empowers researchers, data teams, and agencies to effortlessly generate synthetic data with unmatched efficiency. Whether you want us to run an end-to-end research project, or just generate the data for your in-house team, we use our proprietary data-driven platform to scope, synthesise, collect, analyze and share data with your team.
Using GPT can only get you so far. Inside MindPort, our data team can organize your projects, scope your requirements, generate a synthetic sample and track individual synthetic participants. We're able to generate validated data in a fully bespoke workflow. This means your teams get the data and insights they need, in one centralized location.
Take your insights and analytics a step further with MindPort's synthetic data and insights:
Even before generating synthetic data, we can tailor over 180 unique categories of variable, totalling over 700 individual parameters per participant, based on your data requirements.
We can scale up from one to one thousand participants, enabling us to generate entirely unique three-dimensional personas that reflect the complexities, nuances, and fidelity of your real-world data.
We take this further by using 3rd party verification and our review framework to benchmark generated data against input data, carefully checking for biases, errors, and miscategorisations.
We're then able to package your synthesized data and insights within a team workspace, incorporating quotes for a visually compelling presentation.
Notably transforms your data analysis process, making synthetic data synthesis a seamless and insightful journey from start to finish.
Interested? Contact us and we'll send you a synthetic sample for free.
Learn about our approach to AI Strategy
Explore our research into AIX and Human-Centered Design Research
Sign up receive our insight & reports straight to your inbox. Always interesting, and never more than once per month. We promise.
Share this Insight: