Many Hollywood screenplays based on true stories take the liberty of amalgamating one or more people into a single character. Making one character do the work that several did in real life tightens the story to make it more compelling so it’s “a better version than the truth.” Synthetic data testing for health plan configuration is built on a similar concept. Instead of making a copy of actual production data for testing, synthetic data is invented using random selection processes so testing data does not directly represent any single entity or record within the production data set.
Yet synthetic data is not random: Created well, it fully represents the richness of the solution being tested. Further, whereas production data copies are limited to real-world events, synthetic data allows health plan providers to create data that meets the precise criteria required to test the full scope of a transaction or application. That helps ensure complete configuration quality.
Payers also need to find an alternative to testing with actual member data. It’s likely that more employers and regulators will limit the use of employees’ personal health information (PHI) for testing and other purposes, even if the data is masked. Synthetic data enables payers to simultaneously protect PHI and create a pathway to more robust testing. These all build a strong case for payers to swiftly adopt synthetic data.
How to use synthetic data
The most common method of data creation for testing is making a copy of production data. However, data dumps can be huge, while the amount of information used in testing is but a tiny subset of a few thousand records. Testing teams often cannot articulate their specific data needs until they are actively developing plans for a specific test cycle. To fend off multiple requests for data, the IT organization gives the team the full data dump. But when using synthetic data, a plan provider does not need to copy a million members and 10 years of claims history from live production for a test. Here is a more effective way for teams to identify their needs and approach test-data management:
Connect: Recognizing how cases, process and data are connected helps streamline data requirements. Testing teams can better analyze what they are testing and why by identifying their test use cases, the test process they’ll use, the data required to execute the test and the metrics that signal success or failure.
Capture: Instead of using complete member records, capture the information that’s important to executing the test, such as demographics, statistics and claim models.
Create: Use that information to create a set of employer groups, members, providers and claims that look and act like production data — but are invented entities. The “likeness” of any given actual member to an invented member is not reliable or consistent.
These created entities are termed “unicorns.” Each unicorn represents a perfect test use case or multiple use cases. Instead of searching for real-world data to fulfill the needs of each and every test, teams may build a unicorn member with the specific desirable attributes that will trigger the multiple exact test outcomes required in a single pass. Or, a unicorn may have multiple perfect circumstances that meet many different testing needs. If testing vision claims, for example, the unicorn will “have” glasses. If testing limits on physical therapy, the unicorn will have “visited” his therapist 10 times. Teams may change any of the unicorns’ characteristics that have no bearing on the testing result and keep those important to testing. They can evolve unicorns, making them single or married, childless or parents to eight children, all with different health conditions by virtue of the claims assigned to them.
All the data is real and valid — and assigned to the invented entities that nonetheless look and behave as though they are actual members culled from production data.
The bigger picture
The unicorns must incorporate data that goes beyond that gleaned from the claims processing solution. Dozens of business process systems might be used in any given test cycle. Creating synthetic data that provides the related information across the enterprise suite of applications enables robust and comprehensive business process testing. For example, a test suite may need to represent not only a given claim, but also specialized pricing triggers, workflow controls, care management integration and supporting documentation systems such as images or laboratory results.
This process requires analyzing what’s important in each unique application, creating data that fits each application’s requirements and then making that data consumable in a way that addresses the data schema. Applications may include HIPAA privacy modules, an image management solution or a provider network management system, as well as lab and pharmacy connected data sets. To maintain the interrelationships among the application databases, testing teams should create the required data elements once, then use that data set to write to the different databases.
All the designed data is stored for repeated use. Testing teams select the data they need, format it for a specific application and load the processes appropriate for the test cycle. The result is unicorn test data that better meets all the diverse testing needs of the organization. That results in much smaller non-production data sets, with much more efficient and effective testing execution cycles.
Teams will no longer have to search far and wide for data that matches their testing requirements. All necessary data will be built into a single unicorn member.
A synthetic data storyboard
While a complete solution for creating and using synthetic data is still emerging, plans can start storyboarding for their adoption of synthetic data testing with the following steps:
Build a large suite of test use cases. This is key to getting value from using synthetic data — that is, apply it broadly and not just to one application suite, such as claims. Test teams must be clear about what they test and their various testing processes for transactions ranging from new contracts to regression testing a modified product. Ensure positive and negative test cases are covered and that interfaces and extensions can still be tested. Production data is trusted because its history and reliability are inherent. Synthetic data must have its reliability created and then validated over time. The end game is a data set that is reliable and trusted — trust comes from reliability and reliability is based on some trust.
Implement data perturbation techniques. Data perturbation not only enables payers to better protect the PHI in their systems right now, but it will also start their transition to synthetic data. Synthetic data enhances other PHI protection methods, such as masking test data, a technique that relies on data obfuscation and anonymization. Masking solutions are expensive and complex to deploy because PHI exists in more than one system. The solutions also don’t adequately protect data. Health plan data is riddled with information — birth dates, gender, zip codes, etc. — that bad actors can use to triangulate data sources and reidentify data.
Through data perturbation, implant “noise” into data so that real data is both less recognizable and less marketable by bad actors. Payers can insert noise by slightly modifying a claim or adding lines, features or notes in a data set. This invented data builds up into reliable and trusted test data, complete with a utilization plan that supports diverse test use cases and a data history that can be used to support regression testing.
Expand the trusted data pool. Over time, create more synthetic John and Jane Does that seem to exist in a real world, with associated data that is as valid as that from actual members. Testing teams will gradually come to rely on these synthesized “members,” steadily reducing their need to use copies of actual production data. In addition, bad actors won’t be able to distinguish real members from synthetic ones, making plan data less valuable to them.
Payers may start using synthetic data by cloning a single test case, such as a data set for a claim representing a broken leg, to test the benefits configuration for a physician joining a plan’s network. The cloning enables plans to instantly create multiples of the “same” data types, whether two or 200, ready to be tested in multiple passes of the same test. This means:
Testing teams do not need to recreate data sets that are consumed, changed or destroyed during testing, which improves speed and efficiency.
The best and most highly valued data may be used by different teams simultaneously, even if the teams are in the same environment.
Teams may use the data to test without using PHI.
In the future, much of the synthetic data creation may be automated, using machine learning elements that leverage information gained from prior test iterations. Tools will extract claims data and similarly analyze non-core administrative systems to identify data types and interrelationships to create synthetic data that represents transactions across the enterprise. That eventually will result in a comprehensive pool of non-production-designed data, ready for testing any type of solution — including those invented in a payer’s imagination. Current production data represents the past. Using synthetic data, plans can test contract and physician performance under different configurations and business models, such as value-based care.
By enabling testing teams to create better-than-real-world characters, synthetic data enables payers to improve testing efficiency, protect PHI and intelligently test business scenarios. Transitioning to synthetic data will enable payers to add larger-than-life performance to their testing capabilities, tightening the test case story to make it more compelling so it’s indeed a better version than the truth.
This article was written by Tom Newman, General Manager, Cognizant Optimization Software Products.