CR 3.0 – A Manifesto for The Next Generation of Clinical Research Data Standards

During these last lingering days of summer before Labor Day, 12 months since retiring from CDISC and now 6 months with HL7, I’ve been contemplating a new vision for clinical research data.   My first exposure to this industry was during the teaming chaos of paper CRFs and double data entry – call that CR 1.0.  CR 2.0 involved EDC and the internet and the first generation of data standards, courtesy of CDISC.  In formulating my personal vision for CR 3.0, I thought of starting with a set of core principles of how things might be through the lens of my past and most recent experience – call it a manifesto for the next generation.  I’ve decided to try to describe my current thinking as a set of core principles:

  1. Whenever data can be captured directly at the source, it must be. EHR data is source data.  If the source is wrong, we need to fix the source, not just correct it downstream in a separate conflicting copy.  Traceability problems vanish when the data captured are the data reviewed.
  2. We must avoid data transformations wherever possible. Each transformation can introduce error and reduce data fidelity. Instead of twisting data to fit into different specific formats, we must learn to fit analytics directly to the data as captured in its native form – using standards that exist at the source.  This is how analytics are applied throughout the modern world of technology – why not in research too?  Why not research on internet time?
  3. Pharma should take advantage of the movement to structured data catalyzed by Meaningful Use and new value and outcomes-based reimbursement models in the USA. This means pharma adopting prevalent common healthcare standards like UCUM and LOINC used extensively in healthcare data records without requiring transformations to other coding systems used only in pharma research.  Pharma should also consider  including SNOMED codes applied at the point of capture in addition to MedDRA because they can provide additional contextual information that may be valuable to reviewers or researchers.
  4. The HL7® FHIR® standard, which is already being widely adopted throughout the world of healthcare, offers the best opportunity to date for research and other pharmaceutical processes to capitalize on the availability of rich EHR data – and can eliminate many of the inconsistencies and variations seen historically with secondary use of EHR data. FHIR can make it possible to reach inside of EHRs not just to capture data, but to monitor protocol progress, provide safety alerts, and allow much greater visibility into trial conduct and can lead to dramatic improvements in study efficiency and drug safety.  We need FHIR for better research.
  5. The current HL7 C-CDA standard provides a useful, persistent archival format for source data from EHRs, despite certain inconsistencies among different implementers. However, the next generation C-CDA on FHIR initiative should resolve many of these current limitations along a smooth migration path from the current C-CDA.
  6. While CDISC standards are currently the language for regulatory submission standards – and should continue to be so for many years to come given the lag time between study and submission – it’s critical for research to also begin adapting to new ways to power research, fueled by EHR data, based on HL7 FHIR. Now is the time to begin work on the standards for tomorrow.  But the CDISC SDTM should prioritize stability over constant change.
  7. With the widespread adoption of cloud technologies, and the ability of FHIR to access distributed data on demand wherever it resides, we are nearing the time when it will no longer be necessary to submit static copies of data from point to point. Instead, we should be planning to use FHIR to access and coalesce data in near-real time from the source, with full provenance and rich metadata, as a definitive single source of truth.  We must eliminate unnecessary redundancy, and use the full capabilities of modern technologies to move forward to the next generation of clinical research.

I recognize some of these may be too radical for some, and I’m sure there are many different ideas of what CR 3.0 may look like.  So I’m interested in starting a dialogue.  I’m also working on some sketches to help illustrate my manifesto which I may share eventually.  Looking forward to hearing what others may think until the next time I find a quiet summer afternoon to stare out my window at a luscious green garden and think other idle thoughts of where the future may take us.  Happy last days of summer!

FHIR® is the registered trademark of HL7 and is used with the permission of HL7.


Playing with FHIR®

“A standard is not used because we created it. It is a standard because people use it.”  This familiar quote from Dr. Chuck Jaffe, CEO of HL7, could have been the motto for the inaugural FHIR Applications Roundtable Meeting held last week at Harvard Medical School in Boston.

As so many of the smiling attendees attested, this was indeed a very different kind of meeting.

The premise was to get an indication of how widespread FHIR usage already is, and the answer was – more than we could have ever imagined.  Although FHIR is currently designated as a Standard for Trial Use (STU), it has already captivated the development community drawn to its advanced, elegant technology platform.  The Roundtable, like most FHIR events, cements the impression that interoperability through FHIR is not a pipe dream, but a burgeoning reality.

The meeting opened with a rousing talk from Dr. Shafiq Rhab on “FHIR as Enabler” describing how its already transforming communications and processes at Hackensack University Medical Center.  We then transitioned into the meat of the meeting – a series of thirty-four 15-minute speed-dating sessions (including two of the recently announced winners of an ONC challenge grant) with applications and tools covering development and testing environments, patient and provider-facing apps using FHIR, genomics, Clinical Decision Support, and many more application areas.  We also learned how FHIR is being supported by major technology providers such as Microsoft, Computer Associates, and Lockheed Martin and academic institutions including Harvard, Duke, University of California San Francisco and Georgia Tech.

What was most impressive was that this roundtable only scratched the surface of what’s really going on.  Many attendees commented on other activities already underway – several in the audience who learned about the meeting after the submission deadline spoke of their own apps and their desire to get their chance on the podium.

The innovative meeting format was continually fast-moving, dynamic, and fun for all with a palpable sense of energy and community, and the invigorating appeal of a pep rally before the big homecoming game.

My read of the overall sentiment after the meeting was “FHIR is real, FHIR is already in widespread use, FHIR offers an unprecedented opportunity to transform the way we access and use healthcare information. More FHIR!”

A quick poll indicated unanimous support for repeating the whole experience, and plans are already underway to hold the next Roundtable at Duke University in Durham, NC in March 2017.

We expect to see a lot more progress by then since it’s now clear that FHIR is catching on like, well, fire.

FHIR® is the registered trademark of HL7 and is used with the permission of HL7.

EHR eSource: Sword of Change?

Note:  Some of this material will be published in Applied Clinical Trials, June 2016.

We, in the biopharmaceutical clinical research world, are creatures of habit, doing our jobs in a consistent, repeatable process, usually driven by SOPs and systems.  Change comes slowly; old habits die hard.

When EDC systems came of age around the turn of the century, health records were an inaccessible mess – and healthcare clinical data was hardly of sufficient quality to support what we felt was our gold standard of randomized clinical trials.  So we continued to treat clinical research data as an entirely separate process, with data entered into CRFs manually.  Electronic health data was scarce and dirty, didn’t line up with research databases, used different terminologies, and simply was too difficult to reuse.

But, now that the era of digital health is upon us in the USA and other countries, it’s time for a fresh look at how we collect research data.  What if we could make a great leap forward by completely changing our approach through EHR eSource?  Would we dare to try?

“Source” is the initial recording of data for a clinical study.  When the original recording is on digital media rather than paper, it’s “eSource”.  Clinical trials have used eSource for years in ECG readings, lab results and other measurements.  The FDA eSource Guidance describes different ways to transmit eSource data (from direct capture, devices, transcription, EHRs or PRO instruments) to an eCRF (or EDC) system, and several approaches have been proposed for trying to feed EHR data into our existing EDC-based processes.  Last year the FDA even asked for demonstration projects to explore such approaches.

But a more recent draft Guidance on Use of EHR Data in Clinical Investigations offers another take entirely with explicit goals to “facilitate the use of EHR data in clinical investigations” and to “promote the interoperability of EHRs” with clinical research systems.  The guidance recognizes that the ONC Health IT Certification Program can indicate the readiness of EHRs to support research.  And ONC’s Advancing Care Information initiative (the successor to Meaningful Use) relies heavily (among other things) on leveraging APIs to make health data more timely and accessible to patients and caregivers.

This is where HL7’s FHIR® platform standard comes in:

  • FHIR’s Data Access Framework will provide a universal API to EHR systems that can be used to populate much of a casebook in a clinical database.
  • The Smart on FHIR specification demonstrates how patients can grant researchers access to their data through electronic informed consent, as well as input outcome data through smartphones and browsers – data that can be directed to an EHR or a trusted third-party cloud-based research repository simply by selecting the appropriate target FHIR server.
  • Since EHR data is eSource, FHIR can also provide authorized access to remote study monitors.
  • And since FHIR can update as well as read data, it can also support the processing of data clarification transactions, thus making it possible to synchronize EHR records with clinical databases, improving transparency and traceability for both monitors and regulatory inspectors.
  • What’s more, FHIR makes it possible for regulatory reviewers to delve into the full EHR database to explore, for example, serious adverse events, in more depth than was ever possible before.

But is it really necessary to have an eCRF system in the middle at all?  In other words, can we use EHR data directly to feed our analysis so that all health data could potentially be reused as research data?  What could be more transparent, traceable, and efficient than going from source to analysis with as few steps and transformations as necessary?  But that’s a provocative question for another day.


Thoughts on Interoperability

Within the data standards community – and especially among CDISC and HL7, the term “interoperability” is commonly espoused as a vision, mission and goal.  For CDISC, the term refers to the ability of clinical studies to reuse data that originate as eSource from electronic health record (EHR) systems by pre-populating study CRFs through its Healthcare Link initiative, most typically for sponsored clinical trials.  But the problem goes much deeper than that.

In the world of healthcare informatics, interoperability is a broader term describing the goal of rapid and easy exchange of meaningful, usable information between participants or systems for any relevant health-related purpose.  For example, when I visit a new doctor, interoperability should make it possible for both of us to easily get all my medical information from my prior doctors.  Instead, I have to complete a new medical history form and list of current medications for each physician or specialist — since there is no comprehensive repository of all my medically-information anywhere except maybe in my head.  For a more specific example, as part of a recent medical procedure, I had to visit 5 providers:  Primary internist, surgeon, lab, ECG, MRI.  Sure, I could get a copy of my record for each visit after the fact by logging into 5 separate portals under separate IDs and passwords, but I couldn’t get one to pass anything on to the next.  In fact, in order to get my MRI from the imagist to the surgeon, I had to pick up and deliver the DVD myself.  No, that’s not interoperability.

Of course interoperability is difficult, but not impossible – as many have stated, technology is not the problem.  Sure we have to ensure security, privacy, and proper authorization – but those challenges can be met.  As a matter of fact, I just did so myself for a previously unknown distant relative who shares with me a common great-great-great grandmother from the old country via 23andMe.

For all of us who will know patients or be patients at one time or another, the biggest challenges to interoperability are often between healthcare and research organizations and even within them.  There are plenty of excuses for not sharing data (as confessed in a rather controversial recent editorial of the New England Journal of Medicine).   But whatever motives may drive or block the sharing of data for individual cases, the representation of data in a syntax and vocabulary that at least has the potential to be consistently expressed and understood by separate authorized parties under the proper circumstances should be a common goal.  Indeed it should be possible for patients – and their authorized caregivers – to get the medical information they need when they need it to deliver the best possible care.

So, while data standards development may be a particularly arcane and even tedious task to the vast majority, the need to make it possible for a physician, who may be treating a vacationing patient with a sudden illness, to be able to quickly retrieve his medical records from other providers is something we should all support – it could be, truly, a matter of life or death.

And interoperability is essential to achieve some of the most ambitious and far-reaching of health-related future visions, like the Learning Health System (LHS) and the Precision Medicine Initiative (PMI).  While the link from an LHS to clinical research has always seemed tenuous because of the long lag time between processing of pre-marketing clinical studies and delivery of approved new therapies to patient care, the PMI should clearly build on a baseline of knowledge acquired through research.  As I’ve espoused in public CDISC presentations many times over the years, a most critical future objective of research is to learn how to tap into the fundamental data flows of digital healthcare as much as possible, rather than trying to operate in a separate, often redundant parallel world.

We’re at a point in time when we have the technology and awareness to finally make real progress toward interoperability.  What we need is the courage and will to put the final pieces together and really make it happen.  A goal as important as this, which can conceivably affect everyone in the world toward the betterment of health care, is not to be taken lightly.  For my part, I’ve decided to go all in.

Dear CFAST:  Please Slow Down to Catch Up

A new year and many new things to think about in the greater world of research and healthcare, though there are still plenty of unfinished thoughts about CDISC to expound upon.  In my last post, I tried to sum up my thoughts on SDTM, advocating a more restrained approach that prioritized stability of the core SDTM/SDTMIG standard over frequent change, and proposing a cookbook recipe approach to addressing gaps in the model, especially for the CFAST Therapeutic Area User Guides (TAUGs).

And today, I’d like to echo the slow food movement, with a plea directly to CFAST to also take a breather and slow down. This may seem like a departure from my past point of view – I was a strong advocate of adopting Scrum for standards in order to get more things done. But moving ahead more rapidly should not be done at the expense of consistency and clarity of approach.   CFAST TAUGs have been produced at a relatively prolific rate of 6 or more per year since 2013.  Yet the pressure during that span has been to issue even more TAUGs each year, rather than maintaining consistency in the modeling approaches used by them all.

So, as former Chair of the CFAST TA Steering Committee, I’d like to respectfully suggest that they apply the brakes for a bit until they can go back and bring the full body of published TAUGs to a common parity by fixing the inconsistencies that currently exist.

This is the long-standing “maintenance issue” that was never completely addressed during my term as Chair.  There are several cases where modeling approaches adopted in one TAUG changed when a new TAUG was issued, which might not be apparent to those who don’t read every separate publication.  These inconsistencies in approach work against the goals of standardization, and can only aggravate the types of variances between submissions that have often frustrated FDA reviewers with SDTM submissions to date.  Now’s the time to go back and fix these dangling loose ends, since the FDA has yet to issue a clear message endorsing TAUGs, and since many sponsors are still evaluating how to utilize them.

To take one prominent example, there have been at least four different approaches to represent the diagnosis date of the condition or disease under study in TAUGs over the years:

  1. Putting the date of diagnosis in q Medical History (MH) domain record, using MHCAT = ‘PRIMARY DIAGNOSIS’ and MHSCAT to distinguish between an onset course and a current course (though these important categorization values are not specified as controlled terminology). This approach was used in the Alzheimer’s, Parkinson’s and other TAUGs.
  2. Creating a Findings About (FA) record, with the FAOBJ = <disease or condition>, and the date as an observation result. This approach was used in the Asthma TAUG.
  3. Creating a supplemental qualifier record with QNAM=’MHDXDTC’, an approach that was used in the Diabetes TAUG.
  4. Creating a new MH record using the new SDTM v1.5 variable MHEVTTYP = ’DIAGNOSIS’ – which was an approach recommended by the SDS Leadership group (including me) in 2014 and was adopted as the preferred approach in some of the more recent TAUGs (e.g., MS). Unfortunately, adoption of this approach has not been embraced by the full SDS team and is limited because it’s not included in the SDTMIG and requires a new variable which is not in the currently accepted v1.4 SDTM.

Now, given that the date of diagnosis is likely to be of interest to all medical reviewers, the notion of how to represent a date of diagnosis of the disease under study consistently should be tackled independently of any TA.  And the correct approach should be easy to find by implementers.  But, currently, the method may vary depending on what document each implementer uses as a guideline.

One way for dealing with this type of maintenance issue is to remove it from the individual TAUGs and making it a standardized convention that could be consulted by any implementer as a separate resource.  Using my cookbook recipe approach, such a guideline might say:

  • Always represent Diagnosis Date as an MH event record
  • Use MHEVTTYP to distinguish it from other elements of history with controlled terminology value = ‘DIAGNOSIS’.
  • If using SDTM v1.4 or earlier, represent MHEVTTYP (with controlled terminology QNAM value of ‘MHEVTTYP’ and QVAL = ‘DIAGNOSIS’.

Now (especially if supplemental qualifiers are represented within the parent domain where they belong) wouldn’t that be easier on the FDA reviewer?  Wouldn’t it be helpful to always know where to find such an important bit of data no matter who submits it irrespective of which reference (IG, UG) an individual sponsor employee may have chosen to follow at the time?  I mean, can CDISC and CFAST really say they has an effective standard if they’re not consistent on such points?

So, one more shout in the dark:

  • Let’s ease the constant pressure to create new TAUGs, domain models and new IG versions (which often seem like they’ll never be finished anyway)
  • Stop creating new TAUGs that contradict modeling approaches of older ones until all the published ones can be made consistent to a common baseline. One way to do this would be to remove all references how to represent certain modeling use cases like Diagnosis Date from the published TAUGs and replace with a link to a document showing a preferred standard approach (though an example might still be provided in the TAUG).
  • Start representing more recipes and examples for other preferred modeling conventions with better engagement of the CDISC standards implementer community as an interactive forum.
  • Make a firm decision once and for all between “Taste Great” and “Less Filling” as key modeling principles moving forward.

If you’d like to learn more about some of the significant variations across the CFAST TAUGs, please refer to this excellent paper by Johannes Ulander and Niels Both presented at PhUSE 2015.

A Call to CDISC Arms – Recapping My SDTM Posts

I’ve been posting a series of thoughts on the CDISC SDTM over the past few weeks, and realize that most people won’t have time to read them all, so thought I’d sum up with these CliffsNotes before moving on to other thoughts for improving clinical research.  So, as former CDISC CTO, here are my essential hindsight thoughts in a nutshell on what should be done with the SDTMIG:

  1. Prioritize stability for the SDTMIG so the world can catch up. In SDTM as a Cookbook I make a plea to aim for more SDTMIG stability by minimizing new versions, suggesting that new domains be presented as recipes to be crafted from the SDTM rather than as fixed domain models in the IG.  I propose a leaner, more flexible approach to applying the SDTM to create new custom domains for data not previously addressed in the current SDTMIG, a suggestion which should be adopted by TAUGs immediately.
  2. Fix supplemental qualifiers now! RIP Time for Supplemental Qualifiers makes another plea to represent non-standard variables (NSVs) in the parent domain where they belong, using Define-xml tags.  Acknowledging past FDA concerns about file size, suggests an alternative of representing these as a separate file that can be easily merged onto the parent as extra columns (much more easily than the vertical method in SUPPXX files today).  This has been a nuisance for far too long – fix it.
  3. Hit the brakes on creating so many new domains. As in the “Cookbook” blog,  The Lumping and Splitting Acid Test makes another plea to be more conservative in creation of new SDTMIG domains, proposing a default “lumping” approach which should only support new domains when there’s no ambiguity about what data to put there.  The more domains we give adopters to choose from, the more likely that they’ll make different decisions – the exact opposite of what a standard should do.  It’s also time to realize that such granular separation of domains probably won’t be the way of the future with patient data in integrated data repositories – see the FDA’s Janus CTR for an example.
  4. Describe use cases for applying the SDTM instead of IG version updates. A Recipe from the SDTM Cookbook gives an example of creating a new domain following the SDTM as a Cookbook approach.   This can be done without creating new IG versions, meaning that people can actually get accurate new information more quickly instead of waiting years for the next IG version.
  5. Time to plan the next generation. Life after the v3.x SDTMIG offers yet another plea to refrain from creating too many new versions of the SDTMIG v3.x series, and begin work instead on a next generation v2 SDTM and, subsequently, v4 SDTMIG that fixes long-standing problems and limitations and is more compatible with interoperability goals and newer technologies.
  6. Clarify the role of CFAST Therapeutic Area User Guides. No blog on this yet, but it’s essential to clarify that CFAST Therapeutic Area User Guides should NOT be considered as separate new standards – they should be viewed as ways to apply current standards. We can provide new examples of how to use a standard whenever a valid new example is created – but we simply can’t allow more confusion to be introduced into an industry still learning how to use our standards by frequently pushing out conflicting information or confusing directions about draft domains or modeling approaches that don’t conform to current standards acceptable to regulators.  My feeling is that TAUGs should focus on describing use cases and recipes to apply the current standards as well as identify new terminology to be applied.  Of course, would be better if we fixed a few things first – notably supplemental qualifiers and probably the disease milestones approach introduced with the Diabetes TAUG.  So maybe a v3.3 SDTMIG is necessary – but let’s draw the 3.x line there.

As we get clearer to the onset date of the binding guidance, we need to think differently if we want to make implementation of data standards the success story it ought to be.  So (maybe after finishing v3.3 soon, I hope) let’s take a breather, let the soup simmer awhile.  Meanwhile, let’s arouse the community to share their implementation challenges and experiences more openly and widely (as some of the User Groups already do, with a special mention to the energetic example set by the CDISC Japan User Group).  Let’s agree we need to gather more common practical experience and knowledge before we jump to develop new disruptive content.

And let’s get going collecting those requirements for the next, great leap forward instead.

That’s the pitch, so let the cards and rotten tomatoes fly.

Life after the v3.x SDTMIGs

Hard to believe that it’s been 11 years since the release of v3.1 of the SDTMIG.  Since then there have been 4 additional versioned releases, all based on the SDTM general class model, intended for representation as SAS v5 XPORT files.  SDTMIG still has plenty of life to it – in fact, one might argue that it’s just beginning to hit its stride now that use of CDISC standards will be mandatory in the US and Japan late in 2016.   But, let’s face it, as a standard that predates Facebook, YouTube, smartphones and reality TV, it’s also getting long in the tooth, and, indeed, may already be something of a legacy standard.

Perhaps the biggest limitation to the current SDTMIG is the restriction to use SAS v5 XPORT, a more than 30-year old format devised in the days of MS-DOS and floppy disks that is still the only data exchange format that the FDA and PMDA will currently accept for study data in submissions. While alternative formats have been proposed – the HL7 v3 Subject Data format in 2008, RDF in an FDA public meeting in 2012, the CDISC dataset-xml standard in 2013 – the FDA is still stuck on XPORT.  Recently they’ve asked the PhUSE CSS Community to help evaluate alternatives, which indicates that things haven’t progressed much closer to a decision yet.

The ripple effects of XPORT have severely limited the usefulness and acceptance of the SDTM beyond regulatory submissions – especially to those who haven’t grown up as SAS programmers working with domain and analysis datasets.  So any major new revision of the SDTMIG needs to start there, to split out all the XPORT-specific stuff.  This involves using longer field names, richer metadata, more advanced data types and eliminating field length restrictions.  That’s the easy part, but that’s not enough.  If we’re going to reconsider the SDTMIG, then we should use the opportunity to think broadly and address other needs as well.

We need a longer-term replacement, but we also need to keep the current trains running on time now.  Now that people are just getting used to the idea of a regulatory mandate to use SDTM and SEND, we certainly don’t want to change too much just yet.  We need to keep it stable enough so new adopters can get used to it – rapidly changing terminology gives them enough of a challenge to deal with without the pressure of adopting new IG versions.  I recently described one way to help minimize the number of necessary future versions of the existing XPORT-bound IG as a recipe.   We could do this now with the current version 3.2 and address many new needs.

On the other hand, we should be working on the next generation while we keep that venerable current one going.  In Chicago, the White Sox didn’t tear down the old Comiskey Park until the new U.S. Cellular field was finished — they built the new while using the old.  And they minimized making too many repairs to the old once they started working on the new.   So while we can assume we’ll need XPORT for some time even if a replacement exchange format is finally chosen, that shouldn’t stop us from rethinking the SDTMIG to better meet future needs now.  It’s time to think ahead.

What might a next generation SDTM look like?  A new SDTM for the future might have some of the following characteristics:

  1. As implied above, it should support standard content that’s independent of the exchange format. The standard should be easily representable in RDF, JSON (with HL7 FHIR resources and profiles), XML (and, yes, even XPORT for legacy purposes – at least for some years).
  2. A general class structure as used in the current model must remain as the heart of SDTM, though likely with some variations. We’ll want to retain the 3 general classes and most, but maybe not all variables (though such variables need precise definitions and more robust datatypes).  The core variables are essential, but perhaps some variables that are unique to a specific use case (such as those being introduced with new TAs or for SEND) can be packaged as supplements to augment the core under certain conditions.  What if there was a way to add new variables to general classes, timing and identifiers without necessarily creating a new IG version?  Rather than having to keep issuing new versions each time we want more variables, can’t a curated dictionary of non-standard variables – all defined with full metadata and applicable value sets – be used and managed separately in a manner similar to coding dictionaries?
  3. We may need some new general classes as well, such as the long-recognized need for a general class to represent activities such as procedures.
  4. We should reassess, with the benefit of hindsight, what data really belongs in which class. For example, perhaps substance use data (smoking, recreational drugs, alcohol) might be better represented as findings along with other lifestyle characteristics, which would better align with how such data is represented in healthcare systems.  Disposition data might fit better as an activity rather than event.
  5. Thorough definitions for each variable (a task already in progress), and variable names that are more intelligible – without being limited to 8 characters with a domain prefix – are mandatory.
  6. We should remove redundant information that can easily be looked up (as Jozef Aerts has long proposed). Lookups can be made via define-xml codelists or web services.
  7. Other non-backwards compatible corrections to known issues, deep in the weeds should also be addressed – such as distinguishing timings associated with specimen collection from point in time result findings – and resolving that strange confusion between collection data and start date in the Findings class.
  8. Perhaps a reconsideration and simplification of the key structure is in order, replacing the Sequence variable with a unique observation identifier/Uniform Resource Identifier (URI) that can be referenced for linked data purposes and make it easier to represent more complex associations and relationships (including the ability to be extended dimensionally with meta observations such as attributions and interpretations). This would be part of a richer metadata structure that should also support the representation of concepts.
  9. A more advanced extension mechanism that replaces the cumbersome supplemental qualifier approach is critical (such as the one already proposed by SDS) so users can easily incorporate those special use case variables mentioned in item 2 above.
  10. And we need the ability to align better with other healthcare-related information, to make it possible to use clinical study data with other real world data sources, and the courage to modify the SDTM to facilitate such alignment where appropriate.

Now, some might argue that this is still limiting ourselves to 2-dimensional representations here – which is indeed a valid criticism.  But maybe the longer term solution involves more than one representation of the data.  Perhaps we have a broad patient file with both structured and unstructured source information as a sort of case history, and representations/views in tabular structures that are derived from it – an old idea which might be getting closer to prime time.  Thinking beyond the table/dataset way of thinking should certainly be part of the exercise.

I know many are already impatient for change (at least as far as XPORT is concerned), and others feel we should just throw it all away and adopt more radical solutions.   But my personal feeling is that we need to keep what we have, which has already taken us much farther than we could have imagined 15 years ago, and build from that.  The approach echoes that of a 2009  New Yorker article by the great Atul Gawande about the upcoming healthcare reform, where he advocated building up from our history of employer-provided insurance rather than jumping to something radically different, like single-payer.  “Each country has built on its own history, however imperfect, unusual, and untidy… we have to start with what we have.”

So whatever we do, we should start with SDTM as governing model that really drives implementation, with more extensive metadata, clear definitions, complex datatypes, and a simpler extension mechanism.  An improved SDTM can drive implementation and result in a more streamlined implementation guide, that also shows how to apply research/biomedical concepts, controlled terminologies and computer-executable rules (e.g. for verifying conformance, derivations, relationships, etc.) and where to find use cases and examples. Such use cases and examples (as for Therapeutic Areas) could be maintained separately in a knowledge repository, and the SHARE metadata repository would provide all the pieces and help put them together.  We start with the SDTM and metadata and build out from there.  But we need to build in a way to converge with the opportunities provided by what’s going on in the world of healthcare, technology and science.  Like the Eastbound and Westbound project teams of the transcontinental railroad 150 years ago, we should endeavor to meet in the middle.

A Recipe from the SDTM Cookbook

In my earlier posting on SDTM as a Cookbook, I described an alternative approach for defining new domain models for use with CFAST Therapeutic Area User Guides (TAUGs). Based on an internal poll of SDS team members, there seems to be a desire to create many domain models (a predilection toward splitting, rather than the lumping approach I favor).  Yet creating new domains is a frustrating and lengthy process. Although these are now mostly being modeled by CFAST teams with very specific use cases, there has been a tendency to also vet them through SDS in a more generalized form as part of a batch associated with a future SDTMIG release, a process which can take 2-3 years or more.  In the meantime, TAUGs are faced with proposing draft domain models under a stricter timeline, well before they exist in the officially sanctioned normative SDTMIG world.

What a waste — and it gets worse.  Once the domain is issued as final as part of an SDTMIG version update (indeed, that’s assuming the SDS team consensus allows it to and it actually passes the comment process) it now has to be evaluated by FDA before they can determine whether they’re ready to accept it.   Although 15 TAUGs have been posted to date, the FDA has yet to clearly indicate their readiness to accept any of them.  And the acceptance process has also been excruciatingly long (it took nearly 2 years for FDA to announce readiness to accept SDTMIG v3.2 – even then with some restrictions). In the meantime, people simply make up some other approach to get their daily work done – the antithesis of standards.

Let’s take a current example of how we may have applied the cookbook approach to draft domains included with the just-posted TAUG-TBv2. This TAUG includes 5 new draft domains as well as revisions to 3 existing domains which are presented as SDTMIG domain models.  One of the new domains (which was vetted with SDTMIG v3.3 Batch 2 and also used in Asthma but not yet released as final) is the RE (Respiratory Physiology) domain. This is a Findings General Class domain, which is mostly consistent with the current version of the SDTM v1.4 except for the addition of 3 new variables:  REORREF, RESTREFN and REIRESFL. (An earlier version of this domain was also included in the Asthma and Influenza TAUGs.)

Now a cookbook recipe might present this RE domain as a list of steps to follow to “roll your own” domain.  Instructions might include:

  1. Create a new custom domain with the standard identifiers and all variables from the SDTM v1.4 Findings General Class.
  2. Assign the Domain Code and prefix “RE” from controlled terminology.
  3. Insert the standard timing variables that are typically provided with a Findings Physiology domain.
  4. Create the following 3 new Non Standard Variables (NSVs):
    1. REORREF

(Note definitions for these might be pulled from a newer version of the SDTM which is now being updated more frequently, or else from a CDISC Wiki resource).

  1. Remove any unnecessary or irrelevant permissible variables such as REMODIFY, RETSTDTL, etc. – just like you do with published domains. (Note that these are all permissible variables – assigning a new NSV as Required or Expected would be a complication, but this would be an odd choice for a newly created variable anyway).
  2. Add any additional NSVs in the usual manner (this would be much smoother if the new proposed method of putting NSVs in the parent domain was adopted).
  3. Apply other controlled terminology bindings for variables within the domain (such as RECAT, RETESTCD, etc.  that are declared in a sample define.xml file that’s posted along with the recipe.

As the output of this exercise one would normally create the define.xml metadata for the domain as well as a RE.XPT file (which will later be populated with data values).   The sample define file included with the recipe would also specify which controlled terminologies apply to both the standard and new non-standard variables (I assume that new terminology values that would be intended for use in this domain would be created through the normal terminology request process, and simply referenced in the define-xml example). The recipe could still provide a draft domain in the usual Word Table or Excel format – but this would be presented as an example rather than a normative specification, similar to including an illustration or photo in a recipe.

I believe it should be sufficient to apply the standard class-level validation rules (which include checking for controlled terminology assignments), which can be addressed separately from the domain model, so there should not be any specific new user acceptance testing required by FDA. FDA might also specify separate content-based checks they may want, but these can be added at any time later, once they’ve had a chance to review submissions using this model.  But new rules can also be added outside the IG.  And while it will be technically a non-binding custom v3.2 domain in each submission, if it’s conforming to the recipe (which can be clearly stated in the define) it can serve the same purpose as a new SDTMIG domain in a future version. The difference is that it can be put directly into use.  A beneficial side effect is that this also encourages early testing among the research community, which might result in beneficial tweaks to the recipe, which can be maintained over time and augmented with more and more examples suggested by adopters in a crowd-sourced Wiki sharing environment, which should only serve to make the domain model more solid over time.   Sure, this might require review and curation by the SDS team, but that should be a lot less onerous than the current process.

The benefits of such an approach include:

  • Making it simpler and easier to create new domain models based on existing published versions, which might help shorten the development time for TAUGs
  • Allowing sponsors to adopt these new models more rapidly without waiting for new domains or FDA announcements
  • Making it possible for FDA to accept these models without a lengthy acceptance process
  • Providing an improved, rapidly evolving Wiki-based knowledge resource to help sponsors address representation of data that doesn’t fit in existing final domains in a more consistent manner.
  • Minimizing the number of new versions of the SDTMIG that have to be handled by industry and regulatory authorities.

Of course, adopting such an approach is not trivial.  It would require buy-in by FDA and industry, and would require new methods for sharing these recipe guidelines (probably via the Wiki) and a whole lot of communication and training.  But it seems to me it would be a much more practical way to move forward to extend the reach of the SDTM for new TAs in a leaner, quicker manner with fewer maintenance and version management headaches.

The “Cubs Way” to Future Submission Data Standards

Even for those who don’t follow baseball, you must have heard something about the storybook year of the out of nowhere Chicago Cubs in 2015.  No, they’re not going to win the 2015 World Series, but they made the Final Four, and somehow, that didn’t feel like losing this time around.

You must know this about the Cubs: 107 years since their last championship, which is generally acknowledged as the benchmark of futility in professional sports.  For clinical data geeks, you might think in terms of a similar drought — the many years we’ve been handicapped with SAS V5 transport format (XPT).  XPT stems from the days of the Commodore computer, 5-1/4” floppy disks and MS-DOS 640kb memory limits, and while it hasn’t been around quite as long as the Cubs’ last World Series trophy, it’s a Methuselah in tech years.

However, just like the Cubs and their venerable Wrigley Field, it looks like it’s going to be around for awhile, and definitely needs some attention.  So can we learn any relevant lessons from the 2015 Cubs?

  1. Think long term – with a plan. The old Cubs way (overpriced has-been free agents and bad trades) had never worked, so the new regime sacrificed current performance for the promise of future competitiveness, losing enough games to gain high draft picks and flip-trading useful veterans for uncertain prospects.  With respect to XPT, this might mean living with a partial improvement (like the CDISC Dataset-XML) for awhile while working on a separate longer-term solution that will will keep us competitive for decades.
  1. Keep meeting current needs (but only to a point). The Cubs still had to field a team that showed enough to keep fans on board and invested in the future.  In our world that means giving users time to gain basic literacy and get the most value possible out of current CDISC data standards with XPT (and maybe Dataset-XML), now that those will be required by FDA and PMDA (who aren’t about to change suddenly before the rule formally goes into effect).   This might also mean that we limit the degree of change to the current published standards with some minimal fine-tuning that users can easily absorb until they gain basic literacy, while concentrating most of the attention on that much more robust next generation solution that will make the big leaps tomorrow.
  1. Be patient so the prospects can develop.  In other words, even if the future solution isn’t necessarily mature now, that may be fine as long as it’s got the talent to take you a where you need to go in the future.  Such a description might fit HL7 FHIR and the Semantic Web, for example.
  1. Fill in the missing pieces along the way. – The Cubs soon realized they needed more starting pitching and situational hitting, which will guide their winter and spring moves for next year.
  1. Don’t worry about future salaries (I mean file size)! In 1908, the highest paid star baseball player made $8500, and in 1988 a floppy disk held 1.44 MB, less than a typical MP3 song that you can play from your watch.  This should not be an obstacle to moving beyond XPT.  Things get bigger over time; get over it.

Of course, the jury’s still out on whether the Cubs will ever make it, but it seems there’s more excitement about next year here in the Windy City than ever before.  It would be wonderful if we could say the same sort of thing about the future of clinical data by spring training, 2017.