Life after the v3.x SDTMIGs

Hard to believe that it’s been 11 years since the release of v3.1 of the SDTMIG.  Since then there have been 4 additional versioned releases, all based on the SDTM general class model, intended for representation as SAS v5 XPORT files.  SDTMIG still has plenty of life to it – in fact, one might argue that it’s just beginning to hit its stride now that use of CDISC standards will be mandatory in the US and Japan late in 2016.   But, let’s face it, as a standard that predates Facebook, YouTube, smartphones and reality TV, it’s also getting long in the tooth, and, indeed, may already be something of a legacy standard.

Perhaps the biggest limitation to the current SDTMIG is the restriction to use SAS v5 XPORT, a more than 30-year old format devised in the days of MS-DOS and floppy disks that is still the only data exchange format that the FDA and PMDA will currently accept for study data in submissions. While alternative formats have been proposed – the HL7 v3 Subject Data format in 2008, RDF in an FDA public meeting in 2012, the CDISC dataset-xml standard in 2013 – the FDA is still stuck on XPORT.  Recently they’ve asked the PhUSE CSS Community to help evaluate alternatives, which indicates that things haven’t progressed much closer to a decision yet.

The ripple effects of XPORT have severely limited the usefulness and acceptance of the SDTM beyond regulatory submissions – especially to those who haven’t grown up as SAS programmers working with domain and analysis datasets.  So any major new revision of the SDTMIG needs to start there, to split out all the XPORT-specific stuff.  This involves using longer field names, richer metadata, more advanced data types and eliminating field length restrictions.  That’s the easy part, but that’s not enough.  If we’re going to reconsider the SDTMIG, then we should use the opportunity to think broadly and address other needs as well.

We need a longer-term replacement, but we also need to keep the current trains running on time now.  Now that people are just getting used to the idea of a regulatory mandate to use SDTM and SEND, we certainly don’t want to change too much just yet.  We need to keep it stable enough so new adopters can get used to it – rapidly changing terminology gives them enough of a challenge to deal with without the pressure of adopting new IG versions.  I recently described one way to help minimize the number of necessary future versions of the existing XPORT-bound IG as a recipe.   We could do this now with the current version 3.2 and address many new needs.

On the other hand, we should be working on the next generation while we keep that venerable current one going.  In Chicago, the White Sox didn’t tear down the old Comiskey Park until the new U.S. Cellular field was finished — they built the new while using the old.  And they minimized making too many repairs to the old once they started working on the new.   So while we can assume we’ll need XPORT for some time even if a replacement exchange format is finally chosen, that shouldn’t stop us from rethinking the SDTMIG to better meet future needs now.  It’s time to think ahead.

What might a next generation SDTM look like?  A new SDTM for the future might have some of the following characteristics:

  1. As implied above, it should support standard content that’s independent of the exchange format. The standard should be easily representable in RDF, JSON (with HL7 FHIR resources and profiles), XML (and, yes, even XPORT for legacy purposes – at least for some years).
  2. A general class structure as used in the current model must remain as the heart of SDTM, though likely with some variations. We’ll want to retain the 3 general classes and most, but maybe not all variables (though such variables need precise definitions and more robust datatypes).  The core variables are essential, but perhaps some variables that are unique to a specific use case (such as those being introduced with new TAs or for SEND) can be packaged as supplements to augment the core under certain conditions.  What if there was a way to add new variables to general classes, timing and identifiers without necessarily creating a new IG version?  Rather than having to keep issuing new versions each time we want more variables, can’t a curated dictionary of non-standard variables – all defined with full metadata and applicable value sets – be used and managed separately in a manner similar to coding dictionaries?
  3. We may need some new general classes as well, such as the long-recognized need for a general class to represent activities such as procedures.
  4. We should reassess, with the benefit of hindsight, what data really belongs in which class. For example, perhaps substance use data (smoking, recreational drugs, alcohol) might be better represented as findings along with other lifestyle characteristics, which would better align with how such data is represented in healthcare systems.  Disposition data might fit better as an activity rather than event.
  5. Thorough definitions for each variable (a task already in progress), and variable names that are more intelligible – without being limited to 8 characters with a domain prefix – are mandatory.
  6. We should remove redundant information that can easily be looked up (as Jozef Aerts has long proposed). Lookups can be made via define-xml codelists or web services.
  7. Other non-backwards compatible corrections to known issues, deep in the weeds should also be addressed – such as distinguishing timings associated with specimen collection from point in time result findings – and resolving that strange confusion between collection data and start date in the Findings class.
  8. Perhaps a reconsideration and simplification of the key structure is in order, replacing the Sequence variable with a unique observation identifier/Uniform Resource Identifier (URI) that can be referenced for linked data purposes and make it easier to represent more complex associations and relationships (including the ability to be extended dimensionally with meta observations such as attributions and interpretations). This would be part of a richer metadata structure that should also support the representation of concepts.
  9. A more advanced extension mechanism that replaces the cumbersome supplemental qualifier approach is critical (such as the one already proposed by SDS) so users can easily incorporate those special use case variables mentioned in item 2 above.
  10. And we need the ability to align better with other healthcare-related information, to make it possible to use clinical study data with other real world data sources, and the courage to modify the SDTM to facilitate such alignment where appropriate.

Now, some might argue that this is still limiting ourselves to 2-dimensional representations here – which is indeed a valid criticism.  But maybe the longer term solution involves more than one representation of the data.  Perhaps we have a broad patient file with both structured and unstructured source information as a sort of case history, and representations/views in tabular structures that are derived from it – an old idea which might be getting closer to prime time.  Thinking beyond the table/dataset way of thinking should certainly be part of the exercise.

I know many are already impatient for change (at least as far as XPORT is concerned), and others feel we should just throw it all away and adopt more radical solutions.   But my personal feeling is that we need to keep what we have, which has already taken us much farther than we could have imagined 15 years ago, and build from that.  The approach echoes that of a 2009  New Yorker article by the great Atul Gawande about the upcoming healthcare reform, where he advocated building up from our history of employer-provided insurance rather than jumping to something radically different, like single-payer.  “Each country has built on its own history, however imperfect, unusual, and untidy… we have to start with what we have.”

So whatever we do, we should start with SDTM as governing model that really drives implementation, with more extensive metadata, clear definitions, complex datatypes, and a simpler extension mechanism.  An improved SDTM can drive implementation and result in a more streamlined implementation guide, that also shows how to apply research/biomedical concepts, controlled terminologies and computer-executable rules (e.g. for verifying conformance, derivations, relationships, etc.) and where to find use cases and examples. Such use cases and examples (as for Therapeutic Areas) could be maintained separately in a knowledge repository, and the SHARE metadata repository would provide all the pieces and help put them together.  We start with the SDTM and metadata and build out from there.  But we need to build in a way to converge with the opportunities provided by what’s going on in the world of healthcare, technology and science.  Like the Eastbound and Westbound project teams of the transcontinental railroad 150 years ago, we should endeavor to meet in the middle.

A Recipe from the SDTM Cookbook

In my earlier posting on SDTM as a Cookbook, I described an alternative approach for defining new domain models for use with CFAST Therapeutic Area User Guides (TAUGs). Based on an internal poll of SDS team members, there seems to be a desire to create many domain models (a predilection toward splitting, rather than the lumping approach I favor).  Yet creating new domains is a frustrating and lengthy process. Although these are now mostly being modeled by CFAST teams with very specific use cases, there has been a tendency to also vet them through SDS in a more generalized form as part of a batch associated with a future SDTMIG release, a process which can take 2-3 years or more.  In the meantime, TAUGs are faced with proposing draft domain models under a stricter timeline, well before they exist in the officially sanctioned normative SDTMIG world.

What a waste — and it gets worse.  Once the domain is issued as final as part of an SDTMIG version update (indeed, that’s assuming the SDS team consensus allows it to and it actually passes the comment process) it now has to be evaluated by FDA before they can determine whether they’re ready to accept it.   Although 15 TAUGs have been posted to date, the FDA has yet to clearly indicate their readiness to accept any of them.  And the acceptance process has also been excruciatingly long (it took nearly 2 years for FDA to announce readiness to accept SDTMIG v3.2 – even then with some restrictions). In the meantime, people simply make up some other approach to get their daily work done – the antithesis of standards.

Let’s take a current example of how we may have applied the cookbook approach to draft domains included with the just-posted TAUG-TBv2. This TAUG includes 5 new draft domains as well as revisions to 3 existing domains which are presented as SDTMIG domain models.  One of the new domains (which was vetted with SDTMIG v3.3 Batch 2 and also used in Asthma but not yet released as final) is the RE (Respiratory Physiology) domain. This is a Findings General Class domain, which is mostly consistent with the current version of the SDTM v1.4 except for the addition of 3 new variables:  REORREF, RESTREFN and REIRESFL. (An earlier version of this domain was also included in the Asthma and Influenza TAUGs.)

Now a cookbook recipe might present this RE domain as a list of steps to follow to “roll your own” domain.  Instructions might include:

  1. Create a new custom domain with the standard identifiers and all variables from the SDTM v1.4 Findings General Class.
  2. Assign the Domain Code and prefix “RE” from controlled terminology.
  3. Insert the standard timing variables that are typically provided with a Findings Physiology domain.
  4. Create the following 3 new Non Standard Variables (NSVs):
    1. REORREF

(Note definitions for these might be pulled from a newer version of the SDTM which is now being updated more frequently, or else from a CDISC Wiki resource).

  1. Remove any unnecessary or irrelevant permissible variables such as REMODIFY, RETSTDTL, etc. – just like you do with published domains. (Note that these are all permissible variables – assigning a new NSV as Required or Expected would be a complication, but this would be an odd choice for a newly created variable anyway).
  2. Add any additional NSVs in the usual manner (this would be much smoother if the new proposed method of putting NSVs in the parent domain was adopted).
  3. Apply other controlled terminology bindings for variables within the domain (such as RECAT, RETESTCD, etc.  that are declared in a sample define.xml file that’s posted along with the recipe.

As the output of this exercise one would normally create the define.xml metadata for the domain as well as a RE.XPT file (which will later be populated with data values).   The sample define file included with the recipe would also specify which controlled terminologies apply to both the standard and new non-standard variables (I assume that new terminology values that would be intended for use in this domain would be created through the normal terminology request process, and simply referenced in the define-xml example). The recipe could still provide a draft domain in the usual Word Table or Excel format – but this would be presented as an example rather than a normative specification, similar to including an illustration or photo in a recipe.

I believe it should be sufficient to apply the standard class-level validation rules (which include checking for controlled terminology assignments), which can be addressed separately from the domain model, so there should not be any specific new user acceptance testing required by FDA. FDA might also specify separate content-based checks they may want, but these can be added at any time later, once they’ve had a chance to review submissions using this model.  But new rules can also be added outside the IG.  And while it will be technically a non-binding custom v3.2 domain in each submission, if it’s conforming to the recipe (which can be clearly stated in the define) it can serve the same purpose as a new SDTMIG domain in a future version. The difference is that it can be put directly into use.  A beneficial side effect is that this also encourages early testing among the research community, which might result in beneficial tweaks to the recipe, which can be maintained over time and augmented with more and more examples suggested by adopters in a crowd-sourced Wiki sharing environment, which should only serve to make the domain model more solid over time.   Sure, this might require review and curation by the SDS team, but that should be a lot less onerous than the current process.

The benefits of such an approach include:

  • Making it simpler and easier to create new domain models based on existing published versions, which might help shorten the development time for TAUGs
  • Allowing sponsors to adopt these new models more rapidly without waiting for new domains or FDA announcements
  • Making it possible for FDA to accept these models without a lengthy acceptance process
  • Providing an improved, rapidly evolving Wiki-based knowledge resource to help sponsors address representation of data that doesn’t fit in existing final domains in a more consistent manner.
  • Minimizing the number of new versions of the SDTMIG that have to be handled by industry and regulatory authorities.

Of course, adopting such an approach is not trivial.  It would require buy-in by FDA and industry, and would require new methods for sharing these recipe guidelines (probably via the Wiki) and a whole lot of communication and training.  But it seems to me it would be a much more practical way to move forward to extend the reach of the SDTM for new TAs in a leaner, quicker manner with fewer maintenance and version management headaches.

The “Cubs Way” to Future Submission Data Standards

Even for those who don’t follow baseball, you must have heard something about the storybook year of the out of nowhere Chicago Cubs in 2015.  No, they’re not going to win the 2015 World Series, but they made the Final Four, and somehow, that didn’t feel like losing this time around.

You must know this about the Cubs: 107 years since their last championship, which is generally acknowledged as the benchmark of futility in professional sports.  For clinical data geeks, you might think in terms of a similar drought — the many years we’ve been handicapped with SAS V5 transport format (XPT).  XPT stems from the days of the Commodore computer, 5-1/4” floppy disks and MS-DOS 640kb memory limits, and while it hasn’t been around quite as long as the Cubs’ last World Series trophy, it’s a Methuselah in tech years.

However, just like the Cubs and their venerable Wrigley Field, it looks like it’s going to be around for awhile, and definitely needs some attention.  So can we learn any relevant lessons from the 2015 Cubs?

  1. Think long term – with a plan. The old Cubs way (overpriced has-been free agents and bad trades) had never worked, so the new regime sacrificed current performance for the promise of future competitiveness, losing enough games to gain high draft picks and flip-trading useful veterans for uncertain prospects.  With respect to XPT, this might mean living with a partial improvement (like the CDISC Dataset-XML) for awhile while working on a separate longer-term solution that will will keep us competitive for decades.
  1. Keep meeting current needs (but only to a point). The Cubs still had to field a team that showed enough to keep fans on board and invested in the future.  In our world that means giving users time to gain basic literacy and get the most value possible out of current CDISC data standards with XPT (and maybe Dataset-XML), now that those will be required by FDA and PMDA (who aren’t about to change suddenly before the rule formally goes into effect).   This might also mean that we limit the degree of change to the current published standards with some minimal fine-tuning that users can easily absorb until they gain basic literacy, while concentrating most of the attention on that much more robust next generation solution that will make the big leaps tomorrow.
  1. Be patient so the prospects can develop.  In other words, even if the future solution isn’t necessarily mature now, that may be fine as long as it’s got the talent to take you a where you need to go in the future.  Such a description might fit HL7 FHIR and the Semantic Web, for example.
  1. Fill in the missing pieces along the way. – The Cubs soon realized they needed more starting pitching and situational hitting, which will guide their winter and spring moves for next year.
  1. Don’t worry about future salaries (I mean file size)! In 1908, the highest paid star baseball player made $8500, and in 1988 a floppy disk held 1.44 MB, less than a typical MP3 song that you can play from your watch.  This should not be an obstacle to moving beyond XPT.  Things get bigger over time; get over it.

Of course, the jury’s still out on whether the Cubs will ever make it, but it seems there’s more excitement about next year here in the Windy City than ever before.  It would be wonderful if we could say the same sort of thing about the future of clinical data by spring training, 2017.

R.I.P. Time for Supplemental Qualifiers

Warning:  this one’s primarily for SDTM geeks.

Back when the SDTM and SDTMIG v3.1 were being created circa 2003, there was never a delusion that the SDS team had thought of everything.  The SDTMIG domains were created by taking the least common denominator among CRFs from several major pharma companies.  It was always understood that we could only standardize on a core set of variables – that individual studies would almost always find cases when they’d need to add additional variables for some specific purpose.

The chosen solution for handling these extra variables was (shudder) supplemental qualifiers (SQs).  The original use case for SQs was to provide a way to represent flag variables like “clinically significant” that might be attributed to different people – an independent reviewer or DSMB, for instance.  But this was expanded to handle other variables that didn’t fit within any variable defined in the general classes.  A name/value pair structure with a different row for each variable and its value was adopted – quite flexible, but not very user friendly.  This was not viewed as a problem by all — there was a perception (held by one our FDA observers among others) that by making it difficult to represent SQs, sponsors would be disinclined to use them, and thus the standard would be leaner and more consistent and not get cluttered with other messy data.

But that assumption was wrong.  It turned out that many standard domains often need additional variables to fully describe a record – often essential information related to a specific therapeutic area.  So the SQ files kept getting bigger and bigger.

And SQs were clunky in many ways.  It was necessary to use value-level metadata to describe the variable in define.xml files.  And some tools or users had difficulty merging them back into parent domains.  And because they were so unwieldy, voluminous and hard to read, some reviewers simply gave up looking at them at all, resulting in risks that critical information might be missed during a review.

So some SDS team members wisely proposed an alternative proposal to place these SQs (which they renamed “Non-Standard Variables” or “NSV’s”) in the parent domain.  Instead of physically separating these out into another file structured differently, the proposal appended these to the end of the dataset record and relied on Define metadata to tag these as non-standard.  The metadata tag would make it straightforward to strip these out into a separate SuppQual structure if that was still needed for some reason (such as conforming to a database load program expecting such a file) but the dataset would already include these variables where they belong so they’d be less likely to be missed.

But this reasonable proposal wasn’t viewed as a panacea by everyone.  FDA was still concerned that this would encourage sponsors to add more and more unnecessary variables – which might just be noise to a reviewer.  And they worried about increasing the file size beyond their acceptable limits.  (But at least they didn’t disagree that these were a whole lot more trouble in their present form than they anticipated).

Meanwhile, other members of the SDS team objected to the proposal as an unnecessary change – since most companies had already invested in ways to create these and didn’t want to have to change again (even if the datasets would be more useful and their processes simpler if they did).  This, of course, is the notoriously stubborn  “sunk cost” fallacy.

But let’s pause now for a moment.  We know that the current SuppQual method is a clunky solution, which was already revised once (in SDTMIG v3.1.1, when a single file proved unmanageable and too big to submit), and that we still hear it can cause review problems for many and is seen as an unnecessary extra non-value added step by many more.  But we don’t want to offer a simpler and more efficient solution instead because we’ve already invested in the clunky solution?  Hello?

So, here’s another suggestion.  Let’s create a separate file with the exact same structure as the parent domain – namely, use the SDTM unique keys (STUDYID-DOMAIN-USUBJID-XXSEQ) and add in all the NSVs as additional columns.  Such a structure would allow full metadata representation in Define-XML – just like the other variables — and is a much simpler merge (and, for sponsors, also a simple split to take them out).  To allow applications to realize that this is a different merge from the normal SUPP—format, perhaps a new file name prefix can be used, such as SUPW (for “wide” or some other name, whatever).

Under such a scenario, FDA should be happy that file sizes are smaller (and smaller than the current tall and skinny SuppXX files, since they tend to expand to reserve as much space for all variables as required by the biggest), and the variables can be easily viewed in the dataset whether they’re merged or not – making it possible to only merge in the ones of interest if the possibility of noise is still a concern.

Not quite as elegant as a single file solution, but certainly seems to me better than the status quo.  And for those SDTM old-timers who still want to do it the old way, well, they can probably adapt the code they’ve already written to strip out the NSVs when they create SDTM (and put them back for their statistical programmers and internal reviewers) already and keep wasting time doing it the old way as well if that’s what really makes them happy.

Seriously, can’t we bury these SUPPxx files once and for all and try to agree to make SDTM datasets just a little more useful?  What’s the controversy with that?