SDTM as a Cookbook

The original CDISC Study Data Tabulation Model Implementation Guide (SDTMIG) was built around a core set of domains describing data commonly used in most clinical trials — much of which were necessary to understand basic product safety.  Over time, the SDTMIG has been expanded with new versions that incorporated more and more domain models.  SDTMIG v3.2 has nearly 50 domains, with several more to come with the next version.

This domain-based approach, while very familiar to those involved in study data management and analysis, has been a mixed blessing in many ways:

  • FDA acceptance testing of new SDTMIG versions has been taking a year or more post publication, and sometimes this belated support for new versions comes with exceptions — a serious disincentive to early adoption by sponsors.
  • While continually adding new domain models may be helpful to some, others may find these threatening – due partly to the excessive effort of change management in a regulated systems environment in general and especially if they’ve already developed alternative approaches to modeling similar data in custom domains which might require last minute database changes to support new SDTMIG versions for ongoing drug development programs.
  • The time it takes the SDS team to release a new SDTMIG version is at least two years or more, and its already massive but steadily increasing size and complexity makes updating more difficult and adoption and understanding by end users burdensome. And still publicly available validation tools have not been able to keep up.
  • Meanwhile, the long timeline has made it necessary for the timeboxed CFAST Therapeutic Area (TA) User Guides to issue Draft or Provisional domain models to capture TA-specific requirements, which FDA seems reluctant to accept, again discouraging early adoption and injecting more fear, uncertainty and doubt since such domains may still undergo change before being added to a new SDTMIG version.

So this spiral of steadying increasing complexity and version management seems to be a cause of widespread consternation.

Cooking Up an Alternative Solution

What if we considered the SDTMIG as something more like a cookbook of recipes rather than a blueprint?  A recipe describes the general ingredients, but the cook has discretion to substitute or add some additional ingredients and season to taste.   What if the SDS and CFAST teams — rather than continuously creating more and more new domain models (sometimes with seemingly overlapping content), and waiting to embed these within a new version of the normative SDTMIG every few years to address any changes or additions —  took a more flexible, forgiving approach.  So a new Therapeutic Area User Guide would continue to describe what existing final SDTMIG domains may be most appropriate for representing TA data such as relevant disease history, baseline conditions, and, when necessary, include directions for cooking up a new custom domain for, say, key efficacy data.  The recipe would state which SDTM general class to use, which variables from that class apply, what current or new controlled terminologies should be used, and other implementation details, such as any incremental business rules that might be used for validation checks.  in lieu of a traditional domain model, the recipe could include an example of the Define-xml metadata for the domain.  But it’s up to the cook to adjust for the number of guests at the table, prepare, add garnishes and serve.  Such a recipe could be referenced (just like an SDTMIG version and a domain version) in a define-xml file (a proposed new v2.1 of Define-XML has features to support this) as if it was a standard, but, in reality, it would be a best practice guideline.

Representing domain models as recipes (or guidelines if you prefer) would have the advantage of producing domains that would already be acceptable for FDA submission (since custom domains are being accepted by FDA), and yet still being checked for consistency with controlled terminology and validation rules.  And adopters of CFAST TA standards might start using these more quickly, allowing these to be field tested so they can evolve more rapidly without waiting years for a new SDTMIG version to bless them.  As people learn more and gain more confidence, they can post additional examples and comments to a Wiki that can inform future user so everyone builds on everyone else’s experience over time.

Under this approach, the SDTM becomes the real model/basis for the standard, and the domain models would be treated as representative use cases.  The recipe can also include incremental validation rules, but must remain faithful to any SDTM model-level rules (which would have to be promoted upward from the IG presumably). New models may need new variables in many cases, and these variables should be added to newer versions of the SDTM.  But, assuming the SDS team finally adopts a more efficient manner of representing Non-Standard Variables (as they have already proposed in two drafts for comment) it would be easy enough for custom recipe-driven domains conforming to a current version of the SDTM to add each necessary new variable as an NSV at first.  These NSVs could then to uplifted to become standard variables later in a new SDTM version, which would only require a simple change to an attribute in the define.xml once that new version becomes available. Either way, the new domain model can be used immediately by applying current available structures in the SDTM with some incremental new terminology (such as a new domain code describing the contents and a new controlled term for a variable name.)

This does not mean the SDTMIG goes away – it’s still needed to describe the context, general assumptions and rules, and show how to implement key SDTM concepts like the 3 general classes, trial design model, special purpose and relationship classes.  But the IG wouldn’t need to cover each new variable and domain anymore and could focus on explaining core SDTM concepts with assumptions and rules instead.  It could make for a leaner SDTMIG, which doesn’t have to change so often, and impart more flexibility and agility in the use of domains.

Such an approach could also be a good first step toward decoupling the SDTMIG from the highly cumbersome, restrictive use of SAS v5 Transport files, and make the model more adaptable for implementation in new technological frameworks, such as HL7 FHIR and the semantic web.

Of course this still leaves a number of challenges in terms of terminology management, but that’s a topic for another day.

In the meantime, what do you think about a simpler SDTMIG, that emphasized applying the general model with terminology to meet the needs for new CFAST domains?


9 thoughts on “SDTM as a Cookbook

  1. When I started working with SDTM, implementing it into my SDTM-ETL software, I regarded the SDTM-IG as a cookbook, and I still more or less do so. The SDTM-IG told me which variables I really needed for a domain (the „required“ ingredients), which ones the FDA would like to see (the „expected“ ingredients) and which ones are optional („permissible“ ingredients), but can improve the quality of my cake or dish (or submission). It also told me the order in which I needed to mix the ingredients (the order of the variables), although this may be less critical.
    Unfortunately people have more and more transformed it into a „code of law“. I see the following reasons for this:
    – fear for the FDA (if I strictly follow the IG, I cannot do anything wrong…)
    – overinterpretation of the IG (even examples are interpreted as being the „law“)
    leading to „rules“ developed by a few people (not consensus based) that were never intented to be so by the IG developers, and gratefully taken over by the FDA
    – the development team giving in to FDA „requirements“ even when the latter do not make sense, damages the model or makes it more complex, introduces redundancy and thus must lead to inconsistencies in the data (e.g. many of the „derived“ variables)
    – the myth or disbelieve that (in a first step) the quality of a submission can be estimated/measured by counting the errors when using the validation tool (FDA data fitness)

    My first experience on this was when I was asked to check a set of SDTM datasets. The sponsor had slightly changed the label on one of the variables to BETTER explain it to the reviewers. Panic arose at the sponsor’s regulatory department when this lead to an ERROR reported by the validator tool used by the FDA. What was meant to increase the quality was interpreted by the tool as decreasing the quality.
    It is a shame that I must justify each deviation from these „rules“ in the reviewers guide, even when the rule is wrong, leads to false positives, or overinterpretes the IG. It does not make sense that I need to justify that my submission is not using the newest CDISC controlled terminology when I have a 15 year old legacy study. Similarly, I should not need to justify to my family when replacing the strawberries on the top of the cake by pieces of apple because it is apple season and not strawberry season. I followed the cookbook but decided to replace one of the ingredients for very good and obvious reasons.

    We need to start seeing the define.xml as „leading“, i.e. what is in the define.xml is the „sponsor’s truth“. This may completely comply to the SDTM-IG (nice!) but can also deviate considerably – but it remains the „sponsor’s thruth“. For example, I can explain in my define.xml that I replaced one ingredient (variable) by another, as this better explains the data in the submission. But unfortunately, current validation tools even usually do not look into the contents of the define.xml…, leaving no flexibility at all. And when I then see that people start manipulating their data in order to comply with the SDTM-IG and/or controlled terminology, …


  2. Thanks, Jozef. Indeed, basing the cookbook on the SDTM rather than the SDTMIG was proposed to allow for more flexibility. This would likely require more work on the SDTM (for example, certain terminology requirements could be promoted up from the IG level, and some validation rules can apply to the general classes as well as individual domains). It should be clear that not all validation rules are equal in impact, that examples are not intended to be normative, and that not every use of the SDTM requires a new domain standard domain model to be defined. I think it would be nice to see more people cooking their own, ideally driving their projects with define as you suggest, and see what comes of that. Such a smorgasbord might teach us a lot by testing things in real applications first before they’re designated as standards.


  3. Wayne and Jozef. You both are so knowledgeable that I cannot argue with you. What I would introduce to this conversation is the interest of the FDA in importing these data and combining them into a standard warehouse view. Of course it is possible to account for these variations from the SDTMIG during importing but it would make the task more complex. Can we account for this without such rigidity in standards?


    1. Dave,
      Thanks for your comment. I think we often delude ourselves regarding the maturity of our current standards and what needs to be rigid. A cookbook approach would still require rigorous adherence to the SDTM, business rules and a core set of domain models, and a much richer metadata description of custom domains together with use of controlled terminologies. Truth is there’s already plenty of wiggle room in existing standards, and a lot of time can be wasted on conformance that isn’t necessarily value-added (like the labels Jozef mentions). And there’s not enough time spent on testing and evaluation of new extensions. I believe an SDTM-based approach with a robust set of controlled terminologies would fit very well within the FDA’s data warehousing strategy — possibly better than continually defining more and more domain models which may suggest more rigidity than warranted. But we have a long way to go before we can realize the value of the current standards — we don’t have the protocol context associated with the data, we don’t have all the necessary associations and relationships clearly identified, and we don’t have a submission format that allows for richer datatypes and more robust representation of truth. But I feel we can get there sooner with a cookbook approach based on the SDTM.


  4. Thanks Dave!
    Maybe I do understand something else under “data warehouse”, but I always thought that in order to populate a data warehouse (through an ETL process) one first need to have good transactional/operational databases. I have not heard yet that the FDA has a well organized database where it stores all its SDTM submissions.
    I have experimented a lot with native XML databases, and would consider such one as an ideal solution for the FDA, i.e. with each submission being a separate “collection” in the native XML database. This would then be the basis for doing comparsions between studies, doing aggregations etc.. Based on XML, this would also free the FDA from requiring SUPPQUAL datasets for NSVs and for records > 200 characterts.
    The advantage of a native XML database is that one can easily query and do ETL work over different collections (i.e. submissions), and restore the results in other collections, i.e. the native XML database can serve as well as an OLTP as well as an OLAP database.
    In my opinion, having a more flexible SDTM would not undermine such activities. At the contrary, i think data quality would improve, as we now often see that people are “pushing” information into certain SDTM variables, just for the sake of being compliant to the SDTM standard.


  5. This is one of those discussions where you want to be in a room with a very large and very empty whiteboard as against a keyboard, isolated, its dark outside and the first mug of tea of the day is still being consumed 🙂

    My immediate thought in reaction to what Wayne has written is ‘what are we trying to solve’. We have several issues raised within Wayne’s text and other issues within Jozef’s and Dave’s comments. The submission problem and getting the product approved, data pooling or what? Is there a single answer to all? I always fear that ‘SDTM’ being the answer to multiple challenges is not the answer. To continue the analogy, are we asking for someone to cook a single dish that I can serve for every course of a banquet?

    The idea of a cookbook with the current industry mentality fills me with fear. Show me a recipe and I have no doubt that my ‘creation’ won’t look like the pretty picture in the cookbook! 🙂 Show that recipe to 100 folks and god knows what will happen.

    I agree with Wayne that we need to have a better way to deliver standards (new versions, fix errors). We are flooding industry at the present time.

    We can say what variables are required and which are optional etc but it is the use of variables in combination that causes a lot of issues (if TEST is X then this other variable is required, when Y it is not etc etc etc). Knowing where to put X, domain A, B or C and how best to put the combined information about X into several variables. We need more sophisticated content, we need more complete content, we are delivering standards today with holes in.

    I saw this week a nice presentation by Niels Both and Johannes Ulander (S-Cubed) at PhUSE on the use of –CAT within the new TA domains. They identified, if memory serves me, 4 different mechanisms in the way –CAT was being used within MH across several TAs. One variable, 4 mechanisms! I have an example where I looked at one scale from one TA. I found holes in the specification (as Jozef said, examples being used to fill holes in specification and then being taken as gospel).

    So I see many issues: inconsistent specification, incomplete specification, delivery of specification to industry and ease of consumption (where Wayne was directing his thoughts I think), interpretation of standards within industry, the mind-set within industry, FDA capability and the regulators mindset. And soon we will see differences in regulatory consumption with Japan coming on-line and the inevitable US v Japan SDTM flavours.

    I cannot disagree with anything in the blog or comments from Jozef & Dave. But are we talking about a single problem? Is SDTM the only answer?

    To the whiteboard Robin!


    1. Good point asking “what problem was I trying to solve?” I certainly did not mean to imply that SDTM is the only answer to anything, but in cases when we are limited to SDTMIG SAS XPT implementations (e.g., for regulatory submissions) what can we do to make it easier to evolve SDTMIG (e.g. for TAUGs) without waiting 3+ years for a new version to come out and then be acceptable to FDA? It is my view that we need a 2-pronged approach — (1) what can we do within the current constraints to make things better while simultaneously working on (2) what can we do to define a better long-term solution and hopefully wait for the industry to come to their senses and use it. I fully agree that there are many issues with the current SDTMIG (such as the –CAT example you mention), but I think a recipe can be tweaked more easily to specify an ingredient like using a specific codelist for a defined purpose more easily than creating a new version of the SDTMIG. Is there a virtual whiteboard we can link to from WordPress in your utility belt, Batman?


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s