The Lumping and Splitting Acid Test

Call me “Lumpy.”  And I’m a lumpaholic.  I acknowledge this predilection, because I simply feel lumping makes things easier, and I’m generally more comfortable with big pictures than detailed nuances.

But I’m not a lumping zealot.  I have many close friends are splitters, and I can respect their point of view some of the time.   It’s not unlike the dynamic between friends who are either Cubs or Sox fans in Chicago, for instance – we can agree to disagree, even when they’re wrong.

The term “lumping and splitting” apparently was coined by Charles Darwin with respect to how to classify species, and is explained nicely here:

  • (For lumpers) “two things are in the same category unless there is some convincing reason to divide them.”
  • (For splitters) “Two things are in different categories unless there is some convincing reason to unite them”.

This clearly defines the opening position of each camp, and relies on the existence of a compelling and logical reason to go the other way, which sounds reasonable enough.

Now a partisan divide of lumpers and splitters also exists in the CDISC SDS/SDTM microcosm, where the term is applied to the decision on whether to create new domains.   Lumpers believe there should be a limited number of domains, to make it easier for sponsors to know where to put data.  Splitters want to create many more domains with finer distinctions between them.  The CDISC SEND team follows a very fine-grained splitting approach.  But that does not necessarily mean human trials should follow the same pattern, since a lumping approach has also been followed with questionnaires and lab results data.  In the latter cases, the SDTMIG describes a way to split these often massive lumped domains into separate files to conform to FDA file limits, but they’re still considered the same domain.

This difference of opinion has recently been illuminated as the team has wrestled with how to represent morphology and physiology data, a topic that’s critical to CFAST Therapeutic Area standards.  Some years ago, the SDS team made a decision to separate physiology data by body system, and reserved some 15 separate physiological domain codes even though no specific use cases had been defined for all of those yet, while lumping together morphological observations in a single domain.  This proved problematic, because it wasn’t always clear which type of test belonged where.  So the team decided (through an internal poll) to merge Morphological observations into the various physiology domains.  An alternative lumping proposal to combine all physiology data in a single data, and use a separate variable for Body system (as was already the case in Events and the PE domain) was proposed but did not gain sufficient traction.

Splitting may be fine as long as there’s that convincing reason and it’s clear where to split – like the “Tear Here” perforation on a package.  We can easily see that AEs differ from Vital Signs (though the latter may indicate a potential AE).  And I’m not suggesting that we lump together domains previously released (well not all, anyway). But what do you do with a complex disease such as cancer or diabetes, or a traumatic injury that affects multiple body systems?  what happens with, say, observations related to a bullet wound that penetrates skin, muscle and  organs when you have to split these over multiple domains?

In such cases, wouldn’t a physician – or FDA medical reviewer – want to see all of the observations relevant to the patient’s disease state together rather than jumping around from file to file and trying to link them up?   And how are patient profile visualizations affected when multiple, possibly related morphological and physiological observations are split across many separate files, since patient profiles tend to look in a finite number of domains for core data (typically DM, EX, LB, VS, AE and CM) – adding in dozens of others is likely to be challenging.  And maybe it would reduce the constant stress of having to accommodate more and more new domains with each new version if SDS didn’t feel a need to keep adding more and more domains each time.

This brings me back to a “first principle” – making it easy on the reviewer. If you know exactly what you’re looking for, and you’re confident everyone else knows to put the same type of data in the same place, then maybe it’s easy to point to a separate fine-grained domain.  But what if you’re looking to see possible associations and relationships that may not be explicit (which FDA reviewers have lamented previously).  I think for most of us who are familiar with spreadsheets and other query and graphics tools, it’s far easier to have all the data you want in one place and rely on filter and search tools within a structured dataset rather than search/query across a directory full for files.  And that’s one reason why FDA has been working so long on moving from file based data review to their Janus Clinical Data Repository (CDR) – so reviewers can find the data they want through a single interface.  In my experience, CDRs (including Janus) do not usually split most findings type data by domain.

Lumping solutions should be more consistent between sponsors and studies, since there’s a simpler decision on where to put data.  Just like when shopping at Costco, it’s easier to make a decision when there are fewer choices involved. And just as it’s quicker and easier to pick out some bathroom reading from a single shelf rather than have to search through an entire library, lumping should make it quicker and easier to access the data you want.

So, what exactly is the acid test?   Whichever approach is chosen for SDTMIG in the future (and it should be up to the community to weigh in when the next comment period begins), one overriding principle must be in place:  a new domain split should never be created unless there’s a clear rationale for splitting (and in the case of Physiology, I don’t really see that myself), and it’s absolutely, unambiguously clear when the new domain is to be used for what kind of data.  If there’s any ambiguity about whether a certain type of data could go here or there, then we should opt for a lumping solution instead, and allow users to rely on query and filter display tools to pick out what they want.  Meanwhile, we can rely on our statistical programmers put data logically together in analysis files just as we always have.

So, lumpers of the world, unite!  There are simply too many other problems to tackle other than the “What domain do I use this time?” game.  Like defining rich metadata models for sets of variables with valid terminology (i.e., concepts), or defining a next generation SDTM that isn’t based on SAS XPT.

In the meantime, another scoop of mashed potatoes for me, please – lumps and all.


10 thoughts on “The Lumping and Splitting Acid Test

  1. I’m another lumper, in whole-hearted agreement with your acid test. I just don’t see that any one study (or submission) would be likely to have findings from so many different body systems that it would become overwhelming to sort them out. Even if a submission included findings from all 15 body systems, I’m willing to rely on filters to find what I need. I remember that the original proposal for SDTM saw domain codes for findings as ways to filter what was essentially one table.
    Are there reasons not to split? I’m not looking forward to more “which domain?” debates, and I specifically worry about tests that are hard to classify (e.g., eye or nervous system?) and those that apply to the body as a whole.
    In an internal vote, the decision was to split, so that’s what will be proposed for public review. I can live with separate domains, but I think we’re making life unnecessarily hard for ourselves.

    Liked by 1 person

  2. I would like to throw my ideas into the ring on this topic. I believe there is a way to keep both the “lumper” and “splitter” camps happy. Would it not be possible to have a few high level domains that branch off into sub domains? This would fulfill the “first principle” of making life easier for the reviewer, but it would also allow the possibility to drill down to the finer details if and when needed. Instead of having flat tables as submission data a relation database type structure, whereby data can be accessed in multiple ways. This would be more flexible, easier to maintain and maybe even more robust to future SDTM philosophy adaptations.


    1. Yes, I think that way of thinking works well with my proposal, as long as we get away from the one file per domain approach. Drilling down within a lumped domain file using filters and queries is a pretty tried and true method. but we have to move away from so many separate XPT files. I have always wondered why we need a DOMAIN field on every record if we’re holding fast to the one domain per file rule — if we allowed DOMAIN (or –BODSYS) to represent the more granular domain, we could keep them all in one file. Regarding a relational DB structure — that’s more of where you put the data — we’re still constrained by FDA’s requirement to use SAS XPT to exchange data.


  3. The SDTM standard has evolved, and the principle of “making it easier for the reviewer” has led to the fact that we left the first principles of SDTM itself, like “no derived data”. Many thinks for “making it easier for the reviewer” should be done by the tools, not by the model. Examples are automatic derivation of “study day” (–DY variables), or even derivation of EPOCH. Tools like the open source “Smart Dataset-XML Viewer” ( exactly do this, so it can’t be that hard.
    Many people still think about the SDTM as a relational database (it once nearly was), but by leaving the SDTM first principles, it nowadays is a “view” on a relational database at best (, offending many of the rules of good database design, like “no derivations”, “no data redundancy”, “no transitive dependencies” (the well know normal forms). It is an illusion that one can regenerate a correct relational database from a view. The only real relational database only resides at the sponsor (if at all), not at the FDA. When treating a view on a database as being a relational database, disaster is preprogrammed.
    Another problem is that the way SDTM describes relations between data points (Wayne’s bullet wound example) is lousy. We use a RELREC table to describe that records (not data points) are related, but do not even describe the relationship in detail. For example, was the lung dysfunction cause or result of the bullet wound? SDTM does not tell us.
    When starting working on Dataset-XML, I proposed to move “STUDYID” and “USUBJID” out of the “table” (ItemGroupData) as the ODM already sets the study ID near the top of the file (StudyOID) and we group the data by subject ID anyway (“SubjectKey” in ODM). But for that, the SDTM model (defining everything as 2-dimensional tables) would have to be changed, or at least an alternative transport mechanism published in an Implementation Guide. That was too revolutionary.
    Another problem is that in SDTM, the order of the variables is fixed. Why? In database tables, the order of columns doesn’t matter, why should it in SDTM? Probably for “ease of review”. But isn’t that the task of the tool? The “Smart Dataset-XML viewer” allows you to swap and shift columns.
    I have no problem with the principle of “1 domain, 1 file”, though nowadays it is not reality anyway. When generating SDTM datasets using our “SDTM-ETL” software, I always advise my customers not to let it come to have to split datasets due to the FDA size limitations, I advise them to generate different instances of the same domain right from the start, So for example for LB, I advise them to have different instances right from the start based on e.g. data source (e.g. different labs), or category (chemistry, hematology, …). That makes mapping and life so much easier.
    Combining data for visualization or analysis is again a task for the tool, essentially it should be more or less independent from how the data is organized. I still have the strong impression that the Janus-CTR does not really allow this, and that reviewers are still “jumping around from file to file and trying to link them up” (Wayne’s words). Essentially, it should take one key stroke (F1 or so) to toggle between related data points (again, the “Smart Dataset-XML Viewer” does this).
    Lumping or splitting? I too believe that splitting (different domains for similar things) is a sign of weakness. Weakness of the model, weakness of the tools. Why shouldn’t it be allowed to have a non-standard variable (NSV) in the middle of normal variables? We can mark them as such in the define.xml anyway. When my mapping tool generates the domains table, the user it now already overwhelmed with a huge table in the GUI. Beginners are immediately confused… When I started working on the software about 10 years ago, I believed what the SDTM people told me: that there would never be more than 20 domains.
    I am an admirer of HL7-FHIR. Those having to cope with HL7-CDA (I do) know how hard it is to learn it (just like SDTM). FHIR was an enlightenment. Due to its flexibility, it is much easier to learn and to implement. In FHIR, the order of the data points (“resources”) does not really matter (in CDA it does, just like in SDTM). FHIR is an 80/20 solution: one has standardized resources (like standard SDTM variables), which should take care of 80% of the data, and one has “extensions” (like NSVs in SDTM) for things that are not standardized. Related data points are well described using the “References” resource, and easy to implement, as each data point (that is different from each record – in SDTM one can only make relations between records) can be given an ID, and can be referenced from another resource (data point). Try that in SAS XPT…
    If we apply the same design principles that led to FHIR (I am NOT saying that we should use the FHIR format) to SDTM, allowing flexibility, not using SAS-XPT anymore, I think we can again come to less than 20 domains. If we go away from SAS-XPT, and move to XML or even to JSON, we could keep the relationships between data points where they belong: within the data point, e.g. using an XPath expression, or using identifiers and references to them. RELREC would then disappear. The same applies for “Comments” (CO). If we would allow variations within domains using the FHIR design principles, allowing flexibility and 80/20, allowing using different ingredients for each of our domains, we could lump domains as Wayne proposes. This would also make it so much easier for combining data (like Wayne’s patient profiles), done by tools (not by the model).
    But isn’t all this the idea of Wayne’s “SDTM as a cookbook”?

    Liked by 1 person

  4. Thanks, Jozef. Many great comments here. One critical point is to reach clarity on what we mean by “making it easy on the reviewer.” While things have changed since SDTMIG 3.1 was released in 2004 when tools were not available) this first principle should be based on clear and unambiguous content, not fringe syntactical points like order of variables (the order matters for viewing the metadata consistently with the SDTM model, but I agree it should not matter for the define and dataset. Putting an NSV at the end of a file can be quite at odds with making it easier on the reviewer by putting a variable where it logically belongs according to the content.) It’s hard to generalize about removing derived variables across so many different tools but clearly the model could be streamlined by some very basic improvements. Unfortunately, I suspect we still have many people browsing a dataset rather than using more sophisticated tools. But that should not hold back the standard. Clearly relationships can be expressed better. I am also a fan of the HL7 FHIR approach, and will be posting several other views on the future of the SDTM in coming months. Thanks for continuing the dialogue.


  5. Thanks Wayne,
    Do you happen know whether the reviewers at the FDA inspect the XPT files directly, or that they query and get “views” on the data from the Janus CTR? If they (still) inspect the files directly, there is not much we can do about the model itself I am afraid.


  6. Jozef, the last I heard, Janus was still not fully in production. Though some reviewers may be using it to some degree, I believe many are still working directly with files at this time. Would be helpful to time box when that situation may finally change.


  7. Imagine Amazon being a company where one can only download files with products: one file with books, one file with movies, one file with electronic devices, …
    One would then need to search in the file with books for a specific book, and then in the file with movies to find out whether there is a movie based on the selected book, …
    Sounds like SDTM isn’t it? 😉 At least like how SDTM is used by many …


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s