Call me “Lumpy.” And I’m a lumpaholic. I acknowledge this predilection, because I simply feel lumping makes things easier, and I’m generally more comfortable with big pictures than detailed nuances.
But I’m not a lumping zealot. I have many close friends are splitters, and I can respect their point of view some of the time. It’s not unlike the dynamic between friends who are either Cubs or Sox fans in Chicago, for instance – we can agree to disagree, even when they’re wrong.
The term “lumping and splitting” apparently was coined by Charles Darwin with respect to how to classify species, and is explained nicely here:
- (For lumpers) “two things are in the same category unless there is some convincing reason to divide them.”
- (For splitters) “Two things are in different categories unless there is some convincing reason to unite them”.
This clearly defines the opening position of each camp, and relies on the existence of a compelling and logical reason to go the other way, which sounds reasonable enough.
Now a partisan divide of lumpers and splitters also exists in the CDISC SDS/SDTM microcosm, where the term is applied to the decision on whether to create new domains. Lumpers believe there should be a limited number of domains, to make it easier for sponsors to know where to put data. Splitters want to create many more domains with finer distinctions between them. The CDISC SEND team follows a very fine-grained splitting approach. But that does not necessarily mean human trials should follow the same pattern, since a lumping approach has also been followed with questionnaires and lab results data. In the latter cases, the SDTMIG describes a way to split these often massive lumped domains into separate files to conform to FDA file limits, but they’re still considered the same domain.
This difference of opinion has recently been illuminated as the team has wrestled with how to represent morphology and physiology data, a topic that’s critical to CFAST Therapeutic Area standards. Some years ago, the SDS team made a decision to separate physiology data by body system, and reserved some 15 separate physiological domain codes even though no specific use cases had been defined for all of those yet, while lumping together morphological observations in a single domain. This proved problematic, because it wasn’t always clear which type of test belonged where. So the team decided (through an internal poll) to merge Morphological observations into the various physiology domains. An alternative lumping proposal to combine all physiology data in a single data, and use a separate variable for Body system (as was already the case in Events and the PE domain) was proposed but did not gain sufficient traction.
Splitting may be fine as long as there’s that convincing reason and it’s clear where to split – like the “Tear Here” perforation on a package. We can easily see that AEs differ from Vital Signs (though the latter may indicate a potential AE). And I’m not suggesting that we lump together domains previously released (well not all, anyway). But what do you do with a complex disease such as cancer or diabetes, or a traumatic injury that affects multiple body systems? what happens with, say, observations related to a bullet wound that penetrates skin, muscle and organs when you have to split these over multiple domains?
In such cases, wouldn’t a physician – or FDA medical reviewer – want to see all of the observations relevant to the patient’s disease state together rather than jumping around from file to file and trying to link them up? And how are patient profile visualizations affected when multiple, possibly related morphological and physiological observations are split across many separate files, since patient profiles tend to look in a finite number of domains for core data (typically DM, EX, LB, VS, AE and CM) – adding in dozens of others is likely to be challenging. And maybe it would reduce the constant stress of having to accommodate more and more new domains with each new version if SDS didn’t feel a need to keep adding more and more domains each time.
This brings me back to a “first principle” – making it easy on the reviewer. If you know exactly what you’re looking for, and you’re confident everyone else knows to put the same type of data in the same place, then maybe it’s easy to point to a separate fine-grained domain. But what if you’re looking to see possible associations and relationships that may not be explicit (which FDA reviewers have lamented previously). I think for most of us who are familiar with spreadsheets and other query and graphics tools, it’s far easier to have all the data you want in one place and rely on filter and search tools within a structured dataset rather than search/query across a directory full for files. And that’s one reason why FDA has been working so long on moving from file based data review to their Janus Clinical Data Repository (CDR) – so reviewers can find the data they want through a single interface. In my experience, CDRs (including Janus) do not usually split most findings type data by domain.
Lumping solutions should be more consistent between sponsors and studies, since there’s a simpler decision on where to put data. Just like when shopping at Costco, it’s easier to make a decision when there are fewer choices involved. And just as it’s quicker and easier to pick out some bathroom reading from a single shelf rather than have to search through an entire library, lumping should make it quicker and easier to access the data you want.
So, what exactly is the acid test? Whichever approach is chosen for SDTMIG in the future (and it should be up to the community to weigh in when the next comment period begins), one overriding principle must be in place: a new domain split should never be created unless there’s a clear rationale for splitting (and in the case of Physiology, I don’t really see that myself), and it’s absolutely, unambiguously clear when the new domain is to be used for what kind of data. If there’s any ambiguity about whether a certain type of data could go here or there, then we should opt for a lumping solution instead, and allow users to rely on query and filter display tools to pick out what they want. Meanwhile, we can rely on our statistical programmers put data logically together in analysis files just as we always have.
So, lumpers of the world, unite! There are simply too many other problems to tackle other than the “What domain do I use this time?” game. Like defining rich metadata models for sets of variables with valid terminology (i.e., concepts), or defining a next generation SDTM that isn’t based on SAS XPT.
In the meantime, another scoop of mashed potatoes for me, please – lumps and all.