Warning: this one’s primarily for SDTM geeks.
Back when the SDTM and SDTMIG v3.1 were being created circa 2003, there was never a delusion that the SDS team had thought of everything. The SDTMIG domains were created by taking the least common denominator among CRFs from several major pharma companies. It was always understood that we could only standardize on a core set of variables – that individual studies would almost always find cases when they’d need to add additional variables for some specific purpose.
The chosen solution for handling these extra variables was (shudder) supplemental qualifiers (SQs). The original use case for SQs was to provide a way to represent flag variables like “clinically significant” that might be attributed to different people – an independent reviewer or DSMB, for instance. But this was expanded to handle other variables that didn’t fit within any variable defined in the general classes. A name/value pair structure with a different row for each variable and its value was adopted – quite flexible, but not very user friendly. This was not viewed as a problem by all — there was a perception (held by one our FDA observers among others) that by making it difficult to represent SQs, sponsors would be disinclined to use them, and thus the standard would be leaner and more consistent and not get cluttered with other messy data.
But that assumption was wrong. It turned out that many standard domains often need additional variables to fully describe a record – often essential information related to a specific therapeutic area. So the SQ files kept getting bigger and bigger.
And SQs were clunky in many ways. It was necessary to use value-level metadata to describe the variable in define.xml files. And some tools or users had difficulty merging them back into parent domains. And because they were so unwieldy, voluminous and hard to read, some reviewers simply gave up looking at them at all, resulting in risks that critical information might be missed during a review.
So some SDS team members wisely proposed an alternative proposal to place these SQs (which they renamed “Non-Standard Variables” or “NSV’s”) in the parent domain. Instead of physically separating these out into another file structured differently, the proposal appended these to the end of the dataset record and relied on Define metadata to tag these as non-standard. The metadata tag would make it straightforward to strip these out into a separate SuppQual structure if that was still needed for some reason (such as conforming to a database load program expecting such a file) but the dataset would already include these variables where they belong so they’d be less likely to be missed.
But this reasonable proposal wasn’t viewed as a panacea by everyone. FDA was still concerned that this would encourage sponsors to add more and more unnecessary variables – which might just be noise to a reviewer. And they worried about increasing the file size beyond their acceptable limits. (But at least they didn’t disagree that these were a whole lot more trouble in their present form than they anticipated).
Meanwhile, other members of the SDS team objected to the proposal as an unnecessary change – since most companies had already invested in ways to create these and didn’t want to have to change again (even if the datasets would be more useful and their processes simpler if they did). This, of course, is the notoriously stubborn “sunk cost” fallacy.
But let’s pause now for a moment. We know that the current SuppQual method is a clunky solution, which was already revised once (in SDTMIG v3.1.1, when a single file proved unmanageable and too big to submit), and that we still hear it can cause review problems for many and is seen as an unnecessary extra non-value added step by many more. But we don’t want to offer a simpler and more efficient solution instead because we’ve already invested in the clunky solution? Hello?
So, here’s another suggestion. Let’s create a separate file with the exact same structure as the parent domain – namely, use the SDTM unique keys (STUDYID-DOMAIN-USUBJID-XXSEQ) and add in all the NSVs as additional columns. Such a structure would allow full metadata representation in Define-XML – just like the other variables — and is a much simpler merge (and, for sponsors, also a simple split to take them out). To allow applications to realize that this is a different merge from the normal SUPP—format, perhaps a new file name prefix can be used, such as SUPW (for “wide” or some other name, whatever).
Under such a scenario, FDA should be happy that file sizes are smaller (and smaller than the current tall and skinny SuppXX files, since they tend to expand to reserve as much space for all variables as required by the biggest), and the variables can be easily viewed in the dataset whether they’re merged or not – making it possible to only merge in the ones of interest if the possibility of noise is still a concern.
Not quite as elegant as a single file solution, but certainly seems to me better than the status quo. And for those SDTM old-timers who still want to do it the old way, well, they can probably adapt the code they’ve already written to strip out the NSVs when they create SDTM (and put them back for their statistical programmers and internal reviewers) already and keep wasting time doing it the old way as well if that’s what really makes them happy.
Seriously, can’t we bury these SUPPxx files once and for all and try to agree to make SDTM datasets just a little more useful? What’s the controversy with that?