Warning: this one’s primarily for SDTM geeks.
Back when the SDTM and SDTMIG v3.1 were being created circa 2003, there was never a delusion that the SDS team had thought of everything. The SDTMIG domains were created by taking the least common denominator among CRFs from several major pharma companies. It was always understood that we could only standardize on a core set of variables – that individual studies would almost always find cases when they’d need to add additional variables for some specific purpose.
The chosen solution for handling these extra variables was (shudder) supplemental qualifiers (SQs). The original use case for SQs was to provide a way to represent flag variables like “clinically significant” that might be attributed to different people – an independent reviewer or DSMB, for instance. But this was expanded to handle other variables that didn’t fit within any variable defined in the general classes. A name/value pair structure with a different row for each variable and its value was adopted – quite flexible, but not very user friendly. This was not viewed as a problem by all — there was a perception (held by one our FDA observers among others) that by making it difficult to represent SQs, sponsors would be disinclined to use them, and thus the standard would be leaner and more consistent and not get cluttered with other messy data.
But that assumption was wrong. It turned out that many standard domains often need additional variables to fully describe a record – often essential information related to a specific therapeutic area. So the SQ files kept getting bigger and bigger.
And SQs were clunky in many ways. It was necessary to use value-level metadata to describe the variable in define.xml files. And some tools or users had difficulty merging them back into parent domains. And because they were so unwieldy, voluminous and hard to read, some reviewers simply gave up looking at them at all, resulting in risks that critical information might be missed during a review.
So some SDS team members wisely proposed an alternative proposal to place these SQs (which they renamed “Non-Standard Variables” or “NSV’s”) in the parent domain. Instead of physically separating these out into another file structured differently, the proposal appended these to the end of the dataset record and relied on Define metadata to tag these as non-standard. The metadata tag would make it straightforward to strip these out into a separate SuppQual structure if that was still needed for some reason (such as conforming to a database load program expecting such a file) but the dataset would already include these variables where they belong so they’d be less likely to be missed.
But this reasonable proposal wasn’t viewed as a panacea by everyone. FDA was still concerned that this would encourage sponsors to add more and more unnecessary variables – which might just be noise to a reviewer. And they worried about increasing the file size beyond their acceptable limits. (But at least they didn’t disagree that these were a whole lot more trouble in their present form than they anticipated).
Meanwhile, other members of the SDS team objected to the proposal as an unnecessary change – since most companies had already invested in ways to create these and didn’t want to have to change again (even if the datasets would be more useful and their processes simpler if they did). This, of course, is the notoriously stubborn “sunk cost” fallacy.
But let’s pause now for a moment. We know that the current SuppQual method is a clunky solution, which was already revised once (in SDTMIG v3.1.1, when a single file proved unmanageable and too big to submit), and that we still hear it can cause review problems for many and is seen as an unnecessary extra non-value added step by many more. But we don’t want to offer a simpler and more efficient solution instead because we’ve already invested in the clunky solution? Hello?
So, here’s another suggestion. Let’s create a separate file with the exact same structure as the parent domain – namely, use the SDTM unique keys (STUDYID-DOMAIN-USUBJID-XXSEQ) and add in all the NSVs as additional columns. Such a structure would allow full metadata representation in Define-XML – just like the other variables — and is a much simpler merge (and, for sponsors, also a simple split to take them out). To allow applications to realize that this is a different merge from the normal SUPP—format, perhaps a new file name prefix can be used, such as SUPW (for “wide” or some other name, whatever).
Under such a scenario, FDA should be happy that file sizes are smaller (and smaller than the current tall and skinny SuppXX files, since they tend to expand to reserve as much space for all variables as required by the biggest), and the variables can be easily viewed in the dataset whether they’re merged or not – making it possible to only merge in the ones of interest if the possibility of noise is still a concern.
Not quite as elegant as a single file solution, but certainly seems to me better than the status quo. And for those SDTM old-timers who still want to do it the old way, well, they can probably adapt the code they’ve already written to strip out the NSVs when they create SDTM (and put them back for their statistical programmers and internal reviewers) already and keep wasting time doing it the old way as well if that’s what really makes them happy.
Seriously, can’t we bury these SUPPxx files once and for all and try to agree to make SDTM datasets just a little more useful? What’s the controversy with that?
15 thoughts on “R.I.P. Time for Supplemental Qualifiers”
While I still favor including NSVs in the parent domain, the “extra columns” proposal is interesting. I’d propose a naming convention that had the additional columns in a dataset that began with the two-letter domain code. This would at least ensure that the extra-column datasets would reside right below the parent domains in the normal alphabetized directory listing, rather than clustered between the SU and VS domains, where they sometimes get ignored if one is looking at top of the directory (AE data, for example). Something like AENSV might work. This naming, without a “SUP” in it dispenses with the notion that these are “supplemental”, which in some minds reads “less important”.
Your renaming suggestion makes sense. I’d go with that.
A nice read. Very informative. Thanks Wayne.
Really? Do we still want to separate standard and non-standard variables in 2 separate data sets when the metadata already will tell us which variables are non-standard?
Also, when you talk about file sizes, I think you are talking about the XPT world, not the emerging Dataset-XML standard. I’m not even convinced that your proposal would always result in smaller sizes.
Lex, I’d much rather have them in the parent domain, in accordance with the SDS proposal. But it seems that half the SDS team disagreed, and the FDA was also unwilling to support it because of their fear of XPT file sizes (since they’re not using Dataset-XML). Just trying to offer a conciliatory alternative.
Very interesting blog indeed!
It gives a lot of insight in how and why suppqual domains were established in the past. That is a new aspect for me. Especially sentences like “that by making it difficult to represent SQs, sponsors would be disinclined to use them..” are real eye-openers for me. Or the sentence “FDA was still concerned that this would encourage sponsors to add more and more unnecessary variables – which might just be noise to a reviewer”…
Honestly said, I am more concerned about sponsors “squeezing” information into standard variables to avoid creating an NSV. This might trigger a discussion about “data quality versus ease of review” (both are important).
Although I am not really in favor of Wayne’s proposal (but some people say I am an extremist…), it might be an acceptable compromise for the FDA. I support Freds suggestion for the naming of the files too. My opinion however remains that NSVs should be kept where they belong: in the original domain, and indepently from whether Dataset-XML or SAS Transport 5 is used.
I also still do not understand the fear for file sizes at the FDA. Once received at the FDA, submissions should not be kept in files (except for archival), they belong in high-performance databases. Doesn’t the Janus CTR data warehouse do that? Or does it only keep metadata and standardized analysis information about the submissions that physically still reside in files? Can anyone give me information on this?
IF the submissions are still kept in files, I think having real SDTM databases should be FDAs highest priority. Amazon does not keep its lists of products in files either isn’t it?
LikeLiked by 1 person
If someone really insists on putting non-standard variables in a separate dataset, this “wide” NSV file makes more sense than the current supplemental qualifiers dataset. Of course, even better is including the NSV variables in the parent domain.
LikeLiked by 1 person
@Wayne: Understood! I certainly like the SUPW concept better than SUPPxx.
Thanks for starting this discussion.
I think everyone can probably agree that the status quo is pretty poor for a number of reasons, however creating another way of doing it that is essentially yet another awkward implementation of the same thing seems like madness. Yes I agree it’s less awkward than what we have today, but it’s a whole lot more awkward than just putting them in the parent domain.
If XPT size is the main issue – is that not essentially already solved? Today we already have to deal with XPTs that are larger than allowed, and we have a way of splitting them up in that circumstance. That would work fine here too, so why is there a problem?
Perhaps I’m being too simplistic.
LikeLiked by 1 person
Except for this tiny matter of regulatory compliance in the USA and Japan, Kevin, I’d have to agree. But given the current state and our continued dependance on the real enemy, SAS v5 XPT, I think it’s worth taking a few small steps in the right direction in the meantime.
Altough I don’t like SUPPxx this discussion remind me what an oracle guy said during last European CDISC Interchange in Paris “why you are removing standard variables that were there since ages? could you please stop doing it?”. I understand it is not ideal to have important variables in SUPPxx, like treat emergent AE flag, study populations, etc. but I guess one may propose to the SDS team to consider to add this imporant variables to the parent domain for future versions, as we do for new domains coming from the TA UG or new concepts e.g. the new Diabetes ADaM UG make a good proposal for variables used for stratifying the randomization
To my mind, the concerns over adopting NSVs in the parent domain had more to do with “why now” rather than “why”. I suspect if the dataset XML implementation were to adopt the option (or requirement) of NSVs in the parent domain, there would be relatively few objections as the process implications became clear. Dataset XML not only requires a new approach to generating and extracting the data, it also reinforces the perspective that the format is for transmission and not operational use. That alone would be a “win”, hopefully fundamentally changing the concerns over transmission file size.
Valid comment, Carlo. My answer to “why now?” is that rather than wait endlessly for XPT to be replaced by something better (such as Dataset-xml), which we’ve been doing for a decade and which has no imminent solution, it’s better to do something now. Others may differ, but to me it’s clear that we’ll be using XPT for several more years, so why not minimize our pain, while we search for a longer-term cure in parallel?
[…] supplemental qualifiers now! RIP Time for Supplemental Qualifiers makes another plea to represent non-standard variables (NSVs) in the parent domain where they […]
[…] (especially if supplemental qualifiers are represented within the parent domain where they belong) wouldn’t that be easier on the FDA reviewer? Wouldn’t it be helpful to […]