I've been fortunate to be in an academic library considered by many to be one of the more advanced in research data management planning. Certainly in confronting ever changing guidelines the funding requirements around data sharing, data preservation and the submission of data management plans at universities across North America, academic library institutions clamouring to understand the needs, in addition to attaining a fuller understanding of their users’ research data management practices and attitudes. Certainly, much has been written about the work being done here and here. My friend and colleague Eugene Barsky, for example, has researched and published much in the area of data management. RDM is such a new area that it sometimes feels very much like the early days of the Wild West. Where to begin? It seems like everything we collect becomes data; sorting and organizing it all is an challenge unto itself.
On the surface, it's messy, but underneath it, it's even more complicated as not all of can or should be archived. Since federal granting agencies in Canada are now advocating for "open science" whereby future researchers can access and reuse such research data, we often assume all data is important, all data is equal. But just because it's data, doesn't mean it's useable, let alone preservable. University of Alberta librarians Janice Kung and Sandy Campbell's What Not to Keep: Not All Data Have Future Research Value offers a remarkably cogent and sensible examination into what faculty, clinicians and graduate students from the health and medical sciences deem as research data and while what types of data should not be kept by libraries and archives for the purpose of reuse. There are eight themes identified here:
Bad or Junk Data - Data that has missing values, malformed records, or stored in problematic file formats has no research value and are therefore unusable.
Cannot be used by others - When datasets become too specific to be combined with other datasets - or cannot be used by other researchers that require knowledge of that particular context or subject - it prevents researchers from manipulating them in a meaningful way and hence,
Easily Replicable - Cost effectiveness of regenerating data on demand - for example, citation analysis data - can make data preservation impractical.
Without good metadata - Since descriptive metadata must accompany research data to ensure future use and interpretation, the ability to reuse datasets can be hindered by suboptimal metadata.
Data without cultural or historical value - Since server space and administrative costs are finite, not all data are valued equally and it's necessary to evaluate the feasibility of archiving everything. Data covering short periods of time, small samples, or have no cultural/historical content would have less value than longitudinal, large, and cultural based studies in such instances might need to be "weeded."
Pilot or test data - Data derived from instrument testing or trial runs have little future research value since they are used for testing the data collection methods to ensure quality control. Sometimes there are many iterations of data generated in developing a method that such "raw data" is not required for validation
Proprietary data - Often researchers do not have ownership rights to data but work with such data released to them under contract by companies or organizations for a specific project only.
Confidential data - When research involving human subjects is being conducted, ethics agreements define when data must be destroyed and researchers must abide by these restrictions.
Of course, the study is not exhaustive by any means as it offers only a viewpoint of the health sciences. But what about other subject domains? For a more comprehensive contribution to the establishment of more detailed library and archival best practices, policies, and procedures, we need to further examine the digital humanities, for instance. This is a good, early start. But more is to come. Stay tuned.