Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What makes code a research object? #18

Open
blahah opened this issue Sep 21, 2015 · 12 comments
Open

What makes code a research object? #18

blahah opened this issue Sep 21, 2015 · 12 comments

Comments

@blahah
Copy link

blahah commented Sep 21, 2015

Following on from this twitter discussion, I figure this is a better place to ask.

I'm interested in what makes some bundle of code become a research object? It seems as though this project treats 'having a DOI' as the criterion. If so, why is this?

My position, from which I would love to be moved, and which may be factually wrong, is this:

  • A DOI is not in itself valuable - the thing that makes DOIs useful in science is that they are the identifiers for metadata collection and exchange via Crossref. I would say the main value comes from cited-by - that you can use them to track when your work is cited. Only Crossref-issued DOIs carry this value as far as I am aware.
  • Zenodo DOIs are issued by DataCite, not by Crossref. Is there some way to get similar value from DataCite DOIs?
  • I am more confident in the longevity of Github than Zenodo (long story short: science funding ~ politics, Github funding ~ the value they provide)

If a research object is a fixed archive of some artefacts of research, then what is wrong with git commit, or better, a tagged release?

@robldavidson
Copy link

Hi,
CrossRef and DataCite both agree on the DOI standard and work closely together.
The value of a DOI is that the company that 'mints' the DOI has had to provide various guarantees of long term solvency and whatnot to the issuing subsidiary of the global DOI community. So for example, some people like BMC's GigaScience journal make such promises to the British Library, as a subsidiary of Data Cite, and others like FigShare make these promises to the California Digital Library, another representative of Data Cite.
Although the DOI itself is not greatly different to e.g. an accession number or GitHub url, the DOI has the backing that someone has agreed to maintain a standard and for a good length of time. This is quite different to normal private-sector working practices where anyone could buy over anyone at anytime, products could be dropped or standards could be changed for any number of reasons.
Although DOI tracking and citation metrics are very useful, it's very much an added service (Thompson Reuters have another DOI tracking service for example) rather than the reason for DOIs.
In terms of concern over the short term funding of science and the longer likelihood of successful private sector ventures, this is normally perfectly true except that the science organisations that are allowed to mint DOIs have guaranteed a certain level of funding for a certain length of time - Zenodo is an offshoot from CERN which has pretty big backing that can be guaranteed for a couple of decades at least.
Comparatively, as I've said above, GitHub may seem like the bees knees currently but e.g. China's Alibaba could buy it over next year and suddenly change the landing page of every repository - not enough to ruin the basic functionality but in a way that might disrupt metadata searching/scraping algorithms or break links or whatever. DOIs guarantee the landing page will contain certain fields persistently.

So in terms of code becoming a research object - the DOI guarantees persistence of the metadata whereas a private sector interest makes no such guarantees beyond "we won't annoy our main earnings source too much".
Or at least, that's how i've been thinking of it.

@blahah
Copy link
Author

blahah commented Sep 22, 2015

Thank you for these excellent points.

I was not aware of the solvency guarantee condition for DOI providers, that is certainly a tip in favour of DOIs.

The idea of guaranteed metadata persistence is clearly valuable at the systems interoperability level, but why this helps science should I think be made clear.

It still seems to me that it's ultimately the services that build on that layer, rather than the layer itself, that provide the visible value. And Datacite doesn't seem to have many of those services, at least not researcher-facing. Perhaps this isn't the place to ask, but would it be possible to build an open citation-tracking service on top of both Crossref and Datacite? Perhaps this is something we at ContentMine could work on.

I think it would be worthwhile to explain these and any other reasons why a DOI is useful on the site. I'll wait a while to see if anyone else contributes ideas, then I'll make a PR adding such an explanation.

@mbjones
Copy link

mbjones commented Sep 22, 2015

@blahah The Making Data Count project has been building just such an open citation index for the last year. We recently announced that our prototype citation tracking index has moved to DataCite servers, and we'll be adding additional reports there soon. This work is based on the Lagotto tool by @mfenner, and on data usage statistics collation and reporting services that we developed at @DataONEorg. We learned several lessons in building this prototype, some of which are being presented by J. Lin at this week's DataCite PID workshop. These include issues with the effectiveness of current citation practices with respect to complex data objects, data versioning, and support for multiple persistent identifier standards (DOIs are not magic, not are they ubiquitous, but they do have some desirable features).

The hard part of a citation index is making it open. We index a variety of open sources for citations (e.g., PubMed Central), but of course there are many closed repositories of articles to which we do not have text mining rights. An ongoing struggle with academic publishing models...

@blahah
Copy link
Author

blahah commented Sep 22, 2015

@mbjones this is a very interesting project! Thanks for the information - I have been looking at lagotto but did not know about Making Data Count.

I believe there is a solution to the text-mining rights problem. In the UK we now have a copyright exception for non-commercial text and data-mining. In addition, a citation is un-copyrightable, being a fact (specifically en entity relation, article x cites article y). Thus, provided a collaborator can access the works to read, they have the right to mine them in the UK for non-commercial purposes and any contractual restrictions that try to prevent it are not enforceable. Subsequently releasing those facts is protected.

I'd be happy to explore how I can help with this - rds45@cam.ac.uk

@lnielsen
Copy link

I would say the main value comes from cited-by - that you can use them to track when your work is cited. Only Crossref-issued DOIs carry this value as far as I am aware.

DataCite DOIs also has linkage information via the RelatedIdentifiers property. CrossRef has stricter enforcement that publishers must provide these links (AFAIK). There's however also a reason why PubMed, NASA ADS, INSPIRE HEP and all the other indexing services keep extracting references from e.g. PDF files (which is pretty ugly compared to explicit having them in the metadata).

The second point is that CrossRef simply doesn't allow you to register DOIs for datasets or software. If they did, data repositories like Zenodo and figshare could just mint CrossRef DOIs.

I am more confident in the longevity of Github than Zenodo

It's not GitHub longevity I'm worried about. It's the individual researchers that worries me the most. Reseachers can re-write their history so a commit is not available anymore, they can move their repository (GitHub will redirect you if you did it correctly), or worse they can simply delete your repository. A DOI offers protection against this, because those who mints DOIs (either via CrossRef or DataCite) agree to adhere to certain principles - e.g. not changing the underlying files. Also, as mentioned before, even if the files are deleted, the metadata persists. On Zenodo we've archived in order of 2000 repositories and 4000 individual releases so far and after a year roughly 40 repositories had been deleted (not just moved).

(long story short: science funding ~ politics, Github funding ~ the value they provide)

Long-term archiving is pretty difficult - no matter if you're company, academic institution or state library and the way GitHub is built now, it's not a place for long-term archiving (and I doubt they have any interest in it either).

If a research object is a fixed archive of some artefacts of research, then what is wrong with git commit, or better, a tagged release?

I think the important point here is that archiving != identification. Long-term archiving is difficult - say you archive a git repository, then you'll likely need git version x to be able to read and understand that archive 10-20 years from now. For citations I'm about to say that any persistent globally unique identifier is better than none at all and DOIs is just one option. See more in https://dx.doi.org/10.7717/peerj-cs.1

Also you might be interested in http://dliservice.research-infrastructures.eu/#/

@drj11
Copy link

drj11 commented Sep 25, 2015

where are these "DOI guarantees" documented? Section 3.2 of this handbook: http://www.doi.org/doi_handbook/3_Resolution.html makes it clear that the entity in question "may or may not be an Internet-accessible file". So how is that any different from a git checksum?

@lnielsen
Copy link

@drj11: Depends on the DOI registration agency (http://www.doi.org/registration_agencies.html). The difference between a git checksum and a DOI is that you can resolve a DOI - i.e. stick it in a browser and get redirected to a landing page (which can tell you how you can get access to the object if it's not internet accessible). And in case the digital object goes lost, CrossRef/DataCite will still have the metadata of the object.

@hvdsomp
Copy link

hvdsomp commented Sep 25, 2015

From my perspective, any kind of object, including software, can qualify as a research object. It can obtain that status by claiming an official place in the scholarly record. From the perspective of the ongoing discussions, this entails two core requirements:

  1. Having an archival copy of the object that can be accessed long into the future
  2. The ability to actually access that archival copy

For example, in the paper-based journal system:

  1. Archival copies of journal articles (well, of journals that contain journal articles) were redundantly stored by libraries, worldwide
  2. Access to a journal article was (typically) gained by visiting a library and obtaining the journal issue that contained the desired article

Moving on to code as a research object in the GitHub/Zenodo scenario:

  1. The archival copy is hosted in Zenodo. While there may be uncertainties about the longevity of Zenodo, it clearly has archival aspirations.
  2. Access to the archival copy in Zenodo is via an HTTP-DOI. The core value proposition of the DOI is that, if the archival copy would move to another web location (for example because Zenodo ceases to exist), then the HTTP-DOI will be made to point to that new web location. This assumes, of course, that someone assumes the responsibility to make that happen.

Based on my work with web archiving, Memento and Robust Links, I can sketch another approach that meets the requirements:

  1. The archival copy of the GitHub code is hosted in a web archive such as the Internet Archive, perma.cc, archive.today that support on-demand archival requests.
  2. Access to the archival copy is by means of Robust Links powered by link decoration. A decorated link is of the form <a href="..." data-versionurl="..." data-versiondate="...">link text</a> and in this case conveys the following information:

Link decoration results in the following behavior, as long as GitHub code repository remains available:

  • Via the Original URI provided in href it is possible to visit the current version of the code
  • Via the version URI provided in data-versionurl it is possible to visit the specific version of the code that is part of the scholarly record in GitHub
  • The combination of Original URI and version date, using Memento infrastructure, leads to the specific version of the code that is part of the scholarly record in GitHub. Try it in the Time Travel portal.

Link decoration, combined with Memento infrastructure, results in the following behavior, if the GitHub code repository becomes unavailable:

  • The Original URI yields a 404
  • The Version URI yields a 404
  • The combination of Original URI and version date, using Memento infrastructure, leads to the specific version of the code that is part of the scholarly record available in the Web Archive in which it was deposited. That's just how the Memento protocol and associated infrastructure works.

Notes:

  • GitHub does not currently support the Memento protocol natively and hence the behavior in the Time Travel portal is the result of a Memento proxy for GitHub operated by us. However, we have been in touch with GitHub about adding Memento support.
  • The syntax of a decorated link can also be <a href="..." data-originalurl="..." data-versiondate="...">link text</a> in which case the Version URI is provided in href.
  • One could also choose to use the URI of the code snapshot in the web archive as value for Version URI. But, in the above approach, traffic is directed as much as possible to GitHub.
  • Link decorations can be made operational through simple JavaScript, see robustlinks.js. They are also supported by the [Memento extension for Chrome(http://bit.ly/memento-for-chrome)
  • In the current GitHub/Zenodo approach, there is no (machine-actionable) connection from the Zenodo snapshot to the GitHub repository, i.e. there is no expression of the fact that the code in Zenodo is a snapshot of code in GitHub. This could be achieved by inserting a duplicate HTTP Link from the Zenodo snapshot to the Version URI in Github. It can also be achieved by treating the snapshot in Zenodo as a Memento (add Memento-Datetime header and original HTTP Link from Zenodo snapshot to the Original URI in GitHub)
  • In the GitHub/Zenodo approach, there is an abundance of (machine-actionable) connections between related resources. There are the built-in connections in GitHub between Original URI and Version URIs. These connections could be fundamentally strengthened by supporting the Memento protocol. This would, for example allow direct access to the version of a codebase that was operational at some time in the past. But, via the Memento protocol (supported by all major public web archives), the snapshot in the web archive is also connected to the Original URI in GitHub via an original HTTP Link. This allows navigating from the snapshot in the web archive to the Original URI in GitHub but also from that snapshot to a version in GitHub that was operational at some specified point in time.

@owlice
Copy link

owlice commented Sep 27, 2015

Interesting discussion! I was glad to see it; thank you for starting it!

I would say code is a research object that should be cited when it enables research. Clearly there’s a continuum for software, from small scripts that don’t manipulate data/output but perhaps simply move it, for example, and general tools/software such as Excel that are not research specific that one probably wouldn’t consider a research object, to scientist-written code that uniquely enables research results that clearly is or should be considered a research object.

I agree with you that having a DOI is not necessary to a code’s being a research object, and indeed, as the editor of the Astrophysics Source Code Library (ASCL), I could not possibly defend DOI necessity! Out of our over 1200 site links, only two are DOI links. We link to the GitHub repos for both of those codes in addition to their DOI page. Though the archived version is useful to reveal a specific version of the code that enabled specific research written in a specific paper, anyone who might be interested in using a code for his/her own research would likely want to get the software from its development site rather than an archive site, as code may undergo additional development (or bug fixes) after archiving.

The ASCL is citable and is indexed by the main indexing service for astrophysics, the Astrophysics Data System (ADS). You might be interested in this (incomplete) Google doc about astronomy software citation; it includes a short section called Citable works that pulls in discussion elsewhere on what makes something citable and the difference between attribution and citation.

The ASCL started out in 1999 as a repository, and though it can and does store codes, we have found that most authors prefer to keep their software close to them rather than on a site they don’t control. I suspect this is why there is little uptake of code archiving in astronomy. We do recommend software authors take steps to save their codes, and in fact the founder of the ASCL will be presenting at this January's AAS meeting on what to do with a dead code.

@robldavidson
Copy link

This is an excellent discussion.

Inspired by @owlice's recent post, I have these thoughts on when code is a research object:

I tend to think of research objects as a synonym for research 'outputs' or anything that the researcher has had to work to produce. This is perhaps restrictive but stems from 2 notions:

First, the need to improve research efficiency (see Ioannidis, 2014, 85% of research resources are wasted PLOS) sees me want to reduce duplication of effort by improving accessibility and reuse of anything a researcher has created - so metadata, indexing, persistence are important for researcher 'outputs', but not so much for pre-existing code that is an 'input'.

Second, the traditional research object/output is the publication and that has become the thing that gets researchers funding, leading to quick and dirty, irreproducible publications being preferred over more time consuming reproducible ones. If we can provide citations and tracking for all the other outputs/objects, perhaps we can get researchers rewarded (via funding/employement) for sharing these useful outputs (which feeds back into my first point).

These two (closely linked) considerations make me focus on outputs when considering all possible research objects. Of course, it's also important to report the research objects that are 'inputs' but this can be done as part of the metadata for the work that led to 'outputs'.

With that i mind, I'd say code is a research object when a researcher has put work into creating/adapting it. A researcher has not put work into creating Excel, but they have put work into creating an Excel spreadsheet and/or any macros or Visual Basic code that might be used with Excel. Thus, to my limited definition, Excel wouldn't be a research object but Excel version and other metadata would be reported alongside the research data object (Excel spreadsheet) that may include research code objects (Excel macros/Visual Basic).

@robldavidson
Copy link

Regarding author retention of their code objects:

At GigaScience, we encourage authors to have a project page for them to better control the definitive version of their code - but we also encourage a GitHub location (and will fork/host at our GitHub group) to facilitate community engagement/support, AND we take a snapshot of the code at time of publication so that we ensure we have control over the persistence of a version linked to our publication. Perhaps this is overkill but we feel each avenue has its merits and that they complement one another.

That level of control by the research repository (gigascience) stems from concerns that are well highlighted in this report by Arizona University's comp sci dept. They tried to systematically contact authors of comp sci papers and conference proceedings and found that less than half were 'weakly repeatable'. Responses when trying to contact authors included the classics 'oh, that was a phd student/post-doc that has left', 'oh, i've lost it' and ... no response. Thus leaving research outputs/objects solely in the loving arms of authors is a real problem for reuse of those research outputs.

That said - the astrophysics community is particularly awesome at this sort of thing and there are no broken links in ASCL's list of homepages (going by a quick online link checker test), so the need to supplement author management of research code objects may be field dependent (or perhaps ASCL has excellent stipulations when registering a project and these ensure persistence - if so, please share!)

@owlice
Copy link

owlice commented Oct 2, 2015

I'd say code is a research object when a researcher has put work into creating/adapting it.

I like that. I'd exempt small scripts or other outputs that anyone with basic computer skills could produce, but yeah, overall, that works.

At GigaScience, we encourage authors to have a project page for them to better control the definitive version of their code - but we also encourage a GitHub location (and will fork/host at our GitHub group) to facilitate community engagement/support, AND we take a snapshot of the code at time of publication so that we ensure we have control over the persistence of a version linked to our publication.

I like this approach! Other journals would do well to emulate your approach of taking a snapshot of the code at publication and controlling that file.

there are no broken links in ASCL's list of homepages (going by a quick online link checker test)

Thanks! We do have links fail, of course, but Associate Editor Kim DuPrie runs a link checker regularly and does an excellent job keeping the links up-to-date. We've had very very few codes disappear permanently over the years.

Our policies for registering a code are pretty simple: it must be used in refereed research or research submitted for refereeing, and it has to be available for download.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants