Notebooks as a research tool #112

khinsen · 2016-03-04T08:31:35Z

This is a follow-up to a Twitter conversation which I think is relevant for everpub.

betatim · 2016-03-04T08:44:38Z

My two cents: notebooks are the (shiny), simple to make GUI for your research project.

People quickly learn when they start using notebooks that editing and managing more than a few 100lines of code in a notebook is unwieldy. This is a good experience, because you shouldn't do it. Sometimes they then conclude "notebooks are rubbish".

IMO The Right Way ™️ to use notebooks is for:

experimenting and prototyping
driving your analysis

Driving? Yes. Let me explain. Put the heavy lifting in plain old text files, organise them as a library/package/module. Import tools from your module into the main notebook and use them. Importing could also mean call make step1, followed by ./bin/step1 someargument.

Intersperse these calls to your module or shell scripts or compiled executables with commentary on the science you are doing. Why it makes sense to do what you are doing. Why you are doing what you are doing. Conclusions you draw from the output of the utility you just invoked.

Use the notebook to display (in a nice way) the results. Maybe add some interactivity by having a slider that adjusts a threshold in your analysis. This makes it easier for you to experiment (though you will have to note this in the narrative) as well as for future readers who doubt that you chose the best threshold.

Maybe the one sentence summary of all this: notebooks are the executable README for your research project.

khinsen · 2016-03-04T08:45:13Z

A few people reported that notebooks don't seem to support good practices in scientific computing as much as would be desirable. That's my experience as well, and I will summarize this here.

The problematic aspect that has been discussed most is version control. This has a superficial aspect, due to the JSON format used by Jupyter, which is addressed by proposals such as ipymd. It also has a more profound aspect, which due to the fact that notebooks combine human-generated information and computed results in a single document, but only the first part should be under version control. This is addressed in use_ipynb_git. I think that none of these fixes is sufficient, but they are good steps towards exploring the issues.

Another problematic aspect is that notebooks do not well support the code maturation process in computational science. It is very common that a computational method starts out as a few lines of code in a script, which are then reorganized in a function, the function is transferred into a module shared by several scripts, and perhaps ends up as part of a published library. Notebooks discourage the step to a modules shares by several notebooks. This requires a change of tools, and results in a loss of readability of the notebook: a function essential for understanding what happens is no longer "in view", and the nicely formatted explanation around it (Markdown, formulas) must be replaced by less readable comments in a module. As a consequence, people tend to keep much code in their notebooks and copy/paste it to other notebooks.

A related issue is that code from notebooks is hard to reuse. For interactive exploration, copying the notebook and changing it is fine. But reusing a method for a different application is often very inconvenient.

khinsen · 2016-03-04T08:47:21Z

@betatim I agree that what you describe is a good way to work with notebooks. But that's not what I see happening. Worse, I don't do it myself when I use notebooks, which is why I don't use them much any more. My impression is that I lose most of the advantages of notebooks when I follow your approach, so I'd better stick to plain scripts.

lukasheinrich · 2016-03-04T08:49:53Z

mostly agree with the above points. For me, notebooks are good for exploration, but I quickly put stuff into actual e.g. python modules. I like @betatim analogy as a executable README. I.e. within everpub it might (unless there are better options) be the presentation frontend. The replacement for the paper, but with short code snippets that show how to use the project's libraries to produce the e.g. plots

betatim · 2016-03-04T08:53:23Z

I agree it is more of a vision, than reality. It also involves several steps that require "common sense", which means the answer depends on the particular human doing the work (and we will all agree to disagree with the choice ... and 🚲 🏠 ensues).

I believe having the human text (with rendered equations, figures and tables) next to the (high level) code is worth it. We need to build better tools for transitioning code and establish social norms about checking in the rendered vs unrendered version of the notebooks, diffing etc.

Not yet ready to give up on notebooks (which by the way could be .md files!) just because there is a lot
of potential for shooting yourself in the foot. python allows you to do truly awful things in terms of maintainability ... instead of making it technically impossible we use best-practice guides as a way to stop people from making excessive use of them.

lukasheinrich · 2016-03-04T08:53:49Z

though I should add, that I never tried to dive deep into the notebook paradigm (mostly using it as described above).. reading fernandos short history here: http://blog.fperez.org/2012/01/ipython-notebook-historical.html I think there is a good chunk (definitely also in HEP theory) that live by the notebook paradigm for much more.

lukasheinrich · 2016-03-04T08:54:41Z

👍 @betatim

khinsen · 2016-03-04T08:57:35Z

I don't want to give up on notebooks either, I hope they will evolve to address all these issues.

One idea would be to have notebook-style modules. Imported like plain Python code, but with all the documentation features of the notebook, and managed with the same tools. That would add good old literate programming to the notebook universe.

Another idea would be to show imported functions in a notebook as non-editable cells. The notebook would still show all the relevant code, but the code would also be reusable.

rougier · 2016-03-04T09:12:46Z

One thing that puzzle me with notebooks is the split of the code in different cells that can be executed on their own, without incidence on the others (maybe this has changed in later version, I did not test recently). I often found myself re-running everything everytime because I just don't know if some cell I just changed will have some impact or not on the other cells. This lead me to adapt and structure my code to the notebook paradigm.

davidmam · 2016-03-04T14:47:41Z

I found that the best way to run with notebooks is to have two things running side by side. Spyder, in which I keep library code/functions etc. and Jupyter, within which I keep an overview of the whole process in a way that is readable and pretty much a report format. If I have a block of reusable code it gets copied/pasted to the library and called from there. In many ways it gives teh best of both worlds. Having dynamic library/module reloading would be nice. My code is under version control, the notebook doesn't contain too much detail or reusable code but is sufficient (with the module) to repeat the workflow.
The major flaw at the moment is the frequent reloading of the notebook (clear all content and restart kernel) which can be a bit of a pain.

lukasheinrich · 2016-03-04T19:39:36Z

so it seems like we're all on the same page on this. I agree that some further development on notebooks (as @khinsen) and what the general workflow should be day-to-day. But this is probably beyond the scope of (at least this early stage) of this project. But it seems like we all agree that at least the notebook could be a good presentation layer with short code snippets (calling various libs etc..) to do light data transformations + plot generation etc.

tritemio · 2016-03-04T19:49:13Z

Importing notebooks is possible somewhat. But I think that it would be wrong to do it. Normally notebooks have commands that gets executed and you want to import only the functions.

In practice, when you want to reuse a function you just move it to a .py file as @betatim said. A .py file is automatically a module. Then if you have more than a few modules you can create a package. Python here really shines in scaling up from simple to complex.

One advantage of notebook compared to a purely script-based approach is that you can see the sequentiality of results/figures and you can add "comments" pointing out why you are trying a particular analysis. Also you have section headers with links (TOC) for quick navigation (as well as equations, links, etc...). With scripts, you have in some way reconnect all the generated figures to the script instead of having it all inline. It is also a quasi-GUI. Just share a notebook and tell the user to click on Run All. Easier than calling commands from the command line.

For reproducibility, the first thing is having the habit of doing a "Restart and Run All" before saving/committing the notebook help maintaining a coherent state. Also, I also print the version of all imported libraries and the beginning of the notebook, and this info is saved with the notebook. Finally, I often create a conda environment per project folder. The environment is saved in a simple YAML file that is kept under version control and contains the exact version of every installed package. In this way, I can easily reproduce work of 1 or 2 years ago (when I was still using python2!) with no problem.

The pieces for reproducibility are all here. We just have to promote best-practices, that at the end of the day is the spirit of this proposal.

Repositorian · 2016-03-04T20:07:47Z

(Possibly more than) a few euros about semantics and object models, from a librarian lurker and cheerleader:

This discussion about the object model for everpub (the notebook as wrapper/electronic binding) is reminiscent of conversations in library/publishing circles around compound digital objects that comprise many individually referenceable and reuseable parts. I think the discussions are related and important because, at the end of the day, we all see the 'publication' paradigm as the canonical unit of release, identifier assignment, cataloging/metadata, citation, and, yes, copyright assignment. It is the ultimate PARENT of all the child parts, each of which can also be rendered, referenced, assigned identifiers, cited, etc. but always as a part of a greater whole from which the children inherit multiple properties and provenance.

Borrowing from the copyright paradigm (because legal restrictions end up governing lots of what we do, at least in today's world), it sounds like everpub is a compilation or anthology. It has integrity, value, and is managed as a whole, yet it contains discrete parts that can stand alone. When Titus and I tweeted yesterday around commits as indicators of authorship, I mentioned the difference between joint authorship and ownership in the compilation and individual authorship in the commits, which themselves represent derivative works of the parent code. On such a view, the released software represents a compilation, with editors and copyright owners in the whole; and authors and owners in the constituent parts.

I recognize that this Notebook discussion has focused on technical aspects of publication more than societal/humanistic ones, and the technical aspects need to be sorted out (for which I am essentially ignorant and do profusely apologize for any dumb comments!). But it does sound like you are looking to replace the PDF that has electronically bound page images plus full text files plus references to something else with something more robust and open, yes? What about XML or HMTL5? Everpub could have its own scheme. Your technical spec for the Everpub schema could, on the fly, render into various formats for whatever purpose is needed in context, the way LaTeX can output to PDF for the journal publishers or the digital repository?

Finally, a shameless plug and also a really significant usecase for everpub: theses, because these are generally produced as compilations and have only universities as publishers. No other gods to please. We are currently working with Overleaf to coax grad students off of PDF and into LaTeX to create more open, reuseable research outputs. This is our 'Author Carpentry' approach. Caltech would love to pilot everpub for our dissertations!

lukasheinrich · 2016-03-04T22:05:35Z

interesting points @Repositorian . I agree theses are a good first target (and part of why I am here, ideally i'll have the first everpub'ed thesis :-P).

You say that we all see the publication as a parent with various child data fragments (datasets, code, auxiliary material). I think this is somewhat a notion we would like to break up. Similar how an essay / article can be published in multiple places, I think we would like to come to a point where the actual pdf/tex/prose document is only one of many interconnected data-fragments that not necessarily all belong to a signle publication.

A everpub project might contain

a new data analysis method in the form of new computer programs
main text / exposition
datasets
the dataset might be shared with many other everpub projects
the computer program might be shared with a small set of other publications as well,
the resulting exposition might be published in multiple places

does the librarian community have concepts of breaking up a record into multiple sub-records on equal footing? i.e. i see the 'main record' more like a 'tag' (perhaps DOI) that can be attached to an arbitrary number of records (each with their own DOI), e.g. zenodo I think has some notion of relations between DOIs.

PS: maybe this should be it's own topic or moved to #105

Repositorian · 2016-03-04T23:23:03Z

So I'll defer to you all on whether to move this discussion to the Archiving issue (#105)...IMO, our efforts here to envision and describe a "publishing object model" of the future is messy, rapidly evolving, and emergent and so is hard to categorize. It feels like we are trying to think well beyond existing models, or even devise incremental changes to existing models, with everpub (like what has been the case with the Research Data Alliance's data publishing groups and basically everybody else since journals went electronic in the early 1990's!). Something more transformative seems to be happening with everpub ... a new species is being birthed.

No one on this site (including me) appears to suggest that an everpub model (or any new "paper of the future" proposal) need privilege text as the preferred component. But I do think that text is a very convenient 'grout' between the tiles of the mosaic, because it is both machine AND human readable. Ultimately, a big subset of the users of 'everpub' will be readers, peer reviewers, thesis advisers, editors, funders and other humans who will render the content of the 'publication' using eyes as the technology.

In the library paradigm (which derives from the formal publishing ecosystem), an object needs to have some kind of fixity and persistence to earn it our commitment. Fixity and persistence are sacred cows that seems difficult to deconstruct.

That is not to say that everpub can not totally deconstruct and then reconstruct an object model that still provides fixity and persistence. There may be some models to draw on like the RDA IG on Citing Dynamic Data sets...(essentially they propose a way to issue and version precise references for computationally-devised database subsets).

khinsen · 2016-03-05T17:41:36Z

@Repositorian I am moving this to a new issue: #113

khinsen mentioned this issue Mar 5, 2016

The structure of a digital research object #113

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notebooks as a research tool #112

Notebooks as a research tool #112

khinsen commented Mar 4, 2016

betatim commented Mar 4, 2016

khinsen commented Mar 4, 2016

khinsen commented Mar 4, 2016

lukasheinrich commented Mar 4, 2016

betatim commented Mar 4, 2016

lukasheinrich commented Mar 4, 2016

lukasheinrich commented Mar 4, 2016

khinsen commented Mar 4, 2016

rougier commented Mar 4, 2016

davidmam commented Mar 4, 2016

lukasheinrich commented Mar 4, 2016

tritemio commented Mar 4, 2016

Repositorian commented Mar 4, 2016

lukasheinrich commented Mar 4, 2016

Repositorian commented Mar 4, 2016

khinsen commented Mar 5, 2016

Notebooks as a research tool #112

Notebooks as a research tool #112

Comments

khinsen commented Mar 4, 2016

betatim commented Mar 4, 2016

khinsen commented Mar 4, 2016

khinsen commented Mar 4, 2016

lukasheinrich commented Mar 4, 2016

betatim commented Mar 4, 2016

lukasheinrich commented Mar 4, 2016

lukasheinrich commented Mar 4, 2016

khinsen commented Mar 4, 2016

rougier commented Mar 4, 2016

davidmam commented Mar 4, 2016

lukasheinrich commented Mar 4, 2016

tritemio commented Mar 4, 2016

Repositorian commented Mar 4, 2016

lukasheinrich commented Mar 4, 2016

Repositorian commented Mar 4, 2016

khinsen commented Mar 5, 2016