跳到内容

Case Study: An Introduction to Code Ocean—Creating and Uploading Content into Overleaf

Graham · September 21, 2017

Post-publication update (9 November 2017)

Subsequent to publication of this blog post on 21 September 2017, Overleaf and Code Ocean decided to publish the example Code Ocean project discussed within this article. It was released on 7 November 2017 and is now available on the Code Ocean platform: FreeType and PDF tests for Overleaf/Code Ocean blogpost.

Table of contents

This is a relatively lengthy post so we have included a table of contents to help you navigate it.

Introduction

Code Ocean is a cloud-based computational reproducibility platform that provides researchers and developers with an easy way to share, discover and run code published in academic journals and conferences. In this article we show how files produced by Code Ocean algorithms and projects can be uploaded into an Overleaf \(\mathrm\LaTeX\) document. We begin with a very brief overview of reproducibility issues and cite an example based on a journal paper (from 1997) which discusses re-implementation of \(\mathrm\TeX\)’s mathematical typesetting algorithms. Following an overview of Code Ocean’s features—showing how Overleaf can import content from published algorithms—we conclude by discussing an example of a Code Ocean project which generates content specifically for use within Overleaf \(\mathrm\LaTeX\) documents.

Note: This article is based on Code Ocean’s services as defined in September 2017. As with any new cloud-based service, it is likely to evolve rapidly and introduce new features beyond those discussed here. Additionally, screen images incorporated within this article will, of course, reflect the design of Code Ocean’s interface at the time those images were produced.

Reproducibility 101

The reproducibility of academic research is a much-discussed topic which has exercised debate for some years but it has recently become the focus of renewed interest. An exploration of reproducibility, in general, is outside the remit of this article—our objective is rather more prosaic and practical: what does Code Ocean do, and offer, and how might you use it with Overleaf.

A quick Google search will generate an extensive list of articles, blog posts and other resources which discuss reproducibility issues in great detail. For example, these articles in Nature publications:

However, for the purposes of this article a practical working definition of reproducible research will suffice:

“The term reproducible research refers to the idea that the ultimate product of academic research is the paper along with the laboratory notebooks and full computational environment used to produce the results in the paper such as the code, data, etc. that can be used to reproduce the results and create new work based on the research.”

Source: https://en.wikipedia.org/wiki/Reproducibility#Reproducible_research

Our focus here, and that of Code Ocean, is computational reproducibility: that component of reproducible research which focuses on being able to reuse data, code or the implementation of an algorithm which might be published as part of a research paper.

Re-using code: A \(\mathrm\TeX\) algorithm example

Suppose you have just read a paper which contains interesting algorithms, code or perhaps a detailed statistical analysis of some data. Maybe you’d like to reuse some, or all, of that work in your own research—perhaps exploring it to gain a deeper understanding; possibly to enhance it, or simply just to use it. Perhaps reflecting this author’s own interests, but to illustrate the general point, here’s a \(\mathrm\TeX\)-related algorithm example from a paper published in 1997 titled A Functional Description of \(\mathrm\TeX\)’s Formula Layout by Reinhold Heckmann, Reinhard Wilhelm—the formal publication is here.

In that paper the researchers discuss their re-implementation of \(\mathrm\TeX\)’s math typesetting algorithms using the functional language SML. To build an understanding \(\mathrm\TeX\)’s math typesetting algorithms sufficient to re-implement them in another programming language is a truly Herculean undertaking and an incredible achievement by paper’s authors—as anyone who has browsed or read the relevant sections of \(\mathrm\TeX\): The Program can attest.

The authors provide extensive fragments of SML code to demonstrate various aspects of their re-implementation of \(\mathrm\TeX\)’s math typesetting algorithms and, with regard to the entire source code, the authors note:

“The current version is available in directory formulae of the ftp server ftp.cs-uni-sb.de of the University of the Saarland.”

Remarkably, at the time of writing that FTP directory still exists—some 20 years after that article was published—and yes the code is there for those who want to try running it. It turns out that the code is also available on GitHub.

Dated 17 July 1996 (on the FTP server), the README text file accompanying the code says:

This directory contains a re-implementation of TeX's
formula layout written in Standard ML of New Jersey,
version 109.2”
...
...
Unfortunately, there is no documentation of the usage
of the system other than the remarks in this file
(mainly because we have not yet implemented a
nice user interface).

Clearly, that paper represents a very considerable investment of time, expertise and perseverance. Imagine if, at the time they produced this work, those authors had access to (and used) a platform on which to develop and publish the code for others to run, explore and investigate. It’s interesting to speculate if that might have increased the visibility of this very interesting work, possibly inspiring others to build on their results—which is indeed a stated aim of their work and implementation:

“… it extracts this particular subtask of \(\mathrm\TeX\) from the monolithically designed \(\mathrm\TeX\) system leading to the possibility to study it independently and to potentially use it in systems other than \(\mathrm\TeX\).”

Additional re-implementations of \(\mathrm\TeX\)’s math typesetting algorithms, perhaps in other programming languages, might provide standalone libraries of code which could enable sophisticated mathematical typesetting in many more open source projects—of course, the MathJax developers have undertaken the same heroic journey to produce a JavaScript implementation, as have the Khan Academy with KaTeX.

You’ve got the code, what’s the problem?

Having discovered a paper which contains code, data or an algorithm of interest, among the first challenges is obtaining the actual code (and/or data) though, today, that might be less of an issue with the availability of code repositories, such as GitHub, and governments, research funders, and publishers requiring code and data to be made publicly available. Even when you obtain the code it’s not necessarily a fait accompli: you still need to actually get it to run, and here’s where the “fun” might actually start.

Dante’s off-by-one error: “dependency hell”

If you’re not a programmer you might pose the question “You’ve got the code, what’s the problem?” and it’s a good question; however, the key issue is often one of dependencies. When the author(s) originally wrote and published their code we assume it clearly worked for them within the runtime environment of the machine(s) on which it was written. For anyone else to run that same program, a key prerequisite might be the need to recreate the authors’ computation environment: putting into place any resources upon which the code depends for successful execution—its dependencies. I dare say that users of \(\mathrm\LaTeX\) understand the consequence of dependencies!

Those dependencies might comprise the operating system, the chosen programming language, configuration files, data, packages and external code libraries. To compound the challenge, many of those dependencies can require specific versions—which version of the operating system, programming language or external code libraries were used? Code libraries (providing APIs) change over time as functions are renamed or even removed, potentially generating complex version-specific compatibility issues which can be extremely time-consuming to identify and resolve. As time passes, the number of version-specific dependencies is likely to increase and many of those may not have been fully documented in the original article—only coming to light as you try to run the code and hunt down the resources required by the author’s program.

This is often referred to as “dependency hell” because resolving one dependency can frequently generate a cascade of additional dependencies, creating a chain of related resources that all need to be in place before the code will actually run.

An Overview of Code Ocean

As discussed, academic papers often use, or reference, prior work such as data sets or code/algorithms, or they might publish/present brand new algorithms together with interesting code and/or data. In either case, readers may want to use the code/algorithms in their own work but, aside from actually locating usable code, getting that code to run on their local computer/device can be challenging—particularly for non-programmers. Clearly, such technical difficulties can create barriers to reuse which is not conducive to independent reproduction, checking and validation of an author’s work—and might inhibit others from building on existing research to generate new results or insights.

Code Ocean aims to address the challenges of computational reproducibility by simplifying the processes involved in sharing code and algorithms—not by hosting them within a static repository but by providing an infrastructure which enables users to execute the code: re-running algorithms to generate new results after uploading their own data, changing settings, adjusting parameters, or even editing/changing the code. A platform which, in effect, acts as an intermediary to provide a shared runtime and computation environment.

After publication on the Code Ocean platform, an algorithm’s code is packaged into easily shared and reusable components which can be distributed through simple web links or fully functional “widgets”. These “widgets” can be embedded into web pages or within the full-text HTML of journal articles—providing a run-time version of your algorithm that readers can engage and interact with, in situ and in context as part of a published paper. Code Ocean includes an interface designer which can help users by making it easy and non-intimidating to engage with published algorithms—especially for non-programmers who may prefer not to directly edit or amend the code. Users are free to design their own interface if the algorithm’s author chose not to provide one.

Example of embedded Code Ocean “widget”

Here is an example project published on Code Ocean: 3D convolutional Neural Networks for Audio-Visual Recognition. We will embed this example using a “widget” provided by Code Ocean:

At the time of writing, Code Ocean provides “out of the box” support for 10 programming languages commonly used in academic research: Python, R, MATLAB, Julia, Octave, C/C++, Fortran, Perl, Java, and Lua—the platform can support other open-source languages capable of being installed onto a Linux system. Programmers can upload code to Code Ocean from their computer, import it from a GitHub repository or write code directly on the Code Ocean platform. Collaborators can be invited to join your project—perhaps to contribute code, check your work or test it. Whilst you are developing, testing and debugging your work the projects are private and not publicly accessible.

The need for distributable content components which can travel to where the readers happen to be—now, or in the future—is vital in today’s world: it’s far better to take your content to its readers than trying to make them travel to your content. Functional content which is easily shared, and works across different platforms and devices, not only helps to raise the visibility of its authors and originators but also rides the fickle winds of fashion, ready for the “next big thing” in the evolution of popular social media and publication platforms.

Publication on Code Ocean: DOIs, citability and recognition

To make a piece of work hosted on Code Ocean available to the research community you’ll need to publish it through Code Ocean’s validation procedures. Your work will be checked prior to release and upon publication it will be allocated a Digital Object Identifier (DOI) which, in effect, provides your algorithm with a unique identifier that can be used to cite your work within academic papers—helping to ensure you receive due credit for the contribution made by your algorithm.

A DOI creates a permanent identifier which, with the associated object metadata, provides a defined object identity and existence which persists outside of any literature in which it might appear or be referenced. DOIs not only facilitate citation but also make it easy to find and access the data/code via DOI-lookup technologies: readers can be taken directly to the code/data published on Code Ocean.

Using Code Ocean algorithms with Overleaf

Code Ocean was launched in February 2017, just 7 months before this blog post was written; so it’s early days but already you can browse a list of published algorithms which are categorized to indicate the intended application domain (physics, bioinformatics, engineering, computer science, earth science, etc.).

How do you get Code Ocean output/content into Overleaf?

The form and type of output produced by each algorithm published on Code Ocean is specific to each program: some will output graphics in PNG or PDF format, others will generate HTML, plain text files or some other specialized format.

When a project is published on Code Ocean the algorithm’s author(s) might choose to include a selection of sample output files to be published along with the code. Users can download those published examples and cite them in their work—they won’t change because, in effect, those files are “frozen” at the point of publication. Each time you run a published algorithm it will generate a fresh set of results—perhaps based on any parameter changes or settings you have used. The published results and the results from your run will both be listed in the Results pane, and available to download, but only the results published by the algorithm’s author(s) are accessible via a URL—all graphics produced from your runs of the algorithm have to be downloaded to your local device. The following annotated graphic provides an example:

Using URLs to upload published results into an Overleaf project

To upload one of the Code Ocean graphics listed under “Published Result”:

  1. On Code Ocean: Select the graphic of interest and use the Copy Link option under the Actions drop-down menu:
  2. On Overleaf: Select the files drop-down menu and choose URL:
  3. Paste the URL (provided by Code Ocean) into the Overleaf dialog box, and give the file a name you wish you use in your Overleaf project. Select ADD FILE to upload the file into your Overleaf project.

Using non-published results in an Overleaf project

All graphics which result from running the algorithm will need to be downloaded to a local device and then re-uploaded to Overleaf in the usual way.

  1. On Code Ocean: Within the Results pane, select the graphic of interest and under the Actions drop-down menu select Download. Note that you can download individual graphics or download the entire /results directory as a ZIP file:
  2. On Overleaf: Select the files drop-down menu, choose Computer and follow the standard Overleaf procedures to upload the Code Ocean file:

Creating content for \(\mathrm\LaTeX\): A sample Code Ocean project

Code Ocean was launched to address the issue of computational reproducibility; however, it’s quite possible to leverage Code Ocean’s capabilities for day-to-day projects which aren’t destined for publication. It does, after all, provide a convenient computation and development platform upon which to generate a wide variety of content for upload into Overleaf and use within \(\mathrm\LaTeX\) projects.

As part of the research for this article I created a Code Ocean project which reflects my personal interests in content production. Being neither an academic researcher or inventor of clever algorithms the sample project is rather mundane and the absence of a published version with a DOI is unlikely to be detrimental to scientific progress.

My preferred programming languages are Lua (for scripting) and C (with a sprinkling of C++ for string handling…) so by way of a small, but slightly pointless, example I explored direct generation of PDFs using several C code libraries available on the Code Ocean platform:

  • the Haru PDF C library (for generating PDF files);
  • FreeType C library (for rasterizing font glyphs);
  • libPNG C library (for reading/writing + processing PNG files).

FreeType was used to rasterize some glyphs (from the font Amiri) and libPNG was used to output a transparent PNG file containing FreeType’s rasterized glyph image data. PDF files are generated using the Haru PDF C library—the PDFs contain just a rasterized glyph (incorporated via a transparent PNG file) plus some 1750 circles randomly located on the page, filled with random colours. I used a random number generator (RNG) from http://pcg-random.org which provides several implementations of RNGs in C/C++—all available for download. That site, by Melissa O'Neill, Professor of Computer Science, also provides background theory on RNGs.

Here is a screenshot showing FreeType, libPNG and the Haru PDF library being incorporated into a Code Ocean C/C++ project:

Because I don’t program in most of the languages supported by Code Ocean I’m unable to comment or make suggestions concerning their graphics production capabilities. However, if you are a C/C++ programmer you can leverage Code Ocean’s infrastructure to use a wide range of pre-built C/C++ libraries available on the platform—particularly useful if you have specialist graphics production or text-processing requirements. You are not restricted to producing graphics and can process/output textual content and download that for use within your \(\mathrm\LaTeX\) project.

Here is a screenshot showing how to browse all the packages (libraries) available for use within Code Ocean C/C++ projects—at the bottom-left is a Show all packages check box.

The Code Ocean interface

The main interface to your project is divided into several panes as shown in this annotated screenshot:

As noted in the Code Ocean documentation, any output that you want to preserve after the code has completed execution needs to be stored in a directory called /results—or within a subdirectory of /results. For example, if you want to create a file called myresults.txt you need to create that file using a path such as /results/myresults.txt or /results/myoutput/myresults.txt if you have created the subdirectory /myoutput within /results.

Within the above screenshot you can see a file called main.sh: this is a so-called bash script file and, for my project, it is set as the main/entry point file, which Code Ocean uses to compile the source code (using gcc and g++), build the executables and run them to generate output files saved into the /results directory.

Adding an interface to your Code Ocean project

Code Ocean includes an interface designer so that users can engage and interact with your code through that interface rather than work directly with the code—essential for non-programmers and anyone unfamiliar with the programming language used to implement the algorithm.

Here is a very simple interface added to my project—I did not implement the ability for a user to input data values (change parameters) and pass those to my code but, of course, that is supported by Code Ocean’s interface designer. This interface simply allows you to run the program and view the results produced by my project: a PDF file containing a transparent PNG graphic superimposed on top of coloured circles.

Accessing the output: Importing into Overleaf

When a Code Ocean project is unpublished, the only option is to manually download the files listed in the Results pane on the right-hand side of the interface:

As explained above, uploading unpublished results (files) from a Code Ocean project into Overleaf is extremely easy: download them to your local device and then re-upload them into your Overleaf project. The following screenshot contains a graphic (PDF format) produced on Code Ocean and imported into Overleaf: 1750 randomly-sized and coloured circles overlaid by a large transparent letter ‘O’ (rasterized by FreeType). The PDF file was produced by the Haru PDF C library.

None of this example will win an award for artistic excellence but the key message here is that Code Ocean can be used as an external platform for producing a wide range of programmatically-generated content which can be incorporated into your \(\mathrm\LaTeX\) documents:

About Code Ocean

Code Ocean is a cloud-based computational reproducibility platform that provides researchers and developers an easy way to share, discover and run code published in academic journals and conferences. More and more of today's research includes software code, statistical analysis and algorithms that are not included in traditional publishing. But these are often essential to reproducing the research results and reusing them in a new product or research. This creates a major roadblock for researchers, one that inspired the first steps of Code Ocean as part of the 2014 Runway Startup Postdoc Program at the Jacobs Technion Cornell Institute. The company officially launched the product in February 2017 and today features over 100 publicly accessible and executable “Compute Capsules” accompanying scientific articles.

\begin{now}

Discover why 18 million people worldwide trust Overleaf with their work.