Starrydata = Starrydata2 ≠ Starry data

We are sorry about the confusion about the naming of our web system. ‘Starry data’ is our first web system that did not work well.  ‘Starrydata’ or ‘Starrydata2’ is the second web system, which we developed from scratch.

Both web systems were developed to collect data from plots in published papers efficiently. Unfortunately, the first version ‘Starry data’ was not able to make it very efficient, mainly due to the unexpected variety. So we learned from the first version, and designed a completely new web system ‘Starrydata2’ or ‘Starrydata’.

We present the difference between the two web systems in the table below.

Old version (April 2016-) New version (June 2017-)
Name Starry data / Starrydata1 Starrydata / Starrydata2
Programming language Ruby (Ruby on Rails) Python (Django)
Backend database MySQL MongoDB
Files stored on server Full-text PDFs,
figure images,
data from plots
Text only
(Bibliographic information & data from plots)
Processing order of the figures System determines User selects which figure to work on next.
Data extraction from images Manual clicking Automatic color detection / manual clicking
Data editting Not allowed Allowed, with version-controlling system
Language Japanese English
Basic user interface Data extractor Reference manager
Plot digitization Original digitization program (Minimum features of data point clicking) WebPlotDigitizer (includes various features: semi-automatic digitization, zoom-up/down)
How to open plot images Automatic extraction from PDF file By partial screenshot of a PDF file

Data collection workflow in the old version ‘Starry data’

  1.  Search and download PDF file of a paper of interest
  2.  Upload the PDF file to Starry data (Doubtful in copyright?)
  3.  If DOI was readable, the paper appears on Starry data
    (If not, then the paper cannot be used. )
  4. User clicks all the images extracted from the PDF.
    (If the PDF was composed of many unnecessary images, the user has to delete all such images (sometimes, tens~hundreds of such images appeared. So it was hard to find the figure of interest.)
  5. User goes to classification page, and classify all the extracted figure images, label the axes, and input number of samples.
  6. User goes to extraction page, and click all the data points.
  7. User goes to details page, and write down all the sample details.

When we started data extraction using ‘Starry data’, we found that this workflow is not efficient to collect data from papers, because about 1/3 of the papers were lost because of failure in processing. The papers failed in automatic retrieval of bibliographic information or automatic extraction of images due to the variation in PDF formats. Once failed, the paper was not able to be accessed any more. Even if the images were successfully extracted, it is difficult to fix the mistakes in the input, because editing was not allowed in the system.

Leave a comment