Starrydata = Starrydata2 ≠ Starry data

We are sorry about the confusion about the naming of our web system. ‘Starry data’ is our first web system that did not work well. ‘Starrydata’ or ‘Starrydata2’ is the second web system, which we developed from scratch.

Both web systems were developed to collect data from plots in published papers efficiently. Unfortunately, the first version ‘Starry data’ was not able to make it very efficient, mainly due to the unexpected variety. So we learned from the first version, and designed a completely new web system ‘Starrydata2’ or ‘Starrydata’.

We present the difference between the two web systems in the table below.

	Old version (April 2016-)	New version (June 2017-)
Name	Starry data / Starrydata1	Starrydata / Starrydata2
Programming language	Ruby (Ruby on Rails)	Python (Django)
Backend database	MySQL	MongoDB
Files stored on server	Full-text PDFs, figure images, data from plots	Text only (Bibliographic information & data from plots)
Processing order of the figures	System determines	User selects which figure to work on next.
Data extraction from images	Manual clicking	Automatic color detection / manual clicking
Data editting	Not allowed	Allowed, with version-controlling system
Language	Japanese	English
Basic user interface	Data extractor	Reference manager
Plot digitization	Original digitization program (Minimum features of data point clicking)	WebPlotDigitizer (includes various features: semi-automatic digitization, zoom-up/down)
How to open plot images	Automatic extraction from PDF file	By partial screenshot of a PDF file

Data collection workflow in the old version ‘Starry data’

Search and download PDF file of a paper of interest
Upload the PDF file to Starry data (Doubtful in copyright?)
If DOI was readable, the paper appears on Starry data
(If not, then the paper cannot be used. )
User clicks all the images extracted from the PDF.
(If the PDF was composed of many unnecessary images, the user has to delete all such images (sometimes, tens~hundreds of such images appeared. So it was hard to find the figure of interest.)
User goes to classification page, and classify all the extracted figure images, label the axes, and input number of samples.
User goes to extraction page, and click all the data points.
User goes to details page, and write down all the sample details.

When we started data extraction using ‘Starry data’, we found that this workflow is not efficient to collect data from papers, because about 1/3 of the papers were lost because of failure in processing. The papers failed in automatic retrieval of bibliographic information or automatic extraction of images due to the variation in PDF formats. Once failed, the paper was not able to be accessed any more. Even if the images were successfully extracted, it is difficult to fix the mistakes in the input, because editing was not allowed in the system.

Starrydata = Starrydata2 ≠ Starry data

Data collection workflow in the old version ‘Starry data’

Published by starrydata

Leave a comment Cancel reply

Data collection workflow in the old version ‘Starry data’

Related

Published by starrydata

Leave a comment Cancel reply