Lafka paper

by José Manuel Barrueco Cruz and Thomas Krichel

0. Status

This a requirement document for the CitEcCyr task to download PDF files. CitEcCyr is a project funded by RANEPA.

This is the current version. We have archived versions of

1. Basic assumptions

Papers’ full text are only contained in PDF payloads. We reach a PDF payload through at most one intermediate HTML payload.

2. Input

Each paper has an identifier henceforth pid. Its metadata may also can have a bunch of presumed full text links, called futlis. These may go to full text but we can’t guarantee this.

The software described here does not read the metadata. Instead it is being given a futli and an identifier.

3. Fetch

We use one warc for each paper. To find the warc from the pid we use a function in a library, out of a hash, with each collection (RePEc or Socionet) having a function.

At any run, we first store a copy of the futli payload in a temporary warc. We then transfer it to the paper’s warc if this is not already in the warc.

The payload from the access to the futli can either by PDF, or HTML, otherwise we write a message in a log for further examination. If the accessed resource is PDF, it is called a full text (henceforth fut). If the accessed resource is a HTML payload, it is called a splap.

If the payload is anything other than HTML we place the payload in the warc and stop. Otherwise the HTML is supposed to be a splap. We place it in the warc.

Then we we do further processing. This is called secondary processing. It works as follows. We first look for the HTML meta tag citation_pdf_url. We try that first. If there is not such HTML tag or if the payload of the citation_pdf_url is not a fut, we try to take all the (<a> links) from the HTML. We store the PDFs in the warc. We stop at this point. The combination of a PDF fut or a splap and fut for a futli is called the futli trail at a point in time. If the trail leads nowhere, the train only contains the splap.

4. Custom header

We use custome headers in the warcinfo record. For all URLs we download a copy for we use

Paper-Id
the identifier of the paper.
Futli:
the futli

In additition for secondary parsing we add

HTML-Url:
the URL of the HTML pages where we got the link from. Note that the link may come the metadata element.
HTML-Date:
the date, as marked in the warc metadata when the HTML is stored. This will allow to distinguish various version of the stored contents of HTML-Url.

The date is a 14-digit UTC timestamp formatted according to YYYY-MM-DDThh:mm:ssZ, described in the W3C profile of ISO8601.

5. Summary

We need to summarize data for papers, in order to access the data in the warc in a timely fashion. Thus when a warc is updated, a corresponding summary for the warc must be updated at well. Software to access futs will never actually access the warc first. Instead it will access the summary file, and using instructions in the summary, it will access the warc. Thus a summary file contains the information that an application needs to access the data in the warc. The warc summary file, henceforth suwaf, is a file that contains a simple JSON structure. This contains the handle of the paper in the id= key, as well as a series of futli elements. In Perl parlance

$f → {$id} → {$futli_1} → {$time_1} → {'s'} → number of start byte in warc

$f → {$id} → {$futli_1} → {$time_1} → {'l'} → length of fut payload for the futli

$f → {$id} → {$futli_2} → {$time_2} → {'s'} → number of start byte in warc

$f → {$id} → {$futli_2} → {$time_2} → {'l'} → length of fut payload for the futli

$f → {$id} → {$futli_2} → {$time_3} → {'s'} → number of start byte in warc

$f → {$id} → {$futli_2} → {$time_3} → {'l'} → length of fut payload for the futli

The time is the time at which the fut is obtained. We need to time to potentially distinguish between various futs coming out of the same futli. Note that we contain any futli that is distinct by chars as being different here, even though there are trivial equivalence rule. Eg, http://foo.org/blar/../buff is the same as http://foo.org/buff, but we look at it as different. However, since we never store two futs with the same signature, this should not be an issue. The start and length parameters would just coincide.

To store the summaries, we use a different base directory than the warc directory, applied with the same collection based function that assigns a file to a identifier.

6. Legacy data

We have a number of legacy papers that are not in a warc, but use an earlier scheme. For them, we open a special ArchEc server to pull them into the warcs to start with.

7. Modules

We have some partial implentation software written in Perl.