Web

The Web agent was used early in the development of Event Data but is no longer active.


Agent Source token	`d9c55bad-73db-4860-be18-520d3891b01f`
Consumes Artifacts	`domain-list`
Subject coverage	Any webpage.
Object coverage	All DOIs, all Article Landing Pages
Data contributor	Various
Data origin	Authors of webpages
Freshness	Infrequent
Identifies	Linked DOIs, unlinked DOIs, landing page URLs
License	Creative Commons CC0 1.0 Universal (CC0 1.0)
Looks in	Text of webpages.
Name	Web
Operated by	Crossref
Produces Evidence Records	Yes
Produces relation types	`mentions`
Source ID	`web`
Updates or deletions	None expected

What it is

The Web source is a catch-all name we give to Events collected from the Web when we follow links that fall outside any other source. As with all other sources, we don't visit webpages that belong to Crossref members.

Many Agents such as Reddit Links, Newsfeed, Wikipedia follow links.

What it is

Events from any non-member web page we think might be relevant. We monitor a list of URLs that we think might have links to Items via their DOIs or landing pages, and then follow them to see if we can find any Items. We curate this list, as best we can, to ensure that we never follow a link when we believe it belongs to a Crossref member or when directed not to by robots.txt.

The list of URLs can come from a range of sources, including those submitted by users. If you have such a list, feel free to contact us.

What it does

A list of URLs is maintained. The Agent submits every URL to the Percolator. The Percolator looks for linked or unlinked DOIs, or linked Article Landing Pages in the HTML of each page.

Where data comes from

A list of URLs that we compile internally, and that are submitted by users.
The content of each web page on the list.

Example Event

{
  "obj_id": "https://doi.org/10.1017/s0963180100005168",
  "source_token": "d9c55bad-73db-4860-be18-520d3891b01f",
  "occurred_at": "2017-03-13T10:10:38Z",
  "subj_id": "http://philpapers.org/rec/ANNAAS",
  "id": "00003c22-1571-4bd3-924b-0438f6f7ff54",
  "evidence_record": "https://0-evidence-eventdata-crossref-org.library.alliant.edu/evidence/20170313e86bef03-4556-4ecc-8401-0e71af4d0bb6",
  "terms": "https://doi.org/10.13003/CED-terms-of-use",
  "action": "add",
  "subj": {
    "pid": "http://philpapers.org/rec/ANNAAS",
    "work-type": "webpage",
    "url": "http://philpapers.org/rec/ANNAAS"
  },
  "source_id": "web",
  "obj": {
    "pid": "https://doi.org/10.1017/s0963180100005168",
    "url": "https://doi.org/10.1017/s0963180100005168"
  },
  "timestamp": "2017-03-13T10:11:19Z",
  "relation_type_id": "mentions"
}

Evidence Record

The Evidence Record contains observations of type content-url which correspond to every URL visited.

Edits / deletion

We may mark Events as deleted if we subsequently find that the subj_id doesn't conform to the Event Data aims (e.g. if it belongs to a member).

Quirks

The selection of URLs doesn't follow any particular pattern.

Failure modes

Publisher sites may block the Event Data Bot collecting landing pages.

Further information

None.