Sources in Depth
The following is a description of the Sources of data available in CED. Every Data Source requires an Agent to process the data, so the following section describes the format of data, the agent used to collect it and issues surrounding each source.
Crossref to DataCite Links
Property | Value |
---|---|
Name | crossref_datacite |
Consumes Artifacts | none |
Matches by | DOI |
Produces relation types | cites |
Freshness | Daily |
Data Source | Crossref Metadata API |
Coverage | All DOIs |
Relevant concepts | Occurred-at vs collected-at, Duplicate Data |
Operated by | Crossref |
Agent | Cayenne |
When members of Crossref (who are mostly Scholarly Publishers) deposit metadata, they can deposit links to datasets via their DataCite DOIs. The Crossref Metadata API monitors these links and sends them to Event Data. As this is an internal system there are no Artifacts as the data comes straight from the source.
Example Event
{
"obj_id":"https://doi.org/10.13127/ITACA/2.1",
"occurred_at":"2016-08-19T20:30:00Z",
"subj_id":"https://doi.org/10.1007/S10518-016-9982-8",
"total":1,
"id":"71e62cbd-28a8-4a41-9b74-7e58dca03efc",
"message_action":"create",
"source_id":"crossref_datacite",
"timestamp":"2016-08-19T22:14:33Z",
"relation_type_id":"cites"
}
Methodology
- The Metadata API scans incoming Content Registration items and when it finds links to DataCite DOIs, it adds the Events to CED.
- It can also scan back-files for links.
Notes
- Because the Agent can scan for back-files, it is possible that duplicate Events may be re-created. See Duplicate Data.
- Because the Agent can scan for back-files, Events may be created with
occurred_at
in the past. See Occurred-at vs collected-at.
DataCite to CrossRef Links
Property | Value |
---|---|
Name | datacite_crossref |
Consumes Artifacts | none |
Matches by | DOI |
Produces relation types | cites |
Fields in Evidence Record | no evidence record |
Freshness | daily |
Data Source | DataCite API |
Coverage | All DOIs |
Relevant concepts | External Agents, Occurred-at vs collected-at |
Operated by | DataCite |
When members of DataCite deposit datasets, they can include links to Crossref Registered Content via their Crossref DOIs. The DataCite agent monitors these links and sends them to Event Data. As this is an External Agent, there are no Artifacts or Evidence Records.
Example Event
{
"obj_id":"https://doi.org/10.1007/S10518-016-9982-8",
"occurred_at":"2016-08-19T20:30:00Z",
"subj_id":"https://doi.org/10.13127/ITACA/2.1",
"total":1,
"id":"71e62cbd-28a8-4a41-9b74-7e58dca03efc",
"message_action":"create",
"source_id":"datacite_crossref",
"timestamp":"2016-08-19T22:14:33Z",
"relation_type_id":"cites"
}
Methodology
- DataCite operate an Agent that scans its Metadata API for new citations to Crossref DOIs. When it finds links, it deposits them.
- It can also scan for back-files
Notes
- Because the Agent can scan for back-files, it is possible that duplicate Events may be re-created. See Duplicate Data.
- Because the Agent can scan for back-files, Events may be created with
occurred_at
in the past. See Occurred-at vs collected-at.
Property | Value |
---|---|
Name | |
Matches by | Landing Page URL |
Consumes Artifacts | high-urls , medium-urls , all-urls |
Produces relation types | bookmarks , shares |
Fields in Evidence Record | Complete API response |
Freshness | Three schedules |
Data Source | Facebook API |
Coverage | All DOIs where there is a unique URL mapping |
Relevant concepts | Unambiguously linking URLs to DOIs, Individual Events vs Pre-Aggregated, Sources that must be queried once per Item |
Operated by | Crossref |
Agent | event-data-facebook-agent |
The Facebook Data Source polls Facebook for Items via their Landing Page URLs. It records how many 'likes' a given Item has received at that point in time, via its Landing Page URL. A Facebook Event records the current number of Likes an Item has on Facebook at a given point in time. It doesn't record who liked the Item or when then the liked it. See Individual Events vs Pre-Aggregated for further discussion. The timestamp represents the time at which the query was made.
Because of the structure of the Facebook API, it is necessary to make one API query per Item, which means that it can take a long time to work through the entire list of Items. This means that, whilst we try and poll as often and regularly as possible, the time between Facebook Events for a given Item can be unpredictable.
Freshness
The Facebook Agent uses three categories of Item: high-urls
, medium-urls
and all-urls
(see the URL Artifact lists documentation for more detail). It processes the three categories in parallel. In each category it scans the current list of all Items with URLs from start to finish, and queries the Facebook API for each one. It does this in a loop, each time fetching the most recent list of URLs.
The Facebook Agent works within rate limits of Facebook API. If the Facebook API indicates that the rate of traffic is too high then the Agent will lower the rate of querying and a complete scan will take longer.
Subject URIs and PIDs
As Facebook Events are pre-aggregated and don't record the relationship between the liker and the Item, Events are recorded against Facebook as a whole. Because we don't expect to collect Events more than once per month per Item, we create an entity that represents Facebook in a given month.
Each 'Facebook Month' is recorded as a separate subject PID, e.g. https://facebook.com/2016/8
. This PID is a URI and doesn't correspond to an extant URL. Note that the metadata contains the URL of https://facebook.com
.
This approach strikes the balance between recording data against a consistent Subject whilst allowing easy analysis of numbers on a per-month basis.
If you just want to find 'all the Facebook data for this DOI' remember that you can filter by the source_id
.
Example Event
{
"obj_id":"https://doi.org/10.1080/13600820802090512",
"occurred_at":"2016-08-11T00:00:30Z",
"subj_id":"https://facebook.com/2016/8",
"total":5681,
"id":"55492dc1-ce8a-4c5d-85d0-97a5192519c7",
"subj":{
"pid":"https:/facebook.com/2016/8",
"URL":"https://facebook.com",
"title":"Facebook activity for August 2016",
"type":"webpage",
"issued":"2016-08-01"
},
"message_action":"create",
"source_id":"Facebook",
"timestamp":"2016-08-11T00:26:48Z",
"relation_type_id":"references"
}
Landing Page URLs vs DOI URLs in Facebook
Facebook Users may share links to Items two ways: they may link using the DOI URL, or they may link using the Landing Page URL. When a DOI is used, Facebook records and shows the DOI URL but records statistics against the Landing Page URL it resolves to. This means that Facebook doesn't necessarily maintain a one-to-one mapping between URLs and statistics for that URL.
Event Data always uses the Landing page URL when it queries Facebook and never the DOI URL. If a Facebook user shared an Item using its Landing Page URL then there would be no results for the DOI, and if they used the DOI, the statistics would be recorded against the Landing Page anyway.
Here is a justification of the above approach using examples from the Facebook Graph API v2.7. Note that these API results capture a point in time and the same results may not be returned now.
Where a Facebook User has shared an Item using its DOI, Facebook's system resolves the DOI discover the Landing page. In cases where Facebook has seen the DOI URL it is possible to query using it, e.g. https://graph.facebook.com/v2.7/http://doi.org/10.5555/12345678?access_token=XXXX
gives:
{
og_object: {
id: "10150995451832648",
title: "Toward a Unified Theory of High-Energy Metaphysics: Silly String Theory",
type: "website",
updated_time: "2016-08-25T01:23:00+0000"
},
share: {
comment_count: 0,
share_count: 3
},
id: "http://doi.org/10.5555/12345678"
}
If we query for the current Landing Page URL for the same Item we see the same results. https://graph.facebook.com/v2.7/http://0-psychoceramics-labs-crossref-org.library.alliant.edu/10.5555-12345678.html?access_token=XXXX
gives:
{
og_object: {
id: "10150995451832648",
title: "Toward a Unified Theory of High-Energy Metaphysics: Silly String Theory",
type: "website",
updated_time: "2016-08-25T01:23:00+0000"
},
share: {
comment_count: 0,
share_count: 3
},
id: "http://0-psychoceramics-labs-crossref-org.library.alliant.edu/10.5555-12345678.html"
}
Here we see that Facebook considers the DOI URL and the Landing Page to have the same id
of 10150995451832648
, because the DOI URL redirected to the Landing Page URL.
DOIs can be expressed a number of different ways using different resolvers and protocols, e.g. http://doi.org/10.5555/12345678
, https://doi.org/10.5555/12345678
, http://dx.oi.org/10.5555/12345678
, https://0-dx-doi-org.library.alliant.edu/10.5555/12345678
. These may all treated as different URLs by Facebook. Therefore there is no 'canonical' DOI URL from Facebook's point of view. As they all redirect to the same Landing Page, the Landing Page is the only thing that they have in common from Facebook's perspective.
Where a user has shared the Item using its Landing Page, Facebook is not aware of the DOI. In this example, there is data for the Landing Page of an Item: https://graph.facebook.com/v2.7/http://0-www-emeraldinsight-com.library.alliant.edu/doi/abs/10.1108/RSR-11-2015-0046?access_token=XXXX
{
og_object: {
id: "1034517766662581",
description: "Impact of web-scale discovery on reference inquiryArticle Options and ToolsView: PDFAdd to Marked ListDownload CitationTrack CitationsAuthor(s): Kimberly Copenhaver ( Eckerd College St. Petersburg United States ) Alyssa Koclanes ( Eckerd College St. Petersburg United States )Citation: Kimberly Copen…",
title: "Impact of web-scale discovery on reference inquiry: Reference Services Review: Vol 44, No 3",
type: "website",
updated_time: "2016-06-30T05:01:41+0000"
},
share: {
comment_count: 0,
share_count: 8
},
id: "http://0-www-emeraldinsight-com.library.alliant.edu/doi/abs/10.1108/RSR-11-2015-0046"
}
But a Query using its DOI fails https://graph.facebook.com/v2.7/http://doi.org/10.1108/RSR-11-2015-0046?access_token=XXXX
:
{
id: "http://doi.org/10.1108/RSR-11-2015-0046"
}
Therefore, whilst Facebook returns results for some DOIs, we use exclusively use the Landing Page URL to query Facebook for activity. This takes account of users sharing via the DOI and via the Landing Page.
HTTP and HTTPS in Facebook
Many websites allow users to access the same content over HTTP and HTTPS, and serve up the same content. Whilst the web server may consider the two URLs equal in some way, Facebook doesn't automatically treat HTTPS and HTTP versions of the same URL as equal. The WHATWG URL Specification supports this position.
If we take the example of a website that allows serving of both HTTP and HTTPS content, e.g. The Co-operative Bank, we see that Facebook assigns different OpenGraph IDs and different share_count
results.
https://graph.facebook.com/v2.7/http://co-operativebank.co.uk?access_token=XXXX
{
og_object: {
id: "10150337668163877",
description: "The Co-operative Bank provides personal banking services including current accounts, credit cards, online and mobile banking, personal loans, savings and more",
title: "Personal banking | Online banking | Co-op Bank",
type: "website",
updated_time: "2016-08-31T14:07:30+0000"
},
share: {
comment_count: 0,
share_count: 910
},
id: "http://co-operativebank.co.uk"
}
https://graph.facebook.com/v2.7/https://co-operativebank.co.uk?access_token=XXXX
{
og_object: {
id: "742866445762882",
type: "website",
updated_time: "2014-09-11T17:38:25+0000"
},
share: {
comment_count: 0,
share_count: 0
},
id: "https://co-operativebank.co.uk"
}
Other sites implement automatic redirects, and an HTTP URL will immediately redirect to an HTTPS version. For example, PLoS HTTP:
https://graph.facebook.com/v2.7/http://plos.org?access_token=XXXX
{
og_object: {
id: "393605900711524",
description: "A Model for an Angular Velocity-Tuned Motion Detector Accounting for Deviations in the Corridor-Centering Response of the Bee",
title: "PLOS | Public Library Of Science",
type: "website",
updated_time: "2016-08-30T18:28:58+0000"
},
share: {
comment_count: 0,
share_count: 523
},
id: "http://plos.org"
}
And the HTTPS version:
https://graph.facebook.com/v2.7/https://plos.org?access_token=XXXX
{
og_object: {
id: "393605900711524",
description: "A Model for an Angular Velocity-Tuned Motion Detector Accounting for Deviations in the Corridor-Centering Response of the Bee",
title: "PLOS | Public Library Of Science",
type: "website",
updated_time: "2016-08-30T18:28:58+0000"
}
share: {
comment_count: 0,
share_count: 523
},
id: "https://plos.org"
}
Note the same share_count
and id
.
Therefore Facebook considers HTTP and HTTPS URLs to be equivalent if the HTTP site redirects to HTTPS.
Crossref Event Data uses the Landing Page that the DOI resolved to. If this is HTTP, then we use HTTP, and this means we query Facebook for the same URL that Facebook users share. If the site subsequently adds HTTPS redirects but CED has an outdated HTTP Landing Page URL, the way Facebook treats redirects will ensure we get the correct results.
If a situation arises where the publisher serves the same Landing Page both over HTTP and HTTPS without redirecting, CED will use the Landing Page URL that the DOI resolves to. This may result in some views not being accounted for, but it is the most accurate and consistent.
Methodology
The Agent has three parallel processes. They operate on three Artifacts: high-urls
, medium-urls
and all-urls
. The last of these contains the mapping of all known DOI to URL mappings. The first two contain subsets of these.
Each process:
- fetches the most recent version of the relevant URL List Artifact
- iterates over each the URL. It uses the Facebook Graph API 2.7 to query data for the Landing Page URL.
- the
comment_count
is recorded as an Event with the giventotal
field and therelation_type_id
ofshares
. - the
comment_count
is subtracted from theshare_count
and the result is recorded as an Event with the giventotal
field and therelation_type_id
ofbookmarks
. - When the end of the list is reached, it starts again at step 1.
Further information
Methodology
- The Mendeley agent consumes three Artifacts:
high-dois
,medium-dois
andall-dois
. It runs a three parallel processes, one per list. - For each list, the agent fetches the most recent version of the Artifact.
- It scans over the entire list, making one query per DOI.
- For each Item for which there is data, two Event is created with total values. The
reader_count
total is stored in an event withrelation_type_id
ofbookmarks
. Thegroup_count
total is stored in an event with therelation_type_id
oflikes
. - When it has finished the list, it starts again at step 1.
Further information
Newsfeed
Property | Value |
---|---|
Name | newsfeed |
Matches by | Landing Page URL |
Consumes Artifacts | newsfeed-list |
Produces relation types | mentions |
Fields in Evidence Record | |
Freshness | half-hourly |
Data Source | Multiple blog and aggregator RSS feeds |
Coverage | All DOIs |
Relevant concepts | Unambiguously linking URLs to DOIs, Duplicate Data, Landing Page Domains, Sources that must be queried in their entirety, DOI Reversal Service |
Operated by | Crossref |
Agent | event-data-newsfeed-agent |
The Newsfeed agent monitors RSS and Atom feeds from blogs and blog aggregators. Crossref maintains a list of newsfeeds, including
- ScienceSeeker blog aggregator
- ScienceBlogging blog aggregator
- BBC News
You can see the latest version of the newsfeed-list by using the Evidence Service: http://0-evidence-eventdata-crossref-org.library.alliant.edu/artifacts/newsfeed-list/current.
Example Event
{
obj_id: "https://doi.org/10.1145/2933057.2933107",
occurred_at: "2016-09-26T00:25:08Z",
subj_id: "https://rjlipton.wordpress.com/2016/09/25/a-creeping-model-of-computation/",
total: 1,
id: "170678af-92da-4375-967c-b056d828525d",
subj: {
pid: "https://rjlipton.wordpress.com/2016/09/25/a-creeping-model-of-computation/",
title: "A Creeping Model Of Computation",
issued: "2016-09-26T00:25:08.000Z",
URL: "https://rjlipton.wordpress.com/2016/09/25/a-creeping-model-of-computation/",
type: "post-weblog"
},
message_action: "create",
source_id: "newsfeed",
timestamp: "2016-09-26T00:30:18Z",
relation_type_id: "discusses"
},
Example Evidence Record
{
"artifacts": [
"http://0-evidence-eventdata-crossref-org.library.alliant.edu/artifacts/newsfeed-list/versions/41ac1c7ecf505785411b0e0b498c4cef",
"http://0-evidence-eventdata-crossref-org.library.alliant.edu/artifacts/domain-list/versions/1b2bcc1f6e77196b9b40be238675101c"
],
"input": {
"newsfeed-url": "http://www.inoreader.com/stream/user/1005830516/tag/Artificial%20Intelligence%2C%20Computer%20Science",
"blog-urls": [
"https://rjlipton.wordpress.com/2016/09/25/a-creeping-model-of-computation/",
"http://feedproxy.google.com/~r/blogspot/wCeDd/~3/pY5hWW0nwXM/sunday-morning-video-bay-area-deep.html",
« ... removed ... »
],
"blog-urls-seen": [
{
"seen-before": true,
"seen-before-date": "2016-09-25T15:59:24.000Z",
"seen-before-feed": "http://www.inoreader.com/stream/user/1005830516/tag/Artificial%20Intelligence%2C%20Computer%20Science",
"url": "http://feedproxy.google.com/~r/blogspot/wCeDd/~3/pY5hWW0nwXM/sunday-morning-video-bay-area-deep.html"
},
« ... removed ... »
],
"blog-urls-unseen": [
"https://rjlipton.wordpress.com/2016/09/25/a-creeping-model-of-computation/"
]
},
"processing": {
"https://rjlipton.wordpress.com/2016/09/25/a-creeping-model-of-computation/": {
"data": {
"seen-before": false,
"seen-before-date": null,
"seen-before-feed": null,
"url": "https://rjlipton.wordpress.com/2016/09/25/a-creeping-model-of-computation/",
"blog-item": {
"title": "A Creeping Model Of Computation",
"link": "https://rjlipton.wordpress.com/2016/09/25/a-creeping-model-of-computation/",
"id": "http://www.inoreader.com/article/3a9c6e7f83b41e90",
"updated": "2016-09-26T00:25:08.000Z",
"summary": "<p><br><em>Local rules can achieve global behavior</em><br> « ... removed ... »</p>",
"feed-url": "http://www.inoreader.com/stream/user/1005830516/tag/Artificial%20Intelligence%2C%20Computer%20Science",
"fetch-date": "2016-09-26T00:29:20.616Z"
}
},
"dois": [
"10.1145/2933057.2933107"
],
"url-doi-matches": {
"http://arxiv.org/abs/1603.07991": {
"doi": "10.1145/2933057.2933107",
"version": null
}
}
}
},
"deposits": [
{
"obj_id": "https://doi.org/10.1145/2933057.2933107",
"source_token": "c1bfb47c-39b8-4224-bb18-96edf85e3f7b",
"occurred_at": "2016-09-26T00:25:08.000Z",
"subj_id": "https://rjlipton.wordpress.com/2016/09/25/a-creeping-model-of-computation/",
"action": "added",
"subj": {
"title": "A Creeping Model Of Computation",
"issued": "2016-09-26T00:25:08.000Z",
"pid": "https://rjlipton.wordpress.com/2016/09/25/a-creeping-model-of-computation/",
"URL": "https://rjlipton.wordpress.com/2016/09/25/a-creeping-model-of-computation/",
"type": "post-weblog"
},
"uuid": "170678af-92da-4375-967c-b056d828525d",
"source_id": "newsfeed",
"relation_type_id": "discusses"
}
]
}
Methodology
- Every hour, the latest 'newsfeed-list' Artifact is retrieved.
- For every feed URL in the list, the agent queries the newsfeed to see if there are any new blog posts.
- The content of the body in the RSS feed item are inspected to look for DOIs and URLs. The Agent queries the DOI Reversal Service for each URL to try and convert it into a DOI.
- The URL of the blog post is retrieved and the body is inspected to look for DOIs and URLs. The Agent queries the DOI Reversal Service for each URL to try and convert it into a DOI.
- For every DOI found an Event is created with a
relation_type_id
ofmentions
.
Notes
Because the Newsfeed Agent connects to blogs and blog aggregators, it is possible that the same blog post may be picked up by two different routes. In this case, the same blog post may be reported in more than one event See Duplicate Data.
Property | Value |
---|---|
Name | event-data-reddit-agent |
Matches by | DOI |
Consumes Artifacts | domain-list |
Produces relation types | discusses |
Freshness | Polling approximately every 30 minutes |
Data Source | Reddit API |
Coverage | All landing page URLs and DOI URLs |
Relevant concepts | Unambiguously linking URLs to DOIs, Pre-filtering |
Operated by | Crossref |
Agent | event-data-reddit-agent |
The Reddit agent queries the Reddit API for each domain in the Landing Page Domain list. It finds discussions and comments that mention Items via their landing pages or DOIs.
Methodology
- The Reddit agent runs a loop, with a delay of a 30 minutes between runs.
- The most recent
domain-list
Artifact is fetched at the start of each loop. - During the loop, for each domain in the
domain-list
- The Agent requests all data for the domain, ordered by date descending.
- The Agent continues fetching pages of results until it finds inputs it has seen before.
- The Agent looks at every result. Where it has not seen a link before, it tries to reverse it to an Item DOI.
- Where an Item is found, an Event is created.
Example Event
{
"obj_id": "https://doi.org/10.1523/JNEUROSCI.1907-16.2016",
"occurred_at": "2016-09-25T16:59:52Z",
"subj_id": "https://reddit.com/r/science/comments/54fyzt/many_supposed_features_of_alzheimers_are/",
"total": 1,
"id": "7cc890a6-ca68-4d7c-8853-fb243aa59279",
"subj": {
"pid": "https://reddit.com/r/science/comments/54fyzt/many_supposed_features_of_alzheimers_are/",
"title": "Many supposed features of Alzheimers are artifacts of the mouse models used. The findings of over 3000 publications may need to be re-evaluated.",
"issued": "2016-09-25T16:59:52.000Z",
"URL": "https://reddit.com/r/science/comments/54fyzt/many_supposed_features_of_alzheimers_are/",
"type": "post"
},
"message_action": "create",
"source_id": "reddit",
"timestamp": "2016-09-25T20:31:36Z",
"relation_type_id": "discusses"
}
Example Evidence Record
{
"agent": {
"name": "reddit",
"version": "0.1.1"
},
"run": "2016-09-25T20:24:01.392Z",
"artifacts": [
"http://0-evidence-eventdata-crossref-org.library.alliant.edu/artifacts/domain-list/versions/1b2bcc1f6e77196b9b40be238675101c"
],
"input": {
"https://oauth.reddit.com/domain/www.jneurosci.org/new.json?sort=new&after=": {
"after-token": "t3_46qn9t",
"items": [
{
"url": "http://www.jneurosci.org/content/36/38/9933.abstract?etoc",
"id": "54fyzt",
"title": "Many supposed features of Alzheimers are artifacts of the mouse models used. The findings of over 3000 publications may need to be re-evaluated.",
"permalink": "/r/science/comments/54fyzt/many_supposed_features_of_alzheimers_are/",
"created_utc": 1474822792,
"subreddit": "science",
"kind": "t3"
},
« ... removed ... »
]
}
},
"processing": {
"items": [
{
"url": "http://www.jneurosci.org/content/36/38/9933.abstract?etoc",
"id": "54fyzt",
"title": "Many supposed features of Alzheimers are artifacts of the mouse models used. The findings of over 3000 publications may need to be re-evaluated.",
"permalink": "/r/science/comments/54fyzt/many_supposed_features_of_alzheimers_are/",
"created_utc": 1474822792,
"subreddit": "science",
"kind": "t3",
"seen-before-date": null,
"url-doi-match": {
"doi": "10.1523/jneurosci.1907-16.2016",
"version": null,
"query": "http://www.jneurosci.org/content/36/38/9933.abstract?etoc"
}
},
« ... removed ... »
],
"interested-items": [
{
"url": "http://www.jneurosci.org/content/36/38/9933.abstract?etoc",
"id": "54fyzt",
"title": "Many supposed features of Alzheimers are artifacts of the mouse models used. The findings of over 3000 publications may need to be re-evaluated.",
"permalink": "/r/science/comments/54fyzt/many_supposed_features_of_alzheimers_are/",
"created_utc": 1474822792,
"subreddit": "science",
"kind": "t3",
"seen-before-date": null,
"url-doi-match": {
"doi": "10.1523/jneurosci.1907-16.2016",
"version": null,
"query": "http://www.jneurosci.org/content/36/38/9933.abstract?etoc"
}
}
]
},
"deposits": [
{
"source_token": "a6c9d511-9239-4de8-a266-b013f5bd8764",
"uuid": "7cc890a6-ca68-4d7c-8853-fb243aa59279",
"action": "added",
"subj_id": "https://reddit.com/r/science/comments/54fyzt/many_supposed_features_of_alzheimers_are/",
"subj": {
"title": "Many supposed features of Alzheimers are artifacts of the mouse models used. The findings of over 3000 publications may need to be re-evaluated.",
"issued": "2016-09-25T16:59:52.000Z",
"pid": "https://reddit.com/r/science/comments/54fyzt/many_supposed_features_of_alzheimers_are/",
"URL": "https://reddit.com/r/science/comments/54fyzt/many_supposed_features_of_alzheimers_are/",
"type": "post"
},
"source_id": "reddit",
"relation_type_id": "discusses",
"obj_id": "https://doi.org/10.1523/jneurosci.1907-16.2016",
"occurred_at": "2016-09-25T16:59:52.000Z"
}
]
}
Property | Value |
---|---|
Name | |
Matches by | DOI |
Consumes Artifacts | domain-list , doi-prefix-list |
Produces relation types | discusses |
Freshness | continual |
Data Source | Twitter via Gnip |
Coverage | All DOIs, all known Landing Pages |
Relevant concepts | Pre-filtering |
Operated by | Crossref |
Agent | event-data-twitter-agent |
The Twitter source identifies Items that have been mentioned in Tweets. It matches Items using their Landing Page or DOI URL. Each event contains subject metadata including:
- tweet author ID
- tweet id
- tweet type (tweet or retweet)
- tweet publication date
When Items are matched using their Landing Page URL the URL Reversal Service is used.
Methodology
- On a periodic basis (approximately every 24 hours) the most recent version of the
domain-list
Artifact is retrieved. A set of Gnip PowerTrack rules are compiled and sent to Gnip. The list of rules specifies that Gnip should send all tweets that: - Mention a DOI URL
- Mention a URL that uses an article Landing Page domain
- Contain a DOI prefix, e.g.
10.5555
- The Twitter agent connects to Gnip PowerTrack.
- All Tweets that the agent recieves from PowerTrack have been sent because they match a rule. Gnip also extracts all URLs and follows them to their destination. All URLs extracted and sent along with the data for the Tweet.
- The Agent attempts to reverse every URL using the DOI Reversal Service. For every recognised DOI an Event is created.
Example Event
{
"obj_id": "https://doi.org/10.1038/nature19798",
"occurred_at": "2016-09-26T15:23:13.000Z",
"subj_id": "http://twitter.com/randomshandom/statuses/780427511956180992",
"total": 1,
"id": "35ec2a67-a765-4f26-9c37-7f9eb9a1c7a8",
"subj": {
"pid": "http://twitter.com/randomshandom/statuses/780427511956180992",
"author": {
"literal": "http://www.twitter.com/randomshandom"
},
"issued": "",
"URL": "http://twitter.com/randomshandom/statuses/780427511956180992",
"type": "tweet"
},
"message_action": "create",
"source_id": "twitter",
"timestamp": "2016-09-26T15:23:13.000Z",
"relation_type_id": "discusses"
}
Example Evidence Record
{
artifacts: [
"http://0-evidence-eventdata-crossref-org.library.alliant.edu/artifacts/domain-list/versions/1b2bcc1f6e77196b9b40be238675101c",
"http://0-evidence-eventdata-crossref-org.library.alliant.edu/artifacts/doi-prefix-list/versions/797e77470ed94b2f7b336adab4cbaf19"
],
input: {
tweet-url: "http://twitter.com/randomshandom/statuses/780427511956180992",
author: "http://www.twitter.com/randomshandom",
posted-time: "2016-09-26T15:23:13.000Z"
urls: [
"http://www.nature.com/nature/journal/vaop/ncurrent/full/nature19798.html"
],
matching-rules: [
"url_contains:"//www.nature.com/""
]
},
agent: {
name: "twitter",
version: "0.1.2"
},
working: {
matching-rules: [
"url_contains:"//www.nature.com/""
],
matching-dois: [
{
doi: "10.1038/nature19798",
version: null,
query: "http://www.nature.com/nature/journal/vaop/ncurrent/full/nature19798.html"
}
],
match-attempts: [
{
doi: "10.1038/nature19798",
version: null,
query: "http://www.nature.com/nature/journal/vaop/ncurrent/full/nature19798.html"
}
],
original-tweet-author: null,
original-tweet-url: "http://twitter.com/randomshandom/statuses/780427511956180992"
},
deposits: [
{
obj_id: "https://doi.org/10.1038/nature19798",
source_token: "45a1ef76-4f43-4cdc-9ba8-5a6ad01cc231",
occurred_at: "2016-09-26T15:23:13.000Z",
subj_id: "http://twitter.com/randomshandom/statuses/780427511956180992",
action: "add",
subj: {
author: {
literal: "http://www.twitter.com/randomshandom"
},
issued: "2016-09-26T15:23:13.000Z",
pid: "http://twitter.com/randomshandom/statuses/780427511956180992",
URL: "http://twitter.com/randomshandom/statuses/780427511956180992",
type: "tweet"
},
uuid: "35ec2a67-a765-4f26-9c37-7f9eb9a1c7a8",
source_id: "twitter",
relation_type_id: "discusses"
}
]
}
Wikipedia
Property | Value |
---|---|
Name | Wikipedia |
Matches by | DOI |
Consumes Artifacts | |
Produces relation types | references |
Freshness | continual |
Data Source | Wikipedia Recent Changes Stream, Wikipedia RESTBase |
Coverage | All Wikimedia properties. DOI URL references only. |
Relevant concepts | Matching by DOIs |
Operated by | Crossref |
Agent | event-data-wikipedia-agent |
Methodology
- The agent subscribes to the Recent Changes Stream using the wildcard "
*
". This includes all Wikimedia properties. - The Recent Changes Stream server sends the Agent every change to a page. Every change event includes the page title, the old and new revision and other data.
- For every change, the Agent fetches the HTML of the old and the new pages using the RESTBase API.
- For every URL in the old version, the Agent looks for those that are DOI URLs.
- For every URL in the new version, the Agent looks for those that are DOI URLs.
- DOIs are split into those that were added and those that were removed.
- For every DOI that was removed an Event with the
action: "delete"
is produced. - For every DOI that was added an Event with the
action: "add"
is produced.
- For every DOI that was removed an Event with the
Example Event
{
obj_id: "https://doi.org/10.1093/EMBOJ/20.15.4132",
occurred_at: "2016-09-25T23:58:58Z",
subj_id: "https://es.wikipedia.org/wiki/Se%C3%B1alizaci%C3%B3n_paracrina",
total: 1,
id: "d24e5449-7835-44f4-b7e6-289da4900cd0",
subj: {
pid: "https://es.wikipedia.org/wiki/Se%C3%B1alizaci%C3%B3n_paracrina",
title: "Señalización paracrina",
issued: "2016-09-25T23:58:58.000Z",
URL: "https://es.wikipedia.org/wiki/Se%C3%B1alizaci%C3%B3n_paracrina",
type: "entry-encyclopedia"
},
message_action: "create",
source_id: "wikipedia",
timestamp: "2016-09-26T00:03:52Z",
relation_type_id: "references"
}
Example Evidence Record
{
artifacts: [ ],
agent: {
name: "wikipedia",
version: "0.1.5"
},
input: {
stream-input: {
bot: false, user: "J3D3",
id: 133112611,
timestamp: 1474847938,
wiki: "eswiki",
revision: {
new: 93906371, old: 93391161
},
server_script_path: "/w",
minor: false,
server_url: "https://es.wikipedia.org",
server_name: "es.wikipedia.org",
length: {
new: 51542, old: 51700
},
title: "Señalización paracrina",
type: "edit",
namespace: 0,
comment: "Traduciendo otra pequeña parte"
},
old-revision-id: 93391161,
new-revision-id: 93906371,
old-body: "<!DOCTYPE html> <html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/" about="http://es.wikipedia.org/wiki/Special:Redirect/revision/93391161">« ... removed ... »</html>",
new-body: "<!DOCTYPE html> <html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/" about="http://es.wikipedia.org/wiki/Special:Redirect/revision/93906371">« ... removed ... »</html>"
},
processing: {
canonical: "https://es.wikipedia.org/wiki/Se%C3%B1alizaci%C3%B3n_paracrina",
dois-added: [
« ... removed ... »
{
action: "add",
doi: "10.1016/S1097-2765(01)00421-X",
event-id: "48de8c32-a901-4cc5-b911-544c959332f5"
}
],
dois-removed: [ ]
},
deposits: [
« ... removed ... »
{
obj_id: "https://doi.org/10.1016/s1097-2765(01)00421-x",
source_token: "36c35e23-8757-4a9d-aacf-345e9b7eb50d",
occurred_at: "2016-09-25T23:58:58.000Z",
subj_id: "https://es.wikipedia.org/wiki/Se%C3%B1alizaci%C3%B3n_paracrina",
action: "add",
subj: {
title: "Señalización paracrina",
issued: "2016-09-25T23:58:58.000Z",
pid: "https://es.wikipedia.org/wiki/Se%C3%B1alizaci%C3%B3n_paracrina",
URL: "https://es.wikipedia.org/wiki/Se%C3%B1alizaci%C3%B3n_paracrina",
type: "entry-encyclopedia"
},
uuid: "48de8c32-a901-4cc5-b911-544c959332f5",
source_id: "wikipedia",
relation_type_id: "references"
}
]
}
Failure modes
- The stream has no catch-up. If the agent is disconnected (which can happen from time to time), then edit events may be missed.
- The RESTBase API occasionally does not contain the edit mentioned in the change. Although the Agent will retry several times, if it repeatedly receives an error for retriving either the old or the new versions, no event will be returned. This will be recorded in the Evidence Record as an empty input.