Newsfeed
Agent Source token | c1bfb47c-39b8-4224-bb18-96edf85e3f7b |
Consumes Artifacts | newsfeed-list |
Subject coverage | Blog posts mentioned in RSS and Atom feeds on newsfeed-list |
Object coverage | All DOIs, landing page URLs, plain-text DOIs. |
Data contributor | Curators of RSS and Atom feed aggregators. Authors of blog posts. |
Data origin | RSS and Atom feeds, and the blog posts they point to. |
Freshness | Every few hours. |
Identifies | Linked DOIs, unlinked DOIs, Landing Page URLs |
License | Creative commons CC0 1.0 Universal (CC0 1.0) |
Looks in | HTML of webpages (mostly blog posts) linked to from RSS and Atom feeds. |
Name | Newsfeed |
Operated by | Crossref |
Produces Evidence Records | Yes |
Produces relation types | discusses |
Source ID | newsfeed |
Updates or deletions | None expected |
What it is
Links from blogs and other content with a newsfeed.
What it does
The Agent has a list of RSS feeds. It monitors each one for links to blog posts. If a blog post links to registered content, or mentions DOIs in the text, they are extracted into Events.
Where data comes from
The newsfeed-list
Artifact is consulted. On a regular basis the Agent retrieves the Artifact, then follows the link to every blog post or page mentioned. Data sources:
- The
newsfeed-list
Artifact, curated by Crossref. - The content of each newsfeed. Each newsfeed may be operated by a different organization.
- The content of the individual blog posts.
Example RSS feeds include:
- ScienceSeeker blog aggregator
- ScienceBlogging blog aggregator
Example Event
Content to follow.
Methodology
On a regular basis (approximately every hour) the Newsfeed Agent starts a scan. Each scan:
- It retrieves the most recent version of the
newsfeed-list
Artifact. - It scans over every RSS feed.
- It passes the URL of every blog post to the Percolator.
Evidence Record
- Includes batches of
landing-page-url
type observations.
Edits / deletion
- Events may be edited if they are found to be faulty, e.g. non-existent DOIs
Quirks
- Links to blog posts are followed. If a summary of the blog post is included in the RSS feed, it is not consulted.
Failure modes
- RSS feeds may be taken offline.
- RSS feeds may contain incomplete data.
- RSS feeds may update too quickly for the Agent to keep up.
- Publisher sites may block the Event Data Bot collecting landing pages.
- Publisher sites may prevent the Event Data Bot collecting landing pages with robots.txt
- Blog sites may block the Event Data Bot collecting landing pages.
- Blog sites may prevent the Event Data Bot collecting landing pages with robots.txt