NISO Altmetrics Working Group on Data Quality - Plum Analytics Code of Conduct Report

NISO Altmetrics Working Group on Data Quality – Plum Analytics Code of Conduct Report

Posted: March 31, 2016 by Andrea Michalek

(Update: we now have an audit log as referenced in question Q5 below. The audit log can be found here. This post and the audit log have been updated as of October 15, 2018.)

We’ve enjoyed working with NISO on the NISO Altmetrics Initiative since it began in June of 2013. Their Working Group C has created a Draft of the Recommended Practice around altmetrics data quality and a code of conduct.

One of the recommendations was for each altmetrics aggregator to document compliance to the Code of Conduct they’ve proposed. The answers for the PlumX Suite can be downloaded from our resources page and are included below.

Thanks to NISO and all of the members who have been working towards making this happen.

Responses

Q1: List all available data and metrics (providers and aggregators) and altmetric data providers from which data are collected (aggregators).

A: Plum Analytics has a suite of products called PlumX. A description of each PlumX product can be found on our product pages.

PlumX collects metrics data from many sources and groups them into 5 categories of metrics. Sources for each category are defined below:

Usage – Airiti,. bepress, bit.ly, CABI, Dryad, DSpace, EBSCO, ePrints, Facebook, figshare, Forbes, Github, Institutional Repositories, OJS Journals, PLOS, PubMedCentral, Pure, RePEc, Slideshare, SSRN, WorldCat More info.

Captures – EBSCO, GitHub, Goodreads, Mendeley, SlideShare, Vimeo, YouTube More info.

Mentions – Amazon, blogs, Facebook, GitHub, Goodreads, mainstream media, Reddit, Slideshare, SourceForge, StackExchange, Vimeo, YouTube, Wikipedia More info.

Social Media – Amazon, Facebook, Figshare, Goodreads, SourceForge, Reddit, Twitter, Vimeo, YouTube More info.

Citations – Airiti Academic Citation Index, CrossRef, PubMed Central, PubMed Central Europe, RePEc, Scopus (for mutual customers), SSRN, United States Patent and Trademark Office, Policy Citations, DynaMed Plux Topics, National Institute for Health Care Excellence Guidelines (NICE), PubMed Guidelines More info.

Q2: Provide a clear definition of each metric.

A: The PlumX Suite provides the raw usage, capture, mention, social media, or citation counts by source, e.g., the number of Wikipedia articles we have mined about a specific book or article. Raw counts can be viewed in the application, embedded in other sites through widgets, or exported. We strive to keep the naming of these metrics consistent with how the source we are harvesting them from. E.g., Mendeley “readers” and Delicious “bookmarks.” We have over 35 specific, granular metrics that we calculate. A complete list and definition of each can be found here.

Q3: Describe the method(s) by which data are generated or collected and how data are maintained over time.

A: Data are collected via a range of methods, largely via data provider APIs, third-party provider APIs, FTP data transfers, OAI-PMH harvesting, web crawlers and RSS feeds.

The data is maintained over time as described in section #7 below.

Q4: Describe all known limitations of the data.

A: When PlumX begins utilizing a source of metrics, the amount of historic data from that source will vary.

Our text mining for calculating mentions of artifacts often requires that the artifact is mentioned by URL or another scholarly identifier to associate the mention with the artifact.

Links to original posts on third party blog and news sources may break or posts may be deleted.

Our match and merge algorithms for combining and aggregating metrics from all the different online locations where it is published (as described in #6 below) depend upon a knowledge base of how to cross-walk different identifiers (like going from a DOI to a PubMed ID). If there are errors in this crosswalk data, it is possible to “over-merge” a record. Any examples of this can be reported to PlumXSupport@ebsco.com. Similarly, if there is not enough data to automatically merge two preprints from two different services together, they may also need to be manually identified and merged by the PlumX staff.

We license twitter data in PlumX directly through Twitter/GNIP. We have a filtered view of all tweets based upon the domain names of the links in the tweets. Our historic twitter data begins on January 1, 2011. We accommodate URL shorteners and have match and merge technology for combining tweets from multiple, separate URLs into a single view for a given artifact. However, if the original artifact is published at a domain that we do not yet track, once identified and added by the Plum Analytics team, twitter mentions for that domain will only begin to be counted from the time the new domain is added.

Q5: Provide a documented audit trail of how and when data generation and collection methods change over time and list all known effects of these changes. Documentation should note whether changes were applied historically or only from change date forward.

A: ~~PlumX does not have an audit trail. Tracking of an audit trail for PlumX will begin in April 2016.~~

Update: The PlumX audit log can be found here: https://plumanalytics.com/learn/resources/plum-analytics-metrics-audit-log/

Q6: Describe how data are aggregated.

A: Each research output in PlumX is called an artifact. PlumX tracks over 40 different types of artifacts including books, book chapters, conference proceedings, journal articles, slide presentations, videos, etc. A full list of artifact types can be found here.

Online events about different versions of the same artifact (Publisher + Green Open Access + Preprint + Aggregated versions + A&I) are collected and aggregated based on algorithms that examine matching identifiers (such as DOI, ISBN, or URI) across versions.

Usage, Capture, Social Media, and Mention metrics counts are summed across all versions of each artifacts.

Citation counts are not added together across different providers since this would result in double-counting the citations. Instead, we represent the cited by count for an artifact as the maximum value reported.

Within PlumX Dashboards and PlumX +Grants, metrics are aggregated based on researcher, grant, or any other customer-defined group hierarchy for comparisons at the aggregate level. Group hierarchies are defined by each client and might include grouping by school or department, by geography or by journal issue or volume.

Within PlumX Benchmarks, metrics are aggregated per institution, to allow comparisons between all institutions who have received NIH funding from 2012-2015. They are also aggregated at the NIH grant level, so that users can see the ROI on any NIH grant.

Q7: Detail how often data are updated.

A: Metrics data is kept up to date by re-harvesting on the frequency that the source of the metrics updates. For some data providers, like twitter, we license part of the twitter firehose, and we get the metrics in real time. For other sources, we get daily updates of metrics. For example, we update usage data from EBSCO on a daily basis. For other providers, they only give us their data on a weekly or monthly basis.

Every 3-4 hours we refresh the entire PlumX index to have the most up to date metrics from all of our sources.

Q8: Describe how data can be accessed.

A: Plum Analytics provides access to the data via end-user interfaces, widgets that customers can integrate to their site, free artifact widgets or via our open Application Programming Interface (API).

Article level widgets can be accessed by the following identifier types:

arxiv
cabi_abstract_id
doi
github_repo_id
isbn
nct_id
oclc
pmid
repo_url
slideshare_slideshow_id
sourceforge_repo_id
ssrn_id
us_patent_publication_id
vimeo_video_id
youtube_video_id

Author level widgets can be accessed by their PlumX user id. This user id can also be associated with both publicly available author identifiers such as ORCID or with institution specific unique author identifiers. Each customer of PlumX can decide if their PlumX user ids are public or private.

Group level widgets can be accessed by their PlumX group id. These group ids can be mapped to and associated with institution-specific group ids. Each customer of PlumX can decide if their PlumX group ids are public or private.

Grant level widgets can be accessed by their PlumX grant id. These grant ids can be mapped to and associated with institution-specific or funder-specific grant ids. Each customer of PlumX can decide if their PlumX grant ids are public or private.

Documentation about our widgets and API is available here.

Q9: Confirm that data provided to different data aggregators and users at the same time are identical and, if not, how and why they differ.

A: All Plum Analytics applications are based on the same set of data. Users access the same data across each tool, except where data is restricted according to access level. Access level varies across products, but all products require a subscription to access all data. Artifact-level PlumX pages are free and publicly accessible; they provide access to all our article-level metrics.

Q10: Confirm that all retrieval methods lead to the same data and, if not, how and why they differ.

A:: We eat our own dog food at Plum Analytics, and the entire PlumX product suite is developed on top of the same API we expose to customers. Different retrieval methods will lead to the same data.

Q11: Describe the data-quality monitoring process.

A: Data quality is monitored in a variety of ways. Some sources of data (such as the set of blogs that PlumX covers) are hand-curated to focus on research-oriented blogs. This set of blogs is created in conjunction with customer driven requests, and metadata librarians on the Plum Analytics team facilitate this process.

Outlier analysis is done on our data to identify and investigate potential gaming or erroneous metrics.

Our match and merge technology for bibliographic data prioritizes high quality sources like CrossRef over unedited sources like when our crawlers harvest data off of the open web.

Each new source of metrics goes through a rigorous data quality assurance cycle before being added to PlumX

Q12: Provide a process by which data can be independently verified (aggregators only).

A: See Item #8 – all PlumX Suite tools and services use the API documented here.

We consider all the metrics in all of the PlumX products to be fully auditable. For all third-party data providers like Facebook, Twitter, etc., metrics can be verified with those parties. For example, PlumX reports a Facebook likes count of 141 for this article: https://plu.mx/a/0ekxGOXPjTF3JmszhgRPI1yvt4F39t2sc54bCJ4BeHg/?display-tab=summary-content#LIKE_COUNT. A search of the Facebook API for that article by URL shows the same count.

In all cases, we clearly break down metric counts and indicate the provider. Our data is fully transparent. We do not calculate any scores on top of the raw results returned by each data partner.

Q13: Provide a process for reporting and correcting data or metrics that are suspected to be inaccurate.

A: Suspected inaccurate metrics or data can be reported to PlumXSupport@ebsco.com.