OpenAPC - from Raw to Meaningful

OpenAPC is an initiative to release datasets on fees paid for open access journal articles by institutions to contribute to the transparent and reproducible OA movement. My colleague, Jeanette Hatherill (Scholarly Communications Librarian at uOttawa), would like to release datasets on APC that the University of Ottawa Library has paid since 2010 to financially support researchers affiliated with uOttawa. I joined her initiative as a data analyst whose main responsibility was to combine different files generated by previous scholarly communications librarians, clean them, and develop an internal report as well as final version to submit it to the INTACT OpenAPC initiative.

In order to do that, I followed the data science processes proposed by Garrett Grolemund and Hadley Wickham from R for Data Science: Import - Tidy - Explore (Transform, Visualize, Model) - Communicate. For this project, I didn’t need to model to develop or confirm my hypothesis so I skipped this process. As Garrett and Wickham insisted, it wasn’t a really linear process so I had to go back and forth in the wrangling process (tidy-transform). I chose R with RStudio and OpenRefine as tools because I am familiar with them, but you could use Python, Ruby, Julia or any other available languages.

Goals

There were basically two goals that I would like to accomplish from this project - (reproducible) internal report and data curation for OpenAPC. First, I was asked to create an internal report based on datasets on OA fees directly paid from the Library including the following information:

  • numbers of articles supported by year the by department, by faculty, and by publisher
  • numbers of unique authors
  • numbers of requests by unique authors
  • numbers of unique departments

At the same time, I would like to develop a reproducible report with R Markdown and knitr packages so that annual report could be easily generated and reproduced in the future.

In addition, in order to release datasets to OpenAPC, there are mandatory variables to be provided which included organization, period (year of organization), APC amount (euro), Article DOI (doi), and if the journal is hybrid or gold OA (is_hybrid). Some of these fields were collected through the years while some of them were not. Therefore, based on information collected, I had to curate the original datasets and generate a new value.

Import

There were 10 different files with various formats and variables.

Internal library datasets

From 2010 to 2015:

  • Author Fund - 2010-2011.xlsx
  • Author Fund - 2011-2012.xlsx
  • Author Fund - 2012-2013.xlsx
  • Author Fund - 2013-2014.xlsx
  • Author Fund - 2014-2015.xlsx

These Microsoft Excel files had either 3 sheets (Applications, Budget, and Processing) or 4 sheets (Applications, Processing, Budget, and Survey). They were connected via the unique Prop.# variable.

Since 2016, there is one master file in the Google Sheets with 6 sheets (2016/2017 Accepted, 2016/2017 Rejected/withdraw, 2017/2018 Accepted, 2017/2018 Rejected/withdrawn, 2018/2019 Accepted, 2018/2019 Rejected/withdrawn).

In spite of different variable names in these files, all of them contained the following variables:

  • Name
  • Faculty
  • Department
  • Status
  • Article title
  • Co-authors
  • Journal title
  • Publisher
  • Fee charged
  • Currency
  • Fee paid
  • Dates

Datasets from publishers

The University of Ottawa has an institutional agreement with several open access publishers, so we had the following files received from publishers:

  • BMC account statement 2015-2019.xlsx
  • Frontiers total articles 2015 to Feb 2019.xlsx
  • PeerJ basic memberships to date March 2019.docx
  • PLoS Pre-invoice details.xlsx

The first part is to import these datasets to R. You can import any data or file formats to R using various packages, but since the files were either xlsx or Google Sheets, I applied readxl and googlesheets packages to open each sheet. I combined these sheets by year via Prop.# variable for the files from 2010 to 2015 and kept the Google sheets as they were as all the information were kept in one sheet from 2016 to 2019.

Tidy

There is a joke about data science - “Data Scientists spend up to 80% of time on data cleaning in preparation for data analysis, statistical modeling, & machine learning.” Although I am not a data scientist, I thought that it was very true. Because these files were maintained by different librarians, there were slightly different rules applied when it came to date format, department name, author’s name with French accents and etc., so my first task was to tidy them.

According to Hadley Wickham, the characteristics of tidy data are:

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.

For example, there were multiple values stored in one column like faculty and department information were saved in the Faculty which left empty in the Department variable. Or multiple types of observational units were stored in the same column like there were notes (basically characters mixed with date type) in the Date field which causes automatic transformed to Excel numeric dates (i.e., 42752 instead of 2017-01-17) when I imported them to R. I used as.Date to convert from the Excel numeric dates to date format as YYYY-MM-DD.

I also noticed that all the records from previous years were copied and pasted to the following years, I had to remove duplicate records. Since the Prop.# was unique across the years, I identified if there was more than one Prop.# using count() from tidyverse packages. Since OpenAPC keeps track of cost paid for journal articles, I had to remove records related to book chapters, basically other than journal articles.

Transform

Whenever I work with data, I always think that the whole process is really dynamic and interactive. In order to complete my process of tidying data, I had to add a new variable to the data frames prior to 2015 as date_paid with empty value in it. Furthermore, there was no exact date when the APC was paid prior to 2015 except Email decision, Documents sent to Librarian, and Deposits in institutional repository. So using substring(), I retrieved only year information from the Email decision variable and transformed it to the new variable called Period.

I also renamed and standardize the variables so that I could combine all those data frames as Name, Faculty, Department, Status, Article title, Journal title, Publisher, OA type, Currency, Amount paid, Period and Date Paid. Since all data frames per year had these common fields, it was easy to merge them with rbind() function.

Tidy

In order to clean up values, I used OpenRefine instead of R. You can do this task in R, but I personally found that OpenRefine is more intuitive, easy, and user-friendly. Using rrefine package, I was able to connect to my OpenRefine from R and vice versa without any issues.

In the OpenRefine environment, I cleaned up Name, Faculty, Department, Journal Title, Publisher, Status, and Currency to apply one of its function, Cluster. For instance, there were different name variations for one person - with/without French accent or with/without middle name. These same principles were applied to Faculty, Department, Journal Title and Publisher. I also merged different categories for technically one category (Grad or Graduate students for status and USD or US for currency).

As of 2015, the Library has supported only Gold OA so no information has been collected since. However, the previous datasets did contain the OA status and its categories were TJ, OA, Gold and Green. I did convert from TJ and Green to TRUE (which would be a value of is_hybrid field) while from OA to FALSE.

Visualize and Communicate

The nice thing about any data science programming languages is that once you develop your code or script, it can be easily reproducible with one click as long as it keeps its consistency. I would like to apply this principle to generate a report as well. RMarkdown package provides a framework for data science - you can combine your code, its results, and graphics. You can also generate documents in multiple formats such as PDFs, Word, slideshows, html, and more. This means for me that I could combine data visualization with code as well as explanation to communicate with other people. I shared my RMarkdown code in my Github and hopefully can share the report under uOttawa institutional repository soon.

Transform for OpenAPC

In order to release datasets to OpenAPC, I had to combine two datasets: one internally collected and the other generated from publishers. I followed the same processes above for the datasets generated from publishers except PeerJ. It was a word file which wasn’t ideal but also no information except author’s name. There was a View details section but it wasn’t clickable at all so I couldn’t really trace.

For the internal datasets, there was no doi information, only author, journal title and article title, and publisher. From insights from Jeanette, Jere Odell (Scholarly Communications Librarian at IUPUI), and Andy Smith (Developer at IUPUI), I used rcrossref which was to use its API to find out doi based on information that I had. Below is the simple code that I developed based on cr_works function from rcrossref.

Find_doi <- function(x,y) {
  if (is.na(y)) {
    y<-""
  }
  article_data <- cr_works(query=x, flq=c(`query.container-title`=y), limit=1)

  doi <- article_data$data$doi
  if (identical(doi, character(0))|is.null(doi)) {
    doi <- ""
  } else {
    doi <- article_data$data$doi
  }
  return(doi)
}

###### List
doi_list_all <- mapply(Find_doi, YourDataFrameName$`Article title`, YourDataFrameName$`Journal title`)

Then now I have institution, period, amount, currency, date_paid, doi, is_hybrid variables. The reason why I included the amount, currency, date_paid fields was that OpenAPC can convert them to euro which is an official currency in the project of OpenAPC.

I hope that we can share our datasets soon with open access community.

What I learned

Now I work with our new Scholarly Communications Librarian (Mathieu Cloutier) as Jeanette is on Academic Leave to report to SPARC regarding the Campus OA Funds. Since the datasets are now tidy, this can be easily done. Along the way, I learned new tools and possibilities for bibliometric work with R or Python, not sure how I could do yet, but I believe that there is work already done especially open access, so I may start with literature review.

Published 24 Jul 2019

Continuous learner - Open knowledge, Open data, Open source
Yoo Young Lee on Twitter