(Hidden) research data analysis processes

In the traditional scholarly communication, a reader can see only a final product of research, usually in an article format - Title, Abstract, Keywords, Introduction, Literature Review, Methodologies, Analysis and Results, Discussion, Conclusion, and References. Research is an iterative process, but often times, it is hard to capture the process and share with other people. In the open scholarship (open science) movement, researchers encourage themselves to make their own research open, not just their final product but also the entire process and data valuing transparent and accessible. Open source tools like OSF (Open Science Framework) and Project Jupyter enable them to do that.

I had a great opportunity to work with Ayoung Yoon, Assistant Professor at Indiana University School of Informatics and Computing (IUPUI), Department of Library and Information Science, on her research project - what trust factors could affect a data reuse practice (publication: Factors of trust in data reuse). In the final publication, we simply described what a statistical method (Partial Least Squares Path Modeling (PLS-PM)) was used with which software and package (plspm R package, version 0.4.9), but there was a (hidden) process of how we arrived there which nobody mention in an article in general.

When I read scholarly articles, I am always in awe of research methodologies and analyses, especially how researchers arrive their conclusion from their research question. It’s simple yet beautiful. However, based on my research experience, it’s not that simple and beautiful. In this blog post, in order to make a small contribution to open science movement, I would like to share my trial and error regarding the data analysis of our work.

Preparation for data analysis

The study utilized a survey method using SurveyMonkey institutional account and data collected were exported into csv and Excel formats. I imported original datasets in the csv file to RStudio as a dataframe and then started tidying the original datasets with data validation (check if it is in compliance with the research criteria e.g., in our case, it was data reuse experience) and data coding (column’s name e.g., in our case, from question of the producers of the data are the experts in the domain of this research to ABILITY01).

Why R and RStudio?

R is a powerful and open source language for statistical analysis with a strong community behind. Mplus is a great tool for latent variable model analyses including exploratory factor analysis (EFA), structural equation modeling (SEM), item response theory (IRT) analysis, and many more, but it’s very expensive. I knew that COOL RDC at uOttawa Library has a copy of Mplus, but learned that its access to COOL RDC is very restricted. In R, there are packages such as lavaan developed by Yves Rosseel or psych (mostly for researchers in psychology but it also provides functions for factor analysis) developed by William Revelle. In addition, RStudio is an easy and user-friendly integrated development environment for R in which I can save all codes, data, workflow, command history, and etc. as a project so that I can reload them whenever I want to. Like Python and Jupyter Notebook, R and RStudio enable researchers to make their research open, reproducible, and easily sharable.

From CFA (Confirmatory Factor Analysis) to PLS-PM (Partial Least Squares Path Modeling)

Since we would like to identify an underlying latent factor from a set of observed variables (e.g., ABILITY01, ABILITY02, ABILITY03, ETHICS01, ETHICS02, ETHICS03, COMMIT01, COMMIT02, RAPPORT01, RAPPORT02 for Data Producer), our first considered method was the confirmatory factor analysis (CFA) using lavaan package. One of the CFA assumptions, though, is a normal distribution of variables as it is based on the covariances. We examined each variable, but our datasets were not normally distributed so we could either transform those variables, drop them from our model or simply choose other methods.

The second method we considered was the Rasch model (one of the most known item response theory (IRT)) using ltm package. Since our datasets were skewed, we could transform them from continuous data to categorical data as “0” (Not agree) and “1” (agree). From this approach, we were able to identify how each item was related to the TRUST factor, but we couldn’t identify relationships between observed variables and latent variables.

Finally, we chose the partial least squares path modeling (PLS-PM) with plspm package for our method. This approach is a two-step process: factor model for constructing measurements and relationships between constructs for structural model. In addition, the PLS-PM approach doesn’t assume normal distribution or sample size, so it was a good match for us to proceed in order to pursue our research question from datasets collected.

Same datasets but different approaches

I attended a session delivered by Thomas Lindsay and Alicia Hofelich Mohr from the University of Minnesota at the IASSIST & CARTO 2018 conference and I was fascinated by the study that they introduced. The study was Many analysts, one dataset: Making transparent how variations in analytical choices affect result which talks about how different analysts address the same research question with the same dataset with different analytic approaches.

Although we chose the PLS-PM approach, there would be other approaches to answer our original research question and I would like to learn and improve myself. I am afraid of my mistakes and public humiliation, but I learn from mistakes. I want to contribute to open scholarship movement, but when I think back, I always benefit from the community who is willing to open their research and share their knowledge so that I can be a better librarian and researcher.