Datasets, Library, and Artificial Intelligence

I attended a public lecture and workshop on Machine Learning in the Presence of Class Imbalance, organized by Fields Centre of Quantitative Modelling and Analysis at Carleton University on June 20 and 21, 2019. I have been interested in this topic since 2016 when Google DeepMind’s AlphaGo won the board game Go against Master Lee Sedol. At that time, I was really shocked. Lee Sedol is one of the geniuses of Go masters and I knew how hard and difficult the board game Go is. Since then, I’ve tried to learn the topic by taking courses (i.e., Deep Learning Specialization) and participating in workshops (i.e., Stanford University Libraries AI Studio Experiments) and consider how these new technologies could be applied to my daily work.

My basic understanding of AI (Artificial Intelligence), ML (Machine Learning), and DL (Deep Learning) is that:

  • AI: Broad concept of computers performing tasks that human intelligence is needed like decision-making and translation.
  • ML: Subset of AI, but computers learning themselves based on data.
  • DL: Subset of ML, but computers processing information like a human neuron brain to be more fine-tuned.

Not sure if I understood them correctly and how these could improve my current work immediately at the Library, but one thing I’ve been thinking about and I would like to explore more is (digital) datasets (numeric, text, images, audio, and etc.) that are used to train AL, ML, DL algorithms such as how researchers obtain their train, dev, test sets and how other libraries apply AI to their work.

Public datasets

Presenters at the workshop came from different fields - academia, government, and private sectors. One thing I noticed in common was that most of them used public datasets to train/dev/test their model. The keynote presenter, Benjamin Fung, applied one of the ICWSM datasets to train/dev/test his StyloMatric to infer the author’s characteristics based on the writing styles from Tweets (paper and source code). Robin Grosset from MindBridge Analytics Inc. used one of the Kaggle datasets to train/dev/test his model for fraud detection and prediction.

There are many posts and websites to tell us about where to find public datasets (i.e., The Best Public Datasets for Machine Learning and Data Science and Awesome Public Datasets). In addition, Google Dataset Search facilitates data search easier across repositories, personal websites, and etc. I learned that the Canada Science and Technology Museum releases open data about their collections and operations in the machine readable way and there is also the UCI Machine Learning Repository which is one of the oldest data sources for ML communities.

I looked into how these datasets are shared or accessible. There are platforms like Kaggle or software development platform like GitHub, repositories like UCI, portals like Open Government Portal, databases like ImageNet, and websites like Stanford Dogs Dataset. These made me think about how academic libraries could support researchers in AI in terms of platform (or repository) wide to preserve and access, data curation for a high-quality dataset, and digitization to make collections available in a machine readable way.

Library

Recently, I’ve thought a lot about “digital” and “open” at the library - digital preservation, open access, digital scholarship, digital strategy, digital technologies, open source, open infrastructure, open education, and digitization. I do believe that “digital” affects all aspects of scholarship from discovery, integration, application to teaching and it enables scholars to do open practice [1]. Regarding the first dimension of scholarship - discovery - defined by Boyer, the authors [1] argue that digital technologies make researchers possible to generate, analyze, and share a massive amount of data so that datasets became an essential part of scholarly communication.

Data can be defined in a variety of ways, but for me as a layman’s view, data is a ground evidence for research so it could be anything - numeric, text, images, audio, video, and etc. Academic libraries play an important role in storing, organizing, preserving and providing access to the scholarly record [2] and I do believe that we contain great resources in various formats through digital collections, special collections, institutional repositories and many more. Some institutions have already been a pioneer of preservation and access for born-digital and digitized materials. In order to support researchers in this era with advanced digital technologies, I think that it is important for academic libraries to provide our resources in a digital way too so that they can easily explore our resources to discover new findings and incorporate materials for their teaching.

AI at the Library

There are library communities to explore AI - AI4LAM, conferences such as Fantastic Futures, and many workshops like SUL AI Studio Experiments. I watched its online webinar in January 2019. They applied techniques of AI to their digital and special collections in order to improve metadata, generate automatic transcripts from audio recordings, analyze images with IIIF, and enhance search features. In addition, in an interview with Chris Erdmann (Chief Strategist for Research Collaboration at North Carolina State University) and Karim Boughida (Dean of University Libraries at the University of Rhode Island) mentioned about chat with AI to interact with students and community. There was also one competition to develop algorithms to discover relationships between datasets, researchers, publications, research methods, and fields sponsored by New York University’s Coleridge Initiative.

I do strongly believe that there are many possibilities with AI in the library and this could have a great impact on us in terms of open and digital scholarship in general. Since our critical role is to store, organize, preserve and provide access to our resources, digitization and dealing with born-digital materials as well as massive data can’t be avoided as to better support researchers whose research practices have been affected by digital technologies. Without “digital”, it is hard to explore any new or upcoming technologies like AI in the library and without our “digital” practice, it’s hard to innovate or improve our current workflow. Especially, these days, if there is one critical missing component in the entire digital ecosystem, it is hard to accomplish the other one.

Conclusion

I really enjoyed the workshop although it wasn’t directly related to the library setting. Some of the techniques and tools that they applied would be a great use of the library like privacy preserving data augmentation in medical text processing. At my current institution, I talked with our Scholarly Communications Librarian (Jeanette Hatherill) about possible AI techniques for the thesis collections in our institutional repository for metadata improvement. These days, I also learn a lot and work with Head, Archives and Special Collections (Marina Bokovay) on preservation and curation and some digital preservation librarians already discussed about AI to improve or automate some of the workflow when dealing with digital preservation and access. At the same time, I would like to invest more time on biased data (yes, data can be biased too) and privacy in the AI era, but for the next time.

References

Published 28 Jun 2019

Continuous learner - Open knowledge, Open data, Open source
Yoo Young Lee on Twitter