Diverse, Unbiased Data Sets are Essential for Shaping the AI of Our Future

Note: This is a preview of a talk I will be giving at the Silicon Valley Deep Learning group meetup this fall.

In the new information economy, data is more valuable than oil. It’s essential that as developers we’re aware of how data is being used in the tech products we build. Our book app, MyLibrarian, is powered by a diverse data set and is an example of curation by unbiased experts for better recommendations and more inclusive AI.

At In the Stacks, we’ve created an app for book discovery. In recent years, we’ve been building and scaling our product, and we’ve focused on creating a data set that will shape the way book lovers get their book recommendations. Instead of asking friends what to read (and risk staying inside a biased bubble of people who think alike), why not ask unbiased book experts, whose job it is to find books that will change reader’s lives? When Librarians choose your next read for you, you’ll get a diverse list of book recommendations, based on descriptive keywords the user chooses, and how highly librarians rate the book.

We’ve worked on building this product as a part-time project for several years now, and have found that creating a data set, who we call the Librarian Brain, and which generates fresh new book picks, is valuable. Most online booksellers still rely on cross-selling recommendations, or reviews from friends and family, or even paid reviews. Our product is shaped by a data set—tens of thousands of books librarians love—with a custom-designed taxonomy. Our data includes the best book industry keyword descriptors, to create an experience similar to asking a librarian at a library reference desk what to read. A larger book catalog is then run through our set, to provide readers with five book picks. We provide outbound reader’s advisory—bringing that experience out of the library, and online—in an app.

Creating data sets to offset bias the we’re seeing in ML/AI is essential. At least some human-curation is beneficial, and librarians are ideal for this kind of work. Librarians are a progressive, inclusive group, who all have Information Science masters degrees, are technical, and sensitive to bias and discrimination. Plus we are detail-oriented data nerds. Librarians are champions of banned books, free speech and marginalized groups. We promote diverse books, and our recommendations help to offset the often unintentional partiality and bias inherent in some products, and raise aware of the quality of data used in fueling the products we’re coding and designing.

A Little More Background

For the last 4 years, our team has been researching and searching the book review sphere for the best reviews, and scraping the open source data from these reviews into a database. We named our database The Librarian Brain, because the reviews are all from smart librarians who have graduate degrees.

We debuted an early version of the app, an internal alpha working locally, at a demo day in 2016 (view the program here: https://drive.google.com/open?id=0B-Ug9haIJqKmTFQ0SF9nYzE3QWc). Since then, we’ve spent time working on the business model, through programs offered by the NASDAQ Entrepreneurial Center SF, landed on a B2C business model, and have shown our demo to several online booksellers. Our data set currently runs through the open source book corpus from Internet Archive.

This past year, I’ve pitched many investors, and have created relationships with some who have become mentors. We’ve raised about $25K, through crowdfunding and other sources and are still bootstrapping.

I create all aspects of the product, build what I can and oversee the execution of the remainder of the product with our team of 8+. I’ve created the data set taxonomy and product UI and functionality and my contractors improve on it. Timing is everything with this product—I’m glad we have more of the business and marketing plan in place before the app is in the stores. If you’re interested in hearing more and viewing our pitch deck, get in touch at info@inthestacks.tv—Michelle Z.

p.s. Here’s a paper about another human-curated data training set we worked on, Applying Narrative to Metadata for News