Helping us judge a book by its cover: software help request

2 min read Original article ↗

The Internet Archive would appreciate some help from a volunteer programmer to create software that would help determine if a book cover is useful to our users as a thumbnail or if we should use the title page instead. For many of our older books, they have cloth covers that are not useful, for instance:

But others are useful:

Just telling by age is not enough, because even 1923 cloth covers are sometimes good indicators of what the book is about (and are nice looking):

We would like a piece of code that can help us determine if the cover is useful or not to display as the thumbnail of a book. It does not have to be exact, but it would be useful if it knew when it didn’t have a good determination so we could run it by a person.

To help any potential programmer volunteers, we have created folders of hundreds of examples in 3 catatories: year 1923 books with not-very-useful covers, year 1923 books with useful covers, and year 2000 books with useful covers. The filenames of the images are the Internet Archive item identifier that can be used to find the full item:  1922forniaminera00bradrich.jpg would come from https://archive.org/details/1922forniaminera00bradrich.   We would like a program (hopefully fast, small, and free/open source) that would say useful or not-useful and a confidence. 

Interested in helping? Brenton at archive.org is a good point of contact on this project.   Thank you for considering this. We can use the help. You can also use the comments on this post for any questions.

FYI: To create these datasets, I ran these command lines, and then by hand pulled some of the 1923 covers into the “useful” folder.

bash-3.2$ ia search "date:1923 AND mediatype:texts AND NOT collection:opensource AND NOT collection:universallibrary AND scanningcenter:*" --itemlist --sort=downloads\ desc | he\
ad -1000 | parallel --will-cite -j10 "curl -Ls https://archive.org/download/{}/page/cover_.jpg?fail=soon.jpg\&cnt=0 >> ~/tmp/cloth/{}.jpg"

bash-3.2$ ia search "date:2000 AND mediatype:texts AND scanningcenter:cebu" --itemlist --sort=downloads\ desc | head -1000 | parallel --will-cite -j10 "curl -Ls https://archive.\
org/download/{}/page/cover_.jpg?fail=soon.jpg\&cnt=0 >> ~/tmp/picture/{}.jpg"