Making a book search engine in Python and Elasticsearch

image

Ever wondered what Nietzsche said about dragons? or Spinoza said about beauty? No? Well anyway this post is about how to make a book search engine, which you can load with a set of books, and then mine for information.

Assuming you have some text versions of books, it’s a simple scenario, why not just grep the information you want? You could, and it would work fine for simple queries, but doesn’t have much potential for richer searches, augmenting with contextual information, built-in scrolling, an API you can wrap an application around.

For the book search engine we’ll use Elasticsearch to index the books and serve the queries, and Python to write the data load and query tools. Elasticsearch is an open source search technology based on Lucene that provides a very easy to use REST API for indexing and searching, trivial installation, scalability, and a configuration that works out of the box. It also integrates with a rich set of products for data analysis, reporting, graph API etc. though I won’t get into them here.

Since Elasticsearch is Java based and the tools use Python this will run on any platform.

Install Elasticsearch

Download Elasticsearch from https://www.elastic.co/downloads/elasticsearch/ and follow the installation steps. Java is required to run it. On Windows that meant unzipping the file, making sure %JAVA_HOME% environment variable was set and running bin\elasticsearch.bat from a command window. The default settings start a server in foreground listening on local port 9200:

image

Install Python Elasticsearch client library

Note: The examples in this article assume you have Python 3 installed.

There is a low-level Python library called elasticsearch-py, and a higher level client called elasticsearch-dsl. The difference is explained here. The higher level client starts becoming very useful to support composable queries which provide more flexibility. I’ll be using the low level client which is installed with:
    pip install elasticsearch

Download some books

The goal is to index some basic UTF-8 text books. Project Gutenberg is a good source of free text books, a better source might be The Internet Archive. Pick some authors or topics that interest you and download the books in .txt format.  I downloaded some books by Spinoza and Nietzsche for a mini-philosophy library.

Once downloaded, edit the files in a text editor and remove any header and footer information you don’t want to index (such as the contents or index), so what you have left in the file is just the book body by the author. Gutenberg books come with a header, and a lot of license info at the end. If you were planning to use it for anything other than private use use the Internet Archive or another source, or make sure you understand the license requirements before deleting stuff.

Data-indexing tool

This Python program indexes a text file with Elasticsearch. It takes the file path, author and title as arguments, and uses the author as an Elasticsearch index name, and the title as a document type. The elasticsearch functions like lower case values. E.g. to load Spinoza’s Ethics..

loadbook.py .\ethics.txt spinoza ethics

import elasticsearch
import sys
import os

def usage():
    sys.exit('Usage: python ' + sys.argv[0] + ' filename author title')

# check command line    
if len(sys.argv) != 4:
    print(str(len(sys.argv)) + ' args provided')
    usage()

# get filename, author, title, from command line
fileName = sys.argv[1]
author = sys.argv[2]
title = sys.argv[3]

# check file exists
if os.path.isfile(fileName) is False:
    print('File not found: ' + fileName)
    usage()

es = elasticsearch.Elasticsearch()  # use default of localhost, port 9200

# open file
book = open(fileName)
lineNum = 0 # line number including empty lines
txtNum = 0 # line number of non-empty lines

# read each line
try:
    for lineText in book:
        lineNum += 1
        if len(lineText) > 0:
            txtNum += 1
            # load line
            es.index(index=author, doc_type=title, id=txtNum, body = {
                'lineNum': lineNum,
                'text': lineText
            })
except UnicodeDecodeError as e:
    print("Decode error at: " + str(lineNum) + ':' + str(txtNum))
    print(e)
# close file
book.close()
print(es.get(index=author, doc_type=title, id=txtNum))


I’ve noticed that some books fail to load as they appear to have unrecognized Unicode characters in them. If that happens you have a choice of skipping the book, or finding the characters which caused a problem and removing them.

Query tool

The query tool takes the author name and search term as arguments. It prints how many matches, each matching line number, text, and the lines before and after. By default it returns the 1st 10 results. An optional 3rd argument tells it how many results to print.

import elasticsearch
import sys
import json

def usage():
    sys.exit('Usage: ' + sys.argv[0] + '   [num results]')

# check command line
numArgs = len(sys.argv)
if numArgs < 3:
    print(str(len(sys.argv)) + ' args provided')
    usage()
author = sys.argv[1]
query = sys.argv[2]
if numArgs == 4:
    numResults = sys.argv[3]
else:
    numResults = 10
    
es = elasticsearch.Elasticsearch()  # use default of localhost, port 9200

# single word query:
#results = es.search(index=author, q=query, size=numResults)
# phrase match query:
results = es.search(
    index=author,
    body={
        "size": numResults,
        "query": {"match": {"text": {"query": query, "type": "phrase"}}}})
# print(json.dumps(results, sort_keys=False, indent=2, separators=(',', ': ')))
hitCount = results['hits']['total']
if hitCount > 0:
    # the next might be needed if text is UTF-8 and there are mapping errors
    #utf8stdout = open(1, 'w', encoding='utf-8', closefd=False)
    if hitCount is 1:
        print(str(hitCount) + ' result')
    else:
        print(str(hitCount) + ' results')
    
    for hit in results['hits']['hits']:
        text = hit['_source']['text']
        lineNum = hit['_source']['lineNum']
        score = hit['_score']
        title = hit['_type']
        if lineNum &tg; 1:
            previousLine = es.get(index=author, doc_type=title, id=lineNum-1)
        nextLine = es.get(index=author, doc_type=title, id=lineNum+1)
        #print(str(lineNum) + ' (' + str(score) + '): ' + text, file=utf8stdout)
        print(title + ': ' + str(lineNum) + ' (' + str(score) + '): ')
        print(previousLine['_source']['text'] + text + nextLine['_source']['text'])
else:
    print('No results')

E.g. to see what Spinoza said about beauty:

booksearch.py spinoza beauty

image

If you ever get a Unicode error during a search it means the text book you loaded had some bad characters in it. You could uncomment the ‘utf8’ lines in the above code to print directly via utf-8, or look at the line where it failed, edit the book and replace any blank spaces (what looks like a blank space is probably the bad character if you edit in notepad++ for example). To do this properly view the file in hex.

The first version of the es.search() function call (commented out) only matched single words. This was replaced by a phrase match query so you can search for phrases as well. At this point it’s probably time to start using the higher level elasticsearch-dsl library.

Book extract tool

If one of the search hits looks interesting and you want to read more about the topic, here’s a book extract tool which takes beginning and ending line numbers as arguments. Since in this model the line number corresponds to the search index number, all it has to do is return the range of indexes which match the arguments. It also strips the end-of-line character from each line (perhaps this should have been done during the indexing phase instead, but never mind):

import elasticsearch
import sys

def usage():
    sys.exit('Usage: ' + sys.argv[0] + ' <author> <begin line> <end line>')
# check command line
numArgs = len(sys.argv)
if numArgs < 4:
    print(str(len(sys.argv)) + ' args provided')
    usage()
author = sys.argv[1]
title = sys.argv[2]
beginLine = int(sys.argv[3])
if numArgs == 5:
    endLine = int(sys.argv[4])
else:
    # if no endline provided print a single line
    endLine = beginLine + 1
 
es = elasticsearch.Elasticsearch()  # use default of localhost, port 9200

# get the range of lines from the elasticsearch index
for lineNum in range(beginLine, endLine):
    line = es.get(index=author, doc_type=title, id=lineNum)
    # print line, strip linefeeds since print adds a linefeed
    print(line['_source']['text'].rstrip())

E.g. to see lines 8360 to 8370 of Spinoza’s Ethics:

getlines.py spinoza ethics 8360 8370

Next Steps

This is a very simple example of using Elasticsearch. Simple enhancements would include:

– Adding a Stemmer Token filter so you don’t need to do wildcard searches to match “dragon” and “dragons” for example

– Adding scrolling support (e.g. https://gist.github.com/drorata/146ce50807d16fd4a6aa)

– Put this in a web interface (e.g. with a Django app).

– Support more freeform queries – and using the elasticsearch-dsl library.

Advertisements
This entry was posted in Python, Search and tagged , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s