Introduction to ElasticSearch

Some basic notes on installing ElasticSearch (2.1.1) and Sense and starting to use it to search text.

Although the documentation of Elastic is great I sometimes struggled to see the actual results of the queries and therefore wrote this little introduction-post.

Installing ElasticSearch (2.1.1)

First download and unpack ElasticSearch:

$ cd ~/

$ wget https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-2.1.1.tar.gz

$ tar -xvzf elasticsearch-2.1.1.tar.gz

$ rm elasticsearch-2.1.1.tar.gz

$ mv elasticsearch-2.1.1/ elastic-elasticsearch-2.1.1/

$ cd elastic-elasticsearch-2.1.1

Next we need to install the licenses.

$ cd ~/elastic-elasticsearch-2.1.1/

$ sudo bin/plugin install license
-> Installing license...
Plugins directory [/home/vagrant/elasticsearch-2.1.1/plugins] does not exist. Creating...
Trying https://download.elastic.co/elasticsearch/release/org/elasticsearch/plugin/license/2.1.1/license-2.1.1.zip ...
Downloading .......DONE
Verifying https://download.elastic.co/elasticsearch/release/org/elasticsearch/plugin/license/2.1.1/license-2.1.1.zip checksums if available ...
Downloading .DONE
Installed license into /home/vagrant/elastic-elasticsearch-2.1.1/plugins/license

Install Kibana, the visualisation toolkit and something we will need to run the UI for elasticsearch (i.e., Sense).

$ cd ~/

$ wget https://download.elastic.co/kibana/kibana/kibana-4.3.1-linux-x64.tar.gz

$ tar -xvzf kibana-4.3.1-linux-x64.tar.gz

$ rm kibana-4.3.1-linux-x64.tar.gz

$ mv kibana-4.3.1-linux-x64/ elastic-kibana-4.3.1/

$ cd elastic-kibana-4.3.1/

$ ./bin/kibana plugin --install elastic/sense
Installing sense
Attempting to extract from https://download.elastic.co/elastic/sense/sense-latest.tar.gz
Downloading 318236 bytes....................
Extraction complete
Optimizing and caching browser bundles...

To summarise. We have installed ElasticSearch, Kibana and the Sense UI plugin.

Move back to your home directory using cd ~/ to finally start using ElasticSearch.

Starting the Service(s)

To start ElasticSearch we can run:

# Use the flag -d for background deamon.
$ ./elastic-elasticsearch-2.1.1/bin/elasticsearch -d 

To see what the process id is:

$ ps aux | grep elastic
vagrant   2134 60.5  8.6 1886424 177556 pts/0  Sl   15:57   0:04 /usr/bin/java -Xms256m -Xmx1g -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC -Dfile.encoding=UTF-8 -Djna.nosys=true -Des.path.home=/home/vagrant/elasticsearch-2.1.1 -cp /home/vagrant/elasticsearch-2.1.1/lib/elasticsearch-2.1.1.jar:/home/vagrant/elasticsearch-2.1.1/lib/* org.elasticsearch.bootstrap.Elasticsearch start -d
vagrant   2187  0.0  0.0  10432   668 pts/0    S+   15:57   0:00 grep elastic

To start Kibana we can run:

# To start Kibana, supress output and put it in a background process
$ nohup ./elastic-kibana-4.3.1/bin/kibana &

The URL we can use now to interface with ElasticSearch through a web-interface is:

Sense on Kibana - http://127.0.0.1:5601/app/sense

ElasticSearch via Command-Line

Since our database is now empty, lets start putting some information in. Before we do so a few pointers to get information about the database itself via the command line.

Run these from without the VM because the normal port is 9200.

$ curl http://127.0.0.1:9200/_cat/health?v
# output:
epoch      timestamp cluster       status node.total node.data shards pri relo init unassign pending_tasks
1444212818 10:13:38  elasticsearch green           1         1      0   0    0    0        0             0

Nodes

$ curl http://127.0.0.1:9200/_cat/nodes?v
# output:
host                     ip        heap.percent ram.percent load node.role master name
vagrant-ubuntu-trusty-64 10.0.2.15            3          15 0.08 d         *      Fagin

Indices

$ curl http://127.0.0.1:9200/_cat/indices?v
# output:
health status index pri rep docs.count docs.deleted store.size pri.store.size

Creating an index named news

$ curl -XPUT http://127.0.0.1:9200/news?pretty
# output:
{
  "acknowledged" : true
}

$ curl http://127.0.0.1:9200/_cat/indices?v
# output:
health status index    pri rep docs.count docs.deleted store.size pri.store.size
yellow open   news     5   1          0            0       575b           575b

Similarly you can delete an index.

$ curl -XDELETE http://127.0.0.1:9200/news

Indices, Types and Documents

Before moving on its a good idea to get a better understanding into the three main categories in ElasticSearch:

Index: An index is a collection of documents that have somewhat similar characteristics. For example, you can have an index for customer data, another index for a product catalog, and yet another index for order data. An index is identified by a name (that must be all lowercase) and this name is used to refer to the index when performing indexing, search, update, and delete operations against the documents in it. In a single cluster, you can define as many indexes as you want.
- Must read: Index Management. Read through the pages and pay close attention to: analyzers and mapping. Just like in relational databases you need to pay close attention to what analyzers and mapping you use for the data. Its a bit like thinking about what type of data you want to store and how it is searchable/sortable as text (sort by sentence, or bag-of-words etc).
Type: Within an index, you can define one or more types. A type is a logical category/partition of your index whose semantics is completely up to you. In general, a type is defined for documents that have a set of common fields. For example, let’s assume you run a blogging platform and store all your data in a single index. In this index, you may define a type for user data, another type for blog data, and yet another type for comments data.
Document: A document is a basic unit of information that can be indexed. For example, you can have a document for a single customer, another document for a single product, and yet another for a single order. This document is expressed in JSON (JavaScript Object Notation) which is an ubiquitous internet data interchange format. Within an index/type, you can store as many documents as you want. Note that although a document physically resides in an index, a document actually must be indexed/assigned to a type inside an index.

Sense Interface

For this we will use the Sense Kibana-app. Go to http://127.0.0.1:5601/app/sense to open it. Since this runs on Kibana within the VM we can have it connect locally to http://localhost:9200

Creating an Index with Mapping

To create a new index we can use PUT /<INDEX-NAME>.

PUT /news
{
	"mappings": {
		"articles": {
			"properties": {
				"webTitle": {
					"type": "multi_field",
					"fields": {
						"webTitle": { 
							"type": 		"string",
							"analyzer": 	"standard"
						},
						"whitespace": { 
							"type": 		"string",
							"analyzer": 	"whitespace"
						},
						"english": { 
							"type": 		"string",
							"analyzer": 	"english"
						},
						"raw": { 
							"type": 		"string",   
							"index": 		"not_analyzed"
						}
					}
				}
			}
		}
	}
}

Adding Documents

We’ll now manually create a document in the news index. We give it a type articles and no _id so that will be automatically generated. To do this use the following ‘query’.

POST /news/articles
{
    "type": "liveblog",
    "sectionId": "politics",
    "webTitle": "Tory minister attacks BBC for its coverage of Elliott Johnson story - Politics live",
    "webPublicationDate": "2015-12-10T15:08:42Z",
    "id": "politics/blog/live/2015/dec/10/cameron-fails-to-win-support-of-polish-pm-for-his-plan-to-change-eu-benefit-rules-politics-live",
    "webUrl": "http://www.theguardian.com/politics/blog/live/2015/dec/10/cameron-fails-to-win-support-of-polish-pm-for-his-plan-to-change-eu-benefit-rules-politics-live",
    "apiUrl": "http://content.guardianapis.com/politics/blog/live/2015/dec/10/cameron-fails-to-win-support-of-polish-pm-for-his-plan-to-change-eu-benefit-rules-politics-live",
    "sectionName": "Politics"
}

The output will looks something like:

{
   "_index": "news",
   "_type": "articles",
   "_id": "AVGMeZ9hbQL4kpTUdIf8",
   "_version": 1,
   "created": true
}

Adding an article without specifying an _id uses POST. If you want to insert one using a specific _id you can use the PUT request.

PUT /{INDEX}/{TYPE}/{ID}
{
    "field1" : "text",
    "field2" : "text"
}

Lets add some more articles to make sure we have some stuff to work with from here.

POST /news/articles
{
    "type": "liveblog",
    "sectionId": "business",
    "webTitle": "VW emissions scandal: misconduct, process failure and tolerance of rule-breaking blamed – as it happened",
    "webPublicationDate": "2015-12-10T14:59:44Z",
    "id": "business/live/2015/dec/10/volkswagen-vw-grilling-emissions-scandal-bank-of-england-business-live",
    "webUrl": "http://www.theguardian.com/business/live/2015/dec/10/volkswagen-vw-grilling-emissions-scandal-bank-of-england-business-live",
    "apiUrl": "http://content.guardianapis.com/business/live/2015/dec/10/volkswagen-vw-grilling-emissions-scandal-bank-of-england-business-live",
    "sectionName": "Business"
}

POST /news/articles 
{
    "type": "liveblog",
    "sectionId": "environment",
    "webTitle": "Paris talks: negotiators await new draft climate deal - live blog",
    "webPublicationDate": "2015-12-10T14:58:25Z",
    "id": "environment/blog/live/2015/dec/10/paris-climate-talks-cop21-draft-deal-negotiations-continue-live-blog",
    "webUrl": "http://www.theguardian.com/environment/blog/live/2015/dec/10/paris-climate-talks-cop21-draft-deal-negotiations-continue-live-blog",
    "apiUrl": "http://content.guardianapis.com/environment/blog/live/2015/dec/10/paris-climate-talks-cop21-draft-deal-negotiations-continue-live-blog",
    "sectionName": "Environment"
}

POST /news/articles
{
    "type": "article",
    "sectionId": "world",
    "webTitle": "Japanese PM's website hacked by whaling protesters",
    "webPublicationDate": "2015-12-10T14:40:35Z",
    "id": "world/2015/dec/10/japanese-pms-website-hacked-whaling-protesters",
    "webUrl": "http://www.theguardian.com/world/2015/dec/10/japanese-pms-website-hacked-whaling-protesters",
    "apiUrl": "http://content.guardianapis.com/world/2015/dec/10/japanese-pms-website-hacked-whaling-protesters",
    "sectionName": "World news"
}

POST /news/articles
{
    "type": "article",
    "sectionId": "politics",
    "webTitle": "Senior minister attacks Newsnight over Tory bullying claims coverage",
    "webPublicationDate": "2015-12-10T14:36:54Z",
    "id": "politics/2015/dec/10/nick-boles-minister-attacks-newsnight-tory-bullying-claims-lord-feldman",
    "webUrl": "http://www.theguardian.com/politics/2015/dec/10/nick-boles-minister-attacks-newsnight-tory-bullying-claims-lord-feldman",
    "apiUrl": "http://content.guardianapis.com/politics/2015/dec/10/nick-boles-minister-attacks-newsnight-tory-bullying-claims-lord-feldman",
    "sectionName": "Politics"
}

So we have now added 5 articles. Notice that 2 have the word Tory in there.

Searching and Retrieving Documents

Retrieve an article by using the _id

GET /{INDEX}/{TYPE}/{ID}

Usually you will try to actually search the database for matching articles. To return all documents you can ‘search empty’. You can use the following ways to return everything for now…

// Everything
GET /_all/_search

// Everything within news
GET /news/_search

// Everything within news/articles
GET /news/articles/_search

Because we don’t have any more indices or types this will do for now.

Before looking at actually searching lets take a look at how to ‘filter’ the fields that we get back from a selected document. To retrieve only specific fields we can specify the exact fields we want back. You can use both a POST and a GET request for this.

POST /news/articles/_search
{
    "fields" : [ "webTitle", "sectionName", "webPublicationDate" ]
}

GET /news/articles/_search?fields=webTitle,sectionName,webPublicationDate

So we know how to retrieve all documents in an index/type and how to specify what fields we want back. Lets look at searching! Some documentation and documentation

We can search using queries with both POST and GET.

POST /news/articles/_search 
{
    "fields" : ["webTitle", "sectionName", "webPublicationDate"],
    "query": { 
        "match": { "webTitle" : "Tory" } 
    }
}

GET /news/articles/_search?q=webTitle:Tory&fields= webTitle,sectionName,webPublicationDate

To search on multiple words but not on exact matches we can again use the match query.

POST /news/articles/_search 
{
    "fields" : ["webTitle", "sectionName", "webPublicationDate"],
    "query": { 
        "match": { "webTitle" : "Tory Paris" } 
    }
}

GET /news/articles/_search?q=webTitle:Tory&fields= webTitle,sectionName,webPublicationDate

Exact matching of a string using match_phrase

POST /news/articles/_search 
{
    "fields" : ["webTitle", "sectionName", "webPublicationDate"],
    "query": { 
        "match_phrase": { "webTitle" : "Tory newsnight" } 
    }
}

Boolean matching

POST /news/articles/_search 
{
    "fields" : ["webTitle", "sectionName", "webPublicationDate"],
    "query": { 
        "bool": {
            "must": [
                { "match": { "webTitle" : "Tory" } },
                { "match": { "webTitle" : "newsnight" } }
            ]
        }
    }
}

There are more options: should, must_not.

Sorting Documents

Sorting documents is both very easy and not so much. Lets start with the simple stuff. Some documentation. Basic sorting can be done using POST and GET.

POST /news/articles/_search
{
    "fields" : ["webTitle", "sectionName", "webPublicationDate"],
    "sort" : [
        { "webPublicationDate" : { "order" : "asc" } }
    ]
}

GET /news/articles/_search?sort=webPublicationDate:asc&fields= webTitle,sectionName,webPublicationDate

This allows us to sort on webPublicationDate. Any sorting that is not on strings is quite straightforward. Now lets try sorting the information based on the webTitle instead of the webPublicationDate.

‘Normal’ string searching happens on a text bag-of-words basis.

https://www.elastic.co/guide/en/elasticsearch/guide/current/multi-fields.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/sorting-collations.html

supports both POST and GET requests.

POST /news/articles/_search
{
    "fields" : ["webTitle", "sectionName", "webPublicationDate"],
    "sort" : [
        "webTitle",
        "_score"
    ]
}

GET /news/articles/_search?sort=webTitle,_score&fields=webTitle,sectionName,webPublicationDate

We can see it go wrong here already. It will sort on words like and and attack. It splits the sentence in a bag-of-words and uses the min or max of the list to select the word to use in the sorting. Ideally here we would like to sort based on the full title whilst still retaining the flexibility of searching on the bag-of-words title. This all has to do with the mapping of the documents.

MAPPING

You can see the mapping of all content by

GET /news/_mapping

GET /_all/_mapping

This will give you something looking like

{
   "news": {
      "mappings": {
         "articles": {
            "properties": {
               ...
               "webTitle": {
                  "type": "string"
               },
               ...

So an index (news) has mappings and each type (articles) will have its own mappings per field. To directly get the mapping for a single field you can use the following query:

GET /news/_mapping/articles/field/webTitle

giving

webTitle": {
    "type": "string",
    "analyzer": "standard",
    "fields": {
      "english": {
        "type": "string",
        "analyzer": "english"
      },
      "raw": {
        "type": "string",
        "index": "not_analyzed"
      },
      "whitespace": {
        "type": "string",
        "analyzer": "whitespace"
      }
    }
  }

We can use the webTitle.raw for the sorting for example. Lets see how this works…

Sorting continued

To sort by the sentence, descending (note the webTitle.raw):

POST /news/articles/_search
{
    "fields" : ["webTitle", "sectionName", "webPublicationDate"],
    "query": { 
        "bool": {
            "must": [
                { "match": { "webTitle" : "Tory" } }
            ]
        }
    },
    "sort" : [
        "_score",
        { "webTitle.raw" : { "order" : "desc" } }
    ]
}

Aggregation

Aggregating counts/stats using aggs.

https://www.elastic.co/guide/en/elasticsearch/reference/1.4/_executing_aggregations.html

Like a COUNT(*) GROUP BY

POST /news/articles/_search
{
    "size": 0,
    "aggs": {
        "group_by_state": {
            "terms": {
                "field": "sectionName"
            }
        }
    }
}