• ElasticSearch: The Future of Fulltext

  • Who am I?

    Mat Brown

    @0utoftime

    http://github.com/outoftime

  • What is ElasticSearch?

    A standalone distributed HTTP-based fulltext search index

  • Standalone

    Runs as its own instance (or cluster)

  • Distributed

    Designed to distribute both data and processing over many machines

  • HTTP-based

    All interaction is via a RESTful, JSON API

  • Fulltext search index

    Efficiently retrieve documents whose text content matches user input

  • What problems does it solve?

    • Fulltext search
    • Drill-down search -- what are my options for refinement?
    • Maybe some other problems???
  • Let's have a look around

  • Install that business

    $ wget https://github.com/downloads/elasticsearch/elasticsearch/elasticsearch-0.17.6.tar.gz
    $ tar -xvf elasticsearch-0.17.6.tar.gz
    $ cd elasticsearch-0.17.6
    $ ./bin/elasticsearch -f
  • Index a document

    > post('/myapp/post', { :title => 'My First Post', :blog_id => 1 })
    {
              "ok" => true,
          "_index" => "myapp",
           "_type" => "post",
             "_id" => "lVRL_5FERsWW2tf_oX1WJw",
        "_version" => 1
    }
    
  • Awesome Fact: ElasticSearch is JSON-based

    JSON in, JSON out

  • ElasticSearch resource structure

    You can have as many indexes as you like.

  • Create a mapping

    > put('/myapp/post/_mapping', {
        :post => {
          :properties => {
            :title => { :type => 'string', :index => 'analyzed' }
          }
        }
      })
  • Awesome Fact: Runtime Configuration

    Almost everything in ElasticSearch is configurable at runtime, via the JSON API.

  • Awesome Fact: Dynamic Schema

    • ElasticSearch has a concept of document structure
    • But it's easy to change the structure when you need to
  • Let's search this business

    > post(/myapp/post/_search', { :query => { :term => { :blog_id => 1 }}})
    {
      "hits" => {
        "total" => 1,
        "max_score" => 1.0,
        "hits" => [
          [0] {
            "_index" => "myapp",
            "_type" => "post",
            "_id" => "sHUhJNq6R7-8atpaXdzcfw",
            "_score" => 1.0,
            "_source" => {
              "title" => "My First Post",
              "blog_id" => 1
            }
          }
        ]
      }
    }
  • Awesome Fact: The search DSL

    JSON search DSL exposes the underlying Lucene search API in tremendous detail.

  • Awesome Fact: The _source

    ElasticSearch retains the original JSON document that you sent it, and returns it with search results (by default).

  • Wait, this is a search engine?

    > get('/myapp/post/sHUhJNq6R7-8atpaXdzcfw')
    {
          "_index" => "myapp",
           "_type" => "post",
             "_id" => "sHUhJNq6R7-8atpaXdzcfw",
        "_version" => 1,
          "exists" => true,
         "_source" => {
              "title" => "My First Post",
            "blog_id" => 1
        }
    }
  • What is ElasticSearch?

    A standalone distributed HTTP-based fulltext search index.

    An HTTP-based distributed document-oriented data store with a search-based query system.

  • More Awesome

  • Distribution

    • Multiple shards per index, multiple replicas per shard
    • Talk to any instance you want
    • Sharding and replication completely behind the scenes
  • Movie break

  • Near-real-time search

    • Writes to disk happen transparently in near real time
    • Document GETs are true real time (by default)
  • Nested documents

    • ElasticSearch JSON documents can be deeply nested
    • Under the hood, these are flattened to namespaced fields
  • Mapping nested documents

    {
      "post": {
        "properties": {
          "author": {
            "properties": {
              "name": { "type": "string" },
              "department_id": { "type": "integer" }
            }
          }
        }
      }
    }
  • Creating nested documents

    { "author": { "name": "Mat", "department_id": 4 }}
  • Searching nested documents

    { "query": { "term": { "author.name": "Mat" }}}
  • Multi-fields

    Index the same data in different ways.

    We might want to index a title as a scalar string, standard analyzed fulltext, and as substrings.

  • Defining a multi-field

    {
      "post": {
        "properties": {
          "title": {
            "type": "multi_field",
            "fields": {
              "title": { "type": "string", "index": "analyzed" },
              "scalar": { "type": "string", "index": "not_analyzed" },
              "substrings": { "type": "string", "index": "analyzed", "analyzer": "n_gram" }
            }
          }
        }
      }
    }
  • Index aliases

    • Point an alias at an index
    • Swap out the index at a given name without client knowlege
    • Point a single alias at multiple indices
    • Run an alias through a filter
  • Index templates

    • Designed for many-indexes setup
    • Define a wildcard matcher for index names
    • Any new index that matches gets its configuration from the template
  • Rich queries

    {
      "query": {
        "filtered": {
          "query": {
            "query_string": {
              "query": "delicious pizza",
              "fields": ["title", "body", "tags", "author.name"]
            }
          },
          "filter": {
            "or": [
              { "term": { "blog_id": 1 }},
              {
                "geo_distance": {
                  "distance": "5mi",
                  "location": {
                    "lat": 40,
                    "lon": -70
                  }
                }
              }
            ]
          }
        }
      }
    
  • Clients

    • Tire -- http://github.com/karmi/tire
    • Elastictastic -- TBA
    • Others -- http://www.elasticsearch.org/guide/appendix/clients.html
  • K all done!