Elastic in Action Notes
ElasticSearch In Action#
4. Mapping#
4.1 Mapping Oveview#
Mapping is the schema definition
In elasticsearch one does nt need to define the schema/mapping before inserting documents (although it is recommeneded).
Sending this request (without anything else being done prior):
PUT movies/_doc/1
{
"title": "Godfather",
"rating": 4.5,
"release_year": "1972/08/01"
}
does:
- A new index
movies
is created automatically - A new schema (mapping) is created for the index. For example,
title
is set to atext
andkeyword
types,rating
to afloat
, andrelease_year
to adate
type. - The document is indexed and stored into elasticsearch data store
- Subsequent documents are stored without prior steps
The dynamically generated schema can be acquired for the mapping api:
GET movies/_mapping
{
"test": {
"mappings": {
"properties": {
"rating": {
"type": "float"
},
"release_year": {
"type": "date",
"format": "yyyy/MM/dd HH:mm:ss||yyyy/MM/dd||epoch_millis",
"print_format": "yyyy/MM/dd HH:mm:ss"
},
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
This process is called dynamic mapping - when a document is indexed for the first time. This is a best guess and can often be wrong.
Note the title
is multi-types. title.keyword
is how it is to be accessed.
keyword
fields are used for exact value searches - they are untouched and won’t go through an analysis phase. Neither tokenised, synonymised, nor stemmed. They use a special analyser called ano-op
(No operation) analyser. The analyser spits out the entire field as a token. One can customise and enable anormaliser
onkeyword
fields (expected to be done prior to indexing).
4.2 Dynamic Mapping#
Elasticsearch can deduce whether a field is float or matches the format of a date.
Should you declare the field as a date type explicitly (using explicit mapping - more on this later), unless you provide a custom format, by default the field would be adhering to strict_date_optional_time format. The strict_date_optional_time conforms to a ISO date format, i.e.,yyyy-MM-dd or yyyy-MM-ddTHH:mm:ss.
If the format does not match exactly Elasticsearch will consider that value as a text field rather than a date field
Also if instead of a rating of 4.5 the value was 4. Elasticsearch would set the data type as long
.
Having incorrect types will cause potential problems in an application, making the fields ineligible for sorting, filtering, and aggregations on data
The main downside of dynamic mapping is: Elasticsearch could misinterpret the document field values and derive an incorrect mapping which voids the fields eligibility for appropriate search, sort and aggregation capabilities
Example:
PUT students/_doc/1
{
"name": "John",
"age": "12"
}
GET students/_mapping
{
"test": {
"mappings": {
"properties": {
"age": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
The age being enclosed in quotes means elasticsearch treats it as a text
Then attempting to search and sort on age:
GET students/_search
{
"sort": [
{
"age": {
"order": "desc"
}
}
]
}
Returns an error:
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [age] in order to load field data by uninverting the inverted index. Note that this can use significant memory."
}
Elasticsearch cannot sort on fields that are of type
text
by default
One can enable fielddata
:
PUT students_with_fielddata_setting
{
"mappings": {
"properties": {
"age": {
"type": "text",
"fielddata": true
}
}
}
}
fielddata leads to expensive computation using heap field data cache, suggestion is to rather use
keyword
fields.
Using Keyword#
Fortunately Elasticsearch creates any field as a multi-field with keyword as the second type by default
Changing the search to works:
GET students/_search
{
"sort": [
{
"age.keyword": {
"order": "desc"
}
}
]
}
Sorting can be applied to
keyword
type fields
Date formats#
Differing date formats from yyyy-MM-dd
or yyyy/MM/dd
- will be considered as text and ineligible for sorting filtering and aggregation.
4.3 Explicit Mapping#
There are 2 ways to create the mappings explcitly:
- Indexing APIs - at the time of index creation
- Mapping APIs - add and modify mapping with the
_mapping
endpoint
indexing api
PUT movies
{
"mappings": {
"properties": {
"title": {
"type": "text"
}
}
}
}
Put command for index and then pass the
mappings
object with all the required fields
mapping api
PUT movies/_mapping
{
"properties": {
"release_date": {
"type": "date",
"format dd-mm-yyyy"
}
}
}
Put command for index/_mapping endpoint then pass a
properties
object with the new fields
4.3.3 Modifying Existing Fields is Not Allowed#
Once an index is live - any modification of existing fields in the live index is prohibited. Modifying in place would lead to search failures.
To make updates one needs to use the reindexing technique:
- Create a new index with the updated schema
- Copy the data from the old index to the new index
- When new index ready, application switches to use the new index
- Remove the old index when confirmed new index works
To reindex, we issue:
POST _reindex
{
"source": {"index": "orders"},
"dest": {"index": "orders_new"}
}
Use aliases:
If your application is rigidly tied up with an existing index, moving to the new index may require a code or configuration change. The ideal way to avoid such situations is to use aliases. Aliases are the alternate names given to indices. Aliasing helps us switch between indices seamlessly with zero downtime.