Elasticsearch
1. What Elasticsearch is
- Elasticsearch is a distributed search engine with a REST interface and it based on the Lucene library.
- Indexed documents are available for search in near real-time.
- Official documentation
2. Elasticsearch concepts
Cluster
- A cluster consists of one or more nodes which share the same cluster name.
- Each cluster has a single master node which is chosen automatically by the cluster and which can be replaced if the current master node fails.
Node
A node is a running instance of Elasticsearch. A node can be at least of two types: a master node and a data node.
Shard
Each document is stored in a single primary shard. When you index a document, it is indexed first on the primary shard, then on all replicas of the primary shard.
Replica
- A replica is a copy of the primary shard, and has two purposes:
- Increase failover: a replica shard can be promoted to a primary shard if the primary fails.
- Increase performance: get and search requests can be handled by primary or replica shards. By default, each primary shard has one replica, but the number of replicas can be changed dynamically on an existing index. A replica shard will never be started on the same node as its primary shard.
Field
A field is a smallest data unit in ElasticSearch.
Document
- A document is a JSON document which is stored in Elasticsearch. It is like a row in a table in a relational database.
- Each document has its data in fields.
- The _source field contains the original JSON document body that was passed at index time. The _source field itself is not indexed (and thus is not searchable), but it is stored so that it can be returned when executing get and search requests.
Index
- An index can be thought of as an optimized collection of documents and each document is a collection of fields, which are the key-value pairs that contain your data.
- By default, Elasticsearch indexes all data in every field and each indexed field has a dedicated, optimized data structure. For example, text fields are stored in inverted indices, and numeric and geo fields are stored in BKD trees. The ability to use the per-field data structures to assemble and return search results is what makes Elasticsearch so fast.
- .monitoring-es hidden index is used to save cluster state which allows to monitor RPS, memory, CPU, etc.
Alias
- An alias is a secondary name for a group of data streams or indices. Most Elasticsearch APIs accept an alias in place of a data stream or index name.
- You can change the data streams or indices of an alias at any time. If you use aliases in your application’s Elasticsearch requests, you can reindex data with no downtime or changes to your app’s code.
Analyzer
An analyzer is applied to a field. An analyzer consists of the three following units:
- zero or more character filters. A character filter receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters.
- one tokenizer. A tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens.
- zero or more token filters. A token filter receives the token stream and may add, remove (stop token filter), or change (lowercase or synonym token filter) tokens. Elasticsearch uses ‘analyzer’ for indexing and ‘search_analyzer’ for searching. If you use both it can cause unexpected results.
Mapping
- Mapping is the process of defining how a document, and the fields it contains, are stored and indexed.
- Each document is a collection of fields, which each have their own data type. When mapping your data, you create a mapping definition, which contains a list of fields that are pertinent to the document. A mapping definition also includes metadata fields, like the _source field, which customize how a document’s associated metadata is handled.
Mapping type
Each field has a field data type, or field type. This type indicates the kind of data the field contains, such as strings or boolean values, and its intended use. For example, you can index strings to both text and keyword fields. However, text field values are analyzed for full-text search while keyword strings are left as-is for filtering and sorting.
Index template
An index template is a way to tell Elasticsearch how to configure an index when it is created. For data streams, the index template configures the stream’s backing indices as they are created. Templates are configured prior to index creation. When an index is created - either manually or through indexing a document - the template settings are used as a basis for creating the index.
3. Endpoints
- Show list of indices http://localhost:9200/_cat/indices
- Show index's content http://localhost:9200/index_name/_search
- Show index's mapping http://localhost:9200/index_name/_mapping
4. Ports
- 9200 port provides REST API
- 9300 port is used for communication between nodes
- Each node can accept requests but search is handled by data nodes, other nodes can gather and post-process search results if client send request to them
5. Data types
binary
Binary value encoded as a Base64 string.
boolean
true and false values.
keyword
Used for structured content such as IDs, email addresses, hostnames, status codes, zip codes, or tags.
text
the traditional field type for full-text content such as the body of an email or the description of a product.
completion
the suggester provides auto-complete/search-as-you-type functionality. The suggester uses data structures that enable fast lookups, but are costly to build and are stored in-memory.
float
a single-precision 32-bit IEEE 754 floating point number, restricted to finite values.
integer
a signed 32-bit integer with a minimum value of -2^31 and a maximum value of 2^31-1.
date
internally, dates are converted to UTC (if the time-zone is specified) and stored as a long number representing milliseconds-since-the-epoch.
alias
an alias mapping defines an alternate name for a field in the index. The alias can be used in place of the target field in search requests.
object
default field type for internal objects.
nested
the nested type is a specialised version of the object data type that allows arrays of objects to be indexed in a way that they can be queried independently of each other.
join
it is a special field that creates parent/child relation within documents of the same index. The relations section defines a set of possible relations within the documents, each relation being a parent name and a child name. Mapping settings limit
6. Query DSL
Boolean query
A query that matches documents matching boolean combinations of other queries.
Term query
- Returns documents that contain an exact term in a provided field. Avoid using the term query for text fields.
{ "query": { "term": { "user.id": { "value": "123" } } } }
Range query
- Returns documents that contain terms within a provided range.
{ "query": { "range": { "age": { "gte": 10, "lte": 20 } } } }
Nested query
- The nested query searches nested field objects as if they were indexed as separate documents. If an object matches the search, the nested query returns the root parent document.
- To use the nested query, your index must include a nested field mapping.
Match query
- Returns documents that match a provided text, number, date or boolean value. The provided text is analyzed before matching. The match query is the standard query for performing a full-text search, including options for fuzzy matching.
{ "query": { "match": { "message": "this is a test" } } }
Explain
Returns information about why a specific document matches (or doesn’t match) a query.