Connecting to Elasticsearch
Set the Server and Port connection properties, in addition to the authentication properties. The Lyftron Provider for Elasticsearch 2019 uses X-Pack Security for authentication and TLS/SSL encryption.
Authenticating to Elasticsearch
Set the User and Password properties and/or use PKI (public key infrastructure) to authenticate. Once the provider is connected, X-Pack performs user authentication and grants role permissions based on the realms you have configured.
To use PKI, set the SSLClientCert, SSLClientCertType, SSLClientCertSubject, and SSLClientCertPassword properties.
Note: TLS/SSL and client authentication must be enabled on X-Pack to use PKI.
Securing Elasticsearch Connections
To enable TLS/SSL in the provider, prefix the Server value with 'https://'.
Connecting to Amazon Elasticsearch Service
To connect to Amazon ES, set the Server, Port, AccessKey, SecretKey, and Region properties.
- Server should be set to the Endpoint URL for the Amazon ES instance.
- Port should be set to 443.
- Region should be set to the Amazon AWS region where the Elasticsearch instance is being hosted (the provider will attempt to automatically identify the region based on the Server value).
The AccessKey and SecretKey properties are used to authenticate an AWS user. Note: Requests are signed using AWS Signature Version 4.
Accessing NoSQL Tables
The provider implements Automatic Schema Discovery that is highly configurable. The following sections outline the provider's defaults and link to ways to further customize.
Flattening Nested JSON
By default, the provider projects columns over the properties of objects, including objects nested in objects. Arrays are returned as JSON strings, by default. You can use the following properties to access array elements, including objects nested in arrays.
- FlattenArrays: Set this property to the number of array elements that you want to return as column values. You can also use this property with FlattenObjects to extract the properties of objects nested in arrays.
-
FlattenObjects: By default, this is true; that is, the properties of objects and nested objects are returned as columns. When you set FlattenArrays, objects nested in the specified array elements are also flattened and returned as columns.
Other mechanisms for accessing nested objects are detailed in Searching with SQL.
Querying Multiple Indices
Multiple indices can be queried by executing a query using one of the following formats:
-
Query all indices via the _all view: SELECT * FROM [_all]
-
Query a list of indices: SELECT * FROM [index1,index2,index3]
-
Query indices matching a wildcard pattern: SELECT * FROM [index*]
Note, index lists can contain wildcards and indices can be excluded by prefixing an index with '-'. For example: SELECT * FROM [index*,-index3]
Fine Tuning Data Access
You can use the following properties to gain greater control over Elasticsearch API features and the strategies the provider uses to surface them:
-
GenerateSchemaFiles: This property enables you to persist table metadata in static schema files that are easy to customize, to persist your changes to column data types, for example. You can set this property to "OnStart" to generate schema files for all tables in your database at connection. The resulting schemas are based on the connection properties you use to configure Automatic Schema Discovery.
Or, you can set this property to "OnUse" to generate schemas based on a query.
To use the resulting schema files, set the Location property to the folder containing the schemas.
- QueryPassthrough: This property enables you to use Elasticsearch's Search DSL language instead of SQL.
- RowScanDepth: This property determines the number of rows that will be scanned to detect column data types when generating table metadata. This property applies if you are working with the dynamic schemas generated from Automatic Schema Discovery or if you are using QueryPassthrough.
Fine Tuning Performance
-
PageSize: This property enables you to optimize performance based on your resource provisioning. Paging has an impact on sorting performance in a distributed system, as each shard must first sort results before submitting them to the coordinating server.
By default, the provider requests a page size of 10,000. This is the default index.max_result_window setting in Elasticsearch.
-
MaxResults: This property sets a limit on the results for queries at connection time, without requiring that you specify a LIMIT clause. By default, this is the same value as the index.max_result_window setting in Elasticsearch.
If you are using the Scroll API, set ScrollDuration instead.
- ScrollDuration: This property specifies how long the server should keep the search context alive. Setting this property to a nonzero value and time unit enables the Scroll API.
Connecting Through a Firewall or Proxy
To connect through the Windows system proxy, set only the connection properties needed to authenticate and connect. To connect to other proxies, set ProxyAutoDetect to false and in addition set the following.
To authenticate to an HTTP proxy, set ProxyAuthScheme, ProxyUser, and ProxyPassword, in addition to ProxyServer and ProxyPort.
To connect to other proxies, set FirewallType, FirewallServer, and FirewallPort. To tunnel the connection, set FirewallType to TUNNEL. To authenticate to a SOCKS proxy, set FirewallType to SOCKS5. Additionally, specify FirewallUser and FirewallPassword.
Custom URLs
If a custom URL is required, using the form [Server]:[Port]/[URLPathPrefix], the 'URLPathPrefix' value can be specified via the Other property. For example: URLPathPrefix=myprefix
The provider will use the specified path prefix to build the URL required for connecting to the Elasticsearch API endpoints.
Troubleshooting the Connection
To show provider activity from query execution to HTTP calls, use Logfile and Verbosity. The examples of common connection errors below show how to use these properties to get more context. Contact the support team for help tracing the source of an error or circumventing a performance issue.
- Authentication errors: Typically, recording a Logfile at Verbosity 4 is necessary to get full details on an authentication error.
- Queries time out: A server that takes too long to respond will exceed the provider's client-side timeout. Often, setting the Timeout property to a higher value will avoid a connection error. Another option is to disable the timeout by setting the property to 0. Setting Verbosity to 2 will show where the time is being spent.
Searching with SQL
Elasticsearch is a document-oriented database that provides high performance searching, flexibility, and scalability. These features are not necessarily incompatible with a standards-compliant query language like SQL-92. In this section we will show various schemes that the provider offers to bridge the gap with relational SQL and an Elasticsearch database.
The provider models Elasticsearch objects into relational tables and translates SQL queries into Elasticsearch queries to get the requested data. See Schema Mapping for more details on how Elasticsearch objects are mapped to tables to generate schemas. See Query Mapping for more details on how various Elasticsearch operations are represented as SQL.
The Automatic Schema Discovery scheme automatically finds the data types by retrieving the mapping for the Elasticsearch type. You can use RowScanDepth, FlattenArrays, and FlattenObjects to control the relational representation of the collections in Elasticsearch.
Optionally, you can use Custom Schema Definitions to project your chosen relational structure on top of a Elasticsearch object. This allows you choose your own column names, their data types, and the location of their values in the collection.
When GenerateSchemaFiles is set, you can persist schemas for all collections in the database or for the results of SELECT queries.
Schema Mapping
The Lyftron Provider for Elasticsearch 2019 models the Elasticsearch REST APIs as relational tables and stored procedures that can be accessed with standard SQL. This enables access from standards-based tools.
The table definitions are dynamically retrieved. When you connect, the provider connects to Elasticsearch and retrieves the schemas, list of tables, and the metadata for the tables by querying the Elasticsearch REST server. Any changes to the remote data are immediately reflected in your queries.
The following table maps Elasticsearch concepts to relational ones:
Elasticsearch Versions 6 and Above:
Elasticsearch Concept | SQL Concept |
Index | Table |
Alias | View |
Document | Row (each document is a row and the document's JSON structure is represented as columns) |
Field | Column |
Note: Starting in Elasticsearch 6, indices are limited to a single type. Therefore the type is no longer treated as a table, since an index and type have a one-to-one relation. Types are hidden and used internally where necessary to issue the proper request to Elasticsearch.
Elasticsearch Versions Prior to Version 6:
Elasticsearch Concept | SQL Concept |
Index | Schema |
Type | Table |
Alias | View |
Document | Row (each document is a row and the document's JSON structure is represented as columns) |
Field | Column |
Parent-Child Relationships
Elasticsearch contains the ability to establish parent-child relationships. This relationship maps closely to SQL JOIN functionality. The provider models these parent-child relationships in a way to enable the ability to perform JOIN queries.
Elasticsearch Versions 6 and Above:
In version 6 and above of Elasticsearch, relationships are established by using the join datatype. Included in this functionality is the ability to define multiple children for a single parent and to create multiple levels of relations.
The provider supports all of these relationships and will generate a separate table for each relation in Elasticsearch. The table name will be in the form: [index]_[relation].
All child tables will have an additional column containing the parent table id. The column name will be in the form: _[parent_table]_id. This column is a foreign key to the _id column of the parent table and can be used to perform SQL JOIN queries.
When querying these tables individually, filtering logic is pushed to the server to improve performance by only returning the data relevant to the table selected.
Elasticsearch Versions Prior to Version 6:
In versions prior to 6, a relationship is established between two types via a _parent field. This creates a single parent-child relationship.
The tables identified in this parent-child relationship do not change (they are still based on the Elasticsearch type). However the child table will have an additional column containing the parent id. The column name will be in the form: _[parent_table]_id. This column is a foreign key to the _id column of the parent table and can be used to perform SQL JOIN queries.
Raw Data
Below is the raw data used throughout this chapter. Following is the mapping for the "insured" table (index):
{ "insured" : { "mappings" : { "properties" : { "name" : { "type" : "string" }, "address" : { "street" : { "type" : "string" }, "city" : { "type" : "string" }, "state" : { "type" : "string" } }, "insured_ages" : { "type" : "integer" }, "vehicles" : { "type" : "nested" , "properties" : { "year" : { "type" : "integer" }, "make" : { "type" : "string" }, "model" : { "type" : "string" }, "body_style" { "type" : "string" } } } } } } } |
The following is the sample data set for the "insured" table (index):
{ "hits" : { "total" : 2, "max_score" : 1, "hits" : [ { "_index" : "insured" , "_type" : "_doc" , "_id" : "1" , "_score" : 1, "_source" : { "name" : "John Smith" , "address" : { "street" : "Main Street" , "city" : "Chapel Hill" , "state" : "NC" }, "insured_ages" : [ 17, 43, 45 ], "vehicles" : [ { "year" : 2015, "make" : "Dodge" , "model" : "RAM 1500" , "body_style" : "TK" }, { "year" : 2015, "make" : "Suzuki" , "model" : "V-Strom 650 XT" , "body_style" : "MC" }, { "year" : 1992, "make" : "Harley Davidson" , "model" : "FXR" , "body_style" : "MC" } ] } }, { "_index" : "insured" , "_type" : "_doc" , "_id" : "2" , "_score" : 1, "_source" : { "name" : "Joseph Newman" , "address" : { "street" : "Oak Street" , "city" : "Raleigh" , "state" : "NC" }, "insured_ages" : [ 23, 25 ], "vehicles" : [ { "year" : 2010, "make" : "Honda" , "model" : "Accord" , "body_style" : "SD" }, { "year" : 2008, "make" : "Honda" , "model" : "Civic" , "body_style" : "CP" } ] } } ] } } |
Automatic Schema Discovery
The provider automatically infers a relational schema by retrieving the mapping of the Elasticsearch type. The columns and data types are generated from the retrieved mapping.
Detecting Arrays
Any field within Elasticsearch can be an array of values, but this is not explicitly defined within the mapping. To account for this, the provider will query the data to detect if any fields contain arrays. The number of Elasticsearch documents retrieved during this array scanning is based on the RowScanDepth property.
Elasticsearch nested types are special types that denote an array of objects and thus will always be treated as such when generating the metadata.
Detecting Columns
The columns identified during the discovery process depend on the FlattenArrays and FlattenObjects properties.
Example Data Set
To provide an example of how these options work, consider the following mapping (where 'insured' is the name of the table):
{ "insured" : { "properties" : { "name" : { "type" : "string" }, "address" : { "street" : { "type" : "string" }, "city" : { "type" : "string" }, "state" : { "type" : "string" } }, "insured_ages" : { "type" : "integer" }, "vehicles" : { "type" : "nested" , "properties" : { "year" : { "type" : "integer" }, "make" : { "type" : "string" }, "model" : { "type" : "string" }, "body_style" { "type" : "string" } } } } } } |
Also consider the following example data for the above mapping:
{ "_source" : { "name" : "John Smith" , "address" : { "street" : "Main Street" , "city" : "Chapel Hill" , "state" : "NC" }, "insured_ages" : [ 17, 43, 45 ], "vehicles" : [ { "year" : 2015, "make" : "Dodge" , "model" : "RAM 1500" , "body_style" : "TK" }, { "year" : 2015, "make" : "Suzuki" , "model" : "V-Strom 650 XT" , "body_style" : "MC" }, { "year" : 2012, "make" : "Honda" , "model" : "Accord" , "body_style" : "4D" } ] } } |
Using FlattenObjects
If FlattenObjects is set, all nested objects will be flattened into a series of columns. The above example will be represented by the following columns:
Column Name | Data Type | Example Value |
name | String | John Smith |
address.street | String | Main Street |
address.city | String | Chapel Hill |
address.state | String | NC |
insured_ages | String | [ 17, 43, 45 ] |
vehicles | String | [ { "year": "2015", "make": "Dodge", ... }, { "year": "2015", "make": "Suzuki", ... }, { "year": "2012", "make": "Honda", ... } ] |
If FlattenObjects is not set, then the address.street, address.city, and address.state columns will not be broken apart. The address column of type string will instead represent the entire object. Its value would be the following:
{street: "Main Street" , city: "Chapel Hill" , state: "NC" } |
Using FlattenArrays
The FlattenArrays property can be used to flatten array values into columns of their own. This is only recommended for arrays that are expected to be short. It is best to leave unbounded arrays as they are and piece out the data for them as needed using JSON Functions.
Note: Only the top-most array will be flattened. Any subarrays will be represented as the entire array.
The FlattenArrays property can be set to 3 to represent the arrays in the example above as follows (this example is with FlattenObjects not set):
Column Name | Data Type | Example Value |
insured_ages | String | [ 17, 43, 45 ] |
insured_ages.0 | Integer | 17 |
insured_ages.1 | Integer | 43 |
insured_ages.2 | Integer | 45 |
vehicles | String | [ { "year": "2015", "make": "Dodge", ... }, { "year": "2015", "make": "Suzuki", ... }, { "year": "2012", "make": "Honda", ... } ] |
vehicles.0 | String | { "year": "2015", "make": "Dodge", "model": "RAM 1500", "body_style": "TK" } |
vehicles.1 | String | { "year": "2015", "make": "Suzuki", "model": "V-Strom 650 XT", "body_style": "MC" } |
vehicles.2 | String | { "year": "2012", "make": "Honda", "model": "Accord", "body_style": "4D" } |
Using Both FlattenObjects and FlattenArrays
If FlattenObjects is set along with FlattenArrays (set to 1 for brevity), the vehicles field will be represented as follows:
Column Name | Data Type | Example Value |
vehicles | String | [ { "year": "2015", "make": "Dodge", ... }, { "year": "2015", "make": "Suzuki", ... }, { "year": "2012", "make": "Honda", ... } ] |
vehicles.0.year | String | 2015 |
vehicles.0.make | String | Dodge |
vehicles.0.model | String | RAM 1500 |
vehicles.0.body_style | String | TK |