CLOSE

The alternatives to ETL into Elasticsearch: Virtualize or Reflect?

Published: Friday, April 12th, 2019

Elasticsearch stack is great to search and analyze all sort of enterprise data, that’s why its common to Extract –> Transform –> Load (ETL) data from various systems into Elasticsearch.

For certain kind of data, e.g. streaming machine logs; it is “natural” to use log shipping systems like beats.

ETL from databases however, is far from ideal:

  • It requires external software like Logstash (which itself needs to be monitored), which then requires large  amount of nasty configuration, at system admin level.
  • Data refresh issues – stale data while you wait for your next refresh – which might not be OK for all use cases.
  • With regulation like GDPR and new ones in the US, it is becoming clear that simply “copying” can be a bad and costly idea.

But yet we really do want to analyze multiple datasources in the same UI, e.g. Kibana, along with Elasticsearch data.

Here is the good news: in Siren you have alternatives –Virtualization and Data Reflections.

Enters “Virtual Indexes” in Elasticsearch (from Siren 10)

At Siren we love Elasticsearch and are on a mission to supercharge it. As part of this, our siren Federate plugin allows the creation of Elasticsearch Virtual Indices, which directly virtualize access to JDBC Datasources.

With Virtual Indices, launched last May with Siren 10, data is never copied, instead queries to Virtual Indices are translated to the native SQL queries and back in real time.

From our frontend (Siren Investigate, our Kibana extension), the data in any of the supported backends can be looked at in dashboards alongside Elasticsearch data.

Here is a Siren installation showing data from Ms SQL, MySQL, Postgres, Oracle, Sybase, Impala, Presto, Spark, Dremio, Denodo at the same time (even with certain join capabilities among them!).

While this is great, there are a few things virtualization can’t do, that’s why we took it a step further.

Enters Elasticsearch “Reflection” (Siren 10.2)

Siren 10.2 launches the concept of ‘Reflection’ of Siren Datasources.

Similarly to the way the term reflection is used in other systems, Siren reflections are recurrent, fully managed ingestion from a datasource to build a ‘cache/temporary/reflected’ index in Elasticsearch.

Reflections are great in the following cases:

  • Search and unique analytics: You’re actually interested in the unique abilities of the Elasticsearch backend e.g. great search (with ranking of results, support for stemming, very sophisticated text analysis, advanced search query language), wordcloud visualizations, significant term analysis, “more like this” similarity queries.. and way more.
  • Denormalization into a way more advanced datamodel: Elasticsearch, being an information retrieval system, has “multi value” attributes as a core feature. This is really powerful to get better analytics. Say that a user is member of multiple “groups”. In SQL systems these memberships have to be stored in a separate associative table to be effectively queried. Not in Elasticsearch, where one can simply have an array, with search, drilldowns and rollups working coherently.
  • Performance and scalability: Data reflection is typically way faster than SQL backends for explorative drilldowns and search. you can expect high scalability, multi-fold decrease in dashboard loading time in comparison to dashboards which use virtualized indices in systems like Spark (Data stored in Parquet Format – we’ll publish a benchmark and more).
  • Decrease in load to the remote system: By not hitting the remote datasource at each access.
  • Ease of setup: Using our out of the box UI is really easy, see below.

Let’s see them in action!

You start with a pretty little icon on the left:

Data Reflection
Data Reflection

Within Data Reflection, we have two modules:

Siren Data Reflection App

Reflection Jobs:

Upon selection, you’ll see an overview of all your jobs. You may perform actions such as: enable/disable scheduling, edit, run, delete or clone on your jobs from here.

Reflection Jobs

You may also review how reflections are going (updated live) and review past jobs.

Let me walk you through the reflection job creation process from a JDBC configured datasource or Virtual Index previously configured in Siren.

Configuring a Data Reflection Job:

Step 1: Input Data

Data Reflection - Input Data

  1. Define a name for your job. Description is optional.
  2. Select a configured Datasource OR Virtual Index.
  3. Write a select query for you datasource, if you happen to choose a virtual index then the query for it would automatically be inserted which you may alter if desired.
  4. Press ‘Test’ to get sample data and confirm the query works.

Step 2: Transform Pipeline (Optional)

Data Reflection - Transform Pipeline

You may configure an Elasticsearch ingestion pipeline to enrich your documents on the fly before they get reflected into an index. Have a look at sample pipelines to get you started.

In the UI you get a selection of samples and the ability to debug/test your pipeline, till you’re ready to go to the next step.

Step 3: Mapping

Data Reflection - Mapping
Now, the mapping is auto-detected and already translated to data types available in Elasticsearch. However, you would have to define mapping for any new field that your pipeline adds.

Beside this, you may also use our available transformations for custom mapping like we are using ‘Copy’ here. Available transformations:

Transformations

Step 4: Target Index

Data Reflection - Target Index

  1. Define a target index.
  2. Select a primary key (You may also ask Elasticsearch to Auto-Generate one)
  3. Select an indexing strategy. Available strategies:

Reflection Strategies

  • If using ‘Replace’ strategy then you have an option to give a desired prefix to your staging indices.
  • (Optional) Provide custom credentials to override the ones set in datasource.

Step 5: Schedule

Data Reflection - Schedule

Schedule your Data Reflection using cron syntax.

Now hit save on the top right corner and you’re done! If you’re not scheduling, you may manually run your job as you get redirected to the jobs page.

Extra Tips:

For a near real-time reflection, you may want to use ‘Incremental’ strategy in Step 4 with a primary key. To filter out already reflected data, you may write the query using mustache variables made available by us:

  • max_primary_key: The maximum value of the primary key in Elasticsearch (available if the primary key is numeric).
  • last_record_timestamp: The UTC timestamp at which the last record was successfully processed by the reflection job.
  • last_record: An array with the scalar values in the last record that was successfully processed by the reflection job.

If disk space if a consideration, you may disable _source field in your index and use reflected index in your dashboards for aggregations and fetch the documents from the virtual index.

Conclusions

Data Virtualization and Data Reflection in Siren are a great way to enjoy a unified, searchable, analytics view of data coming from many enterprise sources.

The fully managed nature of Data Reflection makes the setup highly reliable and eliminates the time consuming part of getting into the eco-system. Yet you get all the benefits of Elasticsearch and of the additional feature we add in Siren (relational correlation capabilities, link analysis and others).

Get Notified

We'll inform you of major releases and upcoming features.