How to Use Siren Federate Join in Elasticsearch Aggregations

Introduction

Aggregations in Elasticsearch are fundamental for gaining insights from data by grouping and computing metrics, statistics, and other analytics. Analysts often need to perform analytics that require connecting and analyzing data that are located in separate Elasticsearch indices.

A common need in analytics is to segment or aggregate data from one source (e.g., customers stored in a customers index) based on contextual attributes stored in another source (e.g., their orders stored in an orders index). For instance, analysts may want to group customers by the status of their purchases (like delivered or cancelled), which is information stored in the orders index, not in the customers index itself.

However, natively performing this kind of analysis within Elasticsearch aggregations across different indices to enhance these insights has traditionally been complex. The challenge lies in the lack of native support for expressing cross-index joins directly within Elasticsearch aggregation contexts.

Due to Elasticsearch’s document-oriented nature and the typical denormalization of documents at index time, cross-index relationships are difficult to express and manage efficiently within the aggregations. This makes such analytical use cases either complex or inefficient to implement with standard features.

Starting from version 36, Siren Federate addresses this challenge by extending Elasticsearch aggregation to allow joins to be expressed within aggregations. This means that aggregation buckets can now be defined based on the outcome of join operations between documents in different indices, allowing for a much richer set of aggregation patterns that reflects real-world data models.

This feature enables new analytical patterns such as cross-index segmentation, aggregation by related properties and more accurate and context-aware metrics – all within a single Elasticsearch query.

“Think of it as bringing the flexibility of SQL-style joins into the world of Elasticsearch aggregations, where metrics can now be grouped and filtered based on relationships across multiple datasets.”

This blog post will explore how Siren Federate’s join capabilities within aggregation works and demonstrate a practical e-commerce scenario.

The Scenario

An e-commerce platform wants to gain insights into its customer purchase funnel to optimize marketing efforts and improve sales. Specifically, they need to analyze the behaviour of customers from different acquisition channels like Facebook and track their journey from signup to order placement, including the value and status of those orders. This requires the ability to connect and analyze the customer profiles and order events, which are stored in separate data indices, to answer questions like:

How many customers from the US who signed up via Facebook have placed orders ?
Among these customers, how many have placed orders with a value over $100 that has been delivered?
How many customers have placed cancelled orders?

By addressing this problem, the e-commerce platform can achieve cross-index segmentation to understand customer behaviour across related entities, enable nested filtering to evaluate key performance indicators like conversion rates by customer segment, and perform selective aggregation to focus on specific subsets of data.

Lets start with an example, customer profiles are stored in a customers index, and order events are recorded in an orders index

Customers index sample data

{
  "customer_id": "cust123",
  "customer_country": "US",
  "acquisition_channel": "facebook"
}

Orders index sample data

{
  "order_id": "order789",
  "customer_id": "cust123",
  "order_total_amount": 120.5,
  "order_status": "delivered"
}

Table 1: Sample Data for 5 Customers

customer_id	customer_country	acquisition_channel
cust001	US	facebook
cust002	CA	google
cust002	US	email
cust004	UK	facebook
cust005	US	facebook

Table 2: Sample Data for 5 Orders

order_id	customer_id	order_total_amount	order_status
ord001	cust001	150.00	delivered
ord002	cust003	75.50	cancelled
ord003	cust005	125.75	delivered
ord004	cust001	90.20	delivered
ord005	cust007	200.00	delivered

Order index mappings

{
  "mappings": {
    "properties": {
      "order_id": {
        "type": "keyword"
      },
      "customer_id": {
        "type": "keyword"
      },
      "order_total_amount": {
        "type": "double"
      },
      "order_status": {
        "type": "keyword"
      },
      "customer_id_0": {
        "type": "keyword"
      },
      "customer_id_1": {
        "type": "keyword"
      },
      "customer_id_2": {
        "type": "keyword"
      },
      "customer_key": {
        "type": "keyword"
      }
    }
  }
}

Results

"aggregations" : {
  "total_customers_with_orders" : { "value" : 2 },
  "customers_with_high_value_delivered_orders" : {
    "doc_count" : 8,
    "distinct_customers" : { "value" : 2 }
  },
  "customers_with_cancelled_orders" : {
    "doc_count" : 1,
    "distinct_customers" : { "value" : 1 }
  }
}

In a standard Elasticsearch setup, correlating customer attributes with their order history for aggregated analysis historically lacked direct join capabilities. Achieving this typically involves significant data manipulation, potentially requiring data denormalization by putting related data in the same index, or implementing complex application-level logic for joining data after retrieval. In the example above, this can lead to data duplication, especially when a single customer has many matching orders.

If aggregations are performed without deduplication, it may result in wrong aggregation counts. From the example above, we need two queries; one for search on the customers index and the other for search on the orders index.

We will have to use the correct customer_ids which are error prone and do not scale. From the results, we can see that the total doc_count for customers_with_high_value_delivered_orders is 8 which can be misleading since it represents the total number of high-valued orders, not the number of unique customers who placed them. See full sample queries in the gist

https://gist.github.com/GeorgeAp/50180cf680450241200c34c1be0e8d49

While Elasticsearch later introduced features like nested fields, has_parent/ has_child relationships and enrich policies to address this, these features came with limitations. Nested fields could become complex, parent-child relationships could impact query performance and enrich policies involve separate processing.

Ultimately, these solutions hamper scalability for large data volumes and limit flexibility for diverse data and use cases.

Siren Federate overcomes these limitations by enabling direct join of separate indices within our Elasticsearch queries, including within aggregations, unlocking cross-index segmentation and nested filtering capabilities.

The Challenge

Our goal is to perform a deep customer segmentation. We want to analyse customer purchase funnels by:

Filtering customers from the US who signed up to our platform using Facebook.
Joining this segment with the orders index to focus only on customers who have placed orders. [using agg 1]
Using nested joins within aggregations to gain deeper insights:

*Count customers with delivered orders over $100 [using agg 2]

*Count customers with cancelled orders [using agg 3]

The Siren Federate Solution: An Adapted Query

The technical challenge lies in the fact that joins within aggregations are not part of the top-level Elasticsearch query, but are defined inside the aggregation object.

Siren Federate enables join in aggregation by pushing down the filters from the join context into the parent query’s filter context. This approach is different from predicate pushdown in a search query where filters are directly applied at the query level.

Here is how a Siren Federate query can achieve this:

{
  "query": {
	"join": {
  	"indices": [
    	"orders"
  	],
  	"on": ["customer_id","customer_id"]
	}
  },
  "aggregations": {
	"customers_with_orders": {               agg_1
  	"filter": {
    	"bool": {
      	"filter": [
        	{
          	"term": {
            	"customer_country": {
              	"value": "US"
            	}
          	}
        	},
        	{
          	"term": {
            	"acquisition_channel": {
              	"value": "facebook"
            	}
          	}
        	},
        	{
          	"join": {            
            	"indices": ["orders"],
            	"on": ["customer_id","customer_id"],
            	"request": {
              	"query": {
                	"term": {
                  	"order_status": {
                    	"value": "delivered"
                  	}
                	}
              	}
            	}
          	}
        	}
      	]
    	}
  	},

"aggregations": {
    	"high_value_orders": {                    agg_2
      	"filter": {
        	"join": {
          	"indices": ["orders"],
          	"on": ["customer_id","customer_id"],
          	"request": {
            	"query": {
              	"range": {
                	"order_total_amount": {
                  	"gte": 100
                	}
              	}
            	}
          	}
        	}
      	}
    	},
    	"cancelled_orders": {                    agg_3
      	"filter": {
        	"join": {
          	"indices": [
            	"orders"
          	],
          	"on": ["customer_id","customer_id"],
          	"request": {
            	"query": {
              	"term": {
                	"order_status": {
                  	"value": "cancelled"
                	}
              	}
            	}
          	}
        	}
      	}
    	}
  	}
	}
  }
}

In this query:

The outer join clause initially brings together customers and orders based on the customer_id fields. This top-level join connects the two indices based on these matching fields.
Inside the aggregations, we first filter for US customers who signed up via Facebook. This occurs within the customers_with_orders aggregation using term queries on customer_country and acquisition_channel.
Then, a nested join within the customers_with_orders aggregation ensures we only consider customers who have placed at least one delivered order.
Further nested join clauses within the high_value_orders and cancelled_orders aggregations allow us to count delivered orders over $100 and cancelled orders for this specific customer segment.

These subsequent nested joins within high_value_orders and cancelled_orders use customer_id to further analyze the orders based on their total amount and status, respectively.

The results would provide valuable insights into our customer segment’s purchasing behaviour.

Result

{
  "aggregations": {
	"customers_with_orders": {
  	"doc_count": 2,
  	"cancelled_orders": {
    	"doc_count": 1
  	},
  	"high_value_orders": {
    	"doc_count": 2
  	}
	}
  }
}

Explanation of the results:

The filter aggregation at the top level of the aggregations in the query is designed to include customers who meet the following criteria:

country is US
signup_source is facebook
They have at least one corresponding document in the orders index due to the initial join query

Looking at our sample data:

cust001 (US facebook) has multiple orders
cust005 (US facebook) has multiple orders
No other customers meet both the country and signup source criteria.

Therefore, the doc_count for customers_with_orders is expected to be 2.

high_value_orders: This is a nested filter aggregation within customers_with_orders. It further filters the customers who meet the initial criteria to count only those who have at least one order with an amount greater than or equal to 100 and a status of delivered.

cust001 has delivered orders with amounts $150.00, $130.40, $105.50, $180.00, and $115.80, which are all over $100.
cust005 has delivered orders with amounts $125.75, $140.25, and $152.30, which are all over $100.

Thus, the doc_count for high_value_orders is expected to be 2.

cancelled_orders: This is another nested filter aggregation within customers_with_orders . It filters for customers who have at least one order with a status of cancelled:

cust001 has no orders with the status cancelled.
cust005 has orders ord008 and ord028 with the status cancelled.

Therefore, the doc_count for cancelled_orders is expected to be 1.

This output demonstrates the power of using joins within aggregations to perform cross-index segmentation and nested filtering to analyze customer behavior based on data residing in different indices.

Why Joins in Aggregations are beneficial

Using joins within Elasticsearch aggregations, powered by Siren Federate unlocks several key benefits:

Cross-Index Segmentation: Easily analyse behaviours across related entities stored in different indices. For example, understanding which types of customers tend to cancel their orders.
Nested Filtering: Apply joins inside filter aggregations to evaluate complex business KPIs, such as tracking conversion rates or average order values by specific customer segments.
Selective Aggregation: Aggregate only documents that meet the joined conditions. This is incredibly useful for use cases like fraud detection (joining customer data with transaction anomalies) and churn analysis (joining customer data with support tickets).
Real-world Successful Deployment: Joins in Aggregations is a proven and successfully deployed feature, for instance, by Apollo on over 350 nodes cluster, as highlighted in their tech blog https://www.apollo.io/tech-blog/cross-entity-searching and a Siren case study https://siren.io/apollo-io-elevates-enterprise-search-with-siren/.

Conclusion

Siren Federate’s ability to perform joins within Elasticsearch aggregations provides a powerful and efficient way to conduct complex analytical use cases involving data across multiple indices. By enabling cross-index segmentation, nested filtering, and selective aggregation, it empowers users to gain deeper insights from their data and address critical business questions. This e-commerce example demonstrates just one of the many scenarios where this capability can provide significant value.

The dataset and queries used in the blog post are in the gist below:

Vanilla Elasticsearch

https://gist.github.com/GeorgeAp/50180cf680450241200c34c1be0e8d49

With Siren Federate Join

https://gist.github.com/GeorgeAp/3b3ee29ccc17c2e330f376b0d7be6bef

Note:

This blog post highlights Siren Federate’s ability to perform joins within Elasticsearch aggregations, which therefore provides a powerful and efficient way to conduct complex analytical use cases involving data across multiple indices. By enabling cross-index segmentation, nested filtering, and selective aggregation, Siren Federate empowers users to gain deeper insights from their data and address critical business questions. Join in Elasticsearch aggregations has been available since Siren Federate version 36.