Back
Blog 5 May 2026

How to Use Siren Federate Join in Elasticsearch Aggregations

Author: George Apaaboah
Author George Apaaboah
How to Use Siren Federate Join in Elasticsearch Aggregations

Introduction

Aggregations in Elasticsearch are fundamental for gaining insights from data by grouping and computing metrics, statistics, and other analytics. Analysts often need to perform analytics that require connecting and analyzing data that are located in separate Elasticsearch indices.

A common need in analytics is to segment or aggregate data from one source (e.g., customers stored in a customers index) based on contextual attributes stored in another source (e.g., their orders stored in an orders index). For instance, analysts may want to group customers by the status of their purchases (like delivered or cancelled), which is information stored in the orders index, not in the customers index itself.

However, natively performing this kind of analysis within Elasticsearch aggregations across different indices to enhance these insights has traditionally been complex. The challenge lies in the lack of native support for expressing cross-index joins directly within Elasticsearch aggregation contexts.

Due to Elasticsearch’s document-oriented nature and the typical denormalization of documents at index time, cross-index relationships are difficult to express and manage efficiently within the aggregations. This makes such analytical use cases either complex or inefficient to implement with standard features.

Starting from version 36, Siren Federate addresses this challenge by extending Elasticsearch aggregation to allow joins to be expressed within aggregations. This means that aggregation buckets can now be defined based on the outcome of join operations between documents in different indices, allowing for a much richer set of aggregation patterns that reflects real-world data models.

This feature enables new analytical patterns such as cross-index segmentation, aggregation by related properties and more accurate and context-aware metrics – all within a single Elasticsearch query. 

“Think of it as bringing the flexibility of SQL-style joins into the world of Elasticsearch aggregations, where metrics can now be grouped and filtered based on relationships across multiple datasets.”

This blog post will explore how Siren Federate’s join capabilities within aggregation works and demonstrate a practical e-commerce scenario.

The Scenario

An e-commerce platform wants to gain insights into its customer purchase funnel to optimize marketing efforts and improve sales. Specifically, they need to analyze the behaviour of customers from different acquisition channels like Facebook and track their journey from signup to order placement, including the value and status of those orders. This requires the ability to connect and analyze the customer profiles and order events, which are stored in separate data indices, to answer questions like:

By addressing this problem, the e-commerce platform can achieve cross-index segmentation to understand customer behaviour across related entities, enable nested filtering to evaluate key performance indicators like conversion rates by customer segment, and perform selective aggregation to focus on specific subsets of data. 

Lets start with an example, customer profiles are stored in a customers index, and order events are recorded in an orders index

Customers index sample data
{
  "customer_id": "cust123",
  "customer_country": "US",
  "acquisition_channel": "facebook"
}
Orders index sample data
{
  "order_id": "order789",
  "customer_id": "cust123",
  "order_total_amount": 120.5,
  "order_status": "delivered"
}

Table 1: Sample Data for 5 Customers

customer_idcustomer_countryacquisition_channel
cust001USfacebook
cust002CAgoogle
cust002USemail
cust004UKfacebook
cust005USfacebook

Table 2: Sample Data for 5 Orders

order_idcustomer_idorder_total_amountorder_status
ord001cust001150.00delivered
ord002cust00375.50cancelled
ord003cust005125.75delivered
ord004cust00190.20delivered
ord005cust007200.00delivered
Order index mappings
{
  "mappings": {
    "properties": {
      "order_id": {
        "type": "keyword"
      },
      "customer_id": {
        "type": "keyword"
      },
      "order_total_amount": {
        "type": "double"
      },
      "order_status": {
        "type": "keyword"
      },
      "customer_id_0": {
        "type": "keyword"
      },
      "customer_id_1": {
        "type": "keyword"
      },
      "customer_id_2": {
        "type": "keyword"
      },
      "customer_key": {
        "type": "keyword"
      }
    }
  }
}
Results
"aggregations" : {
  "total_customers_with_orders" : { "value" : 2 },
  "customers_with_high_value_delivered_orders" : {
    "doc_count" : 8,
    "distinct_customers" : { "value" : 2 }
  },
  "customers_with_cancelled_orders" : {
    "doc_count" : 1,
    "distinct_customers" : { "value" : 1 }
  }
}

In a standard Elasticsearch setup, correlating customer attributes with their order history for aggregated analysis historically lacked direct join capabilities. Achieving this typically involves significant data manipulation, potentially requiring data denormalization by putting related data in the same index, or implementing complex application-level logic for joining data after retrieval. In the example above, this can lead to data duplication, especially when a single customer has many matching orders.

If aggregations are performed without deduplication, it may result in wrong aggregation counts. From the example above, we need two queries; one for search on the customers index and the other for search on the orders index.

We will have to use the correct customer_ids which are error prone and do not scale. From the results, we can see that the total doc_count for customers_with_high_value_delivered_orders is 8 which can be misleading since it represents the total number of high-valued orders, not the number of unique customers who placed them. See full sample queries in the gist

https://gist.github.com/GeorgeAp/50180cf680450241200c34c1be0e8d49

While Elasticsearch later introduced features like nested fields, has_parent/ has_child relationships and enrich policies to address this, these features came with limitations. Nested fields could become complex, parent-child relationships could impact query performance and enrich policies involve separate processing.

Ultimately, these solutions hamper scalability for large data volumes and limit flexibility for diverse data and use cases.

Siren Federate overcomes these limitations by enabling direct join of separate indices within our Elasticsearch queries, including within aggregations, unlocking cross-index segmentation and nested filtering capabilities.

The Challenge

Our goal is to perform a deep customer segmentation. We want to analyse customer purchase funnels by:

  1. Filtering customers from the US who signed up to our platform using Facebook.
  2. Joining this segment with the orders index to focus only on customers who have placed orders. [using agg 1]
  3. Using nested joins within aggregations to gain deeper insights: 

*Count customers with delivered orders over $100 [using agg 2]

*Count customers with cancelled orders [using agg 3]

The Siren Federate Solution: An Adapted Query

The technical challenge lies in the fact that joins within aggregations are not part of the top-level Elasticsearch query, but are defined inside the aggregation object.

Siren Federate enables join in aggregation by pushing down the filters from the join context into the parent query’s filter context. This approach is different from predicate pushdown in a search query where filters are directly applied at the query level. 

Here is how a Siren Federate query can achieve this:

{
  "query": {
	"join": {
  	"indices": [
    	"orders"
  	],
  	"on": ["customer_id","customer_id"]
	}
  },
  "aggregations": {
	"customers_with_orders": {               agg_1
  	"filter": {
    	"bool": {
      	"filter": [
        	{
          	"term": {
            	"customer_country": {
              	"value": "US"
            	}
          	}
        	},
        	{
          	"term": {
            	"acquisition_channel": {
              	"value": "facebook"
            	}
          	}
        	},
        	{
          	"join": {            
            	"indices": ["orders"],
            	"on": ["customer_id","customer_id"],
            	"request": {
              	"query": {
                	"term": {
                  	"order_status": {
                    	"value": "delivered"
                  	}
                	}
              	}
            	}
          	}
        	}
      	]
    	}
  	},
"aggregations": {
    	"high_value_orders": {                    agg_2
      	"filter": {
        	"join": {
          	"indices": ["orders"],
          	"on": ["customer_id","customer_id"],
          	"request": {
            	"query": {
              	"range": {
                	"order_total_amount": {
                  	"gte": 100
                	}
              	}
            	}
          	}
        	}
      	}
    	},
    	"cancelled_orders": {                    agg_3
      	"filter": {
        	"join": {
          	"indices": [
            	"orders"
          	],
          	"on": ["customer_id","customer_id"],
          	"request": {
            	"query": {
              	"term": {
                	"order_status": {
                  	"value": "cancelled"
                	}
              	}
            	}
          	}
        	}
      	}
    	}
  	}
	}
  }
}

In this query:

These subsequent nested joins within high_value_orders and cancelled_orders use customer_id to further analyze the orders based on their total amount and status, respectively.

The results would provide valuable insights into our customer segment’s purchasing behaviour.

Result

{
  "aggregations": {
	"customers_with_orders": {
  	"doc_count": 2,
  	"cancelled_orders": {
    	"doc_count": 1
  	},
  	"high_value_orders": {
    	"doc_count": 2
  	}
	}
  }
}

Explanation of the results:

The filter aggregation at the top level of the aggregations in the query is designed to include customers who meet the following criteria:

Looking at our sample data:

Therefore, the doc_count for customers_with_orders is expected to be 2.

high_value_orders: This is a nested filter aggregation within customers_with_orders. It further filters the customers who meet the initial criteria to count only those who have at least one order with an amount greater than or equal to 100 and a status of delivered.

Thus, the doc_count for high_value_orders is expected to be 2.

cancelled_orders: This is another nested filter aggregation within customers_with_orders . It filters for customers who have at least one order with a status of cancelled:

Therefore, the doc_count for cancelled_orders is expected to be 1.

This output demonstrates the power of using joins within aggregations to perform cross-index segmentation and nested filtering to analyze customer behavior based on data residing in different indices.

Why Joins in Aggregations are beneficial

Using joins within Elasticsearch aggregations, powered by Siren Federate unlocks several key benefits: 

Conclusion

Siren Federate’s ability to perform joins within Elasticsearch aggregations provides a powerful and efficient way to conduct complex analytical use cases involving data across multiple indices. By enabling cross-index segmentation, nested filtering, and selective aggregation, it empowers users to gain deeper insights from their data and address critical business questions. This e-commerce example demonstrates just one of the many scenarios where this capability can provide significant value.

The dataset and queries used in the blog post are in the gist below: 

Vanilla Elasticsearch

https://gist.github.com/GeorgeAp/50180cf680450241200c34c1be0e8d49

With Siren Federate Join

https://gist.github.com/GeorgeAp/3b3ee29ccc17c2e330f376b0d7be6bef

Note:

This blog post highlights Siren Federate’s ability to perform joins within Elasticsearch aggregations, which therefore provides a powerful and efficient way to conduct complex analytical use cases involving data across multiple indices. By enabling cross-index segmentation, nested filtering, and selective aggregation, Siren Federate empowers users to gain deeper insights from their data and address critical business questions. Join in Elasticsearch aggregations has been available since Siren Federate version 36.

OTHER AREAS

Explore our topics

Close