In a previous post we saw how to use Elasticsearch to search for our dream job among the ones posted on hacker news.
Since Elasticsearch queries are basically JSON it’s really easy to lose track when we start nesting them.
In this post we are going to define a Python class that will create the required query (read: JSON) on demand.
Anatomy of an Elasticsearch query
A query in Elasticsearch in its general form is:
Where must, should and must_not correspond to the AND, OR and AND NOT operators respectively.
To use an earlier example, the following query will require both ‘san francisco’ and ‘machine learning’ to be present in the text field.
So far so good. But what if we want to nest a boolean query inside another boolean query?
For example, what if we want to find something that matches (‘san franisco’ OR ‘bay area’) AND (‘machine learning’ OR ‘data analysis’)?
The resulting query will be something like this:
You can see how it’s easy to get lost here.
Python to the rescue!
We want to represent the queries in a more human friendly way.
Before start coding, let’s design the class. This will save us so much time afterwards. What are the requirements, i.e. what is the expected behavior of our class (and its methods)?
it should have arguments that correspond to the three boolean operators
each operator can accept one or more term
we need to be able to specify which field we want to search in
each term can be another query
Let’s write down some obvious use cases.
Most important of all, our class needs to have a to_elasticsearch method to produce the desired json on demand.
Let’s start coding something that would work for the first query, and then let’s improve on that.
Since we are in the easy case, we can assume that the value passed are dicts.
In order to build the ES query, we then need to figure out which arguments have been passed (i.e. are not None) and put them in the query.
So far so good. But what if the the same field has multiple values, as in query_2?
Our function needs to adapt. For one, we need to use the same key for all the values. So in the example above the key text has to be applied to both 'san francisco' and 'bay area'.
Now we got it working for lists. What happens if we try mixed case like
There are 2 things that will break:
query[name] = ... will overwrite previous results (in our case the title field will overwirte the text one)
for v in field_values part would not behave as expected (e.g. it will unpack a string)
To fix the first problem, we can just make query a defaultdict and extend it. To avoid problems communicating with the Elasticsearch client, we will convert back to a regular dict before returning.
The most elegant way of solving the second problem, is to transform the input into a standard way. In that way our to_elasticsearch method will be independent of the original input form.
For each (non null) argument, we want to make sure that its values are wrapped in a list.
Or, in a more compact way:
Some stuff to note here:
if the field is None, we want to keep that way, not wrap it into a list. Returning {} would be fine too
[value] is not the same as list(value)
query_1 output is now changed (there is an extra list), but it’s still a valid ES query, and it returns the same result.
Let’s do this step by step. First of all, we want the method to_elasticsearch to expand any inner queries, if present. Since inner queries are still instsances of Query we can call the same method on them.
Therefore, we need to distinguish between a real Query object and just a query term.
The minimal edit to the previous code will be something like this:
Just to reiterate: everything is the same as before, except we now check if we encounter another instance of Query. If that’s the case, the instance itself will take care of transforming its portion of the query, which might cause another instsance to be found…
Also note that if there is another instance of Query, we just ignore the field_name variable. This means that you could pass the internal query as {'foo':query_2}.
Unfortunately this does not work as expected:
Can you spot the mistake? The key query appears twice, so the previous is not a valid es query.