External Hazelcast Queries

We can send an SQL query to Hazelcast and get results, a great feature.

But did you know the data doesn’t have to be stored in Hazelcast to be queryable via Hazelcast?

It could be somewhere like a Kafka topic or a CSV file, and you can still query it using Hazelcast’s SQL.

This makes Hazelcast a “one-stop-shop” for all your querying needs.

Example

Imagine this query:

SELECT * FROM xyz

Where is the data stored that it is querying? An IMap? Somewhere else?

The answer, if you are a user, is that you don’t care!

All you are interested in really is the results, not where the data is stored.

Hazelcast knows where the data is. The data may be in Hazelcast or elsewhere such as a Kafka topic. What matters is if Hazelcast can locate the data to run the query against it.

Background

While it’s easy for the user to query external things via Hazelcast, there are two caveats to be aware of.

Mappings

Firstly, there needs to be a definition of the external source so that Hazelcast knows how to find it and work with it.

For example:

CREATE EXTERNAL MAPPING IF NOT EXISTS kf_trades EXTERNAL NAME kf_trades
(  id             VARCHAR,
   price          BIGINT,
   quantity       BIGINT,
   symbol         VARCHAR,
   "timestamp"    BIGINT
)
  TYPE Kafka  
OPTIONS
(  'keyFormat' = 'java', 'keyJavaClass' = 'java.lang.String',
   'valueFormat' = 'json',
   'auto.offset.reset' = 'earliest',
   'bootstrap.servers' = 'kafka-broker0:9092,kafka-broker1:9093,kafka-broker2:9094'
)

This defines a Kafka topic called “kf_trades“. It has field elements in a JSON value – id, price, quantity, symbol and timestamp. We also need to confirm the location of the Kafka brokers, kafka-broker0:9092,kafka-broker1:9093,kafka-broker2:9094.

So not exactly difficult, and a one-off task to submit the “CREATE MAPPING” Sql command to the Sql engine when the cluster. starts.

Performance

Secondly, the external source won’t be as fast as Hazelcast. So the end-user may have to wait a bit for their results. Just don’t blame Hazelcast here.

Now This Is An Example

Now let’s stretch the imagination with this query:

SELECT k.id, k.symbol, k."timestamp", s.* FROM kf_trades AS k LEFT JOIN symbols AS s ON k.symbol = s.__key

Now what we’re doing is a join on the Kafka topic defined above and “symbols” which happens to be an IMap.

So we can take a sequence of Stock Market trades on a Kafka topic and enrich them with information about the stock symbol to form our result.

ID                                    ,  SYMBOL  ,  TIMESTAMP      ,  __KEY  ,  FINANCIALSTATUS  ,  MARKETCATEGORY        ,  SECURITYNAME                                                  
4a696e0c-45e3-4c88-8d6f-a5bc2c966966  ,  LULU    ,  1624528869786  ,  LULU   ,  NORMAL           ,  GLOBAL_SELECT_MARKET  ,  lululemon athletica inc. - Common Stock                       
9b25cb15-9b94-4578-815b-a9db035f4da9  ,  BSRR    ,  1624528873366  ,  BSRR   ,  NORMAL           ,  GLOBAL_SELECT_MARKET  ,  Sierra Bancorp - Common Stock                                 
b0c3b163-4122-4e49-81a3-25f32c60eab1  ,  VDSI    ,  1624528883836  ,  VDSI   ,  NORMAL           ,  CAPITAL_MARKET        ,  VASCO Data Security International, Inc. - Common Stock        
0f5426ab-86cf-4e5a-b60b-6369c2470432  ,  AMDA    ,  1624528886519  ,  AMDA   ,  DEFICIENT        ,  CAPITAL_MARKET        ,  Amedica Corporation - Common Stock                            
6f6b2f74-7f48-414c-bc82-5aed441c0c35  ,  ASBB    ,  1624528887603  ,  ASBB   ,  NORMAL           ,  GLOBAL_MARKET         ,  ASB Bancorp, Inc. - Common Stock                              
c6b4d97d-18ec-480b-b1dd-20491713dc82  ,  TANNI   ,  1624528879335  ,  TANNI  ,  NORMAL           ,  GLOBAL_SELECT_MARKET  ,  TravelCenters of America LLC - 8.25% Senior Notes due 2028    

So our trades have a stock symbol, and by joining on the map with symbol information, we can find out the status and name of the stock.

Some input is external here (Kafka) and some internal (IMap), but all the end-user need to concern themselves with is the output.

Or CSV

Or how about a CSV file?

It’s common to pass data from system to system and not normally something you can query easily. Now you can. All you do is tell Hazelcast where to find it, somewhere all members of the cluster can reach.

For this CSV file:

English,French,Spanish,Italian
one,un,uno,uno
two,deux,dos,due
three,trois,tres,tre
four,quatre,cuatro,quattro

With this definition:

CREATE MAPPING csv (  English VARCHAR, French VARCHAR, Spanish VARCHAR, Italian VARCHAR )  TYPE File  OPTIONS (  'format' = 'csv', 'path' = '/tmp/blog' )

We can run this query:

SELECT English, French FROM csv

And get the results we expect:

one un
four quatre
two deux
three trois

Of course, we may want to join our CSV file with a Hazelcast map.

Our map has two pairs of Strings, for French and German.

hazelcastInstance.getMap("map").set("deux", "zwei");
hazelcastInstance.getMap("map").set("trois", "drei");

So now our query becomes:

SELECT c.*, m.* FROM csv AS c LEFT JOIN map AS m ON c.French = m.__key

And our result:

one un uno uno null null
four quatre cuatro quattro null null
two deux dos due deux zwei
three trois tres tre trois drei

Two rows have nulls because it’s a LEFT JOIN.

Ordering

Did you spot the order of the query results wasn’t the same as the order of lines in the file?

That’s because Hazelcast is distributed.

It’s easily fixed if you need it, with the ORDER BY clause.

SELECT English, French FROM csv ORDER BY 1

Remember also this is the order of the output. We don’t care about the order of the input.

Summary

Hazelcast allows you to query data in its stores and data elsewhere.

This is a great convenience for the end-user.

We can query different things such as Kafka and Hazelcast from one place.

The end-user doesn’t need to know (and may not care) the source of the data.

After all, it’s the results of the query we want, not the input.

Keep Reading