Real-Time Stream Processing in Jupyter Notebook with Hazelcast Viridian Serverless and Python Client

Fawaz Ghali | Sep 21, 2022

We previously explained how to get started with the Hazelcast Python client. In this tutorial, we will show how to use the Python client in Jupyter Notebook. This notebook demonstrates the SQL support of the Hazelcast Python Client. Hazelcast provides in-depth SQL support for Map data structures kept in Hazelcast Clusters. Using Hazelcast SQL support, you can create mappings between your data and a database table and execute SQL queries on the Map. This support provides fast in-memory computing using SQL without writing complex functions that iterate through your maps. Through this tutorial, you can either use your local cluster or Hazelcast Viridian. We will use Hazelcast Viridian as our cluster provider to not worry about setup or installation. Hazelcast Viridian Serverless offers free registration with 2GiB of storage. Remember, you can run this notebook in a Google Colaboratory environment without dealing with local installations.

We are looking forward to your feedback and comments about this blog post. Don’t hesitate to share your experience with us in our community Slack or Github repository.

Why do you want to run Hazelcast in the Jupyter Notebook

The Jupyter Notebook offers an all-in-one solution with a web-based interactive environment (in runs in the web browser) that you can use to combine Python code, formatted text, animated images and graphs, videos, mathematical equations, plots, maps, and interactive figures, all in a single document. It is also easy to share and collaborate on as it uses structured text formats.  It offers various extensions for various data science and ML projects. On the other hand, Hazelcast provides a simple solution to quickly evaluate and process your data as a real-time data processing platform. Using the Hazelcast SQL engine, you can skip all the details and directly work on the value of your customer’s data. You can infer much information without dealing with hundreds of line codes and slow executions.

Setup Section

Notebook version and setup

For you to run the notebook file locally on your computer, you must install Jupyter Notebook and follow its installation guide. We used Notebook version 6.4.12, but it’s not important for our setup, as long as you have an up-to-date version. Instead of installing Jupyter Notebook locally, you can use the Google Colaboratory environment, which is a virtual machine to execute IPython notebook files. When you open the notebook link, you can create a copy of the notebook file to your Google Drive and work on a virtual machine created for you. Another advantage of Colaboratory is not having to worry about package installations, they are all bound to this virtual environment only. We will explain the Hazelcast setup later, but know that Google Colabrotary uses Hazelcast Viridian cloud service for connection since Hazelcast is not installed on Google’s servers. If you have installed Hazelcast and Jupyter Notebook locally before, we recommend running the notebook on your local machine for better CPU performance. If you are new to Hazelcast and trying to learn, use Google Colaboratory and Hazelcast Viridian Serverless to skip all local installation details.

API setup and how to replace with a different API

Instead of manually inserting data, we prefer to pull them over an API and simulate real-time use cases of Hazelcast. We have used The Movie Database (TMDB) API to pull movie and actor data. TMDB and most API providers ask for an API key to validate and accept the incoming requests to their servers. So, you need to create an account from its website and go to Settings > API > Create a new API key section. It may ask you some questions about your project, short answers like “Experimenting API requests” is enough. Hazelcast has placed a form section for the API key at the beginning of the file to paste your API key into this section. The same process applies to other API providers too. You can use other API sources for the notebook instead of TMDB API. You can change the endpoint URLs under the “Load Data From API” section to load data from different APIs. Please keep in mind that all other queries and mappings are configured for the movie scenario. Of course, feel free to change them according to your API source and experiment with the different scenarios via the easy-to-use interface of Hazelcast.

Hazelcast Viridian Serverless setup

To use one of six Hazelcast clients, you need to have a running Hazelcast cluster instance. You can either install Hazelcast locally and run a cluster on your localhost or connect to your Viridian Serverless cluster. To connect your local cluster, you must remove the config options for hazelcast.HazelcastClient(…) call inside the Connect To Hazelcast Cluster section. In this case, it tries to connect to the localhost. Alternatively, Hazelcast Viridian Serverless is our service to provide running Hazelcast clusters in the cloud without dealing with any local installations. You can create an account and deploy a cluster with up to 2GiB of storage for free. After creating a cluster, please select the created cluster and go to Connect Client > Advanced Setup section. You will see the cluster name, discovery token and SSL password for your Hazelcast Viridian cluster. Please run the “Hazelcast Viridian Authentication Tokens” cell to enter your tokens, which will open some text boxes for you to paste the token. Since Hazelcast Viridian Serverless requires a secure connection, it will also ask you to select the ZIP file that contains the SSL certificates generated for your cluster. Please select it from the opening windows when you run the cell. 

CLUSTER_NAME = "YOUR_CLUSTER_NAME"

DISCOVERY_TOKEN = "YOUR_DISCOVERY_TOKEN"

SSL_PASSWORD = “YOUR_SSL_PASSWORD”

SQL queries

Now, we have all the ingredients to use Hazelcast functionalities. Hazelcast provides exclusive support for querying your traditional map entries using SQL syntax. To make this possible, we need to create a mapping between your data and a map. These mapping queries are under Create Mapping between Map and Table section. We inserted our data as HazelcastJsonValue, which is our serialization method for JSON objects. We can directly refer to these JSON fields during the mapping and assign them as table columns. There is no restriction on mapping, you don’t have to select all the fields. After creating the mappings for your maps, you can execute SQL queries on them using Hazelcast SQL functions. Most of the SQL features and functions are available in Hazelcast. Using Hazelcast, you can skip chaotic functions and use SQL syntax to easily search data on your map. We have provided some sample queries for you under the “Fun Part: SQL queries” section. Feel free to change them as you want and see the usage of Hazelcast. Since it is an IPython notebook, you can run cells repeatedly without having to call the previous cells. 

query = """

    SELECT m.title AS name

    FROM movies m

    WHERE m.vote_count > 20000 AND m.vote_average > 7 AND m.release_date < '2015-01-01'

    ORDER BY m.popularity DESC

"""

result = client.sql.execute(query).result()

for row in result:

    print(row['name'])

Summary

In this tutorial, we explained how to use a Python client in Jupyter Notebook. This notebook demonstrates the SQL support of the Hazelcast Python Client. Hazelcast provides in-depth SQL support for your distributed Map data structures kept in Hazelcast clusters. Using Hazelcast SQL support, you can create mappings between your data and a database table and execute SQL queries on the Map. We then used Hazelcast Viridian Serverless as our cluster provider to simplify installation and setup or installation. Don’t hesitate to share your experience in our community Slack or Github repository.

Finally, we would like to acknowledge our Software Engineer Intern, Mehmet Tokgöz, for his input on this project.

Notebook link

For local file version: https://github.com/mehmettokgoz/hazelcast-python-sql-notebook

Google Colaboratory environment: https://colab.research.google.com/drive/1ujUt_XJI2moWSWMcF5_MPiWPg4LCJuot?usp=sharing

 

Relevant Resources

View All Resources
About the Author

Fawaz Ghali

Developer Advocate

Fawaz Ghali is a Developer Advocate at Hazelcast with 20+ years’ experience in software developments, machine learning and real-time intelligent applications. He holds a PhD in Computer Science and has worked in the private sector as well as Academia as a Researcher and Senior Lecturer. He has published over 46 scientific papers in the fields of machine learning and data science. His strengths and skills lie within the fields of low latency applications, IoT & Edge, distributed systems and cloud technologies.

Follow me on