Introducing Vector Collections and Vector Search
Hazelcast Platform 5.5 introduces a new capability, Vector Search across vector collections. In this blog, we’ll examine what vector collections are and why you need Vector Search.
Vector Search is a feature intended for AI and ML workloads, but here, we will take a beginner’s perspective to increase everyone’s understanding of this feature. This feature is currently in BETA status for Hazelcast Platform Enterprise Edition customers.
Background
It’s probably useful to start by knowing what vectors and vector collections are and why you might want to search them. We will begin with the periodic table and a Chemistry example describing elements.
Vector
A “Vector” is just a list of numbers. Taking the Periodic Table, each element has a “valency,” which is a count of the number of electrons the element has, and this, in turn, influences how the element reacts with other elements. Potassium has 19 electrons, arranged around the nucleus in four orbits of 2, 8, 8, and 1. One of the ways to describe Potassium would be to note its electron configuration as a vector of “[2, 8, 8, 1, 0, 0, 0]
”. The vector has 7 numbers, as seven orbits are possible. Potassium only has enough electrons to fill the first four.
Vector Values
There are other attributes of Potassium, not just the electron configuration. For example, the name itself, “Potassium.” If we take the ASCII codes for the letters, this could be the vector “[80, 111, 116, 97, 115, 115, 105, 117, 109]
”. The “Vector Values” for Potassium would be two unconnected vectors describing Potassium attributes.
We might think of Potassium as being described by these vector values:
Orbit, [2, 8, 8, 1, 0, 0, 0] Spelling, [80, 111, 116, 97, 115, 115, 105, 117, 109]
We have represented Potassium with a vector for the “Orbit
” of electrons and with a vector for “Spelling
” of the word.
Vector Collection
If we work with the periodic table, we need to store the description of more than just Potassium. We must store most or all elements in a “vector collection.”
In the Hazelcast Platform, this is just a key-value store, like an IMap, ReplicatedMap, and so on. So we might have a VectorCollection named “Elements
” which might contain an entry with the key “Potassium
” and the vector values shown previously, as well as entries for all the other elements we are interested in.
While we can retrieve “Potassium
” using the primary key, the main use-case for such vectors is approximate searching.
Vector Search
“Vector Search” is a similarity search rather than an exact matching search. If we think of chemical elements, we might be interested in ones with similar behaviors. We might wish to search for elements with an electron formation similar to Potassium to validate or disprove our hypothesis that these elements will have similar behavioral properties.
Here, our vector search would be against the “Orbit
” vector to find matches for “[2, 8, 8, 1, 0, 0, 0]
”, which will find Potassium. It will also find Calcium (“[2, 8, 8, 2, 0, 0, 0]
”) and Titanium (“[2, 8, 10, 2, 0, 0, 0]
”) as close and more distant matches, respectively.
So, if our chemistry knowledge is valid, we would expect Calcium to behave more like Potassium than Titanium. Whether it does or not is a different question. We used vectors to describe attributes of something we were interested in and vector search to find close matches.
How about some code?
Configuration
In our “hazelcast.yml
” file, our servers have this configuration element.
<pre> vector-collection: 'Elements': indexes: - name: Orbit dimension: 7 metric: EUCLIDEAN - name: Spelling dimension: 13 metric: EUCLIDEAN
There is a vector collection named “Elements
” with an index for each of the vectors. Indexes are mandatory. The whole purpose of vector collections is searching, so we need to define the vectors we will search so that they can be indexed. The vector “Orbit
” uses an Euclidean distance index against a vector length of 7. Recall that there are 7 possible orbits around the nucleus, though not all may contain electrons. The vector “Spelling
” also uses an Euclidean distance index. It has a size of 13 as the longest element name has 13 letters (Rutherfordium).
Other index types are available, depending on the search criteria. Euclidean is used here for simplicity.
Java
Vector search is also available in Python, a more typical home for AI/ML workloads. However, let’s do it from Java.
The Vector Collection
We obtain a reference to the vector collection like this:
VectorCollection elements = VectorCollection.getCollection(hazelcastInstance, "Elements");
The vector collection has the name “Elements
”, and takes a key that is a String and a value description that is also a String.
As is usual for Hazelcast Platform, the vector collection is created on the server side when we first access it. Note, though, that the configuration must be known. We cannot dynamically create a searchable object without defining the indexes that dictate how it can be searched.
Plutonium
We want to insert data for the element Plutonium. We might create it’s vectors like this:
<pre>VectorValues pu = VectorValues.of( "Orbit", new float[] { 2f, 8f, 18f, 32f, 24f, 8f, 2f }, "Spelling", new float[] {80.0f, 108.0f, 117.0f, 116.0f, 111.0f, 110.0f, 105.0f, 117.0f, 109.0f, 0.0f, 0.0f, 0.0f, 0.0f }); </pre>
Plutonium here has two vectors. The first is “Orbit
”, defining how many electrons are in each of seven orbits, 2 in the first, then 8 and so on for a total of 94. The second is “Spelling
”, which starts with the character for a capital “P” then a lowercase “l” and so on.
Next, we save this into the vector collection.
<pre>elements.putAsync("Pu", VectorDocument.of("Plutonium", pu)); </pre>
The key is the String “Pu
”, the chemical symbol for Plutonium. The value has a String description, “Plutonium
” and the array of vectors.
Other elements
We can insert 5 other elements similarly:
VectorValues am = VectorValues.of( "Orbit", new float[] { 2f, 8f, 18f, 32f, 25f, 8f, 2f }, "Spelling", spell("Americium")); VectorValues cm = VectorValues.of( "Orbit", new float[] { 2f, 8f, 18f, 32f, 25f, 9f, 2f }, "Spelling", spell("Curium")); VectorValues bk = VectorValues.of( "Orbit", new float[] { 2f, 8f, 18f, 32f, 27f, 8f, 2f }, "Spelling", spell("Berkelium")); VectorValues cf = VectorValues.of( "Orbit", new float[] { 2f, 8f, 18f, 32f, 28f, 8f, 2f }, "Spelling", spell("Californium")); VectorValues es = VectorValues.of( "Orbit", new float[] { 2f, 8f, 18f, 32f, 29f, 8f, 2f }, "Spelling", spell("Einsteinium"));
spell()
If you’re interested, this is how the spell()
method is implemented.
private static float[] spell(String s) { float[] result = new float[13]; for (int i = 0; i < s.length(); i++) { char c = s.charAt(i); if (i < result.length) { result[i] = c; } } return result; }
Searching
Now, we want to search our data. We don’t have Thorium stored but are looking for something similar.
Thorium has 90 electrons arranged “[2, 8, 18, 32, 18, 10, 2]
”.
We would search for it like this:
<pre>float[] targetOrbit = new float[]{ 2f, 8f, 18f, 32f, 18f, 10f, 2f }; VectorValues targetVectorValues = VectorValues.of("Orbit", targetOrbit); SearchResults searchResults = elements .searchAsync(targetVectorValues, SearchOptions.builder().limit(2).build()) .toCompletableFuture().get(); Iterator<SearchResult> iterator = searchResults.results(); while (iterator.hasNext()) { SearchResult searchResult = iterator.next(); System.out.println(searchResult.getKey() + " score==" + searchResult.getScore()); }
We build a search whether the input is the “Orbit
” vector for Thorium, and we have requested “limit(2)
” the two best matches.
This gives us the output below:
Pu score==0.024390243 Cm score==0.019607844
Plutonium (Pu) is the closest match, with a score of 0.024, followed by Curium (Cm).
Validation
Let’s validate the result for the chemists and mathematicians amongst us.
Plutonium is the nearest match. It has 94 electrons arranged [2, 8, 18, 32, 24, 10, 2],
whereas Thorium is [2, 8, 18, 32, 18, 10, 0]
is [0, 0, 0, 0, 6, 0, 2]
”. The Euclidean difference here is 1/(1 + 6 squared + 2 squared), 0.02439. Curium has 86 electrons [2, 8, 18, 32, 25, 9, 2]
. The difference from Thorium is [0, 0, 0, 0, 7, 1, 0]
. The Euclidean difference is 1/(1 + 7 squared + 1 squared), 0.01960.
More Searching
We have defined our elements as having two vectors. Let’s search for an element with a similar name on the “Spelling
” vector.
If we search for Potassium, we search for the nearest match to the vector [80, 111, 116, 97, 115, 115, 105, 117, 109, 0, 0, 0, 0]
for capital “P”, lowercase “o” and so on. As it turns out, for the 6 elements we have stored, the closest match is Plutonium with the vector [80, 108, 117, 116, 111, 110, 105, 117, 109, 0, 0, 0, 0]
. It’s questionable chemistry to assume two elements will behave the same because their names are similar, but there’s an important point here – this algorithm scores vector elements individually based on mathematical distance.
For example, if we were searching for “Pot” [80, 111, 116]
, the difference between “Plu” [80, 108, 117]
is small because one letter is 3 away and the other is 1 away. Linguistically, though, “Bot” [66, 111, 116]
would be a closer match in two ways – only one letter is different, and the pronunciation of “B” and “P” are linked. Similarly, this algorithm would consider “Plot” [80, 108, 111, 116]
a poor match to “Pot” as three letters differ. The 2nd element in the input is compared to the 2nd element in the target, not adjacent elements.
Understanding the nature of the similarity search is important to ensure that you get the answers you want.
Vector Search in AI
This blog post does not intend to discuss the direct correlation to AI/ML. However, we need to cover two points that are not immediately obvious.
Transformer
As noted earlier, a vector is just a list of numbers, which is convenient for storage and searching, but it is not where we started. Our starting point was a more nebulous thing, an element. We needed a mechanism to transform the concept of “Plutonium” into vectors and used two such mechanisms.
The first transformer took the concept “Plutonium” and gave us the vector [2.0, 8.0, 18.0, 32.0, 24.0, 10.0, 2.0]
for the orbit of the electrons. This transformer was a human being searching the internet.
The second transformer took the concept “Plutonium” and gave the vector [80.0, 108.0, 117.0, 116.0, 111.0, 110.0, 105.0, 117.0, 109.0, 0.0, 0.0, 0.0, 0.0]
for the spelling of the word. This was coding!
AI uses more sophisticated transformation algorithms that produce longer vectors, but they’re still just vectors of numbers. For example, an entry system might capture a picture of an employee entering a building, and the AI transformation algorithm could take the recognizable features of their face and turn this into a vector.
Vector search can now find which employee is entering the building. Staff can walk from place to place in the building, and secured doors can be opened automatically as their faces become their “keys.”
Did you spot float[]
?
A last point to note is that vectors in Hazelcast Platform are floating-point numbers. In the simplistic example of the periodic table, the number of electrons is whole numbers, which is not required for vectors. They will usually be floating-point numbers.
Summary
Vector Search underpins a significant facet of AI: the ability to take arbitrary input and search for similar stored data.
In Hazelcast Platform, this data storage is just a key-value. This is indexed for optimal searching, allowing us to find similar data records quickly.
More details on Vector Search in Hazelcast Platform can be found in our documentation.