Vector databases are hot right now. At the end of this article, I hope you'll have an intuitive feel for what a vector database is actually doing and how it could prove useful when we want to query things that are *similar*.

This sounds obvious, but a vector database is chiefly concerned with the efficient storage, retrieval and querying of vectors. To me, this makes them intriguingly different to most other database categories which tend to equally embrace a variety of data types - strings, integers, dates, floats and so on.

## How Does It Work?

We're going to motivate this with a little bit of maths. It's fun, I promise.

Let's start by considering a pair of points^{[1]} on a two-dimensional plane: $(4,4)$ and $(-4,-4)$. I've selected those because it's deliciously straightforward to plot them out, and I want this article to be less than 1200 words.

Voila! You'll see those two points in pink, but I've also cheekily added a third point in cyan: $(3,4)$. We can see with our own eyes that $(3,4)$ is much closer to $(4,4)$ than it is $(-4,-4)$, but as of July 2024 computers have absolutely zero intuition - so how might we calculate that? One formula we can employ is called the L2 distance^{[2]}.

If we wanted to calculate the L2 distance, $d$, between points $p$ and $q$ in an abstract $n$-dimensional vector space, we'd do this:

$d(p,q) = \sqrt{\sum_{i=1}^n(q_i - p_i)^2}$

Don't worry - we'll go through this step by step. The big fancy sigma ($\Sigma$) letter makes this quite confusing, but it's saying that we'll generate a squared number for the result after subtracting each dimension from our two points $p$ and $q$, then make $d$ the square root of the total. We're popping it here now because later in the article we'll be contemplating vectors with hundreds of dimensions, and also it looks so cool.

Given $p=(4,4)$ and $q=(3,4)$, we can calculate $d$ within a less generalised two-dimensional vector space. This makes $n = 2$, and our resulting formula becomes $d(p,q) = \sqrt{(q_1 - p_1)^2 + (q_2 - p_2)^2}$. Which is a bit easier to read.

We substitute in our values to calculate the differences for each dimension of $p$ and $q$:

$\begin{aligned} (q_1 - p_1)^2 = (4 - 3)^2 = 1^2 &= 1 \\ (q_2 - p_2)^2 = (4 - 4)^2 = 0^2 &= 0 \end{aligned}$

And then take the square root of the sum of the differences to get the value of $d$:

$d = \sqrt{1+0} = 1$

All that work for a $1$! Alas. We can do the same but where $p=(-4,-4)$ and $q = (3,4)$:

$\begin{aligned} q_1 - p_1 = 3 - (-4) &= 7 \\ q_2 - p_2 = 4 - (-4) &= 8 \\ (q_1 - p_1)^2 = 7^2 &= 49 \\ (q_2 - p_2)^2 = 8^2 &= 64 \\ \sqrt{49 + 64} &\approx 10.63 \\ d &\approx 10.63 \end{aligned}$

That's it. That's all the maths^{[3]}. We can see the $d$ value is much greater between $(-4,-4)$ and $(3, 4)$.

Now, let's fling all this into a vector database. There are dozens to choose from but, for simplicity, I am going to use reliable, friendly Postgres and its `pgvector`

extension. The team makes a pre-cooked installation available via Docker, so we can spin it up locally:

```
$ docker run \
--name pgvector \
-e POSTGRES_PASSWORD=vector
-d \
-p 5432:5432 \
--rm \
pgvector/pgvector:pg16
```

And then feed it the relevant SQL. Our `vector(2)`

column type will receive two-dimensional vectors, and we'll `INSERT`

$(4,4)$ and $(-4,-4)$ as vectors into it.

`CREATE EXTENSION vector;`

CREATE TABLE vectors (id BIGSERIAL PRIMARY KEY, vector vector(2));

INSERT INTO vectors (vector) VALUES ('[4,4]'), ('[-4,-4]');

Now we can query the database directly for the L2 distances from $(3,4)$ with the `<->`

operator.

`SELECT vectors.vector AS "Vector",`

vectors.vector <-> '[3,4]' AS "Distance From [3,4]"

FROM vectors;

Which returns the exact same results as our calculations earlier:

Vector | Distance From [3,4] | |
---|---|---|

1 | [4,4] | 1 |

2 | [-4,-4] | 10.63014581273465 |

Cool! Mathematics wins again!

## But, Why?

Up until now we've been considering our vectors in a relatively abstract space, where $(3,4)$ had no particular meaning other than it was closer to $(4,4)$ than it was $(-4,-4)$.

But we can take these vectors out of their happy abstract space and soak them in all kinds of meaning with a process called embedding.

It's easiest to think of an embedding model as, well, pure magic at this point. Think of it as a ready-made 'text to vectors' delivery service. For our purposes we'll use one called `all-MiniLM-L6-v2`

, which will take a text input and return a 384 dimensional vector. This model was tuned on 1 billion sentence pairs, and as part of that training it picks up some ability to group sentences close to each other across those many dimensions. If two sentences are closely located within 384-dimensional space, the model has decided they share semantic meaning.

We won't be venturing into the LLM-world in this article, so no worrying about ChatGPT, Gemini or Claude. But one common use case for vector databases right now is to enrich a prompt with some memory and/or context. So you might store all your relevant internal documents in a vector database, and then pull out some similar text before sending your prompt

withthose results to the LLM.

Let's throw a few sentences at `all-MiniLM-L6-v2`

:

`sentences = [`

"Crystal Palace finished 10th in the premier league in 2023/24",

"Man City finished 1st in the premier league in 2023/24",

"Man City finished 1st in the premier league in 2022/23",

"Brighton finished 11th in the premier league",

"Preston finished 10th in the championship",

"Cucumbers taste good when pickled",

"Ginger tastes good when pickled",

"Milk does not taste good when pickled",

"Cats have four legs",

"Dogs have four legs",

"Humans have two legs"

]

We've got 11 sentences bucketed into three categories: football finishing positions, pickled foods and limbed animals.

It's hard for a tiny human brain to visualise 384 dimensional space, but at the same time I want to have a look at them. We can use the t-SNE algorithm (again, for now, let's just consider the implementation details pure magic) to bring that 384 dimensional data back into two dimensions, and then visualise the results.

We can visibly see that the model has semantically grouped all the football finishing positions, pickled foods and limbed animals close to one another. These vectors have meaning!

Now, what if we create an embedding for another sentence: `Did palace win the premier league?`

Imagine this takes the place of our $(3,4)$ vector from earlier, and now we want to find the closest vector stored in our database. We can once again go back to calculating the L2 distance for the nearest result.

If you're interested in going further, note that we're heading into a territory known as semantic search, which is one of the most exciting use cases for a vector database.

The response? `Crystal Palace finished 10th in the premier league in 2023/24`

. Oh well. Maybe next year.

Vectors are

*technically*not points and they aren't held in fixed position in space. In this article, let's just say we will anchor our points relative to the origin $(0,0)$ of our coordinate system and we'll use the terms interchangeably for the sake of simplicity. Another term we can use for these are positional vectors. ↩︎This is also known as Euclidean distance. There are also other distance measures, such as cosine distance and L1/Manhattan distance. I think cosine distance is often extremely popular, but I personally find L2 the most straightforward to visualise. ↩︎

OK, quickly, one more thing: you could

*also*look at this as if it was $7^2 + 8^2 \approx 10.63^2$, or generalised further into Pythagoras' famous $a^2 + b^2 = c^2$, which blows my mind. Triangles are everywhere! ↩︎