IT Threat Detection using Neural Search: Part 2

Community Spotlight

Deep-learning powered cybersecurity dashboard to simulate real-time network traffic monitoring for malicious events.

Press enter or click to view image in full size

Background

In part one of this blog series, we started building a Jina application that leverages similarity search to classify network traffic flow as benign or malicious. We saw performance improvements in precision/recall by repurposing a pre-trained neural network as a feature extractor and classifying network traffic based on similarity rather than the output label of a neural network.¹

Press enter or click to view image in full size

Results (Before/After)

But in a world where “seeing is believing,” the aesthetics of your project is as important as the technology behind it. In part two of the IT Threat Detector series, we will explore how we can wrap our Jina search pipeline into a Streamlit front-end, making it easy for us to simulate network traffic monitoring in real-time through the browser!

Data: Too much of a good thing?

Nowadays, companies use their data not only to understand the past but also to predict the future. The problem is that they’re drowning in the data they are collecting, unable to make sense of the past or present. In fact, 90% of the world’s data was created in the last two years.²

Everyone can get on board with the idea that collecting data for the sake of it is boneheaded. But how is a developer supposed to bring their “dark data” to light if they don’t have any front-end expertise?

Press enter or click to view image in full size

Application Dashboard powered by Streamlit

Streamlit is an open-source framework specifically designed for ML engineers working with Python. It allows you to create beautiful interactive dashboards in just a few lines of code. The best part? You can build an entire application without having to google “how to center a div”.

Using Streamlit, we can wrap our Jina neural search pipeline in a powerful, production-ready frontend that allows us to draw intelligence from our data in a web browser without getting bogged down in the numbers. It’s perfect for the situation where you have an amazing data app but no frontend expertise to bring it to a larger audience.³

Before our app goes en Fuego with Streamlit, let’s step back for a second and remind ourselves where our data is, how it’s stored, and why we need a ‘query Flow’ in the first place.

Why do you need a query Flow?

Recall from part one that we utilized a complex Flow topology to index Documents to separate indexes in parallel. The idea behind this was to create an “ensemble” of predictions at query time from indexes that use different search algorithms.

Application Index Flow from part 1.

Those new to Jina may be wondering at this point, why on earth do we need a Flow again? Aren’t we building a front-end with Streamlit?

Remember, all we have done so far is store the vector embeddings generated by our feature extractor in a way that makes them searchable. We used a fancy name for this process, indexing, but all it really means is that we have data stored in a way that allows it to be searched against.

You wouldn’t waste blood, sweat, and tears building a sweet index full of interesting information if you didn’t plan on using it, would you? One thing we still can NOT do yet is access the data. To connect to the Flow and send it Documents that we want to classify as benign or malicious, we need to define a query Flow.

Press enter or click to view image in full size

Application Query Flow

Our query Flow will perform an important function in our application by exposing an API Gateway to receive requests over the network. It will do this via an Executor we will call ITPredictor at an endpoint called /predict (below).

Press enter or click to view image in full size

Matching nearest neighbors is easy with DocArray

That way, when we receive a request at the “/predict” endpoint, it makes it easy for us to return to the Client the nearest-neighbor matches (i.e., predictions) for each index in a DocumentArray.

Press enter or click to view image in full size

The “/predict” endpoint returns nearest-neighbor matches (i.e., “predictions”)

But you may wonder, how do we actually “talk” to the Flow? Where is the Client? I spent all this time telling you how important it is to gain actionable business insights from your data, but how does the Streamlit front-end facilitate any of that?

Let’s discuss how we use a Client to connect to the Flow and send it Documents that we want to classify as benign or malicious.

Connecting Flow and Documents via Client

To connect to the Flow and send it Documents that we want to classify as benign or malicious, we can use Jina’s Client object.⁴ The Client enables you to send Documents to a running Flow in a number of different ways and protocols (HTTP/gRPC/WebSocket and GraphQL).

Sending data to the Flow from a Client

After a Client has connected to a Flow, it can send requests to the Flow using its post() method. That way, you can send your Documents to the Executor methods that you want to target.

In the demonstration below from the Jina docs, note how Executor methods decorated with ‘requests’ are bound to specific network requests and respond to network queries.

The requests decorator allows an Executor’s methods to be targeted specifically

In our application, we will use a similar approach by instantiating a Client object and using its .post() method when our application starts to query our indexed Documents and determine which events are malicious.

Press enter or click to view image in full size

Our DataFrame is populated with data returned by get_predictions() in the clean_data() function.

In particular, note how the get_predictions()method (below) populates our Pandas DataFrame (above) by querying our index via the /predict endpoint exposed by our query Flow. The function returns data assigning unknown network traffic data the label of its nearest neighbor in our index depending on the cosine similarity of their embeddings. It does this for both our DocumentArrayIndexer and WeaviateIndexer.

Press enter or click to view image in full size

The “/predict” endpoint exposed by the query Flow returns classified network traffic back to the Client.

In the future, it would be nice to modify the design of this application to have an input “stream” of data, but for now, we will “simulate” that the network traffic is detected in real-time by classifying it based on similarity when the app starts up and our DataFrame loads with the get_data() function (below).

Press enter or click to view image in full size

Build an interactive dashboard with a few lines of Python code.

In other words, since we don’t have “real-time” streaming data, we will simulate to the best of our abilities by classifying the data as “late as possible”.

The results are presented in a suite of interactive figures and performance indicators that make it easy for the user to determine the overall state of their network and the identified threats at a glance.

Press enter or click to view image in full size

Key performance metrics (simulated values)

We can even identify which rows are misclassified using the sortable “is_wrong” column, as well as determine how each index classified a particular row.

Press enter or click to view image in full size

Streamlit plays nicely with Pandas DataFrames

Indeed, in a world where “seeing is believing,” our Streamlit front-end allows us to bring aesthetics to our project that match its technology.

You can find the source code here 👇

Conclusion

In this series of articles, we used Jina to demonstrate how similarity search can be used to solve a common business challenge like network security.

We accomplished this by developing a feature extractor from a pre-trained neural network that allowed us to predict whether network traffic is malicious based on its similarity with other known training samples rather than the output label of the network.

Then, without any front-end expertise, we learned how to “wrap” our Jina search pipeline in a Streamlit front-end, making it easy for us to simulate network traffic monitoring in real-time through the browser!

The crazy part? We built this whole project without having to Google “how to center a div” 🥳

Community Spotlight

Deep-learning powered cybersecurity dashboard to simulate real-time network traffic monitoring for malicious events.

Background

Data: Too much of a good thing?

Why do you need a query Flow?

Connecting Flow and Documents via Client

Conclusion

References