Ralf Niewenhuijsen about how Owlin can analyze a million articles in real-time

At Owlin, we couldn’t do what we are doing without the great work of all the different people working with us. In this interview series, The people of Owlin, we ask them about their daily work, background, and where they see Owlin going in the future. This month: Ralf Nieuwenhuijsen, one of our Software Engineers, tells us about working for Owlin and how Owlin can analyze a million articles in real-time.

First things first: when did you join Owlin?

“I joined Owlin about eight years ago after running into Bart & Bas (one of the founders) at Kafe België in Utrecht. I was a freelancer at the time, doing projects on and off, extending my student lifestyle somewhat indefinitely. After enjoying talking shop for many nights, with some fine beers, I was eventually convinced to join Owlin.”

What did you like about Owlin?

“Although I had a lot of experience working with customers and managing my own projects at the time, working collaboratively as a team in this kind of setting was a very new experience for me. Owlin provided the space for me to grow and allowed me to build something substantial.”

How did Owlin develop during the last eight years?

“Since I started with Owlin, we went through many different iterations of our infrastructure. When I just joined, the risk dashboard designed by Wessel, which would now be considered the main product, was one of many ad-hoc side projects.

To ensure all the different projects would be maintainable and have a quantifiable and manageable cost profile, I spearheaded setting up a unified API surface, unified account management, and our scheduler. This automatic job runner would fetch and cache all the relevant data for a customer’s dataset. This decoupling of a dataset from the product enabled the data analyst team and the serving team to deliver products much more asynchronously without everyone drowning in technical debt.

I also developed components such as the Owlin Query Language, our filter manager, many API integrations, and many other internal toolings that enable our data analysts and commercial team to independently QA and deliver datasets to our dashboards.

In the last couple of years, our focus has mainly been on ‘platform’, our new infrastructure where we maintain a single data warehouse for all of our needs, from keeping track of portfolios and user accounts to alert emails and articles. This enables us to apply our enrichment components much more freely to different data types and serve other use cases without changing our systems’ underlying architecture.

Whenever rules change, we automatically calculate a delta and apply it retroactively to already processed articles again. This enables our architecture to work both efficiently for the real-time use case of quickly processing freshly scraped news articles and automatically propagate changes to our historic catalog of articles cost-efficiently.”

What has your main focus been recently?

“One of the things we never really got around to until this most recent iteration of our infrastructure was percolation. The ability to turn a search query inside out and have a way to quickly match a new incoming article to existing rules in our system.

Whenever a new article comes into our system, we need to match this article to thousands of different rules. Some of these rules will tell us the news category, whereas others might inform us about the region the event takes place in or the type of risk that may be applicable.

To do this all efficiently, it is the rules themselves that are indexed into our systems. This helps limit the number of candidate rules we would consider for further evaluation. Once we establish that a particular article matches a specific rule, we tag the article with that rule. Almost as if we put stickers on every article. We do multiple percolation rounds, allowing our more complex rules to build up on other rules without re-evaluating them repeatedly.

This recursive percolation allows us to quickly classify millions of articles while they come into our systems.

What kind of systems do you use?

“Generally speaking, at Owlin, we use a mixture of machine learning and rule-based systems. Rule-based systems have value because they are easier to adjust on the spot and can explain themselves well (tell us why they think a particular classification is correct). They can be audited and verified if they are correctly applying regulatory requirements. On the other hand, machine learning-based approaches tend to have more flexibility and nuance.

Our philosophy is to use both approaches and have them feed into each other. Our rule-based classifications might use annotations from a machine learning-based approach and vice versa.”

What are you going to work on in the coming years?

“Now that our architecture has stabilized and is future-proof, we are getting into this exciting space where even with a small team, we can quickly iterate on many complex approaches. This is much more customer-facing than it has been in the past two years. This includes trying to combine our data with 3rd party data sources that can automatically tell us the complete company hierarchy or scraping customer review data ourselves and using that to detect issues with order fulfillment immediately.”

Thanks, Ralf!