Samza

Fighting Adult Content with Kafka, Samza, and Google Safe Search by Daniel Ehrman

Originally published on Code Red on August 17, 2018.

Recently, we added the ability to upload photos of your home renovations to Redfin. To get there, we faced the immediate problem of needing to maintain the integrity of our public-facing content — filtering out adult images and the like. Let’s look at how we used Kafka, Samza, and Google Safe Search to put it all together.

First, Why Photos?

Kitchen photo uploaded by a Redfin homeowner

Kitchen photo uploaded by a Redfin homeowner

User testing showed that owners want a way to show off their homes and the hard work they’d put into recent updates. They also want a more accurate Redfin Estimate. A limiting issue we face is that we can’t see inside their homes, preventing us from fully understanding the home’s true value. With enough training data, photos may help us give a more accurate Estimate.

System Requirements

We had five goals in designing our photo-filtering system:

  1. Accurate: the photo filter should be accurate enough to catch nearly all inappropriate content.
  2. Affordable: we should be able to validate a large volume of photos without breaking the bank.
  3. Non-blocking: homeowners should be able to upload renovation photos without waiting for them to be validated first.
  4. Handles bursts: because we’re likely to see big batches of photos from a single homeowner spread apart by long stretches of inactivity, the system should be able to handle bursts without backing up.
  5. Testable: we should be able to test the system in its entirety with photos that are accessible only from within our VPN.

Our Design

Given the requirements, we elected to use Kafka and Samza to meet our performance and scalability needs. We chose Google Safe Search because of Google’s established reputation with computer vision, and because we’d only be charged a fraction of a cent per image. Here’s the solution we landed on....

Continue reading on Code Red....