Skip to main content
Portfolio logo Patrick Tavares

Back to all posts

The Black Box Illusion: ML APIs are an invitation to theft

Published on by Patrick Tavares · 10 min read

Look, let’s be real here: the premise of Machine Learning as a Service (MLaaS) has always been kind of naive. You drop millions on GPUs, hire PhDs that cost an arm and a leg, scrub terabytes of proprietary data for months… and in the end, you expose all that intellectual property through an open HTTP door to the world, believing that hiding the model binaries is protection enough.

Spoiler: it isn’t.

The reality we are living in is that this so-called “Black Box” is transparent to anyone who knows where to knock. Model Stealing (or Model Extraction) isn’t just some academic paper stuff that no one implements anymore. It has turned into an economic engineering discipline, complete with calculated ROI. If you have a public API, rest assured: someone is training a clone model behind your back, using your inference budget to subsidize their R&D.

Let me tell you how this works.

The Attack Paradigm: Efficiency is Key

Forget those brute-force attacks from 2020. Nobody is going to query your API randomly until they break the bank (well, almost nobody). The game now is budget efficiency. The attacker wants to replicate your function fv(x)f_v(x) with a substitute model fs(x)f_s(x) spending the absolute minimum on API calls.

Important pause here: do not confuse Extraction with Inversion. They are completely different attacks:

  • Extraction (Stealing): “I want to copy your model’s functionality so I don’t have to pay for it anymore.” This is IP theft, pure and simple.
  • Inversion: “I want to reconstruct the faces/data you used in training.” This is a privacy violation.

What I’m discussing here is extraction. And it has become frighteningly efficient.

The AugSteal Revolution

If your API returns only “Hard-Labels” (like “Cat” or “Dog”, without probabilities), you probably think you’re safe. My friend, I have bad news.

The AugSteal framework1 has completely destroyed that notion of security. And its logic is astute and brilliant at the same time: the attacker doesn’t care about what your model knows. He wants to know what his substitute model doesn’t know.

It works like this:

  1. Active Learning (AL): The attacker selects inputs where their clone model has high entropy (uncertainty). You know those cases where the AI isn’t sure? Those decision boundary points? That’s exactly where they attack.

  2. Surgical Querying: They only pay to query your API at these critical points. No waste.

  3. MixMatch/Data Augmentation: They use rotations, noise, and augmentations to extract consistency from a single “Hard-Label” response. It’s like squeezing blood from a stone, but it works.

The result? Replicas with over 90% fidelity costing a ridiculous fraction of the original training. We are talking about saving millions here.

LLMs: The Mathematics of Extraction

Now things get more interesting (and mathematically elegant, if you’re into that sort of thing). LLMs are, essentially, giant matrix multipliers. And linear algebra is ruthless.

Recent research2 has demonstrated that it is trivial to extract the embedding projection layer from models like the GPT family. Trivial, no exaggeration.

How the mathematical attack works

To understand, you need to know how an LLM works internally. When you send text, the model:

  1. Converts it into numbers (embeddings) - the representation “h”
  2. Multiplies by a giant matrix “W” (the projection matrix we want to steal)
  3. Adds a bias “b”
  4. Generates the final scores “z” (logits) for every possible word in the vocabulary

Mathematically:

z=Wh+bz = W \cdot h + b

Where:

  • z = logits (the raw scores the model gives for each word in the vocabulary)
  • W = projection matrix (the secret “weights” we want to steal)
  • h = hidden embedding (the internal representation of your text)
  • b = bias (a constant adjustment term)

The critical problem is that some APIs (like OpenAI’s until 2024) exposed two dangerous features:

  • logprobs (log-probabilities): the exact probabilities the model assigns to each word. Instead of just returning “cat”, the API says “cat has 85% chance, dog 10%, rat 5%”.
  • logit_bias: a parameter that allows you to inject values into the “b” of the equation above, forcing the model to prefer or avoid certain words.

When you combine these two features, it becomes college-level math. The attacker:

  1. Sends the same phrase multiple times
  2. Each time, changes the logit_bias (manipulates “b”)
  3. Observes how the logprobs (the “z”) change
  4. With N queries, sets up a system of linear equations
  5. Solves to find “W” using basic linear algebra

Let me show you the concept in a simplified way:

Model Extraction
# Simplified concept of W extraction
import numpy as np
# The attacker sends queries varying the bias (b)
# and observes the exact changes in the output log-probs (z).
# With N queries, he solves:
# W * h = z_observed - b_injected
def solve_projection_layer(queries_results):
# Solves linear system Ax = B using least squares
# This extracts the matrix W the provider wanted to keep secret
projection_matrix = np.linalg.lstsq(
queries_results['h'],
queries_results['z_adjusted']
)
return projection_matrix

Want to know the estimated cost to steal the projection matrix of gpt-3.5-turbo? Under $2,000 USD2.

Let that sink in for a second. Less than two grand. This reveals the exact vocabulary, the hidden dimension, and provides a perfect “head” for subsequent adversarial attacks. It’s basically buying the blueprints to the house you want to break into.

Important note: It is worth highlighting that the researchers in the Carlini et al. paper obtained prior permission from OpenAI to conduct this attack ethically and responsibly. OpenAI has since fixed this vulnerability, and today you can no longer use logit_bias freely in their API. But the attack served as a devastating proof of concept: if your API leaks too much information, linear algebra does the rest.

The Replica is the Attacker’s Dojo

Now you might be thinking: “Alright Patrick, but why would anyone steal my model besides saving on the API bill?”

Good question. The answer is: to break it.

There is a property in Deep Learning called Transferability3. Adversarial examples (inputs surgically designed to fool the AI) created for Model A often fool Model B, if the architectures or datasets are similar.

So the attack flow looks like this:

  1. I steal your model and have fs(x)f_s(x) running locally on my GPU.
  2. I can generate millions of adversarial attacks per second (White-Box attack) without paying a cent in API fees and without showing up in your rate limiting logs.
  3. I optimize the attack until I hit 100% success on my local model.
  4. I fire a SINGLE request against your production API. And it gets through.

Your API becomes a static target. The attacker trains in the dojo (the replica) and wins in the ring (production) with a single punch. It’s like training against a dummy of you before the real fight.

Side-Channels: The Silence that Speaks

Even if you block logprobs, hide confidence scores, and ban suspicious IPs, your physical infrastructure still gives away information. Timing Attacks on NLP APIs are absolutely devastating.

Mixture-of-Experts (MoE) models, like GPT-4 and beyond or Mixtral, have variable inference times depending on how many and which “experts” are activated internally.

If input A consistently takes 200ms and input B takes 250ms, I’ve gained a bit of information about the complexity or routing path of your network. Seems like nothing? Multiply that by 10,000 queries.

In classification systems, if the “Fraud” class goes through 3 extra layers of verification and auditing, the latency reveals the result before the JSON even arrives. It’s like poker, but your server has obvious tells.

For the attacker, your latency jitter is Morse code broadcasting details of your architecture.

Defense: A Game of Cat and Mouse (with State)

Okay, enough with the bad news. How do we defend ourselves?

Passive defense is dead. The only way out is Stateful Detection.

The End of Statelessness

REST APIs were designed to be stateless. This is great for scalability. And terrible for ML security. If you don’t keep a history of a user’s queries, you are fighting blindfolded.

Effective defense requires behavioral monitoring. Modern detection frameworks like GuardNet4 and PRADA5 specifically address the detection of model extraction attacks by analyzing query patterns over time:

GuardNet (2024) combines three critical components:

  1. Boundary features: Detects queries that explore decision boundaries (the natural target of Active Learning based attacks like AugSteal).
  2. Inter-sample distance: Analyzes the distance between consecutive samples to identify systematic exploration patterns.
  3. Distribution divergences: Uses Variational Autoencoders (VAE) and Wasserstein distance to distinguish legitimate distribution shifts versus adversarial patterns.

GuardNet’s differentiator is its ability to detect model stealing using fewer queries and minimize false positives caused by natural shifts in legitimate user distribution—a critical problem that destroyed previous approaches based on fixed thresholds.

PRADA (2019), the classic detection framework, established the concept of identifying statistical patterns of knowledge distillation in queries. While effective against older attacks, recent research has shown it can be evaded by techniques like ActiveThief. Still, it serves as an important baseline for layered defense systems.

Let me show you the defense approaches and how effective they really are:

Rate Limiting (IP/Token throttling)

  • Effectiveness: Low
  • Why: Easily bypassed via Sybil attacks and rotating proxies.

Hard-Label Only (Hiding probabilities)

  • Effectiveness: Medium
  • Why: Broken by AugSteal, as we’ve seen.

POW (Proof-of-Work)

  • Effectiveness: Medium
  • Why: Increases attack cost, but seriously hurts UX for legitimate users.

Stateful Detection (GuardNet/PRADA)

  • Effectiveness: High (GuardNet demonstrates effective detection with fewer queries and low false positives)
  • Why: Analyzes behavioral patterns over time (boundary exploration, inter-sample distance, distribution shifts).
  • Limitation: Requires stateful monitoring infrastructure and continuous analysis.

The Compliance Nightmare

And here comes a legal bombshell that few people are discussing:

If I steal your model trained with sensitive medical data, my replica (the stolen model) might memorize and leak that data. The legal question of 2026 will be: Who is liable for the data leak via the stolen model?

The victim company that didn’t adequately protect the API, or the attacker?

If GDPR/LGPD decides that the security flaw in the API was gross negligence, YOU pay the fine for the data the THIEF leaked. It’s like getting mugged and then having to pay for the damages the robber caused with what he stole from you.

Think about that.

Conclusion

Your model is your product. If you expose it via API, you are selling free samples of your intellectual property with every HTTP 200 request.

Security by obscurity is over. Dead. Gone. Buried. If you aren’t implementing state monitoring to detect Active Learning patterns, or measuring the entropy of incoming queries, your model is likely already:

  • Someone else’s training dataset, OR
  • The punching bag for an adversarial script running 24/7

The golden rule: Protect the weights, but protect the gradients EVEN MORE.

And if you think this is paranoia, maybe it’s time to check your API logs. Those 50,000 “suspicious” requests from last week? They weren’t scraping bots.

They were someone cloning you.


Further reading:

References:

Footnotes

  1. Gao et al. (2024). AugSteal: Advancing Model Steal With Data Augmentation in Active Learning Frameworks. IEEE Transactions on Information Forensics and Security. 10.1109/TIFS.2024.3384841

  2. Carlini et al. (2024). Stealing Part of a Production Language Model. Proceedings of the 41st International Conference on Machine Learning (ICML). 10.48550/arXiv.2403.06634 2

  3. Tramèr et al. (2016). Stealing Machine Learning Models via Prediction APIs. 25th USENIX Security Symposium. 10.48550/arXiv.1609.02943

  4. Zhang et al. (2024). Making models more secure: An efficient model stealing detection method. Computers and Electrical Engineering, Volume 117. 10.1016/j.compeleceng.2024.109266

  5. Juuti et al. (2019). PRADA: Protecting Against DNN Model Stealing Attacks. IEEE European Symposium on Security and Privacy (EuroS&P). 10.1109/EuroSP.2019.00044