fix: prevent SQL injection in query_metrics and natural_language_query endpoints by atiliohector33 · Pull Request #89 · PostHog/HouseWatch

atiliohector33 · 2026-06-27T22:43:01Z

🔒 What's the problem?

Two API endpoints were building SQL queries by pasting raw user input directly into SQL strings — no validation, no sanitization.

This is a classic SQL Injection (CWE-89) vulnerability. An attacker who knows how the API works (or just guesses) can manipulate the query that gets sent to ClickHouse and read data they shouldn't see, or worse.

🔍 Vulnerable endpoints

1. `GET /api/analyze/{hash}/query_metrics/`

The {hash} in the URL comes straight from the browser and was dropped into the SQL with no checks:

# BEFORE — vulnerable
conditions = "AND event_time > now() - INTERVAL 1 WEEK AND toString(normalized_query_hash) = '{}'".format(pk)

That conditions string then gets pasted into four different SQL templates via %(conditions)s.

Normal request:

GET /api/analyze/14763824958273648/query_metrics/

Produces safe SQL:

AND normalized_query_hash = '14763824958273648'

Attack request:

GET /api/analyze/' OR 1=1 --/query_metrics/

Produces broken SQL:

AND toString(normalized_query_hash) = '' OR 1=1 --'
-- The WHERE clause is now bypassed entirely

2. `POST /api/analyze/natural_language_query/`

The AI feature lets users pick which tables to query. Table names came from the request body and were interpolated directly into SQL:

# BEFORE — vulnerable
database, table = full_table_name.split(">>>>>")
condition = f"(database = '{database}' AND table = '{table}')"

Normal request body:

{
  "tables_to_query": ["default>>>>>events"],
  "query": "show me the top 10 slowest queries"
}

Produces safe SQL:

WHERE (database = 'default' AND table = 'events')

Attack request body:

{
  "tables_to_query": ["default>>>>>' OR 1=1 UNION SELECT name, '', query FROM system.users --"],
  "query": "anything"
}

Produces malicious SQL:

WHERE (database = 'default' AND table = '' OR 1=1 UNION SELECT name, '', query FROM system.users --')
-- This dumps ClickHouse's internal users table

🤔 Why not just use parameterized queries?

Good question. The run_query() function already supports parameters — but parameterization only works for values (strings, numbers, dates). The %(conditions)s slot injects an entire SQL clause, not a single value:

# run_query uses Python % substitution, not driver-level binding
final_query = query % (params or {})

-- This is what %(conditions)s looks like in the template
WHERE
    query_start_time > now() - INTERVAL %(days)s day
    AND type = 2
    AND is_initial_query %(conditions)s
--                        ^^^^^^^^^^
--                        This is a SQL fragment, not a value.
--                        No driver can safely bind a fragment.

The conditions pattern is used across multiple SQL templates to let different endpoints share the same query structure. Refactoring all templates would be a much larger change. Input validation is the correct minimal fix here.

✅ The fix

Fix #1 — `query_metrics`: validate that the hash is a plain integer

normalized_query_hash in ClickHouse is a UInt64 — always a number like 14763824958273648. Numbers can't carry SQL syntax. If we confirm it's a valid integer before touching the SQL, we're safe.

# AFTER — fixed
if not pk.isdigit():
    return Response(status=400, data={"error": "Invalid query hash"})

# Cast to int: now it's type-safe, and we compare UInt64 = UInt64 directly
# (removes the unnecessary toString() call too)
conditions = f"AND event_time > now() - INTERVAL 1 WEEK AND normalized_query_hash = {int(pk)}"

The attack ' OR 1=1 -- fails .isdigit() and gets a 400 immediately — never reaches the SQL builder.

Fix #2 — `natural_language_query`: validate identifiers with a regex

Database and table names in ClickHouse can only contain [a-zA-Z0-9_]. They can't contain quotes, semicolons, spaces, or any other SQL-special character. We enforce that rule before building the fragment:

# AFTER — fixed (added at the top of analyze.py)
import re
_IDENTIFIER_RE = re.compile(r"^[a-zA-Z0-9_]+$")

# Inside natural_language_query:
for full_table_name in request.data["tables_to_query"]:
    if ">>>>>" not in full_table_name:
        return Response(status=400, data={"error": f"Invalid table format: {full_table_name!r}"})

    database, table = full_table_name.split(">>>>>", 1)   # maxsplit=1 for safety
    database, table = database.strip(), table.strip()

    if not _IDENTIFIER_RE.match(database) or not _IDENTIFIER_RE.match(table):
        return Response(status=400, data={"error": "Table names must contain only letters, digits, and underscores"})

    condition = f"(database = '{database}' AND table = '{table}')"
    table_schema_sql_conditions.append(condition)

The attack payload "default>>>>>' OR 1=1 --" gives table = "' OR 1=1 --" — which fails the regex and returns 400 Bad Request.

📋 Changes summary

Location	Vulnerability	Fix
`query_metrics` (line 79)	`pk` from URL pasted into SQL via `.format()`	`pk.isdigit()` guard + integer cast
`natural_language_query` (lines 273–274)	`database`/`table` from request body pasted into SQL via f-string	Regex `^[a-zA-Z0-9_]+$` validates both before use

Files changed: housewatch/api/analyze.py

fix: solve sql injection problem on analyze.py

1ebd79b

atiliohector33 force-pushed the fix/sql-injection-problem branch from 1a49f2e to 1ebd79b Compare June 27, 2026 22:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: prevent SQL injection in query_metrics and natural_language_query endpoints#89

fix: prevent SQL injection in query_metrics and natural_language_query endpoints#89
atiliohector33 wants to merge 1 commit into
PostHog:mainfrom
atiliohector33:fix/sql-injection-problem

atiliohector33 commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

atiliohector33 commented Jun 27, 2026

🔒 What's the problem?

🔍 Vulnerable endpoints

1. GET /api/analyze/{hash}/query_metrics/

2. POST /api/analyze/natural_language_query/

🤔 Why not just use parameterized queries?

✅ The fix

Fix #1 — query_metrics: validate that the hash is a plain integer

Fix #2 — natural_language_query: validate identifiers with a regex

📋 Changes summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `GET /api/analyze/{hash}/query_metrics/`

2. `POST /api/analyze/natural_language_query/`

Fix #1 — `query_metrics`: validate that the hash is a plain integer

Fix #2 — `natural_language_query`: validate identifiers with a regex