Skip to content

fix: prevent SQL injection in query_metrics and natural_language_query endpoints#89

Open
atiliohector33 wants to merge 1 commit into
PostHog:mainfrom
atiliohector33:fix/sql-injection-problem
Open

fix: prevent SQL injection in query_metrics and natural_language_query endpoints#89
atiliohector33 wants to merge 1 commit into
PostHog:mainfrom
atiliohector33:fix/sql-injection-problem

Conversation

@atiliohector33

Copy link
Copy Markdown

🔒 What's the problem?

Two API endpoints were building SQL queries by pasting raw user input directly into SQL strings — no validation, no sanitization.

This is a classic SQL Injection (CWE-89) vulnerability. An attacker who knows how the API works (or just guesses) can manipulate the query that gets sent to ClickHouse and read data they shouldn't see, or worse.


🔍 Vulnerable endpoints

1. GET /api/analyze/{hash}/query_metrics/

The {hash} in the URL comes straight from the browser and was dropped into the SQL with no checks:

# BEFORE — vulnerable
conditions = "AND event_time > now() - INTERVAL 1 WEEK AND toString(normalized_query_hash) = '{}'".format(pk)

That conditions string then gets pasted into four different SQL templates via %(conditions)s.

Normal request:

GET /api/analyze/14763824958273648/query_metrics/

Produces safe SQL:

AND normalized_query_hash = '14763824958273648'

Attack request:

GET /api/analyze/' OR 1=1 --/query_metrics/

Produces broken SQL:

AND toString(normalized_query_hash) = '' OR 1=1 --'
-- The WHERE clause is now bypassed entirely

2. POST /api/analyze/natural_language_query/

The AI feature lets users pick which tables to query. Table names came from the request body and were interpolated directly into SQL:

# BEFORE — vulnerable
database, table = full_table_name.split(">>>>>")
condition = f"(database = '{database}' AND table = '{table}')"

Normal request body:

{
  "tables_to_query": ["default>>>>>events"],
  "query": "show me the top 10 slowest queries"
}

Produces safe SQL:

WHERE (database = 'default' AND table = 'events')

Attack request body:

{
  "tables_to_query": ["default>>>>>' OR 1=1 UNION SELECT name, '', query FROM system.users --"],
  "query": "anything"
}

Produces malicious SQL:

WHERE (database = 'default' AND table = '' OR 1=1 UNION SELECT name, '', query FROM system.users --')
-- This dumps ClickHouse's internal users table

🤔 Why not just use parameterized queries?

Good question. The run_query() function already supports parameters — but parameterization only works for values (strings, numbers, dates). The %(conditions)s slot injects an entire SQL clause, not a single value:

# run_query uses Python % substitution, not driver-level binding
final_query = query % (params or {})
-- This is what %(conditions)s looks like in the template
WHERE
    query_start_time > now() - INTERVAL %(days)s day
    AND type = 2
    AND is_initial_query %(conditions)s
--                        ^^^^^^^^^^
--                        This is a SQL fragment, not a value.
--                        No driver can safely bind a fragment.

The conditions pattern is used across multiple SQL templates to let different endpoints share the same query structure. Refactoring all templates would be a much larger change. Input validation is the correct minimal fix here.


✅ The fix

Fix #1query_metrics: validate that the hash is a plain integer

normalized_query_hash in ClickHouse is a UInt64 — always a number like 14763824958273648. Numbers can't carry SQL syntax. If we confirm it's a valid integer before touching the SQL, we're safe.

# AFTER — fixed
if not pk.isdigit():
    return Response(status=400, data={"error": "Invalid query hash"})

# Cast to int: now it's type-safe, and we compare UInt64 = UInt64 directly
# (removes the unnecessary toString() call too)
conditions = f"AND event_time > now() - INTERVAL 1 WEEK AND normalized_query_hash = {int(pk)}"

The attack ' OR 1=1 -- fails .isdigit() and gets a 400 immediately — never reaches the SQL builder.


Fix #2natural_language_query: validate identifiers with a regex

Database and table names in ClickHouse can only contain [a-zA-Z0-9_]. They can't contain quotes, semicolons, spaces, or any other SQL-special character. We enforce that rule before building the fragment:

# AFTER — fixed (added at the top of analyze.py)
import re
_IDENTIFIER_RE = re.compile(r"^[a-zA-Z0-9_]+$")

# Inside natural_language_query:
for full_table_name in request.data["tables_to_query"]:
    if ">>>>>" not in full_table_name:
        return Response(status=400, data={"error": f"Invalid table format: {full_table_name!r}"})

    database, table = full_table_name.split(">>>>>", 1)   # maxsplit=1 for safety
    database, table = database.strip(), table.strip()

    if not _IDENTIFIER_RE.match(database) or not _IDENTIFIER_RE.match(table):
        return Response(status=400, data={"error": "Table names must contain only letters, digits, and underscores"})

    condition = f"(database = '{database}' AND table = '{table}')"
    table_schema_sql_conditions.append(condition)

The attack payload "default>>>>>' OR 1=1 --" gives table = "' OR 1=1 --" — which fails the regex and returns 400 Bad Request.


📋 Changes summary

Location Vulnerability Fix
query_metrics (line 79) pk from URL pasted into SQL via .format() pk.isdigit() guard + integer cast
natural_language_query (lines 273–274) database/table from request body pasted into SQL via f-string Regex ^[a-zA-Z0-9_]+$ validates both before use

Files changed: housewatch/api/analyze.py

@atiliohector33 atiliohector33 force-pushed the fix/sql-injection-problem branch from 1a49f2e to 1ebd79b Compare June 27, 2026 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant