High-performance URL blacklist aggregator with multi-bloom filtering and scoring.
Blacked collects threat intelligence from multiple sources (OISD, URLHaus, OpenPhish, PhishTank), decomposes every URL across 6 bloom dimensions, and answers is this URL blocked? in ~0.4ms.
Provider → Source pipeline with independent fetch, parse, and schedule per source. 3 active providers feeding 938K+ entries. |
6-layer parallel check at ~0.4ms. One entry → one bloom type. First hit wins, parent-path cascade. |
Provider trust × depth weight. Single match uses trust directly. 5 levels: critical → informational. |
HTTP-agnostic `internal/query/` package. Testable standalone. Adapter pattern — zero framework lock-in. |
| Capability | Detail |
|---|---|
| Multi-Source Aggregation | Provider → Source hierarchy with independent fetch/parse pipelines |
| Parallel Bloom Engine | 6 bloom layers (Domain → Host → HostPath → File → FullURL → IP), checked concurrently — first hit wins |
| Cascading Parent Match | /a/b/c/file.exe matches /a or /a/b via parent-path traversal at check time |
| Scoring & Levels | Provider trust × depth weight → 5 confidence levels (critical → informational) |
| Schedule-Aware Cache | Parametric TTL per source/provider, cron-triggered invalidation, app-restart resilience |
| Dual API | Bloom-only check (~0.4ms) and full hit (bloom + DB + score, ~5-15ms) |
| HTTP Agnostic Core | internal/query/ package decoupled from Echo, testable standalone |
| Built-in Metrics | Prometheus endpoints, execution tracing, pprof profiling |
| No Legacy | Greenfield schema, clean-slate policy — zero backward compatibility debt |
| Host Normalization | Entry.Host = url.Hostname() — port stripped. Bloom keys and DB confirmation use same format, no mismatch |
| Metric | Value |
|---|---|
| Bloom Check (P99) | 0.4 ms |
| Full Hit (bloom + DB + score) | 5–15 ms |
| CPU Usage (idle, 820K entries) | 1.28% |
| Heap (idle) | 101 MB |
| Sync Alloc (before perf fixes) | 2.36 GB → ~1.73 GB (−628 MB) |
| Sync Duration (3 providers, 826K entries) | ~109 s |
| E2E Tests | 14 / 14 · 0.59 s · No network calls |
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Provider │ │ Provider │ │ Provider │
│ (OISD) │ │ (URLHaus)│ │(OpenPhish)│
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────┐
│ Source Layer │
│ (Fetcher + Parser per source URL) │
└────────────────┬────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Pond Collector (batched writer) │
│ ┌──────────┐ ┌──────────┐ ┌───────┐ │
│ │ SQLite │ │ Badger │ │ Bloom │ │
│ │ (WAL) │ │ Cache │ │ Sets │ │
│ └──────────┘ └──────────┘ └───────┘ │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Query Core (internal/query/) │
│ Check (bloom only) → Hit (full) │
│ ┌──────────┐ ┌──────────┐ │
│ │ Scorer │ │ Adapter │ │
│ └──────────┘ └──────────┘ │
└────────────────┬────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ REST API (Echo) │
│ /api/v1/check /api/v1/hit │
│ /api/v1/bulk-check /api/v1/bulk-hit │
└─────────────────────────────────────────┘
Check URL "cdn.evil.com/malware/exploit.php?ref=bad"
│
▼
ParseURL → GenerateKeys()
│
├── Domain: evil.com → BloomDomain ──┐
├── Host: cdn.evil.com → BloomHost │
├── HostPath: cdn.evil.com/ma... → BloomHostPath ├── PARALLEL
├── File: exploit.php → BloomFile │ First Hit
├── FullURL: ...exploit.php?ref → BloomFullURL ─┘ Cancels All
└── IP: 103.224.212.251 → BloomIP
│
▼
┌──────────────────┴──────────────┐
▼ ▼
✔ HIT → 200 OK ❌ MISS → 204
{ type: "file", No Content
source: "oisd",
key: "exploit.php",
confidence: 0.85,
level: "high" }
- Go 1.26+
- Git
git clone https://github.com/runaho/blacked.git
cd blacked
# Download dependencies
go mod download
# Configure
cp .env.toml.copy .env.toml
# Edit to suit your environment
# Run the server
go run . serveThe server starts at http://localhost:8082.
# Process all providers immediately
go run . process
# Query a URL
go run main.go query --url "https://evil.com/path"
# JSON output
go run main.go query --url "https://evil.com" --json| Endpoint | Method | Description | Latency |
|---|---|---|---|
/api/v1/check?url= |
GET | Bloom-only check — fast negative | ~0.4 ms |
/api/v1/hit?url= |
GET | Bloom + DB confirmation + scorer — confidence + level + matches | ~5–15 ms |
/api/v1/bulk-check |
POST | Batch bloom check (up to N URLs) | ~0.4 ms × N |
/api/v1/bulk-hit |
POST | Batch bloom + DB + scorer | ~5–15 ms × N |
Hit (200) — URL is blocked:
{
"url": "https://cdn.evil.com/malware/exploit.php",
"blocked": true,
"confidence": 0.85,
"level": "high",
"matches": [{
"type": "full_url",
"key": "cdn.evil.com/malware/exploit.php",
"source_id": "urlhaus-online"
}]
}Miss (204) — URL is clean (or missing url parameter):
No Content
Blacked uses .env.toml (TOML format). Key sections:
[APP]
environment = "development" # or "production"
log_level = "info"
[Server]
port = 8082
host = "localhost"
[Cache]
use_bloom = true
badger_path = ""
[Collector]
batch_size = 1000
cron_schedule = "0 0 * * *"
# Each provider is independently configured.
# enabled = false → provider is skipped entirely.
[providers.oisd-big]
enabled = true
source_url = "https://big.oisd.nl/domainswild2"
cron = "0 6 * * *"
category = "blocklist"
parser_workers = 4
parser_batch_size = 1000
[providers.phishtank-online-valid]
enabled = false
source_url = "https://data.phishtank.com/data/{api_key}/online-valid.json"
api_key = ""
cron = "45 */6 * * *"
category = "phishing"All provider settings come from .env.toml — zero hard-coded URLs, crons, or categories. API keys are never committed to code; they live in the api_key field of the provider block or are injected via environment variables.
Each provider is a Go package in features/providers/. Add a new TOML block in .env.toml, then implement a constructor:
// features/providers/myprovider/myprovider.go
func NewMyProvider(cfg *config.Config, collyClient *colly.Collector) base.Provider {
const providerName = "myprovider"
opts, ok := cfg.Providers[providerName]
if !ok || opts == nil {
opts = &config.ProviderOptions{} // defaults kick in
}
if opts.Enabled != nil && !*opts.Enabled { return nil }
sourceURL := opts.SourceURL
if sourceURL == "" {
sourceURL = "https://example.com/feed.txt" // built-in default
}
cron := opts.Cron
if cron == "" {
cron = "0 */6 * * *" // built-in default
}
category := opts.Category
if category == "" {
category = "blocklist" // built-in default
}
workers := opts.ParserWorkers
if workers <= 0 { workers = 4 }
batchSize := opts.ParserBatchSize
if batchSize <= 0 { batchSize = 1000 }
client := base.BuildCollyClientForProvider(collyClient, opts)
parseFunc := func(data io.Reader, collector entry_collector.Collector) error {
return base.ParseLinesParallel(data, collector, providerName,
workers, batchSize, func(line, processID string) (*entries.Entry, error) {
// ... parse logic ...
})
}
provider := base.NewBaseProvider(providerName, sourceURL, category, client, parseFunc)
provider.SetCronSchedule(cron).Register()
return provider
}# .env.toml
[providers.myprovider]
enabled = true
source_url = "https://example.com/feed.txt"
cron = "0 */6 * * *"
category = "malware"
parser_workers = 4
parser_batch_size = 1000# All unit and integration tests
go test ./... -count=1 -timeout 120s
# E2E bloom-aware tests (no network calls)
go test -tags=e2e ./features/e2e/... -v -timeout 60s
# Performance benchmarks
go test -bench=. ./features/web/handlers/benchmark/...| # | Test | What it verifies |
|---|---|---|
| 1 | DomainBloom | Domain-level match |
| 2 | HostBloom | Exact host match |
| 3 | HostPathBloom | Path-level match |
| 4 | ParentPathBloom | Parent path traversal (/a → /a/b/c) |
| 5 | FileBloom | File name match (.exe) |
| 6 | FullURLBloom | File + query match; different query = miss |
| 7 | IPBloom | IP bloom populate |
| 8 | FirstHitWinsDomain | Domain wins over HostPath on same URL |
| 9 | CleanMiss | Clean URL → 204 |
| 10–12 | HitEndpoint, HitClean, EmptyURL | Hit response, clean hit, empty param |
| 13–14 | BulkCheck, BulkHit | Batch endpoints |
Each blacklist entry goes into exactly one bloom set — determined by what the source provides:
| Source provides | Bloom type | Key | Example |
|---|---|---|---|
evil.com |
Domain | evil.com |
Covers all subdomains |
cdn.evil.com |
Host | cdn.evil.com |
Exact subdomain |
cdn.evil.com/malware/ |
HostPath | cdn.evil.com/malware |
Folder-level block |
exploit.php |
File | exploit.php |
File name, any path |
cdn.evil.com/exploit.php?ref=x |
FullURL | cdn.evil.com/exploit.php?ref=x |
Exact request |
103.224.212.251 |
IP | 103.224.212.251 |
IP address |
At check time, all 6 bloom sets are queried in parallel goroutines. The first true response cancels the rest via context.Cancel(). Bloom Test() is O(1), so goroutine overhead is negligible (~50 ns).
Check: cdn.x.com/a/b/c/file.exe
Generate HostPath keys (shallowest → deepest):
/a
/a/b
/a/b/c
If source blacklisted cdn.x.com/a/b → HIT via parent path
Single match: confidence = provider trust score directly. A domain from a trusted source should reflect that trust — not be penalized for being "shallow."
Multiple matches (2+ bloom layers hit): depth weights are used to weigh matches against each other:
confidence = Σ(trust_score × depth_weight) / Σ(trust_score)
| Level | Score Range |
|---|---|
| Critical | ≥ 0.90 |
| High | ≥ 0.70 |
| Medium | ≥ 0.50 |
| Low | ≥ 0.25 |
| Informational | < 0.25 |
Depth weights: Domain 0.3 · Host 0.5 · HostPath 1.0 · File 0.7 · FullURL 1.5 · IP 0.8
features/
├── bloom/ # Multi-Bloom Engine (types, manager, URL parser)
├── cache/ # BadgerDB cache layer
├── entries/ # Entry model, repository, services
├── entry_collector/ # Pond collector (batch writer + cache sync)
├── providers/ # Provider system (OISD, URLHaus, OpenPhish, PhishTank)
├── tests/ # Integration tests
├── web/ # Echo handlers, routes, middleware
└── e2e/ # Bloom-aware E2E tests (no network)
internal/
├── collector/ # Prometheus metrics collector
├── colly/ # Colly HTTP client wrapper
├── config/ # TOML-based configuration
├── db/ # SQLite connection pool (read/write split), migrations
├── db/models/ # DB models (Provider, Source, Entry)
├── logger/ # Zerolog logger setup
├── query/ # HTTP-agnostic query core (service, scorer, types)
├── runner/ # gocron scheduler + provider executor
├── telemetry/ # OTLP tracing setup
├── testutil/ # Test helpers (DB, collector init)
├── tracing/ # Execution tracing
└── utils/ # Response cache, utilities
MIT — see LICENSE.