A CUDA implementation of the by-C^z parametrization for empirical verification of Beal's conjecture, designed for long-running searches on a single consumer GPU.
Beal's conjecture states that if A^x + B^y = C^z with positive integers and exponents x, y, z ≥ 3, then A, B, C must share a common prime factor. Empirical verification has been done by Norvig (2000), Vanderkam, Watkins, Jarnicki & Konerding, William Root, notabs.org, and others — most using a by-base parametrization (iterate over A and B, check if A^x + B^y is a perfect power).
Bill Butler ("Durango Bill") documented a different approach: by-C^z (iterate over (C, z), search for (A, x, B, y) on the left side via BSGS-style hashing). His work ran on CPU.
beal_bigint is a GPU implementation of the by-C^z parametrization. It
complements the existing CPU work and surfaces hits that pure by-base searches
do not naturally produce — for example:
9565938^3 + 27^14 = 81^11 (gcd = 27, derives from 2³ × 3⁴²)
This is not a counterexample (gcd > 1). But it lives in the (A, B) space where one base far exceeds typical search bounds, which by-base searches restrict for tractability.
beal_bigint solves this by:
- Outer iteration over (C, z) instead of (A, B) — no upper bound on A or B
- BSGS-style hashing with dual Solinas primes (P1=2^61-31, P2=2^59-55)
- GPU-resident hash table (up to 8 GB / 512M slots) for parallel lookup
- M6 coverage threshold (skip A_max,B_max > 10M) — 20× speedup with ~12% hit loss
- M7 checkpoint+resume — 28-byte hit records persisted with fsync after each (C, z)
Tested on RTX 5070 (SM 12.0 / Blackwell), CUDA 12.9, Linux Mint 22.1.
| Tier | bound | exp | Wall time | Hits | Coprime | Throughput |
|---|---|---|---|---|---|---|
| 1 | 1,000 | 3-12 | 30 s | 829 | 0 | 28 hits/s |
| 2 | 2,000 | 3-12 | 58 s | 1,031 | 0 | 18 hits/s |
| 3 | 5,000 | 3-15 | 187 s | 1,646 | 0 | 9 hits/s |
| 4 | 10,000 | 3-15 | 370 s | 2,096 | 0 | 6 hits/s |
| 5 | 20,000 | 3-15 | 754 s | 2,719 | 0 | 4 hits/s |
| 6 | 50,000 | 3-15 | 1,625 s | 3,858 | 0 | 2 hits/s |
| 9 | 500,000 | 3-15 | 13,377 s | 8,162 | 0 | 0.6 hits/s |
Cumulative wall time: ~4.4 hours. Total unique hits: 20,341. Counterexamples found: 0 (as expected; Beal is widely conjectured true).
Doubling C grows wall time ~1.85-2.04× (sub-linear). Hits grow ~1.27-1.30× per doubling — diminishing returns as new C values have fewer factor-rich combinations.
| Mode | Wall | Hits | Throughput |
|---|---|---|---|
| M6 default (skip > 10M) | 5.6 s | 511 | 91 hits/s |
--exhaustive (no skip) |
115 s | 572 | 5 hits/s |
Default loses ~12% of hits, runs ~20× faster. At bound ≥ 1000, exhaustive mode becomes impractical on a single RTX 5070.
build_hash kernel: 151 s (40% of GPU work)
sweep_mega kernel: 205 s (55%)
memsets/memcpy/sync: 14 s (4%)
host setup: 0.7 s (~0.2%)
Cycles with A_max in [1M, 10M] account for >90% of GPU time. The M6 skip cuts the >10M tail entirely, which is the entire reason overnight bound=500K runs are feasible.
┌──────────────────────────────────────────────────┐
│ Host (Ryzen 9800X3D, GMP, libgmp) │
│ ├── (C, z) outer loop — 128-bit pow │
│ ├── iroot128(T, x) — A_max via GMP │
│ ├── checkpoint + hits file (fsync per C,z) │
│ └── final GMP verification of candidates │
├──────────────────────────────────────────────────┤
│ GPU (RTX 5070 Blackwell, 12 GB VRAM) │
│ ├── d_H[] — 8 GB hash table (512M slots) │
│ ├── d_epoch[]— 2 GB epoch tags (M4: no memset) │
│ ├── d_out[] — match buffer │
│ ├── kernel_build_hash<P1> (Solinas 61-bit) │
│ └── kernel_sweep_mega<P1,P2> (dual-prime filt) │
├──────────────────────────────────────────────────┤
│ Solinas reduction primitives: │
│ P1 = 2^61 - 31, P2 = 2^59 - 55 │
│ Knuth k2 secondary hash (16-bit suffix) │
└──────────────────────────────────────────────────┘
make ARCH=sm_120
# V1-compat sanity (must produce 17 hits in ~0.02s):
./beal_bigint 50 3 6# Validation matrix (~15s, must pass before production):
make calibrate# Tier 1 example (~30s):
./scripts/long_run.sh 1000 3 12 --raw
# Tier 6 example (~30 min):
./scripts/long_run.sh 50000 3 15 --raw
# Tier 9 overnight (~3-7 hours):
./scripts/long_run.sh 500000 3 15 --rawIf interrupted (Ctrl-C, crash, power loss), re-run the same command —
auto-resume continues from the last completed (C, z). To start fresh and
discard the previous checkpoint, add --restart.
Clean shutdown: pkill -INT beal_bigint (or Ctrl-C if foreground)
waits for the current (C, z) to finish, saves state, exits.
| File | Description |
|---|---|
src/beal_bigint.cu |
CUDA + host main (everything in one TU) |
oracle/oracle_by_cz.py |
Python oracle for cross-validation |
scripts/long_run.sh |
Production wrapper (M7-aware, line-buffered) |
scripts/calibrate.sh |
Validation matrix |
scripts/run_range.sh |
Single-range wrapper |
scripts/coverage_ledger.py |
Ledger query/status |
scripts/memory_budget.py |
VRAM planning |
docs/DESIGN.md |
Architecture notes |
RESULTS.md |
Production tier results (numbers + hits found) |
NOTES.md |
Per-milestone changelog (M2-M7.1) |
M6_NOTES.md |
M6 trade-off rationale |
M7_NOTES.md |
M7 checkpoint design |
- NVIDIA GPU (Volta or newer, SM 7.0+; tested on Blackwell SM 12.0)
- CUDA Toolkit 11.0+ (12.x recommended)
- GMP library (libgmp-dev) — for host-side 128-bit arithmetic
- Linux (tested on Linux Mint 22.1)
# Ubuntu/Debian/Mint
sudo apt install libgmp-dev
# Build
make ARCH=sm_120The code uses __uint128_t (host) and Solinas reduction templated by prime
constant. Nothing intrinsically Blackwell-specific.
| Architecture | SM | GPUs | Status |
|---|---|---|---|
| Volta | 7.0 | V100 | Should work (untested) |
| Turing | 7.5 | RTX 2080 | Should work (untested) |
| Ampere | 8.x | RTX 3090, A100 | Should work (untested) |
| Ada Lovelace | 8.9 | RTX 4090 | Should work (untested) |
| Hopper | 9.0 | H100 | Should work (untested) |
| Blackwell | 12.0 | RTX 5070/5080/5090 | Tested ✓ |
Contributions of benchmark results on other GPUs are welcome.
By default, the search skips:
- (C, z, x) cycles where A_max > 10M
- y-slots where B_max > 10M
This cuts ~95% of GPU work while losing ~12% of hits at bound=500. Profiling showed cycles with A_max in [1M, 10M] dominate runtime.
Tunable:
| Flag | Behavior |
|---|---|
| (default) | Skip A_max, B_max > 10M |
--max-Amax=N |
Override threshold to N |
--exhaustive |
Disable skip entirely (impractical at scale) |
Production runs save state after every (C, z) cycle:
<checkpoint>— text file with last completed (C, z)<checkpoint>.hits— binary append-only, 28 bytes per hit (A, x, B, y, C, z, gcd asuint32_t)
State is fsync-ed after each (C, z). Auto-resume is the default. The
wrapper scripts/long_run.sh sets all paths automatically.
Reading the binary hits file in Python:
import struct
with open("results/run.checkpoint.hits", "rb") as f:
while chunk := f.read(28):
A, x, B, y, C, z, g = struct.unpack("7I", chunk)
print(f"{A}^{x} + {B}^{y} = {C}^{z} gcd={g}")The summary line coprime hits (Beal!): N reports confirmed hits where
gcd(A, B, C) = 1. A coprime hit with x, y, z ≥ 3 would constitute a
counterexample to Beal's conjecture.
Coprime hits are always verified with full-precision GMP, regardless of
--skip-gmp-verify. Across all runs documented in RESULTS.md,
this counter has been zero.
If you obtain a coprime hit while running this software:
- Preserve the log file and
.checkpoint.hitsfile. - Re-verify independently with a different toolchain (Python+GMP, SageMath, PARI/GP, Mathematica).
- Open an issue here with the verification artifacts.
Both primes are chosen for cheap modular reduction via Solinas form:
P1 = 2^61 - 31 → x mod P1 = (x_hi * 31 + x_lo) reduce ×2
P2 = 2^59 - 55 → same pattern with 55
Templated kernel mulmod_solinas_dev<P, C> instantiates per prime. Cross-
validated against bitcoin-core/secp256k1 reference implementation.
16-byte HashEntry: 32-bit A, 32-bit x, 64-bit secondary hash (Knuth k2). Open-addressing with linear probing. Load factor capped at 0.5. False positive rate from Solinas+Knuth is below detection threshold across 12,179 confirmed hits (0 FPs).
Per-cycle hash table reset would require 8 GB memset (~600 ms each) — prohibitive at 100K+ cycles. Instead, each entry carries a 32-bit epoch tag; lookups treat entries with stale epoch as empty. Epoch increments per (C, z, x), wraps every 4B cycles (never observed in practice). Eliminates per-cycle memset entirely.
- M2 (binary exp + Solinas): baseline functional kernel
- M3.7 (sync amortization): D2H out of y-loop. Helped small cases (1.8-2.5×), no effect on raw_b500.
- M3.8 (mega-sweep, 1 launch all y): no improvement, debunked the latency hypothesis.
- M3.9 (3-bucket timing instrumentation): revealed gpu_work = 100% of wall, not host launch overhead.
- M4 (epoch hash, no memset): eliminates per-cycle 8 GB clear.
- M5 (cudaEvents per-kernel): build_hash 41%, sweep_mega 45%, kernels are real bottleneck.
- M6 (skip A_max,B_max > 10M): 20× speedup at 12% hit cost. The single most important engineering decision in the project.
- M7 (checkpoint + incremental hits): enables overnight runs surviving Ctrl-C, crash, power loss.
- M7.1 (auto-resume default): re-run the same command, no
--resumeflag needed.
-
--exhaustivemode at scale: removing M6 skip costs 20× wall time for ~12% more hits at bound = 500. At bound ≥ 1000, the runs become impractical on a single RTX 5070. The skip is necessary, not optional. -
GPU Bloom filter for hash pre-screening: added complexity, no measurable speedup at observed load factors. Removed.
-
Endomorphism canonical form / cheap-second-point in kernel: code complexity not justified by measured impact. (Same engineering pattern as refuted in PSCKangaroo by independent forum review.)
-
SSD-Hybrid TAME-store architecture: GPU produces ~90K DPs/s; no random-write SSD strategy keeps up while preserving hash semantics. Abandoned. In-RAM 16-byte compact entries are correct for this workload.
-
Diagnosis-by-guessing: spent multiple iterations theorizing about the bottleneck (host CPU loop? buffering? GPU stall?) before instrumenting per-kernel timing (M5). Instrument before theorizing — added as guideline.
This is the third iteration of an ongoing Beal verification project by the same author. Earlier iterations are not currently published:
- V1 (
beal_blackwell) — Frente A (by-base parametrization), validated up to bound 1M, 15 datasets, internal release only. - V2 (
beal_blackwell_v2) — Frente A rebalanced 36-bit C_idx, bound 10M, 11 datasets, internal release only. - V3 (
beal_bigint, this repo) — Frente B (by-C^z parametrization), GPU implementation. First publicly released iteration of this project.
V3 was cross-validated against V1's by-base output via the --v1-compat mode.
Developed and validated on:
- NVIDIA RTX 5070 (Blackwell, SM 12.0), 12 GB VRAM
- Ryzen 9800X3D, 128 GB system RAM
- Linux Mint 22.1, CUDA 12.9, GCC 13, GMP 6.3
- Off-grid solar power supply
pscamillo — independent developer working on GPU kernels for number-theoretic computation.
- mr_blackwell — Native Miller-Rabin GPU kernel for NVIDIA Blackwell.
- PSCKangaroo — GPU-accelerated Pollard's Kangaroo for secp256k1 ECDLP.
Apache 2.0. See LICENSE and NOTICE.
- Bill Butler ("Durango Bill") for documenting the by-C^z approach
- Peter Norvig, Dan Vanderkam, Noah Watkins, Witold Jarnicki & David Konerding, William Root, notabs.org — prior empirical verification work
- GMP for runtime arbitrary-precision arithmetic
- Anthropic Claude — pair-programming partner
If this project is useful to you, consider starring the repository — it helps visibility and signals to others that the work matters.
If you want to support continued development directly:
BTC (bech32): bc1qxahfnh6pcpv80plw6lxv8enus0ggpqxu0cl243
Hardware, electricity, and tooling costs are real, though partially offset by solar power. Contributions of code, benchmarks, bug reports, and documentation are also welcome — open an issue or pull request.