beal_bigint — GPU search for Beal counterexamples (by-C^z parametrization)

A CUDA implementation of the by-C^z parametrization for empirical verification of Beal's conjecture, designed for long-running searches on a single consumer GPU.

Why?

Beal's conjecture states that if A^x + B^y = C^z with positive integers and exponents x, y, z ≥ 3, then A, B, C must share a common prime factor. Empirical verification has been done by Norvig (2000), Vanderkam, Watkins, Jarnicki & Konerding, William Root, notabs.org, and others — most using a by-base parametrization (iterate over A and B, check if A^x + B^y is a perfect power).

Bill Butler ("Durango Bill") documented a different approach: by-C^z (iterate over (C, z), search for (A, x, B, y) on the left side via BSGS-style hashing). His work ran on CPU.

beal_bigint is a GPU implementation of the by-C^z parametrization. It complements the existing CPU work and surfaces hits that pure by-base searches do not naturally produce — for example:

9565938^3 + 27^14 = 81^11   (gcd = 27, derives from 2³ × 3⁴²)

This is not a counterexample (gcd > 1). But it lives in the (A, B) space where one base far exceeds typical search bounds, which by-base searches restrict for tractability.

beal_bigint solves this by:

Outer iteration over (C, z) instead of (A, B) — no upper bound on A or B
BSGS-style hashing with dual Solinas primes (P1=2^61-31, P2=2^59-55)
GPU-resident hash table (up to 8 GB / 512M slots) for parallel lookup
M6 coverage threshold (skip A_max,B_max > 10M) — 20× speedup with ~12% hit loss
M7 checkpoint+resume — 28-byte hit records persisted with fsync after each (C, z)

Performance

Tested on RTX 5070 (SM 12.0 / Blackwell), CUDA 12.9, Linux Mint 22.1.

Production tier ladder (RAW mode, M6 default 10M skip)

Tier	bound	exp	Wall time	Hits	Throughput
1	1,000	3-12	30 s	829	28 hits/s
2	2,000	3-12	58 s	1,031	18 hits/s
3	5,000	3-15	187 s	1,646	9 hits/s
4	10,000	3-15	370 s	2,096	6 hits/s
5	20,000	3-15	754 s	2,719	4 hits/s
6	50,000	3-15	1,625 s	3,858	2 hits/s
9	500,000	3-15	13,377 s	8,162	0.6 hits/s

Cumulative wall time: ~4.4 hours. Total unique hits: 20,341. Counterexamples found: 0 (as expected; Beal is widely conjectured true).

Doubling C grows wall time ~1.85-2.04× (sub-linear). Hits grow ~1.27-1.30× per doubling — diminishing returns as new C values have fewer factor-rich combinations.

M6 trade-off (validated at bound = 500, exp 3-10)

Mode	Wall	Hits	Throughput
M6 default (skip > 10M)	5.6 s	511	91 hits/s
`--exhaustive` (no skip)	115 s	572	5 hits/s

Default loses ~12% of hits, runs ~20× faster. At bound ≥ 1000, exhaustive mode becomes impractical on a single RTX 5070.

Bottleneck breakdown (M5 cudaEvents, Tier 4)

build_hash kernel:     151 s   (40% of GPU work)
sweep_mega kernel:     205 s   (55%)
memsets/memcpy/sync:    14 s   (4%)
host setup:              0.7 s (~0.2%)

Cycles with A_max in [1M, 10M] account for >90% of GPU time. The M6 skip cuts the >10M tail entirely, which is the entire reason overnight bound=500K runs are feasible.

Architecture

┌──────────────────────────────────────────────────┐
│  Host (Ryzen 9800X3D, GMP, libgmp)               │
│  ├── (C, z) outer loop — 128-bit pow             │
│  ├── iroot128(T, x) — A_max via GMP              │
│  ├── checkpoint + hits file (fsync per C,z)      │
│  └── final GMP verification of candidates        │
├──────────────────────────────────────────────────┤
│  GPU (RTX 5070 Blackwell, 12 GB VRAM)            │
│  ├── d_H[]    — 8 GB hash table (512M slots)     │
│  ├── d_epoch[]— 2 GB epoch tags (M4: no memset)  │
│  ├── d_out[]  — match buffer                     │
│  ├── kernel_build_hash<P1>     (Solinas 61-bit)  │
│  └── kernel_sweep_mega<P1,P2>  (dual-prime filt) │
├──────────────────────────────────────────────────┤
│  Solinas reduction primitives:                   │
│  P1 = 2^61 - 31, P2 = 2^59 - 55                  │
│  Knuth k2 secondary hash (16-bit suffix)         │
└──────────────────────────────────────────────────┘

Quick Start

Sanity check

make ARCH=sm_120

# V1-compat sanity (must produce 17 hits in ~0.02s):
./beal_bigint 50 3 6

Full validation

# Validation matrix (~15s, must pass before production):
make calibrate

Production run

# Tier 1 example (~30s):
./scripts/long_run.sh 1000 3 12 --raw

# Tier 6 example (~30 min):
./scripts/long_run.sh 50000 3 15 --raw

# Tier 9 overnight (~3-7 hours):
./scripts/long_run.sh 500000 3 15 --raw

If interrupted (Ctrl-C, crash, power loss), re-run the same command — auto-resume continues from the last completed (C, z). To start fresh and discard the previous checkpoint, add --restart.

Clean shutdown: pkill -INT beal_bigint (or Ctrl-C if foreground) waits for the current (C, z) to finish, saves state, exits.

Files

File	Description
`src/beal_bigint.cu`	CUDA + host main (everything in one TU)
`oracle/oracle_by_cz.py`	Python oracle for cross-validation
`scripts/long_run.sh`	Production wrapper (M7-aware, line-buffered)
`scripts/calibrate.sh`	Validation matrix
`scripts/run_range.sh`	Single-range wrapper
`scripts/coverage_ledger.py`	Ledger query/status
`scripts/memory_budget.py`	VRAM planning
`docs/DESIGN.md`	Architecture notes
`RESULTS.md`	Production tier results (numbers + hits found)
`NOTES.md`	Per-milestone changelog (M2-M7.1)
`M6_NOTES.md`	M6 trade-off rationale
`M7_NOTES.md`	M7 checkpoint design

Build Requirements

NVIDIA GPU (Volta or newer, SM 7.0+; tested on Blackwell SM 12.0)
CUDA Toolkit 11.0+ (12.x recommended)
GMP library (libgmp-dev) — for host-side 128-bit arithmetic
Linux (tested on Linux Mint 22.1)

# Ubuntu/Debian/Mint
sudo apt install libgmp-dev

# Build
make ARCH=sm_120

Supported architectures

The code uses __uint128_t (host) and Solinas reduction templated by prime constant. Nothing intrinsically Blackwell-specific.

Architecture	SM	GPUs	Status
Volta	7.0	V100	Should work (untested)
Turing	7.5	RTX 2080	Should work (untested)
Ampere	8.x	RTX 3090, A100	Should work (untested)
Ada Lovelace	8.9	RTX 4090	Should work (untested)
Hopper	9.0	H100	Should work (untested)
Blackwell	12.0	RTX 5070/5080/5090	Tested ✓

Contributions of benchmark results on other GPUs are welcome.

Coverage strategy (M6)

By default, the search skips:

(C, z, x) cycles where A_max > 10M
y-slots where B_max > 10M

This cuts ~95% of GPU work while losing ~12% of hits at bound=500. Profiling showed cycles with A_max in [1M, 10M] dominate runtime.

Tunable:

Flag	Behavior
(default)	Skip A_max, B_max > 10M
`--max-Amax=N`	Override threshold to N
`--exhaustive`	Disable skip entirely (impractical at scale)

Checkpoint and resume (M7)

Production runs save state after every (C, z) cycle:

<checkpoint> — text file with last completed (C, z)
<checkpoint>.hits — binary append-only, 28 bytes per hit (A, x, B, y, C, z, gcd as uint32_t)

State is fsync-ed after each (C, z). Auto-resume is the default. The wrapper scripts/long_run.sh sets all paths automatically.

Reading the binary hits file in Python:

import struct
with open("results/run.checkpoint.hits", "rb") as f:
    while chunk := f.read(28):
        A, x, B, y, C, z, g = struct.unpack("7I", chunk)
        print(f"{A}^{x} + {B}^{y} = {C}^{z}  gcd={g}")

Coprime hits and counterexamples

The summary line coprime hits (Beal!): N reports confirmed hits where gcd(A, B, C) = 1. A coprime hit with x, y, z ≥ 3 would constitute a counterexample to Beal's conjecture.

Coprime hits are always verified with full-precision GMP, regardless of --skip-gmp-verify. Across all runs documented in RESULTS.md, this counter has been zero.

If you obtain a coprime hit while running this software:

Preserve the log file and .checkpoint.hits file.
Re-verify independently with a different toolchain (Python+GMP, SageMath, PARI/GP, Mathematica).
Open an issue here with the verification artifacts.

Technical Details

Solinas reduction

Both primes are chosen for cheap modular reduction via Solinas form:

P1 = 2^61 - 31    →  x mod P1  =  (x_hi * 31 + x_lo) reduce ×2
P2 = 2^59 - 55    →  same pattern with 55

Templated kernel mulmod_solinas_dev<P, C> instantiates per prime. Cross- validated against bitcoin-core/secp256k1 reference implementation.

Hash structure

16-byte HashEntry: 32-bit A, 32-bit x, 64-bit secondary hash (Knuth k2). Open-addressing with linear probing. Load factor capped at 0.5. False positive rate from Solinas+Knuth is below detection threshold across 12,179 confirmed hits (0 FPs).

M4 epoch tags

Per-cycle hash table reset would require 8 GB memset (~600 ms each) — prohibitive at 100K+ cycles. Instead, each entry carries a 32-bit epoch tag; lookups treat entries with stale epoch as empty. Epoch increments per (C, z, x), wraps every 4B cycles (never observed in practice). Eliminates per-cycle memset entirely.

Optimization history

M2 (binary exp + Solinas): baseline functional kernel
M3.7 (sync amortization): D2H out of y-loop. Helped small cases (1.8-2.5×), no effect on raw_b500.
M3.8 (mega-sweep, 1 launch all y): no improvement, debunked the latency hypothesis.
M3.9 (3-bucket timing instrumentation): revealed gpu_work = 100% of wall, not host launch overhead.
M4 (epoch hash, no memset): eliminates per-cycle 8 GB clear.
M5 (cudaEvents per-kernel): build_hash 41%, sweep_mega 45%, kernels are real bottleneck.
M6 (skip A_max,B_max > 10M): 20× speedup at 12% hit cost. The single most important engineering decision in the project.
M7 (checkpoint + incremental hits): enables overnight runs surviving Ctrl-C, crash, power loss.
M7.1 (auto-resume default): re-run the same command, no --resume flag needed.

What didn't work

--exhaustive mode at scale: removing M6 skip costs 20× wall time for ~12% more hits at bound = 500. At bound ≥ 1000, the runs become impractical on a single RTX 5070. The skip is necessary, not optional.
GPU Bloom filter for hash pre-screening: added complexity, no measurable speedup at observed load factors. Removed.
Endomorphism canonical form / cheap-second-point in kernel: code complexity not justified by measured impact. (Same engineering pattern as refuted in PSCKangaroo by independent forum review.)
SSD-Hybrid TAME-store architecture: GPU produces ~90K DPs/s; no random-write SSD strategy keeps up while preserving hash semantics. Abandoned. In-RAM 16-byte compact entries are correct for this workload.
Diagnosis-by-guessing: spent multiple iterations theorizing about the bottleneck (host CPU loop? buffering? GPU stall?) before instrumenting per-kernel timing (M5). Instrument before theorizing — added as guideline.

Lineage

This is the third iteration of an ongoing Beal verification project by the same author. Earlier iterations are not currently published:

V1 (beal_blackwell) — Frente A (by-base parametrization), validated up to bound 1M, 15 datasets, internal release only.
V2 (beal_blackwell_v2) — Frente A rebalanced 36-bit C_idx, bound 10M, 11 datasets, internal release only.
V3 (beal_bigint, this repo) — Frente B (by-C^z parametrization), GPU implementation. First publicly released iteration of this project.

V3 was cross-validated against V1's by-base output via the --v1-compat mode.

Hardware

Developed and validated on:

NVIDIA RTX 5070 (Blackwell, SM 12.0), 12 GB VRAM
Ryzen 9800X3D, 128 GB system RAM
Linux Mint 22.1, CUDA 12.9, GCC 13, GMP 6.3
Off-grid solar power supply

Author

pscamillo — independent developer working on GPU kernels for number-theoretic computation.

License

Apache 2.0. See LICENSE and NOTICE.

Acknowledgments

Bill Butler ("Durango Bill") for documenting the by-C^z approach
Peter Norvig, Dan Vanderkam, Noah Watkins, Witold Jarnicki & David Konerding, William Root, notabs.org — prior empirical verification work
GMP for runtime arbitrary-precision arithmetic
Anthropic Claude — pair-programming partner

Supporting development

If this project is useful to you, consider starring the repository — it helps visibility and signals to others that the work matters.

If you want to support continued development directly:

BTC (bech32): bc1qxahfnh6pcpv80plw6lxv8enus0ggpqxu0cl243

Hardware, electricity, and tooling costs are real, though partially offset by solar power. Contributions of code, benchmarks, bug reports, and documentation are also welcome — open an issue or pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github		.github
docs		docs
oracle		oracle
results		results
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
M6_NOTES.md		M6_NOTES.md
M7_NOTES.md		M7_NOTES.md
Makefile		Makefile
NOTES.md		NOTES.md
NOTICE		NOTICE
README.md		README.md
RESULTS.md		RESULTS.md

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

beal_bigint — GPU search for Beal counterexamples (by-C^z parametrization)

Why?

Performance

Production tier ladder (RAW mode, M6 default 10M skip)

M6 trade-off (validated at bound = 500, exp 3-10)

Bottleneck breakdown (M5 cudaEvents, Tier 4)

Architecture

Quick Start

Sanity check

Full validation

Production run

Files

Build Requirements

Supported architectures

Coverage strategy (M6)

Checkpoint and resume (M7)

Coprime hits and counterexamples

Technical Details

Solinas reduction

Hash structure

M4 epoch tags

Optimization history

What didn't work

Lineage

Hardware

Author

See also

License

Acknowledgments

Supporting development

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages