Skip to content

pscamillo/beal_bigint

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

beal_bigint — GPU search for Beal counterexamples (by-C^z parametrization)

A CUDA implementation of the by-C^z parametrization for empirical verification of Beal's conjecture, designed for long-running searches on a single consumer GPU.

Why?

Beal's conjecture states that if A^x + B^y = C^z with positive integers and exponents x, y, z ≥ 3, then A, B, C must share a common prime factor. Empirical verification has been done by Norvig (2000), Vanderkam, Watkins, Jarnicki & Konerding, William Root, notabs.org, and others — most using a by-base parametrization (iterate over A and B, check if A^x + B^y is a perfect power).

Bill Butler ("Durango Bill") documented a different approach: by-C^z (iterate over (C, z), search for (A, x, B, y) on the left side via BSGS-style hashing). His work ran on CPU.

beal_bigint is a GPU implementation of the by-C^z parametrization. It complements the existing CPU work and surfaces hits that pure by-base searches do not naturally produce — for example:

9565938^3 + 27^14 = 81^11   (gcd = 27, derives from 2³ × 3⁴²)

This is not a counterexample (gcd > 1). But it lives in the (A, B) space where one base far exceeds typical search bounds, which by-base searches restrict for tractability.

beal_bigint solves this by:

  • Outer iteration over (C, z) instead of (A, B) — no upper bound on A or B
  • BSGS-style hashing with dual Solinas primes (P1=2^61-31, P2=2^59-55)
  • GPU-resident hash table (up to 8 GB / 512M slots) for parallel lookup
  • M6 coverage threshold (skip A_max,B_max > 10M) — 20× speedup with ~12% hit loss
  • M7 checkpoint+resume — 28-byte hit records persisted with fsync after each (C, z)

Performance

Tested on RTX 5070 (SM 12.0 / Blackwell), CUDA 12.9, Linux Mint 22.1.

Production tier ladder (RAW mode, M6 default 10M skip)

Tier bound exp Wall time Hits Coprime Throughput
1 1,000 3-12 30 s 829 0 28 hits/s
2 2,000 3-12 58 s 1,031 0 18 hits/s
3 5,000 3-15 187 s 1,646 0 9 hits/s
4 10,000 3-15 370 s 2,096 0 6 hits/s
5 20,000 3-15 754 s 2,719 0 4 hits/s
6 50,000 3-15 1,625 s 3,858 0 2 hits/s
9 500,000 3-15 13,377 s 8,162 0 0.6 hits/s

Cumulative wall time: ~4.4 hours. Total unique hits: 20,341. Counterexamples found: 0 (as expected; Beal is widely conjectured true).

Doubling C grows wall time ~1.85-2.04× (sub-linear). Hits grow ~1.27-1.30× per doubling — diminishing returns as new C values have fewer factor-rich combinations.

M6 trade-off (validated at bound = 500, exp 3-10)

Mode Wall Hits Throughput
M6 default (skip > 10M) 5.6 s 511 91 hits/s
--exhaustive (no skip) 115 s 572 5 hits/s

Default loses ~12% of hits, runs ~20× faster. At bound ≥ 1000, exhaustive mode becomes impractical on a single RTX 5070.

Bottleneck breakdown (M5 cudaEvents, Tier 4)

build_hash kernel:     151 s   (40% of GPU work)
sweep_mega kernel:     205 s   (55%)
memsets/memcpy/sync:    14 s   (4%)
host setup:              0.7 s (~0.2%)

Cycles with A_max in [1M, 10M] account for >90% of GPU time. The M6 skip cuts the >10M tail entirely, which is the entire reason overnight bound=500K runs are feasible.

Architecture

┌──────────────────────────────────────────────────┐
│  Host (Ryzen 9800X3D, GMP, libgmp)               │
│  ├── (C, z) outer loop — 128-bit pow             │
│  ├── iroot128(T, x) — A_max via GMP              │
│  ├── checkpoint + hits file (fsync per C,z)      │
│  └── final GMP verification of candidates        │
├──────────────────────────────────────────────────┤
│  GPU (RTX 5070 Blackwell, 12 GB VRAM)            │
│  ├── d_H[]    — 8 GB hash table (512M slots)     │
│  ├── d_epoch[]— 2 GB epoch tags (M4: no memset)  │
│  ├── d_out[]  — match buffer                     │
│  ├── kernel_build_hash<P1>     (Solinas 61-bit)  │
│  └── kernel_sweep_mega<P1,P2>  (dual-prime filt) │
├──────────────────────────────────────────────────┤
│  Solinas reduction primitives:                   │
│  P1 = 2^61 - 31, P2 = 2^59 - 55                  │
│  Knuth k2 secondary hash (16-bit suffix)         │
└──────────────────────────────────────────────────┘

Quick Start

Sanity check

make ARCH=sm_120

# V1-compat sanity (must produce 17 hits in ~0.02s):
./beal_bigint 50 3 6

Full validation

# Validation matrix (~15s, must pass before production):
make calibrate

Production run

# Tier 1 example (~30s):
./scripts/long_run.sh 1000 3 12 --raw

# Tier 6 example (~30 min):
./scripts/long_run.sh 50000 3 15 --raw

# Tier 9 overnight (~3-7 hours):
./scripts/long_run.sh 500000 3 15 --raw

If interrupted (Ctrl-C, crash, power loss), re-run the same command — auto-resume continues from the last completed (C, z). To start fresh and discard the previous checkpoint, add --restart.

Clean shutdown: pkill -INT beal_bigint (or Ctrl-C if foreground) waits for the current (C, z) to finish, saves state, exits.

Files

File Description
src/beal_bigint.cu CUDA + host main (everything in one TU)
oracle/oracle_by_cz.py Python oracle for cross-validation
scripts/long_run.sh Production wrapper (M7-aware, line-buffered)
scripts/calibrate.sh Validation matrix
scripts/run_range.sh Single-range wrapper
scripts/coverage_ledger.py Ledger query/status
scripts/memory_budget.py VRAM planning
docs/DESIGN.md Architecture notes
RESULTS.md Production tier results (numbers + hits found)
NOTES.md Per-milestone changelog (M2-M7.1)
M6_NOTES.md M6 trade-off rationale
M7_NOTES.md M7 checkpoint design

Build Requirements

  • NVIDIA GPU (Volta or newer, SM 7.0+; tested on Blackwell SM 12.0)
  • CUDA Toolkit 11.0+ (12.x recommended)
  • GMP library (libgmp-dev) — for host-side 128-bit arithmetic
  • Linux (tested on Linux Mint 22.1)
# Ubuntu/Debian/Mint
sudo apt install libgmp-dev

# Build
make ARCH=sm_120

Supported architectures

The code uses __uint128_t (host) and Solinas reduction templated by prime constant. Nothing intrinsically Blackwell-specific.

Architecture SM GPUs Status
Volta 7.0 V100 Should work (untested)
Turing 7.5 RTX 2080 Should work (untested)
Ampere 8.x RTX 3090, A100 Should work (untested)
Ada Lovelace 8.9 RTX 4090 Should work (untested)
Hopper 9.0 H100 Should work (untested)
Blackwell 12.0 RTX 5070/5080/5090 Tested ✓

Contributions of benchmark results on other GPUs are welcome.

Coverage strategy (M6)

By default, the search skips:

  • (C, z, x) cycles where A_max > 10M
  • y-slots where B_max > 10M

This cuts ~95% of GPU work while losing ~12% of hits at bound=500. Profiling showed cycles with A_max in [1M, 10M] dominate runtime.

Tunable:

Flag Behavior
(default) Skip A_max, B_max > 10M
--max-Amax=N Override threshold to N
--exhaustive Disable skip entirely (impractical at scale)

Checkpoint and resume (M7)

Production runs save state after every (C, z) cycle:

  • <checkpoint> — text file with last completed (C, z)
  • <checkpoint>.hits — binary append-only, 28 bytes per hit (A, x, B, y, C, z, gcd as uint32_t)

State is fsync-ed after each (C, z). Auto-resume is the default. The wrapper scripts/long_run.sh sets all paths automatically.

Reading the binary hits file in Python:

import struct
with open("results/run.checkpoint.hits", "rb") as f:
    while chunk := f.read(28):
        A, x, B, y, C, z, g = struct.unpack("7I", chunk)
        print(f"{A}^{x} + {B}^{y} = {C}^{z}  gcd={g}")

Coprime hits and counterexamples

The summary line coprime hits (Beal!): N reports confirmed hits where gcd(A, B, C) = 1. A coprime hit with x, y, z ≥ 3 would constitute a counterexample to Beal's conjecture.

Coprime hits are always verified with full-precision GMP, regardless of --skip-gmp-verify. Across all runs documented in RESULTS.md, this counter has been zero.

If you obtain a coprime hit while running this software:

  1. Preserve the log file and .checkpoint.hits file.
  2. Re-verify independently with a different toolchain (Python+GMP, SageMath, PARI/GP, Mathematica).
  3. Open an issue here with the verification artifacts.

Technical Details

Solinas reduction

Both primes are chosen for cheap modular reduction via Solinas form:

P1 = 2^61 - 31    →  x mod P1  =  (x_hi * 31 + x_lo) reduce ×2
P2 = 2^59 - 55    →  same pattern with 55

Templated kernel mulmod_solinas_dev<P, C> instantiates per prime. Cross- validated against bitcoin-core/secp256k1 reference implementation.

Hash structure

16-byte HashEntry: 32-bit A, 32-bit x, 64-bit secondary hash (Knuth k2). Open-addressing with linear probing. Load factor capped at 0.5. False positive rate from Solinas+Knuth is below detection threshold across 12,179 confirmed hits (0 FPs).

M4 epoch tags

Per-cycle hash table reset would require 8 GB memset (~600 ms each) — prohibitive at 100K+ cycles. Instead, each entry carries a 32-bit epoch tag; lookups treat entries with stale epoch as empty. Epoch increments per (C, z, x), wraps every 4B cycles (never observed in practice). Eliminates per-cycle memset entirely.

Optimization history

  1. M2 (binary exp + Solinas): baseline functional kernel
  2. M3.7 (sync amortization): D2H out of y-loop. Helped small cases (1.8-2.5×), no effect on raw_b500.
  3. M3.8 (mega-sweep, 1 launch all y): no improvement, debunked the latency hypothesis.
  4. M3.9 (3-bucket timing instrumentation): revealed gpu_work = 100% of wall, not host launch overhead.
  5. M4 (epoch hash, no memset): eliminates per-cycle 8 GB clear.
  6. M5 (cudaEvents per-kernel): build_hash 41%, sweep_mega 45%, kernels are real bottleneck.
  7. M6 (skip A_max,B_max > 10M): 20× speedup at 12% hit cost. The single most important engineering decision in the project.
  8. M7 (checkpoint + incremental hits): enables overnight runs surviving Ctrl-C, crash, power loss.
  9. M7.1 (auto-resume default): re-run the same command, no --resume flag needed.

What didn't work

  • --exhaustive mode at scale: removing M6 skip costs 20× wall time for ~12% more hits at bound = 500. At bound ≥ 1000, the runs become impractical on a single RTX 5070. The skip is necessary, not optional.

  • GPU Bloom filter for hash pre-screening: added complexity, no measurable speedup at observed load factors. Removed.

  • Endomorphism canonical form / cheap-second-point in kernel: code complexity not justified by measured impact. (Same engineering pattern as refuted in PSCKangaroo by independent forum review.)

  • SSD-Hybrid TAME-store architecture: GPU produces ~90K DPs/s; no random-write SSD strategy keeps up while preserving hash semantics. Abandoned. In-RAM 16-byte compact entries are correct for this workload.

  • Diagnosis-by-guessing: spent multiple iterations theorizing about the bottleneck (host CPU loop? buffering? GPU stall?) before instrumenting per-kernel timing (M5). Instrument before theorizing — added as guideline.

Lineage

This is the third iteration of an ongoing Beal verification project by the same author. Earlier iterations are not currently published:

  • V1 (beal_blackwell) — Frente A (by-base parametrization), validated up to bound 1M, 15 datasets, internal release only.
  • V2 (beal_blackwell_v2) — Frente A rebalanced 36-bit C_idx, bound 10M, 11 datasets, internal release only.
  • V3 (beal_bigint, this repo) — Frente B (by-C^z parametrization), GPU implementation. First publicly released iteration of this project.

V3 was cross-validated against V1's by-base output via the --v1-compat mode.

Hardware

Developed and validated on:

  • NVIDIA RTX 5070 (Blackwell, SM 12.0), 12 GB VRAM
  • Ryzen 9800X3D, 128 GB system RAM
  • Linux Mint 22.1, CUDA 12.9, GCC 13, GMP 6.3
  • Off-grid solar power supply

Author

pscamillo — independent developer working on GPU kernels for number-theoretic computation.

See also

  • mr_blackwell — Native Miller-Rabin GPU kernel for NVIDIA Blackwell.
  • PSCKangaroo — GPU-accelerated Pollard's Kangaroo for secp256k1 ECDLP.

License

Apache 2.0. See LICENSE and NOTICE.

Acknowledgments

  • Bill Butler ("Durango Bill") for documenting the by-C^z approach
  • Peter Norvig, Dan Vanderkam, Noah Watkins, Witold Jarnicki & David Konerding, William Root, notabs.org — prior empirical verification work
  • GMP for runtime arbitrary-precision arithmetic
  • Anthropic Claude — pair-programming partner

Supporting development

If this project is useful to you, consider starring the repository — it helps visibility and signals to others that the work matters.

If you want to support continued development directly:

BTC (bech32): bc1qxahfnh6pcpv80plw6lxv8enus0ggpqxu0cl243

Hardware, electricity, and tooling costs are real, though partially offset by solar power. Contributions of code, benchmarks, bug reports, and documentation are also welcome — open an issue or pull request.

About

GPU search for Beal counterexamples — by-C^z parametrization for empirical verification (RTX 5070 / Blackwell)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

 
 
 

Contributors