Skip to content

emceeKim/AI-RVS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Recommendation Verification Standard (AIRVS)

An open, version-controlled, peer-reviewable standard for evaluating AI-generated investment recommendations. The standard defines (1) six process axes with mandatory evidence, (2) macro/micro coherence in three tiers, (3) outcome time-series at four time points, and (4) a five-tier verdict label vocabulary. The standard defines how to measure and what the labels mean; the algorithm that maps measurements to a label is implementer-defined (each evaluator publishes their own decision rule). A reference implementation by the maintainer is available as a separate repository (mc-ai-labs-airvs-implementation/).

Current version: v1.0.0 — first public release License: CC BY 4.0 Maintainer: MC AI Labs — Mincheol Kim ([email protected]) Repository purpose: Public, citable, falsifiable standard for third-party AI recommendation review. Distinct from any specific evaluator's implementation.


TL;DR

If you read one paragraph, read this. Most AI investment recommendations published today carry no falsifiable record of how they were produced, what they assumed, what data they cited, or how they performed. Readers cannot tell a careful recommendation from a hallucinated one — and the recommender has no incentive to disclose either. This repository defines an evaluation sheet a third party can apply to any AI (or human) recommendation, producing a record that is (1) procedurally reproducible, (2) outcome-measured at four time points, and (3) summarized as a single colored verdict label so a non-expert reader can act on it. It is versioned with SemVer; v1.0.0 is the first stable public release. Subsequent breaking changes will go through future MAJOR versions (v2.0.0+) and external peer review, per the methodology's own self-rules.

Global by design

The methodology is intended to be applied to recommendations about any asset class in any market. The macro indicator framework in core.md §4-3 uses US assets as the primary named example (because USD-denominated pricing dominates global cross-asset linkages) and provides substitutable framework rows for other major developed markets and for emerging markets. The Tier 1 source rulebook in tier-rulebook.md lists regulators, central banks, and exchanges from multiple jurisdictions as parallel examples. Defamation and right-of-reply procedures in core.md §9 are written generically, with the explicit note that any evaluator adopting this sheet should seek local legal advice for their own publishing jurisdiction. The author is based in one specific jurisdiction and discloses that openly (see Contact), but the methodology is not specialized for it.


Why this exists

The full statement of motivation is in WHY.md. In short:

  1. AI recommendations are non-falsifiable in their default form. They cite no sources, name no time window, declare no stop-loss, list no counter-scenario, and disappear from the feed within hours. There is no record to verify against later.
  2. "Trust score" sites are graded by their own publishers, which means the highest-scored recommendation and the harshest grading rubric come from the same party. The information value is near zero.
  3. A standard that anyone can apply — to anyone else's recommendation — fixes the incentive structure. The recommender does not get to grade themselves; the grader does not get to invent the rules after the fact; the reader gets a label they can act on.

This is the first public release of that standard.


Status

v1.0.0 is the first stable public release. The methodology is stable as published — it is safe to cite, apply, and build tooling against. Bug-fix patches (v1.0.1, v1.0.2, …) may revise wording and edge cases without changing the underlying rules. Breaking changes — axis count, Pass criterion model, verdict-tier structure, decision-rule branches, outcome timepoints, scope split — will be released as future MAJOR versions (v2.0.0, …) only after external peer review, per the methodology's own self-rules in core.md §11.

Content evaluated under v1.0.0 is permanently frozen at this version in its frontmatter. If the rules later change in v2.0.0, prior evaluations are not re-scored — instead, a new addendum may be appended showing what the new rules would have produced, with the original evaluation preserved.

The author's internal pre-publication self-review is committed in PEER-REVIEWS/internal-3-persona-review.md for transparency; external peer review is actively invited via CONTRIBUTING.md.

See CHANGELOG.md for the v1.0.0 release log and known limitations.


What this sheet evaluates

The sheet is applied only to external recommendations — recommendations the evaluator did not write. (The companion problem of self-grading is explicitly excluded; see Scope below.) For each evaluated recommendation, the sheet produces four independent records:

Dimension When What it measures Form
Process Score At publication Whether six methodological axes are satisfied with evidence Pass / Fail per axis (6 axes)
Macro / Micro Coherence At publication Whether the recommendation accounts for the macro and micro environment Sufficient / Partial / Missing
Outcome (time-series) 30 / 60 / 90 / 180 days after publication Market result, drawdown trajectory, counter-scenario realization Quantitative time-series
Verdict Label Provisional at D+0, Confirmed at D+90 Single label from a fixed 5-tier vocabulary, produced by the evaluator's own pre-published decision rule One of 5 colored labels

The four dimensions are never summed into a single score. Single scores invite gaming. The five-tier verdict label vocabulary is standard, but the algorithm that maps measurements to a label is implementer-defined — each evaluator publishes their own decision rule before any evaluation, and version-locks it. This prevents per-evaluation cherry-picking while allowing reasonable algorithm diversity. A reference implementation by the maintainer is available in a separate repository (mc-ai-labs-airvs-implementation/).

The five verdict labels are: 🟢 Trustworthy, 🔵 Acceptable, 🟡 Questionable, 🟠 Unreliable, 🔴 Hallucinated.


Repository layout

airvs/                              ← this repository (the AIRVS standard itself)
├── README.md                       ← this file
├── WHY.md                          ← author's motivation in full
├── CHANGELOG.md                    ← v1.0.0 release log + known limitations
├── CONTRIBUTING.md                 ← how to submit a peer review
├── LICENSE                         ← CC BY 4.0
│
├── v1.0.0/                         ← current standard version
│   ├── core.md                     ← main standard text (§0–§15)
│   ├── annex-a-ai.md               ← AI-specific assessment items (11 items)
│   └── tier-rulebook.md            ← source tier classification rulebook
│
└── PEER-REVIEWS/
    └── internal-3-persona-review.md  ← author's own adversarial pre-review
                                       (3 personas × 3 rounds, full transcript)

(separate sibling repository, not part of the AIRVS standard)
mc-ai-labs-airvs-implementation/    ← the maintainer's reference implementation
├── README.md
├── LICENSE
└── v1.0.0/
    └── decision-rule.md            ← MC AI Labs' specific decision rule
                                      satisfying AIRVS §6-2

The decision rule (the algorithm that maps Process / Coherence / Outcome records into a 5-tier label) is not part of the AIRVS standard. Each evaluator publishes their own decision rule, version-locks it, and links to it from every evaluation's frontmatter. The maintainer's own decision rule is published in the sibling repository above as a worked reference; other evaluators may fork it, adopt it as-is, or write their own.

When a future MAJOR version of the AIRVS standard ships, it will live in a parallel directory (v2.0.0/) without modifying v1.0.0/. This is per the SemVer directory convention in core.md §11-3. Implementer decision rules are versioned independently.


Quick start (for evaluators)

Read in this order:

  1. WHY.md — understand what problem this standard is solving and what it deliberately is not solving.
  2. v1.0.0/core.md — the standard itself, §0 through §15.
  3. v1.0.0/annex-a-ai.md — required additional checks when the recommender is an AI (LLM response).
  4. v1.0.0/tier-rulebook.md — how to classify a source as Tier 1 / 2 / 3 without it becoming a judgment call.
  5. Your own (or an existing) decision rule — the algorithm satisfying core.md §6-2 that you will publish and version-lock before issuing any evaluation. If you want a starting point, see the maintainer's reference implementation at mc-ai-labs-airvs-implementation/v1.0.0/decision-rule.md.
  6. PEER-REVIEWS/internal-3-persona-review.md — the adversarial pre-review the author ran against this very document, including the unresolved disagreements.

If you only want to evaluate one recommendation as a trial run: read core.md §1, §3, §5, §6, §10 in that order — those five sections are sufficient to produce one full evaluation (you will also need a decision rule per step 5 above).


Quick start (for recommenders being evaluated)

If your recommendation is about to be evaluated under this standard, you are entitled to a seven-day pre-publication notice (core.md §9) and one round of rebuttal. The rebuttal is published alongside the evaluation, unedited. Read core.md §9 for the full procedure. The maintainer is committed to following this procedure on every evaluation issued under their own decision rule, and any deviation should be reported as a bug. Other evaluators adopting AIRVS are bound by the same procedure.


Quick start (for peer reviewers)

See CONTRIBUTING.md. In short: open an issue tagged peer-review with your review attached. Reviews of any length are welcome; even one-page critiques pointing at a single defect are useful. All accepted peer reviews are committed to PEER-REVIEWS/ with attribution, unless the reviewer r

About

AI Recommendation Verification Standard (AIRVS) — an open, version-controlled, peer-reviewable standard for evaluating AI-generated investment recommendations. v1.0.0.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors