From Conversation Ratings to Predictive Revenue Intelligence:

An Ordinal-Regression Approach to Learning Sales Quality

Authors: Aonxi Research Collective (Industry Submission, 2025)

Contact: origin@aonxi.com

Abstract

Aonxi is a self-improving revenue cognition system that turns human-rated sales interactions into a closed-loop learning engine. Technically, it combines: (i) Transformer models to understand long, multi-turn speech/text sequences, enabling precise extraction of intents, objections and commitments; (ii) preference-based reinforcement learning so model policies are updated toward what humans judge as "good" outcomes; and (iii) tamper-evident logging so changes to messaging and targeting can be audited and attributed to concrete evidence.

• Transformers provide parallel, attention-based sequence modeling for transcripts and summaries.

• Learning from human feedback is grounded in established preference-optimization methods for RL.

• Hash-chained logs follow the widely used approach introduced in the Bitcoin whitepaper for verifiable history.

• Each client runs an isolated, private model fine-tuned only on that client's rated conversations, hosted on NVIDIA Hopper-class (H100/H200) GPUs for low-latency training/serving.

1
Problem: Misaligned Proxies

Most growth stacks optimize intermediate signals (clicks, impressions) that can diverge from what the business values (qualified opportunities, profit). In RL terms, the reward is misspecified. Aonxi reframes the rated conversation as the atomic unit of truth: human scores (1–10) define which sequences (≥8) are positive signals. Preference-based training then steers policies toward these human-validated outcomes.

2
System Architecture

Conversation → Transcription/Summary (Transformer)
→ Human rating r∈{1…10} (≥8 = positive preference)
→ Policy/value update (preference RL)
→ Asset generation + targeting decisions
→ Deploy + measure outcomes
→ Hash-chained audit + nightly retrain

Core Principles

  • Sequence understanding: attention over long transcripts (sales, support, demos)
  • Preference RL: learn policies that produce more human-preferred outcomes
  • Auditability: append-only, hash-chained logs for changes, datasets and outcomes
  • Serving/training: Hopper-class accelerators (H100/H200) for throughput and memory bandwidth
LayerPurposePrimary Grounding
1. Signal captureTranscribe/segment multi-turn speech; structure chats/emailsTransformer attention for long sequences
2. Interpretive intelligenceHuman ratings 1–10; ≥8 flagged as positivePreference learning from human feedback
3. Computational intelligenceTrain per-client policy/value modelsPolicy/value learning; bandit exploration for variants
4. Neural executionGenerate/test assets; log outcomes on hash chainHash-chained provenance

3
Mathematical Footing

Preference Signals

Let a transcript trajectory be τ. Human rating r ∈ {1,...,10}. Define a simple binary reward:

R(τ) = 𝟙[r ≥ κ], κ = 8

Policy parameters θ maximize expected return J(θ) = 𝔼τ∼πθ[R(τ)]. A REINFORCE-style update is:

Δθ = α(R(τ)−b) Σtθ log πθ(at|st)

with a learned value baseline b to reduce variance—standard in policy/value splits.

Pairwise Formulation (Optional)

With two trajectories τ+ (r≥8) and τ (r<8), a Bradley–Terry-style logistic loss on a value head Vφ encourages Vφ+) > Vφ):

ℒ = −log σ(Vφ+) − Vφ))

ROI Break-Even (Engineering Check)

If N8+ is the number of ≥8 conversations in a cycle, P close rate per such lead, M average profit per sale, and S spend in the cycle, a simple break-even condition is:

N8+ ≥ S / (P × M)

Note: This equation is a unit-economics check; it does not claim universal performance. It defines the target needed to be ROI-positive with your inputs.

Private LLMs (Per-Client Isolation)

Each client has a separate policy/value stack fine-tuned only on that client's rated conversations and measured outcomes. Models are served and (nightly) updated on NVIDIA H100/H200 systems—selected for tensor throughput and HBM bandwidth required by long-context Transformer workloads.

Why Per-Client?

  • • Data isolation and compliance are simpler to reason about
  • • The model becomes a proprietary asset tuned to one firm's language and playbook
  • • No claims of cross-tenant data sharing are made or needed

Execution Engine (Language → Actions)

  1. 1. Generate: policy proposes copy/scripts/offers conditioned on segment features
  2. 2. Select: a small-ε exploration policy (bandit) tries controlled variants while exploiting top candidates
  3. 3. Deploy: push to Ads/Email/CRM with version IDs
  4. 4. Log: outcomes + datasets + model hashes → hash-chain (tamper-evident)
  5. 5. Retrain: nightly updates from new ≥8 signals and outcomes

This closes the loop from speech → preference → policy → distribution → evidence.

Governance, Risk, Alignment

Reward Drift

Audit that "≥8" correlates with commercial value; re-calibrate if needed (preference RL best practice).

Bias & Safety

Monitor value-head calibration; keep exploration budgets bounded; maintain allowlists for execution targets.

Provenance

Hash chain maintains an immutable trail of what changed, why, and with which data.

Privacy

Per-client models and stores. No cross-tenant training.

Implementation Notes

  • Transcription/segmentation: long-form Transformers are suitable for diarization-aware summarization and intent/objection tagging
  • Human-in-the-loop: keep a simple rater UI (1–10). Use ≥8 as a contracted truth signal; re-check quarterly
  • Hardware: For fast iteration on long contexts and nightly updates, Hopper-class GPUs (H100/H200) provide appropriate memory bandwidth and tensor throughput
  • Change control: all shipped assets carry a version + hash; dashboards show which call clusters influenced which copy changes—auditable down to the training batch

Worked Example (Illustrative, Not a Result)

Assume a 60-day cycle with spend S=$18,000, close rate per ≥8 lead P=0.30, profit per sale M=$900.

Break-even ≥8 threshold:

N8+ ≥ 18,000 / (0.30 × 900) = 18,000 / 270 = 66.7 ⇒ 67

If you logged N8+=75, then expected profit (illustrative) is 75 × 0.30 × 900 = $20,250.

ROI = (20,250 − 18,000) / 18,000 = 12.5%

This is just a calculator, not a guarantee. Replace with your actuals to plan targets.

What This Is (and Isn't)

✓ IS

A rigorously grounded system that learns from human-rated conversations using Transformers + preference RL and ships changes with verifiable provenance.

✗ IS NOT

A claim of universal uplift or specific industry benchmarks. Aonxi provides the loop; results depend on inputs and execution.

References (Primary Sources)

Vaswani et al., "Attention Is All You Need." NeurIPS (2017)

(Transformer/attention for long-context sequence modeling)

Read Paper

Christiano et al., "Deep Reinforcement Learning from Human Preferences." NeurIPS (2017)

(Preference-based RL for aligning policies with human judgments)

Read Paper

Bradley & Terry, "Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons." Biometrika (1952)

(Paired comparison model, widely used in preference learning)

Nakamoto, "Bitcoin: A Peer-to-Peer Electronic Cash System." (2008)

(Hash-chained, tamper-evident logging)

Read Paper

NVIDIA, H100 Tensor Core GPU (Hopper architecture) docs/whitepaper

(Hardware characteristics for training/serving)

Read Whitepaper

NVIDIA, H200 Tensor Core GPU

(HBM3e capacity/bandwidth overview)

Learn More

Ready to Build Your Revenue Brain?

This research demonstrates the mathematical foundation. Now see it in production.

Watch Live Demo