From Conversation Ratings to Predictive Revenue Intelligence:
An Ordinal-Regression Approach to Learning Sales Quality
Authors: Aonxi Research Collective (Industry Submission, 2025)
Contact: origin@aonxi.com
Abstract
Aonxi is a self-improving revenue cognition system that turns human-rated sales interactions into a closed-loop learning engine. Technically, it combines: (i) Transformer models to understand long, multi-turn speech/text sequences, enabling precise extraction of intents, objections and commitments; (ii) preference-based reinforcement learning so model policies are updated toward what humans judge as "good" outcomes; and (iii) tamper-evident logging so changes to messaging and targeting can be audited and attributed to concrete evidence.
• Transformers provide parallel, attention-based sequence modeling for transcripts and summaries.
• Learning from human feedback is grounded in established preference-optimization methods for RL.
• Hash-chained logs follow the widely used approach introduced in the Bitcoin whitepaper for verifiable history.
• Each client runs an isolated, private model fine-tuned only on that client's rated conversations, hosted on NVIDIA Hopper-class (H100/H200) GPUs for low-latency training/serving.
1Problem: Misaligned Proxies
Most growth stacks optimize intermediate signals (clicks, impressions) that can diverge from what the business values (qualified opportunities, profit). In RL terms, the reward is misspecified. Aonxi reframes the rated conversation as the atomic unit of truth: human scores (1–10) define which sequences (≥8) are positive signals. Preference-based training then steers policies toward these human-validated outcomes.
2System Architecture
Core Principles
- • Sequence understanding: attention over long transcripts (sales, support, demos)
- • Preference RL: learn policies that produce more human-preferred outcomes
- • Auditability: append-only, hash-chained logs for changes, datasets and outcomes
- • Serving/training: Hopper-class accelerators (H100/H200) for throughput and memory bandwidth
| Layer | Purpose | Primary Grounding |
|---|---|---|
| 1. Signal capture | Transcribe/segment multi-turn speech; structure chats/emails | Transformer attention for long sequences |
| 2. Interpretive intelligence | Human ratings 1–10; ≥8 flagged as positive | Preference learning from human feedback |
| 3. Computational intelligence | Train per-client policy/value models | Policy/value learning; bandit exploration for variants |
| 4. Neural execution | Generate/test assets; log outcomes on hash chain | Hash-chained provenance |
3Mathematical Footing
Preference Signals
Let a transcript trajectory be τ. Human rating r ∈ {1,...,10}. Define a simple binary reward:
Policy parameters θ maximize expected return J(θ) = 𝔼τ∼πθ[R(τ)]. A REINFORCE-style update is:
with a learned value baseline b to reduce variance—standard in policy/value splits.
Pairwise Formulation (Optional)
With two trajectories τ+ (r≥8) and τ− (r<8), a Bradley–Terry-style logistic loss on a value head Vφ encourages Vφ(τ+) > Vφ(τ−):
ROI Break-Even (Engineering Check)
If N8+ is the number of ≥8 conversations in a cycle, P close rate per such lead, M average profit per sale, and S spend in the cycle, a simple break-even condition is:
Note: This equation is a unit-economics check; it does not claim universal performance. It defines the target needed to be ROI-positive with your inputs.
Private LLMs (Per-Client Isolation)
Each client has a separate policy/value stack fine-tuned only on that client's rated conversations and measured outcomes. Models are served and (nightly) updated on NVIDIA H100/H200 systems—selected for tensor throughput and HBM bandwidth required by long-context Transformer workloads.
Why Per-Client?
- • Data isolation and compliance are simpler to reason about
- • The model becomes a proprietary asset tuned to one firm's language and playbook
- • No claims of cross-tenant data sharing are made or needed
Execution Engine (Language → Actions)
- 1. Generate: policy proposes copy/scripts/offers conditioned on segment features
- 2. Select: a small-ε exploration policy (bandit) tries controlled variants while exploiting top candidates
- 3. Deploy: push to Ads/Email/CRM with version IDs
- 4. Log: outcomes + datasets + model hashes → hash-chain (tamper-evident)
- 5. Retrain: nightly updates from new ≥8 signals and outcomes
This closes the loop from speech → preference → policy → distribution → evidence.
Governance, Risk, Alignment
Reward Drift
Audit that "≥8" correlates with commercial value; re-calibrate if needed (preference RL best practice).
Bias & Safety
Monitor value-head calibration; keep exploration budgets bounded; maintain allowlists for execution targets.
Provenance
Hash chain maintains an immutable trail of what changed, why, and with which data.
Privacy
Per-client models and stores. No cross-tenant training.
Implementation Notes
- • Transcription/segmentation: long-form Transformers are suitable for diarization-aware summarization and intent/objection tagging
- • Human-in-the-loop: keep a simple rater UI (1–10). Use ≥8 as a contracted truth signal; re-check quarterly
- • Hardware: For fast iteration on long contexts and nightly updates, Hopper-class GPUs (H100/H200) provide appropriate memory bandwidth and tensor throughput
- • Change control: all shipped assets carry a version + hash; dashboards show which call clusters influenced which copy changes—auditable down to the training batch
Worked Example (Illustrative, Not a Result)
Assume a 60-day cycle with spend S=$18,000, close rate per ≥8 lead P=0.30, profit per sale M=$900.
Break-even ≥8 threshold:
If you logged N8+=75, then expected profit (illustrative) is 75 × 0.30 × 900 = $20,250.
ROI = (20,250 − 18,000) / 18,000 = 12.5%
This is just a calculator, not a guarantee. Replace with your actuals to plan targets.
What This Is (and Isn't)
✓ IS
A rigorously grounded system that learns from human-rated conversations using Transformers + preference RL and ships changes with verifiable provenance.
✗ IS NOT
A claim of universal uplift or specific industry benchmarks. Aonxi provides the loop; results depend on inputs and execution.
References (Primary Sources)
Vaswani et al., "Attention Is All You Need." NeurIPS (2017)
(Transformer/attention for long-context sequence modeling)
Read PaperChristiano et al., "Deep Reinforcement Learning from Human Preferences." NeurIPS (2017)
(Preference-based RL for aligning policies with human judgments)
Read PaperBradley & Terry, "Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons." Biometrika (1952)
(Paired comparison model, widely used in preference learning)
Nakamoto, "Bitcoin: A Peer-to-Peer Electronic Cash System." (2008)
(Hash-chained, tamper-evident logging)
Read PaperNVIDIA, H100 Tensor Core GPU (Hopper architecture) docs/whitepaper
(Hardware characteristics for training/serving)
Read WhitepaperReady to Build Your Revenue Brain?
This research demonstrates the mathematical foundation. Now see it in production.
Watch Live Demo