An honest evaluation of a composite intelligence system: what the artifacts support, and what they refute.
Draft v1.1 — 2026-06-21. Results current as of the 2026-06-21 Attention Diagnostic Suite, Phase-0 forward-holdout panel, GNN promotion audit, fit_study staging run, and the 2026-06-08 LF-1 anchored test (attempt #2, closed FAIL).
The healthcare sector generates thousands of news signals daily across regulatory actions, M&A transactions, clinical trial outcomes, executive movements, and policy changes. Traditional intelligence systems process these signals reactively — Bloomberg reports the deal, Reuters covers the announcement, sentiment scores measure the reaction.
We propose the Plocamium Signal Index (PSI), an intelligence system that characterizes the information structure around healthcare entities. PSI was designed on the hypothesis that events in healthcare M&A, regulatory action, and strategic partnerships are preceded by observable changes in information flow — attention cascades, narrative shifts, graph restructuring, sentiment-momentum divergence, source concentration — days to weeks before formal announcements. That forecasting hypothesis was tested externally and did not hold (§6.4: on SEC-filing-dated deal events, PSI does not beat a naive mention-spike baseline). PSI’s demonstrated value is therefore as a structural and enrichment system — quantifying and explaining an entity’s information profile — rather than as an event forecaster. This paper reports that distinction honestly; §6 documents both what is and is not supported.
We model entity mention arrivals as a self-exciting Hawkes process. For entity e, the conditional intensity function is:
λ(t) = μ_e + α Σ_{t_i < t} β exp(-β(t - t_i))
where μ_e is the baseline mention rate estimated from the trailing 14-day window, α is the self-excitation parameter (fixed at 0.5), and β is the decay rate (fixed at 1.0 day⁻¹). The ACI signal for entity e on day d is the standardized residual:
ACI_e(d) = (observed_count - expected_count) / σ_e
where expected_count is the integral of λ(t) over day d and σ_e is the historical standard deviation of residuals. Positive ACI indicates attention exceeding the entity’s own self-exciting baseline — a signal that exogenous information is driving coverage beyond endogenous momentum.
For each entity e, we compute the centroid of Amazon Titan (1,536-dim) embeddings from all articles mentioning e in the trailing 7-day window. NED is the cosine distance between the current week’s centroid and the prior week’s centroid:
NED_e(d) = 1 - cos(c_e^{current}, c_e^{prior})
High NED indicates that the narrative context surrounding an entity has shifted — the same entity is being discussed in materially different terms. For emerging entities (first appearance), NED is set to 1.0 (maximum drift), reflecting complete novelty. NED captures qualitative narrative change that volume-based signals miss entirely: an entity can maintain constant mention volume while undergoing a complete narrative reframing.
We construct the entity co-occurrence graph G_d = (V, E_d) where edges are weighted by co-mention frequency in the trailing 14-day window. The normalized Laplacian is:
L = I - D^{-1/2} A D^{-1/2}
where A is the adjacency matrix and D is the degree matrix. We compute the top-k eigenvalues (k = min(50, |V|)) of L for consecutive days and measure:
GSS(d) = ||λ(d) - λ(d-1)||_2
For per-entity attribution, we compute the change in each entity’s eigenvector centrality between consecutive snapshots. GSS captures structural reorganization of the entity network — mergers, new alliances, and cluster dissolution manifest as spectral shifts before they appear in headlines.
For each entity e, we fit an OLS regression of sentiment on momentum (trailing 7-day mention velocity):
sentiment_e(d) = β_0 + β_1 · momentum_e(d) + ε_e(d)
The SMD signal is the standardized residual ε̂_e(d). Positive SMD indicates sentiment exceeding what momentum alone would predict (the market is more positive than attention warrants). Negative SMD indicates sentiment lagging momentum (rising attention with deteriorating narrative). SMD functions as a contrarian signal: extreme positive SMD often precedes corrections, while extreme negative SMD (attention rising, sentiment falling) often precedes adverse events such as regulatory actions or earnings misses.
We compute the Herfindahl-Hirschman Index across source domains for each entity’s mentions:
SC_e(d) = Σ_i s_i^2
where s_i is the share of mentions from source domain i. SC ranges from 1/N (perfect diversification) to 1.0 (single source). High SC indicates information asymmetry — when a story is concentrated in one or two sources, it may represent a leak, exclusive, or planted narrative rather than broad consensus. SC serves as a signal quality modifier: high-SC signals should be treated with higher uncertainty.
The five raw signals are orthogonalized via PCA to remove residual correlation. We apply the Kaiser criterion (retain components with eigenvalue > 1) followed by VIF rejection (remove any component with Variance Inflation Factor > 5). This ensures that the composite score is not dominated by correlated signal pairs. The retained principal components are rotated back to the original signal space for interpretability.
We implement Adams and MacKay (2007) with a Normal-Inverse-Gamma conjugate prior on the composite signal stream. The prior parameters are:
Regime classification uses cumulative probability mass at short run lengths (< 7 days). A 7-day transition window smooths regime boundaries to prevent flip-flopping. Regimes are labeled: QUIET, NORMAL, ELEVATED, HIGH, EXTREME.
Each signal component receives weight inversely proportional to its trailing variance:
w_i = (1/σ_i^2) / Σ_j (1/σ_j^2)
This auto-updating scheme prioritizes stable, informative signals and downweights noisy components. Weights are recalculated daily on a 30-day trailing window.
We apply Harvey’s (1989) local linear trend model:
State: x_t = x_{t-1} + v_{t-1} + η_t, η_t ~ N(0, Q_level)
Velocity: v_t = v_{t-1} + ζ_t, ζ_t ~ N(0, Q_trend)
Obs: y_t = x_t + ε_t, ε_t ~ N(0, R)
The Rauch-Tung-Striebel backward pass decomposes each entity’s score into: - Trend (smoothed state) — reported as the PSI score - Mean-reverting component (innovation residual) — used for alert generation - Noise (observation error) — discarded
Confidence intervals are derived directly from the smoothed state covariance matrix P_{t|T}.
Entities with fewer than 30 observations receive James-Stein shrinkage toward their BICS sector mean:
PSI_shrunk = (1 - B) · PSI_raw + B · PSI_sector
where B is the James-Stein shrinkage factor and confidence ramps linearly from 0 at n=0 to 1 at n=30. Entities with fewer than 10 observations are suppressed entirely (score = 0, flagged as insufficient data).
The final PSI score is a z-score with regime classification:
Table 1: PSI regime classification thresholds (z-score bands).
| Regime | Z-Score Range | Interpretation |
|---|---|---|
| QUIET | < -1 | Below-normal activity; entity fading from coverage |
| NORMAL | -1 to +1 | Baseline activity; no actionable signal |
| ELEVATED | +1 to +2 | Above-normal activity; worth monitoring |
| HIGH | +2 to +3 | Significant disruption; likely event precursor |
| EXTREME | > +3 | Rare signal intensity; immediate attention required |
For the first 30 days of system operation, percentile calibration supplements the z-score thresholds to account for limited distributional data.
We compute four classical link prediction scores for all non-adjacent entity pairs:
These heuristics capture structural proximity from different perspectives — local neighborhood overlap (Jaccard, CN), weighted overlap penalizing high-degree intermediaries (Adamic-Adar), and global popularity (PA).
A logistic regression model trained on 10 pair features:
Training uses 80/20 stratified split with class-balanced sampling.
A 2-layer GCN encoder following Kipf and Welling (2017):
H^{(1)} = ReLU(Â X W^{(0)})
Z = Â H^{(1)} W^{(1)}
where  = D̃^{-1/2} à D̃^{-1/2} is the normalized adjacency with self-loops and X is the node feature matrix (PSI scores, degree, sector encoding). Link scores are computed via cosine similarity: score(u,v) = cos(z_u, z_v). Training minimizes binary cross-entropy on held-out edges.
Biased random walks with return parameter p=1 and in-out parameter q=0.5 (biased toward BFS-like exploration). Walk parameters:
Training uses skip-gram with negative sampling SGD. Link scores are computed as the dot product of learned embeddings.
A 2-layer GCN encoder produces mean (μ) and log-variance (log σ²) vectors for each node:
μ = GCN_μ(X, A)
log σ² = GCN_σ(X, A)
z = μ + σ ⊙ ε, ε ~ N(0, I)
The decoder reconstructs the adjacency matrix via inner product: Â = σ(Z Z^T). Training maximizes the ELBO:
L = E_q[log p(A|Z)] - KL[q(Z|X,A) || p(Z)]
Entities and relations are embedded in ℝ^d such that h + r ≈ t for valid triples (h, r, t). We define four relation types derived from graph context:
Training minimizes the margin-based ranking loss with negative sampling. Link prediction scores are computed as -||h + r - t|| for each candidate relation type, taking the maximum.
The eight model scores are combined via weighted average:
score_ensemble(u,v) = Σ_i w_i · score_i(u,v)
Weights are optimized through a two-phase process: 1. Dirichlet search: Sample 200 random weight vectors from Dir(α=1) and evaluate on validation edges 2. Perturbation refinement: Take the best Dirichlet sample and perturb each weight by ±0.05, selecting improvements
Post-processing filters: - BICS sector filter: Both entities must have sector classification - Entity deduplication: Substring pairs excluded (e.g., “UnitedHealth” / “UnitedHealth Group”) - Confidence threshold: Model agreement (how many of 8 models rank the pair in top-20)
Each entity receives a natural-language interpretation generated via a template-based system enhanced by Bedrock Haiku:
Interpretations are regenerated daily and surfaced on the intelligence dashboard alongside the entity card metrics.
The Attention Diagnostic Suite (formerly “validation gates”) runs five statistical probes weekly on PSI history (MIN_VALIDATION_DATE = 2026-04-27). It characterizes signal structure — it is not a deployment go/no-go, does not authorize deal-forecast claims, and does not gate link-prediction promotion (those use separate metrics; see §6.1b–§6.1c and psi-production-model-policy.md).
Diagnostic summary threshold (≥4/5 gates) is informational only; current result 2/5 (2026-06-21) does not block production PSI publication.
LF-1 pre-registration. External deal-forecast evaluation uses a locked protocol (LF-1): events anchored to SEC EDGAR filing timestamps, PSI regime flags within ~20 days before filing, success defined as beating a mention-spike baseline with median lead ≥ 3 days. Attempt #2 (2026-06-08) pre-registered before inspection; attempt #3+ requires new pre-registration. See §6.4 and artifact models/psi/lf1_anchored_test.json.
The production link-prediction layer runs eight reproducible models on the
~400-entity production graph (built by global_chart_render from per-article
co-occurrence >= 2). Per-model holdout ROC-AUC and the full Dirichlet-weighted
ensemble are measured monthly by the psi-fit-study Fargate job and reported
here from models/psi/fit_study.json.
Filled from the controlled prod ECS run, run_id
2026-06-22T01:50:03Z (87 days of history, 400-entity universe, 5 CV folds,
7-day holdout; task fea0dbb4, exit 0). Holdout ROC-AUC and fit-gap are from fit_study.json;
the Weight column is the production Dirichlet weight from the daily
link_predictions.json optimizer.
Table 2: Link prediction holdout metrics from fit_study (2026-06-21 prod run).
| Model | Holdout ROC-AUC | Fit-Gap | Weight |
|---|---|---|---|
| Logistic Regression | 0.677 | 0.048 | 0.00 |
| Adamic-Adar | 0.555 | 0.061 | 0.00 |
| Common Neighbors | 0.552 | 0.062 | 0.00 |
| Jaccard | 0.578 | 0.053 | 0.00 |
| GCN | 0.499 | 0.061 | 0.00 |
| Node2Vec | 0.570 | 0.043 | 1.00 |
| Preferential Attachment | 0.472 | 0.069 | 0.00 |
| Ensemble (Dirichlet-weighted) | 0.676 | 0.048 | 1.00 |
Two honest observations from this run:
overall.status = "HEALTHY"), consistent with the staging run the same day
(trend last runs: 0.078 → 0.052 → 0.049 → 0.048). Worst single fit is
preferential_attachment (fit-gap ~0.069); Node2Vec is tightest (~0.043).fea0dbb4 wrote
models/psi/fit_study.json (~31 KB, run_id 2026-06-22T01:50:03Z, 87 history
days). Staging artifact fit_study_STAGING.json (run_id 2026-06-21T22:41:31Z)
reports ensemble fit-gap 0.0494 — within sampling noise of prod. Weekly
schedule psi-fit-study-weekly is ENABLED (Sundays 09:00 UTC). Treat the
monthly fit_study as an overfit diagnostic only; it does not gate GNN
promotion (pinned holdout AUC vs 0.4293) or authorize deal-forecast claims.psi_compute link-prediction eval set —
assigns weight 1.00 to Node2Vec and 0.00 to everything else. Phase-0 forward
holdout (§6.1b) ranks GBM above Node2Vec. These are different evaluation sets;
“best model” is eval-set dependent.Forward link prediction for production scoring policy is evaluated on a
pinned 14-day holdout with HeaRT-style hard negatives, three rolling-origin
anchors, and five fixed seeds (phase0_baseline_panel.json, generated_for:
2026-06-21). This is Mode E — GBM parity: the graph has not yet shown GNN
attention beating a gradient-boosted baseline on forward edge formation.
Table 3: Phase-0 forward holdout panel — production link policy (2026-06-21).
| Arm | Mean AUPRC | Mean AUC | Notes |
|---|---|---|---|
| GBM | 0.717 | ~0.71 | Production-first policy |
| GNN (GATv2 encoder) | 0.589 | ~0.46 | Below GBM and Node2Vec on 2/3 windows |
| Node2Vec-only | 0.580 | ~0.60 | Ablation-backed prod arm; beats GNN |
| Heuristics (best: Adamic-Adar) | 0.556 | ~0.46 | Above GNN AUPRC on aggregate |
| EdgeBank | 0.500 | 0.500 | Recurrence floor |
Baseline hierarchy: EdgeBank (0.50) → heuristics (~0.56) → Node2Vec (0.58) → GNN (0.59) → GBM (0.72). Production policy is GBM / Node2Vec-first; the eight-model Dirichlet ensemble remains a reproducibility and fit_study diagnostic construct — do not claim it beats GBM on forward holdout.
GNN attention (models/psi-gnn/latest.json, May 4 four-key manifest) is a
separate refinement layer from link-prediction scoring. Promotion is gated
solely on pinned forward holdout ROC-AUC — val AUC during GPU training does
not gate promotion.
Table 4: GNN attention promotion gate (holdout AUC vs production bar).
| Field | Value (2026-06-21) |
|---|---|
| Production bar | 0.4293 (encoder_heads_20260504_154154.npz) |
| Holdout window | 2026-06-07 .. 2026-06-21, n_pos 1932 |
| Latest candidate holdout | 0.4205 (Fargate gnn_retrain, 2026-06-21) |
| Decision | SKIPPED — candidate ≤ prod; latest.json untouched |
| GPU val AUC (same run) | 0.7217 — informational only; holdout 0.4286 still below bar |
Weekly auto-retrain (EventBridge) remains disabled until a manual run PROMOTES above the holdout bar. GNN attention does not beat GBM on the Phase-0 panel (§6.1b); scaling GNN is not the validated path for link value.
Per-model marginal contribution is measured by leave-one-out: for each of the eight models, the ensemble is re-optimized on the remaining seven features and the resulting AUC is compared to the full ensemble AUC.
Leave-one-out results from ablation_study.json (2026-06-08 run; full-ensemble
AUC 0.822 on 1,524 positive / 7,482 negative held-out edges):
Table 5: Leave-one-out ablation study (2026-06-08).
| Model | Full AUC | LOO AUC | Drop | Pct Contribution |
|---|---|---|---|---|
| Node2Vec | 0.822 | 0.006 | 0.816 | 99.3% |
| Logistic Regression | 0.822 | 0.727 | 0.095 | 11.5% |
| GCN | 0.822 | 0.731 | 0.091 | 11.1% |
| VGAE | 0.822 | 0.733 | 0.089 | 10.8% |
| Adamic-Adar | 0.822 | 0.735 | 0.087 | 10.6% |
| Common Neighbors | 0.822 | 0.735 | 0.087 | 10.6% |
| Jaccard | 0.822 | 0.736 | 0.086 | 10.5% |
| Preferential Attachment | 0.822 | 0.737 | 0.085 | 10.4% |
The previous ablate-by-zeroing logic in psi_compute/ablation.py reported
auc_drop = 0 for any model whose full-ensemble weight was already zero
(an artifact of the daily Dirichlet search). Leave-one-out, landed in A3,
reports each model’s marginal value honestly.
The honest reading: production link prediction is effectively a one-model ensemble. Removing Node2Vec collapses the AUC from 0.822 to 0.006 — it carries 99.3% of the contribution. The other seven models are near-redundant: each leave-one-out drop is ~0.085–0.095, because once Node2Vec is removed the re-optimizer simply re-fits the remaining heuristics back to ~0.73. All eight models are run and scored (the “8-model reproducible ensemble” claim is structurally true), but predictively the ensemble is Node2Vec.
The psi_validate Lambda runs weekly (Sun 08:00 UTC) as the Attention
Diagnostic Suite — characterizing PSI vs forward 7-day log growth in entity
mention count. It is not a deployment go/no-go and does not authorize
deal-forecast claims (external anchor: LF-1 only, §6.4).
All runs cited here use the single post-GNN-blend regime
(MIN_VALIDATION_DATE = 2026-04-27). Earlier weekly runs mixed pre-Apr-15
sparse graph-cap records with post-Apr-27 GNN-blended records; numbers from
those mixed runs should not be cited.
Table 6: Attention Diagnostic Suite summary (2026-06-21).
| Validation date | 2026-06-21 |
|---|---|
| N records | 48 |
| N dates | (artifact) |
| Date range | 2026-04-27 → 2026-06-13 |
| Gates passed | 2 / 5 (overall_passed: false) |
Per-gate results:
Honest interpretation. The negative correlation is the substantive finding, not a failure mode: PSI captures attention peaking ahead of mean reversion in next-7-day mention growth. This is a different claim than link-prediction AUC (§6.1–§6.1b). The suite’s 2/5 result is diagnostic characterization — it does not block psi-compute, enrichment, or link prediction in production.
The event study (models/psi/event_study.json, 2026-06-08 run) tests whether
PSI was ELEVATED in the days before a deal/announcement event. From the
published corpus it extracted 171 events spanning 232 entity-event pairs (after
dropping 2,618 non-PSI entities, 316 duplicate pairs, and 27 events that
predate PSI coverage, which begins 2026-03-23). For each pair PSI is sampled in
six windows: pre-14d, pre-7d, pre-3d, event day, post-3d, post-7d.
Table 7: Internal event study coverage (2026-06-08; not an external anchor).
| Metric | Value |
|---|---|
| Entity-event pairs | 232 |
| Pairs with ELEVATED PSI before the event | 0 / 232 (0.0%) |
| Mean pre-event PSI | -1.0 |
This is not yet an interpretable result, and should not be cited as a null.
The 0/232 hit rate is dominated by coverage, not signal: most pre-event windows
resolve to NO_DATA because the event corpus largely predates dense daily PSI
history (the GNN-blend regime only begins 2026-04-27; see §6.3). Where data
exists, pre-event PSI sits near the -1.0 floor — consistent with the §6.3
finding that PSI is depressed (not elevated) around attention peaks, but the
sample of events with complete pre-event windows is currently too small to
separate that effect from missing data. The event study becomes evaluable only
once the clean-regime corpus is long enough to contain events with fully
populated 14-day lookbacks — estimated Day 90+ of the post-Apr-27 regime
(see §8.3, §9 limitation 5).
Deal-forecast hypothesis tested under pre-registered protocol before results inspection.
Events anchored to SEC EDGAR filing timestamps (8-K Item 1.01/2.01, SC-13D, S-4, DEFM14A) — exogenous to the news stream.
Success rule locked: PSI ELEVATED/CRITICAL within ~20 calendar days before filing must beat mention-spike baseline (bootstrap CI) with median lead ≥ 3 days.
Attempt #2 (2026-06-08) reused attempt #1 rules; only harness fixes (CIK join + trailing baseline) changed — no goalpost moving.
Artifact: models/psi/lf1_anchored_test.json.
Do not reopen without new pre-registration (attempt #3+) and explicit sign-off.
External anchored test — LF-1 attempt #2 (2026-06-08, closed FAIL). The internal
event study above is circular: it measures PSI against the same coverage stream
PSI is built from. We therefore ran a pre-registered, externally anchored test
(models/psi/lf1_anchored_test.json): deal events dated by their SEC EDGAR
filing timestamp (8-K Item 1.01/2.01, SC-13D, S-4, DEFM14A) — exogenous to the
news stream — joined to PSI entities by resolved CIK. Pre-registered rule: an
entity’s deal is “flagged” if PSI was ELEVATED/CRITICAL within ~20 calendar days
before the filing; success = beat a mention-spike baseline (bootstrap-CI margin)
and median lead ≥ 3 days. Attempt #2 uses the same locked rule as attempt #1;
only harness fixes (CIK join + trailing baseline) changed — no goalpost moving.
Result: CLOSED FAIL — PSI does NOT lead deal events. PSI hit-rate 10.8% (CI 4.8–18.1%) vs naive mention-spike baseline 49.4% (CI 39.8–60.2%) — non-overlapping, PASS: False (n=83 evaluable clean-regime events). A trivial attention trigger anticipates ~half of deal events; PSI’s regime flag catches ~1 in 9. Do not reopen without new pre-registration (attempt #3+) and explicit sign-off. This corroborates §6.1b (GBM beats GNN on forward holdout), §6.2 (link prediction is Node2Vec-dominated in production), and §6.3 (PSI is a forward-attention contra-indicator). The deal-forecasting hypothesis is refuted on the one target with a clean external anchor. PSI’s validated value is enrichment — explaining why an entity matters — not event forecasting.
The EDGAR layer (models/psi/edgar_signals.json, 2026-06-07 run) cross-
references PSI entities against recent SEC filings: 200 entities scanned, 155
(77.5%) with at least one filing in the lookback window.
The integration is wired and producing per-entity signals, but the current
signal quality is low and is reported here honestly rather than as a validated
result. The top-scoring “entities” are generic terms — Iran, President, China —
matched to unrelated registrants (e.g. Iran → a net-lease REIT, President →
a SPAC), with form_type and filing description frequently empty and
material_filings at 0. The per-entity score is therefore driven by raw filing
volume, not filing materiality or genuine entity identity. The bottleneck
is entity resolution: PSI entities are extracted from news narrative (including
geopolitical and role nouns) and do not map cleanly to SEC registrant names or
CIKs.
Update (2026-06-08): a high-precision entity→CIK resolver now exists
(lambdas/psi_compute/entity_resolver.py; 0.47% false-positive rate, ~1,000
entities mapped), so EDGAR signals can be keyed on resolved CIKs rather than the
substring match. The resolver also exposed a structural ceiling: only ~10–25%
of PSI entities resolve to any CIK — the universe is dominated by
non-company nouns (people, places, geopolitics). This is consistent with the
entity triage (only ~25% of the graph is market-linkable) and reframes EDGAR/
deal data as usable for a minority enrichment slice, not a universe-wide
forecasting signal.
Aligned with psi-poc-scorecard.md and the methodology page “Studies &
Validation” section. All numbers cite dated S3 artifacts; clean-corpus floor
≥ 2026-04-27.
| Claim | Evidence |
|---|---|
| GBM beats GNN on forward link holdout | Phase-0 panel: GBM AUPRC 0.717 vs Node2Vec 0.580 vs GNN 0.589 |
| Production link ensemble is predictively Node2Vec-only | Dirichlet weight 1.00 Node2Vec; LOO ablation 99.3% (2026-06-08) |
| PSI composite is a forward-attention contra-indicator (internal) | Attention Diagnostic Suite: permutation PASS, r ≈ −0.91, p=0 (2026-06-21) |
| Signal decay is measurable | Decay gate PASS; half-life 11 observations |
| GNN promotion gate works (audit trail) | 2026-06-21 retrain SKIPPED at holdout 0.4205 ≤ prod 0.4293 |
| Fit study prod healthy | fit_study.json: fit-gap 0.0483, status HEALTHY (87 history days; weekly schedule enabled) |
| Claim | Result |
|---|---|
| PSI leads deal events (external) | LF-1 attempt #2 closed FAIL: 10.8% vs spike 49.4% |
| Attention Diagnostic Suite “passes” | 2 / 5 gates — diagnostic only, not a prod block |
| GNN attention beats GBM on forward holdout | GNN AUPRC 0.589 < GBM 0.717 |
| Promote latest GNN retrain on val AUC | GPU val 0.7217 → holdout 0.4286 < prod 0.4293 |
| Eight-model ensemble beats GBM on forward edges | Reproducibility construct; Phase-0 ranks GBM first |
| Internal event study as anticipatory null | 0 / 232 ELEVATED — not interpretable (coverage) |
Metric separation (do not conflate): fit_study holdout AUC → overfit bands only;
Phase-0 AUPRC → link-structure / GBM-first policy; GNN pinned holdout AUC →
promote.py only; Attention Diagnostic Suite → signal characterization; LF-1 →
deal-forecast hypothesis only.
The per-model leave-one-out ablation is reported in §6.2. The deployed ensemble is the eight reproducible models (four heuristics + Logistic + GCN + Node2Vec + VGAE); TransE was removed in A3 for non-reproducibility and is not ablated here. The headline result: Node2Vec carries ~99% of the contribution, the other seven models are near-redundant — see §6.2 for the full table.
Not yet computed. Unlike the model ablation (§6.2), which is produced daily by the link-prediction pipeline, there is currently no artifact that decomposes the composite PSI into its five constituent signals — ACI (Hawkes self-excitation), NED (embedding drift), GSS (spectral shift), SMD (sentiment-momentum divergence), and SC (source concentration) — and measures each one’s marginal contribution. The intended methodology is:
| Signal | Variance Explained | Correlation w/ PSI | IC (rank) |
|---|---|---|---|
| ACI (Hawkes) | pending | pending | pending |
| NED (Embedding Drift) | pending | pending | pending |
| GSS (Spectral Shift) | pending | pending | pending |
| SMD (Sent-Mom Divergence) | pending | pending | pending |
| SC (Source Concentration) | pending | pending | pending |
To fill this honestly requires a per-component analysis run over the daily PSI
history that (a) regresses each component against forward 7-day mention growth
to obtain its rank IC, (b) computes each component’s correlation with the
composite, and (c) attributes variance via a PCA or leave-one-signal-out on the
James-Stein composite. That computation does not exist yet; the numbers are
deliberately left as pending rather than estimated. This is the natural
companion to the signal-decomposition work and is the most defensible next
addition to the validation suite.
Models. The link-prediction ensemble is, predictively, a one-model system: the §6.2 leave-one-out ablation shows Node2Vec carrying ~99% of the ensemble AUC, with the remaining seven models near-redundant (each LOO drop ~0.085–0.095 because the optimizer re-fits the survivors). The honest implication is that the “eight-model ensemble” is a robustness and reproducibility claim, not a performance one — seven of the eight could be dropped from production scoring with negligible AUC loss, though they are cheap to compute and retaining them guards against Node2Vec degrading on a future graph. The fit_study diagnostic (§6.1) tells a different story — there Logistic Regression is the strongest single model — which underlines that “most critical model” is eval-set dependent and should not be over-read from either run alone.
Signals. The five-signal ablation (§7.2) is not yet computed, so no claim is made here about which of ACI/NED/GSS/SMD/SC is most critical. What §6.3 does establish is a property of the composite: it is a statistically significant forward-attention contra-indicator (permutation p≈0, factor-regression psi_beta -0.65), not a continuation signal. Determining which constituent signals drive that behavior is exactly what §7.2 is meant to answer and remains open.
Net. The defensible, evidence-backed claims today are narrow: (1) a reproducible link-prediction layer whose predictive content is concentrated in Node2Vec, and (2) a composite PSI that is a significant negative predictor of forward attention. Broader claims about per-signal contribution and event lead-time are not yet supported by the artifacts and are marked pending rather than asserted.
Scope note (2026-06-21). This section argued a forecasting moat. LF-1 attempt #2 (§6.4) closed FAIL: PSI does not lead deal events better than a naive baseline. Phase-0 (§6.1b) ranks GBM above GNN and Node2Vec on forward holdout; GNN promotion remains gated at holdout 0.4293 with 2026-06-21 candidates SKIPPED. The defensible moat is the earned graph + multi-signal enrichment asset (§8.1, §8.2), not event prediction. §8.3 timeline items on deal probability are contradicted by evidence — read as research agenda only.
No published system combines Hawkes processes, spectral graph analysis, BOCPD, and Kalman smoothing for healthcare intelligence. Individual components are well-studied; the specific combination and domain application are novel. The closest related work in financial signal processing (e.g., Bloomberg’s event-driven analytics) operates on price and volume data rather than narrative structure.
The entity co-occurrence graph required months of continuous ingestion from ~140 feeds. With a 400-node backend graph (12,791 edges as of 2026-06-11; 150-entity display subset) and daily temporal history, this graph cannot be replicated from static data. The temporal dimension — how edges form, strengthen, weaken, and dissolve over time — represents information that can only be accumulated through sustained operation.
System capability increases non-linearly with data accumulation:
| Milestone | Capability Unlocked |
|---|---|
| Day 1 | Signal computation only |
| Day 7 | Lead-lag detection activates |
| Day 30 | Validation gates fire, link prediction gets temporal features |
| Day 90 | Event study reaches statistical significance — but see §6.4: the externally-anchored event study already ran and PSI failed to beat a naive baseline |
| Day 180 | Deal probability model viable — first attempt (LF-1) already refuted; not on track |
| Day 365 | Seasonal pattern detection, annual cycle modeling |
Each additional day of operation adds to the baseline rate estimates (Hawkes), the embedding history (NED), the spectral trajectory (GSS), and the regression training data (SMD), creating a widening moat against systems that start later.
Eight models from three methodological families (classical heuristics, ML, deep graph) provide reproducibility and diagnostic redundancy. Leave-one-out ablation (§6.2) shows Node2Vec dominates (~99% contribution) in production scoring; Phase-0 forward holdout (§6.1b) ranks GBM first. Production policy is GBM / Node2Vec-first — not “eight models contribute equally.” Model agreement tiers (6–8 models agree) remain a qualitative confidence signal for predicted pairs, but should not be read as evidence that all eight models carry predictive content on forward holdout.
PSI assembles statistical signal processing, graph learning, and NLP into a single operational pipeline that runs daily at production scale (~400-entity graph, 1,900+ scored entities, clean single-regime history since 2026-04-27). This draft reports what the artifacts currently support — and, deliberately, no more.
What is established. Evidence-backed claims as of 2026-06-21:
fit_study.json fit-gap 0.0483;
weekly Fargate schedule enabled (psi-fit-study:13, image f1a03d6).What is not yet established. Per-signal ablation (§7.2) uncomputed; internal event study (§6.4) coverage-limited; EDGAR integration (§6.5) entity-resolution- bound. The eight-model ensemble does not beat GBM on forward holdout. The 2026-06-07 prod fit_study OVERFIT band (0.078) is a single monthly snapshot — staging HEALTHY (0.0494) suggests the diagnostic can recover with longer history.
The honest position is that PSI is a working, defensible enrichment system — a reproducible relationship-prediction layer plus a five-signal characterization of an entity’s information structure — but not an event forecaster. Its one internally-significant relationship (the forward-attention contra-indicator, §6.3) is endogenous and not external validation; the single test with a clean external anchor (§6.4, SEC-filing-dated deal events) refuted the forecasting hypothesis (PSI 10.8% vs naive baseline 49.4%). The Attention Diagnostic Suite (formerly framed as “validation gates”) characterizes signal structure — it is not a deployment go/no-go; its value is surfacing this — separating a real enrichment asset from an unsupported forecasting claim, rather than obscuring the difference.
Future work, in priority order: (1) lean into the validated direction — ship the graph + signals as enrichment/retrieval (the “why an entity matters” use case); the first such product outcome shipped 2026-06-11: the mention-spike baseline itself (the rule that beat PSI in §6.4) is now a watchlist alert trigger, with its pre-registered constants unchanged; (2) key EDGAR/deal data on the new entity→CIK resolver to strengthen that enrichment layer; (3) compute the §7.2 per-signal ablation; (4) if forecasting is revisited, do it only against a fresh external anchor (e.g. regulatory/FDA events) under the same pre-registration discipline — not by scaling the GNN, which Phase 0 showed is at GBM-parity. The previously-planned “deal-probability model” is removed from the roadmap: its premise was tested early (LF-1) and did not hold.
All paths relative to s3://plocamium-content/models/psi/ unless noted.
fit_study.json — monthly link-prediction CV diagnostic (prod key; staging: fit_study_STAGING.json)phase0_baseline_panel.json — Phase-0 forward holdout panel (GBM / Node2Vec / GNN arms)ablation_study.json — leave-one-out ensemble ablation (2026-06-08)validation/latest.json — Attention Diagnostic Suite weekly outputlf1_anchored_test.json — LF-1 pre-registered external deal anchor (attempt #2, closed FAIL)event_study.json — internal event-study coverage artifactedgar_signals.json — SEC EDGAR cross-reference signalspsi-gnn/latest.json — GNN attention promotion manifest (holdout gate 0.4293)Four AWS Lambda functions (arm64, Python 3.12):
| Lambda | Purpose | Schedule | Timeout |
|---|---|---|---|
psi-compute |
5 signals, PCA, BOCPD, Kalman, James-Stein, EDGAR boost | Daily 13:30 UTC | 900s |
card-compute |
Entity intelligence cards with 5 metrics | Async (post-compute) | 900s |
psi-validate |
Attention Diagnostic Suite (5 probes) | Weekly | 900s |
psi-enrichment |
Link prediction, EDGAR scan, velocity, event study, lead-lag, peer-relative, interpretations | Async (post-compute) | 900s |
All Lambdas share a common layer with NumPy, SciPy, scikit-learn, and graph libraries. State is persisted in S3 (signal history) and DynamoDB (entity metadata, scores).
All code is available at github.com/jtannahill/plocamium-content-engine. The system comprises 16 Python modules totaling ~4,000 lines of signal processing, graph learning, and statistical analysis code, with 145+ unit tests.
Key modules:
- psi_signals.py — Five signal computations (ACI, NED, GSS, SMD, SC)
- psi_pipeline.py — Six-layer processing pipeline (PCA, BOCPD, IVW, Kalman, James-Stein, output)
- link_prediction.py — Eight-model reproducible ensemble with auto-weight optimization (leave-one-out ablation)
- psi_validation.py — Attention Diagnostic Suite (five statistical probes)
- psi_enrichment.py — EDGAR integration, event study, lead-lag analysis
The card-compute Lambda has produced its first batches; cards_latest.json
(2026-06-08) holds 87 live entity cards. A real card, rendered from production
data, follows. The original draft showed a hypothetical “UnitedHealth Group”
card; it is replaced here with an actual one to avoid presenting fabricated
numbers as output.
┌─────────────────────────────────────────┐
│ ENTITY: FDA │
│ PSI Score: -0.70 (NORMAL) │
│ Momentum: 60.4 Degree: 46 │
│ Network position: 0.159 │
├─────────────────────────────────────────┤
│ ACI (Attention cascade): -0.79 │
│ NED (Narrative drift): 0.00 │
│ GSS (Spectral shift): 464.70 │
│ SMD (Sent-Mom divergence): 0.00 │
│ SC (Source concentration): 0.91 │
├─────────────────────────────────────────┤
│ Dominant signal: spectral_shift │
└─────────────────────────────────────────┘
Two honest observations from the live cards, both consistent with limitations already noted:
dominant_signal for nearly every card. It reads as a shared,
un-normalized global spectral metric rather than an entity-specific
contribution. The per-component normalization needed to make this column
comparable is the same work blocking the §7.2 signal ablation, and the card
interpretation text is intentionally omitted until it lands.