CuratAI · AI-Native Multimodal Data Curation Platform

An on-premise multimodal data curation platform for clinical research — six on-premise stages, end to end

Retrieval, ingest (unstructured → structured), de-identification across all five PHI vectors, AI-assisted annotation, custom AI plugin slot, cross-institution sharing. Inside the institutions that already do the work.

Request a demo Download the one-pager →

Stage 1 · Retrieval

Multi-source. Multi-modal

Pull from PACS, EHR, and pathology systems. Batch via REDCap, Excel, or CSV exports — whatever shape your research data is already in. One pipeline for every modality you work with.

Multi-source retrieval

PACS, EHR, pathology. Batch pulls driven by REDCap, Excel, or CSV.

Multi-modal coverage

DICOM, NIfTI, WSI, PDF, clinical notes. One pipeline.

Connectors and queues

Configure once. Studies arrive automatically as they're scanned or charted.

Multimodal data sources flowing into CuratAI

Stage 2 · Ingest

Unstructured → Structured — local LLM

Pulls structured fields out of free-text clinical notes, reports, images, and PDFs. Adapts to each target registry's structure. A local 7B model on a 12 GB consumer GPU is enough; accuracy depends on the registry.

A free-text clinical note on the left flows through a local LLM (7B params, 12 GB GPU, inside the firewall) into structured registry fields on the right, with per-field confidence scores — Free-text clinical note → structured registry fields. The LLM runs inside the institution; no PHI leaves the firewall.

Validation — STAR Neurovascular Registry

The Stroke Thrombectomy and Aneurysm Registry (founded at MUSC, 85+ sites, 15,000+ patients) requires structured abstraction across 341 fields per case from the H&P, procedure note, and discharge summary. We validated on 29 patients, 170 evaluable fields, against the site's REDCap ground truth.

Field type	n	CuratAI 7B	Cloud baseline
Yes / No	49	91.6%	92.3%
Multiple-choice	23	75.6%	79.6%
Cascaded	94	90.1%	93.2%
Free text	4	53.4%	57.1%
Overall	170	87.7%	90.9%

Confidence-triaged workflow

86% of fields land at confidence 1 with ~89% accuracy — auto-accept.
14% land at confidence 2 or 3 — surfaced for human review.
End-to-end: 26 min per patient across 341 fields.

Local 7B reaches 87.7% overall on a complex neurovascular registry — within 3.2 points of the cloud baseline. Registry-agnostic prompts; the same engine ports to GWTG-Stroke, oncology, and custom institutional registries.
Automated Population of the STAR Neurovascular Registry Using a Local Language Model. Submitted to Neurosurgery, 2026.

Stage 3 · De-identify

PHI lives in 5 places. We cover all of them

Headers, dates, burned-in pixel text, facial features, and clinical reports. Each gets a dedicated mechanism — naive blanket-blackouts destroy research utility.

Headers — Rule-based DICOM de-id with multi-layer PHI checks (known, residual, user-defined fields).
Dates — Patient-specific random shift. Identity gone; longitudinal structure kept. Details ↓
Burned-in pixel text — OCR plus local-LLM reasoning. Retains research-critical text; removes PHI only. Details ↓
Facial features — Mask-based defacing that preserves intracranial anatomy. Details ↓
Clinical reports — On-prem LLM strips PHI from notes, reports, and PDFs while preserving clinical meaning. Validated on i2b2 2014 (514 discharge summaries, 2,883 patient/clinician names).

Facial features

Mask-based defacing removes facial geometry while leaving brain, skull base, and surrounding anatomy untouched. Penn benchmarked four defacing algorithms — CuratAI's delivered the best privacy–utility trade-off.

Original ultrasound and CT panels with burned-in text — PHI removed while measurements and modality markers are preserved

Burned-in pixel text

OCR plus local-LLM reasoning classifies each detected region. PHI is replaced; measurements, scale bars, side markers, and modality annotations are preserved — so the image stays readable for research.

Patient-specific date shifting — absolute dates removed, intervals between studies preserved

Date-interval retention

A patient-specific random offset removes the absolute date but preserves the intervals — baseline to follow-up, procedure to discharge, treatment to event — exactly.

Stage 4 · Annotate

AI-assisted annotation

The hidden bottleneck in every research pipeline is the graduate student clicking 12,000 scans. Prediction and interpolation turn a 6-month clicking project into a 3-week supervision project.

Web-based, no install

Runs in the browser. No per-workstation install, no IT ticket to onboard a new annotator.

Prediction + interpolation

Annotator draws on representative slices; the model predicts and fills in the rest of the 3-D segmentation. Reviewer corrects what's needed.

Multi-user roles

PI, annotator, and reviewer roles with shared queues and live cursors. Reviewer approval baked in.

Fully traceable

Every edit logged with user, timestamp, and reversible diff. Inter-observer variability is visible in the viewer.

Multi-planar + 3-D

Linked axial, sagittal, coronal, and MIP views, plus 3-D rendering. DICOM-RT, NIfTI, and DICOM-SEG export.

CuratAI annotation viewer with AI-assisted annotation and multi-user assignment

Stage 5 · AI Plugins

Run your own AI models on your de-identified cohort

CuratAI's plugin slot is what makes it a platform: your research question, our stack, your model — running on your hardware, inside your firewall, on data that stays inside the institution.

Drop a folder, register a model

Any language — Python, C++, R, MATLAB, Julia. CuratAI handles the data plumbing, the cohort selection, the I/O contracts.

Runs on your hardware

GPU on the same workstation, on a local Linux server, or on your institutional compute. No cloud, no telemetry, no PHI off-site.

Same de-identified cohort

Your model trains and infers on data that's already passed pixel-level PHI removal, audit logging, and IRB-scoped project access.

Deployment

Inside the institutions that already do the work

Inputs and processing stay inside the institutional firewall. Only de-identified shared projects cross out — by explicit user action.

CuratAI runs entirely inside your institution's network. The full stack — application, database, on-premise large language model, custom-model runner — installs on a Windows workstation or Linux server you control. No cloud component, no telemetry, no outbound connection that carries patient data. The linking file required for re-identification stays inside your firewall.

Audit logs remain local to your installation.
Re-identification linking files stay on your storage; CuratAI never transmits them.
Exports are explicit user actions.
Cross-institution sharing is gated by per-project safeguards.

HIPAA-aligned
Audit-ready
IRB-compatible
No cloud, no external BAA

And then — the same stack

Research today, FDA-cleared product tomorrow

A research group can de-identify on Monday, annotate by Friday, train the following weeks, publish in months, and ship as an FDA-cleared product on the same stack. Same company, same deployment footprint, same team.

See the clinical AI products built on CuratAI's stack: INTContour (FDA-cleared auto-contouring), OncoAI Suite, QuantBrain, and INTDose.

Curate your next cohort with CuratAI

Request a demo