Oct 21, 2025

Release notes: AI

Report assistant 3.0

Introduction


What is Report Assistant?

Report Assistant is an advanced AI tool designed to support reading physicians in creating, refining, and managing medical imaging reports. It enhances workflow efficiency and ensures high-quality report content while preserving physician oversight.


Core Purpose:

To assist, not replace, the reading physician in:

Current Release - Release 3.0

Generating specific sections of the report such as Impression and Recommendations. 

Future Release - Release 3.1

Report corrector and prompt insertion.

Future Release - Release 3.2

Provide clinical guidance for breast reporting.

Future Release - Release 3.3

Personalizing reports to match the physician's individual style and preference. Supporting clear, professional, and accurate reporting.


Report Assistant 3.0 Sub-Agents
Report Impression 1.0
Report Recommendation 1.0

All outputs are reviewable suggestions — the reading physician retains full control.


When to Use It

After completing the Findings section.

To help generate clear summary sections.


What It Is Not

Not a diagnostic tool — it does not analyze images. 

Not a full report generator. 

Not autonomous — all outputs require physician review. 

Not a replacement for the reading physician’s judgment and editing.


User Guide


Report Impression 1.0


Gap

Reading physicians can face inconsistency and variability in impression statements due to time constraints, fatigue, or differences in dictation style.

Solution

The Impression Agent converts report findings into consistent, concise, and clinically accurate list of impression statements.


User Flow:


Login and access the radiology report.



Select the "Impress" button.



Consistent and clear list of impressions are generated in the IMPRESSION section.​



Report Recommendation 1.0


Gap

Gathering all the recommendations in the report can be time-consuming for the reading physicians.

Solution

The Recommendation Agent extracts any mention of recommendations from the report and generates a list of concise recommendations that is organized under a new section in the report.


User Flow:


Login and access the radiology report that contains recommendations.



Select the "Impress" button.



Consistent and clear list recommendations is generated in the RECOMMENDATIONS section.​



Sub-Agent Development


Report Assistant 3.0 uses the following two sub-agents:


Main Agent

Report Assistant 3.0

Sub-Agent

Report Impression 1.0

Sub-Agent

Report Recommendation 1.0


The main agent, Report Assistant, is responsible for orchestrating the information between its sub-agents and producing the final output.​ Communications between agents are done through secured POST API calls.


Flowchart


The logic is as follows:

Report Assistant takes a report as the input and makes parallel call to the two sub-agents to process it.​

Report Assistant then takes the output and parses the result in a JSON structured output with the list of Impressions and Recommendations. If there are no recommendations, an empty list is sent out.



Bench Testing & Performance Evaluation


Dataset Selection: Platform Data


We filtered reports from our database using the following conditions:
  1. Exams with a report.

  1. Exams with an Impression section​. 

  1. Gender: Male or female.

  1. Acceptable facilities and shortlisted physicians who generate high quality reports.

  1. Patient age: More or equal to 21.


330 reports were selected, focusing on fair distribution of gender, age group, facility, reading physician, report with recommendations, modality, and exam type. This was used as the final Bench Testing Set. The impressions and recommendations from these reports were clinically approved by the reading physicians and therefore could be used as the ground truth.


Dataset Distribution: Gender and Age Group
Distribution of Gender
Gender
Patient/Study/Reports

Female

182 (55%)

Male

148 (45%)

Total
330 (100%)
Distribution of Patient Age Group / Gender (Study)
Patient Age Group
Female
Male
Total

21-29

19 (10%)

6 (4%)

25 (8%)

30-39

23 (13%)

17 (11%)

40 (12%)

40-49

30 (16%)

16 (11%)

46 (14%)

50-59

24 (13%)

18 (12%)

42 (13%)

60-69

24 (13%)

37 (25%)

61 (18%)

70-79

31 (17%)

32 (22%)

63 (19%)

80+

31 (17%)

22 (15%)

53 (16%)

Total
182 (100%)
148 (100%)
330 (100%)


Dataset Distribution: Report with Recommendation, Facilities, and Gender​
Distribution of Report with Recommendation / Gender (Study)
Report with Recommendation
Female
Male
Total

FALSE

115 (63%)

98 (66%)

213 (65%)

TRUE

67 (37%)

50 (34%)

117 (35%)

Total
182 (100%)
148 (100%)
330 (100%)
Distribution of Facility Group / Gender (Study)
Facility Group
Female
Male
Total

Facility Group 1

103 (57%)

80 (54%)

183 (55%)

Facility Group 2

38 (21%)

36 (24%)

74 (22%)

Facility Group 3

28 (15%)

24 (16%)

52 (16%)

Facility Group 4

13 (7%)

8 (5%)

21 (6%)

Total
182 (100%)
148 (100%)
330 (100%)


Dataset Distribution: US State, Modality, and Gender​
Distribution of US State / Gender (Study)
US State
Female
Male
Total

Arizona

82 (45%)

67 (45%)

149 (45%)

California

48 (26%)

40 (27%)

88 (27%)

Nevada

31 (17%)

29 (20%)

60 (18%)

New York

17 (9%)

8 (5%)

25 (8%)

Florida

4 (2%)

4 (3%)

8 (2%)

Total
182 (100%)
148 (100%)
330 (100%)
Distribution of Modality / Gender (Study)
Modality
Female
Male
Total

CT

71 (39%)

71 (48%)

142 (43%)

XR

46 (25%)

27 (18%)

73 (22%)

MR

34 (19%)

36 (24%)

70 (21%)

US

31 (17%)

14 (9%)

45 (14%)

Total
182 (100%)
148 (100%)
330 (100%)


Dataset Distribution: Reading Physician, Exam Type, and Gender​
Distribution of US State / Gender (Study)
Reading Physician
Female
Male
Total

Reading physician 1

77 (42%)

61 (41%)

138 (42%)

Reading physician 2

38 (21%)

36 (24%)

74 (22%)

Reading physician 3

28 (15%)

24 (16%)

52 (16%)

Reading physician 4

26 (14%)

19 (13%)

45 (14%)

Reading physician 5

13 (7%)

8 (5%)

21 (6%)

Total
182 (100%)
148 (100%)
330 (100%)
Distribution of Exam Type / Gender (Study)
Exam Type
Female
Male
Total

Chest/Abdomen/Pelvis

68 (37%)

46 (31%)

114 (35%)

Chest

37 (20%)

37 (25%)

74 (22%)

Spine

40 (22%)

29 (20%)

69 (21%)

Extremity

16 (9%)

25 (17%)

41 (12%)

Head/Brain/Neck

21 (12%)

11 (7%)

32 (10%)

Total
182 (100%)
148 (100%)
330 (100%)


Description of Evaluation Metrics​


Deterministic Evaluations:


  1. Similarity

Cosine Similarity​

Jaccard Similarity​

Levenshtein Similarity

  1. Count of Items generated
  1. Confusion Matrix

True Positives (TP)​

True Negatives (TN)​

False Positives (FP)​

False Negatives (FN)​

Accuracy​

Precision​

Sensitivity​

Specificity​

F1 Score


LLM Grader Evaluations:
  1. Similarity Score​

  1. Hallucination Score​

  1. Correctness Score​


Cosine Similarity​

What it is: Cosine Similarity measures the angle between two text vectors, showing semantic closeness even if word order differs.


Formula



Example 1 (High Similarity)

GT: “No acute intracranial abnormality is seen.”​

AI: “There is no evidence of acute intracranial pathology.”​​

Cosine Similarity: 0.92


Example 2 (Low Similarity)

GT: “Mild disc degeneration at L4-L5.”​

AI: “Fracture seen in the right femur.”​

Cosine Similarity: 0.36


Jaccard Similarity​​

What it is: Jaccard Similarity measures the overlap of unique words between two strings.


Formula



Example 1 (Moderate Similarity)

GT: “No acute intracranial abnormality is seen.”​

AI: “There is no evidence of acute intracranial pathology.”​​

Jaccard Similarity: 0.53


Example 2 (Low Similarity)

GT: “Mild disc degeneration at L4-L5.”​

AI: “Fracture seen in the right femur.”​

Jaccard Similarity: 0.15

Levenshtein Similarity​

What it is: Measures edit distance, the number of character-level insertions, deletions, or substitutions needed to match two texts.


Formula



Example 1 (High Similarity)

GT: “No acute intracranial abnormality is seen.”​

AI: “There is no evidence of acute intracranial pathology.”​​

Levenshtein Similarity: 0.88


Example 2 (Low Similarity)

GT: “Mild disc degeneration at L4-L5.”​

AI: “Fracture seen in the right femur.”​

Levenshtein Similarity: 0.42


Count of Items Generated​

What it is: A simple yet informative metric, count the number of distinct items generated in the AI output and compare it to the ground truth (GT) provided by reading physicians.


Why it matters:

Helps assess over generation or under generation.​

Indicates whether the AI is being too verbose or missing expected findings.​

Important for evaluating clinical usefulness and readability.


Example:
A. Ground Truth (GT) Impression (2 items):

No acute intracranial abnormality.​

Mild chronic microvascular ischemic changes.

B. AI-Generated Impression (4 items):​

No evidence of acute intracranial pathology.​

Mild chronic ischemic changes noted.​

Age-related brain atrophy present.​

No mass effect or midline shift.


GT Count: 2​
AI Count: 4​
Difference: +2 (Overgeneration)


Confusion Matrix

What it is: To determine the True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN) of the AI agent in terms of at least generating one item if the GT has one or more item, and vice versa.


Why it's useful: Helps assess overall agent sensitivity and specificity at the report level.


Definitions of TP, TN, FP, FN:


For Impression Agent

AI
GT

TP

At least one IMPRESSION item​

At least one IMPRESSION item​

TN

No IMPRESSION item

No IMPRESSION item

FP

At least one IMPRESSION item

No IMPRESSION item

FN

No IMPRESSION item

At least one IMPRESSION item​


For Recommendation Agent

AI
GT

TP

At least one RECOMMENDATION item

At least one RECOMMENDATION item

TN

No RECOMMENDATION item

No RECOMMENDATION item

FP

At least one RECOMMENDATION item

No RECOMMENDATION item

FN

No RECOMMENDATION item

At least one RECOMMENDATION item


Metric
AI

Accuracy​

Proportion of total correct predictions.​

Precision

How many predicted positives are correct.​

Sensitivity​

How many actual positives were correctly predicted.​

Specificity​

How many actual negatives were correctly predicted.​

F1 Score​

Harmonic mean of precision and sensitivity.​


LLM Grader: AI-Based Quality Evaluation

A custom Chat Flow, powered by GPT-4o, was developed to evaluate AI-generated Impression and
Recommendation sections. The evaluation compares each AI output to both:

  • The Reading physician-reviewed Ground Truth (GT)​

  • The Original Report written by the reading physician


Each AI output is scored on three key criteria, using a numeric scale from 0 to 1 along with textual justifications.


Evaluation Criteria
Similarity Score​

How closely the AI output matches the Ground Truth in structure and meaning.

Hallucination Score​

Assesses whether the AI introduces unsupported or fabricated content.​

Compares the AI output to the Original Report.​

Penalizes clinically incorrect or invented statements.

Correctness Score​

Measures how accurately the AI reflects the clinical content in the Ground Truth.


LLM Grader: Similarity Score​


What it measures:

Evaluates how closely the AI-generated section aligns in meaning and structure with the Ground Truth (GT). This score focuses on semantic overlap, organization, and phrasing consistency.


Good Example – Score: 0.90

GT: "Slightly coarsened hepatic echotexture. 3.6 humeri simple right renal cyst."​

AI Output: "Slightly coarsened liver echotexture. 3.6 cm right renal simple cyst."

Equivalent meaning with slight rephrasing and different synonyms.​
High alignment with GT content and structure.


Bad Example – Score: 0.00

GT: "No renal stones or acute process seen."​

AI Output: "Lumbar fusion with mild S-shaped scoliosis and multilevel degenerative joint disease (DJD). Vascular calcifications noted."

The AI response and GT have completely different content.
The AI response mentions lumbar fusion, scoliosis, DJD, and vascular calcifications, while the GT states no renal stones or acute process.


LLM Grader: Hallucination Score​


What it measures:

Assesses whether the AI introduces unsupported or fabricated findings not present in the Original Report. It focuses on clinical safety and factual grounding.


Good Example – Score: 0.00

Original Report: "LIVER: Slightly coarsened echotexture without focal hepatic mass. Midclavicular length: 13.9 cm. RIGHT KIDNEY: Size: 10.7 x 4.0 x 5.2 cm (L x W x H). 3.6 cm right renal simple cyst. " ​

AI Output: "Slightly coarsened liver echotexture. 3.6 cm right renal simple cyst."

Faithful to source content
No hallucinated findings or clinical interpretation


Bad Example – Score: 0.25

Original Report: "PRE-FINDINGS: EXAM: CT OF THE HEAD/BRAIN WITHOUT INTRAVENOUS CONTRAST COMPARISON: None. INDICATIONS: TRAUMA TECHNIQUE: An axial spiral CT examination was performed without intravenous contrast. In addition, sagittal and coronal reformatted images were created and reviewed. This CT uses dose modulation, interval reconstruction, and/or weight-based dosing when appropriate to reduce radiation dose to as low as reasonably achievable. FINDINGS: BRAIN INCLUDING MENINGES AND EXTRA AXIAL SPACES: Acute abnormality. Severe chronic small vessel disease. Generalized cerebral volume loss. The hippocampi appear decreased in volume, with extra vacuum enlargement of the temporal horns. No upward bowing of the corpus callosum or asymmetric narrowing of the hy- convexity sulci is suggests hydrocephalus. VASCULATURE: No atherosclerotic changes in the major vessels. PITUITARY: No mass or deformity of the sella. SKULL: No significant lesion or fracture. SINUSES: Limited views demonstrate no significant mucosal thickening or fluid. ORBITS: Left-sided scleral band. Right lens replacement. OTHER: There are scattered foci of air in the left masticator space and in the left preauricular region. Trace left mastoid effusion."​

AI Output: "Acute abnormality in the brain. Severe chronic small vessel disease. Generalized cerebral volume loss with decreased hippocampal volume and enlargement of temporal horns. Left-sided scleral band and right lens replacement. Scattered foci of air in the left masticator space and left preauricular region. Trace left mastoid effusion. ”

The AI response is mostly grounded in the Original Report, but it incorrectly states an 'acute abnormality in the brain,' which is not supported by the Original Report. Other findings are consistent with the Original Report.


LLM Grader: Correctness Score


What it measures:

Evaluates how accurately the AI captures the clinical meaning of the Ground Truth. Focuses on factual consistency and correct medical interpretation.


Good Example – Score: 1.00

GT: "Slightly coarsened hepatic echotexture. 3.6 humeri simple right renal cyst."​

AI Output: "Slightly coarsened liver echotexture. 3.6 cm right renal simple cyst."

Clinically aligned and accurate.
Different medical terms used, still valid.​


Bad Example – Score: 0.5

GT: "No renal stones or acute process seen."​

AI Output: "Lumbar fusion with mild S-shaped scoliosis and multilevel degenerative joint disease (DJD). Vascular calcifications noted."

The AI response is plausible as it accurately reflects findings from the Original Report, but it does not align with the GT, which mentions no renal stones or acute process.

Human Expert Evaluation

An expert Canadian and US Board Certified reading physician with 20+ years of experience, fellowship training in pediatrics, interventional radiology and a mini fellowship in musculoskeletal imaging was assigned to manually review the original reports alongside the AI-generated Impressions and Recommendations for the 330 cases that comprise our benchmark testing set.


An expert Canadian and US Board Certified reading physician with 20+ years of experience, fellowship training in pediatrics, interventional radiology and a mini fellowship in musculoskeletal imaging was assigned to manually review the original reports alongside the AI-generated Impressions and Recommendations for the 330 cases that comprise our benchmark testing set.
  1. Correctness Score: For a given report, score the correctness of the AI-generated results using a Likert scale from 1 to 5. The breakdown of this score is given below:​

5 → Fully Correct​

4 → Minor inaccuracies​

3 → Half correct items​

2 → Major inaccuracies​

1 → Fully incorrect

  1. Hallucination Score: For a given report, determine if any of the AI result contain information that is not found in the original report. In other words, score if there are Hallucinations or Not. The breakdown of the score is given below:

No Hallucination → All AI results are grounded and are based on the original report​

Not Applicable → There is no recommendation in the original report and AI correctly generated no recommendations; Therefore Hallucination score doesn't apply here.​

Contains Hallucination → The generated results contain information outside the original report


Human Expert Evaluation - Overall Report Level Results
Correctness Score:


Correctness Score-Meaning​
Count (Percentage)​

5-Fully Correct​

266 (80.6%)​

4-Minor Inaccuracies​

49 (14.8%)​

3-Half Correct Items​

12 (3.6%)​

2-Major Inaccuracies

3 (0.9%)​

1-Fully Incorrect​

0 (0.0%)​

Total
330 (100%)


Statistical Summary​
Value

Count

330​

Mean

4.75

Median

5​

Std Dev

0.56​

Min

2

Max

5

Mode
5 (appeared 266 times)​


Correctness Score Distribution for 330 Reports


Hallucination Score:​
Correctness Score-Meaning​
Count (Percentage)​

No Hallucination​

321 (97.3%)​​

Contains Hallucination​

9 (2.7%)​

Total
330 (100%)


Hallucination Score Distribution for 330 Reports


Human Expert Evaluation - Report Impression 1.0 - Impressions Only
Correctness Score:


Correctness Score-Meaning​
Count (Percentage)​

5-Fully Correct​

266 (80.6%)​

4-Minor Inaccuracies​

49 (14.8%)​

3-Half Correct Items​

12 (3.6%)​

2-Major Inaccuracies

3 (0.9%)​

1-Fully Incorrect​

0 (0.0%)​

Total
330 (100%)


Statistical Summary​
Value

Count

330​

Mean

4.75

Median

5​

Std Dev

0.56​

Min

2

Max

5

Mode
5 (appeared 266 times)​


Correctness Score Distribution for 330 Reports


Hallucination Score:​
Correctness Score-Meaning​
Count (Percentage)​

No Hallucination​

321 (97.3%)​​

Contains Hallucination​

9 (2.7%)​

Total
330 (100%)


Correctness Score Distribution for 330 Reports


Human Expert Evaluation - Report Impression 1.0 - Impressions Only
Correctness Score:


Correctness Score-Meaning​
Count (Percentage)​

5-Fully Correct​

306 (92.7%)​

4-Minor Inaccuracies​

6 (1.8%)​

3-Half Correct Items​

2 (0.6%)​

2-Major Inaccuracies

2 (0.6%)​

1-Fully Incorrect​

14 (4.2%)​

Total
330 (100%)


Statistical Summary​
Value

Count

330​

Mean

4.78​

Median

5​

Std Dev

0.85​

Min

1

Max

5

Mode
5 (appeared 306 times)​


Correctness Score Distribution for 330 Reports


Hallucination Score:​
Correctness Score-Meaning​
Count (Percentage)​

No Hallucination​

103 (31.2%)​

Not Applicable​

227 (68.8%)​

Contains Hallucination​
0 (0.0%)​


Hallucination Score Distribution for 330 Reports