Physician in Radiodiagnosis JEFFERSON ABINGTON HOSPITAL Huntingdon Valley, Pennsylvania, United States
Purpose: This study aimed to develop a deep NLP model to classify tumor response categories (TRCs) from free-text oncology reports (FTOR) and to compare its performance with conventional NLP approaches and human annotators with different levels of radiology experience.
Methods/Materials: In this retrospective study, 10,455 oncology radiology reports were collected from three radiology departments, including 9,653 SOR and 802 FTOR. TRCs were extracted from SOR using the Response Evaluation Criteria in Solid Tumors (RECIST) to provide ground truth labels for model training. A BERT-based deep NLP model was fine-tuned on these structured findings, while three conventional NLP models—Linear-SVC, k-nearest neighbors, and multinomial naive Bayes—were trained on bag-of-words representations. Model performance was compared with human annotators, including radiologists, medical students, and radiology technologist students, using F1 score, accuracy, precision, and recall. Lexical complexity and semantic ambiguity analyses were conducted to identify factors affecting performance.
Results: The BERT model achieved an overall F1 score of 0.70 on FTOR, outperforming Linear-SVC (0.63) and radiology technologist students (0.65), and approaching the performance of medical students (0.73), though below radiologists (0.79). Lexical complexity and semantic ambiguities in FTOR reduced both model and human performance, with maximum F1 drops of −0.19 and −0.17, respectively. On structured reports, BERT achieved an F1 score of 0.86, demonstrating excellent performance when structured data were available.
Conclusions: Deep NLP models trained on structured oncology reports can accurately classify tumor response in free-text reports, achieving near-human performance. This approach enables automated, scalable extraction of oncologic outcomes, supporting clinical workflow optimization, longitudinal patient monitoring, and oncology research. Future studies should explore multilingual adaptatio