Evaluation of departmental inter-rater reliability when scoring thyroid nodules acc. to BTA U-classification model. Is there significant disagreement?

Nabil Rtam, Yeovil District Hospital

Introduction

The BTA U-classification is a risk stratification model which grades thyroid nodules (TNs) in U2-5 based on their sonographic appearance. Existence of variability between the ultrasound operators when U-scoring is reported in literature with some evidence found in the author’s department. The aim of this study was to investigate whether there is significant disagreement in the department and identify potential reasons for variability.

Methods

Eight operators, radiologists and sonographers were recruited to grade 33 TNs and answer a tick box questionnaire using the BTA lexicon. The inter-operator variability for the U-categories, indication for FNAB and US features was assessed by using Fleiss’ kappa and Gwet’s AC1. The operators’ accuracy was measured against the most experienced operator in the department using Cohen’s kappa and percentage-agreement.

Results

Fair agreement (Fleiss’ K=0.213) was obtained between the participants when U-scoring (U2-5). Fair to moderate agreement was noted between sonographers (K=0.396). Significant variability was demonstrated between radiologists (P>0.05). U5 was the most agreed upon category (K=0.561) and U3 and U4 were the least agreed on (K=0.115 and 0.187 respectively). Indication for FNAB reached fair to almost substantial agreement (radiologists’ AC1=0.33, sonographers’ AC1=0.581, overall AC1=0.413). No significant variability measured for echogenicity (K=0.291), composition (K=0.332), shape (K=0.584), margin (K=0.452), halo (K=0.331) and vascularity (K=0.435). Significant variability was noted for the “micro-cystic/spongiform” feature that agreement due to chance could not be excluded (P>0.05). Accuracy reached fair agreement (mean Cohen’s K=0.289) and moderate agreement (mean AC1=0.53) for the U-categories and FNAB respectively. Radiologists demonstrated lower accuracy.

Conclusion

This study demonstrated that there is no significant inter-rater variability in U-scoring or recommending FNAB between all the US operators in the department. The study showed, however, margin for improvement particularly for the radiologist group. Reliability and accuracy could be improved by addressing those problematic categories and features identified with this study.

View the poster