|
|
||||||||
ORIGINAL RESEARCH |
From the Departments of Obstetrics and Gynecology and Clinical Chemistry, University of Helsinki, Helsinki, Finland; and Program for Appropriate Technology in Health, Seattle, Washington.
Address reprint requests to: Pontus Molander, MD, Department of Obstetrics and Gynecology, University of Helsinki, Haartmaninkatu 2, 00290, Helsinki, Finland; E-mail: pontus.molander{at}hus.fi.
| ABSTRACT |
|---|
|
|
|---|
METHODS: Three senior consultants and three residents in training in obstetrics and gynecology scored the four laparoscopic images (adnexa, cul-de-sac, and pelvic panoramic view) from each of 40 patients and repeated the process 2 days later after the order of presentation had been randomized. A standardized predesigned scoring form was used. Histopathologically proven PID was used as the gold standard.
RESULTS: The overall accuracy of the laparoscopic diagnosis of PID was 78%, the sensitivity was 27%, and the specificity was 92%. The overall intraobserver reproducibility of the diagnosis of PID was only fair (
= 0.58), and it was clearly better among the consultants than among the residents (
= 0.76 and 0.39, respectively). The overall interobserver reproducibility was poor to fair (
= 0.43), and it was again better among the consultants than among the residents (
= 0.48 and 0.38, respectively). When specific diagnostic features (including tubal erythema, edema, adhesions, cul-de-sac fluid) were separately analyzed, the results were no different suggesting only poor-to-fair reproducibility.
CONCLUSION: Based on photographic images, the observer reproducibility and the overall diagnostic accuracy of the laparoscopic diagnosis of PID are unsatisfactory when histopathologically proven PID is used as the gold standard.
Laparoscopy has been recognized as the gold standard for the diagnosis of pelvic inflammatory disease (PID).1 Many studies have been performed on the correlation between the clinical and the laparoscopic diagnosis of PID. Most studies have demonstrated a relatively low sensitivity for the clinical diagnosis usually in the range of 6070%.24 Similarly, laboratory criteria seem to have a low sensitivity in the diagnosis of PID.2 Several other diagnostic modalities including transvaginal ultrasound,5 magnetic resonance imaging,6 endometrial biopsy,7 and evidence of lower genital tract infection8 have also been compared with laparoscopy in the diagnosis of PID.
Laparoscopic criteria for the diagnosis of PID include tubal erythema, tubal edema, and the presence of purulent exudate.1,9 In some studies, the acute salpingitis score has been derived to improve laparoscopic diagnosis and also to determine the severity of PID.10 However, uniform classification systems for the laparoscopic diagnosis of PID have been difficult to develop and implement. In fact, laparoscopy has never been properly validated as the gold standard in the diagnosis of PID. An expectation bias by the laparoscopist can be a major problem.11 Sellors et al11 demonstrated that laparoscopy was no better than chance in the diagnosis of salpingitis when fimbrial minibiopsy showing histopathologic evidence of salpingitis was used as the gold standard. This study clearly demonstrated the inaccuracy of the laparoscopic diagnosis of PID. Surprisingly, only few studies have tested observer reproducibility of specific laparoscopic findings.1215 Thus, there is a growing concern about the use of laparoscopy as the gold standard. Therefore, we decided to perform a systematic study of the observer reproducibility and diagnostic accuracy of the laparoscopic diagnosis of PID among unselected patients referred for acute pelvic pain. We used the histopathologic diagnosis of PID as the gold standard.
| MATERIALS AND METHODS |
|---|
|
|
|---|
The intraobserver reproducibility (agreement between two examinations by the same observer) was analyzed for each observer based on two separate readings of the same photographic images. Overall, 240 pairs of observations were analyzed (six observers viewing 40 sets of images that included four slides of each case on two occasions). The intraobserver reproducibility was also analyzed separately for six individual diagnostic features including erythema of the fallopian tubes (none, mild, moderate, or severe), erythema of the uterus (none, mild, moderate, or severe), edema of the fallopian tubes (none, mild, moderate, or severe), fimbrial status (normal, edematous, phimotic, clubbed), presence of tubal adhesions (none, filmy, dense with distortion, dense with enclosure), and characteristics of fluid in the cul-de-sac (none, serous, purulent, hemorrhagic).
The interobserver reproducibility (agreement between two observers reading the same set of photographic images) was analyzed from the results of the first review session. Agreement between the six observers was tested both by the overall PID diagnosis and by the individual diagnostic features of PID.
Diagnostic reproducibility was assessed in terms of accuracy and sensitivity and specificity. The intraobserver and interobserver agreement was measured by the
statistic.
measures agreement between observers above what would be expected by chance alone.
values above 0.75 represent excellent agreement beyond chance;
values between 0.40 and 0.75 represent fair agreement beyond chance; and
values less than 0.40 represent poor agreement beyond chance. A
value of 1 indicates perfect agreement, whereas 0 indicates a result obtained by chance alone.1618 The P value for a
estimate was used to determine whether a
value differs significantly from a result obtained by chance (null hypothesis:
= 0). A cutoff P value of .05 was used to demonstrate significance. The distributions of mean
values among juniors and seniors were compared using the Student t test.
| RESULTS |
|---|
|
|
|---|
=0.58). The
coefficients for the six observers ranged from 0.26 to 0.90 (Table 2
= 0.76 versus 0.39), although the difference between the groups was not significant (P = .067) (Table 2
estimates for the residents ranged from 0.18 to 0.59, and for the consultants from 0.45 to 0.70 (Table 3
|
|
|
= 0.43), ranging from 0.07 to 0.68. Again, the consultants performed better than the residents (
= 0.48 versus 0.38), although the difference was not significant (Table 4
|
|
| DISCUSSION |
|---|
|
|
|---|
= 0.45 and 0.28). In another study, the interobserver
estimate was 0.44, suggesting again only poor to fair reproducibility.13 In these studies, laparoscopic videotapes, color photographs, or color slides were used. These studies suggest that the American Fertility Society classification system for endometriosis is inadequate because of lack of observer reproducibility. Another study evaluated the interobserver reproducibility of assessing pelvic adhesions using the American Fertility Society scoring system for adhesions. The surgeons scored videotapes and recorded outcome of surgery, providing recommendations for management. The interobserver reproducibility was poor (
= 0.21, 0.32, 0.13).14 Similar results have also been obtained by other investigators.15 Thus, our results are not totally unexpected. Although the slide sessions were conducted in a uniform fashion, other limitations certainly could apply. The observers had no access to medical history, laboratory results, or other test results, and this could have affected the interpretation of the photographs compared with an in vivo situation. Similarly, there was no possibility to manipulate pelvic structures and to perform intraoperative interventions, such as adhesiolysis, which again might provide valuable information and have an effect on the overall interpretation of the pelvic findings. However, in other studies, videotapes have also been used, and these studies have similarly shown surprisingly poor observer reproducibility.1215
The scoring system developed for our study may not be ideal or optimal and has so far not been extensively validated by other research groups. However, the development of such scoring systems and forms is not easy. A scoring system cannot be too simple, and it should allow descriptive comments in addition to simple scoring. On the other hand, scoring forms cannot be too complicated or detailed and time consuming. Our results suggest that laparoscopy is not appropriate as a single reference standard for the diagnosis of PID in comparative clinical trials in which different diagnostic modalities are tested. Therefore, in future clinical PID studies, a combination of laparoscopic and histopathologic diagnoses should be considered as the gold standard. Poor to fair intraobserver and interobserver agreement suggests that laparoscopic diagnosis of PID needs further evaluation.
One straightforward conclusion from this study was that experienced gynecologists performed better than residents in training. Senior consultants reached even fair to excellent agreement for the intraobserver PID diagnosis. Interpretation of certain features of PID diagnosis showed large variation between consultants and residents, which suggests that consensus agreement of these specific findings is important. One potential application for studies like this would be to include such observer agreement tests in the residency training programs. Another implication of such findings is that any diagnostic feature with poor reproducibility warrants attempts to better define its categories. If this process of consensus building on definitions does result in increased agreement, then consideration should be given to discounting the feature as a reliable sign of disease.
The clinical diagnosis of PID is extremely difficult, leading both to underdiagnosis and overdiagnosis, and both can be associated with serious sequelae such as overuse of antimicrobials or long-term complications such as tubal factor infertility, ectopic pregnancy, and chronic pelvic pain. However, laparoscopy seems not to solve the diagnostic problems because of lack of observer agreement and lack of diagnostic sensitivity. Diagnostic accuracy and equity of health services delivery depend greatly on the reproducibility of a testnot just in the hands of the experts who develop the test but also when others attempt to use it.
| Footnotes |
|---|
doi:10.1016/S0029-7844(03)00013-9
Received July 18, 2002. Received in revised form November 18, 2002. Accepted November 27, 2002.
| REFERENCES |
|---|
|
|
|---|
2. Munday PE. Pelvic inflammatory diseaseAn evidence-based approach to diagnosis. J Infect 2000;40:3141.[Medline]
3. Weström L, Joesoef R, Reynolds G, Hagdu A, Thompson SE. Pelvic inflammatory disease and infertility: A cohort study of 1844 women with laparoscopically verified disease and 657 control women with normal laparoscopic results. Sex Transm Dis 1992;19:18592.[Medline]
4. Bevan CD, Johal BJ, Mumtaz G, Ridgway GL, Siddle NC. Clinical, laparoscopic and microbiological findings in acute salpingitis: Report on a UK cohort. Br J Obstet Gynaecol 1995;102:40714.[Medline]
5. Molander P, Sjöberg J, Paavonen J, Cacciatore B. Transvaginal power Doppler findings in laparoscopically proven acute PID. Ultrasound Obstet Gynecol 2001;17:2338.[Medline]
6. Tukeva T, Aronen HJ, Karjalainen PT, Molander P, Paavonen T, Paavonen J. MR imaging in pelvic inflammatory disease. Comparison with laparoscopy and US. Radiology 1999;210:20916.
7. Paavonen J, Teisala K, Heinonen PK, Aine R, Laine S, Lehtinen M, et al. Microbiological and histopathological findings in acute pelvic inflammatory disease. Br J Obstet Gynaecol 1987;24:21120.
8. Peipert JF, Boardman L, Hogan JW, Sung J, Mayer KH. Laboratory evaluation of acute upper genital tract infection. Obstet Gynecol 1996;87:7306.[Abstract]
9. Hager WD, Eshenbach DA, Spence MR, Sweet RL. Criteria for the diagnosis and grading of salpingitis. Obstet Gynecol 1983;61:1134.
10. Henry-Suchet J, Tesquier L. Role of laparoscopy in the management of pelvic adhesions and pelvic sepsis. Clin Obstet Gynecol 1994;8:75972.
11. Sellors J, Mahony J, Goldsmith C, Rath D, Mander R, Hunter B, et al. The accuracy of clinical findings and laparoscopy in acute pelvic inflammatory disease. Am J Obstet Gynecol 1991;164:11320.[Medline]
12. Hornstein MD, Gleason RE, Orav J, Haas ST, Friedman JA, Rein MS, et al. The reproducibility of the revised American Fertility Society classification of endometriosis. Fertil Steril 1993;59:101521.[Medline]
13. Rock JA. The revised American Fertility Society classification of endometriosis: Reproducibility of scoring. Fertil Steril 1995;63:110810.[Medline]
14. Bowman MC, Tin-Chiu L, Cooke ID. Inter-observer variability at laparoscopic assessment of pelvic adhesions. Hum Reprod 1995;10:15560.
15. Corson SL, Batzer FR, Gocial B, Kelly M, Gutmann JN, Maislin G. Intra-observer and inter-observer variability in scoring laparoscopic diagnosis of pelvic adhesions. Hum Reprod 1995;10:1614.
16. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:15974.[Medline]
17. Rosner B. Fundamentals of biostatistics. Boston: PWS-Kent Publishing, 1990.
18. Brennan P, Silman A. Statistical methods for assessing observer variability in clinical measures. BMJ 1992;304: 14914.
19. Raab SS, Snider TE, Potts SA, McDaniel HL, Robinson RA, Nelson DL, et al. Atypical glandular cells of undetermined significance: Diagnostic accuracy and interobserver variability using select cytologic criteria. Am J Clin Pathol 1997;107:299307.[Medline]
20. Stoler MH. Does every little cell count? Dont "ASCUS." Cancer 1999;87:457.[Medline]
21. Stoler MH, Schiffman M. Interobserver reproducibility of cervical cytologic and histologic interpretations: Realistic estimates from the ASCUS-LSIL triage study. JAMA 2001;285:15005.
22. Sellors JW, Nieminen P, Vesterinen E, Paavonen J. Observer variability in the scoring of colpophotographs. Obstet Gynecol 1990;76:10068.
23. Ferris DG, Cox JT, Burke L, Litaker MS, Harper DM, Campion MJ, et al. Colposcopy quality control: Establishing colposcopy criterion standards for the ALTS trial using cervigrams. J Lower Genital Tract Dis 1998;2:195203.
24. Sellors JW, Jeronimo J, Sankaranarayanan R, Wright TC, Howard M, Blumenthal PD. Assessment of the cervix after acetic acid wash: Inter-rater agreement using photographs. Obstet Gynecol 2002;99:63540.
25. Mavrides E, Holden J, Bland JM, Tekay A, Thilaganathan B. Intraobserver and interobserver variability of transabdominal doppler velocimetry measurements of the fetal ductus venosus between 10 and 14 weeks of gestation. Ultrasound Obstet Gynecol 2001;17:30610.[Medline]
26. Reid MC, Lachs MS, Feinstein AR. Use of methodologic standards in diagnostic test research: Getting better but still not good. JAMA 1995;274:64551.[Abstract]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |