Obstetrics & Gynecology Track the topics, authors and articles important to you
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Obstetrics & Gynecology 2003;101:875-880
© 2003 by The American College of Obstetricians and Gynecologists
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Molander, P.
Right arrow Articles by Paavonen, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Molander, P.
Right arrow Articles by Paavonen, J.

ORIGINAL RESEARCH

Observer Agreement With Laparoscopic Diagnosis of Pelvic Inflammatory Disease Using Photographs

Pontus Molander, MD, Patrik Finne, MD, PhD, Jari Sjöberg, MD, PhD, John Sellors, MD, PhD and Jorma Paavonen, MD, PhD

From the Departments of Obstetrics and Gynecology and Clinical Chemistry, University of Helsinki, Helsinki, Finland; and Program for Appropriate Technology in Health, Seattle, Washington.

Address reprint requests to: Pontus Molander, MD, Department of Obstetrics and Gynecology, University of Helsinki, Haartmaninkatu 2, 00290, Helsinki, Finland; E-mail: pontus.molander{at}hus.fi.


    ABSTRACT
 TOP
 ABSTRACT
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
OBJECTIVE: To test the intraobserver (are observers likely to agree between themselves?) and interobserver (are observers likely to agree with other observers?) reproducibility and the overall diagnostic accuracy of laparoscopic diagnosis of pelvic inflammatory disease (PID).

METHODS: Three senior consultants and three residents in training in obstetrics and gynecology scored the four laparoscopic images (adnexa, cul-de-sac, and pelvic panoramic view) from each of 40 patients and repeated the process 2 days later after the order of presentation had been randomized. A standardized predesigned scoring form was used. Histopathologically proven PID was used as the gold standard.

RESULTS: The overall accuracy of the laparoscopic diagnosis of PID was 78%, the sensitivity was 27%, and the specificity was 92%. The overall intraobserver reproducibility of the diagnosis of PID was only fair ({kappa} = 0.58), and it was clearly better among the consultants than among the residents ({kappa} = 0.76 and 0.39, respectively). The overall interobserver reproducibility was poor to fair ({kappa} = 0.43), and it was again better among the consultants than among the residents ({kappa} = 0.48 and 0.38, respectively). When specific diagnostic features (including tubal erythema, edema, adhesions, cul-de-sac fluid) were separately analyzed, the results were no different suggesting only poor-to-fair reproducibility.

CONCLUSION: Based on photographic images, the observer reproducibility and the overall diagnostic accuracy of the laparoscopic diagnosis of PID are unsatisfactory when histopathologically proven PID is used as the gold standard.

Laparoscopy has been recognized as the gold standard for the diagnosis of pelvic inflammatory disease (PID).1 Many studies have been performed on the correlation between the clinical and the laparoscopic diagnosis of PID. Most studies have demonstrated a relatively low sensitivity for the clinical diagnosis usually in the range of 60–70%.2–4 Similarly, laboratory criteria seem to have a low sensitivity in the diagnosis of PID.2 Several other diagnostic modalities including transvaginal ultrasound,5 magnetic resonance imaging,6 endometrial biopsy,7 and evidence of lower genital tract infection8 have also been compared with laparoscopy in the diagnosis of PID.

Laparoscopic criteria for the diagnosis of PID include tubal erythema, tubal edema, and the presence of purulent exudate.1,9 In some studies, the acute salpingitis score has been derived to improve laparoscopic diagnosis and also to determine the severity of PID.10 However, uniform classification systems for the laparoscopic diagnosis of PID have been difficult to develop and implement. In fact, laparoscopy has never been properly validated as the gold standard in the diagnosis of PID. An expectation bias by the laparoscopist can be a major problem.11 Sellors et al11 demonstrated that laparoscopy was no better than chance in the diagnosis of salpingitis when fimbrial minibiopsy showing histopathologic evidence of salpingitis was used as the gold standard. This study clearly demonstrated the inaccuracy of the laparoscopic diagnosis of PID. Surprisingly, only few studies have tested observer reproducibility of specific laparoscopic findings.12–15 Thus, there is a growing concern about the use of laparoscopy as the gold standard. Therefore, we decided to perform a systematic study of the observer reproducibility and diagnostic accuracy of the laparoscopic diagnosis of PID among unselected patients referred for acute pelvic pain. We used the histopathologic diagnosis of PID as the gold standard.


    MATERIALS AND METHODS
 TOP
 ABSTRACT
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Three senior consultants (experience in laparoscopic surgery 6–15 years, mean 12 years) and three residents in training (third-year residents undergoing rotation in gynecologic surgery) from the Department of Obstetrics and Gynecology, University of Helsinki, Finland, participated in the study. Laparoscopic color slides showing pelvic structures were obtained from women referred to the emergency room for acute pelvic pain, as previously described.11 Briefly, in Burlington, Ontario, Canada, consecutive women with pelvic pain, presenting to 65 family physicians and six casualty officers at the hospital emergency department were referred to one of three gynecologists for laparoscopy if the primary care physician considered the pain to represent PID or another important pelvic condition, such as ectopic pregnancy or appendicitis. Diagnoses were made as usual, and there were no present criteria insisted on for the diagnosis of PID. Primary care physicians and gynecologists were allowed to use any diagnostic tests available to them before they recorded their preoperative diagnoses. Eligibility criteria were as follows: The women had to be premenopausal and not known to be pregnant; no antimicrobial treatment, abortion, or abdominal surgery had occurred in the previous 6 weeks; the uterus and at least one fallopian tube had to be present; the pain had not been present for more than 2 months; and the women had to be fit for surgery. Of the 95 women who signed a consent form and underwent surgical exploration, 63% presented to the primary care physician in the office, 33% came to the emergency department, and 4% were seen on the wards. Endometrial biopsies and fimbrial minibiopsies were obtained during the acute phase laparoscopy.11 A total of 160 color slides with four slides each of a convenience sample of 40 patients were available for review. This sample represented women chosen for high-quality photographs and severity of PID matching what was found in the total sample. Each set of four slides per case showed pelvic structures including the uterus, the ovaries, the fallopian tubes, and the pelvic peritoneum in a systematic fashion (views of left and right adnexae, pelvic panorama, and cul-de-sac). Each set of four slides had been taken so that the pelvic anatomy was shown and visible. The slides considered not evaluable by the observers were excluded from the final analysis. The slides were projected on a large screen to produce an image 1 m wide. All four views of each patient were presented at the same time using four slide projectors. All observers were placed at a distance of approximately 3 m from the images in a darkened room. The observers were not allowed to discuss the cases and were allowed to use as much time as they needed to complete scoring of the slides. The mean time required for scoring of one case was approximately 2 minutes. The slide review was conducted during two identical sessions, 2 days apart. Each of the two sessions took approximately 2 hours. For the second session, the same slides were shown again but in a random order to minimize any recall bias based on the first review. To study the intraobserver reproducibility, the observers were not told that the slides from the same cases would be shown again. Each set of four slides was scored systematically using a predesigned form. An overall diagnosis was made in each case. Each observer interpreted the slides independently and was not aware of any of the interpretations of the other observers. To avoid bias, specific training with the grading system before initiation of the study was not performed. Of the 40 patients included in the review, nine had histopathologically proven PID and 31 had other pelvic pathology or normal findings. The results by the senior consultants and the residents were analyzed separately. The overall diagnostic accuracy was analyzed in relation to the presence or absence of proven PID. Pelvic inflammatory disease was defined by the presence of salpingitis by fimbrial minibiopsy or the presence of plasma cell endometritis by endometrial biopsy obtained during the acute phase laparoscopy.11

The intraobserver reproducibility (agreement between two examinations by the same observer) was analyzed for each observer based on two separate readings of the same photographic images. Overall, 240 pairs of observations were analyzed (six observers viewing 40 sets of images that included four slides of each case on two occasions). The intraobserver reproducibility was also analyzed separately for six individual diagnostic features including erythema of the fallopian tubes (none, mild, moderate, or severe), erythema of the uterus (none, mild, moderate, or severe), edema of the fallopian tubes (none, mild, moderate, or severe), fimbrial status (normal, edematous, phimotic, clubbed), presence of tubal adhesions (none, filmy, dense with distortion, dense with enclosure), and characteristics of fluid in the cul-de-sac (none, serous, purulent, hemorrhagic).

The interobserver reproducibility (agreement between two observers reading the same set of photographic images) was analyzed from the results of the first review session. Agreement between the six observers was tested both by the overall PID diagnosis and by the individual diagnostic features of PID.

Diagnostic reproducibility was assessed in terms of accuracy and sensitivity and specificity. The intraobserver and interobserver agreement was measured by the {kappa} statistic. {kappa} measures agreement between observers above what would be expected by chance alone. {kappa} values above 0.75 represent excellent agreement beyond chance; {kappa} values between 0.40 and 0.75 represent fair agreement beyond chance; and {kappa} values less than 0.40 represent poor agreement beyond chance. A {kappa} value of 1 indicates perfect agreement, whereas 0 indicates a result obtained by chance alone.16–18 The P value for a {kappa} estimate was used to determine whether a {kappa} value differs significantly from a result obtained by chance (null hypothesis:{kappa}= 0). A cutoff P value of .05 was used to demonstrate significance. The distributions of mean {kappa} values among juniors and seniors were compared using the Student t test.


    RESULTS
 TOP
 ABSTRACT
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Based on photographic images, the overall accuracy of the diagnosis of PID by laparoscopy was 78%, the sensitivity was 27%, and the specificity was 92% (Table 1Go). With respect to accuracy, senior consultants and residents in training performed equally well. Measures were mean values of the first and second viewing sessions. The overall intraobserver reproducibility (agreement between two examinations) was only fair ({kappa} =0.58). The {kappa} coefficients for the six observers ranged from 0.26 to 0.90 (Table 2Go). Senior consultants performed clearly better than residents (mean {kappa} = 0.76 versus 0.39), although the difference between the groups was not significant (P = .067) (Table 2Go). The intraobserver reproducibility was then analyzed by specific features of the laparoscopic diagnosis of PID. The results were highly variable, ranging from -0.08 to 1 for the individual observers, and mean values showed again only poor to fair reproducibility. Again, the consultants performed better than the residents. Evaluation of tubal erythema and tubal fimbria produced clearly better agreement among the consultants with a significant difference for interpretation of tubal erythema (P = .015). {kappa} estimates for the residents ranged from 0.18 to 0.59, and for the consultants from 0.45 to 0.70 (Table 3Go).


View this table:
[in this window]
[in a new window]
 
Table 1. Overall Validity of the Laparoscopic Diagnosis of PID
 

View this table:
[in this window]
[in a new window]
 
Table 2. Intraobserver Reproducibility of the Laparoscopic Diagnosis of PID
 

View this table:
[in this window]
[in a new window]
 
Table 3. Intraobserver Reproducibility of Specific Features of the Laparoscopic Diagnosis of PID
 
The overall interobserver reproducibility (agreement between two observers) of PID diagnosis was only fair ({kappa} = 0.43), ranging from 0.07 to 0.68. Again, the consultants performed better than the residents ({kappa} = 0.48 versus 0.38), although the difference was not significant (Table 4Go). The results of the interobserver agreement analyses of specific PID features were comparable to those of the intraobserver analysis, showing only poor to fair reproducibility (Table 5Go). The consultants performed better for each of the separate components except for interpretation of edema of the fallopian tubes, which showed no difference between the consultants and the residents.


View this table:
[in this window]
[in a new window]
 
Table 4. Interobserver Reproducibility of the Laparoscopic Diagnosis of PID
 

View this table:
[in this window]
[in a new window]
 
Table 5. Interobserver Reproducibility of Specific Features of the Laparoscopic Diagnosis of PID
 

    DISCUSSION
 TOP
 ABSTRACT
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Laparoscopy has been recognized as the gold standard for the diagnosis of PID. Therefore, it is striking that the reproducibility of the laparoscopic diagnosis of PID is not known. We decided to perform a systematic study of the reproducibility of the laparoscopic diagnosis of PID using a set of color slides obtained from 40 patients with or without PID. Our results were unexpected, showing surprisingly low intraobserver and interobserver reproducibility of the PID diagnosis. The overall sensitivity of the laparoscopic diagnosis in relation to histopathologically proven diagnosis turned out to be surprisingly low. Similar observer reproducibility studies have been performed for cervical cytology, histopathology, visual inspection of the cervix with acetic acid, and colposcopy.19–24 These studies consistently have shown only modest observer agreement. We have previously studied observer reproducibility of the colposcopic assessment of cervical transformation zone findings. When four experienced colposcopists reviewed a set of colposcopic slides showing specific features of the atypical transformation zone, observer agreement was poor.22 Similar results have been obtained by others as well.23 Observer reproducibility studies have also been performed on transabdominal ultrasound measurements of fetal blood flow, again showing poor interobserver reproducibility.25 These results are a cause of major concern, suggesting that observer reproducibility should be systematically analyzed in comparative clinical trials of new diagnostic or screening tests.26 Laparoscopic reproducibility studies have been performed on the diagnosis of endometriosis.12,13 A high degree of variability was demonstrated when the American Fertility Society classification system was used as the gold standard for the laparoscopic diagnosis of endometriosis. For instance, Hornstein et al12 found that intraobserver and interobserver reproducibility was only moderate to fair ({kappa} = 0.45 and 0.28). In another study, the interobserver {kappa} estimate was 0.44, suggesting again only poor to fair reproducibility.13 In these studies, laparoscopic videotapes, color photographs, or color slides were used. These studies suggest that the American Fertility Society classification system for endometriosis is inadequate because of lack of observer reproducibility. Another study evaluated the interobserver reproducibility of assessing pelvic adhesions using the American Fertility Society scoring system for adhesions. The surgeons scored videotapes and recorded outcome of surgery, providing recommendations for management. The interobserver reproducibility was poor ({kappa} = 0.21, 0.32, 0.13).14 Similar results have also been obtained by other investigators.15 Thus, our results are not totally unexpected.

Although the slide sessions were conducted in a uniform fashion, other limitations certainly could apply. The observers had no access to medical history, laboratory results, or other test results, and this could have affected the interpretation of the photographs compared with an in vivo situation. Similarly, there was no possibility to manipulate pelvic structures and to perform intraoperative interventions, such as adhesiolysis, which again might provide valuable information and have an effect on the overall interpretation of the pelvic findings. However, in other studies, videotapes have also been used, and these studies have similarly shown surprisingly poor observer reproducibility.12–15

The scoring system developed for our study may not be ideal or optimal and has so far not been extensively validated by other research groups. However, the development of such scoring systems and forms is not easy. A scoring system cannot be too simple, and it should allow descriptive comments in addition to simple scoring. On the other hand, scoring forms cannot be too complicated or detailed and time consuming. Our results suggest that laparoscopy is not appropriate as a single reference standard for the diagnosis of PID in comparative clinical trials in which different diagnostic modalities are tested. Therefore, in future clinical PID studies, a combination of laparoscopic and histopathologic diagnoses should be considered as the gold standard. Poor to fair intraobserver and interobserver agreement suggests that laparoscopic diagnosis of PID needs further evaluation.

One straightforward conclusion from this study was that experienced gynecologists performed better than residents in training. Senior consultants reached even fair to excellent agreement for the intraobserver PID diagnosis. Interpretation of certain features of PID diagnosis showed large variation between consultants and residents, which suggests that consensus agreement of these specific findings is important. One potential application for studies like this would be to include such observer agreement tests in the residency training programs. Another implication of such findings is that any diagnostic feature with poor reproducibility warrants attempts to better define its categories. If this process of consensus building on definitions does result in increased agreement, then consideration should be given to discounting the feature as a reliable sign of disease.

The clinical diagnosis of PID is extremely difficult, leading both to underdiagnosis and overdiagnosis, and both can be associated with serious sequelae such as overuse of antimicrobials or long-term complications such as tubal factor infertility, ectopic pregnancy, and chronic pelvic pain. However, laparoscopy seems not to solve the diagnostic problems because of lack of observer agreement and lack of diagnostic sensitivity. Diagnostic accuracy and equity of health services delivery depend greatly on the reproducibility of a test—not just in the hands of the experts who develop the test but also when others attempt to use it.


    Footnotes
 
This study was supported by grants from the Helsinki University Research Funds, K. Albin Johansson Funds, and Medical Society of Finland.

doi:10.1016/S0029-7844(03)00013-9

Received July 18, 2002. Received in revised form November 18, 2002. Accepted November 27, 2002.


    REFERENCES
 TOP
 ABSTRACT
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
1. Jacobson L, Weström L. Objectivized diagnosis of pelvic inflammatory disease. Am J Obstet Gynecol 1969;105: 1088–98.[Medline]

2. Munday PE. Pelvic inflammatory disease—An evidence-based approach to diagnosis. J Infect 2000;40:31–41.[Medline]

3. Weström L, Joesoef R, Reynolds G, Hagdu A, Thompson SE. Pelvic inflammatory disease and infertility: A cohort study of 1844 women with laparoscopically verified disease and 657 control women with normal laparoscopic results. Sex Transm Dis 1992;19:185–92.[Medline]

4. Bevan CD, Johal BJ, Mumtaz G, Ridgway GL, Siddle NC. Clinical, laparoscopic and microbiological findings in acute salpingitis: Report on a UK cohort. Br J Obstet Gynaecol 1995;102:407–14.[Medline]

5. Molander P, Sjöberg J, Paavonen J, Cacciatore B. Transvaginal power Doppler findings in laparoscopically proven acute PID. Ultrasound Obstet Gynecol 2001;17:233–8.[Medline]

6. Tukeva T, Aronen HJ, Karjalainen PT, Molander P, Paavonen T, Paavonen J. MR imaging in pelvic inflammatory disease. Comparison with laparoscopy and US. Radiology 1999;210:209–16.[Abstract/Free Full Text]

7. Paavonen J, Teisala K, Heinonen PK, Aine R, Laine S, Lehtinen M, et al. Microbiological and histopathological findings in acute pelvic inflammatory disease. Br J Obstet Gynaecol 1987;24:211–20.

8. Peipert JF, Boardman L, Hogan JW, Sung J, Mayer KH. Laboratory evaluation of acute upper genital tract infection. Obstet Gynecol 1996;87:730–6.[Abstract]

9. Hager WD, Eshenbach DA, Spence MR, Sweet RL. Criteria for the diagnosis and grading of salpingitis. Obstet Gynecol 1983;61:113–4.[Free Full Text]

10. Henry-Suchet J, Tesquier L. Role of laparoscopy in the management of pelvic adhesions and pelvic sepsis. Clin Obstet Gynecol 1994;8:759–72.

11. Sellors J, Mahony J, Goldsmith C, Rath D, Mander R, Hunter B, et al. The accuracy of clinical findings and laparoscopy in acute pelvic inflammatory disease. Am J Obstet Gynecol 1991;164:113–20.[Medline]

12. Hornstein MD, Gleason RE, Orav J, Haas ST, Friedman JA, Rein MS, et al. The reproducibility of the revised American Fertility Society classification of endometriosis. Fertil Steril 1993;59:1015–21.[Medline]

13. Rock JA. The revised American Fertility Society classification of endometriosis: Reproducibility of scoring. Fertil Steril 1995;63:1108–10.[Medline]

14. Bowman MC, Tin-Chiu L, Cooke ID. Inter-observer variability at laparoscopic assessment of pelvic adhesions. Hum Reprod 1995;10:155–60.[Abstract/Free Full Text]

15. Corson SL, Batzer FR, Gocial B, Kelly M, Gutmann JN, Maislin G. Intra-observer and inter-observer variability in scoring laparoscopic diagnosis of pelvic adhesions. Hum Reprod 1995;10:161–4.[Abstract/Free Full Text]

16. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:159–74.[Medline]

17. Rosner B. Fundamentals of biostatistics. Boston: PWS-Kent Publishing, 1990.

18. Brennan P, Silman A. Statistical methods for assessing observer variability in clinical measures. BMJ 1992;304: 1491–4.

19. Raab SS, Snider TE, Potts SA, McDaniel HL, Robinson RA, Nelson DL, et al. Atypical glandular cells of undetermined significance: Diagnostic accuracy and interobserver variability using select cytologic criteria. Am J Clin Pathol 1997;107:299–307.[Medline]

20. Stoler MH. Does every little cell count? Don’t "ASCUS." Cancer 1999;87:45–7.[Medline]

21. Stoler MH, Schiffman M. Interobserver reproducibility of cervical cytologic and histologic interpretations: Realistic estimates from the ASCUS-LSIL triage study. JAMA 2001;285:1500–5.[Abstract/Free Full Text]

22. Sellors JW, Nieminen P, Vesterinen E, Paavonen J. Observer variability in the scoring of colpophotographs. Obstet Gynecol 1990;76:1006–8.[Abstract/Free Full Text]

23. Ferris DG, Cox JT, Burke L, Litaker MS, Harper DM, Campion MJ, et al. Colposcopy quality control: Establishing colposcopy criterion standards for the ALTS trial using cervigrams. J Lower Genital Tract Dis 1998;2:195–203.

24. Sellors JW, Jeronimo J, Sankaranarayanan R, Wright TC, Howard M, Blumenthal PD. Assessment of the cervix after acetic acid wash: Inter-rater agreement using photographs. Obstet Gynecol 2002;99:635–40.[Abstract/Free Full Text]

25. Mavrides E, Holden J, Bland JM, Tekay A, Thilaganathan B. Intraobserver and interobserver variability of transabdominal doppler velocimetry measurements of the fetal ductus venosus between 10 and 14 weeks of gestation. Ultrasound Obstet Gynecol 2001;17:306–10.[Medline]

26. Reid MC, Lachs MS, Feinstein AR. Use of methodologic standards in diagnostic test research: Getting better but still not good. JAMA 1995;274:645–51.[Abstract]





This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Molander, P.
Right arrow Articles by Paavonen, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Molander, P.
Right arrow Articles by Paavonen, J.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS