What are other model variations that should be considered in using Intraclass Correlation Coefficients (ICC)? Part III (Consistency)
ICC (Adjusted / Consistency) is a form of the ICC that measures the consistency of measurements between raters or instruments after adjusting for systematic differences in the ratings or measurements. The “consistency” ICC does not require exact agreement on the measured values but instead evaluates whether the rankings or relative orderings of subjects by different raters are consistent, even if raters are systematically higher or lower than each other.
- Consistency focuses on whether raters rank subjects similarly or whether measurements retain the same relative values across raters, even if their absolute values differ. For instance, two raters may consistently score higher or lower than each other. However, if their rankings or relative differences between subjects are the same, the consistency ICC will reflect a high degree of agreement.
- Adjusted means that the ICC accounts for these systematic differences (i.e., one rater might consistently give higher scores than another, but that bias is not penalized as long as the ranking remains the same).
Practical Applications of ICC (Adjusted / Consistency)
The consistency ICC is useful when it focuses on whether raters or measurement instruments rank individuals consistently rather than on whether their absolute scores are identical. Below are some key applications:
- Medical and Clinical Research
- Subjective Ratings of Symptoms: In clinical trials, different doctors might rate the severity of symptoms (e.g., pain, depression, or anxiety). If one doctor consistently gives higher scores but both doctors rank patients in the same order of severity, ICC (Consistency) will indicate high reliability.
- Inter-rater Reliability in Diagnoses: Consistency ICC can be used to assess whether two radiologists, pathologists, or other healthcare professionals agree on which patients have more severe conditions, even if they disagree slightly on absolute scores.
- Behavioral and Psychological Research
- Psychological Assessments: When different psychologists administer subjective assessments, ICC (Consistency) helps to ensure that they agree on which individuals are showing higher or lower levels of a trait, such as anxiety or cognitive ability, even if they have slight biases in scoring.
- Behavioral Observation: In observational studies of human behavior (e.g., classroom behavior, social interactions), consistency ICC assesses whether different observers rate the relative frequency or intensity of behaviors similarly, ignoring small differences in how they score each behavior.
- Sports Science and Coaching
- Skill Ratings by Coaches: In sports, multiple coaches may evaluate an athlete’s performance in various skills (e.g., passing, shooting, or speed). ICC (Consistency) checks whether the coaches consistently agree on which athletes perform better, even if they score the absolute performance differently.
- Fitness Assessments: If two trainers assess athletes’ physical fitness (e.g., endurance, strength), consistency ICC will verify if both trainers agree on the rank-ordering of athletes based on their fitness levels without requiring identical scores.
- Educational Testing and Assessment
- Grading of Subjective Assignments: Teachers grading essays or other subjective work may have different grading standards, but ICC (Consistency) checks whether they rank the essays similarly in quality, even if one teacher grades more harshly or leniently.
- Standardized Test Scoring: In scoring tests with subjective elements, like writing or speaking, ICC (Consistency) ensures that raters agree on who performed better, even if they apply slightly different grading rubrics.
- Survey Research
- Perception and Attitudinal Studies: In surveys where multiple interviewers rate respondents’ attitudes or perceptions, consistency ICC helps determine whether interviewers rank respondents similarly, even if one interviewer tends to give higher or lower ratings overall.
- Market Research: Multiple raters may score consumer feedback when evaluating consumer satisfaction. Consistency ICC ensures that raters agree on which products or services were rated better, even if their absolute scoring is not identical.
- Environmental Science and Quality Control
- Instrument Calibration: Different instruments might measure environmental parameters like temperature or pollution levels. ICC (Consistency) would determine whether these instruments consistently rank sites or samples in the same order, even if their absolute measurements differ slightly.
- Quality Control in Manufacturing: In manufacturing processes, consistency ICC can be used to assess whether different inspectors consistently agree on which products are more defective, even if they differ in their strictness in scoring the defects.
Why Use ICC (Adjusted / Consistency)?
- Focus on Relative Rankings: ICC (Consistency) is appropriate when the goal is to ensure that raters or instruments rank subjects consistently, but where exact agreement on the magnitude of scores is less important.
- Systematic Differences Are Acceptable: In many real-world applications, raters may have different scoring tendencies (e.g., some are stricter, some are lenient), but as long as their rankings are consistent, this is acceptable. The consistency ICC adjusts for these biases, making it more forgiving than absolute agreement.
For instance, in a clinical setting, two doctors might agree on which patients are the most and least ill but systematically give different scores for illness severity. The consistency ICC focuses on ensuring that these rankings are reliable across raters while allowing for differences in absolute scores.
Summary
ICC (Adjusted / Consistency) measures the reliability of raters or instruments regarding relative consistency across subjects, allowing for systematic differences between raters. It is widely used in fields like clinical research, psychology, education, sports, and quality control, where consistent rankings or judgments are more important than exact agreement on the magnitude of measurements.
- Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: uses in assessing rater reliability. Psychological Bulletin, 86(2), 420-428.
- Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15(2), 155-163.
- McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30-46.