It is time to upgrade how you analyze your assessment data…


Many health-professions education institutions heavily rely on high-stake examinations to determine students’ performance and progression. High-stake exams require that the test is highly reliable and valid (Sloane & Kelly, 2003). The latter refers to the question whether the test results are an adequate representation of the knowledge (skills etc.) the test was intended to measure. Reliability is often conceived as a form of reproducibility; if we would administer the test again, would the students obtain the same scores? Reliability is important because there are only few high-stake exams (often only one or two), and the test scores need to be precise since we only have one opportunity to collect the sample. 

Most health-professions education providers apply a form of test item analysis after the test was administered (Tavakol & O’Brien, 2022). This analysis typically consists of descriptive statistics, point-by-serial correlation, facility (percentage correct), distractor analysis, and internal consistency measures. In addition. the mean scores for all test items are generated and a pass mark (i.e., cutoff score) is determined. This approach dealing with test data is based on Classical Test Theory (CTT).

The problem

Item analysis, as the name suggests, only provides information about an individual test item; very limited information is obtained about the overall test, with the exception of the mean and other descriptive statistics. This is problematic if one is concerned with establishing the validity of the entire test.

Validity is typically determined by means of content validity. That is, a blueprint containing test items matching the curriculum content is used to select test items from an item bank (or stations in the case of an OSCE). It is then assumed that the test represents the knowledge (skills etc.) taught through the curriculum. Sometimes a board of (external) examiners convenes to establish the face validity (i.e., a last check before the test administration that the test items are a good representation of what was taught).

Reliability is typically established by means of measures of internal consistency such as Cronbach’s Alpha. Besides the fact that the Alpha is often an inappropriate measure of internal consistency because statistical requirements are hardly met (Sijtsma, 2009), the idea of test items being internally consistent does not make sense in the context of assessment. Test items are highly diverse, one bunch measuring basic sciences, another physiology, yet another ethics etc. These items measure diverse things and should therefore not be internally consistent!

In summary, common psychometric approaches aimed at establishing the validity and reliability of high-stake tests are often inadequate for making high-stake decisions. Too simplistic metrices are used that are poorly understood and often wrongly applied (e.g., the Alpha). What is need is a psychometric approach that can produce validity and reliability evidence by evaluating the test in its totality but with the option for scrutinizing individual test items. A promising approach to achieve this goal is Item Response Theory.

The solution

Item Response Theory, or IRT, is a statistical approach that provides powerful insights in the validity and reliability of an entire test, while allowing for in-depth analysis of individual test items (or stations when an OSCE is concerned). A powerful feature of IRT is that the construct validity can “quantitatively” be established. This is operationalized by comparing test item difficulty with test-taker performance on a standardized scale (logits or Theta).

An example may elucidate this point. Consider responses to a test item. Say, 90% of the students provided the correct answer. This may suggest that the students were all well prepared and had the knowledge to answer the question correctly. However, one could also argue that the item was very easy and even without preparation, most students could guess the correct answer.

With IRT one is able to statistically disentangle student ability and item difficulty by taking into consideration how all students performed on the entire test and by taking into consideration how difficult all items were on the test. The test item difficulty and student ability can then be plotted on a standardize scale (i.e., a Wright plot). For an example see Figure 1a-b.

This is a very powerful manner for establishing the construct validity of a test. It addresses the question whether the test, with all its items, was constructed to measure adequate levels of student ability. 

An additional feature of construct validity in IRT is that it can be estimated to which extend test items provide adequate information; the so-called test information curve. A good test should be able to measure across a wide ability spectrum (i.e., high, medium and low performing students). In IRT, a test information curve can visualize this information. See Figure 2.

The test information curve is particularly important for high-stake examinations, because one has to assure that there is sufficient test information available at the cut-off point where students pass of fail the test. If a test does not have sufficient information at that point (i.e., too few items), no valid pass/fail decisions can be made with the test at hand.

Besides using IRT for establishing the validity of a test, it is also capable of determining reliability. IRT does that by estimating conditional reliability. IRT approaches the concept of reliability differently than the traditional CTT approach using the coefficient Alpha or similar absolute indices. The CTT approach assumes that reliability is based on a single value that applies to all scale scores. This is different in IRT analysis as it can determine scale reliability along the ability spectrum. The conditional reliability is expressed as a curve that is mathematically related to both scale information and conditional standard errors through simple transformations. See Figure 3 for an overview.

At this point it should be noted that although this approach is also applicable to OSCEs, it is often overlooked that the ONLY appropriate manner for determining reliability evidence for an OSCE is inter-rater agreement.

Finally, IRT is also capable of determining cut-off points by using all available information about item difficulty and student ability. This is a more valid approach since it is based on the actual test data and not some external standard that does not adequately make use of all available information. See Figure 4 for an example.

If you are interested in finding out more about the use of IRT (and other statistical approaches) for establishing validity and reliability evidence for your high-stake exams, please contact us and we will provide more information.