A Rasch analysis of the Burnout Assessment Tool (BAT)

Emina Hadžibajramović, Conceptualization , Formal analysis , Investigation , Methodology , Project administration , Visualization , Writing – original draft , Writing – review & editing , 1, 2, * Wilmar Schaufeli, Conceptualization , Data curation , Funding acquisition , Investigation , Methodology , Writing – original draft , Writing – review & editing , 3, 4 and Hans De Witte, Conceptualization , Data curation , Funding acquisition , Investigation , Methodology , Writing – review & editing 4, 5

Emina Hadžibajramović

1 Institute of Stress Medicine, Region Västra Götaland, Gothenburg, Sweden

2 Biostatistics, Department of Public Health and Community Medicine, Institute of Medicine, University of Gothenburg, Gothenburg, Sweden

Find articles by Emina Hadžibajramović

Wilmar Schaufeli

3 Department of Psychology, Utrecht University, Utrecht, The Netherlands

4 Research Unit Occupational & Organizational Psychology and Professional Learning, KU Leuven, Leuven, Belgium

Find articles by Wilmar Schaufeli

Hans De Witte

4 Research Unit Occupational & Organizational Psychology and Professional Learning, KU Leuven, Leuven, Belgium

5 Optentia Research Focus Area, North-West University, Potchefstroom, South Africa

Find articles by Hans De Witte Stefan Hoefer, Editor 1 Institute of Stress Medicine, Region Västra Götaland, Gothenburg, Sweden

2 Biostatistics, Department of Public Health and Community Medicine, Institute of Medicine, University of Gothenburg, Gothenburg, Sweden

3 Department of Psychology, Utrecht University, Utrecht, The Netherlands

4 Research Unit Occupational & Organizational Psychology and Professional Learning, KU Leuven, Leuven, Belgium

5 Optentia Research Focus Area, North-West University, Potchefstroom, South Africa Medical University Innsbruck, AUSTRIA Competing Interests: The authors have declared that no competing interests exist. Received 2019 Dec 10; Accepted 2020 Oct 30. Copyright © 2020 Hadžibajramović et al

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Associated Data

S1 Appendix: The Burnout Assessment Tool (BAT). (PDF) GUID: F30D3BE3-5A06-4203-89B5-F3B806934A0F S2 Appendix: The observed residual correlation matrix for the Burnout Assessment Tool. (DOCX) GUID: 1AA562DB-561F-4146-8EB5-14B646543B4E S3 Appendix: Conversion tables from mean into metric score for the four BAT subscales. (DOCX) GUID: 4F42DD6B-BCCD-4C37-91D1-3A75375B8692 S1 Data: Complete data n = 2978. (XLSX) GUID: 7973FE53-AA9D-44CA-B73F-A79F39165DAF S2 Data: Sample 1 n = 800. (XLSX) GUID: 860F066D-EA70-449D-9A94-0E7C7429E632 S3 Data: Sample 2 n = 800. (XLSX) GUID: 85EFE56D-3BFD-4592-AAA1-B0976663EBE0

All relevant data are within the manuscript and its Supporting information files.

Abstract

Burnout as a concept indicative of a work-related state of mental exhaustion is recognized around the globe. Numerous studies showed that burnout has negative consequences for both individuals and organizations but also for society at large, especially in welfare states where sickness absence and work incapacitation are covered by social funds. This underlines the importance of a valid and reliable tool that can be used to assess employee burnout levels. Although the Maslach Burnout Inventory is by far the most frequently used questionnaire for assessing burnout, it is associated with several shortcomings and has been criticized on theoretical as well as empirical grounds. Thus, there is a need for an alternative questionnaire with a strong conceptual basis and proper psychometric qualities. This challenge has been taken up by introducing the Burnout Assessment Tool (BAT), according to which burnout is conceived as a work-related state of exhaustion among employees, characterized by extreme tiredness, reduced ability to regulate cognitive and emotional processes, and mental distancing. Given that the BAT is a new measure of burnout, its psychometric properties need to be evaluated. This paper focuses on an evaluation of the internal construct validity of the BAT using Rasch analysis in two random samples (n = 800, each) drawn from larger representative samples of the working population of the Netherlands and Flanders (Belgium). The BAT has sound psychometric properties and fulfils the measurement criteria according to the Rasch model. The BAT score reflects the scoring structure indicated by the developers of the scale and the BAT’s four subscales can be summarized into a single burnout score. The BAT score also works invariantly for women and men, younger and older respondents, and across both countries. Hence, the BAT can be used in organizations for screening and identifying employees who are at risk of burnout.

Introduction

Burnout—a metaphor referring to a work-related state of mental exhaustion—was first used in the Unites States at the end of the 1970s [1]. Meanwhile the concept has spread around the globe and according to PsycINFO, the largest database of psychological research, over 12,000 peer reviewed scientific publications on the subject have appeared. Numerous studies showed that burnout has negative consequences both for individual employees as well as the organizations for which they are working. For instance, burnout is associated with poor physical and mental health of employees, such as type-2 diabetes, cardio-vascular disease, anxiety and depression [2]. In addition it leads to high replacement costs due to turnover and sickness absence [3] and work incapacitation [4], and to poor business outcomes in terms of job performance [5], safety [6], productivity [7] and quality of care [8]. Moreover, burnout is not only an individual and organizational problem, but also a problem for society at large, especially in welfare states where sickness absence and work incapacitation are covered by social funds. Hence it is not surprising that European legislation calls on employers to periodically assess psychosocial risks among their employees and to implement policies to prevent burnout and work stress. In a number of European countries, including Belgium and the Netherlands, burnout is recognized as an occupational disease or work-related disorder [9]. This underlines the importance of a valid and reliable tool that can be used to assess employee burnout levels.

Arguably, the gold standard for assessing burnout is the Maslach Burnout Inventory [10]. The MBI is based on the definition of burnout by Maslach and Jackson [11] as a syndrome of emotional exhaustion, depersonalization and reduced personal accomplishment, later referred to as exhaustion, cynicism and lack of professional efficacy, respectively [10]. Boudreau and Mauthe-Kaddoura [12] estimated that the MBI is used in 88% of all empirical studies on burnout. As a matter of fact, that means that burnout is what the MBI measures, and vice versa. This circularity and mutual dependence of concept and assessment—linked to the dominance of the MBI—is undesirable because it impedes new and innovative research that leads to a better understanding of burnout. Moreover, the MBI has been criticized on theoretical as well as empirical grounds. For instance, Schaufeli and Taris [13] have argued that rather than a constituting element of burnout, reduced professional efficacy should be considered as a consequence. Furthermore, it is maintained that reduced cognitive functioning was wrongfully excluded as a constituting element of burnout [14]. On the technical side, the MBI has been criticized for (a) skewed answering patterns that may affect its reliability [15]; (b) including positive (professional efficacy) items to assess a negative state [16]; (c) lack of clinically validated cut-off values [17]; (d) lack of statistical norms that are based on national representative samples [18] and (e) the fact that it yields three different subscale scores instead of a single burnout score [19].

Taken together, these criticisms call for an alternative self-report burnout instrument with a strong conceptual basis and proper technical qualities. This challenge has been taken up by introducing the Burnout Assessment Tool [20, 21]. The conceptual basis of the BAT builds on the analysis of Schaufeli and Taris [13], who argued that burnout represents both the inability and the unwillingness to expend effort, which is reflected by its energetic and motivational component, respectively. The unwillingness to perform manifests itself by increased resistance, reduced commitment, lack of interest, disengagement, and so on—in short, mental distancing. Thus, according to Schaufeli and Taris [13], inability (exhaustion) and unwillingness (distancing) are the key components that constitute two sides of the same burnout coin. The BAT was developed using a combination of an inductive and deductive approach. More specifically, 49 in-depth structured interviews were conducted with Flemish and Dutch professionals who frequently deal with persons who suffer from burnout, such as general practitioners, occupational physicians, occupational health psychologists, and career counselors. The aim was to find out which symptoms they would identify as being typical for burnout. For classifying the burnout symptoms mentioned in the interviews, the conceptual approach of Schaufeli and Taris [13] was used. Attention was also given to non-specific, atypical symptoms that are observed in other psychological disorders as well as in cancer, hypo- or hyperthyroidism, mood disorder or anxiety disorder.

As a final result, four symptom clusters emerged (see also the BAT test manual: www.burnoutassessmenttool.be) [21]. Not surprisingly, fatigue was mentioned unanimously as most important and distinctive for burnout (e.g., "exhaustion", "feeling empty", "completely exhausted", "having no energy", and "looking tired"). In addition, symptoms emerged that refer to cognitive and emotional impairment. Examples of the former are: "concentration problems", "making mistakes", "disturbed imprinting", "being less efficient", and "forgetfulness"; and of the latter: "weeping”, "irritability", "anger", "hot temper", and "being emotional". Finally, symptoms of mental distance were mentioned, such as "no motivation", "withdrawal", "finding one’s job meaningless", "indifference", and "cynicism".

Accordingly, burnout was described as a work-related state of exhaustion that occurs among employees, characterized by extreme tiredness, reduced ability to regulate cognitive and emotional processes, and mental distancing. Because of the exhaustion experienced, the necessary energy is lacking to adequately regulate one’s emotional and cognitive processes. In other words, when experiencing burnout, the functional capacity for regulating emotional and cognitive processes is impaired. This is subjectively experienced as a loss of emotional and cognitive control. By way of self-protection and in order to prevent further energy depletion and loss of control, mental distancing occurs. In this conceptualization, both sides of the burnout coin are represented by exhaustion and its concurrent cognitive and emotional impairment on the one hand, and mental distancing on the other hand. Based on this conceptualization of burnout, BAT items were carefully formulated and tested for four burnout symptoms: exhaustion, mental distance, and emotional and cognitive impairment (for further details see below). In addition to the four core symptoms, burnout is associated with secondary symptoms—psychological distress, psychosomatic complaints and depressed mood. These symptoms often occur together with burnout but are not specific to burnout.

Given that the BAT is a new measure of burnout, its psychometric properties need to be evaluated. Moreover, it needs to be tested whether the BAT can be used to obtain a single burnout score, which is impossible with the MBI. In a recently published paper, the measurement invariance of the BAT across seven cross-national representative samples was investigated, and the BAT was successfully modelled as a second-order factor with a good fit to the data [22]. The current paper focuses on an evaluation of the internal construct validity of the BAT using Rasch analysis. The Rasch measurement model, usually referred to as Rasch analysis, belongs to the modern psychometric approaches or item response theory (IRT). The Rasch model has been used in a variety of applications since its introduction in education during the 1950s, and has been widely used in the health sciences over the last two decades. Short introductions to Rasch analysis are described elsewhere [23–25] and a comprehensive overview of the statistical theory of Rasch models is given in a recent textbook [26]. The advantage of the Rasch model over the classical test theory approaches such as factor analysis is the lack of requirement for the normal score distribution. Hence it is the preferred choice for the analysis of ordinal data produced by multi-item questionnaires with ordered categorical responses. In addition, more detailed information about the persons, items and response categories is obtained in a more feasible way, as the Rasch analysis allows a unified approach to measurement issues such as unidimensionality, appropriate category ordering of polytomous items, testing the invariance of items, and differential item functioning (DIF) (these concepts are briefly explained in the method section below).

The Rasch measurement model [27] operationalizes the axioms of additive conjoint measurements, which are the requirements for the measurement construction [28–31]. In other words, the Rasch model is a mathematical model describing how data are expected to behave in order to approximate a unidimensional measurement with interval scale properties. A unique feature in Rasch analysis is that fitting the data to the Rasch model places both item and person estimates on the same log-odds units (logit) scale, and in the case of model fit these are independent parameters. Given that the data fit the Rasch model, construct validity and objective measurement is achieved and the total score is a sufficient statistic [26]. In case that the data do not fit the Rasch model, this is interpreted as an indication that the questionnaire does not have the right psychometric properties and hence needs to be revised and improved.

For instance, Rasch analysis of the MBI Student Survey (MBI-SS) among US preclinical medical students showed that the three MBI scales function adequately but not optimally, so the authors recommend including additional items and increasing the number of response options from seven to nine [32]. Another study examined the MBI Human Service Surveys (MBI-HSS) among Dutch nursing graduates and found problems with disordered response ordering for all items, as well as redundancy of the personal accomplishment scale [33]. Finally, a study among UK pediatric oncology staff showed that emotional exhaustion and personal accomplishment seem to work well, but that the depersonalization subscale is problematic [34]. Hence, it appears that occasional Rasch studies with the MBI show mixed results, suggesting that MBI data do not fit the Rasch model unequivocally. Besides, these studies used specific occupational and student samples, so results cannot be generalized beyond these groups.

The current paper reports on a Rasch analysis of the BAT using two representative samples of the working population of the Netherlands and Flanders (Belgium), respectively. More specifically, the aims of this study are to evaluate: (a) the BAT’s construct validity using Rasch analysis; (b) whether the BAT’s four subscales can be combined into a single burnout score; (c) possible differential item functioning regarding gender, age and country.

Material and methods

Sample

Data come from two representative samples of national working populations in terms of age, gender and industry in the Netherlands (n = 1500) and Flanders (n = 1500), collected in the summer of 2017. Details about the sampling procedure and sample characteristics are described in the BAT test manual [21]. The study was reviewed and approved by the Social and Societal Ethics Committee (SMEC) of KU Leuven (https://www.kuleuven.be/english/research/ethics/committees/smec) on October 22, 2015 (reference number: G-2015 10 353). Before filling out the questionnaire, participants were informed about the purpose of the study, that participation was voluntary, that they could stop at any moment if they wished to do so, that questions could be directed to a contact person (name and email address provided) and that complaints could be filed with the ethical committee (email address provided). Participants declared that they agreed with these terms by clicking on “next”. This consent procedure was approved by the ethical committee.

This study only considered complete cases for the analyses. Complete cases on all items were obtained for n = 2978 (NL = 1500, FL = 1478). Given the large sample sizes it was possible to use cross-validation to check the robustness of the results. Also, equal sizes for each of the compared groups are recommended for the evaluation of differential item functioning (DIF, explained below), to ensure that in case of DIF the largest group does not dominate the estimation of parameters [35]. Differential item functioning was evaluated for country (NL/FL), gender (male/female), and age (under/above the median age of 41). Therefore, the total sample was divided into four homogenous strata of men/NL, men/FL, women/NL and women/FL. Next a random sample of 200 respondents from each stratum was drawn twice, resulting in two subsamples of 800 individuals each ( Table 1 ). The median age in the two samples was 41.

Table 1

Random samples used in Rasch analysis drawn from the representative samples of the working population of the Netherlands (NL) and Flanders (FL); count within each group.

Sample 1
n = 800
Sample 2
n = 800
Women/NL200200
Men/NL200200
Women/FL200200
Men/FL200200
≤41 years391404
>41 years409396

Measure

The Burnout Assessment Tool (BAT) is a self-report questionnaire consisting of 23 items (see S1 Appendix) grouped in four subscales: exhaustion (8 items), mental distance (5 items), cognitive impairment (5 items), and emotional impairment (5 items). All items are expressed as statements with five frequency-based response categories (1 = never, 2 = rarely, 3 = sometimes, 4 = often, 5 = always). The total burnout score is calculated as a mean of all 23 items, and a high score is indicative of high levels of burnout (range 1–5). The BAT also contains two subscales for secondary symptoms of psychosocial distress and psychosomatic complaints (five items each, not analyzed in this study). Detailed information about development of the BAT is described in the BAT test manual [20, 21].

The Rasch model

The goal of the Rasch analysis is to evaluate whether the observed data satisfy the assumptions of the Rasch model, in which case the measurement is construct valid. Important concepts in Rasch analysis are unidimensionality, monotonicity, invariance, DIF and local dependency.

Unidimensionality is a basic prerequisite for combining a set of items into a single burnout score, i.e. all items should represent a common latent trait. Monotonicity implies that the item responses are positively related to the latent trait. The response structure required by the Rasch model is a stochastically consistent item order; i.e. a probabilistic Guttman pattern [36]. This implies that persons experiencing higher levels of burnout are expected to get higher scores on the BAT and vice versa. Moreover, this pattern of responses needs to be observed across all response categories for each item. Analogously, increasing levels of severity of burnout across response categories for each item need to be reflected in the data. The invariance criterion implies that the items need to work invariantly across the whole burnout continuum for all individuals, i.e. the ratio between the location values (items’ positioning on the latent burnout logit scale) of any two items must be constant along the latent construct. Invariance also implies that the items need to work in the same way (invariantly) for all comparable groups, which is known as lack of DIF. The Rasch model contains only the latent variable and the items, and it is implicitly assumed that the model applies to all persons within a specific population. Thus, if a specific population contains both women and men, it is assumed that both the measurement model and the item parameters are the same for both groups. Simply put, given the same level of burnout, the scale should function in a similar way for both women and men.

Local dependency implies that, having extracted the unidimensional latent trait of burnout, there should be no other meaningful patterns in the residuals [23]. Local dependency may be violated by response dependency and/or multidimensionality [37] and has an effect on the fit of the data to the model. Response dependency can result in increased similarity of the responses of persons across items, so that responses are more Guttman-like than they should be under no dependency. Contrarily, multidimensionality results would result in responses being less Guttman-like than they should be under no dependency [38].

Response dependency occurs when items are linked in some way so that the response to one item depends on the response to another item. A known example from rheumatology is when several items assessing walking ability are included in the same questionnaire. If a person is able to walk several miles without difficulty, then that person is also able to walk 1 mile or less without difficulty [23]. In this way, items are response-dependent as there is no other logical way in answering the two items. Another form of response dependency, known as redundancy dependency, may be caused by the degree of overlap of the content of two items, so that a particular rating for one item implies logically the same rating for another item, e.g. two items reflecting reversed statements such as “I feel tired” and “I feel alert” [39]. Logically, multidimensionality occurs when items are measuring more than one latent dimension.

Data analysis

Rasch analysis was performed on the two samples separately. A sample size of 800 is sufficient to yield a high degree of precision [40]. All analyses were done in RUMM2030 [41], where pairwise conditional maximum likelihood is used for computation of expected value estimates, based on the total raw scores (mean values of the 23 BAT items) and the actual response frequencies on each item, under the assumption that these observed scores fit the Rasch measurement model. Residuals, i.e. the differences between the model-expected values and the observed values, are scrutinized in several ways in order to evaluate whether the data fit to the Rasch model, where both items and individuals can be ordered according to their burnout levels on a common logit (burnout) scale. The partial credit model was used, which allows the distances between thresholds to vary across items [42]. To control for the large number of comparisons, the significance level was set at 0.01 and Bonferroni adjusted.

The first step was to investigate the BAT’s construct validity: We fitted the Rasch model to all 23 items and evaluated whether the items within each subscale would cluster together in a residual correlation matrix in a pattern that is consistent with the underlying conceptualization of the BAT. When instruments consist of a bundle of items measuring different aspects of the latent trait, it is expected that the correlation matrix of residuals reveals the clustering of the items within each subscale [39]. Any residual correlation between the items 0.2 above the average observed correlation is indicative of local dependency [39]. Moreover, in this step, the functioning of each item is evaluated in terms of (a) threshold ordering (i.e. appropriateness of the response categories, evaluated graphically and by thresholds estimates for each item); (b) discriminant ability (item fit residual within range of ± 2.5); (c) the non-significant item χ 2 statistic; (d) local dependency (residual correlation matrix), and (e) absence of DIF for age, gender and country.

The assumption of unidimensionality was tested by Smith’s test of unidimensionality [43]. For this test, first a principal component analysis (PCA) on residuals was performed. Next, items loading positively and negatively on the first principal component were used to obtain an independent person estimate. In the next step, independent t-tests for differences in these estimates for each person were performed [43]. Less than 5% of such tests being outside the range of ±1.96 support the unidimensionality of the scale. A 95% binomial confidence interval of proportions [44] was used to show whether the lower limit of the observed proportion is below the 5% level [43]. When local dependency was detected we followed the method of combining correlated items into testlets, as recommended by Marais and colleagues [37, 38, 45]. This method combines correlated items into one or more testlets (preferably based on theoretical considerations) and the data are re-analyzed using testlets instead of individual items. Thus, the second step was to fit the model with the four testlets based on the BATs four subscales. The testlets’ model fit was compared with the fit obtained from the initial analysis of the individual items. The latent correlation among the subscales was also calculated, as well as the proportion of the non-error common variance accounted for when the testlets were added together to make a total score (also known as explained common variance) [45–47].

DIF was tested by conducting ANOVA of standardized residuals, which enables separate estimations of misfit along the latent trait, uniform and non-uniform DIF. It is important to distinguish between real and artificial DIF. As explained by Andrich and Hagquist [35, 48], artificial DIF is an artefact of the procedure for identifying DIF. Therefore, following their recommendation, DIF items detected by ANOVA were resolved sequentially; initial and resolved analyses were compared, and magnitude and impact of DIF were investigated [48, 49]. Real DIF can be dealt with by splitting a mis-fitting item into two items, e.g. one item for women, with missing values for men, and the other for men, with missing values for women, and subsequently reanalyzing the data. In addition to formal tests, DIF was also evaluated graphically by means of the item characteristic curve.

The adequacy of the fit to the Rasch model was evaluated by means of three overall summary fit statistics. The item-trait interaction statistic was computed to test whether the hierarchical ordering of the items was invariant across the burnout trait. A non-significant value of this χ 2 statistic indicates invariance. Two other indices of the overall fit to the model are the mean and standard deviations of items and persons residuals. These were computed and compared to the model-expected values of a mean of zero and a SD of 1.

The internal consistency of the scale and the power of the BAT scale to discriminate among respondents with different levels of burnout were evaluated with the Person Separation Index (PSI). The PSI ranges from 0 to 1 and is similar to Cronbach’s alpha.

Targeting (distribution on a logit scale) of the BAT items and persons in the sample was evaluated graphically in a person-item-threshold graph. Targeting is an aspect of how well the items are targeted for severity levels of burnout as reported by the respondents. In a person-item-threshold graph the distribution of the person parameter estimates are compared with the distribution of the item thresholds. In that way, thresholds which are extreme compared to persons can be identified, as they provide little information in the population. This is important for the precision of person parameter estimates. In other words, responses to such items will have little impact on the precision of the person estimates as these items are out of target. For a well-targeted instrument, the mean location for persons would be around the value of zero.

Finally, in case of good fit to the model, Rasch person estimates, which are logits, can be transformed into a convenient range (henceforth referred to as metric score) [50].

Results

Rasch analysis on sample 1

In the first step, the Rasch model was fitted to all 23 items. The residual correlation matrix between the items is found in S2 Appendix in Table A1. Observed residual correlations indicated violation of local dependency. As expected, correlations higher than expected under the condition of local independence (in our sample a value >0.16) were found for most of the item pairs within each subscale and none between different subscales; exhaustion: EX1-EX4, EX1-EX8, EX3-EX4, EX3-EX5, EX3-EX8, EX4-EX7, EX4-EX8, EX7-EX8; mental distance: MD1-MD3, MD1-MD4, MD1-MD5, MD2-MD3, MD2-MD4, MD3-MD4, MD4-MD5; cognitive impairment: all pairs; and emotional impairment: EI1-EI2, EI1-EI4, EI1-EI5, EI2-EI4, EI2-EI5, EI3-EI4, EI3-EI5, EI4-EI5. The Smith’s test confirmed the presence of multidimensionality as the percentage of significant t-tests was 20.9 (CI 18.2;23.9) and thus confirmed the patterns observed in the correlation matrix (see Table 3 , BAT 23 items). Overall fit statistics are presented in Table 3 . The analysis on all 23 BAT items indicated poor fit to the model, with a significant χ 2 statistic, and high standard deviation for mean person and item fit residuals.

Table 3

Overall fit statistics in sample 1 and sample 2 (n = 800 each) and total sample of 2978.
Item residualPerson residualChi squareUnidimensionality
Analysis nameMeanSDMeanSDValuepPSITest % (95% CI)
Sample 1 n = 800
BAT 23 items-0.152.91-0.862.86416.510.9520.9 (18.2;23.9)
BAT 4 testlets0.071.17-0.521.1356.580.0160.854.8 (3.5;6.6)
Sample 2 n = 800
BAT 23 items0.032.52-0.772.41378.550.9522.8 (20.0;26.0)
BAT 4 testlets0.221.73-0.541.1240.750.260.834.6 (3.3;6.3)
Total sample n = 2978
BAT 23 items-0.285.17-0.802.48970.300.9521.1 (19.6;22.6)
BAT 4 testlets-0.012.38-0.531.1109.650.854.4 (3.7;5.2)
Ideal values0.0.40.0.4 >0.01>0.7(LCI <5%)

Analyses on item level showed that all items had ordered thresholds. As seen in Fig 1 , displayed as an illustrative example for item EX1, the probability of the response category never was highest at the lowest level of the latent estimate of burnout (person locations) and decreases when moving along the logit scale. In a similar way, the probability of choosing response categories implying higher levels of burnout increased with increasing levels of latent estimates of burnout.

An external file that holds a picture, illustration, etc. Object name is pone.0242241.g001.jpg

Category probability curves for the item EX1 (“At work, I feel mentally exhausted”).

Item fit residuals outside the predefined range of ±2.5 were observed for exhaustion items EX2, EX7 and EX8, mental distance items MD1, MD2 and MD3, cognitive impairment item CI2 and emotional impairment items EI3, E4 and EI5 ( Table 2 ). Among those, only items EX7 and MD2 showed a significant χ 2 statistic. High positive and negative fit residual values are indicative of under- and over-discrimination of items respectively. However, visual examination of the item characteristic curves (ICC) showed that the observed values in most cases were located close to the expected value, as shown in Fig 2 for item EX2 as an illustrative example (the solid line represents expected values and dots are observed values within different class intervals). Table 2 shows item locations (i.e. the mean of threshold estimates), with a higher item location representing more severe burnout symptoms.