SEARCH for an outcome measure

By name or acronym

By Language

No selection means any language







































By Disease

No selection means any disease

ANCA-associated vasculitis (AAV)
Ankylosing spondylitis (AS)
Autoinflammatory diseases
Fibromyalgia (FM)
Inflammatory arthritis (IA)
Juvenile Idiopathic Arthritis (JIA)
Low Back Pain (LBP)
Osteoarthritis (OA)
Osteoporosis (OP)
Psoriatic Arthritis (PsA)
Raynaud's disease
Rheumatoid Arthritis (RA)
Sjögren's syndrome
Systemic Lupus Erythematosus (SLE)
Systemic sclerosis

By Domain

No selection means any domain
Anxiety and depression
Coping & helplessness
Disease activity
Global assessment
Health related quality of life
Muscle weakness
Needs assessment
Physical function
Sexual aspects
Social Function
Symptom severity


This Glossary was developed by Robin Christensen and Francis Guillemin, 2012, based on the COSMIN documents.

Internal consistency

Internal consistency is defined as the interrelatedness (unidimensionality) among the items; this include realizing whether the statistical measure come from a reflective or a formative model. Internal consistency (due to its unidimensionality) is only relevant in a reflective model.

A reflective model is a model in which all items are a manifestation of the same underlying construct. These items are called effect indicators and are expected to be highly correlated and interchangeable.

In a formative model, the items only together form a construct. These items do not need to be correlated.

Therefore, internal consistency is not relevant for items that form a formative model. Meaning, if very different measures are needed to assess the overall construct (e.g., emotional stress), these items obviously do not need to be correlated, thus internal consistency is not relevant for such an instrument.

Unidimensionality (from a reflective model) of a scale can be investigated with e.g., a factor analysis, but not with an assessment of internal consistency. Unidimensionality of a scale is a prerequisite for a clear interpretation of the internal consistency statistics.

Content validity

In order for a PRO to have “content validity” it needs to truthfully reflect what it is supposed to measure. Is the output from a given PRO instrument an adequate reflection of the construct to be measured (e.g., pain)?

The assessment of content validity necessitates making a (clinical) judgment about the relevance and the comprehensiveness of the items.

Relevance: Judging whether the items are relevant for (i) the construct to be measured, (ii) the study population, and (iii) the purpose of the PRO. It is important that the PRO was developed and/or subsequently tested in the target population; if the instrument is subsequently used in another population than the original target population for which it was developed, it should be assessed whether all items are relevant for this new study population. Despite the use of technical terms in the psychometrics literature, when the focus is on content validity of a PRO instrument, patients should always be considered as experts when judging the relevance of the items for the patient population.

Comprehensiveness: To assess the comprehensiveness of the items the following should be taken into account: the content coverage of the items, the description of the domains, and the theoretical foundation. If the items and the domains cover all aspects of the construct we feel confident about the coverage and description of the domains. The theoretical foundation refers to the availability of a clear description of the construct, and the theory on which it is based. An indication that the comprehensiveness of the items was assessed could be that patients or experts were asked whether they missed items. Large floor and ceiling effects could be an indication that a scale is not comprehensive.

Construct validity

Construct validity is the degree to which the scores of a PRO instrument are consistent with hypotheses based on the assumption that the PRO instrument validly measures the construct to be measured. Construct validity contains three aspects, i.e. structural validity, which concerns the internal relation- ships, hypotheses testing, and cross-cultural validity, which both concern the relationships to scores of other instruments, or differences between relevant groups.

Specific hypotheses to be tested should be formulated a priori about expected mean differences between known groups or expected correlations between the scores on the instrument and other variables. The expected direction (positive or negative) and magnitude (absolute or relative) of the correlations or differences should be included in the hypotheses. For example, an investigator may theorize that two PROs intended to assess the same construct should correlate. Therefore, the investigator would test whether the observed correlation equals the expected correlation (e.g. > 0.70). The hypotheses may also concern the relative magnitude of correlations, for example “it is expected that the score on measure A correlates higher (e.g. 0.10 higher) with the score on measure B than with the score on measure C”.

A hypothesis can also concern differences in scores between groups. When assessing differences between groups, it is less relevant whether these differences are statistically significant (which depends on the sample size) than whether these differences have the expected magnitude. It is preferable to specify a minimally important between-group difference (i.e., using a statistical indicator that does not depend on sample size, e.g. magnitude of a correlation coefficient, magnitude of a difference). For the same reason it is recommended that p-values should be avoided in the hypotheses, because it is not relevant to examine whether correlations or differences statistically differ from zero. Thus, formal hypotheses testing are preferably based using the expected magnitude of correlations and differences, rather than p-values.

Criterion validity

Criterion validity was defined as the degree to which the scores of a PRO instrument are an adequate reflection of “the truth” in the form of a “gold standard”. Thus, the criterion used should be considered a reasonable “gold standard”.

Usually, however, no gold standards exist for PRO instruments, and therefore the whole aspect criterion validity for a PRO might rightfully be excluded from any “PRO validity” checklist. This argument is important as criterion validity is something very different from construct validity. Often, authors consider their comparator instrument wrongly as a gold standard, for example when they compare the scores of a new instrument to a widely used instrument like the SF-36. When the new instrument is compared to the SF-36, we consider it as construct validation, and expected hypotheses about the magnitude and direction of the correlation between (subscales of) the instruments should be formulated and tested.

The COSMIN panel decided that the only exception of a gold standard is when a shortened PRO instrument is compared to the original long version. In which case, the original long version can be considered the gold standard per se.


Responsiveness is defined as the ability of a PRO instrument to detect change over time in the construct to be measured. The only difference between cross-sectional (construct) validity and responsiveness is that validity refers to the validity of a single score, and responsiveness refers to the validity of a change score.

The standards for responsiveness should be analogue to the standards for construct validity. Appropriate measures to evaluate responsiveness are the same as those for hypotheses testing and criterion validity, with the only difference that hypotheses should focus on the change score of an instrument. For example, one could calculate correlations between the change scores on the different measurement instruments and interpret whether the correlations is as expected.

There are a number of statistical parameters proposed in the literature to assess responsiveness. However, currently there is no consensus about the best statistical measure to apply; e.g. the use of paired effect sizes such as the standardised response mean (mean change score/SD change score), Norman’s responsiveness coefficient (SD2 change/(SD2 change + SD2 error)), and relative efficacy t-statistic ((t- 1/t2)2) has previously been used with some success. Guyatt’s responsiveness ratio (MIC/SD change score of stable patients) has also been considered, but has includes a caveat because it takes the minimal important change into account.

Statistical tests such as for example paired t-test are though considered inappropriate, because it is a measure of significant change instead of valid change, and it is dependent on the sample size of the study. These measures are considered measures of the magnitude of change due to an intervention or other event, rather than measures of the quality of the measurement instrument.

Finally, it is generally agreed that the ‘minimal important change’ concerns the interpretation of the change score, but not the validity of the change score.