• How it works

researchprospect post subheader

Reliability and Validity – Definitions, Types & Examples

Published by Alvin Nicolas at August 16th, 2021 , Revised On October 26, 2023

A researcher must test the collected data before making any conclusion. Every  research design  needs to be concerned with reliability and validity to measure the quality of the research.

What is Reliability?

Reliability refers to the consistency of the measurement. Reliability shows how trustworthy is the score of the test. If the collected data shows the same results after being tested using various methods and sample groups, the information is reliable. If your method has reliability, the results will be valid.

Example: If you weigh yourself on a weighing scale throughout the day, you’ll get the same results. These are considered reliable results obtained through repeated measures.

Example: If a teacher conducts the same math test of students and repeats it next week with the same questions. If she gets the same score, then the reliability of the test is high.

What is the Validity?

Validity refers to the accuracy of the measurement. Validity shows how a specific test is suitable for a particular situation. If the results are accurate according to the researcher’s situation, explanation, and prediction, then the research is valid. 

If the method of measuring is accurate, then it’ll produce accurate results. If a method is reliable, then it’s valid. In contrast, if a method is not reliable, it’s not valid. 

Example:  Your weighing scale shows different results each time you weigh yourself within a day even after handling it carefully, and weighing before and after meals. Your weighing machine might be malfunctioning. It means your method had low reliability. Hence you are getting inaccurate or inconsistent results that are not valid.

Example:  Suppose a questionnaire is distributed among a group of people to check the quality of a skincare product and repeated the same questionnaire with many groups. If you get the same response from various participants, it means the validity of the questionnaire and product is high as it has high reliability.

Most of the time, validity is difficult to measure even though the process of measurement is reliable. It isn’t easy to interpret the real situation.

Example:  If the weighing scale shows the same result, let’s say 70 kg each time, even if your actual weight is 55 kg, then it means the weighing scale is malfunctioning. However, it was showing consistent results, but it cannot be considered as reliable. It means the method has low reliability.

Internal Vs. External Validity

One of the key features of randomised designs is that they have significantly high internal and external validity.

Internal validity  is the ability to draw a causal link between your treatment and the dependent variable of interest. It means the observed changes should be due to the experiment conducted, and any external factor should not influence the  variables .

Example: age, level, height, and grade.

External validity  is the ability to identify and generalise your study outcomes to the population at large. The relationship between the study’s situation and the situations outside the study is considered external validity.

Also, read about Inductive vs Deductive reasoning in this article.

Looking for reliable dissertation support?

We hear you.

  • Whether you want a full dissertation written or need help forming a dissertation proposal, we can help you with both.
  • Get different dissertation services at ResearchProspect and score amazing grades!

Threats to Interval Validity

Threat Definition Example
Confounding factors Unexpected events during the experiment that are not a part of treatment. If you feel the increased weight of your experiment participants is due to lack of physical activity, but it was actually due to the consumption of coffee with sugar.
Maturation The influence on the independent variable due to passage of time. During a long-term experiment, subjects may feel tired, bored, and hungry.
Testing The results of one test affect the results of another test. Participants of the first experiment may react differently during the second experiment.
Instrumentation Changes in the instrument’s collaboration Change in the   may give different results instead of the expected results.
Statistical regression Groups selected depending on the extreme scores are not as extreme on subsequent testing. Students who failed in the pre-final exam are likely to get passed in the final exams; they might be more confident and conscious than earlier.
Selection bias Choosing comparison groups without randomisation. A group of trained and efficient teachers is selected to teach children communication skills instead of randomly selecting them.
Experimental mortality Due to the extension of the time of the experiment, participants may leave the experiment. Due to multi-tasking and various competition levels, the participants may leave the competition because they are dissatisfied with the time-extension even if they were doing well.

Threats of External Validity

Threat Definition Example
Reactive/interactive effects of testing The participants of the pre-test may get awareness about the next experiment. The treatment may not be effective without the pre-test. Students who got failed in the pre-final exam are likely to get passed in the final exams; they might be more confident and conscious than earlier.
Selection of participants A group of participants selected with specific characteristics and the treatment of the experiment may work only on the participants possessing those characteristics If an experiment is conducted specifically on the health issues of pregnant women, the same treatment cannot be given to male participants.

How to Assess Reliability and Validity?

Reliability can be measured by comparing the consistency of the procedure and its results. There are various methods to measure validity and reliability. Reliability can be measured through  various statistical methods  depending on the types of validity, as explained below:

Types of Reliability

Type of reliability What does it measure? Example
Test-Retests It measures the consistency of the results at different points of time. It identifies whether the results are the same after repeated measures. Suppose a questionnaire is distributed among a group of people to check the quality of a skincare product and repeated the same questionnaire with many groups. If you get the same response from a various group of participants, it means the validity of the questionnaire and product is high as it has high test-retest reliability.
Inter-Rater It measures the consistency of the results at the same time by different raters (researchers) Suppose five researchers measure the academic performance of the same student by incorporating various questions from all the academic subjects and submit various results. It shows that the questionnaire has low inter-rater reliability.
Parallel Forms It measures Equivalence. It includes different forms of the same test performed on the same participants. Suppose the same researcher conducts the two different forms of tests on the same topic and the same students. The tests could be written and oral tests on the same topic. If results are the same, then the parallel-forms reliability of the test is high; otherwise, it’ll be low if the results are different.
Inter-Term It measures the consistency of the measurement. The results of the same tests are split into two halves and compared with each other. If there is a lot of difference in results, then the inter-term reliability of the test is low.

Types of Validity

As we discussed above, the reliability of the measurement alone cannot determine its validity. Validity is difficult to be measured even if the method is reliable. The following type of tests is conducted for measuring validity. 

Type of reliability What does it measure? Example
Content validity It shows whether all the aspects of the test/measurement are covered. A language test is designed to measure the writing and reading skills, listening, and speaking skills. It indicates that a test has high content validity.
Face validity It is about the validity of the appearance of a test or procedure of the test. The type of   included in the question paper, time, and marks allotted. The number of questions and their categories. Is it a good question paper to measure the academic performance of students?
Construct validity It shows whether the test is measuring the correct construct (ability/attribute, trait, skill) Is the test conducted to measure communication skills is actually measuring communication skills?
Criterion validity It shows whether the test scores obtained are similar to other measures of the same concept. The results obtained from a prefinal exam of graduate accurately predict the results of the later final exam. It shows that the test has high criterion validity.

Does your Research Methodology Have the Following?

  • Great Research/Sources
  • Perfect Language
  • Accurate Sources

If not, we can help. Our panel of experts makes sure to keep the 3 pillars of Research Methodology strong.

Does your Research Methodology Have the Following?

How to Increase Reliability?

  • Use an appropriate questionnaire to measure the competency level.
  • Ensure a consistent environment for participants
  • Make the participants familiar with the criteria of assessment.
  • Train the participants appropriately.
  • Analyse the research items regularly to avoid poor performance.

How to Increase Validity?

Ensuring Validity is also not an easy job. A proper functioning method to ensure validity is given below:

  • The reactivity should be minimised at the first concern.
  • The Hawthorne effect should be reduced.
  • The respondents should be motivated.
  • The intervals between the pre-test and post-test should not be lengthy.
  • Dropout rates should be avoided.
  • The inter-rater reliability should be ensured.
  • Control and experimental groups should be matched with each other.

How to Implement Reliability and Validity in your Thesis?

According to the experts, it is helpful if to implement the concept of reliability and Validity. Especially, in the thesis and the dissertation, these concepts are adopted much. The method for implementation given below:

Segments Explanation
All the planning about reliability and validity will be discussed here, including the chosen samples and size and the techniques used to measure reliability and validity.
Please talk about the level of reliability and validity of your results and their influence on values.
Discuss the contribution of other researchers to improve reliability and validity.

Frequently Asked Questions

What is reliability and validity in research.

Reliability in research refers to the consistency and stability of measurements or findings. Validity relates to the accuracy and truthfulness of results, measuring what the study intends to. Both are crucial for trustworthy and credible research outcomes.

What is validity?

Validity in research refers to the extent to which a study accurately measures what it intends to measure. It ensures that the results are truly representative of the phenomena under investigation. Without validity, research findings may be irrelevant, misleading, or incorrect, limiting their applicability and credibility.

What is reliability?

Reliability in research refers to the consistency and stability of measurements over time. If a study is reliable, repeating the experiment or test under the same conditions should produce similar results. Without reliability, findings become unpredictable and lack dependability, potentially undermining the study’s credibility and generalisability.

What is reliability in psychology?

In psychology, reliability refers to the consistency of a measurement tool or test. A reliable psychological assessment produces stable and consistent results across different times, situations, or raters. It ensures that an instrument’s scores are not due to random error, making the findings dependable and reproducible in similar conditions.

What is test retest reliability?

Test-retest reliability assesses the consistency of measurements taken by a test over time. It involves administering the same test to the same participants at two different points in time and comparing the results. A high correlation between the scores indicates that the test produces stable and consistent results over time.

How to improve reliability of an experiment?

  • Standardise procedures and instructions.
  • Use consistent and precise measurement tools.
  • Train observers or raters to reduce subjective judgments.
  • Increase sample size to reduce random errors.
  • Conduct pilot studies to refine methods.
  • Repeat measurements or use multiple methods.
  • Address potential sources of variability.

What is the difference between reliability and validity?

Reliability refers to the consistency and repeatability of measurements, ensuring results are stable over time. Validity indicates how well an instrument measures what it’s intended to measure, ensuring accuracy and relevance. While a test can be reliable without being valid, a valid test must inherently be reliable. Both are essential for credible research.

Are interviews reliable and valid?

Interviews can be both reliable and valid, but they are susceptible to biases. The reliability and validity depend on the design, structure, and execution of the interview. Structured interviews with standardised questions improve reliability. Validity is enhanced when questions accurately capture the intended construct and when interviewer biases are minimised.

Are IQ tests valid and reliable?

IQ tests are generally considered reliable, producing consistent scores over time. Their validity, however, is a subject of debate. While they effectively measure certain cognitive skills, whether they capture the entirety of “intelligence” or predict success in all life areas is contested. Cultural bias and over-reliance on tests are also concerns.

Are questionnaires reliable and valid?

Questionnaires can be both reliable and valid if well-designed. Reliability is achieved when they produce consistent results over time or across similar populations. Validity is ensured when questions accurately measure the intended construct. However, factors like poorly phrased questions, respondent bias, and lack of standardisation can compromise their reliability and validity.

You May Also Like

What are the different research strategies you can use in your dissertation? Here are some guidelines to help you choose a research strategy that would make your research more credible.

In historical research, a researcher collects and analyse the data, and explain the events that occurred in the past to test the truthfulness of observations.

A case study is a detailed analysis of a situation concerning organizations, industries, and markets. The case study generally aims at identifying the weak areas.

USEFUL LINKS

LEARNING RESOURCES

researchprospect-reviews-trust-site

COMPANY DETAILS

Research-Prospect-Writing-Service

  • How It Works
  • Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

Reliability vs Validity: Differences & Examples

By Jim Frost 1 Comment

Reliability and validity are criteria by which researchers assess measurement quality. Measuring a person or item involves assigning scores to represent an attribute. This process creates the data that we analyze. However, to provide meaningful research results, that data must be good. And not all data are good!

Check mark indicating that the researchers have assessed measurement reliability and validity.

For data to be good enough to allow you to draw meaningful conclusions from a research study, they must be reliable and valid. What are the properties of good measurements? In a nutshell, reliability relates to the consistency of measures, and validity addresses whether the measurements are quantifying the correct attribute.

In this post, learn about reliability vs. validity, their relationship, and the various ways to assess them.

Learn more about Experimental Design: Definition, Types, and Examples .

Reliability

Reliability refers to the consistency of the measure. High reliability indicates that the measurement system produces similar results under the same conditions. If you measure the same item or person multiple times, you want to obtain comparable values. They are reproducible.

If you take measurements multiple times and obtain very different values, your data are unreliable. Numbers are meaningless if repeated measures do not produce similar values. What’s the correct value? No one knows! This inconsistency hampers your ability to draw conclusions and understand relationships.

Suppose you have a bathroom scale that displays very inconsistent results from one time to the next. It’s very unreliable. It would be hard to use your scale to determine your correct weight and to know whether you are losing weight.

Inadequate data collection procedures and low-quality or defective data collection tools can produce unreliable data. Additionally, some characteristics are more challenging to measure reliably. For example, the length of an object is concrete. On the other hand, a psychological construct, such as conscientiousness, depression, and self-esteem, can be trickier to measure reliably.

When assessing studies, evaluate data collection methodologies and consider whether any issues undermine their reliability.

Validity refers to whether the measurements reflect what they’re supposed to measure. This concept is a broader issue than reliability. Researchers need to consider whether they’re measuring what they think they’re measuring. Or do the measurements reflect something else? Does the instrument measure what it says it measures? It’s a question that addresses the appropriateness of the data rather than whether measurements are repeatable.

Validity is a smaller concern for tangible measurements like height and weight. You might have a biased bathroom scale if it tends to read too high or too low—but it still measures weight. Validity is a bigger concern in the social sciences, where you can measure elusive concepts such as positive outlook and self-esteem. If you’re assessing the psychological construct of conscientiousness, you need to confirm that the instrument poses questions that appraise this attribute rather than, say, obedience.

Reliability vs Validity

A measurement must be reliable first before it has a chance of being valid. After all, if you don’t obtain consistent measurements for the same object or person under similar conditions, it can’t be valid. If your scale displays a different weight every time you step on it, it’s unreliable, and it is also invalid.

So, having reliable measurements is the first step towards having valid measures. Validity is necessary for reliability, but it is insufficient by itself.

Suppose you have a reliable measurement. You step on your scale a few times in a short period, and it displays very similar weights. It’s reliable. But the weight might be incorrect.

Just because you can measure the same object multiple times and get consistent values, it does not necessarily indicate that the measurements reflect the desired characteristic.

How can you determine whether measurements are both valid and reliable? Assessing reliability vs. validity is the topic for the rest of this post!

Similar measurements for the same person/item under the same conditions. Measurements reflect what they’re supposed to measure.
Stability of results across time, between observers, within the test. Measures have appropriate relationships to theories, similar measures, and different measures.
Unreliable measurements typically cannot be valid. Valid measurements are also reliable.

How to Assess Reliability

Reliability relates to measurement consistency. To evaluate reliability, analysts assess consistency over time, within the measurement instrument, and between different observers. These types of consistency are also known as—test-retest, internal, and inter-rater reliability. Typically, appraising these forms of reliability involves taking multiple measures of the same person, object, or construct and assessing scatterplots and correlations of the measurements. Reliable measurements have high correlations because the scores are similar.

Test-Retest Reliability

Analysts often assume that measurements should be consistent across a short time. If you measure your height twice over a couple of days, you should obtain roughly the same measurements.

To assess test-retest reliability, the experimenters typically measure a group of participants on two occasions within a few days. Usually, you’ll evaluate the reliability of the repeated measures using scatterplots and correlation coefficients . You expect to see high correlations and tight lines on the scatterplot when the characteristic you measure is consistent over a short period, and you have a reliable measurement system.

This type of reliability establishes the degree to which a test can produce stable, consistent scores across time. However, in practice, measurement instruments are never entirely consistent.

Keep in mind that some characteristics should not be consistent across time. A good example is your mood, which can change from moment to moment. A test-retest assessment of mood is not likely to produce a high correlation even though it might be a useful measurement instrument.

Internal Reliability

This type of reliability assesses consistency across items within a single instrument. Researchers evaluate internal reliability when they’re using instruments such as a survey or personality inventories. In these instruments, multiple items relate to a single construct. Questions that measure the same characteristic should have a high correlation. People who indicate they are risk-takers should also note that they participate in dangerous activities. If items that supposedly measure the same underlying construct have a low correlation, they are not consistent with each other and might not measure the same thing.

Inter-Rater Reliability

This type of reliability assesses consistency across different observers, judges, or evaluators. When various observers produce similar measurements for the same item or person, their scores are highly correlated. Inter-rater reliability is essential when the subjectivity or skill of the evaluator plays a role. For example, assessing the quality of a writing sample involves subjectivity. Researchers can employ rating guidelines to reduce subjectivity. Comparing the scores from different evaluators for the same writing sample helps establish the measure’s reliability. Learn more about inter-rater reliability .

Related post : Interpreting Correlation

Cronbach’s Alpha

Cronbach’s alpha measures the internal consistency, or reliability, of a set of survey items. Use this statistic to help determine whether a collection of items consistently measures the same characteristic. Learn more about Cronbach’s Alpha .

Gage R&R Studies

These studies evaluation a measurement systems reliability and identifies sources of variation that can help you target improvement efforts effectively. Learn more about Gage R&R Studies .

How to Assess Validity

Validity is more difficult to evaluate than reliability. After all, with reliability, you only assess whether the measures are consistent across time, within the instrument, and between observers. On the other hand, evaluating validity involves determining whether the instrument measures the correct characteristic. This process frequently requires examining relationships between these measurements, other data, and theory. Validating a measurement instrument requires you to use a wide range of subject-area knowledge and different types of constructs to determine whether the measurements from your instrument fit in with the bigger picture!

An instrument with high validity produces measurements that correctly fit the larger picture with other constructs. Validity assesses whether the web of empirical relationships aligns with the theoretical relationships.

The measurements must have a positive relationship with other measures of the same construct. Additionally, they need to correlate in the correct direction (positively or negatively) with the theoretically correct constructs. Finally, the measures should have no relationship with unrelated constructs.

If you need more detailed information, read my post that focuses on Measurement Validity . In that post, I cover the various types, how to evaluate them, and provide examples.

Experimental validity relates to experimental designs and methods. To learn about that topic, read my post about Internal and External Validity .

Whew, that’s a lot of information about reliability vs. validity. Using these concepts, you can determine whether a measurement instrument produces good data!

Share this:

validity and reliability in research sample

Reader Interactions

' src=

August 17, 2022 at 3:53 am

Good way of expressing what validity and reliabiliy with building examples.

Comments and Questions Cancel reply

Grad Coach

Validity & Reliability In Research

A Plain-Language Explanation (With Examples)

By: Derek Jansen (MBA) | Expert Reviewer: Kerryn Warren (PhD) | September 2023

Validity and reliability are two related but distinctly different concepts within research. Understanding what they are and how to achieve them is critically important to any research project. In this post, we’ll unpack these two concepts as simply as possible.

This post is based on our popular online course, Research Methodology Bootcamp . In the course, we unpack the basics of methodology  using straightfoward language and loads of examples. If you’re new to academic research, you definitely want to use this link to get 50% off the course (limited-time offer).

Overview: Validity & Reliability

  • The big picture
  • Validity 101
  • Reliability 101 
  • Key takeaways

First, The Basics…

First, let’s start with a big-picture view and then we can zoom in to the finer details.

Validity and reliability are two incredibly important concepts in research, especially within the social sciences. Both validity and reliability have to do with the measurement of variables and/or constructs – for example, job satisfaction, intelligence, productivity, etc. When undertaking research, you’ll often want to measure these types of constructs and variables and, at the simplest level, validity and reliability are about ensuring the quality and accuracy of those measurements .

As you can probably imagine, if your measurements aren’t accurate or there are quality issues at play when you’re collecting your data, your entire study will be at risk. Therefore, validity and reliability are very important concepts to understand (and to get right). So, let’s unpack each of them.

Free Webinar: Research Methodology 101

What Is Validity?

In simple terms, validity (also called “construct validity”) is all about whether a research instrument accurately measures what it’s supposed to measure .

For example, let’s say you have a set of Likert scales that are supposed to quantify someone’s level of overall job satisfaction. If this set of scales focused purely on only one dimension of job satisfaction, say pay satisfaction, this would not be a valid measurement, as it only captures one aspect of the multidimensional construct. In other words, pay satisfaction alone is only one contributing factor toward overall job satisfaction, and therefore it’s not a valid way to measure someone’s job satisfaction.

validity and reliability in research sample

Oftentimes in quantitative studies, the way in which the researcher or survey designer interprets a question or statement can differ from how the study participants interpret it . Given that respondents don’t have the opportunity to ask clarifying questions when taking a survey, it’s easy for these sorts of misunderstandings to crop up. Naturally, if the respondents are interpreting the question in the wrong way, the data they provide will be pretty useless . Therefore, ensuring that a study’s measurement instruments are valid – in other words, that they are measuring what they intend to measure – is incredibly important.

There are various types of validity and we’re not going to go down that rabbit hole in this post, but it’s worth quickly highlighting the importance of making sure that your research instrument is tightly aligned with the theoretical construct you’re trying to measure .  In other words, you need to pay careful attention to how the key theories within your study define the thing you’re trying to measure – and then make sure that your survey presents it in the same way.

For example, sticking with the “job satisfaction” construct we looked at earlier, you’d need to clearly define what you mean by job satisfaction within your study (and this definition would of course need to be underpinned by the relevant theory). You’d then need to make sure that your chosen definition is reflected in the types of questions or scales you’re using in your survey . Simply put, you need to make sure that your survey respondents are perceiving your key constructs in the same way you are. Or, even if they’re not, that your measurement instrument is capturing the necessary information that reflects your definition of the construct at hand.

If all of this talk about constructs sounds a bit fluffy, be sure to check out Research Methodology Bootcamp , which will provide you with a rock-solid foundational understanding of all things methodology-related. Remember, you can take advantage of our 60% discount offer using this link.

Need a helping hand?

validity and reliability in research sample

What Is Reliability?

As with validity, reliability is an attribute of a measurement instrument – for example, a survey, a weight scale or even a blood pressure monitor. But while validity is concerned with whether the instrument is measuring the “thing” it’s supposed to be measuring, reliability is concerned with consistency and stability . In other words, reliability reflects the degree to which a measurement instrument produces consistent results when applied repeatedly to the same phenomenon , under the same conditions .

As you can probably imagine, a measurement instrument that achieves a high level of consistency is naturally more dependable (or reliable) than one that doesn’t – in other words, it can be trusted to provide consistent measurements . And that, of course, is what you want when undertaking empirical research. If you think about it within a more domestic context, just imagine if you found that your bathroom scale gave you a different number every time you hopped on and off of it – you wouldn’t feel too confident in its ability to measure the variable that is your body weight 🙂

It’s worth mentioning that reliability also extends to the person using the measurement instrument . For example, if two researchers use the same instrument (let’s say a measuring tape) and they get different measurements, there’s likely an issue in terms of how one (or both) of them are using the measuring tape. So, when you think about reliability, consider both the instrument and the researcher as part of the equation.

As with validity, there are various types of reliability and various tests that can be used to assess the reliability of an instrument. A popular one that you’ll likely come across for survey instruments is Cronbach’s alpha , which is a statistical measure that quantifies the degree to which items within an instrument (for example, a set of Likert scales) measure the same underlying construct . In other words, Cronbach’s alpha indicates how closely related the items are and whether they consistently capture the same concept . 

Reliability reflects whether an instrument produces consistent results when applied to the same phenomenon, under the same conditions.

Recap: Key Takeaways

Alright, let’s quickly recap to cement your understanding of validity and reliability:

  • Validity is concerned with whether an instrument (e.g., a set of Likert scales) is measuring what it’s supposed to measure
  • Reliability is concerned with whether that measurement is consistent and stable when measuring the same phenomenon under the same conditions.

In short, validity and reliability are both essential to ensuring that your data collection efforts deliver high-quality, accurate data that help you answer your research questions . So, be sure to always pay careful attention to the validity and reliability of your measurement instruments when collecting and analysing data. As the adage goes, “rubbish in, rubbish out” – make sure that your data inputs are rock-solid.

Literature Review Course

Psst… there’s more!

This post is an extract from our bestselling short course, Methodology Bootcamp . If you want to work smart, you don't want to miss this .

You Might Also Like:

Research aims, research objectives and research questions

THE MATERIAL IS WONDERFUL AND BENEFICIAL TO ALL STUDENTS.

THE MATERIAL IS WONDERFUL AND BENEFICIAL TO ALL STUDENTS AND I HAVE GREATLY BENEFITED FROM THE CONTENT.

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

  • Print Friendly

Reliability and Validity in Research: Definitions, Examples

Statistics Definitions > Reliability and Validity

What is Reliability?

The reliability coefficient, what is validity.

  • Curricular Validity

Overview of Reliability and Validity

Outside of statistical research, reliability and validity are used interchangeably. For research and testing, there are subtle differences. Reliability implies consistency: if you take the ACT five times, you should get roughly the same results every time. A test is valid if it measures what it’s supposed to.

Tests that are valid are also reliable. The ACT is valid (and reliable) because it measures what a student learned in high school. However, tests that are reliable aren’t always valid. For example, let’s say your thermometer was a degree off. It would be reliable (giving you the same results each time) but not valid (because the thermometer wasn’t recording the correct temperature).

Reliability is a measure of the stability or consistency of test scores. You can also think of it as the ability for a test or research findings to be repeatable. For example, a medical thermometer is a reliable tool that would measure the correct temperature each time it is used. In the same way, a reliable math test will accurately measure mathematical knowledge for every student who takes it and reliable research findings can be replicated over and over.

Of course, it’s not quite as simple as saying you think a test is reliable. There are many statistical tools you can use to measure reliability. For example:

  • Kuder-Richardson 20 : a measure of internal reliability for a binary test (i.e. one with right or wrong answers).
  • Cronbach’s alpha : measures internal reliability for tests with multiple possible answers.

Internal vs. External Reliability

Internal reliability , or internal consistency, is a measure of how well your test is actually measuring what you want it to measure. External reliability means that your test or measure can be generalized beyond what you’re using it for. For example, a claim that individual tutoring improves test scores should apply to more than one subject (e.g. to English as well as math). A test for depression should be able to detect depression in different age groups, for people in different socio-economic statuses, or introverts.

One specific type is parallel forms reliability , where two equivalent tests are given to students a short time apart. If the forms are parallel, then the tests produce the same observed results.

A reliability coefficient is a measure of how well a test measures achievement. It is the proportion of variance in observed scores (i.e. scores on the test) attributable to true scores (the theoretical “real” score that a person would get if a perfect test existed).

The term “reliability coefficient” actually refers to several different coefficients : Several methods exist for calculating the coefficient include test-retest, parallel forms and alternate-form :

  • Cronbach’s alpha — the most widely used internal-consistency coefficient.
  • A simple correlation between two scores from the same person is one of the simplest ways to estimate a reliability coefficient. If the scores are taken at different times, then this is one way to estimate test-retest reliability ; Different forms of the test given on the same day can estimate parallel forms reliability .
  • Pearson’s correlation can be used to estimate the theoretical reliability coefficient between parallel tests.
  • The Spearman Brown formula is a measure of reliability for split-half tests.
  • Cohen’s Kappa measures interrater reliability.

The range of the reliability coefficient is from 0 to 1. Rule of thumb for preferred levels of the coefficient:

  • For high stakes tests (e.g. college admissions), > 0.85. Some authors suggest this figure should be above .90.
  • For low stakes tests (e.g. classroom assessment), > 0.70. Some authors suggest this figure should be above 0.80

reliability and validity

  • Composite Reliability
  • Concurrent Validity .
  • Content Validity .
  • Convergent Validity .
  • Consequential Validity.
  • Criterion Validity.
  • Curricular Validity and Instructional Validity .
  • Ecological Validity .
  • External Validity .
  • Face Validity .
  • Formative validity & Summative Validity.
  • Incremental Validity
  • Internal Validity .
  • Predictive Validity .
  • Sampling Validity.
  • Statistical Conclusion Validity .

What is Curricular Validity?

curricular validity

Validity is defined by how well a test measures what it’s supposed to measure. Curricular validity refers to how well test items reflect the actual curriculum (i.e. a test is supposed to be a measure of what’s on the curriculum). It usually refers to a specific, well-defined curriculum, like those provided by states to schools. McClung (1978) defines it as

“…a measure of how well test items represent the objectives of the curriculum”.

A similar term is instructional validity, which is how well the test items reflect what is actually taught . McClung defines instructional validity as “an actual measure of whether the schools are providing students with instruction in the knowledge and skills measured by the test.”

In an ideal educational world, there would be no need for a distinction between instructional and curricular validity: teachers follow a curriculum, students learn what is on the curriculum through their teachers. However, it doesn’t always follow that a child will be taught what is on the curriculum. Many things can have an impact on what parts of the curriculum are taught (or not taught), including:

  • Inexperienced teachers,
  • Substitute teachers,
  • Poorly managed schools/flow of information,
  • Teachers may choose not to teach specific parts of the curriculum they don’t agree with (e.g. evolution or sex education ),
  • Teachers might skip over parts of the curriculum they don’t fully understand (like mathematics. According to this report , elementary school teachers struggle with basic math concepts).

How to Measure Curricular Validity

Curricular validity is usually measured by a panel of curriculum experts. It’s not measured statistically, but rather by a rating of “valid” or “not valid.” A test that meets one definition of validity might not meet another. For example, a test might have curricular validity, but not instructional validity and vice versa.

References : McClung, M. S. (1978). Competency testing programs: Legal and educational issues. Fordham Law Review, 47, 651-712. Ostashevsky, L. (2016). Elementary school teachers struggle with Common Core math standards.

Everitt, B. S.; Skrondal, A. (2010), The Cambridge Dictionary of Statistics , Cambridge University Press. Gonick, L. (1993). The Cartoon Guide to Statistics . HarperPerennial.

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, automatically generate references for free.

  • Knowledge Base
  • Methodology
  • Reliability vs Validity in Research | Differences, Types & Examples

Reliability vs Validity in Research | Differences, Types & Examples

Published on 3 May 2022 by Fiona Middleton . Revised on 10 October 2022.

Reliability and validity are concepts used to evaluate the quality of research. They indicate how well a method , technique, or test measures something. Reliability is about the consistency of a measure, and validity is about the accuracy of a measure.

It’s important to consider reliability and validity when you are creating your research design , planning your methods, and writing up your results, especially in quantitative research .

Reliability vs validity
Reliability Validity
What does it tell you? The extent to which the results can be reproduced when the research is repeated under the same conditions. The extent to which the results really measure what they are supposed to measure.
How is it assessed? By checking the consistency of results across time, across different observers, and across parts of the test itself. By checking how well the results correspond to established theories and other measures of the same concept.
How do they relate? A reliable measurement is not always valid: the results might be reproducible, but they’re not necessarily correct. A valid measurement is generally reliable: if a test produces accurate results, they should be .

Table of contents

Understanding reliability vs validity, how are reliability and validity assessed, how to ensure validity and reliability in your research, where to write about reliability and validity in a thesis.

Reliability and validity are closely related, but they mean different things. A measurement can be reliable without being valid. However, if a measurement is valid, it is usually also reliable.

What is reliability?

Reliability refers to how consistently a method measures something. If the same result can be consistently achieved by using the same methods under the same circumstances, the measurement is considered reliable.

What is validity?

Validity refers to how accurately a method measures what it is intended to measure. If research has high validity, that means it produces results that correspond to real properties, characteristics, and variations in the physical or social world.

High reliability is one indicator that a measurement is valid. If a method is not reliable, it probably isn’t valid.

However, reliability on its own is not enough to ensure validity. Even if a test is reliable, it may not accurately reflect the real situation.

Validity is harder to assess than reliability, but it is even more important. To obtain useful results, the methods you use to collect your data must be valid: the research must be measuring what it claims to measure. This ensures that your discussion of the data and the conclusions you draw are also valid.

Prevent plagiarism, run a free check.

Reliability can be estimated by comparing different versions of the same measurement. Validity is harder to assess, but it can be estimated by comparing the results to other relevant data or theory. Methods of estimating reliability and validity are usually split up into different types.

Types of reliability

Different types of reliability can be estimated through various statistical methods.

Type of reliability What does it assess? Example
The consistency of a measure : do you get the same results when you repeat the measurement? A group of participants complete a designed to measure personality traits. If they repeat the questionnaire days, weeks, or months apart and give the same answers, this indicates high test-retest reliability.
The consistency of a measure : do you get the same results when different people conduct the same measurement? Based on an assessment criteria checklist, five examiners submit substantially different results for the same student project. This indicates that the assessment checklist has low inter-rater reliability (for example, because the criteria are too subjective).
The consistency of : do you get the same results from different parts of a test that are designed to measure the same thing? You design a questionnaire to measure self-esteem. If you randomly split the results into two halves, there should be a between the two sets of results. If the two results are very different, this indicates low internal consistency.

Types of validity

The validity of a measurement can be estimated based on three main types of evidence. Each type can be evaluated through expert judgement or statistical methods.

Type of validity What does it assess? Example
The adherence of a measure to  of the concept being measured. A self-esteem questionnaire could be assessed by measuring other traits known or assumed to be related to the concept of self-esteem (such as social skills and optimism). Strong correlation between the scores for self-esteem and associated traits would indicate high construct validity.
The extent to which the measurement  of the concept being measured. A test that aims to measure a class of students’ level of Spanish contains reading, writing, and speaking components, but no listening component.  Experts agree that listening comprehension is an essential aspect of language ability, so the test lacks content validity for measuring the overall level of ability in Spanish.
The extent to which the result of a measure corresponds to of the same concept. A is conducted to measure the political opinions of voters in a region. If the results accurately predict the later outcome of an election in that region, this indicates that the survey has high criterion validity.

To assess the validity of a cause-and-effect relationship, you also need to consider internal validity (the design of the experiment ) and external validity (the generalisability of the results).

The reliability and validity of your results depends on creating a strong research design , choosing appropriate methods and samples, and conducting the research carefully and consistently.

Ensuring validity

If you use scores or ratings to measure variations in something (such as psychological traits, levels of ability, or physical properties), it’s important that your results reflect the real variations as accurately as possible. Validity should be considered in the very earliest stages of your research, when you decide how you will collect your data .

  • Choose appropriate methods of measurement

Ensure that your method and measurement technique are of high quality and targeted to measure exactly what you want to know. They should be thoroughly researched and based on existing knowledge.

For example, to collect data on a personality trait, you could use a standardised questionnaire that is considered reliable and valid. If you develop your own questionnaire, it should be based on established theory or the findings of previous studies, and the questions should be carefully and precisely worded.

  • Use appropriate sampling methods to select your subjects

To produce valid generalisable results, clearly define the population you are researching (e.g., people from a specific age range, geographical location, or profession). Ensure that you have enough participants and that they are representative of the population.

Ensuring reliability

Reliability should be considered throughout the data collection process. When you use a tool or technique to collect data, it’s important that the results are precise, stable, and reproducible.

  • Apply your methods consistently

Plan your method carefully to make sure you carry out the same steps in the same way for each measurement. This is especially important if multiple researchers are involved.

For example, if you are conducting interviews or observations, clearly define how specific behaviours or responses will be counted, and make sure questions are phrased the same way each time.

  • Standardise the conditions of your research

When you collect your data, keep the circumstances as consistent as possible to reduce the influence of external factors that might create variation in the results.

For example, in an experimental setup, make sure all participants are given the same information and tested under the same conditions.

It’s appropriate to discuss reliability and validity in various sections of your thesis or dissertation or research paper. Showing that you have taken them into account in planning your research and interpreting the results makes your work more credible and trustworthy.

Reliability and validity in a thesis
Section Discuss
What have other researchers done to devise and improve methods that are reliable and valid?
How did you plan your research to ensure reliability and validity of the measures used? This includes the chosen sample set and size, sample preparation, external conditions, and measuring techniques.
If you calculate reliability and validity, state these values alongside your main results.
This is the moment to talk about how reliable and valid your results actually were. Were they consistent, and did they reflect true values? If not, why not?
If reliability and validity were a big problem for your findings, it might be helpful to mention this here.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

Middleton, F. (2022, October 10). Reliability vs Validity in Research | Differences, Types & Examples. Scribbr. Retrieved 1 July 2024, from https://www.scribbr.co.uk/research-methods/reliability-or-validity/

Is this article helpful?

Fiona Middleton

Fiona Middleton

Other students also liked, the 4 types of validity | types, definitions & examples, a quick guide to experimental design | 5 steps & examples, sampling methods | types, techniques, & examples.

  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • QuestionPro

survey software icon

  • Solutions Industries Gaming Automotive Sports and events Education Government Travel & Hospitality Financial Services Healthcare Cannabis Technology Use Case NPS+ Communities Audience Contactless surveys Mobile LivePolls Member Experience GDPR Positive People Science 360 Feedback Surveys
  • Resources Blog eBooks Survey Templates Case Studies Training Help center

validity and reliability in research sample

Home Market Research

Reliability vs. Validity in Research: Types & Examples

Explore how reliability vs validity in research determines quality. Learn the differences and types + examples. Get insights!

When it comes to research, getting things right is crucial. That’s where the concepts of “Reliability vs Validity in Research” come in. 

Imagine it like a balancing act – making sure your measurements are consistent and accurate at the same time. This is where test-retest reliability, having different researchers check things, and keeping things consistent within your research plays a big role. 

As we dive into this topic, we’ll uncover the differences between reliability and validity, see how they work together, and learn how to use them effectively.

Understanding Reliability vs. Validity in Research

When it comes to collecting data and conducting research, two crucial concepts stand out: reliability and validity. 

These pillars uphold the integrity of research findings, ensuring that the data collected and the conclusions drawn are both meaningful and trustworthy. Let’s dive into the heart of the concepts, reliability, and validity, to comprehend their significance in the realm of research truly.

What is reliability?

Reliability refers to the consistency and dependability of the data collection process. It’s like having a steady hand that produces the same result each time it reaches for a task. 

In the research context, reliability is all about ensuring that if you were to repeat the same study using the same reliable measurement technique, you’d end up with the same results. It’s like having multiple researchers independently conduct the same experiment and getting outcomes that align perfectly.

Imagine you’re using a thermometer to measure the temperature of the water. You have a reliable measurement if you dip the thermometer into the water multiple times and get the same reading each time. This tells you that your method and measurement technique consistently produce the same results, whether it’s you or another researcher performing the measurement.

What is validity?

On the other hand, validity refers to the accuracy and meaningfulness of your data. It’s like ensuring that the puzzle pieces you’re putting together actually form the intended picture. When you have validity, you know that your method and measurement technique are consistent and capable of producing results aligned with reality.

Think of it this way; Imagine you’re conducting a test that claims to measure a specific trait, like problem-solving ability. If the test consistently produces results that accurately reflect participants’ problem-solving skills, then the test has high validity. In this case, the test produces accurate results that truly correspond to the trait it aims to measure.

In essence, while reliability assures you that your data collection process is like a well-oiled machine producing the same results, validity steps in to ensure that these results are not only consistent but also relevantly accurate. 

Together, these concepts provide researchers with the tools to conduct research that stands on a solid foundation of dependable methods and meaningful insights.

Types of Reliability

Let’s explore the various types of reliability that researchers consider to ensure their work stands on solid ground.

High test-retest reliability

Test-retest reliability involves assessing the consistency of measurements over time. It’s like taking the same measurement or test twice – once and then again after a certain period. If the results align closely, it indicates that the measurement is reliable over time. Think of it as capturing the essence of stability. 

Inter-rater reliability

When multiple researchers or observers are part of the equation, interrater reliability comes into play. This type of reliability assesses the level of agreement between different observers when evaluating the same phenomenon. It’s like ensuring that different pairs of eyes perceive things in a similar way. 

Internal reliability

Internal consistency dives into the harmony among different items within a measurement tool aiming to assess the same concept. This often comes into play in surveys or questionnaires, where participants respond to various items related to a single construct. If the responses to these items consistently reflect the same underlying concept, the measurement is said to have high internal consistency. 

Types of validity

Let’s explore the various types of validity that researchers consider to ensure their work stands on solid ground.

Content validity

It delves into whether a measurement truly captures all dimensions of the concept it intends to measure. It’s about making sure your measurement tool covers all relevant aspects comprehensively. 

Imagine designing a test to assess students’ understanding of a history chapter. It exhibits high content validity if the test includes questions about key events, dates, and causes. However, if it focuses solely on dates and omits causation, its content validity might be questionable.

Construct validity

It assesses how well a measurement aligns with established theories and concepts. It’s like ensuring that your measurement is a true representation of the abstract construct you’re trying to capture. 

Criterion validity

Criterion validity examines how well your measurement corresponds to other established measurements of the same concept. It’s about making sure your measurement accurately predicts or correlates with external criteria.

Differences between reliability and validity in research

Let’s delve into the differences between reliability and validity in research.

NoCategoryReliabilityValidity
01MeaningFocuses on the consistency of measurements over time and conditions.Concerns about the accuracy and relevance of measurements in capturing the intended concept.
02What it assessesAssesses whether the same results can be obtained consistently from repeated measurements.Assesses whether measurements truly measure what they are intended to measure.
03Assessment methodsEvaluated through test-retest consistency, interrater agreement, and internal consistency.Assessed through content coverage, construct alignment, and criterion correlation.
04InterrelationA measurement can be reliable (consistent) without being valid (accurate).A valid measurement is typically reliable, but high reliability doesn’t guarantee validity.
05ImportanceEnsures data consistency and replicabilityGuarantees meaningful and credible results.
06FocusFocuses on the stability and consistency of measurement outcomes.Focuses on the meaningfulness and accuracy of measurement outcomes.
07OutcomeReproducibility of measurements is the key outcome.Meaningful and accurate measurement outcomes are the primary goal.

While both reliability and validity contribute to trustworthy research, they address distinct aspects. Reliability ensures consistent results, while validity ensures accurate and relevant results that reflect the true nature of the measured concept.

Example of Reliability and Validity in Research

In this section, we’ll explore instances that highlight the differences between reliability and validity and how they play a crucial role in ensuring the credibility of research findings.

Example of reliability

Imagine you are studying the reliability of a smartphone’s battery life measurement. To collect data, you fully charge the phone and measure the battery life three times in the same controlled environment—same apps running, same brightness level, and same usage patterns. 

If the measurements consistently show a similar battery life duration each time you repeat the test, it indicates that your measurement method is reliable. The consistent results under the same conditions assure you that the battery life measurement can be trusted to provide dependable information about the phone’s performance.

Example of validity

Researchers collect data from a group of participants in a study aiming to assess the validity of a newly developed stress questionnaire. To ensure validity, they compare the scores obtained from the stress questionnaire with the participants’ actual stress levels measured using physiological indicators such as heart rate variability and cortisol levels. 

If participants’ scores correlate strongly with their physiological stress levels, the questionnaire is valid. This means the questionnaire accurately measures participants’ stress levels, and its results correspond to real variations in their physiological responses to stress. 

Validity assessed through the correlation between questionnaire scores and physiological measures ensures that the questionnaire is effectively measuring what it claims to measure participants’ stress levels.

In the world of research, differentiating between reliability and validity is crucial. Reliability ensures consistent results, while validity confirms accurate measurements. Using tools like QuestionPro enhances data collection for both reliability and validity. For instance, measuring self-esteem over time showcases reliability, and aligning questions with theories demonstrates validity. 

QuestionPro empowers researchers to achieve reliable and valid results through its robust features, facilitating credible research outcomes. Contact QuestionPro to create a free account or learn more!

LEARN MORE         FREE TRIAL

MORE LIKE THIS

zero correlation

Zero Correlation: Definition, Examples + How to Determine It

Jul 1, 2024

validity and reliability in research sample

When You Have Something Important to Say, You want to Shout it From the Rooftops

Jun 28, 2024

The Item I Failed to Leave Behind — Tuesday CX Thoughts

The Item I Failed to Leave Behind — Tuesday CX Thoughts

Jun 25, 2024

feedback loop

Feedback Loop: What It Is, Types & How It Works?

Jun 21, 2024

Other categories

  • Academic Research
  • Artificial Intelligence
  • Assessments
  • Brand Awareness
  • Case Studies
  • Communities
  • Consumer Insights
  • Customer effort score
  • Customer Engagement
  • Customer Experience
  • Customer Loyalty
  • Customer Research
  • Customer Satisfaction
  • Employee Benefits
  • Employee Engagement
  • Employee Retention
  • Friday Five
  • General Data Protection Regulation
  • Insights Hub
  • Life@QuestionPro
  • Market Research
  • Mobile diaries
  • Mobile Surveys
  • New Features
  • Online Communities
  • Question Types
  • Questionnaire
  • QuestionPro Products
  • Release Notes
  • Research Tools and Apps
  • Revenue at Risk
  • Survey Templates
  • Training Tips
  • Tuesday CX Thoughts (TCXT)
  • Uncategorized
  • Video Learning Series
  • What’s Coming Up
  • Workforce Intelligence

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

  • The 4 Types of Reliability in Research | Definitions & Examples

The 4 Types of Reliability in Research | Definitions & Examples

Published on August 8, 2019 by Fiona Middleton . Revised on June 22, 2023.

Reliability tells you how consistently a method measures something. When you apply the same method to the same sample under the same conditions, you should get the same results. If not, the method of measurement may be unreliable or bias may have crept into your research.

There are four main types of reliability. Each can be estimated by comparing different sets of results produced by the same method.

Type of reliability Measures the consistency of…
The same test over .
The same test conducted by different .
of a test which are designed to be equivalent.
The of a test.

Table of contents

Test-retest reliability, interrater reliability, parallel forms reliability, internal consistency, which type of reliability applies to my research, other interesting articles, frequently asked questions about types of reliability.

Test-retest reliability measures the consistency of results when you repeat the same test on the same sample at a different point in time. You use it when you are measuring something that you expect to stay constant in your sample.

Why it’s important

Many factors can influence your results at different points in time: for example, respondents might experience different moods, or external conditions might affect their ability to respond accurately.

Test-retest reliability can be used to assess how well a method resists these factors over time. The smaller the difference between the two sets of results, the higher the test-retest reliability.

How to measure it

To measure test-retest reliability, you conduct the same test on the same group of people at two different points in time. Then you calculate the correlation between the two sets of results.

Test-retest reliability example

You devise a questionnaire to measure the IQ of a group of participants (a property that is unlikely to change significantly over time).You administer the test two months apart to the same group of people, but the results are significantly different, so the test-retest reliability of the IQ questionnaire is low.

Improving test-retest reliability

  • When designing tests or questionnaires , try to formulate questions, statements, and tasks in a way that won’t be influenced by the mood or concentration of participants.
  • When planning your methods of data collection , try to minimize the influence of external factors, and make sure all samples are tested under the same conditions.
  • Remember that changes or recall bias can be expected to occur in the participants over time, and take these into account.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

validity and reliability in research sample

Interrater reliability (also called interobserver reliability) measures the degree of agreement between different people observing or assessing the same thing. You use it when data is collected by researchers assigning ratings, scores or categories to one or more variables , and it can help mitigate observer bias .

People are subjective, so different observers’ perceptions of situations and phenomena naturally differ. Reliable research aims to minimize subjectivity as much as possible so that a different researcher could replicate the same results.

When designing the scale and criteria for data collection, it’s important to make sure that different people will rate the same variable consistently with minimal bias . This is especially important when there are multiple researchers involved in data collection or analysis.

To measure interrater reliability, different researchers conduct the same measurement or observation on the same sample. Then you calculate the correlation between their different sets of results. If all the researchers give similar ratings, the test has high interrater reliability.

Interrater reliability example

A team of researchers observe the progress of wound healing in patients. To record the stages of healing, rating scales are used, with a set of criteria to assess various aspects of wounds. The results of different researchers assessing the same set of patients are compared, and there is a strong correlation between all sets of results, so the test has high interrater reliability.

Improving interrater reliability

  • Clearly define your variables and the methods that will be used to measure them.
  • Develop detailed, objective criteria for how the variables will be rated, counted or categorized.
  • If multiple researchers are involved, ensure that they all have exactly the same information and training.

Parallel forms reliability measures the correlation between two equivalent versions of a test. You use it when you have two different assessment tools or sets of questions designed to measure the same thing.

If you want to use multiple different versions of a test (for example, to avoid respondents repeating the same answers from memory), you first need to make sure that all the sets of questions or measurements give reliable results.

The most common way to measure parallel forms reliability is to produce a large set of questions to evaluate the same thing, then divide these randomly into two question sets.

The same group of respondents answers both sets, and you calculate the correlation between the results. High correlation between the two indicates high parallel forms reliability.

Parallel forms reliability example

A set of questions is formulated to measure financial risk aversion in a group of respondents. The questions are randomly divided into two sets, and the respondents are randomly divided into two groups. Both groups take both tests: group A takes test A first, and group B takes test B first. The results of the two tests are compared, and the results are almost identical, indicating high parallel forms reliability.

Improving parallel forms reliability

  • Ensure that all questions or test items are based on the same theory and formulated to measure the same thing.

Internal consistency assesses the correlation between multiple items in a test that are intended to measure the same construct.

You can calculate internal consistency without repeating the test or involving other researchers, so it’s a good way of assessing reliability when you only have one data set.

When you devise a set of questions or ratings that will be combined into an overall score, you have to make sure that all of the items really do reflect the same thing. If responses to different items contradict one another, the test might be unreliable.

Two common methods are used to measure internal consistency.

  • Average inter-item correlation : For a set of measures designed to assess the same construct, you calculate the correlation between the results of all possible pairs of items and then calculate the average.
  • Split-half reliability : You randomly split a set of measures into two sets. After testing the entire set on the respondents, you calculate the correlation between the two sets of responses.

Internal consistency example

A group of respondents are presented with a set of statements designed to measure optimistic and pessimistic mindsets. They must rate their agreement with each statement on a scale from 1 to 5. If the test is internally consistent, an optimistic respondent should generally give high ratings to optimism indicators and low ratings to pessimism indicators. The correlation is calculated between all the responses to the “optimistic” statements, but the correlation is very weak. This suggests that the test has low internal consistency.

Improving internal consistency

  • Take care when devising questions or measures: those intended to reflect the same concept should be based on the same theory and carefully formulated.

It’s important to consider reliability when planning your research design , collecting and analyzing your data, and writing up your research. The type of reliability you should calculate depends on the type of research  and your  methodology .

What is my methodology? Which form of reliability is relevant?
Measuring a property that you expect to stay the same over time. Test-retest
Multiple researchers making observations or ratings about the same topic. Interrater
Using two different tests to measure the same thing. Parallel forms
Using a multi-item test where all the items are intended to measure the same variable. Internal consistency

If possible and relevant, you should statistically calculate reliability and state this alongside your results .

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Degrees of freedom
  • Null hypothesis
  • Discourse analysis
  • Control groups
  • Mixed methods research
  • Non-probability sampling
  • Quantitative research
  • Ecological validity

Research bias

  • Rosenthal effect
  • Implicit bias
  • Cognitive bias
  • Selection bias
  • Negativity bias
  • Status quo bias

Reliability and validity are both about how well a method measures something:

  • Reliability refers to the  consistency of a measure (whether the results can be reproduced under the same conditions).
  • Validity   refers to the  accuracy of a measure (whether the results really do represent what they are supposed to measure).

If you are doing experimental research, you also have to consider the internal and external validity of your experiment.

You can use several tactics to minimize observer bias .

  • Use masking (blinding) to hide the purpose of your study from all observers.
  • Triangulate your data with different data collection methods or sources.
  • Use multiple observers and ensure interrater reliability.
  • Train your observers to make sure data is consistently recorded between them.
  • Standardize your observation procedures to make sure they are structured and clear.

Reproducibility and replicability are related terms.

  • A successful reproduction shows that the data analyses were conducted in a fair and honest manner.
  • A successful replication shows that the reliability of the results is high.

Research bias affects the validity and reliability of your research findings , leading to false conclusions and a misinterpretation of the truth. This can have serious implications in areas like medical research where, for example, a new form of treatment may be evaluated.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Middleton, F. (2023, June 22). The 4 Types of Reliability in Research | Definitions & Examples. Scribbr. Retrieved July 1, 2024, from https://www.scribbr.com/methodology/types-of-reliability/

Is this article helpful?

Fiona Middleton

Fiona Middleton

Other students also liked, reliability vs. validity in research | difference, types and examples, what is quantitative research | definition, uses & methods, data collection | definition, methods & examples, get unlimited documents corrected.

✔ Free APA citation check included ✔ Unlimited document corrections ✔ Specialized in correcting academic texts

  • Reliability vs Validity in Research: Types & Examples

busayo.longe

In everyday life, we probably use reliability to describe how something is valid. However, in research and testing, reliability and validity are not the same things.

When it comes to data analysis, reliability refers to how easily replicable an outcome is. For example, if you measure a cup of rice three times, and you get the same result each time, that result is reliable.

The validity, on the other hand, refers to the measurement’s accuracy. This means that if the standard weight for a cup of rice is 5 grams, and you measure a cup of rice, it should be 5 grams.

So, while reliability and validity are intertwined, they are not synonymous. If one of the measurement parameters, such as your scale, is distorted, the results will be consistent but invalid.

Data must be consistent and accurate to be used to draw useful conclusions. In this article, we’ll look at how to assess data reliability and validity, as well as how to apply it.

Read: Internal Validity in Research: Definition, Threats, Examples

What is Reliability?

When a measurement is consistent it’s reliable. But of course, reliability doesn’t mean your outcome will be the same, it just means it will be in the same range. 

For example, if you scored 95% on a test the first time and the next you score, 96%, your results are reliable.  So, even if there is a minor difference in the outcomes, as long as it is within the error margin, your results are reliable.

Reliability allows you to assess the degree of consistency in your results. So, if you’re getting similar results, reliability provides an answer to the question of how similar your results are.

What is Validity?

A measurement or test is valid when it correlates with the expected result. It examines the accuracy of your result.

Here’s where things get tricky: to establish the validity of a test, the results must be consistent. Looking at most experiments (especially physical measurements), the standard value that establishes the accuracy of a measurement is the outcome of repeating the test to obtain a consistent result.

Read: What is Participant Bias? How to Detect & Avoid It

For example, before I can conclude that all 12-inch rulers are one foot, I must repeat the experiment several times and obtain very similar results, indicating that 12-inch rulers are indeed one foot.

Most scientific experiments are inextricably linked in terms of validity and reliability. For example, if you’re measuring distance or depth, valid answers are likely to be reliable.

But for social experiences, one isn’t the indication of the other. For example, most people believe that people that wear glasses are smart. 

Of course, I’ll find examples of people who wear glasses and have high IQs (reliability), but the truth is that most people who wear glasses simply need their vision to be better (validity). 

So reliable answers aren’t always correct but valid answers are always reliable.

How Are Reliability and Validity Assessed?

When assessing reliability, we want to know if the measurement can be replicated. Of course, we’d have to change some variables to ensure that this test holds, the most important of which are time, items, and observers.

If the main factor you change when performing a reliability test is time, you’re performing a test-retest reliability assessment.

Read: What is Publication Bias? (How to Detect & Avoid It)

However, if you are changing items, you are performing an internal consistency assessment. It means you’re measuring multiple items with a single instrument.

Finally, if you’re measuring the same item with the same instrument but using different observers or judges, you’re performing an inter-rater reliability test.

Assessing Validity

Evaluating validity can be more tedious than reliability. With reliability, you’re attempting to demonstrate that your results are consistent, whereas, with validity, you want to prove the correctness of your outcome.

Although validity is mainly categorized under two sections (internal and external), there are more than fifteen ways to check the validity of a test. In this article, we’ll be covering four.

First, content validity, measures whether the test covers all the content it needs to provide the outcome you’re expecting. 

Suppose I wanted to test the hypothesis that 90% of Generation Z uses social media polls for surveys while 90% of millennials use forms. I’d need a sample size that accounts for how Gen Z and millennials gather information.

Next, criterion validity is when you compare your results to what you’re supposed to get based on a chosen criteria. There are two ways these could be measured, predictive or concurrent validity.

Read: Survey Errors To Avoid: Types, Sources, Examples, Mitigation

Following that, we have face validity . It’s how we anticipate a test to be. For instance, when answering a customer service survey, I’d expect to be asked about how I feel about the service provided.

Lastly, construct-related validity . This is a little more complicated, but it helps to show how the validity of research is based on different findings.

As a result, it provides information that either proves or disproves that certain things are related.

Types of Reliability

We have three main types of reliability assessment and here’s how they work:

1) Test-retest Reliability

This assessment refers to the consistency of outcomes over time. Testing reliability over time does not imply changing the amount of time it takes to conduct an experiment; rather, it means repeating the experiment multiple times in a short time.

For example, if I measure the length of my hair today, and tomorrow, I’ll most likely get the same result each time. 

A short period is relative in terms of reliability; two days for measuring hair length is considered short. But that’s far too long to test how quickly water dries on the sand.

A test-retest correlation is used to compare the consistency of your results. This is typically a scatter plot that shows how similar your values are between the two experiments.

If your answers are reliable, your scatter plots will most likely have a lot of overlapping points, but if they aren’t, the points (values) will be spread across the graph.

Read: Sampling Bias: Definition, Types + [Examples]

2) Internal Consistency

It’s also known as internal reliability. It refers to the consistency of results for various items when measured on the same scale.

This is particularly important in social science research, such as surveys, because it helps determine the consistency of people’s responses when asked the same questions.

Most introverts, for example, would say they enjoy spending time alone and having few friends. However, if some introverts claim that they either do not want time alone or prefer to be surrounded by many friends, it doesn’t add up.

These people who claim to be introverts or one this factor isn’t a reliable way of measuring introversion.

Internal reliability helps you prove the consistency of a test by varying factors. It’s a little tough to measure quantitatively but you could use the split-half correlation .

The split-half correlation simply means dividing the factors used to measure the underlying construct into two and plotting them against each other in the form of a scatter plot.

Introverts, for example, are assessed on their need for alone time as well as their desire to have as few friends as possible. If this plot is dispersed, likely, one of the traits does not indicate introversion.

3) Inter-Rater Reliability

This method of measuring reliability helps prevent personal bias. Inter-rater reliability assessment helps judge outcomes from the different perspectives of multiple observers.

A good example is if you ordered a meal and found it delicious. You could be biased in your judgment for several reasons, perception of the meal, your mood, and so on.

But it’s highly unlikely that six more people would agree that the meal is delicious if it isn’t. Another factor that could lead to bias is expertise. Professional dancers, for example, would perceive dance moves differently than non-professionals. 

Read: What is Experimenter Bias? Definition, Types & Mitigation

So, if a person dances and records it, and both groups (professional and unprofessional dancers) rate the video, there is a high likelihood of a significant difference in their ratings.

But if they both agree that the person is a great dancer, despite their opposing viewpoints, the person is likely a great dancer.

Types of Validity

Researchers use validity to determine whether a measurement is accurate or not. The accuracy of measurement is usually determined by comparing it to the standard value.

When a measurement is consistent over time and has high internal consistency, it increases the likelihood that it is valid.

1) Content Validity

This refers to determining validity by evaluating what is being measured. So content validity tests if your research is measuring everything it should to produce an accurate result.

For example, if I were to measure what causes hair loss in women. I’d have to consider things like postpartum hair loss, alopecia, hair manipulation, dryness, and so on.

By omitting any of these critical factors, you risk significantly reducing the validity of your research because you won’t be covering everything necessary to make an accurate deduction. 

Read: Data Cleaning: 7 Techniques + Steps to Cleanse Data

For example, a certain woman is losing her hair due to postpartum hair loss, excessive manipulation, and dryness, but in my research, I only look at postpartum hair loss. My research will show that she has postpartum hair loss, which isn’t accurate.

Yes, my conclusion is correct, but it does not fully account for the reasons why this woman is losing her hair.

2) Criterion Validity

This measures how well your measurement correlates with the variables you want to compare it with to get your result. The two main classes of criterion validity are predictive and concurrent.

3) Predictive validity

It helps predict future outcomes based on the data you have. For example, if a large number of students performed exceptionally well in a test, you can use this to predict that they understood the concept on which the test was based and will perform well in their exams.

4) Concurrent validity

On the other hand, involves testing with different variables at the same time. For example, setting up a literature test for your students on two different books and assessing them at the same time.

You’re measuring your students’ literature proficiency with these two books. If your students truly understood the subject, they should be able to correctly answer questions about both books.

5) Face Validity

Quantifying face validity might be a bit difficult because you are measuring the perception validity, not the validity itself. So, face validity is concerned with whether the method used for measurement will produce accurate results rather than the measurement itself.

If the method used for measurement doesn’t appear to test the accuracy of a measurement, its face validity is low.

Here’s an example: less than 40% of men over the age of 20 in Texas, USA, are at least 6 feet tall. The most logical approach would be to collect height data from men over the age of twenty in Texas, USA.

However, asking men over the age of 20 what their favorite meal is to determine their height is pretty bizarre. The method I am using to assess the validity of my research is quite questionable because it lacks correlation to what I want to measure.

6) Construct-Related Validity

Construct-related validity assesses the accuracy of your research by collecting multiple pieces of evidence. It helps determine the validity of your results by comparing them to evidence that supports or refutes your measurement.

7) Convergent validity

If you’re assessing evidence that strongly correlates with the concept, that’s convergent validity . 

8) Discriminant validity

Examines the validity of your research by determining what not to base it on. You are removing elements that are not a strong factor to help validate your research. Being a vegan, for example, does not imply that you are allergic to meat.

How to Ensure Validity and Reliability in Your Research

You need a bulletproof research design to ensure that your research is both valid and reliable. This means that your methods, sample, and even you, the researcher, shouldn’t be biased.

  • Ensuring Reliability

To enhance the reliability of your research, you need to apply your measurement method consistently. The chances of reproducing the same results for a test are higher when you maintain the method you’re using to experiment.

For example, you want to determine the reliability of the weight of a bag of chips using a scale. You have to consistently use this scale to measure the bag of chips each time you experiment.

You must also keep the conditions of your research consistent. For instance, if you’re experimenting to see how quickly water dries on sand, you need to consider all of the weather elements that day.

So, if you experimented on a sunny day, the next experiment should also be conducted on a sunny day to obtain a reliable result.

Read: Survey Methods: Definition, Types, and Examples
  • Ensuring Validity

There are several ways to determine the validity of your research, and the majority of them require the use of highly specific and high-quality measurement methods.

Before you begin your test, choose the best method for producing the desired results. This method should be pre-existing and proven.

Also, your sample should be very specific. If you’re collecting data on how dogs respond to fear, your results are more likely to be valid if you base them on a specific breed of dog rather than dogs in general.

Validity and reliability are critical for achieving accurate and consistent results in research. While reliability does not always imply validity, validity establishes that a result is reliable. Validity is heavily dependent on previous results (standards), whereas reliability is dependent on the similarity of your results.

Logo

Connect to Formplus, Get Started Now - It's Free!

  • concurrent validity
  • examples of research bias
  • predictive reliability
  • research analysis
  • research assessment
  • validity of research
  • busayo.longe

Formplus

You may also like:

How to do a Meta Analysis: Methodology, Pros & Cons

In this article, we’ll go through the concept of meta-analysis, what it can be used for, and how you can use it to improve how you...

validity and reliability in research sample

Research Bias: Definition, Types + Examples

Simple guide to understanding research bias, types, causes, examples and how to avoid it in surveys

Selection Bias in Research: Types, Examples & Impact

In this article, we’ll discuss the effects of selection bias, how it works, its common effects and the best ways to minimize it.

Simpson’s Paradox & How to Avoid it in Experimental Research

In this article, we are going to look at Simpson’s Paradox from its historical point and later, we’ll consider its effect in...

Formplus - For Seamless Data Collection

Collect data the right way with a versatile data collection tool. try formplus and transform your work productivity today..

validity and reliability in research sample

Validity vs. Reliability in Research: What's the Difference?

validity and reliability in research sample

Introduction

What is the difference between reliability and validity in a study, what is an example of reliability and validity, how to ensure validity and reliability in your research, critiques of reliability and validity.

In research, validity and reliability are crucial for producing robust findings. They provide a foundation that assures scholars, practitioners, and readers alike that the research's insights are both accurate and consistent. However, the nuanced nature of qualitative data often blurs the lines between these concepts, making it imperative for researchers to discern their distinct roles.

This article seeks to illuminate the intricacies of reliability and validity, highlighting their significance and distinguishing their unique attributes. By understanding these critical facets, qualitative researchers can ensure their work not only resonates with authenticity but also trustworthiness.

validity and reliability in research sample

In the domain of research, whether qualitative or quantitative , two concepts often arise when discussing the quality and rigor of a study: reliability and validity . These two terms, while interconnected, have distinct meanings that hold significant weight in the world of research.

Reliability, at its core, speaks to the consistency of a study. If a study or test measures the same concept repeatedly and yields the same results, it demonstrates a high degree of reliability. A common method for assessing reliability is through internal consistency reliability, which checks if multiple items that measure the same concept produce similar scores.

Another method often used is inter-rater reliability , which gauges the consistency of scores given by different raters. This approach is especially amenable to qualitative research , and it can help researchers assess the clarity of their code system and the consistency of their codings . For a study to be more dependable, it's imperative to ensure a sufficient measurement of reliability is achieved.

On the other hand, validity is concerned with accuracy. It looks at whether a study truly measures what it claims to. Within the realm of validity, several types exist. Construct validity, for instance, verifies that a study measures the intended abstract concept or underlying construct. If a research aims to measure self-esteem and accurately captures this abstract trait, it demonstrates strong construct validity.

Content validity ensures that a test or study comprehensively represents the entire domain of the concept it seeks to measure. For instance, if a test aims to assess mathematical ability, it should cover arithmetic, algebra, geometry, and more to showcase strong content validity.

Criterion validity is another form of validity that ensures that the scores from a test correlate well with a measure from a related outcome. A subset of this is predictive validity, which checks if the test can predict future outcomes. For instance, if an aptitude test can predict future job performance, it can be said to have high predictive validity.

The distinction between reliability and validity becomes clear when one considers the nature of their focus. While reliability is concerned with consistency and reproducibility, validity zeroes in on accuracy and truthfulness.

A research tool can be reliable without being valid. For instance, faulty instrument measures might consistently give bad readings (reliable but not valid). Conversely, in discussions about test reliability, the same test measure administered multiple times could sometimes hit the mark and at other times miss it entirely, producing different test scores each time. This would make it valid in some instances but not reliable.

For a study to be robust, it must achieve both reliability and validity. Reliability ensures the study's findings are reproducible while validity confirms that it accurately represents the phenomena it claims to. Ensuring both in a study means the results are both dependable and accurate, forming a cornerstone for high-quality research.

validity and reliability in research sample

Efficient, easy data analysis with ATLAS.ti

Start analyzing data quickly and more deeply with ATLAS.ti. Download a free trial today.

Understanding the nuances of reliability and validity becomes clearer when contextualized within a real-world research setting. Imagine a qualitative study where a researcher aims to explore the experiences of teachers in urban schools concerning classroom management. The primary method of data collection is semi-structured interviews .

To ensure the reliability of this qualitative study, the researcher crafts a consistent list of open-ended questions for the interview. This ensures that, while each conversation might meander based on the individual’s experiences, there remains a core set of topics related to classroom management that every participant addresses.

The essence of reliability in this context isn't necessarily about garnering identical responses but rather about achieving a consistent approach to data collection and subsequent interpretation . As part of this commitment to reliability, two researchers might independently transcribe and analyze a subset of these interviews. If they identify similar themes and patterns in their independent analyses, it suggests a consistent interpretation of the data, showcasing inter-rater reliability .

Validity , on the other hand, is anchored in ensuring that the research genuinely captures and represents the lived experiences and sentiments of teachers concerning classroom management. To establish content validity, the list of interview questions is thoroughly reviewed by a panel of educational experts. Their feedback ensures that the questions encompass the breadth of issues and concerns related to classroom management in urban school settings.

As the interviews are conducted, the researcher pays close attention to the depth and authenticity of responses. After the interviews, member checking could be employed, where participants review the researcher's interpretation of their responses to ensure that their experiences and perspectives have been accurately captured. This strategy helps in affirming the study's construct validity, ensuring that the abstract concept of "experiences with classroom management" has been truthfully and adequately represented.

In this example, we can see that while the interview study is rooted in qualitative methods and subjective experiences, the principles of reliability and validity can still meaningfully inform the research process. They serve as guides to ensure the research's findings are both dependable and genuinely reflective of the participants' experiences.

Ensuring validity and reliability in research, irrespective of its qualitative or quantitative nature, is pivotal to producing results that are both trustworthy and robust. Here's how you can integrate these concepts into your study to ensure its rigor:

Reliability is about consistency. One of the most straightforward ways to gauge it in quantitative research is using test-retest reliability. It involves administering the same test to the same group of participants on two separate occasions and then comparing the results.

A high degree of similarity between the two sets of results indicates good reliability. This can often be measured using a correlation coefficient, where a value closer to 1 indicates a strong positive consistency between the two test iterations.

Validity, on the other hand, ensures that the research genuinely measures what it intends to. There are various forms of validity to consider. Convergent validity ensures that two measures of the same construct or those that should theoretically be related, are indeed correlated. For example, two different measures assessing self-esteem should show similar results for the same group, highlighting that they are measuring the same underlying construct.

Face validity is the most basic form of validity and is gauged by the sheer appearance of the measurement tool. If, at face value, a test seems like it measures what it claims to, it has face validity. This is often the first step and is usually followed by more rigorous forms of validity testing.

Criterion-related validity, a subtype of the previously discussed criterion validity, evaluates how well the outcomes of a particular test or measurement correlate with another related measure. For example, if a new tool is developed to measure reading comprehension, its results can be compared with those of an established reading comprehension test to assess its criterion-related validity. If the results show a strong correlation, it's a sign that the new tool is valid.

Ensuring both validity and reliability requires deliberate planning, meticulous testing, and constant reflection on the study's methods and results. This might involve using established scales or measures with proven validity and reliability, conducting pilot studies to refine measurement tools, and always staying cognizant of the fact that these two concepts are important considerations for research robustness.

While reliability and validity are foundational concepts in many traditional research paradigms, they have not escaped scrutiny, especially from critical and poststructuralist perspectives. These critiques often arise from the fundamental philosophical differences in how knowledge, truth, and reality are perceived and constructed.

From a poststructuralist viewpoint, the very pursuit of a singular "truth" or an objective reality is questionable. In such a perspective, multiple truths exist, each shaped by its own socio-cultural, historical, and individual contexts.

Reliability, with its emphasis on consistent replication, might then seem at odds with this understanding. If truths are multiple and shifting, how can consistency across repeated measures or observations be a valid measure of anything other than the research instrument's stability?

Validity, too, faces critique. In seeking to ensure that a study measures what it purports to measure, there's an implicit assumption of an observable, knowable reality. Poststructuralist critiques question this foundation, arguing that reality is too fluid, multifaceted, and influenced by power dynamics to be pinned down by any singular measurement or representation.

Moreover, the very act of determining "validity" often requires an external benchmark or "gold standard." This brings up the issue of who determines this standard and the power dynamics and potential biases inherent in such decisions.

Another point of contention is the way these concepts can inadvertently prioritize certain forms of knowledge over others. For instance, privileging research that meets stringent reliability and validity criteria might marginalize more exploratory, interpretive, or indigenous research methods. These methods, while offering deep insights, might not align neatly with traditional understandings of reliability and validity, potentially relegating them to the periphery of "accepted" knowledge production.

To be sure, reliability and validity serve as guiding principles in many research approaches. However, it's essential to recognize their limitations and the critiques posed by alternative epistemologies. Engaging with these critiques doesn't diminish the value of reliability and validity but rather enriches our understanding of the multifaceted nature of knowledge and the complexities of its pursuit.

validity and reliability in research sample

A rigorous research process begins with ATLAS.ti

Download a free trial of our powerful data analysis software to make the most of your research.

validity and reliability in research sample

  • Privacy Policy

Research Method

Home » Reliability – Types, Examples and Guide

Reliability – Types, Examples and Guide

Table of Contents

Reliability

Reliability

Definition:

Reliability refers to the consistency, dependability, and trustworthiness of a system, process, or measurement to perform its intended function or produce consistent results over time. It is a desirable characteristic in various domains, including engineering, manufacturing, software development, and data analysis.

Reliability In Engineering

In engineering and manufacturing, reliability refers to the ability of a product, equipment, or system to function without failure or breakdown under normal operating conditions for a specified period. A reliable system consistently performs its intended functions, meets performance requirements, and withstands various environmental factors, stress, or wear and tear.

Reliability In Software Development

In software development, reliability relates to the stability and consistency of software applications or systems. A reliable software program operates consistently without crashing, produces accurate results, and handles errors or exceptions gracefully. Reliability is often measured by metrics such as mean time between failures (MTBF) and mean time to repair (MTTR).

Reliability In Data Analysis and Statistics

In data analysis and statistics, reliability refers to the consistency and repeatability of measurements or assessments. For example, if a measurement instrument consistently produces similar results when measuring the same quantity or if multiple raters consistently agree on the same assessment, it is considered reliable. Reliability is often assessed using statistical measures such as test-retest reliability, inter-rater reliability, or internal consistency.

Research Reliability

Research reliability refers to the consistency, stability, and repeatability of research findings . It indicates the extent to which a research study produces consistent and dependable results when conducted under similar conditions. In other words, research reliability assesses whether the same results would be obtained if the study were replicated with the same methodology, sample, and context.

What Affects Reliability in Research

Several factors can affect the reliability of research measurements and assessments. Here are some common factors that can impact reliability:

Measurement Error

Measurement error refers to the variability or inconsistency in the measurements that is not due to the construct being measured. It can arise from various sources, such as the limitations of the measurement instrument, environmental factors, or the characteristics of the participants. Measurement error reduces the reliability of the measure by introducing random variability into the data.

Rater/Observer Bias

In studies involving subjective assessments or ratings, the biases or subjective judgments of the raters or observers can affect reliability. If different raters interpret and evaluate the same phenomenon differently, it can lead to inconsistencies in the ratings, resulting in lower inter-rater reliability.

Participant Factors

Characteristics or factors related to the participants themselves can influence reliability. For example, factors such as fatigue, motivation, attention, or mood can introduce variability in responses, affecting the reliability of self-report measures or performance assessments.

Instrumentation

The quality and characteristics of the measurement instrument can impact reliability. If the instrument lacks clarity, has ambiguous items or instructions, or is prone to measurement errors, it can decrease the reliability of the measure. Poorly designed or unreliable instruments can introduce measurement error and decrease the consistency of the measurements.

Sample Size

Sample size can affect reliability, especially in studies where the reliability coefficient is based on correlations or variability within the sample. A larger sample size generally provides more stable estimates of reliability, while smaller samples can yield less precise estimates.

Time Interval

The time interval between test administrations can impact test-retest reliability. If the time interval is too short, participants may recall their previous responses and answer in a similar manner, artificially inflating the reliability coefficient. On the other hand, if the time interval is too long, true changes in the construct being measured may occur, leading to lower test-retest reliability.

Content Sampling

The specific items or questions included in a measure can affect reliability. If the measure does not adequately sample the full range of the construct being measured or if the items are too similar or redundant, it can result in lower internal consistency reliability.

Scoring and Data Handling

Errors in scoring, data entry, or data handling can introduce variability and impact reliability. Inaccurate or inconsistent scoring procedures, data entry mistakes, or mishandling of missing data can affect the reliability of the measurements.

Context and Environment

The context and environment in which measurements are obtained can influence reliability. Factors such as noise, distractions, lighting conditions, or the presence of others can introduce variability and affect the consistency of the measurements.

Types of Reliability

There are several types of reliability that are commonly discussed in research and measurement contexts. Here are some of the main types of reliability:

Test-Retest Reliability

This type of reliability assesses the consistency of a measure over time. It involves administering the same test or measure to the same group of individuals on two separate occasions and then comparing the results. If the scores are similar or highly correlated across the two testing points, it indicates good test-retest reliability.

Inter-Rater Reliability

Inter-rater reliability examines the degree of agreement or consistency between different raters or observers who are assessing the same phenomenon. It is commonly used in subjective evaluations or assessments where judgments are made by multiple individuals. High inter-rater reliability suggests that different observers are likely to reach the same conclusions or make consistent assessments.

Internal Consistency Reliability

Internal consistency reliability assesses the extent to which the items or questions within a measure are consistent with each other. It is commonly measured using techniques such as Cronbach’s alpha. High internal consistency reliability indicates that the items within a measure are measuring the same construct or concept consistently.

Parallel Forms Reliability

Parallel forms reliability assesses the consistency of different versions or forms of a test that are intended to measure the same construct. Two equivalent versions of a test are administered to the same group of individuals, and the scores are compared to determine the level of agreement between the forms.

Split-Half Reliability

Split-half reliability involves splitting a measure into two halves and examining the consistency between the two halves. It can be done by dividing the items into odd-even pairs or by randomly splitting the items. The scores from the two halves are then compared to assess the degree of consistency.

Alternate Forms Reliability

Alternate forms reliability is similar to parallel forms reliability, but it involves administering two different versions of a test to the same group of individuals. The two forms should be equivalent and measure the same construct. The scores from the two forms are then compared to assess the level of agreement.

Applications of Reliability

Reliability has several important applications across various fields and disciplines. Here are some common applications of reliability:

Psychological and Educational Testing

Reliability is crucial in psychological and educational testing to ensure that the scores obtained from assessments are consistent and dependable. It helps to determine the accuracy and stability of measures such as intelligence tests, personality assessments, academic exams, and aptitude tests.

Market Research

In market research, reliability is important for ensuring consistent and dependable data collection. Surveys, questionnaires, and other data collection instruments need to have high reliability to obtain accurate and consistent responses from participants. Reliability analysis helps researchers identify and address any issues that may affect the consistency of the data.

Health and Medical Research

Reliability is essential in health and medical research to ensure that measurements and assessments used in studies are consistent and trustworthy. This includes the reliability of diagnostic tests, patient-reported outcome measures, observational measures, and psychometric scales. High reliability is crucial for making valid inferences and drawing reliable conclusions from research findings.

Quality Control and Manufacturing

Reliability analysis is widely used in industries such as manufacturing and quality control to assess the reliability of products and processes. It helps to identify and address sources of variation and inconsistency, ensuring that products meet the required standards and specifications consistently.

Social Science Research

Reliability plays a vital role in social science research, including fields such as sociology, anthropology, and political science. It is used to assess the consistency of measurement tools, such as surveys or observational protocols, to ensure that the data collected is reliable and can be trusted for analysis and interpretation.

Performance Evaluation

Reliability is important in performance evaluation systems used in organizations and workplaces. Whether it’s assessing employee performance, evaluating the reliability of scoring rubrics, or measuring the consistency of ratings by supervisors, reliability analysis helps ensure fairness and consistency in the evaluation process.

Psychometrics and Scale Development

Reliability analysis is a fundamental step in psychometrics, which involves developing and validating measurement scales. Researchers assess the reliability of items and subscales to ensure that the scale measures the intended construct consistently and accurately.

Examples of Reliability

Here are some examples of reliability in different contexts:

Test-Retest Reliability Example: A researcher administers a personality questionnaire to a group of participants and then administers the same questionnaire to the same participants after a certain period, such as two weeks. The scores obtained from the two administrations are highly correlated, indicating good test-retest reliability.

Inter-Rater Reliability Example: Multiple teachers assess the essays of a group of students using a standardized grading rubric. The ratings assigned by the teachers show a high level of agreement or correlation, indicating good inter-rater reliability.

Internal Consistency Reliability Example: A researcher develops a questionnaire to measure job satisfaction. The researcher administers the questionnaire to a group of employees and calculates Cronbach’s alpha to assess internal consistency. The calculated value of Cronbach’s alpha is high (e.g., above 0.8), indicating good internal consistency reliability.

Parallel Forms Reliability Example: Two versions of a mathematics exam are created, which are designed to measure the same mathematical skills. Both versions of the exam are administered to the same group of students, and the scores from the two versions are highly correlated, indicating good parallel forms reliability.

Split-Half Reliability Example: A researcher develops a survey to measure self-esteem. The survey consists of 20 items, and the researcher randomly divides the items into two halves. The scores obtained from each half of the survey show a high level of agreement or correlation, indicating good split-half reliability.

Alternate Forms Reliability Example: A researcher develops two versions of a language proficiency test, which are designed to measure the same language skills. Both versions of the test are administered to the same group of participants, and the scores from the two versions are highly correlated, indicating good alternate forms reliability.

Where to Write About Reliability in A Thesis

When writing about reliability in a thesis, there are several sections where you can address this topic. Here are some common sections in a thesis where you can discuss reliability:

Introduction :

In the introduction section of your thesis, you can provide an overview of the study and briefly introduce the concept of reliability. Explain why reliability is important in your research field and how it relates to your study objectives.

Theoretical Framework:

If your thesis includes a theoretical framework or a literature review, this is a suitable section to discuss reliability. Provide an overview of the relevant theories, models, or concepts related to reliability in your field. Discuss how other researchers have measured and assessed reliability in similar studies.

Methodology:

The methodology section is crucial for addressing reliability. Describe the research design, data collection methods, and measurement instruments used in your study. Explain how you ensured the reliability of your measurements or data collection procedures. This may involve discussing pilot studies, inter-rater reliability, test-retest reliability, or other techniques used to assess and improve reliability.

Data Analysis:

In the data analysis section, you can discuss the statistical techniques employed to assess the reliability of your data. This might include measures such as Cronbach’s alpha, Cohen’s kappa, or intraclass correlation coefficients (ICC), depending on the nature of your data and research design. Present the results of reliability analyses and interpret their implications for your study.

Discussion:

In the discussion section, analyze and interpret the reliability results in relation to your research findings and objectives. Discuss any limitations or challenges encountered in establishing or maintaining reliability in your study. Consider the implications of reliability for the validity and generalizability of your results.

Conclusion:

In the conclusion section, summarize the main points discussed in your thesis regarding reliability. Emphasize the importance of reliability in research and highlight any recommendations or suggestions for future studies to enhance reliability.

Importance of Reliability

Reliability is of utmost importance in research, measurement, and various practical applications. Here are some key reasons why reliability is important:

  • Consistency : Reliability ensures consistency in measurements and assessments. Consistent results indicate that the measure or instrument is stable and produces similar outcomes when applied repeatedly. This consistency allows researchers and practitioners to have confidence in the reliability of the data collected and the conclusions drawn from it.
  • Accuracy : Reliability is closely linked to accuracy. A reliable measure produces results that are close to the true value or state of the phenomenon being measured. When a measure is unreliable, it introduces error and uncertainty into the data, which can lead to incorrect interpretations and flawed decision-making.
  • Trustworthiness : Reliability enhances the trustworthiness of measurements and assessments. When a measure is reliable, it indicates that it is dependable and can be trusted to provide consistent and accurate results. This is particularly important in fields where decisions and actions are based on the data collected, such as education, healthcare, and market research.
  • Comparability : Reliability enables meaningful comparisons between different groups, individuals, or time points. When measures are reliable, differences or changes observed can be attributed to true differences in the underlying construct, rather than measurement error. This allows for valid comparisons and evaluations, both within a study and across different studies.
  • Validity : Reliability is a prerequisite for validity. Validity refers to the extent to which a measure or assessment accurately captures the construct it is intended to measure. If a measure is unreliable, it cannot be valid, as it does not consistently reflect the construct of interest. Establishing reliability is an important step in establishing the validity of a measure.
  • Decision-making : Reliability is crucial for making informed decisions based on data. Whether it’s evaluating employee performance, diagnosing medical conditions, or conducting research studies, reliable measurements and assessments provide a solid foundation for decision-making processes. They help to reduce uncertainty and increase confidence in the conclusions drawn from the data.
  • Quality Assurance : Reliability is essential for maintaining quality assurance in various fields. It allows organizations to assess and monitor the consistency and dependability of their processes, products, and services. By ensuring reliability, organizations can identify areas of improvement, address sources of variation, and deliver consistent and high-quality outcomes.

Limitations of Reliability

Here are some limitations of reliability:

  • Limited to consistency: Reliability primarily focuses on the consistency of measurements and findings. However, it does not guarantee the accuracy or validity of the measurements. A measurement can be consistent but still systematically biased or flawed, leading to inaccurate results. Reliability alone cannot address validity concerns.
  • Context-dependent: Reliability can be influenced by the specific context, conditions, or population under study. A measurement or instrument that demonstrates high reliability in one context may not necessarily exhibit the same level of reliability in a different context. Researchers need to consider the specific characteristics and limitations of their study context when interpreting reliability.
  • Inadequate for complex constructs: Reliability is often based on the assumption of unidimensionality, which means that a measurement instrument is designed to capture a single construct. However, many real-world phenomena are complex and multidimensional, making it challenging to assess reliability accurately. Reliability measures may not adequately capture the full complexity of such constructs.
  • Susceptible to systematic errors: Reliability focuses on minimizing random errors, but it may not detect or address systematic errors or biases in measurements. Systematic errors can arise from flaws in the measurement instrument, data collection procedures, or sample selection. Reliability assessments may not fully capture or address these systematic errors, leading to biased or inaccurate results.
  • Relies on assumptions: Reliability assessments often rely on certain assumptions, such as the assumption of measurement invariance or the assumption of stable conditions over time. These assumptions may not always hold true in real-world research settings, particularly when studying dynamic or evolving phenomena. Failure to meet these assumptions can compromise the reliability of the research.
  • Limited to quantitative measures: Reliability is typically applied to quantitative measures and instruments, which can be problematic when studying qualitative or subjective phenomena. Reliability measures may not fully capture the richness and complexity of qualitative data, limiting their applicability in certain research domains.

Also see Reliability Vs Validity

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Validity

Validity – Types, Examples and Guide

Reliability Vs Validity

Reliability Vs Validity

Test-Retest Reliability

Test-Retest Reliability – Methods, Formula and...

Parallel Forms Reliability

Parallel Forms Reliability – Methods, Example...

Content Validity

Content Validity – Measurement and Examples

Split-Half Reliability

Split-Half Reliability – Methods, Examples and...

Reliability In Psychology Research: Definitions & Examples

Saul Mcleod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul Mcleod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Learn about our Editorial Process

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

On This Page:

Reliability in psychology research refers to the reproducibility or consistency of measurements. Specifically, it is the degree to which a measurement instrument or procedure yields the same results on repeated trials. A measure is considered reliable if it produces consistent scores across different instances when the underlying thing being measured has not changed.

Reliability ensures that responses are consistent across times and occasions for instruments like questionnaires . Multiple forms of reliability exist, including test-retest, inter-rater, and internal consistency.

For example, if people weigh themselves during the day, they would expect to see a similar reading. Scales that measured weight differently each time would be of little use.

The same analogy could be applied to a tape measure that measures inches differently each time it is used. It would not be considered reliable.

If findings from research are replicated consistently, they are reliable. A correlation coefficient can be used to assess the degree of reliability. If a test is reliable, it should show a high positive correlation.

Of course, it is unlikely the same results will be obtained each time as participants and situations vary. Still, a strong positive correlation between the same test results indicates reliability.

Reliability is important because unreliable measures introduce random error that attenuates correlations and makes it harder to detect real relationships.

Ensuring high reliability for key measures in psychology research helps boost the sensitivity, validity, and replicability of studies. Estimating and reporting reliable evidence is considered an important methodological practice.

There are two types of reliability: internal and external.
  • Internal reliability refers to how consistently different items within a single test measure the same concept or construct. It ensures that a test is stable across its components.
  • External reliability measures how consistently a test produces similar results over repeated administrations or under different conditions. It ensures that a test is stable over time and situations.
Some key aspects of reliability in psychology research include:
  • Test-retest reliability : The consistency of scores for the same person across two or more separate administrations of the same measurement procedure over time. High test-retest reliability suggests the measure provides a stable, reproducible score.
  • Interrater reliability : The level of agreement in scores on a measure between different raters or observers rating the same target. High interrater reliability suggests the ratings are objective and not overly influenced by rater subjectivity or bias.
  • Internal consistency reliability : The degree to which different test items or parts of an instrument that measure the same construct yield similar results. Analyzed statistically using Cronbach’s alpha, a high value suggests the items measure the same underlying concept.

Test-Retest Reliability

The test-retest method assesses the external consistency of a test. Examples of appropriate tests include questionnaires and psychometric tests. It measures the stability of a test over time.

A typical assessment would involve giving participants the same test on two separate occasions. If the same or similar results are obtained, then external reliability is established.

Here’s how it works:

  • A test or measurement is administered to participants at one point in time.
  • After a certain period, the same test is administered again to the same participants without any intervention or treatment in between.
  • The scores from the two administrations are then correlated using a statistical method, often Pearson’s correlation.
  • A high correlation between the scores from the two test administrations indicates good test-retest reliability, suggesting the test yields consistent results over time.

This method is especially useful for tests that measure stable traits or characteristics that aren’t expected to change over short periods.

The disadvantage of the test-retest method is that it takes a long time for results to be obtained. The reliability can be influenced by the time interval between tests and any events that might affect participants’ responses during this interval.

Beck et al. (1996) studied the responses of 26 outpatients on two separate therapy sessions one week apart, they found a correlation of .93 therefore demonstrating high test-restest reliability of the depression inventory.

This is an example of why reliability in psychological research is necessary, if it wasn’t for the reliability of such tests some individuals may not be successfully diagnosed with disorders such as depression and consequently will not be given appropriate therapy.

The timing of the test is important; if the duration is too brief, then participants may recall information from the first test, which could bias the results.

Alternatively, if the duration is too long, it is feasible that the participants could have changed in some important way which could also bias the results.

The test-retest method assesses the external consistency of a test. This refers to the degree to which different raters give consistent estimates of the same behavior. Inter-rater reliability can be used for interviews.

Inter-Rater Reliability

Inter-rater reliability, often termed inter-observer reliability, refers to the extent to which different raters or evaluators agree in assessing a particular phenomenon, behavior, or characteristic. It’s a measure of consistency and agreement between individuals scoring or evaluating the same items or behaviors.

High inter-rater reliability indicates that the findings or measurements are consistent across different raters, suggesting the results are not due to random chance or subjective biases of individual raters.

Statistical measures, such as Cohen’s Kappa or the Intraclass Correlation Coefficient (ICC), are often employed to quantify the level of agreement between raters, helping to ensure that findings are objective and reproducible.

Ensuring high inter-rater reliability is essential, especially in studies involving subjective judgment or observations, as it provides confidence that the findings are replicable and not heavily influenced by individual rater biases.

Note it can also be called inter-observer reliability when referring to observational research. Here, researchers observe the same behavior independently (to avoid bias) and compare their data. If the data is similar, then it is reliable.

Where observer scores do not significantly correlate, then reliability can be improved by:

  • Train observers in the observation techniques and ensure everyone agrees with them.
  • Ensuring behavior categories have been operationalized. This means that they have been objectively defined.
For example, if two researchers are observing ‘aggressive behavior’ of children at nursery they would both have their own subjective opinion regarding what aggression comprises.

In this scenario, they would be unlikely to record aggressive behavior the same, and the data would be unreliable.

However, if they were to operationalize the behavior category of aggression, this would be more objective and make it easier to identify when a specific behavior occurs.

For example, while “aggressive behavior” is subjective and not operationalized, “pushing” is objective and operationalized. Thus, researchers could count how many times children push each other over a certain duration of time.

Internal Consistency Reliability

Internal consistency reliability refers to how well different items on a test or survey that are intended to measure the same construct produce similar scores.

For example, a questionnaire measuring depression may have multiple questions tapping issues like sadness, changes in sleep and appetite, fatigue, and loss of interest. The assumption is that people’s responses across these different symptom items should be fairly consistent.

Cronbach’s alpha is a common statistic used to quantify internal consistency reliability. It calculates the average inter-item correlations among the test items. Values range from 0 to 1, with higher values indicating greater internal consistency. A good rule of thumb is that alpha should generally be above .70 to suggest adequate reliability.

An alpha of .90 for a depression questionnaire, for example, means there is a high average correlation between respondents’ scores on the different symptom items.

This suggests all the items are measuring the same underlying construct (depression) in a consistent manner. It taps the unidimensionality of the scale – evidence it is measuring one thing.

If some items were unrelated to others, the average inter-item correlations would be lower, resulting in a lower alpha. This would indicate the presence of multiple dimensions in the scale, rather than a unified single concept.

So, in summary, high internal consistency reliability evidenced through high Cronbach’s alpha provides support for the fact that various test items successfully tap into the same latent variable the researcher intends to measure. It suggests the items meaningfully cohere together to reliably measure that construct.

Split-Half Method

The split-half method assesses the internal consistency of a test, such as psychometric tests and questionnaires.

There, it measures the extent to which all parts of the test contribute equally to what is being measured.

The split-half approach provides another method of quantifying internal consistency by taking advantage of the natural variation when a single test is divided in half.

It’s somewhat cumbersome to implement but avoids limitations associated with Cronbach’s alpha. However, alpha remains much more widely used in practice due to its relative ease of calculation.

  • A test or questionnaire is split into two halves, typically by separating even-numbered items from odd-numbered items, or first-half items vs. second-half.
  • Each half is scored separately, and the scores are correlated using a statistical method, often Pearson’s correlation.
  • The correlation between the two halves gives an indication of the test’s reliability. A higher correlation suggests better reliability.
  • To adjust for the test’s shortened length (because we’ve split it in half), the Spearman-Brown prophecy formula is often applied to estimate the reliability of the full test based on the split-half reliability.

The reliability of a test could be improved by using this method. For example, any items on separate halves of a test with a low correlation (e.g., r = .25) should either be removed or rewritten.

The split-half method is a quick and easy way to establish reliability. However, it can only be effective with large questionnaires in which all questions measure the same construct. This means it would not be appropriate for tests that measure different constructs.

For example, the Minnesota Multiphasic Personality Inventory has sub scales measuring differently behaviors such as depression, schizophrenia, social introversion. Therefore the split-half method was not be an appropriate method to assess reliability for this personality test.

Validity vs. Reliability In Psychology

In psychology, validity and reliability are fundamental concepts that assess the quality of measurements.

  • Validity refers to the degree to which a measure accurately assesses the specific concept, trait, or construct that it claims to be assessing. It refers to the truthfulness of the measure.
  • Reliability refers to the overall consistency, stability, and repeatability of a measurement. It is concerned with how much random error might be distorting scores or introducing unwanted “noise” into the data.

A key difference is that validity refers to what’s being measured, while reliability refers to how consistently it’s being measured.

An unreliable measure cannot be truly valid because if a measure gives inconsistent, unpredictable scores, it clearly isn’t measuring the trait or quality it aims to measure in a truthful, systematic manner. Establishing reliability provides the foundation for determining the measure’s validity.

A pivotal understanding is that reliability is a necessary but not sufficient condition for validity.

It means a test can be reliable, consistently producing the same results, without being valid, or accurately measuring the intended attribute.

However, a valid test, one that truly measures what it purports to, must be reliable. In the pursuit of rigorous psychological research, both validity and reliability are indispensable.

Ideally, researchers strive for high scores on both -Validity to make sure you’re measuring the correct construct and reliability to make sure you’re measuring it accurately and precisely. The two qualities are independent but both crucial elements of strong measurement procedures.

Validity vs reliability as data research quality evaluation outline diagram. Labeled educational comparison with reliable or valid information vector illustration. Method, technique or test indication

Beck, A. T., Steer, R. A., & Brown, G. K. (1996). Manual for the beck depression inventory The Psychological Corporation. San Antonio , TX.

Clifton, J. D. W. (2020). Managing validity versus reliability trade-offs in scale-building decisions. Psychological Methods, 25 (3), 259–270. https:// doi.org/10.1037/met0000236

Guttman, L. (1945). A basis for analyzing test-retest reliability. Psychometrika, 10 (4), 255–282. https://doi.org/10.1007/BF02288892

Hathaway, S. R., & McKinley, J. C. (1943). Manual for the Minnesota Multiphasic Personality Inventory . New York: Psychological Corporation.

Jannarone, R. J., Macera, C. A., & Garrison, C. Z. (1987). Evaluating interrater agreement through “case-control” sampling. Biometrics, 43 (2), 433–437. https://doi.org/10.2307/2531825

LeBreton, J. M., & Senter, J. L. (2008). Answers to 20 questions about interrater reliability and interrater agreement. Organizational Research Methods, 11 (4), 815–852. https://doi.org/10.1177/1094428106296642

Watkins, M. W., & Pacheco, M. (2000). Interobserver agreement in behavioral research: Importance and calculation. Journal of Behavioral Education, 10 , 205–212

Print Friendly, PDF & Email

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • J Family Med Prim Care
  • v.4(3); Jul-Sep 2015

Validity, reliability, and generalizability in qualitative research

Lawrence leung.

1 Department of Family Medicine, Queen's University, Kingston, Ontario, Canada

2 Centre of Studies in Primary Care, Queen's University, Kingston, Ontario, Canada

In general practice, qualitative research contributes as significantly as quantitative research, in particular regarding psycho-social aspects of patient-care, health services provision, policy setting, and health administrations. In contrast to quantitative research, qualitative research as a whole has been constantly critiqued, if not disparaged, by the lack of consensus for assessing its quality and robustness. This article illustrates with five published studies how qualitative research can impact and reshape the discipline of primary care, spiraling out from clinic-based health screening to community-based disease monitoring, evaluation of out-of-hours triage services to provincial psychiatric care pathways model and finally, national legislation of core measures for children's healthcare insurance. Fundamental concepts of validity, reliability, and generalizability as applicable to qualitative research are then addressed with an update on the current views and controversies.

Nature of Qualitative Research versus Quantitative Research

The essence of qualitative research is to make sense of and recognize patterns among words in order to build up a meaningful picture without compromising its richness and dimensionality. Like quantitative research, the qualitative research aims to seek answers for questions of “how, where, when who and why” with a perspective to build a theory or refute an existing theory. Unlike quantitative research which deals primarily with numerical data and their statistical interpretations under a reductionist, logical and strictly objective paradigm, qualitative research handles nonnumerical information and their phenomenological interpretation, which inextricably tie in with human senses and subjectivity. While human emotions and perspectives from both subjects and researchers are considered undesirable biases confounding results in quantitative research, the same elements are considered essential and inevitable, if not treasurable, in qualitative research as they invariable add extra dimensions and colors to enrich the corpus of findings. However, the issue of subjectivity and contextual ramifications has fueled incessant controversies regarding yardsticks for quality and trustworthiness of qualitative research results for healthcare.

Impact of Qualitative Research upon Primary Care

In many ways, qualitative research contributes significantly, if not more so than quantitative research, to the field of primary care at various levels. Five qualitative studies are chosen to illustrate how various methodologies of qualitative research helped in advancing primary healthcare, from novel monitoring of chronic obstructive pulmonary disease (COPD) via mobile-health technology,[ 1 ] informed decision for colorectal cancer screening,[ 2 ] triaging out-of-hours GP services,[ 3 ] evaluating care pathways for community psychiatry[ 4 ] and finally prioritization of healthcare initiatives for legislation purposes at national levels.[ 5 ] With the recent advances of information technology and mobile connecting device, self-monitoring and management of chronic diseases via tele-health technology may seem beneficial to both the patient and healthcare provider. Recruiting COPD patients who were given tele-health devices that monitored lung functions, Williams et al. [ 1 ] conducted phone interviews and analyzed their transcripts via a grounded theory approach, identified themes which enabled them to conclude that such mobile-health setup and application helped to engage patients with better adherence to treatment and overall improvement in mood. Such positive findings were in contrast to previous studies, which opined that elderly patients were often challenged by operating computer tablets,[ 6 ] or, conversing with the tele-health software.[ 7 ] To explore the content of recommendations for colorectal cancer screening given out by family physicians, Wackerbarth, et al. [ 2 ] conducted semi-structure interviews with subsequent content analysis and found that most physicians delivered information to enrich patient knowledge with little regard to patients’ true understanding, ideas, and preferences in the matter. These findings suggested room for improvement for family physicians to better engage their patients in recommending preventative care. Faced with various models of out-of-hours triage services for GP consultations, Egbunike et al. [ 3 ] conducted thematic analysis on semi-structured telephone interviews with patients and doctors in various urban, rural and mixed settings. They found that the efficiency of triage services remained a prime concern from both users and providers, among issues of access to doctors and unfulfilled/mismatched expectations from users, which could arouse dissatisfaction and legal implications. In UK, a care pathways model for community psychiatry had been introduced but its benefits were unclear. Khandaker et al. [ 4 ] hence conducted a qualitative study using semi-structure interviews with medical staff and other stakeholders; adopting a grounded-theory approach, major themes emerged which included improved equality of access, more focused logistics, increased work throughput and better accountability for community psychiatry provided under the care pathway model. Finally, at the US national level, Mangione-Smith et al. [ 5 ] employed a modified Delphi method to gather consensus from a panel of nominators which were recognized experts and stakeholders in their disciplines, and identified a core set of quality measures for children's healthcare under the Medicaid and Children's Health Insurance Program. These core measures were made transparent for public opinion and later passed on for full legislation, hence illustrating the impact of qualitative research upon social welfare and policy improvement.

Overall Criteria for Quality in Qualitative Research

Given the diverse genera and forms of qualitative research, there is no consensus for assessing any piece of qualitative research work. Various approaches have been suggested, the two leading schools of thoughts being the school of Dixon-Woods et al. [ 8 ] which emphasizes on methodology, and that of Lincoln et al. [ 9 ] which stresses the rigor of interpretation of results. By identifying commonalities of qualitative research, Dixon-Woods produced a checklist of questions for assessing clarity and appropriateness of the research question; the description and appropriateness for sampling, data collection and data analysis; levels of support and evidence for claims; coherence between data, interpretation and conclusions, and finally level of contribution of the paper. These criteria foster the 10 questions for the Critical Appraisal Skills Program checklist for qualitative studies.[ 10 ] However, these methodology-weighted criteria may not do justice to qualitative studies that differ in epistemological and philosophical paradigms,[ 11 , 12 ] one classic example will be positivistic versus interpretivistic.[ 13 ] Equally, without a robust methodological layout, rigorous interpretation of results advocated by Lincoln et al. [ 9 ] will not be good either. Meyrick[ 14 ] argued from a different angle and proposed fulfillment of the dual core criteria of “transparency” and “systematicity” for good quality qualitative research. In brief, every step of the research logistics (from theory formation, design of study, sampling, data acquisition and analysis to results and conclusions) has to be validated if it is transparent or systematic enough. In this manner, both the research process and results can be assured of high rigor and robustness.[ 14 ] Finally, Kitto et al. [ 15 ] epitomized six criteria for assessing overall quality of qualitative research: (i) Clarification and justification, (ii) procedural rigor, (iii) sample representativeness, (iv) interpretative rigor, (v) reflexive and evaluative rigor and (vi) transferability/generalizability, which also double as evaluative landmarks for manuscript review to the Medical Journal of Australia. Same for quantitative research, quality for qualitative research can be assessed in terms of validity, reliability, and generalizability.

Validity in qualitative research means “appropriateness” of the tools, processes, and data. Whether the research question is valid for the desired outcome, the choice of methodology is appropriate for answering the research question, the design is valid for the methodology, the sampling and data analysis is appropriate, and finally the results and conclusions are valid for the sample and context. In assessing validity of qualitative research, the challenge can start from the ontology and epistemology of the issue being studied, e.g. the concept of “individual” is seen differently between humanistic and positive psychologists due to differing philosophical perspectives:[ 16 ] Where humanistic psychologists believe “individual” is a product of existential awareness and social interaction, positive psychologists think the “individual” exists side-by-side with formation of any human being. Set off in different pathways, qualitative research regarding the individual's wellbeing will be concluded with varying validity. Choice of methodology must enable detection of findings/phenomena in the appropriate context for it to be valid, with due regard to culturally and contextually variable. For sampling, procedures and methods must be appropriate for the research paradigm and be distinctive between systematic,[ 17 ] purposeful[ 18 ] or theoretical (adaptive) sampling[ 19 , 20 ] where the systematic sampling has no a priori theory, purposeful sampling often has a certain aim or framework and theoretical sampling is molded by the ongoing process of data collection and theory in evolution. For data extraction and analysis, several methods were adopted to enhance validity, including 1 st tier triangulation (of researchers) and 2 nd tier triangulation (of resources and theories),[ 17 , 21 ] well-documented audit trail of materials and processes,[ 22 , 23 , 24 ] multidimensional analysis as concept- or case-orientated[ 25 , 26 ] and respondent verification.[ 21 , 27 ]

Reliability

In quantitative research, reliability refers to exact replicability of the processes and the results. In qualitative research with diverse paradigms, such definition of reliability is challenging and epistemologically counter-intuitive. Hence, the essence of reliability for qualitative research lies with consistency.[ 24 , 28 ] A margin of variability for results is tolerated in qualitative research provided the methodology and epistemological logistics consistently yield data that are ontologically similar but may differ in richness and ambience within similar dimensions. Silverman[ 29 ] proposed five approaches in enhancing the reliability of process and results: Refutational analysis, constant data comparison, comprehensive data use, inclusive of the deviant case and use of tables. As data were extracted from the original sources, researchers must verify their accuracy in terms of form and context with constant comparison,[ 27 ] either alone or with peers (a form of triangulation).[ 30 ] The scope and analysis of data included should be as comprehensive and inclusive with reference to quantitative aspects if possible.[ 30 ] Adopting the Popperian dictum of falsifiability as essence of truth and science, attempted to refute the qualitative data and analytes should be performed to assess reliability.[ 31 ]

Generalizability

Most qualitative research studies, if not all, are meant to study a specific issue or phenomenon in a certain population or ethnic group, of a focused locality in a particular context, hence generalizability of qualitative research findings is usually not an expected attribute. However, with rising trend of knowledge synthesis from qualitative research via meta-synthesis, meta-narrative or meta-ethnography, evaluation of generalizability becomes pertinent. A pragmatic approach to assessing generalizability for qualitative studies is to adopt same criteria for validity: That is, use of systematic sampling, triangulation and constant comparison, proper audit and documentation, and multi-dimensional theory.[ 17 ] However, some researchers espouse the approach of analytical generalization[ 32 ] where one judges the extent to which the findings in one study can be generalized to another under similar theoretical, and the proximal similarity model, where generalizability of one study to another is judged by similarities between the time, place, people and other social contexts.[ 33 ] Thus said, Zimmer[ 34 ] questioned the suitability of meta-synthesis in view of the basic tenets of grounded theory,[ 35 ] phenomenology[ 36 ] and ethnography.[ 37 ] He concluded that any valid meta-synthesis must retain the other two goals of theory development and higher-level abstraction while in search of generalizability, and must be executed as a third level interpretation using Gadamer's concepts of the hermeneutic circle,[ 38 , 39 ] dialogic process[ 38 ] and fusion of horizons.[ 39 ] Finally, Toye et al. [ 40 ] reported the practicality of using “conceptual clarity” and “interpretative rigor” as intuitive criteria for assessing quality in meta-ethnography, which somehow echoed Rolfe's controversial aesthetic theory of research reports.[ 41 ]

Food for Thought

Despite various measures to enhance or ensure quality of qualitative studies, some researchers opined from a purist ontological and epistemological angle that qualitative research is not a unified, but ipso facto diverse field,[ 8 ] hence any attempt to synthesize or appraise different studies under one system is impossible and conceptually wrong. Barbour argued from a philosophical angle that these special measures or “technical fixes” (like purposive sampling, multiple-coding, triangulation, and respondent validation) can never confer the rigor as conceived.[ 11 ] In extremis, Rolfe et al. opined from the field of nursing research, that any set of formal criteria used to judge the quality of qualitative research are futile and without validity, and suggested that any qualitative report should be judged by the form it is written (aesthetic) and not by the contents (epistemic).[ 41 ] Rolfe's novel view is rebutted by Porter,[ 42 ] who argued via logical premises that two of Rolfe's fundamental statements were flawed: (i) “The content of research report is determined by their forms” may not be a fact, and (ii) that research appraisal being “subject to individual judgment based on insight and experience” will mean those without sufficient experience of performing research will be unable to judge adequately – hence an elitist's principle. From a realism standpoint, Porter then proposes multiple and open approaches for validity in qualitative research that incorporate parallel perspectives[ 43 , 44 ] and diversification of meanings.[ 44 ] Any work of qualitative research, when read by the readers, is always a two-way interactive process, such that validity and quality has to be judged by the receiving end too and not by the researcher end alone.

In summary, the three gold criteria of validity, reliability and generalizability apply in principle to assess quality for both quantitative and qualitative research, what differs will be the nature and type of processes that ontologically and epistemologically distinguish between the two.

Source of Support: Nil.

Conflict of Interest: None declared.

Validity, Accuracy and Reliability Explained with Examples

This is part of the NSW HSC science curriculum part of the Working Scientifically skills.

Part 1 – Validity

Part 2 – Accuracy

Part 3 – Reliability

Science experiments are an essential part of high school education, helping students understand key concepts and develop critical thinking skills. However, the value of an experiment lies in its validity, accuracy, and reliability. Let's break down these terms and explore how they can be improved and reduced, using simple experiments as examples.

Target Analogy to Understand Accuracy and Reliability

The target analogy is a classic way to understand the concepts of accuracy and reliability in scientific measurements and experiments. 

validity and reliability in research sample

Accuracy refers to how close a measurement is to the true or accepted value. In the analogy, it's how close the arrows come to hitting the bullseye (represents the true or accepted value).

Reliability  refers to the consistency of a set of measurements. Reliable data can be reproduced under the same conditions. In the analogy, it's represented by how tightly the arrows are grouped together, regardless of whether they hit the bullseye. Therefore, we can have scientific results that are reliable but inaccurate.

  • Validity  refers to how well an experiment investigates the aim or tests the underlying hypothesis. While validity is not represented in this target analogy, the validity of an experiment can sometimes be assessed by using the accuracy of results as a proxy. Experiments that produce accurate results are likely to be valid as invalid experiments usually do not yield accurate result.

Validity refers to how well an experiment measures what it is supposed to measure and investigates the aim.

Ask yourself the questions:

  • "Is my experimental method and design suitable?"
  • "Is my experiment testing or investigating what it's suppose to?"

validity and reliability in research sample

For example, if you're investigating the effect of the volume of water (independent variable) on plant growth, your experiment would be valid if you measure growth factors like height or leaf size (these would be your dependent variables).

However, validity entails more than just what's being measured. When assessing validity, you should also examine how well the experimental methodology investigates the aim of the experiment.

Assessing Validity

An experiment’s procedure, the subsequent methods of analysis of the data, the data itself, and the conclusion you draw from the data, all have their own associated validities. It is important to understand this division because there are different factors to consider when assessing the validity of any single one of them. The validity of an experiment as a whole , depends on the individual validities of these components.

When assessing the validity of the procedure , consider the following:

  • Does the procedure control all necessary variables except for the dependent and independent variables? That is, have you isolated the effect of the independent variable on the dependent variable?
  • Does this effect you have isolated actually address the aim and/or hypothesis?
  • Does your method include enough repetitions for a reliable result? (Read more about reliability below)

When assessing the validity of the method of analysis of the data , consider the following:

  • Does the analysis extrapolate or interpolate the experimental data? Generally, interpolation is valid, but extrapolation is invalid. This because by extrapolating, you are ‘peering out into the darkness’ – just because your data showed a certain trend for a certain range it does not mean that this trend will hold for all.
  • Does the analysis use accepted laws and mathematical relationships? That is, do the equations used for analysis have scientific or mathematical base? For example, `F = ma` is an accepted law in physics, but if in the analysis you made up a relationship like `F = ma^2` that has no scientific or mathematical backing, the method of analysis is invalid.
  • Is the most appropriate method of analysis used? Consider the differences between using a table and a graph. In a graph, you can use the gradient to minimise the effects of systematic errors and can also reduce the effect of random errors. The visual nature of a graph also allows you to easily identify outliers and potentially exclude them from analysis. This is why graphical analysis is generally more valid than using values from tables.

When assessing the validity of your results , consider the following: 

  • Is your primary data (data you collected from your own experiment) BOTH accurate and reliable? If not, it is invalid.
  • Are the secondary sources you may have used BOTH reliable and accurate?

When assessing the validity of your conclusion , consider the following:

  • Does your conclusion relate directly to the aim or the hypothesis?

How to Improve Validity

Ways of improving validity will differ across experiments. You must first identify what area(s) of the experiment’s validity is lacking (is it the procedure, analysis, results, or conclusion?). Then, you must come up with ways of overcoming the particular weakness. 

Below are some examples of this.

Example – Validity in Chemistry Experiment 

Let's say we want to measure the mass of carbon dioxide in a can of soft drink.

Heating a can of soft drink

The following steps are followed:

  • Weigh an unopened can of soft drink on an electronic balance.
  • Open the can.
  • Place the can on a hot plate until it begins to boil.
  • When cool, re-weigh the can to determine the mass loss.

To ensure this experiment is valid, we must establish controlled variables:

  • type of soft drink used
  • temperature at which this experiment is conducted
  • period of time before soft drink is re-weighed

Despite these controlled variables, this experiment is invalid because it actually doesn't help us measure the mass of carbon dioxide in the soft drink. This is because by heating the soft drink until it boils, we are also losing water due to evaporation. As a result, the mass loss measured is not only due to the loss of carbon dioxide, but also water. A simple way to improve the validity of this experiment is to not heat it; by simply opening the can of soft drink, carbon dioxide in the can will escape without loss of water.

Example – Validity in Physics Experiment

Let's say we want to measure the value of gravitational acceleration `g` using a simple pendulum system, and the following equation:

$$T = 2\pi \sqrt{\frac{l}{g}}$$

  • `T` is the period of oscillation
  • `l` is the length of string attached to the mass
  • `g` is the acceleration due to gravity

Pendulum practical

  • Cut a piece of a string or dental floss so that it is 1.0 m long.
  • Attach a 500.0 g mass of high density to the end of the string.
  • Attach the other end of the string to the retort stand using a clamp.
  • Starting at an angle of less than 10º, allow the pendulum to swing and measure the pendulum’s period for 10 oscillations using a stopwatch.
  • Repeat the experiment with 1.2 m, 1.5 m and 1.8 m strings.

The controlled variables we must established in this experiment include:

  • mass used in the pendulum
  • location at which the experiment is conducted

The validity of this experiment depends on the starting angle of oscillation. The above equation (method of analysis) is only true for small angles (`\theta < 15^{\circ}`) such that `\sin \theta = \theta`. We also want to make sure the pendulum system has a small enough surface area to minimise the effect of air resistance on its oscillation.

validity and reliability in research sample

In this instance, it would be invalid to use a pair of values (length and period) to calculate the value of gravitational acceleration. A more appropriate method of analysis would be to plot the length and period squared to obtain a linear relationship, then use the value of the gradient of the line of best fit to determine the value of `g`. 

Accuracy refers to how close the experimental measurements are to the true value.

Accuracy depends on

  • the validity of the experiment
  • the degree of error:
  • systematic errors are those that are systemic in your experiment. That is, they effect every single one of your data points consistently, meaning that the cause of the error is always present. For example, it could be a badly calibrated temperature gauge that reports every reading 5 °C above the true value.
  • random errors are errors that occur inconsistently. For example, the temperature gauge readings might be affected by random fluctuations in room temperature. Some readings might be above the true value, some might then be below the true value.
  • sensitivity of equipment used.

Assessing Accuracy 

The effect of errors and insensitive equipment can both be captured by calculating the percentage error:

$$\text{% error} = \frac{\text{|experimental value – true value|}}{\text{true value}} \times 100%$$

Generally, measurements are considered accurate when the percentage error is less than 5%. You should always take the context of the experimental into account when assessing accuracy. 

While accuracy and validity have different definitions, the two are closely related. Accurate results often suggest that the underlying experiment is valid, as invalid experiments are unlikely to produce accurate results.

In a simple pendulum experiment, if your measurements of the pendulum's period are close to the calculated value, your experiment is accurate. A table showing sample experimental measurements vs accepted values from using the equation above is shown below. 

validity and reliability in research sample

All experimental values in the table above are within 5% of accepted (theoretical) values, they are therefore considered as accurate. 

How to Improve Accuracy

  • Remove systematic errors : for example, if the experiment’s measuring instruments are poorly calibrated, then you should correctly calibrate it before doing the experiment again.
  • Reduce the influence of random errors : this can be done by having more repetitions in the experiment and reporting the average values. This is because if you have enough of these random errors – some above the true value and some below the true value – then averaging them will make them cancel each other out This brings your average value closer and closer to the true value.
  • Use More Sensitive Equipments: For example, use a recording to measure time by analysing motion of an object frame by frame, instead of using a stopwatch. The sensitivity of an equipment can be measured by the limit of reading . For example, stopwatches may only measure to the nearest millisecond – that is their limit of reading. But recordings can be analysed to the frame. And, depending on the frame rate of the camera, this could mean measuring to the nearest microsecond.
  • Obtain More Measurements and Over a Wider Range:  In some cases, the relationship between two variables can be more accurately determined by testing over a wider range. For example, in the pendulum experiment, periods when strings of various lengths are used can be measured. In this instance, repeating the experiment does not relate to reliability because we have changed the value of the independent variable tested.

Reliability

Reliability involves the consistency of your results over multiple trials.

Assessing Reliability

The reliability of an experiment can be broken down into the reliability of the procedure and the reliability of the final results.

The reliability of the procedure refers to how consistently the steps of your experiment produce similar results. For example, if an experiment produces the same values every time it is repeated, then it is highly reliable. This can be assessed quantitatively by looking at the spread of measurements, using statistical tests such as greatest deviation from the mean, standard deviations, or z-scores.

Ask yourself: "Is my result reproducible?"

The reliability of results cannot be assessed if there is only one data point or measurement obtained in the experiment. There must be at least 3. When you're repeating the experiment to assess the reliability of its results, you must follow the  same steps , use the  same value  for the independent variable. Results obtained from methods with different steps cannot be assessed for their reliability.

Obtaining only one measurement in an experiment is not enough because it could be affected by errors and have been produced due to pure chance. Repeating the experiment and obtaining the same or similar results will increase your confidence that the results are reproducible (therefore reliable).

In the soft drink experiment, reliability can be assessed by repeating the steps at least three times:

reliable results example

The mass loss measured in all three trials are fairly consistent, suggesting that the reliability of the underly method is high.

The reliability of the final results refers to how consistently your final data points (e.g. average value of repeated trials) point towards the same trend. That is, how close are they all to the trend line? This can be assessed quantitatively using the `R^2` value. `R^2` value ranges between 0 and 1, a value of 0 suggests there is no correlation between data points, and a value of 1 suggests a perfect correlation with no variance from trend line.

In the pendulum experiment, we can calculate the `R^2` value (done in Excel) by using the final average period values measured for each pendulum length.

validity and reliability in research sample

Here, a `R^2` value of 0.9758 suggests the four average values are fairly close to the overall linear trend line (low variance from trend line). Thus, the results are fairly reliable. 

How to Improve Reliability

A common misconception is that increasing the number of trials increases the reliability of the procedure . This is not true. The only way to increase the reliability of the procedure is to revise it. This could mean using instruments that are less susceptible to random errors, which cause measurements to be more variable.

Increasing the number of trials actually increases the reliability of the final results . This is because having more repetitions reduces the influence of random errors and brings the average values closer to the true values. Generally, the closer experimental values are to true values, the closer they are to the true trend. That is, accurate data points are generally reliable and all point towards the same trend.

Reliable but Inaccurate / Invalid

It is important to understand that results from an experiment can be reliable (consistent), but inaccurate (deviate greatly from theoretical values) and/or invalid. In this case, your procedure  is reliable, but your final results likely are not.

Examples of Reliability

Using the soft drink example again, if the mass losses measured for three soft drinks (same brand and type of drink) are consistent, then it's reliable. 

Using the pendulum example again, if you get similar period measurements every time you repeat the experiment, it’s reliable.  

However, in both cases, if the underlying methods are invalid, the consistent results would be invalid and inaccurate (despite being reliable).

Do you have trouble understanding validity, accuracy or reliability in your science experiment or depth study?

Consider getting personalised help from our 1-on-1 mentoring program !

RETURN TO WORKING SCIENTIFICALLY

  • choosing a selection results in a full page refresh
  • press the space key then arrow keys to make a selection

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • Current issue
  • Write for Us
  • BMJ Journals

You are here

  • Volume 18, Issue 2
  • Issues of validity and reliability in qualitative research
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • Helen Noble 1 ,
  • Joanna Smith 2
  • 1 School of Nursing and Midwifery, Queens's University Belfast , Belfast , UK
  • 2 School of Human and Health Sciences, University of Huddersfield , Huddersfield , UK
  • Correspondence to Dr Helen Noble School of Nursing and Midwifery, Queens's University Belfast, Medical Biology Centre, 97 Lisburn Rd, Belfast BT9 7BL, UK; helen.noble{at}qub.ac.uk

https://doi.org/10.1136/eb-2015-102054

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Evaluating the quality of research is essential if findings are to be utilised in practice and incorporated into care delivery. In a previous article we explored ‘bias’ across research designs and outlined strategies to minimise bias. 1 The aim of this article is to further outline rigour, or the integrity in which a study is conducted, and ensure the credibility of findings in relation to qualitative research. Concepts such as reliability, validity and generalisability typically associated with quantitative research and alternative terminology will be compared in relation to their application to qualitative research. In addition, some of the strategies adopted by qualitative researchers to enhance the credibility of their research are outlined.

Are the terms reliability and validity relevant to ensuring credibility in qualitative research?

Although the tests and measures used to establish the validity and reliability of quantitative research cannot be applied to qualitative research, there are ongoing debates about whether terms such as validity, reliability and generalisability are appropriate to evaluate qualitative research. 2–4 In the broadest context these terms are applicable, with validity referring to the integrity and application of the methods undertaken and the precision in which the findings accurately reflect the data, while reliability describes consistency within the employed analytical procedures. 4 However, if qualitative methods are inherently different from quantitative methods in terms of philosophical positions and purpose, then alterative frameworks for establishing rigour are appropriate. 3 Lincoln and Guba 5 offer alternative criteria for demonstrating rigour within qualitative research namely truth value, consistency and neutrality and applicability. Table 1 outlines the differences in terminology and criteria used to evaluate qualitative research.

  • View inline

Terminology and criteria used to evaluate the credibility of research findings

What strategies can qualitative researchers adopt to ensure the credibility of the study findings?

Unlike quantitative researchers, who apply statistical methods for establishing validity and reliability of research findings, qualitative researchers aim to design and incorporate methodological strategies to ensure the ‘trustworthiness’ of the findings. Such strategies include:

Accounting for personal biases which may have influenced findings; 6

Acknowledging biases in sampling and ongoing critical reflection of methods to ensure sufficient depth and relevance of data collection and analysis; 3

Meticulous record keeping, demonstrating a clear decision trail and ensuring interpretations of data are consistent and transparent; 3 , 4

Establishing a comparison case/seeking out similarities and differences across accounts to ensure different perspectives are represented; 6 , 7

Including rich and thick verbatim descriptions of participants’ accounts to support findings; 7

Demonstrating clarity in terms of thought processes during data analysis and subsequent interpretations 3 ;

Engaging with other researchers to reduce research bias; 3

Respondent validation: includes inviting participants to comment on the interview transcript and whether the final themes and concepts created adequately reflect the phenomena being investigated; 4

Data triangulation, 3 , 4 whereby different methods and perspectives help produce a more comprehensive set of findings. 8 , 9

Table 2 provides some specific examples of how some of these strategies were utilised to ensure rigour in a study that explored the impact of being a family carer to patients with stage 5 chronic kidney disease managed without dialysis. 10

Strategies for enhancing the credibility of qualitative research

In summary, it is imperative that all qualitative researchers incorporate strategies to enhance the credibility of a study during research design and implementation. Although there is no universally accepted terminology and criteria used to evaluate qualitative research, we have briefly outlined some of the strategies that can enhance the credibility of study findings.

  • Sandelowski M
  • Lincoln YS ,
  • Barrett M ,
  • Mayan M , et al
  • Greenhalgh T
  • Lingard L ,

Twitter Follow Joanna Smith at @josmith175 and Helen Noble at @helnoble

Competing interests None.

Read the full text or download the PDF:

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 01 July 2024

Development and validation of the MultiScent-20 digital odour identification test using item response theory

  • Marcio Nakanishi 1   na1 ,
  • Pedro Renato de Paula Brandão 2 , 3   na1 ,
  • Gustavo Subtil Magalhães Freire 1 ,
  • Luis Gustavo do Amaral Vinha 4 ,
  • Marco Aurélio Fornazieri 5 ,
  • Wilma Terezinha Anselmo-Lima 6 ,
  • Danilo Assis Pereira 7 ,
  • Gustavo Henrique Campos de Sousa 2 ,
  • Claudia Galvão 8 &
  • Thomas Hummel 9  

Scientific Reports volume  14 , Article number:  15059 ( 2024 ) Cite this article

Metrics details

  • Biomedical engineering
  • Neurological disorders
  • Neuroscience
  • Olfactory system
  • Respiratory tract diseases
  • Sensory processing

Although validated and reliable psychophysical tests of olfactory function are available, an easy-to-use and feasible test has yet to be developed. This study aimed to design a digital odour identification test, evaluate its validity, assess its reliability, establish a normative curve, and explore the impact of demographic factors. The odour identification test was presented with the Multiscent-20, a hand-held, tablet-like digital scent device that features an integrated odour digital delivery system. The identification performance on the 20 odours was assessed using item response theory (IRT). The normative curve was established by administering the test to a large sample of participants (n = 1299). The mean identification score was 17.5 (SD = 2.1). The two-parameter logistic IRT model provided the best fit, revealing variation in item discrimination and difficulty parameters. Educational attainment influenced performance, with primary education associated with lower scores. Additionally, sex was not found to be associated with performance. This study provides initial evidence supporting the validity and reliability of use of the Multiscent-20 as a digital odour identification test. The test’s automation and portability enable the standardized delivery of olfactory stimuli and efficient automatic recording and scoring of responses.

Introduction

Olfactory function assessment is vital for evaluating human health, providing information about an individual’s neurological and cognitive status. At present, there are reliable and validated psychophysical tests for olfactory function assessment in clinical practice, including the University of Pennsylvania Smell Identification Test (UPSIT) 1 , which employs microencapsulation of odorous liquids, and the Sniffin’ Sticks test 2 , 3 , 4 , which uses felt-tip pens infused with odorized fluids. While these tests are elegant and widely used, they have some limitations, such as the time required for testing, repeated use, cost, and the requirement of having a trained examiner conduct the tests. As a result, there is an increasing need for automated testing procedures capable of delivering stimuli, recording responses, and scoring assessments in a cost-effective manner. To overcome these challenges, we developed the Multiscent-20 (Noar, São Paulo, Brazil), a portable tablet computer with an integrated odour dispensing system; this device provides an efficient, mobile, and user-friendly method for olfactory function assessment.

A recent proof-of-concept study demonstrated the feasibility of using the Multiscent-20 to present an olfactory function assessment 5 . We enlisted 180 participants and compared performance on the Multiscent-20 with that on the 40-item UPSIT. The findings indicated that participants completed this test in a shorter duration and that there was a strong correlation between performance on the two tests. The Multiscent-20 showed high test–retest reliability and was regarded as easy to use. Additionally, due to the COVID-19 pandemic, recent years have seen a surge in interest in olfactory function evaluation and research, underscoring the need for a universally accepted, feasible assessment tool that is validated across diverse populations globally 6 , 7 . The creation of such a tool would substantially expedite cross-cultural research and ensure consistent evaluation criteria among diverse populations. To address this demand, we endeavoured to develop a new self-administered digital olfactory function test based on the classic 4-alternative forced-choice (AFC) paradigm to provide a universally applicable olfactory function assessment.

In this study, we describe the development, validation, and establishment of the normative curve of the Multiscent-20 olfactory identification test in its preliminary version. The selection of odours for this digital assessment was predicated upon the assumption that they embodied universally recognized and familiar odours across all five continents, thereby ensuring the potential to apply this assessment in a variety of cultures 6 .

Evaluating the reliability and identification performance on individual items in any psychophysical test is crucial to ensure the test’s overall validity. Item response theory (IRT) is a framework for examining these elements, modelling the relationship between an individual’s latent trait (olfactory ability, in this instance) and their responses to individual test items 8 . IRT enables assessment of the interplay between item parameters, such as difficulty and discrimination, in conjunction with the latent ability of tested individuals, thereby enhancing the precision of measurement and facilitating adjustments and adaptations for diverse test versions.

The objectives of this study are as follows:

To develop an odour identification test utilizing the Multiscent-20 that can effectively and adequately measure individual olfactory identification ability.

To assess performance on the 20 items of the Multiscent-20 through IRT analysis, ensuring accuracy and reliability, and examining item parameters such as discrimination and difficulty. Additionally, this assessment will gauge the quality of information generated by the test.

To establish the normative curve and investigate the influence of demographic factors on participant performance.

Participants

Table 1 displays the demographic and clinical data of a sample of 1832 individuals. The majority of the participants reported that they did not have a cold or flu at the time of testing (85%), difficulty smelling things (90%), or major memory issues (98%). Similarly, most participants reported that they were literate (99%) and had never had a head injury (98%). Among those who underwent COVID-19 testing, 31% tested positive, and 19% of those who tested positive reported a loss of smell. The majority of participants had completed secondary education (69%), followed by those who had completed tertiary education (23%), and those who had completed only primary education (8%). Most participants were male (55%). The mean age of the sample was 37 years, with the male participants exhibiting a slightly higher mean age than the female participants (38.4 vs. 34.8 years, respectively).

Dimensionality assessment

The unidimensionality of test results was evaluated through confirmatory factor analysis (CFA). The factor model fit statistics were robust. The chi-square test for the factor model yielded a significant result (χ 2 (170) = 245.189, p < 0.001), indicating a good fit. The model also demonstrated strong performance in various other fit indices, including the comparative fit index (CFI = 0.983), Tucker‒Lewis index (TLI = 0.981), and Bentler-Bonett normed fit index (NFI = 0.947). Moreover, the model exhibited a low root mean square error of approximation (RMSEA) value of 0.015, with a 90% confidence interval ranging from 0.011 to 0.020. Additionally, the standardized root mean square residual (SRMSR) was well within acceptable limits (0.067). In summary, the CFA provided strong support for unidimensionality of test results.

Item-level IRT analysis

Selection of the most suitable irt model.

Three IRT models were compared to determine the best-fitting model for the data: the one-parameter logistic (1PL) model, the two-parameter logistic (2PL) model, and the three-parameter logistic (3PL) model. The goodness-of-fit statistics for each model included the M2, RMSEA, SRMR, TLI, and CFI values.

The 1PL model showed a relatively poor fit, with an M2 of 1046.848 (df = 189, p < 0.001), RMSEA of 0.0495, SRMSR of 0.0914, TLI of 0.8458, and CFI of 0.8466. In contrast, the 2PL model demonstrated a considerably better fit, with an M2 of 316.1901 (df = 170, p < 0.001), RMSEA of 0.0216, SRMSR of 0.0395, TLI of 0.9708, and CFI of 0.9739. The 3PL model showed a similarly good fit, with an M2 of 258.4386 (df = 150, p < 0.001), RMSEA of 0.0198, SRMSR of 0.0409, TLI of 0.9754, and CFI of 0.9806.

A comparison of the three models using the Akaike information criterion (AIC), Schwarz’s Bayesian information criterion (SABIC), Hannan-Quinn information criterion (HQ), and Bayesian information criterion (BIC) revealed that the 2PL model had the lowest values for all criteria, indicating the best fit to the data (Table 2 ). Furthermore, an analysis of variance (ANOVA) comparing the three models showed a significant improvement in fit for the 2PL model over the 1PL model (χ 2  = 327.282, df = 19, p < 0.001) but no significant difference in fit between the 2PL and 3PL models (χ 2  = 19.075, df = 20, p = 0.517).

Based on these results, the 2PL model was chosen as the most suitable model for the data. This model provides a more flexible approach than the 1PL model by allowing items to have different discrimination parameters, which can better capture the underlying relationships between the items and the latent trait. The lack of significant improvement in fit from the 2PL model to the 3PL model suggests that the added complexity of the 3PL model, which incorporates a guessing parameter, was unnecessary for adequate representation of the data.

Item and person fit

The item fit parameters for the 2PL IRT model were assessed using S-χ 2 , RMSEA, and p values. The S-χ 2 statistic evaluates the discrepancy between observed and expected item responses, with higher values indicating a worse fit. RMSEA provides an estimate of the item fit error, with values closer to 0 indicating a better fit. The p value assesses the statistical significance of the item misfit.

In the present analysis, most items exhibited good fit, with nonsignificant p values (> 0.05) and low RMSEA values. However, a few items showed potential misfits: Item 3 (Coconut) with an S-χ 2 of 26.757, df = 12, RMSEA = 0.026, and p value = 0.008; Item 6 (Tire) with an S-χ 2 of 23.440, df = 12, RMSEA = 0.023, and p value = 0.024; and Item 14 (Strawberry) with an S-χ 2 of 23.926, df = 11, RMSEA = 0.025, and p value = 0.013.

The person-fit statistics for the 2PL IRT model indicated that the majority of individual response patterns were consistent with the model expectations, with 99.946% and 98.920% fitting based on infit and outfit statistics, respectively.

Discrimination and difficulty parameters

The discrimination (a) and difficulty (b) parameters of the 2PL IRT model for the 20 items provided meaningful insights into the characteristics of the test items. The discrimination parameters ranged from 0.38 to 2.43, indicating that items vary in their ability to differentiate between individuals with differing levels of the underlying latent trait. For instance, the item ‘Vanilla’ (a = 2.43) showed higher discrimination, meaning that it better distinguished between individuals along the ability continuum. In contrast, the item ‘Rose’ (a = 0.38) had lower discrimination and was less effective in this regard (Supplementary Table S2 ).

Figure  1 presents the item characteristic curves (ICCs) for the Multiscent-20 items, illustrating the relationship between individual ability and item difficulty on the same dimension. Figure  2 shows the item information curves (IICs), visually depicting the contribution of individual items to the overall precision of the scale.

figure 1

Item Characteristic Curves. Each curve depicts the relationship between an individual’s odour identification ability (θ, theta) on the horizontal axis and the probability of providing a correct answer to the question on the vertical axis. The person parameter (θ, latent trait, or ability) is on a scale from − 4 (severely impaired odour identification ability) to + 4 (excellent odour identification ability). The P(θ) depicted on the y-axis of each curve represents the probability of providing the correct answer on the specific item.

figure 2

Item Information Curves. Each curve represents an item in the Multiscent-20. The shape of the curve indicates how much information the item provides at different levels of latent ability. A steeper curve indicates that the item offers more information at a given level of odour identification ability, while a flatter curve indicates that the item provides less information. The person parameter (θ, latent trait, or ability) is on a scale that goes from − 4 (low odour identification ability) to + 4 (high odour identification ability). The curves show that Menthol, Bubble-gum, Strawberry, Garlic, Onion, Pizza, Mint, and Vanilla are the most informative items.

The difficulty parameters ranged from − 3.22 to − 0.60, providing information about the relative difficulty of each item. A more negative value of difficulty corresponds to an easier item, while a less negative value of difficulty denotes a more difficult item. For example, the item ‘Coffee’ (b =  − 3.22) is relatively easier to identify, whereas the item ‘Rose’ (b =  − 0.60) is more challenging for test-takers.

Test information curves

The test information curve (TIC; Fig.  3 , left panel) illustrates the diagnostic test’s precision across the spectrum of the latent trait, reflecting its capacity to differentiate between individuals possessing varying levels of the trait. The scale characteristic curve (SCC; Fig.  3 , right panel) also serves as an invaluable resource for evaluating diagnostic tests, encompassing facets such as test reliability, precision, measurement gap identification, and test calibration and equating. The SCC revealed that individuals with a θ value of approximately 0, corresponding to average latent ability, obtained an identification score of approximately 17–18 points. In contrast, those with a θ value of − 4 to − 2, indicating very low levels of odour identification ability, attained identification scores between 4 and 12.5.

figure 3

Multiscent-20 test information and scale characteristic curves. The left panel displays the item information curve (red line) and the standard error (dotted brown line). The right panel presents the scale characteristic curve. The test is most informative for a range of θ values between − 4 and − 1, indicating a higher precision and better discrimination in estimating the loss of odour identification ability for individuals within this ability range.

Influence of demographic factors

The one-way ANOVA revealed a statistically significant difference in the number of correct answers across the three educational attainment categories (F(2, 1829) = 13.55, p < 0.001). The Tukey honestly significant difference (HSD) test showed lower performance in the primary education group than in the secondary and tertiary education groups (primary vs. secondary education: mean difference = 1.05, 95% CI [0.53, 1.57], p < 0.001; primary vs. tertiary education: mean difference = 1.26, 95% CI [0.69, 1.83], p < 0.001). However, no statistically significant difference was found between the secondary education (high school) and tertiary education (college) groups (mean difference = 0.21, 95% CI [− 0.10, 0.52], p = 0.257).

A multiple linear regression analysis was conducted to examine the relationship between the Multiscent-20 identification score (the dependent variable) and sex, age group, and educational attainment (independent variables). The results indicated that the educational attainment was significantly associated with the Multiscent-20 identification score (secondary education: p = 6.94e−06, tertiary education: p = 6.26e−07). Specifically, participants with secondary and tertiary education had higher scores than those with primary education.

However, no significant relationship was found between age group and Multiscent-20 identification score (30–40 years: p = 0.97, 40–50 years: p = 0.89, 50–60 years: p = 0.17, > 60 years: p = 0.50). The analysis did not reveal any significant differences in olfactory performance between sexes with the Multiscent-20 identification score (p = 0.096). The multiple R 2 value of 0.018 indicated that the model explained only a small proportion of the variation in the Multiscent-20 identification score.

Descriptive data of the preliminary “normative” sample

A total of 533 individuals were excluded from the analysis of the preliminary normative curve. Exclusions were made according to the following criteria: recent infection (n = 240), illiteracy (n = 16), severe memory impairment (n = 34), prior head trauma (n = 107), nasal surgery (n = 101), persistent post-COVID-19 hyposmia (n = 4), self-reported hyposmia (n = 135), diagnosed neurological disease (n = 13), or scores of 23 or less on the Mini-Mental State Examination (MMSE) (n = 18).

The sample of 1299 individuals had a mean Multiscent-20 identification score of 17.5, with a standard deviation of 2.1, as seen in Fig.  4 . The median score was 18, and the trimmed mean was 17.8, indicating that the central tendency of the scores was relatively high. The scores ranged from a minimum of 2 to a maximum of 20. The skewness value of − 2.3 suggests a negative (left) skew, indicating that the majority of the participants scored higher on the Multiscent-20 identification scale, with a smaller number of individuals scoring lower. The kurtosis value of 7.3 implies a leptokurtic distribution, meaning that the scores were more concentrated around the mean than in a normal distribution.

figure 4

MultiScent-20 Odour Identification Score: distribution and descriptive statistics of central tendencies, dispersion, and percentiles of normative performance.

Assessment of internal and external validity, and test–retest reliability

Comparison of average performance: olfactory dysfunction vs. normosmic groups.

Patients diagnosed with olfactory dysfunction resulting from chronic sinusitis demonstrated an average performance score of 7.0 ± 2.4. This score is significantly different from the normosmic control group’s average score of 17.8 ± 1.84 (Welch’s t-test, t(15.5) = 15.1, p < 0.001, mean difference = 10.8, Cohen’s d = 5.03, Levene’s F(1, 66) = 1.02, p = 0.317, Shapiro–Wilk p = 0.011; Anderson–Darling p = 0.003) (Fig.  5 A).

figure 5

MultiScent-20 in Olfactory Function Assessment. ( A ) Compares MultiScent-20 scores between chronic sinusitis patients (hyposmic group) and normosmic controls, showing a significant difference. ( B ) Compares scores between Parkinson’s disease patients and age-matched controls, indicating a substantial olfactory deficit in Parkinson’s. ( C ) Scatter plot demonstrating the convergent validity of MultiScent-20 with the Sniffin’ Sticks 16-item score. Violin plots show score distributions; central red dot is the mean. Statistical details (effect sizes, p-values, CI 95%, n obs/n pairs) are provided.

Additionally, the subgroup comprising nine patients with Parkinson’s Disease, who are recognized to suffer from olfactory disorders 9 recorded an average identification score of 5.11 ± 2.32. This was in contrast to age- and education-matched controls (n = 8), who achieved an average score of 14.38 ± 1.60 (Student’s t-test, t(15) = 9.472, p < 0.001, 95% CI [7.18, 11.35], Shapiro–Wilk p = 0.162, Levene’s F(1, 15) = 1.34, p = 0.265). The test revealed a highly significant mean difference with an extremely large effect size (Cohen’s d = 4.60). It is important to highlight that the control group, being older, exhibited a lower performance relative to the normative sample (Fig.  5 B).

Test–retest reliability

The evaluation of test–retest reliability for the olfactory test, conducted over an average interval of 31.73 days (SD = 2.1 days), revealed a Pearson’s r correlation coefficient of 0.94 (95% CI [0.91–0.96]). A linear model demonstrated a strong predictive capability for ‘retest’ scores based on ‘test’ scores, with a significant positive correlation (std. ꞵ = 0.94) and explaining 89% of the variance. The model’s intercept was statistically significant at 1.60. Additionally, the Intraclass Correlation Coefficient (ICC) for a one-way random model, measuring agreement (ICC1), was 0.93 with a 95% confidence interval ranging from 0.89 to 0.95. The tenfold cross-validation results also indicate high reliability in test–retest measurements, with a mean ICC of 0.9218. The median ICC was 0.9579 (range 0.809–0.987), demonstrating high reliability even in the worst cases. This result indicates a strong consistency in the test outcomes over time among the participants.

Correlation and agreement with Sniffin’ Sticks test (SS-16)

Pearson’s correlation analysis conducted to assess the convergent validity between the MultiScent test and the Sniffin’ Sticks test (SS-16), a well-known olfactory test, yielded significant results among both Parkinson’s Disease (PD) patients and corresponding control group. With a correlation coefficient (r) of 0.88 and a 95% confidence interval ranging from 0.68 to 0.95 (p < 0.001), the analysis demonstrated a strong positive relationship between the two olfactory assessments. Furthermore, the determination coefficient (R 2 ) of 0.77 indicates that approximately 77% of the variance in olfactory test scores between the SS-16 and MultiScent-20 tests is shared, underscoring the substantial overlap in the measures of olfactory function provided by both tests (Fig.  5 C).

Internal consistency

McDonald’s omega reliability index was determined to be 0.902, indicating high internal consistency. The 95% confidence interval for this measure ranged from 0.888 to 0.916.

In the present study, we developed a novel digital odour identification test presented on a dedicated tablet that releases odours and evaluated its psychometric properties. This new test is automated and portable, enabling standardized olfactory stimuli delivery, swift recording, and scoring of responses in a cost-effective manner. The validity of the test was evaluated with the IRT framework. The findings revealed that the 2PL model yielded the most appropriate fit to the data, indicating variations in item discrimination and difficulty. Furthermore, the adequate fit (based on infit and outfit statistics) showed that the majority of participant responses were in line with the model expectations.

The TIC and SCC provide insights into the performance of diagnostic tests in general and may also be used for odour identification tests such as in the present study. The TIC provides a visual representation of the test’s capability to differentiate between individuals with different ability levels. For the MultiScent-20, the TIC revealed that the test has a high level of precision across lower levels of the latent trait, suggesting that it can effectively differentiate between individuals with hyposmia and anosmia. The SCC showed that individuals with an average odour identification ability obtained a score of 17–18 points, indicating that they correctly identified most of the scents presented. In contrast, individuals with the lowest levels of olfactory ability (θ values lower than − 2) obtained lower scores, ranging between 4 and 12.5 points.

The present findings also highlighted variations across demographic groups, consistent with the complexity of olfactory function testing. The data showed that while higher education levels were correlated with better performance on the Multiscent-20, sex was only marginally associated with olfactory ability. This is in line with previous research finding sex differences in olfaction with relatively small effect sizes 10 . Moreover, previous research has shown that in odour identification, verbal ability 11 , educational attainment 12 , and cognitive factors 13 , 14 play a significant role in modulating both odor discrimination and identification capabilities.

The percentile-based interpretation of the normative values was guided by the contribution of the ‘Sniffin’ Sticks’ test and the UPSIT, prompting us to consider a score below the 10th percentile as indicative of hyposmia 1 , 4 . A similar standardized classification system for an objective assessment of participant performance based on percentiles is included in the guidelines of the American Academy of Clinical Neuropsychology consensus conference statement on labelling performance in tests with non-normal distributions 15 . In this classification, scores above the 24th percentile (≥ 17 points) are classified “Within Normal Expectations Score”, “Within Normal Limits Score”, or “normosmia”, while scores ranging from the 10th to the 24th percentile (15–16 points) are labelled as “Low Average Score” but still normosmia. Scores falling between the 2nd and 8th percentiles are designated as “Below Average Score” or “hyposmia”, while scores below the 2nd percentile are classified as “Exceptionally Low Score” or “anosmia”. Following these guidelines provided a consistent and objective assessment of subject performance in our study.

The significant difference in performance scores between patients with olfactory dysfunction due to chronic sinusitis and the normosmic control group is a critical finding. With a marked mean difference and a large effect size (Cohen’s d = 5.03), the study demonstrates the extent of olfactory impairment in individuals with chronic sinusitis. These results are consistent with the existing literature that documents the impact of chronic sinusitis on olfactory capabilities. Moreover, the subgroup of PD patients showed a more pronounced olfactory deficit compared to their matched controls. The substantial effect size (Cohen’s d = 4.60) in this comparison highlights the severity of olfactory impairment in PD, aligning with prior research indicating olfactory dysfunction as a common non-motor symptom of PD. This stark contrast between PD patients and controls further validates the sensitivity of the olfactory test in detecting olfactory deficits in neurodegenerative conditions.

The high test–retest reliability observed in this study is indicative of the consistency and stability of the olfactory test over time. This finding is particularly relevant for clinical settings where repeated measurements are essential to monitor the progression of olfactory dysfunction or the efficacy of therapeutic interventions. The strong correlation coefficient suggests that the test can reliably measure olfactory function over a period of about one month, an important consideration for longitudinal studies.

The strong positive correlation between the MultiScent-20 test and the Sniffin’ Sticks test (SS-16) underscores the convergent validity of these olfactory assessments. The high correlation coefficient (r = 0.88) and determination coefficient (R 2  = 0.77) indicate that both tests are measuring similar constructs of olfactory function.

The olfactory function test presented with this digital device can enable large-scale administration without the need for examiners, as scoring is automatic, and the test is self-administered. This feature facilitates epidemiological screening in populations 16 , 17 , 18 . Moreover, the equipment enables the collection of multidimensional data, including discrimination, threshold, familiarity, pleasantness, and preference for different odours. The collection and analysis of multidimensional data provide a more comprehensive and detailed understanding of olfactory responses and individual or collective patterns of odour perception 19 . Furthermore, the device facilitates the utilization of artificial intelligence, as analysis of the generated big data can provide novel insights into the biology of olfactory perception and the cognitive processes involved in odour identification 13 . The application of IRT may also be of interest in enabling computerized adaptive tests.

Moreover, incorporating “universal odours” in the Multiscent-20 and the initial exploration of cross-cultural research underlines the tool’s potential to transcend cultural boundaries 20 , 21 . Subsequent studies may refine the test by addressing any discrepancies in item calibration and investigating other potential factors that could account for variation in test performance.

The development of the Multiscent-20 marks a significant advancement in olfactory testing, shifting from traditional, manual methods like the UPSIT and Sniffin’ Sticks to a digital approach that enhances scalability and reduces the need for trained personnel 1 , 3 . This device automates odor delivery and response scoring, integrating algorithms for precise analysis. The use of a digital platform allows for large-scale screenings and complex data analysis, significantly improving upon the limitations of traditional methods and setting a new standard in olfactory research 22 .

This study also has some limitations. It is crucial to acknowledge that the multiple linear regression model explained only a small proportion of variance in Multiscent-20 identification scores, emphasizing the need for future research to identify additional factors influencing olfactory ability. The study’s participant sample, primarily comprising workers from a glass factory, may not fully represent the broader population, particularly of ethnic, cultural diversity, socioeconomic status, and educational background. The proposed cut-off values are preliminary and based on the recommendation by Guilmette et al. 15 . These values are subject to updates with the expansion of the normative sample, as well as the inclusion of more patients with olfactory dysfunction of known aetiology.

These limitations highlight the importance of expanding future research to diverse demographics to enhance the generalizability of findings. Moreover, it underscores the importance of refining Multiscent-20 test items with misfits, suggesting future modifications in odour concentration and test design to improve clinical relevance while acknowledging the test’s ongoing evolution.

In conclusion, the findings of the present study support the validity of the Multiscent-20 as an odour identification test. The 2PL IRT model provides a robust framework for understanding the performance on test items. The test represents an important step forward in olfactory function testing, with potential benefits for both clinical practice and research.

Study design and ethical approval

The study was approved by the Ethics Committee of the Faculty of Medicine of the University of Brasilia (Approval Number: 43188021-9-1001-5558). All methods were performed in accordance with the guidelines and regulations of the National Council of Research Ethics and the Declaration of Helsinki, and informed consent was obtained from all participants. Participant data are protected by national data protection legislation. The study had a cross-sectional observational design and utilized nonprobabilistic convenience sampling by quotas. In the following sections, we provide a detailed description of the experimental methodology and preliminary results.

Participant enrolment and inclusion criteria

To be eligible for participation, individuals had to be at least 18 years of age and native speakers of Portuguese. The study participants consisted of healthcare professionals from the university hospital, medical students, administrative staff, and employees from a glass factory. We excluded participants who provided careless or potentially biased responses to the olfactory function test, such as selecting answers at random or choosing the same option for all responses. We purposely did not exclude individuals who had a recent cold or flu or who reported olfactory loss because a sample with sufficient variability in test performance is essential to evaluate the test’s ability to measure the loss of the sense of smell.

However, for the normative curve of the test, other inclusion and exclusion criteria were established to ensure the inclusion of normosmic individuals. Participants had to be 18 or older, native speakers of Portuguese, and free of any known impairment of smell or taste. Participants who had been sick (with a cold or flu) within 15 days prior to the exam were excluded, as were those who reported neurological or psychiatric diseases and those with a history of head trauma. Participants who demonstrated negligent performance in the exam were also excluded. Additionally, those aged above 60 years and scoring below 24 points on the MMSE were excluded 23 .

Description of device features

The Multiscent-20 is a dedicated tablet featuring a 7-inch touchscreen specifically designed for use in the cosmetics industry to present perfume fragrances to customers (Fig.  6 ) 5 .

figure 6

Front (left) and back (right) views of the Multiscent-20. It is a dedicated tablet with a 7-inch touchscreen used to present aromas and record responses digitally. The opening for odour release is indicated by an arrow on the front of the device. The fragrance capsule is also shown. The device can be loaded with up to 20 capsules made of oil-resistant polymer, in which olfactory stimuli are stored. The capsules are loaded through an insertion port (arrow) on the back of the device.

The Multiscent-20 is a portable tablet computer with an integrated hardware system consisting of a central processing unit, touchscreen, Wi-Fi antenna, USB dock, power supply, and rechargeable battery. Additionally, it contains an odour system composed of 20 microcartridges, an air filter, a system that generates a dry air stream, and an odour-dispensing opening. The device is capable of presenting 20 different odours from individual odour capsules, which store the olfactory stimuli by incorporating the odours into an oil-resistant polymer. The capsules are loaded through an insertion port on the back of the device.

A software-controlled processing unit governs the flow of dry air, generating a constant air stream that passes through the capsule, releasing individual odours through a small opening at the upper front of the device. In this study, the intra-stimulus interval, i.e., the release of each odour stimulus, had a duration of 5 s, and odour release was initiated by participant activation via a touch screen interface. The participants had complete autonomy to activate the next odour at their own pace. The minimum inter-stimulus interval is 6 s, but the actual interval tended to be longer, as participants are required to read four descriptors before releasing each odour. Each capsule contained 35 µL of oil-based odour solution, allowing the device to maintain consistent odour intensity and identifiability for up to 100 activations, as per the manufacturer’s instructions.

A digital application (software) was developed to present odours and record olfactory function test results. The device delivers odours through a dry air system, leaving no residue in the environment or on users. Users can access the test using an application on the device (screen version) or mobile phone. The results are accessible through the integrated software, allowing for storage and analysis of the data.

Selection of the “universal” odours

The methodology employed for selecting “universal” odours for the Multiscent20 olfactory assessment tool was anchored in three pivotal criteria, each odour being required to meet at least two of these conditions for its inclusion. Primarily, an odour had to be included within the scope of scents identified in the study conducted by Schriever et al. 6 . Initially designed for a paediatric cohort with a selection of 12 items, our research adapted this approach to serve the adult population, positing that odours recognizable by children would be even more identifiable to adults, attributed to their broader spectrum of life experiences and exposures.

Secondly, among the 77 odours used in major olfactory tests described in the literature, (Supplementary Table S3 ) we identified 35 odours that were utilized in two or more tests. Therefore, the second criterion necessitated that an odour be among these 35 of the 77 odours mentioned in the literature.

Thirdly, the odours chosen were required to have demonstrated a minimum identification accuracy of 80% or higher in a proof-of-concept study by Nakanishi et al. 5 , ensuring their high likelihood of accurate recognition. Additionally, factors including worldwide prevalence and the commercial availability across a global scale, covering five continents, were also taken into account.

The methodology employed for the selection of distractors was adapted from the framework established by Schriever et al. 6 : 1. Four-Alternative Forced-Choice (4-AFC) Procedure The 4-AFC format was implemented, presenting participants with each odor accompanied by one correct answer and three distractors. 2. Target-Related Selection In the distractor selection process, one of the three distractors for each odor was chosen for its semantic relation to the target odor, thereby enhancing the test’s discriminatory demand. This inclusion of semantically related and unrelated distractors aimed to increase the identification task’s complexity, necessitating more refined differentiation skills from the participants. 3. Odor Selection Variability The selection of odors was diversified across various levels of difficulty to present a graduated challenge to all participants. This variation was intended to prevent the “ceiling effect”, where the simplicity of the test might lead to disproportionately high scores across participants, thus undermining the test’s discriminative capacity. (Supplementary Table S1 ).

The selected odours were Smoked, Lavender, Coconut, Mint, Vanilla, Tire, Rose, Coffee, Grass, Bubble-gum, Menthol, Grape, Clove, Strawberry, Banana, Orange, Cinnamon, Garlic, Pizza, and Onion. Specifications are delineated in Supplementary Table S4 .

Experimental setup

The Multiscent-20 odour identification test commences with the participant reviewing the step-by-step instructions displayed on the initial screen. These instructions are presented below:

This assessment consists of a test containing 20 distinct odours.

Enter your identification information and press the “next” button.

The next screen will show a “Try it” button, the phrase “This smell resembles”, and with four alternative answers. Please reading all the choices before pressing the “Try it” button.

Upon pressing the “Try it” button, a small opening at the top front of the device will release the odour for 5 s. You can press the “Try it” button up to three times. Please maintain the device approximately 10 cm from your nostrils after pressing the button. After perceiving the odour, please select one of the alternatives, touch the corresponding letter, and press the “next odour” button to proceed.

The number of correct responses will be displayed upon completion of the test.

This assessment was a 4-AFC test, meaning that participants were asked to choose one of the four alternatives to advance to the next odour. The software was specifically developed to present questions and manage responses on the device. After the test, the results are displayed, and the device automatically synchronizes with the database stored in the cloud.

Data analysis (descriptive statistics and dimensionality assessment)

For the descriptive analysis, categorical variables are presented as the frequency (n) and percentage (%). Continuous variables are either described as the mean and standard deviation (SD) or median and interquartile range (IQR), depending on the normality of the data, as determined with the Shapiro‒Wilk test. Data analysis was mainly performed in R software (v4.3.0, R Foundation for Statistical Computing, Vienna, Austria). All statistical tests were two-sided, with a significance threshold set at 0.05.

To assess the dimensionality of the dataset, a CFA was performed using JASP 24 . Model fit indices, such as the SRMSR, TLI, and RMSEA, were examined to evaluate the model fit. This procedure aimed to evaluate the suitability of the dataset for further analysis using an IRT model.

Item response theory analysis

To conduct the IRT analysis, the ‘mirt’ package 25 in R was utilized. The following steps were followed for the analysis:

Data preparation Data were organized into a matrix format, with rows representing individual participants, and columns representing dichotomous responses (correct/incorrect) to each of the 20 odour items.

Model estimation Using the ‘mirt’ function, the parameters of the 1PL, 2PL, and 3PL models were estimated. The 1PL model assumed a constant discrimination parameter across items, while the 2PL model allowed for variation in discrimination parameters. The 3PL model additionally included a guessing parameter, accounting for the possibility that participants may have guessed the correct answer by chance.

Model comparison To determine the best-fitting model for the present dataset, goodness-of-fit statistics were compared, such as the AIC and BIC. Lower values of these statistics indicated a better-fitting model.

Item analysis For the best-fitting model, the ICCs were examined. These provide information about the relationship between participant latent trait levels and their probability of correctly identifying each odour. Item information functions (IIFs) were examined to understand the precision of each item in measuring the latent trait.

Influence of demographic factors (sex, age, and educational attainment)

To investigate the relationship between participant educational attainment and the number of correct answers on the administered test, a one-way ANOVA was conducted. Participants were divided into three categories based on their educational attainment: primary (the fundamental level in Brazil), secondary (high school), and tertiary (college) education. The dependent variable was the number of correct answers in the odour identification test. Subsequently, Tukey’s HSD test was employed to identify pairwise differences between the three educational attainment groups, provided that the ANOVA yielded significant results.

A multiple linear regression model was employed to investigate the relationship between the Multiscent-20 identification scores (dependent variable) and sex, age group, and educational attainment (independent variables). The dataset was analysed using R software, with the “lm” function utilized to create the regression model.

Evaluation of internal and external validity, and assessment of test–retest reliability

Administrative staff were selected from the sample described in “ Participants ” section, comprising 55 healthy individuals without olfactory complaints, including 25 men and 30 women (mean age = 31.7 years, SD = 4.3 years, range 21 to 42 years). Additionally, 13 individuals, comprising 5 men and 8 women (mean age = 57.4 years, SD = 16.8 years, range 24 to 75 years), diagnosed with hyposmia or anosmia and chronic sinusitis 26 who were receiving follow-up care at the otorhinolaryngology outpatient clinic of the University Hospital of Brasília.

Parkinson’s Disease Subgroup and Matched Control. Nine patients diagnosed with Parkinson’s Disease 27 (mean age = 65.8 years, SD = 11.5 years, range 47 to 78 years), followed in the neurology outpatient clinic, and eight age- and education-matched control subjects (mean age = 61.4 years, SD = 6.2 years, range 53 to 71 years) were selected for the comparison of olfactory performance and cross-validation study with the Sniffin’ Sticks test. This difference in age was not statistically significant (p = 0.352). Gender distribution across the groups was not significantly different (p = 0.772), with 62.5% females (N = 5) in the control group and 55.6% females (N = 5) in the Parkinson’s group. Males constituted 37.5% (N = 3) of the control group and 44.4% (N = 4) of the Parkinson’s group.

With the aim of evaluating the differences in olfactory performance among three groups—those with olfactory dysfunction due to chronic sinusitis, a normosmic control group, and the Parkinson’s Disease Subgroup and Matched Control—we employed t-tests.

To evaluate the temporal stability of olfactory test outcomes, we conducted the assessment on the same cohort of individuals at two distinct time points, separated by an average interval of 31.7 days (sd = 2.1). For the analysis, Pearson’s correlation coefficient and the Intraclass Correlation Coefficient (ICC) were employed to quantify the degree of consistency between the two sets of measurements. We implemented a tenfold cross-validation approach using the entire test–retest dataset without partitioning it into groups. Using the ‘ caret’ package in R, we configured a tenfold cross-validation procedure, with each fold serving as a validation set once and the remaining folds as the training set. ICCs were calculated for each fold using the ‘ icc’ function from the ‘ irr’ package. This approach enhances generalizability by ensuring that the findings are not contingent upon a specific data subset and mitigating overfitting. Furthermore, a linear regression model was applied, designating the results of the retest as the dependent variable and the initial test outcomes as the independent variable, to investigate the predictive relationship between the two measurements.

Correlation with Sniffin’ Sticks test (SS-16)

To determine the convergent validity with a well-known test that measures the same construct, a correlation analysis was conducted to examine the relationship between the Sniffin’ Sticks (SS-16) and MultiScent-20 test scores. The Pearson correlation coefficient (r) was calculated along with its corresponding p-value, and the 95% confidence interval (CI) limits.

From a Confirmatory Factor Analysis (CFA) conducted using Mplus software (version 8.8), we modelled all items as indicators of a single latent factor, with factor loadings, thresholds, and variances extracted to determine model fit, using indices such as RMSEA, CFI, TLI, and SRMR. Subsequently, we used the NumPy library (in python), to calculate McDonald’s omega reliability coefficient 28 . This step involved extracting standardized factor loadings from the Mplus output and computing error variances for each item.

This coefficient, representing the scale’s internal consistency, is crucial for measuring the reliability of the scale in measuring the intended construct.

Data availability

The anonymized data that support the findings of this study are openly available in Zenodo public repository, at https://doi.org/10.5281/zenodo.8079860 reference number 8079860.

Doty, R. L. et al. Development of the University of Pennsylvania Smell Identification Test: A standardized microencapsulated test of olfactory function. Physiol. Behav. 32 , 489–502 (1984).

Article   CAS   PubMed   Google Scholar  

Kobal, G. et al. Sniffin “Sticks”: Screening of olfactory performance. Rhinology 34 , 222–226 (1996).

CAS   PubMed   Google Scholar  

Hummel, T., Sekinger, B., Wolf, S. R., Pauli, E. & Kobal, G. ‘Sniffin’ Sticks’: Olfactory performance assessed by the combined testing of odor identification, odor discrimination and olfactory threshold. Chem. Senses 22 , 39–52 (1997).

Hummel, T., Kobal, G., Gudziol, H. & Mackay-Sim, A. Normative data for the ‘Sniffin’’ Sticks” including tests of odor identification, odor discrimination, and olfactory thresholds: An upgrade based on a group of more than 3,000 subjects. Eur. Arch. Oto-Rhino-Laryngol. 264 , 237–243 (2007).

Article   CAS   Google Scholar  

Nakanishi, M. et al. The digital scent device as a new concept for olfactory assessment. Int. Forum Allergy Rhinol. 12 , 1263–1272 (2022).

Article   PubMed   Google Scholar  

Schriever, V. A. et al. Development of an international odor identification test for children: The Universal Sniff Test. J. Pediatr. 198 , 265–272 (2018).

Renaud, M. et al. Clinical outcomes for patients with anosmia 1 year after COVID-19 diagnosis. JAMA Netw. Open 4 , e2115352 (2021).

Article   PubMed   PubMed Central   Google Scholar  

Cai, L., Choi, K., Hansen, M. & Harrell, L. Item response theory. Annu. Rev. Stat. Appl. 3 , 297–321 (2016).

Article   Google Scholar  

Doty, R. L. Clinical disorders of olfaction. In Handbook of Olfaction and Gustation 3rd edn (ed. Doty, R. L.) 453–478 (Wiley, 2015).

Chapter   Google Scholar  

Sorokowski, P. et al. Sex differences in human olfaction: A meta-analysis. Front. Psychol. 10 , 242 (2019).

Lillqvist, M., Claeson, A. S., Zakrzewska, M. & Andersson, L. Comparable responses to a wide range of olfactory stimulation in women and men. Sci. Rep. 13 , 9059 (2023).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Fornazieri, M. A. et al. Relationship of socioeconomic status to olfactory function. Physiol. Behav. 198 , 84–89 (2019).

Hedner, M., Larsson, M., Arnold, N., Zucco, G. M. & Hummel, T. Cognitive factors in odor detection, odor discrimination, and odor identification tasks. J. Clin. Exp. Neuropsychol. 32 , 1062–1067 (2010).

Challakere Ramaswamy, V. M. & Schofield, P. W. Olfaction and executive cognitive performance: A systematic review. Front. Psychol. 13 , 871391 (2022).

Guilmette, T. J. et al. American Academy of Clinical Neuropsychology consensus conference statement on uniform labeling of performance test scores. Clin. Neuropsychol. 34 , 437–453 (2020).

Wysocki, C. J. & Gilbert, A. N. National geographic smell survey: Effects of age are heterogenous. Ann. N. Y. Acad. Sci. 561 , 1 (1989).

Oleszkiewicz, A. et al. Global study of variability in olfactory sensitivity. Behav. Neurosci. 134 , 394 (2020).

Arshamian, A. et al. The perception of odor pleasantness is shared across cultures. Curr. Biol. 32 , 2061–2066 (2022).

Hummel, T. et al. Position paper on olfactory dysfunction. Rhinology 54 , 1–30 (2017).

Majid, A. Human olfaction at the intersection of language, culture, and biology. Trends Cogn. Sci. 25 , 111–123 (2021).

Wysocki, C., Pierce, J. & Gilbert, A. Geographic, cross-cultural, and individual variation in human olfaction. In Smell and Taste in Health and Disease Vol. 1 (eds Getchell, T. V. et al. ) 1–906 (Raven Press, 1991).

Google Scholar  

Marin, C. et al. Olfactory dysfunction in neurodegenerative diseases. Curr. Allergy Asthma Rep. 18 , 42 (2018).

Folstein, M. F., Folstein, S. E. & Mchugh, P. R. ‘Mini-mental state’. A practical method for grading the cognitive state of patients for the clinician. J. Psychiatr. Res. 12 , 189–198 (1975).

Love, J. et al. JASP: Graphical statistical software for common statistical designs. J. Stat. Softw. 88 , 1–17 (2019).

Chalmers, R. P. et al. mirt: A multidimensional item response theory package for the R environment. J. Stat. Softw. 48 , 1–29 (2012).

Fokkens, W. J. et al. European position paper on Rhinosinusitis and Nasal Polyps 2020. Rhinology 58 , 1–464 (2020).

Postuma, R. B. et al. Validation of the MDS clinical diagnostic criteria for Parkinson’s disease. Mov. Disord. 33 , 1601 (2018).

McDonald, R. P. Test Theory: A Unified Treatment (Lawrence Erlbaum Associates, 1999).

Download references

Acknowledgements

The authors gratefully acknowledge the study participants and patients, as well as Fernando Araujo Rodrigues Oliveira, Ana Paula Matos Rodrigues, and Rosane Antunes Carreiro Chaves at the Clinical Research Center of the University Hospital of Brasília. They also thank Peter Gottschalk, Hugo Zambotti, Claudia Ciszevsk, Daiana Saraiva at the Noar Company. Finally, they thank Nature Research Editing Service for the English language editing and manuscript formatting.

Funding information Noar Company, Grant/Award Number: 43188021.9.1001.5558.

Author information

These authors contributed equally: Marcio Nakanishi and Pedro Renato de Paula Brandão.

Authors and Affiliations

Department of Otorhinolaryngology, University Hospital of Brasília, Graduate Program in Medical Sciences, School of Medicine, University of Brasília, Brasília, DF, Brazil

Marcio Nakanishi & Gustavo Subtil Magalhães Freire

Research Institute, Hospital Sírio-Libanês, Brasília, DF, Brazil

Pedro Renato de Paula Brandão & Gustavo Henrique Campos de Sousa

School of Medicine, University of Brasília, Brasília, DF, Brazil

Pedro Renato de Paula Brandão

Statistics Department, University of Brasília, Brasília, DF, Brazil

Luis Gustavo do Amaral Vinha

State University of Londrina, Pontifical Catholic University of Paraná, Londrina, Brazil

Marco Aurélio Fornazieri

Department of Ophthalmology and Otorhinolaryngology, Ribeirão Preto Medical School-University of São Paulo, Ribeirão Preto, Brazil

Wilma Terezinha Anselmo-Lima

Brazilian Institute of Neuropsychology and Cognitive Sciences, Brasília, DF, Brazil

Danilo Assis Pereira

NOAR Brasil Ltda, São Paulo, SP, Brazil

Claudia Galvão

Department of Otorhinolaryngology, Smell and Taste Clinic, Technische Universität Dresden, Dresden, Germany

Thomas Hummel

You can also search for this author in PubMed   Google Scholar

Contributions

M.N., G.S.M.F., G.H.C.S, C.G. performed the experiments. P.R.P.B., D.A.P and L.G.A.V. analyzed data. M.N and P.R.P.B wrote and revised the manuscript. G.S.M.F., L.G.A.V., M.A.F., W.T.A.L., C.G., T.H. designed the study, made revisions, and edited the manuscript.

Corresponding author

Correspondence to Marcio Nakanishi .

Ethics declarations

Competing interests.

M.N. participated as a medical consultant and received speaker fees from Noar Company. C.G. is CEO shareholder of the Noar Company, the manufacturer of the Noar MultiScent device described in this article. The remaining authors have no competing interests to disclose.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary tables., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Nakanishi, M., Brandão, P.R.d.P., Freire, G.S.M. et al. Development and validation of the MultiScent-20 digital odour identification test using item response theory. Sci Rep 14 , 15059 (2024). https://doi.org/10.1038/s41598-024-65915-3

Download citation

Received : 03 July 2023

Accepted : 25 June 2024

Published : 01 July 2024

DOI : https://doi.org/10.1038/s41598-024-65915-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

validity and reliability in research sample

  • Search Menu

Sign in through your institution

  • Advance articles
  • Editor's Choice
  • Supplements
  • Thematic Issues
  • Author Guidelines
  • Online Submission Instructions
  • Submission Site
  • Open Access
  • About Annals of Work Exposures and Health
  • About the British Occupational Hygiene Society
  • Editorial Board
  • Advertising and Corporate Services
  • International Advisory Board
  • Journals Career Network
  • Self-Archiving Policy
  • Contact the BOHS
  • Journals on Oxford Academic
  • Books on Oxford Academic

British Occupational Hygiene Society

  • < Previous

99 Protecting workers from respirable crystalline silica exposure: Validity and reliability assessment of dust control climate questionnaire

  • Article contents
  • Figures & tables
  • Supplementary Data

Frederick Anlimah, Vinod Gopaldasani, Catherine MacPhail, Brian Davies, 99 Protecting workers from respirable crystalline silica exposure: Validity and reliability assessment of dust control climate questionnaire, Annals of Work Exposures and Health , Volume 68, Issue Supplement_1, June 2024, Page 1, https://doi.org/10.1093/annweh/wxae035.044

  • Permissions Icon Permissions

This research aims to assess and improve the management of respirable crystalline silica (RCS) exposure among workers in Australian tunnelling projects by introducing and measuring the “dust control climate.”

Tunnelling projects contribute towards infrastructure development but expose workers to RCS, leading to silicosis, lung cancer, and other health issues. Despite existing control measures, RCS exposure remains a concern. Previous research has identified factors affecting RCS exposure control in tunnelling but lacks worker input. Additionally, while safety climate studies are common, little focus is given to occupational hygiene, particularly RCS exposure.

Based on the NOSACQ-50 safety climate questionnaire, the concept of dust control climate is defined as workers’ perceptions and awareness of dust prevention and control strategies. The adapted tool contained 52 items and has been reviewed by tunnelling experts for content validity. After administering the questionnaire on a tunnelling project, the psychometric properties of the dust control climate tool were tested.

The questionnaire yielded a Cronbach’s alpha of 0.93, and factor analysis revealed robust inter-component relationships, affirming its reliability and validity in assessing dust control climate.

A tool was developed to measure dust control climate using the NOSACQ-50 safety climate tool as a foundation. Its reliability and validity were established by applying and analysing the tool’s constructs. This tool holds promise for enhancing dust control practices in tunnelling and various industries, as it considers the perspectives of workers and management.

Personal account

  • Sign in with email/username & password
  • Get email alerts
  • Save searches
  • Purchase content
  • Activate your purchase/trial code
  • Add your ORCID iD

Institutional access

Sign in with a library card.

  • Sign in with username/password
  • Recommend to your librarian
  • Institutional account management
  • Get help with access

Access to content on Oxford Academic is often provided through institutional subscriptions and purchases. If you are a member of an institution with an active account, you may be able to access content in one of the following ways:

IP based access

Typically, access is provided across an institutional network to a range of IP addresses. This authentication occurs automatically, and it is not possible to sign out of an IP authenticated account.

Choose this option to get remote access when outside your institution. Shibboleth/Open Athens technology is used to provide single sign-on between your institution’s website and Oxford Academic.

  • Click Sign in through your institution.
  • Select your institution from the list provided, which will take you to your institution's website to sign in.
  • When on the institution site, please use the credentials provided by your institution. Do not use an Oxford Academic personal account.
  • Following successful sign in, you will be returned to Oxford Academic.

If your institution is not listed or you cannot sign in to your institution’s website, please contact your librarian or administrator.

Enter your library card number to sign in. If you cannot sign in, please contact your librarian.

Society Members

Society member access to a journal is achieved in one of the following ways:

Sign in through society site

Many societies offer single sign-on between the society website and Oxford Academic. If you see ‘Sign in through society site’ in the sign in pane within a journal:

  • Click Sign in through society site.
  • When on the society site, please use the credentials provided by that society. Do not use an Oxford Academic personal account.

If you do not have a society account or have forgotten your username or password, please contact your society.

Sign in using a personal account

Some societies use Oxford Academic personal accounts to provide access to their members. See below.

A personal account can be used to get email alerts, save searches, purchase content, and activate subscriptions.

Some societies use Oxford Academic personal accounts to provide access to their members.

Viewing your signed in accounts

Click the account icon in the top right to:

  • View your signed in personal account and access account management features.
  • View the institutional accounts that are providing access.

Signed in but can't access content

Oxford Academic is home to a wide variety of products. The institutional subscription may not cover the content that you are trying to access. If you believe you should have access to that content, please contact your librarian.

For librarians and administrators, your personal account also provides access to institutional account management. Here you will find options to view and activate subscriptions, manage institutional settings and access options, access usage statistics, and more.

Short-term Access

To purchase short-term access, please sign in to your personal account above.

Don't already have a personal account? Register

Email alerts

Citing articles via.

  • Recommend to your Library

Affiliations

Annals of Work Exposures and Health

  • Online ISSN 2398-7316
  • Print ISSN 2398-7308
  • Copyright © 2024 British Occupational Hygiene Society
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Rights and permissions
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

systems-logo

Article Menu

validity and reliability in research sample

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Cyber evaluation and management toolkit (cemt): face validity of model-based cybersecurity decision making.

validity and reliability in research sample

1. Introduction

1.1. background.

“The DON needs a new approach to cybersecurity that goes beyond compliance because our over-reliance on compliance has resulted in insecure systems, which jeopardise the missions these systems support. Instead of a compliance mindset, the DON will shift to Cyber Ready, where the right to operate is earned and managed every day. The DON will make this transition by adhering to an operationally relevant, threat-informed process that affordably reduces risk and produces capabilities that remain secure after they have been delivered at speed”. [ 5 ] (p. 7)

1.2. Literature Review

1.3. cyberworthiness.

“The desired outcome of a range of policy and assurance activities that allow the operation of Defence platforms, systems and networks in a contested cyber environment. It is a pragmatic, outcome-focused approach designed to ensure all Defence capabilities are fit-for-purpose against cyber threats”. [ 43 ]
“2.10 The seaworthiness governance principles require that seaworthiness decisions are made:
a. mindfully—decisions are more effective and less likely to have unintended consequences when they are made with a thorough understanding of the context, the required outcome, the options available, and their implications now and in the future
b. collaboratively—obtaining input from all stakeholders and engaging in joint problem-solving results in better decisions (bearing in mind that collaboration does not necessarily require consensus)
c. accountably—decisions only become effective when people take accountability for making them happen
d. transparently—decisions are more effective when everyone understands what has been decided and why”. [ 44 ] (p.33)

1.4. Addressing the Problem

  • Usability—there is limited ability to easily create and review these graph-based threat assessments, especially in large, complex systems;
  • Efficiency—reusability of these assessments is limited in comparison to compliance-based approaches that re-apply a common control set;
  • Maintainability—it is difficult to update complex graph-based assessments without specialised toolsets as the system or threat environment evolves.
  • Are integrated threat models, developed using model-based systems engineering (MBSE) techniques, an effective and efficient basis for the assessment and evaluation of cyberworthiness?
  • Do the developed threat models provide decision makers with the necessary understanding to make informed security risk decisions?
  • Does the process provide sufficient reusability and maintainability that the methodology is more efficient than prevailing compliance-based approaches?
  • Do cybersecurity risk practitioners prefer the integrated threat model approach to traditional security risk assessment processes?

2. Materials and Methods

2.1. threat-based cybersecurity engineering.

  • Threat Context, derived from the system or capability design/architecture;
  • Threat Identification, provided by the Cyber Threat Intelligence function within an organisation;
  • Threat Insight, contributed by the Cyber Threat Emulation function within an organisation;
  • Best Practice Controls, distilled from the various cybersecurity frameworks available within the cybersecurity body of knowledge.
  • Preventative Controls, a baseline of preventative cybersecurity controls within the system, for inclusion in the system design;
  • Detecting Controls, a baseline of detection and response controls relevant to the system, for implementation by the Cyber Operations function within an organisation;
  • Recovery Controls, a baseline of recovery and resilience controls relevant to the system, for implementation by the System Operations function within an organisation;
  • Residual Risk, the overall risk presented by the threats to the capability given the mitigation mechanisms that are in place.

2.2. Cyber Evaluation and Management Toolkit (CEMT)

2.3. cemt sample model, 2.3.1. threat modelling.

  • Misuse case diagrams;
  • Intermediate mal-activity diagrams;
  • Detailed mal-activity diagrams.

2.3.2. Threat Mitigation

  • Allocating assets to the threat model;
  • Tracing controls to the threat model.

2.3.3. Risk Assessment

  • Attack tree assessment;
  • Parametric risk analysis;
  • Risk evaluation.

2.4. Achieving Threat-Based Cybersecurity Engineering

2.5. efficiency through automation.

  • Automated update of complex drawings and simulations to ensure that changes to the design or threat environment can be incorporated efficiently into the threat model;
  • Automated model validation to ensure that basic review tasks are automated, allowing expert reviewers to focus on the actual threat assessment component;
  • Automated documentation to ensure that the process of creating enduring design artefacts is efficient and accurate.

3.1. Face Validity Trial Setup

3.2. face validity trial data collection and setup, 4. discussion.

  • Appropriateness of the assessed controls to the system being assessed, as demonstrated by the responses to Question 1;
  • Prioritisation of controls, as demonstrated by the responses to Questions 6 and 14;
  • Ability for non-expert decision makers to understand the assessment, as demonstrated by Questions 7, 8, and 17.

4.1. Significance

  • Extended Model-Based Taxonomy—an extension of an open model-based systems engineering language such as UML or SysML; this is provided to facilitate a model-based approach;
  • Threat Focused—the threats to the system, rather than a best-practice control baseline or asset hierarchy, is used as the focal point of the assessment;
  • Detailed Adversary Modelling—the actions of the adversary are modelled in detail, facilitating a precise discussion and review of any threat analysis;
  • Visualisation and Simulation of Threat—detailed adversary modelling is expressed in simplified graphs such as attack trees, and branches of those graphs can be simulated quantitatively;
  • Explicit Traceability to Threats—derived security controls are directly traceable to adversary actions, facilitating discussion and review of the importance of each control in terms of the malicious action it mitigates.

4.2. Future Work

5. conclusions, author contributions, data availability statement, acknowledgments, conflicts of interest.

  • Australian Government—Department of Home Affairs, Protective Security Policy Framework. Available online: https://www.protectivesecurity.gov.au (accessed on 25 April 2024).
  • National Institute of Standards and Technology (NIST) Computer Security Resource Center (CSRC), NIST Risk Management Framework (RMF). Available online: https://csrc.nist.gov/projects/risk-management/about-rmf (accessed on 25 April 2024).
  • Australian Government—Australian Signals Directorate, Information Security Manual (ISM). Available online: https://www.cyber.gov.au/resources-business-and-government/essential-cyber-security/ism (accessed on 25 April 2024).
  • National Institute of Standards and Technology (NIST) Computer Security Resource Center (CSRC), NIST Special Publication 800-53 Rev. 5: Security and Privacy Controls for Information Systems and Organizations. Available online: https://csrc.nist.gov/pubs/sp/800/53/r5/upd1/final (accessed on 25 April 2024).
  • U.S. Department of Navy; Cyber Strategy, November 2023. Available online: https://dvidshub.net/r/irstzr (accessed on 25 April 2024).
  • Australian Government—Australian Signals Directorate, System Security Plan Annex Template (March 2024). Available online: https://www.cyber.gov.au/sites/default/files/2024-03/System%20Security%20Plan%20Annex%20Template%20%28March%202024%29.xlsx (accessed on 25 April 2024).
  • National Institute of Standards and Technology (NIST) Computer Security Resource Center (CSRC), Control Catalog (spreadsheet). Available online: https://csrc.nist.gov/files/pubs/sp/800/53/r5/upd1/final/docs/sp800-53r5-control-catalog.xlsx (accessed on 25 April 2024).
  • National Institute of Standards and Technology (NIST), OSCAL: The Open Security Controls Assessment Language. Available online: https://pages.nist.gov/OSCAL/ (accessed on 25 April 2024).
  • MITRE ATT&CK Framework. Available online: https://attack.mitre.org/ (accessed on 25 April 2024).
  • The Department of Defense Cyber Table Top Guide, Version 2, 16 September 2021. Available online: https://www.cto.mil/wp-content/uploads/2023/06/DoD-Cyber-Table-Top-Guide-v2-2021.pdf (accessed on 25 April 2024).
  • Monroe, M.; Olinger, J. Mission-Based Risk Assessment Process for Cyber (MRAP-C). ITEA J. Test Eval. 2020 , 41 , 229–232. [ Google Scholar ]
  • Kuzio de Naray, R.; Buytendyk, A.M. Analysis of Mission Based Cyber Risk Assessments (MBCRAs) Usage in DoD’s Cyber Test and Evaluation ; Institute for Defense Analyses: Alexandria, VA, USA, 2022; IDA Publication P-33109. [ Google Scholar ]
  • Kordy, B.; Piètre-Cambacédès, L.; Schweitzer, P.P. DAG-based attack and defense modeling: Don’t miss the forest for the attack trees. Comput. Sci. Rev. 2014 , 13–14 , 1–38. [ Google Scholar ] [ CrossRef ]
  • Weiss, J.D. A system security engineering process. In Proceedings of the 14th Annual NCSC/NIST National Computer Security Conference, Washington, DC, USA, 1–4 October 1991. [ Google Scholar ]
  • Schneier, B. Attack trees: Modeling security threats. Dr Dobb’s J. Softw. Tools 1999 , 12–24 , 21–29. Available online: https://www.schneier.com/academic/archives/1999/12/attack_trees.html (accessed on 25 April 2024).
  • Paul, S.; Vignon-Davillier, R. Unifying traditional risk assessment approaches with attack trees. J. Inf. Secur. Appl. 2014 , 19 , 165–181. [ Google Scholar ] [ CrossRef ]
  • Kordy, B.; Pouly, M.; Schweitzer, P. Probabilistic reasoning with graphical security models. Inf. Sci. 2016 , 342 , 111–131. [ Google Scholar ] [ CrossRef ]
  • Gribaudo, M.; Iacono, M.; Marrone, S. Exploiting Bayesian Networks for the analysis of combined Attack Trees. Electron. Notes Theor. Comput. Sci. 2015 , 310 , 91–111. [ Google Scholar ] [ CrossRef ]
  • Holm, H.; Korman, M.; Ekstedt, M. A Bayesian network model for likelihood estimations of acquirement of critical software vulnerabilities and exploits. Inf. Softw. Technol. 2015 , 58 , 304–318. [ Google Scholar ] [ CrossRef ]
  • Moskowitz, I.; Kang, M. An insecurity flow model. In Proceedings of the 1997 Workshop on New Security Paradigms, Cumbria, UK, 23–26 September 1997; pp. 61–74. [ Google Scholar ]
  • McDermott, J.; Fox, C. Using abuse case models for security requirements analysis. In Proceedings of the 15th Annual Computer Security Applications Conference, Phoenix, AZ, USA, 6–10 December 1999; pp. 55–64. [ Google Scholar ]
  • Sindre, G.; Opdahl, A.L. Eliciting security requirements with misuse cases. Requir. Eng. 2004 , 10 , 34–44. [ Google Scholar ] [ CrossRef ]
  • Karpati, P.; Sindre, G.; Opdahl, A.L. Visualizing cyber attacks with misuse case maps. In Requirements Engineering: Foundation for Software Quality ; Springer: Berlin/Heidelberg, Germany, 2010; pp. 262–275. [ Google Scholar ]
  • Abdulrazeg, A.; Norwawi, N.; Basir, N. Security metrics to improve misuse case model. In Proceedings of the 2012 International Conference on Cyber Security, Cyber Warfare and Digital Forensics, Kuala Lumpur, Malaysia, 26–28 June 2012. [ Google Scholar ]
  • Saleh, F.; El-Attar, M. A scientific evaluation of the misuse case diagrams visual syntax. Inf. Softw. Technol. 2015 , 66 , 73–96. [ Google Scholar ] [ CrossRef ]
  • Mai, P.; Goknil, A.; Shar, L.; Pastore, F.; Briand, L.C.; Shaame, S. Modeling Security and Privacy Requirements: A Use Case-Driven Approach. Inf. Softw. Technol. 2018 , 100 , 165–182. [ Google Scholar ] [ CrossRef ]
  • Matuleviaius, R. Fundamentals of Secure System Modelling ; Springer International Publishing: Cham, Switzerland, 2017; pp. 93–115. [ Google Scholar ]
  • Sindre, G. Mal-activity diagrams for capturing attacks on business processes. In Requirements Engineering: Foundation for Software Quality ; Springer: Berlin/Heidelberg, Germany, 2007; pp. 355–366. [ Google Scholar ]
  • Opdahl, A.; Sindre, G. Experimental comparison of attack trees and misuse cases for security threat identification. Inf. Softw. Technol. 2009 , 51 , 916. [ Google Scholar ] [ CrossRef ]
  • Karpati, P.; Redda, Y.; Opdahl, A.; Sindre, G. Comparing attack trees and misuse cases in an industrial setting. Inf. Softw. Technol. 2014 , 56 , 294. [ Google Scholar ] [ CrossRef ]
  • Tondel, I.A.; Jensen, J.; Rostad, L. Combining Misuse Cases with Attack Trees and Security Activity Models. In Proceedings of the 2010 International Conference on Availability, Reliability and Security, Krakow, Poland, 15–18 February 2010; pp. 438–445. [ Google Scholar ]
  • Meland, P.H.; Tondel, I.A.; Jensen, J. Idea: Reusability of threat models—Two approaches with an experimental evaluation. In Engineering Secure Software and Systems ; Springer: Berlin/Heidelberg, Germany, 2010; pp. 114–122. [ Google Scholar ]
  • Purton, L.; Kourousis, K. Military Airworthiness Management Frameworks: A Critical Review. Procedia Eng. 2014 , 80 , 545–564. [ Google Scholar ] [ CrossRef ]
  • Mo, J.P.T.; Downey, K. System Design for Transitional Aircraft Support. Int. J. Eng. Bus. Manag. 2014 , 6 , 45–56. [ Google Scholar ] [ CrossRef ]
  • Hodge, R.J.; Craig, S.; Bradley, J.M.; Keating, C.B. Systems Engineering and Complex Systems Governance—Lessons for Better Integration. INCOSE Int. Symp. 2019 , 29 , 421–433. [ Google Scholar ] [ CrossRef ]
  • Simmonds, S.; Cook, S.C. Use of the Goal Structuring Notation to Argue Technical Integrity. INCOSE Int. Symp. 2017 , 27 , 826–841. [ Google Scholar ] [ CrossRef ]
  • United States Government Accountability Office. Weapon Systems Cybersecurity: DOD just Beginning to Grapple with Scale of Vulnerabilities. GAO-19-129 . 2018. Available online: https://www.gao.gov/products/gao-19-128 (accessed on 15 June 2024).
  • Joiner, K.F.; Tutty, M.G. A tale of two allied defence departments: New assurance initiatives for managing increasing system complexity, interconnectedness and vulnerability. Aust. J. Multi-Discip. Eng. 2018 , 14 , 4–25. [ Google Scholar ] [ CrossRef ]
  • Joiner, K.F. How Australia can catch up to U.S. cyber resilience by understanding that cyber survivability test and evaluation drives defense investment. Inf. Secur. J. A Glob. Perspect. 2017 , 26 , 74–84. [ Google Scholar ] [ CrossRef ]
  • Thompson, M. Towards Mature ADF Information Warfare—Four Years of Growth. Defence Connect Multi-Domain . 2020. Available online: https://www.defenceconnect.com.au/supplements/multi-domain-2 (accessed on 15 June 2024).
  • Fowler, S.; Sweetman, C.; Ravindran, S.; Joiner, K.F.; Sitnikova, E. Developing cyber-security policies that penetrate Australian defence acquisitions. Aust. Def. Force J. 2017 , 102 , 17–26. [ Google Scholar ]
  • Australian Senate. Budget Hearings on Foreign Affairs Defence and Trade, Testimony by Vice Admiral Griggs, Major General Thompson and Minister of Defence (29 May, 2033–2035 hours). 2018. Available online: https://parlview.aph.gov.au/mediaPlayer.php?videoID=399539timestamp3:19:43 (accessed on 15 June 2024).
  • Australian Government. ADF Cyberworthiness Governance Framework ; Australian Government: Canberra, Australia, 2020.
  • Australian Government. Defence Seaworthiness Management System Manual. 2018. Available online: https://www.defence.gov.au/sites/default/files/2021-01/SeaworthinessMgmtSystemManual.pdf (accessed on 15 June 2024).
  • Allen, M.S.; Robson, D.A.; Iliescu, D. Face Validity: A Critical but Ignored Component of Scale Construction in Psychological Assessment. Eur. J. Psychol. Assess. Off. Organ Eur. Assoc. Psychol. Assess. 2023 , 39 , 153–156. [ Google Scholar ] [ CrossRef ]
  • Fowler, S.; Sitnikova, E. Toward a framework for assessing the cyber-worthiness of complex mission critical systems. In Proceedings of the 2019 Military Communications and Information Systems Conference (MilCIS), Canberra, Australia, 12–14 November 2019. [ Google Scholar ]
  • Fowler, S.; Joiner, K.; Sitnikova, E. Assessing cyber-worthiness of complex system capabilities using MBSE: A new rigorous engineering methodology. IEEE Syst. J. 2022. submitted . Available online: https://www.techrxiv.org/users/680765/articles/677291-assessing-cyber-worthiness-of-complex-system-capabilities-using-mbse-a-new-rigorous-engineering-methodology (accessed on 25 April 2024).
  • Cyber Evaluation and Management Toolkit (CEMT). Available online: https://github.com/stuartfowler/CEMT (accessed on 25 April 2024).
  • Fowler, S. Cyberworthiness Evaluation and Management Toolkit (CEMT): A model-based approach to cyberworthiness assessments. In Proceedings of the Systems Engineering Test & Evaluation (SETE) Conference 2022, Canberra, Australia, 12–14 September 2022. [ Google Scholar ]
  • National Institute of Standards and Technology (NIST) Computer Security Resource Center (CSRC), NIST Special Publication 800-160 Rev. 2: Developing Cyber-Resilient Systems: A Systems Security Engineering Approach. Available online: https://csrc.nist.gov/pubs/sp/800/160/v2/r1/final (accessed on 25 April 2024).
  • National Institute of Standards and Technology (NIST), CSF 2.0: Cybersecurity Framework. Available online: https://www.nist.gov/cyberframework (accessed on 25 April 2024).
  • Madni, A.; Purohit, S. Economic analysis of model-based systems engineering. Systems 2019 , 7 , 12. [ Google Scholar ] [ CrossRef ]
  • Bussemaker, J.; Boggero, L.; Nagel, B. The agile 4.0 project: MBSE to support cyber-physical collaborative aircraft development. INCOSE Int. Symp. 2023 , 33 , 163–182. [ Google Scholar ] [ CrossRef ]
  • Amoroso, E.G. Fundamentals of Computer Security Technology ; Pearson College Div: Englewood Cliffs, NJ, USA, 1994. [ Google Scholar ]
  • INCOSE. Systems Engineering Vision 2020 ; International Council on Systems Engineering: Seattle, WA, USA, 2007. [ Google Scholar ]
  • Madni, A.M.; Sievers, M. Model-based systems engineering: Motivation, current status, and research opportunities. Syst. Eng. 2018 , 21 , 172–190. [ Google Scholar ] [ CrossRef ]
  • Huang, J.; Gheorghe, A.; Handley, H.; Pazos, P.; Pinto, A.; Kovacic, S.; Collins, A.; Keating, C.; Sousa-Poza, A.; Rabadi, G.; et al. Towards digital engineering—The advent of digital systems engineering. Int. J. Syst. Syst. Eng. 2020 , 10 , 234–261. [ Google Scholar ] [ CrossRef ]
  • Chelouati, M.; Boussif, A.; Beugin, J.; El Koursi, E.-M. Graphical safety assurance case using goal structuring notation (gsn)– challenges, opportunities and a framework for autonomous trains. Reliab. Eng. Syst. Saf. 2023 , 230 , 108–933. [ Google Scholar ] [ CrossRef ]
  • Sujan, M.; Spurgeon, P.; Cooke, M.; Weale, A.; Debenham, P.; Cross, S. The development of safety cases for healthcare services: Practical experiences, opportunities and challenges. Reliab. Eng. Syst. Saf. 2015 , 140 , 200–207. [ Google Scholar ] [ CrossRef ]
  • Nguyen, P.H.; Ali, S.; Yue, T. Model-based security engineering for cyber-physical systems: A systematic mapping study. Inf. Softw. Technol. 2017 , 83 , 116–135. [ Google Scholar ] [ CrossRef ]
  • Geismann, J.; Bodden, E. A systematic literature review of model-driven security engineering for cyber–physical systems. J. Syst. Softw. 2020 , 169 , 110697. [ Google Scholar ] [ CrossRef ]
  • Carter, B.; Adams, S.; Bakirtzis, G.; Sherburne, T.; Beling, P.; Horowitz, B. A preliminary design-phase security methodology for cyber–physical systems. Systems 2019 , 7 , 21. [ Google Scholar ] [ CrossRef ]
  • Larsen, M.H.; Muller, G.; Kokkula, S. A Conceptual Model-Based Systems Engineering Method for Creating Secure Cyber-Physical Systems. INCOSE Int. Symp. 2022 , 32 , 202–213. [ Google Scholar ] [ CrossRef ]
  • Japs, S.; Anacker, H.; Dumitrescu, R. SAVE: Security & safety by model-based systems engineering on the example of automotive industry. In Proceedings of the 31st CIRP Design Conference, Online, 19–21 May 2021. [ Google Scholar ]
  • Navas, J.; Voirin, J.; Paul, S.; Bonnet, S. Towards a model-based approach to systems and cybersecurity: Co-engineering in a product line context. Insight (Int. Counc. Syst. Eng.) 2020 , 23 , 39–43. [ Google Scholar ] [ CrossRef ]
  • Geismann, J.; Gerking, C.; Bodden, E. Towards ensuring security by design in cyber-physical systems engineering processes. In Proceedings of the International Conference on the Software and Systems Process, Gothenburg, Sweden, 26–27 May 2018. [ Google Scholar ]
  • Mažeika, D.; Butleris, R. MBSEsec: Model-based systems engineering method for creating secure systems. Appl. Sci. 2020 , 10 , 2574. [ Google Scholar ] [ CrossRef ]
  • Object Management Group. UAF: Unified Architecture Framework. 2022. Available online: https://www.omg.org/spec/UAF. (accessed on 15 June 2024).
  • Jurjens, J. Secure Systems Development with UML ; Springer: Berlin/Heidelberg, Germany, 2005. [ Google Scholar ]
  • Apvrille, L.; Roudier, Y. Towards the model-driven engineering of secure yet safe embedded systems. Int. Workshop Graph. Models Secur. 2014 , 148 , 15–30. [ Google Scholar ] [ CrossRef ]

Click here to enlarge figure

Survey QuestionStrongly DisagreeDisagreeNeutralAgreeStrongly Agree
Q1The CEMT produces risk assessments that are tailored to the context in which the system operates00155035
Q2Cyberworthiness assessments are simple to produce using the CEMT50403520
Q3The CEMT is an effective use of time00302545
Q4The CEMT process is intuitive05254525
Q5The CEMT encourages stakeholders to work collaboratively to determine the residual risk level00103555
Q6The CEMT clearly identifies which security controls are important to the system0055540
Q7The CEMT produces transparent cyberworthiness assessments05104045
Q8The CEMT facilitates informed decision making with respect to the identified cybersecurity risks0055045
Q9The CEMT produces cyberworthiness assessments that have ongoing value through the future phases of the capability life cycle00104050
Q10The CEMT would improve my understanding of the cyberworthiness of a system00102070
Q11The CEMT produces accurate assessments of a system’s cyberworthiness010203535
Q12The CEMT facilitates the engagement of stakeholders and the provision of meaningful input from those stakeholders into a cyberworthiness assessment00204040
Q13The cyberworthiness assessments produced by the CEMT are sufficiently detailed05203045
Q14The CEMT identifies the relative impact of security controls with respect to the cyberworthiness of the system05154040
Q15The CEMT is not overly dependent on the subjective opinion of subject matter experts00305020
Q16The CEMT provides sufficient information to allow decision makers to be accountable for their decisions010153540
Q17The CEMT clearly highlights the areas of greatest cyber risk to the system00153550
Q18The CEMT adds value to a system and/or project0053560
Q19The CEMT provides a complete and comprehensive approach to determining cyberworthiness510105025
Q20The CEMT is an improvement over existing cyberworthiness assessment processes05102065
Model-Based Security
Assessment Approach
Extended Model-Based TaxonomyThreat FocusedDetailed Adversary ModellingVisualisation and Simulation of ThreatsExplicit Traceability to Threats
1CSRM [ ]YNNNN
2Larsen et al. [ ]YNNNN
3SAVE [ ]YYNNN
4Navas et al. [ ]YYNNN
5Geissman et al. [ ]YYNNN
6MBSESec [ ]YYYNN
7UAF [ ]YNNNN
8UMLSec [ ]YNNNN
9SysML-Sec [ ]YNNNN
10CEMTYYYYY
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Fowler, S.; Joiner, K.; Ma, S. Cyber Evaluation and Management Toolkit (CEMT): Face Validity of Model-Based Cybersecurity Decision Making. Systems 2024 , 12 , 238. https://doi.org/10.3390/systems12070238

Fowler S, Joiner K, Ma S. Cyber Evaluation and Management Toolkit (CEMT): Face Validity of Model-Based Cybersecurity Decision Making. Systems . 2024; 12(7):238. https://doi.org/10.3390/systems12070238

Fowler, Stuart, Keith Joiner, and Siqi Ma. 2024. "Cyber Evaluation and Management Toolkit (CEMT): Face Validity of Model-Based Cybersecurity Decision Making" Systems 12, no. 7: 238. https://doi.org/10.3390/systems12070238

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

  • Open access
  • Published: 29 June 2024

Psychometric properties of the arabic translation of the Physical Appearance Comparison Scale-Revised (PACS-R) in adults

  • Marie Anne El Khoury 1 ,
  • Diana Malaeb 2 ,
  • Mirna Fawaz 3 ,
  • Nancy Chammas 4 ,
  • Michel Soufia 4 ,
  • Feten Fekih-Romdhane 5 , 6   na1 ,
  • Sahar Obeid 1   na1 &
  • Souheil Hallit 4 , 7 , 8   na1  

BMC Psychology volume  12 , Article number:  371 ( 2024 ) Cite this article

Metrics details

Physical comparison may be a factor in body dissatisfaction and related issues, like eating disorders and depression. The Physical Appearance Comparison Scale-Revised (PACS-R) is a scale developed to assess the frequency of physical comparison. Because there is no validated scale for body comparison in Arabic, this study aims to address this gap by validating the PACS-R in the Arabic language.

The PACS-R was translated to Arabic following a conventional forward-backward translation procedure, and was administered to a sample of 359 Lebanese adults along with The Depression Anxiety Stress Scale, and the Rosenberg self-esteem scale (RSES) for convergent validity. The factor structure was studied by confirmatory factor analysis (CFA), and composite reliability was assessed using McDonald’s omega and Cronbach’s alpha.

Results suggested a one-factor structure of the Arabic PACS-R, with good internal consistency (McDonald’s ω = 0.97 / Cronbach α = 0.97). Measurement invariance was established across sex groups, with no significant difference being reported between males and females in terms of PACS-R scores (15.42 ± 10.64 vs. 13.16 ± 11.88; t(357) = 1.84; p  = .066). Finally, adequate convergent validity was tested and found to be adequate, with PACS-R scores found to be correlated negatively with self-esteem and positively with psychological distress.

The present findings preliminarily establish the Arabic PACS-R as an effective instrument for researchers and practitioners aiming to explore the physical comparison among Arabic-speaking populations, thus contributing to research and clinical work in the Arabic community.

Peer Review reports

Introduction

Body dissatisfaction represents a pervasive concern within contemporary society, impacting individuals across various age groups, genders, and cultural backgrounds [ 1 , 2 ]. Furthermore, it is a core symptom of eating disorders [ 3 ] and one of its leading causes [ 4 ]. Besides, it is also involved in depression and low self-esteem and this was found to affect both sexes, teenagers and adults [ 1 , 5 , 6 , 7 , 8 , 9 , 10 ]. This is a multifaceted phenomenon influenced by psychological, sociocultural, and environmental factors [ 2 , 11 , 12 ]. It can stem from comparison with societal ideals, often internalized through media exposure and reinforced by peer and familial attitudes. This leads us to the theory of social comparison, first introduced by Festinger in 1954 [ 13 ], where he suggests that people have a natural drive to evaluate their own opinions and abilities. When lacking objective measures, they instinctively compare themselves to others. The comparison can be upward or downward: when individuals can compare themselves with others perceived to be superior or inferior in some way. This theory can be applied to different psychological and social contexts, notably body image [ 12 , 14 , 15 ]. In the context of body image, this theory has played a key role in understanding how comparative evaluation with peers, media portrayals, and societal beauty norms shape individual perceptions of their physical attractiveness and value [ 16 , 17 ]. It has been recognized that unintended comparisons can take place, and the benchmark used in the comparison might involve someone quite different from oneself [ 12 ].

Research highlights the potentially harmful effects of engaging in social comparisons based on appearance, whether it is peer comparison or social media comparison [ 14 , 17 , 18 , 19 ]. Based on the social comparison theory, comparison can happen upward toward idealized body images portrayed by social media and television which frequently results in feelings of insufficiency, dissatisfaction with one’s body, and a negative self-image [ 19 , 20 ]. Thus, upward comparisons are linked to a more negative impact than downward comparisons [ 21 , 22 ]. Moreover, comparison with media tends to have a more harmful effect [ 21 , 23 ]. Social comparison and more specifically appearance comparison are associated with body dissatisfaction, disordered eating, and low self-esteem [ 14 , 24 ]. Social comparison correlates positively with psychological distress [ 25 ], depression and anxiety [ 26 ]. Furthermore, physical comparison was seen to be associated with higher anxiety [ 24 , 27 ] and depression [ 28 ]. Furthermore, sex differences appear to exist in physical social comparison, leading to differential negative effects on males compared to females. Females seem to be more inclined to compare their appearances to others than men, which is associated with several negative psychological outcomes such as lower self-esteem, depression, body dissatisfaction, and dieting behaviors [ 17 , 29 ]. While males also engage in appearance comparisons, they do so less frequently and with fewer negative consequences for their body image [ 30 ]. Overall, this body of research underscores the significant, and often harmful, impact of appearance comparisons on both females’ and males’ mental health and body image, with a stronger effect observed in females [ 31 ]. Considering the significant role that appearance comparisons play in issues related to body image and eating disorders, it is crucial to possess a tool that effectively measures an individual’s propensity for engaging in physical appearance comparisons.

Measurement instruments of physical appearance comparison

Different scales have been created to evaluate the inclination towards appearance comparison, but most come with considerable drawbacks. Among the first to be developed is the Body Comparison Scale (BCS; [ 11 ], evaluating the frequency with which an individual compares specific parts of their body with others. However, a notable limitation of this tool is its failure to directly compare one’s weight or body fat [ 31 ]. Additionally, the scale lacks details about the comparison’s target and the context, both of which are vital for understanding the dynamics and potential triggers of appearance comparisons. O’Brien et al. [ 32 ] introduced scales designed to measure the propensity for engaging in comparisons with those deemed significantly more attractive (Upward Physical Appearance Comparison Scale, UPACS) and those considered much less attractive (Downward Appearance Comparison Scale, DACS). Their validation was confined to the Chinese cultural milieu, wherein their psychometric characteristics were found to be satisfactory [ 33 ]. Nonetheless, Schaefer and Thompson [ 31 ] raised critiques regarding the UPACS and DACS scales, pointing out that these scales judge appearance comparisons through the lens of attractiveness stereotypes and do not cover lateral comparisons, where individuals compare themselves to others of perceived similar attractiveness, consequently, they might only offer a narrow view of the frequency with which individuals engage in appearance comparisons.

The physical appearance scale (PACS), created by Thompson et al. in 1991 [ 34 ], was considered one of the primary validated tools for assessing how individuals compare their looks with others [ 17 ]. It is a 5-item scale primarily developed for females, thus sex differences in body image concerns highlight a potential limitation of the original PACS, as males and females aspire to different physical ideals, which may not be fully captured by the scale. The Physical Appearance Comparison Scale-Revised (PACS-R) addressed this issue, among others, including the evaluation of weight and shape and a wider variety of comparison contexts [ 31 ]. The PACS-R demonstrates great internal consistency (Cronbach’s alpha of 0.97), featuring 11 items phrased neutrally and encompassing a broader range of contexts for evaluation. Exploratory factor analysis and parallel analysis suggested a single-factor structure for the PACS-R, as well as strong convergent validity with indices of body satisfaction, eating disorders, the impact of sociocultural standards on appearance, and self-esteem among female college students [ 31 ]. The PACS-R has been translated and validated in different languages among which are Spanish [ 35 ], Iranian [ 36 ], and Brazilian Portuguese [ 37 ] all if which found single-factor structures. On another hand, high physical comparison associates with low self-esteem as seen in the original study of PACS-R [ 31 ] and other validation studies [ 35 , 37 ]. Strong associations were found with eating disorders and stress [ 31 , 35 , 37 ], aligning with previous research that shows association of physical comparison with stress, anxiety and depression [ 38 , 39 ]. To date, however, there has been no translation and validation of the PACS-R into the Arabic language.

The present study

In the Arab world, around one-third of females display restrictive eating patterns [ 40 ]. Several studies [ 41 , 42 , 43 , 44 , 45 , 46 , 47 , 48 ] showed how media exposure, societal and peer pressures, and individual factors (like sex, age, and BMI) contribute to body image concerns and eating disorders in the Arab context. The findings highlight a need for comprehensive health education, media literacy initiatives, and mental health support tailored to the unique cultural and societal framework of the Arab world. These efforts aim to mitigate the impact of negative body image and eating disorders among youth, advocating for a healthier, more inclusive understanding of body image and self-esteem. Along these lines, an Arabic version of the PACS-R is needed to address the physical comparison in Arabic-speaking populations. Moreover, applying the social theory to the Arab world, findings show that higher levels of collectivism are linked with a greater overall inclination to engage in comparison, a heightened interest in making upward comparisons, and a reduced interest in making downward comparisons [ 49 ]. Hofstede [ 50 ] posits that Arab nations are characterized by a collectivist cultural orientation, thus social comparison has a considerable impact. While the concept of body comparison holds significant importance, the absence of a validated Arabic measure stands as a gap. Advancing research in this field necessitates the creation of reliable and valid tools. The current study has the following objectives: first, to analyze the factor structure and assess the model fit of the PACS-R adapted into Arabic; second, to investigate the consistency of their measurement across sex; and third, to evaluate the validity of our Arabic translated version of the PACS-R by exploring its association with self-esteem and psychological distress. Our hypothesis suggests that the Arabic PACS-R would reveal a unidimensional structure with a satisfactory level of internal consistency and would display measurement invariance across sex. Moreover, we anticipate that the PACS-R would have positive correlation with psychological distress and negative correlation with self-esteem.

Study design and participants

A total of 359 Lebanese participants were enrolled in this cross-sectional study that was conducted between September and November 2022, through convenience sampling in several Lebanese governorates. The research team approached people and asked them to fill the survey; those who accepted were asked to forward the link to other people they might know, explaining the snowball sampling technique followed. The survey was a Google form questionnaire that was administered through the internet, using the snowball technique. Participants were informed about the study, and were provided an online link to it; pressing on the link led interested participants to the consent form and information form (outlining the current study’s objectives, anonymity, and voluntary permission to research). When confidentiality is assured, participants are encouraged to respond honestly and deliver more accurate information. Secondly, detailed instructions defining the purpose of the survey and the importance of the thoughtfulness of the responses minimized inaccuracy. No rewards were given to participants in return for participation.

The questionnaire used was anonymous and in Arabic, the native language in Lebanon. It required approximately 10 to 15 min to complete. It consisted of three parts. The first part explained the study’s topic and objective, a statement ensuring the anonymity of respondents. The participant had to select the option stating “I consent to participate in this study” to be directed to the questionnaire.

Sociodemographic survey

Participants provided self-reports on their age, sex, marital status, body mass index (calculated from self-reported weight and height) and the household crowding index, which reflects the socioeconomic status (calculated by dividing the number of persons by that of the rooms in the house besides the kitchen and bathrooms) [ 51 ].

Revised Physical Appearance Comparison Scale (PACS-R [ 31 ]: The PACS-R is comprised of an 11-item survey designed to assess how often individuals compare their physical appearance to that of others across a wide range of social contexts. Responses are collected using a 5-point Likert scale, with options extending from “Never” to “Always.” A higher score on the scale signifies a greater frequency of appearance comparison. The Arabic version of the PACS-R scale was translated and culturally adapted before being used in this study. This involved translating the scale into Arabic in line with international standards and recommendations to ensure semantic equivalence between the original measurements and their Arabic counterparts [ 52 ]. We used forward and back-translation procedure. The Arabic version was initially translated from English by a Lebanese translator. Subsequently, a Lebanese psychologist fluent in English retranslated the Arabic text back into English, ensuring that each translation, whether specific or literal, was suitable. In addition to the study team, two psychiatrists and a psychologist reviewed both the original and retranslated English version to identify and rectify any discrepancies, ensuring the accuracy of the translation. A specialized measure was implemented to confirm that the Arabic and the original versions are conceptually equivalent. This step was designed to address any potential misunderstandings concerning the language and readability of the items [ 53 ]. A pilot study was conducted on 20 persons before the start of the official data collection to make sure all questions are well understood; no changes were done consequently.

The DASS scale

The Depression Anxiety Stress Scales [ 54 ] is a self-report questionnaire created to quantify three negative emotional states: depression, anxiety, and stress. We used a shorter version of 8 items (DASS-8, [ 55 ] that has demonstrated high validity and reliability. It is composed of three subscales with: depression (3 items, ω = 0.82 / α = 0.82), anxiety (3 items, ω = 0.81 / α = 0.81) and stress (2 items, α = 0.68). Items are rated on a four-point scale from 0 to 3.

The Rosenberg Self-Esteem scale (RSES)

The RSES [ 56 ] was employed to assess trait self-esteem. This instrument includes 10 items, half of which are reverse-scored. It utilizes a 4-point Likert scale ranging from “Strongly Disagree” to “Strongly Agree,” where higher scores signify greater self-esteem. The scale has been previously utilized in its Arabic-translated form in various studies [ 57 , 58 ].

Analytic Strategy

Confirmatory factor analysis.

There were no missing responses in the dataset. We used data from the total sample to conduct a CFA using the SPSS AMOS v.26 software. We aimed to enroll a minimum of 220 adolescents following the recommendations of Mundfrom et al. of 3 to 20 times the number of the scale’s variables [ 59 ]. Parameter estimates were obtained using the maximum likelihood method. Multiple fit indices were calculated: Steiger-Lind root mean square error of approximation (RMSEA), standardized root mean square residual (SRMR), Tucker-Lewis Index (TLI), and Comparative Fit Index (CFI). Values ≤ 0.08 for RMSEA, ≤ 0.05 for SRMR and ≥ 0.90 for CFI and TLI indicate a good fit of the model to the data [ 60 ]. Additionally, values of the average variance extracted (AVE) ≥ 0.50 indicated evidence of convergent validity [ 61 ]. Multivariate normality was not verified at first (Bollen-Stine bootstrap p  = .002); therefore we performed a non-parametric bootstrapping procedure.

Sex invariance

To examine gender invariance of PACS-R scores, we conducted multi-group CFA [ 62 ] using the total sample. Measurement invariance was assessed at the configural, metric, and scalar levels [ 63 ]. We accepted ΔCFI ≤ 0.010 and ΔRMSEA ≤ 0.015 or ΔSRMR ≤ 0.010 as evidence of invariance [ 64 ].

Further analyses

Composite reliability was assessed using McDonald’s ω and Cronbach’s α, with values greater than 0.70 reflecting adequate composite reliability [ 65 ]. Normality was verified since the skewness and kurtosis values for each item of the scale varied between − 1 and + 1 [ 66 ]. Pearson test was used to correlate the PACS-R scores with the other scales in the survey. Student t test was used to compare two means.

Participants

Three hundred fifty-nine participants participated in this study, with a mean age of 22.75 ± 7.04 years (age range 18–58), 59.9% females and 92.2% single. In addition, the mean BMI was 24.12 ± 512 kg/m 2 and the mean HCI was 1.28 ± 1.92 person/room.

Confirmatory factor analysis of the PACS-R scale

CFA indicated that fit of the one-factor model of the PACS-R scale was acceptable: RMSEA = 0.125 (90% CI 0.112, 0.139), SRMR = 0.031, CFI = 0.940, TLI = 0.924. The standardized estimates of factor loadings were all adequate (Fig.  1 ). Composite reliability of scores was adequate in the total sample (ω = 0.97 / α = 0.97). The convergent validity for this model was very good, as AVE = 0.72.

figure 1

Standardized loading factors of the Physical Appearance Comparison Scale-Revised (PACS-R) items in Arabic

Gender invariance

We were able to show the invariance across sex at the configural, metric, and scalar levels (Table  1 ). No significant difference was seen between males and females in terms of PACS-R scores (15.42 ± 10.64 vs. 13.16 ± 11.88; t (357) = 1.84; p  = .066).

Concurrent validity

Higher physical appearance comparison scores were significantly associated with lower self-esteem ( r  = − .43; p  < .001) and higher psychological distress ( r  = .37; p  < .001).

The objective of this study was to translate the PACS-R into Arabic and to examine its psychometric properties in terms of factor structure, internal consistency reliability, cross-sex measurement invariance and concurrent validity. To this end, CFA, reliability evaluation, and correlational analysis were conducted. The findings in our study support the satisfactory psychometric characteristics of the Arabic iteration of the PACS-R. The evaluation of the Arabic PACS-R in a sample of Arabic-speaking Lebanese adults identified a single-factor structure with all 11 items retained, which aligns with the original model [ 31 ]. As expected, the Arabic PACS-R also exhibited good reliability and concurrent validity, suggesting its suitability for use among Arabic-speaking adults in community settings.

Our results share the single-factor structure with the original scale validation [ 31 ], where they initially considered a multi-factor solution but ultimately supported a single-factor solution through additional analyses. Similar findings supporting the one-dimensional structure were observed in the subsequent translation validations [ 35 , 36 , 37 ]. Our findings also showed that composite reliability of the Arabic version of the PACS-R was excellent (ω = 0.97 / α = 0.97). These high values indicate that the scale items are both consistent and effectively measure the same underlying construct. This is supported across the validations in different languages, where high internal consistency, with Cronbach’s alpha and McDonald’s omega values consistently above 0.95, indicating excellent reliability [ 35 , 36 , 37 ].

Sex invariance of the Arabic PACS-R was established, indicating the scale’s applicability across sexes. This is aligned with another mixed sex PACS-R validation [ 35 ]. This means the PACS-R scale measures the same construct in the same way for both males and females, allowing for direct comparisons. Since other validations used a female-only sample only [ 36 , 37 ], this study is one of the few along with the Spanish version that validated the scale on a mixed sex sample. Our sample had a good ratio of males and females which showcases a significant strength of the Arabic PACS-R version. As for between-sex comparisons, our study showed no significant difference across sex in terms of PACS-R scores. This comes in contrast with the findings from the Spanish translation, where a significant sex difference was observed in PACS-R scores, highlighting the influence of sex on physical appearance comparison concerns [ 35 ]. These differences can be attributed to sample demographics, the characteristics of the study samples (such as age range, social and economic backgrounds), and the specific population sampled (e.g., university students, general population). For example, a sample composed of a majority of young adults from a university setting might reflect more homogeneous attitudes toward appearance, potentially minimizing or exaggerating sex differences seen in a broader, more diverse population.

Finally, the Arabic PACS-R showed good patterns of convergent validity with measures of self-esteem and psychological distress. In particular, increased frequency of body comparison correlated with low self-esteem. This is supported by the original validation study of the PACS-R [ 31 ], as well as other studies [ 24 , 35 , 37 ]. Correspondingly, greater PACS-R scores also correlated with increased psychological distress. These outcomes align with the conclusions of previous research that link increased physical comparison with higher level of depression and anxiety [ 67 , 68 ]. Indeed, low self-esteem has been linked to upward social comparison [ 18 ]. This relation seems bidirectional, as upward social comparison appears to lower self-esteem, but also people with low self-esteem and negative mood tend to engage in upward social comparisons [ 19 , 20 ]. Additionally, physical appearance is recognized as one of the most prominent aspects of self-esteem, especially among teenagers and young adults [ 69 ]. Thus, having an association between higher PACS-R and lower self-esteem and psychological distress might fall under this bidirectional relation. Moreover, upward social comparisons have also been associated with additional adverse outcomes, such as depressive symptoms [ 26 , 70 , 71 ] and self-esteem has been demonstrated to partially mediate the relationship between depressive symptoms and upward social comparisons [ 69 , 72 ].

Study limitations

The current study’s limitations should be acknowledged. Primarily, the data were collected through convenience (non-probabilistic) and web-based sampling methods, which might restrict the extrapolation of our findings. The sample was mostly comprised of Lebanese young adults, with slightly more females than males, which may limit the applicability of the findings to broader demographic populations. Furthermore, we need to take into account cultural differences between other Arabic-speaking countries that may differ from our Lebanese sample. The Arabic PACS-R needs further validation across different demographics, including older participants and from different Arabic-speaking countries. Next, the reliance on self-reported surveys may introduce the potential biases related to memory recall and social desirability. Finally, certain critical psychometric properties of the PACS-R, such as test-retest reliability have not been assessed. These aspects warrant further examination in subsequent research.

Despite these limitations, the study offers substantial evidence that the Arabic version of the PACS-R possesses robust psychometric qualities. The comprehensive results preliminarily establish the Arabic PACS-R as an effective instrument for researchers and practitioners aiming to explore the physical comparison among Arabic-speaking populations, thus contributing to research and clinical work in the Arabic community. Future studies across the lifespan (e.g., adolescents) using larger populations of Arabic-speaking adults from different countries, as well as clinical samples are required to confirm the present findings.

Data availability

The datasets generated and/or analysed during the current study are not publicly available due to restrictions from the ethics committee but are available from the corresponding author on reasonable request.

Gogolin T, Norris E, Murch H, Volk F. The effect of male body dissatisfaction on sexual and relationship satisfaction in the Presence of Pornography Use and Depression. J Men’s Stud. 2024;10608265241236478. https://doi.org/10.1177/10608265241236478 .

Holmqvist K, Frisén A. Body dissatisfaction across cultures: findings and research problems. Eur Eat Disorders Rev. 2010;18(2):133–46. https://doi.org/10.1002/erv.965 .

Article   Google Scholar  

American Psychiatric Association. Diagnostic and statistical manual of mental disorders: DSM-5. 5th ed. 2013.

Stice E, Shaw HE. Role of body dissatisfaction in the onset and maintenance of eating pathology: a synthesis of research findings. J Psychosom Res. 2002;53(5):985–93. https://doi.org/10.1016/S0022-3999(02)00488-9 .

Article   PubMed   Google Scholar  

Barnes M, Abhyankar P, Dimova E, Best C. Associations between body dissatisfaction and self-reported anxiety and depression in otherwise healthy men: a systematic review and meta-analysis. PLoS ONE. 2020;15(2):e0229268. https://doi.org/10.1371/journal.pone.0229268 .

Article   PubMed Central   PubMed   Google Scholar  

Brechan I, Kvalem IL. Relationship between body dissatisfaction and disordered eating: mediating role of self-esteem and depression. Eat Behav. 2015;17:49–58. https://doi.org/10.1016/j.eatbeh.2014.12.008 .

Chen J, Peng S, Wei Y. New media facilitate adolescents’ body dissatisfaction and eating disorders in Mainland China. Trends Mol Med. 2024. https://doi.org/10.1016/j.molmed.2024.02.011 .

Corno G, Paquette A, Burychka D, Miragall M, Rivard M-C, Baños RM, Bouchard S. Development of a visual-perceptual method to assess body image: a cross-cultural validation in Canadian and Spanish women. Eur Eat Disorders Review: J Eat Disorders Association. 2024. https://doi.org/10.1002/erv.3086 .

Faria K, Haçul BE, Lopes J, de Andrade GF, SEXUAL DISSATISFACTION OF WOMEN ASSISTED IN A BASIC HEALTH UNIT. Braz J Phys Ther. 2024;28:100825. https://doi.org/10.1016/j.bjpt.2024.100825 . BODY IMAGE AND.

Karazsia BT, Murnen SK, Tylka TL. Is body dissatisfaction changing across time? A cross-temporal meta-analysis. Psychol Bull. 2017;143(3):293–320. https://doi.org/10.1037/bul0000081 .

Fisher E, Dunn M, Thompson JK. Social comparison and body image: an investigation of body comparison processes using Multidimensional Scaling. J Soc Clin Psychol. 2002;21(5):566–79. https://doi.org/10.1521/jscp.21.5.566.22618 .

Morrison TG, Kalin R, Morrison MA. Body-image evaluation and body-image investment among adolescents: a test of sociocultural and social comparison theories. Adolescence. 2004;39(155):571–92.

PubMed   Google Scholar  

Festinger L. A theory of social comparison processes. Hum Relat. 1954;7(2):117–40. https://doi.org/10.1177/001872675400700202 .

Bailey SD, Ricciardelli LA. Social comparisons, appearance related comments, contingent self-esteem and their relationships with body dissatisfaction and eating disturbance among women. Eat Behav. 2010;11(2):107–12. https://doi.org/10.1016/j.eatbeh.2009.12.001 .

Krayer A, Ingledew DK, Iphofen R. Social comparison and body image in adolescence: a grounded theory approach. Health Educ Res. 2008;23(5):892–903. https://doi.org/10.1093/her/cym076 .

Fitzsimmons-Craft EE, Harney MB, Koehler LG, Danzi LE, Riddell MK, Bardone-Cone AM. Explaining the relation between thin ideal internalization and body dissatisfaction among college women: the roles of social comparison and body surveillance. Body Image. 2012;9(1):43–9. https://doi.org/10.1016/j.bodyim.2011.09.002 .

Myers TA, Crowther JH. Social comparison as a predictor of body dissatisfaction: a meta-analytic review. J Abnorm Psychol. 2009;118(4):683–98. https://doi.org/10.1037/a0016763 .

Vogel E, Rose J, Roberts L, Eckles K. Social Comparison, Social Media, and self-esteem. Psychol Popular Media Cult. 2014;3:206–22. https://doi.org/10.1037/ppm0000047 .

Vogel EA, Rose JP, Okdie BM, Eckles K, Franz B. Who compares and despairs? The effect of social comparison orientation on social media use and its outcomes. Pers Indiv Differ. 2015;86:249–56. https://doi.org/10.1016/j.paid.2015.06.026 .

Tibber MS, Zhao J, Butler S. The association between self-esteem and dimensions and classes of cross-platform social media use in a sample of emerging adults – evidence from regression and latent class analyses. Comput Hum Behav. 2020;109:106371. https://doi.org/10.1016/j.chb.2020.106371 .

Schaefer L, Thompson J. The Development and Validation of the physical appearance comparison Scale-3 (PACS-3). Psychol Assess. 2018;30. https://doi.org/10.1037/pas0000576 .

Thompson JK, Heinberg LJ, Altabe M, Tantleff-Dunn S. Exacting beauty: theory, assessment, and treatment of body image disturbance. American Psychological Association; 1999.

Ridolfi D, Myers T, Crowther J, Ciesla J. Do Appearance focused cognitive distortions moderate the relationship between Social Comparisons to peers and media images and body image disturbance? Sex Roles. 2011;65:491–505. https://doi.org/10.1007/s11199-011-9961-0 .

Alcaraz-Ibáñez M, Sicilia Á, Díez-Fernández DM, Paterna A. Physical appearance comparisons and symptoms of disordered eating: the mediating role of social physique anxiety in Spanish adolescents. Body Image. 2020;32:145–9. https://doi.org/10.1016/j.bodyim.2019.12.005 .

Webb caroline. (2000). Psychological distress in clinical obesity: The role of eating disorder beliefs and behaviours, social comparison and shame . ProQuest Dissertations & Theses Global. https://www.proquest.com/dissertations-theses/psychological-distress-clinical-obesity-role/docview/900298535/se-2 .

McCarthy PA, Morina N. Exploring the association of social comparison with depression and anxiety: a systematic review and meta-analysis. Clinical Psychology Psychotherapy. 2020;27(5):640–71. https://doi.org/10.1002/cpp.2452 .

Rapee RM, Magson NR, Forbes MK, Richardson CE, Johnco CJ, Oar EL, Fardouly J. Risk for social anxiety in early adolescence: longitudinal impact of pubertal development, appearance comparisons, and peer connections. Behav Res Ther. 2022;154:104126. https://doi.org/10.1016/j.brat.2022.104126 .

Schlechter P, Morina N. The role of aversive appearance-related comparisons and self-discrepancy in Depression and Well-being from a Longitudinal General Comparative-Processing Perspective. Behav Ther. 2023. https://doi.org/10.1016/j.beth.2023.09.003 .

Keery H, van den Berg P, Thompson JK. An evaluation of the tripartite influence model of body dissatisfaction and eating disturbance with adolescent girls. Body Image. 2004;1(3):237–51. https://doi.org/10.1016/j.bodyim.2004.03.001 .

Karazsia BT, Crowther JH. Social body comparison and internalization: mediators of social influences on men’s muscularity-oriented body dissatisfaction. Body Image. 2009;6(2):105–12. https://doi.org/10.1016/j.bodyim.2008.12.003 .

Schaefer LM, Thompson JK. The development and validation of the physical appearance comparison scale-revised (PACS-R). Eat Behav. 2014;15(2):209–17. https://doi.org/10.1016/j.eatbeh.2014.01.001 .

O’Brien KS, Caputi P, Minto R, Peoples G, Hooper C, Kell S, Sawley E. Upward and downward physical appearance comparisons: development of scales and examination of predictive qualities. Body Image. 2009;6(3):201–6. https://doi.org/10.1016/j.bodyim.2009.03.003 .

Liao J, Jackson T, Chen H. The structure and validity of directional measures of appearance social comparison among emerging adults in China. Body Image. 2014;11(4):464–73. https://doi.org/10.1016/j.bodyim.2014.07.001 .

Thompson J, Heinberg L, Tantleff-Dunn S. The physical appearance comparison scale. Behav Therapist. 1991;14:174. https://digitalcommons.usf.edu/psy_facpub/2116 .

Google Scholar  

Vall Roqué H, Andrés A, Saldaña C. Validation of the Spanish version of the physical appearance comparison scale-revised (PACS-R): psychometric properties in a mixed-gender community sample. Behav Psychology/Psicologia Conductual. 2022;30:269–89. https://doi.org/10.51668/bp.8322114n .

Atari M, Akbari-Zardkhaneh S, Soufiabadi M, Mohammadi L. (2015). Cross-Cultural Adaptation of the Physical Appearance Comparison Scale-Revised in Iran. International Journal of Body, Mind and Culture , 2 .

Claumann GS, Laus MF, Folle A, Silva DAS, Pelegrini A. Translation and validation of the Brazilian version of the physical appearance comparison scale-revised in college women. Body Image. 2021;38:157–61. https://doi.org/10.1016/j.bodyim.2021.03.018 .

Cattarin JA, Thompson JK, Thomas C, Williams R. Body image, Mood, and televised images of attractiveness: the role of Social Comparison. J Soc Clin Psychol. 2000;19(2):220–39. https://doi.org/10.1521/jscp.2000.19.2.220 .

McCreary DR, Saucier DM. Drive for muscularity, body comparison, and social physique anxiety in men and women. Body Image. 2009;6(1):24–30. https://doi.org/10.1016/j.bodyim.2008.09.002 .

Fekih-Romdhane F, Hallit R, Malaeb D, Sakr F, Dabbous M, Sawma T, Obeid S, Hallit S. Psychometric properties of an arabic translation of the Nine Item Avoidant/Restrictive Food Intake Disorder Screen (NIAS) in a community sample of adults. J Eat Disorders. 2023;11:143. https://doi.org/10.1186/s40337-023-00874-0 .

Alharballeh S, Dodeen H. Prevalence of body image dissatisfaction among youth in the United Arab Emirates: gender, age, and body mass index differences. Curr Psychol. 2023;42(2):1317–26. https://doi.org/10.1007/s12144-021-01551-8 .

Al-Musharaf S, Rogoza R, Mhanna M, Soufia M, Obeid S, Hallit S. Factors of body dissatisfaction among Lebanese adolescents: the indirect effect of self-esteem between mental health and body dissatisfaction. BMC Pediatr. 2022;22(1):302. https://doi.org/10.1186/s12887-022-03373-4 .

Haddad C, Zakhour M, Akel M, Honein K, Akiki M, Hallit S, Obeid S. Factors associated with body dissatisfaction among the Lebanese population. Eating and Weight disorders - studies on Anorexia . Bulimia Obes. 2019;24(3):507–19. https://doi.org/10.1007/s40519-018-00634-z .

Hallit S, Mhanna M, Soufia M, Obeid S. Factors of body dissatisfaction among Lebanese adolescents: the indirect effect of self-esteem between mental health and body dissatisfaction. BMC Pediatr. 2022. https://doi.org/10.1186/s12887-022-03373-4 .

Mousa TY, Mashal RH, Al-Domi HA, Jibril MA. Body image dissatisfaction among adolescent schoolgirls in Jordan. Body Image. 2010;7(1):46–50. https://doi.org/10.1016/j.bodyim.2009.10.002 .

Musaiger AO, Al-Mannai M, ASSOCIATION BETWEEN EXPOSURE TO MEDIA AND BODY WEIGHT CONCERN AMONG FEMALE UNIVERSITY STUDENTS IN FIVE ARAB COUNTRIES. A PRELIMINARY CROSS-CULTURAL STUDY. J Biosoc Sci. 2014;46(2):240–7. https://doi.org/10.1017/S0021932013000278 .

Obeid S, Chok A, Sacre H, Haddad C, Tahan F, Ghanem L, Azar J, Hallit S. Are eating disorders associated with bipolar disorder type I? Results of a Lebanese case-control study. Perspect Psychiatr Care. 2021;57(1):326–34. https://doi.org/10.1111/ppc.12567 .

Schulte SJ, Thomas J. Relationship between eating pathology, body dissatisfaction and depressive symptoms among male and female adolescents in the United Arab Emirates. Eat Behav. 2013;14(2):157–60. https://doi.org/10.1016/j.eatbeh.2013.01.015 .

Chung T, Mallery P. Social comparison, individualism-collectivism, and self-esteem in China and the United States. Curr Psychol. 1999;18(4):340–52. https://doi.org/10.1007/s12144-999-1008-0 .

Hofstede G. Dimensionalizing cultures: the Hofstede Model in Context. Online Readings Psychol Cult. 2011;2(1). https://doi.org/10.9707/2307-0919.1014 .

Melki IS, Beydoun HA, Khogali M, Tamim H, Yunis KA, National Collaborative Perinatal Neonatal Network (NCPNN). Household crowding index: a correlate of socioeconomic status and inter-pregnancy spacing in an urban setting. J Epidemiol Commun Health. 2004;58(6):476–80. https://doi.org/10.1136/jech.2003.012690 .

van Widenfelt BM, Treffers PDA, de Beurs E, Siebelink BM, Koudijs E. Translation and cross-cultural adaptation of assessment instruments used in psychological research with children and families. Clin Child Fam Psychol Rev. 2005;8(2):135–47. https://doi.org/10.1007/s10567-005-4752-1 .

Ambuehl B, Inauen J. Contextualized measurement scale adaptation: a 4-Step tutorial for health psychology research. Int J Environ Res Public Health. 2022;19(19):12775. https://doi.org/10.3390/ijerph191912775 .

Brown TA, Chorpita BF, Korotitsch W, Barlow DH. Psychometric properties of the Depression anxiety stress scales (DASS) in clinical samples. Behav Res Ther. 1997;35(1):79–89. https://doi.org/10.1016/s0005-7967(96)00068-x .

Ali AM, Hori H, Kim Y, Kunugi H. The Depression anxiety stress scale 8-Items expresses Robust Psychometric properties as an Ideal Shorter Version of the Depression anxiety stress scale 21 among healthy respondents from three continents. Front Psychol. 2022;13:799769. https://doi.org/10.3389/fpsyg.2022.799769 .

Rosenberg M. (1965). Rosenberg self-esteem scale (RSE). Acceptance and commitment therapy Measures package .

Obeid S, Haddad C, Zakhour M, Fares K, Akel M, Salameh P, Hallit S. Correlates of self-esteem among the Lebanese population: a cross-sectional study. Psychiatria Danubina. 2019;31(4):429–39. https://doi.org/10.24869/psyd.2019.429 .

Mhanna M, Azzi R, Hallit S, Obeid S, Soufia M. Correlates of orthorexia nervosa in a sample of Lebanese adolescents: the co-moderating effect of body dissatisfaction and self-esteem between mental health issues and orthorexia nervosa. Vulnerable Child Youth Stud. 2023;18(4):610–22. https://doi.org/10.1080/17450128.2022.2163732 .

Mundfrom DJ, Shaw DG, Ke TL. Minimum sample size recommendations for conducting factor analyses. Int J Test. 2005;5(2):159–68. https://doi.org/10.1207/s15327574ijt0502_4 .

Hu L, Bentler PM. Cutoff criteria for fit indexes in covariance structure analysis: conventional criteria versus new alternatives. Struct Equation Modeling: Multidisciplinary J. 1999;6(1):1–55. https://doi.org/10.1080/10705519909540118 .

Dash S. M., N. K. (2011). Marketing Research: An Applied Orientation. Pearson-Dorling Kindersley , 6th edition, Delhi .

Chen FF. Sensitivity of goodness of fit indexes to lack of Measurement Invariance. Struct Equation Modeling: Multidisciplinary J. 2007;14(3):464–504. https://doi.org/10.1080/10705510701301834 .

Vandenberg R, Lance C. A review and synthesis of the Measurement Invariance Literature: suggestions, practices, and recommendations for Organizational Research. Organizational Res Methods. 2000;3:4–69. https://doi.org/10.1177/109442810031002 .

Fekih-Romdhane F, Malaeb D, Fawaz M, Chammas N, Soufia M, Obeid S, Hallit S. Psychometric properties of an arabic translation of the multidimensional assessment of interoceptive awareness (MAIA-2) questionnaire in a non-clinical sample of Arabic-speaking adults. BMC Psychiatry. 2023;23(1):577. https://doi.org/10.1186/s12888-023-05067-2 .

Dunn TJ, Baguley T, Brunsden V. From alpha to omega: a practical solution to the pervasive problem of internal consistency estimation. Br J Psychol (London England: 1953). 2014;105(3):399–412. https://doi.org/10.1111/bjop.12046 .

Hair J, Sarstedt M, Ringle C, Gudergan S. (2017). Advanced Issues in Partial Least Squares Structural Equation Modeling .

Alfonso-Fuertes I, Alvarez-Mon MA, Hoyo RS, del, Ortega MA, Alvarez-Mon M, Molina-Ruiz RM. Time Spent on Instagram and Body Image, Self-esteem, and physical comparison among young adults in Spain: Observational Study. JMIR Formative Res. 2023;7(1):e42207. https://doi.org/10.2196/42207 .

Etu SF, Gray JJ. A preliminary investigation of the relationship between induced rumination and state body image dissatisfaction and anxiety. Body Image. 2010;7(1):82–5. https://doi.org/10.1016/j.bodyim.2009.09.004 .

Aggarwal R, Ranjan D, Chandola R. Effect of body image on Self Esteem: a systematic literature review and future implication. Eur Chem Bull. 2023;12:6087–95. https://doi.org/10.48047/ecb/2023.12.si4.5412023.09/05/2023 .

Nesi J, Prinstein MJ. Using Social Media for Social Comparison and Feedback-Seeking: gender and Popularity Moderate associations with depressive symptoms. J Abnorm Child Psychol. 2015;43(8):1427–38. https://doi.org/10.1007/s10802-015-0020-0 .

Wang W, Wang M, Hu Q, Wang P, Lei L, Jiang S. Upward social comparison on mobile social media and depression: the mediating role of envy and the moderating role of marital quality. J Affect Disord. 2020;270. https://doi.org/10.1016/j.jad.2020.03.173 .

Schmuck D, Karsay K, Matthes J, Stevic A. Looking up and feeling down. The influence of mobile social networking site use on upward social comparison, self-esteem, and well-being of adult smartphone users. Telematics Inform. 2019;42:101240. https://doi.org/10.1016/j.tele.2019.101240 .

Download references

Acknowledgements

The authors would like to thank all participants.

Author information

Feten Fekih-Romdhane, Sahar Obeid and Souheil Hallit are last coauthors.

Authors and Affiliations

Social and Education Sciences Department, School of Arts and Sciences, Lebanese American University, Beirut, Lebanon

Marie Anne El Khoury & Sahar Obeid

College of Pharmacy, Gulf Medical University, Ajman, United Arab Emirates

Diana Malaeb

College of Health Sciences, American University of the Middle East, Kuwait, Kuwait

Mirna Fawaz

School of Medicine and Medical Sciences, Holy Spirit University of Kaslik, P.O. Box 446, Jounieh, Lebanon

Nancy Chammas, Michel Soufia & Souheil Hallit

The Tunisian Center of Early Intervention in Psychosis, Department of Psychiatry “Ibn Omrane”, Razi hospital, Manouba, 2010, Tunisia

Feten Fekih-Romdhane

Faculty of Medicine of Tunis, Tunis El Manar University, Tunis, Tunisia

Psychology Department, College of Humanities, Effat University, Jeddah, 21478, Saudi Arabia

Souheil Hallit

Applied Science Research Center, Applied Science Private University, Amman, Jordan

You can also search for this author in PubMed   Google Scholar

Contributions

FFR, SO and SH designed the study; MK drafted the manuscript; SH carried out the analysis and interpreted the results; NC and MF collected the data; DM and MS reviewed the paper for intellectual content; all authors reviewed the final manuscript and gave their consent.

Corresponding authors

Correspondence to Feten Fekih-Romdhane or Souheil Hallit .

Ethics declarations

Ethics approval and consent to participate.

The Ethics and Research Committee at the Lebanese International University approved this study protocol (2022RC-051-LIUSOP). A written informed consent was considered obtained from each participant when submitting the online form and from parents or the legal guardian(s) of the participants below 16 years of age involved in the study. All methods were performed in accordance with the relevant guidelines and regulations.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

El Khoury, M.A., Malaeb, D., Fawaz, M. et al. Psychometric properties of the arabic translation of the Physical Appearance Comparison Scale-Revised (PACS-R) in adults. BMC Psychol 12 , 371 (2024). https://doi.org/10.1186/s40359-024-01871-x

Download citation

Received : 22 April 2024

Accepted : 25 June 2024

Published : 29 June 2024

DOI : https://doi.org/10.1186/s40359-024-01871-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Physical appearance
  • Appearance comparison
  • Psychometrics properties
  • Arabic validation

BMC Psychology

ISSN: 2050-7283

validity and reliability in research sample

IMAGES

  1. Validity and reliability in research example

    validity and reliability in research sample

  2. Examples of reliability and validity in research

    validity and reliability in research sample

  3. Validity and Reliability in Research

    validity and reliability in research sample

  4. Validity and Reliability in Qualitative Research

    validity and reliability in research sample

  5. Reliability vs. Validity in Research

    validity and reliability in research sample

  6. how to write validity and reliability in research

    validity and reliability in research sample

VIDEO

  1. Validity vs Reliability || Research ||

  2. Validity and Reliability in Research

  3. Validity, Reliability, and Scoring

  4. Reliability & Validity in Research Studies

  5. Concept of Reliability and Validity

  6. Validity and Reliability in Research: The Smaller and BIGGER Picture Conceptions

COMMENTS

  1. Reliability vs. Validity in Research

    Learn how to evaluate the quality of your research methods and measures using reliability and validity. Find out the difference, types and examples of reliability and validity, and how to ensure them in your research design.

  2. Reliability and Validity

    Learn how to measure the quality of research data using reliability and validity. Find out the types, examples, and threats of reliability and validity in research design.

  3. Reliability vs Validity: Differences & Examples

    Learn how to assess the quality of measurements in research using reliability and validity criteria. Reliability relates to the consistency of measures, and validity addresses whether the measurements are quantifying the correct attribute.

  4. Validity & Reliability In Research: Simple Explainer + Examples

    Learn what validity and reliability are and how to achieve them in your research project. Validity is about measuring the right thing, reliability is about measuring it consistently and accurately.

  5. Reliability and Validity in Research: Definitions, Examples

    Reliability is a measure of the stability or consistency of test scores. You can also think of it as the ability for a test or research findings to be repeatable. For example, a medical thermometer is a reliable tool that would measure the correct temperature each time it is used. In the same way, a reliable math test will accurately measure ...

  6. Reliability vs Validity in Research

    Learn how to evaluate the quality of your research methods and results by considering reliability and validity. Reliability is about the consistency of a measure, and validity is about the accuracy of a measure.

  7. Reliability vs. Validity in Research: Types & Examples

    Example of Reliability and Validity in Research. In this section, we'll explore instances that highlight the differences between reliability and validity and how they play a crucial role in ensuring the credibility of research findings. Example of reliability; Imagine you are studying the reliability of a smartphone's battery life measurement.

  8. The 4 Types of Validity in Research

    Face validity. Face validity considers how suitable the content of a test seems to be on the surface. It's similar to content validity, but face validity is a more informal and subjective assessment. Example. You create a survey to measure the regularity of people's dietary habits.

  9. The 4 Types of Reliability in Research

    Reliability is a key concept in research that measures how consistent and trustworthy the results are. In this article, you will learn about the four types of reliability in research: test-retest, inter-rater, parallel forms, and internal consistency. You will also find definitions and examples of each type, as well as tips on how to improve reliability in your own research.

  10. Reliability Vs Validity

    Test-retest reliability, inter-rater reliability, internal consistency reliability: Content validity, criterion validity, construct validity: Measure: Degree of agreement or correlation between repeated measures or observers: Degree of association between a measure and an external criterion, or degree to which a measure assesses the intended ...

  11. Reliability vs Validity in Research: Types & Examples

    Learn how to assess the consistency and accuracy of your data in research and testing. Find out the differences and similarities between reliability and validity, and the methods to measure them.

  12. Validity In Psychology Research: Types & Examples

    Types of Validity In Psychology. Two main categories of validity are used to assess the validity of the test (i.e., questionnaire, interview, IQ test, etc.): Content and criterion. Content validity refers to the extent to which a test or measurement represents all aspects of the intended content domain. It assesses whether the test items ...

  13. Validity

    Ensuring validity in research involves several strategies: Clear Operational Definitions: Define variables clearly and precisely. Use of Reliable Instruments: Employ measurement tools that have been tested for reliability. Pilot Testing: Conduct preliminary studies to refine the research design and instruments.

  14. Validity vs. Reliability

    Reliability ensures the study's findings are reproducible while validity confirms that it accurately represents the phenomena it claims to. Ensuring both in a study means the results are both dependable and accurate, forming a cornerstone for high-quality research.

  15. Reliability

    Reliability refers to the consistency, dependability, and trustworthiness of a system, process, or measurement to perform its intended function or produce consistent results over time. It is a desirable characteristic in various domains, including engineering, manufacturing, software development, and data analysis. Reliability In Engineering.

  16. Reliability In Psychology Research: Definitions & Examples

    Reliability in psychology research refers to the reproducibility or consistency of measurements. Specifically, it is the degree to which a measurement instrument or procedure yields the same results on repeated trials. A measure is considered reliable if it produces consistent scores across different instances when the underlying thing being measured has not changed.

  17. (PDF) Validity and Reliability in Quantitative Research

    Reliability is an. indicator of the stability of the measured values obtained in repeated measurements. under the same circumst ances using the same me asuring instrument. Reliability is. not only ...

  18. Validity, reliability, and generalizability in qualitative research

    Whether the research question is valid for the desired outcome, the choice of methodology is appropriate for answering the research question, the design is valid for the methodology, the sampling and data analysis is appropriate, and finally the results and conclusions are valid for the sample and context. In assessing validity of qualitative ...

  19. What Are Validity & Reliability In Research? SIMPLE Explainer (With

    Learn about validity and reliability in research methodology with this straightforward, plain-language explainer video. We unpack the related concepts of rel...

  20. Validity, Accuracy and Reliability Explained with Examples

    Part 3 - Reliability. Science experiments are an essential part of high school education, helping students understand key concepts and develop critical thinking skills. However, the value of an experiment lies in its validity, accuracy, and reliability. Let's break down these terms and explore how they can be improved and reduced, using ...

  21. Issues of validity and reliability in qualitative research

    Although the tests and measures used to establish the validity and reliability of quantitative research cannot be applied to qualitative research, there are ongoing debates about whether terms such as validity, reliability and generalisability are appropriate to evaluate qualitative research.2-4 In the broadest context these terms are applicable, with validity referring to the integrity and ...

  22. PDF CHAPTER 3 VALIDITY AND RELIABILITY

    3.1 INTRODUCTION. In Chapter 2, the study's aims of exploring how objects can influence the level of construct validity of a Picture Vocabulary Test were discussed, and a review conducted of the literature on the various factors that play a role as to how the validity level can be influenced. In this chapter validity and reliability are ...

  23. Validity and Reliability of the Research Instrument; How to Test the

    Taherdoost [36] revealed "face validity is the degree to which a measure appears to be related to a specific construct, in the judgment of nonexperts such as test takers" and the clarity of the ...

  24. Ensure Data Reliability in Research: A Guide

    This includes looking at sample sizes, data collection techniques, and analysis methods. A solid methodology increases the reliability of the data, giving you confidence that your research is ...

  25. Development and validation of the MultiScent-20 digital odour ...

    The normative curve was established by administering the test to a large sample of participants (n = 1299). ... This study provides initial evidence supporting the validity and reliability of use ...

  26. A study of the reliability and construct validity of the 1-minute sit

    Results . The 1-minute sit-to-stand test exhibited excellent test-retest reliability with an ICC 2,1 (CI) of 0.97 (0.95-0.99). The minimal metrically detectable change between separate measures on a subject for the difference in the measures to be considered real at the 95% confidence level was 2.9 repetitions and 11%, respectively.

  27. Healthcare

    Stroke survivors often face diverse unmet needs highlighting the significance of identifying and addressing these needs to enhance rehabilitation outcomes and overall quality of life. This study aimed to validate the modified Needs Assessment Questionnaire (mNAQ) as a reliable and valid tool for assessing the needs of stroke patients in the Greek context. Additionally the research sought to ...

  28. 99 Protecting workers from respirable crystalline silica exposure

    Previous research has identified factors affecting RCS exposure control in tunnelling but lacks worker input. Additionally, while safety climate studies are common, little focus is given to occupational hygiene, particularly RCS exposure. Methods. ... Its reliability and validity were established by applying and analysing the tool's ...

  29. Cyber Evaluation and Management Toolkit (CEMT): Face Validity of ...

    The Cyber Evaluation and Management Toolkit (CEMT) is an open-source university research-based plugin for commercial digital model-based systems engineering tools that streamlines conducting cybersecurity risk evaluations for complex cyber-physical systems. The authors developed this research tool to assist the Australian Defence Force (ADF) with the cybersecurity evaluation of complicated ...

  30. Psychometric properties of the arabic translation of the Physical

    Physical comparison may be a factor in body dissatisfaction and related issues, like eating disorders and depression. The Physical Appearance Comparison Scale-Revised (PACS-R) is a scale developed to assess the frequency of physical comparison. Because there is no validated scale for body comparison in Arabic, this study aims to address this gap by validating the PACS-R in the Arabic language.