Note : A huge shoutout to Leyah Dizon, a researcher with Milieu, who designed and executed the experiment discussed in this article
The meaning and interpretation of numerical ratings can be culturally specific, and can vary depending on the context and purpose for which the ratings are used.
For instance, in many countries higher exam scores indicate better performance. Higher star ratings in reviews imply better quality of service/product.
However, in other countries like Germany, the contrary is usually true. Lower grade point averages in German education system indicate better student performance (Grades vary from 1 = excellent to 5 = insufficient). Specific to consumer research, Germany’s national Consumer Report, Stiftung Warentest, uses an inverse numerical rating system where a lower number denotes greater product quality (0.5–1.5 = very good and 4.6–5.5 = unsatisfactory).
Even within cultures where higher-is-better rating is the norm, there may be instances when we typically associate a lower number with higher importance. For instance, when it comes to rankings, 1 is always associated with the best.
This begs the question - how do we design numerical rating scales? Should we always stick to what is intuitive for a particular market or is it okay to reverse the scale? It is important to give this some thought since the design of numerical rating scales can have a significant impact on the results that are obtained.
We often get questionnaires from clients where the values assigned to rating scales differ from the bigger-is-better rating system that respondents are accustomed to. For instance take the below example.
This particular client wanted to test how well their advertisements were received by the audience via an Ad Effectiveness Study. One of the questions asked in the survey was :
With clear instructions in the question text and response labels, you would think that the lower-is-better rating system won’t really pose an issue. However, an experiment that we conducted suggests otherwise.
Experimental design
To test whether numerical values assigned to rating scales can have an effect on how respondents provide answers, we ran an experiment in which respondents were randomly assigned to two groups.
The two groups were presented with the exact same survey questions. They only differed in their response scales.
Group A (N = 700) was presented with a survey in which 1 in the response scale was attached to the highest rating (extremely popular) and 10 to the lowest rating (extremely unpopular) as seen in the below image.
Group B (N = 700) was presented with the same survey but in which 1 in the response scale was attached to the lowest rating (extremely unpopular) and 10 to the highest rating (extremely popular)as can be seen below.
A point to note is that this question was a part of a longer survey with 20 related questions to provide respondents with a routine survey experience. The questions and their order were kept consistent between the two groups to prevent any order effects.
This particular question was chosen for analysis in this article because Fried chicken is highly popular in both Singapore and the Philippines, and hence it is logical to expect higher % selection in favour of "popular" rating.
The two sets of surveys were run in Singapore and the Philippines to assess whether the observations differ as a function of cultural settings. Each survey had a sample size of N = 700 at a margin of error of +/-4 that was representative of the respective country’s age, gender and region.
The assignment of numerical rating had a huge impact on the results
We observed that in both Singapore and Philippines, using a higher-is-more-popular format (10 = extremely popular, 1= extremely unpopular) led to vastly different results when compared to a lower-is-more-popular format with reversed rating poles (10 = extremely unpopular, 1 = extremely popular).
T4B = % of respondents who rated 10,9,8 or 7 ; Middle = % of respondents who rated 5 or 6. B4B = % of respondents who rated 1,2,3 or 4.
In the case of Singapore, we observe that when higher-is-more-popular format (10 = extremely popular) is used, 84% of the respondents assign a high popularity level to fried chickens in Singapore. This number is even greater in the Philippines at 96%. However, this score drastically drops to 43% and 54%, respectively for the two countries when the order of the response scale is flipped (10 = extremely unpopular).
Given that fried chicken is pretty popular in both Singapore and the Philippines, it could be reasonably argued that the more intuitive higher-is-more popular format is closer to the true opinion of respondents.
The stark difference in results could have crucial implications in the insights gleaned if this question was part of a brand dipstick or consumer research survey.
Why do we observe this?
There could be two reasons, acting in tandem :
1. Rating polarity effect
A compelling hypothesis is that respondents may be using cultural norms and numerical associations that they have learned throughout their lives when interpreting and responding to scales, rather than actively thinking about how to interpret the scale. This could be because these associations are so deeply ingrained in our thinking that we rely on them without even realizing it. For instance, 8 is considered to be a lucky number in Chinese culture. If a scale with numbers ranging from 1 to 10 is presented to individuals from these cultures, they may subconsciously assign more positive attributes to the number 8, or pick 8 from a list even if they are not consciously aware of doing so.
Furthermore respondents may have learned certain associations between numbers and concepts through their experiences. For example, if someone has been repeatedly exposed to the idea that a score of 10 is considered "good," they may automatically interpret a score of 10 as positive, without actively thinking about it.
In the context of rating scales, a 2017 study by Kyung and team that was published in Journal of Consumer Research sheds some light on this observation :
"The culturally determined numerical association that people learn over time becomes part of their implicit memory—the type of memory that influences judgment without conscious awareness. This numerical association in implicit memory can then interfere with people’s ability to make evaluations using a newly learned format with opposite rating polarity, resulting in judgments that are less sensitive to numeric differences in quality level."
This phenomenon is termed the “rating polarity effect”. The authors of the paper show via numerous studies that even when respondents are well aware that they are using a numerical rating system with opposite polarity, their decision-making capacity is influenced by the rating polarity effect.
The different experiments in the paper also demonstrate that the effect is consistently observed across a wide range of tasks such as auction bidding, visual perception, purchase intent, and willingness to pay.
2. Primacy effect or top choice bias
Primacy bias refers to the tendency for people to place more importance or weight on the first items presented to them.
In a lot of cultures we see a consistent top choice bias irrespective of the length of the scale, type of scale used or the question topic. For example, if a survey asks respondents to rate their level of agreement with a statement using a scale of 1 to 5, with 1 being "strongly disagree" and 5 being "strongly agree," the order in which these response options are presented can influence how respondents answer. If "strongly disagree" is presented first, respondents may be more likely to choose that option even if they do not strongly disagree with the statement.
This may be either simply because it is the first option presented (low effort, satisficing) or due to social desirability effect as in when respondents give answers that they believe are socially acceptable or desirable, rather than providing their true opinions or behaviours.
What does this mean for you?
The next time you design a survey that has numerical rating scales involved you can considering the following :
1. Use intuitive response scale formats. People may rely on implicit associations and cultural norms when interpreting scales, rather than consciously processing the information presented to them. The results of the experiment in this article highlight that using a rating scale that is intuitive for your survey audience can help improve the accuracy of results, as respondents are more likely to understand and use the scale as intended. For instance, in Singapore and the Philippines assign higher numbers to more positive ratings and lower numbers to less positive ratings.
2. Consider alternatives to numerical rating scales. Instead of using numerical ratings that requires one to assign meaning to the numbers, consider using alternatives such as a Likert-type ordinal scale that presents a range of possible responses, each with a qualitative label that indicates the degree of polarity (see image below). This way, any cognitive dissonance between the text label and number can be avoided.
- Counteract top choice bias by designing better rating scales. Choosing the right response scale format can minimise response bias and yield more accurate responses. Our research found that the format in which a rating scale is presented to the respondent can have a significant impact on the results. For example, we ran one version of our test using a standard single-select rating scale (shown on the left below), as well as another format where we show the respondents the same scale using a spinner format, which is native format on iOS and Android devices (shown on the right below). The spinner was also configured so the default was to display the middle of the scale (i.e. 5) upon loading the question, and the respondent needed to interact with the spinner in order to proceed to the next question. Our findings showed that the spinner design helped to reduce top-choice bias by a significant margin for countries that exhibited the strongest bias.
4. Clean out potentially inattentive respondents. A thing to watch out for is respondents who provide responses without careful consideration. One example would be straight-lining or respondents selecting the same response for all questions regardless of the question content or what their actual opinion is. You could also introduce manipulation checks to weed out respondents who may be inattentive while answer questions.
Feel free to check out more survey design best practices under our Learn Section.
If you have questions related to Milieu's survey platform and/or other product/service offerings feel free to reach out to sales@mili.eu.
Milieu Insight is a recognized survey software and consumer research agency, dedicated to helping businesses excel through data insights.