Subjective evaluation measures - Revision history

Usabart at 02:46, 1 March 2011

2011-03-01T02:46:32Z

Usabart at 22:09, 21 February 2011

2011-02-21T22:09:53Z

Usabart at 22:05, 21 February 2011

2011-02-21T22:05:14Z

Usabart: /* Good questions */

2011-02-21T22:03:43Z

Good questions

Usabart: /* Multiple items, scale development */

2011-02-21T22:03:28Z

Multiple items, scale development

Usabart: Created page with "== Measuring usability and user experience == ""Subjective evaluation measures"" are expressions of the users about the system or their interaction with the system. They are ther..."

2011-02-21T22:02:10Z

Created page with "== Measuring usability and user experience == ""Subjective evaluation measures"" are expressions of the users about the system or their interaction with the system. They are ther..."

New page

== Measuring usability and user experience ==
""Subjective evaluation measures"" are expressions of the users about the system or their interaction with the system. They are therefore typically used to evaluate the usability and user experience of recommender systems. In qualitative studies, subjective measures are user comments, interviews, or questionnaire responses. Subjective evaluations can also be used quantitatively. In this case, closed-format responses (typically questionnaire items) are required for statistical analysis.

== Good questions ==
Care has to be taken that the elicitation of user responses does not interfere with the actual responses they give. Double-barreled questions ("Did the recommender provide novel and relevant items?") can cause confusion and are often very imprecise (what if the user found the items novel, but not relevant?). Leading questions ("How great was our system?") and imbalanced response categories ("How do you rate our system?" - bad, good, great or awesome) can inadvertently push the participants' answers in a certain direction. A typical way to avoid these issues is to ask the user to agree or disagree with a number of statements on a 5- or 7-point scale, e.g.:

"The system helped me make better choices." - completely disagree, somewhat disagree, agree nor disagree, agree, completely agree
"The system did not provide me any benefits" - completely disagree, somewhat disagree, agree nor disagree, agree, completely agree

Note that in order to avoid response format bias, it is good practice to provide both positively and negatively phrased items. Also note that the middle category is not the same as "not applicable", which should be a separate category (if provided at all).

== Multiple items, scale development ==
Usability and user experience concepts such as "satisfaction", "usefulness", and "choice difficulty" are rather nuanced, and it is very hard to measure these concepts robustly with just a single question. It is therefore a better practice to ask multiple questions per concept. There are two ways to combine the answers to these questions into a single scale. The simplistic approach is to sum the answers to the questions (making sure to revert the negatively phrased ones). In order for this to be a valid approach, a reliability analysis should be performed on the answers (Chronbach's alpha). This procedure handles each scale separately.

The more advanced approach is to construct and test all scales at the same time with a factor analysis. A factor analysis evaluates the latent structure of a set of responses by analyzing its covariance matrix. An exploratory factor analysis triest to create an "elegant" factor solution with a specified number of factors. A confirmatory factor analysis tests a predefined factor structure. Even when the factor structure is theoretically determined beforehand, it is good practice to check whether an exploratory factor analysis returns the predicted factor structure. Often, one or two items do not fit the predicted factor structure (they contribute to the wrong factor, several factors, or none of the factors); these items can be deleted from the analysis.

Taking this one step further, one can check for measurement invariance. This procedure ensures that the answers of different types of participants (e.g. males and females, those using system PA and those using system PB) adhere to the same conceptual structure. E.g.: Does "satisfaction" mean the same thing for experts and novices?

Developing a robust scale is usually a complex procedure that takes several iterations. After deleting "bad" questions, a scale should consist of at least 5-7 items to be a robust measurement of the underlying concept. To ensure enough power for adequate scale development, one should have about 5 responses per item. Simultaneously developing 5 robust subjective scales, then, takes about 150 participants. Finally, the developed scales should be correlated (triangulated) with other subjective or objective measures to ensure their external validity.

== Structural Equation Models ==
A final step in subjective evaluations is to combine scale validation (factor analysis) and causal inference (ANOVA or linear regression) into a single analysis. These ""Structural Equation Models"" provide added statistical power, because they can use the estimated robustness of the constructed scales to provide better estimates of the regression coefficients. Experimental manipulations and [[objective evaluation measures]] can be included into the Structural Equation Model, and the fit of the entire model can be tested as well as the specific regression coefficients.

← Older revision		Revision as of 02:46, 1 March 2011
Line 25:		Line 25:
	[[Category:Evaluation]]		[[Category:Evaluation]]
	[[Category:Evaluation measure]]		[[Category:Evaluation measure]]
		+	[[Category:User-centric evaluation]]

← Older revision		Revision as of 22:09, 21 February 2011
Line 22:		Line 22:
	== Structural Equation Models ==		== Structural Equation Models ==
	A final step in subjective evaluations is to combine scale validation (factor analysis) and causal inference (ANOVA or linear regression) into a single analysis. These '''Structural Equation Models''' provide added statistical power, because they can use the estimated robustness of the constructed scales to provide better estimates of the regression coefficients. Experimental manipulations and [[objective evaluation measures]] can be included into the Structural Equation Model, and the fit of the entire model can be tested as well as the specific regression coefficients.		A final step in subjective evaluations is to combine scale validation (factor analysis) and causal inference (ANOVA or linear regression) into a single analysis. These '''Structural Equation Models''' provide added statistical power, because they can use the estimated robustness of the constructed scales to provide better estimates of the regression coefficients. Experimental manipulations and [[objective evaluation measures]] can be included into the Structural Equation Model, and the fit of the entire model can be tested as well as the specific regression coefficients.
		+
		+	[[Category:Evaluation]]
		+	[[Category:Evaluation measure]]

@@ Line 1: / Line 1: @@
 == Measuring usability and user experience ==
-""Subjective evaluation measures"" are expressions of the users about the system or their interaction with the system. They are therefore typically used to evaluate the usability and user experience of recommender systems. In qualitative studies, subjective measures are user comments, interviews, or questionnaire responses. Subjective evaluations can also be used quantitatively. In this case, closed-format responses (typically questionnaire items) are required for statistical analysis.
+'''Subjective evaluation measures''' are expressions of the users about the system or their interaction with the system. They are therefore typically used to evaluate the usability and user experience of recommender systems. In qualitative studies, subjective measures are user comments, interviews, or questionnaire responses. Subjective evaluations can also be used quantitatively. In this case, closed-format responses (typically questionnaire items) are required for statistical analysis.
 == Good questions ==
@@ Line 21: / Line 21: @@
 == Structural Equation Models ==
-A final step in subjective evaluations is to combine scale validation (factor analysis) and causal inference (ANOVA or linear regression) into a single analysis. These ""Structural Equation Models"" provide added statistical power, because they can use the estimated robustness of the constructed scales to provide better estimates of the regression coefficients. Experimental manipulations and [[objective evaluation measures]] can be included into the Structural Equation Model, and the fit of the entire model can be tested as well as the specific regression coefficients.
+A final step in subjective evaluations is to combine scale validation (factor analysis) and causal inference (ANOVA or linear regression) into a single analysis. These '''Structural Equation Models''' provide added statistical power, because they can use the estimated robustness of the constructed scales to provide better estimates of the regression coefficients. Experimental manipulations and [[objective evaluation measures]] can be included into the Structural Equation Model, and the fit of the entire model can be tested as well as the specific regression coefficients.

← Older revision		Revision as of 22:03, 21 February 2011
Line 6:		Line 6:

	"The system helped me make better choices." - completely disagree, somewhat disagree, agree nor disagree, agree, completely agree		"The system helped me make better choices." - completely disagree, somewhat disagree, agree nor disagree, agree, completely agree
		+
	"The system did not provide me any benefits" - completely disagree, somewhat disagree, agree nor disagree, agree, completely agree		"The system did not provide me any benefits" - completely disagree, somewhat disagree, agree nor disagree, agree, completely agree

@@ Line 17: / Line 17: @@
 Taking this one step further, one can check for measurement invariance. This procedure ensures that the answers of different types of participants (e.g. males and females, those using system PA and those using system PB) adhere to the same conceptual structure. E.g.: Does "satisfaction" mean the same thing for experts and novices?
-Developing a robust scale is usually a complex procedure that takes several iterations. After deleting "bad" questions, a scale should consist of at least 5-7 items to be a robust measurement of the underlying concept. To ensure enough power for adequate scale development, one should have about 5 responses per item. Simultaneously developing 5 robust subjective scales, then, takes about 150 participants. Finally, the developed scales should be correlated (triangulated) with other subjective or objective measures to ensure their external validity.
+Developing a robust scale is usually a complex procedure that takes several iterations. After deleting "bad" questions, a scale should consist of at least 5-7 items to be a robust measurement of the underlying concept. To ensure enough power for adequate scale development, one should have about 5 responses per item. Simultaneously developing 5 robust subjective scales, then, takes about 150 participants. Finally, the developed scales should be correlated (triangulated) with other subjective or objective measures to ensure their external validity. A good subjective scale, however, provides results that are usually far more robust than most [[objective evaluation measures]] which are typically inherently noisy.
 == Structural Equation Models ==
 A final step in subjective evaluations is to combine scale validation (factor analysis) and causal inference (ANOVA or linear regression) into a single analysis. These ""Structural Equation Models"" provide added statistical power, because they can use the estimated robustness of the constructed scales to provide better estimates of the regression coefficients. Experimental manipulations and [[objective evaluation measures]] can be included into the Structural Equation Model, and the fit of the entire model can be tested as well as the specific regression coefficients.