Créer une présentation
Télécharger la présentation

Télécharger la présentation
## CS 160: Lecture 16

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**CS 160: Lecture 16**Professor John Canny Fall 2004**Outline**• Basics of quantitative methods • Random variables, probabilities, distributions • Review of statistics • Collecting data • Analyzing the data**Qualitative vs. Quantitative Studies**• Qualitative: What we’ve been doing so far: • Contextual Inquiry: trying to understand user’s tasks and their conceptual model. • Usability Studies: looking for critical incidents in a user interface. • In general, we use qualitative methods to: • Understand what’s going on, look for problems, or get a rough idea of the usability of an interface.**Qualitative vs. Quantitative Studies**• Quantitative: • Use to reliably measure something. • Requires us to know how to measure it. • Examples: • Time to complete a task. • Average number of errors on a task. • Users’ ratings of an interface *: • Ease of use, elegance, performance, robustness, speed,… * - You could argue that users’ perception of speed, error rates etc is more important than their actual values.**Quantitative Methods**• Very often, we want to compare values for two different designs (which is faster?). This requires some statistical methods. • We begin by defining some quantities to measure - variables.**Random variables**• Random variables take on different values according to a probability distribution. • E.g. X {1, 2, 3} is a discrete random variable. • To characterize the variable, we need to define the probabilities: • Pr[X=1] = Pr[X=2] = ¼, Pr[X=3] = ½**Random variables**• Given Pr[X=1] = Pr[X=2] = ¼, Pr[X=3] = ½ we can also represent the distribution with a graph: ½ ¼ 1 2 3**Continuous Random variables**• Some random variables take on continuous values, e.g. Y [-1,1]. • The probability must be defined by a probability density function (pdf). • E.g. p(Y) = ¾ (1 – Y2) • Note that the areaunder the curve is the total probability,which is 1. ¾ 1 -1**Continuous Random variables**• The area between two values gives the probability that the value of the variable lies in that range. • i.e. Pr[a < Y < b] = ¾ a b 1 -1**Meaning of the distribution**• The limit of the area as the range [a,b] goes to zero gives the value of p(Y)Pr[a < Y < a+dY] = p(Y) dY ¾ a 1 -1**CDF: Cumulative Distribution**• The CDF is the area under the distribution from - to some value v • So C(- ) = 0 and C() = 1 -1 1 v**Mean and Variance**• The mean is the expected value of the variable. Its roughly the average value of the variable over many trials. • Mean = E[Y] = • In this case E[Y] = ½ ¾ ½ 1 -1**Variance**• Variance is the expected value of the square difference from the mean. Its roughly the squared “width” of the distribution. • Var[Y] = • Standard deviation is thesquare root of variance. ¾ ½ 1 -1**Mean and Variance**• What is the mean and variance for the following distribution? ½ ¼ 2 4 3**Independent trials**• For independent trials, both the mean and the variances add. i.e. for r.v.s X and Y, • E[X+Y] = E[X]+E[Y] • Var[X+Y] = Var[X] + Var[Y]**Identical trials**• For independent trials with the same mean and variance • E[X1 + … + Xn] = n E[X] • Var[X1 + … + Xn] = n Var[X] • Std[X1 + … + Xn] = n Std[X]**Identical trials**• As the number of trials increases, the ratio of mean to std. deviation decreases. • i.e. the distribution narrows in a relative sense.**Variable types**• Independent Variables: the ones you control • Aspects of the interface design • Characteristics of the testers • Discrete: A, B or C • Continuous: Time between clicks for double-click • Dependent variables: the ones you measure • Time to complete tasks • Number of errors**Deciding on Data to Collect**• Two types of data • process data • observations of what users are doing & thinking • bottom-line data • summary of what happened (time, errors, success…) • i.e., the dependent variables**Process Data vs. Bottom Line Data**• Focus on process data first • gives good overview of where problems are • Bottom-line doesn’t tell you where to fix • just says: “too slow”, “too many errors”, etc. • Hard to get reliable bottom-line results • need many users for statistical significance**Some statistics**• Variables X & Y • A relation (hypothesis) e.g. X > Y • We would often like to know if a relation is true • e.g. X = time taken by novice users • Y = time taken by users with some training • To find out if the relation is true we do experiments to get lots of x’s and y’s (observations) • Suppose avg(x) > avg(y), or that most of the x’s are larger than all of the y’s. What does that prove?**Significance**• The significance or p-value of an outcome is the probability that it happens by chance if the relation does not hold. • E.g. p = 0.05 means that there is a 1/20 chance that the observation happens if the hypothesis is false. • So the smaller the p-value, the greater the significance.**Significance**• For instance p = 0.001 means there is a 1/1000 chance that the observation would happen if the hypothesis is false. So the hypothesis is almost surely true. • Significance increases with number of trials. • CAVEAT: You have to make assumptions about the probability distributions to get good p-values. There is always an implied model of user performance.**Normal distributions**• Many variables have a Normal distribution (pdf) • At left is the density, right is the cumulative prob. • Normal distributions are completely characterized by their mean and variance (mean squared deviation from the mean).**Normal distributions**• The std. deviation for a normal distribution occurs at about 60% of its value One standard deviation**T-test**• The T-test asks for the probability that E[X] > E[Y] is false. • i.e. the null hypothesis for the T-test is whether E[X] = E[Y]. • What is the probability of that given the observations?**T-test**• We actually ask for the probability that E[X] and E[Y] are at least as different as the observed means. X Y**Analyzing the Numbers**• Example: prove that task 1 is faster on design A than design B. • Suppose the average time for design B is 20% higher than A. • Suppose subjects’ times in the study have a std. dev. which is 30% of their mean time (typical). • How many subjects are needed?**Analyzing the Numbers**• Example: prove that task 1 is faster on design A than design B. • Suppose the average time for design B is 20% higher than A. • Suppose subjects’ times in the study have a std. dev. which is 30% of their mean time (typical). • How many subjects are needed? • Need at least 13 subjects for significance p=0.01 • Need at least 22 subjects for significance p=0.001 • (assumes subjects use both designs)**Analyzing the Numbers (cont.)**• i.e. even with strong (20%) difference, need lots of subjects to prove it. • Usability test data is quite variable • 4 times as many tests will only narrow range by 2x • breadth of range depends on sqrt of # of test users • This is when online methods become useful • easy to test w/ large numbers of users (e.g., Landay’s NetRaker system)**Lies, damn lies and statistics…**• A common mistake (made by famous HCI researchers *): • Increasing n, the number of trials, by running each subject several times. • No! the analysis only works when trials are independent. • All the trials for one subject are dependent, because that subject may be faster/slower/less error-prone than others. * - making this error will not help you become a famous HCI researcher .**Statistics with care:**• What you can do to get better significance: • Run each subject several times, compute the average for each subject. • Run the analysis as usual on subjects’ average times, with n = number of subjects. • This decreases the per-subject variance, while keeping data independent.**Measuring User Preference**• How much users like or dislike the system • can ask them to rate on a scale of 1 to 10 • or have them choose among statements • “best UI I’ve ever…”, “better than average”… • hard to be sure what data will mean • novelty of UI, feelings, not realistic setting, etc. • If many give you low ratings -> trouble • Can get some useful data by asking • what they liked, disliked, where they had trouble, best part, worst part, etc. (redundant questions)**B**A Using Subjects • Between subjects experiment • Two groups of test users • Each group uses only 1 of the systems • Within subjects experiment • One group of test users • Each person uses both systems**Between subjects**• Two groups of testers, each use 1 system • Advantages: • Users only have to use one system (practical). • No learning effects. • Disadvantages: • Per-user performance differences confounded with system differences: • Much harder to get significant results (many more subjects needed). • Harder to even predict how many subjects will be needed (depends on subjects).**Within subjects**• One group of testers who use both systems • Advantages: • Much more significance for a given number of test subjects. • Disadvantages: • Users have to use both systems (two sessions). • Order and learning effects (can be minimized by experiment design).**Example**• Same experiment as before: • System B is 20% slower than A • Subjects have 30% std. dev. in their times. • Within subjects: • Need 13 subjects for significance p = 0.01 • Between subjects: • Typically require 52 subjects for significance p = 0.01. • But depending on the subjects, we may get lower or higher significance.**Experimental Details**• Order of tasks • choose one simple order (simple -> complex) • unless doing within groups experiment • Training • depends on how real system will be used • What if someone doesn’t finish • assign very large time & large # of errors • Pilot study • helps you fix problems with the study • do 2, first with colleagues, then with real users**Reporting the Results**• Report what you did & what happened • Images & graphs help people get it!**Summary**• Random variables • Distributions • Some statistics • Experiment design guidelines