Analysis of “Specifying Gestures by Example”

Author:

Dean Rubine
Summary:

1) The author begins with discussion of the current hand-coded gesture recognizers of 1991, labeling them as difficult.  He then claims the ability to do away from hand coding by using examples gestures and his GRANDMA architecture.  Few tools exist for such, and the author says his architecture can build small, fast, accurate recognizers that are trained on a small number of examples for a gesture.

2) An example program called GDP also uses the framework. The authors highlights a few use cases through screen shots, and that GRANDMA has a two phase operation: gesture collection and classification, then manipulation. GRANDMA also intentionally only has single-stroke commands to:

  1. avoid the problem of segmentation
  2. support the two-phase process
  3. contribute to a positive user experience

3) The user, or “gesture designer”, will select the command, or “class”, at run-time that he wishes to train and then provide around fifteen (empirically determined) examples of the gesture that issues the command.  GRANDMA is an MVC framework.  He was specify three semantic components: “recog” to define the attributes, handler, and view when the gesture is recognized, “manip” for the manipulation phase, and then “done”.  When gesture is made over multiple view (remember MVC), the priority is given to the top-most view (example gesture goes over an object view and the main window view).

4) Classification of an input gesture ‘g’ from the available classes ‘C’ is done through statistical analysis of a feature vector extracted from the input. Features were empirically chosen and included the cosine and sine of the starting angle, total length of the gesture, size of the bounding box, speed of drawing (so it’s not just a static image), etc. etc.  Features need to be computed in constant time, be meaningful, and have enough in number to distinguish between gestures. A weight for each of the classes is determined from their examples in the training process.  The linear classifier then classifies the gesture from the feature vector, assigning probabilities and standard deviation to reject classes (though the author says to forget rejection if you just have an “undo” feature).

5) GDP has “eager recognition”, attempting to determine the gesture at each new data point and continuing in the manipulation phase once ambiguity has been resolved and as long as the mouse button is held.  Multi-touch is also supported by using single-stroke recognition and a decision tree of the determined gestures.
Discussion:

I had commented on Nabeel’s post about using a time-out for determining the end of a gesture (here Rubine uses 0.2 seconds).  Interesting to me to see it was actually used.

If you have an ambiguous gesture, is it really absolutely horrible to ask the user what command he wanted? Or just like a live spell checker or the “Did you mean:” on Google search results, perform the most probable action but allow the user to select alternatives from an unobtrusive list? The “pack” gesture on page 2 looks like it could have been a rectangle, thus bringing this question to mind.

The problem definitely gets easier with only one stroke.  Could just wait 0.2 seconds for another stroke before going into the manipulation phase? The user might think the application is slow, but then you could still keep the two phase approach. If I’m going to draw an “X” or an arrow, I’ll make the accompanying marks pretty quickly.

Big fan of the MVC paradigm.

I like the concept of eager recognition, but the system should also take into account that it could getting it wrong.  The author discussed that the best solution is just to have an undo feature.  But what if I’m attempting the ‘pack’ command three times and eager recognition thinks I want to draw a rectangle each time?  Three times I’d hit execute the undo command, which should be hint to the system to try something different.

Still reading on covariance matrices, but wouldn’t a plain vector space comparison between the candidate feature vector and the average feature vector for each class suffice too?
Honestly, this paper just makes me excited.

Comments made elsewhere:

  1. Aksha’s Blog
  2. Andrew’s Blog
  3. Manoj’s Blog

2 Comments so far

  1. manoj on September 2nd, 2008

    Actually asking the user to select gestures from the matching hits would be a bad idea
    1. Efficient use of real estate is important in cases where the screen size is small (as in Windows Mobile, palmtops).
    2. It looks similar to menu bar / drop down list after right click. This totally defeats the purpose of gestures

  2. manoj on September 2nd, 2008

    Thank you for referring resources on MVC

Leave a reply

*
To prove you're a person (not a spam script), type the security word shown in the picture.
Anti-Spam Image