Pointing at the Screen with Kinect

Posted: March 30, 2011 in Kinect

So I put together a fancy gesture recognition library for the Kinect. I used some fancy math to get the system to recognize common movements, and translated those directly into specific events that an application could work with.

To get an application to interpret the events directly, I implemented a focus model for controls very similar to keyboard focus. When you press a key on the keyboard, the system sends the events to the control with focus. It’s no different with Kinect gestures. To focus on a particular control, all you had to do was point at them (I’m using past tense on purpose… I’ll get to that story later).

So, one major issue was how to take the “raw” (relatively speaking) data of the user’s hand, elbow, and shoulder positions, and convert that into a specific pixel on the screen being pointed at.

There are three basic steps to this process:

  1. Figure out whether the user is using their arm to point at something. If so, find a ray in world space that represents the pointing arm.
  2. The display screen lies on a plane in world space. We must know this plane, and must intersect it with the ray we just computed to find a specific point on the display plane.
  3. Once we have that point, a simple transform can convert it into screen space (a pixel coordinate). Of course, the trick is getting that transform in the first place.

Problem #1 is fairly simple. When a user is pointing at something, their shoulder, elbow, and hand are all basically collinear. To test this, I construct unit vectors representing the forearm and upper arm, and the use the property of dot products that states:

a \cdot b = |a| |b| \cos \theta

Where \theta is the angle between the two vectors. We want that angle to be close to zero, so the cosine of it should be close to 1. Since the upper arm and forearm vectors are unit vectors already, their lengths are both 1, so those drop out of the equation. Realistically, it’s impossible to expect your hand, elbow, and shoulder to be precisely collinear, so we threshold it. In the end, the code looks like:

Vector3D upperarm = Elbow - Shoulder;
Vector3D forearm = Hand - Elbow;
upperarm.Normalize();
forearm.Normalize();
bool ispointing = Vector3D.DotProduct(upperarm, forearm) > 0.95;

So by checking whether the dot product is greater than 0.95, the system allows you to be ‘close enough’ without any trouble. Next, we need to define the ray that represents the pointing arm. I just use the vector that goes from the user’s shoulder to the hand, making our ray (which originates at R_o going in direction R_d), and our solution to problem #1:

R_o=Shoulder \newline R_d=Hand - Shoulder

For problem #2, we need to know the plane that the display lies on. This requires a calibration step that I think I’ll save for my next blog post, since it’s kind of its own topic. So for now I’ll just assume I know these numbers magically. To define the plane, we use the traditional plane equation:

Ax + By + Cz + D = 0

To intersect the ray with the plane, first realize that the ray breaks down to three different equations, one for each dimension in 3D space:

x = x_0 + \delta_xu\newline y=y_0 + \delta_yu\newline z=z_0+\delta_zu

where u > 0. To find our point, we need to use substitution to factor out the x, y, and z variables, leaving u as the only unknown. Remember, x_0, y_0, z_o, \delta_x, \delta_y, and \delta_z were all figured out in step 1 (they’re the components of R_o and R_d). Substitution yields this nasty thing:

A(x_0 + \delta_xu) + B(x_o + \delta_yu) + C(z_0 + \delta_zu)+D = 0

With a little bit of swapping things around and simplifying, we can figure out u:

u = -\frac{Ax_0+By_0+Cz_0+D}{A\delta_x+B\delta_y+C\delta_z}

Personally, I’m a big fan of vector calculus. It makes everything look so much cleaner! Did you know that in the plane equation, the coefficients A, B, and C actually represent the normal of the plane? Just for fun, if I let that same normal be represented as N_s, the previous equation becomes

u = -\frac{N_s \cdot R_o + D}{N_s \cdot R_d}

Ahhh… clean. Once you know u, you can calculate the point in space using the equations for the ray I listed earlier. Or,  just P=R_o + R_du. Hee!

So that solves problem #2! We’re almost done. Problem #3 involves converting this point of ours into a pixel location on the screen. We do this using a 4×4 transform matrix. The difficulty with problem #3 is entirely in finding this matrix, which is part of the calibration step and therefore something for my next blog post. So, assuming we have it, just use it to transform the point P we just found. I set my matrix up so that the pixel locations are in the X and Y coordinates of the output vector. The Z should be zero, but if it doesn’t wind up that way… I don’t really care. I just want the pixel position.

So that’s it! After transforming P, we have an onscreen pixel (well, you’ll have to do some bounds checking to make sure of that) that the user is pointing at.

As an aside, the reason I originally referred to this method in the past tense is that it is fairly inaccurate, no matter what you do. The idea that I can pinpoint a precise pixel with my arm is a bit silly, and in practice you do indeed get a lot of jitter in the input, and the pixel that is computed winds up being offset from where you think it should be. This isn’t because of a defect in the calibration process, but in our own perceptions (which in my experience changes rather frequently, depending on subtle differences in your stance). I think in the future, when the sensors have better precision, and as long as your UI is either very simple or displayed on a VERY large screen, this technique might be useful. Lately I’ve opted to just take pointing as a rough suggestion, based simply on direction the user is pointing in. It doesn’t require calibration, the precision is much better, and the fact that you aren’t physically pointing at anything doesn’t seem to be a bother.

In the future I may try adjusting my algorithm in step #1, so that the pointing ray actually starts from the user’s eye, instead of the shoulder, but I’m really not sure that will have a huge impact on the overall quality of the experience. I’m happy with the simpler approach, but when it works well, being able to physically point at the screen and see it respond is very impressive.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s