A couple days ago, Google blogged about a technology it’s working on: MediaPipe Holistic. It caught my eye because the post featured a gif of the technology being used to detect body, face, and hand motions of a prominent American Sign Language (ASL) instructor, Dr. Bill Vicars (I highly recommend his website, lifeprint.com to anyone interested in learning more about sign). Google claims MediaPipe Holistic can detect human poses, facial expressions, and hand motions in real time.
Does this mean we’ll have ASL versions of Google Translate and Google Assistant? Will Dr. Vicars be able to auto-grade his students’ ASL homework assignment videos? Probably not anytime soon.
This isn’t a new technology, just three old technologies combined. First: it detects your overall body shape and creates a stick figure pose outline. Next it identifies where your face and hands are, and creates a skeleton of your hand joint landmarks and a more detailed grid outline of your face. So far, that’s all it does. No translation capabilities. Yet.
Right now, the technology is just a clunky proof of concept. You can try it out on their demo page like I did. What it does do is show that computers can do a fairly decent job of detecting what your face and hands are doing, even from different camera angles and perspectives. Somewhere far down the road, we might be able to assign these hand and face shapes meaning values in a database.
Traditionally, translation has focused solely on text, but what about emotion? ASL is a great example because it’s a language for which physical details like eyebrow placement are important grammatical components. Spoken languages could also benefit from paying closer attention to emotion as well: there’s a big difference between widening your eyes and waving your hands, smiling, “fantastic!” versus heaving your shoulders in a sigh, rolling your eyes, and saying “fantastic.”