Goals

Movies and TV are a rich source of diverse and complex video of people, objects, actions and locales “in the wild”. This provides us with realistic data for understanding, with many applications to computer vision and machine learning. In particular, we would like to know

Who
is on screen (person recognition)
What they are doing, and with what objects
(action & object recognition, pose estimation)
Where things happen (scene recognition/understanding)
Why? (semantic understanding, content analysis)

Approach

The above are challenging tasks. Thankfully, we can leverage vast quantities of weakly labeled data in order to learn: DVDs, scripts and closed captions for TVs and movies are easily obtainable for free on the internet.