Teaching a machine to recognize human actions has many potential applications, from automatically detecting worker falls on a construction site to enabling a smart home robot to interpret a user’s gestures.
To do this, researchers train machine learning models using large datasets of video clips that show humans performing actions. However, not only is it expensive and laborious to collect and label millions or billions of videos, but the clips often contain sensitive information, such as people’s faces or license plate numbers. Use of these videos may also violate copyright or data protection laws. And that assumes the video data is publicly available in the first place – many datasets are owned by companies and not free to use.
Thus, researchers turn to synthetic datasets. These are created by a computer that uses 3D models of scenes, objects and humans to quickly produce many varied clips of specific actions – without the potential copyright issues or ethical concerns that come with the actual data.
But is synthetic data as “good” as real data? How well does a model trained with this data perform when asked to classify real human actions? A team of researchers from MIT, the MIT-IBM Watson AI Lab and Boston University set out to answer this question. They built a synthetic dataset of 150,000 video clips that captured a wide range of human actions, which they used to train machine learning models. Then they showed these models six real-world video datasets to see how well they could learn to recognize actions in these clips.
The researchers found that synthetically trained models performed even better than models trained on real data for videos with fewer background objects.
This work could help researchers use synthetic datasets in ways that models achieve greater accuracy on real-world tasks. It could also help scientists identify which machine learning applications are best suited for training with synthetic data, with the aim of alleviating some of the ethical, privacy, and copyright concerns associated with using it. actual datasets.
“The ultimate goal of our research is to replace pre-training on real data with pre-training on synthetic data. There is a cost to creating an action in synthetic data, but once done, you can generate unlimited images or videos by changing pose, lighting, etc. That’s the beauty of synthetic data,” says Rogerio. Feris, senior scientist and head of the MIT-IBM Watson AI Lab, and co-author of a paper detailing this research.
The article is written by lead author Yo-whan “John” Kim ’22; Aude Oliva, Director of Strategic Industry Engagement at MIT Schwarzman College of Computing, MIT Director of the MIT-IBM Watson AI Lab and Principal Investigator at the Computer Science and Artificial Intelligence Laboratory (CSAIL); and seven others. The research will be presented at the Neural Information Processing Systems conference.
Build a synthetic dataset
The researchers began by compiling a new dataset using three publicly available datasets of synthetic video clips that captured human actions. Their dataset, called Synthetic Action Pre-training and Transfer (SynAPT), contained 150 action categories, with 1,000 video clips per category.
They selected as many action categories as possible, such as people waving or falling to the ground, based on the availability of clips with clean video data.
Once the dataset was prepared, they used it to pre-train three machine learning models to recognize actions. Pre-training involves training a model for a task to give it a head start in learning other tasks. Inspired by the way people learn – we reuse old knowledge when we learn something new – the pre-trained model can use the parameters it has already learned to help it learn a new task with a new set data faster and more efficiently.
They tested the pre-trained models using six real video clip datasets, each capturing different classes of actions than the training data.
The researchers were surprised to see that the three synthetic models outperformed the models trained with real video clips on four of the six data sets. Their accuracy was highest for datasets containing video clips with “low scene-object bias”.
A low scene-object bias means that the model cannot recognize the action by looking at the background or other objects in the scene – it has to focus on the action itself. For example, if the model is tasked with classifying diving poses in video clips of people diving into a swimming pool, it cannot identify a pose by looking at the water or the tiles on the wall. It should focus on the movement and position of the person to classify the action.
“In low scene-object bias videos, the temporal dynamics of actions is more important than the appearance of objects or background, and this seems to be captured well with synthetic data,” says Feris.
“A high scene-object bias can actually act as a hindrance. The model can misclassify an action by looking at an object, not the action itself. It can confuse the model,” says Kim.
Building on these findings, the researchers want to include more action classes and additional synthetic video platforms in future work, eventually creating a catalog of models that have been pre-trained using synthetic data, says co-author Rameswar Panda, a member of the MIT research staff. -IBM Watson AI Lab.
“We want to build models that have very similar or even better performance than existing models in the literature, but without being bound by any of these biases or security issues,” he adds.
They also want to combine their work with research aimed at generating more accurate and realistic synthetic videos, which could improve the performance of the models, says SouYoung Jin, co-author and CSAIL postdoc. She also wants to explore how models can learn differently when trained with synthetic data.
“We use synthetic datasets to avoid privacy concerns or contextual or social biases, but what does the model actually learn? Is it learning something unbiased? she says.
Now that they have demonstrated this potential for using synthetic videos, they hope that other researchers will build on their work.
“Although there is a lower cost to obtain well-annotated synthetic data, we currently do not have a scaled dataset to compete with larger datasets annotated with real video. In discussing of the various costs and concerns with real videos and showing the effectiveness of synthetic data, we hope to motivate efforts in this direction,” adds co-author Samarth Mishra, a graduate student at Boston University (BU).
Additional co-authors include Hilde Kuehne, professor of computer science at Goethe University in Germany and affiliate professor at the MIT-IBM Watson AI Lab; Leonid Karlinsky, research staff member at MIT-IBM Watson AI Lab; Venkatesh Saligrama, professor in the Department of Electrical and Computer Engineering at BU; and Kate Saenko, associate professor in the Department of Computer Science at BU and consultant professor at the MIT-IBM Watson AI Lab.
This research was supported by the Defense Advanced Research Projects Agency LwLL, as well as the MIT-IBM Watson AI Lab and its member companies, Nexplore and Woodside.