W2D2S2: Survey on Few-Shot Learning/ Fine-grained classroom activity

Date

Description

A Survey on Few-Shot Learning with Audio, Piper Wolters (WWU)

Recent advances in the deep learning field have resulted in state-of-the-art performance for various audio classification tasks, but unlike humans, machines traditionally require large amounts of data to correctly classify. Few-shot learning refers to machine learning methods in which the model is able to generalize to new classes with very few training examples. In this research, we address speaker identification and audio segment classification with the Prototypical Network few-shot learning algorithm. We systematically compare the key architectural decision: the encoder, which performs feature extraction on the raw data. Our encoders include recurrent neural networks, as well as one- and two-dimensional convolutional neural networks. For a 5-way speaker identification task on the VoxCeleb dataset, with only five training examples per speaker, our best model obtains 94.9% accuracy. On a 5-way audio classification task using the Kinetics 600 dataset of Youtube videos, with only five examples per class, we obtain 49.0% accuracy. We are currently extending this work to few-shot audio event detection and speaker identification, so that audio events and speakers can be detected in long audio documents, with minimal supervision.

Fine-grained classroom activity detection with neural networks. Chris Daw (WWU)

Instructors frequently refine their teaching methodologies in pursuit of higher class participation and comprehension. To quantify how these adjustments affect their students, an instructor first needs data to quantify their use of class time. To provide such data about classroom activity to instructors, we previously developed a deep learning-based system to segment and detect three categories: no-voice (e.g. individual work), single-voice (e.g. lecture) and multi-voice (e.g. groupwork). However, the coarseness of these activity classes is a limitation; for example, that system cannot distinguish between students or instructors, nor questions and answers. To overcome these limitations, we are now developing a new system to automatically segment and classify classroom audio into nine fine-grained categories. This system, which we call Classroom Activity Annotation with Machine Learning (CAAML), will be able to identify lecture, student questions and answers, instructor question and answers, group work, silence, and “other.” We are exploring several deep learning architectures for CAAML, including deep neural networks, Long Short-Term Memory Network networks, and dilated temporal convolutional neural networks. To ease adoption, we assume audio and motion features will be captured using a commodity webcam connected to the instructor station. Once implemented, we plan to make our system publicly available to instructors via an online interface.