Abstract:
Instrument playing technique classification is a problem in music information retrieval (MIR) that has only selectively been explored in the context of specific instrumentations or datasets. Classifying playing techniques with pitch is a further challenge that takes a step closer to automatic music transcription (AMT) with playing technique annotation. Traditional deep learning methods have been used for the problems of instrument classification, playing technique classification and multiple-instrument transcription, however, annotated data for the combined problems are scarce, thus it is hard to train a sufficiently complex deep neural network that would be able to generalize to many different instruments, playing styles and recording conditions. This study presents a few-shot learning model for joint instrument, playing technique and pitch classification of single tones using prototypical networks. The few-shot nature of the model allows it to be trained on what data are available and to adapt to new instruments, playing techniques or recording conditions at inference time from a few examples. This model could form part of a tutorial system where a music student would record scales of a given playing technique under the supervision of a music teacher, which would later be used to match and evaluate a performance with the technique.
Different deep neural network (DNN) architectures and both log-mel spectrogram and constant-Q transform (CQT) input features are compared. The few-shot models are compared to standard neural network classifier models with transfer learning to show how the few-shot models generalize better to previously unseen playing techniques. Model training is optimized with Bayesian optimization. Prototypical models outperform standard classifier models with transfer learning on all experiments.
The 3-shot CQT convolutional neural network (CNN) model performs the best on the joint classification task and achieves a macro F-score of 0.64 on the Studio On Line (OrchideaSOL) string instrument playing technique dataset of previously unseen playing technique classes, which shows an ability for the prototypical model to generalize to a new dataset without much loss of performance compared to evaluation on the training classes. The model also achieves a macro F-score of up to 0.855 on individual instruments, which shows promise for its use in a tutorial set up for any of the string instruments. The models perform just as well when evaluated on extracts from YouTube tutorials and examples of clarinet playing techniques from the Real World Computing (RWC) dataset. The few-shot model also functions as a multitask model, capable of classifying pitch, playing technique or instrument from a recorded sample. The best joint instrument, playing technique and pitch classification prototypical model can accurately classify both playing technique and pitch, and do so just as well or better than models trained more specifically on these problems when compared on the same data. Furthermore, the scenario of instrument, playing technique and pitch classification in the presence of piano accompaniment is investigated, which resulted in some loss of generalization, but still shows promise for the task of main melody extraction, as pitch classification remains high.