CNN-based multi-class multi-label classification of sound scenes in the context of wind turbine sound emission measurements

N. Poschadel, C. Gill, S. Preihs, J. Peissig

Within the scope of the interdisciplinary project WEA-Akzeptanz, measurements of the sound emission of wind turbines were carried out at the Leibniz University Hannover. Due to the environment there are interfering components (e. g. traffic, birdsong, wind, rain, ...) in the recorded signals. Depending on the subsequent signal processing and analysis, it may be necessary to identify sections with the raw sound of a wind turbine, recordings with the purest possible background noise or even a specific combination of interfering noises. Due to the amount of data, a manual classification of the audio signals is usually not feasible and an automated classification becomes necessary. In our paper, we extended our previously proposed multi-class single-label classification model to a multi-class multi-label model, which reflects the real-world acoustic conditions around wind turbines more accurately and allows for finer-grained evaluations. We first provided a short overview of the data acquisition and the dataset. We then briefly summarized our previous approach, extended it to a multi-class multi-label formulation, and analyzed the trained convolutional neural network regarding different metrics. All in all, the model delivered very reliable classification results with an overall example-based F1-score of about 79 % for a multi-label classification of 12 classes.

On this webpage, we show some example classification results of our trained model. However, it should be noted that these are just single examples and the results are not representative for all recordings and classifications.

Visual representation

In order to get a qualitative impression of the obtained classification results and to get a quick overview of a classified scene, a methodology for a visual representation of the classification data was developed. The output of this visual mapping is a video which contains the original classified audio scene as audio tracks and shows a fine grained matrix with a color-coded value for each potential class and time instant. When playing the video, a vertical bar indicates the current position in the scene. Each timestamp has a total of 12 entries mapped to the classes along the y-axis of the image. The entries of this matrix are values between 0.0 and 1.0 and correspond to the output of the final sigmoid layer of the neural network. Thus, they reflect the probability of the presence of a class as predicted by the neural network, with values greater than 0.5 being considered as detection of the class. In the following, examples of this visual representation are shown. The first videos only contain the prediction of the neural networks, in the second section, the videos also show the manual label assigned to the respective sound scene.

Examples without manual labels

The model is able to detect the constant presence of the wind turbine. In addition, it perceives quiet chirping and the wind as soon as it gets stronger. However, some predictions are below the classification threshold of 0.5. Towards the end, the approaching car is classified as such, even if the classification starts a bit later than the sound. At the same time, a misclassification of birds occurs.


In this example, the wind turbine is not as loud as in the previous example and is also classified less overall. The clip also contains more wind, which is classified as such by the neural network. The presence of crickets in the background as well as the brief appearance of birdsong are well reflected by the classification result.


The aircraft that is passing over the area is classified as such throughout the entire clip. Also the neural network is able to pick up on a few birds chirping but the predictions are below the classification threshold.


The wind turbine is the dominant source of sound which is classified correctly. The short chirping from crickets in the beginning is correctly identified as well as the short presences of birds.


The constant presence of wind turbine and frogs is classified as such. When the chirping from crickets gets louder it is correctly identified. The very short (and uncertain) classification of birds is not correct.


Birds are constantly chirping and correctly labeled as such. The noise in the background should be classified as Hiss but is not picked up on. Towards the end the sound of cars driving by gets more dominant and is identified by the neural network.


This sample is a more quiet mix of Birds, Crickets, Wind and the Wind Turbine. At the beginning and at the end, these classes are recognized correctly. In between, there are occasional misclassifications, for example the short appearing predictions of Rain and Silence.


The wind turbine can be heard clearly and is classified accordingly. Around the 20 second mark the wind is picking up which is missclassified as Rain. It is definitely a misclassification, but a partially plausible one.


The wind turbine is the dominant source of sound and can be correctly identified as such. The wind is getting faster at around the 25 second mark but the neural network also predicts it with a score close to the threshold of 0.5. The few classifications of crickets are mostly correct.

Examples with manual labels

This clip contains the harvester driving near the recording station (verified by the recording protocol) as well as birds chirping. Both sounds are classified as such for most occurrences, though the bird chirping is left out a few times.


The sample consists of the three classes Bird, Frog and Wind Turbine. Frog and Wind Turbine are being classified correctly the entire time. The intensity of Bird varies throughout the clip which is reflected in the classification too. Additionally the neural network identifies Crickets in the sample that have not been labeled and are most likely not present.


This clip has been labeled as Frogs and Wind Turbine and the neural network is able to fully identify both classes as such. A few misclassifications are present in the classes Bird and Cricket.


The birds in this clip are classified very well. However, the wind turbine class is divided into aircraft, traffic and wind turbine. This is an example of the confusions that occur in our model between these 3 classes. However, many of these confusions are very difficult to classify correctly even for humans.


Similar to the sample above, this clip shows that the neural network has trouble distinguishing between Aircrafts and the Wind Turbine and thus missclassifies the Wind Turbine for Aircraft around half of the time.


In the first 20 seconds, the rain is very clearly visible to the human ear, but the network does not recognize it. After that, the sound of the rain changes slightly and the net also detects it with a high degree of certainty. The wind turbine is not detected, but can also only be heard very quietly.


The weaker rain in this clip allows for a correct classification of the Rain throughout the entire clip. Additionally the Wind Turbine is classified correctly too. The chirping that is predicted to come from Birds can not be heard. Also the Wind is not very strong in this clip and thus should not have been classified as such.


The high pitch noise is correctly identified as Hiss. The bird chirping is consistent with the labeling. The Wind Turbine might not picked up on because the Hissing is more dominant in this clip.


This clip consists of Bird, Cricket and Wind Turbine. The first two classes are identified well. Wind Turbine though is confused for Aircraft a couple of times and otherwise correctly predicted.