CNN-based multi-class multi-label classification of sound scenes in the context of wind turbine sound emission measurements – Institute of Communications Technology

Visual representation

In order to get a qualitative impression of the obtained classification results and to get a quick overview of a classified scene, a methodology for a visual representation of the classification data was developed. The output of this visual mapping is a video which contains the original classified audio scene as audio tracks and shows a fine grained matrix with a color-coded value for each potential class and time instant. When playing the video, a vertical bar indicates the current position in the scene. Each timestamp has a total of 12 entries mapped to the classes along the y-axis of the image. The entries of this matrix are values between 0.0 and 1.0 and correspond to the output of the final sigmoid layer of the neural network. Thus, they reflect the probability of the presence of a class as predicted by the neural network, with values greater than 0.5 being considered as detection of the class. In the following, examples of this visual representation are shown. The first videos only contain the prediction of the neural networks, in the second section, the videos also show the manual label assigned to the respective sound scene.

Examples without manual labels

The model is able to detect the constant presence of the wind turbine. In addition, it perceives quiet chirping and the wind as soon as it gets stronger. However, some predictions are below the classification threshold of 0.5. Towards the end, the approaching car is classified as such, even if the classification starts a bit later than the sound. At the same time, a misclassification of birds occurs.

In this example, the wind turbine is not as loud as in the previous example and is also classified less overall. The clip also contains more wind, which is classified as such by the neural network. The presence of crickets in the background as well as the brief appearance of birdsong are well reflected by the classification result.

The aircraft that is passing over the area is classified as such throughout the entire clip. Also the neural network is able to pick up on a few birds chirping but the predictions are below the classification threshold.

The wind turbine is the dominant source of sound which is classified correctly. The short chirping from crickets in the beginning is correctly identified as well as the short presences of birds.

The constant presence of wind turbine and frogs is classified as such. When the chirping from crickets gets louder it is correctly identified. The very short (and uncertain) classification of birds is not correct.

Birds are constantly chirping and correctly labeled as such. The noise in the background should be classified as Hiss but is not picked up on. Towards the end the sound of cars driving by gets more dominant and is identified by the neural network.

This sample is a more quiet mix of Birds, Crickets, Wind and the Wind Turbine. At the beginning and at the end, these classes are recognized correctly. In between, there are occasional misclassifications, for example the short appearing predictions of Rain and Silence.

The wind turbine can be heard clearly and is classified accordingly. Around the 20 second mark the wind is picking up which is missclassified as Rain. It is definitely a misclassification, but a partially plausible one.

The wind turbine is the dominant source of sound and can be correctly identified as such. The wind is getting faster at around the 25 second mark but the neural network also predicts it with a score close to the threshold of 0.5. The few classifications of crickets are mostly correct.

Examples with manual labels

This clip contains the harvester driving near the recording station (verified by the recording protocol) as well as birds chirping. Both sounds are classified as such for most occurrences, though the bird chirping is left out a few times.

The sample consists of the three classes Bird, Frog and Wind Turbine. Frog and Wind Turbine are being classified correctly the entire time. The intensity of Bird varies throughout the clip which is reflected in the classification too. Additionally the neural network identifies Crickets in the sample that have not been labeled and are most likely not present.

This clip has been labeled as Frogs and Wind Turbine and the neural network is able to fully identify both classes as such. A few misclassifications are present in the classes Bird and Cricket.

The birds in this clip are classified very well. However, the wind turbine class is divided into aircraft, traffic and wind turbine. This is an example of the confusions that occur in our model between these 3 classes. However, many of these confusions are very difficult to classify correctly even for humans.

Similar to the sample above, this clip shows that the neural network has trouble distinguishing between Aircrafts and the Wind Turbine and thus missclassifies the Wind Turbine for Aircraft around half of the time.

In the first 20 seconds, the rain is very clearly visible to the human ear, but the network does not recognize it. After that, the sound of the rain changes slightly and the net also detects it with a high degree of certainty. The wind turbine is not detected, but can also only be heard very quietly.

The weaker rain in this clip allows for a correct classification of the Rain throughout the entire clip. Additionally the Wind Turbine is classified correctly too. The chirping that is predicted to come from Birds can not be heard. Also the Wind is not very strong in this clip and thus should not have been classified as such.

The high pitch noise is correctly identified as Hiss. The bird chirping is consistent with the labeling. The Wind Turbine might not picked up on because the Hissing is more dominant in this clip.

This clip consists of Bird, Cricket and Wind Turbine. The first two classes are identified well. Wind Turbine though is confused for Aircraft a couple of times and otherwise correctly predicted.