EP4189673A1

EP4189673A1 - Computer-implemented method and computer program for machine-learning a robustness of an acoustic classifier, acoustic classification system for automatically operable driving systems, and automatically operable driving system

Info

Publication number: EP4189673A1
Application number: EP21742385.4A
Authority: EP
Inventors: Georg Schneider; Fabian Woitschek
Original assignee: ZF Friedrichshafen AG
Current assignee: ZF Friedrichshafen AG
Priority date: 2020-07-27
Filing date: 2021-07-12
Publication date: 2023-06-07
Also published as: WO2022023008A1; DE102020209446A1

Abstract

A computer-implemented method for machine-learning a robustness of an acoustic classifier (AK), wherein a driving system is controlled automatically on the basis of classifications and/or locations of the acoustic classifier (AK), the method comprising the steps of providing first input signals by way of a driving system acoustic sensor for the acoustic classifier (AK) (V1), receiving interference (S) on the basis of the first input signals for fraud identification, fraud avoidance and/or fraud protection purposes and/or for improving a recognition and/or classification performance of the acoustic classifier (AK), wherein an audibility of the interference is reduced (V2), receiving second input data from an addition of the first input data and the interference (V3), inputting combinations of the first and second input data into the acoustic classifier (AK) (V4) and machine-learning the combinations (V5), wherein the acoustic classifier (AK) learns to classify and/or locate acoustic events and in the process becomes robust to interference.

Description

Computer-implemented method and computer program for machine learning of a robustness of an acoustic classifier, acoustic classification system for automated operable driving systems and automated operable

driving system

The invention relates to a computer-implemented method and a computer program for machine learning of a robustness of an acoustic classifier, an acoustic classification system for driving systems that can be operated in an automated manner, and a driving system that can be operated in an automated manner.

DE 10 2020 205 825.3 generally discloses a system for detecting, avoiding and protecting against fraud by ADAS functions. The control system disclosed there is set up and intended for use in a motor vehicle, based on environmental data obtained from at least one environmental sensor and/or signal receiver assigned to the motor vehicle: lanes, roadway boundaries, roadway markings, other motor vehicles, traffic signs, light signals (systems) and/or other objects in an area in front of, to the side of and/or behind the motor vehicle. The environment sensor and/or signal receiver is set up to provide the control system with the environment data reflecting the area in front of, to the side of and/or behind the motor vehicle. The control system is at least set up and intended to assign the environmental data provided to at least one traffic category using a machine learning classifier, each of the at least one traffic category being one of several categories of potential driving situations, and the machine learning system being previously known environmental data has been trained with already assigned traffic categories. If the at least one traffic category was assigned incorrectly to the provided environment data, a correction signal is received which correctively indicates which at least one traffic category the provided environment data is correctly assigned to, the correction signal preferably originating from a user input. The machine learning classifier is based on the provided environmental data and the corrected at least one traffic category trained. The motor vehicle is controlled accordingly to the corrected at least one traffic category.

DE 10 2020 205 825.3 discloses a front camera, rear camera, side camera, a radar sensor, a lidar sensor, an ultrasonic sensor and/or an inertial sensor as surroundings sensors.

In addition to optical signals, radar signals and ultrasonic signals, driving systems with AD/ADAS functions should also be able to record, analyze and evaluate acoustic signals outside the driving system. A human driver also uses this sense of hearing to a not inconsiderable extent, for example to determine the arrival and location of an emergency vehicle. But the acoustic assessment of a human driver about the road condition, for example wetness due to a changed background noise, should be taken over by an automated driving system. At the same time, noise is recorded, analyzed and evaluated in the vehicle interior. Examples are voice commands from the driver, rattling noises from the driving system or noises that indicate the condition of the driver and the occupants.

The evaluation of these acoustic signals is increasingly being taken over by algorithm modules based on artificial intelligence and, here, machine learning in particular. However, such sensor systems can be deliberately deceived and/or attacked.

The invention was based on the object, on the one hand, of making the acoustic sensors of the driving system robust against all types of attacks and, on the other hand, of improving the general ability of my generalization of the recognition performance and classification performance of the acoustic sensor.

The objects of claims 1, 5, 6 and 10 solve this problem by robustness training for acoustic sensor detection systems, RASES for short. In one aspect, the invention provides a computer-implemented method for machine learning a robustness of an acoustic classifier. A driving system is automatically controlled depending on classifications and/or localizations of the acoustic classifier. The procedure includes the steps:

• Provision of first input signals by means of a driving system acoustic sensor for the acoustic classifier,

Obtaining interference as a function of the first input signals for deceptive detection, avoidance and/or protection and/or for improving a detection and/or classification performance of the acoustic classifier, the audibility of the interference being reduced,

• obtaining second input data from an addition of the first input data and the disturbances,

• inputting combinations of the first and the second input data into the acoustic classifier and

• machine learning of the combinations, whereby the acoustic classifier learns to classify and/or localize acoustic events and thereby becomes robust against the disturbances.

In another aspect, the invention provides a computer program for machine learning a robustness of an acoustic classifier. The program includes program instructions that cause a computer to execute a method according to the invention when the program is run on the computer. The program instructions are written, for example, in an object-oriented programming language, such as C++.

According to a further aspect, the invention provides an acoustic classification system for driving systems that can be operated automatically, for classifying and/or localizing acoustic events in the exterior and/or interior of the driving system. The acoustic classification system includes an acoustic sensor and an acoustic classifier, wherein the acoustic classifier has learned, according to a method according to the invention, to classify and/or localize acoustic events in a robust manner against disturbances. According to a further aspect, the invention provides a driving system that can be operated automatically, comprising an acoustic classification system according to the invention, a control unit for automated driving and actuators for longitudinal and/or lateral guidance of the driving system. Depending on classifications and/or localizations of acoustic events of the acoustic classification system, the control device determines regulation and/or control signals and provides these to the actuators. Disturbances are added to the first input data of the acoustic classifier in the form of signals from a loudspeaker arranged outside the driving system, a carrier signal from a loudspeaker arranged inside the driving system and/or from driving system parts that produce noise.

Sound-producing driving system parts include, for example, an infected water pump that produces sounds to perform a targeted attack.

Advantageous refinements of the invention result from the dependent claims, the drawing and the description of preferred exemplary embodiments.

Machine learning is a technology that teaches computers and other data processing devices to perform tasks by learning from data, rather than being programmed to do the tasks. As long as an artificial intelligence is learned based on data, RASES can be used to increase the robustness against any interfering signals, including noise or attacks. Attacks include deception. Increasing robustness against noise includes making an acoustic classifier robust against overfitting by RASES. RASES thus provides an improved, generalized acoustic recognition system that more reliably and correctly recognizes acoustic signals that have not been trained before, in particular noise signals.

An acoustic classifier is an artificial intelligence comprising software and/or hardware components that can be trained and/or trained to recognize, classify and/or localize sounds and/or speech. the Acoustic signals are classified, for example, into the categories of rescue vehicle, falling branch, children playing, deer crossing, grinding noises. According to one aspect of the invention, the acoustic classifier evaluates a continuous data stream from the driving system acoustic sensor. According to one aspect of the invention, the acoustic classifier classifies overlapping time signals, for example the last 1s every 0.2ms.

The acoustic classifier represents a sense of hearing for a driving system. For example, the acoustic classifier determines the arrival and/or the position of an emergency vehicle depending on a siren signal. This determination is made available as a signal to a control unit of the driving system, for example an ADAS/AD domain ECU, ie an electronic control unit for assisted or automated/autonomous driving. Depending on the classification and/or localization of the acoustic classifier, the control device determines control and/or regulation signals for actuators for longitudinal and/or lateral guidance of the driving system in order to automatically control the driving system.

The software components of the acoustic classifier are available, for example, as program commands in the programming language Python or TensorFlow. The analysis of the first input data is carried out, for example, with the Python program package LibROSA, which includes routines for music and audio analysis. The hardware components include GPUs and/or tensor processing units with a microarchitecture for parallelized processing of tasks and execution of matrix multiplications. This makes the training and use of a trained artificial intelligence more efficient.

The driving system includes cars, commercial vehicles, trucks, buses, people movers, robots such as industrial robots, drones, rail vehicles, ships and airplanes. The driving system includes technical equipment for operating the driving system in accordance with SAE J3016 levels 1 to 5. According to one aspect of the invention, the driving system is a road vehicle with an automation level SAE J3016 levels 2+ to 5. The first input data includes acoustic signals from the driving system acoustic sensor. Compared to other acoustic sensors, the driving system acoustic sensor is particularly suitable for automotive use. For example, the driving system acoustic sensor, when used outside of the driving system, includes a protective grille to protect against the ingress of foreign bodies, an acoustically permeable, hydrophobic and/or lipophobic membrane to protect against splash water and grease, and a flow bypass to prevent fluids or to guide foreign bodies out of the sensor. According to the invention, the driving system acoustic sensor is also used in the interior of the driving system.

The disturbances for deception detection, avoidance and/or protection correspond to signals that a disturber, ie an attacker, calculates and plays back in order to deceive the acoustic classifier. Sound and speech recognition are vulnerable.

The basic idea for deceiving an acoustic classifier is that a loudspeaker is used through which interference signals are played back. The classifier is supposed to be deceived by these interference signals, so that the original/actual event is not recognized or another desired event is recognized, although in reality no such acoustic event has occurred. Existing loudspeakers in the vehicle can be used for this purpose, for example infotainment or mobile phones, or loudspeakers can be set up in a targeted manner at the desired location, for example a residential area, the edge of a forest or a bus stop.

These spurious signals are either integrated into carrier signals or exist as a separate signal. An example of integration into carrier signals is introducing the interference into music. The modified music signal is then uploaded to a popular platform, such as YouTube or Spotify, and played back via the driving system's infotainment system. As a result, a large number of attacks are carried out in which the acoustic classifiers are fooled into recognizing an event when in reality there is no event. This can cause significant damage to a mass of users/customers. Another case is that of an inconspicuous interfering signal, which for humans is only a faint noise is recognizable, is played, whereby the acoustic classifier events are given before or the detection of actually happening events is prevented ver.

Such an attack is dangerous because the human ear cannot detect the interference signals. As a result, the occupants would not notice the ongoing attack or only after the driving system had already initiated reactionary measures, such as braking if a branch fell or children were playing. The human ear cannot detect the present attack because either the volume of the interference signal is too low or the interference is only applied to certain frequencies which are masked by neighboring louder frequencies for the human ear.

The invention includes untargeted and targeted attacks. In an untargeted attack, the attacker's goal is to introduce a perturbation to get the acoustic classifier to predict a class other than the correct one. It does not matter which class is predicted instead of the correct class, in contrast to a targeted attack, where the attacker wants to ensure that a specific target class is predicted instead of the correct class.

RASES prevents this attack in that the acoustic classifier learns, through the method according to the invention, to be robust against targeted or naturally occurring disturbances and to carry out the classification of the actual, real acoustic event correctly. By reducing the audibility of the interference, an inconspicuous interference signal that is only recognizable to humans as a faint noise is simulated.

The attacker has to calculate the interference depending on the other acoustic signals in the target environment, for example residential area, edge of the forest, interior or busy street. For this purpose, exemplary signals can be accepted, which reflect the real situation as best as possible, and the generation of the interference signal can be carried out for several of these signals, for example 1000 to 100,000 exemplary signals. This allows the attacker to ensure that the calculated Noise actually deceives the acoustic classifier, regardless of any other acoustic signals.

If the attacker has knowledge about the system to be attacked, gradient-based methods can be used to optimize the jamming signal depending on the classification of the system. One method is, for example, the projected gradient descent method, abbreviated PGDM, in which a step in the positive direction of the gradient of a loss function of the acoustic classifier, also called loss function, is repeatedly carried out as a function of the input data. Corresponding attack methods are disclosed in Section 2.2 of https://arxiv.org/pdf/1611.01236.pdf.

If the attacker has no information about the acoustic classifier used, it is initially not possible to use gradient-based methods because the necessary gradients cannot be calculated. In order to still be able to use these methods, the attacker can try to obtain information about the acoustic classifier used. On the one hand, there is the possibility of breaking the encryption of the locally stored parameters, for example checkpoints of an artificial neural network, including weights and structure, and thereby gaining the required information. Alternatively, an attacker can train a system that is as identical as possible, preferably on similar training data. Then this system can be used to calculate an interfering signal. Since it has been found that such interference signals can largely be transmitted between artificial intelligences, this interference signal can also be used to deceive the acoustic classifier that is actually being attacked. Techniques also exist to ensure that a transmittable jamming signal is found. For example, several substitute models can be trained on different data, which are incorporated by the loss function used in order to calculate a uniform interference signal for all models.

Another possibility are model stealing attacks, which have the purpose of obtaining information about an artificial intelligence. In order to be able to carry this out, an attacker only needs the input data of the acoustic can change the classifier, for example play a test signal, and then be able to observe the output values of the acoustic classifier. By cleverly combining different input values and testing, also called queries, how the acoustic classifier reacts to them, such attacks can collect information about how the acoustic classifier works and how it can be deceived. For example, model stealing is disclosed in https://arxiv.org/pdf/1802.05351.pdf.

Types of attack are also known as pure black-box attacks without gradient information, which also do not replicate/retrain the system locally. Instead, clever decisions are made based on the current value of the loess function as to how the current disturbance must be changed in order to fool the artificial intelligence, see https://arxiv.org/pdf/1712.04248.pdf. The method according to the invention also achieves and/or increases robustness against this type of attack.

RASES prevents any of these attacks and ensures the correct functionality of the acoustic classifier even though such interference signals are present and an attack is attempted. According to the invention, this is achieved in that, during the training, disturbances are obtained as a function of the first input signals for deception detection, avoidance and/or protection and/or for improving a recognition and/or classification performance of the acoustic classifier, and these disturbances are also trained, wherein an audibility of the disturbances is reduced iteratively or successively.

In order to be able to implement the selected attack method, certain hyper parameters of the acoustic classifier are determined as best as possible, for example the initial maximum strength of the interference signal or target sequence. Depending on these parameters, there are various changes in the robustness and accuracy of the resulting acoustic classifier after training is complete.

Deception and/or attacks with the aim of attacking the outward-facing acoustic sensors have the following effect, for example: • Non-recognition and/or incorrect localization of noise sources to be recognized,

• specification of noise sources to be recognized that do not exist in reality;

• cause the sensors to confuse the sources of noise to be recognized. These effects counteract safe operation of the driving system. RASES makes the acoustic classifier robust against these illusions by expanding the training of the acoustic classifier with these disturbances.

Sources of noise related to the exterior include:

• other road users, e.g. B. o other vehicles, o people, o children playing, o ambulances on duty, o animals/deer crossing,

• Situations, o accident in the surroundings, o falling trees, falling branches,

• Emergency calls/warning calls by people, o "Help",

• Weather noise, o wet road, o snow on road, o hail, o strong wind, o forest fire,

• Damage noises on your own or other vehicles, o rattling on the car, o squeaking, o grinding,

• Control commands to the vehicle, o Opening of trunk, doors, o Identification of the driver.

Deception and/or attacks with the attack target of the inward-facing acoustic sensors have the following effect, for example:

• Non-detection and/or incorrect localization of noise sources to be detected

• specification of noise sources to be recognized that do not exist in reality,

• cause the sensors to confuse the sources of noise to be recognized. These effects also counteract safe operation of the driving system. RASES makes the acoustic classifier robust against these illusions by expanding the training of the acoustic classifier with these disturbances. RASES thus makes an acoustic classifier for the exterior and interior robust.

Interior noise sources include:

• Inmates with the following attributes o state of mind, stress, health, alcohol, drugs, o position, orientation, o identification,

• Situations, o Interaction of inmates, eg argument, party

• Damage noises on one's own vehicles, o rattling on the car, o squeaking, o grinding, o fire noises,

• Interaction with vehicle functions, o control commands to the vehicle,

^■ switching systems on and off,

^■ route selection,

^■ music choice,

^■ call dialing, ^■ Inquiries, o Request for help, o Warning call, o Satisfaction/dissatisfaction with a function • Influencing acoustically connected systems such as cell phones in the interior or those that are being called.

The attacks also include a target other than the ego driving system, for example a system that is connected to the driving system in some way, for example cloud storage, similar to the introduction of computer viruses, trojans, worms.

Augmenting the training of the acoustic classifier with these perturbations further increases the fundamental ability of the acoustic classifier's ability to generalize, since "accidental manipulation" and deliberate attacks correspond in some ways exactly to the ability to generalize. This increases the recognition and/or classification performance.

By inputting combinations of the first and second input data, which result from an addition of the first input data and the disturbances, into the acoustic classifier and machine learning the combinations, the acoustic classifier learns to defend itself against the attacks described above defend. The resulting combinations represent extended or augmented training data for the acoustic classifier.

The machine learning of these combinations is a so-called adversarial training, i.e. the augmentation of the first input data with interference signals, which an attacker would use to fool the acoustic classifier. The disturbances are recalculated during the training for each input signal and are always adapted to the current parameters of the acoustic classifier. The interference signals are added to the original data, but the ground truth class is not changed. As a result, the acoustic classifier learns, robust against to be the disorders shown and still correctly classify this data. Adversarial training is disclosed in https://arxiv.org/pdf/1706.06083.pdf.

How the interference signals are actually calculated is irrelevant and one, several, very different combinations of attack methods can be used. In addition, noisy data does not have to be used exclusively. Any combination of original and perturbed signals can be used in each batch during training. This may be necessary to optimize the tradeoff between general accuracy and robustness to disturbances. Typically, a significant improvement in one leads to a deterioration in the other.

Batches are groups of input data of equal size. The training can be carried out per batch. When all batches have gone through the artificial intelligence once, an epoch is complete. An epoch denotes a complete run through of all input data. The number of training epochs and batches is a parameter for training the artificial intelligence. For example, each batch consists of 50% original and 50% corrupted data. However, other distributions are also conceivable, e.g.: 20% original, 40% attack method 1, for example gradient-based, 40% attack method 2, for example model stealing.

According to one aspect of the invention, the adversarial training is used conceptually with further augmentation strategies, for example with spectrogram augmentation, see https://arxiv.org/pdf/1904.08779.pdf. Furthermore, the original signal can also be overlaid with further realistic noise signals in order to be able to reflect a real scenario even better and thus further increase the accuracy of the acoustic classifier under non-optimal conditions.

In one embodiment of the method, in order to obtain the interference and/or to reduce the audibility of the interference, a loss function is minimized while complying with the condition that the interference is smaller than a predetermined interference. The loss function, also called the combined loss function, includes as first part the disturbances and as a second part a loss function of the acoustic classifier extended with the disturbances. The extended loss function is minimized by an interferer's intended classification of the acoustic classifier.

In principle, PGDM can also be used for an attack in the audio sector. However, it turns out that this method does not work well with the increased non-linearities that are caused by pre-processing and the possible massive use of recurrent layers in the acoustic classifier. It is therefore often not possible, particularly in the case of long sequences, for example speech recognition, to find a suitable disturbance which is inaudible to a human being.

An alternative, but more complex, approach relies on a proprietary optimization approach to ensure that a perturbation is found that is inaudible to a human and still fools the acoustic classifier. Formally, this can be expressed as: minimize dB _x (6) = dB(6) — dB(x) d subject to y = /(x + d; Q) = t

In these formulas, x means: vector with raw, first input data, d: generic disturbance,

^A y: class predicted by the acoustic classifier, t: target class of the attacker,

Q: parameters.

The target class is the class that the attacker will ensure to be predicted by the acoustic classifier instead of the correct class.

In this formulation, the main goal is to minimize the difference between the magnitude of the interference and the magnitude of the first input data, so that the interference is not audible to a human when it is added to the input data. As a necessary condition, it is introduced that the acoustic classifier must be successfully deceived and the targeted class, or sequence of acoustic units, is predicted. However, this optimization problem is very difficult to solve with methods based on normal gradients, since according to one aspect of the invention the classification function f(-) is represented by an artificial neural network which is very strongly non-linear.

In order to circumvent this problem, the optimization problem is reformulated according to the invention and a combined loss function is introduced: minimize+a*L(x+d, t; Q) d imperceptible adversarial subject to άB _c (d) < e

In these formulas, L: loss function, cc tradeoff parameters, e: maximum allowed disturbance.

This concrete variant of calculating disturbances for acoustic signals is disclosed in https://arxiv.org/abs/1801.01944.

The first part of the combined loss function causes a disturbance d with the lowest possible strength to be found and the second part causes the disturbance found to also successfully disturb the acoustic classifier. Successful disruption is ensured by minimizing the value of the acoustic classifier's loss function L(•), thereby ensuring that it tends to zero. The parameter a acts as an opportunity to set the tradeoff between successful disruption and imperceptibility and can therefore be adapted to the given circumstances and to the objective. The presence of the necessary condition provides an additional constraint to ensure that the interference is evenly distributed across the input signal and does not have a very high outlier in some regions that would be heard by humans, even though the first term of the loess function , which is the squared ^ ₂ -norm of the perturbation, is small. According to one aspect of the invention, further terms are added, which say, for example, that the interference should be added mainly on frequencies that are not audible to a human.

According to the invention, this optimization problem is solved with gradient descent. Therefore, the combined loss function is minimized until a perturbation d is found that successfully perturbs the acoustic classifier and causes it to predict the target class t. For this purpose, a higher value is initially used for the maximum strength e with which the disturbance can be heard by a human. As soon as this initial disturbance has been found, the attacker's maximum allowed strength e is reduced and the optimization continued. This process continues iteratively until a predetermined number of iterations has been completed. Consequently, during the optimization, the audibility is reduced more and more, but the deceptive character of the disturbance remains, so that the acoustic classifier is still correctly deceived.

With this, the disturbances during the adversarial training are calculated with the method according to this embodiment and different hyperparameter settings.

In a further embodiment of the method, the first input data includes raw data from the driving system acoustic sensor, filtered raw data and/or a representation of the raw data in a time-frequency range.

In principle, raw acoustic signals can be used as input data for the acoustic classifier without pre-processing, but this currently results in lower classification accuracies.

According to one aspect of the invention, the raw data are filtered with low-pass or band-pass filters in order to specifically blind or amplify noises depending on the situation.

The representation of the raw data in the time-frequency domain is based, for example, on pre-processing the raw data with a short-time Fourier transformation, whereby different window types (Hann, Blackman) with different parameters (window width, hop distance) are used. The result is a time-frequency picture in which the energy is displayed in different frequencies over time. If more than one driving system acoustic sensor is evaluated, there is a signal for each sensor which is transformed independently. In this case, therefore, there are several time-frequency images, analogous to an RGB image in which three color channels are then present). From this representation, further weighted characteristics or features, divided into frequency bins, are extracted, which typically use the Mel Frequency Scale, so that at the end Mel Frequency Cepstral Coefficients (MFCC) or Mel Frequency Filter Banks (FBank) are used, see https ://towardsdatascience.com/getting-to-know-the-mel-spectrogram-31bca3e2d9d0. Different settings can be used for this, for example min/max frequency, number of frequency bins.

In addition, the pre-processing can contain noise reduction in order to improve the signal quality of the acoustic signals. For example, in the case of speech recognition, it is desirable to filter out unimportant background noise, such as engine noise or tire friction. If several sensors are used, mechanisms can be used which exploit the different propagation times of acoustic waves to the individual sensors, for example beamforming or source separation. These methods can themselves be based on artificial intelligence. It is also possible to remove noise from the time signals, for example using a denoising autoencoder or Wiener filter, before these signals are transformed into the time-frequency domain. There are also algorithmic, statistical methods that weight the time-frequency features and try to assign a low weight to features with low speech energy.

Raw acoustic signals can differ significantly even though they reflect the same context, such as noise or speech. For example, the current emotional state of a speaker leads to differently emphasized signals. The pre-processing generates features first, which have a higher Have invariance to such different signals of the same basic event.

According to one aspect of the invention, the method according to the invention is extended such that the attacker no longer adds the interference signal to the original input data. Instead, the interference signal is added to a representation in the time-frequency domain. It is also possible to add the interference signal to any other representation after the individual steps in the pre-processing.

In reality, it is not possible for an attacker to attack these features because he has no access within the acoustic classifier for acoustic classification. As a defender, however, this is a good way to further improve the robustness of the system by simulating the effects of a real attack on an acoustic signal more effectively, making training more efficient.

In a further embodiment of the method, the first input data includes a representation of raw data from the driving system acoustic sensor in a time-frequency range. Masking adds the interference at low-energy frequencies.

The masking restricts the features that the used attacker is allowed to attack during training. As a result, the attacker can only add the interference signal to a subset of all available features during training. According to the invention, the masking is used to prevent attacks on relevant features with high speech energy during training. As a result, during training, the attacker can only add the interference signal to features that receive little information about the existing speech energy.

This better reflects a real attack. Since the target is an attack that is inaudible to humans, a real attacker will calculate the interference signal in such a way that frequencies are mainly affected that previously did not have high energy. This is necessary because in this way it can be achieved that a human hears no difference, since at these frequencies a relatively small amount of interference is already sufficient to mask the influence of the original signal. If the Instead, if the attacker adds the interference to frequencies that already have high energy, the existing energy must be drowned out. This requires significantly stronger interference so that the resulting interference signal is clearly audible and may even be stronger than the original signal. It can happen that certain parts of the original signal are no longer understandable. This is not a desired behavior for a real attacker, so the jamming signal must be added on low-energy frequencies. By simulating such an attack attacking only low-energy frequencies during training, the acoustic classifier can be improved more efficiently and effectively against general real-world attacks compared to the case of normal adversarial training attacking the raw speech signal. The acoustic classifier is thus specifically trained to utilize frequencies with high energy and to be more robust against interference from less important frequencies.

The masking can be transferred analogously to noise detection, in that only features that are not relevant to the respective acoustic event may be disturbed by the attacker. As a result, the acoustic classifier will learn during training not to use the disturbed features and rely on the remaining features. Since these are particularly relevant and meaningful with regard to the existing acoustic events, the existing language, the robustness increases further because the acoustic classifier learns to make its decision mainly on the basis of these features.

Various methods are possible to generate the masks that express the relevance of the features. For example, a mask is generated by first generating an image of the same size as the features, using white noise with an SNR=0dB to randomize the values. The absolute value of each feature is then individually compared to the absolute value of the noise image at the location. If the feature's value is greater than that of the random noise, that feature is considered relevant. With this simple method, plausible relevance masks can be determined quickly, with which the attacker is then limited during the training. In a further refinement of the classification system, the acoustic sensor is arranged in the interior of the driving system when used, and the acoustic classifier is robust against disturbing noises from

• Inmates including state of mind, stress, health, alcohol, drugs, position, orientation, identification,

• situations involving occupant interaction,

• Noise from damage to your own driving system, including rattling, squeaking, grinding, fire noise,

• Interaction with functions of the driving system including control commands to the driving system including switching systems on and off, route selection, music selection, dialing calls, inquiries and

• Influencing of acoustically connected systems including mobile phones in the interior.

In a further refinement of the classification system, the acoustic sensor is arranged outside of the driving system when it is used, and the acoustic classifier is robust against disturbing noises from

• Other road users, including other driving systems, people, children playing, emergency vehicles in action, animals/wild animals,

• Situations including accident in the surroundings, falling trees, falling branches,

• Emergency calls/warning calls by people,

• weather noises including wet road, snow on road, hail, strong wind, forest fire,

• Damage noises on your own or someone else's driving system, including rattling, squeaking, grinding,

• Control commands to the driving system including opening of trunk, doors, identification of the driver.

In one embodiment of the classification system according to the invention, the acoustic classifier includes an artificial neural network for noise/speech recognition. The artificial neural network includes layers of convolutional networks, recurrent layers, fully connected layers and/or an encoder-decoder structure.

Convolutional networks include filter layers, also called kernels, to minimize dimensions of respective input data, and discretization layers, for example maxpooling kernels, to further reduce dimensions of respective input data. Using these layers, new features are extracted from the input data. Contextual sequence information is evaluated by means of recurrent layers, comprising GRU, BGRU, LSTM and BLSTM. Finally, fully connected layers can be used to output the final probabilities per event class. An encoder-decoder structure defines an encoded context/summary vector. An encoder-decoder structure is advantageous for speech recognition. Batch normalization or sequence normalization layers are used as additional components to speed up training and increase generalization.

RASES is independent of the specific network architecture and the existing hyper parameters, such as regularization, batch size, number of epochs, activations, classes, further data augmentation and/or dropout, and optimization settings, such as loss function, optimizer, LR schedule.

In summary, an attack on an acoustic classifier is prevented or at least made more difficult by the invention. The acoustic classifier can therefore not be deceived by an attacker and also works correctly when there is an interference signal which is actually intended to deceive the acoustic classifier.

In addition, the invention increases the generalizability and thus the recognition rates under any interference. This improves the robustness against natural disturbances, such as street noise or conversations. This is particularly relevant as acoustic classifiers operate under widely varying environments and high robustness against unknown noise types/sounds is required. The improvements are made possible by the fact that RASES teaches the acoustic classifier to rely on features that are representative of the relevant acoustic energy in the input data. As a result, the acoustic classifier focuses on features that are meaningful and extracts information from important features. Noisy features are used less, making the acoustic classifier less sensitive to various perturbations, natural and adversarial.

A further advantage of the invention is that the increase in robustness is carried out by a synthetic augmentation of the training data. It is not necessary to record new data in reality, which depict all possible interference signals. On the one hand, this is hardly possible and, on the other hand, it requires greater effort to record as representative a quantity of noise signals as possible.

Furthermore, RASES can be extended to include regression models, which are used, for example, for localization/distance estimation. It is possible that an attacker can also fool such artificial intelligences. A simple increase in robustness is possible with the help of RASES, since in this case the original data set can also be augmented with specially generated interference signals. The RASES concept can therefore be transferred to all acoustic artificial intelligences that are learned using training data.

The invention is illustrated in the following exemplary embodiments. Show it:

1 shows a schematic representation of a normal training course of an artificial intelligence,

2 shows an exemplary embodiment of the method according to the invention,

3 shows an exemplary embodiment of preprocessed raw data,

4 shows an exemplary embodiment of a mask, 5 shows an embodiment of an acoustic classifier for speech recognition and

6 shows a schematic representation of exemplary access points of an attacker.

In the figures, the same reference symbols denote the same or functionally similar reference parts. For the sake of clarity, only the relevant reference parts are highlighted in the individual figures.

1 visualizes a normal course of training. The existing training data is shown to an artificial intelligence, for example an artificial neural network, and the loss function is minimized. This process is performed iteratively over multiple epochs of the training data. As a result, the artificial intelligence learns to correctly classify the existing data.

In the case of the adversarial training shown in FIG. 2, the original training data are augmented. This is done by an attacker who specifically calculates an interference signal S, which leads to the current acoustic classifier AK being deceived. An iterative attack is used for this. The optimization-based method according to the invention is used to attack the acoustic classifier AK. This introduces a combined loss function, which expresses how well the current interference signal S deceives the acoustic classifier and how audible this interference signal S is for humans. This combined loss function is then solved using Gradient Descent. Typically, the focus is first on finding a valid interference signal S, even if this is clearly audible to a human. In the further course of the optimization process, the strength of this interference signal S is then reduced, resulting in a valid interference signal that is not recognizable to humans. The original data is expanded with the resulting interference signals. The resulting augmented training data is any combination of original and challenged/perturbed data. on this data a normal training iteration is then performed to minimize the loss function and thereby robustly train the acoustic classifier.

Other methods for generating the interference signal are also within the scope of the inventions.

The individual process steps are as follows:

• V1: Provision of first input signals by means of a driving system acoustic sensor for the acoustic classifier AK,

• V2: Obtaining disturbances S as a function of the first input signals for deception detection, avoidance and/or protection and/or for improving a recognition and/or classification performance of the acoustic classifier AK, the audibility of the disturbances being reduced,

• V3: obtaining second input data from an addition of the first input data and the disturbances,

• V4: Entering combinations of the first and the second input data into the acoustic classifier AK and

• V5: machine learning of the combinations.

FIG. 3 shows an exemplary transformation in the time-frequency domain of the sentence: "The seven units to be offered for sale have a work force of about twenty thousand." FIG. 3 shows an exemplary representation of FBank features. With Fourier transformation, a signal in the time domain is broken down into its frequencies. The acoustic events are separated into time frames and a Fourier transform is applied to each time frame. The frequency axis is then displayed logarithmically and the amplitudes in decibels. A spectrogram results. In order to obtain a Mel spectrogram as shown in Fig. 3, the frequency scale f of the spectrogram is transformed to Mel scale m according to, for example,

FIG. 4 shows a masking according to the invention of the mel spectrogram from FIG. 3, the data from FIG. 3 having been compared with a noise image. 5 shows the structure of a system for speech recognition. The time signal x is pre-processed so that a time-frequency representation F results. This is used as input data for an acoustic model. This model is trained data-driven and represented by deep artificial neural networks, called DNN, or a mix of DNN and Hidden Markov Models. It outputs a sequence of probabilities of acoustic units comprising letters, phonemes, parts of words, which is combined to form the resulting total words and the word sequence being searched for. The network architecture of the acoustic model includes layers of a convolutional network, fully connected layers and recurrent layers. Only the number of output classes is typically significantly larger in order to cover all relevant acoustic units, for example 80-2000. Special loess functions, such as Connectionist Temporal Classification, see https://www.cs.toronto.edu/~graves/icml_2006.pdf, are also used.

The composition is performed using a decoder, which searches for the most probable sequence through the sequence of probability vectors of the acoustic units. A beam search decoder is often used with various options, for example with regard to beam width and/or weighting. Furthermore, additional a priori information about the formalisms of the processed language can be used. This includes a lexicon that contains legal words and a language model that expresses grammatical dependencies, including probabilities of the next word depending on the previous one. The language model can be represented by its own artificial intelligence or by simple probability tables and manually formed decision rules.

The invention can be applied not only to systems that use this structure, but to all speech recognizers/noise recognizers that are learned from data. Consequently, RASES also applies in this case independently of various hyperparameters of the learned artificial intelligence. In access point A in FIG. 6, the attacker attacks before pre-processing the raw data. According to the invention, this is simulated in that the interference signal S is added to the original input data. At access point B, the attacker attacks after preprocessing, for example the interference signal is added to a representation in the time-frequency domain.

According to the invention, the attacker can also add the disruption to each point in the preprocessing, ie between Abs and FBANK, for example, during training.

Reference sign

V1 -V5 method steps AK acoustic classifier S disturbance x time signal

F time-frequency representation

A,B Access points of an attacker

Claims

patent claims

1. Computer-implemented method for machine learning of a robustness of an acoustic classifier (AK), a driving system depending on classifications and/or localizations of the acoustic classifier (AK) being automatically controlled, the method comprising the steps

• Provision of first input signals by means of a driving system acoustic sensor for the acoustic classifier (AK) (V1),

• Obtaining disturbances (S) as a function of the first input signals for deception detection, avoidance and/or protection and/or for improving a recognition and/or classification performance of the acoustic classifier (AK), the audibility of the disturbances being reduced (V2),

• obtaining second input data from an addition of the first input data and the disturbances (V3),

• inputting combinations of the first and the second input data into the acoustic classifier (AK) (V4) and

• machine learning of the combinations (V5), whereby the acoustic classifier (AK) learns to classify and/or localize acoustic events and thereby becomes robust against the disturbances.

2. The method as claimed in claim 1, wherein, in order to obtain the interference and/or reduce the audibility of the interference, a loss function is minimized in compliance with the condition that the interference is smaller than a specified interference, the loss function being the interference as the first part and as a second part, a loss function of the acoustic classifier (AK) expanded with the disturbances, wherein the expanded loss function is minimized by a classification of the acoustic classifier (AK) intended by a disturber.

3. The method according to claim 1 or 2, wherein the first input data comprises raw data from the driving system acoustic sensor, filtered raw data and/or a representation of the raw data in a time-frequency range.

4. The method according to any one of the preceding claims, wherein the first input data include a representation of raw data of the driving system acoustic sensor in a time-frequency range and the disturbances are added to Frequ zen with low energy by masking.

5. Computer program for machine learning of a robustness of an acoustic classifier (AK) comprising program instructions that cause a computer to execute a method according to any one of claims 1 to 4 when the program runs on the computer.

6. Acoustic classification system for automated driving systems for classifying and / or localizing acoustic events in the exterior and / or interior of the driving system comprising an acoustic sensor and an acoustic classifier (AK), wherein the acoustic classifier (AK) according to a method of the preceding claims has learned to classify and/or localize acoustic events robustly against disturbances.

7. Classification system according to claim 6, wherein when using the acoustic sensor is arranged in the interior of the driving system and the acoustic classifier ro bust against disturbed noises

• situations involving occupant interaction,

8. Classification system according to claim 6, wherein when using the acoustic sensor is arranged in the exterior of the driving system and the acoustic classifier ro bust against disturbing noises

• Emergency calls/warning calls by people,

9. Classification system according to one of claims 6 to 8, wherein the acoustic classifier (AK) comprises an artificial neural network for noise/speech recognition and the artificial neural network comprises layers of convolutional networks, recurrent layers, fully connected layers and/or a En coder-decoder structure includes.

10. Driving system that can be operated automatically, comprising an acoustic classification system according to one of Claims 6 to 9, a control unit for automated driving and actuators for longitudinal and/or lateral guidance of the driving system, the control unit depending on classifications and/or localizations of acoustic events of the acoustic classification system determines regulation and/or control signals and makes these available to the actuators, with disturbances in the form of signals from a loudspeaker arranged outside the driving system, a carrier signal from a loudspeaker arranged inside the driving system and/or from driving system parts that produce noise are added to the first input data of the acoustic classifier (AK).