CN117746877A

CN117746877A - Voice noise reduction method and vehicle

Info

Publication number: CN117746877A
Application number: CN202311765892.XA
Authority: CN
Inventors: 张新会; 王珏华
Original assignee: Great Wall Motor Co Ltd
Current assignee: Great Wall Motor Co Ltd
Priority date: 2023-12-20
Filing date: 2023-12-20
Publication date: 2024-03-22

Abstract

The application provides a voice noise reduction method and a vehicle, wherein the method comprises the following steps: acquiring voice to be recognized sent by a user in the running process of the vehicle; recognizing the voice to be recognized by utilizing a pre-trained unsteady noise recognition model to obtain unsteady noise; and removing the unsteady state noise from the voice to be recognized to obtain voice with the unsteady state noise removed. The method utilizes the unsteady noise recognition model to recognize the unsteady noise in the voice to be recognized, can accurately recognize the unsteady noise in the voice to be recognized, and outputs the unsteady noise; and removing the unsteady state noise from the voice to be recognized to obtain voice with the unsteady state noise removed. The voice to be recognized is recognized through the unsteady noise recognition model, so that the unsteady noise in the voice to be recognized can be accurately determined, the unsteady noise in the voice to be recognized is removed, the obtained voice has fewer interference items, the accuracy of voice recognition is improved, and the driving experience of a user is improved.

Description

Voice noise reduction method and vehicle

Technical Field

The application relates to the technical field of voice noise reduction, in particular to a voice noise reduction method and a vehicle.

Background

In the car cabin, speech recognition technology has been widely used for controlling various functions and systems of the vehicle. However, when many users use voice control, there is a case where the instruction which is not responded or recognized by voice does not coincide with the instruction required by the vehicle owner. The reason for this may be that there is more noise in the user's voice, thereby affecting the voice recognition effect. However, physical noise reduction, sound source localization and acoustic models have been adopted in cabins of many vehicle types to reduce noise, and these methods can only reduce stationary noise, but cannot accurately reduce unsteady noise generated during vehicle running.

Disclosure of Invention

In view of the foregoing, an object of the present application is to provide a voice noise reduction method and a vehicle, so as to solve the problem of noise reduction of unsteady noise generated during driving of the vehicle.

Based on the above object, the present application provides a voice noise reduction method, including:

acquiring voice to be recognized sent by a user in the running process of the vehicle;

the voice to be recognized is recognized by using an unsteady noise recognition model, so that unsteady noise is obtained; the unsteady noise model is obtained by training based on unsteady noise data of a vehicle in different driving states and voice data sent by a person collected by the vehicle under the noiseless condition in advance;

And removing the unsteady noise from the voice to be recognized to obtain voice with the unsteady noise removed.

Optionally, the recognizing the speech to be recognized by using an unsteady noise recognition model to obtain unsteady noise includes:

determining the number of frames of the voice to be recognized;

framing the voice to be recognized according to the frame number;

and recognizing the voice to be recognized after framing by using the unsteady noise recognition model to obtain the unsteady noise.

Optionally, the determining the number of frames of the voice to be recognized includes:

and calculating the frame number of the voice to be recognized according to the total frame length of the voice to be recognized and the preset frame number contained in each frame length, so as to obtain the frame number of the voice to be recognized.

Optionally, the recognizing the speech to be recognized after framing by using the unsteady noise recognition model to obtain the unsteady noise includes:

inputting each frame of voice to be recognized after framing into an unsteady noise recognition model, recognizing each frame of voice to be recognized through the unsteady noise recognition model, removing other sounds except for the unsteady noise in each frame of voice to be recognized, and outputting first unsteady noise in each frame of voice to be recognized;

And integrating all the first non-stationary noise to obtain the non-stationary noise.

Optionally, the training method of the unsteady state noise identification model includes:

constructing an initial model;

acquiring a noise-free voice data set and an unsteady noise data set; wherein the noise-free speech data set comprises: the voice data sent by a person collected by the vehicle under the noiseless condition, and the unsteady noise data set comprises: the unsteady noise collected by the vehicle in the running process of different vehicle speeds corresponding to different windowing states;

mixing the noiseless voice data set and the unsteady noise data set to obtain a mixed voice data set;

and carrying out iterative training on the initial model according to the unsteady noise data set and the mixed voice data set to obtain the unsteady noise recognition model.

Optionally, acquiring the non-stationary noise dataset comprises:

when the vehicle window state is in a half-open state, acquiring tire noise data, wind noise data and engine noise data at each vehicle speed in a preset vehicle speed range as a first noise data set;

when the vehicle window state is in a full-open state, acquiring tire noise data, wind noise data and engine noise data at each vehicle speed in a preset vehicle speed range as a second noise data set;

The first noise data set and the second noise data set are integrated into the non-stationary noise data set.

Optionally, before training the initial model according to the non-stationary noise data set and the mixed speech data set, further comprising:

and respectively intercepting the mixed voice data set and the unsteady noise data set according to the preset intercepting frame length and the preset intercepting frame movement.

Optionally, performing iterative training on the initial model according to the non-stationary noise data set and the mixed voice data set to obtain the non-stationary noise recognition model, including:

inputting the mixed voice data set into the initial model, and outputting estimated noise through the initial model;

constructing a loss function of the initial model based on the estimated noise and the unsteady noise;

and minimizing the loss function to update the model parameters of the initial model to obtain the unsteady state noise identification model.

Based on the same inventive concept, the present application further provides a voice noise reduction device, including:

the acquisition module is configured to acquire voice to be recognized, which is sent by a user in the running process of the vehicle;

The voice recognition module is configured to recognize the voice to be recognized by using an unsteady noise recognition model to obtain unsteady noise; the unsteady noise model is obtained by training based on unsteady noise data of a vehicle in different driving states and voice data sent by a person collected by the vehicle under the noiseless condition in advance;

and the execution module is configured to remove the unsteady state noise from the voice to be recognized to obtain voice with the unsteady state noise removed.

Based on the same inventive concept, the application also provides a vehicle, which comprises a controller for executing the voice recognition method.

From the above, it can be seen that the method for noise reduction of voice and the vehicle provided by the present application include that the pre-trained non-stationary noise recognition model is used to recognize non-stationary noise in the voice to be recognized, and the pre-trained non-stationary noise recognition model can accurately recognize the non-stationary noise in the voice to be recognized and output the non-stationary noise; and removing the unsteady noise in the voice to be recognized, namely removing most of sound interference items contained in the voice to be recognized, and obtaining the voice with the unsteady noise removed. The voice to be recognized is recognized through the unsteady noise recognition model, so that the unsteady noise in the voice to be recognized can be accurately determined, the unsteady noise in the voice to be recognized is removed, the obtained voice has fewer interference items, the accuracy of voice recognition is improved, and the driving experience of a user is improved.

Drawings

In order to more clearly illustrate the technical solutions of the present application or related art, the drawings that are required to be used in the description of the embodiments or related art will be briefly described below, and it is apparent that the drawings in the following description are only embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort to those of ordinary skill in the art.

Fig. 1 is a schematic flow chart of a voice noise reduction method according to an embodiment of the present application;

FIG. 2 is a flowchart of an unsteady noise model training method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a voice noise reduction device according to an embodiment of the present application;

fig. 4 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail below with reference to the accompanying drawings.

It should be noted that unless otherwise defined, technical or scientific terms used in the embodiments of the present application should be given the ordinary meaning as understood by one of ordinary skill in the art to which the present application belongs. The terms "first," "second," and the like, as used in embodiments of the present application, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.

As described in the background art, in the car cabin, when many users control various functions of the vehicle using voice, there is a case where an instruction which is not responded or recognized by voice does not coincide with an instruction required by the car owner. The first reason is that the accent of the vehicle owner is not in the range of the voice recognition technology, for example, some dialects are used. The second reason is that tires, wind noise, and engine noise are relatively large when the vehicle is running at high speed, and the effect of voice recognition is affected. In the prior art, in order to improve the cabin voice recognition accuracy, a physical noise reduction technology, a sound source positioning technology, an acoustic model and other voice recognition technologies are generally adopted. However, the above-described speech recognition technique can only cope with stationary noise, but cannot perform rapid denoising for non-stationary noise, for example, including fetal noise, wind noise, and engine noise. The non-stationary noise in the vehicle can change under different opening and closing states of the window of the vehicle and different vehicle speeds of the vehicle. Since the existing speech recognition technology can only denoise stationary noise, how to rapidly denoise variable non-stationary noise becomes a problem to be solved.

In order to solve the above problems, referring to fig. 1, the present application provides a voice noise reduction method, which is applied to a vehicle controller, and includes the following steps:

step 102, obtaining voice to be recognized sent by a user in the running process of the vehicle.

In this step, the voice to be recognized may be a voice which is uttered by the driver to the vehicle during the traveling of the vehicle and which makes a certain function of the vehicle perform work, and the voice may include a sound made by the driver and unsteady noise in the vehicle. For example, the driver may be said to "turn on the air conditioner". The vehicle controller of the vehicle collects the voice spoken by the driver and the unsteady noise in the vehicle, and transmits the collected voice to the unsteady noise recognition model. The non-stationary noise recognition model can be deployed at the cloud end, the vehicle controller of the vehicle transmits the voice to be recognized to the cloud end, the non-stationary noise recognition model of the cloud end recognizes the voice, the recognition result is fed back to the vehicle controller, and the vehicle controller carries out the next processing. The non-stationary noise recognition model can be deployed in the vehicle, the non-stationary noise recognition model is transmitted to the non-stationary noise recognition model in the vehicle by the whole vehicle controller of the vehicle, the non-stationary noise recognition model is used for recognition, the recognition result is fed back to the whole vehicle controller, and the whole vehicle controller is used for the next processing.

Step 104, recognizing the voice to be recognized by using the trained non-stationary noise recognition model to obtain non-stationary noise; the unsteady noise model is obtained by training based on unsteady noise data of the vehicle in different driving states and voice data sent by a person collected by the vehicle under the noiseless condition in advance.

In this step, the non-stationary noise recognition model is trained in advance based on non-stationary noise data of the vehicle in different driving states and voice data of the vehicle collected under the noise-free condition, where different driving states may refer to different windowing states of the vehicle during running and different vehicle speeds in the windowing states. The initial model is trained by an unsteady noise data set (which can comprise tire noise, wind noise and engine noise in a vehicle window opening state) and a noiseless voice data set (which can comprise voice data sent by people without noise), and a pre-trained unsteady noise model is obtained. The trained unsteady noise identification model can accurately identify unsteady noise. And the voice to be recognized is recognized by utilizing a pre-trained unsteady noise recognition model, so that the obtained unsteady noise is more accurate.

And 106, removing the unsteady state noise from the voice to be recognized to obtain voice with the unsteady state noise removed.

In the step, the whole vehicle controller of the vehicle acquires the unsteady noise output by the unsteady noise model, and removes the unsteady noise from the voice to be recognized, so that voice after the unsteady noise is removed is obtained. Because the unsteady noise is identified in the voice to be identified through the unsteady noise identification model, the identified unsteady noise is removed in the voice to be identified, and the denoised voice with less interference or less noise can be obtained, so that the whole vehicle controller can conveniently control other functions of the vehicle according to the denoised voice.

102-106, the method utilizes a pre-trained non-stationary noise recognition model to recognize non-stationary noise in the voice to be recognized, and the pre-trained non-stationary noise recognition model can accurately recognize the non-stationary noise in the voice to be recognized and output the non-stationary noise; and removing the unsteady noise in the voice to be recognized, namely removing most of sound interference items contained in the voice to be recognized, and obtaining the voice with the unsteady noise removed. The voice to be recognized is recognized through the unsteady noise recognition model, so that the unsteady noise in the voice to be recognized can be accurately determined, the unsteady noise in the voice to be recognized is removed, the obtained voice has fewer interference items, the accuracy of voice recognition is improved, and the driving experience of a user is improved.

Further, when removing the unsteady state noise in the voice to be recognized, spectral subtraction can be adopted for removal. The removing process of the spectral subtraction may include: determining a voice frequency spectrum to be recognized according to the voice to be recognized; according to the unsteady noise voice, determining an unsteady noise frequency spectrum; performing difference operation on the voice spectrum to be recognized and the unsteady noise spectrum to obtain a voice spectrum with unsteady noise removed; and carrying out inverse Fourier transform on the voice frequency spectrum from which the unsteady noise is removed, and obtaining the voice from which the unsteady noise is removed. Wherein, according to waiting to discern the pronunciation, confirm waiting to discern the pronunciation frequency spectrum, include: and carrying out Fourier transform on the voice to be recognized to obtain a voice frequency spectrum Y (omega) to be recognized. Determining an unsteady noise spectrum from the unsteady noise, comprising: and carrying out Fourier transform on the unsteady noise to obtain an unsteady noise spectrum N (omega), and removing the voice spectrum X (omega) =Y (omega) -N (omega) after the unsteady noise. When the voice spectrum after the unsteady state noise is removed is processed, namely, the voice spectrum after the unsteady state noise is removed is converted into voice, and the voice spectrum after the unsteady state noise is removed is subjected to Fourier inverse transformation according to the amplitude spectrum and the phase in the voice spectrum after the unsteady state noise is removed, so that the voice after the noise is finally removed is obtained. The specific fourier transform and inverse fourier transform processes are known in the art and will not be described in detail herein.

In some embodiments, the obtaining the speech to be recognized includes:

acquiring initial mixed voice;

intercepting the initial mixed voice according to a preset intercepting frame length and a preset intercepting frame shift to obtain the voice to be recognized.

Specifically, the initial mixed voice may be a voice uttered by the driver to the vehicle during the running of the vehicle that causes a certain function of the vehicle to perform work. An exemplary collected driver output to the vehicle "turn on air conditioner, temperature adjusted to 25 °". The length of the preset intercepting frame can be 2s, the preset intercepting frame movement can be 1s, the first section of voice is the intercepted initial mixed voice, the air conditioner is turned on, the temperature is regulated to 0s-2s in 25 degrees, the frame movement is the overlapped frame between the two sections of voice, then the intercepting start time of the second end voice is the time of the first section of voice ending time, the frame movement is taken forward, the second section of voice is intercepted according to the preset intercepting frame movement, the air conditioner is turned on for the second end voice, and the temperature is regulated to 1s-3s in 25 degrees. And intercepting the initial mixed voice into a plurality of sections of voice in turn so as to intercept long voice into short voice for rapid recognition by the unsteady noise recognition model.

In some embodiments, the recognizing the speech to be recognized by using the trained non-stationary noise recognition model to obtain non-stationary noise includes:

determining the number of frames of the voice to be recognized;

framing the voice to be recognized according to the frame number;

Specifically, when the vehicle controller of the vehicle utilizes the nonstationary noise recognition model to recognize the voice to be recognized, firstly determining the frame number of the voice to be recognized, framing the voice to be recognized according to the frame number, inputting each frame of the voice to be recognized into the nonstationary noise recognition model, and enabling the nonstationary noise recognition model to recognize the voice to be recognized frame by frame according to the frame number. When the unsteady noise recognition model recognizes the voice to be recognized, each frame of voice to be recognized is input, the types of unsteady noise and other sounds (such as human voice or other artificially produced noise) in each frame of voice to be recognized are recognized, and all voice types in the frame of voice to be recognized are determined, for example, the frame of voice to be recognized contains the unsteady noise and the human voice. Then the unsteady state recognition model will reject the human voice, only the unsteady state noise is reserved, and the unsteady state noise is output.

In some embodiments, the determining the number of frames of the speech to be recognized includes:

and calculating the frame number of the voice to be recognized according to the total frame length of the voice to be recognized and the preset frame number contained in each frame length to obtain the frame number of the voice to be recognized.

Specifically, when determining the number of frames of the voice to be recognized, the vehicle controller calculates the number of frames of the voice to be recognized according to the frame length of each voice segment in the voice to be recognized and the frame shift between each voice segment, wherein the voice to be recognized is a plurality of voice segments intercepted in the initial mixed voice, and the total frame length of the voice to be recognized is the total frame length of the plurality of voice segments, and is exemplified by 5s, and the preset number of frames (exemplified by 5 frames) contained in each frame length (i.e., each second), and then the number of frames of the voice to be recognized is 25 frames. Thus, the frame length and the frame shift of the voice to be recognized are used for calculation, and the frame number of the voice to be recognized can be accurately obtained. The multi-section voice to be recognized may include 4 sections of voice, for example, a frame length of the first section of voice is 0s-2s, a frame length of the second section of voice is 1s-3s, a frame length of the third section of voice is 2s-4s, and a frame length of the fourth section of voice is 3s-5s.

In some embodiments, the recognizing the speech to be recognized after framing by using the unsteady noise recognition model to obtain the unsteady noise includes:

Specifically, the vehicle controller of the vehicle sequentially inputs each frame of recognition voice after framing into an unsteady noise recognition model according to the number of frames of the voice to be recognized, and each frame of voice to be recognized is input, the unsteady noise recognition model recognizes the frame of voice to be recognized and removes other sounds except for the unsteady noise in the frame of voice to be recognized, only the unsteady noise of the frame of voice to be recognized is reserved, the first unsteady noise in the frame of voice to be recognized is output, all the voices to be recognized are sequentially processed according to the step, the first unsteady noise in all the frames of voice to be recognized is obtained, and then all the first unsteady noise is integrated, so that the unsteady noise is obtained. Thus, the unsteady noise recognition is carried out on each frame of voice to be recognized after framing, so that the unsteady noise can be accurately removed from the voice to be recognized, and the voice with the unsteady noise removed can be obtained more accurately. When the first unsteady noise is integrated, the first unsteady noise can be integrated in sequence according to the input sequence of recognizing the voice of each frame. Each frame of voice to be recognized can also be provided with a sequenced mark, the output first non-stationary noise also has the same mark as the frame of voice to be recognized, all the first non-stationary noise is sequentially integrated according to the sequence of the marks to obtain sequenced non-stationary noise, so that when the non-stationary noise is removed from the voice to be recognized, the non-stationary noise can be removed according to the mark, and the voice after the non-stationary noise is removed more accurately can be obtained. The method can also remove the unsteady noise from the voice to be recognized according to the sequence of the unsteady noise, so that the speed of removing the unsteady noise from the voice to be recognized is higher, and the voice with the unsteady noise removed can be obtained rapidly.

In some embodiments, as shown in fig. 2, the training method of the non-stationary noise recognition model includes the following steps:

step 202, constructing an initial model;

step 204, acquiring a noiseless voice data set and an unsteady noise data set; wherein the noise-free speech data set comprises: the voice data sent by a person collected by the vehicle under the noiseless condition, and the unsteady noise data set comprises: the unsteady noise collected by the vehicle in the running process of different vehicle speeds corresponding to different windowing states;

step 206, mixing the noiseless voice data set and the unsteady noise data set to obtain a mixed voice data set;

and step 208, performing iterative training on the initial model according to the unsteady noise data set and the mixed voice data set to obtain the unsteady noise recognition model.

Specifically, the initial model may be any one of a plurality of models such as SqueezeNet, mobileNet, shuffleNet, ghostNet. During the running process of the vehicle, the whole vehicle controller acquires a plurality of pieces of noiseless voice (for example, voice data sent by a person collected by the vehicle under the noiseless condition) to form a noiseless voice data set. The vehicle controller acquires various kinds of non-stationary noise respectively to form a non-stationary noise data set. The plurality of non-stationary noise may include wind noise, tire noise and engine noise in the vehicle collected during the running of the vehicle in different windowing states and corresponding different vehicle speeds in each windowing state. Mixing the noiseless voice data set and the unsteady noise data set to obtain a mixed voice data set. When mixing, multiple kinds of unsteady noise can be overlapped and mixed, so that mixed voice is richer, and an initial model is iteratively trained by using an unsteady noise data set and a mixed voice data set, so as to obtain an unsteady noise recognition model. Through the steps 202-208, training of the initial model by using the mixed voice data set containing rich noise is realized, so that the recognition capability of the trained initial model (the pre-trained non-stationary noise model) is higher, and the non-stationary noise in voice can be recognized more accurately.

In some embodiments, the mixing the noiseless speech data set and the unsteady state noise data set to obtain a mixed speech data set includes:

mixing the voice data which are collected as even frames by the noiseless voice data and the unsteady noise data which are collected as even frames by the unsteady noise data to obtain even mixed voice data;

mixing the voice data which are collected as odd frames by the noiseless voice data and the unsteady noise data which are collected as odd frames by the unsteady noise data to obtain odd mixed voice data;

and integrating the even mixed voice data and the odd mixed voice data in sequence to obtain the mixed voice data set.

Specifically, a vehicle controller of a vehicle carries out framing processing on all the noiseless voices in the noiseless voice data set to obtain multi-frame noiseless voice data with sequence identification, carries out framing processing on the unsteady noises in the unsteady noise data set to obtain multi-frame unsteady noises with sequence identification, and mixes the noiseless voice data of even frames with the unsteady noise data of even frames to obtain multi-frame even mixed voice data. And mixing the voice data of the odd frames and the unsteady noise data of the odd frames to obtain multi-frame odd mixed voice data. And sequencing the multi-frame even mixed voice data and the multi-frame odd mixed voice data according to the sequence of the sequence identification to obtain the mixed voice data set. In this way, all noise and noise-free voice are fully mixed, the recognition capability of the unsteady noise recognition model is improved, and unsteady noise in the voice to be recognized can be rapidly recognized.

In some embodiments, acquiring the non-stationary noise data set comprises:

when the vehicle window state is in a full-open state, acquiring tire noise data, wind noise data and engine noise data of each vehicle speed in a preset vehicle speed range as a second noise data set;

Specifically, the unsteady noise may include tire noise and wind noise engine noise, and as the vehicle travels at different speeds and the opening and closing states of the vehicle window are different, the unsteady noise may change accordingly, for example, when the vehicle window state is a half-open state, the intensities of the tire noise, wind noise and engine noise entering the vehicle are smaller, and when the vehicle window state is a full-open state, the intensities of the tire noise, wind noise and engine noise entering the vehicle are larger. And the intensity of the unsteady noise entering the vehicle is different due to the different vehicle speeds. Therefore, when the unsteady noise data set is acquired, the states of the windows during driving can be combined to acquire the noise in the vehicle under different window states. When the window state is a half open state, tire noise data, wind noise data, and engine noise data at respective vehicle speeds (exemplary 60km/h, 70km/h, 80km/h, 90km/h, 100km/h, 110km/h, and 120 km/h) within a preset vehicle speed range (60 km/h to 120 km/h) are acquired as a first noise data set. When the window state is the fully open state, tire noise data, wind noise data, and engine noise data at respective vehicle speeds (exemplary 60km/h, 70km/h, 80km/h, 90km/h, 100km/h, 110km/h, and 120 km/h) within a preset vehicle speed range (60 km/h to 120 km/h) are acquired as the second noise data set. And integrating the first noise data set and the second noise data set into one data set to obtain the unsteady noise data set. The intensities of the non-stationary noise under different vehicle window states and different vehicle speeds are different, so that the non-stationary noise under different vehicle window states and various vehicle speeds is obtained to be used as a non-stationary noise data set, the initial model can learn more non-stationary noise samples, the trained non-stationary noise model can identify various non-stationary noise, and the processing capacity of the non-stationary noise model is improved.

In some embodiments, before training the initial model from the noise-free speech data set, the non-stationary noise data set, and the mixed speech data set, further comprising:

Specifically, before the initial model is trained, the mixed voice data set and the unsteady noise data set can be intercepted, so that the phrase sound can be used for training when the initial model is trained by using the mixed voice data set and the unsteady noise data set, the processing amount of one-time training of the initial model can be reduced, and the learning capacity of the initial model is stronger through multiple times of training. When the data in each data set is intercepted, the same method is adopted, and the preset intercepting frame length is 2s, and the preset frame is 1s (namely the overlapped part between two sections of voices). Intercepting the first section of voice in the mixed voice data set as 0s-2s of the mixed voice data, the second section of voice as 1s-3s of the mixed voice data, and the third section of voice as 2s-4s of the mixed voice data. And intercepting the mixed voice data set and the unsteady noise data set according to the intercepting mode and the preset intercepting frame length and the preset frame shift.

In some embodiments, performing iterative training on the initial model according to the non-stationary noise data set and the mixed voice data set to obtain the non-stationary noise recognition model, including:

Specifically, the vehicle controller of the vehicle sequentially inputs multiple sections of mixed voices in the mixed voice data set into an initial model, the initial model outputs estimated noise of each section of mixed voices, and a loss function of the initial model is constructed according to the estimated noise and unsteady noise corresponding to the mixed voices, wherein the constructed loss function can be a cross entropy loss function, a mean square error loss function and the like. And comparing the estimated noise of each section of mixed voice with the unsteady noise corresponding to each section of mixed voice, and continuously adjusting the parameters of the model according to the comparison result until the loss function is minimum, so as to obtain the trained unsteady noise model. By comparing the estimated noise of each section of mixed voice with the corresponding non-stationary noise of each section of mixed voice, the initial model parameters are continuously adjusted, so that the initial model is continuously converged in the training process, the recognition capability of the trained non-stationary noise model is stronger, and the non-stationary noise model is more accurate in recognition of the non-stationary noise.

It should be noted that, the embodiments of the present application may be further described in the following manner:

in the automobile cabin, when a user controls each function of the vehicle by using voice, voice recognition is inaccurate and accurate voice cannot be recognized due to the existence of the non-stationary noise in the vehicle, so that noise reduction processing is required for the voice, but the non-stationary noise in the vehicle cannot be removed by using a common noise reduction method, so that a model capable of recognizing the non-stationary noise needs to be constructed in advance, the voice is reduced by using the non-stationary noise model, and a large amount of non-stationary noise models need to be trained, so that the non-stationary noise model can learn more non-stationary noise data and can recognize the non-stationary noise more accurately. The training process of the unsteady state model is as follows:

1. during running of the vehicle, the whole vehicle controller acquires tire noise data, wind noise data and engine noise data at each vehicle speed (60 km/h, 70km/h, 80km/h, 90km/h, 100km/h, 110km/h and 120km/h in an exemplary range of a preset vehicle speed (60 km/h-120 km/h) when a vehicle window state is in a half-open state, and the tire noise data, the wind noise data and the engine noise data are used as a first noise data set. When the window state is the fully open state, tire noise data, wind noise data, and engine noise data at respective vehicle speeds (exemplary 60km/h, 70km/h, 80km/h, 90km/h, 100km/h, 110km/h, and 120 km/h) within a preset vehicle speed range (60 km/h to 120 km/h) are acquired as the second noise data set. And integrating the first noise data set and the second noise data set into one data set to obtain the unsteady noise data set.

2. And collecting multiple sections of voice data sent by a person of the vehicle under the noiseless condition as a noiseless voice data set.

3. And mixing the noiseless voice data set and the unsteady noise data set to obtain a mixed voice data set.

4. And respectively intercepting the mixed voice data set and the unsteady noise data set according to the preset intercepting frame length and the preset intercepting frame movement.

5. Sequentially inputting multiple sections of mixed voices in a mixed voice data set into an initial model, outputting estimated noise of each section of mixed voices by the initial model, constructing a loss function of the initial model according to the estimated noise and unsteady noise corresponding to the mixed voices, and minimizing the loss function to update model parameters of the initial model to obtain the unsteady noise identification model.

The trained unsteady state noise reduction model is utilized to reduce noise of the voice to be recognized, and the specific noise reduction process is as follows:

1. and acquiring initial mixed voice sent by a user in the running process of the vehicle, and if the initial mixed voice is longer, intercepting the initial mixed voice according to the preset intercepting frame length and the preset intercepting frame length to obtain the voice to be recognized.

2. The frame number of the voice to be recognized is obtained by calculating the frame number of the voice to be recognized according to the frame length of each voice section in the voice to be recognized and the frame shift between each voice section.

3. And framing the voice to be recognized according to the frame number.

4. Inputting each frame of voice to be recognized after framing into an unsteady noise recognition model, recognizing each frame of voice to be recognized through the unsteady noise recognition model, removing other sounds except for the unsteady noise in each frame of voice to be recognized, and outputting first unsteady noise in each frame of voice to be recognized; and integrating all the first non-stationary noise to obtain the non-stationary noise.

5. Determining a voice frequency spectrum to be recognized according to the voice to be recognized; according to the unsteady noise voice, determining an unsteady noise frequency spectrum; performing difference operation on the voice spectrum to be recognized and the unsteady noise spectrum to obtain a voice spectrum with unsteady noise removed; and carrying out inverse Fourier transform on the voice frequency spectrum from which the unsteady noise is removed, and obtaining the voice from which the unsteady noise is removed.

Specifically, the controller may be a whole vehicle controller, and is configured to receive and process a voice sent by a driver, so that other functional modules execute corresponding functions according to a processing result of the controller.

It should be noted that, the method of the embodiments of the present application may be performed by a single device, for example, a computer or a server. The method of the embodiment can also be applied to a distributed scene, and is completed by mutually matching a plurality of devices. In the case of such a distributed scenario, one of the devices may perform only one or more steps of the methods of embodiments of the present application, and the devices may interact with each other to complete the methods.

It should be noted that some embodiments of the present application are described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Based on the same inventive concept, the application also provides a voice noise reduction device corresponding to the method of any embodiment.

Referring to fig. 3, the voice noise reduction device includes:

an acquisition module 302 configured to acquire a voice to be recognized;

the voice recognition module 304 is configured to recognize the voice to be recognized by using an unsteady noise recognition model to obtain unsteady noise;

and the execution module 306 is configured to remove the unsteady state noise from the voice to be recognized, so as to obtain voice with the unsteady state noise removed.

For convenience of description, the above devices are described as being functionally divided into various modules, respectively. Of course, the functions of each module may be implemented in the same piece or pieces of software and/or hardware when implementing the present application.

The device of the foregoing embodiment is configured to implement a voice noise reduction method according to any one of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which is not described herein.

Based on the same inventive concept, the application also provides an electronic device corresponding to the method of any embodiment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the voice noise reduction method of any embodiment when executing the program.

Fig. 4 shows a more specific hardware architecture of an electronic device according to this embodiment, where the device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 implement communication connections therebetween within the device via a bus 1050.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit ), microprocessor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. for executing relevant programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 1020 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory ), static storage device, dynamic storage device, or the like. Memory 1020 may store an operating system and other application programs, and when the embodiments of the present specification are implemented in software or firmware, the associated program code is stored in memory 1020 and executed by processor 1010.

The input/output interface 1030 is used to connect with an input/output module for inputting and outputting information. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.

Communication interface 1040 is used to connect communication modules (not shown) to enable communication interactions of the present device with other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).

Bus 1050 includes a path for transferring information between components of the device (e.g., processor 1010, memory 1020, input/output interface 1030, and communication interface 1040).

It should be noted that although the above-described device only shows processor 1010, memory 1020, input/output interface 1030, communication interface 1040, and bus 1050, in an implementation, the device may include other components necessary to achieve proper operation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary to implement the embodiments of the present description, and not all the components shown in the drawings.

The electronic device of the foregoing embodiment is configured to implement a voice noise reduction method according to any one of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which is not described herein.

Based on the same inventive concept, corresponding to any of the above embodiments, the present application further provides a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform a method of voice noise reduction as described in any of the above embodiments.

The computer readable media of the present embodiments, including both permanent and non-permanent, removable and non-removable media, may be used to implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.

The storage medium of the foregoing embodiments stores computer instructions for causing the computer to perform a voice noise reduction method as described in any one of the foregoing embodiments, and has the advantages of the corresponding method embodiments, which are not described herein.

It will be appreciated that before using the technical solutions of the various embodiments in the disclosure, the user may be informed of the type of personal information involved, the range of use, the use scenario, etc. in an appropriate manner, and obtain the authorization of the user.

For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Therefore, the user can select whether to provide personal information to the software or hardware such as the electronic equipment, the application program, the server or the storage medium for executing the operation of the technical scheme according to the prompt information.

As an alternative but non-limiting implementation, in response to receiving an active request from a user, the manner in which the prompt information is sent to the user may be, for example, a popup, in which the prompt information may be presented in a text manner. In addition, a selection control for the user to select to provide personal information to the electronic device in a 'consent' or 'disagreement' manner can be carried in the popup window.

It will be appreciated that the above-described notification and user authorization process is merely illustrative, and not limiting of the implementations of the present disclosure, and that other ways of satisfying relevant legal regulations may be applied to the implementations of the present disclosure.

Those of ordinary skill in the art will appreciate that: the discussion of any of the embodiments above is merely exemplary and is not intended to suggest that the scope of the application (including the claims) is limited to these examples; the technical features of the above embodiments or in the different embodiments may also be combined within the idea of the present application, the steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present application as described above, which are not provided in detail for the sake of brevity.

Additionally, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures, in order to simplify the illustration and discussion, and so as not to obscure the embodiments of the present application. Furthermore, the devices may be shown in block diagram form in order to avoid obscuring the embodiments of the present application, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform on which the embodiments of the present application are to be implemented (i.e., such specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the application, it should be apparent to one skilled in the art that embodiments of the application can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative in nature and not as restrictive.

While the present application has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of those embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may use the embodiments discussed.

The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Accordingly, any omissions, modifications, equivalents, improvements and/or the like which are within the spirit and principles of the embodiments are intended to be included within the scope of the present application.

Claims

1. A method of voice noise reduction, comprising:

2. The method of claim 1, wherein the recognizing the speech to be recognized using the trained non-stationary noise recognition model to obtain non-stationary noise comprises:

determining the number of frames of the voice to be recognized;

framing the voice to be recognized according to the frame number;

3. The method of claim 2, wherein the determining the number of frames of the speech to be recognized comprises:

4. The method according to claim 2, wherein the recognizing the speech to be recognized after framing by using the non-stationary noise recognition model to obtain the non-stationary noise includes:

5. The method of claim 1, wherein the training method of the non-stationary noise recognition model comprises:

constructing an initial model;

6. The method of claim 5, wherein acquiring an unsteady state noise data set comprises:

7. The method of claim 5, further comprising, prior to training the initial model from the non-stationary noise dataset and the mixed speech dataset:

8. The method of claim 5, wherein iteratively training the initial model based on the non-stationary noise data set and the mixed speech data set to obtain the non-stationary noise recognition model comprises:

9. A speech noise reduction device, comprising:

10. A vehicle comprising a controller to perform a speech recognition method as claimed in claims 1-8.