CN115641861A

CN115641861A - Vehicle-mounted voice enhancement method and device, storage medium and equipment

Info

Publication number: CN115641861A
Application number: CN202211254927.9A
Authority: CN
Inventors: 黄远芳; 胡郁
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2022-10-13
Filing date: 2022-10-13
Publication date: 2023-01-24

Abstract

The application discloses a vehicle-mounted voice enhancement method, a device, a storage medium and equipment, wherein the method comprises the following steps: firstly, acquiring vehicle-mounted auxiliary information of a target vehicle, acquiring target voice information of vehicle-mounted users in each sound zone of the target vehicle, and then performing enhancement processing on the target voice information by using the vehicle-mounted auxiliary information to obtain enhanced target voice information; and then, carrying out preset operation processing on the vehicle-mounted user and/or the target vehicle according to the enhanced target voice information to obtain a processing result. Therefore, according to the method and the device, the voice of the vehicle-mounted user in each sound zone on the vehicle is enhanced according to the vehicle-mounted auxiliary information, and then the enhanced voice is used for carrying out preset operation processing such as follow-up vehicle awakening and user positioning and recognition, so that awakening, positioning and recognition effects can be improved, and voice interaction experience of the user in a target vehicle driving state is improved.

Description

Vehicle-mounted voice enhancement method and device, storage medium and equipment

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a method, an apparatus, a storage medium, and a device for vehicle-mounted speech enhancement.

Background

Along with the improvement of living standard of people and the rapid development of social economy, the utilization rate of automobiles is gradually increased, more and more automobiles enter the life of people, and great convenience is brought to all aspects of the life of people. Among them, the voice interactive system has also been popularized in the smart car.

At present, a plurality of sound zones of a vehicle are generally divided according to the position of a seat, for example, four vehicles including a main driving seat, an auxiliary driving seat, a main rear seat and an auxiliary rear seat are divided into four sound zones, etc., speaker audio corresponding to multiple sound zones is obtained through a directional beam or a voice separation model and is sent to the rear end to be awakened, then the awakening result is compared to obtain the position of an awakened person, and then the subsequent identification of a target speaker is completed. However, when the vehicle is in a scene with a low signal-to-noise ratio, such as a high-speed driving window, interference of multiple speakers, and the like, the problem of low awakening rate is generated, and a positioning error phenomenon also occurs even after awakening, so that the voice interaction experience of the vehicle-mounted user is poor.

Disclosure of Invention

The embodiment of the application mainly aims to provide a vehicle-mounted voice enhancement method, a vehicle-mounted voice enhancement device, a storage medium and vehicle-mounted voice enhancement equipment, which can enhance the voice of a speaker according to vehicle-mounted auxiliary information, so that the effects of awakening, positioning and recognition can be improved, and the voice interaction experience of a user in a driving state can be improved.

The embodiment of the application provides a vehicle-mounted voice enhancement method, which comprises the following steps:

acquiring vehicle-mounted auxiliary information of a target vehicle and acquiring target voice information of vehicle-mounted users in each sound zone on the target vehicle;

enhancing the target voice information by using the vehicle-mounted auxiliary information to obtain enhanced target voice information;

and carrying out preset operation processing on the vehicle-mounted user and/or the target vehicle according to the enhanced target voice information to obtain a processing result.

In one possible implementation, the in-vehicle assistance information of the target vehicle includes seat information of the target vehicle; the enhancing the target voice information by using the vehicle-mounted auxiliary information to obtain the enhanced target voice information includes:

performing Fourier transform on the target voice information to obtain converted target voice information;

constructing a combined vector by using the seat information of the target vehicle and the converted target voice information, inputting the combined vector to a pre-constructed voice enhancement model, and predicting to obtain the weight of the target voice information of each vocal range on the target vehicle;

and multiplying the weight and the corresponding target voice information respectively, and performing inverse Fourier transform on the obtained calculation result to obtain the enhanced target voice information.

In one possible implementation, the constructing a combined vector using the seat information of the target vehicle and the converted target speech information includes:

splicing the vector corresponding to the seat information of the target vehicle and the vector corresponding to the converted target voice information to obtain a spliced vector serving as a combined vector; or, constructing a combined vector by using the seat information of the target vehicle and the converted target voice information in a gating mode.

In one possible implementation manner, the vehicle-mounted auxiliary information of the target vehicle includes vehicle speed information and window state information of the target vehicle; the enhancing the target voice information by using the vehicle-mounted auxiliary information to obtain the enhanced target voice information includes:

carrying out Fourier transform on the target voice information to obtain converted target voice information;

constructing a combined vector by using the speed information and window state information of the target vehicle and the converted target voice information, inputting the combined vector to a pre-constructed voice enhancement model, and predicting to obtain the weight of the target voice information of each sound zone on the target vehicle;

and respectively multiplying the weight and the corresponding target voice information, and performing inverse Fourier transform on the obtained calculation result to obtain the enhanced target voice information.

In a possible implementation manner, the constructing a combination vector by using the vehicle speed information and the window state information of the target vehicle and the converted target voice information includes:

splicing the vector corresponding to the speed information of the target vehicle, the vector corresponding to the window state information and the vector corresponding to the converted target voice information to obtain a spliced vector serving as a combined vector; or, constructing a combined vector by using the vehicle speed information and the vehicle window state information of the target vehicle and the converted target voice information in a gating mode.

In one possible implementation, the speech enhancement model includes at least one of a convolutional neural network CNN, a recurrent neural network RNN, a real or complex network.

In a possible implementation manner, the inputting the combination vector into a pre-constructed speech enhancement model, and predicting a weight of the target speech information of each vocal range on the target vehicle includes:

and inputting the combined vector to a pre-constructed voice enhancement model, and calculating the ratio of the frequency domain signal of the target voice information acquired by each sound zone to the frequency domain signal of the noisy audio acquired by a preset sound zone to be used as the weight of the target voice information of each sound zone.

In a possible implementation manner, the performing, according to the enhanced target voice information, preset operation processing on the vehicle-mounted user and/or the target vehicle to obtain a processing result includes:

and awakening the preset device of the target vehicle according to the enhanced target voice information, and positioning and identifying the vehicle-mounted user sending the awakening voice to obtain a processing result.

The embodiment of the present application further provides an on-vehicle speech enhancement device, including:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring vehicle-mounted auxiliary information of a target vehicle and acquiring target voice information of vehicle-mounted users in each sound zone on the target vehicle;

the enhancement unit is used for utilizing the vehicle-mounted auxiliary information to carry out enhancement processing on the target voice information to obtain enhanced target voice information;

and the processing unit is used for carrying out preset operation processing on the vehicle-mounted user and/or the target vehicle according to the enhanced target voice information to obtain a processing result.

In one possible implementation, the on-board assistance information of the target vehicle includes seat information of the target vehicle; the enhancement unit includes:

the first transformation subunit is used for carrying out Fourier transformation on the target voice information to obtain the converted target voice information;

the first prediction subunit is used for constructing a combined vector by utilizing the seat information of the target vehicle and the converted target voice information, inputting the combined vector into a pre-constructed voice enhancement model, and predicting to obtain the weight of the target voice information of each sound zone on the target vehicle;

and the first calculating subunit is used for respectively multiplying the weights and the corresponding target voice information, and performing inverse Fourier transform on the obtained calculation result to obtain the enhanced target voice information.

In a possible implementation manner, the first prediction subunit is specifically configured to:

In one possible implementation manner, the vehicle-mounted auxiliary information of the target vehicle comprises vehicle speed information and window state information of the target vehicle; the enhancement unit includes:

the second transformation subunit is used for carrying out Fourier transformation on the target voice information to obtain the converted target voice information;

the second prediction subunit is used for constructing a combined vector by utilizing the speed information and window state information of the target vehicle and the converted target voice information, inputting the combined vector to a pre-constructed voice enhancement model, and predicting to obtain the weight of the target voice information of each sound zone on the target vehicle;

and the second calculating subunit is used for respectively multiplying the weights and the corresponding target voice information, and performing inverse Fourier transform on the obtained calculation result to obtain the enhanced target voice information.

In a possible implementation manner, the second prediction subunit is specifically configured to:

splicing the vector corresponding to the speed information of the target vehicle, the vector corresponding to the window state information and the vector corresponding to the converted target voice information to obtain a spliced vector serving as a combined vector; or, constructing a combined vector by using the speed information and window state information of the target vehicle and the converted target voice information in a gating mode.

In a possible implementation manner, the first prediction subunit or the second prediction subunit is specifically configured to:

In a possible implementation manner, the processing unit is specifically configured to:

The embodiment of the present application further provides a vehicle-mounted speech enhancement device, including: a processor, a memory, a system bus;

the processor and the memory are connected through the system bus;

the memory is configured to store one or more programs, the one or more programs including instructions, which when executed by the processor, cause the processor to perform any one implementation of the above-described in-vehicle speech enhancement method.

An embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a terminal device, the terminal device is enabled to execute any implementation manner of the above vehicle-mounted speech enhancement method.

The embodiment of the present application further provides a computer program product, and when the computer program product runs on a terminal device, the terminal device executes any implementation manner of the above vehicle-mounted voice enhancement method.

According to the vehicle-mounted voice enhancement method, the device, the storage medium and the equipment, firstly, vehicle-mounted auxiliary information of a target vehicle is obtained, target voice information of vehicle-mounted users in each sound zone on the target vehicle is obtained, and then enhancement processing is carried out on the target voice information by utilizing the vehicle-mounted auxiliary information to obtain enhanced target voice information; and then, according to the enhanced target voice information, carrying out preset operation processing on the vehicle-mounted user and/or the target vehicle to obtain a processing result. Therefore, according to the method and the device, the voice of the vehicle-mounted user in each sound zone on the vehicle is enhanced according to the vehicle-mounted auxiliary information, and then the enhanced voice is used for carrying out preset operation processing such as follow-up vehicle awakening, user positioning and recognition, so that awakening, positioning and recognition effects can be improved, and voice interaction experience of the user in the driving state of the target vehicle is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic view of a four-seat vehicle including a main driving seat, an auxiliary driving seat, a main rear seat and an auxiliary rear seat divided into four sound zones according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a vehicle-mounted speech enhancement method according to an embodiment of the present application;

fig. 3 is a schematic composition diagram of a vehicle-mounted speech enhancement device according to an embodiment of the present application.

Detailed Description

With the continuous development of science and technology, the voice interaction system has been popularized in the smart car. In order to achieve better information interaction between multiple speakers in a vehicle and a voice device, generally, multiple sound zones of the vehicle are divided according to seat positions, for example, four vehicles including a main driving seat, an auxiliary driving seat, a main rear seat and an auxiliary rear seat are divided into four sound zones, as shown in fig. 1, speaker audio corresponding to multiple sound zones is obtained through a directional beam or a voice separation model to perform rear-end awakening and the like, and then awakening results are compared to obtain the positions of awakened speakers on the vehicle, so that subsequent identification of target speakers is completed. However, when the vehicle is in a scene with a low signal-to-noise ratio, such as a high-speed driving scene and a window opening scene, or a plurality of speakers on the vehicle interfere with each other, the problem of low awakening rate can be caused, and even if the awakening is successful, the phenomenon of positioning error of the awakened speakers can also occur, so that the voice interaction experience of the vehicle-mounted user is poor.

In order to solve the defects, the application provides a vehicle-mounted voice enhancement method, which comprises the steps of firstly obtaining vehicle-mounted auxiliary information of a target vehicle, obtaining target voice information of vehicle-mounted users in each sound zone on the target vehicle, and then utilizing the vehicle-mounted auxiliary information to carry out enhancement processing on the target voice information to obtain enhanced target voice information; and then, according to the enhanced target voice information, carrying out preset operation processing on the vehicle-mounted user and/or the target vehicle to obtain a processing result. Therefore, according to the method and the device, the voice of the vehicle-mounted user in each sound zone on the vehicle is enhanced according to the vehicle-mounted auxiliary information, and then the enhanced voice is used for carrying out preset operation processing such as follow-up vehicle awakening and user positioning and recognition, so that awakening, positioning and recognition effects can be improved, and voice interaction experience of the user in a target vehicle driving state is improved.

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

First embodiment

Referring to fig. 2, a schematic flow chart of a vehicle-mounted speech enhancement method provided in this embodiment is shown, where the method includes the following steps:

s201: the method comprises the steps of obtaining vehicle-mounted auxiliary information of a target vehicle and obtaining target voice information of vehicle-mounted users in each sound zone on the target vehicle.

In the embodiment, any vehicle needing vehicle-mounted voice enhancement is defined as a target vehicle, and voice information sent by vehicle-mounted users in each sound zone on the target vehicle is defined as target voice information to be enhanced. It should be noted that the present embodiment does not limit the language type of the target voice information, for example, the target voice may be a voice composed of chinese, a voice composed of english, or the like; meanwhile, the embodiment also does not limit the length of the target speech, for example, the target speech may be a sentence or a phrase, etc.

In practical application, the vehicle-mounted voice interaction scene has the characteristics that the user positions are fixed (such as a main driving seat, an auxiliary driving seat and the like) and the number of users is related to vehicle types (such as four-seat vehicles, seven-seat vehicles and the like) relative to the interaction scenes such as home. In the driving process of the vehicle, the main sources of the noise are wind noise, tire noise and the like generated in a driving state, and different noise levels can be presented at different vehicle speeds and are related to a window opening and closing state, for example, in the window closing state, the vehicle can realize a certain degree of passive noise reduction effect.

Therefore, when the target voice information of the vehicle-mounted user is enhanced, the vehicle-mounted auxiliary information of the target vehicle is fully considered, wherein the vehicle-mounted auxiliary information comprises but is not limited to seat information of the target vehicle, vehicle speed information of the target vehicle, window state information and the like. And after the vehicle-mounted auxiliary information of the target vehicle is acquired, the vehicle-mounted auxiliary information of the target vehicle can be converted into a corresponding vehicle-mounted auxiliary information representation vector by using the existing or future vector conversion mode. The specific format of the token vector may be set according to actual situations, which is not limited in this embodiment, for example, the token vector may be a 4-dimensional vector, and is used to execute the subsequent step S202.

For example, taking a target vehicle including four sound zones as shown in fig. 1, the vehicle-mounted seats include a main driver seat, an auxiliary driver seat, a main rear seat and an auxiliary rear seat, and a microphone may be mounted in advance at a roof position corresponding to each seat for collecting voice information uttered by the vehicle-mounted users in each sound zone as the target voice information of the vehicle-mounted users in the sound zone.

It should be further noted that, the present application does not limit the specific content included in the vehicle-mounted auxiliary information of the target vehicle and the acquiring manner thereof, and a corresponding acquiring manner may be selected according to the content of the vehicle-mounted auxiliary information, for example, when the vehicle-mounted auxiliary information of the target vehicle includes seat information of the target vehicle, the seat information of the target vehicle may be acquired by monitoring a wearing state of a seat belt, detecting a seat state by using a pressure sensor, acquiring human activity information of a corresponding position by using a human infrared detection sensor or a vehicle-mounted camera, and the like.

S202: and enhancing the target voice information by utilizing the vehicle-mounted auxiliary information to obtain the enhanced target voice information.

In this embodiment, after the vehicle-mounted auxiliary information of the target vehicle is acquired in step S201 and the target voice information of the vehicle-mounted user in each sound zone on the target vehicle is acquired, in order to improve the voice interaction experience of the vehicle-mounted user, the vehicle-mounted auxiliary information may be further utilized to perform enhancement processing on the target voice information, so as to obtain enhanced target voice information, which is used to perform subsequent step S203.

Specifically, in an alternative implementation manner, the vehicle-mounted auxiliary information of the target vehicle may include seat information of the target vehicle, and the implementation process of step S202 may include: firstly, fourier transform is performed on target voice information to obtain converted target voice information, then a combination vector is constructed by using seat information of a target vehicle and the converted target voice information, the combination vector is input to a pre-constructed voice enhancement model, weights of the target voice information of each vocal tract on the target vehicle are obtained through prediction, then, multiplication calculation can be performed on each weight and the target voice information of the corresponding vocal tract, inverse fourier transform is performed on the obtained calculation result to obtain enhanced target voice information, and the enhanced target voice information is used for executing the subsequent step S203.

In this implementation manner, the obtaining manner of the data of the user in the target vehicle and the seat information of the target vehicle is not limited, and after the seat information of the target vehicle is obtained, the seat information of the target vehicle may be converted into a corresponding dimensional characterization vector according to the sound zone division of the target vehicle. For example, the seat information of the target vehicle of a four-seat car may be converted into a 4-dimensional characterization vector, or the seat information of the target vehicle of a seven-seat car may be converted into a 7-dimensional characterization vector, or the like. And the value of each dimension correspondingly represents the seat information of each sound zone (namely, whether the vehicle-mounted user exists or not).

For example, the following steps are carried out: taking the target vehicle including the four-tone zone shown in fig. 1 as an example, in the manner of monitoring the wearing state of a safety belt, detecting the state of a seat by using a pressure sensor, acquiring human activity information at a corresponding position by using a human infrared detection sensor or a vehicle-mounted camera, and the like, it is determined that the vehicle-mounted user of the target vehicle is located at the main driving seat and the auxiliary driving seat, and then the seat information of the target vehicle can be converted into corresponding 4-dimensional characterization vectors: information _ position = [1, 0], and similarly, when it is determined that the in-vehicle user is located in the main driver seat and the main rear seat, the seat Information of the target vehicle may be converted into corresponding 4-dimensional characterization vectors: information _ position = [1,0,1,0].

On the basis, in order to improve the voice interaction experience of the vehicle-mounted user, after the target voice information of the vehicle-mounted user in each sound zone on the target vehicle is obtained, fourier transform can be performed on the target voice information to obtain the converted target voice information of the frequency domain.

By way of example: still taking the target vehicle including four sound zones shown in fig. 1 as an example, assuming that one microphone is mounted at each of the roof positions corresponding to the four sound zones, namely, the main driver seat, the auxiliary driver seat, the main rear seat and the auxiliary rear seat, and can be defined as the 1 st microphone, the 2 nd microphone, the 3 rd microphone and the 4 th microphone respectively, the acquired target noisy speech information of the vehicle-mounted user in each sound zone can be further defined as:

yi＝si1+si2+si3+si4+ni (1)

the yi represents target voice information of the vehicle-mounted user, which is acquired by an ith microphone of an ith sound zone; si1, si2, si3 and si4 respectively represent signals of reaching the ith microphone by sounds emitted by four vehicle-mounted users positioned in four sound ranges of a main driving seat, an auxiliary driving seat, a main rear seat and an auxiliary rear seat; ni represents the signal where the noise reaches the ith microphone.

For example, y1= s11+ s12+ s13+ s14+ n1 represents the target speech information of the in-vehicle user collected by the microphone of the sound zone in which the main driving seat is located, and by analogy, y2= s21+ s22+ s23+ s24+ n2, y3= s31+ s32+ s33+ s34+ n3, and y4= s41+ s42+ s43+ s44+ n4 represent the target speech information of the in-vehicle user collected by the microphones of the sound zones in which the sub-driving seat, the main rear seat, and the sub-rear seat are located, respectively.

Then, the voice information expressed by the above formula is subjected to fourier transform, so as to obtain the target voice information of the frequency domain after conversion, as shown in the following formula (2):

Yi＝Si1+Si2+Si3+Si4+Ni (2)

further, the 4-dimensional feature vector corresponding to the seat information of the target vehicle and the converted target speech information (i.e., yi) in the frequency domain may be vector-fused by using an existing or future vector combination manner to construct a combined vector, which is defined as feature. Specifically, a vector corresponding to the seat information of the target vehicle and a vector corresponding to the converted target speech information may be spliced, and the obtained spliced vector as a combined vector is: feature = [ Y1, Y2, Y3, Y4, information _ position ]; or, the model can be guided to better perform voice separation by utilizing the seat auxiliary information while utilizing the amplitude difference and the phase difference between microphones by acting on an input layer or a middle layer of the voice enhancement model in a gating mode, so that the vector combination of the seat information of the target vehicle and the converted target voice information is realized.

And then inputting the obtained combined vector (feature) into a pre-constructed voice enhancement model, calculating the ratio of the frequency domain signal of the target voice information acquired by each sound zone to the frequency domain signal of the noisy audio acquired by the preset sound zone, and using the ratio as the weight of the target voice information of each sound zone, so that the weights of the target voice information of four sound zones on the target vehicle can be predicted and obtained, and the weights can be defined as masks which can be expressed as [ F1, F2, F3, F4], wherein F1, F2, F3, F4 respectively represent the weights of the target voice information of four sound zones (namely the sound zones where the main driving seat, the auxiliary driving seat, the main rear seat and the auxiliary rear seat are respectively located).

The preset sound zone can be set according to actual conditions (such as the arrangement position of a microphone on an actual target vehicle), which is not limited in the embodiment of the present application, that is, the preset sound zone can be any one of the sound zones where the main driving seat, the auxiliary driving seat, the main rear seat and the auxiliary rear seat are respectively located.

Specifically, when there is only one user, taking the user as a main driving seat user as an example, the preset zone may be set as the zone where the main driving seat is located, and s12= s13= s14=0, then F1=1, F2= F3= F4=0, that is, the combination vector (feature) of the model input is: [ Y1, Y2, Y3, Y4,1, 0], the model output vector can be approximated as: [1, 0]; when there are two users, taking the two users as a main driver seat user and an auxiliary driver seat user as an example, and the preset zone is still set as the zone where the main driver seat is located, s12= s14=0, and f2= f4=0, that is, the combined vector (feature) of the model input is: [ Y1, Y2, Y3, Y4,1, 0], the model output vector may be: [ F1,0, F3,0].

Further, the weights [ F1, F2, F3, F4] of the target speech information of the four sound zones on the target vehicle and the corresponding converted target speech information [ Y1, Y2, Y3, Y4] of the frequency domain are multiplied, and the obtained calculation result is subjected to inverse fourier transform, so as to obtain enhanced target speech information, which is defined as Y1', Y2', Y3', Y4', respectively.

In another alternative implementation manner, the vehicle-mounted auxiliary information of the target vehicle may further include vehicle speed information and window state information of the target vehicle, and the implementation process of this step 202 may include: firstly, performing Fourier transform on target voice information to obtain converted target voice information, then constructing a combined vector by using the speed information and window state information of a target vehicle and the converted target voice information, inputting the combined vector to a pre-constructed voice enhancement model, predicting to obtain the weight of the target voice information of each vocal tract on the target vehicle, then respectively performing multiplication calculation on each weight and the target voice information of the corresponding vocal tract, and performing inverse Fourier transform on the obtained calculation result to obtain the enhanced target voice information for executing the subsequent step S203.

In this implementation manner, the acquisition manners of the vehicle speed information and the window state information of the target vehicle are not limited, and after the vehicle speed information and the window state information of the target vehicle are acquired, they may be converted into the characterization vectors of the corresponding dimensions, respectively. For example, the vehicle speed information of the target vehicle can be converted into a 4-dimensional characterization vector, and the value of each dimension correspondingly characterizes that the target vehicle is in a static state (0 km/h), a low-speed state (0-40 km/h), a medium-speed state (40-80 km/h) and a high-speed state (more than 80 km/h); and converting the car window state information into 4-dimensional characterization vectors and the like, wherein the values of all dimensions correspondingly characterize the car window opening states of all the sound zones (namely whether car windows corresponding to the sound zones are opened or not).

By way of example: still taking the target vehicle including the four-tone region shown in fig. 1 as an example, when the vehicle speed information is acquired by using the vehicle speed meter and it is determined that the target vehicle is in a stationary state, the vehicle speed information of the target vehicle may be converted into a 4-dimensional characterization vector: information _ speed = [1, 0], similarly, when the vehicle speed Information is acquired by using a vehicle-mounted speedometer and the target vehicle is determined to be in a high-speed state, the vehicle speed Information of the target vehicle can be converted into a 4-dimensional characterization vector: information _ speed = [0, 1].

When the window state information is obtained by detecting the window opening state and the situation that only the main driving window of the target vehicle is opened is determined, the window state information of the target vehicle can be converted into a corresponding 4-dimensional characterization vector: information _ window = [1, 0], similarly, when the window state Information is obtained by detecting the window opening state, and it is determined that the windows of the main driving and the auxiliary driving of the target vehicle are both opened, the window state Information of the target vehicle can be converted into a 4-dimensional characterization vector: information _ window = [1, 0].

On this basis, in order to improve the voice interaction experience of the vehicle-mounted user, after the target voice information of the vehicle-mounted user in each sound zone on the target vehicle is obtained, the target voice information can still be subjected to fourier transform by using the above formulas (1) and (2), so that the converted target voice information of the frequency domain is obtained, and the details are not repeated herein. Reference is made to 23641E, which is described below with reference to the target vehicle including four tones of voice in fig. 1.

Further, a 4-dimensional characterization vector corresponding to the vehicle speed information and the window state information of the target vehicle and the converted target voice information (i.e., yi) of the frequency domain can be vector-fused by using an existing or future vector combination mode to construct a combination vector, and the combination vector is defined as feature. Specifically, a vector corresponding to the vehicle speed information of the target vehicle, a vector corresponding to the vehicle window state information, and a vector corresponding to the converted target voice information may be spliced, and the obtained spliced vector is used as a combined vector: feature = [ Y1, Y2, Y3, Y4, information _ speed, information _ window ]; or, the model can be guided to better perform voice separation by utilizing the seat auxiliary information while utilizing the amplitude difference and the phase difference between microphones by acting on an input layer or a middle layer of the voice enhancement model in a gating mode, so that the vector combination of the seat information of the target vehicle and the converted target voice information is realized.

And then inputting the obtained combined vector (feature) into a pre-constructed voice enhancement model, calculating the ratio between the frequency domain signal of the target voice information collected by each sound zone and the frequency domain signal of the noisy audio collected by a preset sound zone as the weight of the target voice information of each sound zone, so that the weights of the target voice information of four sound zones on the target vehicle can be predicted and obtained, and the weights can be defined as masks which can be expressed as [ F1, F2, F3, F4], wherein F1, F2, F3, F4 respectively express the weights of the target voice information of the four sound zones (namely the sound zones in which the main driving seat, the auxiliary driving seat, the main rear seat and the auxiliary rear seat are respectively located).

It should be noted that the preset sound zone may also be set according to an actual situation (for example, a layout position of a microphone on an actual target vehicle), which is not limited in the embodiment of the present application, but a sound zone where a seat without a window is located is usually selected as the preset sound zone, that is, a sound zone with an optimal signal-to-noise ratio is selected as the preset sound zone, so as to reduce an influence of noise on a voice enhancement result.

For example, under the condition that only the main driving window is opened, according to the actual microphone position distribution, the noise energy close to the main driving microphone is stronger than the noise energy far away from the main driving microphone, and at the moment, the sound zone where the main driving microphone is located can be selected as the preset sound zone, so that the signal-to-noise ratio benefit brought by the microphone position distribution can be obtained.

Or, an auxiliary microphone can be installed at the middle handrail of the left and right sound zones, and when the signal-to-noise ratio of the microphone close to the car window is too low, the voice information collected by the microphone of the sound zone is added into the combined vector input by the model.

It should be noted that, the voice enhancement process is described with the target vehicle as a four-seater, but the present application does not limit the specific composition of the vehicle, for example, the target vehicle may also be a seven-seater or another vehicle type, and for the voice enhancement process of another vehicle type, the vehicle-mounted voice enhancement process of the four-seater may be referred to for implementation, and details are not repeated here.

Next, this embodiment will describe a construction process of the speech enhancement model mentioned in the above steps, wherein an optional implementation manner is that the construction process of the speech enhancement model specifically may include: the method comprises the steps of firstly obtaining sample vehicle-mounted auxiliary information of a sample vehicle, obtaining sample voice information of vehicle-mounted users in each sound zone on the sample vehicle, and then training an initial voice enhancement model by utilizing the sample vehicle-mounted auxiliary information, the sample voice information and a target loss function to obtain a voice enhancement model.

Specifically, in the present embodiment, in order to construct the speech enhancement model, a large amount of preparation work is required in advance, and first, a large amount of sample vehicle-mounted auxiliary information and voice information of each voice zone vehicle-mounted user need to be collected as sample vehicle-mounted auxiliary information (including sample seat information, sample vehicle speed information, and sample window state information) and sample voice information, respectively, to construct model training data. For example, seat information of a large number of sample vehicles in different situations may be collected in advance, for example, the number of different users of one user and multiple users and corresponding seat information exist on a sample vehicle with four seats, similarly, vehicle speed information (including a static state, a low speed state, a medium speed state, and a high speed state) and window state information (i.e., opening states of windows of different seats) of a large number of sample vehicles in different situations need to be collected in advance, and in these situations and under the permission of the vehicle users on the sample vehicle, voice information uttered by the vehicle users in each sound zone of the sample vehicle jointly constitutes model training data, and a weight recognition result corresponding to the sample voice information in these situations is manually marked. Then, the initial speech enhancement model may be trained according to the sample vehicle-mounted auxiliary information, the sample speech information, the weight recognition result corresponding to the sample speech information, and the target loss function (the specific function is not limited, and may be set according to actual conditions and empirical values), so as to generate the speech enhancement model.

In an optional implementation manner, the initial speech enhancement model may be (but is not limited to) at least one of a Convolutional Neural Networks (CNN), a Recurrent Neural Network (RNN), a real number Network, or a complex number Network.

Specifically, when model training is performed, sample vehicle-mounted auxiliary information (such as sample seat information) and sample voice information under a target vehicle operation condition can be sequentially extracted from training data, then the sample voice information is subjected to fourier transform and converted into sample voice information of a frequency domain, the sample vehicle-mounted auxiliary information and the converted sample voice information of the frequency domain are respectively converted into corresponding characterization vectors, the obtained characterization vectors of the sample vehicle-mounted auxiliary information and the converted sample voice information of the frequency domain are fused and then used as model input, corresponding weight recognition results are used as output, multiple rounds of model training are performed, the weight recognition results obtained by each round of training are compared with corresponding manual labeling results, model parameters are updated according to the difference between the weight recognition results and the manual labeling results, until preset conditions are met, for example, the value of a target loss function is small and basically unchanged, updating of the model parameters is stopped, training of a voice enhancement model is completed, and a trained voice enhancement model is generated.

On the basis, after the voice enhancement model is generated according to the sample vehicle-mounted auxiliary information and the sample voice information, the generated voice enhancement model can be further verified by utilizing the verification vehicle-mounted auxiliary information and the verification voice information. The specific verification process may include the following steps (1) to (3):

step (1): and acquiring verification vehicle-mounted auxiliary information of the verification vehicle and verification voice information of vehicle-mounted users in each sound zone on the verification vehicle.

In this embodiment, in order to implement verification of the voice enhanced model, first, verification vehicle-mounted auxiliary information of a verification vehicle (including verification seat information, verification vehicle speed information, and verification window state information) and verification voice information of vehicle-mounted users in each voice zone on the verification vehicle are acquired, for example, the number of different users and corresponding seat information of one user and a plurality of users existing in the verification vehicle in four seats can be obtained, similarly, a large amount of vehicle speed information (including a static state, a low speed state, a medium speed state, and a high speed state) and window state information (i.e., opening states of windows of different seats) of the verification vehicle in different situations can be collected in advance as verification seat information vehicle speed information and verification window state information, and 1000 pieces of voice data sent by different vehicle-mounted users in each voice zone are collected as verification voice information under the condition that the vehicle-mounted user on the verification vehicle allows, wherein the verification vehicle-mounted auxiliary information and the verification voice information refer to vehicle-mounted auxiliary information and voice information which can be used for performing voice enhanced model verification, and after each piece of weight corresponding to the verification voice information is acquired, the subsequent step 2 can be executed continuously.

Step (2): and inputting a verification combination vector constructed by using the verification vehicle-mounted auxiliary information and the verification voice information of the converted frequency domain into the voice enhancement model to obtain a weight prediction result corresponding to the verification voice information.

After the verification vehicle-mounted auxiliary information and the verification voice information are obtained in the step (1), further, the verification voice information may be subjected to fourier transform and converted into verification voice information of a frequency domain, the verification vehicle-mounted auxiliary information and the verification voice information of the converted frequency domain are respectively converted into corresponding characterization vectors, the obtained characterization vectors of the verification vehicle-mounted auxiliary information and the converted verification voice information of the frequency domain are fused and then input into a voice enhancement model, and a weight prediction result corresponding to the verification voice information is obtained and used for executing the subsequent step (3).

And (3): and when the weight prediction result of the verification voice information is inconsistent with the weight marking result corresponding to the verification voice information, respectively taking the verification voice information and the verification vehicle-mounted auxiliary information as the sample voice information and the sample vehicle-mounted auxiliary information again to update the voice enhancement model.

After the weight prediction result of the verification voice information is obtained in the step (2), if the weight prediction result of the verification voice information is inconsistent with the real weight identification result (such as the artificially marked weight marking result) corresponding to the verification voice information, the verification voice information and the verification vehicle-mounted auxiliary information can be respectively re-used as the sample voice information and the sample vehicle-mounted auxiliary information to perform parameter updating on the voice enhancement model.

Through the embodiment, the voice enhancement model can be effectively verified by utilizing the verification vehicle-mounted auxiliary information and the verification voice information, and when the weight prediction result of the verification voice information is inconsistent with the real weight identification result (such as the artificially marked weight marking result) corresponding to the verification voice information, the voice enhancement model can be timely adjusted and updated, so that the prediction precision and accuracy of the voice enhancement model can be improved.

S203: and carrying out preset operation processing on the vehicle-mounted user and/or the target vehicle according to the enhanced target voice information to obtain a processing result.

In this embodiment, after obtaining the enhanced target voice information in step S202, a preset device of the target vehicle may be further awakened (e.g., awaken a vehicle-mounted air conditioner or a vehicle-mounted audio player, etc.) according to the enhanced target voice information, and preset operation processing of positioning and recognizing a vehicle-mounted user who sends an awakening voice (e.g., "too hot, please turn on the air conditioner") may be performed, for example, the awakening result and the energy of the enhanced voice in each sound zone may be compared, the user in the sound zone with the largest energy may be selected to be positioned as the target awakener, and voice recognition may be further implemented, so as to obtain a recognition result of the target speaker. Therefore, the effects of awakening, positioning and identifying can be improved, and the voice interaction experience of the user in the driving state is further improved.

Thus, by executing the steps S201-S203, the voice enhancement of the vehicle sound-separating zone can be realized by combining the vehicle-mounted seat information, the vehicle speed information and the vehicle window information, so that the awakening and positioning effects can be greatly improved in the subsequent processing, and the voice interaction experience of the user in the driving state is further improved.

In summary, in the vehicle-mounted voice enhancement method provided in this embodiment, first, vehicle-mounted auxiliary information of a target vehicle is obtained, target voice information of vehicle-mounted users in each sound zone on the target vehicle is obtained, and then enhancement processing is performed on the target voice information by using the vehicle-mounted auxiliary information, so as to obtain enhanced target voice information; and then, according to the enhanced target voice information, carrying out preset operation processing on the vehicle-mounted user and/or the target vehicle to obtain a processing result. Therefore, according to the method and the device, the voice of the vehicle-mounted user in each sound zone on the vehicle is enhanced according to the vehicle-mounted auxiliary information, and then the enhanced voice is used for carrying out preset operation processing such as follow-up vehicle awakening and user positioning and recognition, so that awakening, positioning and recognition effects can be improved, and voice interaction experience of the user in a target vehicle driving state is improved.

Second embodiment

In this embodiment, a vehicle-mounted speech enhancement apparatus will be described, and please refer to the above method embodiment for related contents.

Referring to fig. 3, a schematic composition diagram of a vehicle-mounted speech enhancement apparatus provided in this embodiment is shown, where the apparatus 300 includes:

an obtaining unit 301, configured to obtain vehicle-mounted auxiliary information of a target vehicle, and obtain target voice information of a vehicle-mounted user in each sound zone on the target vehicle;

an enhancing unit 302, configured to perform enhancement processing on the target voice information by using the vehicle-mounted auxiliary information, so as to obtain enhanced target voice information;

and the processing unit 303 is configured to perform preset operation processing on the vehicle-mounted user and/or the target vehicle according to the enhanced target voice information to obtain a processing result.

In one implementation manner of the embodiment, the vehicle-mounted auxiliary information of the target vehicle includes seat information of the target vehicle; the enhancement unit 302 includes:

the first conversion subunit is used for performing Fourier conversion on the target voice information to obtain converted target voice information;

the first prediction subunit is used for constructing a combined vector by utilizing the seat information of the target vehicle and the converted target voice information, inputting the combined vector into a pre-constructed voice enhancement model, and predicting to obtain the weight of the target voice information of each vocal tract on the target vehicle;

In an implementation manner of this embodiment, the first prediction subunit is specifically configured to:

In one implementation manner of the embodiment, the vehicle-mounted auxiliary information of the target vehicle comprises vehicle speed information and window state information of the target vehicle; the enhancement unit 302 includes:

the second conversion subunit is used for performing Fourier conversion on the target voice information to obtain converted target voice information;

and the second calculating subunit is configured to perform multiplication calculation on the weights and the corresponding target voice information respectively, and perform inverse fourier transform on an obtained calculation result to obtain the enhanced target voice information.

In an implementation manner of this embodiment, the second prediction subunit is specifically configured to:

splicing the vector corresponding to the speed information of the target vehicle, the vector corresponding to the vehicle window state information and the vector corresponding to the converted target voice information to obtain a spliced vector serving as a combined vector; or, constructing a combined vector by using the speed information and window state information of the target vehicle and the converted target voice information in a gating mode.

In one implementation of this embodiment, the speech enhancement model includes at least one of a convolutional neural network CNN, a recurrent neural network RNN, a real or complex network.

In an implementation manner of this embodiment, the first prediction subunit or the second prediction subunit is specifically configured to:

In an implementation manner of this embodiment, the processing unit 303 is specifically configured to:

Further, an embodiment of the present application further provides a vehicle-mounted speech enhancement device, including: a processor, a memory, a system bus;

the processor and the memory are connected through the system bus;

the memory is used for storing one or more programs, and the one or more programs comprise instructions which, when executed by the processor, cause the processor to execute any implementation method of the vehicle-mounted voice enhancement method.

Further, an embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are executed on a terminal device, the terminal device is caused to execute any implementation method of the above vehicle-mounted voice enhancement method.

Further, an embodiment of the present application further provides a computer program product, which, when running on a terminal device, enables the terminal device to execute any one implementation method of the above vehicle-mounted voice enhancement method.

From the above description of the embodiments, it is clear to those skilled in the art that all or part of the steps in the method of the above embodiments may be implemented by software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network communication device such as a media gateway, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A vehicle-mounted speech enhancement method, comprising:

enhancing the target voice information by utilizing the vehicle-mounted auxiliary information to obtain enhanced target voice information;

2. The method of claim 1, wherein the in-vehicle assistance information of the target vehicle includes seat information of the target vehicle; the enhancing the target voice information by using the vehicle-mounted auxiliary information to obtain the enhanced target voice information includes:

constructing a combined vector by using the seat information of the target vehicle and the converted target voice information, inputting the combined vector to a pre-constructed voice enhancement model, and predicting to obtain the weight of the target voice information of each sound zone on the target vehicle;

3. The method of claim 2, wherein the constructing a combined vector using seat information of the target vehicle and the converted target speech information comprises:

4. The method according to claim 1, wherein the on-board assistance information of the target vehicle includes vehicle speed information and window state information of the target vehicle; the enhancing the target voice information by using the vehicle-mounted auxiliary information to obtain the enhanced target voice information includes:

5. The method of claim 4, wherein the constructing a combined vector using the vehicle speed information and window state information of the target vehicle and the converted target speech information comprises:

6. The method of any of claims 2 to 5, wherein the speech enhancement model comprises at least one of a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), a real or complex network.

7. The method according to claims 2 to 5, wherein the inputting the combined vector into a pre-constructed speech enhancement model, predicting the weight of the target speech information of each phoneme on the target vehicle comprises:

8. The method according to claim 1, wherein the performing preset operation processing on the vehicle-mounted user and/or the target vehicle according to the enhanced target voice information to obtain a processing result comprises:

9. An in-vehicle speech enhancement device, comprising:

the enhancement unit is used for enhancing the target voice information by utilizing the vehicle-mounted auxiliary information to obtain enhanced target voice information;

10. An in-vehicle speech enhancement device, comprising: a processor, a memory, a system bus;

the processor and the memory are connected through the system bus;

the memory is to store one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of any of claims 1-8.

11. A computer-readable storage medium having stored therein instructions that, when executed on a terminal device, cause the terminal device to perform the method of any one of claims 1-8.