CN115410561B

CN115410561B - Voice recognition method, device, medium and equipment based on vehicle-mounted multimode interaction

Info

Publication number: CN115410561B
Application number: CN202211359138.1A
Authority: CN
Inventors: 王增喜; 于波; 王赟芝; 方琳; 潘霞; 张苏林; 宗岩; 焦莉莉; 韩瑞龙; 秦川琪; 张莹
Original assignee: Sinotruk Data Co ltd
Current assignee: Sinotruk Data Co ltd
Priority date: 2022-11-02
Filing date: 2022-11-02
Publication date: 2023-02-17
Anticipated expiration: 2042-11-02
Also published as: CN115410561A

Abstract

The invention relates to the field of data processing, and discloses a voice recognition method, a device, a medium and equipment based on vehicle-mounted multimode interaction, which comprise the following steps: acquiring in-vehicle voice data, and extracting voice feature vectors from the in-vehicle voice data; extracting facial feature vectors, lip feature vectors and gesture feature vectors; acquiring vehicle state data, and extracting a vehicle state feature vector from the vehicle state data; determining harmonic coefficients corresponding to the facial feature vector, the lip feature vector, the gesture feature vector and the vehicle state feature vector respectively; performing multi-mode fusion on the facial feature vector, the lip feature vector, the gesture feature vector and the vehicle state feature vector to obtain a first fusion feature vector; performing fusion processing on the first fusion feature vector and the voice feature vector to obtain a second fusion feature vector; and inputting the second fusion feature vector into the voice recognition model to obtain a voice recognition result. The embodiment of the invention can improve the accuracy of vehicle-mounted voice recognition.

Description

Voice recognition method, device, medium and equipment based on vehicle-mounted multimode interaction

Technical Field

The invention relates to the technical field of data processing, in particular to a voice recognition method, a voice recognition device, a voice recognition medium and voice recognition equipment based on vehicle-mounted multimode interaction.

Background

With the rise of the car networking and artificial intelligence technology, more and more functions are carried on the car machine. The endless functions and increasingly complex interfaces compete for the driver's attention during driving. For the current 'man-machine driving together' stage, the voice interaction technology helps a driver to reduce the dependence on manual operation of in-vehicle devices, and increases the driving safety. With the rise of intelligent networking and intelligent cabin technologies, china has become the largest automobile consumption market, and the experience of human-vehicle interaction scenes increasingly becomes an identification field concerned by a plurality of manufacturers. At present, the voice interaction function is taken as a symbolic representation of the intellectualization of the automobile cabin, and is combined with various applications in the automobile to become a core function of ecological construction of the cabin, such as navigation through voice control, music, a radio, an air conditioner, a window and a skylight, and weather, stock, flight and the like through voice query.

Early vehicle-mounted speech recognition was achieved by local recognition with the aid of a vehicle-mounted chip, and only a few very fixed words could be recognized, with a very low accuracy. By about 2013, along with popularization of a neural deep network and application of an internet technology. The voice technology is expanded to be combined with the cloud, the recognition range is wider and wider, the awakening and interruption are gradually realized, a large amount of operations in the driving process are involved, and dialect recognition and fuzzy word recognition are more and more accurate. For almost all new vehicles, voice recognition is indispensable, and related voice recognition development enterprises and related enterprises related to chips, software algorithms and the like still continuously improve the application capability of voice recognition. But the accuracy of speech recognition still needs to be improved at present.

Disclosure of Invention

In order to solve at least one technical problem, the invention provides a voice recognition method, a voice recognition device, a voice recognition medium and voice recognition equipment based on vehicle-mounted multimode interaction.

According to a first aspect, an embodiment of the present invention provides a speech recognition method based on vehicle-mounted multimode interaction, including:

acquiring in-vehicle voice data, and extracting voice feature vectors from the in-vehicle voice data; acquiring face data, lip data and gesture data of people in the vehicle, extracting a face feature vector from the face data, extracting a lip feature vector from the lip data, and extracting a gesture feature vector from the gesture data; acquiring vehicle state data, and extracting a vehicle state feature vector from the vehicle state data;

determining a harmonic coefficient corresponding to each of the facial feature vector, the lip feature vector, the gesture feature vector and the vehicle state feature vector;

according to each harmonic coefficient, performing multi-mode fusion on the face feature vector, the lip feature vector, the gesture feature vector and the vehicle state feature vector to obtain a first fusion feature vector;

performing fusion processing on the first fusion feature vector and the voice feature vector to obtain a second fusion feature vector;

and inputting the second fusion feature vector into a pre-trained voice recognition model to obtain a corresponding voice recognition result.

According to a second aspect, an embodiment of the present invention provides a speech recognition apparatus based on vehicle-mounted multimode interaction, including:

the vector forming module is used for acquiring in-vehicle voice data and extracting voice characteristic vectors from the in-vehicle voice data; acquiring face data, lip data and gesture data of people in the vehicle, extracting face feature vectors from the face data, extracting lip feature vectors from the lip data, and extracting gesture feature vectors from the gesture data; acquiring vehicle state data, and extracting a vehicle state feature vector from the vehicle state data;

a coefficient determination module for determining a harmonic coefficient corresponding to each of the facial feature vector, the lip feature vector, the gesture feature vector, and the vehicle state feature vector;

the first fusion module is used for carrying out multi-mode fusion on the face feature vector, the lip feature vector, the gesture feature vector and the vehicle state feature vector according to each harmonic coefficient to obtain a first fusion feature vector;

the second fusion module is used for carrying out fusion processing on the first fusion feature vector and the voice feature vector to obtain a second fusion feature vector;

and the voice recognition module is used for inputting the second fusion feature vector into a pre-trained voice recognition model to obtain a corresponding voice recognition result.

According to a third aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, which, when executed in a computer, causes the computer to perform the method provided in the first aspect.

According to a fourth aspect, an embodiment of the present invention provides a computing device, including a memory and a processor, where the memory stores executable codes, and the processor executes the executable codes to implement the method provided in the first aspect.

The embodiment of the invention has the following technical effects: acquiring in-vehicle voice data, and extracting voice feature vectors from the in-vehicle voice data; acquiring face data, lip data and gesture data of people in the vehicle, extracting face feature vectors from the face data, extracting lip feature vectors from the lip data, and extracting gesture feature vectors from the gesture data; acquiring vehicle state data, and extracting a vehicle state feature vector from the vehicle state data; determining a harmonic coefficient corresponding to each of the facial feature vector, the lip feature vector, the gesture feature vector, and the vehicle state feature vector; according to each harmonic coefficient, performing multi-mode fusion on the face feature vector, the lip feature vector, the gesture feature vector and the vehicle state feature vector to obtain a first fusion feature vector; performing fusion processing on the first fusion feature vector and the voice feature vector to obtain a second fusion feature vector; and inputting the second fusion feature vector into a pre-trained voice recognition model to obtain a corresponding voice recognition result. Therefore, the vehicle-mounted voice recognition method and device are based on the multi-mode interaction modes such as voice, visual facial expressions, lip movements, gestures and the internal state of the vehicle, vehicle-mounted voice recognition is completed through the multi-mode information fusion technology, and therefore the accuracy of vehicle-mounted voice recognition is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart illustrating a speech recognition method based on vehicle-mounted multi-modal interaction according to an embodiment of the present invention;

FIG. 2 is a schematic layout of a camera and microphone within a vehicle interior in accordance with one embodiment of the present invention;

FIG. 3 is a flow chart of a speech recognition method based on vehicle-mounted multi-mode interaction according to an embodiment of the invention;

FIG. 4 is a flow diagram illustrating the determination of the execution of a voice command in accordance with one embodiment of the present invention;

fig. 5 is a block diagram of a voice recognition device based on vehicle-mounted multi-mode interaction according to an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In a first aspect, an embodiment of the present invention provides a speech recognition method based on vehicle-mounted multimode interaction. The method provided by the embodiment of the invention can be executed by any computing equipment, and comprises the following steps S110-S150 according to the following drawings in FIGS. 1-3:

s110, acquiring in-vehicle voice data, and extracting a voice feature vector from the in-vehicle voice data; acquiring face data, lip data and gesture data of people in the vehicle, extracting face feature vectors from the face data, extracting lip feature vectors from the lip data, and extracting gesture feature vectors from the gesture data; acquiring vehicle state data, and extracting a vehicle state feature vector from the vehicle state data;

in a particular scenario, two cameras and one microphone may be mounted within the vehicle. The layout of the cameras and microphones inside the vehicle can be seen in fig. 2.

The camera 1 collects images or videos of people in the vehicle, and therefore the face data, lip data and gesture data of the people can be conveniently acquired from the images or videos of the people collected by the camera 1. The speaking information of the driver is identified through the lip data, emotion identification is carried out through the face data including eyes, mouth and the like, and gesture actions can be identified through the gesture data.

The shooting angle of the camera 2 is 360 degrees, so that the image of the interior of the vehicle before the sound of the person is produced can be collected in real time, and further the state data of the interior of the vehicle, namely the vehicle state data, can be acquired from the image, thereby being beneficial to voice recognition. The camera 2 can also gather the inside image of vehicle after personnel take place and the vehicle carries out the voice command, and then can learn the situation of change of state in the car, for example, like door window, skylight change, screen change, help judging speech recognition's accuracy degree, confirm speech recognition's car machine implementation. When the screen information in the vehicle is incomplete, log information, namely main log information of ADB (called Android Debug Bridge in Chinese), can be read through a vehicle-mounted USB interface, and the log information and the screen information are combined to jointly judge and identify execution information.

The microphone is used for carrying out voice acquisition, and the pronunciation that the collection personnel sent, including pronunciation, voiceprint, and then can acquire interior voice data from the microphone.

It is understood that the vehicle state data in S110 actually refers to the state data of the vehicle interior before the corresponding operation is performed according to the recognized voice command, but may be considered as the state data of the vehicle interior before the utterance because the time taken for the person to issue the voice command and for the voice recognition is relatively short.

It can be understood that after the in-vehicle voice data, the face data, the lip data, the gesture data, and the vehicle state data are acquired, the data are subjected to certain extraction processing, such as convolution processing, so that corresponding feature vectors can be obtained.

For example, after feature extraction, the obtained feature vectors are respectively:

lip feature vector a = { a = { a = } ₁ ,A ₂ ,•••A _i ,•••A _a H, dimension a is 32;

facial feature vector B = { B = { (B) ₁ ,B ₂ ,•••B _i ,•••B _b }, dimension b is 96;

gesture feature vector C = { C ₁ ,C ₂ ,•••C _i ,•••C _c }, dimension c is 64;

vehicle state feature vector D = { D = { (D) ₁ ,D ₂ ,•••D _i ,•••D _d D is 64;

speech feature vector E = { E = { E = ₁ ,E ₂ ,•••E _i ,•••E _e Dimension e is 128.

S120, determining corresponding harmonic coefficients of the facial feature vector, the lip feature vector, the gesture feature vector and the vehicle state feature vector;

it can be understood that the properties of the multi-mode features, the way and the structure of representing information are generally different, the feature vectors are directly fused without processing, the desired effect is often not achieved, and meanwhile, due to the fact that the information extracted by the same extraction method is mutually overlapped among partial features. If these feature vectors are put together without thinking, choice, or strategy, the goal of introducing multi-modal features is not achieved, and satisfactory results are not obtained. Therefore, the dimension reduction of the multi-mode features is needed, and the features of similar types are really projected on a lower dimension after being fused.

In one embodiment, the harmonic coefficients may be calculated using a preset system of equations, the preset system of equations including the following equations:

in the formula (I), the compound is shown in the specification,

for the ith element in the lip feature vector,

for the ith element in the facial feature vector,

is the ith element in the gesture feature vector,

is the ith element in the vehicle state feature vector; a is the number of elements in the lip feature vector, b is the number of elements in the facial feature vector, c is the number of elements in the gesture feature vector, and d is the number of elements in the vehicle state feature vector;

is the harmonic coefficient of the lip feature vector,

is the harmonic coefficient of the facial feature vector,

is a harmonic coefficient of the gesture feature vector,

and the harmonic coefficient is the harmonic coefficient of the vehicle state feature vector.

The first equation and the second equation are used for aligning a plurality of feature vectors, and the third to sixth equations are used for normalizing all the feature vectors. 4 harmonic coefficients can be calculated by the above equations. E.g. calculated alpha ² 、β ² 、γ ² And delta ² Are respectively 1/8, 3/8, 1/4 and1/4。

s130, according to each harmonic coefficient, performing multi-mode fusion on the face feature vector, the lip feature vector, the gesture feature vector and the vehicle state feature vector to obtain a first fusion feature vector;

in one embodiment, S130 may specifically include: and multiplying the facial feature vector, the lip feature vector, the gesture feature vector and the vehicle state feature vector by corresponding harmonic coefficients respectively, and splicing the vectors obtained after multiplication into one vector to obtain the first fusion feature vector.

For example, the first fused feature vector W is:

s140, carrying out fusion processing on the first fusion feature vector and the voice feature vector to obtain a second fusion feature vector;

in an embodiment, the performing feature fusion processing on the first fusion feature vector and the speech feature vector to obtain a second fusion feature vector may include S141 to S143:

s141, calculating corresponding inter-class dispersion vectors according to a plurality of preset voice recognition samples, calculating inter-class dispersion matrixes according to the inter-class dispersion vectors, and selecting each element of the front f rows and the front f columns from the inter-class dispersion matrixes to form a transformation matrix; wherein f is a positive integer greater than 1;

in one embodiment, the inter-class dispersion vector may be calculated using a ninth calculation:

in the formula (I), the compound is shown in the specification,

is an inter-class scatter vector, m is a predetermined termThe number of voice recognition samples, c is the category of the preset voice recognition samples,

and recognizing any one of the plurality of preset voice recognition samples.

The category refers to that the preset voice recognition sample is used for navigation, vehicle control, music opening, communication and the like, and the visible category refers to a voice function corresponding to the voice recognition sample. A preset voice recognition sample comprises a preset voice feature vector and a corresponding voice recognition result.

In one embodiment, the inter-class scatter matrix may be calculated using a tenth calculation, which is:

where T is an inverted symbol, R is the interspecies scatter matrix,

vectors are spread for the inter-class.

S142, calculating a diagonal matrix according to the inter-class scatter vectors and the transformation matrix, and calculating a dimension reduction transfer matrix according to the diagonal matrix, the inter-class scatter vectors and the transformation matrix;

it will be appreciated that the calculated diagonal matrix is f rows and f columns.

In one embodiment, the diagonal matrix may be calculated using an eleventh calculation formula, where the eleventh calculation formula is:

in the formula (I), the compound is shown in the specification,

for the diagonal matrix, M is the transformation matrix,

is an inter-class scatter vector and T is an inverted symbol.

In one embodiment, the dimension reduction transition matrix may be calculated by using a twelfth calculation formula, where the twelfth calculation formula is:

wherein, P is the dimension reduction transfer matrix,

for the inter-class dispersion vector, M is the transform matrix,

is the diagonal matrix.

S143, according to the dimensionality reduction transfer matrix, dimensionality reduction processing is respectively carried out on the first fusion feature vector and the voice feature vector, the two vectors after dimensionality reduction processing are spliced into one vector, and the second fusion feature vector is obtained.

In one embodiment, the dimension reduction process may be performed on the first fused feature vector by using a thirteenth calculation formula:

wherein W is the first fused feature vector, P is the dimensionality reduction transfer matrix,

and the first fusion feature vector after the dimension reduction processing is performed.

In one embodiment, the speech feature vector may be subjected to dimension reduction using a fourteenth calculation expression, where the fourteenth calculation expression is:

wherein P is the dimensionality reduction transition matrix, E is the speech feature vector,

and the speech feature vector after dimension reduction processing is carried out.

For example, the second fusion feature vector is calculated by the following splicing formula:

wherein X is a second fused feature vector,

for the speech feature vectors after the dimension reduction process,

S150, inputting the second fusion feature vector into a pre-trained voice recognition model to obtain a corresponding voice recognition result;

wherein, the speech recognition model is obtained for training in advance, and the training sample that adopts in the training process includes: a plurality of second fused feature vectors and a speech recognition result labeled for each second fused feature vector.

The input information of the speech recognition model obtained through training is a second fusion feature vector, the output information is a speech recognition result, and the speech recognition result is a character sequence.

Aiming at the problems of low voice recognition rate, false awakening, false recognition and the like of the existing vehicle-mounted voice interaction, the embodiment of the invention is based on multi-mode interaction modes such as voice, visual facial expression, lip movement, gestures and vehicle interior states, and completes vehicle-mounted voice recognition through a multi-mode information fusion technology, so that the accuracy rate of vehicle-mounted voice recognition is improved.

In one embodiment, the method provided in the embodiment of the present invention may further include:

s160, after the vehicle executes the voice command each time, acquiring state change data of the vehicle, and determining whether the voice recognition result is correct or not according to the state change data;

specifically, referring to fig. 4, the state in the vehicle after the voice instruction is executed and the screen of the vehicle machine can be collected by the camera 2, and the change condition of the state in the vehicle can be obtained through front-back comparison according to the condition in the vehicle before the voice instruction is executed. Log information of the ADB is read through a vehicle-mounted USB interface, log information related to voice execution is obtained, and the execution condition of the voice instruction is determined by the state change in the vehicle and the log information together, so that the accuracy of counting the voice execution rate is improved.

S170, calculating sentence recognition success rate, awakening rate, interactive recognition rate, awakening average response time and function recognition rate corresponding to voice recognition in a preset time period after every preset time period;

it can be understood that, in order to verify the performance of the voice recognition effect, the embodiment of the invention evaluates the performance index based on the sentence recognition success rate, the wake-up rate, the interactive recognition rate, the wake-up average response time, the function recognition rate and other parameters.

In one embodiment, the sentence recognition success rate may be calculated using a first calculation formula, where the first calculation formula is: a = number of recognition successes/total number of recognition for continuous speech, and a is the sentence recognition success rate.

The voice recognition model is one part of the intelligent voice interaction system of the vehicle-mounted terminal and is responsible for voice recognition. The vehicle-mounted terminal intelligent voice interaction system with the voice recognition model supports command word recognition and continuous voice recognition, and evaluates the correct recognition condition of continuous voice according to the sentence recognition success rate.

In one embodiment, the wake-up rate includes a successful wake-up rate and a false wake-up rate, and the successful wake-up rate may be calculated by using a second calculation formula, where the second calculation formula is: b1= number of successful awakenings/total number of identifications, b1 being the successful awakening rate; the false wake-up rate may be calculated using a third equation: b2= false wake-up times/total identification times, and b2 is the false wake-up rate.

The vehicle-mounted terminal intelligent voice interaction system with the voice recognition model supports command word awakening service comprising user-defined awakening command words, multiple awakening command words and the like, the response condition of the vehicle-mounted intelligent voice interaction system to awakening operation is evaluated by the successful awakening rate, and the frequency of the mistaken awakening operation of the vehicle-mounted intelligent voice interaction system in unit time is evaluated by the mistaken awakening rate.

In one embodiment, the interaction identification rate includes an interaction success rate and an incorrect manipulation rate, and the interaction success rate may be calculated by using a fourth calculation formula, where the fourth calculation formula is: c1= successful interaction times/total identification times, c1 is the interaction success rate; the misoperation rate can be calculated by adopting a fifth calculation formula, wherein the fifth calculation formula is as follows: c2= number of interactive failures/total number of identifications, c2 being the mishandling rate.

The vehicle-mounted terminal intelligent voice interaction system with the voice recognition model supports a control instruction of a vehicle-mounted terminal, and the semantic intention understanding of interaction behaviors in daily life is comprehensively covered. The interaction success rate is used for evaluating the correct response condition of the intelligent voice interaction system of the vehicle-mounted terminal to a voice interaction task, and the interaction task comprises voice recognition, voice awakening, voice interruption and voice synthesis. If the intelligent voice interaction system of the vehicle-mounted terminal completes the voice interaction task within the set number of interaction rounds, the voice interaction is successful, and the interaction success rate or the misoperation rate is used as an evaluation index.

In one embodiment, the wake-up average response time may be calculated by using a sixth calculation formula, where the sixth calculation formula is:

wherein g is the wake-up average response time,

response time for the i-th successful wake-up, and X is the total number of successful wake-ups.

And for the voice interaction task, the awakening average response time is used for evaluating the response speed of the intelligent voice interaction system of the vehicle-mounted terminal. First wake-up response time (T1) = difference between the moment the alert tone is first given and the end of the first command input.

In one embodiment, the function identification rate corresponding to each function may be calculated by using a seventh calculation formula, where the seventh calculation formula is:

= number of successful identifications for ith function/total number of identifications,

and the function identification rate corresponding to the ith function.

The functions are voice instruction function identification and comprise functions of navigation, music, telephone, radio station, air conditioner, vehicle control equipment, information inquiry, chat interaction and the like.

S170, calculating corresponding identification performance indexes according to the sentence identification success rate, the awakening rate, the interactive identification rate, the awakening average response time and the function identification rate.

In one embodiment, the recognition performance indicator may be calculated using an eighth calculation equation, which includes:

in which Y isThe identification performance index, a is the sentence identification success rate, b1 is the successful awakening rate, b2 is the false awakening rate, c1 is the interaction success rate, c2 is the false operation rate, g is the awakening average response time,

the function recognition rate corresponding to the ith function,

is 100 or 0, if the ith function is present

Is 100, otherwise

Is 0.

Wherein S1, S2, S3, S4 and S5 are respectively 100, 90, 80, 70 and 60 which are respectively identified according to experience.

It can be understood that, in the embodiment of the present invention, the speech recognition performance is quantitatively evaluated through the recognition performance index, so that the quality of the recognition performance is known, and an evaluation report is automatically generated. The speech recognition model can be further optimized according to the recognized performance condition, so that the recognition accuracy of the speech recognition model can be further improved.

In a second aspect, an embodiment of the present invention provides a speech recognition apparatus based on vehicle-mounted multimode interaction, and referring to fig. 5, the apparatus 100 includes:

the vector forming module 110 is configured to obtain in-vehicle voice data and extract a voice feature vector from the in-vehicle voice data; acquiring face data, lip data and gesture data of people in the vehicle, extracting face feature vectors from the face data, extracting lip feature vectors from the lip data, and extracting gesture feature vectors from the gesture data; acquiring vehicle state data, and extracting a vehicle state feature vector from the vehicle state data;

a coefficient determining module 120, configured to determine a harmonic coefficient corresponding to each of the facial feature vector, the lip feature vector, the gesture feature vector, and the vehicle state feature vector;

the first fusion module 130 is configured to perform multi-mode fusion on the facial feature vector, the lip feature vector, the gesture feature vector, and the vehicle state feature vector according to each harmonic coefficient to obtain a first fusion feature vector;

a second fusion module 140, configured to perform fusion processing on the first fusion feature vector and the voice feature vector to obtain a second fusion feature vector;

and the speech recognition module 150 is configured to input the second fusion feature vector into a pre-trained speech recognition model to obtain a corresponding speech recognition result.

In one embodiment, the coefficient determining module 120 is specifically configured to calculate each harmonic coefficient by using a preset equation set, where the preset equation set includes the following equations:

calculating each harmonic coefficient by adopting a preset equation set, wherein the preset equation set comprises the following equations:

in the formula (I), the compound is shown in the specification,

for the ith element in the lip feature vector,

for the ith element in the facial feature vector,

for the ith element in the gesture feature vector,

is the harmonic coefficient of the lip feature vector,

is a harmonic coefficient of the facial feature vector,

are the harmonic coefficients of the gesture feature vector,

and the harmonic coefficients are harmonic coefficients of the vehicle state feature vector.

In one embodiment, the first fusion module 130 is specifically configured to: and multiplying the facial feature vector, the lip feature vector, the gesture feature vector and the vehicle state feature vector by corresponding harmonic coefficients respectively, and splicing the vectors obtained after multiplication into one vector to obtain the first fusion feature vector.

In one embodiment, the apparatus 100 may further comprise:

the result judging module is used for acquiring the state change data of the vehicle after the vehicle executes the voice command each time, and determining whether the voice recognition result is correct or not according to the state change data;

the first calculation module is used for calculating sentence recognition success rate, awakening rate, interactive recognition rate, awakening average response time and function recognition rate corresponding to voice recognition in a preset time period after every preset time period;

and the second calculation module is used for calculating corresponding identification performance indexes according to the sentence identification success rate, the awakening rate, the interactive identification rate, the awakening average response time and the function identification rate.

In an embodiment, the first calculating module is specifically configured to calculate the sentence recognition success rate by using a first calculation formula, where the first calculation formula is: a = the recognition success times/total recognition times of continuous speech, a being the sentence recognition success rate; and/or the wake-up rate includes a successful wake-up rate and a false wake-up rate, the first calculation module is specifically configured to calculate the successful wake-up rate by using a second calculation formula, where the second calculation formula is: b1= successful wake-up times/total identification times, b1 being the successful wake-up rate; the first calculation module is specifically configured to calculate the false wake-up rate by using a third calculation formula, where the third calculation formula is: b2= false wake-up times/total identification times, and b2 is the false wake-up rate.

In an embodiment, the interaction identification rate includes an interaction success rate and a mishandling rate, and the first calculating module is specifically configured to calculate the interaction success rate by using a fourth calculation formula, where the fourth calculation formula is: c1= successful interaction times/total identification times, c1 being the interaction success rate; the first calculating module is specifically configured to calculate the misoperation rate by using a fifth calculation formula, where the fifth calculation formula is: c2= number of interactive failures/total number of identification times, c2 is the misoperation rate; and/or the first calculating module is specifically configured to calculate the wake-up average response time by using a sixth calculation formula, where the sixth calculation formula is:

wherein g is the wake-up average response time,

the response time of the ith successful awakening is shown, and X is the total times of successful awakening; and/or the first calculating module is specifically configured to calculate the function identification rate corresponding to each function by using a seventh calculation formula, where the seventh calculation formula is:

in an embodiment, for the function identification rate corresponding to the ith function, the second calculating module is specifically configured to calculate the identification performance indicator by using an eighth calculation formula, where the eighth calculation formula includes:

wherein Y is the recognition performance index, a is the sentence recognition success rate, b1 is the successful wake-up rate, b2 is the false wake-up rate, c1 is the interaction success rate, c2 is the false operation rate, g is the wake-up average response time,

the function recognition rate corresponding to the ith function,

is 100 or 0, if the ith function is present

Is 100, otherwise

Is 0.

In one embodiment, the second fusion module specifically includes:

the first calculation unit is used for calculating corresponding inter-class dispersion vectors according to a plurality of preset voice recognition samples, calculating an inter-class dispersion matrix according to the inter-class dispersion vectors, and selecting each element of the first f columns of the first f rows from the inter-class dispersion matrix to form a transformation matrix; wherein f is a positive integer greater than 1;

a second calculation unit, configured to calculate a diagonal matrix according to the inter-class dispersion vector and the transformation matrix, and calculate a dimension reduction transfer matrix according to the diagonal matrix, the inter-class dispersion vector, and the transformation matrix;

and the third calculation unit is used for respectively performing dimensionality reduction processing on the first fusion feature vector and the voice feature vector according to the dimensionality reduction transfer matrix, and splicing the two vectors subjected to dimensionality reduction processing into one vector to obtain the second fusion feature vector.

In one embodiment, the first computing unit is specifically configured to: calculating the inter-class dispersion vector by adopting a ninth calculation formula, wherein the ninth calculation formula is as follows:

in the formula (I), the compound is shown in the specification,

m is the number of predetermined speech recognition samples, c is the category of the predetermined speech recognition samples,

any one of the plurality of preset speech recognition samples.

In an embodiment, the first computing unit is specifically configured to: calculating the inter-class dispersion matrix by using a tenth calculation formula, wherein the tenth calculation formula is as follows:

where T is an inverted symbol, R is the interspecies scatter matrix,

vectors are spread for the inter-class.

In an embodiment, the second computing unit is specifically configured to: calculating the diagonal matrix by using an eleventh calculation formula, wherein the eleventh calculation formula is:

in the formula (I), the compound is shown in the specification,

for the diagonal matrix, M is the transformation matrix,

is an inter-class scatter vector and T is an inverted symbol.

In an embodiment, the second computing unit is specifically configured to: calculating the dimensionality reduction transfer matrix by adopting a twelfth calculation formula, wherein the twelfth calculation formula is as follows:

wherein, P is the dimension reduction transfer matrix,

for the inter-class spread vectors, M is the transform matrix,

is the diagonal matrix.

In an embodiment, the third calculating unit is specifically configured to perform dimension reduction on the first fused feature vector by using a thirteenth calculation formula, where the thirteenth calculation formula is:

In an embodiment, the third computing unit is specifically configured to perform dimension reduction on the speech feature vector by using a fourteenth computing equation:

and the speech feature vectors are subjected to dimensionality reduction.

It is understood that the apparatus provided by the second aspect corresponds to the method provided by the first aspect, and the explanation, the description, the examples, the embodiments and the like of the related contents in the second aspect can refer to the corresponding parts in the first aspect.

In a third aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed in a computer, the computer is caused to execute the method provided in the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computing device, including a memory and a processor, where the memory stores executable codes, and the processor executes the executable codes to implement the method provided in the first aspect.

It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to limit the scope of the present application. As used in the specification and claims of this application, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural, unless the context clearly dictates otherwise. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, or apparatus comprising the element.

It is further noted that the terms "center," "upper," "lower," "left," "right," "vertical," "horizontal," "inner," "outer," and the like are used in the orientation or positional relationship indicated in the drawings for convenience in describing the invention and for simplicity in description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and thus should not be construed as limiting the invention. Unless expressly stated or limited otherwise, the terms "mounted," "connected," "coupled," and the like are to be construed broadly and encompass, for example, both fixed and removable coupling or integral coupling; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the technical solutions of the embodiments of the present invention.

Claims

1. A speech recognition method based on vehicle-mounted multimode interaction is characterized by comprising the following steps:

determining a harmonic coefficient corresponding to each of the facial feature vector, the lip feature vector, the gesture feature vector, and the vehicle state feature vector;

inputting the second fusion feature vector into a pre-trained voice recognition model to obtain a corresponding voice recognition result;

the feature fusion processing of the first fusion feature vector and the voice feature vector to obtain a second fusion feature vector comprises:

calculating corresponding inter-class dispersion vectors according to a plurality of preset voice recognition samples, calculating an inter-class dispersion matrix according to the inter-class dispersion vectors, and selecting each element of the first f rows and the first f columns from the inter-class dispersion matrix to form a transformation matrix; wherein f is a positive integer greater than 1;

calculating a diagonal matrix according to the inter-class dispersion vector and the transformation matrix, and calculating a dimension reduction transfer matrix according to the diagonal matrix, the inter-class dispersion vector and the transformation matrix;

and respectively carrying out dimensionality reduction processing on the first fusion feature vector and the voice feature vector according to the dimensionality reduction transfer matrix, and splicing the two vectors subjected to dimensionality reduction processing into one vector to obtain the second fusion feature vector.

2. The method of claim 1, wherein each harmonic coefficient is calculated using a predetermined set of equations, the predetermined set of equations comprising the following equations:

in the formula (I), the compound is shown in the specification,

for the ith element in the lip feature vector,

for the ith element in the facial feature vector,

for the ith element in the gesture feature vector,

is the ith element in the vehicle state feature vector; a is the number of elements in the lip feature vector, b is the number of elements in the facial feature vector, and c is the number of elements in the gesture feature vectorD is the number of elements in the vehicle state feature vector;

is the harmonic coefficient of the lip feature vector,

is the harmonic coefficient of the facial feature vector,

is a harmonic coefficient of the gesture feature vector,

3. The method of claim 1, wherein the multimodal fusing the facial feature vector, the lip feature vector, the gesture feature vector, and the vehicle state feature vector to obtain a first fused feature vector comprises:

and multiplying the facial feature vector, the lip feature vector, the gesture feature vector and the vehicle state feature vector by corresponding harmonic coefficients respectively, and splicing the vectors obtained after multiplication into one vector to obtain the first fusion feature vector.

4. The method of claim 1, further comprising:

after the vehicle executes the voice command each time, acquiring state change data of the vehicle, and determining whether the voice recognition result is correct or not according to the state change data;

after every preset time period, calculating sentence recognition success rate, awakening rate, interactive recognition rate, awakening average response time and function recognition rate corresponding to voice recognition in the preset time period;

and calculating corresponding identification performance indexes according to the sentence identification success rate, the awakening rate, the interactive identification rate, the awakening average response time and the function identification rate.

5. The method of claim 4,

calculating the sentence recognition success rate by adopting a first calculation formula, wherein the first calculation formula is as follows: a = number of recognition success/total number of recognition of continuous speech, a being the sentence recognition success rate; and/or the presence of a gas in the gas,

the awakening rate comprises a successful awakening rate and a false awakening rate, and the successful awakening rate is calculated by adopting a second calculation formula, wherein the second calculation formula is as follows: b1= successful wake-up times/total identification times, b1 being the successful wake-up rate; calculating the false wake-up rate by adopting a third calculation formula, wherein the third calculation formula is as follows: b2= false wake-up times/total identification times, b2 is the false wake-up rate; and/or the presence of a gas in the atmosphere,

the interaction identification rate comprises an interaction success rate and a misoperation rate, and the interaction success rate is calculated by adopting a fourth calculation formula, wherein the fourth calculation formula is as follows: c1= successful interaction times/total identification times, c1 being the interaction success rate; calculating the misoperation rate by adopting a fifth calculation formula, wherein the fifth calculation formula is as follows: c2= number of interactive failures/total number of identification times, c2 is the misoperation rate; calculating the wake-up average response time by using a sixth calculation formula, wherein the sixth calculation formula is as follows:

wherein g is the wake-up average response time,

the response time of the ith successful awakening is X, and the X is the total number of successful awakening; and/or the presence of a gas in the gas,

calculating the function identification rate corresponding to each function by adopting a seventh calculation formula, wherein the seventh calculation formula is as follows:

the function identification rate corresponding to the ith function; and/or the presence of a gas in the atmosphere,

calculating the recognition performance index by using an eighth calculation formula, wherein the eighth calculation formula comprises:

wherein Y is the recognition performance index, a is the sentence recognition success rate, b1 is the successful awakening rate, b2 is the false awakening rate, c1 is the interaction success rate, c2 is the false operation rate, g is the awakening average response time,

the function recognition rate corresponding to the ith function,

is 100 or 0, if the ith function is present

Is 100, otherwise

Is 0.

6. The method of claim 1,

calculating the inter-class dispersion vector by using a ninth calculation formula, wherein the ninth calculation formula is as follows:

in the formula (I), the compound is shown in the specification,

any one of the preset voice recognition samples is selected;

or, calculating the inter-class dispersion matrix by using a tenth calculation formula, where the tenth calculation formula is:

where T is an inverted symbol, R is the interspecies scatter matrix,

spreading vectors for the classes;

or, the diagonal matrix is calculated by using an eleventh calculation formula, where the eleventh calculation formula is:

in the formula (I), the compound is shown in the specification,

for the diagonal matrix, M is the transformation matrix,

is an inter-class scatter vector, T is an inverted sign;

or, calculating the dimension reduction transfer matrix by using a twelfth calculation formula, where the twelfth calculation formula is:

wherein, P is the dimension reduction transfer matrix,

for the inter-class spread vectors, M is the transform matrix,

is the diagonal matrix;

or, performing dimension reduction on the first fusion feature vector by using a thirteenth calculation formula, where the thirteenth calculation formula is:

the first fusion feature vector is subjected to dimensionality reduction;

or, performing dimension reduction processing on the speech feature vector by using a fourteenth calculation formula, where the fourteenth calculation formula is:

wherein P is the dimensionality reduction transition matrix and E is the speech feature vector，

7. A speech recognition device based on vehicle-mounted multimode interaction is characterized by comprising:

the first fusion module is used for performing multi-mode fusion on the face feature vector, the lip feature vector, the gesture feature vector and the vehicle state feature vector according to each harmonic coefficient to obtain a first fusion feature vector;

the voice recognition module is used for inputting the second fusion feature vector into a pre-trained voice recognition model to obtain a corresponding voice recognition result;

the second fusion module is specifically configured to:

calculating corresponding inter-class dispersion vectors according to a plurality of preset voice recognition samples, calculating an inter-class dispersion matrix according to the inter-class dispersion vectors, and selecting each element of the front f columns of the front f rows from the inter-class dispersion matrix to form a transformation matrix; wherein f is a positive integer greater than 1;

8. A computer-readable storage medium, having stored thereon a computer program which, when executed in a computer, causes the computer to carry out the method of any one of claims 1 to 6.

9. A computing device comprising a memory having executable code stored therein and a processor that when executed performs the method of any one of claims 1 to 6.