CN115410561B - Voice recognition method, device, medium and equipment based on vehicle-mounted multimode interaction - Google Patents

Voice recognition method, device, medium and equipment based on vehicle-mounted multimode interaction Download PDF

Info

Publication number
CN115410561B
CN115410561B CN202211359138.1A CN202211359138A CN115410561B CN 115410561 B CN115410561 B CN 115410561B CN 202211359138 A CN202211359138 A CN 202211359138A CN 115410561 B CN115410561 B CN 115410561B
Authority
CN
China
Prior art keywords
feature vector
calculation formula
rate
fusion
vehicle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211359138.1A
Other languages
Chinese (zh)
Other versions
CN115410561A (en
Inventor
王增喜
于波
王赟芝
方琳
潘霞
张苏林
宗岩
焦莉莉
韩瑞龙
秦川琪
张莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sinotruk Data Co ltd
Original Assignee
Sinotruk Data Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sinotruk Data Co ltd filed Critical Sinotruk Data Co ltd
Priority to CN202211359138.1A priority Critical patent/CN115410561B/en
Publication of CN115410561A publication Critical patent/CN115410561A/en
Application granted granted Critical
Publication of CN115410561B publication Critical patent/CN115410561B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Traffic Control Systems (AREA)

Abstract

The invention relates to the field of data processing, and discloses a voice recognition method, a device, a medium and equipment based on vehicle-mounted multimode interaction, which comprise the following steps: acquiring in-vehicle voice data, and extracting voice feature vectors from the in-vehicle voice data; extracting facial feature vectors, lip feature vectors and gesture feature vectors; acquiring vehicle state data, and extracting a vehicle state feature vector from the vehicle state data; determining harmonic coefficients corresponding to the facial feature vector, the lip feature vector, the gesture feature vector and the vehicle state feature vector respectively; performing multi-mode fusion on the facial feature vector, the lip feature vector, the gesture feature vector and the vehicle state feature vector to obtain a first fusion feature vector; performing fusion processing on the first fusion feature vector and the voice feature vector to obtain a second fusion feature vector; and inputting the second fusion feature vector into the voice recognition model to obtain a voice recognition result. The embodiment of the invention can improve the accuracy of vehicle-mounted voice recognition.

Description

Voice recognition method, device, medium and equipment based on vehicle-mounted multimode interaction
Technical Field
The invention relates to the technical field of data processing, in particular to a voice recognition method, a voice recognition device, a voice recognition medium and voice recognition equipment based on vehicle-mounted multimode interaction.
Background
With the rise of the car networking and artificial intelligence technology, more and more functions are carried on the car machine. The endless functions and increasingly complex interfaces compete for the driver's attention during driving. For the current 'man-machine driving together' stage, the voice interaction technology helps a driver to reduce the dependence on manual operation of in-vehicle devices, and increases the driving safety. With the rise of intelligent networking and intelligent cabin technologies, china has become the largest automobile consumption market, and the experience of human-vehicle interaction scenes increasingly becomes an identification field concerned by a plurality of manufacturers. At present, the voice interaction function is taken as a symbolic representation of the intellectualization of the automobile cabin, and is combined with various applications in the automobile to become a core function of ecological construction of the cabin, such as navigation through voice control, music, a radio, an air conditioner, a window and a skylight, and weather, stock, flight and the like through voice query.
Early vehicle-mounted speech recognition was achieved by local recognition with the aid of a vehicle-mounted chip, and only a few very fixed words could be recognized, with a very low accuracy. By about 2013, along with popularization of a neural deep network and application of an internet technology. The voice technology is expanded to be combined with the cloud, the recognition range is wider and wider, the awakening and interruption are gradually realized, a large amount of operations in the driving process are involved, and dialect recognition and fuzzy word recognition are more and more accurate. For almost all new vehicles, voice recognition is indispensable, and related voice recognition development enterprises and related enterprises related to chips, software algorithms and the like still continuously improve the application capability of voice recognition. But the accuracy of speech recognition still needs to be improved at present.
Disclosure of Invention
In order to solve at least one technical problem, the invention provides a voice recognition method, a voice recognition device, a voice recognition medium and voice recognition equipment based on vehicle-mounted multimode interaction.
According to a first aspect, an embodiment of the present invention provides a speech recognition method based on vehicle-mounted multimode interaction, including:
acquiring in-vehicle voice data, and extracting voice feature vectors from the in-vehicle voice data; acquiring face data, lip data and gesture data of people in the vehicle, extracting a face feature vector from the face data, extracting a lip feature vector from the lip data, and extracting a gesture feature vector from the gesture data; acquiring vehicle state data, and extracting a vehicle state feature vector from the vehicle state data;
determining a harmonic coefficient corresponding to each of the facial feature vector, the lip feature vector, the gesture feature vector and the vehicle state feature vector;
according to each harmonic coefficient, performing multi-mode fusion on the face feature vector, the lip feature vector, the gesture feature vector and the vehicle state feature vector to obtain a first fusion feature vector;
performing fusion processing on the first fusion feature vector and the voice feature vector to obtain a second fusion feature vector;
and inputting the second fusion feature vector into a pre-trained voice recognition model to obtain a corresponding voice recognition result.
According to a second aspect, an embodiment of the present invention provides a speech recognition apparatus based on vehicle-mounted multimode interaction, including:
the vector forming module is used for acquiring in-vehicle voice data and extracting voice characteristic vectors from the in-vehicle voice data; acquiring face data, lip data and gesture data of people in the vehicle, extracting face feature vectors from the face data, extracting lip feature vectors from the lip data, and extracting gesture feature vectors from the gesture data; acquiring vehicle state data, and extracting a vehicle state feature vector from the vehicle state data;
a coefficient determination module for determining a harmonic coefficient corresponding to each of the facial feature vector, the lip feature vector, the gesture feature vector, and the vehicle state feature vector;
the first fusion module is used for carrying out multi-mode fusion on the face feature vector, the lip feature vector, the gesture feature vector and the vehicle state feature vector according to each harmonic coefficient to obtain a first fusion feature vector;
the second fusion module is used for carrying out fusion processing on the first fusion feature vector and the voice feature vector to obtain a second fusion feature vector;
and the voice recognition module is used for inputting the second fusion feature vector into a pre-trained voice recognition model to obtain a corresponding voice recognition result.
According to a third aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, which, when executed in a computer, causes the computer to perform the method provided in the first aspect.
According to a fourth aspect, an embodiment of the present invention provides a computing device, including a memory and a processor, where the memory stores executable codes, and the processor executes the executable codes to implement the method provided in the first aspect.
The embodiment of the invention has the following technical effects: acquiring in-vehicle voice data, and extracting voice feature vectors from the in-vehicle voice data; acquiring face data, lip data and gesture data of people in the vehicle, extracting face feature vectors from the face data, extracting lip feature vectors from the lip data, and extracting gesture feature vectors from the gesture data; acquiring vehicle state data, and extracting a vehicle state feature vector from the vehicle state data; determining a harmonic coefficient corresponding to each of the facial feature vector, the lip feature vector, the gesture feature vector, and the vehicle state feature vector; according to each harmonic coefficient, performing multi-mode fusion on the face feature vector, the lip feature vector, the gesture feature vector and the vehicle state feature vector to obtain a first fusion feature vector; performing fusion processing on the first fusion feature vector and the voice feature vector to obtain a second fusion feature vector; and inputting the second fusion feature vector into a pre-trained voice recognition model to obtain a corresponding voice recognition result. Therefore, the vehicle-mounted voice recognition method and device are based on the multi-mode interaction modes such as voice, visual facial expressions, lip movements, gestures and the internal state of the vehicle, vehicle-mounted voice recognition is completed through the multi-mode information fusion technology, and therefore the accuracy of vehicle-mounted voice recognition is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart illustrating a speech recognition method based on vehicle-mounted multi-modal interaction according to an embodiment of the present invention;
FIG. 2 is a schematic layout of a camera and microphone within a vehicle interior in accordance with one embodiment of the present invention;
FIG. 3 is a flow chart of a speech recognition method based on vehicle-mounted multi-mode interaction according to an embodiment of the invention;
FIG. 4 is a flow diagram illustrating the determination of the execution of a voice command in accordance with one embodiment of the present invention;
fig. 5 is a block diagram of a voice recognition device based on vehicle-mounted multi-mode interaction according to an embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In a first aspect, an embodiment of the present invention provides a speech recognition method based on vehicle-mounted multimode interaction. The method provided by the embodiment of the invention can be executed by any computing equipment, and comprises the following steps S110-S150 according to the following drawings in FIGS. 1-3:
s110, acquiring in-vehicle voice data, and extracting a voice feature vector from the in-vehicle voice data; acquiring face data, lip data and gesture data of people in the vehicle, extracting face feature vectors from the face data, extracting lip feature vectors from the lip data, and extracting gesture feature vectors from the gesture data; acquiring vehicle state data, and extracting a vehicle state feature vector from the vehicle state data;
in a particular scenario, two cameras and one microphone may be mounted within the vehicle. The layout of the cameras and microphones inside the vehicle can be seen in fig. 2.
The camera 1 collects images or videos of people in the vehicle, and therefore the face data, lip data and gesture data of the people can be conveniently acquired from the images or videos of the people collected by the camera 1. The speaking information of the driver is identified through the lip data, emotion identification is carried out through the face data including eyes, mouth and the like, and gesture actions can be identified through the gesture data.
The shooting angle of the camera 2 is 360 degrees, so that the image of the interior of the vehicle before the sound of the person is produced can be collected in real time, and further the state data of the interior of the vehicle, namely the vehicle state data, can be acquired from the image, thereby being beneficial to voice recognition. The camera 2 can also gather the inside image of vehicle after personnel take place and the vehicle carries out the voice command, and then can learn the situation of change of state in the car, for example, like door window, skylight change, screen change, help judging speech recognition's accuracy degree, confirm speech recognition's car machine implementation. When the screen information in the vehicle is incomplete, log information, namely main log information of ADB (called Android Debug Bridge in Chinese), can be read through a vehicle-mounted USB interface, and the log information and the screen information are combined to jointly judge and identify execution information.
The microphone is used for carrying out voice acquisition, and the pronunciation that the collection personnel sent, including pronunciation, voiceprint, and then can acquire interior voice data from the microphone.
It is understood that the vehicle state data in S110 actually refers to the state data of the vehicle interior before the corresponding operation is performed according to the recognized voice command, but may be considered as the state data of the vehicle interior before the utterance because the time taken for the person to issue the voice command and for the voice recognition is relatively short.
It can be understood that after the in-vehicle voice data, the face data, the lip data, the gesture data, and the vehicle state data are acquired, the data are subjected to certain extraction processing, such as convolution processing, so that corresponding feature vectors can be obtained.
For example, after feature extraction, the obtained feature vectors are respectively:
lip feature vector a = { a = { a = } 1 ,A 2 ,•••A i ,•••A a H, dimension a is 32;
facial feature vector B = { B = { (B) 1 ,B 2 ,•••B i ,•••B b }, dimension b is 96;
gesture feature vector C = { C 1 ,C 2 ,•••C i ,•••C c }, dimension c is 64;
vehicle state feature vector D = { D = { (D) 1 ,D 2 ,•••D i ,•••D d D is 64;
speech feature vector E = { E = { E = 1 ,E 2 ,•••E i ,•••E e Dimension e is 128.
S120, determining corresponding harmonic coefficients of the facial feature vector, the lip feature vector, the gesture feature vector and the vehicle state feature vector;
it can be understood that the properties of the multi-mode features, the way and the structure of representing information are generally different, the feature vectors are directly fused without processing, the desired effect is often not achieved, and meanwhile, due to the fact that the information extracted by the same extraction method is mutually overlapped among partial features. If these feature vectors are put together without thinking, choice, or strategy, the goal of introducing multi-modal features is not achieved, and satisfactory results are not obtained. Therefore, the dimension reduction of the multi-mode features is needed, and the features of similar types are really projected on a lower dimension after being fused.
In one embodiment, the harmonic coefficients may be calculated using a preset system of equations, the preset system of equations including the following equations:
Figure DEST_PATH_IMAGE001
Figure 189572DEST_PATH_IMAGE002
Figure DEST_PATH_IMAGE003
Figure 559331DEST_PATH_IMAGE004
Figure DEST_PATH_IMAGE005
Figure 448789DEST_PATH_IMAGE006
Figure DEST_PATH_IMAGE007
Figure 859042DEST_PATH_IMAGE008
in the formula (I), the compound is shown in the specification,
Figure DEST_PATH_IMAGE009
for the ith element in the lip feature vector,
Figure 644596DEST_PATH_IMAGE010
for the ith element in the facial feature vector,
Figure DEST_PATH_IMAGE011
is the ith element in the gesture feature vector,
Figure 507509DEST_PATH_IMAGE012
is the ith element in the vehicle state feature vector; a is the number of elements in the lip feature vector, b is the number of elements in the facial feature vector, c is the number of elements in the gesture feature vector, and d is the number of elements in the vehicle state feature vector;
Figure DEST_PATH_IMAGE013
is the harmonic coefficient of the lip feature vector,
Figure 669500DEST_PATH_IMAGE014
is the harmonic coefficient of the facial feature vector,
Figure DEST_PATH_IMAGE015
is a harmonic coefficient of the gesture feature vector,
Figure 934260DEST_PATH_IMAGE016
and the harmonic coefficient is the harmonic coefficient of the vehicle state feature vector.
The first equation and the second equation are used for aligning a plurality of feature vectors, and the third to sixth equations are used for normalizing all the feature vectors. 4 harmonic coefficients can be calculated by the above equations. E.g. calculated alpha 2 、β 2 、γ 2 And delta 2 Are respectively 1/8, 3/8, 1/4 and1/4。
s130, according to each harmonic coefficient, performing multi-mode fusion on the face feature vector, the lip feature vector, the gesture feature vector and the vehicle state feature vector to obtain a first fusion feature vector;
in one embodiment, S130 may specifically include: and multiplying the facial feature vector, the lip feature vector, the gesture feature vector and the vehicle state feature vector by corresponding harmonic coefficients respectively, and splicing the vectors obtained after multiplication into one vector to obtain the first fusion feature vector.
For example, the first fused feature vector W is:
Figure DEST_PATH_IMAGE017
s140, carrying out fusion processing on the first fusion feature vector and the voice feature vector to obtain a second fusion feature vector;
in an embodiment, the performing feature fusion processing on the first fusion feature vector and the speech feature vector to obtain a second fusion feature vector may include S141 to S143:
s141, calculating corresponding inter-class dispersion vectors according to a plurality of preset voice recognition samples, calculating inter-class dispersion matrixes according to the inter-class dispersion vectors, and selecting each element of the front f rows and the front f columns from the inter-class dispersion matrixes to form a transformation matrix; wherein f is a positive integer greater than 1;
in one embodiment, the inter-class dispersion vector may be calculated using a ninth calculation:
Figure 562818DEST_PATH_IMAGE018
in the formula (I), the compound is shown in the specification,
Figure DEST_PATH_IMAGE019
is an inter-class scatter vector, m is a predetermined termThe number of voice recognition samples, c is the category of the preset voice recognition samples,
Figure 178607DEST_PATH_IMAGE020
and recognizing any one of the plurality of preset voice recognition samples.
The category refers to that the preset voice recognition sample is used for navigation, vehicle control, music opening, communication and the like, and the visible category refers to a voice function corresponding to the voice recognition sample. A preset voice recognition sample comprises a preset voice feature vector and a corresponding voice recognition result.
In one embodiment, the inter-class scatter matrix may be calculated using a tenth calculation, which is:
Figure DEST_PATH_IMAGE021
where T is an inverted symbol, R is the interspecies scatter matrix,
Figure 409868DEST_PATH_IMAGE022
vectors are spread for the inter-class.
S142, calculating a diagonal matrix according to the inter-class scatter vectors and the transformation matrix, and calculating a dimension reduction transfer matrix according to the diagonal matrix, the inter-class scatter vectors and the transformation matrix;
it will be appreciated that the calculated diagonal matrix is f rows and f columns.
In one embodiment, the diagonal matrix may be calculated using an eleventh calculation formula, where the eleventh calculation formula is:
Figure DEST_PATH_IMAGE023
in the formula (I), the compound is shown in the specification,
Figure 670080DEST_PATH_IMAGE024
for the diagonal matrix, M is the transformation matrix,
Figure DEST_PATH_IMAGE025
is an inter-class scatter vector and T is an inverted symbol.
In one embodiment, the dimension reduction transition matrix may be calculated by using a twelfth calculation formula, where the twelfth calculation formula is:
Figure 63015DEST_PATH_IMAGE026
wherein, P is the dimension reduction transfer matrix,
Figure DEST_PATH_IMAGE027
for the inter-class dispersion vector, M is the transform matrix,
Figure 900521DEST_PATH_IMAGE024
is the diagonal matrix.
S143, according to the dimensionality reduction transfer matrix, dimensionality reduction processing is respectively carried out on the first fusion feature vector and the voice feature vector, the two vectors after dimensionality reduction processing are spliced into one vector, and the second fusion feature vector is obtained.
In one embodiment, the dimension reduction process may be performed on the first fused feature vector by using a thirteenth calculation formula:
Figure 997790DEST_PATH_IMAGE028
wherein W is the first fused feature vector, P is the dimensionality reduction transfer matrix,
Figure DEST_PATH_IMAGE029
and the first fusion feature vector after the dimension reduction processing is performed.
In one embodiment, the speech feature vector may be subjected to dimension reduction using a fourteenth calculation expression, where the fourteenth calculation expression is:
Figure 372227DEST_PATH_IMAGE030
wherein P is the dimensionality reduction transition matrix, E is the speech feature vector,
Figure DEST_PATH_IMAGE031
and the speech feature vector after dimension reduction processing is carried out.
For example, the second fusion feature vector is calculated by the following splicing formula:
Figure 936064DEST_PATH_IMAGE032
wherein X is a second fused feature vector,
Figure DEST_PATH_IMAGE033
for the speech feature vectors after the dimension reduction process,
Figure 260866DEST_PATH_IMAGE029
and the first fusion feature vector after the dimension reduction processing is performed.
S150, inputting the second fusion feature vector into a pre-trained voice recognition model to obtain a corresponding voice recognition result;
wherein, the speech recognition model is obtained for training in advance, and the training sample that adopts in the training process includes: a plurality of second fused feature vectors and a speech recognition result labeled for each second fused feature vector.
The input information of the speech recognition model obtained through training is a second fusion feature vector, the output information is a speech recognition result, and the speech recognition result is a character sequence.
Aiming at the problems of low voice recognition rate, false awakening, false recognition and the like of the existing vehicle-mounted voice interaction, the embodiment of the invention is based on multi-mode interaction modes such as voice, visual facial expression, lip movement, gestures and vehicle interior states, and completes vehicle-mounted voice recognition through a multi-mode information fusion technology, so that the accuracy rate of vehicle-mounted voice recognition is improved.
In one embodiment, the method provided in the embodiment of the present invention may further include:
s160, after the vehicle executes the voice command each time, acquiring state change data of the vehicle, and determining whether the voice recognition result is correct or not according to the state change data;
specifically, referring to fig. 4, the state in the vehicle after the voice instruction is executed and the screen of the vehicle machine can be collected by the camera 2, and the change condition of the state in the vehicle can be obtained through front-back comparison according to the condition in the vehicle before the voice instruction is executed. Log information of the ADB is read through a vehicle-mounted USB interface, log information related to voice execution is obtained, and the execution condition of the voice instruction is determined by the state change in the vehicle and the log information together, so that the accuracy of counting the voice execution rate is improved.
S170, calculating sentence recognition success rate, awakening rate, interactive recognition rate, awakening average response time and function recognition rate corresponding to voice recognition in a preset time period after every preset time period;
it can be understood that, in order to verify the performance of the voice recognition effect, the embodiment of the invention evaluates the performance index based on the sentence recognition success rate, the wake-up rate, the interactive recognition rate, the wake-up average response time, the function recognition rate and other parameters.
In one embodiment, the sentence recognition success rate may be calculated using a first calculation formula, where the first calculation formula is: a = number of recognition successes/total number of recognition for continuous speech, and a is the sentence recognition success rate.
The voice recognition model is one part of the intelligent voice interaction system of the vehicle-mounted terminal and is responsible for voice recognition. The vehicle-mounted terminal intelligent voice interaction system with the voice recognition model supports command word recognition and continuous voice recognition, and evaluates the correct recognition condition of continuous voice according to the sentence recognition success rate.
In one embodiment, the wake-up rate includes a successful wake-up rate and a false wake-up rate, and the successful wake-up rate may be calculated by using a second calculation formula, where the second calculation formula is: b1= number of successful awakenings/total number of identifications, b1 being the successful awakening rate; the false wake-up rate may be calculated using a third equation: b2= false wake-up times/total identification times, and b2 is the false wake-up rate.
The vehicle-mounted terminal intelligent voice interaction system with the voice recognition model supports command word awakening service comprising user-defined awakening command words, multiple awakening command words and the like, the response condition of the vehicle-mounted intelligent voice interaction system to awakening operation is evaluated by the successful awakening rate, and the frequency of the mistaken awakening operation of the vehicle-mounted intelligent voice interaction system in unit time is evaluated by the mistaken awakening rate.
In one embodiment, the interaction identification rate includes an interaction success rate and an incorrect manipulation rate, and the interaction success rate may be calculated by using a fourth calculation formula, where the fourth calculation formula is: c1= successful interaction times/total identification times, c1 is the interaction success rate; the misoperation rate can be calculated by adopting a fifth calculation formula, wherein the fifth calculation formula is as follows: c2= number of interactive failures/total number of identifications, c2 being the mishandling rate.
The vehicle-mounted terminal intelligent voice interaction system with the voice recognition model supports a control instruction of a vehicle-mounted terminal, and the semantic intention understanding of interaction behaviors in daily life is comprehensively covered. The interaction success rate is used for evaluating the correct response condition of the intelligent voice interaction system of the vehicle-mounted terminal to a voice interaction task, and the interaction task comprises voice recognition, voice awakening, voice interruption and voice synthesis. If the intelligent voice interaction system of the vehicle-mounted terminal completes the voice interaction task within the set number of interaction rounds, the voice interaction is successful, and the interaction success rate or the misoperation rate is used as an evaluation index.
In one embodiment, the wake-up average response time may be calculated by using a sixth calculation formula, where the sixth calculation formula is:
Figure 896247DEST_PATH_IMAGE034
wherein g is the wake-up average response time,
Figure DEST_PATH_IMAGE035
response time for the i-th successful wake-up, and X is the total number of successful wake-ups.
And for the voice interaction task, the awakening average response time is used for evaluating the response speed of the intelligent voice interaction system of the vehicle-mounted terminal. First wake-up response time (T1) = difference between the moment the alert tone is first given and the end of the first command input.
In one embodiment, the function identification rate corresponding to each function may be calculated by using a seventh calculation formula, where the seventh calculation formula is:
Figure 990105DEST_PATH_IMAGE036
= number of successful identifications for ith function/total number of identifications,
Figure DEST_PATH_IMAGE037
and the function identification rate corresponding to the ith function.
The functions are voice instruction function identification and comprise functions of navigation, music, telephone, radio station, air conditioner, vehicle control equipment, information inquiry, chat interaction and the like.
S170, calculating corresponding identification performance indexes according to the sentence identification success rate, the awakening rate, the interactive identification rate, the awakening average response time and the function identification rate.
In one embodiment, the recognition performance indicator may be calculated using an eighth calculation equation, which includes:
Figure 990422DEST_PATH_IMAGE038
Figure DEST_PATH_IMAGE039
Figure 536941DEST_PATH_IMAGE040
in which Y isThe identification performance index, a is the sentence identification success rate, b1 is the successful awakening rate, b2 is the false awakening rate, c1 is the interaction success rate, c2 is the false operation rate, g is the awakening average response time,
Figure DEST_PATH_IMAGE041
the function recognition rate corresponding to the ith function,
Figure 320220DEST_PATH_IMAGE042
is 100 or 0, if the ith function is present
Figure 737426DEST_PATH_IMAGE043
Is 100, otherwise
Figure 236540DEST_PATH_IMAGE044
Is 0.
Wherein S1, S2, S3, S4 and S5 are respectively 100, 90, 80, 70 and 60 which are respectively identified according to experience.
It can be understood that, in the embodiment of the present invention, the speech recognition performance is quantitatively evaluated through the recognition performance index, so that the quality of the recognition performance is known, and an evaluation report is automatically generated. The speech recognition model can be further optimized according to the recognized performance condition, so that the recognition accuracy of the speech recognition model can be further improved.
In a second aspect, an embodiment of the present invention provides a speech recognition apparatus based on vehicle-mounted multimode interaction, and referring to fig. 5, the apparatus 100 includes:
the vector forming module 110 is configured to obtain in-vehicle voice data and extract a voice feature vector from the in-vehicle voice data; acquiring face data, lip data and gesture data of people in the vehicle, extracting face feature vectors from the face data, extracting lip feature vectors from the lip data, and extracting gesture feature vectors from the gesture data; acquiring vehicle state data, and extracting a vehicle state feature vector from the vehicle state data;
a coefficient determining module 120, configured to determine a harmonic coefficient corresponding to each of the facial feature vector, the lip feature vector, the gesture feature vector, and the vehicle state feature vector;
the first fusion module 130 is configured to perform multi-mode fusion on the facial feature vector, the lip feature vector, the gesture feature vector, and the vehicle state feature vector according to each harmonic coefficient to obtain a first fusion feature vector;
a second fusion module 140, configured to perform fusion processing on the first fusion feature vector and the voice feature vector to obtain a second fusion feature vector;
and the speech recognition module 150 is configured to input the second fusion feature vector into a pre-trained speech recognition model to obtain a corresponding speech recognition result.
In one embodiment, the coefficient determining module 120 is specifically configured to calculate each harmonic coefficient by using a preset equation set, where the preset equation set includes the following equations:
calculating each harmonic coefficient by adopting a preset equation set, wherein the preset equation set comprises the following equations:
Figure 739197DEST_PATH_IMAGE045
Figure 716380DEST_PATH_IMAGE046
Figure 253672DEST_PATH_IMAGE047
Figure 923688DEST_PATH_IMAGE048
Figure 913641DEST_PATH_IMAGE049
Figure 694515DEST_PATH_IMAGE050
Figure 86313DEST_PATH_IMAGE051
Figure 927230DEST_PATH_IMAGE052
in the formula (I), the compound is shown in the specification,
Figure 404479DEST_PATH_IMAGE009
for the ith element in the lip feature vector,
Figure 989044DEST_PATH_IMAGE053
for the ith element in the facial feature vector,
Figure 94403DEST_PATH_IMAGE011
for the ith element in the gesture feature vector,
Figure 981588DEST_PATH_IMAGE012
is the ith element in the vehicle state feature vector; a is the number of elements in the lip feature vector, b is the number of elements in the facial feature vector, c is the number of elements in the gesture feature vector, and d is the number of elements in the vehicle state feature vector;
Figure 805187DEST_PATH_IMAGE013
is the harmonic coefficient of the lip feature vector,
Figure 68810DEST_PATH_IMAGE054
is a harmonic coefficient of the facial feature vector,
Figure 452778DEST_PATH_IMAGE015
are the harmonic coefficients of the gesture feature vector,
Figure 369918DEST_PATH_IMAGE055
and the harmonic coefficients are harmonic coefficients of the vehicle state feature vector.
In one embodiment, the first fusion module 130 is specifically configured to: and multiplying the facial feature vector, the lip feature vector, the gesture feature vector and the vehicle state feature vector by corresponding harmonic coefficients respectively, and splicing the vectors obtained after multiplication into one vector to obtain the first fusion feature vector.
In one embodiment, the apparatus 100 may further comprise:
the result judging module is used for acquiring the state change data of the vehicle after the vehicle executes the voice command each time, and determining whether the voice recognition result is correct or not according to the state change data;
the first calculation module is used for calculating sentence recognition success rate, awakening rate, interactive recognition rate, awakening average response time and function recognition rate corresponding to voice recognition in a preset time period after every preset time period;
and the second calculation module is used for calculating corresponding identification performance indexes according to the sentence identification success rate, the awakening rate, the interactive identification rate, the awakening average response time and the function identification rate.
In an embodiment, the first calculating module is specifically configured to calculate the sentence recognition success rate by using a first calculation formula, where the first calculation formula is: a = the recognition success times/total recognition times of continuous speech, a being the sentence recognition success rate; and/or the wake-up rate includes a successful wake-up rate and a false wake-up rate, the first calculation module is specifically configured to calculate the successful wake-up rate by using a second calculation formula, where the second calculation formula is: b1= successful wake-up times/total identification times, b1 being the successful wake-up rate; the first calculation module is specifically configured to calculate the false wake-up rate by using a third calculation formula, where the third calculation formula is: b2= false wake-up times/total identification times, and b2 is the false wake-up rate.
In an embodiment, the interaction identification rate includes an interaction success rate and a mishandling rate, and the first calculating module is specifically configured to calculate the interaction success rate by using a fourth calculation formula, where the fourth calculation formula is: c1= successful interaction times/total identification times, c1 being the interaction success rate; the first calculating module is specifically configured to calculate the misoperation rate by using a fifth calculation formula, where the fifth calculation formula is: c2= number of interactive failures/total number of identification times, c2 is the misoperation rate; and/or the first calculating module is specifically configured to calculate the wake-up average response time by using a sixth calculation formula, where the sixth calculation formula is:
Figure 946393DEST_PATH_IMAGE034
wherein g is the wake-up average response time,
Figure 748127DEST_PATH_IMAGE035
the response time of the ith successful awakening is shown, and X is the total times of successful awakening; and/or the first calculating module is specifically configured to calculate the function identification rate corresponding to each function by using a seventh calculation formula, where the seventh calculation formula is:
Figure 562499DEST_PATH_IMAGE056
= number of successful identifications for ith function/total number of identifications,
Figure 57066DEST_PATH_IMAGE057
in an embodiment, for the function identification rate corresponding to the ith function, the second calculating module is specifically configured to calculate the identification performance indicator by using an eighth calculation formula, where the eighth calculation formula includes:
Figure 855258DEST_PATH_IMAGE038
Figure 195103DEST_PATH_IMAGE039
Figure 395140DEST_PATH_IMAGE040
wherein Y is the recognition performance index, a is the sentence recognition success rate, b1 is the successful wake-up rate, b2 is the false wake-up rate, c1 is the interaction success rate, c2 is the false operation rate, g is the wake-up average response time,
Figure 919663DEST_PATH_IMAGE058
the function recognition rate corresponding to the ith function,
Figure 814937DEST_PATH_IMAGE059
is 100 or 0, if the ith function is present
Figure 348687DEST_PATH_IMAGE043
Is 100, otherwise
Figure 13018DEST_PATH_IMAGE044
Is 0.
In one embodiment, the second fusion module specifically includes:
the first calculation unit is used for calculating corresponding inter-class dispersion vectors according to a plurality of preset voice recognition samples, calculating an inter-class dispersion matrix according to the inter-class dispersion vectors, and selecting each element of the first f columns of the first f rows from the inter-class dispersion matrix to form a transformation matrix; wherein f is a positive integer greater than 1;
a second calculation unit, configured to calculate a diagonal matrix according to the inter-class dispersion vector and the transformation matrix, and calculate a dimension reduction transfer matrix according to the diagonal matrix, the inter-class dispersion vector, and the transformation matrix;
and the third calculation unit is used for respectively performing dimensionality reduction processing on the first fusion feature vector and the voice feature vector according to the dimensionality reduction transfer matrix, and splicing the two vectors subjected to dimensionality reduction processing into one vector to obtain the second fusion feature vector.
In one embodiment, the first computing unit is specifically configured to: calculating the inter-class dispersion vector by adopting a ninth calculation formula, wherein the ninth calculation formula is as follows:
Figure 708441DEST_PATH_IMAGE018
in the formula (I), the compound is shown in the specification,
Figure 215646DEST_PATH_IMAGE027
m is the number of predetermined speech recognition samples, c is the category of the predetermined speech recognition samples,
Figure 162873DEST_PATH_IMAGE060
any one of the plurality of preset speech recognition samples.
In an embodiment, the first computing unit is specifically configured to: calculating the inter-class dispersion matrix by using a tenth calculation formula, wherein the tenth calculation formula is as follows:
Figure 212869DEST_PATH_IMAGE061
where T is an inverted symbol, R is the interspecies scatter matrix,
Figure 344773DEST_PATH_IMAGE019
vectors are spread for the inter-class.
In an embodiment, the second computing unit is specifically configured to: calculating the diagonal matrix by using an eleventh calculation formula, wherein the eleventh calculation formula is:
Figure 73695DEST_PATH_IMAGE062
in the formula (I), the compound is shown in the specification,
Figure 824613DEST_PATH_IMAGE063
for the diagonal matrix, M is the transformation matrix,
Figure 994694DEST_PATH_IMAGE019
is an inter-class scatter vector and T is an inverted symbol.
In an embodiment, the second computing unit is specifically configured to: calculating the dimensionality reduction transfer matrix by adopting a twelfth calculation formula, wherein the twelfth calculation formula is as follows:
Figure 31920DEST_PATH_IMAGE064
wherein, P is the dimension reduction transfer matrix,
Figure 654663DEST_PATH_IMAGE022
for the inter-class spread vectors, M is the transform matrix,
Figure 68326DEST_PATH_IMAGE063
is the diagonal matrix.
In an embodiment, the third calculating unit is specifically configured to perform dimension reduction on the first fused feature vector by using a thirteenth calculation formula, where the thirteenth calculation formula is:
Figure 686390DEST_PATH_IMAGE028
wherein W is the first fused feature vector, P is the dimensionality reduction transfer matrix,
Figure 769883DEST_PATH_IMAGE065
and the first fusion feature vector after the dimension reduction processing is performed.
In an embodiment, the third computing unit is specifically configured to perform dimension reduction on the speech feature vector by using a fourteenth computing equation:
Figure 738976DEST_PATH_IMAGE066
wherein P is the dimensionality reduction transition matrix, E is the speech feature vector,
Figure 97276DEST_PATH_IMAGE067
and the speech feature vectors are subjected to dimensionality reduction.
It is understood that the apparatus provided by the second aspect corresponds to the method provided by the first aspect, and the explanation, the description, the examples, the embodiments and the like of the related contents in the second aspect can refer to the corresponding parts in the first aspect.
In a third aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed in a computer, the computer is caused to execute the method provided in the first aspect.
In a fourth aspect, an embodiment of the present invention provides a computing device, including a memory and a processor, where the memory stores executable codes, and the processor executes the executable codes to implement the method provided in the first aspect.
It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to limit the scope of the present application. As used in the specification and claims of this application, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural, unless the context clearly dictates otherwise. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, or apparatus comprising the element.
It is further noted that the terms "center," "upper," "lower," "left," "right," "vertical," "horizontal," "inner," "outer," and the like are used in the orientation or positional relationship indicated in the drawings for convenience in describing the invention and for simplicity in description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and thus should not be construed as limiting the invention. Unless expressly stated or limited otherwise, the terms "mounted," "connected," "coupled," and the like are to be construed broadly and encompass, for example, both fixed and removable coupling or integral coupling; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the technical solutions of the embodiments of the present invention.

Claims (9)

1. A speech recognition method based on vehicle-mounted multimode interaction is characterized by comprising the following steps:
acquiring in-vehicle voice data, and extracting voice feature vectors from the in-vehicle voice data; acquiring face data, lip data and gesture data of people in the vehicle, extracting a face feature vector from the face data, extracting a lip feature vector from the lip data, and extracting a gesture feature vector from the gesture data; acquiring vehicle state data, and extracting a vehicle state feature vector from the vehicle state data;
determining a harmonic coefficient corresponding to each of the facial feature vector, the lip feature vector, the gesture feature vector, and the vehicle state feature vector;
according to each harmonic coefficient, performing multi-mode fusion on the face feature vector, the lip feature vector, the gesture feature vector and the vehicle state feature vector to obtain a first fusion feature vector;
performing fusion processing on the first fusion feature vector and the voice feature vector to obtain a second fusion feature vector;
inputting the second fusion feature vector into a pre-trained voice recognition model to obtain a corresponding voice recognition result;
the feature fusion processing of the first fusion feature vector and the voice feature vector to obtain a second fusion feature vector comprises:
calculating corresponding inter-class dispersion vectors according to a plurality of preset voice recognition samples, calculating an inter-class dispersion matrix according to the inter-class dispersion vectors, and selecting each element of the first f rows and the first f columns from the inter-class dispersion matrix to form a transformation matrix; wherein f is a positive integer greater than 1;
calculating a diagonal matrix according to the inter-class dispersion vector and the transformation matrix, and calculating a dimension reduction transfer matrix according to the diagonal matrix, the inter-class dispersion vector and the transformation matrix;
and respectively carrying out dimensionality reduction processing on the first fusion feature vector and the voice feature vector according to the dimensionality reduction transfer matrix, and splicing the two vectors subjected to dimensionality reduction processing into one vector to obtain the second fusion feature vector.
2. The method of claim 1, wherein each harmonic coefficient is calculated using a predetermined set of equations, the predetermined set of equations comprising the following equations:
Figure 841274DEST_PATH_IMAGE001
Figure 742013DEST_PATH_IMAGE002
Figure 648789DEST_PATH_IMAGE003
Figure 538248DEST_PATH_IMAGE004
Figure 479659DEST_PATH_IMAGE005
Figure 530792DEST_PATH_IMAGE006
Figure 393705DEST_PATH_IMAGE007
Figure 86855DEST_PATH_IMAGE008
in the formula (I), the compound is shown in the specification,
Figure 351614DEST_PATH_IMAGE009
for the ith element in the lip feature vector,
Figure 573648DEST_PATH_IMAGE010
for the ith element in the facial feature vector,
Figure 455016DEST_PATH_IMAGE011
for the ith element in the gesture feature vector,
Figure 951857DEST_PATH_IMAGE012
is the ith element in the vehicle state feature vector; a is the number of elements in the lip feature vector, b is the number of elements in the facial feature vector, and c is the number of elements in the gesture feature vectorD is the number of elements in the vehicle state feature vector;
Figure 68193DEST_PATH_IMAGE013
is the harmonic coefficient of the lip feature vector,
Figure 195549DEST_PATH_IMAGE014
is the harmonic coefficient of the facial feature vector,
Figure 564213DEST_PATH_IMAGE015
is a harmonic coefficient of the gesture feature vector,
Figure 68007DEST_PATH_IMAGE016
and the harmonic coefficients are harmonic coefficients of the vehicle state feature vector.
3. The method of claim 1, wherein the multimodal fusing the facial feature vector, the lip feature vector, the gesture feature vector, and the vehicle state feature vector to obtain a first fused feature vector comprises:
and multiplying the facial feature vector, the lip feature vector, the gesture feature vector and the vehicle state feature vector by corresponding harmonic coefficients respectively, and splicing the vectors obtained after multiplication into one vector to obtain the first fusion feature vector.
4. The method of claim 1, further comprising:
after the vehicle executes the voice command each time, acquiring state change data of the vehicle, and determining whether the voice recognition result is correct or not according to the state change data;
after every preset time period, calculating sentence recognition success rate, awakening rate, interactive recognition rate, awakening average response time and function recognition rate corresponding to voice recognition in the preset time period;
and calculating corresponding identification performance indexes according to the sentence identification success rate, the awakening rate, the interactive identification rate, the awakening average response time and the function identification rate.
5. The method of claim 4,
calculating the sentence recognition success rate by adopting a first calculation formula, wherein the first calculation formula is as follows: a = number of recognition success/total number of recognition of continuous speech, a being the sentence recognition success rate; and/or the presence of a gas in the gas,
the awakening rate comprises a successful awakening rate and a false awakening rate, and the successful awakening rate is calculated by adopting a second calculation formula, wherein the second calculation formula is as follows: b1= successful wake-up times/total identification times, b1 being the successful wake-up rate; calculating the false wake-up rate by adopting a third calculation formula, wherein the third calculation formula is as follows: b2= false wake-up times/total identification times, b2 is the false wake-up rate; and/or the presence of a gas in the atmosphere,
the interaction identification rate comprises an interaction success rate and a misoperation rate, and the interaction success rate is calculated by adopting a fourth calculation formula, wherein the fourth calculation formula is as follows: c1= successful interaction times/total identification times, c1 being the interaction success rate; calculating the misoperation rate by adopting a fifth calculation formula, wherein the fifth calculation formula is as follows: c2= number of interactive failures/total number of identification times, c2 is the misoperation rate; calculating the wake-up average response time by using a sixth calculation formula, wherein the sixth calculation formula is as follows:
Figure 572938DEST_PATH_IMAGE017
wherein g is the wake-up average response time,
Figure 402353DEST_PATH_IMAGE018
the response time of the ith successful awakening is X, and the X is the total number of successful awakening; and/or the presence of a gas in the gas,
calculating the function identification rate corresponding to each function by adopting a seventh calculation formula, wherein the seventh calculation formula is as follows:
Figure 727156DEST_PATH_IMAGE019
= number of successful identifications for ith function/total number of identifications,
Figure 34640DEST_PATH_IMAGE020
the function identification rate corresponding to the ith function; and/or the presence of a gas in the atmosphere,
calculating the recognition performance index by using an eighth calculation formula, wherein the eighth calculation formula comprises:
Figure 394077DEST_PATH_IMAGE021
Figure 659974DEST_PATH_IMAGE022
Figure 737651DEST_PATH_IMAGE023
wherein Y is the recognition performance index, a is the sentence recognition success rate, b1 is the successful awakening rate, b2 is the false awakening rate, c1 is the interaction success rate, c2 is the false operation rate, g is the awakening average response time,
Figure 377055DEST_PATH_IMAGE024
the function recognition rate corresponding to the ith function,
Figure 856578DEST_PATH_IMAGE025
is 100 or 0, if the ith function is present
Figure 293376DEST_PATH_IMAGE026
Is 100, otherwise
Figure 858349DEST_PATH_IMAGE027
Is 0.
6. The method of claim 1,
calculating the inter-class dispersion vector by using a ninth calculation formula, wherein the ninth calculation formula is as follows:
Figure 38795DEST_PATH_IMAGE028
in the formula (I), the compound is shown in the specification,
Figure 372824DEST_PATH_IMAGE029
m is the number of predetermined speech recognition samples, c is the category of the predetermined speech recognition samples,
Figure 980523DEST_PATH_IMAGE030
any one of the preset voice recognition samples is selected;
or, calculating the inter-class dispersion matrix by using a tenth calculation formula, where the tenth calculation formula is:
Figure 767213DEST_PATH_IMAGE031
where T is an inverted symbol, R is the interspecies scatter matrix,
Figure 485771DEST_PATH_IMAGE029
spreading vectors for the classes;
or, the diagonal matrix is calculated by using an eleventh calculation formula, where the eleventh calculation formula is:
Figure 939886DEST_PATH_IMAGE032
in the formula (I), the compound is shown in the specification,
Figure 452907DEST_PATH_IMAGE033
for the diagonal matrix, M is the transformation matrix,
Figure 258052DEST_PATH_IMAGE034
is an inter-class scatter vector, T is an inverted sign;
or, calculating the dimension reduction transfer matrix by using a twelfth calculation formula, where the twelfth calculation formula is:
Figure 983562DEST_PATH_IMAGE035
wherein, P is the dimension reduction transfer matrix,
Figure 292184DEST_PATH_IMAGE034
for the inter-class spread vectors, M is the transform matrix,
Figure 238756DEST_PATH_IMAGE033
is the diagonal matrix;
or, performing dimension reduction on the first fusion feature vector by using a thirteenth calculation formula, where the thirteenth calculation formula is:
Figure 38DEST_PATH_IMAGE036
wherein W is the first fused feature vector, P is the dimensionality reduction transfer matrix,
Figure 794819DEST_PATH_IMAGE037
the first fusion feature vector is subjected to dimensionality reduction;
or, performing dimension reduction processing on the speech feature vector by using a fourteenth calculation formula, where the fourteenth calculation formula is:
Figure 957947DEST_PATH_IMAGE038
wherein P is the dimensionality reduction transition matrix and E is the speech feature vector,
Figure 812770DEST_PATH_IMAGE039
And the speech feature vector after dimension reduction processing is carried out.
7. A speech recognition device based on vehicle-mounted multimode interaction is characterized by comprising:
the vector forming module is used for acquiring in-vehicle voice data and extracting voice characteristic vectors from the in-vehicle voice data; acquiring face data, lip data and gesture data of people in the vehicle, extracting face feature vectors from the face data, extracting lip feature vectors from the lip data, and extracting gesture feature vectors from the gesture data; acquiring vehicle state data, and extracting a vehicle state feature vector from the vehicle state data;
a coefficient determination module for determining a harmonic coefficient corresponding to each of the facial feature vector, the lip feature vector, the gesture feature vector, and the vehicle state feature vector;
the first fusion module is used for performing multi-mode fusion on the face feature vector, the lip feature vector, the gesture feature vector and the vehicle state feature vector according to each harmonic coefficient to obtain a first fusion feature vector;
the second fusion module is used for carrying out fusion processing on the first fusion feature vector and the voice feature vector to obtain a second fusion feature vector;
the voice recognition module is used for inputting the second fusion feature vector into a pre-trained voice recognition model to obtain a corresponding voice recognition result;
the second fusion module is specifically configured to:
calculating corresponding inter-class dispersion vectors according to a plurality of preset voice recognition samples, calculating an inter-class dispersion matrix according to the inter-class dispersion vectors, and selecting each element of the front f columns of the front f rows from the inter-class dispersion matrix to form a transformation matrix; wherein f is a positive integer greater than 1;
calculating a diagonal matrix according to the inter-class dispersion vector and the transformation matrix, and calculating a dimension reduction transfer matrix according to the diagonal matrix, the inter-class dispersion vector and the transformation matrix;
and respectively carrying out dimensionality reduction processing on the first fusion feature vector and the voice feature vector according to the dimensionality reduction transfer matrix, and splicing the two vectors subjected to dimensionality reduction processing into one vector to obtain the second fusion feature vector.
8. A computer-readable storage medium, having stored thereon a computer program which, when executed in a computer, causes the computer to carry out the method of any one of claims 1 to 6.
9. A computing device comprising a memory having executable code stored therein and a processor that when executed performs the method of any one of claims 1 to 6.
CN202211359138.1A 2022-11-02 2022-11-02 Voice recognition method, device, medium and equipment based on vehicle-mounted multimode interaction Active CN115410561B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211359138.1A CN115410561B (en) 2022-11-02 2022-11-02 Voice recognition method, device, medium and equipment based on vehicle-mounted multimode interaction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211359138.1A CN115410561B (en) 2022-11-02 2022-11-02 Voice recognition method, device, medium and equipment based on vehicle-mounted multimode interaction

Publications (2)

Publication Number Publication Date
CN115410561A CN115410561A (en) 2022-11-29
CN115410561B true CN115410561B (en) 2023-02-17

Family

ID=84169289

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211359138.1A Active CN115410561B (en) 2022-11-02 2022-11-02 Voice recognition method, device, medium and equipment based on vehicle-mounted multimode interaction

Country Status (1)

Country Link
CN (1) CN115410561B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408440A (en) * 2014-12-10 2015-03-11 重庆邮电大学 Identification method for human facial expression based on two-step dimensionality reduction and parallel feature fusion
CN106127156A (en) * 2016-06-27 2016-11-16 上海元趣信息技术有限公司 Robot interactive method based on vocal print and recognition of face
CN107239749A (en) * 2017-05-17 2017-10-10 广西科技大学鹿山学院 A kind of face spatial pattern recognition method
CN112053690A (en) * 2020-09-22 2020-12-08 湖南大学 Cross-modal multi-feature fusion audio and video voice recognition method and system
CN113591659A (en) * 2021-07-23 2021-11-02 重庆长安汽车股份有限公司 Gesture control intention recognition method and system based on multi-modal input
CN113947127A (en) * 2021-09-15 2022-01-18 复旦大学 Multi-mode emotion recognition method and system for accompanying robot
WO2022033556A1 (en) * 2020-08-14 2022-02-17 华为技术有限公司 Electronic device and speech recognition method therefor, and medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408440A (en) * 2014-12-10 2015-03-11 重庆邮电大学 Identification method for human facial expression based on two-step dimensionality reduction and parallel feature fusion
CN106127156A (en) * 2016-06-27 2016-11-16 上海元趣信息技术有限公司 Robot interactive method based on vocal print and recognition of face
CN107239749A (en) * 2017-05-17 2017-10-10 广西科技大学鹿山学院 A kind of face spatial pattern recognition method
WO2022033556A1 (en) * 2020-08-14 2022-02-17 华为技术有限公司 Electronic device and speech recognition method therefor, and medium
CN114141230A (en) * 2020-08-14 2022-03-04 华为终端有限公司 Electronic device, and voice recognition method and medium thereof
CN112053690A (en) * 2020-09-22 2020-12-08 湖南大学 Cross-modal multi-feature fusion audio and video voice recognition method and system
CN113591659A (en) * 2021-07-23 2021-11-02 重庆长安汽车股份有限公司 Gesture control intention recognition method and system based on multi-modal input
CN113947127A (en) * 2021-09-15 2022-01-18 复旦大学 Multi-mode emotion recognition method and system for accompanying robot

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于类矩阵和特征融合的加权自适应人脸识别;杨欣等;《中国图象图形学报》;20080515(第05期);第111-117页 *

Also Published As

Publication number Publication date
CN115410561A (en) 2022-11-29

Similar Documents

Publication Publication Date Title
CN110288979B (en) Voice recognition method and device
CN106683680B (en) Speaker recognition method and device, computer equipment and computer readable medium
CN110797010A (en) Question-answer scoring method, device, equipment and storage medium based on artificial intelligence
CN110990543A (en) Intelligent conversation generation method and device, computer equipment and computer storage medium
CN109920414A (en) Nan-machine interrogation's method, apparatus, equipment and storage medium
CN112468659B (en) Quality evaluation method, device, equipment and storage medium applied to telephone customer service
JP7213943B2 (en) Audio processing method, device, device and storage medium for in-vehicle equipment
CN113221580B (en) Semantic rejection method, semantic rejection device, vehicle and medium
CN111401259B (en) Model training method, system, computer readable medium and electronic device
CN110875039A (en) Speech recognition method and apparatus
CN113129867B (en) Training method of voice recognition model, voice recognition method, device and equipment
CN112562723B (en) Pronunciation accuracy determination method and device, storage medium and electronic equipment
CN112992191B (en) Voice endpoint detection method and device, electronic equipment and readable storage medium
CN110287981B (en) Significance detection method and system based on biological heuristic characterization learning
CN115640200A (en) Method and device for evaluating dialog system, electronic equipment and storage medium
CN115512696A (en) Simulation training method and vehicle
CN113128284A (en) Multi-mode emotion recognition method and device
CN114882522A (en) Behavior attribute recognition method and device based on multi-mode fusion and storage medium
CN111340004A (en) Vehicle image recognition method and related device
CN115410561B (en) Voice recognition method, device, medium and equipment based on vehicle-mounted multimode interaction
CN114595692A (en) Emotion recognition method, system and terminal equipment
CN117407507A (en) Event processing method, device, equipment and medium based on large language model
CN116721449A (en) Training method of video recognition model, video recognition method, device and equipment
CN116541507A (en) Visual question-answering method and system based on dynamic semantic graph neural network
CN115985317A (en) Information processing method, information processing apparatus, vehicle, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant