CN112017633B

CN112017633B - Speech recognition method, device, storage medium and electronic equipment

Info

Publication number: CN112017633B
Application number: CN202010950236.7A
Authority: CN
Inventors: 宫一尘
Original assignee: Beijing Horizon Information Technology Co Ltd
Current assignee: Beijing Horizon Information Technology Co Ltd
Priority date: 2020-09-10
Filing date: 2020-09-10
Publication date: 2024-04-26
Anticipated expiration: 2040-09-10
Also published as: CN112017633A

Abstract

A voice recognition method, a device, a storage medium and an electronic device are provided, wherein a voice frame at a current time point and a video frame at the current time point are obtained, video feature information in the video frame is extracted, the video feature information is used for representing voice actions of the voice frame corresponding to a current user, and a recognition result of the voice frame is determined based on the voice frame and the video feature information, namely, the voice frame is recognized by combining information of the voice frame and voice actions of the voice frame corresponding to the user, so that accuracy of the recognition result of the voice frame is improved, voice of each frame is recognized, corresponding timeliness of voice interaction is improved, user voice is recognized in real time, and experience effect of voice interaction is improved.

Description

Speech recognition method, device, storage medium and electronic equipment

Technical Field

The present application relates to the field of speech technologies, and in particular, to a speech recognition method, a device, a storage medium, and an electronic apparatus.

Background

Currently, for signal processing systems, such as multi-modal speech recognition systems, the signal to be processed is usually processed after being completely received. For example, when the signal to be processed is an audio signal, for a multi-modal speech recognition system, the speech recognition process is usually performed after a segment of the audio signal is recorded. The result of this speech recognition does not meet the real-time requirements.

Therefore, how to improve the real-time performance of signal processing is a problem to be solved.

Disclosure of Invention

The present application has been made to solve the above-mentioned technical problems. The embodiment of the application provides a voice recognition method, a voice recognition device, a storage medium and electronic equipment, which are used for recognizing user voices in real time and improving the experience effect of voice interaction.

According to an aspect of the present application, there is provided a voice recognition method including: acquiring a voice frame of a current time point; acquiring a video frame of the current time point; extracting video characteristic information in the video frame; the video characteristic information is used for representing the voice action of the current user corresponding to the voice frame; and determining a recognition result of the voice frame based on the voice frame and the video feature information.

According to an aspect of the present application, there is provided a voice recognition apparatus including: the voice acquisition module is used for acquiring a voice frame at the current time point; the video acquisition module is used for acquiring the video frame of the current time point; the video feature extraction module is used for extracting video feature information in the video frames; the video characteristic information is used for representing the voice action of the current user corresponding to the voice frame; and the determining module is used for determining the recognition result of the voice frame based on the voice frame and the video characteristic information.

According to an aspect of the present application, there is provided a computer-readable storage medium storing a computer program for executing any one of the above-described speech recognition methods.

According to an aspect of the present application, there is provided an electronic device including: a processor; a memory for storing the processor-executable instructions; the processor is configured to perform any one of the above-described speech recognition methods.

According to the voice recognition method, the device, the storage medium and the electronic equipment, the voice frame and the video frame at the current time point are obtained, then the video characteristic information in the video frame is extracted, the recognition result of the voice frame is comprehensively determined by combining the video characteristic information and the voice frame, namely, the voice frame is recognized by combining the information of the voice frame and the voice action of a user corresponding to the voice frame, so that the accuracy of the recognition result of the voice frame is improved, the voice of each frame is recognized, the corresponding timeliness of voice interaction is improved, the voice of the user is recognized in real time, and the experience effect of voice interaction is improved.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing embodiments of the present application in more detail with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate the application and together with the embodiments of the application, and not constitute a limitation to the application. In the drawings, like reference numerals generally refer to like parts or steps.

Fig. 1 is a flowchart of a speech recognition method according to an exemplary embodiment of the present application.

Fig. 2 is a flowchart of a method for determining a recognition result of a speech frame according to an exemplary embodiment of the present application.

Fig. 3 is a flowchart illustrating a probability method for calculating phoneme information according to an exemplary embodiment of the present application.

Fig. 4 is a flowchart of a voice recognition method according to another exemplary embodiment of the present application.

Fig. 5 is a flowchart of a voice recognition method according to another exemplary embodiment of the present application.

Fig. 6 is a schematic structural diagram of a voice recognition apparatus according to an exemplary embodiment of the present application.

Fig. 7 is a schematic structural diagram of a voice recognition apparatus according to another exemplary embodiment of the present application.

Fig. 8 is a block diagram of an electronic device according to an exemplary embodiment of the present application.

Detailed Description

Hereinafter, exemplary embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.

Summary of the application

In conventional signal processing systems, a signal to be processed is usually received completely and then processed, for example, in multimode speech recognition systems, after a complete speech of a user is obtained, the complete speech is generally recognized as a whole. Because the user often does not think about when inputting voice, the problem of longer redundancy sometimes occurs in the input voice of the user, and because the input voice of the user is longer, the multimode voice recognition system needs to take longer time in the recognition process, and is difficult to meet the real-time requirement, and the longer voice also increases the difficulty of voice recognition, even can not be recognized or wrongly recognized.

In order to solve the above problems, the embodiments of the present application provide a method and an apparatus for voice recognition, by acquiring a voice frame and a video frame at a current time point, then extracting video feature information in the video frame, and comprehensively determining a recognition result of the voice frame in combination with the video feature information and the voice frame, that is, recognizing the voice frame in combination with information of the voice frame and voice actions of a user corresponding to the voice frame, so that accuracy of the recognition result of the voice frame can be improved by using cooperation of the video frame and the voice frame, and recognition can be performed for each frame of voice, thereby improving corresponding timeliness of voice interaction, recognizing user voice in real time, and improving experience effects of voice interaction.

Exemplary method

Fig. 1 is a flowchart of a speech recognition method according to an exemplary embodiment of the present application. As shown in fig. 1, the voice recognition method includes the steps of:

step 110: and acquiring a voice frame at the current time point.

In an embodiment, the current time point may be a current time point in the process of receiving the voice to be recognized, and specifically may be any time point in the process of inputting the voice to be recognized by the user. In another embodiment, the voice frame may be a frame of voice data, and in particular, the voice frame at the current time point may be a frame of voice data that starts at the current time point in the process of inputting the voice to be recognized by the user. The voice to be recognized in the present disclosure may be divided into multi-frame voice data corresponding to a plurality of moments, that is, the voice to be recognized may be divided into voice frames of a plurality of time points.

Step 120: and acquiring a video frame at the current time point.

In an embodiment, the video frame at the current time point may be a frame of video data beginning at the current time point in the process of inputting the voice to be recognized by the user. The video data corresponding to the voice to be recognized (i.e., the video data during the process of inputting the voice to be recognized by the user) in the present disclosure may be divided into video frames at a plurality of time points.

Step 130: extracting video characteristic information in a video frame; the video characteristic information is used for representing the voice action of the current user corresponding to the voice frame.

In an embodiment, the video feature information may be a voice action corresponding to the voice to be recognized when the user inputs the voice, and may include a limb action (such as a hand swing action), a head action (such as a nodding or waving action), a lip action (such as a mouth shape change), and the like. The video feature information in the corresponding video frame may be an action state at the current time point, such as a mouth shape.

In one embodiment, the video feature information may also include optical flow field features. The optical flow field features are used for representing the displacement information of the pixel level between two adjacent frames of video, and the voice action of the current user can be accurately obtained by utilizing the optical flow field features in consideration of the fact that the current user possibly has displacement actions (such as face displacement and the like) when inputting the voice to be recognized. In an embodiment, the optical flow field feature of each video frame may be obtained by calculating the optical flow field feature of the current frame according to the current frame and the video of the previous frame or the next frame of the current frame, and the specific calculation mode may be a neural network mode, that is, the video of the current frame and the video of the previous frame or the next frame of the current frame are input into a trained neural network model, so as to obtain the optical flow field feature of the current frame.

Step 140: and determining a recognition result of the voice frame based on the voice frame and the video characteristic information.

Since a user inputs a voice with a specific voice action, such as a change in a mouth shape and a limb action, the voice content can be directly known by the voice action, for example, the voice content is recognized by the change in the mouth shape (i.e., a lip language is read), and after the voice content is known by the voice action, the voice content is fused (e.g., spliced, weighted, etc.) with the voice content corresponding to the voice frame to determine the recognition result of the voice frame. In an embodiment, after the voice frame is acquired, verification may be performed by using the corresponding video feature information (such as a mouth shape, etc.), so as to exclude the unlikely voice content in the voice frame, thereby improving the accuracy of the recognition result of the voice frame.

In the embodiment of the disclosure, the recognition result of the voice frame is determined comprehensively based on the video feature information in the voice frame and the corresponding video frame, that is, the voice frame is comprehensively recognized by combining the voice frame and the corresponding voice action, and because the video feature information contains the action information related to the voice information (that is, the action information accompanied by the voice sent by the user, such as mouth shape change, limb voice and the like), the accuracy of the recognition result of the voice frame can be improved, and the voice can be recognized in real time in the process of inputting the voice by the user through recognizing the voice frame by frame, so that the real-time effect of voice recognition is improved.

Fig. 2 is a flowchart of a method for determining a recognition result of a speech frame according to an exemplary embodiment of the present application. As shown in fig. 2, step 140 may include the following sub-steps:

step 141: the speech frame is parsed into at least one phoneme information.

The phonemes are the minimum phonetic units divided according to the natural attribute of the voice, and are analyzed according to the pronunciation actions in syllables, one action forms a phoneme, each frame of voice comprises at least one phoneme information, the voice frame is analyzed into the minimum phonetic units (namely phonemes), and the processing of the voice frame is realized by processing the phoneme information.

Step 142: based on at least one of the phoneme information and the video feature information, a probability of each of the phoneme information is obtained.

The pronunciation of each user is different and there is also an influence in terms of external noise, deviation of voice acquisition, etc., so that the acquired voice information is not necessarily the exact total voice input by the user. Since each piece of phoneme information corresponds to one voice action, the probabilities of all the phoneme information contained in each voice frame can be obtained comprehensively according to the phoneme information and the corresponding voice action (i.e., the video feature information), that is, the probabilities of all the phonemes contained in each frame of voice are obtained comprehensively by acquiring the phoneme information in each frame of voice and the video feature information corresponding to the frame of voice.

Step 143: according to the probability of each phoneme information, calculating the probability of a plurality of voice results; wherein the plurality of speech results are combined from some or all of the at least one phoneme information.

The probability of each piece of phoneme information in the current speech frame, that is, the probability of phoneme information that may exist in the current speech frame, may be calculated in the above step 142, and since each piece of speech is composed of one or more pieces of phoneme information, the probability of each piece of speech result may be obtained by weighting the probabilities of the constituent phoneme information of each speech result.

Step 144: and when the probability of one voice result in the plurality of voice results meets the preset condition, taking the voice result as a recognition result.

A preset condition is set, and when the probability of one voice result among the plurality of voice results obtained in step 143 meets the preset condition, that is, the probability of the voice result obtained by integrating the voice information of the voice frame and the voice action in the corresponding video frame meets the preset condition, the voice result can be output as a recognition result.

In an embodiment, the preset condition may include: the probability of the voice result is larger than a preset probability threshold; and/or the probability of the speech result is the maximum of the probabilities of the plurality of speech results. When the probability of one speech result in the plurality of speech results obtained in step 143 is greater than the preset probability threshold, or the probability of one speech result in the plurality of speech results obtained in step 143 is the maximum value of the probabilities of the plurality of speech results, the speech result may be selected as the recognition result to be output. It should be understood that, in step 144, there may be a plurality of voice results satisfying the preset condition, at this time, the plurality of voice results may be stored as the recognition results of the current voice frame at the same time, after the recognition results of all the voice frames are obtained, the recognition results of all the voice frames are combined, and the recognition results of the current voice frame are determined by comprehensive semantics, or only the voice result with the largest probability and the probability difference with other voice results being larger than the preset difference may be selected, so long as the selected voice result can accurately express the recognition results of the current voice frame.

Fig. 3 is a flowchart illustrating a probability method for calculating phoneme information according to an exemplary embodiment of the present application. As shown in fig. 3, step 142 may include the following sub-steps:

Step 1421: and determining first similarity between single phoneme information in at least one phoneme information and corresponding standard phoneme voices, and obtaining a plurality of first similarity.

The standard phoneme information is obtained based on standard language (such as standard mandarin chinese or standard english, etc.), because of a certain difference between different standard phoneme phonetic information, although the pronunciation of each user may be different, there may be a higher similarity between the phoneme information of the uttered speech and the standard phoneme phonetic, by comparing the single phoneme information with the standard phoneme phonetic to obtain the first similarity between the single phoneme information and the standard phoneme phonetic, the possible result of the single phoneme information may be locked in a few numbers, thereby ensuring the recognition accuracy.

Step 1422: and determining second similarity between the video characteristic information and the standard video characteristic information of each phoneme to obtain a plurality of second similarity.

The standard video characteristic information of the phonemes is a standard voice action set based on the pronunciation action of the standard phonemes, and certain differences exist among different standard video characteristic information, although voice actions (namely video characteristic information) of each user may be different, a higher similarity exists between the video characteristic information and the standard video characteristic information when the voice is sent, and by comparing the video characteristic information of the current voice frame with the standard video characteristic information of a single phoneme, a second similarity between the video characteristic information and the standard video characteristic information of each phoneme is obtained, and possible results of the single phoneme information can be locked in a few, so that recognition accuracy is ensured.

Step 1423: and weighting the first similarity and the second similarity of each phoneme to obtain the probability of each phoneme information.

Through step 1421 and step 1422, a first similarity between each phoneme information and standard phoneme speech, a second similarity between video feature information and standard video feature information of each phoneme are obtained, the first similarity and the second similarity corresponding to each phoneme are weighted, that is, the similarity between the speech information and the standard phoneme speech and the similarity between the video information and the standard video information are combined to comprehensively obtain the probability of each phoneme information, so that the probability of containing single phoneme information in the current speech frame is comprehensively known by two dimensions of the phonemes and the video. In an embodiment, the weights of the first similarity and the second similarity may be equal or unequal, for example, when the first similarity and the second similarity are both greater than the respective preset similarities, the weights of the first similarity and the second similarity may be equal, and when one of the first similarity and the second similarity is smaller than the preset similarity, the accuracy of the recognition result is not high, so that the weight of the similarity may be set to be smaller than the weight of the other similarity, thereby avoiding that the inaccurate recognition result has a larger influence on the final recognition result.

In an embodiment, the specific implementation of step 140 may further include: and weighting the single phoneme information in the at least one phoneme information and the corresponding video characteristic information to obtain weighted single phoneme information, and then calculating the similarity between the weighted single phoneme information and the corresponding standard phoneme voice to obtain the probability of each phoneme information.

In an embodiment, the specific implementation of step 140 may further include: and inputting the voice frame and the video characteristic information into the first neural network model to obtain a recognition result of the voice frame. The specific implementation mode can be as follows: inputting the voice frame and the video characteristic information into a first neural network model, wherein the first neural network model performs any one of fusion modes such as splicing, average summation, attention weighted summation and the like on the voice information and the video characteristic information of the voice frame so as to obtain a distribution result of phoneme probability in each voice frame, and then acquiring a recognition result with higher probability as a recognition result of the current voice frame based on beam search, thereby improving efficiency and saving memory space.

Fig. 4 is a flowchart of a voice recognition method according to another exemplary embodiment of the present application. As shown in fig. 4, before step 110, the above-mentioned voice recognition method may further include:

step 150: and judging whether the current user performs voice action.

Since there are more or less other ambient sounds or interfering voices of other users in the environment of speech recognition, this obviously results in many useless operations if the acquisition operation is performed for all the sounds. Therefore, by acquiring the image and the action of the user in real time and judging whether the user is performing voice input according to the action, namely acquiring the image information before the interactive device through an image acquisition device such as a camera, when the acquired image information contains the head portrait information of the user, the head action or the lip action of the current user can be further acquired to judge whether the current user is performing voice action.

Step 160: and when the judgment result is that the current user is performing voice action, acquiring a voice frame of the current time point.

And if the current user is doing voice action, activating the voice acquisition equipment and the video acquisition equipment to acquire the voice frame and the video frame of the current user, namely entering a voice acquisition and voice recognition state. In an embodiment, it may also be determined whether the voice action of the current user is finished in step 150, and when it is determined that the current user has finished the voice action, the voice recognition result in the period from when the current user starts the voice action to when the current user finishes the voice action may be output, and the stored voice frame and video frame, phoneme information and probability generated in the recognition process may be emptied, so that a storage space is reserved for the next voice recognition.

In an embodiment, the implementation manner of the step 130 may include: inputting the video frames into a second neural network model to obtain video characteristic information; or obtaining the video characteristic information through at least one of a SIFT algorithm, a SURF algorithm and an ORB algorithm. Extracting video characteristic information in the video frames, namely extracting voice actions of corresponding voice frames in the video frames, wherein feature extraction can be carried out on specific position area images in the video frames (namely pictures) through a second neural network model, or the video characteristic information in the video frames can be extracted through at least one of a SIFT algorithm, a SURF algorithm and an ORB algorithm. It should be understood that other methods may be selected to extract video features in video frames according to the requirements.

In one embodiment, the video feature information may include lip region video information of the current user. Because the voice action of the user is only related to a specific position, such as a lip or a face, only the face or lip region video information of the current user can be acquired, so that calculation of most of irrelevant data in a video frame is avoided, calculation space is saved, and calculation efficiency and response speed are improved.

Fig. 5 is a flowchart of a voice recognition method according to another exemplary embodiment of the present application. As shown in fig. 5, after step 110, the above identification method may further include:

Step 170: and preprocessing the voice frame to obtain a preprocessed voice frame.

Because a lot of noise may exist in the interactive environment, in order to improve the recognition accuracy of the user voice, the embodiment of the application can preprocess the voice frame to reduce the noise level and improve the voice-to-noise ratio, thereby providing more accurate initial voice frames for subsequent voice recognition. In one embodiment, the preprocessing may include: and carrying out short-time Fourier transform on the voice frame to obtain the frequency spectrum characteristic information. The short-time fourier transform is a window function that selects a time-frequency localization, assuming that the analysis window function is stationary (pseudo stationary) for a short time interval, moving the window function such that the speech frames are stationary signals for different finite time widths, thereby reducing clutter noise.

In an embodiment, after obtaining the spectral feature information, the identifying method may further include: MFCC features and/or FBank features in the spectral feature information are extracted. By extracting the MFCC features and/or FBank features in the spectral feature information, the response characteristics close to the human ear can be fitted, thereby improving the speech recognition.

Exemplary apparatus

Fig. 6 is a schematic structural diagram of a voice recognition apparatus according to an exemplary embodiment of the present application. As shown in fig. 6, the voice recognition apparatus 60 includes: a voice acquisition module 61, configured to acquire a voice frame at a current time point; a video acquisition module 62, configured to acquire a video frame at a current time point; a video feature extraction module 63, configured to extract video feature information in a video frame; the video characteristic information is used for representing the voice action of the voice frame corresponding to the current user; the determining module 64 is configured to determine a recognition result of the voice frame based on the voice frame and the video feature information.

According to the voice recognition device provided by the embodiment of the application, the voice frame and the video frame at the current time point are acquired through the voice acquisition module 61 and the video acquisition module 62, then the video feature extraction module 63 extracts the video feature information in the video frame, and the determination module 64 comprehensively determines the recognition result of the voice frame by combining the video feature information and the voice frame, namely, the voice frame is recognized by combining the information of the voice frame and the voice action of the user corresponding to the voice frame, so that the accuracy of the recognition result of the voice frame is improved, the voice of each frame is recognized, the corresponding timeliness of voice interaction is improved, the voice of the user is recognized in real time, and the experience effect of voice interaction is improved.

Fig. 7 is a schematic structural diagram of a voice recognition apparatus according to another exemplary embodiment of the present application. As shown in fig. 7, the determining module 64 may include: a parsing unit 641 for parsing the voice frame into at least one phoneme information; a phoneme probability obtaining unit 642 for obtaining a probability of each phoneme information based on at least one phoneme information and video feature information; a speech probability obtaining unit 643, configured to calculate probabilities of a plurality of speech results according to the probabilities of each phoneme information; wherein the plurality of speech results are combined from some or all of the at least one phoneme information; the recognition result obtaining unit 644 is configured to, when a probability that one of the plurality of voice results exists satisfies a preset condition, take the voice result as a recognition result.

In an embodiment, the phoneme probability obtaining unit 642 may be further configured to: determining first similarity between single phoneme information in at least one phoneme information and corresponding standard phoneme voices to obtain a plurality of first similarity; determining second similarity between the video characteristic information and standard video characteristic information of each phoneme to obtain a plurality of second similarity; and weighting the first similarity and the second similarity of each phoneme to obtain the probability of each phoneme information.

In an embodiment, the phoneme probability obtaining unit 642 may be further configured to: and weighting the single phoneme information in the at least one phoneme information and the corresponding video characteristic information to obtain weighted single phoneme information, and then calculating the similarity between the weighted single phoneme information and the corresponding standard phoneme voice to obtain the probability of each phoneme information.

In an embodiment, the determination module 64 may be further configured to: and inputting the voice frame and the video characteristic information into the first neural network model to obtain a recognition result of the voice frame. The specific implementation mode can be as follows: inputting the voice frame and the video characteristic information into a first neural network model, wherein the first neural network model performs any one of fusion modes such as splicing, average summation, attention weighted summation and the like on the voice information and the video characteristic information of the voice frame so as to obtain a distribution result of phoneme probability in each frame of voice frame, and then acquiring a recognition result with higher probability based on beam search as a recognition result of the current voice frame.

In one embodiment, as shown in fig. 7, the voice recognition apparatus 60 may further include: a judging module 65, configured to judge whether the current user performs a voice action; the voice acquisition module 61 is further configured to: and when the judgment result is that the current user is performing voice action, acquiring a voice frame of the current time point.

In an embodiment, the video feature extraction module 63 may be further configured to: inputting the video frames into a second neural network model to obtain video characteristic information; or obtaining the video characteristic information through at least one of a SIFT algorithm, a SURF algorithm and an ORB algorithm. In one embodiment, the video feature information may include lip region video information of the current user.

In one embodiment, as shown in fig. 7, the voice recognition apparatus 60 may further include: the preprocessing module 66 is configured to preprocess the voice frame to obtain a preprocessed voice frame. In a further embodiment, the preprocessing may include: and carrying out short-time Fourier transform on the voice frame to obtain the frequency spectrum characteristic information. In an embodiment, after obtaining the spectral feature information, the speech recognition device 60 may be further configured to: MFCC features and/or FBank features in the spectral feature information are extracted.

Exemplary electronic device

Next, an electronic device according to an embodiment of the present application is described with reference to fig. 8. The electronic device may be either or both of the first device and the second device, or a stand-alone device independent thereof, which may communicate with the first device and the second device to receive the acquired input signals therefrom.

Fig. 8 illustrates a block diagram of an electronic device according to an embodiment of the application.

As shown in fig. 8, the electronic device 10 includes one or more processors 11 and a memory 12.

The processor 11 may be a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities, and may control other components in the electronic device 10 to perform desired functions.

Memory 12 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer readable storage medium that can be executed by the processor 11 to implement the speech recognition method and/or other desired functions of the various embodiments of the present application described above. Various contents such as an input signal, a signal component, a noise component, and the like may also be stored in the computer-readable storage medium.

In one example, the electronic device 10 may further include: an input device 13 and an output device 14, which are interconnected by a bus system and/or other forms of connection mechanisms (not shown).

For example, when the electronic device is a first device or a second device, the input means 13 may be a microphone or an array of microphones for capturing an input signal of a sound source. When the electronic device is a stand-alone device, the input means 13 may be a communication network connector for receiving the acquired input signals from the first device and the second device.

In addition, the input device 13 may also include, for example, a keyboard, a mouse, and the like.

The output device 14 may output various information to the outside, including the determined distance information, direction information, and the like. The output device 14 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, etc.

Of course, only some of the components of the electronic device 10 that are relevant to the present application are shown in fig. 8 for simplicity, components such as buses, input/output interfaces, etc. are omitted. In addition, the electronic device 10 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer readable storage Medium

In addition to the methods and apparatus described above, embodiments of the application may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps in a speech recognition method according to the various embodiments of the application described in the "exemplary methods" section of this specification.

The computer program product may write program code for performing operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium, on which computer program instructions are stored, which, when being executed by a processor, cause the processor to perform the steps in a speech recognition method according to various embodiments of the present application described in the "exemplary methods" section above in this specification.

The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The basic principles of the present application have been described above in connection with specific embodiments, but it should be noted that the advantages, benefits, effects, etc. mentioned in the present application are merely examples and not intended to be limiting, and these advantages, benefits, effects, etc. are not to be construed as necessarily possessed by the various embodiments of the application. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, as the application is not necessarily limited to practice with the above described specific details.

The block diagrams of the devices, apparatuses, devices, systems referred to in the present application are only illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.

It is also noted that in the apparatus, devices and methods of the present application, the components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered as equivalent aspects of the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the application to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.

Claims

1. A method of speech recognition, comprising:

acquiring a voice frame of a current time point;

acquiring a video frame of the current time point;

Extracting video characteristic information in the video frame; the video characteristic information is used for representing the voice action of the current user corresponding to the voice frame; and

Determining a recognition result of the voice frame based on the voice frame and the video characteristic information;

Wherein, based on the voice frame and the video feature information, determining the recognition result of the voice frame includes:

parsing the speech frame into at least one phoneme information;

based on the at least one piece of phoneme information and the video characteristic information, obtaining the probability of each piece of phoneme information;

according to the probability of each phoneme information, calculating to obtain probabilities of a plurality of voice results; wherein the plurality of speech results are combined from some or all of the at least one phoneme information; and

When the probability of one voice result in the plurality of voice results meets a preset condition, taking the voice result as a recognition result;

Wherein the obtaining the probability of each phoneme information based on the at least one phoneme information and the video feature information includes:

determining first similarity between single phoneme information in the at least one phoneme information and corresponding standard phoneme voices to obtain a plurality of first similarity;

Determining second similarity between the video characteristic information and standard video characteristic information of each phoneme to obtain a plurality of second similarity; and

And weighting the first similarity and the second similarity of each phoneme to obtain the probability of each phoneme information.

2. The identification method according to claim 1, wherein the preset condition includes: the probability of the voice result is larger than a preset probability threshold; and/or the probability of the speech result is the maximum of the probabilities of the plurality of speech results.

3. The recognition method according to claim 1 or 2, wherein the determining a recognition result of the speech frame based on the speech frame and the video feature information includes:

And inputting the voice frame and the video characteristic information into a first neural network model to obtain a recognition result of the voice frame.

4. The recognition method according to claim 1, wherein, before the acquiring the speech frame at the current point in time, further comprising:

judging whether the current user performs voice action or not;

And when the judgment result is that the current user is performing voice action, acquiring the voice frame of the current time point.

5. The identification method of claim 1, wherein the extracting video feature information in the video frame comprises:

inputting the video frame into a second neural network model to obtain the video characteristic information; or alternatively

The video feature information is obtained through at least one of a SIFT algorithm, a SURF algorithm and an ORB algorithm.

6. The identification method of claim 1, wherein the video feature information comprises lip region video information of the current user.

7. The recognition method according to claim 1, wherein after the acquisition of the speech frame at the current point in time, further comprising:

and preprocessing the voice frame to obtain a preprocessed voice frame.

8. The identification method of claim 7, wherein the preprocessing comprises:

And carrying out short-time Fourier transform on the voice frame to obtain spectrum characteristic information.

9. The identification method according to claim 8, wherein after the obtaining of the spectral feature information, further comprising:

MFCC features and/or FBank features in the spectral feature information are extracted.

10. A speech recognition apparatus comprising:

The voice acquisition module is used for acquiring a voice frame at the current time point;

The video acquisition module is used for acquiring the video frame of the current time point;

The video feature extraction module is used for extracting video feature information in the video frames; the video characteristic information is used for representing the voice action of the current user corresponding to the voice frame; and

The determining module is used for determining a recognition result of the voice frame based on the voice frame and the video characteristic information;

parsing the speech frame into at least one phoneme information;

11. A computer readable storage medium storing a computer program for performing the speech recognition method of any one of the preceding claims 1-9.

12. An electronic device, the electronic device comprising:

A processor;

a memory for storing the processor-executable instructions;

The processor being configured to perform the speech recognition method of any one of the preceding claims 1-9.