CN113838460A

CN113838460A - Video voice recognition method, device, equipment and storage medium

Info

Publication number: CN113838460A
Application number: CN202011617331.1A
Authority: CN
Inventors: 付立
Original assignee: Jingdong Technology Holding Co Ltd
Current assignee: Jingdong Technology Holding Co Ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-12-24

Abstract

The application provides a video voice recognition method, a device, equipment and a storage medium, which relate to the technical field of voice recognition, wherein the method comprises the following steps: processing the video to obtain a plurality of audio sub-segments and an image frame sequence corresponding to each audio sub-segment; performing text recognition on the image frame sequence to obtain a plurality of text results, and processing the plurality of text results to obtain a plurality of keywords; processing each audio sub-segment through a voice recognition model to obtain a plurality of candidate voice recognition results; and determining a target text recognition result of each audio sub-segment according to the candidate voice recognition results and the keywords, and acquiring a voice recognition result of the video according to the target text recognition result of each audio sub-segment. Therefore, the video voice recognition is assisted through the text recognition result in the image of the video, and the accuracy of the video voice recognition is improved.

Description

Video voice recognition method, device, equipment and storage medium

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a video speech recognition method, apparatus, device, and storage medium.

Background

At present, video data has the advantages of strong entertainment, rich content, high user stickiness and the like, and has become one of the main flows of internet users in recent years. Video Automatic Speech Recognition (ASR) plays an important role in the fields of video content review, video recommendation, and the like by recognizing the spoken content in a video as a corresponding text.

However, in an actual video scene, the speech data to be recognized may have various complex interference factors such as accent, background music and noise, which seriously reduce the effect of video speech recognition.

In the related art, usually, a manual labeling manner is adopted to obtain audio data and corresponding labels of a large number of video scenes, and then, the labeled data is adopted to perform optimization training of a model, however, compared with labeling of data such as images and texts, labeling cost is often higher because the labeling of the audio data needs to be heard at least once manually.

Disclosure of Invention

The present application is directed to solving, at least to some extent, one of the technical problems in the related art.

The application provides a video voice recognition method, a video voice recognition device and a storage medium, wherein the accuracy of the video voice recognition is improved by assisting the video voice recognition through a text recognition result in an image of a video, and the method, the device and the storage medium are used for solving the technical problems of inaccuracy of the video voice recognition and high cost in the prior art.

An embodiment of a first aspect of the present application provides a video speech recognition method, including:

processing a video to obtain a plurality of audio sub-segments and an image frame sequence corresponding to each audio sub-segment;

performing text recognition on the image frame sequence to obtain a plurality of text results, and processing the text results to obtain a plurality of keywords;

processing each audio sub-segment through a voice recognition model to obtain a plurality of candidate voice recognition results;

and determining a target text recognition result of each audio sub-segment according to the candidate voice recognition results and the keywords, and acquiring a voice recognition result of the video according to the target text recognition result of each audio sub-segment.

According to the video and voice recognition method, a plurality of audio sub-segments and an image frame sequence corresponding to each audio sub-segment are obtained by processing a video; performing text recognition on the image frame sequence to obtain a plurality of text results, and processing the plurality of text results to obtain a plurality of keywords; processing each audio sub-segment through a voice recognition model to obtain a plurality of candidate voice recognition results; and determining a target text recognition result of each audio sub-segment according to the candidate voice recognition results and the keywords, and acquiring a voice recognition result of the video according to the target text recognition result of each audio sub-segment. Therefore, the video voice recognition is assisted through the text recognition result in the image of the video, and the accuracy of the video voice recognition is improved.

The embodiment of the second aspect of the present application provides a video speech recognition apparatus, including:

the first acquisition module is used for processing a video, acquiring a plurality of audio sub-segments and an image frame sequence corresponding to each audio sub-segment;

the identification module is used for carrying out text identification on the image frame sequence to obtain a plurality of text results;

the second acquisition module is used for processing the text results to acquire a plurality of keywords;

the processing module is used for processing each audio sub-segment through a voice recognition model to obtain a plurality of candidate voice recognition results;

a determining module, configured to determine a target text recognition result of each audio sub-segment according to the candidate speech recognition results and the keywords;

and the third acquisition module is used for acquiring the voice recognition result of the video according to the target text recognition result of each audio sub-segment.

The video voice recognition device of the embodiment of the application acquires a plurality of audio sub-segments and an image frame sequence corresponding to each audio sub-segment by processing a video; performing text recognition on the image frame sequence to obtain a plurality of text results, and processing the plurality of text results to obtain a plurality of keywords; processing each audio sub-segment through a voice recognition model to obtain a plurality of candidate voice recognition results; and determining a target text recognition result of each audio sub-segment according to the candidate voice recognition results and the keywords, and acquiring a voice recognition result of the video according to the target text recognition result of each audio sub-segment. Therefore, the video voice recognition is assisted through the text recognition result in the image of the video, and the accuracy of the video voice recognition is improved.

An embodiment of a third aspect of the present application provides a server, including: the device comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the video voice recognition method as set forth in the embodiment of the first aspect of the present application.

An embodiment of a fourth aspect of the present application provides a non-transitory computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the video speech recognition method as set forth in the embodiment of the first aspect of the present application.

An embodiment of a fifth aspect of the present application provides a computer program product, where when executed by an instruction processor in the computer program product, the method for video speech recognition provided in the embodiment of the first aspect of the present application is performed.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flowchart of a video speech recognition method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a video-speech recognition method according to a second embodiment of the present application;

FIG. 3 is an exemplary diagram of audio silence detection in an embodiment of the present application;

FIG. 4 is an exemplary diagram of text recognition by the ORC algorithm of an embodiment of the present application;

fig. 5 is a schematic structural diagram of a video speech recognition apparatus according to an eighth embodiment of the present application;

fig. 6 is a schematic structural diagram of a video speech recognition apparatus according to a ninth embodiment of the present application;

FIG. 7 illustrates a block diagram of an exemplary server suitable for use in implementing embodiments of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.

In practical application, the speaking content in the video is recognized as the corresponding text through automatic voice recognition of the video, which is common in the fields of video content auditing, video recommendation and the like, however, the audio data and the corresponding labels of a large number of video scenes are obtained usually by adopting a manual labeling mode, and then the labeled data are adopted to carry out the optimization training of the model, so that the labeling cost is high, the video recognition difficulty is high, and the efficiency is low.

In order to solve the above problems, the present application provides a video speech recognition method, which processes a video to obtain a plurality of audio sub-segments and an image frame sequence corresponding to each audio sub-segment; performing text recognition on the image frame sequence to obtain a plurality of text results, and processing the plurality of text results to obtain a plurality of keywords; processing each audio sub-segment through a voice recognition model to obtain a plurality of candidate voice recognition results; and determining a target text recognition result of each audio sub-segment according to the candidate voice recognition results and the keywords, and acquiring a voice recognition result of the video according to the target text recognition result of each audio sub-segment.

Therefore, the video voice recognition is assisted through the text recognition result in the image of the video, and the accuracy of the video voice recognition is improved.

A video speech recognition method, apparatus, device, and storage medium according to embodiments of the present application are described below with reference to the accompanying drawings.

Fig. 1 is a flowchart illustrating a video speech recognition method according to an embodiment of the present application.

The embodiment of the present application is exemplified by the video and speech recognition method being configured in a video and speech recognition apparatus, and the video and speech recognition apparatus can be applied to any device, so that the device can perform a video and speech recognition function.

As shown in fig. 1, the video speech recognition method may include the following steps:

step 101, processing a video, and obtaining a plurality of audio sub-segments and an image frame sequence corresponding to each audio sub-segment.

In the embodiment of the application, the video may be understood as a video to be subjected to voice recognition, and may be acquired according to an application scene, for example, a video uploaded by a related terminal, a video shot in real time, and the like, specifically, the setting is selected according to the application scene.

In the embodiment of the present application, it is understood that the video may be a video with audio data continuously provided for a period of time, or may be audio data at intervals, and the video may be processed in a corresponding manner according to different scene selections, to obtain a plurality of audio sub-segments, and an image frame sequence corresponding to each audio sub-segment, for example, as follows:

a first example is that a video is processed to obtain audio data and image data, and the audio data is subjected to silence detection to obtain a plurality of audio sub-segments; a sequence of image frames corresponding to each audio sub-segment is acquired from the image data.

In a second example, a video is processed to obtain audio data and image data, and the audio data is cut according to a preset time period to obtain a plurality of audio sub-segments; a sequence of image frames corresponding to each audio sub-segment is acquired from the image data.

In the embodiment of the present application, the image frame sequence corresponding to each audio sub-segment may be understood as a set of multiple images corresponding to one piece of audio.

Step 102, performing text recognition on the image frame sequence to obtain a plurality of text results, and processing the plurality of text results to obtain a plurality of keywords.

In this embodiment of the present application, performing text recognition on an image frame sequence, and acquiring a plurality of text results may be understood as performing text recognition on each image in the image frame sequence to acquire a corresponding text result, where there are many ways to perform text recognition on the image frame sequence to acquire the plurality of text results, for example, the following manners are illustrated:

in a first example, optical character recognition is performed on each image frame in a series of image frames to obtain a plurality of text results.

In the second example, each frame image in the image frame series is processed through a trained neural network to obtain a plurality of text results; the trained neural network has the capability of performing text recognition on the image through the training sample.

In the embodiment of the present application, there are various ways to process a plurality of text results and obtain a plurality of keywords, which are described as follows:

in the first example, each text result is subjected to word segmentation processing to obtain a plurality of words, and the plurality of words are filtered to obtain a plurality of keywords.

In a second example, each text result is matched with a text template to obtain a plurality of keywords, wherein the text model is preset.

And 103, processing each audio sub-segment through the voice recognition model to obtain a plurality of candidate voice recognition results.

And 104, determining a target text recognition result of each audio sub-segment according to the candidate voice recognition results and the keywords, and acquiring a voice recognition result of the video according to the target text recognition result of each audio sub-segment.

In the embodiment of the application, the voice recognition model is a model which is trained in advance according to the sample and has the capability of recognizing the voice, each audio sub-segment is input into the voice recognition model for processing, a plurality of candidate voice recognition results can be obtained, for example, the audio sub-segment a is input into the voice recognition model for processing, and the candidate voice recognition results of "hello,", "number of you" and "hello lama" are obtained.

Further, there are various ways to determine the target text recognition result for each audio sub-segment based on the plurality of candidate speech recognition results and the plurality of keywords, as illustrated below.

As an example, a recognition probability of each candidate voice recognition text is obtained, a statistical probability and a first coefficient of each candidate voice recognition text are obtained, a contribution probability and a second coefficient of a plurality of keywords corresponding to each candidate voice recognition text are obtained, a calculation is performed according to the recognition probability, the statistical probability, the first coefficient, the contribution probability and the second coefficient of each candidate voice recognition text, a correct rate of each candidate voice recognition text is obtained, and a target text recognition result of each audio sub-segment is determined from a plurality of candidate voice recognition results according to the correct rate of each candidate voice recognition text.

As another example, the accuracy and the corresponding coefficient of each candidate speech recognition text are obtained, and the coefficients of the plurality of keywords are obtained, weighted summation is performed, and the target text recognition result of each audio sub-segment is determined from the plurality of candidate speech recognition results according to the weighted summation result.

And further, combining and splicing a plurality of target text recognition results of a plurality of audio sub-segments to obtain a voice recognition result of the video.

Fig. 2 is a flowchart illustrating a video speech recognition method according to a second embodiment of the present application.

As shown in fig. 2, the video speech recognition method may include the steps of:

step 201, processing the video, obtaining audio data and image data, performing silence detection on the audio data, obtaining a plurality of audio sub-segments, and obtaining an image frame sequence corresponding to each audio sub-segment from the image data.

In this embodiment of the present application, a video may be sampled, for example, the video may be sampled according to an image sampling frequency to obtain image data, and the video may be sampled according to an audio signal sampling frequency to obtain audio data.

For example, given a piece of video data V ═ { P, S }, where P ═ I ═ j₁,I₂,...,I_NIs image data in video data, I_nFor the nth frame image in the image data, N is equal to [1, N ∈]The image sampling frequency is f_P，S＝{x₁,x₂,...,x_MIs the audio data in the video data, x_mFor the mth audio signal in the audio data, M ∈ [1, M ∈]The sampling frequency of the audio signal is f_S. In this application, for example, the sampling frequency of an image is f_P30Hz, the sampling frequency of the audio signal is f_S＝16kHz。

Further, the audio data is subjected to silence detection to obtain a plurality of audio sub-segments, and it can be understood that, usually, in the audio in a section of video, there are mostly silent or non-human voice portions, and in order to improve the accuracy and efficiency of voice recognition, the application utilizes a silence detection algorithm to detect and delete the silent portions in the audio data, cut a section of longer audio into a plurality of sub-segments, and record the start and end time of each section of voice.

For example, as shown in fig. 3, in order to prevent factors such as background music from causing a long sub-segment obtained by the silence detection algorithm due to the fact that the audio cannot be completely truncated, a fixed duration threshold T may be set, and the audio with a duration greater than the threshold T may be forcedly divided into a plurality of segments with a duration T. In the present application, for example, the fixed duration threshold T is 10 s.

Specifically, let X be the sequence of audio sub-segments obtained after passing silence detection for a given piece of audio data₁,X₂,...,X_KWherein each audio sub-segment X_kRespectively is t_skAnd t_ek，k∈[1,K]. In the present application, for example, WebRTC (Web Real-Time Communication) is used for audio mute detection.

In the embodiment of the present application, there are many ways to obtain an image frame sequence corresponding to each audio sub-segment from image data, and as an example of a scene, a start time and an end time corresponding to each audio sub-segment are obtained, a start frame image is determined according to a ratio of the start time to an image sampling frequency, an end frame image is determined according to a ratio of the end time to the image sampling frequency, a section frame image is determined from image data according to the start frame image and the end frame image, and image extraction is performed on the section frame image according to a preset frequency, so as to obtain an image frame sequence corresponding to each audio sub-segment.

Specifically, since the sampling frequencies of the image data and the audio data are not identical and the frequency of change of the actual image caption in the image stream is low, X is set for each audio sub-segment_kExtracting the corresponding image frame sequence from the image dataColumn Y_kWherein, the starting and ending frame numbers of the image frame sequence are respectively

And

wherein

And

respectively an upper round and a lower round. In this application, for example, the image frame sequence Y_kIs set to 1Hz, i.e. in the second part of the overall image data

Frame and second

Between frames, every f_PFrame sampling a picture to form each audio sub-segment X_kCorresponding image frame sequence Y_k。

Step 202, performing optical character recognition on each frame image in the image frame series to obtain a plurality of text results, performing word segmentation processing on each text result to obtain a plurality of words, and filtering the plurality of words to obtain a plurality of keywords.

In the embodiment of the present application, the text in the image is recognized by an optical Character recognition (ocr) algorithm. The method adopts a general OCR algorithm based on a neural network to carry out image frame sequence Y_kThe text in each frame image is identified to obtain a plurality of text results L₁,L₂,...,L_RAs shown in fig. 4, in the present application, a conventional CRNN (Convolutional Recurrent Neural Network) is used as an OCR algorithm to perform text recognition in an image.

Further, each text result is participled to obtain each wordWord segmentation list W corresponding to text result₁,W₂,...,W_R. In the present application, for example, L₁The text result in the image obtained for the OCR algorithm is "hello, here the subtitle of the video. ", after word segmentation processing, a word segmentation list W can be obtained₁Is { "hello", "," here "," is "," video "," subtitle "," etc. "}.

In the method, punctuation marks, single characters, numbers, letters and other contents possibly existing in the word segmentation list are considered, and the voice recognition is difficult to be well assisted, so that the punctuation marks and the single characters in the word segmentation list can be removed. In addition, since the result of OCR recognition cannot be completely correct or replaced with a special character, a single character result will appear after word segmentation. Therefore, negative influence on video voice recognition caused by incomplete OCR information can be effectively reduced.

For example, "you are good, here is the subtitle of the video. For example, in some videos, the subtitle may replace "view" with "s", the recognition result is "good, here, the subtitle at s frequency", and after the word segmentation, the word segmentation list { "good", "", "here", "is", "s", "frequency", "subtitle", "etc." can be obtained. After the single words are removed, the s frequency is not used for voice recognition, so that misleading information caused by wrong words is avoided.

Step 203, processing each audio sub-segment through a speech recognition model, obtaining a plurality of candidate speech recognition results, obtaining a recognition probability of each candidate speech recognition text, and obtaining a statistical probability and a first coefficient of each candidate speech recognition text.

And 204, acquiring the contribution probability and the second coefficient of the plurality of keywords corresponding to each candidate voice recognition text, and calculating according to the recognition probability, the statistical probability, the first coefficient, the contribution probability and the second coefficient of each candidate voice recognition text to acquire the accuracy of each candidate voice recognition text.

Step 205, determining a target text recognition result of each audio sub-segment from a plurality of candidate speech recognition results according to the accuracy of each candidate speech recognition text, and obtaining a speech recognition result of the video according to the target text recognition result of each audio sub-segment.

In the embodiment of the application, the obtaining of the recognition probability of each candidate speech recognition text may be understood as the probability that each candidate speech recognition text is correctly recognized through a speech recognition model, the statistical probability and the first coefficient of each candidate speech recognition text may be understood as the natural degree of each candidate speech recognition text, that is, the probability that whether expression meets the rule or not and the corresponding first coefficient may be set according to an application scenario, and the contribution probability and the second coefficient of a plurality of keywords corresponding to each candidate speech recognition text may be understood as the contribution degree of the plurality of keywords to the recognition result and the corresponding second coefficient may be set according to the application scenario.

In the embodiment of the present application, the speech recognition refers to recognizing a corresponding text result from a segment of audio signal, and the speech recognition model in the embodiment of the present application mainly includes an acoustic model f_AAnd language model f_LTwo parts. The acoustic model is mainly used for extracting features in the audio signal from the audio signal, and the language model mainly enables the voice recognition model to output a relatively natural language text result, so that in the voice recognition process, the audio features obtained by the acoustic model are combined with the language model in a decoding mode in a solution space, and an optimal output text sequence is searched out, wherein the optimal output text sequence is as follows:

where B is a candidate speech recognition text obtained by speech recognition, B is a text result solution space of speech recognition, f_A(b|X_k) As recognition probability of the acoustic model, f_L(b) Is the statistical probability of the text b, f_H(b,W₁,W₂,...,W_R) For the contribution probability of the keywords in the text b, a first coefficient alpha and a second coefficient beta are respectively used for configuring a language modelType and weight of keyword.

When subtitles do not exist in the video, the processing can be directly performed through a voice recognition model.

The video voice recognition method of the embodiment of the application acquires audio data and image data by processing a video, performs silence detection on the audio data to acquire a plurality of audio sub-segments, acquires an image frame sequence corresponding to each audio sub-segment from the image data, performs optical character recognition on each frame image in the image frame sequence to acquire a plurality of text results, performs word segmentation on each text result to acquire a plurality of words, filters the plurality of words to acquire a plurality of keywords, processes each audio sub-segment through a voice recognition model to acquire a plurality of candidate voice recognition results, acquires the recognition probability of each candidate voice recognition text, acquires the statistical probability and the first coefficient of each candidate voice recognition text, acquires the contribution probability and the second coefficient of the plurality of keywords corresponding to each candidate voice recognition text, calculating according to the recognition probability, the statistical probability, the first coefficient, the contribution probability and the second coefficient of each candidate voice recognition text to obtain the correct rate of each candidate voice recognition text, determining the target text recognition result of each audio sub-segment from a plurality of candidate voice recognition results according to the correct rate of each candidate voice recognition text, and obtaining the voice recognition result of the video according to the target text recognition result of each audio sub-segment. Therefore, aiming at the problems of high difficulty, high labeling cost and the like of the video voice recognition technology, the accuracy of the voice recognition of a video scene is improved by using the text information in the image as the assistance, retraining and optimization of a voice recognition model are avoided, only on the basis of the existing voice recognition model, the text information in the video is extracted by using the text recognition technology as the assistance, and finally the accuracy of the voice recognition of the video with subtitles is improved efficiently and at low cost.

In order to implement the above embodiments, the present application further provides a video speech recognition apparatus.

Fig. 5 is a schematic structural diagram of a video speech recognition apparatus according to a fifth embodiment of the present application.

As shown in fig. 5, the video speech recognition apparatus 500 may include: a first obtaining module 510, a recognition module 520, a second obtaining module 530, a processing module 540, a determination module 550, and a third obtaining module 560.

The first obtaining module 510 is configured to process a video, obtain a plurality of audio sub-segments, and obtain an image frame sequence corresponding to each audio sub-segment.

The recognition module 520 is configured to perform text recognition on the image frame sequence to obtain a plurality of text results.

The second obtaining module 530 is configured to process the text results to obtain a plurality of keywords.

The processing module 540 is configured to process each audio sub-segment through the speech recognition model to obtain a plurality of candidate speech recognition results.

A determining module 550, configured to determine a target text recognition result of each audio sub-segment according to the plurality of candidate speech recognition results and the plurality of keywords.

And a third obtaining module 560, configured to obtain a voice recognition result of the video according to the target text recognition result of each audio sub-segment.

Further, in a possible implementation manner of the embodiment of the present application, referring to fig. 6, on the basis of the embodiment shown in fig. 5, the first obtaining module 510 includes: a processing unit 511, a detection unit 512 and an acquisition unit 513.

The processing unit 511 is configured to process the video and obtain audio data and image data.

A detecting unit 512, configured to perform silence detection on the audio data to obtain multiple audio sub-segments.

An obtaining unit 513 is configured to obtain, from the image data, an image frame sequence corresponding to each of the audio sub-segments.

Further, in a possible implementation manner of the embodiment of the present application, the obtaining unit 513 is specifically configured to: acquiring the corresponding start time and end time of each audio sub-segment; determining a starting frame image according to the ratio of the starting time to the image sampling frequency, and determining an ending frame image according to the ratio of the ending time to the image sampling frequency; and determining an interval frame image from the image data according to the starting frame image and the ending frame image, and extracting the image from the interval frame image according to a preset frequency to obtain an image frame sequence corresponding to each audio sub-segment.

Further, in a possible implementation manner of the embodiment of the present application, the identifying module 520 is specifically configured to: and carrying out optical character recognition on each frame image in the image frame series to obtain a plurality of text results.

Further, in a possible implementation manner of the embodiment of the present application, the second obtaining module 530 is specifically configured to: performing word segmentation processing on each text result to obtain a plurality of words; and filtering the plurality of participles to obtain the plurality of keywords.

Further, in a possible implementation manner of the embodiment of the present application, the determining module 550 is specifically configured to: acquiring the recognition probability of each candidate voice recognition text; acquiring the statistical probability and a first coefficient of each candidate voice recognition text; acquiring the contribution probability and a second coefficient of the plurality of keywords corresponding to each candidate voice recognition text; calculating according to the recognition probability, the statistical probability, the first coefficient, the contribution probability and the second coefficient of each candidate voice recognition text to obtain the accuracy of each candidate voice recognition text; and determining the target text recognition result of each audio sub-segment from the plurality of candidate speech recognition results according to the accuracy of each candidate speech recognition text.

It should be noted that the foregoing explanation on the embodiment of the video speech recognition method is also applicable to the video speech recognition apparatus of the embodiment, and details are not repeated here.

In order to implement the foregoing embodiment, the present application further provides a server, including: the video speech recognition method comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein when the processor executes the program, the video speech recognition method is realized as proposed in the previous embodiment of the application.

In order to implement the foregoing embodiments, the present application also proposes a non-transitory computer-readable storage medium storing a computer program which, when executed by a processor, implements the video speech recognition method as proposed by the foregoing embodiments of the present application.

In order to implement the foregoing embodiments, the present application also provides a computer program product, which when executed by an instruction processor in the computer program product, performs the video speech recognition method as set forth in the foregoing embodiments of the present application.

FIG. 7 illustrates a block diagram of an exemplary server suitable for use in implementing embodiments of the present application. The server 12 shown in fig. 7 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present application.

As shown in FIG. 7, the server 12 is in the form of a general purpose computing device. The components of the server 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.

Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. These architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus, to name a few.

The server 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by server 12 and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 28 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 30 and/or cache Memory 32. The server 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 7, and commonly referred to as a "hard drive"). Although not shown in FIG. 7, a disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk Read Only Memory (CD-ROM), a Digital versatile disk Read Only Memory (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.

A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally perform the functions and/or methodologies of the embodiments described herein.

The server 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with the server 12, and/or with any devices (e.g., network card, modem, etc.) that enable the server 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Moreover, the server 12 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network such as the Internet) via the Network adapter 20. As shown in FIG. 7, the network adapter 20 communicates with the other modules of the server 12 via the bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the server 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processing unit 16 executes various functional applications and data processing, such as implementing the video speech recognition method mentioned in the foregoing embodiments, by running a program stored in the system memory 28.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A video speech recognition method, comprising:

2. The method of claim 1, wherein said processing video to obtain a plurality of audio sub-segments and a sequence of image frames corresponding to each of said audio sub-segments comprises:

processing the video to obtain audio data and image data;

performing silence detection on the audio data to obtain a plurality of audio sub-segments;

acquiring an image frame sequence corresponding to each audio sub-segment from the image data.

3. The method of claim 2, wherein said obtaining from said image data a sequence of image frames corresponding to each of said audio sub-segments comprises:

acquiring the corresponding start time and end time of each audio sub-segment;

determining a starting frame image according to the ratio of the starting time to the image sampling frequency, and determining an ending frame image according to the ratio of the ending time to the image sampling frequency;

and determining an interval frame image from the image data according to the starting frame image and the ending frame image, and extracting the image from the interval frame image according to a preset frequency to obtain an image frame sequence corresponding to each audio sub-segment.

4. The method of claim 1, wherein the text recognition of the sequence of image frames to obtain a plurality of text results comprises:

and carrying out optical character recognition on each frame image in the image frame series to obtain a plurality of text results.

5. The method of claim 1, wherein said processing the plurality of textual results to obtain a plurality of keywords comprises:

performing word segmentation processing on each text result to obtain a plurality of words;

and filtering the plurality of participles to obtain the plurality of keywords.

6. The method of claim 1, wherein said determining a target text recognition result for each of said audio sub-segments based on said plurality of candidate speech recognition results and said plurality of keywords comprises:

acquiring the recognition probability of each candidate voice recognition text;

acquiring the statistical probability and a first coefficient of each candidate voice recognition text;

acquiring the contribution probability and a second coefficient of the plurality of keywords corresponding to each candidate voice recognition text;

calculating according to the recognition probability, the statistical probability, the first coefficient, the contribution probability and the second coefficient of each candidate voice recognition text to obtain the accuracy of each candidate voice recognition text;

and determining the target text recognition result of each audio sub-segment from the plurality of candidate speech recognition results according to the accuracy of each candidate speech recognition text.

7. A video speech recognition apparatus, comprising:

8. The apparatus of claim 7, wherein the first obtaining module comprises:

the processing unit is used for processing the video to acquire audio data and image data;

the detection unit is used for carrying out silence detection on the audio data to obtain a plurality of audio sub-segments;

an obtaining unit configured to obtain an image frame sequence corresponding to each of the audio sub-segments from the image data.

9. The apparatus of claim 8, wherein the obtaining unit is specifically configured to:

acquiring the corresponding start time and end time of each audio sub-segment;

10. The apparatus of claim 7, wherein the identification module is specifically configured to:

11. The apparatus of claim 7, wherein the second obtaining module is specifically configured to:

and filtering the plurality of participles to obtain the plurality of keywords.

12. The apparatus of claim 7, wherein the determination module is specifically configured to:

acquiring the recognition probability of each candidate voice recognition text;

13. A server comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the program implementing a video speech recognition method according to any one of claims 1 to 6.

14. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the video speech recognition method according to any one of claims 1 to 6.

15. A computer program product comprising a computer program, characterized in that the computer program realizes the video speech recognition method of any of claims 1-6 when executed by a processor.