WO2021047103A1

WO2021047103A1 - Voice recognition method and device

Info

Publication number: WO2021047103A1
Application number: PCT/CN2019/127672
Authority: WO
Inventors: 汪俊; 闫博群; 李索恒; 张志齐; 郑达
Original assignee: 上海依图信息技术有限公司
Priority date: 2019-09-12
Filing date: 2019-12-23
Publication date: 2021-03-18
Also published as: CN110610697A; CN110610697B

Abstract

A voice recognition method and device, relating to the field of information technology. Said method comprises: acquiring audio data to be recognized, the audio data to be recognized consisting of sub-audio data at n moments, n being greater than or equal to 1 (301); for the sub-audio data at an ith moment, inputting the sub-audio data into an ith processing module in a first voice recognition model and an ith processing module in a second voice recognition model, so as to obtain a first recognition result and a second recognition result respectively, a computing time of the first voice recognition model matching a computing time of the second voice recognition model, and the computing dimension of the first voice recognition model being greater than the computing dimension of the second voice recognition model (302); and determining a text recognition result of the sub-audio data at the ith moment according to the first recognition result and the second recognition result (303). The present invention improves the real-time performance of voice recognition.

Description

Method and device for speech recognition

Cross-references to related applications

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 12, 2019, the application number is 201910865885.4, and the application name is "a method and device for speech recognition", the entire content of which is incorporated into this application by reference .

Technical field

The embodiments of the present invention relate to the field of information technology, and in particular, to a voice recognition method and device.

Background technique

With the development of communication technology and the popularization of smart terminals, various network communication tools have become one of the main tools for public communication. Among them, due to the convenience of operation and transmission of voice information, it has become the main transmission information of various network communication tools. When using various network communication tools, the process of converting voice information into text is also involved. This process is voice recognition technology.

Voice recognition technology is a technology that enables machines to convert voice information into corresponding text or commands through the process of recognition and understanding. When using the deep learning method for speech recognition, it is necessary to determine the result of the speech recognition by the speech information at the current moment and the context information at the current moment. However, because the calculation time of the speech information at the current moment does not match the calculation time of the context information, so As a result, the output of speech recognition results in the prior art lags behind, which cannot meet the requirements of real-time performance.

Summary of the invention

The embodiment of the present invention provides a voice recognition method and device, which can match the calculation time of the voice information at the current moment with the calculation time of the context information, and meet the real-time requirements.

On the one hand, an embodiment of the present invention provides a speech recognition method, the method is applied to a speech recognition system, the speech recognition system includes at least a first speech recognition model and a second speech recognition model, the first speech recognition model has n processing modules, each module has an input terminal and a corresponding output terminal, the second speech recognition model has n processing modules, each module has an input terminal and a corresponding output terminal, the method includes:

Acquiring audio data to be identified, where the audio data to be identified is composed of sub audio data at n times, where n is greater than or equal to 1;

For the sub audio data at the i-th moment, input the sub-audio data into the i-th processing module in the first speech recognition model and the i-th processing module in the second speech recognition model to obtain the first recognition results, respectively And a second recognition result, where the first recognition result is determined based on the sub-audio data from the first moment to the i-th moment in the to-be-recognized audio, and the second recognition result is based on the sub-audio data in the to-be recognized audio The sub-audio data from the i-th time to the n-th time is determined, each processing model in the first speech recognition model corresponds to the sub-audio data at a time, and each processing model in the second speech recognition model Corresponding to the sub-audio data at a time, the calculation time of the first speech recognition model matches the calculation time of the second speech recognition model, and the calculation dimension of the first speech recognition model is greater than that of the second speech recognition model The calculation dimension, i is determined according to the calculation dimension of the first speech recognition model and the calculation dimension of the second speech recognition model, and i belongs to n;

The text recognition result of the sub audio data at the i-th moment is determined according to the first recognition result and the second recognition result.

Optionally, for the sub-audio data at the i-th moment, input the sub-audio data into the i-th processing module in the first speech recognition model and the i-th processing module in the second speech recognition model to obtain respectively The first recognition result and the second recognition result include:

Input the sub-audio data at the first moment into the first processing module in the first speech recognition model to obtain the first recognition result of the sub-audio data at the first moment. The first recognition result and the sub-audio data at the second moment are used as the input data of the second processing module in the first speech recognition model, and the first recognition result of the sub-audio data at the second moment is obtained, and the second The first recognition result of the sub audio data at time and the sub audio data at time 3 are used as the input data of the third processing module in the first speech recognition model, and the first recognition result of the sub audio data at time 3 is obtained , And so on to obtain the first recognition result of the sub audio data at the i-th moment;

Input the sub-audio data at the nth time into the n-th processing module in the second speech recognition model to obtain the second recognition result of the sub-audio data at the n-th time. The second text recognition result and the sub-audio data at time n-1 are used as the input data of the n-1th processing module in the second speech recognition model to obtain the second recognition of the sub-audio data at time n-1 As a result, the second recognition result of the sub-audio data at time n-1 and the sub-audio data at time n-2 are used as input data of the n-2th processing module in the second speech recognition model, The second recognition result of the sub-audio data at the n-2th time is obtained, and the second recognition result of the sub-audio data at the i-th time is obtained by analogy.

Optionally, matching the calculation time of the first speech recognition model with the calculation time of the second speech recognition model includes:

The difference between the time when the first speech recognition model calculates the first recognition result and the time when the second speech recognition calculates the second recognition result is smaller than a preset threshold.

Optionally, the method further includes:

For the sub-audio data at the i+1-th moment, input the sub-audio data into the i+1-th processing module in the first speech recognition model to obtain the first recognition result, and obtain the second recognition result, the The first recognition result is determined based on the sub-audio data from the 1st moment to the i+1th moment in the audio to be recognized, and the second recognition result is the text recognition of the sub-audio data at the i-th moment. Determined in the process of results;

The text recognition result of the sub audio data at the i+1th time instant is determined according to the first recognition result and the second recognition result.

Optionally, the determining the recognition result of the sub-audio data according to the first recognition result and the second recognition result includes:

The recognition result of the sub audio data is determined according to the weight of the first recognition result and the weight of the second recognition result.

On the one hand, the embodiment of the present invention also provides a voice recognition device, the device is applied to a voice recognition system, the voice recognition system at least includes a first voice recognition model and a second voice recognition model, the first voice recognition model There are n processing modules, each module has an input terminal and a corresponding output terminal, the second speech recognition model has n processing modules, each module has an input terminal and a corresponding output terminal, and the device includes:

The acquiring unit is configured to acquire audio data to be identified, where the audio data to be identified is composed of sub audio data at n times, where n is greater than or equal to 1;

The calculation unit is configured to input the sub-audio data into the i-th processing module in the first speech recognition model and the i-th processing module in the second speech recognition model for the sub-audio data at the i-th moment, respectively Obtain a first recognition result and a second recognition result. The first recognition result is determined based on the sub-audio data from the first moment to the i-th moment in the audio to be recognized, and the second recognition result is based on the The sub-audio data from the i-th time to the n-th time in the audio to be recognized is determined, each processing model in the first speech recognition model corresponds to the sub-audio data at a time, and in the second speech recognition model Each processing model corresponds to the sub-audio data at a time, the calculation time of the first speech recognition model matches the calculation time of the second speech recognition model, and the calculation dimension of the first speech recognition model is greater than the first speech recognition model. 2. The calculation dimension of the speech recognition model, i is determined according to the calculation dimension of the first speech recognition model and the calculation dimension of the second speech recognition model, and i belongs to n;

The result determining unit is configured to determine the text recognition result of the sub audio data at the i-th moment according to the first recognition result and the second recognition result.

Optionally, the calculation unit is specifically configured to:

Optionally, the calculation unit is further used for:

The result determining unit is also used for:

Optionally, the result determining unit is specifically configured to:

On the one hand, an embodiment of the present invention also provides a computer device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor. The processor implements any of the above-mentioned speech recognition when the program is executed. Method steps.

On the one hand, an embodiment of the present invention also provides a computer-readable storage medium that stores a computer program executable by a computer device. When the program runs on the computer device, the computer device executes any of the above-mentioned speech recognitions. Method steps.

An embodiment of the present invention provides a computer program product, the computer program product includes a calculation program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, The computer is allowed to execute any of the above-mentioned speech recognition methods.

In the embodiment of the present invention, the first recognition result of the first speech recognition model is determined based on the sub-audio data from the first time to the i-th time. It can be considered that the first speech recognition model processes the output result at the current time. The second recognition result is determined based on the sub-audio data from the i-th time to the n-th time in the audio to be recognized. It can be considered that the second speech recognition model processes context information, because in the embodiment of the present invention, the first speech The calculation dimension of the recognition model is greater than that of the second speech recognition model. Therefore, when the first speech recognition model calculates the sub-audio data at the i-th moment, the second speech recognition model has also calculated the sub-audio data at the i-th moment. For sub-audio data, in this way, the calculation time of the first recognition result and the second recognition result are matched, and there is no need to wait for another calculation result after calculating one calculation result, which improves the real-time performance of speech recognition.

Description of the drawings

In order to explain the technical solutions in the embodiments of the present invention more clearly, the following will briefly introduce the drawings needed in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, without creative labor, other drawings can be obtained based on these drawings.

Figure 1 is an application scenario architecture diagram provided by an embodiment of the present invention;

FIG. 2 is an architecture diagram of a speech recognition system provided by an embodiment of the present invention;

FIG. 3 is a schematic flowchart of a voice recognition method provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram of an application scenario of a voice recognition method provided by an embodiment of the present invention;

5 is a schematic structural diagram of a speech recognition device provided by an embodiment of the present invention;

Fig. 6 is a schematic structural diagram of a computer device provided by an embodiment of the present invention.

detailed description

In order to make the objectives, technical solutions, and beneficial effects of the present invention clearer, the following further describes the present invention in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, but not to limit the present invention.

In order to facilitate the understanding of the embodiments of the present invention, a few concepts are briefly introduced below:

Speech recognition technology allows machines to convert speech signals into corresponding text or commands through the process of recognition and understanding. Through speech signal processing and pattern recognition, machines can automatically recognize and understand human spoken language. Speech recognition technology is an interdisciplinary subject that involves a wide range of subjects. It is closely related to subjects such as acoustics, phonetics, linguistics, information theory, pattern recognition theory, and neurobiology. Speech recognition technology usually uses three methods: template matching method, random model method and probabilistic syntax analysis method. Deep learning methods and machine learning methods are also commonly used.

Machine learning is a multi-field interdisciplinary subject, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other subjects. Specializing in the study of how computers simulate or realize human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance. Exemplarily, HMM (Hidden Markov Model, Hidden Markov Model) may be used for speech recognition.

Deep learning is to learn the internal laws and representation levels of sample data. The information obtained in the learning process is of great help to the interpretation of data such as text, images and sounds. Its ultimate goal is to enable machines to have the ability to analyze and learn like humans, and to recognize data such as text, images, and sounds. Deep learning is a complex machine learning algorithm that has achieved results in speech and image recognition far beyond previous related technologies.

Deep learning has achieved many results in search technology, data mining, machine learning, machine translation, natural language processing, multimedia learning, speech, recommendation and personalization technology, and other related fields. Deep learning enables machines to imitate human activities such as audiovisual and thinking, and solves many complex pattern recognition problems, which has made great progress in artificial intelligence-related technologies. Exemplarily, a neural network model in a deep learning method can be used for speech recognition.

BRNN, bidirectional recurrent neural network model, a deep learning method, the method proposes that each training sequence is forward and backward respectively as two recurrent neural networks (RNN), and these two are connected to an output layer. This structure provides complete past and future contextual information for each point in the input sequence of the output layer.

In the specific practice process, the applicant of this application found that in the process of speech recognition, there is usually context information, but the process of context information processing is different from the data processed in the process of real-time processing. The data includes sub-audio data at n times as an example. For the sub-audio data at the i-th time, the sub-audio data is input to the i-th processing module in the first speech recognition model and the second speech recognition model. The i-th processing module obtains the first recognition result and the second recognition result respectively. The first recognition result is determined based on the sub-audio data from the first moment to the i-th moment in the audio to be recognized, and the second recognition result is based on The sub-audio data from the i-th moment to the n-th moment in the audio to be recognized is determined. It can be considered that when i is the first few moments in n, the calculation time of the first speech recognition model is short, and the second speech The long calculation time of the recognition model will cause the first speech recognition model to have determined the result, but the second speech recognition model has not determined the result yet, which cannot meet the real-time requirements; similarly, when i is the last few moments of n , The calculation time of the second speech recognition model is short, and the calculation time of the first speech recognition model is long, which will cause the second speech recognition model to have determined the result, but the first speech recognition model has not determined the result yet, which cannot meet the real-time requirements. Claim.

Based on the above-mentioned shortcomings of the prior art, the applicant of this application conceived a voice recognition method in which the calculation dimension of the first speech recognition model is greater than the calculation dimension of the second speech recognition model, so the first speech The calculation time of the recognition model matches the calculation time of the second speech recognition model, which can effectively improve the real-time performance of speech recognition.

The voice recognition method in the embodiment of the present application can be applied to the application scenario shown in FIG. 1, and the application scenario includes the terminal device 101 and the voice server 102. Among them, the terminal device 101 and the voice server 102 are connected through a wireless or wired network. The terminal device 101 includes but is not limited to smart speakers, smart watches, smart homes and other smart devices, smart robots, AI customer service, bank credit card reminder phone systems, And electronic devices such as smart phones, mobile computers, and tablet computers with voice interaction functions. The voice server 102 may provide related voice servers, such as voice recognition, voice synthesis and other services. The voice server 102 may be a server, a server cluster composed of several servers, or a cloud computing center.

In a possible application scenario, the user 10 interacts with the terminal device 101, and the terminal device 101 sends the voice data input by the user 10 to the voice server 102. The voice server 102 performs voice recognition processing and semantic analysis processing on the voice data sent by the terminal device 101, determines the corresponding voice recognition text according to the semantic analysis result, and sends the voice recognition text to the terminal device 101, and the terminal device 101 displays or executes the voice Recognize the instructions corresponding to the text.

It is worth noting that the architecture diagrams in the embodiments of the present application are intended to more clearly illustrate the technical solutions in the embodiments of the present invention, and do not constitute a limitation to the technical solutions provided in the embodiments of the present application. For other application scenarios, architecture and services In application, the technical solutions provided in the embodiments of the present application are equally applicable to similar problems.

Based on the application scenario diagram shown in FIG. 1, an embodiment of the present application provides a voice recognition method. The process of the method can be executed by a voice recognition device. The method is applied to a voice recognition system, and the voice recognition system at least includes a first A voice recognition model and a second voice recognition model, the first voice recognition model has n processing modules, each module has an input terminal and a corresponding output terminal, the second voice recognition model has n processing modules, Each module has an input terminal and a corresponding output terminal. In order to explain the voice recognition method in the embodiment of the present invention, a voice recognition system is first introduced by way of example. As shown in FIG. 2, the voice recognition system includes a first voice The recognition model and the second speech recognition model. The first speech recognition model and the second speech recognition model each have n processing modules, and each processing model has an input terminal and an output terminal. For one sub-audio data, the sub-audio data is Input into the corresponding processing module of the first speech recognition model, and input the sub-audio data into the corresponding processing module of the second speech recognition model for processing.

The voice recognition method in the embodiment of the present invention, as shown in FIG. 3, includes:

Step S301: Acquire audio data to be identified, where the audio data to be identified is composed of sub audio data at n times, where n is greater than or equal to 1.

Specifically, in the embodiment of the present invention, the audio data to be identified is composed of sub audio data at n times. For example, the audio data to be identified is a 20-second piece of audio data, and the 20-second audio data can be divided into 20 pieces of audio data. Time, that is, the audio data at every 1 second time is regarded as one sub-audio data, and each sub-audio data has a time sequence, so the audio data to be identified corresponds to 20 sub-audio data in a sequence.

Step S302: For the sub-audio data at the i-th moment, input the sub-audio data into the i-th processing module in the first speech recognition model and the i-th processing module in the second speech recognition model to obtain the sub-audio data respectively. A recognition result and a second recognition result, where the first recognition result is determined based on the sub audio data from the first moment to the i-th moment in the audio to be recognized, and the second recognition result is based on the to-be-recognized audio Recognizing the sub-audio data from the i-th time to the n-th time in the audio, each processing model in the first speech recognition model corresponds to the sub-audio data at a time, and each processing model in the second speech recognition model Each processing model corresponds to the sub-audio data at a time, the calculation time of the first speech recognition model matches the calculation time of the second speech recognition model, and the calculation dimension of the first speech recognition model is larger than the second speech The calculation dimension of the recognition model, i is determined according to the calculation dimension of the first speech recognition model and the calculation dimension of the second speech recognition model, and i belongs to n.

Specifically, in the embodiment of the present invention, when the audio data to be recognized is recognized, the sub-audio data at each moment is input into the corresponding processing module of the first voice recognition model and the processing module of the second voice recognition model , Get the corresponding results respectively.

The processing direction of the first speech recognition model is opposite to that of the second speech recognition model. Taking i as the second moment as an example, the first speech recognition model and the second speech recognition model each have 10 processing modules. The recognition audio data is 10s voice data, so every second of voice data is input to the processing of the corresponding first voice recognition model.

For the second processing module of the first speech recognition model and the second processing module of the second speech recognition model, the input data of the two processing modules are the 2s speech data, and the second processing of the first speech recognition model The module determines the processing result of the 2s voice data according to the processing results of the 1s voice data and the 2s voice data. The second processing module of the second voice recognition model is based on the third processing module for the 3s voice data The processing result of the 2s voice data and the 2s voice data are used to determine the 2s voice data processing result, and the third processing module of the second voice recognition model is to obtain the 3s voice data and the fourth processing module for the 4s voice data The processing result of the second speech recognition model is determined, and so on, the processing result of the second processing module of the second speech recognition model is based on the processing results of the tenth processing module to the third processing module of the second speech recognition model and the 2s Voice data to determine.

In the embodiment of the present invention, in order to enable the output result of the first speech recognition model and the output result of the second speech recognition model to be determined at the same time, so that the total output result can be determined in real time, in the embodiment of the present invention, the first The calculation dimension of the speech recognition model is greater than the calculation dimension of the second speech recognition model, that is, the calculation time of the first speech recognition model is longer, and the calculation time of the second speech recognition model is shorter. At the i-th moment, the first speech When the i-th processing module of the recognition model calculates the output result, the n-th to the i-th processing module of the second speech recognition model also calculates the output result, so the recognition result of the audio data to be recognized can be determined in real time.

In an alternative embodiment, it can be considered that the difference between the time when the first speech recognition model calculates the first recognition result and the time when the second speech recognition calculates the second recognition result is less than the preset threshold, then it can be considered OK The recognition result of the audio data to be recognized can be determined in real time.

That is to say, in the embodiment of the present invention, there may be a small time difference between the time when the first speech recognition model calculates the first recognition result and the time when the second speech recognition calculates the second recognition result, which will not affect the recognition. Real-time results.

In an optional embodiment, in order to output the recognition result as soon as possible, the earlier the time i is, the better, for example, i is the first time or i is the second time, so that after the audio data to be recognized is input, it can be quickly Output part of the recognition result of the audio data to be recognized.

Exemplarily, in the embodiment of the present invention, the text information corresponding to the audio data to be identified is "I and you are good friends", and "I" corresponds to the sub-audio data of a time, and "and" corresponds to the sub-audio of a time. Data, "you" corresponds to the sub audio data of a time, "Yes" corresponds to the sub audio data of a time, "Good" corresponds to the sub audio data of a time, "Peng" corresponds to the sub audio data of a time, and "Friend" corresponds to the sub audio data Sub audio data for a moment.

Input the sub-audio data at each moment into each processing module of the first speech recognition model and each processing module of the second speech recognition model. When the first processing module of the first speech recognition model has already parsed "I" , The other processing modules of the second speech recognition model have already processed "friend", "friend", "good", "yes", "you", "he", and "I", so the recognition result "I" can be displayed directly , And then after the second processing module of the first speech recognition model parses out the "and", the recognition result "and" can also be displayed quickly, so that the recognition result can be displayed in real time.

In an optional embodiment, for the sub-audio data at the i+1-th moment, the sub-audio data is input to the i+1-th processing module in the first speech recognition model to obtain the first recognition result, and obtain The second recognition result, the first recognition result is determined based on the sub-audio data from the 1st moment to the i+1th moment in the audio to be recognized, and the second recognition result is the text of the sub-audio data at the i-th moment is determined Determined in the process of the recognition result; the text recognition result of the sub-audio data at the i+1th moment is determined according to the first recognition result and the second recognition result.

That is to say, after the first speech recognition model matches the second speech recognition model at the i time at the i time, the second speech recognition model has obtained the recognition results from the n time to the i time, so Only by waiting for the recognition result of the first speech recognition model, the total recognition result can be determined.

In an optional embodiment, the calculation dimensions of the processing modules in the first speech recognition model are different, and the calculation dimensions from the i+1th processing module to the nth processing module are smaller than those from the first processing module to the ith processing module. The calculation dimension of the processing module can speed up the calculation of the first speech recognition model and improve the real-time performance.

In the embodiment of the present invention, the calculation dimension of the first speech recognition model and the calculation dimension of the second speech recognition model can be understood as the parameter amount of each model, and can also be understood as the size of the calculation matrix that each model participates in the calculation. Exemplarily, if the calculated dimension refers to the parameter quantity of each model, the parameter quantity of the first speech recognition model is greater than the parameter quantity of the second speech recognition model, the parameter quantity of the first speech recognition model is 1000, and the parameter quantity of the second speech recognition model is 1000. The parameter of the model is 500.

In another optional embodiment, the calculation dimension of the first speech recognition model is a matrix of 1000*1000, and the calculation dimension of the second speech recognition model is a matrix of 500*500, so the calculation dimension of the first speech recognition model is greater than that of the first speech recognition model. 2. The computational dimensions of the speech recognition model.

Step S303: Determine the text recognition result of the sub audio data at the i-th moment according to the first recognition result and the second recognition result.

In the embodiment of the present invention, when the first speech recognition model determines the first recognition result and the second speech recognition model determines the second recognition result, the sub-determinant is determined according to the weight of the first recognition result and the weight of the second recognition result. The recognition result of the audio data. The weights can be the same or different, and can be set according to the recognition accuracy requirements or the scene requirements.

In order to better explain the embodiments of the present application, the following describes a voice recognition method provided by the embodiments of the present application in conjunction with a specific implementation scenario. As shown in FIG. 4, in the embodiment of the present invention, the voice recognition method is applied to a conference. Scenario, in the conference scenario, the speech of the participants needs to be recorded and displayed on the screen.

In the embodiment of the invention, the BRNN model is used for speech recognition. The BRNN model includes two recognition models, namely a first recognition model and a second recognition model. The first recognition model includes N processing modules, and the second recognition model The model includes N processing modules, and the speech content of the participants is determined by the processing results of each processing module of the first recognition model and the processing results of each processing module of the second recognition model. In the embodiment of the present invention, the first recognition model in BRNN is processed in the order of processing by the first processing module, processing by the second processing module, processing by the third processing module, ..., processing by the Nth processing module , The second recognition model in BRNN is processed in the order of processing by the Nth processing module, processing by the N-1 processing module, ..., processing by the first processing module. The calculation dimension of the first recognition model is greater than the calculation dimension of the second recognition model.

In the embodiment of the present invention, the speech content of each participant is collected through the microphone of the audio collection device, and then the speech content is input into the BRNN model to obtain the recognition result, and the recognition result is displayed on the display screen.

Based on the foregoing embodiment and referring to FIG. 5, an embodiment of the present invention provides a voice recognition device 500, the device 500 is applied to a voice recognition system, and the voice recognition system includes at least a first voice recognition model and a second voice recognition Model, the first speech recognition model has n processing modules, each module has an input terminal and a corresponding output terminal, the second speech recognition model has n processing modules, each module has an input terminal and a corresponding The output terminal of the device 500 includes:

The acquiring unit 501 is configured to acquire audio data to be identified, where the audio data to be identified is composed of sub audio data at n times, where n is greater than or equal to 1;

The calculation unit 502 is configured to input the sub-audio data into the i-th processing module in the first speech recognition model and the i-th processing module in the second speech recognition model for the sub-audio data at the i-th moment, A first recognition result and a second recognition result are obtained respectively. The first recognition result is determined according to the sub-audio data from the first moment to the i-th moment in the audio to be recognized, and the second recognition result is based on The sub-audio data from the i-th time to the n-th time in the audio to be recognized is determined, each processing model in the first speech recognition model corresponds to the sub-audio data at a time, and the second speech recognition model Each processing model in the corresponding to the sub-audio data at a time, the calculation time of the first speech recognition model matches the calculation time of the second speech recognition model, and the calculation dimension of the first speech recognition model is greater than the The calculation dimension of the second speech recognition model, i is determined according to the calculation dimension of the first speech recognition model and the calculation dimension of the second speech recognition model, and i belongs to n;

The result determining unit 503 is configured to determine the text recognition result of the sub audio data at the i-th moment according to the first recognition result and the second recognition result.

Optionally, the calculation unit 502 is specifically configured to:

Optionally, the calculation unit 502 is further configured to:

The result determining unit is also used for:

Optionally, the result determining unit 503 is specifically configured to:

Based on the same technical concept, an embodiment of the present application provides a computer device. As shown in FIG. 6, it includes at least one processor 601 and a memory 602 connected to the at least one processor. The embodiment of the present application does not limit the processor. For the specific connection medium between the 601 and the memory 602, the connection between the processor 601 and the memory 602 in FIG. 6 is taken as an example. The bus can be divided into address bus, data bus, control bus and so on.

In the embodiment of the present application, the memory 602 stores instructions that can be executed by at least one processor 601, and the at least one processor 601 can execute the steps included in the aforementioned voice recognition method by executing the instructions stored in the memory 602.

Among them, the processor 601 is the control center of the computer equipment, which can use various interfaces and lines to connect various parts of the terminal equipment, and obtain customers by running or executing instructions stored in the memory 602 and calling data stored in the memory 602. End address. Optionally, the processor 601 may include one or more processing units, and the processor 601 may integrate an application processor and a modem processor. The application processor mainly processes the operating system, user interface, and application programs. The adjustment processor mainly deals with wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 601. In some embodiments, the processor 601 and the memory 602 may be implemented on the same chip, and in some embodiments, they may also be implemented on separate chips.

The processor 601 may be a general-purpose processor, such as a central processing unit (CPU), a digital signal processor, an application specific integrated circuit (ASIC), a field programmable gate array or other programmable logic devices, discrete gates or transistors Logic devices and discrete hardware components can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of the present application. The general-purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware processor, or executed and completed by a combination of hardware and software modules in the processor.

The memory 602, as a non-volatile computer-readable storage medium, can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The memory 602 may include at least one type of storage medium, such as flash memory, hard disk, multimedia card, card-type memory, random access memory (Random Access Memory, RAM), static random access memory (Static Random Access Memory, SRAM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), magnetic memory, disk , CD, etc. The memory 602 is any other medium that can be used to carry or store desired program codes in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto. The memory 602 in the embodiment of the present application may also be a circuit or any other device capable of realizing a storage function for storing program instructions and/or data.

Based on the same technical concept, the embodiments of the present application provide a computer-readable storage medium that stores a computer program executable by a computer device. When the program runs on the computer device, the computer device executes voice recognition. Method steps.

The embodiment of the present invention also provides a computer program product, the computer program product includes a calculation program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, when the program instructions are executed by a computer To enable the computer to execute any of the methods described in the preceding claims.

Those skilled in the art should understand that the embodiments of the present application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.

This application is described with reference to the flowcharts and/or block diagrams of the methods, equipment (systems), and computer program products according to the application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

Obviously, those skilled in the art can make various changes and modifications to the application without departing from the spirit and scope of the application. In this way, if these modifications and variations of this application fall within the scope of the claims of this application and their equivalent technologies, this application is also intended to include these modifications and variations.

Claims

A speech recognition method, characterized in that the method is applied to a speech recognition system, the speech recognition system includes at least a first speech recognition model and a second speech recognition model, and the first speech recognition model has n processing modules Each module has an input terminal and a corresponding output terminal, the second speech recognition model has n processing modules, and each module has an input terminal and a corresponding output terminal, and the method includes:

Acquiring audio data to be identified, where the audio data to be identified is composed of sub audio data at n times, where n is greater than or equal to 1;

For the sub audio data at the i-th moment, input the sub-audio data into the i-th processing module in the first speech recognition model and the i-th processing module in the second speech recognition model to obtain the first recognition results, respectively And a second recognition result, where the first recognition result is determined based on the sub-audio data from the first moment to the i-th moment in the to-be-recognized audio, and the second recognition result is based on the sub-audio data in the to-be recognized audio The sub-audio data from the i-th time to the n-th time is determined, each processing model in the first speech recognition model corresponds to the sub-audio data at a time, and each processing model in the second speech recognition model Corresponding to the sub-audio data at a time, the calculation time of the first speech recognition model matches the calculation time of the second speech recognition model, and the calculation dimension of the first speech recognition model is greater than that of the second speech recognition model The calculation dimension, i is determined according to the calculation dimension of the first speech recognition model and the calculation dimension of the second speech recognition model, and i belongs to n;

The text recognition result of the sub audio data at the i-th moment is determined according to the first recognition result and the second recognition result.
The method according to claim 1, wherein for the sub-audio data at the i-th moment, the sub-audio data is input into the i-th processing module in the first speech recognition model and the second speech recognition model The i-th processing module of, respectively obtains the first recognition result and the second recognition result, including:

Input the sub-audio data at the first moment into the first processing module in the first speech recognition model to obtain the first recognition result of the sub-audio data at the first moment. The first recognition result and the sub-audio data at the second moment are used as the input data of the second processing module in the first speech recognition model, and the first recognition result of the sub-audio data at the second moment is obtained, and the second The first recognition result of the sub audio data at time and the sub audio data at time 3 are used as the input data of the third processing module in the first speech recognition model, and the first recognition result of the sub audio data at time 3 is obtained , And so on to obtain the first recognition result of the sub audio data at the i-th moment;

Input the sub-audio data at the nth time into the n-th processing module in the second speech recognition model to obtain the second recognition result of the sub-audio data at the n-th time. The second text recognition result and the sub-audio data at time n-1 are used as the input data of the n-1th processing module in the second speech recognition model to obtain the second recognition of the sub-audio data at time n-1 As a result, the second recognition result of the sub-audio data at time n-1 and the sub-audio data at time n-2 are used as input data of the n-2th processing module in the second speech recognition model, The second recognition result of the sub-audio data at the n-2th time is obtained, and the second recognition result of the sub-audio data at the i-th time is obtained by analogy.
The method according to claim 2, wherein the calculation time of the first speech recognition model matches the calculation time of the second speech recognition model, comprising:

The difference between the time when the first speech recognition model calculates the first recognition result and the time when the second speech recognition calculates the second recognition result is smaller than a preset threshold.
The method according to claim 1, wherein the method further comprises:

For the sub-audio data at the i+1-th moment, input the sub-audio data into the i+1-th processing module in the first speech recognition model to obtain the first recognition result, and obtain the second recognition result, the The first recognition result is determined based on the sub-audio data from the 1st moment to the i+1th moment in the audio to be recognized, and the second recognition result is the text recognition of the sub-audio data at the i-th moment. Determined in the process of results;

The text recognition result of the sub audio data at the i+1th time instant is determined according to the first recognition result and the second recognition result.
The method according to claim 1 or 4, wherein the determining the recognition result of the sub-audio data according to the first recognition result and the second recognition result comprises:

The recognition result of the sub audio data is determined according to the weight of the first recognition result and the weight of the second recognition result.
A voice recognition device, characterized in that the device is applied to a voice recognition system, the voice recognition system includes at least a first voice recognition model and a second voice recognition model, and the first voice recognition model has n processing modules Each module has an input terminal and a corresponding output terminal, the second speech recognition model has n processing modules, each module has an input terminal and a corresponding output terminal, and the device includes:

The acquiring unit is configured to acquire audio data to be identified, where the audio data to be identified is composed of sub audio data at n times, where n is greater than or equal to 1;

The calculation unit is configured to input the sub-audio data into the i-th processing module in the first speech recognition model and the i-th processing module in the second speech recognition model for the sub-audio data at the i-th moment, respectively Obtain a first recognition result and a second recognition result. The first recognition result is determined based on the sub-audio data from the first moment to the i-th moment in the audio to be recognized, and the second recognition result is based on the The sub-audio data from the i-th time to the n-th time in the audio to be recognized is determined, each processing model in the first speech recognition model corresponds to the sub-audio data at a time, and in the second speech recognition model Each processing model corresponds to the sub-audio data at a time, the calculation time of the first speech recognition model matches the calculation time of the second speech recognition model, and the calculation dimension of the first speech recognition model is greater than the first speech recognition model. 2. The calculation dimension of the speech recognition model, i is determined according to the calculation dimension of the first speech recognition model and the calculation dimension of the second speech recognition model, and i belongs to n;

The result determining unit is configured to determine the text recognition result of the sub audio data at the i-th moment according to the first recognition result and the second recognition result.
The device according to claim 6, wherein the calculation unit is specifically configured to:

Input the sub-audio data at the first moment into the first processing module in the first speech recognition model to obtain the first recognition result of the sub-audio data at the first moment. The first recognition result and the sub-audio data at the second moment are used as input data of the second processing module in the first speech recognition model, and the first recognition result of the sub-audio data at the second moment is obtained, and the second The first recognition result of the sub audio data at time and the sub audio data at time 3 are used as the input data of the third processing module in the first speech recognition model, and the first recognition result of the sub audio data at time 3 is obtained , And so on to obtain the first recognition result of the sub-audio data at the i-th moment;

Input the sub-audio data at the nth time into the n-th processing module in the second speech recognition model to obtain the second recognition result of the sub-audio data at the n-th time. The second text recognition result and the sub-audio data at time n-1 are used as the input data of the n-1th processing module in the second speech recognition model to obtain the second recognition of the sub-audio data at time n-1 As a result, the second recognition result of the sub-audio data at time n-1 and the sub-audio data at time n-2 are used as input data of the n-2th processing module in the second speech recognition model, The second recognition result of the sub-audio data at the n-2th time is obtained, and the second recognition result of the sub-audio data at the i-th time is obtained by analogy.
8. The device of claim 7, wherein the calculation time of the first speech recognition model matches the calculation time of the second speech recognition model, comprising:

The difference between the time when the first speech recognition model calculates the first recognition result and the time when the second speech recognition calculates the second recognition result is smaller than a preset threshold.
The device according to claim 6, wherein the calculation unit is further configured to:

For the sub-audio data at the i+1-th moment, input the sub-audio data into the i+1-th processing module in the first speech recognition model to obtain the first recognition result, and obtain the second recognition result, the The first recognition result is determined based on the sub-audio data from the 1st moment to the i+1th moment in the audio to be recognized, and the second recognition result is the text recognition of the sub-audio data at the i-th moment. Determined in the process of results;

The result determining unit is also used for:

The text recognition result of the sub audio data at the i+1th time instant is determined according to the first recognition result and the second recognition result.
The device according to claim 6 or 9, wherein the result determining unit is specifically used for:

The recognition result of the sub audio data is determined according to the weight of the first recognition result and the weight of the second recognition result.
A computer device, comprising a memory, a processor, and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program when the program is executed. The steps of the method.
A computer-readable storage medium, characterized in that it stores a computer program that can be executed by a computer device, and when the program runs on a computer device, the computer device executes the method described in any one of claims 1 to 5 A step of.
A computer program product, characterized in that the computer program product includes a calculation program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, The computer executes the method described in any one of claims 1 to 5.