CN112102815B

CN112102815B - Speech recognition method, speech recognition device, computer equipment and storage medium

Info

Publication number: CN112102815B
Application number: CN202011265557.XA
Authority: CN
Inventors: 袁丁; 王晓明; 刘云峰
Original assignee: Shenzhen Zhuiyi Technology Co Ltd
Current assignee: Shenzhen Zhuiyi Technology Co Ltd
Priority date: 2020-11-13
Filing date: 2020-11-13
Publication date: 2021-07-13
Anticipated expiration: 2040-11-13
Also published as: CN112102815A

Abstract

The application relates to a voice recognition method, a voice recognition device, computer equipment and a storage medium. The method comprises the following steps: acquiring voice data to be recognized and a service identifier corresponding to the voice data to be recognized; extracting acoustic characteristic information of voice data to be recognized; decoding the acoustic characteristic information through a decoding model to obtain a word lattice corresponding to the voice data to be recognized, wherein the word lattice comprises a candidate decoding path, and the candidate decoding path comprises a plurality of node connecting arcs; calling a corresponding service language model according to the service identification, and adjusting the language model score corresponding to each node connection arc in the word lattice through the service language model to obtain a target language model score corresponding to each node connection arc in the word lattice; calculating decoding scores of the candidate decoding paths according to the scores of the target language model and the acoustic scores corresponding to the node connection arcs; determining a target decoding path according to the decoding fraction; and generating a voice recognition result according to the target decoding path. The method can save labor cost.

Description

Speech recognition method, speech recognition device, computer equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a speech recognition method, apparatus, computer device, and storage medium.

Background

With the development of artificial intelligence, the application of the speech recognition technology is more and more extensive, and users have higher requirements on the accuracy of speech recognition. For some specific service scenes, in order to improve the accuracy of speech recognition, a traditional method is to perform speech recognition by using a language model corresponding to the specific service scene obtained after performing mixed interpolation training on general text data and text data corresponding to the specific service scene.

However, in the conventional method, the accuracy of speech recognition depends on the weight between the general text data in the language model and the text data corresponding to the specific service scene, and the adjustment of the weight has a high requirement on the professional performance of the staff, which consumes high labor cost.

Disclosure of Invention

In view of the above, it is desirable to provide a speech recognition method, apparatus, computer device and storage medium capable of saving labor cost.

A method of speech recognition, the method comprising:

acquiring voice data to be recognized and a service identifier corresponding to the voice data to be recognized;

extracting acoustic characteristic information of the voice data to be recognized;

inputting the acoustic characteristic information into a decoding model, and decoding the acoustic characteristic information through the decoding model to obtain a word lattice corresponding to the voice data to be recognized, wherein the word lattice comprises a candidate decoding path, and the candidate decoding path comprises a plurality of node connection arcs;

calling a corresponding service language model according to the service identification, and adjusting the language model score corresponding to each node connection arc in the word lattice through the service language model to obtain a target language model score corresponding to each node connection arc in the word lattice;

calculating the decoding scores of the candidate decoding paths according to the target language model scores and the acoustic scores corresponding to the node connection arcs;

determining a target decoding path in the candidate decoding paths according to the decoding scores;

and generating a voice recognition result corresponding to the voice data to be recognized according to the target decoding path.

In one embodiment, the adjusting, by the service language model, the language model score corresponding to each node connection arc in the word lattice to obtain the target language model score corresponding to each node connection arc in the word lattice includes:

acquiring a plurality of node connection arcs of candidate decoding paths in the word lattice;

calculating a language model score corresponding to each node connection arc through the service language model;

and adjusting the language model score corresponding to the corresponding node connection arc in the word lattice according to the calculated language model score to obtain a target language model score corresponding to each node connection arc in the word lattice.

In one embodiment, the adjusting, according to the calculated language model score, the language model score corresponding to the node connection arc in the word lattice, and the obtaining of the target language model score corresponding to each node connection arc in the word lattice includes:

subtracting the language model scores corresponding to the node connection arcs from the word lattice;

and adding target language model scores corresponding to the corresponding node connection arcs in the word lattice according to the calculated language model scores.

In one embodiment, calculating the decoding scores of the candidate decoding paths according to the target language model scores and the acoustic scores corresponding to the node connecting arcs comprises:

and adding the target language model scores and the acoustic scores corresponding to the node connection arcs in the candidate decoding path to obtain the decoding scores of the candidate decoding path.

In one embodiment, before the acquiring the voice data to be recognized, the method further includes:

acquiring service text data;

preprocessing the service text data to obtain preprocessed service text data;

and training a preset language model according to the preprocessed business text data until a preset condition is met, stopping model training, and taking the trained preset language model as a business language model.

In one embodiment, the acquiring the voice data to be recognized includes:

acquiring initial voice data;

carrying out endpoint detection on the initial voice data to obtain a human voice endpoint and an environmental noise endpoint in the initial voice data;

and extracting voice data to be recognized from the initial voice data according to the human voice endpoint and the environmental noise endpoint.

A speech recognition apparatus, the apparatus comprising:

the system comprises an acquisition module, a recognition module and a processing module, wherein the acquisition module is used for acquiring voice data to be recognized and a service identifier corresponding to the voice data to be recognized;

the extraction module is used for extracting the acoustic characteristic information of the voice data to be recognized;

the decoding module is used for inputting the acoustic characteristic information into a decoding model, decoding the acoustic characteristic information through the decoding model to obtain a word lattice corresponding to the voice data to be recognized, wherein the word lattice comprises a candidate decoding path, and the candidate decoding path comprises a plurality of node connecting arcs;

the adjusting module is used for calling a corresponding service language model according to the service identifier, and adjusting the language model scores corresponding to the node connection arcs in the word lattice through the service language model to obtain the target language model scores corresponding to the node connection arcs in the word lattice;

the calculation module is used for calculating the decoding scores of the candidate decoding paths according to the target language model scores and the acoustic scores corresponding to the node connection arcs;

a determining module for determining a target decoding path among the candidate decoding paths according to the decoding scores;

and the generating module is used for generating a voice recognition result corresponding to the voice data to be recognized according to the target decoding path.

In one embodiment, the adjusting module is further configured to obtain a plurality of node connection arcs of the candidate decoding path in the word lattice; calculating a language model score corresponding to each node connection arc through the service language model; and adjusting the language model score corresponding to the corresponding node connection arc in the word lattice according to the calculated language model score to obtain a target language model score corresponding to each node connection arc in the word lattice.

In one embodiment, the adjusting module is further configured to subtract the language model score corresponding to each node connecting arc from the word lattice; and adding target language model scores corresponding to the corresponding node connection arcs in the word lattice according to the calculated language model scores.

In one embodiment, the calculation module is further configured to add the target language model score and the acoustic score corresponding to each node connection arc in the candidate decoding path to obtain the decoding score of the candidate decoding path.

A computer device comprising a memory and a processor, the memory storing a computer program operable on the processor, the processor implementing the steps in the various method embodiments described above when executing the computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the respective method embodiment described above.

According to the voice recognition method, the voice recognition device, the computer equipment and the storage medium, the acoustic characteristic information of the voice data to be recognized is extracted, the acoustic characteristic information corresponding to the voice data to be recognized is decoded by using the decoding model, the word lattice corresponding to the voice data to be recognized is obtained, the corresponding business language model is obtained through the business identification corresponding to the voice data to be recognized, the language model score corresponding to each node connection arc in the word lattice is adjusted through the business language model, the target language model score is obtained, and the decoding score of the candidate decoding path is calculated according to the target language model score and the acoustic score corresponding to the node connection arc. Because the decoding model comprises the general language model, after the word lattice is obtained by decoding, the probability scores of texts and vocabularies of the specific service scene can be improved and the accuracy of the decoding scores of the candidate decoding paths can be improved only by adjusting the language model scores corresponding to the connection arcs of each node in the word lattice according to the language model under the specific service scene, namely the service language model. Therefore, the target decoding path is determined according to the decoding score, and the voice recognition result is generated according to the target decoding path, so that the accuracy of the voice recognition text in a specific service scene can be effectively improved. Compared with the traditional mode in which the mixed interpolation training is carried out by utilizing the general text data and the text data corresponding to the specific service scene, and the weight between the general text data and the text data corresponding to the specific service scene needs to be calculated, the weight does not need to be calculated, the professional requirement on workers is reduced, and the labor cost is saved. Compared with the traditional mode in which the mixed interpolation training is carried out by using the general text data and the text data corresponding to the specific service scene, the training of the service language model requires less data, so that the model training time can be reduced, and the computing resources can be saved. In addition, the service language model can be a language model under any service scene, and the universality of the use scene of the voice recognition is improved.

Drawings

FIG. 1 is a diagram of an exemplary implementation of a speech recognition method;

FIG. 2 is a flow diagram illustrating a speech recognition method in one embodiment;

FIG. 3 is a diagram illustrating the structure of a word lattice in one embodiment;

FIG. 4 is a flowchart illustrating a step of adjusting, in an embodiment, a language model score corresponding to each node-connected arc in a word lattice by a business language model to obtain a target language model score corresponding to each node-connected arc in the word lattice;

FIG. 5 is a flowchart illustrating a step of adjusting a language model score corresponding to a connection arc of a corresponding node in a word lattice according to a calculated language model score to obtain a target language model score corresponding to a connection arc of each node in the word lattice in one embodiment;

FIG. 6 is a flowchart illustrating the steps of obtaining speech data to be recognized according to one embodiment;

FIG. 7 is a block diagram showing the structure of a speech recognition apparatus according to an embodiment;

FIG. 8 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The voice recognition method can be applied to computer equipment, and the computer equipment can be a terminal or a server. It can be understood that the speech recognition method provided by the present application can be applied to a terminal, can also be applied to a server, can also be applied to a system comprising the terminal and the server, and is implemented through the interaction between the terminal and the server.

The speech recognition method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 and the server 104 communicate via a network. The terminal 102 uploads the collected voice data to be recognized to the server 104, the server 104 extracts acoustic feature information of the voice data to be recognized after acquiring a service identifier corresponding to the voice data to be recognized, the acoustic feature information is input to a decoding model, the acoustic feature information is decoded through the decoding model, and a word lattice corresponding to the voice data to be recognized is obtained, wherein the word lattice comprises a candidate decoding path, and the candidate decoding path comprises a plurality of node connection arcs. The server 104 calls a corresponding service language model according to the service identifier, and adjusts the language model score corresponding to each node connection arc in the word lattice through the service language model to obtain the target language model score corresponding to each node connection arc in the word lattice. The server 104 calculates the decoding scores of the candidate decoding paths according to the target language model scores and the acoustic scores corresponding to the node connection arcs. The server 104 determines a target decoding path in the candidate decoding paths according to the decoding scores, and generates a voice recognition result corresponding to the voice data to be recognized according to the target decoding path. The terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server 104 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.

For example, the speech recognition method provided by the application can be applied to various services needing speech recognition, such as banking services, insurance services, tax services, speech navigation and the like.

In an embodiment, as shown in fig. 2, a speech recognition method is provided, which is described by taking an example that the method is applied to a computer device, where the computer device may specifically be a terminal or a server in fig. 1, and includes the following steps:

step 202, acquiring voice data to be recognized and a service identifier corresponding to the voice data to be recognized.

The voice data to be recognized refers to voice data that needs to be subjected to voice recognition. For example, the voice data to be recognized may be voice data in various voice recognition scenarios, such as voice data in a loan consultation service scenario. Speech recognition refers to converting speech data into text data. The speech data to be recognized may be a continuous piece of speech, e.g., a piece of speech, a sentence, etc. The service identifier is a unique identifier of a service scene currently handling the service. The business scene can be a concrete business scene in a plurality of business types such as banking business, insurance business, tax business, voice navigation and the like. For example, the service identification may be a loan consulting service in banking. The voice data to be recognized can be voice data acquired in real time in the service handling process. For example, the computer device may obtain, in real time, voice data input by the user through a banking application, and use the voice data input by the user as voice data to be recognized, where the voice data to be recognized carries a service identifier. In one embodiment, the voice data to be recognized may also be that after receiving the voice recognition request, the computer device parses the voice recognition request to obtain the voice data carried by the voice recognition request and a service identifier, uses the voice data as the voice data to be recognized, and uses the service identifier as a service identifier corresponding to the voice data.

In one embodiment, the speech data to be recognized may be obtained by performing denoising processing on the acquired initial speech data. By denoising the initial voice data, noise interference can be avoided, the effectiveness of the voice data to be recognized is improved, and the accuracy of voice recognition is improved.

Further, the computer device may further perform segmentation processing on the voice data to be recognized to obtain a plurality of voice frames. Specifically, the computer device may preset a frame length, for example, each frame length is 25ms, so that the computing device may divide the speech data to be recognized into a plurality of speech frames according to the preset frame length, thereby performing speech recognition on each speech frame. Because the voice data to be recognized is a short-time stable signal, the voice data to be recognized is divided into a plurality of voice frames, each voice frame can be guaranteed to be stable voice data, and the accuracy of voice recognition can be improved.

And step 204, extracting acoustic feature information of the voice data to be recognized.

And step 206, inputting the acoustic characteristic information into a decoding model, and decoding the acoustic characteristic information through the decoding model to obtain a word lattice corresponding to the voice data to be recognized, wherein the word lattice comprises a candidate decoding path, and the candidate decoding path comprises a plurality of node connection arcs.

The acoustic feature information may be at least one of Mel Frequency Cepstrum Coefficient (MFCC) features, Linear Prediction Cepstrum Coefficient (LPCC) features, logarithmic energy features, or new features formed by calculating first-order and second-order differences from static features and splicing. The acoustic feature information may be embodied in the form of a feature vector. The computer device can sort the acoustic feature information according to a time sequence to obtain a feature vector sequence corresponding to the acoustic feature information. The temporal order may be the order of the speech frames corresponding by the acoustic feature vectors.

And the computer inputs the feature vector sequence corresponding to the acoustic feature information into the decoding model, and the feature vector sequence corresponding to the acoustic feature information is decoded through the decoding model. Wherein the decoding model may be pre-trained and stored in the computer device. The decoding model, which may also be referred to as a decoding graph, represents all possible language spaces. The decoding model refers to various knowledge sources fused using WFST (Weighted finite-State transform), and may include an acoustic model, a language model, a phoneme context correlation model, a pronunciation dictionary, and the like. The acoustic model can be represented by H, the language model can be represented by G, the language model is a general language model, and the general language model is a model that can be widely applied to various situations. The phoneme context correlation model can be represented by C and the pronunciation dictionary can be represented by L.

For example, when the decoding model is a model composed of H, C, L, G, the decoding model may be referred to as HCLG. Further, the computer device may employ the openfst tool to perform a merge operation on the acoustic model, the generic language model, and the phoneme context model pronunciation dictionary to generate a decoding model (HCLG). The acoustic model is used for carrying out acoustic recognition on the feature vector sequence and outputting the scores corresponding to the voice recognition units. The speech recognition unit refers to a unit of speech recognition, and may be a phoneme, a word, or the like. Phonemes refer to the smallest speech recognition unit divided according to the natural properties of speech, analyzed according to the pronunciation action, and the pronunciation of a word may include one or more phonemes, e.g., o (ā) has only one phoneme, ai (aji) has two phonemes, and ai (d) has three phonemes. The score corresponding to the speech recognition unit is a probability that the output of the acoustic model is the speech recognition unit when the speech recognition is performed on the feature vector sequence. The acoustic model may be a model based on HMM (Hidden Markov Models). For example, GMM-HMM (Gaussian Mixture Model-Hidden Markov Model), DNN-HMM (Deep Neural Network-Hidden Markov Model), and the like.

The language model may be a generic language model, which is a knowledge representation of the language structure, such as grammar, common collocation of words, etc. The generic language model is used to calculate the probability of occurrence of a sequence of speech recognition units in a speech signal, wherein the sequence of speech recognition units may be a word string composed of speech recognition units. For example, the generic Language Model may be an N-gram Language Model (N-gram LM) that is derived based on the occurrence of the Nth word in relation to only the first N-1 words. The general language model may represent the connection possibility between words by the language model score, for example, for a 2-tuple language model, the language model score corresponding to the previous word represents the possibility that the previous word is connected to the next word, and the connection possibility between words in the N-tuple language model may be obtained by counting the words in the corpus, for example, for the 2-tuple language model, the language model score corresponding to the previous word may be calculated by the ratio of the number of times that the previous word and the next word appear simultaneously in the corpus to the number of times that the previous word appears alone. The pronunciation dictionary can be used to obtain the phoneme sequence corresponding to the pronunciation of each word. The phoneme context correlation model can also be called a triphone model, the input of which is a triphone and the output of which is a monophnoe, which can embody the continuity between phonemes. Since the pronunciation of one phoneme may be affected by the previous phoneme and the next phoneme, there is a contextual relevance. The influence of the previous phoneme and the next phoneme on the current phoneme is eliminated through the phoneme context correlation model by inputting the current phoneme, the previous phoneme corresponding to the current phoneme and the next phoneme corresponding to the current phoneme to the phoneme context correlation model.

The process of decoding the feature vector sequence corresponding to the acoustic feature information through the decoding model refers to a process of searching a target decoding path in which the speech data to be recognized is most likely to appear in the decoding model, so that the computer device can output a speech recognition result according to the target decoding path. The computer equipment decodes the feature vector sequence corresponding to the acoustic feature information through the decoding model to predict all possible recognition states at each moment, wherein the recognition states can be words recognized at each moment, so that a plurality of candidate decoding paths can be obtained, and a word lattice can be obtained. Word lattices may also be referred to as lattices. The word lattice includes a plurality of candidate decoding paths including a plurality of state nodes from a start state node to an end state node, and one state node may represent an end time point of one word. Two adjacent state nodes are communicated through a node connecting arc. Each candidate decoding path may represent a possible text recognition result corresponding to the speech data to be recognized. The structural diagram of the word lattice can be shown in fig. 3, wherein circles represent state nodes, and arrows between two adjacent state nodes represent node connection arcs. The node connecting arcs are provided with words A1, A2, B1, B2, B3, C1, C2 and C3, and information such as language model scores, acoustic scores and time points corresponding to the words. The acoustic score is a probability that an output of the acoustic model is a speech recognition unit when speech recognition is performed on the feature vector sequence. The language model score refers to the probability of the last word transitioning to the next word. The language score may be predicted using a generic language model. The computer device may calculate a path score corresponding to each candidate decoding path according to the acoustic score and the language model score corresponding to each node state arc in each candidate decoding path, and select a target decoding path according to the path scores corresponding to the plurality of candidate decoding paths.

Wherein the acoustic score may be predicted by an acoustic model. The acoustic score may be represented by a probability. When the acoustic model takes a word as a modeling unit, that is, a speech recognition unit as a word, since one node state arc can represent one word, the acoustic score of the node state arc corresponding to the word is the probability of a certain word output by the acoustic model. For example, when the acoustic model uses a word as a modeling unit, the acoustic model outputs that a probability of a word corresponding to the acoustic model is 0.2, and an acoustic score of a node state arc corresponding to the word may be 0.2. When the acoustic model uses phonemes as a modeling unit, that is, the speech recognition unit is phonemes, since one word may correspond to a plurality of phonemes, one node state arc may correspond to a plurality of phonemes, and the acoustic score of the node state arc corresponding to one word may be obtained according to the acoustic scores of the phonemes corresponding to the word output by the acoustic model, for example, the acoustic score corresponding to one word may be the sum of the acoustic scores of the phonemes corresponding to the word. For example, if the acoustic model uses phonemes as a modeling unit, phonemes corresponding to a word are a and b, the probability corresponding to the output phoneme a of the acoustic model is 0.1, and the probability corresponding to the output phoneme b is 0.2, the acoustic score of the node state arc corresponding to the word may be 0.3. The language model score may be predicted using a generic language model. The language model score represents the likelihood of connection between words, from which the probability of a word string can be derived. For example, the node state arc between state node 0 and state node 1 represents word a, the node state arc between state node 1 and state node 2 represents word B, and the language model score corresponding to the node state arc between state node 0 and state node 1 represents the probability that the next word of word a is word B.

And step 208, calling a corresponding service language model according to the service identification, and adjusting the language model score corresponding to each node connection arc in the word lattice through the service language model to obtain the target language model score corresponding to each node connection arc in the word lattice.

The computer equipment is stored with a business language model in advance. The business language model refers to a language model in a specific business scenario, for example, the specific business scenario may be any one of a plurality of business identifiers, such as banking business, insurance business, tax business, voice navigation, and the like. The business language model can be obtained by training the text data of a specific business identifier. For example, the business Language Model may be an N-gram Language Model (N-gram LM) that is derived based on the occurrence of the Nth word in relation to only the first N-1 words.

In the process of carrying out voice recognition on voice data to be recognized, the decoding is carried out in a limited mode by using the decoding model, and the decoding model comprises the universal language model, so that the voice recognition process is carried out by using the universal language model, and the obtained recognition state on the word lattice depends on the output result of the universal language model. The output result of the generic language model also affects the accuracy of the speech recognition result. In a specific service scene, the speech recognition result may be inaccurate by only decoding the speech data to be recognized by using the decoding model generated by the general language model, so that the computer device can adjust the language model score corresponding to each node connection arc in the word lattice through the service language model to increase the accuracy of the speech recognition result. Specifically, the computer device can replace the language model score corresponding to each candidate decoding path in the service language model with the language model score corresponding to the corresponding candidate decoding path in the word lattice, so that the decoding accuracy of the voice data to be recognized is improved, and the accuracy of voice recognition is effectively improved.

And step 210, calculating the decoding scores of the candidate decoding paths according to the scores of the target language model and the acoustic scores corresponding to the node connection arcs.

The word lattice comprises a plurality of candidate decoding paths, and each candidate decoding path can be connected with a plurality of nodes through arcs. Wherein, a node connecting arc can represent a word, and the acoustic score, the voice model score and the like corresponding to the word. After the language model scores corresponding to the node connection arcs in the word lattice are adjusted through the service language model, the adjusted target language model scores can be stored in the node connection arcs, so that the computer equipment can extract the acoustic scores and the target language model scores in the node state arcs in the candidate decoding path, and further calculate the decoding scores of the candidate decoding path according to the acoustic scores and the target language model scores of a plurality of node state arcs in the same candidate decoding path. For example, the acoustic scores of the plurality of node state arcs and the target language model score may be added.

Step 212, determining a target decoding path among the candidate decoding paths according to the decoding scores.

And 214, generating a voice recognition result corresponding to the voice data to be recognized according to the target decoding path.

After the decoding scores corresponding to the multiple candidate decoding paths in the word lattice are obtained through calculation, the computer device may select the candidate decoding path with the largest score from the decoding scores as the target decoding path, or may select the candidate decoding path with the decoding score larger than the threshold as the target decoding path.

The computer equipment can backtrack according to the target decoding path, and sorts the words passing through the target decoding path according to the state node sequence to obtain a corresponding voice recognition result. For example, if the target decoding path determined by the computer device is a-B-C, the word corresponding to a is "i", the word corresponding to B is "i", and the word corresponding to C is "i", the obtained speech recognition result is "i want to stage".

In this embodiment, acoustic feature information of voice data to be recognized is extracted, the decoding model is used to decode the acoustic feature information corresponding to the voice data to be recognized, so as to obtain a word lattice corresponding to the voice data to be recognized, thereby obtaining a corresponding business language model through a business identifier corresponding to the voice data to be recognized, a language model score corresponding to each node connection arc in the word lattice is adjusted through the business language model, so as to obtain a target language model score, and a decoding score of a candidate decoding path is calculated according to the target language model score and the acoustic score corresponding to the node connection arc. Because the decoding model comprises the general language model, after the word lattice is obtained by decoding, the probability scores of texts and vocabularies of the specific service scene can be improved and the accuracy of the decoding scores of the candidate decoding paths can be improved only by adjusting the language model scores corresponding to the connection arcs of each node in the word lattice according to the language model under the specific service scene, namely the service language model. Therefore, the target decoding path is determined according to the decoding score, and the voice recognition result is generated according to the target decoding path, so that the accuracy of the voice recognition text in a specific service scene can be effectively improved. Compared with the traditional mode in which the mixed interpolation training is carried out by utilizing the general text data and the text data corresponding to the specific service scene, and the weight between the general text data and the text data corresponding to the specific service scene needs to be calculated, the weight does not need to be calculated, the professional requirement on workers is reduced, and the labor cost is saved. Compared with the traditional mode in which the mixed interpolation training is carried out by using the general text data and the text data corresponding to the specific service scene, the training of the service language model requires less data, so that the model training time can be reduced, and the computing resources can be saved. In addition, the service language model can be a language model under any service scene, and the universality of the use scene of the voice recognition is improved.

In an embodiment, as shown in fig. 4, the step of adjusting, by the service language model, the language model score corresponding to each node connection arc in the word lattice to obtain the target language model score corresponding to each node connection arc in the word lattice includes:

step 402, obtaining a plurality of node connection arcs of candidate decoding paths in the word lattice.

And step 404, calculating a language model score corresponding to each node connection arc through the service language model.

And step 406, adjusting the language model score corresponding to the corresponding node connecting arc in the word lattice according to the calculated language model score to obtain a target language model score corresponding to each node connecting arc in the word lattice.

After the computer device decodes the acoustic feature information corresponding to the speech data to be recognized through the decoding model, the language model score of the word lattice obtained after decoding can be adjusted according to the service language model corresponding to the service identifier. Specifically, the word lattice includes a plurality of candidate decoding paths, and each candidate decoding path includes a plurality of node connection arcs. The computer device first obtains a plurality of node connection arcs of each candidate decoding path. Since one node connecting arc corresponds to one word, each decoding path may correspond to one word string. The business language model may be used to calculate the probability that the word string corresponding to each candidate decoding path corresponds to. The word string may include a plurality of words corresponding to the respective decoding paths, and an order between the words. Therefore, the computer device can input the words corresponding to the node connection arcs into the business language model, and calculate the language model scores corresponding to the words through the business language model, namely calculate the language model scores corresponding to the node connection arcs. Therefore, the computer equipment can adjust the language model score corresponding to the corresponding node connecting arc in the word lattice according to the calculated language model score to obtain the target language model score corresponding to each node connecting arc in the word lattice.

In this embodiment, because the language model score corresponding to each node connection arc is obtained by decoding using the decoding model formed by the language model, the recognition accuracy of the vocabulary in a specific service scene is low, the language model score corresponding to each node connection arc is calculated by the service language model, and the language model score corresponding to the corresponding node connection arc in the word lattice is adjusted according to the calculated language model score, so as to obtain the target language model score corresponding to each node connection arc in the word lattice. The service language model is a language model under a specific service scene, and the language model score of words in the specific service scene can be improved, so that the recognition accuracy of corresponding words under the specific service scene can be improved, and the voice recognition accuracy of the specific service scene can be effectively improved.

In an embodiment, as shown in fig. 5, adjusting the language model score corresponding to the node-connected arc in the word lattice according to the calculated language model score, and obtaining the target language model score corresponding to each node-connected arc in the word lattice includes:

and 502, subtracting the language model scores corresponding to the node connection arcs from the word lattice.

And step 504, adding target language model scores corresponding to the corresponding node connecting arcs in the word lattice according to the calculated language model scores.

In the process that the computer device adjusts the language model score corresponding to the corresponding node connecting arc in the word lattice according to the calculated language model score, the language model score corresponding to each node connecting arc in the word lattice can be deleted, and then the language model score of each node connecting arc calculated by the business language model is added to the corresponding node connecting arc, so that the language model score corresponding to the node connecting arc is adjusted. In this embodiment, because the language model score of the word in the service language model in the specific service scene is higher, the language model score corresponding to the corresponding node connection arc in the word lattice can be adjusted more quickly by deleting and adding the language model score corresponding to the corresponding node connection arc in the word lattice, so that the language model score of the word in the specific service scene can be increased quickly, and the accuracy of speech recognition can be improved.

In one embodiment, calculating the decoding scores of the candidate decoding paths according to the target language model scores and the acoustic scores corresponding to the node connecting arcs comprises: and adding the target language model scores and the acoustic scores corresponding to the node connecting arcs in the candidate decoding path to obtain the decoding scores of the candidate decoding path.

The target language model score is obtained by deleting the language model score corresponding to each node connecting arc in the word lattice and then adding the language model score of each node connecting arc obtained by the calculation of the business language model to the corresponding node connecting arc. The computer device may add the target language model score corresponding to each node state arc in each candidate decoding path to the acoustic score to obtain the decoding score of the candidate decoding path. Or respectively calculating the target language model score corresponding to each node state arc in each candidate decoding path to accumulate, accumulating the acoustic score corresponding to each node state arc in each candidate decoding path, and adding the target language model score obtained after accumulation and the acoustic score obtained after accumulation to obtain the decoding score of the candidate decoding path.

For example, in an installlation scene, the word lattice obtained by decoding includes two candidate decoding paths, i (0.5+0.3) want (0.4+0.5) to be installd (0.2+0.2) and i (0.5+0.3) want (0.4+0.5) to be analyzed (0.3+0.4), and the parenthesis respectively represent the acoustic score + the language model score corresponding to the node connecting arc. The computer equipment acquires word strings passed by two candidate decoding paths, wherein the word strings are 1: "i", "want" and "stage", word string 2: "I" "want" "analyze". In the word string 1, the language model score of "me" is 0.3, the language model score of "want" is 0.5, the language model score of "stage" is 0.8, and in the word string 2, the language model score of "me" is 0.3, the language model score of "want" is 0.5, and the language model score of "analyze" is 0.4, at this time, the computer device will score the candidate decoding path again, the decoding score of i (0.5+0.3) want (0.4+0.5) stage (0.2+0.8) becomes 2.7, the decoding score of i (0.5+0.3) want (0.4+0.5) analyze (0.3+0.4) is 2.4, and then the speech recognition result is "me want stage".

In this embodiment, the decoding scores of the candidate decoding paths can be calculated quickly and accurately by adding the target language model scores and the acoustic scores corresponding to the node connection arcs in the candidate decoding paths.

In one embodiment, the computer device may further perform pruning processing on the candidate decoding paths in calculating the decoding scores of the candidate decoding paths. Specifically, the computer device obtains a current state node, which may represent an end time of a word. And the computer equipment determines the candidate decoding path reaching the node in the current state and acquires the current decoding score corresponding to the determined candidate decoding path. There may be multiple candidate decoding paths reaching the current state node, and there may be multiple current decoding scores obtained. The current decoding score refers to the sum of the acoustic score and the language model score of the node state arcs passed through in the determined candidate decoding path. The computer device thus determines a candidate decoding path for the current maximum decoding score. The current maximum decoding score refers to the largest decoding score among the decoding scores of the plurality of candidate decoding paths at the current state node. And the computer equipment selects a candidate decoding path with the current decoding score smaller than the current maximum decoding score and stops calculating the decoding score of the selected candidate decoding path. The candidate decoding paths with the current decoding scores smaller than the current maximum decoding scores are calculated in the stopping path, and are not the optimal paths from the initial state nodes to the current state nodes, and the candidate decoding paths corresponding to the current maximum decoding scores are reserved by removing unnecessary candidate decoding paths, so that the number of the candidate decoding paths is reduced, and the decoding efficiency is improved.

In one embodiment, before acquiring the voice data to be recognized, the method further includes: acquiring service text data; preprocessing the service text data to obtain preprocessed service text data; and training the preset language model according to the preprocessed business text data until a preset condition is met, stopping model training, and taking the trained preset language model as a business language model.

The service text data refers to text data of a specific service scenario. For example, the specific business scenario may be a loan consulting business in banking. The computer device preprocesses the service text data, and the preprocessing can be to arrange the service text data into a text file of one sentence per line. The preset language model refers to a scene model which is constructed in advance. The computer device can train the preset language model according to the preprocessed service text data until the preset condition is met, the model training is stopped, and the computer device can take the trained preset language model as the service language model. Wherein the preset condition may be that the loss value of the loss function does not decrease any more. For example, the computer device may use the ngram tool for model training. And the computer equipment inputs the preprocessed service text data into an ngram tool, and automatically performs model training by setting the recognition length. For example, a recognition length of 4 may be set, indicating that 4 words can be recognized at a time at maximum.

In the embodiment, the business language model is obtained through business text data training, and compared with the traditional mode in which the mixed interpolation training is performed by using the general text data and the text data corresponding to the specific business scene, the required training data amount is less, so that the model training time can be reduced, and the calculation resources can be saved.

In an embodiment, as shown in fig. 6, the step of acquiring the voice data to be recognized specifically includes:

step 602, initial voice data is obtained.

And step 604, performing endpoint detection on the initial voice data to obtain a human voice endpoint and an environmental noise endpoint in the initial voice data.

And 606, extracting the voice data to be recognized from the initial voice data according to the human voice endpoint and the environmental noise endpoint.

The voice data to be recognized may be voice data extracted by detecting an end point of the initial voice data. The initial voice data refers to a real-time voice stream acquired by the computer device. Endpoint Detection is also called Voice Activity Detection (VAD), and is a method capable of identifying a Voice speech segment and an environmental noise segment in a Voice stream, so as to separate the Voice speech segment from the environmental noise segment. The human voice segment is the segment of the real-time voice stream in which the interlocutor utters, while the ambient noise segment is the segment of the target user's silence.

Specifically, the computer device performs an end point detection end on the initial voice data, processes and analyzes the initial voice data, and finally identifies the end points of the voice segments of the human voice and the end points of various environmental noise segments. Further, the computer device may divide the initial voice data at preset time intervals, and predict the voice probability of the voice segment in each preset time in the initial voice data through the neural network, for example, divide a voice segment every 10ms, predict the voice probability of the voice segment in each 10 ms. The speech probability is used to indicate the likelihood that the speech segment contains the user's speech. And if the voice probability exceeds the probability threshold, the user is considered to start speaking, if the voice probability is equal to or lower than the probability threshold, and if the duration of the voice probability which is equal to or lower than the probability threshold exceeds the preset duration, the user is considered to end speaking. Based on the endpoints of the voice segment and the environmental noise segment, the voice segment in the initial voice data can be separated, and the voice segment is used as the voice data to be recognized.

In the embodiment, the initial voice data is used for endpoint detection, so that the initial and final voices of the user and various environmental noises can be identified, the voice data to be identified is separated, the condition that the voice information of the user contained in the initial and final voices is lost due to the fact that the initial and final voices of the user with lower energy are ignored in the traditional voice identification mode is avoided, and the accuracy of voice identification is improved.

It should be understood that, although the steps in the flowcharts of fig. 2, 4 to 6 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2, 4 through 6 may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternatingly with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 7, there is provided a speech recognition apparatus including: an obtaining module 702, an extracting module 704, a decoding module 706, an adjusting module 708, a calculating module 710, a determining module 712, and a generating module 714, wherein:

an obtaining module 702, configured to obtain voice data to be recognized and a service identifier corresponding to the voice data to be recognized.

And the extracting module 704 is configured to extract acoustic feature information of the voice data to be recognized.

The decoding module 706 is configured to input the acoustic feature information to the decoding module, and decode the acoustic feature information through the decoding module to obtain a word lattice corresponding to the to-be-recognized speech data, where the word lattice includes a candidate decoding path, and the candidate decoding path includes a plurality of node connection arcs.

The adjusting module 708 is configured to invoke a corresponding service language model according to the service identifier, and adjust the language model score corresponding to each node connection arc in the word lattice through the service language model to obtain a target language model score corresponding to each node connection arc in the word lattice.

And the calculating module 710 is configured to calculate decoding scores of the candidate decoding paths according to the target language model score and the acoustic scores corresponding to the node connecting arcs.

A determining module 712, configured to determine a target decoding path among the candidate decoding paths according to the decoding scores.

And the generating module 714 is configured to generate a speech recognition result corresponding to the speech data to be recognized according to the target decoding path.

In one embodiment, the adjusting module 708 is further configured to obtain a plurality of node connecting arcs of the candidate decoding path in the word lattice; calculating a language model score corresponding to each node connection arc through a service language model; and adjusting the language model score corresponding to the corresponding node connecting arc in the word lattice according to the calculated language model score to obtain the target language model score corresponding to each node connecting arc in the word lattice.

In one embodiment, the adjusting module 708 is further configured to subtract the language model score corresponding to each node connecting arc from the word lattice; and adding target language model scores corresponding to the corresponding node connecting arcs in the word lattice according to the calculated language model scores.

In an embodiment, the calculating module 710 is further configured to add the target language model score and the acoustic score corresponding to each node connecting arc in the candidate decoding path to obtain a decoding score of the candidate decoding path.

In one embodiment, the above apparatus further comprises: the training module is used for acquiring service text data; preprocessing the service text data to obtain preprocessed service text data; and training the preset language model according to the preprocessed business text data until a preset condition is met, stopping model training, and taking the trained preset language model as a business language model.

In one embodiment, the obtaining module 702 is further configured to obtain initial voice data; carrying out endpoint detection on the initial voice data to obtain a human voice endpoint and an environmental noise endpoint in the initial voice data; and extracting the voice data to be recognized from the initial voice data according to the human voice endpoint and the environmental noise endpoint.

For the specific limitations of the speech recognition device, reference may be made to the above limitations of the speech recognition method, which are not described herein again. The respective modules in the above-described speech recognition apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 8. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data of a speech recognition method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a speech recognition method.

Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory storing a computer program and a processor implementing the steps of the various embodiments described above when the processor executes the computer program.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the respective embodiments described above.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of speech recognition, the method comprising:

the voice data to be recognized is divided into a plurality of voice frames according to the preset frame length;

extracting acoustic characteristic information of a plurality of voice frames;

inputting the acoustic characteristic information into a decoding model, and decoding the acoustic characteristic information through the decoding model to obtain a word lattice corresponding to the voice data to be recognized, wherein the word lattice comprises a candidate decoding path, and the candidate decoding path comprises a plurality of node connection arcs; the decoding model comprises a general language model, and the general language model is used for predicting the language model fraction corresponding to each node connecting arc in the word lattice;

calling a corresponding service language model according to the service identifier, and adjusting the language model score corresponding to each node connection arc in the word lattice through the service language model to obtain a target language model score corresponding to each node connection arc in the word lattice, wherein the method comprises the following steps: acquiring a plurality of node connection arcs of candidate decoding paths in the word lattice; calculating a language model score corresponding to each node connection arc through the service language model; subtracting the language model scores corresponding to the node connection arcs from the word lattice; adding target language model scores corresponding to the corresponding node connecting arcs in the word lattices according to the calculated language model scores; the service language model is obtained by training text data of a service scene corresponding to the service identifier in advance;

generating a voice recognition result corresponding to the voice data to be recognized according to the target decoding path;

in the process of calculating the decoding scores of the candidate decoding paths, acquiring a current state node, determining the candidate decoding paths reaching the current state node, acquiring the current decoding scores corresponding to the determined candidate decoding paths, selecting the candidate decoding paths of which the current decoding scores are smaller than the current maximum decoding scores, and stopping calculating the decoding scores of the selected candidate decoding paths.

2. The method of claim 1, wherein the obtaining voice data to be recognized comprises:

acquiring initial voice data;

and denoising the initial voice data to obtain voice data to be recognized.

3. The method of claim 1, wherein the computing the decoding scores of the candidate decoding paths according to the target language model scores and the acoustic scores corresponding to the nodal connection arcs comprises:

accumulating the target language model scores corresponding to each node state arc in each candidate decoding path, accumulating the acoustic scores corresponding to each node state arc in each candidate decoding path, and adding the target language model scores obtained after accumulation and the acoustic scores obtained after accumulation to obtain the decoding scores of the candidate decoding paths.

4. The method of claim 1, wherein computing the decoding scores of the candidate decoding paths based on the target language model scores and the acoustic scores corresponding to the nodal connection arcs comprises:

5. The method according to any one of claims 1 to 4, wherein before the obtaining the speech data to be recognized, the method further comprises:

acquiring service text data;

preprocessing the service text data to obtain preprocessed service text data;

6. The method according to any one of claims 1 to 4, wherein the acquiring voice data to be recognized comprises:

acquiring initial voice data;

7. A speech recognition apparatus, characterized in that the apparatus comprises:

the decoding module is used for inputting the acoustic characteristic information into a decoding model, decoding the acoustic characteristic information through the decoding model to obtain a word lattice corresponding to the voice data to be recognized, wherein the word lattice comprises a candidate decoding path, and the candidate decoding path comprises a plurality of node connecting arcs; the decoding model comprises a general language model, and the general language model is used for predicting the language model fraction corresponding to each node connecting arc in the word lattice;

the adjusting module is used for calling a corresponding service language model according to the service identifier, adjusting the language model score corresponding to each node connection arc in the word lattice through the service language model, and obtaining the target language model score corresponding to each node connection arc in the word lattice, and comprises: acquiring a plurality of node connection arcs of candidate decoding paths in the word lattice; calculating a language model score corresponding to each node connection arc through the service language model; subtracting the language model scores corresponding to the node connection arcs from the word lattice; adding target language model scores corresponding to the corresponding node connecting arcs in the word lattices according to the calculated language model scores; the service language model is obtained by training text data of a service scene corresponding to the service identifier in advance;

the generating module is used for generating a voice recognition result corresponding to the voice data to be recognized according to the target decoding path;

the device is also used for cutting the voice data to be recognized into a plurality of voice frames according to the preset frame length; extracting acoustic characteristic information of a plurality of voice frames; in the process of calculating the decoding scores of the candidate decoding paths, acquiring a current state node, determining the candidate decoding paths reaching the current state node, acquiring the current decoding scores corresponding to the determined candidate decoding paths, selecting the candidate decoding paths of which the current decoding scores are smaller than the current maximum decoding scores, and stopping calculating the decoding scores of the selected candidate decoding paths.

8. The apparatus according to claim 7, wherein the computing module is further configured to add the target language model score and the acoustic score corresponding to each node connecting arc in the candidate decoding path to obtain the decoding score of the candidate decoding path.

9. A computer device comprising a memory and a processor, the memory storing a computer program operable on the processor, wherein the processor implements the steps of the method of any one of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.