CN112802461B

CN112802461B - Speech recognition method and device, server and computer readable storage medium

Info

Publication number: CN112802461B
Application number: CN202011607655.7A
Authority: CN
Inventors: 周维聪; 袁丁; 赵金昊; 刘云峰
Original assignee: Shenzhen Zhuiyi Technology Co Ltd
Current assignee: Shenzhen Zhuiyi Technology Co Ltd
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2023-10-24
Anticipated expiration: 2040-12-30
Also published as: CN112802461A

Abstract

The application relates to a voice recognition method and device, a server and a computer readable storage medium, comprising the following steps: and extracting acoustic features of the voice data to be processed, inputting the extracted acoustic features into an acoustic model, and calculating an acoustic model score of the acoustic features. And decoding the acoustic features and the acoustic model scores of the acoustic features by adopting a main decoding network and a sub decoding network to obtain a voice recognition result. According to the voice recognition method, a decoding network is not retrained for a scene to be recognized, but a target named entity in the scene to be recognized is trained to obtain a sub-decoding diagram, and then a main decoding network and the sub-decoding network are adopted for decoding to obtain a voice recognition result. Therefore, aiming at the target named entity in the scene to be identified, the target named entity can be accurately decoded based on the sub-decoding network. And because the decoding network is not retrained for the scene to be recognized, the training time is greatly shortened, and the voice recognition efficiency is improved.

Description

Speech recognition method and device, server and computer readable storage medium

Technical Field

The present application relates to the field of natural language processing technology, and in particular, to a method and apparatus for speech recognition, a server, and a computer readable storage medium.

Background

With the continuous development of artificial intelligence and natural language processing technologies, speech recognition technologies have also been rapidly developed. The audio signal may be automatically converted to a corresponding text or command using speech recognition techniques. The traditional voice recognition technology can be applied to common and daily voice recognition scenes, and a better recognition effect is achieved.

However, when applied to a professional scene, the recognition effect is poor by adopting the conventional voice recognition technology because a large number of professional vocabularies are included in the professional scene. If the decoding network is retrained specially for the professional scene to perform voice recognition, obviously, the workload of retraining the decoding graph is large, the training time is long, and the quick implementation cannot be realized.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, a device, a server and a computer readable storage medium, which can reduce the workload of retraining a decoding diagram, greatly shorten the training time and improve the voice recognition efficiency when carrying out voice recognition aiming at a specific application scene.

A method of speech recognition, comprising:

extracting acoustic characteristics of voice data to be processed;

Inputting the extracted acoustic features into an acoustic model, and calculating acoustic model scores of the acoustic features;

decoding the acoustic features and the acoustic model scores of the acoustic features by adopting a main decoding network and a sub decoding network to obtain a voice recognition result; the main decoding network is a decoding diagram obtained by training an original text training corpus, and the sub decoding diagram is a decoding diagram obtained by training a target named entity in a scene to be identified.

A speech recognition device, the device comprising:

the acoustic feature extraction module is used for extracting acoustic features of the voice data to be processed;

an acoustic model score calculation module for inputting the extracted acoustic features into an acoustic model and calculating acoustic model scores of the acoustic features;

the decoding module is used for decoding the acoustic features and the acoustic model scores of the acoustic features by adopting a main decoding network and a sub decoding network to obtain a voice recognition result; the main decoding network is a decoding diagram obtained by training an original text training corpus, and the sub decoding diagram is a decoding diagram obtained by training a target named entity in a scene to be identified.

A server comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to perform the steps of the method as above.

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method as above.

The voice recognition method, the voice recognition device, the server and the computer readable storage medium are used for extracting acoustic features of voice data to be processed, inputting the extracted acoustic features into an acoustic model, and calculating acoustic model scores of the acoustic features. And decoding the acoustic features and the acoustic model scores of the acoustic features by adopting a main decoding network and a sub decoding network to obtain a voice recognition result. The sub decoding network is a decoding network obtained by training the target named entity in the scene to be identified. According to the voice recognition method, a decoding network is not retrained for a scene to be recognized, but a target named entity in the scene to be recognized is trained to obtain a sub-decoding network, and then a main decoding network and the sub-decoding network are adopted to decode acoustic features and acoustic model scores of the acoustic features to obtain a voice recognition result. Therefore, aiming at the target named entity in the scene to be identified, the target named entity can be accurately decoded based on the sub-decoding network. And because the decoding network is not retrained for the scene to be recognized, the training time is greatly shortened, and the voice recognition efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is an application scenario diagram of a speech recognition method in one embodiment;

FIG. 2 is a flow chart of a method of speech recognition in one embodiment;

FIG. 3 is a flow diagram of a primary decoding network generation method in one embodiment;

FIG. 4 is a schematic diagram of a portion of a primary decoding network in one embodiment;

FIG. 5 is a flow diagram of a method of generating a sub-decoding network in one embodiment;

FIG. 6 is a schematic diagram of a speech recognition grid layout in one embodiment;

FIG. 7 is a flow chart of a method for decoding using a main decoding network and a sub decoding network to obtain a speech recognition grid pattern in one embodiment;

FIG. 8 is a schematic diagram of a speech recognition grid lattice using a main decoding network and a sub decoding network for decoding in one embodiment;

FIG. 9 is a flow diagram of a method for obtaining a speech recognition result based on a speech recognition grid lattice, in one embodiment;

FIG. 10 is a block diagram of a speech recognition device in one embodiment;

FIG. 11 is a block diagram showing a voice recognition apparatus according to another embodiment;

FIG. 12 is a schematic diagram of an internal structure of a server in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

It will be understood that the terms first, second, etc. as used herein may be used to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish one element from another element.

FIG. 1 is an application scenario diagram of a speech recognition method in one embodiment. As shown in fig. 1, the application environment includes a terminal 120 and a server 140, and the terminal 120 and the server 140 are connected through a network. The server 140 performs acoustic feature extraction on the voice data to be processed by the voice recognition method of the application; inputting the extracted acoustic features into an acoustic model, and calculating acoustic model scores of the acoustic features; decoding acoustic features and acoustic model scores of the acoustic features by adopting a main decoding network and a sub decoding network to obtain a voice recognition result; the main decoding network is a decoding diagram obtained by training the original text training corpus, and the sub decoding diagrams are decoding diagrams obtained by training the target named entity in the scene to be identified. Here, the terminal 120 may be any terminal device such as a mobile phone, a tablet computer, a PDA (Personal Digital Assistant ), a car-mounted computer, a wearable device, and a smart home.

Fig. 2 is a flowchart of a voice recognition method in one embodiment, and as shown in fig. 2, a voice recognition method is provided, which is applied to a server and includes steps 220 to 260. Wherein,,

step 220, extracting acoustic features of the voice data to be processed.

Wherein the speech data may refer to the acquired audio signal. Specifically, the audio signals acquired in the voice input scene, the intelligent chat scene and the voice translation scene can be obtained. And extracting acoustic characteristics of the voice data to be processed. The specific process of acoustic feature extraction may be: and converting the acquired one-dimensional audio signal into a group of high-dimensional vectors through a feature extraction algorithm. The obtained high-dimensional vector is an acoustic feature, and common acoustic features are MFCC, fbank, vector, etc., which is not limited in the present application. Fbank (FilterBank) is a front-end processing algorithm that processes audio in a manner similar to the human ear to improve speech recognition performance. The general procedure for obtaining Fbank features of a speech signal is: pre-emphasis, framing, windowing, short Time Fourier Transform (STFT), mel filtering, de-averaging, etc. And MFCC characteristics can be obtained by performing Discrete Cosine Transform (DCT) on Fbank.

Wherein, the MFCC (Mel frequency cepstral coefficient, mel-frequency cepstral coefficients) has been proposed based on the auditory characteristics of the human ear, and thus, the Mel frequency has a nonlinear correspondence with the Hz frequency. The mel-frequency cepstrum coefficient (MFCC) is a Hz spectrum feature calculated by using the nonlinear correspondence between the mel-frequency and the Hz frequency. MFCC is mainly used for speech data feature extraction and for reducing the operational dimension. For example: 512-dimensional (sampling point) data are arranged for one frame, and the most important 40-dimensional data can be extracted after the MFCC is carried out, so that the purpose of reducing the dimension is achieved. The vector describes a feature vector of each speaker.

Step 240, inputting the extracted acoustic features into an acoustic model, and calculating an acoustic model score of the acoustic features.

Specifically, the acoustic model may include a neural network model and a hidden markov model, wherein the neural network model may provide an acoustic modeling unit to the hidden markov model, and the granularity of the acoustic modeling unit may include: words, syllables, phonemes, or states, etc. Whereas the hidden markov model may determine the phoneme sequence from an acoustic modeling unit provided by the neural network model. A state mathematically characterizes the state of a markov process. The acoustic model is a model which is obtained by training according to the audio training corpus in advance.

The extracted acoustic features are input into an acoustic model, and acoustic model scores of the acoustic features can be calculated. The acoustic model score may be considered as a score calculated from the probability of occurrence of each phoneme under each acoustic feature.

Step 260, decoding the acoustic features and the acoustic model scores of the acoustic features by adopting a main decoding network and a sub decoding network to obtain a voice recognition result; the main decoding network is a decoding diagram obtained by training the original text training corpus, and the sub decoding diagrams are decoding diagrams obtained by training the target named entity in the scene to be identified.

The decoding network is used for finding the optimal decoding path under the condition of the given phoneme sequence, and then a voice recognition result can be obtained. In the embodiment of the application, the decoding network comprises a main decoding network and a sub decoding network, wherein the main decoding network is a decoding diagram obtained by training an original text training corpus, and the sub decoding diagram is a decoding diagram obtained by training a target named entity in a scene to be identified. In this way, the phoneme sequence excluding the named entity can be decoded using the main decoding network, and the phoneme sequence of the target named entity can be decoded using the sub decoding network. Therefore, the main decoding network and the sub decoding network are adopted to decode the acoustic characteristics and the acoustic model scores of the acoustic characteristics, and the voice recognition result is obtained.

The target named entities in the scene to be identified comprise named entities with the voice recognition error rate exceeding a preset error rate threshold value, which are appointed by people in the scene to be identified, and professional vocabulary in the scene to be identified, and the application is not limited to the named entities.

In the embodiment of the application, aiming at the technical problem of low accuracy of voice recognition in a specific application scene, a voice recognition method is provided, acoustic feature extraction is carried out on voice data to be processed, the extracted acoustic features are input into an acoustic model, and the acoustic model score of the acoustic features is calculated. And decoding the acoustic features and the acoustic model scores of the acoustic features by adopting a main decoding network and a sub decoding network to obtain a voice recognition result. According to the voice recognition method, a decoding network is not retrained for a scene to be recognized, but a target named entity in the scene to be recognized is trained to obtain a sub-decoding diagram, and then a main decoding network and the sub-decoding network are adopted to decode acoustic features and acoustic model scores of the acoustic features to obtain a voice recognition result. Therefore, aiming at the target named entity in the scene to be identified, the target named entity can be accurately decoded based on the sub-decoding network. And because the decoding network is not retrained for the scene to be recognized, the training time is greatly shortened, and the voice recognition efficiency is improved.

In one embodiment, as shown in fig. 3, the generation process of the primary decoding network includes:

and 320, hollowing out named entities in the original text training corpus to obtain the target text training corpus.

The original text training corpus is obtained by obtaining the original text training corpus from a corpus database. The original text training corpus approximately comprises 700-1000 ten thousand text corpora. The language materials which are actually appeared in the actual use of the language are stored in the language database, and the language database is a basic resource for bearing language knowledge by taking an electronic computer as a carrier. In general, the real corpus needs to be processed (e.g., analyzed and processed) to be a useful resource. Named entities (names), as their names mean, the name of a person, the name of an organization, the name of a place, and all other entities identified by the name, and more broadly, the entities include numbers, dates, currencies, addresses, and the like.

Because named entities in different application scenarios have large differences, and these named entities may not be the corpora contained in the original text training corpora. Therefore, in order to improve the accuracy of voice recognition under specific application scenarios, the named entities in the original text training corpus can be hollowed out first, and the hollowed-out positions are represented by hollow nodes. Thus, the training corpus which does not contain the named entity is left, and the training corpus which does not contain the named entity forms the target text training corpus.

Step 340, training the target text training corpus to obtain a language model.

After the target text training corpus is obtained, training the target text training corpus to obtain a language model. The language model may be trained using a recurrent neural network, which may also be referred to as RNNLM (Recurrent Neural Network Based LanguageModel, recurrent network language model). The cyclic network language model may consider a plurality of words previously input in addition to the word currently input, and may calculate the probability of the next word occurrence from a long text composed of the words previously input, so the cyclic network language model has a "better memory effect". For example, "good" may occur after "my", "mood", and "bad" may occur, and the occurrence of these words is "memory effect" depending on the "my" and "mood" that occurred before.

Step 360, training the speech training corpus corresponding to the original text training corpus to obtain an acoustic model.

Then, a voice training corpus corresponding to the original text training corpus is obtained, wherein a corresponding relation exists between the voice training corpus and each corpus in the original text training corpus. That is, each text corpus in the original text corpus can obtain a corresponding speech corpus from the speech corpus.

The acoustic model models the mapping between acoustic utterances and basic acoustic elements (typically phonemes) using a deep neural network model. Wherein, the phonemes are the minimum phonetic units divided according to the natural properties of the speech. The acoustic model may receive input acoustic features and output a sequence of phonemes corresponding to the acoustic features. Extracting acoustic features of the voice corpus in the voice database, and training an acoustic model according to the extracted acoustic features, so as to obtain the acoustic model.

Step 380, combining the language model and the acoustic model to obtain a main decoding network, wherein the main decoding network comprises empty nodes, and the empty nodes correspond to the empty positions in the target text training corpus.

Because the named entity in the original text training corpus is hollowed out, the target text training corpus is obtained, and the target text training corpus is trained to obtain the language model. And training the voice training corpus corresponding to the original text training corpus to obtain an acoustic model. Therefore, the language model and the acoustic model are combined to obtain the main decoding network. The main decoding network comprises nodes corresponding to the training corpus except the named entity, the named entity is represented by null nodes, and the null nodes correspond to the hollowed positions in the target text training corpus, namely the positions corresponding to the named entity.

Fig. 4 is a schematic diagram of a portion of a primary decoding network in one embodiment. Wherein, a word sequence includes nodes and jump edges, wherein the nodes include a start node, an intermediate node and a termination node. As shown in fig. 4, node 1 is a start node, nodes 2, 3, 4, 5 are intermediate nodes, and node 6 is a stop node. A jump edge is connected between the starting node and the ending node, and word information and phoneme information are carried on the jump edge. The jump edge between the node 1 and the node 2 carries word information: you are good; the carrying phoneme information is: n. The jump edge between node 2 and node 3 carries word information: blanc; the carrying phoneme information is: i. the jump edge between node 3 and node 4 carries word information: blanc; the carrying phoneme information is: h. the jump edge between node 4 and node 5 carries word information: blanc; the carrying phoneme information is: ao. The node 5 is an empty node obtained by hollowing out the named entity, so that the jump edge between the node 5 and the node 6 does not carry word information or phoneme information.

In the embodiment of the application, the named entity in the original text training corpus is hollowed out to obtain the target text training corpus, and the target text training corpus is trained to obtain the language model. And training the voice training corpus corresponding to the original text training corpus to obtain an acoustic model. And combining the language model and the acoustic model to obtain a main decoding network, wherein the main decoding network comprises empty nodes, and the empty nodes correspond to the hollowed-out positions in the target text training corpus. The main decoding network obtained after the named entity is hollowed can be combined with the sub decoding network obtained by training the target named entity in any specific scene, so that the method is suitable for voice recognition in any specific scene, and the accuracy and the efficiency of voice recognition in any specific scene are improved.

In one embodiment, combining the language model with the acoustic model results in a primary decoding network comprising:

and combining the language model and the acoustic model by adopting a composition algorithm to obtain a main decoding network.

In the embodiment of the application, the output label on a certain transition of the first WFST is equal to the input label on a certain transition of the second WFST through a composition algorithm, and then the label and weight on the transitions are respectively operated. The specific implementation code of the composition algorithm is not described in detail in the present application.

In one embodiment, as shown in fig. 5, the generation process of the sub-decoding network includes:

in step 520, the target named entity in the scene to be identified is collected to form the target named entity text.

The target named entities in the scene to be identified comprise named entities with the voice recognition error rate exceeding a preset error rate threshold value, which are specified under the scene to be identified by manpower. In addition, the specialized vocabulary in the scene to be identified can be used as the target named entity text, for example, for medical scenes, the specialized vocabulary of doctors, patients, blood pressure, heartbeat, CT (computed tomography) and the like can be used as the target named entity text. And for the electronic contest scene, the professional vocabulary such as eating chickens, MVP and the like are used as target naming entity texts. Obviously, the differences between the target named entities in different application scenes are large.

And respectively acquiring target named entities under different application scenes to form target named entity texts under each application scene.

Step 540, assigning a language model score to the target named entity text.

The language model may be trained using a recurrent neural network, which may also be referred to as RNNLM (Recurrent Neural Network Based LanguageModel, recurrent network language model). The cyclic network language model may consider a plurality of words previously input in addition to the word currently input, and may calculate the probability of the next word occurrence from a long text composed of the words previously input, so the cyclic network language model has a "better memory effect". For example, "good" may occur after "my", "mood", and "bad" may occur, and the occurrence of these words is "memory effect" depending on the "my" and "mood" that occurred before.

The recognition accuracy of the target named entities can be improved by manually assigning a language model score to the target named entity text and generally assigning a higher language model score to the target named entity text. Here, the higher language model score may be a score exceeding a preset score threshold, for example, the preset score threshold is set to 0.9.

And step 560, combining the target named entity text endowed with the language model score with the acoustic model to obtain a sub-decoding network.

And combining the target named entity text endowed with the language model score with the acoustic model by adopting a composition algorithm to obtain the sub-decoding network.

In the embodiment of the application, a target named entity in a scene to be identified is collected to form a target named entity text, and a language model score is given to the target named entity text. And combining the target named entity text endowed with the language model score with the acoustic model to obtain the sub-decoding network. The sub-decoding network can be inserted into the empty node of the main decoding network, so that the main decoding network and the sub-decoding network can accurately and completely recognize the voice data.

In one embodiment, the decoding of acoustic features and acoustic model scores of acoustic features to obtain speech recognition results using a main decoding network and a sub decoding network includes:

decoding the acoustic features and the acoustic model scores of the acoustic features by adopting a main decoding network and a sub decoding network to obtain a voice recognition grid lattice;

and obtaining a voice recognition result based on the voice recognition grid lattice.

Specifically, a main decoding network and a sub decoding network are adopted to decode acoustic features and acoustic model scores of the acoustic features, and a voice recognition grid lattice is obtained. The method comprises the steps of adopting a main decoding network to decode acoustic features except a target named entity, and adopting a sub decoding network to decode acoustic features corresponding to the target named entity to obtain a voice recognition grid lattice.

The speech recognition lattice includes a plurality of candidate word sequences. Where the alternative word sequence includes a plurality of words and a plurality of paths, also referred to as labce, which is essentially a directed acyclic (directed acyclic graph) graph, each node on the graph represents an ending time point of a word, each jump edge represents a possible word, and an acoustic model score and a language model score for the occurrence of the word. When the voice recognition result is expressed, each node stores the voice recognition result at the current position, including information such as acoustic probability, language probability and the like.

FIG. 6 is a schematic diagram of a speech recognition grid layout in one embodiment. Different word sequences can be obtained by starting from the leftmost starting node and going to the ending node along different arcs, and the probabilities stored on the arcs are combined to represent the probability (fraction) of obtaining a certain text of the input voice. For example, as shown in fig. 6, "hello shenzhen", "hello beijing", "hello background" can be regarded as a path of the speech recognition result, i.e. "hello shenzhen", "hello beijing", "hello background" are all word sequences, and these word sequences form the speech recognition grid lattice. And each path in the graph corresponds to a probability, and the score of each path can be calculated according to the probability.

Obviously, the obtained voice recognition grid lattice is huge, so that the voice recognition grid lattice can be pruned. One pruning method is to score lattice forwards and backwards, calculate posterior probability of each jump edge, and delete jump edges with lower posterior probability. Thus, after pruning the speech recognition grid lattice, the speech recognition grid lattice is simplified, but the important information in the speech recognition grid lattice is still preserved.

And extracting a preset number of word sequences with the front score from the word sequences in the voice recognition grid lattice based on the pruning processing. And screening the target word sequence from the candidate word sequences to serve as a voice recognition result. Here, the number of target word sequences is generally one.

In the embodiment of the application, the acoustic features except the target named entity are decoded by adopting a main decoding network, and the acoustic features corresponding to the target named entity are decoded by adopting a sub decoding network to obtain the voice recognition grid lattice. The obtained speech recognition lattice includes a plurality of alternative word sequences, so that the speech recognition lattice is pruned first, then the pruned speech recognition lattice is screened, and finally the target word sequence is screened out as a speech recognition result. Aiming at the target named entity in the scene to be identified, the target named entity can be accurately decoded based on the sub-decoding network. And pruning the voice recognition grid lattice obtained after the decoding of the main decoding network and the sub decoding network, and screening out a target word sequence as a voice recognition result. Therefore, the voice recognition efficiency and accuracy are improved.

In one embodiment, the speech recognition lattice includes a plurality of word sequences including nodes and a skip edge, the skip edge carrying word information of acoustic features;

as shown in fig. 7, decoding acoustic features and acoustic model scores of the acoustic features using a main decoding network and a sub decoding network to obtain a speech recognition lattice, including:

step 720, sequentially obtaining word sequences corresponding to acoustic features in the voice data from the main decoding network.

A schematic diagram of the decoding process in one embodiment is shown in connection with fig. 8. And extracting acoustic characteristics of the voice data to be processed. The extracted acoustic features are input into an acoustic model, and acoustic model scores of the acoustic features are calculated. The acoustic features here comprise phonemes, although the application is not limited in this regard. For example, a segment of audio signal is received, and acoustic features sequentially extracted from the audio signal are n, i, h, ao, sh, en, zh, en, and word sequences corresponding to the eight phonemes are sequentially acquired from the main decoding network.

As shown in fig. 8, the word sequences corresponding to the eight phonemes are obtained in the process of sequentially obtaining the word sequences from the main decoding network, wherein the node 1 is a start node, the nodes 2, 3, 4 and 5 are intermediate nodes, and the node 6 is a stop node. A jump edge is connected between the starting node and the ending node, and word information and phoneme information are carried on the jump edge. The jump edge between the node 1 and the node 2 carries word information: you are good; the carrying phoneme information is: n. The jump edge between node 2 and node 3 carries word information: blanc; the carrying phoneme information is: i. the jump edge between node 3 and node 4 carries word information: blanc; the carrying phoneme information is: h. the jump edge between node 4 and node 5 carries word information: blanc; the carrying phoneme information is: ao. The node 5 is an empty node obtained by hollowing out the named entity, so that the jump edge between the node 5 and the node 6 does not carry word information or phoneme information.

And step 740, if the word information on the jump edge of the middle node of the word sequence is null, calling a sub-decoding network, and acquiring the word sequence corresponding to the next acoustic feature in the audio signal from the sub-decoding network.

In the process of sequentially acquiring word sequences corresponding to acoustic features from a main decoding network, if word information on a jump edge of a middle node of the word sequence is null, the fact that a named entity at the middle node is hollowed is described. Therefore, if the word information on the jump edge of the middle node of the word sequence is empty, the sub-decoding network is called, and the word sequence corresponding to the next acoustic feature in the audio signal is acquired from the sub-decoding network.

As shown in fig. 8, no word information is carried between the hop edges of the node 5, nor is there any phoneme information carried, i.e. the word information on the hop edges of the node 5 is null. At this time, the sub-decoding network is called, and a word sequence corresponding to the next acoustic feature in the audio signal is acquired from the sub-decoding network. Namely, acquiring the next acoustic feature sh in the audio signal from the sub-decoding network, and sequentially acquiring word sequences corresponding to en, zh and en.

Step 760, until reaching the end node of the word sequence in the sub-decoding network, returning to the main encoding network, and continuing to acquire the word sequence corresponding to the next acoustic feature in the audio signal from the main decoding network until sequentially acquiring the word sequences of all acoustic features in the audio signal to be processed.

Until a termination node of the word sequence is reached in the sub-decoding network, it is understood that a word sequence is completely recognized. For example, as can be seen from fig. 8, the word sequences corresponding to the factors of sh, en, zh and en are completely identified in the sub-decoding network, and the word sequences are "Shenzhen" which is the termination node reached to the word sequences. Of course, it is also possible to identify the word sequences corresponding to the factors sh, en, zh, en as multiple alternative word sequences such as "magic needle", "Shen Zhen" in the sub-decoding network. At this point, the termination node 6 of the primary decoding network is returned. And then, continuously acquiring a word sequence corresponding to the next acoustic feature in the audio signal from the main decoding network until the word sequences of all acoustic features in the audio signal to be processed are sequentially acquired. For example, after the audio signal of "hello Shenzhen" and the audio signal of "i like you", the next acoustic feature wo in the audio signal is continuously acquired from the main decoding network, and word sequences corresponding to x, i, h, u, an, n, i are sequentially acquired until the word sequences of all acoustic features in the audio signal to be processed are sequentially acquired.

Step 780, sequentially connecting word sequences of all acoustic features to form a speech recognition grid lattice.

The word sequences of all acoustic features in the voice data are sequentially connected to form the voice recognition grid lattice. For example, the word sequences of the voice data in fig. 8 are sequentially connected, and a plurality of candidate word sequences such as "hello shenzhen", "hello shenzhen" and "hello Shen Zhen" can be obtained. These alternative word sequences constitute a speech recognition lattice.

In the embodiment of the application, word sequences corresponding to acoustic features in voice data are sequentially acquired from a main decoding network. If the word information on the jump edge of the middle node of the word sequence is empty, a sub-decoding network is called, the word sequence corresponding to the next acoustic feature in the audio signal is acquired from the sub-decoding network until the end node of the word sequence is reached in the sub-decoding network, the main encoding network is returned, and the word sequence corresponding to the next acoustic feature in the audio signal is continuously acquired from the main decoding network until the word sequences of all acoustic features in the audio signal to be processed are sequentially acquired. Finally, the word sequences of all the acoustic features are sequentially connected to form a voice recognition grid lattice.

In the decoding process by adopting the main decoding network, if the word information of the skip edge of the middle node of the word sequence is null, the sub decoding network is called to decode. Finally, based on the alternative word sequence obtained by decoding through the main decoding network and the sub decoding network, the voice recognition grid lattice is obtained. The accurate identification of the target named entity is realized through the sub-decoding network.

In one embodiment, as shown in FIG. 9, the jump edge also carries a language model score for the acoustic feature; obtaining a voice recognition result based on the voice recognition grid lattice comprises the following steps:

step 920, obtaining a total score of each word sequence in the speech recognition grid lattice based on the acoustic model score of the acoustic feature and the language model score of the acoustic feature;

step 940, obtaining the word sequence with the highest total score of the word sequence as the target word sequence;

step 960, obtaining word information of the target word sequence, and taking the word information in the target word sequence as a voice recognition result.

The speech recognition lattice includes a plurality of candidate word sequences. Where the word sequence includes words and paths, also referred to as lattice, which is essentially a directed acyclic (directed acyclic graph) graph, each node on the graph represents an ending time point for a word, each jump edge represents a possible word, and the acoustic model score and language model score for the word occurrence.

And for each word sequence, carrying out summation operation on the acoustic model score and the language model score on each jump edge to obtain the total score of each word sequence in the voice recognition grid lattice. And obtaining the word sequence with the highest total score from the word sequence as a target word sequence. And taking the word information of the target word sequence as a voice recognition result.

Referring to fig. 8, if the total score of the word sequence of "hello shenzhen" is highest, word information of the word sequence, i.e. "hello shenzhen", is used as the speech recognition result.

In the embodiment of the application, for each word sequence, the acoustic model score and the language model score on each jump edge are summed to obtain the total score of each word sequence in the voice recognition grid lattice. And obtaining the word sequence with the highest total score from the word sequence as a target word sequence. And taking the word information of the target word sequence as a voice recognition result. Thus, the target word sequence can be accurately screened out through the acoustic model score and the language model score.

In one embodiment, the scene to be identified comprises at least one of a medical scene, an image processing scene, an electronic contest scene.

For the medical scenario, the target named entities in the application scenario are mostly professional vocabularies, for example, the named entities are doctors, patients, blood pressure, heartbeat, CT (computed tomography) and the like. For the image processing scene, the object naming entity is resolution, chromatic aberration, backlight and the like. For the electronic contest scene, the target naming entity is eating chicken, MVP and the like. Obviously, the differences between the target named entities in different application scenes are large.

In the embodiment of the application, aiming at different application scenes including but not limited to a medical scene, an image processing scene and an electronic contest scene, target named entity under different application scenes are respectively acquired to form target named entity texts under each application scene. Then, a language model score is assigned to the target named entity text. And finally, combining the target named entity text endowed with the language model score with the acoustic model to obtain the sub-decoding network. Therefore, the target named entities in different application scenes can be subjected to targeted voice recognition, so that the accuracy of the finally obtained voice recognition result is improved.

In one embodiment, as shown in FIG. 10, a speech recognition apparatus 1000 includes:

an acoustic feature extraction module 1020, configured to perform acoustic feature extraction on the voice data to be processed;

an acoustic model score calculation module 1040 for inputting the extracted acoustic features into an acoustic model, calculating acoustic model scores for the acoustic features;

the decoding module 1060 is configured to decode the acoustic feature and the acoustic model score of the acoustic feature to obtain a speech recognition result by using the main decoding network and the sub decoding network; the main decoding network is a decoding diagram obtained by training an original text training corpus, and the sub decoding diagram is a decoding diagram obtained by training a target named entity in a scene to be identified.

In one embodiment, as shown in fig. 11, there is further provided a voice recognition apparatus 1000, further comprising: the main decoding network generation module 1070 is used for hollowing out named entities in the original text training corpus to obtain a target text training corpus; training the target text training corpus to obtain a language model; training a voice training corpus corresponding to the original text training corpus to obtain an acoustic model; and combining the language model and the acoustic model to obtain a main decoding network, wherein the main decoding network comprises empty nodes, and the empty nodes correspond to the hollowed-out positions in the target text training corpus.

In one embodiment, the primary decoding network generation module 1070 is further configured to combine the language model with the acoustic model using a composition algorithm to obtain the primary decoding network.

In one embodiment, as shown in fig. 11, there is further provided a voice recognition apparatus 1000, further comprising: a sub-decoding network generating module 1080 for collecting target named entities in the scene to be identified to form a target named entity text; assigning a language model score to the target named entity text; and combining the target named entity text endowed with the language model score with the acoustic model to obtain the sub-decoding network.

In one embodiment, the decoding module 1060 further comprises:

the voice recognition grid generation unit is used for decoding the acoustic features and the acoustic model scores of the acoustic features by adopting a main decoding network and a sub decoding network to obtain a voice recognition grid lattice;

and the voice recognition result determining unit is used for obtaining a voice recognition result based on the voice recognition grid lattice.

In one embodiment, the speech recognition lattice includes a plurality of word sequences including nodes and a skip edge, the skip edge carrying word information of acoustic features; the voice recognition grid generating unit is also used for sequentially acquiring word sequences corresponding to acoustic features in the voice data from the main decoding network; if the word information on the jump edge of the middle node of the word sequence is empty, calling a sub-decoding network, and acquiring the word sequence corresponding to the next acoustic feature in the audio signal from the sub-decoding network; until reaching the termination node of the word sequence in the sub-decoding network, returning to the main encoding network, and continuously acquiring the word sequence corresponding to the next acoustic feature in the audio signal from the main decoding network until sequentially acquiring the word sequences of all acoustic features in the audio signal to be processed; and sequentially connecting word sequences of all the acoustic features to form a voice recognition grid lattice.

In one embodiment, the jump edge also carries a language model score for the acoustic feature; the voice recognition result determining unit is further used for obtaining the total score of each word sequence in the voice recognition grid lattice based on the acoustic model score of the acoustic feature and the language model score of the acoustic feature; acquiring a word sequence with the highest total score of the word sequence as a target word sequence; word information of the target word sequence is obtained, and the word information in the target word sequence is used as a voice recognition result.

The division of the modules in the above-described voice recognition device is merely for illustration, and in other embodiments, the voice recognition device may be divided into different modules as needed to perform all or part of the functions of the above-described voice recognition device.

FIG. 12 is a schematic diagram of an internal structure of a server in one embodiment. As shown in fig. 12, the server includes a processor and a memory connected through a system bus. Wherein the processor is configured to provide computing and control capabilities to support the operation of the entire server. The memory may include a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The computer program is executable by a processor for implementing a speech recognition method as provided in the various embodiments below. The internal memory provides a cached operating environment for operating system computer programs in the non-volatile storage medium. The server may be a cell phone, tablet computer or personal digital assistant or wearable device, etc.

The implementation of each module in the voice recognition apparatus provided in the embodiment of the present application may be in the form of a computer program. The computer program may run on a terminal or a server. Program modules of the computer program may be stored in the memory of the terminal or server. Which when executed by a processor, performs the steps of the method described in the embodiments of the application.

The embodiment of the application also provides a computer readable storage medium. One or more non-transitory computer-readable storage media containing computer-executable instructions that, when executed by one or more processors, cause the processors to perform the steps of a speech recognition method.

A computer program product comprising instructions which, when run on a computer, cause the computer to perform a speech recognition method.

Any reference to memory, storage, database, or other medium used by embodiments of the application may include non-volatile and/or volatile memory. Suitable nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The foregoing examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A method of speech recognition, comprising:

extracting acoustic characteristics of voice data to be processed;

decoding the acoustic features and the acoustic model scores of the acoustic features by adopting a main decoding network and a sub decoding network to obtain a voice recognition result; the main decoding network is a decoding graph obtained by training an original text training corpus, and the sub decoding network is a decoding graph obtained by training a target named entity in a scene to be identified;

the generation process of the main decoding network comprises the following steps:

hollowing out named entities in the original text training corpus to obtain a target text training corpus;

Training the target text training corpus to obtain a language model;

training the voice training corpus corresponding to the original text training corpus to obtain an acoustic model;

and combining the language model with the acoustic model to obtain a main decoding network, wherein the main decoding network comprises empty nodes, and the empty nodes correspond to the hollowed positions in the target text training corpus.

2. The method of claim 1, wherein said combining the language model with the acoustic model results in a primary decoding network, comprising:

3. The method of claim 1, wherein the generating of the sub-decoding network comprises:

acquiring a target named entity in a scene to be identified to form a target named entity text;

assigning a language model score to the target named entity text;

and combining the target named entity text endowed with the language model score with the acoustic model to obtain a sub-decoding network.

4. The method of claim 1, wherein decoding the acoustic features and the acoustic model scores of the acoustic features using a main decoding network and a sub decoding network to obtain a speech recognition result comprises:

Decoding the acoustic features and the acoustic model scores of the acoustic features by adopting the main decoding network and the sub decoding network to obtain a voice recognition grid lattice;

5. The method of claim 4, wherein the speech recognition lattice comprises a plurality of word sequences, the word sequences comprising nodes and skip edges, the skip edges carrying word information of the acoustic features;

the decoding the acoustic features and the acoustic model scores of the acoustic features by adopting a main decoding network and a sub decoding network to obtain a voice recognition grid lattice comprises the following steps:

sequentially acquiring word sequences corresponding to the acoustic features in the voice data from the main decoding network;

if the word information on the jump edge of the middle node of the word sequence is empty, calling the sub-decoding network, and acquiring the word sequence corresponding to the next acoustic feature in the voice data from the sub-decoding network;

returning to the main decoding network until reaching a termination node of the word sequence in the sub decoding network, and continuing to acquire the word sequence corresponding to the next acoustic feature in the voice data from the main decoding network until sequentially acquiring the word sequences of all acoustic features in the voice data to be processed;

And sequentially connecting word sequences of all the acoustic features to form a voice recognition grid lattice.

6. The method of claim 5, wherein the jump edge further carries a language model score for the acoustic feature; the step of obtaining the voice recognition result based on the voice recognition grid lattice comprises the following steps:

based on the acoustic model score of the acoustic feature and the language model score of the acoustic feature, acquiring the total score of each word sequence in the voice recognition grid lattice;

acquiring a word sequence with the highest total score of the word sequence as a target word sequence;

and acquiring word information of the target word sequence, and taking the word information in the target word sequence as a voice recognition result.

7. A method according to claim 3, wherein the scene to be identified comprises at least one of a medical scene, an image processing scene, an electronic contest scene.

8. A speech recognition device, the device comprising:

The decoding module is used for decoding the acoustic features and the acoustic model scores of the acoustic features by adopting a main decoding network and a sub decoding network to obtain a voice recognition result; the main decoding network is a decoding graph obtained by training an original text training corpus, and the sub decoding network is a decoding graph obtained by training a target named entity in a scene to be identified;

the main decoding network generation module is used for hollowing out named entities in the original text training corpus to obtain a target text training corpus; training the target text training corpus to obtain a language model; training the voice training corpus corresponding to the original text training corpus to obtain an acoustic model; and combining the language model with the acoustic model to obtain a main decoding network, wherein the main decoding network comprises empty nodes, and the empty nodes correspond to the hollowed positions in the target text training corpus.

9. A server comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to perform the steps of the speech recognition method according to any one of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the speech recognition method according to any one of claims 1 to 7.