CN113066480B

CN113066480B - Voice recognition method and device, electronic equipment and storage medium

Info

Publication number: CN113066480B
Application number: CN202110328573.7A
Authority: CN
Inventors: 李俊博
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2023-02-17
Anticipated expiration: 2041-03-26
Also published as: CN113066480A

Abstract

The present disclosure relates to a voice recognition method, apparatus, electronic device and storage medium, the method comprising: acquiring acoustic features of voice data to be recognized, and processing the acoustic features into acoustic representation through an acoustic model; searching a plurality of decoding paths corresponding to the acoustic representation in a searching graph formed by linking subgraphs corresponding to a plurality of language models; the method comprises the steps of determining a target decoding path from a plurality of decoding paths, obtaining target text data obtained by decoding acoustic representation based on the target decoding path, and determining the target text data as a recognition result of voice data to be recognized.

Description

Voice recognition method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of speech recognition technologies, and in particular, to a speech recognition method and apparatus, an electronic device, and a storage medium.

Background

With the development of speech recognition technology, speech recognition content is more and more abundant, the service scene to which the speech recognition belongs is more and more complex, and a single language model is difficult to support the speech recognition in a complex application scene.

For the purpose of speech recognition effect in a complex application scenario, the following schemes appear in the related art: adding corpora of related fields of an application scene into training of a basic language model, and training to obtain a uniform language model; or on the same language model, interpolating the linguistic data in the related fields according to the specific application scene to obtain a new language model; and then performing HCLG composition based on the obtained language model, and decoding to obtain a voice recognition result.

However, the speech recognition method in the related art weakens the recognition effect of each domain according to the language model obtained by the corpus training of each domain, so that the speech recognition result is not accurate enough.

Aiming at the problem that a single language model in the related art is difficult to support the speech recognition requirement under a complex application scene, an effective solution is not provided at present.

Disclosure of Invention

The present disclosure provides a speech recognition method, apparatus, electronic device and storage medium, so as to at least solve the problem in the related art that a single language model is difficult to support the speech recognition requirement in a complex application scenario. The technical scheme of the disclosure is as follows:

according to a first aspect of an embodiment of the present disclosure, there is provided a speech recognition method, including: acquiring acoustic features of voice data to be recognized, and processing the acoustic features into acoustic representations through an acoustic model, wherein the acoustic representations represent the probability that the acoustic features belong to a target acoustic symbol sequence; searching a plurality of decoding paths corresponding to acoustic representation in a searching graph formed by linking subgraphs corresponding to a plurality of language models, wherein the plurality of language models and the link relation are determined by a service scene to which the voice data to be recognized belongs, and the subgraph corresponding to each language model is formed by linking acoustic symbol nodes; and determining a target decoding path from the multiple decoding paths, acquiring target text data obtained by decoding the acoustic representation based on the target decoding path, and determining the target text data as a recognition result of the voice data to be recognized.

Optionally, before searching a plurality of decoding paths corresponding to the acoustic representation in a search graph formed by linking subgraphs corresponding to the plurality of language models, the method further includes: determining a plurality of language models to be used when voice recognition is carried out in a service scene, and the incidence relation of the plurality of language models, wherein the incidence relation is used for representing the front-back relation or the parallel relation of the plurality of language models applied in the voice recognition process; constructing a subgraph corresponding to each language model; determining the link relation among the subgraphs corresponding to the plurality of language models according to the incidence relation; and linking the subgraphs corresponding to the plurality of language models according to the link relation to obtain a search graph.

Optionally, the language model to be used comprises at least one of: training a first corpus associated with a scene type of a service scene to obtain a first language model; a second language model obtained by training a second corpus associated with the field type to which the business scene belongs; a third language model obtained by training the speech terminology under the service scene; a fourth language model obtained by training personalized information materials of the objects associated with the service scene; a base language model.

Optionally, constructing the subgraph corresponding to each language model includes: acquiring a word list of a language model; determining an acoustic symbol corresponding to each word in a word list according to a pronunciation dictionary of the acoustic model to obtain a plurality of acoustic symbols; establishing acoustic symbol nodes corresponding to a plurality of acoustic symbols and word nodes corresponding to words formed by acoustic symbol sequences, wherein the same acoustic symbol corresponds to the same acoustic symbol node; and linking a plurality of acoustic symbol nodes according to the jumping relations among the plurality of acoustic symbols, and linking a plurality of word nodes according to the jumping relations among the plurality of word nodes to obtain the subgraph corresponding to the language model.

Optionally, the step of linking subgraphs corresponding to the plurality of language models according to the link relationship to obtain the search graph includes: constructing a starting node of a search graph; constructing at least one group of head and tail nodes of the recognition path according to the voice recognition path under the service scene, wherein each group of head and tail nodes comprises a sentence head node and a sentence tail node; and establishing a link between the starting node and each sentence starting node, and linking a plurality of subgraphs between at least one group of starting and ending nodes according to the link relation between the subgraphs corresponding to a plurality of language models to obtain a search graph, wherein at least one subgraph is linked between the sentence starting node and the sentence ending node of each group of starting and ending nodes.

Optionally, determining the target decoding path from the plurality of decoding paths comprises: in each decoding path, respectively calculating a first weight between a sentence head node and a subgraph linked with the sentence head node, calculating a second weight between two adjacent subgraphs, calculating a third weight between a sentence tail node and a subgraph linked with the sentence tail node, and determining a voice recognition weight of the decoding path based on the probability that the acoustic feature belongs to the target acoustic symbol sequence, the first weight, the second weight and the third weight; and determining the decoding path with the highest voice recognition weight in the plurality of decoding paths, and determining the decoding path with the highest voice recognition weight as a target decoding path.

Optionally, the multiple language models include a fourth language model, and after the subgraphs corresponding to the multiple language models are linked according to the link relationship to obtain the search graph, the method further includes: and under the condition that the object associated with the service scene changes, training a language model by using the personalized information material of the changed object, and updating the fourth language model according to the trained language model.

According to a second aspect of the embodiments of the present disclosure, there is provided a speech recognition apparatus, including an obtaining unit configured to obtain an acoustic feature of speech data to be recognized, and process the acoustic feature into an acoustic representation through an acoustic model, where the acoustic representation represents a probability that the acoustic feature belongs to a target acoustic symbol sequence; the search unit is configured to search a plurality of decoding paths corresponding to the acoustic representation in a search graph formed by linking subgraphs corresponding to a plurality of language models, wherein the plurality of language models and the link relation are determined by service scenes to which the voice data to be recognized belongs, and the subgraph corresponding to each language model is formed by linking acoustic symbol nodes; the first determining unit is configured to determine a target decoding path from the plurality of decoding paths, acquire target text data obtained by decoding the acoustic representation based on the target decoding path, and determine the target text data as a recognition result of the voice data to be recognized.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic device of a speech recognition method, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement the speech recognition method of any of the above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having instructions stored thereon which, when executed by a processor of an electronic device of an information processing method, enable the electronic device of the information processing method to perform any one of the above-described speech recognition methods.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product adapted to perform a program for initializing a speech recognition method according to any one of the above when executed on a data processing device.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the method comprises the steps of obtaining acoustic features of voice data to be recognized, and processing the acoustic features into acoustic representations through an acoustic model, wherein the acoustic representations represent the probability that the acoustic features belong to a target acoustic symbol sequence; searching a plurality of decoding paths corresponding to acoustic representation in a searching graph formed by linking subgraphs corresponding to a plurality of language models, wherein the plurality of language models and the link relation are determined by service scenes to which voice data to be recognized belong, and the subgraph corresponding to each language model is formed by linking acoustic symbol nodes; the target decoding path is determined from the multiple decoding paths, the target text data obtained by decoding the acoustic representation based on the target decoding path is obtained, the target text data is determined as the recognition result of the voice data to be recognized, the purposes of determining the target decoding path from the search graph formed by linking the subgraphs corresponding to the multiple language models and decoding the voice data to be recognized based on the target decoding path can be achieved, the technical effect of improving the accuracy of voice recognition under the complex application scene is achieved, and the problem that the single language model in the related technology is difficult to support the voice recognition requirement under the complex application scene is solved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is a schematic diagram illustrating an application scenario of a speech recognition method according to an exemplary embodiment.

FIG. 2 is a flow diagram illustrating a method of speech recognition according to an example embodiment.

FIG. 3 is a schematic diagram illustrating a method of speech recognition according to an example embodiment.

FIG. 4 is a diagram illustrating the construction of a search graph in a speech recognition method according to an exemplary embodiment.

Fig. 5 is a first schematic diagram illustrating construction of a subgraph corresponding to a language model in a speech recognition method according to an exemplary embodiment.

Fig. 6 is a second schematic diagram illustrating construction of a subgraph corresponding to a language model in a speech recognition method according to an exemplary embodiment.

FIG. 7 is a block diagram illustrating a speech recognition apparatus according to an example embodiment.

Fig. 8 is a block diagram illustrating a terminal according to an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The speech recognition method according to the first embodiment may be applied to an application scenario shown in fig. 1, where fig. 1 is an application scenario diagram of information processing in an embodiment, and the application scenario may include a first client 110 and a server 120, where the server 120 may be in communication connection with the first client 110 through a network.

When voice recognition is performed on the first client 110, receiving voice data to be recognized, triggering a first target request to the server 120, responding to the first target request by the server 120, acquiring acoustic features of the voice data to be recognized, and processing the acoustic features into acoustic representations through an acoustic model; searching a plurality of decoding paths corresponding to the acoustic representation in a searching graph formed by linking subgraphs corresponding to the plurality of language models, and determining a target decoding path from the plurality of decoding paths; and acquiring target text data obtained by decoding the acoustic representation based on the target decoding path, and returning the target text data. The first client 110 displays the target text data, i.e., the recognition result of the voice data to be recognized. The technical effect of accurately constructing the oral cavity model on the face model is achieved, and the problem that a single language model in the related technology cannot support the speech recognition requirement under a complex application scene easily is solved.

Fig. 2 is a flow chart illustrating a speech recognition method according to an exemplary embodiment, where the speech recognition method is used in a server, as shown in fig. 2, and includes the following steps.

In step S201, acoustic features of the speech data to be recognized are acquired, and the acoustic features are processed into an acoustic representation by an acoustic model, wherein the acoustic representation characterizes the probability that the acoustic features belong to a target acoustic symbol sequence.

Specifically, the voice data to be recognized is the voice data to be recognized as text data, and may be the voice data sent by the user when the user executes a task in a service scene.

For example, in a business scene of selling goods by live broadcasting, the anchor broadcasts voice data for goods publicity, and in order to facilitate watching a live user to know the voice data broadcast by the anchor clearly, the voice data needs to be recognized as text data and displayed in a live display interface.

It should be noted that, as shown in fig. 3, the process of speech recognition is a process of obtaining acoustic features of speech data to be recognized, and decoding the acoustic features to obtain corresponding text data, where the decoding process needs to be implemented based on an acoustic model, a language model, and a composition of the language model.

Since the voice data to be recognized is a sound wave, in order to determine the semantics of the voice data to be recognized by the language model, the acoustic features of the sound wave need to be extracted first. Specifically, the sound wave may be framed to obtain multiple frames of sound waves, and the waveform of each frame is converted into a multidimensional vector, that is, the speech data to be recognized is converted into a matrix formed by multiple frames of multidimensional vectors, so as to extract acoustic features.

Further, after the extraction of the acoustic features of the voice data to be recognized is realized, the acoustic features are converted into acoustic representation through an acoustic model, specifically, each frame of acoustic features can be recognized as a corresponding state, then the states are combined into phonemes to obtain a plurality of phonemes, the plurality of phonemes form a phoneme sequence, each phoneme corresponds to the probability of being recognized as the phoneme, the phoneme sequence also corresponds to the probability of being recognized as the phoneme sequence, and the acoustic model outputs the probability that the voice data to be recognized belongs to a target phoneme sequence. In addition, the phonemes can be combined into the pronunciation of the word to obtain a plurality of words, the plurality of words form a word sequence, each word corresponds to the probability of being recognized as the word, the word sequence also corresponds to the probability of being recognized as the word sequence, and the acoustic model outputs the probability that the speech data to be recognized belongs to the target word sequence.

In step S202, a plurality of decoding paths corresponding to the acoustic representation are obtained by searching in a search graph formed by linking subgraphs corresponding to a plurality of language models, where the plurality of language models and the link relationship are determined by a service scene to which the voice data to be recognized belongs, and the subgraph corresponding to each language model is formed by linking acoustic symbol nodes.

Specifically, a search graph is formed by the links of small graphs of a plurality of language models related to service scenes, when audio data is input and each frame of data jumps, the jumps can be carried out on the search graph, and a plurality of decoding paths are formed by nodes of jumping paths.

In step S203, a target decoding path is determined from the plurality of decoding paths, target text data obtained by decoding the acoustic representation based on the target decoding path is acquired, and the target text data is determined as a recognition result of the speech data to be recognized.

It should be noted that, in the decoding process, a path with the highest probability corresponding to one voice, that is, a target decoding path, is determined from a plurality of decoding paths in the search graph.

Specifically, an acoustic score and a language score of each decoding path may be calculated, and a target decoding path may be selected according to a difference between the decoded acoustic score and a total score of the language, for example, a decoding path with the highest total score may be determined as the target decoding path.

It should be noted that, in the speech recognition method in the related art, the language model is initially trained, and then the speech model obtained by the initial training is interpolated according to the corpus of other language models to obtain a new language model, and the weight of each language model in the new language model is adjusted, so that the recognition effect of each language model is weakened during decoding. The voice recognition method in the embodiment of the disclosure can decode on a plurality of language models simultaneously, obtain the corresponding language score in the subgraph corresponding to each language model, and determine the target decoding path according to the total score difference between the acoustic score and the language score after decoding, thereby improving the accuracy of voice recognition.

In order to improve the accuracy of speech recognition in a specific service scenario, a search graph used for decoding needs to be constructed according to the specific service scenario, and optionally, in the speech recognition method shown in the embodiment of the present disclosure, before a plurality of decoding paths corresponding to an acoustic representation are searched and obtained in a search graph formed by linking subgraphs corresponding to a plurality of language models, the method further includes: determining a plurality of language models to be used when voice recognition is carried out in a service scene and an incidence relation of the plurality of language models, wherein the incidence relation is used for representing a front-back relation or a parallel relation of the plurality of language models applied in the voice recognition process; constructing a subgraph corresponding to each language model; determining the link relation among subgraphs corresponding to the plurality of language models according to the incidence relation; and linking the subgraphs corresponding to the plurality of language models according to the link relation to obtain a search graph.

It should be noted that, as the service scenario corresponding to speech recognition becomes more and more complex, a single language model is difficult to meet the accuracy requirement of speech recognition, and a plurality of language models to be used need to be determined.

According to the method and the device for identifying the speech data, the multiple language models in the service scene are determined, the subgraphs corresponding to the multiple language models are linked according to the incidence relation among the multiple language models, and the obtained search graph can be used for improving the accuracy of identifying the speech data to be identified in the service scene.

Optionally, in a speech recognition method shown in an embodiment of the present disclosure, the language model to be used includes at least one of: training a first corpus associated with a scene type of a service scene to obtain a first language model; a second language model obtained by training a second corpus associated with the field type to which the service scene belongs; a third language model obtained by training the speech terminology under the service scene; a fourth language model obtained by training personalized information materials of the objects associated with the service scene; a base language model.

It should be noted that, when speech recognition is performed in a service scenario, a language model for recognizing relevant information of the service scenario and a language model for recognizing relevant information of a domain to which the service scenario belongs are required. In some scenarios, a voice robot with an on-line question answering function can train a corresponding language model according to a speech terminology adopted by question answering. In addition, different users have personalized information, and in order to improve the accuracy of voice recognition, the corresponding language model can be trained by adopting the personalized information corpus of the users.

For example, the service scene is a live book sale.

And training the language material associated with the E-commerce live broadcast scene to obtain a first language model, and training the language material associated with book selling to obtain a second language model. Since the user inquires about the selling price of the book, the publishing company and other common problems exist in the live broadcast, the third language model can be obtained by aiming at the common problems and the answer corpus training. In addition, a fourth language model obtained by training information such as address lists, receiving addresses and the like corresponding to the IDs of the live broadcast watching users can be obtained. And determining the first language model, the second language model, the third language model and the fourth language model as language models to be used in the live book selling scene.

Furthermore, after a plurality of language models to be used are determined, the incidence relation of each language model is also required to be determined, and a search graph required by decoding can be constructed according to subgraphs corresponding to the plurality of language models.

In an optional implementation manner, still taking live-broadcast book selling as an example, if the language models to be used include a first language model obtained by training the language materials associated with the live-broadcast selling scene and a plurality of second language models obtained by training the second language materials associated with books of different categories, the association relationship between the first language model and each of the second language models may be a front-back relationship, and the association relationship between the plurality of second language models may be a parallel relationship.

Optionally, in the speech recognition method shown in the embodiment of the present disclosure, the obtaining the search graph includes: constructing a starting node of a search graph; constructing at least one group of head and tail nodes of the recognition path according to the voice recognition path under the service scene, wherein each group of head and tail nodes comprises a sentence head node and a sentence tail node; and establishing a link between the starting node and each sentence head node, and linking a plurality of subgraphs between at least one group of head and tail nodes according to the link relation between the subgraphs corresponding to a plurality of language models to obtain a search graph, wherein at least one subgraph is linked between the sentence head node and the sentence tail node of each group of head and tail nodes.

For example, as shown in fig. 4, the search graph is composed of subgraphs corresponding to 5 language models.

When a search graph is constructed, 4 speech recognition paths are determined according to the link relation between each language model and the corresponding subgraph thereof, wherein 3 speech recognition paths all contain the language recognition model 1 and can share a group of head and tail nodes, and the other speech recognition path corresponds to a group of head and tail nodes.

Specifically, a start node is established, the start node is linked with two sentence head nodes (i.e., < s > nodes in fig. 4) respectively, then the first sentence head node is linked with a subgraph of the language model 1, the subgraph of the language model 1 is linked with a subgraph of the language model 2, a subgraph of the language model 3 and a subgraph of the language model 4 respectively, and the subgraph of the language model 2, the subgraph of the language model 3 and the subgraph of the language model 4 are linked with a first sentence tail node (i.e., < s > nodes in fig. 4) respectively, so that 4 voice recognition paths are obtained.

Meanwhile, the first sentence node (i.e., < s > node in fig. 4) and the subgraph of the language model 5 are linked, and the language model 5 and the second sentence node (i.e., < s > node in fig. 4) are linked to form the 5 th speech recognition path, so that the search graph is formed according to the 5 speech recognition paths.

The construction of the search graph is described below according to a specific embodiment, and the composition procedure is as follows:

1##step1:define slot

2 zhibo_$_st<s>zhibo_lm_5.fst zhibo.dict 00

3 zhibo_$_st</s>zhibo_lm_5.fst zhibo.dict 00

4$zhibo zhibo.dict zhibo_lm_5.fst zhibo.dict 10

5

6 zhibo_chezai_$_st<s>zhibo_chezai_lm_5.fst zhibo_chezai.dict 00

7 zhibo_chezai_$_st</s>zhibo_chezai_lm_5.fst zhibo_chezai.dict 00

8$zhibo_chezai zhibo_chezai.dict zhibo_chezai_lm_5.fst zhibo_chezai.dict 10

9

10 zhibo_chezai_$_st<s>zhibo_chezai_lm_5.fst zhibo_chezai.dict 00

11 zhibo_chezai_$_st</s>zhibo_chezai_lm_5.fst zhibo_chezai.dict 00

12$zhibo_chezai zhibo_chezai.dict zhibo_chezai_lm_5.fst zhibo_chezai.dict 10

13

14 class_$_st<s>class_lm_5.fst class.dict 00

15 class_$_st</s>class_lm_5.fst class.dict 00

16$class class.dict class_lm_5.fst class.dict 10

17

18$txt_$_st<s>txt_lm_5.fst txt.dict 00

19$txt_$_st</s>txt_lm_5.fst txt.dict 00

20$txt txt.dict txt_lm_5.fst txt.dict 10

21

22 song_$_st<s>song_lm_5.fst song.dict 00

23 song_$_st</s>song_lm_5.fst song.dict 00

24$song song.dict song_lm_5.fst song.dict 10

25

26##step2:define sentence start slot

27 zhibo_$_st

28 zhibo_chezai_$_st

29 zhibo_dianshang_$_st

30 class_$_st

31

32##step3:define slot link

33 zhibo_$_st$zhibo zhibo_$_ed

34 zhibo_chezai_$_st$zhibo_chezai zhibo_chezai_$_ed

35 zhibo_dianshang_$_st$zhibo_dianshang zhibo_dianshang_$_ed

36 class_$_st$class class_$_ed

37 class_$_st$class$txt class_$_ed

38 class_$_st$class$song class_$_ed

it should be noted that, during composition, a syntax tree is defined according to specific application scenario analysis to define a slot-to-slot relationship (each node or language model occupies one slot), then a small graph corresponding to a language model is separately constructed according to a corresponding word list and a pronunciation dictionary of a slot corresponding to each language model, and then slot connection is established according to the defined slot relationship.

Specifically, the syntax tree is defined in three steps according to the above format.

The first step is as follows: in lines 2-24, slots in the syntax tree are defined, in the first column of each line with the $ _ st flag, the slot with the beginning < s > corresponds to the slot $ _ ed, the slot with the end </s > corresponds to the slot, and the slots without the beginning $ are all the slots of the respective language model. Wherein, the 2 nd-4 th row defines the base language model correlation slot, the 2 nd row is the sentence head < s > slot, the 3 rd row is the sentence tail </s > slot, the 4 th row is the base language model dictionary corresponding slot; similarly, the 6 th to the 8 th are vehicle-mounted direct seeding slots, and the 10 th to the 12 th are E-commerce slots; 14-16 are generic model slots for custom-made surgery, such as: the anchor says "give XX a praise", "play XX song"; the address book slot 18-20 and the song slot 22-24.

The second step: a starting slot in the syntax tree is defined at lines 27-30, representing a slot into which a start node can enter.

The third step: slots are defined in lines 33-38 where syntax tree species can be connected. Wherein row 33 represents: adding live broadcast language model support, wherein the 34 th row represents that the support of a vehicle-mounted live broadcast vertical type model is added, and the 35 th row represents that the support of an electronic commerce live broadcast vertical type language model is added; line 36 shows the addition of class model support, primarily to customize various dialogs; line 37 adds a class model to the address book slot model, e.g., to enhance call to XX, and the like.

Further, after the grammar tree is built, a vertical language model, a class model and a slot model of a corresponding slot position are trained according to the specific slot position, specifically, a composition program reads a prepared language model, a dictionary and a pronunciation dictionary, a subgraph corresponding to each model is built, a composition program slot link is added, and a search graph is built according to the subgraphs.

In addition, in order to realize personalized language model support, the support of person names and place names similar to personalized recognition is needed in a specific field, for example, in lines 14-16, a personalized groove model can be trained in real time according to specific dialogs and then inserted into a composition program.

Furthermore, in the voice recognition data loading process, the original address book and song models in the composition file can be replaced by the real-time training language model, personalized customization is achieved, and the recognition rate is further improved.

It should be noted that, in the related art, when composition is performed, coupling is severe, composition needs to be performed again each time a language model is updated, only a single language model can be supported, each vertical effect is determined by the current single language model, training difficulty of language models with more vertical types is high, and errors of the models are not easy to repair. In the composition method in the embodiment of the disclosure, the composition program is completely decoupled from the language model, and each vertical class performs the customization and decoding of the language model, so that the language model is simpler to train than before, only the base language model and the vertical class language model need to be trained separately, decoding can be performed simultaneously during decoding, and finally, an optimal path is selected, so that the updating iteration of the whole search graph is simpler.

It should be noted that, when constructing the search graph, besides determining the subgraphs of the respective language models and the link relationships between the head and tail nodes, it is also necessary to construct the subgraphs of the respective language models in advance.

Optionally, in the speech recognition method illustrated in the embodiment of the present disclosure, constructing the subgraph corresponding to each language model includes: acquiring a word list of a language model; determining an acoustic symbol corresponding to each word in a word list according to a pronunciation dictionary of the acoustic model to obtain a plurality of acoustic symbols; establishing acoustic symbol nodes corresponding to a plurality of acoustic symbols and word nodes corresponding to words formed by acoustic symbol sequences, wherein the same acoustic symbol corresponds to the same acoustic symbol node; and linking a plurality of acoustic symbol nodes according to the jumping relations among the plurality of acoustic symbols, and linking a plurality of word nodes according to the jumping relations among the plurality of word nodes to obtain a subgraph corresponding to the language model.

It should be noted that each language model has a corresponding word list, the word list includes at least one word, an acoustic symbol corresponding to each word in each word can be determined according to a pronunciation dictionary of the acoustic model, and an acoustic symbol node corresponding to the acoustic symbol is established when a subgraph is constructed, and in the case that one word includes a plurality of words, an acoustic symbol sequence corresponding to the word can be determined according to the pronunciation dictionary of the acoustic model, and a word node corresponding to the acoustic symbol sequence is established when a subgraph is constructed.

Before composition, the embodiment of the disclosure does not limit the jump probability among the acoustic symbols, each acoustic symbol can jump to other acoustic symbols to form an acoustic symbol sequence, and the same acoustic symbol nodes can be combined, so that the link relation among the acoustic symbol nodes is established according to the jump relation; meanwhile, the embodiment of the disclosure does not limit the jump probability among the acoustic symbol sequences, each acoustic symbol sequence can jump to other acoustic symbol sequences, and the link relation among word nodes can be established according to the jump relation, so that the subgraph corresponding to the language model is obtained.

For example, as shown in fig. 5, the word list corresponding to the language model 1 includes a word "today", the pronunciation dictionary is queried to know that "today" corresponds to an acoustic symbol sequence "jin tie", when a sub-graph is constructed, an acoustic symbol node corresponding to the pronunciation of jin is constructed, an edge is drawn on the acoustic symbol node corresponding to the pronunciation of jin, the edge includes information [ jin pronunciation & next node ID ], and then the edge of tie is added, the pronunciation is a word output node after finishing, and the word output node includes information [ today & next node ID ].

As shown in fig. 5, the vocabulary corresponding to the language model 1 further includes "gold," and the acoustic symbol sequence "jin se" corresponding to "gold" can be known by querying the pronunciation dictionary, and when the composition is performed, because the pronunciation of "gold" is the same as that of "today," the acoustic symbol node corresponding to the pronunciation of jin can be taken to draw an edge, the edge includes information [ jin pronunciation & next node ID ], and then the edge of se is added, and the pronunciation is finished to be the word output node, and the word output node includes information [ today & next node ID ]. As shown in fig. 6, the vocabulary corresponding to the language model 1 further includes "beijing", and "beijing" corresponding to "bei jing" can be known by querying the pronunciation dictionary.

Further, after obtaining each acoustic symbol node and word node, the link relationship between the word nodes and the word nodes is constructed according to the jump relationship between words, and words can jump arbitrarily, as shown in fig. 6, from "beijing" to "today", from "beijing" to "gold", from "today" to "beijing", and from "current color" to "beijing".

It should be noted that, in the related art, when composition is performed, the language model probability is added to the edge, the composition time is slow, the volume of the composition generated graphic file is large, and the decoding load is long. In the composition method in the embodiment of the disclosure, the composition program is completely decoupled from the language model, the language model probability is not stored in the composition file, the language model query is performed during decoding, and the complexity of hot word query time can be reduced by technologies such as Hash and the like.

In the decoding process, that is, the process of determining the most suitable decoding path in the search graph, optionally, in the speech recognition method shown in the embodiment of the present disclosure, determining the target decoding path from the plurality of decoding paths includes: in each decoding path, respectively calculating a first weight between a sentence head node and a subgraph linked with the sentence head node, calculating a second weight between two adjacent subgraphs, calculating a third weight between a sentence tail node and a subgraph linked with the sentence tail node, and determining a voice recognition weight of the decoding path based on the probability that the acoustic feature belongs to the target acoustic symbol sequence, the first weight, the second weight and the third weight; and determining the decoding path with the highest voice recognition weight in the plurality of decoding paths, and determining the decoding path with the highest voice recognition weight as a target decoding path.

It should be noted that, because words and phrases can be arbitrarily skipped, and full connection between words and phrases is allowed, the edge of the search graph obtained in this way only has information from the initial node to the next node, and does not have the probability of adding the language model, and the probability of the language model is calculated during decoding.

Specifically, when decoding, the probability that the acoustic feature belongs to the target acoustic symbol sequence is known, the first weight may be determined based on the number of headwords in the subgraph linked by the beginning nodes of the sentence, that is, the number of the first acoustic symbols corresponding to different words, and the third weight may be determined based on the number of word nodes in the subgraph linked by the end nodes of the sentence.

And when the second weight is calculated, before the word node is encountered, the word intra-expansion updates the acoustic probability to the edge, and when the word node is encountered, the word inter-expansion queries the language model corresponding to each subgraph and adds the language probability to the edge.

For example, searching the search graph for "jin" with an acoustic score of 0.1, adding to an edge, failing to find a word node, having no language score, searching for "tie" with an acoustic score of 0.2, searching for a word node, and outputting a word with a language score of-1.0, and an acoustic score of "today" with a language score of 0.3-1.0= -0.7; as another example, if "jin" is found to have an acoustic score of 0.1, and "se" is found to have an acoustic score of 0.4, a word is generated, and if the linguistic score is also-1.0, then the acoustic score plus linguistic score of "gold" is-0.5.

Further, the out word "today" is scored as-1.0, the "Beijing" is scored as-2.0, and the "golden" is scored as-4.0, then the "Beijing today" is scored as-3.0, and the "golden today" is scored as-5.0, because the score of the "Beijing today" is higher, the "Beijing today" can be selected in the language model 1, and the equal part of the "Beijing today" is determined as the second weight corresponding to the language model 1.

Further, in the case of multiple subgraphs of the language model in one decoding path, the second weights corresponding to each subgraph are added. For example, the sum of the second weights of the subgraphs of the plurality of language models is obtained by decoding "three telephone calls," which belongs to the language model of the telephone directory, and "three telephone calls" which belongs to the language model of the address book.

After the first weight, the second weight, and the third weight are obtained in this way, the speech recognition weight of the decoding path is determined based on the probability that the acoustic feature belongs to the target acoustic symbol sequence, the first weight, the second weight, and the third weight, and the decoding path with the highest speech recognition weight among the plurality of decoding paths is determined as the target decoding path.

In order to improve the recognition rate of the personalized voice data for the user, optionally, in the voice recognition method shown in the embodiment of the present disclosure, the multiple language models include a fourth language model, and after the subgraphs corresponding to the multiple language models are linked according to the link relationship to obtain the search graph, the method further includes: and under the condition that the object associated with the service scene changes, training a language model by using the personalized information material of the changed object, and updating the fourth language model according to the trained language model.

In an optional implementation manner, the personalized information of the object associated with the service scenario is address book information of a preset object "zhang san", and the search map includes an address book language model obtained by address book information training of zhang san ". For the user corresponding to each account in the service scene, the address book language model can be trained in real time by adopting the address book information of the user, the original address book language model in the search graph file is replaced, and the accuracy of identifying the personalized information of the user is further improved.

FIG. 7 is a block diagram illustrating a speech recognition device according to an example embodiment. Referring to fig. 7, the apparatus includes an acquisition unit 71, a search unit 72, and a first determination unit 73.

An obtaining unit 71 configured to obtain an acoustic feature of the voice data to be recognized, and process the acoustic feature into an acoustic representation through an acoustic model, wherein the acoustic representation represents a probability that the acoustic feature belongs to the target acoustic symbol sequence.

The searching unit 72 is configured to search a plurality of decoding paths corresponding to the acoustic representation in a search graph formed by linking subgraphs corresponding to a plurality of language models, wherein the plurality of language models and the link relation are determined by service scenes to which the voice data to be recognized belongs, and the subgraph corresponding to each language model is formed by linking acoustic symbol nodes.

A first determining unit 73 configured to determine a target decoding path from the plurality of decoding paths, acquire target text data obtained by decoding the acoustic representation based on the target decoding path, and determine the target text data as a recognition result of the speech data to be recognized.

Optionally, in a speech recognition apparatus shown in an embodiment of the present disclosure, the apparatus further includes: the second determining unit is configured to determine a plurality of language models to be used in speech recognition in a business scene and an incidence relation of the plurality of language models before searching a plurality of decoding paths corresponding to acoustic representations in a search graph formed by linking subgraphs corresponding to the plurality of language models, wherein the incidence relation is used for representing a front-back relation or a parallel relation of the plurality of language models applied in the speech recognition process; a construction unit configured to construct a subgraph corresponding to each language model; a third determining unit configured to determine a link relation between subgraphs corresponding to the plurality of language models according to the association relation; and the linking unit is configured to link the subgraphs corresponding to the plurality of language models according to the linking relation to obtain a search graph.

Alternatively, in the speech recognition apparatus shown in the embodiment of the present disclosure, the language model to be used includes at least one of: training a first corpus associated with a scene type of a service scene to obtain a first language model; a second language model obtained by training a second corpus associated with the field type to which the business scene belongs; a third language model obtained by training the speech terminology under the service scene; a fourth language model obtained by training personalized information materials of the objects associated with the service scene; a base language model.

Optionally, in a speech recognition apparatus shown in an embodiment of the present disclosure, the construction unit includes: an acquisition module configured to acquire a vocabulary of a language model; the first determining module is configured to determine an acoustic symbol corresponding to each word in a word list according to a pronunciation dictionary of the acoustic model, so as to obtain a plurality of acoustic symbols; the building module is configured to build acoustic symbol nodes corresponding to a plurality of acoustic symbols and word nodes corresponding to words formed by acoustic symbol sequences, wherein the same acoustic symbol corresponds to the same acoustic symbol node; and the linking module is configured to link the plurality of acoustic symbol nodes according to the jumping relations among the plurality of acoustic symbols, and link the plurality of word nodes according to the jumping relations among the plurality of word nodes to obtain a subgraph corresponding to the language model.

Optionally, in a speech recognition apparatus shown in an embodiment of the present disclosure, the linking module includes: a first construction submodule configured to construct a start node of the search graph; the second construction submodule is configured to construct at least one group of head and tail nodes of the recognition path according to the voice recognition path under the service scene, wherein each group of head and tail nodes comprises a sentence head node and a sentence tail node; and the link sub-module is configured to establish a link between the start node and each sentence start node, and link a plurality of subgraphs between at least one group of start and end nodes according to the link relation between the subgraphs corresponding to the plurality of language models to obtain a search graph, wherein at least one subgraph is linked between the sentence start node and the sentence end node of each group of start and end nodes.

Alternatively, in the speech recognition apparatus shown in the embodiment of the present disclosure, the first determination unit 73 includes: the computation module is configured to compute a first weight between a period head node and a subgraph linked with the period head node in each decoding path, compute a second weight between two adjacent subgraphs, compute a third weight between a sentence tail node and a subgraph linked with the sentence tail node, and determine a voice recognition weight of the decoding path based on the probability that the acoustic features belong to the target acoustic symbol sequence, the first weight, the second weight and the third weight; and the second determining module is configured to determine a decoding path with the highest voice recognition weight in the plurality of decoding paths and determine the decoding path with the highest voice recognition weight as the target decoding path.

Optionally, in a speech recognition apparatus shown in an embodiment of the present disclosure, a fourth language model is included in the plurality of language models, and the apparatus further includes: and the updating unit is configured to link the subgraphs corresponding to the plurality of language models according to the link relation to obtain a search graph, train the language models by the personalized information materials of the changed objects under the condition that the objects related to the service scenes are changed, and update the fourth language model according to the trained language models.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

In an exemplary embodiment, there is also provided an electronic device of a speech recognition method, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement the speech recognition method of any of the above.

In an exemplary embodiment, there is also provided a computer readable storage medium having instructions which, when executed by a processor of an electronic device of an information processing method, enable the electronic device of the information processing method to perform the speech recognition method of any one of the above. Alternatively, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, which, when being executed on a data processing device, is adapted to carry out a program for initializing a speech recognition method as defined in any of the above. The computer product may be a terminal, which may be any one of a group of computer terminals. Optionally, in this embodiment of the present disclosure, the terminal may also be a terminal device such as a mobile terminal.

Optionally, in this embodiment of the present disclosure, the terminal may be located in at least one network device of a plurality of network devices of a computer network.

Alternatively, fig. 8 is a block diagram illustrating a structure of a terminal according to an exemplary embodiment. As shown in fig. 8, the terminal may include: one or more (only one shown) processors 81, a memory 83 for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement the speech recognition method of any of the above.

The memory may be configured to store software programs and modules, such as program instructions/modules corresponding to the speech recognition method and apparatus in the embodiments of the present disclosure, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory, so as to implement the speech recognition method. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory remotely located from the processor, which may be connected to the computer terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

It should be understood by those skilled in the art that the structure shown in fig. 8 is only an illustration, and the computer terminal may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 8 is a diagram illustrating a structure of the electronic device. For example, the terminal 10 may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 8, or have a different configuration than shown in FIG. 8.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A speech recognition method, comprising:

acquiring acoustic features of voice data to be recognized, and processing the acoustic features into acoustic representations through an acoustic model, wherein the acoustic representations represent the probability that the acoustic features belong to a target acoustic symbol sequence;

searching a plurality of decoding paths corresponding to the acoustic representation in a searching graph formed by linking subgraphs corresponding to a plurality of language models, wherein the language models and the link relation are determined by the service scene to which the voice data to be recognized belongs, and the subgraph corresponding to each language model is formed by linking acoustic symbol nodes;

determining a target decoding path from the plurality of decoding paths, acquiring target text data obtained by decoding the acoustic representation based on the target decoding path, and determining the target text data as a recognition result of the voice data to be recognized;

before searching a plurality of decoding paths corresponding to the acoustic representation in a search graph formed by linking subgraphs corresponding to a plurality of language models, the method further comprises the following steps:

determining the plurality of language models to be used when voice recognition is carried out in the service scene and the incidence relation of the plurality of language models, wherein the incidence relation is used for representing the front-back relation or the parallel relation of the plurality of language models applied in the voice recognition process;

constructing a subgraph corresponding to each language model;

determining the link relation between subgraphs corresponding to the plurality of language models according to the incidence relation;

and linking the subgraphs corresponding to the plurality of language models according to the link relation to obtain the search graph.

2. The voice recognition method of claim 1, wherein the language model to be used comprises at least one of:

training a first corpus associated with the scene type of the service scene to obtain a first language model;

a second language model obtained by training a second corpus associated with the field type to which the service scene belongs;

a third language model obtained by training the speech terminology material under the service scene;

a fourth language model obtained by training the personalized information material of the object associated with the service scene;

a base language model.

3. The voice recognition method of claim 1, wherein constructing a sub-graph corresponding to each of the language models comprises:

acquiring a word list of the language model;

determining an acoustic symbol corresponding to each word in the word list according to the pronunciation dictionary of the acoustic model to obtain a plurality of acoustic symbols;

establishing acoustic symbol nodes corresponding to the plurality of acoustic symbols and word nodes corresponding to words formed by acoustic symbol sequences, wherein the same acoustic symbol corresponds to the same acoustic symbol node;

and linking a plurality of acoustic symbol nodes according to the jump relation among the plurality of acoustic symbols, and linking a plurality of word nodes according to the jump relation among the plurality of word nodes to obtain a subgraph corresponding to the language model.

4. The speech recognition method of claim 1, wherein linking the subgraphs corresponding to the plurality of language models according to the linking relationship to obtain the search graph comprises:

constructing a starting node of the search graph;

constructing at least one group of head and tail nodes of the recognition path according to the voice recognition path under the service scene, wherein each group of head and tail nodes comprises a sentence head node and a sentence tail node;

and establishing a link between the starting node and each sentence head node, and linking a plurality of subgraphs between at least one group of head and tail nodes according to the link relation between the subgraphs corresponding to the plurality of language models to obtain the search graph, wherein at least one subgraph is linked between the sentence head node and the sentence tail node of each group of head and tail nodes.

5. The speech recognition method of claim 4, wherein determining a target decoding path from the plurality of decoding paths comprises:

in each decoding path, respectively calculating a first weight between the sentence head node and the subgraph linked with the sentence head node, calculating a second weight between two adjacent subgraphs, calculating a third weight between the sentence tail node and the subgraph linked with the sentence tail node, and determining a speech recognition weight of the decoding path based on the probability that the acoustic feature belongs to the target acoustic symbol sequence, the first weight, the second weight and the third weight;

and determining a decoding path with the highest voice recognition weight in the plurality of decoding paths, and determining the decoding path with the highest voice recognition weight as the target decoding path.

6. The speech recognition method according to claim 2, wherein the plurality of language models includes the fourth language model, and after the subgraphs corresponding to the plurality of language models are linked according to the link relationship to obtain the search graph, the method further comprises:

and under the condition that the object associated with the service scene changes, training a language model by using the personalized information material of the changed object, and updating the fourth language model according to the trained language model.

7. A speech recognition apparatus, comprising:

the voice recognition method comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire acoustic features of voice data to be recognized and process the acoustic features into acoustic representations through an acoustic model, and the acoustic representations represent the probability that the acoustic features belong to a target acoustic symbol sequence;

the searching unit is configured to search a plurality of decoding paths corresponding to the acoustic representation in a searching graph formed by linking subgraphs corresponding to a plurality of language models, wherein the plurality of language models and the link relation are determined by service scenes to which the voice data to be recognized belongs, and the subgraph corresponding to each language model is formed by linking acoustic symbolic nodes;

a first determining unit configured to determine a target decoding path from the plurality of decoding paths, acquire target text data obtained by decoding the acoustic representation based on the target decoding path, and determine the target text data as a recognition result of the to-be-recognized speech data;

wherein the apparatus further comprises:

the second determining unit is configured to determine, before obtaining a plurality of decoding paths corresponding to the acoustic representation through searching in a search graph formed by linking subgraphs corresponding to a plurality of language models, the plurality of language models to be used when performing speech recognition in the service scene and an incidence relation of the plurality of language models, wherein the incidence relation is used for representing a front-back relation or a parallel relation of the plurality of language models applied in a speech recognition process;

a construction unit configured to construct a subgraph corresponding to each of the language models;

a third determining unit configured to determine a link relationship between subgraphs corresponding to the plurality of language models according to the association relationship;

and the linking unit is configured to link the subgraphs corresponding to the plurality of language models according to the linking relation to obtain the search graph.

8. The voice recognition apparatus of claim 7, wherein the language model to be used comprises at least one of:

a base language model.

9. The tone recognition device of claim 7, wherein the construction unit comprises:

an obtaining module configured to obtain a vocabulary of the language model;

a first determining module configured to determine an acoustic symbol corresponding to each word in the vocabulary according to a pronunciation dictionary of the acoustic model, so as to obtain a plurality of acoustic symbols;

the building module is configured to build acoustic symbol nodes corresponding to the plurality of acoustic symbols and word nodes corresponding to words formed by acoustic symbol sequences, wherein the same acoustic symbol corresponds to the same acoustic symbol node;

the linkage module is configured to link the plurality of acoustic symbol nodes according to the jump relation among the plurality of acoustic symbols, and link the plurality of word nodes according to the jump relation among the plurality of word nodes to obtain a subgraph corresponding to the language model.

10. The speech recognition device of claim 9, wherein the linking module comprises:

a first construction submodule configured to construct a start node of the search graph;

the second construction submodule is configured to construct at least one group of head and tail nodes of the recognition path according to the voice recognition path under the service scene, wherein each group of head and tail nodes comprises a sentence head node and a sentence tail node;

and the link sub-module is configured to establish a link between the start node and each sentence start node, and link a plurality of subgraphs between the at least one group of start and end nodes according to the link relation between the subgraphs corresponding to the plurality of language models to obtain the search graph, wherein at least one subgraph is linked between the sentence start node and the sentence end node of each group of start and end nodes.

11. The voice recognition apparatus according to claim 10, wherein the first determination unit includes:

a calculation module configured to calculate, in each of the decoding paths, a first weight between the sentence start node and a sub-graph linked to the sentence start node, calculate a second weight between two adjacent sub-graphs, calculate a third weight between the sentence end node and a sub-graph linked to the sentence end node, and determine a speech recognition weight of the decoding path based on a probability that the acoustic feature belongs to the target acoustic symbol sequence, the first weight, the second weight, and the third weight;

a second determining module configured to determine a decoding path with the highest speech recognition weight among the plurality of decoding paths, and determine the decoding path with the highest speech recognition weight as the target decoding path.

12. The speech recognition apparatus of claim 8, wherein the apparatus further comprises:

and the updating unit is configured to include the fourth language model in the plurality of language models, link subgraphs corresponding to the plurality of language models according to the link relation to obtain the search graph, train the language model by the personalized information material of the changed object under the condition that the object associated with the service scene is changed, and update the fourth language model according to the trained language model.

13. An electronic device of a speech recognition method, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the speech recognition method of any of claims 1 to 6.

14. A computer-readable storage medium, in which instructions, when executed by a processor of an electronic device of a speech recognition method, enable the electronic device of the speech recognition method to perform the speech recognition method according to any one of claims 1 to 6.