WO2020119432A1 - 一种语音识别方法、装置、设备和存储介质 - Google Patents
一种语音识别方法、装置、设备和存储介质 Download PDFInfo
- Publication number
- WO2020119432A1 WO2020119432A1 PCT/CN2019/120558 CN2019120558W WO2020119432A1 WO 2020119432 A1 WO2020119432 A1 WO 2020119432A1 CN 2019120558 W CN2019120558 W CN 2019120558W WO 2020119432 A1 WO2020119432 A1 WO 2020119432A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- edge
- state diagram
- weight
- language
- model
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 72
- 239000012634 fragment Substances 0.000 claims abstract description 55
- 238000010586 diagram Methods 0.000 claims description 363
- 238000013507 mapping Methods 0.000 claims description 28
- 238000013138 pruning Methods 0.000 claims description 28
- 230000005284 excitation Effects 0.000 claims description 24
- 230000011218 segmentation Effects 0.000 claims description 18
- 238000012545 processing Methods 0.000 claims description 13
- 238000012549 training Methods 0.000 claims description 8
- 238000001914 filtration Methods 0.000 claims description 2
- 230000006870 function Effects 0.000 description 14
- 239000000284 extract Substances 0.000 description 10
- 230000008569 process Effects 0.000 description 5
- 238000012546 transfer Methods 0.000 description 5
- 230000002708 enhancing effect Effects 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 230000007704 transition Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 230000008450 motivation Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/10—Speech classification or search using distance or distortion measures between unknown speech and reference templates
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0635—Training updating or merging of old and new templates; Mean values; Weighting
Definitions
- This application relates to computer technology, and in particular to a voice recognition method, device, equipment and storage medium.
- Speech recognition technology can convert human speech into corresponding characters or codes, and is widely used in smart homes, real-time voice transfer and other fields.
- the decoder searches for the best word sequence in a search space composed of knowledge sources such as acoustic models, dictionaries, and language models based on the voices spoken by people, and combines the obtained word sequences to obtain the text description corresponding to the voice. That is, the recognition result.
- the language recognition model used in speech recognition is usually obtained by pruning the large language model, and provides the decoder with a word search path at the language layer.
- the language model after pruning has a small amount of data and lacks information. Although it can properly improve the speed of speech recognition, it leads to a decrease in accuracy.
- Embodiments of the present application provide a voice recognition method, device, device, and storage medium, which are intended to improve the accuracy of voice recognition.
- An embodiment of the present application provides a voice recognition method, including:
- the voice recognition model includes the language recognition model
- a sequence of multiple elements corresponding to the voice to be recognized is determined as a result of speech recognition.
- the method may include:
- the first state diagram is a state diagram of a keyword language model
- the second state diagram is a state diagram of a large language model
- the speech recognition model includes the language recognition model
- a target path is selected from the word sequence paths to obtain a speech recognition result.
- the extracting the reference edge in the first state diagram includes:
- the obtaining the reference edge according to the preset traversal depth and the starting node includes:
- the recursive edge is determined as the second reference edge.
- finding the same edge as the reference edge label in the second state diagram as a keyword edge includes:
- the updating of the weight of the keyword side according to the weight of the reference side includes:
- the method further includes:
- the reference edge is mapped to the second state diagram to obtain a keyword edge.
- the method further includes:
- the edge with the same label as the word in the preset vocabulary is selected as the keyword start edge
- the weight of the updated start edge of the keyword in the second state diagram is configured as the incentive weight of the corresponding edge in the language recognition model.
- filtering out the edge with the same label as the word in the preset vocabulary as the start edge of the keyword includes:
- the method further includes:
- the weighted finite state converter of the keyword language model is constructed, and the state diagram indicated by the weighted finite state converter of the keyword language model is obtained as the first state diagram.
- the method further includes:
- the weighted finite state converter of the large language model is constructed, and the state diagram indicated by the weighted finite state converter of the large language model is obtained as the second state diagram.
- the method further includes:
- the probability of the relationship between at least one pair of elements in the language recognition model is adjusted using the probability of the relationship between the at least one pair of elements in the text segment;
- the voice recognition model includes the language recognition model
- a sequence of multiple elements corresponding to the speech to be recognized is determined as a speech recognition result.
- An embodiment of the present application also provides a voice recognition device, including:
- An adjustment module configured to adjust the probability of the relationship between at least one pair of elements in the language recognition model according to the probability of the relationship between at least one pair of elements in the text segment;
- the speech recognition module is used to input the speech to be recognized into a preset speech recognition model, the speech recognition model includes the language recognition model; the speech to be recognized is determined according to the probability of the relationship between the elements in the language recognition model The corresponding sequence of multiple elements is used as the result of speech recognition.
- the voice recognition device may include:
- a loading unit configured to load a preset first state diagram and a second state diagram, the first state diagram is a state diagram of a keyword language model, and the second state diagram is a state diagram of a large language model;
- the keyword unit is used to extract a reference edge in the first state diagram, and find an edge with the same label as the reference edge in the second state diagram as a keyword edge;
- the updating unit is used to obtain the weight of the reference edge, and update the weight of the keyword edge according to the weight of the reference edge;
- An incentive unit configured to configure the weights of the keyword edges in the second state diagram as the incentive weights of the corresponding edges in the language recognition model, and the language recognition model is the language model after pruning the large language model ;
- a recognition unit configured to input the speech to be recognized into a preset speech recognition model to obtain a word sequence path output by the speech recognition model, the speech recognition model including the language recognition model;
- the result unit is used to select a target path from the word sequence paths according to the excitation weights of the edges in the language recognition model to obtain a speech recognition result.
- An embodiment of the present application further provides a voice recognition device, including: a memory, a processor, and a voice recognition program stored on the memory and executable on the processor, and the voice recognition program is executed by the processor.
- a voice recognition device including: a memory, a processor, and a voice recognition program stored on the memory and executable on the processor, and the voice recognition program is executed by the processor.
- the device further includes a voice collection device, and the voice collection device is used to collect voice to be recognized in real time.
- An embodiment of the present application further provides a storage medium, the storage medium stores a plurality of instructions, and the instructions are suitable for the processor to load to execute the steps of any of the voice recognition methods provided by the embodiments of the present application.
- a preset first state diagram and a second state diagram are loaded.
- the first state diagram is a state diagram of a keyword language model
- the second state diagram is a state diagram of a large language model; extracted from the first state diagram Reference edge, find the edge with the same label as the reference edge in the second state diagram as the keyword edge; obtain the weight of the reference edge, update the weight of the keyword edge according to the weight of the reference edge; change the keyword edge in the second state diagram
- the updated weight is configured as the excitation weight of the corresponding edge in the language recognition model.
- the language recognition model is the language model after pruning the large language model; input the speech to be recognized into the preset speech recognition model to obtain the word sequence output by the speech recognition model Paths, speech recognition models include language recognition models; according to the excitation weights of the edges in the language recognition model, the target path is selected from the word sequence paths to obtain the speech recognition results. Since the corpus of the keyword language model is much smaller than that of the large language model, the edge weight of the keyword in the first state diagram is greater than the weight of the same keyword edge in the second state diagram.
- This scheme uses the weights of keyword edges in the first state diagram to enhance the weights of the same keyword edges in the second state diagram, thereby incentivizing the weights of keyword edges in the speech recognition model, thereby improving the inclusion of language recognition models during speech recognition
- the weight of the middle edge of the path of the keyword thereby increasing the probability of the path containing the keyword as the recognition result.
- the solution improves the probability of the occurrence of keywords in the speech recognition results, while ensuring the speed of speech recognition and improving the accuracy of the speech recognition results.
- the solution is also applicable to various theme scenes, and the keywords of each theme scene can be used to improve the accuracy of speech recognition results.
- FIG. 1a is a schematic diagram of a scenario of an information interaction system according to an embodiment of the present application.
- FIG. 1b is a schematic flowchart of a voice recognition method according to an embodiment of the present application.
- 1c is a schematic flowchart of a voice recognition method according to an embodiment of the present application.
- FIG. 1d is a schematic flowchart of a voice recognition method according to an embodiment of the present application.
- FIG. 2 is a schematic flowchart of another voice recognition method according to an embodiment of the present application.
- 3a is a schematic diagram of a first state diagram of an embodiment of the present application.
- 3b is a schematic diagram of a second state diagram of an embodiment of the present application.
- 3c is a schematic diagram of another second state diagram of an embodiment of the present application.
- 3d is a schematic flowchart of a voice recognition method according to an embodiment of the present application.
- FIG. 4a is a schematic structural diagram of a voice recognition device according to an embodiment of the present application.
- FIG. 4b is a schematic structural diagram of a voice recognition device according to an embodiment of the present application.
- FIG. 4c is a schematic structural diagram of a voice recognition device according to an embodiment of the present application.
- FIG. 4d is a schematic structural diagram of another voice recognition device according to an embodiment of the present application.
- FIG. 5a is a schematic structural diagram of a voice recognition device according to an embodiment of the present application.
- 5b is a schematic structural diagram of a voice recognition device according to an embodiment of the present application.
- Embodiments of the present application provide a voice recognition method, device, device, and storage medium.
- An embodiment of the present application provides an information interaction system.
- the system includes the voice recognition device provided in any embodiment of the present application, and other devices such as a server and a terminal.
- the voice recognition device may be integrated in devices such as servers or terminals.
- the terminal may be a mobile terminal or a personal computer (PC, Personal Computer) and other devices.
- an embodiment of the present application provides an information interaction system, including a server and a terminal.
- the voice recognition device may be integrated in the server.
- the voice recognition device may also be integrated in the terminal.
- the voice recognition device may execute the voice recognition method of each embodiment.
- FIG. 1b is a schematic flowchart of a voice recognition method provided by an embodiment of the present application. As shown in FIG. 1b, the method may include the following steps.
- Step 11 Adjust the probability of the relationship between the at least one pair of elements in the language recognition model according to the probability of the relationship between the at least one pair of elements in a text segment.
- textual segment refers to a piece of text that has a specific meaning as a whole.
- the text segment usually includes a plurality of morphemes, which may be, for example, terms, phrases, textual expressions, etc.
- the text segment used in step 11 refers to a piece of text that needs to improve the recognition rate in speech recognition, and is also referred to as a key word hereinafter.
- Step 12 input the speech to be recognized into a preset speech recognition model, and the speech recognition model includes the language recognition model.
- Step 13 Determine the sequence of multiple elements corresponding to the speech to be recognized as the speech recognition result according to the probability of the relationship between the elements in the language recognition model.
- the relationship between the elements In the case of a given text segment, the relationship between the elements must be more closely related to these elements in the basic corpus on which the language recognition model is based. Therefore, the probability of using the relationship between the elements in the given text segment The probability of adjusting the relationship between these elements in the language recognition model can improve the recognition rate of the text segment by the language recognition model during speech recognition, thereby improving the speech recognition rate of the text segment.
- the probability of the relationship between at least one pair of elements in the text segment can be obtained through natural language processing technology.
- the probability can be obtained by establishing a language model of the text segment.
- the speech recognition method may be as shown in FIG. 1c, including the following steps.
- Step 21 Use the weight of an edge representing the relationship between a pair of elements in the first state diagram corresponding to the text segment to adjust the weight of the edge corresponding to the edge in the preset second state diagram.
- the first state diagram is a state diagram of the language model of the text segment
- the second state diagram is a state diagram of the basic language model.
- the first state diagram is the weighted directed state diagram of the language model corresponding to the text segment (hereinafter referred to as the key language model, or keyword language model), which records the directed connection relationship between each node and the node, to Describe the possible states of keyword objects in the key language model and the state transition path.
- the keyword object refers to the language element in the text segment.
- the node is the state of the keyword object.
- the nodes are connected in order to form a directed edge, and the edge connection forms a keyword transfer path.
- Each path is a word sequence path of the keyword, including the keyword object and the output of the keyword object. order.
- the key language model may be a language model constructed from preset text fragments, for example, n-gram (n-gram Chinese language model).
- n-gram n-gram Chinese language model
- trigram language model the third-order tri-gram
- the second state diagram may be a directed state diagram of a preset basic language model (also known as a large language model), which records the directed connection relationship of each node and nodes to describe the possible state of word objects in the basic language model and The transition path of the state.
- the basic language model can be a large-scale language model with rich corpus information and without pruning.
- the node is the state of the word object, the nodes are connected according to the order to form a directed edge, and the edge connection forms a transfer path of the word.
- Each path is a word sequence path of the word, which includes the word object and the output order of the word object.
- each edge has a corresponding label and weight.
- the label includes an input label and an output label.
- the input label and the output label are the same, which is a word object; the weight characterizes the probability of an edge appearing in the transfer path, and the weight can be a probability value or can be calculated according to the probability value.
- edge weights with the same label in the first state diagram and the second state diagram may be different.
- the adjustment step may include: extracting the edge in the first state diagram as a reference edge, and searching for an edge in the second state diagram that has the same label as the reference edge as the target edge; Obtain the weight of the reference edge, and update the weight of the text segment edge according to the weight of the reference edge.
- the adjustment step may include: adding an edge corresponding to the edge in the second state diagram as a target edge; and setting the weight of the target edge according to the weight of the edge.
- Step 22 Configure the weight of at least one edge in the modified second state diagram as the excitation weight of the corresponding edge in the language recognition model.
- the language recognition model is a language model after pruning the basic language model.
- Step S23 Input the voice to be recognized into a preset voice recognition model to obtain a word sequence path output by the voice recognition model.
- the speech recognition model includes the language recognition model.
- step S24 a target path is selected from the word sequence paths according to the excitation weights of the edges in the language recognition model, and a speech recognition result is obtained.
- the key language model is as described above.
- the method for adjusting the weight of the edge in the second state diagram in step 21 may be to adjust the weight of the corresponding edge already in the second state diagram.
- the speech recognition method may be as shown in the figure As shown in 1d, it includes the following steps.
- the first state diagram is a state diagram of a language model corresponding to a text segment
- the second state diagram is a state diagram of a basic language model.
- the speech recognition device may train the language model according to the text segment to obtain the first state diagram.
- the text segment may be a relevant corpus in the field of the speech to be recognized, which can be flexibly configured according to specific needs.
- the voice recognition device when the voice recognition device is set in the server, the text segment may be a text segment that needs to be enhanced or input by the user using the terminal, and the terminal sends the text segment input by the user to the server; or the user directly Text fragments entered or selected in the server.
- the voice recognition device may also obtain text fragments from a designated (local or remote) storage location.
- the speech recognition device may obtain a preset text segment and train a key language model based on the text segment; construct a weighted finite state converter of the key language model to obtain the state diagram indicated by the weighted finite state converter of the key language model as the first One state diagram.
- the weighted finite state converters are Weighted-Finite-State Transducers, which may be referred to as WFST in this embodiment.
- WFST can recognize the entire path from the initial state to the end state of the word, and the state of the word can be understood as a node.
- the nodes are connected in order to form directed edges, and the edges have corresponding labels and weights.
- the label includes an input label and an output label, and the input label and the output label are the same.
- the weight represents the probability of an edge appearing in the entire path.
- the weight can be a probability value or can be calculated according to the probability value.
- the probability of the entire path can be calculated according to the weight or probability of each edge in the path.
- the speech recognition device uses text fragments as training corpus and inputs tri-gram for training to obtain the key language model. Then, the speech recognition device constructs a weighted finite state converter of the key language model. Thus, the voice recognition device can acquire each node in the key language model WFST and the connection relationship between the nodes to obtain the state diagram indicated by the key language model WFST, and use the state diagram indicated by the key language model WFST as the first state diagram.
- the voice recognition device may acquire a preset universal corpus and train the basic language model based on the universal corpus; construct a weighted finite state converter of the basic language model, and obtain the state diagram indicated by the weighted finite state converter of the basic language model as The second state diagram.
- the general corpus may be a commonly used large-scale corpus.
- the speech recognition device inputs the general language corpus into a preset language model, such as a second-order bi-gram (binary language model), performs training, and obtains a basic language model. Then, the speech recognition device constructs a weighted finite state converter of the basic language model. Thereby, the voice recognition device can acquire each node in the basic language model WFST and the connection relationship between the nodes to obtain a state diagram indicated by the basic language model WFST, and use the state diagram indicated by the first word language model WFST as the second state Figure.
- a preset language model such as a second-order bi-gram (binary language model)
- the weight of the same edge in the key language model WFST is greater than its weight in the basic language model WFST, thus, the same The weight of the edge in the first state diagram is greater than its weight in the language recognition model.
- the voice recognition device Before performing voice recognition or during voice recognition, the voice recognition device loads the first state diagram and the second state diagram at the same time.
- the reference edge refers to the edge in the first state diagram.
- the edge related to the preset keyword may be selected as the reference edge.
- all the edges in the first state diagram may be used as reference edges, respectively, and subsequent steps may be performed.
- each edge has a corresponding label and weight.
- the label includes an input label and an output label.
- the input label and the output label are the same, which is a keyword object; the weight characterizes the probability of an edge appearing in the transfer path.
- the weight can be a probability value or can be calculated according to the probability value. Taking any edge in the first state diagram as an example, taking 10 as the base or e as the base, take the log value of the probability of the edge, and use the calculated log value as the weight of the edge.
- the prefix path is the same, and the edge with the same label is the same target edge as the reference edge.
- the speech recognition device first extracts the reference edge from the first state diagram.
- the starting node of the first state diagram may be obtained, and the reference edge may be obtained according to a preset traversal depth and starting node.
- the step "obtaining the starting node of the first state diagram, obtaining the reference edge according to the preset traversal depth and starting node” may include: determining the output edge of the starting node as the first reference edge; Within the preset recursion depth, the first reference edge is recursively obtained to obtain the recursive edge of the first reference edge; if the output label of the recursive edge is not the preset symbol, the recursive edge is determined as the second reference edge.
- the starting node can be flexibly configured as needed.
- the first node in the first state diagram is the start node
- the second node is the second-order state node
- the third node is the first-order node. Therefore, the first node in the first state diagram Three nodes are its starting nodes.
- the recursion depth can be configured according to the order of the language model.
- the speech recognition device obtains the order of the key language model as the recursive depth.
- the speech recognition device configures the recursive depth as 3.
- the speech recognition model uses the output edge of the starting node as the first reference edge to find the same edge in the second state diagram.
- the speech recognition model continues to search for the edge that can be used as the reference edge in the first state diagram. Specifically, taking any first reference edge as an example, the speech recognition model will recurse the first reference edge within a preset recursive depth to obtain the recursive edge of the first reference edge; if the output label of the recursive edge is not If the symbol is set, the recursive edge is determined as the second reference edge.
- the preset symbol is a preset sentence end symbol and a retreat symbol.
- the speech recognition model uses the output edge of the end node of the first reference edge and the output edge of the output edge as the recursive edge in the third order, which contains 4 nodes in total.
- the speech recognition model After obtaining the recursive edge, the speech recognition model detects whether the output label of the recursive edge is a preset symbol. If the output label of the recursive edge is not the preset end-of-statement symbol or rewind symbol, the recursive edge is determined as the second reference edge, and the same edge needs to be found in the second state diagram. If the output label of the recursive edge is a preset end-of-sentence or rewind symbol, the recursive edge is determined as a non-reference edge, and there is no need to find the same edge in the second state diagram.
- the voice recognition device obtains the output edge of the first reference edge, and the output edge of the output edge of the first reference edge is not the preset symbol edge as the output edge of the starting node, that is, the second reference edge, the The second reference edge may be used to update the weight of the same second target edge in the second state diagram.
- the voice recognition device traverses in the second state diagram to find the same target edge as the reference edge.
- the step "find the same edge as the reference edge label in the second state diagram as the target edge” may include: in the second state diagram, find the same edge as the first reference edge label as the first target edge; Among the recursive edges of the first target edge, the second target edge is obtained as the same edge as the second reference edge label.
- the voice recognition device searches for the same edge as the first reference edge label in the second state diagram.
- the same label may refer to the same output label and/or the same output label.
- the input label and the output label of the same edge in the state diagram are the same, so the voice recognition device may search for the same edge as the input label of the first reference edge, or find the output of the first reference edge The edge with the same label, or find the edge with the same input label as the first reference edge and the same output label.
- the voice recognition device determines the same edge as the first reference edge label as the first target edge that is the same as the first reference edge.
- the voice recognition device searches for the edge with the same label as the second reference edge among the recursive edges of the first target edge according to the preset recursive depth to obtain the second target edge.
- the same label may refer to the same output label and/or the same output label.
- the voice recognition device finds the same first target side as each first reference side, and the second target side that is the same as each second reference side.
- the weight of the reference edge is described in the first state diagram, and the initial weight of the target edge is described in the second state diagram.
- the voice recognition device may use the weight of the reference edge to replace the weight of the same target edge to update the weight of the target edge.
- the step of "updating the weight of the target edge according to the weight of the reference edge” may include: obtaining preset interpolation parameters and the initial weight of the target edge; according to the weight of the reference edge, the interpolation parameter and the initial weight of the target edge, The target weight of the target edge is calculated; the target weight is used to replace the initial weight of the target edge in the second state diagram.
- the preset interpolation parameters can be flexibly configured according to actual needs.
- the voice recognition device obtains the initial weight of the target side that is the same as the reference side according to the second state diagram. Then, the voice recognition device may calculate the target weight of the target side according to the following formula.
- w new is the target weight of the target edge
- w old is the initial weight of the target edge
- w k is the weight of the reference edge
- lambda is the interpolation coefficient
- the voice recognition device uses the target weight of the target edge to replace the initial weight of the target edge in the second state diagram.
- the voice recognition device updates the weight of the target side that is the same as each reference side.
- the language recognition model is a language model obtained by pruning the basic language model.
- the speech recognition device can perform pruning processing on the basic language model to obtain a language recognition model. For example, use entropy-based (entropy-based) pruning or rank-based (rank-based) pruning to reduce unimportant branch paths in the large language model, so that the language recognition model after pruning is the same as that before pruning
- the basic language recognition model is maximally similar, thereby reducing the impact on the path probability while compressing the data volume of the model.
- the voice recognition device configures the updated weight of the target edge in the second state diagram as the excitation weight of the corresponding edge in the language recognition model, which can also be understood as being configured as the language recognition model
- the incentive weight of the same side Since the language recognition model is obtained by pruning the basic language model, each side in the language recognition model exists in the state diagram of the basic language model. In the language recognition model, the incentive weight of the edge has a higher priority than its initial weight.
- the voice recognition device establishes the mapping relationship between the target edge in the second state diagram and the corresponding edge in the language recognition model, and then configures the target weight of the target edge as the excitation weight of the corresponding edge in the language recognition model.
- the weights of the edges in the language recognition model need not be modified, and the incentive weights can be used to calculate the score of the word sequence path.
- the text fragments that need to be enhanced may be different. Therefore, different key language models can be trained, and the incentive weights of the corresponding edges in the language recognition model can be configured according to the obtained first state diagram without affecting the Other edges in the language recognition model.
- the mapping relationship of the current incentive weights can be cancelled according to the cancellation instruction input by the user or the switched application scenario, the enhanced text segment weights can be cleared, and then the influence of the current text segment on the language recognition model can be removed to facilitate Reconfigure the incentive weights of the language recognition model according to the requirements of the next scenario to improve the accuracy of speech recognition.
- the mapping relationship is used to configure the incentive weights instead of the direct assignment method, which improves the versatility of the language recognition model and the speech recognition model.
- This solution has strong applicability and can be applied to a variety of scenes. It will not affect the subsequent use in other scenes due to the enhancement of text fragments, reducing maintenance costs. Different speech recognition scenarios or modes can effectively improve the accuracy of speech recognition and avoid cross-effects.
- the speech recognition model includes a language recognition model.
- the voice recognition device can acquire the voice to be recognized.
- the voice to be recognized may be the voice collected by the terminal.
- the terminal may collect the voice to be recognized in real time and may provide it to the server.
- the voice to be recognized may be voice data read from a local or remote storage device.
- step 105 can be performed at the same time as step 101. While enhancing the weight of text fragments in the language recognition model, speech recognition is performed to achieve online speech recognition. Of course, step 105 can also be executed after step 104, using a language recognition model whose text segment weights have been enhanced to perform word path screening to achieve offline speech recognition.
- the preset speech recognition model may be the HCLG model.
- H is a WFST constructed by HMM (Hidden Markov Model, Hidden Markov Model), which can map the state number of HMM to triphone (triphone).
- C is a context WFST constructed by expanding a monophone into a triphone.
- L is a WFST constructed by a pronunciation dictionary, which can convert the input phonemes into words.
- G is a WFST constructed by a language recognition model, used to represent the probability relationship of the context of words.
- the voice recognition device inputs the to-be-recognized voice into the voice recognition model, and after the steps of phoneme recognition and factor conversion into words, input the word elements into the language recognition model WFST to obtain each word sequence path output by the language recognition model WFST, and then calculate each word The score of the sequence path.
- word sequence path is composed of its edges in hidden Markov model WFST, context WFST, pronunciation dictionary WFST, and language recognition model WFST.
- the speech recognition device can calculate the score of each word sequence path.
- the score of each word sequence is calculated according to the weight of the middle of the path of each word sequence.
- the speech recognition device acquires each edge in its path, and one path includes its edges in the hidden Markov model WFST, context WFST, pronunciation dictionary WFST, and language recognition model WFST.
- the speech recognition device obtains the weight of each side of the word sequence path in the hidden Markov model WFST, the context WFST, the pronunciation dictionary WFST, and the language recognition model WFST. And, the voice recognition device detects whether the word sequence path has an incentive weight on the edge of the language recognition model WFST.
- the voice recognition device calculates the score of the word sequence path by adding or multiplying the product according to the weight of each side in the word sequence path.
- the speech recognition device combines the word sequences according to the word sequence path with the highest score to obtain the text corresponding to the speech to be recognized, that is, the recognition result.
- the first state diagram is a state diagram of a key language model
- the second state diagram is a state diagram of a large language model
- the language recognition model is the language model after pruning the large language model; input the speech to be recognized into the preset speech recognition model to obtain the word sequence output by the speech recognition model Paths, speech recognition models include language recognition models; according to the excitation weights of the edges in the language recognition model, the target path is selected from the word sequence paths to obtain the speech recognition results. Since the corpus of the key language model is much smaller than the corpus of the large language model, the edge weight of the text segment in the first state diagram is greater than the weight of the same target edge in the second state diagram.
- This scheme uses the weights of the target edges in the first state diagram to enhance the weights of the same target edges in the second state diagram, thereby incentivizing the weights of the target edges in the speech recognition model, thereby improving the speech recognition model's inclusion of text fragments in speech recognition.
- the weight of the edges in the path thereby increasing the probability of the path containing the text segment as the recognition result.
- the solution improves the probability of text fragments appearing in the speech recognition results, while improving the accuracy of the speech recognition results while ensuring the speed of speech recognition.
- the solution is also applicable to various theme scenes, and the text fragments of each theme scene can be used to improve the accuracy of the speech recognition result.
- the method for adjusting the weight of the edge in the second state diagram in step 21 may be to add a corresponding edge in the second state diagram and set its weight.
- This method can be used separately from the method shown in FIG. 1c, or simultaneously.
- FIG. 2 provides a voice recognition method according to an embodiment of the present application, and may include the following steps.
- the first state diagram is a state diagram of a key language model
- the second state diagram is a state diagram of a basic language model.
- step 101 For a specific implementation manner, reference may be made to the description of step 101 in the foregoing voice recognition method embodiment, and details are not described herein again.
- step 102 For a specific implementation manner, reference may be made to the description of step 102 in the foregoing voice recognition method embodiment, and details are not described herein again.
- the reference edge is mapped to the second state diagram to obtain the target edge.
- the server does not find an edge with the same label as the first reference edge in the second state diagram, it queries the sequence number of the starting node of the first reference edge in the first state diagram, and then, in the second state diagram Find the node corresponding to the serial number, and use the node as the starting node to establish the same virtual edge as the first reference edge as the first target edge to realize the mapping of the first reference edge.
- the server does not find an edge with the same label as the second reference edge in the recursive edge of the first target edge, the end node of the first target edge is used as the starting node to create a virtual edge with the same label as the second reference edge.
- the second reference side is mapped.
- the initial weights of the first target side and the second target side obtained by the mapping may be preset values.
- step 103 For a specific implementation manner, reference may be made to the description of step 103 in the foregoing voice recognition method embodiment, and details are not described herein again.
- step 104 For a specific implementation, reference may be made to the description of step 104 in the foregoing voice recognition method embodiment, and details are not described herein again.
- the first word obtained after the text segment is segmented is recorded in the preset vocabulary.
- step 206 it may further include: performing word segmentation processing on the text segment, and configuring the first word obtained by the word segmentation into a preset word list.
- the server performs word segmentation processing on the text fragments respectively, and configures the first word obtained by word segmentation of each text fragment into a vocabulary.
- the server selects the edge that is the same as the word in the label and the preset vocabulary as the starting edge in the second state diagram.
- the server can calculate the target weight of the starting edge using the following formula:
- w new is the target weight of the starting edge
- w old is the initial weight of the starting edge
- l is the preset scale factor
- the server replaces its initial weight with the target weight of the starting edge, and updates the weight of the starting edge.
- the server enhances the weight of the starting edge in the second state diagram.
- the server After obtaining the starting edge and its updated weight, the server looks for the edge with the same label as the starting edge in the language recognition model, and establishes a mapping relationship. Furthermore, the target weight of the key starting word edge is configured as the language Identify the incentive weights of corresponding edges in the model.
- the speech recognition model includes a language recognition model.
- step 105 For a specific implementation manner, reference may be made to the description of step 105 in the foregoing voice recognition method embodiment, and details are not described herein again.
- the edge of the label when traversing or searching the word sequence path in the speech recognition model, if the edge of the label is not found, the label can be found in the target edge obtained by mapping in the large language model. As the edge in the word sequence, and obtain the target weight of the target edge to calculate the score of the word sequence path.
- step 106 For a specific implementation manner, reference may be made to the description of step 106 in the foregoing voice recognition method embodiment, and details are not described herein again.
- this application uses the weight of the text segment path in the key language model to enhance the weight of the text segment path in the language recognition model, improve the probability of the text segment appearing in the recognition result, and improve the accuracy of the speech recognition result.
- the target edge is added to the second state diagram by means of mapping, so that in speech recognition, the target edge of the mapping can be used to improve the text segment in The probability of occurrence in the recognition result.
- context enhancement is realized, thereby increasing the probability that the text segment is found during language recognition, that is, the probability that the word sequence enters the path of the text segment.
- the present embodiment improves the accuracy of voice recognition from various aspects.
- the speech recognition device will be specifically integrated in a decoder for description.
- the decoder acquires the to-be-recognized speech collected by the speech collection device in real time, and performs online speech recognition.
- the decoder inputs the to-be-recognized speech into the speech recognition model, and after the steps of phoneme recognition and factor conversion into words, etc., inputs the word elements into the language recognition model.
- the server loads the first state diagram and the second state diagram, so as to strengthen the weight of the target edge.
- the decoder obtains preset text fragments and trains the key language model according to the text fragments; constructs a weighted finite state converter of the key language model, and obtains the state diagram indicated by the weighted finite state converter of the key language model as the first state diagram.
- a tri-gram whose key language model is third order is used as an example for description.
- the first state diagram obtained by the decoder can refer to FIG. 3a, where node 2 is the second-order state; node 3 is the starting node of the first state diagram; the nodes are connected by connecting lines
- the connection becomes the edge.
- the arrow direction of the edge indicates the connection relationship. It can also be understood as the path direction.
- the input label, output label and weight of the edge are recorded in sequence.
- the weight of the edge is the logarithm of the probability.
- the preset end-of-sentence symbol may be the symbol " ⁇ /s>”
- the preset back-off symbol may be the symbol "#phi".
- the decoder obtains the preset general corpus and trains the large language model according to the general corpus; constructs the weighted finite state converter of the large language model, and obtains the state diagram indicated by the weighted finite state converter of the large language model as the second state diagram.
- a second-order bi-gram with a large language model is taken as an example for description.
- the second state diagram obtained by the decoder can refer to FIG. 3b, where node 2 is the second-order state; node 3 is the starting node of the first state diagram; the nodes are connected by connecting lines to become edges, and the arrow direction of the edges indicates
- the connection relationship can also be understood as the path direction, and the input label, output label, and weight of the edge are sequentially recorded on the edge.
- the logarithmic value of the probability of the edge is used as an example for illustration.
- the preset end-of-sentence symbol may be the symbol " ⁇ /s>”
- the preset back-off symbol may be the symbol "#phi".
- the weight of the target edge in the second state diagram is enhanced.
- the decoder extracts the reference edge in the first state diagram, and finds the same edge as the reference edge label in the second state diagram as the target edge; obtains the weight of the reference edge, and updates the weight of the target edge according to the weight of the reference edge.
- the first state diagram and the second state diagram go down the same path from node 2 at the same time.
- the edge from node 3 to node 8 is the first reference edge 3-9, and the label is "Zhang Jun”.
- the second state is the edge from node 3 to node 9 and the label is "Zhang Jun.” , So we get the first target edge 3-9 with the same label as 3-8.
- the target weight of the first target edge 3-9 is -2.3, which is enhanced relative to -16.8, that is, the probability of edge 3-9 is improved.
- the decoder performs recursion on the first reference edge in the first state diagram. Since the key language model is a third-order model, the recursion depth is 3 to obtain the second reference edges 8-9 with the label "Qi". And, the decoder finds the side 9-10 labeled "Qi" as the second target side among the output sides of the first target side 3-9. According to the weight 0 of the second reference edge 8-9 and the initial weight -12.7 of the second target edge 9-10, the decoder calculates the target weight -2.3 of the second target edge 9-10, which enhances the Weights. Since the output labels of the two edges of the node 9 in the first state diagram are the rewind symbol and the end-of-statement symbol respectively, they cannot be used as the reference edge to enhance the weight of the edges in the second state diagram.
- the decoder ignores the edge 3-5 labeled with the back-off symbol in the first state graph and the second state graph, recursively, and at the node 5 of the first state graph, obtains the second reference edges 5-6 and 5 -7.
- the second target side that is the same as the first reference side 5-6 label "Zhang Jun” is 5-7
- the second target side that is the same as the first reference side 5-7 label "Qi" is 5- 8.
- the target weight of the second target side 5-7 is calculated -3.4;
- the weight of a reference edge 5-7 is -1.0986, and the second target edge is an initial weight of 5-8 -17.38, and the target weight of the second target edge 5-8 is -3.4.
- the decoder finds the second reference edge 6-9 at the node 6 in the first state graph and the same second target edge 7-10 in the second state graph according to the recursive depth. According to the weight 0 of the first reference side 6-9 and the initial weight -12.7 of the second target side 7-10, the decoder calculates that the target weight of the second target side 7-10 is -2.3.
- the update of the target side weight is achieved.
- the weights of the edges related to the text fragments in the first state diagram are improved.
- the weights of the corresponding edges in the language recognition model obtained by pruning the large language model are also improved. These words appear during decoding. The probability will be much larger than before.
- the decoder configures the weight of each target edge as the excitation weight of each corresponding edge in the language recognition model.
- FIG. 3c taking the first state diagram as FIG. 3a as an example, and the second state diagram as FIG. 3c as an example.
- the decoder uses the high-order edges in the first state diagram to associate the sequence numbers of some nodes in the second state diagram and the sequence numbers of the partial nodes in the first state diagram to perform edge mapping. Therefore, during the decoding process of the coder, if the input label is not found as a specific word in the language recognition model, the word sequence path score is improved through the mapping relationship.
- the decoder adds the same virtual edge as the second reference edge 8-9 in the first state diagram as the second target edge that is the same as the second reference edge 8-9. Map and update the weight of the second target edge to achieve weight enhancement.
- the decoder cannot find the path (Zhang Jun, Qi) in the language recognition model, then in the second state diagram, the path (Zhang Jun, Qi) is determined according to the virtual edge of the mapping Weights.
- this embodiment can increase the recall rate of the text segment to more than 85% without affecting the normal recognition result, which satisfies most scene requirements.
- this embodiment enhances the probability that the first word of the text segment is segmented into the text segment.
- the decoder performs word segmentation processing on the text fragments, and configures the first word obtained by word segmentation into a preset word list. Then, in the second state diagram, the edge with the same label as the word in the preset vocabulary is selected as the starting edge; the initial weight of the starting edge is obtained according to the preset proportional coefficient and the initial weight of the starting edge , Update the weight of the starting edge; configure the updated weight of the starting edge in the second state diagram as the incentive weight of the corresponding edge in the language recognition model.
- the decoder inputs the word elements into the WFST constructed by the language recognition model, and obtains each word sequence path output by the language recognition model WFST. Then, the decoder calculates the score of each word sequence path according to the weight of each side of the word sequence path in the language recognition model, and outputs the word sequence path with the highest score as the recognition result.
- the user can quickly configure text fragments of scenes such as conferences, enhance the occurrence probability of text fragments in the recognition result, and improve the accuracy of voice recognition.
- This embodiment shortens the operation flow, saves a lot of time, and does not affect the real-time rate of the decoder, and has the advantage of low latency.
- the steps of the language recognition method may be respectively executed by a plurality of physical devices to implement the method together, and the above-mentioned language recognition apparatus may be realized by a plurality of physical devices together.
- the plurality of physical devices may be multiple servers, some of which mainly provide voice recognition services to users, and others provide customized voice recognition models for these servers.
- the plurality of physical devices may be terminal devices and servers.
- Terminal devices provide users with voice recognition services
- servers provide user-defined voice recognition models for these terminal devices.
- a voice recognition method in each embodiment may be as shown in FIG. 4b.
- the method may be performed by a computing device, such as a server, terminal device, and so on.
- the method may include the following steps.
- Step 31 the text segment is provided to the second computing device.
- the computing device may receive one or more text segments input or selected by the user through the user interface, such as terms, proper nouns, etc., and then provide the text segment to the second computing device, so that the second computing device provides “customization” according to the text segment "Speech recognition model.
- Step 32 Acquire a language recognition model provided by the second computing device.
- the probability of the relationship between at least one pair of elements in the language recognition model is adjusted using the probability of the relationship between the at least one pair of elements in the text segment.
- the second computing device may perform the relevant steps of adjusting the language recognition model in the above method, such as steps 11, 21-22, 101-104, 201-208, etc., and provide the obtained speech recognition model to the calculation that provides the text segment equipment.
- Step 33 Input the voice to be recognized into a preset voice recognition model, where the voice recognition model includes the language recognition model.
- Step 34 Determine the sequence of multiple elements corresponding to the speech to be recognized according to the probability of the relationship between the elements in the language recognition model, as a speech recognition result.
- An embodiment of the present application also provides a voice recognition device.
- 4a is a schematic structural diagram of a voice recognition device according to an embodiment of the present application. As shown in FIG. 4a, the voice recognition device may include an adjustment module 41 and a voice recognition module 42.
- the adjustment module 41 may adjust the probability of the relationship between the at least one pair of elements in the language recognition model according to the probability of the relationship between the at least one pair of elements in the text segment.
- the speech recognition module 42 may input the speech to be recognized into a preset speech recognition model, the speech recognition model includes the language recognition model; according to the probability of the relationship between the elements in the language recognition model, determine the correspondence of the speech to be recognized The sequence of multiple elements as a result of speech recognition.
- the voice recognition device may be integrated in a network device such as a server. In some embodiments, the voice recognition device may be integrated in the terminal device. In other embodiments, the voice recognition apparatus may be implemented by components distributed in a plurality of physical devices. For example, the adjustment module 41 may be implemented by a first computing device, and the voice recognition module 42 may be implemented by a second computing device.
- the computing device may be any device with computing capability, such as a server or a terminal.
- the adjustment module 41 may include: a language model adjustment unit 411 and an excitation unit 404.
- the language model adjusting unit 411 may use the weight of an edge representing the relationship between a pair of elements in the first state diagram corresponding to the text segment to adjust the weight of the edge corresponding to the edge in the preset second state diagram, the
- the first state diagram is a state diagram of the language model of the text segment
- the second state diagram is a state diagram of the basic language model.
- the incentive unit 404 may configure the weight of at least one edge in the modified second state diagram as the incentive weight of the corresponding edge in the language recognition model, and the language recognition model is the language model after pruning the basic language model .
- the voice recognition module 42 may include a recognition unit 405 and a result unit 406.
- the recognition unit 405 may input the speech to be recognized into a preset speech recognition model to obtain a word sequence path output by the speech recognition model, and the speech recognition model includes the language recognition model;
- the result unit 406 can select a target path from the word sequence paths according to the excitation weights of the edges in the language recognition model to obtain a speech recognition result.
- the language model adjustment unit 411 may include an update unit for finding an edge with the same edge label as the target edge in the second state diagram; increasing the target edge according to the weight of the edge the weight of.
- the language model adjustment unit 411 may include a mapping unit for adding an edge corresponding to the edge in the second state diagram as a target edge; setting the target edge according to the weight of the edge Weights.
- the voice recognition device may include a loading unit 401, a keyword unit 402, an update unit 403, an excitation unit 404, a recognition unit 405, and a result unit 406.
- the loading unit 401 is configured to load a preset first state diagram and a second state diagram.
- the first state diagram is a state diagram of a key language model
- the second state diagram is a state diagram of a large language model.
- the first state diagram is the directed state diagram of the key language model, which records the directional connection relationship between each node and the node to describe the possible state of the text fragment object in the key language model and the state transition path.
- the key language model may be a language model constructed from preset text fragments, for example, n-gram (n-gram Chinese language model).
- n-gram n-gram Chinese language model
- trigram language model the third-order tri-gram
- the second state diagram is a weighted directed state diagram of the large language model.
- the large language model can be a large-scale language model with rich corpus information and without pruning.
- edge weights with the same label in the first state diagram and the second state diagram may be different.
- the loading unit 401 may be specifically configured to: obtain a predetermined text segment, train a key language model according to the text segment; construct a weighted finite state converter of the key language model, and obtain a weighted finite state converter instruction of the key language model
- the state diagram is the first state diagram.
- the preset text segment may be related corpus in the field where the voice to be recognized is located, which can be flexibly configured according to specific needs. There can be one or more preset text fragments.
- the weighted finite state converters are Weighted-Finite-State Transducers, which may be referred to as WFST in this embodiment.
- WFST can recognize the entire path from the initial state to the end state of the word, and the state of the word can be understood as a node.
- the nodes are connected in order to form directed edges, and the edges have corresponding labels and weights.
- the label includes an input label and an output label, and the input label and the output label are the same.
- the weight represents the probability of an edge appearing in the entire path.
- the weight can be a probability value or can be calculated according to the probability value.
- the probability of the entire path can be calculated according to the weight or probability of each edge in the path.
- the loading unit 401 uses the text fragments as training corpus and inputs tri-gram for training to obtain a key language model. Then, the loading unit 401 constructs a weighted finite state converter of the key language model. Thus, the loading unit 401 can acquire each node in the key language model WFST and the connection relationship between the nodes to obtain a state diagram indicated by the key language model WFST, and use the state diagram indicated by the key language model WFST as the first state diagram.
- the loading unit 401 may be specifically used to: obtain a preset general corpus, train a large language model based on the general corpus; construct a weighted finite state converter of the large language model, and obtain a weighted finite state converter instruction of the large language model Is the second state diagram.
- the general corpus may be a large-scale corpus commonly used by people.
- the loading unit 401 inputs the general language corpus into a preset language model, such as a second-order bi-gram (binary language model), performs training, and obtains a large language model. Then, the loading unit 401 constructs a weighted finite state converter of the large language model. Thus, the loading unit 401 can acquire each node in the large language model WFST and the connection relationship between the nodes to obtain a state diagram indicated by the large language model WFST, and use the state diagram indicated by the first word language model WFST as the second state Figure.
- a preset language model such as a second-order bi-gram (binary language model)
- the weight of the same edge in the key language model WFST is greater than its weight in the large language model WFST, thus, the same The weight of the edge in the first state diagram is greater than its weight in the language recognition model.
- the loading unit 401 loads the first state diagram and the second state diagram at the same time.
- the keyword unit 402 is used to extract the reference edge in the first state diagram, and find the edge with the same label as the reference edge in the second state diagram as the target edge.
- the prefix path is the same, and the edge with the same label is the same target edge as the reference edge.
- the keyword unit 402 first extracts the reference edge from the first state diagram. For example, the starting node of the first state diagram may be obtained, and the reference edge may be obtained according to a preset traversal depth and starting node.
- the keyword unit 402 may be specifically used to: determine the output edge of the starting node as the first reference edge; within a preset recursion depth, recursively perform the first reference edge to obtain the first reference edge Recursive edge; if the output label of the recursive edge is not the preset symbol, the recursive edge is determined as the second reference edge.
- the starting node can be flexibly configured as needed.
- the first node in the first state diagram is the start node
- the second node is the second-order state node
- the third node is the first-order node. Therefore, the first node in the first state diagram Three nodes are its starting nodes.
- the recursion depth can be configured according to the order of the language model.
- the keyword unit 402 acquires the order of the key language model as the recursive depth.
- the speech recognition device configures the recursive depth as 3.
- the keyword unit 402 uses the output edge of the starting node as the first reference edge to find the same edge in the second state diagram.
- the keyword unit 402 continues to search for the edge that can be used as the reference edge in the first state diagram. Specifically, taking any first reference edge as an example, the keyword unit 402 performs recursion on the first reference edge within a preset recursion depth to obtain the recursive edge of the first reference edge; if the output label of the recursive edge is not If the symbol is set, the recursive edge is determined as the second reference edge.
- the preset symbol is a preset sentence end symbol and a retreat symbol.
- the keyword unit 402 uses the output edge of the end node of the first reference edge and the output edge of the output edge as the recursive edge in the 3rd order, which contains 4 nodes in total.
- the keyword unit 402 detects whether the output label of the recursive edge is a preset symbol. If the output label of the recursive edge is not the preset end-of-statement symbol or rewind symbol, the recursive edge is determined as the second reference edge, and the same edge needs to be found in the second state diagram. If the output label of the recursive edge is a preset end-of-sentence or rewind symbol, the recursive edge is determined as a non-reference edge, and there is no need to find the same edge in the second state diagram.
- the text fragment unit 402 obtains the output edge of the first reference edge, and the output edge of the output edge of the first reference edge is not the preset symbol edge as the output edge of the starting node, that is, the second reference edge,
- the second reference edge may be used to update the weight of the same second target edge in the second state diagram.
- the keyword unit 402 traverses in the second state diagram to find the same target edge as the reference edge.
- the keyword unit 402 may be specifically used to: in the second state diagram, find the same edge as the first reference edge label as the first target edge; in the recursive edge of the first target edge, as the second reference edge The edge with the same edge label gets the second target edge.
- the keyword unit 402 searches for the same edge as the first reference edge label in the second state diagram.
- the same label may refer to the same output label and/or the same output label.
- the input label and the output label of the same edge in the state diagram are the same, so the keyword unit 402 may search for the same edge as the input label of the first reference edge, or find the same edge as the first reference edge Either output the edge with the same label, or find the edge with the same input label as the first reference edge and the same output label.
- the keyword unit 402 determines the same edge as the first reference edge label as the first target edge that is the same as the first reference edge.
- the keyword unit 402 searches for the edge with the same label as the second reference edge among the recursive edges of the first target edge according to the preset recursive depth to obtain the second target edge.
- the same label may refer to the same output label and/or the same output label.
- the keyword unit 402 finds the first target side that is the same as each first reference side and the second target side that is the same as each second reference side.
- the updating unit 403 is used to obtain the weight of the reference side, and update the weight of the target side according to the weight of the reference side.
- the weight of the reference edge is described in the first state diagram, and the initial weight of the target edge is described in the second state diagram.
- the updating unit 403 can use the weight of the reference edge to replace the weight of the same target edge to achieve the update of the weight of the target edge.
- the update unit 403 may be specifically configured to: obtain preset interpolation parameters and initial weights of the target edge; calculate the target weights of the target edge according to the weights of the reference edge, the interpolation parameters, and the initial weights of the target edge; Use the target weight to replace the initial weight of the target edge in the second state diagram.
- the preset interpolation parameters can be flexibly configured according to actual needs.
- the update unit 403 acquires the initial weight of the target side that is the same as the reference side according to the second state diagram. Then, the updating unit 403 may calculate the target weight of the target side according to the following formula.
- w new is the target weight of the target edge
- w old is the initial weight of the target edge
- w k is the weight of the reference edge
- lambda is the interpolation coefficient
- the update unit 403 uses the target weight of the target edge to replace the initial weight of the target edge in the second state diagram.
- the update unit 403 updates the weights of the target sides that are the same as the reference sides.
- Incentive unit 404
- the incentive unit 404 is configured to configure the updated weight of the target edge in the second state diagram as the incentive weight of the corresponding edge in the language recognition model, and the language recognition model is a language model after pruning the large language model.
- the language recognition model is a language model obtained by pruning the large language model.
- the incentive unit 404 may prune the large language model to obtain a language recognition model.
- the incentive unit 404 configures the updated weight of the target edge in the second state diagram as the incentive weight of the corresponding edge in the language recognition model, which can also be understood as being configured as in the language recognition model The incentive weight of the same side. Since the language recognition model is obtained by pruning the large language model, all the edges in the language recognition model exist in the state diagram of the large language model. In the language recognition model, the incentive weight of the edge has a higher priority than its initial weight.
- the incentive unit 404 establishes the mapping relationship between the target edge in the second state diagram and the corresponding edge in the language recognition model, and then configures the target weight of the target edge as the incentive weight of the corresponding edge in the language recognition model.
- the recognition unit 405 is configured to input the speech to be recognized into a preset speech recognition model to obtain a word sequence path output by the speech recognition model.
- the speech recognition model includes a language recognition model.
- the recognition unit 405 can run simultaneously with the loading unit 401, and while enhancing the weight of text fragments in the language recognition model, perform speech recognition to achieve online speech recognition.
- the recognition unit 405 may also start running after the motivation unit 404 is finished, and use the language recognition model whose text segment weights have been enhanced to perform word path screening to achieve offline speech recognition.
- the preset speech recognition model may be the HCLG model.
- H is a WFST constructed by HMM (Hidden Markov Model, Hidden Markov Model), which can map the state number of HMM to triphone (triphone).
- C is a context WFST constructed by expanding a monophone into a triphone.
- L is a WFST constructed by a pronunciation dictionary, which can convert the input phonemes into words.
- G is a WFST constructed by a language recognition model, used to represent the probability relationship of the context of words.
- the recognition unit 405 inputs the to-be-recognized speech into the speech recognition model, and after the steps of phoneme recognition and factor conversion into words, etc., input the word elements into the language recognition model WFST to obtain each word sequence path output by the language recognition model WFST.
- word sequence path is composed of its edges in hidden Markov model WFST, context WFST, pronunciation dictionary WFST, and language recognition model WFST.
- the result unit 406 is used to select a target path in the word sequence path according to the excitation weights of the edges in the language recognition model to obtain a speech recognition result.
- the result unit 406 calculates the score of each word sequence path.
- the score of each word sequence is calculated according to the weight of the middle of the path of each word sequence.
- the result unit 406 obtains each edge in its path, and one path includes its edges in hidden Markov model WFST, context WFST, pronunciation dictionary WFST, and language recognition model WFST.
- the result unit 406 obtains the weight of each side of the word sequence path in the hidden Markov model WFST, the context WFST, the pronunciation dictionary WFST, and the language recognition model WFST. Furthermore, it is detected whether the word sequence path has an incentive weight on the edge of the language recognition model WFST.
- the result unit 406 calculates the score of the word sequence path by adding or multiplying the product according to the weight of each side in the word sequence path.
- the result unit 406 combines the word sequences according to the word sequence path with the highest score to obtain the text corresponding to the speech to be recognized, that is, the recognition result.
- the loading unit 401 of the embodiment of the present application loads the preset first state diagram and second state diagram.
- the first state diagram is a state diagram of a key language model
- the second state diagram is a state diagram of a large language model
- the fragment unit 402 extracts the reference edge in the first state diagram, and finds the same edge as the reference edge label in the second state diagram as the target edge
- the update unit 403 obtains the weight of the reference edge, and updates the target edge according to the weight of the reference edge Weight
- the incentive unit 404 configures the updated weight of the target edge in the second state diagram as the incentive weight of the corresponding edge in the language recognition model
- the language recognition model is the language model after the pruning of the large language model
- the recognition unit 405 will recognize
- the voice input presets the voice recognition model to obtain the word sequence path output by the voice recognition model.
- the voice recognition model includes the language recognition model; the result unit 406 selects the target path from the word sequence path according to the excitation weights in the language recognition model, and obtains Voice recognition results. Since the corpus of the key language model is much smaller than the corpus of the large language model, the edge weight of the text segment in the first state diagram is greater than the weight of the same target edge in the second state diagram.
- This scheme uses the weights of the target edges in the first state diagram to enhance the weights of the same target edges in the second state diagram, thereby incentivizing the weights of the target edges in the speech recognition model, thereby improving the speech recognition model's inclusion of text fragments in speech recognition. The weight of the edges in the path, thereby increasing the probability of the path containing the text segment as the recognition result.
- the solution improves the probability of text fragments appearing in the speech recognition results, while improving the accuracy of the speech recognition results while ensuring the speed of speech recognition.
- the solution is also applicable to various theme scenes, and the text fragments of each theme scene can be used to improve the accuracy of the speech recognition result.
- the voice recognition apparatus may further include a mapping unit 407, a context unit 408, and a collection unit 409.
- the mapping unit 407 is configured to map the reference edge to the second state diagram to obtain the target edge if no edge with the same label as the reference edge is found in the second state diagram.
- the mapping unit 407 queries the serial number of the starting node of the first reference edge in the first state diagram, and then, Find the node corresponding to the serial number in the second state diagram, and use this node as the starting node to create the same virtual edge as the first reference edge as the first target edge to realize the mapping of the first reference edge.
- mapping unit 407 If the mapping unit 407 does not find an edge with the same label as the second reference edge in the recursive edge of the first target edge, the end node of the first target edge is used as the starting node to create a virtual with the same label as the second reference edge
- the initial weights of the first target side and the second target side obtained by the mapping may be preset values.
- the context unit 408 is used to select the edge with the same label as the word in the preset vocabulary in the second state diagram as the starting edge; obtain the initial weight of the starting edge according to the preset proportional coefficient and starting The initial weight of the edge updates the weight of the starting edge; the updated weight of the starting edge in the second state diagram is configured as the incentive weight of the corresponding edge in the language recognition model.
- the first word obtained after the text segment is segmented is recorded in the preset vocabulary.
- the context unit 408 may specifically be used to perform word segmentation processing on text fragments, and configure the first word obtained by word segmentation into a preset word list.
- the context unit 408 separately performs word segmentation processing on the text fragments, and configures the first word obtained by word segmentation of each text fragment into a vocabulary.
- the context unit 408 selects the same edge as the word in the label and the preset word list as the starting edge.
- the context unit 408 may use the following formula to calculate the target weight of the starting edge:
- w new is the target weight of the starting edge
- w old is the initial weight of the starting edge
- l is the preset scale factor
- the context unit 408 replaces its initial weight with the target weight of the starting edge to implement the update of the starting edge weight.
- the context unit 408 enhances the weight of the starting edge in the second state diagram.
- the context unit 408 After obtaining the starting edge and its updated weight, the context unit 408 searches the language recognition model for the edge that has the same label as the starting edge, and establishes a mapping relationship. In turn, the target weight of the key starting word edge is configured It is the incentive weight of the corresponding edge in the language recognition model.
- the collection unit 409 is used for collecting voice to be recognized in real time.
- the collection unit 409 collects the voice to be recognized in real time, and performs online voice recognition.
- this application uses the weight of the text fragment path in the key language model to enhance the weight of the text fragment path in the language recognition model, improve the probability of the text fragment appearing in the recognition result, and improve the accuracy of the speech recognition result.
- the target edge is added to the second state diagram by means of mapping, so that in speech recognition, the target edge of the mapping can be used to improve the text segment in The probability of occurrence in the recognition result.
- context enhancement is realized, thereby increasing the probability that the text segment is found during language recognition, that is, the probability that the word sequence enters the path of the text segment.
- the present embodiment improves the accuracy of voice recognition from various aspects.
- FIG. 5a shows a schematic structural diagram of the voice recognition device involved in the embodiment of the present application, specifically speaking:
- the voice recognition device may include a processor 501 with one or more processing cores, a memory 502 with one or more computer-readable storage media, a power supply 503, an input unit 504, and other components.
- a processor 501 with one or more processing cores
- a memory 502 with one or more computer-readable storage media
- a power supply 503 with one or more computer-readable storage media
- a power supply 503 with one or more computer-readable storage media
- a power supply 503 with one or more computer-readable storage media
- a power supply 503 with one or more computer-readable storage media
- a power supply 503 with one or more computer-readable storage media
- a power supply 503 with one or more computer-readable storage media
- a power supply 503 with one or more computer-readable storage media
- a power supply 503 with one or more computer-readable storage media
- a power supply 503 with one or more computer-readable storage media
- a power supply 503 with one or more computer-readable storage media
- the processor 501 is the control center of the voice recognition device, and uses various interfaces and lines to connect the various parts of the entire voice recognition device, by running or executing the software programs and/or modules stored in the memory 502, and calling the stored in the memory 502 Within the data, perform various functions of the voice recognition device and process the data, so as to monitor the voice recognition device as a whole.
- the processor 501 may include one or more processing cores; preferably, the processor 501 may integrate an application processor and a modem processor, where the application processor mainly processes an operating system, a user interface, and application programs, etc.
- the modem processor mainly handles wireless communication. It can be understood that, the foregoing modem processor may not be integrated into the processor 501.
- the memory 502 may be used to store software programs and modules.
- the processor 501 executes various functional applications and data processing by running the software programs and modules stored in the memory 502.
- the memory 502 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one function required application program (such as a voice recognition function, etc.), etc.; the storage data area may store Use the created data, etc.
- the memory 502 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 502 may further include a memory controller to provide the processor 501 with access to the memory 502.
- the voice recognition device may further include an input unit 504, which may be used to receive inputted numeric or character information. The user can use the input unit 504 to input text fragments.
- the voice recognition device may further include a display unit and the like, which will not be repeated here.
- the processor 501 in the voice recognition device loads the executable file corresponding to the process of one or more application programs into the memory 502 according to the following instructions, and the processor 501 runs and stores The application program in the memory 502, thereby implementing various functions, as follows:
- the first state diagram is the state diagram of the key language model
- the second state diagram is the state diagram of the large language model
- extract the reference edge in the first state diagram, in the first Find the same edge as the reference edge label in the two-state diagram as the target edge; obtain the weight of the reference edge, update the weight of the target edge according to the weight of the reference edge; configure the updated weight of the target edge in the second state diagram as the language
- the language recognition model is the language model of the large language model after pruning; input the speech to be recognized into the preset speech recognition model to obtain the word sequence path output by the speech recognition model.
- the speech recognition model includes language recognition The model; according to the excitation weights of the edges in the language recognition model, the target path is selected from the word sequence path to obtain the speech recognition result.
- the processor 501 runs the application program stored in the memory 502, and can also implement the following functions:
- the processor 501 runs the application program stored in the memory 502, and can also implement the following functions:
- the processor 501 runs the application program stored in the memory 502, and can also implement the following functions:
- the processor 501 runs the application program stored in the memory 502, and can also implement the following functions:
- the processor 501 runs the application program stored in the memory 502, and can also implement the following functions:
- the reference edge is mapped to the second state diagram to obtain the target edge.
- the processor 501 runs the application program stored in the memory 502, and can also implement the following functions:
- the edge with the same label as the word in the preset vocabulary is selected as the start edge; the initial weight of the start edge is obtained, and updated according to the preset scale factor and the initial weight of the start edge
- the weight of the starting edge; the updated weight of the starting edge in the second state diagram is configured as the incentive weight of the corresponding edge in the language recognition model.
- the processor 501 runs the application program stored in the memory 502, and can also implement the following functions:
- the processor 501 runs the application program stored in the memory 502, and can also implement the following functions:
- the processor 501 runs the application program stored in the memory 502, and can also implement the following functions:
- the voice recognition device may further include a voice collection device 505, such as a microphone, for collecting voice to be recognized in real time.
- a voice collection device 505 such as a microphone
- the processor 501 runs the application program stored in the memory 502, and can also implement the following functions:
- an embodiment of the present application provides a storage medium in which multiple instructions are stored, and the instruction can be loaded by a processor to perform steps in any of the speech recognition methods provided in the embodiments of the present application.
- the instruction can perform the following steps:
- the first state diagram is the state diagram of the key language model
- the second state diagram is the state diagram of the large language model
- extract the reference edge in the first state diagram, in the first Find the same edge as the reference edge label in the two-state diagram as the target edge; obtain the weight of the reference edge, update the weight of the target edge according to the weight of the reference edge; configure the updated weight of the target edge in the second state diagram as the language
- the incentive weights of the corresponding edges in the recognition model is the language model of the large language model after pruning; input the speech to be recognized into the preset speech recognition model to obtain the word sequence path output by the speech recognition model. Model; according to the excitation weights of the edges in the language recognition model, select the target path in the word sequence path to obtain the speech recognition result.
- the instruction can also perform the following steps:
- the instruction can also perform the following steps:
- the instruction can also perform the following steps:
- the instruction can also perform the following steps:
- the instruction can also perform the following steps:
- the reference edge is mapped to the second state diagram to obtain the target edge.
- the instruction can also perform the following steps:
- the edge with the same label as the word in the preset vocabulary is selected as the start edge; the initial weight of the start edge is obtained, and updated according to the preset scale factor and the initial weight of the start edge
- the weight of the starting edge of the text segment; the updated weight of the starting edge in the second state diagram is configured as the incentive weight of the corresponding edge in the language recognition model.
- the instruction can also perform the following steps:
- the instruction can also perform the following steps:
- the instruction can also perform the following steps:
- the instruction can also perform the following steps:
- the storage medium may include: read only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, etc.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
本申请实施例公开了一种语音识别方法、装置、设备和存储介质。本申请实施例根据一文本片段中至少一对元素间关系的概率,调整语言识别模型中所述至少一对元素间关系的概率;将待识别语音输入预设的语音识别模型,所述语音识别模型包括所述语言识别模型;根据所述语言识别模型中各元素间关系的概率,确定所述待识别语音对应的多个元素的序列,作为语音识别结果。。该方案提高了语音识别结果中文本片段出现的概率,在保障语音识别速度的同时,提升了语音识别结果的准确性。
Description
本申请要求于2018年12月11日提交中国专利局、申请号为201811508402.7、申请名称为“一种语音识别方法、装置、设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及计算机技术,具体涉及一种语音识别方法、装置、设备和存储介质。
背景
语音识别技术能够将人类语音转换成为对应的字符或编码,在智能家居、实时语音转写等领域应用广泛。解码器根据人们说出的语音,在由声学模型、字典和语言模型等知识源组成的搜索空间中搜出最佳的词序列,将得到的词序列组合即可得到该语音对应的文本描述,也即识别结果。
目前,进行语音识别时使用的语言识别模型通常是对大语言模型剪枝得到的,在语言层为解码器提供词的搜索路径。剪枝后的语言模型数据量较小,信息较为匮乏,虽然能够适当提高语音识别速度,但是导致了准确性的降低。
技术内容
本申请实施例提供一种语音识别方法、装置、设备和存储介质,旨在提高语音识别的准确性。
本申请实施例提供一种语音识别方法,包括:
根据一文本片段中至少一对元素间关系的概率,调整语言识别模型中所述至少一对元素间关系的概率;
将待识别语音输入预设的语音识别模型,所述语音识别模型包括所述语言识别模型;
根据所述语言识别模型中各元素间关系的概率,确定所述待识别语 音对应的多个元素的序列,作为语音识别结果。
一些实施例中,该方法可以包括:
加载预设的第一状态图和第二状态图,所述第一状态图为关键词语言模型的状态图,所述第二状态图为大语言模型的状态图;
在所述第一状态图中提取基准边,在所述第二状态图中查找与所述基准边标签相同的边,作为关键词边;
获取所述基准边的权重,根据基准边的权重更新所述关键词边的权重;
将所述第二状态图中关键词边更新后的权重,配置为语言识别模型中对应边的激励权重,所述语言识别模型为所述大语言模型剪枝后的语言模型;
将待识别语音输入预设语音识别模型,得到所述语音识别模型输出的词序列路径,所述语音识别模型包括所述语言识别模型;
根据所述语言识别模型中边的激励权重,在所述词序列路径中选出目标路径,得到语音识别结果。
在一些实施例中,所述在所述第一状态图中提取基准边,包括:
获取所述第一状态图的起始节点,根据预设的遍历深度和所述起始节点确定基准边。
在一些实施例中,所述根据预设的遍历深度和所述起始节点获取基准边,包括:
将所述起始节点的输出边确定为第一基准边;
在预设的递归深度内,对所述第一基准边进行递归,获取所述第一基准边的递归边;
若所述递归边的输出标签不是预设符号,则将所述递归边确定为第二基准边。
在一些实施例中,在所述第二状态图中查找与所述基准边标签相同的边,作为关键词边,包括:
在所述第二状态图中,查找与所述第一基准边标签相同的边,作为 第一关键词边;
在所述第一关键词边的递归边中,查找与所述第二基准边标签相同的边,作为第二关键词边。
在一些实施例中,所述根据基准边的权重更新所述关键词边的权重,包括:
获取预设的插值参数及所述关键词边的初始权重;
根据所述基准边的权重、插值参数和关键词边的初始权重,计算得到关键词边的目标权重;
使用所述目标权重,替换所述第二状态图中所述关键词边的初始权重。
在一些实施例中,所述方法还包括:
若在所述第二状态图中未找到与所述基准边标签相同的边,则将所述基准边映射到所述第二状态图中,得到关键词边。
在一些实施例中,所述方法还包括:
在所述第二状态图中,筛选出标签与预设词表中的词相同的边,作为关键词起始边;
获取所述关键词起始边的初始权重,根据预设的比例系数和所述关键词起始边的初始权重,更新所述关键词起始边的权重;
将所述第二状态图中关键词起始边更新后的权重,配置为语言识别模型中对应边的激励权重。
在一些实施例中,所述在所述第二状态图中,筛选出标签与预设词表中的词相同的边,作为关键词起始边之前,包括:
对所述关键词进行分词处理,将分词得到的第一个词配置到预设的词表中。
在一些实施例中,所述方法还包括:
获取预设的关键词,根据所述关键词训练关键词语言模型;
构建所述关键词语言模型的加权有限状态转换器,获取所述关键词语言模型加权有限状态转换器指示的状态图为第一状态图。
在一些实施例中,所述方法还包括:
获取预设的通用语料,根据所述通用语料训练大语言模型;
构建所述大语言模型的加权有限状态转换器,获取所述大语言模型加权有限状态转换器指示的状态图为第二状态图。
在一些实施例中,所述方法还包括:
实时采集待识别语音。
本申请实施例的另一种语音识别方法可以包括:
将文本片段提供给第二计算设备;
获取所述第二计算设备提供的语言识别模型,所述语言识别模型中至少一对元素间关系的概率利用所述文本片段中所述至少一对元素间关系的概率进行了调整;
将待识别语音输入预设的语音识别模型,所述语音识别模型包括所述语言识别模型;
根据所述语言识别模型中各元素间关系的概率,确定所述待识别语音对应的多个元素的序列,作为语音识别结果。
本申请实施例还提供一种语音识别装置,包括:
调整模块,用于根据文本片段中至少一对元素间关系的概率,调整语言识别模型中所述至少一对元素间关系的概率;
语音识别模块,用于将待识别语音输入预设的语音识别模型,所述语音识别模型包括所述语言识别模型;根据所述语言识别模型中各元素间关系的概率,确定所述待识别语音对应的多个元素的序列,作为语音识别结果。
一些实施例中,该语音识别装置可以包括:
加载单元,用于加载预设的第一状态图和第二状态图,所述第一状态图为关键词语言模型的状态图,所述第二状态图为大语言模型的状态图;
关键词单元,用于在所述第一状态图中提取基准边,在所述第二状态图中查找与所述基准边标签相同的边,作为关键词边;
更新单元,用于获取所述基准边的权重,根据基准边的权重更新所述关键词边的权重;
激励单元,用于将所述第二状态图中关键词边更新后的权重,配置为语言识别模型中对应边的激励权重,所述语言识别模型为所述大语言模型剪枝后的语言模型;
识别单元,用于将待识别语音输入预设语音识别模型,得到所述语音识别模型输出的词序列路径,所述语音识别模型包括所述语言识别模型;
结果单元,用于根据所述语言识别模型中边的激励权重,在所述词序列路径中选出目标路径,得到语音识别结果。
本申请实施例还提供一种语音识别设备,包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的语音识别程序,所述语音识别程序被所述处理器执行时实现本申请实施例所提供的任一语音识别方法的步骤。
在一些实施例中,所述设备还包括语音采集装置,所述语音采集装置用于实时采集待识别语音。
本申请实施例还提供一种存储介质,所述存储介质存储有多条指令,所述指令适于处理器进行加载,以执行本申请实施例所提供的任一语音识别方法的步骤。
本申请实施例加载预设的第一状态图和第二状态图,第一状态图为关键词语言模型的状态图,第二状态图为大语言模型的状态图;在第一状态图中提取基准边,在第二状态图中查找与基准边标签相同的边,作为关键词边;获取基准边的权重,根据基准边的权重更新关键词边的权重;将第二状态图中关键词边更新后的权重,配置为语言识别模型中对应边的激励权重,语言识别模型为大语言模型剪枝后的语言模型;将待识别语音输入预设语音识别模型,得到语音识别模型输出的词序列路径,语音识别模型包括语言识别模型;根据语言识别模型中边的激励权重,在词序列路径中选出目标路径,得到语音识别结果。由于关键词语言模 型的语料远小于大语言模型的语料,因此,第一状态图中关键词的边权重大于第二状态图中同一关键词边的权重。该方案使用第一状态图关键词边的权重,增强第二状态图中同一关键词边的权重,进而激励语音识别模型中关键词边的权重,从而在语音识别时,提高语言识别模型中包含关键词的路径中边的权重,进而提高包含关键词的路径作为识别结果的概率。由此,该方案提高了语音识别结果中关键词出现的概率,在保障语音识别速度的同时,提升了语音识别结果的准确性。并且,该方案还适用于各种主题场景,可以利用各主题场景的关键词来提高语音识别结果的准确性。
附图简要说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1a是本申请实施例的信息交互系统的场景示意图;
图1b是本申请实施例的语音识别方法的流程示意图;
图1c是本申请实施例的语音识别方法的流程示意图;
图1d是本申请实施例的语音识别方法的流程示意图;
图2是本申请实施例的另一语音识别方法的流程示意图;
图3a是本申请实施例的第一状态图示意图;
图3b是本申请实施例的第二状态图示意图;
图3c是本申请实施例的另一第二状态图示意图;
图3d是本申请实施例的语音识别方法的流程示意图;
图4a是本申请实施例的语音识别装置的结构示意图;
图4b是本申请实施例的语音识别装置的结构示意图;
图4c是本申请实施例的语音识别装置的结构示意图;
图4d是本申请实施例的另一语音识别装置的结构示意图;
图5a是本申请实施例的语音识别设备的结构示意图;
图5b是本申请实施例的语音识别设备的结构示意图。
实施本申请的方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例提供一种语音识别方法、装置、设备及存储介质。
本申请实施例提供一种信息交互系统,该系统包括本申请任一实施例提供的语音识别装置,以及例如服务器、终端等其它设备。该语音识别装置可以集成在服务器或终端等设备中。终端可以是移动终端或个人计算机(PC,Personl Computer)等设备。
参考图1a,本申请实施例提供一种信息交互系统,包括服务器和终端。一些实施例中,语音识别装置可以集成在该服务器中。一些实施例中,语音识别装置也可以集成在终端中。语音识别装置可以执行各实施例的语音识别方法。图1b是本申请实施例提供的语音识别方法的流程示意图。如图1b所示,该方法可以包括以下步骤。
步骤11,根据一文本片段中至少一对元素间关系的概率,调整语言识别模型中所述至少一对元素间关系的概率。
本文中,文本片段(textual segment)是指作为整体具有特定含义的一段文字。文本片段通常包括复数个语素,可以是,例如术语、词组、文本表达(textual expression),等。步骤11中使用的文本片段是指需要在语音识别中提高识别率的一段文字,后文也称为关键词(key phrase)。
步骤12,将待识别语音输入预设的语音识别模型,所述语音识别模型包括所述语言识别模型。
步骤13,根据所述语言识别模型中各元素间关系的概率,确定所述待识别语音对应的多个元素的序列,作为语音识别结果。
由于在给定文本片段的情况下,其中各元素间关系必然与这些元素 在语言识别模型所基于的基本语料库中的关系更密切,因此,使用元素间的关系的概率在给定的文本片段中的概率去调整语言识别模型中这些元素间关系的概率,可以使得在语音识别时,提高语言识别模型对该文本片段的识别率,进而提高该文本片段的语音识别率。
在步骤11中,文本片段中至少一对元素间关系的概率可以通过自然语言处理技术获得。例如,可以通过建立该文本片段的语言模型来获得该概率,此时,语音识别方法可以如图1c所示,包括以下步骤。
步骤21,利用所述文本片段对应的第一状态图中表示一对元素间关系的一条边的权重调整预设的第二状态图中与所述边对应的边的权重。
其中,所述第一状态图为所述文本片段的语言模型的状态图,所述第二状态图为基础语言模型的状态图。
其中,第一状态图即为文本片段对应的语言模型(下文简称关键语言模型,或关键词语言模型)的加权有向状态图,其中记载了各个节点和节点之间的有向连接关系,以描述关键语言模型中关键词对象的可能状态以及状态的转移路径。其中,关键词对象是指文本片段中的语言元素。节点即为关键词对象的状态,节点根据次序连接形成有向边,边连接形成关键词的转移路径,每条路径即为关键词的词序列路径,包含了关键词对象及关键词对象的输出顺序。
关键语言模型可以是根据预设文本片段构建的语言模型,例如n-gram(n元汉语言模型)。本实施例中,以n为3,关键语言模型为三阶的tri-gram(三元语言模型)为例进行说明,也即关键语言模型的中第3个词的出现只与前2个词相关,与其他任何词不相关。
第二状态图可以是预设的基础语言模型(也称为大语言模型)的有向状态图,记载了各个节点和节点的有向连接关系,以描述基础语言模型中词对象的可能状态以及状态的转移路径。基础语言模型可以为语料信息丰富且未经过剪枝的大规模语言模型。其中,节点即为词对象的状态,节点根据次序连接形成有向边,边连接形成词的转移路径,每条路径即为词的词序列路径,包含了词对象及词对象的输出顺序。第二状态 图中,每条边有对应的标签和权重。其中,标签包括输入标签和输出标签,输入标签和输出标签相同,即为词对象;权重表征了边出现在转移路径中的概率,权重可以是概率值,也可以根据概率值计算得到。
由于语言模型的不同,第一状态图和第二状态图中标签相同的边权重可能不同。
一些实施例中,该调整步骤可以包括:在所述第一状态图中提取所述边作为基准边,在所述第二状态图中查找与所述基准边标签相同的边,作为目标边;获取所述基准边的权重,根据基准边的权重更新所述文本片段边的权重。该实施例的具体方法将在下面结合图1d进行说明。
一些实施例中,该调整步骤可以包括:在所述第二状态图中增加与所述边对应的边,作为目标边;根据所述边的权重设置所述目标边的权重。该实施例的具体方法将在下面结合图2进行说明。
步骤22,将修改后的所述第二状态图中至少一条边的权重,配置为所述语言识别模型中对应边的激励权重。
所述语言识别模型为所述基础语言模型剪枝后的语言模型。
步骤S23,将待识别语音输入预设语音识别模型,得到所述语音识别模型输出的词序列路径。
其中,所述语音识别模型包括所述语言识别模型。
步骤S24,根据所述语言识别模型中边的激励权重,在所述词序列路径中选出目标路径,得到语音识别结果。
关键语言模型如前文所述,步骤21中调整第二状态图中边的权重的方法可以是对第二状态图中已有的相应的边的权重进行调整,此时,语音识别方法可以如图1d所示,包括以下步骤。
101、加载预设的第一状态图和第二状态图,第一状态图为文本片段对应的语言模型的状态图,第二状态图为基础语言模型的状态图。
一些实施例中,语音识别装置可以根据文本片段训练语言模型,获取第一状态图。其中,文本片段可以是待识别语音所在领域的相关语料,具体可根据需要灵活配置。文本片段可以有一个或多个。一些实施例中, 当语音识别装置设置在服务器中时,文本片段可以是用户使用终端输入或选择的需要增强的文本片段,并由终端将用户输入的文本片段发送给服务器;或者是由用户直接在服务器中输入或选择的文本片段。另一些实施例中,语音识别装置也可以从指定的(本地或远端的)存储位置获取文本片段。
一些实施例中,语音识别装置可以获取预设的文本片段,根据文本片段训练关键语言模型;构建关键语言模型的加权有限状态转换器,获取关键语言模型加权有限状态转换器指示的状态图为第一状态图。
加权有限状态转换器为Weighted Finite-State Transducers,本实施例中可简称为WFST。WFST能够识别从词的初始状态到结束状态的整条路径,词的状态可以理解为节点。而节点根据次序连接形成有向边,边有对应的标签和权重。其中,标签包括输入标签和输出标签,输入标签和输出标签相同。权重表征了边出现在整条路径中的概率,权重可以是概率值,也可以根据概率值计算得到。整条路径的概率可以根据路径中各个边的权重或概率计算得到。
语音识别装置将文本片段作为训练语料,输入tri-gram进行训练,得到关键语言模型。然后,语音识别装置构建关键语言模型的加权有限状态转换器。由此,语音识别装置可以获取关键语言模型WFST中的各个节点,及节点之间的连接关系,得到关键语言模型WFST指示的状态图,将关键语言模型WFST指示的状态图作为第一状态图。
在一些实施例中,语音识别装置可以获取预设的通用语料,根据通用语料训练基础语言模型;构建基础语言模型的加权有限状态转换器,获取基础语言模型加权有限状态转换器指示的状态图为第二状态图。其中,通用语料可以是常用的大规模语料。
语音识别装置将通用语料输入预设的语言模型,例如二阶的bi-gram(二元语言模型),进行训练,得到基础语言模型。然后,语音识别装置构建基础语言模型的加权有限状态转换器。由此,语音识别装置可以获取基础语言模型WFST中的各个节点,及节点之间的连接关系,得到 基础语言模型WFST指示的状态图,将第一词语言模型WFST指示的状态图作为第二状态图。
由于关键语言模型WFST中的文本片段数量远小于基础语言模型WFST中的语料数量,因此,相同的边在关键语言模型WFST中的权重,大于其在基础语言模型WFST中的权重,由此,相同的边在第一状态图中的权重大于其在语言识别模型中的权重。
在进行语音识别前,或是在进行语音识别的过程中,语音识别装置同时加载第一状态图和第二状态图。
102、在第一状态图中提取基准边,在第二状态图中查找与基准边标签相同的边,作为目标边。
基准边是指第一状态图中的边。一些实施例中,可以选择输出标签与预设关键词有关的边作为基准边。另一些实施例中,也可以将第一状态图中所有的边分别作为基准边,并执行后续步骤。
第一状态图中,每条边有对应的标签和权重。其中,标签包括输入标签和输出标签,输入标签和输出标签相同,即为关键词对象;权重表征了边出现在转移路径中的概率,权重可以是概率值,也可以根据概率值计算得到。以第一状态图中任意一条边为例,以10为底或以e为底,对该边的概率取对数(log)值,将计算得到的对数值作为该边的权重。
其中,若基准边包括前缀路径,则前缀路径相同,且标签相同的边即为与基准边相同的目标边。
语音识别装置首先从第一状态图中提取出基准边,例如,可以获取第一状态图的起始节点,根据预设的遍历深度和起始节点获取基准边。
在一些实施例中,步骤“获取第一状态图的起始节点,根据预设的遍历深度和起始节点获取基准边”可以包括:将起始节点的输出边确定为第一基准边;在预设的递归深度内,对第一基准边进行递归,获取第一基准边的递归边;若递归边的输出标签不是预设符号,则将递归边确定为第二基准边。
其中,起始节点可以根据需要灵活配置。例如,本实施例中,第一 状态图中的第一个节点为开始节点,第二个节点为二阶状态节点,第三个节点为一阶节点,因此,可以将第一状态图的第三个节点作为其起始节点。
递归深度可根据语言模型的阶数配置。例如,语音识别装置获取关键语言模型的阶数,作为递归深度。本实施例中,以关键语言模型的阶数为三阶举例,则语音识别装置将递归深度配置为3。
语音识别模型将起始节点的输出边作为第一基准边,以在第二状态图中查找相同的边。
然后,语音识别模型根据递归深度,继续查找第一状态图中可作为基准边的边。具体地,以任一第一基准边为例,语音识别模型将在预设的递归深度内,对第一基准边进行递归,获取第一基准边的递归边;若递归边的输出标签不是预设符号,则将递归边确定为第二基准边。
其中,预设符号为预设的语句结束符号和回退符号。
例如,递归深度为3,则语音识别模型将第一基准边终点节点的输出边,以及该输出边的输出边,作为3阶内的递归边,共包含4个节点。
在得到递归边后,语音识别模型检测递归边的输出标签,是否为预设符号。若递归边的输出标签不是预设的语句结束符号或回退符号,则将该递归边确定为第二基准边,需要在第二状态图中查找与其相同的边。若递归边的输出标签是预设的语句结束符号或回退符号,则将该递归边确定为非基准边,不需要在第二状态图中查找与其相同的边。
需要说明的是,以起始节点的任一输出边为例,若该输出边的输出标签为预设的回退符号,则忽略该输出边,将其作为不需要增强权重的第一基准边,不对第二状态图中与其相同的第一目标边做权重更新。然后,语音识别装置获取该第一基准边的输出边,将该第一基准边的输出边中,输出标签不是预设符号边的作为起始节点的输出边,也即第二基准边,该第二基准边可以用来对第二状态图中与其相同的第二目标边做权重更新。
在得到基准边后,语音识别装置在第二状态图中遍历,查找与基准 边相同的目标边。
例如,步骤“在第二状态图中查找与基准边标签相同的边,作为目标边”可以包括:在第二状态图中,查找与第一基准边标签相同的边,作为第一目标边;在第一目标边的递归边中,作为与第二基准边标签相同的边,得到第二目标边。
以任一第一基准边为例,语音识别装置在第二状态图中,查找与第一基准边标签相同的边。其中,标签相同可以指输出标签相同和/或输出标签相同。由于本实施例中,状态图中同一条边的输入标签和输出标签相同,因此,语音识别装置可以是查找与第一基准边的输入标签相同的边,或是查找与第一基准边的输出标签相同的边,或是查找与第一基准边输入标签相同且输出标签相同的边。
语音识别装置将与第一基准边标签相同的边,确定为与第一基准边相同的第一目标边。
然后,语音识别装置根据预设的递归深度,在该第一目标边的递归边中,查找与第二基准边标签相同的边,得到第二目标边。其中,标签相同可以指输出标签相同和/或输出标签相同。
由此,语音识别装置分别找到与各第一基准边相同的第一目标边,以及与各第二基准边相同的第二目标边。
103、获取基准边的权重,根据基准边的权重更新目标边的权重。
其中,第一状态图中记载了基准边的权重,第二状态图中记载了目标边的初始权重。
以任一基准边为例,语音识别装置可以使用基准边的权重,替换与其相同的目标边的权重,实现对目标边权重的更新。
在一些实施例中,步骤“根据基准边的权重更新目标边的权重”可以包括:获取预设的插值参数及目标边的初始权重;根据基准边的权重、插值参数和目标边的初始权重,计算得到目标边的目标权重;使用目标权重,替换第二状态图中目标边的初始权重。
其中,预设的插值参数可根据实际需要灵活配置。
语音识别装置根据第二状态图,获取与基准边相同的目标边的初始权重。然后,语音识别装置可根据如下公式,计算目标边的目标权重。
其中,w
new为目标边的目标权重,w
old为目标边的初始权重,w
k为基准边的权重,lambda为插值系数。
然后,语音识别装置使用目标边的目标权重,替换掉第二状态图中该目标边的初始权重。
若有多个基准边,则语音识别装置分别更新与各基准边相同的目标边的权重。
104、将第二状态图中目标边更新后的权重,配置为语言识别模型中对应边的激励权重,语言识别模型为基础语言模型剪枝后的语言模型。
其中,语言识别模型是对基础语言模型进行剪枝得到的语言模型。语音识别装置可以对基础语言模型进行剪枝处理,得到语言识别模型。例如,使用entropy-based(基于熵)的剪枝或是rank-based(基于秩的)剪枝,减掉大语言模型中不重要分支路径,使剪枝后的语言识别模型与剪枝前的基础语言识别模型最大相似化,从而在压缩模型数据量的同时,减低对路径概率的影响。
第二状态图中目标边的权重更新后,语音识别装置将第二状态图中目标边更新后的权重,配置为语言识别模型中对应边的激励权重,也可理解为配置为语言识别模型中相同边的激励权重。由于语言识别模型是经由对基础语言模型剪枝得到的,因此,语言识别模型中的各边均存在于基础语言模型的状态图中。语言识别模型中,边的激励权重优先级高于其初始权重。
例如,语音识别装置建立第二状态图中目标边和语言识别模型中对应边的映射关系,进而将目标边的目标权重配置为语言识别模型中对应边的激励权重。
本实施例不需要对语言识别模型中边的权重进行修改,即可使用激励权重来计算词序列路径的得分。
由于在不同的应用场景中,需要增强的文本片段可能不同,因此,可以训练不同的关键语言模型,根据得到的第一状态图来配置语言识别模型中对应边的激励权重,而不会影响到语言识别模型中的其他边。在完成语音识别后,可根据用户输入的解除指令或是切换的应用场景,来解除当前激励权重的映射关系,清除增强的文本片段权重,进而去除当前文本片段对语言识别模型的影响,以便于根据下一场景需求重新配置语言识别模型的激励权重,提高语音识别的准确性。
由此,本实施例使用映射关系配置激励权重,替代直接赋值的方式,提高了语言识别模型和语音识别模型的通用性。本方案适用性强,可以应用于多种场景,不会因为文本片段增强而影响到后续在其他场景的使用,降低了维护成本。不同的语音识别场景或模式,均能够有效提高语音识别的准确性,避免了交叉影响。
105、将待识别语音输入预设语音识别模型,得到语音识别模型输出的词序列路径,语音识别模型包括语言识别模型。
语音识别装置可以获取待识别的语音。一些实施例中,待识别的语音可以是终端采集的语音,例如,终端可以实时采集待识别语音,并可以提供给服务器。另一些实施例中,待识别的语音可以为从某个本地或远程存储设备读取的语音数据。
需要说明的是,步骤105可以同步骤101同时执行,在增强语言识别模型中文本片段权重的同时,进行语音识别,实现在线语音识别。当然,步骤105也可以在步骤104之后执行,使用文本片段权重已被增强的语言识别模型,进行词路径的筛选,实现离线语音识别。
预设的语音识别模型可以是HCLG模型。其中,H是HMM(Hidden Markov Model,隐马尔可夫模型)构建的WFST,可以把HMM的状态号映射为triphone(三音素)。C是单音素(monophone)扩展成三音素(triphone)所构建的上下文WFST。L是发音词典构建的WFST,可以把输入的音素转换成词。G是语言识别模型构建的WFST,用来表示词的上下文的概率关系。
语音识别装置将待识别语音输入语音识别模型,经过音素识别、因素被转换成词等步骤后,将词元输入语言识别模型WFST,得到语言识别模型WFST输出的各词序列路径,进而计算各词序列路径的得分。
需要说明的是,词序列路径由其在隐马尔可夫模型WFST、上下文WFST、发音词典WFST和语言识别模型WFST中的各边组成。
106、根据语言识别模型中边的激励权重,在词序列路径中选出目标路径,得到语音识别结果。
语音识别装置可以计算各词序列路径的得分。
具体地,各词序列的得分,是根据各词序列路径的中边的权重计算得到。
以任一词序列为例,语音识别装置获取其路径中的各条边,一条路径包括其在隐马尔可夫模型WFST、上下文WFST、发音词典WFST和语言识别模型WFST中的各边。
然后,语音识别装置获取词序列路径在隐马尔可夫模型WFST、上下文WFST、发音词典WFST和语言识别模型WFST中各边的权重。并且,语音识别装置检测该词序列路径在语言识别模型WFST中的边是否有激励权重。
以该词序列路径在语言识别模型WFST中的任一条边举例说明,若该边有激励权重,则该激励权重代替该边的初始权重,来计算路径的得分;若该边没有激励权重,则使用该边的初始权重,来计算路径的得分。
由此,语音识别装置根据词序列路径中各边的权重,通过加和或乘积等方式,计算得到该词序列路径的得分。
然后,语音识别装置根据得分最高的词序列路径,组合词序列,得到待识别语音对应的文本,也即识别结果。
由上可知,本申请实施例加载预设的第一状态图和第二状态图,第一状态图为关键语言模型的状态图,第二状态图为大语言模型的状态图;在第一状态图中提取基准边,在第二状态图中查找与基准边标签相同的边,作为目标边;获取基准边的权重,根据基准边的权重更新目标边的 权重;将第二状态图中目标边更新后的权重,配置为语言识别模型中对应边的激励权重,语言识别模型为大语言模型剪枝后的语言模型;将待识别语音输入预设语音识别模型,得到语音识别模型输出的词序列路径,语音识别模型包括语言识别模型;根据语言识别模型中边的激励权重,在词序列路径中选出目标路径,得到语音识别结果。由于关键语言模型的语料远小于大语言模型的语料,因此,第一状态图中文本片段的边权重大于第二状态图中同一目标边的权重。该方案使用第一状态图目标边的权重,增强第二状态图中同一目标边的权重,进而激励语音识别模型中目标边的权重,从而在语音识别时,提高语言识别模型中包含文本片段的路径中边的权重,进而提高包含文本片段的路径作为识别结果的概率。由此,该方案提高了语音识别结果中文本片段出现的概率,在保障语音识别速度的同时,提升了语音识别结果的准确性。并且,该方案还适用于各种主题场景,可以利用各主题场景的文本片段来提高语音识别结果的准确性。
如前文所述,步骤21中调整第二状态图中边的权重的方法可以是在第二状态图中增加相应的边并设置其权重。该方法可以与图1c所示的方法分别独立使用,或者同时使用。同时使用时,图2为本申请实施例提供一种语音识别方法,可以包括以下步骤。
201、加载预设的第一状态图和第二状态图,第一状态图为关键语言模型的状态图,第二状态图为基础语言模型的状态图。
具体实施方式可参照上述语音识别方法实施例中步骤101的描述,在此不再赘述。
202、在第一状态图中提取基准边,在第二状态图中查找与基准边标签相同的边,作为目标边。
具体实施方式可参照上述语音识别方法实施例中步骤102的描述,在此不再赘述。
203、若在第二状态图中未找到与基准边标签相同的边,则将基准 边映射到第二状态图中,得到目标边。
例如,若服务器在第二状态图中,未找到与第一基准边标签相同的边,则查询第一基准边在第一状态图中的起始节点的序号,然后,在第二状态图中找到该序号对应的节点,以该节点为起始节点建立与第一基准边相同的虚拟边,作为第一目标边,实现第一基准边的映射。
若服务器在第一目标边的递归边中,未找到与第二基准边标签相同的边,则将第一目标边的终点节点作为起始节点,建立与第二基准边标签相同的虚拟边,作为第二目标边,实现第二基准边的映射。
需要说明的是,映射得到的第一目标边和第二目标边的初始权重可以是预设值。
204、获取基准边的权重,根据基准边的权重更新目标边的权重。
具体实施方式可参照上述语音识别方法实施例中步骤103的描述,在此不再赘述。
205、将第二状态图中目标边更新后的权重,配置为语言识别模型中对应边的激励权重,语言识别模型为大语言模型剪枝后的语言模型。
具体实施方式可参照上述语音识别方法实施例中步骤104的描述,在此不再赘述。
206、在第二状态图中,筛选出标签与预设词表中的词相同的边,作为起始边。
其中,预设词表中记录了文本片段被分词后得到的第一个词。
例如,在步骤206之前,还可以包括:对文本片段进行分词处理,将分词得到的第一个词配置到预设的词表中。
预设的文本片段可以有一个或多个,服务器对文本片段分别进行分词处理,并将各个文本片段分词得到的第一个词配置到词表中。
在进行语音识别时,为了提高进入文本片段路径的概率,服务器在第二状态图中,筛选出与标签与预设词表中的词相同的边,作为起始边。
207、获取起始边的初始权重,根据预设的比例系数和起始边的初始权重,更新起始边的权重。
例如,服务器可以使用如下公式计算起始边的目标权重:
w
new=w
old×(1-l);
其中,w
new为起始边的目标权重,w
old为起始边的初始权重,l为预设的比例系数。
然后,服务器使用起始边的目标权重替换其初始权重,实现对起始边权重的更新。
由此,服务器增强了第二状态图中起始边的权重。
208、将第二状态图中起始边更新后的权重,配置为语言识别模型中对应边的激励权重。
在得到起始边及其更新后的权重后,服务器在语言识别模型中查找与该起始边标签相同的边,并建立映射关系,进而,将关键起始词边的目标权重,配置为语言识别模型中对应边的激励权重。
209、将待识别语音输入预设语音识别模型,得到语音识别模型输出的词序列路径,语音识别模型包括语言识别模型。
具体实施方式可参照上述语音识别方法实施例中步骤105的描述,在此不再赘述。
其中,需要说明的是,在语音识别模型中遍历或查找词序列路径时,若未找到标签为特定词的边时,可在大语言模型中通过映射得到的目标边里,查找标签为特定词的边,作为词序列中的边,并获取该目标边的目标权重,以计算词序列路径的得分。
210、根据语言识别模型中边的激励权重,在词序列路径中选出目标路径,得到语音识别结果。
具体实施方式可参照上述语音识别方法实施例中步骤106的描述,在此不再赘述。
右上可知,本申请使用关键语言模型中文本片段路径的权重,增强语言识别模型中文本片段路径的权重,提高文本片段在识别结果中出现的概率,提升语音识别结果的准确性。在第二状态图中未找到与基准边标签相同的边时,采用映射的方式在第二状态图中添加目标边,从而在 语音识别时,能够采用该映射的目标边,提高该文本片段在识别结果中出现的概率。并且,通过增强初始边的权重,实现了上下文增强,从而在语言识别时,提高了文本片段被找到的概率,也即词序列进入文本片段路径的概率。由此,本实施例从多个方面提高了语音识别的准确性。
根据前面实施例所描述的方法,以下将举例作进一步详细说明。
例如,参照图3a和图3b,在本实施例中,将以该语音识别装置具体集成在解码器中进行说明。
(一)实时采集待识别语音。
解码器获取语音采集装置实时采集得到的待识别语音,进行在线语音识别。
(二)将待识别语音输入语音识别模型。
解码器将待识别语音输入语音识别模型,通过音素识别、因素被转换成词等步骤后,将词元输入语言识别模型。
(三)获取第一状态图和第二状态图,并加载。
在将词元输入语言识别模型之前,或同时,服务器加载第一状态图和第二状态图,从而对目标边的权重进行加强。
例如,解码器获取预设的文本片段,根据文本片段训练关键语言模型;构建关键语言模型的加权有限状态转换器,获取关键语言模型加权有限状态转换器指示的状态图为第一状态图。本实施例中,以关键语言模型为三阶的tri-gram为例进行说明。
以文本片段为“张俊岐”为例,解码器得到的第一状态图可参照图3a,其中,节点2为二阶状态;节点3为第一状态图的起始节点;节点之间通过连接线连接,成为边,边的箭头方向指示了连接关系,也可理解为路径方向,边上依次记载了边的输入标签、输出标签和权重,本实施例中以边的权重为其概率的对数值进行举例说明。其中,预设的语句结束符号可以是符号“</s>”,预设的回退符号可以是符号“#phi”。
同时,解码器获取预设的通用语料,根据通用语料训练大语言模型;构建大语言模型的加权有限状态转换器,获取大语言模型加权有限状态 转换器指示的状态图为第二状态图。本实施例中,以大语言模型为二阶的bi-gram为例进行说明。
解码器得到的第二状态图可参照图3b,其中,节点2为二阶状态;节点3为第一状态图的起始节点;节点之间通过连接线连接,成为边,边的箭头方向指示了连接关系,也可理解为路径方向,边上依次记载了边的输入标签、输出标签和权重,本实施例中以边的权重为其概率的对数值进行举例说明。其中,预设的语句结束符号可以是符号“</s>”,预设的回退符号可以是符号“#phi”。
(四)依据第一状态图,增强第二状态图中目标边的权重。
解码器在第一状态图中提取基准边,在第二状态图中查找与基准边标签相同的边,作为目标边;获取基准边的权重,根据基准边的权重更新目标边的权重。
例如,第一状态图和第二状态图同时从节点2沿着相同的路径往下走。第一状态图中,节点3至节点8的边,作为第一基准边3-9,标签为“张俊”,第二状态如中节点3至节点9的边,标签也为“张俊”,因此,得到与3-8标签相同的第一目标边3-9。然后,获取第一基准边的权重0,以及第一目标边的初始权重-16.8,根据上述实施例中记载的公式:log(e
-16.8×0.9+e
0×0.1)=-2.3,计算得到第一目标边3-9的目标权重为-2.3,相对于-16.8得到了增强,也即边3-9的概率得到了提高。
然后,解码器对第一状态图中的第一基准边进行递归,由于关键语言模型为三阶模型,因此,递归深度为3,得到第二基准边8-9,标签为“岐”。并且,解码器在第一目标边3-9的输出边中,找到标签为“岐”的边9-10,作为第二目标边。解码器根据第二基准边8-9的权重0,以及第二目标边9-10的初始权重-12.7,计算得到第二目标边9-10的目标权重-2.3,增强了第二目标边的权重。由于第一状态图中,节点9两条边的输出标签分别为回退符号和语句结束符号,因此,不能作为基准边来增强第二状态图中边的权重。
同时,解码器忽略第一状态图和第二状态图中标签为回退符号的边 3-5,对其进行递归,在第一状态图的节点5,获取第二基准边5-6和5-7。第二状态图中与第一基准边5-6标签“张俊”相同的第二目标边为5-7,与第一基准边5-7标签“岐”相同的第二目标边为5-8。由此,根据第一基准边5-6的权重-1.0986,和第二目标边为5-7的初始权重-18.5,计算可得第二目标边5-7的目标权重为-3.4;根据第一基准边5-7的权重-1.0986,和第二目标边为5-8的初始权重-17.38,计算可得第二目标边5-8的目标权重为-3.4。
并且,解码器根据递归深度,在第一状态图中的节点6找到第二基准边6-9,以及第二状态图中与其相同的第二目标边7-10。解码器根据第一基准边6-9的权重0,和第二目标边为7-10的初始权重-12.7,计算可得第二目标边7-10的目标权重为-2.3。
由此,实现了目标边权重的更新。第二状态图中,与第一状态图中文本片段相关的边权重均得到提高,对应的,由大语言模型剪枝得到的语言识别模型中对应边的权重也得到提升,解码时出现这些词的概率就会比之前大上很多。
然后,解码器将各目标边的权重,分别对应配置为语言识别模型中各对应的边的激励权重。
(五)目标边的映射。
参照图3c,以第一状态图为图3a为例,第二状态图为图3c为例。
第一状态图路径由节点3-8-9构成的路径(张俊,岐),无法在第二状态图中找到。若要在第二状态图中找到(张俊,岐),则需要在节点9通过回退的方式,来读入“岐”,降低文本片段增强效果。
为此,解码器利用第一状态图中高阶的边,将第二状态图中部分节点的序号和第一状态图中部分节点的序号关联起来,进行边的映射。从而在加码器解码的过程中,若在语言识别模型中找不到输入标签为特定词的时候,通过映射关系,提高词序列路径得分。
例如,解码器在第二状态图的节点9,添加与第一状态图中第二基准边8-9相同的虚拟边,作为第二基准边8-9相同的第二目标边,实现边 的映射,并更新该第二目标边的权重,实现权重增强。
由此,在进行解码时,解码器如果在语言识别模型中找不到路径(张俊,岐),则在第二状态图中,根据映射的虚拟边,确定路径(张俊,岐)的权重。
(六)文本片段上下文增强。
通过第二状态图中目标边的权重增强和映射,本实施例可以在几乎不影响正常识别结果的前提下,将文本片段的召回率提升到85%以上,满足了绝大多数的场景需求。
由于一个文本片段大部分情况下是被分割成多个粒度更小的词,来进行识别和语言训练的。因此,可以通过提升这些文本片段内部的小粒度词的权重,来提高文本片段的召回率。尤其是在用户没有配置文本片段的上下文语料时,在语音识别过程中进入到文本片段被分出来的第一个词的节点上就会比较困难。为此,本实施例增强由文本片段的上文词的节点,进入到文本片段被分割出来第一个词的概率。
具体地,解码器对文本片段进行分词处理,将分词得到的第一个词配置到预设的词表中。然后,在第二状态图中,筛选出标签与预设词表中的词相同的边,作为起始边;获取起始边的初始权重,根据预设的比例系数和起始边的初始权重,更新起始边的权重;将第二状态图中起始边更新后的权重,配置为语言识别模型中对应边的激励权重。
(七)获取语音识别模型输出的词序列路径,计算词序列路径得分,得到识别结果。
解码器将词元输入到语言识别模型构建的WFST,获取语言识别模型WFST输出的各个词序列路径。然后,解码器根据词序列路径在语言识别模型中的各边权重,计算各个词序列路径的得分,将得分最高的词序列路径作为识别结果输出。
由上可知,用户可在本申请实施例中,快速配置会议等场景的文本片段,增强文本片段在识别结果中的出现概率,提高了语音识别的准确性。本实施例缩短了操作流程,节约了大量的时间,并且,对解码器的 实时率没有影响,具有低时延的优点。
一些实施例中,语言识别方法的步骤可以分别由复数个物理设备执行,共同实现该方法,上述语言识别装置可以由复数个物理设备共同实现。
例如,复数个物理设备可以是多个服务器,其中的一些服务器主要向用户提供语音识别服务,另一些服务器为这些服务器提供用户定制的语音识别模型。
又例如,复数个物理设备可以是终端设备和服务器。终端设备为用户提供语音识别服务,服务器为这些终端设备提供用户定制的语音识别模型。
此时,各实施例的一种语音识别方法可以如图4b所示。该方法可以由一计算设备执行,例如服务器、终端设备,等。如图4b所示,该方法可以包括以下步骤。
步骤31,将文本片段提供给第二计算设备。
计算设备可以通过用户接口接收用户输入或选择的一个或复数个文本片段,例如术语、专有名词,等,再将文本片段提供给第二计算设备,使第二计算设备根据文本片段提供“定制的”语音识别模型。
步骤32,获取所述第二计算设备提供的语言识别模型,所述语言识别模型中至少一对元素间关系的概率利用所述文本片段中所述至少一对元素间关系的概率进行了调整。
第二计算设备可以执行上述方法中的调整语言识别模型的相关步骤,例如步骤11、21-22、101-104、201-208,等,并将得到的语音识别模型提供给提供文本片段的计算设备。
步骤33,将待识别语音输入预设的语音识别模型,所述语音识别模型包括所述语言识别模型。
步骤34,根据所述语言识别模型中各元素间关系的概率,确定所述待识别语音对应的多个元素的序列,作为语音识别结果。
本申请实施例还提供一种语音识别装置。图4a是本申请实施例的一种语音识别装置的结构示意图。如图4a所示,该语音识别装置可以包括调整模块41和语音识别模块42。
调整模块41可以根据文本片段中至少一对元素间关系的概率,调整语言识别模型中所述至少一对元素间关系的概率。
语音识别模块42可以将待识别语音输入预设的语音识别模型,所述语音识别模型包括所述语言识别模型;根据所述语言识别模型中各元素间关系的概率,确定所述待识别语音对应的多个元素的序列,作为语音识别结果。
一些实施例中,语音识别装置可以集成在网络设备如服务器等设备中。一些实施例中,语音识别装置可以集成在终端设备中。另一些实施例中,语音识别装置可以由分布在复数个物理设备中的组件共同实现。例如,调整模块41可以由第一计算设备实现,语音识别模块42可以由第二计算设备实现。计算设备可以是服务器、终端等任意有计算能力的设备。
图4b是本申请实施例的一种语音识别装置的结构示意图。如图4b所示,调整模块41可以包括:语言模型调整单元411和激励单元404。
语言模型调整单元411可以利用所述文本片段对应的第一状态图中表示一对元素间关系的一条边的权重调整预设的第二状态图中与所述边对应的边的权重,所述第一状态图为所述文本片段的语言模型的状态图,所述第二状态图为基础语言模型的状态图。
激励单元404可以将修改后的所述第二状态图中至少一条边的权重,配置为语言识别模型中对应边的激励权重,所述语言识别模型为所述基础语言模型剪枝后的语言模型。
此时,语音识别模块42可以包括:识别单元405和结果单元406。
识别单元405可以将待识别语音输入预设语音识别模型,得到所述语音识别模型输出的词序列路径,所述语音识别模型包括所述语言识别模型;
结果单元406可以根据所述语言识别模型中边的激励权重,在所述词序列路径中选出目标路径,得到语音识别结果。
一些实施例中,语言模型调整单元411可以包括更新单元,用于在所述第二状态图中查找与所述边标签相同的边,作为目标边;根据所述边的权重增加所述目标边的权重。
一些实施例中,语言模型调整单元411可以包括映射单元,用于在所述第二状态图中增加与所述边对应的边,作为目标边;根据所述边的权重设置所述目标边的权重。
图4c是本申请实施例的一种语音识别装置的结构示意图。如图4c所示,该语音识别装置可以包括加载单元401、关键词单元402、更新单元403、激励单元404、识别单元405和结果单元406。
(一)加载单元401;
加载单元401,用于加载预设的第一状态图和第二状态图,第一状态图为关键语言模型的状态图,第二状态图为大语言模型的状态图。
其中,第一状态图即为关键语言模型的有向状态图,记载了各个节点和节点之间的有向连接关系,以描述关键语言模型中文本片段对象的可能状态以及状态的转移路径。
关键语言模型可以是根据预设文本片段构建的语言模型,例如n-gram(n元汉语言模型)。本实施例中,以n为3,关键语言模型为三阶的tri-gram(三元语言模型)为例进行说明,也即关键语言模型的中第3个词的出现只与前2个词相关,与其他任何词不相关。
第二状态图为大语言模型的加权有向状态图。大语言模型可以为语料信息丰富且未经过剪枝的大规模语言模型。
由于语言模型的不同,第一状态图和第二状态图中标签相同的边权重可能不同。
在一些实施例中,加载单元401具体可以用于:获取预设的文本片段,根据文本片段训练关键语言模型;构建关键语言模型的加权有限状态转换器,获取关键语言模型加权有限状态转换器指示的状态图为第一 状态图。
其中,预设的文本片段可以是待识别语音所在领域的相关语料,具体可根据需要灵活配置。预设的文本片段可以有一个或多个。
加权有限状态转换器为Weighted Finite-State Transducers,本实施例中可简称为WFST。WFST能够识别从词的初始状态到结束状态的整条路径,词的状态可以理解为节点。而节点根据次序连接形成有向边,边有对应的标签和权重。其中,标签包括输入标签和输出标签,输入标签和输出标签相同。权重表征了边出现在整条路径中的概率,权重可以是概率值,也可以根据概率值计算得到。整条路径的概率可以根据路径中各个边的权重或概率计算得到。
加载单元401将文本片段作为训练语料,输入tri-gram进行训练,得到关键语言模型。然后,加载单元401构建关键语言模型的加权有限状态转换器。由此,加载单元401可以获取关键语言模型WFST中的各个节点,及节点之间的连接关系,得到关键语言模型WFST指示的状态图,将关键语言模型WFST指示的状态图作为第一状态图。
在一些实施例中,加载单元401具体可以用于:获取预设的通用语料,根据通用语料训练大语言模型;构建大语言模型的加权有限状态转换器,获取大语言模型加权有限状态转换器指示的状态图为第二状态图。
其中,通用语料可以是人们常用的大规模语料。
加载单元401将通用语料输入预设的语言模型,例如二阶的bi-gram(二元语言模型),进行训练,得到大语言模型。然后,加载单元401构建大语言模型的加权有限状态转换器。由此,加载单元401可以获取大语言模型WFST中的各个节点,及节点之间的连接关系,得到大语言模型WFST指示的状态图,将第一词语言模型WFST指示的状态图作为第二状态图。
由于关键语言模型WFST中的文本片段数量远小于大语言模型WFST中的语料数量,因此,相同的边在关键语言模型WFST中的权重,大于其在大语言模型WFST中的权重,由此,相同的边在第一状态图中 的权重大于其在语言识别模型中的权重。
在进行语音识别前,或是在进行语音识别的过程中,加载单元401同时加载第一状态图和第二状态图。
(二)关键词单元402;
关键词单元402,用于在第一状态图中提取基准边,在第二状态图中查找与基准边标签相同的边,作为目标边。
其中,若基准边包括前缀路径,则前缀路径相同,且标签相同的边即为与基准边相同的目标边。
关键词单元402首先从第一状态图中提取出基准边,例如,可以获取第一状态图的起始节点,根据预设的遍历深度和起始节点获取基准边。
在一些实施例中,关键词单元402具体可以用于:将起始节点的输出边确定为第一基准边;在预设的递归深度内,对第一基准边进行递归,获取第一基准边的递归边;若递归边的输出标签不是预设符号,则将递归边确定为第二基准边。
其中,起始节点可以根据需要灵活配置。例如,本实施例中,第一状态图中的第一个节点为开始节点,第二个节点为二阶状态节点,第三个节点为一阶节点,因此,可以将第一状态图的第三个节点作为其起始节点。
递归深度可根据语言模型的阶数配置。例如,关键词单元402获取关键语言模型的阶数,作为递归深度。本实施例中,以关键语言模型的阶数为三阶举例,则语音识别装置将递归深度配置为3。
关键词单元402将起始节点的输出边作为第一基准边,以在第二状态图中查找相同的边。
然后,关键词单元402根据递归深度,继续查找第一状态图中可作为基准边的边。具体地,以任一第一基准边为例,关键词单元402在预设的递归深度内,对第一基准边进行递归,获取第一基准边的递归边;若递归边的输出标签不是预设符号,则将递归边确定为第二基准边。
其中,预设符号为预设的语句结束符号和回退符号。
例如,递归深度为3,则关键词单元402将第一基准边终点节点的输出边,以及该输出边的输出边,作为3阶内的递归边,共包含4个节点。
在得到递归边后,关键词单元402检测递归边的输出标签,是否为预设符号。若递归边的输出标签不是预设的语句结束符号或回退符号,则将该递归边确定为第二基准边,需要在第二状态图中查找与其相同的边。若递归边的输出标签是预设的语句结束符号或回退符号,则将该递归边确定为非基准边,不需要在第二状态图中查找与其相同的边。
需要说明的是,以起始节点的任一输出边为例,若该输出边的输出标签为预设的回退符号,则忽略该输出边,将其作为不需要增强权重的第一基准边,不对第二状态图中与其相同的第一目标边做权重更新。然后,文本片段单元402获取该第一基准边的输出边,将该第一基准边的输出边中,输出标签不是预设符号边的作为起始节点的输出边,也即第二基准边,该第二基准边可以用来对第二状态图中与其相同的第二目标边做权重更新。
在得到基准边后,关键词单元402在第二状态图中遍历,查找与基准边相同的目标边。
例如,关键词单元402具体可以用于:在第二状态图中,查找与第一基准边标签相同的边,作为第一目标边;在第一目标边的递归边中,作为与第二基准边标签相同的边,得到第二目标边。
以任一第一基准边为例,关键词单元402在第二状态图中,查找与第一基准边标签相同的边。其中,标签相同可以指输出标签相同和/或输出标签相同。由于本实施例中,状态图中同一条边的输入标签和输出标签相同,因此,关键词单元402可以是查找与第一基准边的输入标签相同的边,或是查找与第一基准边的输出标签相同的边,或是查找与第一基准边输入标签相同且输出标签相同的边。
关键词单元402将与第一基准边标签相同的边,确定为与第一基准边相同的第一目标边。
然后,关键词单元402根据预设的递归深度,在该第一目标边的递 归边中,查找与第二基准边标签相同的边,得到第二目标边。其中,标签相同可以指输出标签相同和/或输出标签相同。
由此,关键词单元402分别找到与各第一基准边相同的第一目标边,以及与各第二基准边相同的第二目标边。
(三)更新单元403;
更新单元403,用于获取基准边的权重,根据基准边的权重更新目标边的权重。
其中,第一状态图中记载了基准边的权重,第二状态图中记载了目标边的初始权重。
以任一基准边为例,更新单元403可以使用基准边的权重,替换与其相同的目标边的权重,实现对目标边权重的更新。
在一些实施例中,更新单元403具体可以用于:获取预设的插值参数及目标边的初始权重;根据基准边的权重、插值参数和目标边的初始权重,计算得到目标边的目标权重;使用目标权重,替换第二状态图中目标边的初始权重。
其中,预设的插值参数可根据实际需要灵活配置。
更新单元403根据第二状态图,获取与基准边相同的目标边的初始权重。然后,更新单元403可根据如下公式,计算目标边的目标权重。
其中,w
new为目标边的目标权重,w
old为目标边的初始权重,w
k为基准边的权重,lambda为插值系数。
然后,更新单元403使用目标边的目标权重,替换掉第二状态图中该目标边的初始权重。
若有多个基准边,则更新单元403分别更新与各基准边相同的目标边的权重。
(四)激励单元404;
激励单元404,用于将第二状态图中目标边更新后的权重,配置为语言识别模型中对应边的激励权重,语言识别模型为大语言模型剪枝后 的语言模型。
其中,语言识别模型是对大语言模型进行剪枝得到的语言模型。激励单元404可以对大语言模型进行剪枝处理,得到语言识别模型。
第二状态图中目标边的权重更新后,激励单元404将第二状态图中目标边更新后的权重,配置为语言识别模型中对应边的激励权重,也可理解为配置为语言识别模型中相同边的激励权重。由于语言识别模型是经由对大语言模型剪枝得到的,因此,语言识别模型中的各边均存在于大语言模型的状态图中。语言识别模型中,边的激励权重优先级高于其初始权重。
例如,激励单元404建立第二状态图中目标边和语言识别模型中对应边的映射关系,进而将目标边的目标权重配置为语言识别模型中对应边的激励权重。
(五)识别单元405;
识别单元405,用于将待识别语音输入预设语音识别模型,得到语音识别模型输出的词序列路径,语音识别模型包括语言识别模型。
需要说明的是,识别单元405可以同加载单元401同时运行,在增强语言识别模型中文本片段权重的同时,进行语音识别,实现在线语音识别。当然,识别单元405也可以在激励单元404运行结束后开始运行,使用文本片段权重已被增强的语言识别模型,进行词路径的筛选,实现离线语音识别。
预设的语音识别模型可以是HCLG模型。其中,H是HMM(Hidden Markov Model,隐马尔可夫模型)构建的WFST,可以把HMM的状态号映射为triphone(三音素)。C是单音素(monophone)扩展成三音素(triphone)所构建的上下文WFST。L是发音词典构建的WFST,可以把输入的音素转换成词。G是语言识别模型构建的WFST,用来表示词的上下文的概率关系。
识别单元405将待识别语音输入语音识别模型,经过音素识别、因素被转换成词等步骤后,将词元输入语言识别模型WFST,得到语言识 别模型WFST输出的各词序列路径。
需要说明的是,词序列路径由其在隐马尔可夫模型WFST、上下文WFST、发音词典WFST和语言识别模型WFST中的各边组成。
(六)结果单元406;
结果单元406,用于根据语言识别模型中边的激励权重,在词序列路径中选出目标路径,得到语音识别结果。
结果单元406计算各词序列路径的得分。
具体地,各词序列的得分,是根据各词序列路径的中边的权重计算得到。
以任一词序列为例,结果单元406获取其路径中的各条边,一条路径包括其在隐马尔可夫模型WFST、上下文WFST、发音词典WFST和语言识别模型WFST中的各边。
然后,结果单元406获取词序列路径在隐马尔可夫模型WFST、上下文WFST、发音词典WFST和语言识别模型WFST中各边的权重。并且,检测该词序列路径在语言识别模型WFST中的边是否有激励权重。
以该词序列路径在语言识别模型WFST中的任一条边举例说明,若该边有激励权重,则该激励权重代替该边的初始权重,来计算路径的得分;若该边没有激励权重,则使用该边的初始权重,来计算路径的得分。
由此,结果单元406根据词序列路径中各边的权重,通过加和或乘积等方式,计算得到该词序列路径的得分。
然后,结果单元406根据得分最高的词序列路径,组合词序列,得到待识别语音对应的文本,也即识别结果。
由上可知,本申请实施例加载单元401加载预设的第一状态图和第二状态图,第一状态图为关键语言模型的状态图,第二状态图为大语言模型的状态图;文本片段单元402在第一状态图中提取基准边,在第二状态图中查找与基准边标签相同的边,作为目标边;更新单元403获取基准边的权重,根据基准边的权重更新目标边的权重;激励单元404将第二状态图中目标边更新后的权重,配置为语言识别模型中对应边的激 励权重,语言识别模型为大语言模型剪枝后的语言模型;识别单元405将待识别语音输入预设语音识别模型,得到语音识别模型输出的词序列路径,语音识别模型包括语言识别模型;结果单元406根据语言识别模型中边的激励权重,在词序列路径中选出目标路径,得到语音识别结果。由于关键语言模型的语料远小于大语言模型的语料,因此,第一状态图中文本片段的边权重大于第二状态图中同一目标边的权重。该方案使用第一状态图目标边的权重,增强第二状态图中同一目标边的权重,进而激励语音识别模型中目标边的权重,从而在语音识别时,提高语言识别模型中包含文本片段的路径中边的权重,进而提高包含文本片段的路径作为识别结果的概率。由此,该方案提高了语音识别结果中文本片段出现的概率,在保障语音识别速度的同时,提升了语音识别结果的准确性。并且,该方案还适用于各种主题场景,可以利用各主题场景的文本片段来提高语音识别结果的准确性。
一些实施例中,参照图4d,该语音识别装置还可以包括映射单元407、上下文单元408和采集单元409。
(七)映射单元407;
映射单元407,用于若在第二状态图中未找到与基准边标签相同的边,则将基准边映射到第二状态图中,得到目标边。
例如,若关键词单元402在第二状态图中,未找到与第一基准边标签相同的边,则映射单元407查询第一基准边在第一状态图中的起始节点的序号,然后,在第二状态图中找到该序号对应的节点,以该节点为起始节点建立与第一基准边相同的虚拟边,作为第一目标边,实现第一基准边的映射。
若映射单元407在第一目标边的递归边中,未找到与第二基准边标签相同的边,则将第一目标边的终点节点作为起始节点,建立与第二基准边标签相同的虚拟边,作为第二目标边,实现第二基准边的映射。
需要说明的是,映射得到的第一目标边和第二目标边的初始权重可以是预设值。
(八)上下文单元408;
上下文单元408,用于在第二状态图中,筛选出标签与预设词表中的词相同的边,作为起始边;获取起始边的初始权重,根据预设的比例系数和起始边的初始权重,更新起始边的权重;将第二状态图中起始边更新后的权重,配置为语言识别模型中对应边的激励权重。
其中,预设词表中记录了文本片段被分词后得到的第一个词。
例如,上下文单元408具体还可以用于:对文本片段进行分词处理,将分词得到的第一个词配置到预设的词表中。
预设的文本片段可以有一个或多个,上下文单元408对文本片段分别进行分词处理,并将各个文本片段分词得到的第一个词配置到词表中。
在进行语音识别时,为了提高进入文本片段路径的概率,上下文单元408在第二状态图中,筛选出与标签与预设词表中的词相同的边,作为起始边。
例如,上下文单元408可以使用如下公式计算起始边的目标权重:
w
new=w
old×(1-l);
其中,w
new为起始边的目标权重,w
old为起始边的初始权重,l为预设的比例系数。
然后,上下文单元408使用起始边的目标权重替换其初始权重,实现对起始边权重的更新。
由此,上下文单元408增强了第二状态图中起始边的权重。
在得到起始边及其更新后的权重后,上下文单元408在语言识别模型中查找与该起始边标签相同的边,并建立映射关系,进而,将关键起始词边的目标权重,配置为语言识别模型中对应边的激励权重。
(九)采集单元409。
采集单元409,用于实时采集待识别语音。
采集单元409实时采集得到待识别语音,进行在线语音识别。
由上可知,本申请使用关键语言模型中文本片段路径的权重,增强语言识别模型中文本片段路径的权重,提高文本片段在识别结果中出现 的概率,提升语音识别结果的准确性。在第二状态图中未找到与基准边标签相同的边时,采用映射的方式在第二状态图中添加目标边,从而在语音识别时,能够采用该映射的目标边,提高该文本片段在识别结果中出现的概率。并且,通过增强初始边的权重,实现了上下文增强,从而在语言识别时,提高了文本片段被找到的概率,也即词序列进入文本片段路径的概率。由此,本实施例从多个方面提高了语音识别的准确性。
本申请实施例还提供一种语音识别设备,如图5a所示,其示出了本申请实施例所涉及的语音识别设备的结构示意图,具体来讲:
该语音识别设备可以包括一个或者一个以上处理核心的处理器501、一个或一个以上计算机可读存储介质的存储器502、电源503和输入单元504等部件。本领域技术人员可以理解,图5a中示出的语音识别设备结构并不构成对语音识别设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。其中:
处理器501是该语音识别设备的控制中心,利用各种接口和线路连接整个语音识别设备的各个部分,通过运行或执行存储在存储器502内的软件程序和/或模块,以及调用存储在存储器502内的数据,执行语音识别设备的各种功能和处理数据,从而对语音识别设备进行整体监控。可选的,处理器501可包括一个或多个处理核心;优选的,处理器501可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器501中。
存储器502可用于存储软件程序以及模块,处理器501通过运行存储在存储器502的软件程序以及模块,从而执行各种功能应用以及数据处理。存储器502可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如语音识别功能等)等;存储数据区可存储根据语音识别设备的使用所创建的数据等。此外,存储器502可以包括高速随机存取存储器,还可以包括非易失性 存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地,存储器502还可以包括存储器控制器,以提供处理器501对存储器502的访问。
该语音识别设备还可包括输入单元504,该输入单元504可用于接收输入的数字或字符信息。用户可以使用输入单元504输入文本片段。
尽管未示出,语音识别设备还可以包括显示单元等,在此不再赘述。具体在本实施例中,语音识别设备中的处理器501会按照如下的指令,将一个或一个以上的应用程序的进程对应的可执行文件加载到存储器502中,并由处理器501来运行存储在存储器502中的应用程序,从而实现各种功能,如下:
加载预设的第一状态图和第二状态图,第一状态图为关键语言模型的状态图,第二状态图为大语言模型的状态图;在第一状态图中提取基准边,在第二状态图中查找与基准边标签相同的边,作为目标边;获取基准边的权重,根据基准边的权重更新目标边的权重;将第二状态图中目标边更新后的权重,配置为语言识别模型中对应边的激励权重,语言识别模型为大语言模型剪枝后的语言模型;将待识别语音输入预设语音识别模型,得到语音识别模型输出的词序列路径,语音识别模型包括语言识别模型;根据语言识别模型中边的激励权重,在词序列路径中选出目标路径,得到语音识别结果。
处理器501运行存储在存储器502中的应用程序,还可以实现如下功能:
获取第一状态图的起始节点,根据预设的遍历深度和起始节点确定基准边。
处理器501运行存储在存储器502中的应用程序,还可以实现如下功能:
将起始节点的输出边确定为第一基准边;在预设的递归深度内,对第一基准边进行递归,获取第一基准边的递归边;若递归边的输出标签不是预设符号,则将递归边确定为第二基准边。
处理器501运行存储在存储器502中的应用程序,还可以实现如下功能:
在第二状态图中,查找与第一基准边标签相同的边,作为第一目标边;在第一目标边的递归边中,查找与第二基准边标签相同的边,作为第二目标边。
处理器501运行存储在存储器502中的应用程序,还可以实现如下功能:
获取预设的插值参数及目标边的初始权重;根据基准边的权重、插值参数和目标边的初始权重,计算得到目标边的目标权重;使用目标权重,替换第二状态图中目标边的初始权重。
处理器501运行存储在存储器502中的应用程序,还可以实现如下功能:
若在第二状态图中未找到与基准边标签相同的边,则将基准边映射到第二状态图中,得到目标边。
处理器501运行存储在存储器502中的应用程序,还可以实现如下功能:
在第二状态图中,筛选出标签与预设词表中的词相同的边,作为起始边;获取起始边的初始权重,根据预设的比例系数和起始边的初始权重,更新起始边的权重;将第二状态图中起始边更新后的权重,配置为语言识别模型中对应边的激励权重。
处理器501运行存储在存储器502中的应用程序,还可以实现如下功能:
对文本片段进行分词处理,将分词得到的第一个词配置到预设的词表中。
处理器501运行存储在存储器502中的应用程序,还可以实现如下功能:
获取预设的文本片段,根据文本片段训练关键语言模型;构建关键语言模型的加权有限状态转换器,获取关键语言模型加权有限状态转换 器指示的状态图为第一状态图。
处理器501运行存储在存储器502中的应用程序,还可以实现如下功能:
获取预设的通用语料,根据通用语料训练大语言模型;构建大语言模型的加权有限状态转换器,获取大语言模型加权有限状态转换器指示的状态图为第二状态图。
此外,参照图5b,该语音识别设备还可以包括语音采集装置505,例如麦克风等,用于实时采集待识别语音。
处理器501运行存储在存储器502中的应用程序,还可以实现如下功能:
实时采集待识别语音。
以上各个操作的具体实施可参见前面的实施例,在此不再赘述。
本领域普通技术人员可以理解,上述实施例的各种方法中的全部或部分步骤可以通过指令来完成,或通过指令控制相关的硬件来完成,该指令可以存储于一计算机可读存储介质中,并由处理器进行加载和执行。
为此,本申请实施例提供一种存储介质,其中存储有多条指令,该指令能够被处理器进行加载,以执行本申请实施例所提供的任一种语音识别方法中的步骤。例如,该指令可以执行如下步骤:
加载预设的第一状态图和第二状态图,第一状态图为关键语言模型的状态图,第二状态图为大语言模型的状态图;在第一状态图中提取基准边,在第二状态图中查找与基准边标签相同的边,作为目标边;获取基准边的权重,根据基准边的权重更新目标边的权重;将第二状态图中目标边更新后的权重,配置为语言识别模型中对应边的激励权重,语言识别模型为大语言模型剪枝后的语言模型;将待识别语音输入预设语音识别模型,得到语音识别模型输出的词序列路径,语音识别模型包括语言识别模型;根据语言识别模型中边的激励权重,在词序列路径中选出目标路径,得到语音识别结果。
该指令还可以执行如下步骤:
获取第一状态图的起始节点,根据预设的遍历深度和起始节点确定基准边。
该指令还可以执行如下步骤:
将起始节点的输出边确定为第一基准边;在预设的递归深度内,对第一基准边进行递归,获取第一基准边的递归边;若递归边的输出标签不是预设符号,则将递归边确定为第二基准边。
该指令还可以执行如下步骤:
在第二状态图中,查找与第一基准边标签相同的边,作为第一目标边;在第一目标边的递归边中,查找与第二基准边标签相同的边,作为第二目标边。
该指令还可以执行如下步骤:
获取预设的插值参数及目标边的初始权重;根据基准边的权重、插值参数和目标边的初始权重,计算得到目标边的目标权重;使用目标权重,替换第二状态图中目标边的初始权重。
该指令还可以执行如下步骤:
若在第二状态图中未找到与基准边标签相同的边,则将基准边映射到第二状态图中,得到目标边。
该指令还可以执行如下步骤:
在第二状态图中,筛选出标签与预设词表中的词相同的边,作为起始边;获取起始边的初始权重,根据预设的比例系数和起始边的初始权重,更新文本片段起始边的权重;将第二状态图中起始边更新后的权重,配置为语言识别模型中对应边的激励权重。
该指令还可以执行如下步骤:
对文本片段进行分词处理,将分词得到的第一个词配置到预设的词表中。
该指令还可以执行如下步骤:
获取预设的文本片段,根据文本片段训练关键语言模型;构建关键语言模型的加权有限状态转换器,获取关键语言模型加权有限状态转换 器指示的状态图为第一状态图。
该指令还可以执行如下步骤:
获取预设的通用语料,根据通用语料训练大语言模型;构建大语言模型的加权有限状态转换器,获取大语言模型加权有限状态转换器指示的状态图为第二状态图。
该指令还可以执行如下步骤:
实时采集待识别语音。
以上各个操作的具体实施可参见前面的实施例,在此不再赘述。
其中,该存储介质可以包括:只读存储器(ROM,Read Only Memory)、随机存取记忆体(RAM,Random Access Memory)、磁盘或光盘等。
由于该存储介质中所存储的指令,可以执行本申请实施例所提供的任一种语音识别方法中的步骤,因此,可以实现本申请实施例所提供的任一种语音识别方法所能实现的有益效果,详见前面的实施例,在此不再赘述。
Claims (19)
- 一种语音识别方法,由计算设备执行,包括:根据一文本片段中至少一对元素间关系的概率,调整语言识别模型中所述至少一对元素间关系的概率;将待识别语音输入预设的语音识别模型,所述语音识别模型包括所述语言识别模型;根据所述语言识别模型中各元素间关系的概率,确定所述待识别语音对应的多个元素的序列,作为语音识别结果。
- 如权利要求1所述的方法,其中,所述根据文本片段中至少一对元素间关系的概率,调整语言识别模型中所述至少一对元素间关系的概率,包括:利用所述文本片段对应的第一状态图中表示一对元素间关系的一条边的权重调整预设的第二状态图中与所述边对应的边的权重,所述第一状态图为所述文本片段的语言模型的状态图,所述第二状态图为基础语言模型的状态图;将修改后的所述第二状态图中至少一条边的权重,配置为所述语言识别模型中对应边的激励权重,所述语言识别模型为所述基础语言模型剪枝后的语言模型;其中,所述根据所述语言识别模型中各元素间关系的概率,确定所述待识别语音对应的多个元素的序列,作为语音识别结果,包括:将待识别语音输入预设语音识别模型,得到所述语音识别模型输出的词序列路径,所述语音识别模型包括所述语言识别模型;根据所述语言识别模型中边的激励权重,在所述词序列路径中选出目标路径,得到语音识别结果。
- 如权利要求1所述的方法,其中,所述利用文本片段对应的第一状态图中的一条边及其权重,修改预设的第二状态图,包括:在所述第一状态图中提取所述边作为基准边,在所述第二状态图中查找与所述基准边标签相同的边,作为目标边;获取所述基准边的权重,根据基准边的权重更新所述文本片段边的权重。
- 如权利要求1所述的方法,其中,所述利用文本片段对应的第一状态图中的一条边及其权重,修改预设的第二状态图,包括:在所述第二状态图中增加与所述边对应的边,作为目标边;根据所述边的权重设置所述目标边的权重。
- 如权利要求3所述的方法,其中,所述在所述第一状态图中提取基准边,包括:获取所述第一状态图的起始节点,根据预设的遍历深度和所述起始节点确定基准边。
- 如权利要求5所述的方法,其中,所述根据预设的遍历深度和所述起始节点获取基准边,包括:将所述起始节点的输出边确定为第一基准边;在预设的递归深度内,对所述第一基准边进行递归,获取所述第一基准边的递归边;若所述递归边的输出标签不是预设符号,则将所述递归边确定为第二基准边。
- 如权利要求6所述的方法,其中,在所述第二状态图中查找与所述基准边标签相同的边,作为目标边,包括:在所述第二状态图中,查找与所述第一基准边标签相同的边,作为第一目标边;在所述第一目标边的递归边中,查找与所述第二基准边标签相同的边,作为第二目标边。
- 如权利要求1所述的方法,其中,所述根据基准边的权重更新所述目标边的权重,包括:获取预设的插值参数及所述目标边的初始权重;根据所述基准边的权重、插值参数和目标边的初始权重,计算得到目标边的目标权重;使用所述目标权重,替换所述第二状态图中所述目标边的初始权重。
- 如权利要求1所述的方法,进一步包括:在所述第二状态图中,筛选出标签与预设词表中的词相同的边,作为起始边;获取所述起始边的初始权重,根据预设的比例系数和所述起始边的初始权重,更新所述起始边的权重;将所述第二状态图中起始边更新后的权重,配置为语言识别模型中对应边的激励权重。
- 如权利要求9所述的方法,其中,所述在所述第二状态图中,筛选出标签与预设词表中的词相同的边,作为起始边之前,包括:对所述文本片段进行分词处理,将分词得到的第一个词配置到预设的词表中。
- 如权利要求1-10中任一项所述的方法,进一步包括:获取预设的文本片段,根据所述文本片段训练所述文本片段的语言模型;构建所述文本片段的语言模型的加权有限状态转换器,获取所述关键语言模型加权有限状态转换器指示的状态图为第一状态图。
- 如权利要求1-10中任一项所述的方法,进一步包括:获取预设的通用语料,根据所述通用语料训练基础语言模型;构建所述基础语言模型的加权有限状态转换器,获取所述基础语言模型加权有限状态转换器指示的状态图为第二状态图。
- 一种语音识别方法,由计算设备执行,包括:将文本片段提供给第二计算设备;获取所述第二计算设备提供的语言识别模型,所述语言识别模型中至少一对元素间关系的概率利用所述文本片段中所述至少一对元素间关系的概率进行了调整;将待识别语音输入预设的语音识别模型,所述语音识别模型包括所述语言识别模型;根据所述语言识别模型中各元素间关系的概率,确定所述待识别语音对应的多个元素的序列,作为语音识别结果。
- 一种语音识别装置,包括:调整模块,用于根据文本片段中至少一对元素间关系的概率,调整语言识别模型中所述至少一对元素间关系的概率;语音识别模块,用于将待识别语音输入预设的语音识别模型,所述语音识别模型包括所述语言识别模型;根据所述语言识别模型中各元素间关系的概率,确定所述待识别语音对应的多个元素的序列,作为语音识别结果。
- 如权利要求14所述的装置,其中,所述调整模块包括:调整单元,用于利用所述文本片段对应的第一状态图中表示一对元素间关系的一条边的权重调整预设的第二状态图中与所述边对应的边的权重,所述第一状态图为所述文本片段的语言模型的状态图,所述第二状态图为基础语言模型的状态图;激励单元,用于将修改后的所述第二状态图中至少一条边的权重, 配置为语言识别模型中对应边的激励权重,所述语言识别模型为所述基础语言模型剪枝后的语言模型;其中,所述语音识别模块包括:识别单元,用于将待识别语音输入预设语音识别模型,得到所述语音识别模型输出的词序列路径,所述语音识别模型包括所述语言识别模型;结果单元,用于根据所述语言识别模型中边的激励权重,在所述词序列路径中选出目标路径,得到语音识别结果。
- 如权利要求15所述的装置,所述调整单元包括:更新单元,用于在所述第二状态图中查找与所述边标签相同的边,,作为目标边;根据所述边的权重增加所述目标边的权重。
- 如权利要求15所述的装置,所述调整单元包括:映射单元,用于在所述第二状态图中增加与所述边对应的边,作为目标边;根据所述边的权重设置所述目标边的权重。
- 一种语音识别设备,其特征在于,包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的语音识别程序,所述语音识别程序被所述处理器执行时实现如权利要求1至13中任一项所述的方法的步骤。
- 一种存储介质,其特征在于,所述存储介质存储有多条指令,所述指令适于处理器进行加载,以执行权利要求1至13中任一项所述的语音识别方法中的步骤。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/192,316 US12125473B2 (en) | 2018-12-11 | 2021-03-04 | Speech recognition method, apparatus, and device, and storage medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811508402.7A CN110176230B (zh) | 2018-12-11 | 2018-12-11 | 一种语音识别方法、装置、设备和存储介质 |
CN201811508402.7 | 2018-12-11 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/192,316 Continuation US12125473B2 (en) | 2018-12-11 | 2021-03-04 | Speech recognition method, apparatus, and device, and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020119432A1 true WO2020119432A1 (zh) | 2020-06-18 |
Family
ID=67689294
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/120558 WO2020119432A1 (zh) | 2018-12-11 | 2019-11-25 | 一种语音识别方法、装置、设备和存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110176230B (zh) |
WO (1) | WO2020119432A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118014011A (zh) * | 2024-04-07 | 2024-05-10 | 蚂蚁科技集团股份有限公司 | 大语言模型训练及训练数据构建方法、装置、设备、介质 |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110176230B (zh) * | 2018-12-11 | 2021-10-08 | 腾讯科技(深圳)有限公司 | 一种语音识别方法、装置、设备和存储介质 |
CN110705282A (zh) * | 2019-09-04 | 2020-01-17 | 东软集团股份有限公司 | 关键词提取方法、装置、存储介质及电子设备 |
CN111933119B (zh) * | 2020-08-18 | 2022-04-05 | 北京字节跳动网络技术有限公司 | 用于生成语音识别网络的方法、装置、电子设备和介质 |
CN111968648B (zh) * | 2020-08-27 | 2021-12-24 | 北京字节跳动网络技术有限公司 | 语音识别方法、装置、可读介质及电子设备 |
CN112634904B (zh) * | 2020-12-22 | 2024-09-20 | 北京有竹居网络技术有限公司 | 热词识别方法、装置、介质和电子设备 |
CN112820280A (zh) * | 2020-12-30 | 2021-05-18 | 北京声智科技有限公司 | 规则语言模型的生成方法及装置 |
CN112802476B (zh) * | 2020-12-30 | 2023-10-24 | 深圳追一科技有限公司 | 语音识别方法和装置、服务器、计算机可读存储介质 |
CN113763938B (zh) * | 2021-10-27 | 2024-06-07 | 杭州网易智企科技有限公司 | 语音识别方法、介质、装置和计算设备 |
CN114360528B (zh) * | 2022-01-05 | 2024-02-06 | 腾讯科技(深圳)有限公司 | 语音识别方法、装置、计算机设备及存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130185073A1 (en) * | 2005-12-08 | 2013-07-18 | Nuance Communications Austria Gmbh | Speech recognition system with huge vocabulary |
CN105869629A (zh) * | 2016-03-30 | 2016-08-17 | 乐视控股(北京)有限公司 | 语音识别方法及装置 |
CN108711422A (zh) * | 2018-05-14 | 2018-10-26 | 腾讯科技(深圳)有限公司 | 语音识别方法、装置、计算机可读存储介质和计算机设备 |
CN108735201A (zh) * | 2018-06-29 | 2018-11-02 | 广州视源电子科技股份有限公司 | 连续语音识别方法、装置、设备和存储介质 |
CN110176230A (zh) * | 2018-12-11 | 2019-08-27 | 腾讯科技(深圳)有限公司 | 一种语音识别方法、装置、设备和存储介质 |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7756708B2 (en) * | 2006-04-03 | 2010-07-13 | Google Inc. | Automatic language model update |
JP5088701B2 (ja) * | 2006-05-31 | 2012-12-05 | 日本電気株式会社 | 言語モデル学習システム、言語モデル学習方法、および言語モデル学習用プログラム |
US9043205B2 (en) * | 2012-06-21 | 2015-05-26 | Google Inc. | Dynamic language model |
KR102305584B1 (ko) * | 2015-01-19 | 2021-09-27 | 삼성전자주식회사 | 언어 모델 학습 방법 및 장치, 언어 인식 방법 및 장치 |
US10325590B2 (en) * | 2015-06-26 | 2019-06-18 | Intel Corporation | Language model modification for local speech recognition systems using remote sources |
US20170018268A1 (en) * | 2015-07-14 | 2017-01-19 | Nuance Communications, Inc. | Systems and methods for updating a language model based on user input |
CN106683677B (zh) * | 2015-11-06 | 2021-11-12 | 阿里巴巴集团控股有限公司 | 语音识别方法及装置 |
CN107146604B (zh) * | 2017-04-27 | 2020-07-03 | 北京捷通华声科技股份有限公司 | 一种语言模型优化方法及装置 |
CN107665705B (zh) * | 2017-09-20 | 2020-04-21 | 平安科技(深圳)有限公司 | 语音关键词识别方法、装置、设备及计算机可读存储介质 |
-
2018
- 2018-12-11 CN CN201811508402.7A patent/CN110176230B/zh active Active
-
2019
- 2019-11-25 WO PCT/CN2019/120558 patent/WO2020119432A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130185073A1 (en) * | 2005-12-08 | 2013-07-18 | Nuance Communications Austria Gmbh | Speech recognition system with huge vocabulary |
CN105869629A (zh) * | 2016-03-30 | 2016-08-17 | 乐视控股(北京)有限公司 | 语音识别方法及装置 |
CN108711422A (zh) * | 2018-05-14 | 2018-10-26 | 腾讯科技(深圳)有限公司 | 语音识别方法、装置、计算机可读存储介质和计算机设备 |
CN108735201A (zh) * | 2018-06-29 | 2018-11-02 | 广州视源电子科技股份有限公司 | 连续语音识别方法、装置、设备和存储介质 |
CN110176230A (zh) * | 2018-12-11 | 2019-08-27 | 腾讯科技(深圳)有限公司 | 一种语音识别方法、装置、设备和存储介质 |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118014011A (zh) * | 2024-04-07 | 2024-05-10 | 蚂蚁科技集团股份有限公司 | 大语言模型训练及训练数据构建方法、装置、设备、介质 |
Also Published As
Publication number | Publication date |
---|---|
CN110176230A (zh) | 2019-08-27 |
CN110176230B (zh) | 2021-10-08 |
US20210193121A1 (en) | 2021-06-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020119432A1 (zh) | 一种语音识别方法、装置、设备和存储介质 | |
CN107195296B (zh) | 一种语音识别方法、装置、终端及系统 | |
US10037758B2 (en) | Device and method for understanding user intent | |
CN108899013B (zh) | 语音搜索方法、装置和语音识别系统 | |
US10672391B2 (en) | Improving automatic speech recognition of multilingual named entities | |
CN103065630B (zh) | 用户个性化信息语音识别方法及系统 | |
EP4018437B1 (en) | Optimizing a keyword spotting system | |
CN114547329A (zh) | 建立预训练语言模型的方法、语义解析方法和装置 | |
CN110364171A (zh) | 一种语音识别方法、语音识别系统及存储介质 | |
CN105609107A (zh) | 一种基于语音识别的文本处理方法和装置 | |
US20090099841A1 (en) | Automatic speech recognition method and apparatus | |
CN111223498A (zh) | 情绪智能识别方法、装置及计算机可读存储介质 | |
JP5932869B2 (ja) | N−gram言語モデルの教師無し学習方法、学習装置、および学習プログラム | |
CN110675855A (zh) | 一种语音识别方法、电子设备及计算机可读存储介质 | |
CN112349289B (zh) | 一种语音识别方法、装置、设备以及存储介质 | |
WO2004034378A1 (ja) | 言語モデル生成蓄積装置、音声認識装置、言語モデル生成方法および音声認識方法 | |
CN106294460B (zh) | 一种基于字和词混合语言模型的汉语语音关键词检索方法 | |
JP2015219583A (ja) | 話題決定装置、発話装置、方法、及びプログラム | |
KR101677859B1 (ko) | 지식 베이스를 이용하는 시스템 응답 생성 방법 및 이를 수행하는 장치 | |
CN110019741B (zh) | 问答系统答案匹配方法、装置、设备及可读存储介质 | |
CN110473527B (zh) | 一种语音识别的方法和系统 | |
CN111508466A (zh) | 一种文本处理方法、装置、设备及计算机可读存储介质 | |
CN112151020B (zh) | 语音识别方法、装置、电子设备及存储介质 | |
JP2013125144A (ja) | 音声認識装置およびそのプログラム | |
JP2019101065A (ja) | 音声対話装置、音声対話方法及びプログラム |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19896527 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19896527 Country of ref document: EP Kind code of ref document: A1 |