WO2023070803A1 - 语音识别方法、装置、设备及存储介质 - Google Patents
语音识别方法、装置、设备及存储介质 Download PDFInfo
- Publication number
- WO2023070803A1 WO2023070803A1 PCT/CN2021/133434 CN2021133434W WO2023070803A1 WO 2023070803 A1 WO2023070803 A1 WO 2023070803A1 CN 2021133434 W CN2021133434 W CN 2021133434W WO 2023070803 A1 WO2023070803 A1 WO 2023070803A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech recognition
- recognition result
- vertical
- speech
- keyword
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 136
- 238000003860 storage Methods 0.000 title claims abstract description 18
- 238000012545 processing Methods 0.000 claims abstract description 55
- 230000006698 induction Effects 0.000 claims abstract description 14
- 230000005284 excitation Effects 0.000 claims description 103
- 238000012549 training Methods 0.000 claims description 30
- 238000010276 construction Methods 0.000 claims description 17
- 239000011159 matrix material Substances 0.000 claims description 13
- 230000011218 segmentation Effects 0.000 claims description 13
- 238000004422 calculation algorithm Methods 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 7
- 230000008569 process Effects 0.000 description 29
- 230000000694 effects Effects 0.000 description 18
- 238000010586 diagram Methods 0.000 description 15
- 230000006870 function Effects 0.000 description 9
- 238000001994 activation Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 8
- 230000004913 activation Effects 0.000 description 6
- 238000003780 insertion Methods 0.000 description 6
- 230000037431 insertion Effects 0.000 description 6
- 230000003993 interaction Effects 0.000 description 6
- 238000013507 mapping Methods 0.000 description 5
- 238000012937 correction Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000003068 static effect Effects 0.000 description 3
- 230000000638 stimulation Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 210000001072 colon Anatomy 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000008450 motivation Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000037361 pathway Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 230000004936 stimulating effect Effects 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/083—Recognition networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/19—Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
- G10L15/193—Formal grammars, e.g. finite state automata, context free grammars or word networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
Definitions
- the present application relates to the technical field of speech recognition, in particular to a speech recognition method, device, equipment and storage medium.
- the most effective solution for speech recognition at present is to use neural network technology to learn massive data to obtain a speech recognition model, which is very effective in general scenarios.
- a very good recognition effect can be achieved.
- the embodiment of the present application proposes a speech recognition method, device, device and storage medium, which can accurately recognize the speech to be recognized, especially the speech in a specific scene involving vertical keywords. Accurately identify vertical keywords in speech.
- a speech recognition method characterized in that, comprising:
- a speech recognition decoding network is constructed, wherein the sentence pattern decoding network at least performs text corpus on the scene where the speech to be recognized belongs Sentence induction processing is constructed;
- a speech recognition method characterized in that, comprising:
- the speech recognition decoding network Obtained based on a set of vertical keywords and a sentence pattern decoding network under the scene to which the speech to be recognized belongs;
- a final speech recognition result is determined from at least the excited first speech recognition result and the second speech recognition result.
- a speech recognition device is characterized in that, comprising:
- the acoustic recognition unit is used to obtain the acoustic state sequence of the speech to be recognized
- a network construction unit configured to construct a speech recognition decoding network based on a vertical keyword set and a sentence pattern decoding network in the scene where the speech to be recognized belongs to, wherein the sentence pattern decoding network at least passes through the The text corpus in the scene is constructed by sentence induction processing;
- the decoding processing unit is configured to use the speech recognition decoding network to decode the acoustic state sequence to obtain a speech recognition result.
- a speech recognition device comprising:
- the acoustic recognition unit is used to obtain the acoustic state sequence of the speech to be recognized
- a multi-dimensional decoding unit configured to use a speech recognition decoding network to decode the acoustic state sequence to obtain a first speech recognition result, and to use a general speech recognition model to decode the acoustic state sequence to obtain a second speech recognition result;
- the speech recognition decoding network is constructed based on a set of vertical keywords and a sentence pattern decoding network under the scene to which the speech to be recognized belongs;
- an acoustic excitation unit configured to perform acoustic score excitation on the first speech recognition result
- the decision processing unit is configured to determine a final speech recognition result at least from the excited first speech recognition result and the second speech recognition result.
- a speech recognition device comprising:
- the memory is connected to the processor for storing programs
- the processor is configured to implement the above speech recognition method by running the program stored in the memory.
- a storage medium on which a computer program is stored, and when the computer program is run by a processor, the above speech recognition method is realized.
- the speech recognition method proposed in this application can build a speech recognition and decoding network based on the set of vertical keywords in the scene to which the speech to be recognized belongs and the pre-built sentence pattern decoding network in the scene. Then, in the speech recognition and decoding network, it contains various speech sentence information in the scene where the speech to be recognized belongs, and also includes various vertical keywords in the scene where the speech to be recognized belongs, and the speech recognition and decoding network can decode the speech to be recognized.
- the voice composed of any sentence pattern and any vertical keyword in the scene to which the voice belongs. Therefore, by constructing the above-mentioned speech recognition decoding network, the speech to be recognized can be accurately recognized, especially the speech in a specific scene involving vertical keywords can be accurately recognized, especially the vertical keywords in the speech can be accurately recognized.
- Fig. 1 is a schematic flow chart of a speech recognition method provided by an embodiment of the present application
- FIG. 2 is a schematic diagram of a word-level sentence pattern decoding network provided by an embodiment of the present application
- Fig. 3 is a schematic flow chart of another speech recognition method provided by the embodiment of the present application.
- Fig. 4 is a schematic flow chart of another speech recognition method provided by the embodiment of the present application.
- Fig. 5 is a schematic flow chart of another speech recognition method provided by the embodiment of the present application.
- FIG. 6 is a schematic flow chart of another speech recognition method provided by the embodiment of the present application.
- Fig. 7 is a schematic diagram of a text sentence network provided by the embodiment of the present application.
- FIG. 8 is a schematic diagram of a pronunciation-level sentence pattern decoding network provided by an embodiment of the present application.
- FIG. 9 is a schematic diagram of a word-level personal name network provided by an embodiment of the present application.
- FIG. 10 is a schematic diagram of the pronunciation-level personal name network corresponding to FIG. 9 provided by the embodiment of the present application.
- Fig. 11 is the process flow diagram of utilizing the second speech recognition result to correct the first speech recognition result provided by the embodiment of the present application;
- Fig. 12 is a flow chart of processing for determining the final speech recognition result from the first speech recognition result and the second speech recognition result provided by the embodiment of the present application;
- Fig. 13 is a schematic diagram of the state network of the speech recognition result provided by the embodiment of the present application.
- Fig. 14 is a schematic diagram of the state network after path extension is performed on the speech recognition result shown in Fig. 13;
- Fig. 15 is a schematic structural diagram of a speech recognition device provided by an embodiment of the present application.
- FIG. 16 is a schematic structural diagram of another speech recognition device provided by an embodiment of the present application.
- Fig. 17 is a schematic structural diagram of a speech recognition device provided by an embodiment of the present application.
- the technical solutions of the embodiments of the present application are applicable to speech recognition application scenarios.
- the speech content can be recognized more accurately, especially in specific business scenarios involving vertical keywords, the speech content can be recognized more accurately , especially the ability to accurately recognize the vertical keywords in the voice, and improve the voice recognition effect as a whole.
- the above vertical keywords generally refer to different keywords belonging to the same type, such as person names, place names, application names, etc. constitute different vertical keywords. Keywords, different place names in the region where the user is located form the place name category keywords, and the names of various applications installed on the user terminal form the application name category keywords.
- the above-mentioned business scenarios involving vertical keywords refer to business scenarios that contain vertical keywords in the corresponding interactive voice, such as voice dialing, voice navigation and other business scenarios. Since the user must say the name or The place name of the navigation, for example, the user may say "call XX” or “navigate to YY", where "XX” may be any name in the user's mobile phone address book, and "YY” may be a certain name in the user's area. a place name. It can be seen that the speech in these business scenarios contains vertical keywords (such as person names and place names), so these business scenarios are business scenarios involving vertical keywords.
- vertical keywords Compared with ordinary text keywords, vertical keywords have the characteristics of frequent changes, unpredictability, and user-definable features, and the proportion of vertical keywords in the massive speech recognition training corpus is extremely low, making conventional Speech recognition solutions that use corpus to train speech recognition models are often incapable of performing speech recognition services involving vertical keywords.
- the occurrence rate of personal names is very low, so even in the massive training corpus, personal names are very rare, which makes the model unable to fully learn the characteristics of personal names through massive corpus.
- the names of people belong to the user-defined text content, which is inexhaustible and unpredictable. It is unrealistic to completely generate all the names of people artificially.
- the storage of the contact names in the address book by the user may not be a standardized name, but may be a nickname, code name, nickname, etc., and the user may even modify, add or delete contacts in the address book at any time, which makes different users
- the names in the address book are highly diverse, and it is impossible to make the speech recognition model learn all the characteristics of names in a unified way.
- the conventional technical solution of training a speech recognition model through massive corpus and using the speech recognition model to realize the speech recognition function is not fully competent for speech recognition tasks in business scenarios involving vertical keywords, especially for vertical keywords in speech. Such keywords are often not successfully identified, seriously affecting user experience.
- the embodiment of the present application proposes a voice recognition method, which can improve the voice recognition effect, especially the voice recognition effect in business scenarios involving vertical keywords.
- the embodiment of the present application proposes a speech recognition method, as shown in FIG. 1, the method includes:
- the above-mentioned speech to be recognized is specifically the speech data in a business scenario involving vertical keywords.
- the speech to be recognized it includes The phonetic content of vertical keywords.
- the audio features can be Mel frequency cepstral coefficient MFCC features, or other audio features of any type.
- the audio features of the speech to be recognized are obtained, the audio features are input into the acoustic model for acoustic recognition, and the acoustic state posterior score of each frame of audio is obtained, that is, the acoustic state sequence is obtained.
- the acoustic model is mainly a neural network structure, which identifies the acoustic state corresponding to each frame of audio and its posterior score through forward calculation.
- the above-mentioned acoustic state corresponding to the audio frame is specifically a pronunciation unit corresponding to the audio frame, such as a phoneme or a phoneme sequence corresponding to the audio frame.
- the conventional speech recognition technology scheme is the architecture of acoustic model + language model, that is, firstly, the speech to be recognized is recognized acoustically through the acoustic model to realize the mapping of speech features to the phoneme sequence; then, the phoneme sequence is recognized through the language model to realize the phoneme Mapping to text.
- the acoustic state sequence of the speech to be recognized obtained by the above acoustic recognition will be input into the language model for decoding, so as to determine the text content corresponding to the speech to be recognized.
- the language model is a model that is trained based on a large amount of training corpus and can realize the mapping of phonemes to text.
- the embodiment of the present application does not use the above-mentioned speech model based on massive corpus training to decode the acoustic state sequence, but uses a decoding network constructed in real time to decode, as detailed below.
- S102 Construct a speech recognition decoding network based on the vertical keyword set and the sentence pattern decoding network in the scene to which the speech to be recognized belongs.
- the embodiment of the present application constructs a speech recognition and decoding network in real time when recognizing speech in vertical keyword business scenarios to decode the acoustic state sequence of the speech to be recognized, and obtains Speech recognition results.
- the above-mentioned speech recognition decoding network is constructed from a set of vertical keywords in the scene where the speech to be recognized belongs, and a pre-built sentence pattern decoding network in the scene where the speech to be recognized belongs.
- the sentence pattern decoding network in the scene where the speech to be recognized belongs to is constructed by at least sentence pattern induction processing on the text corpus in the scene where the speech to be recognized belongs;
- the scene to which the speech to be recognized belongs specifically refers to the business scene to which the speech to be recognized belongs. For example, if the speech to be recognized is "I want to give XX a call", then the speech to be recognized belongs to the voice of the call service, so the scene of the speech to be recognized is the call scene; for another example, suppose the speech to be recognized is "Navigate to XX", the speech to be recognized belongs to the navigation service, so the scene to which the speech to be recognized belongs is the navigation scene.
- the inventors of the present application have found through research that in business scenarios involving vertical keywords, a considerable part of the sentence patterns of the user's voice is a fixed sentence pattern. It is "I want to give XX a call” or "send a message to XX for me”; in the voice navigation scenario, the user's common sentence pattern is usually "go to XX (place name)” or "navigate to XX (place name)”.
- the sentence patterns of the user's voice are regular, or exhaustive.
- sentence decoding network which is named as sentence decoding network in the embodiment of this application.
- the sentence pattern decoding network constructed based on the above method can contain the sentence pattern information corresponding to the scene.
- the sentence pattern decoding network can be Contains any sentence patterns in the scene.
- the embodiment of the present application is constructed by performing sentence induction and grammatical slot definition processing on the text corpus in the scene where the speech to be recognized belongs.
- the text slots in the text sentences are divided into ordinary grammar slots and replacement grammar slots, wherein the text slots where the non-vertical keywords in the text sentences are located are defined as ordinary grammar slots, and the text slots in the text sentences are The text slot where the vertical class keyword is located is defined as the replacement syntax slot.
- the sentence decoding network shown in Figure 2.
- the sentence pattern decoding network is composed of nodes and directed arcs connecting nodes, wherein the directed arcs correspond to ordinary grammar slots and replacement grammar slots, and the directed arcs have label information for recording the text content in the slots.
- the general grammar slot entry is segmented and connected in series through nodes and directed arcs. The directed arcs between the two nodes are marked with word information, and the left and right colons represent input and output information respectively.
- Input and output are set here
- the information is the same, multiple words after word segmentation of a single entry are connected in series, different entries of the same grammar slot are connected in parallel, the placeholder "#placeholder#" is used to replace the grammar slot, no expansion is performed, and the nodes are numbered in order , where words with the same start node ID share a start node, and words with the same end node ID share an end node.
- Figure 2 illustrates a relatively simple word-level sentence pattern decoding network diagram of the address book.
- the ordinary grammar slot before the replacement grammar slot contains three words: "I want to give”, "send a message to" and "give a call”.
- the ordinary grammar slot after the replacement grammar slot contains three entries: "for me”, "a call” and "a call with her number”.
- the connection between node 10 and node 18 indicates that you can go directly from node 10 to the end node, and the " ⁇ /s>" on the arc represents silence.
- the aforementioned set of vertical keywords in the business scenario to which the voice to be recognized belongs refers to a set composed of all vertical keywords in the business scenario to which the voice to be recognized belongs.
- the set of vertical keywords in the business scenario to which the voice to be recognized belongs can specifically be a collection of names composed of the names of the user's address book; assuming that the voice to be recognized If the voice is voice in the voice navigation scenario, then the set of vertical keywords in the business scenario to which the voice to be recognized belongs can specifically be a set of place names composed of various place names in the region where the user is located.
- the speech recognition decoding network can be obtained by adding the vertical keywords in the vertical keyword set under the business scenario to which the voice to be recognized belongs to the replacement grammar slot of the sentence pattern decoding network. It can be seen that the decoding network not only contains all the speech sentences in the business scenario where the speech to be recognized belongs to, but also includes all the vertical keywords in the scene, then the speech recognition decoding network can recognize the business to which the speech to be recognized belongs The speech sentence pattern in the scene, and can recognize the vertical keywords in the speech, that is, the speech in the business scene can be recognized.
- the speech recognition and decoding network when constructing the speech recognition and decoding network in the embodiment of the present application, it is specifically constructed on the server, that is, the set of vertical keywords under the business scenario to which the speech to be recognized belongs is transferred to the cloud
- the server enables the cloud server to build a speech recognition and decoding network based on the set of vertical keywords in the business scenario to which the speech to be recognized belongs and the pre-built sentence pattern decoding network.
- the mobile terminal transmits the local address book (that is, the collection of keywords of personal names) to the cloud server, and the cloud server And the sentence pattern decoding network in the phone call scene, and build a speech recognition decoding network. Then, in the speech recognition decoding network, it contains various sentence patterns for making a call, and also includes the name of the address book of this call. Using this decoding network, it is possible to recognize the voice of the user calling any member in the current address book .
- the speech recognition decoding network is constructed locally at the user terminal, and is not constructed in real time, but is repeatedly invoked after pre-construction. Due to the relatively low computing resources of terminal devices, the network construction speed is slow and the network decoding speed is limited. Moreover, the non-real-time construction of the decoding network cannot be updated in time when the set of vertical keywords is updated, which affects the speech recognition effect.
- the speech recognition decoding network is constructed on the cloud server, and in the speech recognition process, the set of vertical keywords is imported in real time by performing step S102, and the speech recognition decoding network is constructed, so the constructed speech can be guaranteed
- the identification and decoding network contains the latest set of vertical keywords, that is, the set of vertical keywords required for this recognition, so that the vertical keywords can be accurately identified.
- the speech recognition decoding network will have stronger decoding performance.
- the cloud server can target the Build a suitable speech recognition decoding network for the second speech to be recognized, and decode the speech to be recognized this time.
- the speech recognition and decoding network constructed above includes sentence patterns in the business scenario to which the speech to be recognized belongs, and a set of vertical keywords in the business scenario to which the speech to be recognized belongs. Then, the speech content of the speech to be recognized can be recognized by using the speech recognition decoding network.
- the terminal local address book (this address book is used as set of vertical keywords) and a pre-built sentence pattern decoding network corresponding to the call business scenario, and construct a speech recognition decoding network including the names of the local address book.
- the speech recognition decoding network there are multiple same or different sentence pattern paths composed of different vertical keywords.
- an acoustic state sequence matches the pronunciation of one or several sentence-pattern paths in the speech recognition and decoding network, it can be determined that the text content of the acoustic state sequence is the text content of the sentence-pattern path. Therefore, the finally decoded speech recognition result may be the text of one or several paths in the speech recognition decoding network, that is, the final speech recognition result may be one or more.
- the voice recognition decoding The network contains the sentence pattern "I want to give XX a call” and the name "John". At the same time, other sentence patterns and other names are included in the speech recognition decoding network.
- the acoustic state sequence of the speech is matched with each path in the speech recognition decoding network, and it can be determined that the acoustic state sequence is consistent with the pronunciation of the path "I want to give John a call”. match, you can get the voice recognition result "I want to give John a call", that is, to realize the recognition of the user's voice.
- the speech recognition method proposed in the embodiment of the present application can build a speech recognition decoding network based on the vertical keyword set in the business scenario to which the speech to be recognized belongs and the pre-built sentence pattern decoding network in the business scenario. Then, in the speech recognition and decoding network, it includes various speech sentence patterns under the business scene to which the speech to be recognized belongs, and also includes various vertical keywords under the business scene to which the speech to be recognized belongs.
- the speech recognition and decoding network can decode the speech to be recognized Recognize speech composed of any sentence pattern and any vertical keyword in the business scenario to which the speech belongs. Therefore, by constructing the above-mentioned speech recognition decoding network, the speech to be recognized can be accurately recognized, especially the speech in a specific scene involving vertical keywords can be accurately recognized, especially the vertical keywords in the speech can be accurately recognized.
- the general speech recognition model is also used to decode the acoustic state sequence of the speech to be recognized.
- the result obtained by decoding the acoustic state sequence of the speech to be recognized using the above-mentioned speech recognition decoding network is named the first speech recognition result
- the acoustic state sequence of the speech to be recognized is decoded by using the above-mentioned general speech recognition model to obtain
- the result of is named the second speech recognition result.
- step S301 is executed to obtain the acoustic state sequence of the speech to be recognized
- step S302 and step S303 are respectively executed to construct a speech recognition decoding network, and use the speech recognition decoding network to decode the acoustic state sequence , to obtain a first speech recognition result; and, perform step S304, using a general speech recognition model to decode the acoustic state sequence, to obtain a second speech recognition result.
- first speech recognition results and second speech recognition results mentioned above There may be one or more first speech recognition results and second speech recognition results mentioned above.
- the speech recognition results output by each model are reserved up to 5 to participate in the determination of the final speech recognition results.
- the above-mentioned universal speech recognition model is a conventional speech recognition model obtained through massive corpus training, which recognizes the text content corresponding to the speech by learning the characteristics of the speech, rather than having a standardized speech recognition decoding network like the above-mentioned speech recognition model. sentence pattern. Therefore, the sentence patterns that the general speech recognition model can recognize are more flexible. Utilizing the universal speech recognition model to decode the acoustic state sequence of the speech to be recognized can more flexibly recognize the content of the speech to be recognized without being limited by the sentence pattern of the speech to be recognized.
- the speech to be recognized When the speech to be recognized is not a certain sentence pattern in the above-mentioned speech recognition decoding network, it cannot be correctly decoded by the speech recognition decoding network, or the first speech recognition result obtained is inaccurate, but due to the application of the general speech recognition model , the speech to be recognized can still be recognized and decoded to obtain a second speech recognition result.
- step S305 is executed to determine a final speech recognition result at least from the first speech recognition result and the second speech recognition result.
- the first speech recognition result and the second speech recognition result are obtained, according to the acoustic scores of the first speech recognition result and the second speech recognition result, through the acoustic score PK, from the first speech recognition One or more of the results and the second speech recognition result are selected as the final speech recognition result.
- the above-mentioned acoustic scores of the first speech recognition result and the second speech recognition result refer to the scores of the entire decoding result determined according to the decoding scores of each acoustic state sequence element when decoding the acoustic state sequence of the speech to be recognized .
- the sum of the decoding scores of each acoustic state sequence element can be used as the score of the entire decoding result.
- the decoding score of an acoustic state sequence element refers to the probability score of an acoustic state sequence element (such as a phoneme or phoneme unit) being decoded into a certain text, so the score of the entire decoding result is the entire acoustic state sequence being decoded into a certain text probability score.
- the acoustic score of the speech recognition result reflects the score of the speech recognition result, which can be used to characterize the accuracy of the speech recognition result.
- the accuracy of each recognition result can be reflected, through the acoustic score PK, that is, through the acoustic score
- one or more speech recognition results with the highest scores are selected from these recognition results as the final speech recognition result.
- Steps S301 to S303 in the method embodiment shown in FIG. 3 respectively correspond to steps S101 to S103 in the method embodiment shown in FIG. 1 .
- steps S301 to S303 in the method embodiment shown in FIG. 3 respectively correspond to steps S101 to S103 in the method embodiment shown in FIG. 1 .
- FIG. 4 shows a schematic flowchart of another speech recognition method proposed by the embodiment of the present application.
- the speech recognition method proposed in the embodiment of the present application uses the constructed speech recognition decoding network and the general speech recognition model to decode the acoustic state sequence of the speech to be recognized.
- step S405 is also performed to decode the acoustic state sequence through the pre-trained scene customization model to obtain the third speech recognition result.
- step S406 After obtaining the first speech recognition result, the second speech recognition result and the third speech recognition result respectively, execute step S406, from the first speech recognition result, the second speech recognition result and the third speech recognition result , determine the final speech recognition result.
- the aforementioned scenario customization model refers to a speech recognition model obtained by conducting speech recognition training on the speech in the scene to which the speech to be recognized belongs.
- the scene customization model has the same model architecture as the above-mentioned general speech recognition model.
- the difference from the general speech recognition model is that the scene customization model is not obtained by training with a large amount of general corpus, but by using the scene of the speech to be recognized The corpus is trained. Therefore, compared with the general speech recognition model, the scene customization model has higher sensitivity and higher recognition rate for the speech in the business scene to which the speech to be recognized belongs.
- the scene customization model can more accurately recognize speech in a specific business scene, without being limited to predetermined sentence patterns like the speech recognition decoding network mentioned above.
- a scene customization model is added, so that the three models can decode the acoustic state sequence of the speech to be recognized separately, which can be more comprehensive and in-depth in various ways Perform speech recognition on the speech to be recognized.
- step S305 For the speech recognition results output by the three models, you can refer to the introduction of the above step S305 as an example, by comparing the acoustic scores of the first speech recognition result, the second speech recognition result and the third speech recognition result, and select One or more speech recognition results with the highest or higher acoustic scores are used as the final speech recognition result.
- Steps S401-S404 in the method embodiment shown in FIG. 4 respectively correspond to steps S301-S304 in the method embodiment shown in FIG. 3.
- steps S301-S304 in the method embodiment shown in FIG. 3 respectively correspond to steps S301-S304 in the method embodiment shown in FIG. 3.
- the main idea of the above-mentioned speech recognition method based on multi-model decoding is to perform decoding through multiple models, and then select the final recognition result from multiple recognition results through the acoustic score PK.
- PK acoustic score
- the sentence patterns of the speech recognition results are basically the same, but there are differences in the positions of vertical keywords. While the first speech recognition result contains more accurate vertical keyword information, if the first speech recognition result is PKed out, it may cause inaccurate recognition of the vertical keyword. Therefore, when the scores of the recognition results output by each model are similar, the recognition result containing accurate vertical keywords should win.
- the first speech recognition result cannot be made to win.
- the score of the slot where the vertical keyword in the first speech recognition result is located can be first encouraged to increase a certain percentage, that is, the first speech recognition result is acoustically Score incentives, so that when the sentence patterns output by different models are the same, the first speech recognition result can win.
- the embodiment of the present application proposes another speech recognition method, as shown in FIG. 5 , the method includes:
- S502 Use a speech recognition decoding network to decode the acoustic state sequence to obtain a first speech recognition result. And, S503, using a general speech recognition model to decode the acoustic state sequence to obtain a second speech recognition result; the speech recognition decoding network is based on the vertical keyword set and sentence pattern decoding in the scene to which the speech to be recognized belongs The network is constructed.
- the above-mentioned speech recognition decoding network is based on the set of vertical keywords in the scene where the speech to be recognized belongs, and the sentence pattern decoding network obtained by pre-processing the text corpus in the scene to which the speech to be recognized belongs to sentence induction and grammar slot definition , the build gets.
- the specific content of the speech recognition and decoding network refer to the introduction of the above-mentioned embodiments, and for the construction process of the network, refer to the specific introduction of the following embodiments.
- the speech recognition method proposed in the embodiment of the present application before performing the acoustic score PK of the first speech recognition result and the second speech recognition result, thereby determining the final speech recognition result , in the embodiment of the present application, the acoustic score incentive is first performed on the first speech recognition result.
- the acoustic score incentive is performed on the first speech recognition result, specifically, the acoustic score of the slot where the vertical keyword in the first speech recognition result is located is stimulated, that is, the vertical keyword in the first speech recognition result is located.
- the acoustic score of the slot is scaled according to the incentive coefficient.
- the specific value of the incentive coefficient is determined by the business scenario and the actual situation of the speech recognition result.
- For the specific acoustic score incentive content please refer to the implementation of the acoustic score incentive in the following article. Example introduction.
- the acoustic score of the first speech recognition result can be higher than that of the second speech recognition result, whereby make the first speech recognition result win in the acoustic score PK, thus can make when the score of the first speech recognition result and the second speech recognition result are similar, guarantee that the vertical class keyword in the speech recognition result that finally gets is relatively More accurate recognition results.
- the speech recognition method proposed in the embodiment of the present application can build a speech recognition decoding network based on the vertical keyword set in the business scenario to which the speech to be recognized belongs and the pre-built sentence pattern decoding network in the business scenario.
- the speech recognition and decoding network can decode speech composed of any sentence pattern and any vertical keyword in the business scenario to which the speech to be recognized belongs. Therefore, based on the above-mentioned speech recognition decoding network, the speech to be recognized can be accurately recognized, especially the speech in a specific scene involving vertical keywords can be accurately recognized, especially the vertical keywords in the speech can be accurately recognized.
- the speech recognition method proposed in the embodiment of the present application not only uses the speech recognition decoding network to perform decoding and recognition, but also uses a general speech recognition model to perform decoding and recognition.
- the general speech recognition model has higher sentence flexibility than the above-mentioned speech recognition decoding network. Using multiple models to decode the acoustic state sequence of the speech to be recognized separately can make speech recognition of the speech to be recognized more comprehensive and in-depth in a variety of ways.
- the embodiment of the present application performs acoustic score excitation on the speech recognition result output by the speech recognition and decoding network.
- the above-mentioned speech recognition and decoding network has a higher recognition accuracy for vertical keywords. Therefore, based on the above-mentioned acoustic score excitation process, when the speech recognition result output by the speech recognition and decoding network is consistent with the general speech
- the speech recognition result output by the speech recognition decoding network can win, thereby ensuring that the vertical keyword recognition in the final speech recognition result is correct.
- the embodiment of the present application also proposes another speech recognition method. Compared with the speech recognition method shown in FIG. 5, this method adds a scene customization model for decoding the acoustic state sequence of the speech to be recognized. .
- step S604 is further performed to decode the acoustic state sequence through the pre-trained scene customization model to obtain the third speech recognition result.
- the above-mentioned scene customization model is obtained by performing speech recognition training on the speech in the scene to which the speech to be recognized belongs.
- the functions of the above-mentioned scene customization model and the beneficial effects brought by the addition of the scene customization model can refer to the content of the above-mentioned embodiment corresponding to the speech recognition method shown in FIG. 4 , which will not be repeated here.
- step S606 the final speech recognition result is determined from the excited first speech recognition result, the second speech recognition result and the third speech recognition result.
- acoustic score PK on the first speech recognition result, the second speech recognition result and the third speech recognition result, multiple or highest acoustic scores can be selected.
- a speech recognition result as the final speech recognition result.
- For the specific processing process refer to the introduction of the corresponding content in the above-mentioned embodiments.
- steps S601 - S603 and S605 in FIG. 6 please refer to the specific processing content of the corresponding steps in the above embodiment, and will not be repeated here.
- the speech recognition methods mentioned above are completely based on the acoustic score of the speech recognition results when PK and decision-making are finally performed on multiple speech recognition results, which completely ignores the influence of the language model on the recognition effect, especially in the above-mentioned In the speech recognition decoding network or scene customization model.
- This simple and direct PK strategy will greatly affect the recognition effect, and in severe cases, it will cause false triggering problems and affect user experience.
- the embodiment of the present application proposes that on the basis of the acoustic score PK, the language model is activated for the speech recognition result, so that the language model information is incorporated into the speech recognition result, and finally, the final speech recognition result is selected through the language score PK.
- the first speech recognition result, the second speech recognition result and the third speech recognition result are respectively obtained through the speech recognition decoding network, the general speech recognition model, and the scene customization model.
- the speech recognition result, and after the acoustic score is activated on the first speech recognition result the final speech recognition result is determined from the excited first speech recognition result, the second speech recognition result and the third speech recognition result
- the result can be executed as follows:
- a candidate speech is determined from the first speech recognition result and the second speech recognition result recognition result.
- the processing in this step is the same as the acoustic score PK introduced above.
- the acoustic score PK By performing the acoustic score PK on the first speech recognition result and the second speech recognition result after the acoustic score excitation, one or more of the highest acoustic scores are selected. speech recognition results as candidate speech recognition results.
- language model excitation is performed on the candidate speech recognition result and the third speech recognition result respectively.
- the above-mentioned language model excitation refers to matching the speech recognition result with the vertical keyword in the scene where the speech to be recognized belongs to. If the matching is successful, the path extension is performed on the speech recognition result, and then the language The model re-scores the expanded speech recognition results, and completes the speech recognition results excitation on the language model.
- the specific processing process of the language model excitation will be specially introduced in the following embodiments.
- acoustic score PK perform language score PK on the candidate speech recognition results after language model excitation and the third speech recognition result, and select one or more speech recognition results with the highest language scores as The final speech recognition result.
- the specific processing process of the language score PK refer to the processing process of the acoustic score PK introduced in the above-mentioned embodiments, and will not be described in detail here.
- an optional speech recognition result decision-making method refer to the introduction of the above-mentioned embodiments, and obtain the first speech recognition result, the second speech recognition result and the After the third speech recognition result, respectively perform language model excitation on the first speech recognition result, the second speech recognition result and the third speech recognition result; then, according to the first speech recognition result after the language model excitation , the language scores of the second speech recognition result and the third speech recognition result, the final speech recognition result is determined from the first speech recognition result, the second speech recognition result and the third speech recognition result through the language score PK.
- the present application introduces the construction process of the sentence pattern decoding network used to construct the speech recognition decoding network in the above-mentioned embodiments of the speech recognition method.
- the construction process of the sentence pattern decoding network described below is only an exemplary and preferred implementation plan.
- the sentence pattern decoding network under the business scenario of the voice to be recognized described in the above-mentioned embodiments can be constructed by executing the following steps A1-A3:
- A1. Construct a text sentence network by performing sentence pattern induction and grammatical slot definition processing on the corpus data in the scene to which the voice to be recognized belongs.
- the corpus data in the business scenario where the speech to be recognized belongs is the voice annotation data collected from the actual business scenario. It is marked as the corpus data in the scene of calling or sending text messages. Alternatively, it can also be artificially expanded based on experience to obtain corpus data that conforms to grammatical logic and business scenarios. For example, "I want to give John a call” and "send a message to Peter for me” are two corpus data use cases respectively. Because the subsequent sentence induction and grammatical slot definition are directly based on the corpus, the corpus collected at this stage can have a high degree of coverage, but there is no requirement for the coverage of vertical keywords.
- the sentence patterns of the user's voice are usually regular, or exhaustive.
- the sentence network corresponding to the business scenario can be obtained, which is named the text sentence network in the embodiment of this application.
- the text slots in the text sentences are divided into ordinary grammar slots and replacement grammar slots, wherein the text slots corresponding to non-vertical keywords are defined as ordinary grammar slots, and the text slots corresponding to vertical keywords are defined as is defined as a replacement syntax slot.
- ordinary grammar slot non-vertical keyword content in the text sentence is stored, and in the replacement grammar slot, placeholders corresponding to the vertical keyword are stored.
- the number of ordinary grammar slots can be one or more, and each vertical keyword text slot corresponds to a replacement grammar slot.
- the text-sentence network constructed in the above-mentioned manner includes elements including network nodes and directed arcs connecting nodes, and the text-sentence network is defined based on ABNF (Augmented Backus-Naur Form). Specifically, as shown in Figure 7, there is label information on the directed arc of the text sentence network, and the label information corresponds to the placeholder of the replacement grammar slot corresponding to the directed arc or the ordinary grammar corresponding to the directed arc The text of the slot.
- ABNF Algmented Backus-Naur Form
- Figure 7 is a text sentence network defined according to the collected corpus data in the scenario of calling or texting.
- the directed arcs with the " ⁇ xxxx>" tag are called ordinary grammar slots, which contain at least one entry, and the collection of all entries must be completed in the stage of text sentence network definition.
- Figure 7 includes two common grammar slots, namely ⁇ phone> and ⁇ sth>, where ⁇ phone> corresponds to the content of the text entry in front of the name of the address book in the scene corpus use case, such as "I want to give” and "send a message” are the two entries of the syntax slot ⁇ phone>; ⁇ sth> indicates the content of the text entry behind the name of the address book in the use case, such as “a call” and “for me” are the two entries of the syntax slot ⁇ sth> entry.
- the directed arc with the label "xxx" is called a replacement grammar slot, which means that the sentence pattern definition stage does not need to be accompanied by an actual entry but only needs to be matched with a "#placehollder#" placeholder.
- the actual tokens are dynamically passed in when building the speech recognition decoding network.
- the "name" in Figure 7 is a replacement grammar slot, and the subsequent dynamically created vertical keyword network will be inserted into the replacement grammar slot to form a complete speech recognition decoding network.
- the last type of directed arc with a "-" label is called a virtual arc.
- the virtual arc refers to a directed arc that does not have grammatical slots and entry information, indicating that the path is optional, and the virtual arc must have a corresponding grammatical slot.
- all grammar slots in the text sentence network can be ID marked, defined as the slot_id field of the grammar slot, and globally unique identifiers can be set.
- sentence patterns defined by the sentence network, the grammatical slots, and the entries of common grammatical slots together constitute the text sentence network.
- the word-level sentence pattern decoding network includes several nodes and directed arcs between the nodes.
- each entry in the common grammar slot is segmented to obtain each word corresponding to each entry.
- each word corresponding to the same entry to expand the word node, that is, connect the word segmentation results of the same entry through nodes and directed arcs to obtain the word string corresponding to the entry.
- the directed arc between the two nodes is marked with word information obtained by word segmentation, where the left and right colons represent input and output information respectively, and the input and output information are set to be the same here.
- the grammar slot ⁇ phone> contains three entries: "I want to give”, “send a message to” and "give a call”.
- the grammatical slot ⁇ sth> contains three entries: “for me”, “a call” and "a call with her number”.
- the connection between node 10 and node 18 indicates that you can go directly from node 10 to the end node, and the " ⁇ /s>" on the arc represents silence.
- the decoding network is used as a sentence decoding network in the scene where the speech to be recognized belongs to.
- each word in the common grammar slot of the word-level sentence decoding network is replaced with the corresponding pronunciation.
- the corresponding relationship between existing words and pronunciations can be queried through the pronunciation dictionary, so as to determine the pronunciation corresponding to each word marked on the directed arc in the common grammar slot of the word-level sentence pattern decoding network. On this basis, use the pronunciation corresponding to the word to replace the word marked on the directed arc.
- each pronunciation in the word-level sentence pattern decoding network is divided into pronunciation units, and each pronunciation unit corresponding to the pronunciation is used to expand the pronunciation node to obtain the pronunciation level sentence pattern decoding network.
- each utterance on the directed arc of the word-level sentence pattern decoding network its utterance unit is determined respectively, and the utterance unit is divided.
- This application exemplarily divides the utterance into phoneme sequences. For example, the word “I” is pronounced as the monophone “ay”, and the word “give” is pronounced as the phoneme string "gih v”.
- the pronunciation nodes are expanded and connected in series according to the arrangement order and quantity of the pronunciation units.
- each phoneme of the same pronunciation is connected in sequence through nodes and directed arcs to obtain a phoneme string corresponding to the pronunciation.
- the phoneme string corresponding to the pronunciation is used to replace the pronunciation, and the word-level sentence decoding network is extended to the pronunciation-level sentence decoding network.
- the replacement grammar slot is still not expanded.
- the pronunciation-level sentence pattern decoding network is used as the sentence pattern decoding network in the business scenario to which the voice to be recognized belongs.
- Figure 8 shows a schematic diagram of a simple pronunciation-level sentence decoding network.
- the nodes in the sentence decoding network are numbered in sequence, and the pronunciation units with the same initial node identifier share a starting point. start node, and pronunciation units with the same end node ID share one end node.
- a single node in the network includes a total of 3 attribute fields: id number, number of incoming arcs, and number of outgoing arcs, which constitute a node storage triplet. The incoming arc of a node indicates a directed arc pointing to the node, and the outgoing arc is sent from the node. directed arc.
- a single directed arc in the network includes a total of 4 attribute fields including the left node number, right node number, pronunciation information on the arc, and the grammar slot identifier slot_id to which it belongs. At the same time, the total number of nodes and the total number of arcs in the network are recorded.
- the left node position of the ordinary grammar slot ⁇ phone> in the network is 0, and the right node position is 10, and the replacement grammar
- the left node of the slot name is 10, the right node is 11, and the left node of the common syntax slot ⁇ sth> is 11, and the right node is 16.
- the qualified arcs in the obtained pronunciation-level sentence pattern decoding network can be merged and optimized, and redundant nodes can be deleted to reduce the complexity of the network.
- the specific method is the same as the general decoding network optimization method, and no longer detail.
- the decoding network obtained after completing the above steps is the sentence decoding network, which can be loaded into the cloud speech recognition service as a global resource.
- the real address book information since the real address book information has not been recorded in the replacement grammar slot, it does not yet have the actual decoding capability.
- a speech recognition decoding network that is actually used to decode the acoustic state sequence of the speech to be recognized can be constructed.
- the embodiment of the present application will further give an example introduction to the construction process of the speech recognition decoding network.
- the embodiment of the present application constructs a speech recognition decoding network by performing the following steps B1-B3:
- the sentence pattern decoding network under the business scenario to which the voice to be recognized belongs can be constructed in advance according to the above-mentioned embodiments.
- the sentence pattern decoding network can be directly called.
- step B1 when step B1 is executed, a sentence pattern decoding network is constructed in real time.
- the set of vertical keywords in the business scenario to which the voice to be recognized belongs refers to a set composed of all vertical keywords in the business scenario to which the voice to be recognized belongs.
- the set of vertical keywords in the business scenario to which the voice to be recognized belongs can specifically be a collection of names composed of the names of the user's address book; assuming that the voice to be recognized If the voice is voice in the voice navigation scenario, then the set of vertical keywords in the business scenario to which the voice to be recognized belongs can specifically be a set of place names composed of various place names in the region where the user is located.
- the address book is used as a collection of vertical keywords, and the construction of a name network in a communication is taken as an example to introduce the specific implementation process of building a vertical keyword network.
- a word-level vertical keyword network is constructed based on each vertical keyword in the vertical keyword set under the business scenario to which the speech to be recognized belongs.
- a word-level name network For example, perform word segmentation for each name in the address book to obtain each word contained in the name, and connect each word in series through nodes and directed arcs to obtain the word string corresponding to the name; the word strings corresponding to different names are connected in parallel , that is, a word-level name network is constructed.
- each word in the word-level vertical keyword network is replaced with the corresponding pronunciation, and the pronunciation node is expanded according to the pronunciation corresponding to the word to obtain the pronunciation-level vertical keyword network.
- the pronunciation-level personal name network For example, for each word in the word-level personal name network, determine its pronunciation and determine the phonemes contained in the pronunciation, connect each phoneme through nodes and directed arcs to form a phoneme string, and then replace the pronunciation with the phoneme string corresponding to the pronunciation, That is, the pronunciation-level personal name network is obtained.
- a pronunciation-level vertical keyword network is obtained, which is the final vertical keyword network constructed.
- the node whose number of arcs is 0 is the start node of the network
- the node whose number of arcs is 0 is the end node of the network.
- Figure 10 shows the pronunciation-level personal name network obtained after pronunciation replacement of the word-level personal name network shown in Figure 9 .
- node 0 is the start node of the network
- node 8 is the end node of the network.
- the finally constructed sentence pattern decoding network and vertical keyword network are composed of nodes and directed arcs connecting nodes.
- the pronunciation information of the text in the slot is stored on the directed arc corresponding to the ordinary grammar slot of the sentence pattern decoding network
- the placeholder is stored on the directed arc corresponding to the replacement grammar slot of the sentence pattern decoding network
- the corresponding vertical class The pronunciation information of the text in the slot is stored on the directed arc of each grammatical slot of the keyword network.
- the left and right nodes of the vertical keyword network and the replacement grammar slots of the sentence pattern decoding network are respectively connected through directed arcs, that is, the vertical keyword network is used to replace the replacement grammar slots in the sentence pattern decoding network.
- Directed arcs can be used to construct a speech recognition decoding network.
- each connected directed arc stores the starting node of the vertical keyword network respectively.
- the pronunciation information on each outgoing arc, and the left node of each incoming arc of the end node of the vertical keyword network and the right node of the replacement grammar slot are connected by directed arcs, and each connected directed arc is respectively
- the pronunciation information on each incoming arc of the end node of the vertical keyword network is stored, so as to construct a speech recognition decoding network.
- the first arc and the last arc of each keyword in the vertical keyword network in the embodiment of the present application are respectively stored in , and the unique identifiers can be set as the hash codes of the keywords for example.
- the directed arcs between nodes (0, 1) and the directed arcs between nodes (7, 8) respectively store the hashes corresponding to the personal name "Jack Alen”. Greek code.
- this embodiment of the present application also sets a keyword information set that has already entered the network.
- the unique identifier of the keyword that has been inserted into the sentence pattern decoding network, and the directed arc where the unique identifier is located are in the The sentence pattern decodes the left and right node numbers in the network and stores them correspondingly.
- the above-mentioned network-connected keyword information collection can adopt the HashMap storage structure of key:value structure, the key is the hash code corresponding to the above-mentioned vertical keyword, and the value is the node of the directed arc where the hash code is located Number pair collection, the initial HashMap is empty, during the entire identification service process, HashMap uniquely saves the hash code and node number pairs of all dynamic incoming vertical keyword entries, but does not record user ID and user The mapping relationship between ID and set of vertical keywords.
- the setting of the above-mentioned keyword information set that has entered the network can facilitate the identification of vertical keyword information that has been inserted into the sentence pattern decoding network, that is, it can clarify the vertical keyword information that already exists in the speech recognition decoding network.
- it is possible to determine whether the vertical keyword has been inserted by querying the network keyword information collection, and when it is determined that the vertical keyword to be inserted already exists in the speech recognition decoding network , you can cancel the insertion of the current vertical keyword, and continue to perform the insertion operation of other vertical keywords.
- the right node of the traversed arc is connected to the left node of the replacement grammar slot through a directed arc, and the traversed arc is stored on the directed arc.
- the pronunciation information on the outgoing arc and update the number of incoming arcs or outgoing arcs of the nodes at both ends of the directed arc.
- each out arc of the starting node in the person name network is traversed, and for each out arc traversed, the name hash code on the out arc is obtained, by The hash code is compared with all the hash codes in the network keyword information collection. If the hash code matches any hash code in the network keyword information collection, it means that the hash code corresponds to The name of the person has been inserted into the sentence pattern decoding network. At this time, the arc is skipped and the hash code judgment for the next arc is performed.
- the traversed hash code of the arc out does not match all the hash codes in the network keyword information set, it means that the name corresponding to the hash code has not been inserted into the sentence decoding network.
- the traversed The right node of the arc is connected to the left node of the replacement grammar slot of the sentence pattern decoding network through a directed arc, and the pronunciation information on the arc that is traversed is stored on the connected directed arc, and the directed arc is updated The number of incoming or outgoing arcs of nodes at both ends.
- the above process realizes the connection of the right node of each arc out of the starting node of the vertical keyword network and the left node of the replacement grammar slot of the sentence pattern decoding network.
- the left node of the incoming arc traversed is connected to the right node of the replacement grammar slot through a directed arc, and the traversed arc is stored on the directed arc.
- the pronunciation information on the incoming arc is not inserted into the sentence pattern decoding network.
- the hash code is compared with all the hash codes in the network keyword information collection. If the hash code matches any hash code in the network keyword information collection, it means that the hash code corresponds to The person's name has been inserted into the sentence pattern decoding network. At this time, the incoming arc is skipped, and the hash code judgment for the next incoming arc is performed.
- the traversed The left node of the incoming arc is connected to the right node of the replacement grammar slot of the sentence pattern decoding network through a directed arc, and the pronunciation information on the incoming arc is stored on the connected directed arc, and the directed arc is updated The number of incoming or outgoing arcs of nodes at both ends.
- the above process realizes the connection between the left node of each incoming arc of the end node of the vertical keyword network and the right node of the replacement grammar slot of the sentence pattern decoding network.
- the execution order of the insertion of the start node and end node of the vertical keyword network described above can be flexibly arranged, for example, the insertion operation can be performed on the start node of the vertical keyword network first, or the The end node of the vertical keyword network performs the insertion operation, or both.
- the embodiment of the present application will uniquely identify the keyword and the location where the unique identification is located.
- the left and right node numbers of the directed arc in the sentence pattern decoding network are correspondingly stored in the keyword information set that has entered the network.
- each user can upload a vertical keyword set to the cloud server, and the cloud server can build a large-scale Or a very large-scale speech recognition decoding network, so as to be able to meet the calling needs of various users.
- the setting of the network keyword information set can improve the efficiency of inserting vertical keywords into the speech recognition decoding network, and at the same time can facilitate the selection of a specific decoding path according to the speech recognition needs of the current user.
- the voice recognition decoding network constructed according to the above scheme not only includes the address book entry information dynamically imported from the current session, but also includes the address book information of other historical sessions, and is unified. It is stored in the network keyword information collection in the form of hash code.
- the path of the speech recognition decoding network needs to be updated, so that the decoding path is limited to the range of the current incoming address book.
- the specific implementation method is:
- the user voice makes a call as an example
- traverse each hash code in the keyword information set that has entered the network If the hash code traversed belongs to the name hash code in the incoming address book, it will not If it is not the hash code of the name in the incoming address book this time, by querying the keyword information set that has entered the network, determine the left and right node numbers corresponding to the hash code, and compare the left and right node numbers between the left and right node numbers. Break to the arc.
- the speech recognition decoding network that actually participates in the decoding, in fact, only the decoding path where the current incoming address book is located is connected, so it can only decode the speech recognition result of calling any name in the current address book, This is also in line with user expectations.
- the decoding path can be limited to the range of vertical keywords set introduced this time, which is beneficial to narrowing the path search range, improving decoding efficiency and reducing decoding errors.
- the speech recognition decoding network constructed based on the set of vertical keywords and the sentence pattern decoding network is the main network for realizing speech recognition involving vertical keywords.
- the inventors of the present application found in research that the speech recognition decoding network has two deficiencies, one is the problem of false triggering of vertical keywords, and the other is the problem of insufficient network coverage.
- False triggering problem is the most common problem in speech recognition based on fixed sentence speech recognition decoding network, and it is also the most difficult problem to solve. False triggering means that the actual content of a piece of audio is not the sentence pattern in the speech recognition decoding network with a fixed sentence pattern, but the final result is the result in the speech recognition decoding network with a fixed sentence pattern. False triggering of vertical keywords means that there is no vertical keyword in the actual result or the vertical keyword is not in the vertical keyword set passed in this time, but the final result is that the result of the speech recognition decoding network wins, and an error is given vertical keywords. For example, in a phone call scenario, there is no name in the actual result or the name is not in the incoming address book this time, but the final result is that the result output by the speech recognition decoding network wins, and a wrong name is given.
- False triggers of vertical keywords are generally divided into the following four types: (1) false triggers with the same pronunciation of vertical keywords and real results; (2) false triggers with similar pronunciations of vertical keywords and real results; (3) false triggers with vertical keywords and real results with similar pronunciations; There is a large gap between the pronunciation of keywords and the real results due to the false triggers introduced by the incentive strategy; (4) There are no vertical keywords in the real results, but the speech recognition decoding network recognizes the false triggers of vertical keywords.
- the root cause can be attributed to insufficient sentence pattern coverage in the speech recognition decoding network.
- the embodiment of the present application proposes a corresponding solution to the problem of insufficient sentence pattern coverage of the speech recognition decoding network. For other false triggering situations, it will be solved and optimized through other solutions in subsequent embodiments.
- the speech recognition decoding network is built based on the sentence pattern decoding network, that is, based on the sentence pattern.
- the advantage of this network construction method is that it can accurately match the speech sentence pattern.
- the disadvantage is that if the sentence pattern is not in the network. There is no way to recognize it, and the corpus used to build the network often cannot cover all sentence patterns in all scenarios, so the speech recognition decoding network will introduce a problem, that is, the sentence pattern coverage is not high enough.
- the general speech recognition model is trained based on massive data, and the sentence patterns are scalable and the sentence patterns are very rich.
- the result errors of the general-purpose speech recognition model are mainly vertical keyword recognition errors, mainly because the general-purpose speech recognition model includes language scores during corpus training, and the corpus often does not fit well.
- the score of out-of-the-box keywords Although the vertical keywords recognized by the general speech recognition model are wrong, the sentence pattern is correct, mainly because the sentence pattern can be fitted by training data.
- the summary is that the sentence pattern information of the recognition results of the general speech recognition model is more reliable, and the vertical keyword information of the recognition results of the speech recognition decoding network is more reliable. Based on this idea, this case proposes a sentence pattern based on a general speech recognition model to solve the problem of low sentence pattern coverage of the speech recognition decoding network.
- the specific solution is, on the premise that the acoustic state sequence of the speech to be recognized is decoded to obtain the first speech recognition result and the second speech recognition result by using the speech recognition decoding network and the general speech recognition model constructed according to the technical idea of the present application respectively, The first speech recognition result is corrected according to the second speech recognition result.
- the content in the first speech recognition result is divided into vertical keyword content and non-vertical keyword content.
- the content in the second speech recognition result is divided into reference text content and non-reference text content, wherein the reference text content in the second speech recognition result refers to the content in the second speech recognition result and the first speech recognition result
- the text content that matches the non-vertical keyword content in the first speech recognition result may specifically be the string most similar to the non-vertical keyword content in the first speech recognition result, or the similarity is greater than the set threshold the string content of .
- the above-mentioned modification of the first speech recognition result according to the second speech recognition result is to use the reference text content in the second speech recognition result to correct the non-vertical key in the first speech recognition result.
- the content of the word is corrected to obtain the corrected first speech recognition result.
- the embodiment of the present application matches the first speech recognition result with the second speech recognition result based on the edit distance algorithm, and determines the reference text content from the second speech recognition result.
- an edit distance matrix between the first speech recognition result and the second speech recognition result is determined according to an edit distance algorithm.
- the edit distance matrix includes edit distances between each character in the first speech recognition result and each character in the second speech recognition result.
- the non-vertical keyword content in the first speech recognition result may be divided into the character string before the vertical keyword, and/or, the vertical keyword string after.
- the corresponding reference text content can also be determined through the above method.
- the text in the second speech recognition result is The non-vertical keyword content in the target text content or the first speech recognition result is determined as the modified non-vertical keyword content
- the target text content in the second speech recognition result refers to the text content corresponding to the position of the non-vertical keyword content in the first speech recognition result in the second speech recognition result.
- the position of the target text content in the second speech recognition result in the second speech recognition result is the same as the position of the non-vertical keyword content in the first speech recognition result.
- the non-vertical keyword content in the first speech recognition result is the text content before the vertical keyword in the first speech recognition result
- the target text content in the second speech recognition result specifically the first
- the position of the vertical keyword in the speech recognition result is mapped to the text content before the position in the second speech recognition result.
- the mapping between the positions of vertical keywords in the first speech recognition result and the second speech recognition result can be realized based on the above-mentioned edit distance matrix.
- the process of determining the corrected non-person name content will be introduced.
- an example is taken in which the person's name in the first speech recognition result is in the middle of the sentence of the first speech recognition result, and the emphasis is on the correction process of the character string before the person's name in the first speech recognition result.
- the corresponding correction processing can also be carried out for the character string after the person's name.
- the target text content in the second speech recognition result is determined as the modified non-vertical keyword content
- the second speech recognition result has more characters than the non-vertical keyword content in the first speech recognition result, and the difference in the number of characters between the two does not exceed the set threshold, then the second speech recognition The target text content in the result is determined as the modified non-vertical keyword content;
- the second speech recognition result has fewer characters than the non-vertical keyword content in the first speech recognition result, and/or the difference in the number of characters between the two exceeds a set threshold, the first speech will be
- the non-vertical keyword content in the recognition result is determined as the modified non-vertical keyword content.
- the character before the person’s name is calculated.
- string and the local maximum subsequence of the second speech recognition result, and the local maximum subsequence is the reference text content determined from the second speech recognition result.
- the target text content in the second speech recognition result is determined as the corrected character string before the person's name.
- the target text content in the second speech recognition result specifically the position of the name in the first speech recognition result, is mapped to the second speech recognition result matrix according to the edit distance matrix between the first speech recognition result and the second speech recognition result. The text content preceding the position in the speech recognition results.
- the number of characters of the second speech recognition result is not more than the number of characters of the character string before the person's name, then keep the character string before the person's name in the first speech recognition result unchanged, that is, the character string before the person's name in the first speech recognition result Determined as the corrected character string in front of the name.
- the number of characters in the second speech recognition result is more than the number of characters in the character string before the name, it is further judged whether the difference in the number of character strings between the two exceeds the set threshold. Whether the excess number of characters in the string exceeds 20% of the number of characters in the character string before the name.
- the character string in front of the person's name in the first speech recognition result remains unchanged, that is, the character string in front of the person's name in the first speech recognition result is determined as the corrected character string in front of the person's name.
- the target text content in the second speech recognition result is determined as the corrected character string before the person's name.
- the above introduction can also be referred to to determine the corrected character string after the person's name.
- the modified non-vertical keyword content and the vertical keyword content are combined according to the positional relationship between the original non-vertical keyword content and the vertical keyword content, and the combination result obtained is the modified The first speech recognition result of .
- the corrected character string before the person's name, the character string after the person's name, and the corrected character string after the person's name are sequentially concatenated and combined to obtain the first corrected speech recognition result.
- Correcting the first speech recognition result according to the above-mentioned processing process can make the output result of the speech recognition decoding network not only contain more accurate vertical keyword information, but also can integrate the accurate sentence pattern information recognized by the general speech recognition model , so that the recognition results of the speech recognition decoding network are more accurate and the sentence pattern coverage is higher, and at the same time, the problem of recognition false triggering caused by the low sentence pattern coverage of the speech recognition decoding network can be solved.
- the result of the speech recognition decoding network will be Stimulate, including acoustic stimulation and verbal stimulation.
- the incentive if the incentive is not appropriate, it may cause insufficient incentives, or cause false triggers due to excessive incentives, resulting in the second and third vertical keyword false triggers as mentioned above, namely: (2 ) False triggers due to the similar pronunciation of vertical keywords and real results; (3) False triggers due to the large gap between vertical keywords and real results due to the introduction of incentive strategies.
- this application implements For example, the above-mentioned incentive scheme is studied, and an optimal incentive scheme is proposed.
- the embodiment of the present application first proposes that when using the speech recognition decoding network and the general speech recognition model to decode the acoustic state sequence of the speech to be recognized, the first speech recognition result output from the speech recognition decoding network and the general speech recognition model output
- the speech recognition decoding network and the general speech recognition model output
- the first speech recognition result and the second speech recognition result are used to calculate the character edit distance according to the edit distance algorithm, so as to determine the matching degree between the two.
- the confidence degree of the first speech recognition result is further determined based on the matching degree of the first speech recognition result and the second speech recognition result.
- the matching degree threshold is a value that has the largest contribution rate to recognition calculated by multiple test sets.
- the matching degree between the first speech recognition result and the second speech recognition result is not greater than the set matching degree threshold, then use the vertical keywords in the first speech recognition result and the sentence pattern of the second speech recognition result to , that is, the content in the second speech recognition result except the content corresponding to the vertical keyword, and construct a miniature decoding network. It can be seen that the tiny decoding network has only one decoding path.
- the miniature decoding network uses the miniature decoding network to decode the above-mentioned acoustic state sequence of the speech to be recognized again to obtain a decoding result, and use the decoding result as a new first speech recognition result. Then, using the updated acoustic scores of each frame of the first speech recognition result during decoding, the acoustic scores of each frame are accumulated or weighted to be used as the confidence degree of the first speech recognition result finally determined.
- step D2 is performed, according to the acoustic score of the first speech recognition result and the acoustic score of the second speech recognition result, from the first
- the final speech recognition result is selected from the speech recognition result and the second speech recognition result.
- the above-mentioned confidence threshold refers to a score threshold determined through experiments that enables the first speech recognition result to win when performing score competition with other speech recognition results.
- the confidence of the first speech recognition result is greater than the preset confidence threshold, which means that the first speech recognition result can not be easily PKed out in the acoustic score competition with other speech recognition results by virtue of its own confidence. Therefore, at this time, the acoustic score PK can be performed directly according to the acoustic score of the first speech recognition result and the acoustic score of the second speech recognition result, and one or more speech recognition results with the highest acoustic score can be selected as The final speech recognition result.
- step D3 is performed to stimulate the acoustic score of the first speech recognition result, and according to the acoustic score of the stimulated first speech recognition result, score and the acoustic score of the second speech recognition result, and select the final speech recognition result from the first speech recognition result and the second speech recognition result.
- the confidence degree of the first speech recognition result is not greater than the preset confidence threshold, it indicates that the confidence degree of the first speech recognition result is low, that is, its acoustic score is low.
- the acoustic score is PK, it will be PK dropped by other speech recognition results, so that the more accurate vertical keyword information contained in the first speech recognition result will be lost in the final selected speech recognition result, which may cause recognition errors, especially It may cause misidentification of vertical keywords.
- the embodiment of the present application performs acoustic score incentives on the first speech recognition result, specifically for vertical categories in the first speech recognition result.
- the acoustic score of the slot where the keyword is located is stimulated, so that the acoustic score of the slot where the vertical keyword is located in the first speech recognition result is increased by a certain percentage, so that the acoustic score of the first speech recognition result is increased to ensure that the first speech
- the recognition result will not be easily PKed out in the subsequent acoustic score PK with other speech recognition results.
- the acoustic score of the first speech recognition result after the excitation and the acoustic score of the second speech recognition result can be used to perform acoustic score PK on the two, and one or the other can be selected. Multiple speech recognition results with the highest acoustic scores are used as the final speech recognition result.
- Acoustic score incentives are performed on the first speech recognition result, that is, the acoustic score of the slot where the vertical keyword in the first speech recognition result is located is stimulated, specifically, the slot where the vertical keyword in the first speech recognition result is located.
- the acoustic score of the bit is multiplied by an excitation coefficient, and then the acoustic score for determining the first speech recognition result is recalculated on the basis of the stimulated vertical keyword acoustic score.
- the acoustic score incentive for the first speech recognition result can be realized:
- the acoustic excitation coefficient determines the strength of the acoustic score excitation for the first speech recognition result. If the excitation coefficient is too large, it will cause excessive excitation, causing the recognition false trigger problem as described above; if the excitation coefficient is too small, it will reach Not for the motivation purpose, it may cause the first speech recognition result to be PK out by other speech recognition results in the score PK.
- the determination of the acoustic excitation coefficient is the key to solving the problem of false triggering of vertical keywords mentioned in the above embodiments and the problem of losing important vertical keyword information during acoustic PK.
- the acoustic excitation coefficient when determining the acoustic excitation coefficient, it should be determined at least according to the vertical keyword content and non-vertical keyword content in the first speech recognition result, and should also be combined with actual business scenarios and actual business scenarios.
- the following empirical parameters are determined.
- the prior coefficient may be motivated according to the acoustic score in the business scenario to which the voice to be recognized belongs, the number of characters and phonemes of the vertical keyword in the first voice recognition result, and the second Calculate and determine the acoustic excitation coefficient for the total number of characters and the total number of phonemes of the speech recognition result.
- the acoustic excitation coefficient RC is calculated according to the following formula:
- ⁇ is the scene prior coefficient.
- the prior coefficient ⁇ can be dynamically set in each recognition session according to the requirements of the recognition system. For example, based on natural language processing (NLP) technology, the upper-level system can predict the user's behavioral intention through the context of user interaction, and adjust the coefficient in real time and dynamically to meet the requirements of various scenarios.
- NLP natural language processing
- the design of the acoustic excitation coefficient has also fully considered the number of words (word count in slot, SlotWC) and the number of phonemes (phoneme count in slot, SlotPC) in the slot where the vertical class keyword is located (that is, the vertical class The number of characters and the number of phonemes of the keyword), and the number of words in the sentence (word count in sentence, SentWC) and the number of phonemes (phoneme count in sentence, SentPC) (that is, the total number of characters of the first speech recognition result and the number of phonemes), and set the influence weight ⁇ for the number of characters and the number of phonemes respectively, so that the sum of the influence weights of characters and phonemes is 1.
- the acoustic score incentive only stimulates the acoustic score of the slot where the vertical keyword is located, eliminating the interference of irrelevant context, thereby avoiding the problem of false triggering of recognition caused by excessive incentives.
- it may first be based on the number of phonemes and the acoustic score of the vertical keyword content in the first speech recognition result, and the phoneme quantity and The acoustic score is calculated to obtain the score confidence of the vertical keyword content in the first speech recognition result; then, the acoustic excitation coefficient is determined according to the score confidence of the vertical keyword content in the first speech recognition result.
- the state sequence score corresponding to the misrecognized word is relatively low, which is why the correct recognition result often has the highest score.
- the average score of the acoustic sequence is lower for misrecognized results.
- the vertical keyword score confidence scheme is to separate the vertical keywords and non-vertical keywords in the first speech recognition result output by the speech recognition decoding network, and calculate the acoustic total score and vertical keyword respectively.
- the score confidence S c of the above-mentioned vertical keywords can be calculated by the following formula:
- the acoustic score of vertical keywords in the first speech recognition result is recorded as S p
- the number of effective acoustic phonemes occupied by vertical keywords is recorded as N p
- the total acoustic score of non-vertical keywords is It is denoted as S a
- the number of valid acoustic phonemes occupied by the slots of non-vertical keywords is denoted as N a .
- an acoustic excitation coefficient for stimulating the vertical keyword may be determined.
- the confidence score of the vertical keyword content in the first speech recognition result is compared with a preset confidence threshold.
- the confidence threshold is determined according to the probability of false triggering of the recognition caused by the acoustic excitation coefficient.
- the score confidence of the vertical keyword content in the first speech recognition result is greater than the confidence threshold, it can be considered that the vertical keyword is likely to cause false triggers.
- the acoustic excitation coefficient of the vertical keyword The acoustic excitation coefficient should be lowered; when the score confidence of the vertical keyword content in the first speech recognition result is not greater than the confidence threshold, it can be considered that the vertical keyword is easily dropped by PK.
- the acoustic excitation coefficient of characters is changed, the acoustic excitation coefficient should be adjusted up.
- the score confidence of the vertical keyword content in the first speech recognition result is jointly used to determine the acoustic excitation coefficient.
- the embodiment of the present application analyzes the recognition results of multiple test sets, counts the influence of the size of the acoustic excitation coefficient on the recognition effect and recognition false trigger, and determines the relationship between the acoustic excitation coefficient and the recognition effect and recognition false trigger.
- a value that is more balanced between the improvement of the recognition rate and the reduction of false triggers is selected.
- the principle of selection is to ensure that the number of false triggers is much smaller than the number of improved recognition effects.
- the acoustic excitation coefficient selected in the embodiment of the present application is an excitation coefficient that causes the number of false triggers to be recognized to be one percent of the number that promotes the improvement of recognition effect.
- the acoustic dichotomy of the slot where the vertical keyword content in the first speech recognition result is multiplied by the acoustic excitation coefficient determined in the above steps is used to obtain the updated acoustic score of the vertical keyword content.
- the language model activation for the third speech recognition result is taken as an example to introduce the specific processing content of the language model activation.
- the specific language model activation process is not limited by the incentive object, and the language model activation scheme can also be applied to the activation of other speech recognition results, for example, the realization of the language model activation described in the following embodiments
- the scheme is also applicable to performing language model excitation on candidate speech recognition results selected from the first speech recognition result and the second speech recognition result.
- the language model excitation is to recalculate the score of the speech recognition result through the language model, so that the score of the speech recognition result carries the language component.
- the language model incentive mechanism is mainly realized through two aspects: one is the clustering class language model, and the other is a strategy based on the matching of vertical keywords and the pronunciation sequence of the speech recognition result, which expands the path of the speech recognition result, and based on the extension
- the path together with the above-mentioned clustering language model, determines the language score of the speech recognition result.
- the clustering class model is introduced.
- specific speech recognition business scenarios involving vertical keywords such as making phone calls, sending text messages, querying the weather, and navigation
- the vertical keywords in each scenario can be limited to a limited range through enumeration or user-provided methods Inside, and the context where vertical keywords usually appear in a specific sentence pattern.
- the clustering class language model In addition to using general training corpus, the clustering class language model also performs special processing for such specific sentence patterns or sayings.
- the clustering language model will define a class for all specific scenes, and each class will use a special word (class) to mark and distinguish, as the category label corresponding to the scene. After all category labels are defined, vertical keywords such as person names, city names, and audio and video names in the training corpus will be replaced with corresponding category labels to form the target corpus, which will be added to the original training corpus. It is used again for speech recognition training on the above-mentioned clustering language model.
- This processing method makes the special word class represent the probability of a class of words, so the probability of the N-gram language model where the special word class is located in the clustering model will be significantly higher than the probability of the specific vertical class keyword itself.
- the embodiment of the present application performs path extension on the third voice recognition result according to the vertical category keyword set under the business scenario to which the voice to be recognized belongs and the category label corresponding to the business scenario.
- Exemplarily first, compare the vertical keywords in the third speech recognition result with the vertical keywords in the vertical keyword set in the business scenario to which the speech to be recognized belongs.
- the vertical keyword in the third speech recognition result matches any vertical keyword in the vertical keyword set under the business scenario where the speech to be recognized belongs, then the vertical keyword in the third speech recognition result is located A new path is extended between the left and right nodes of the slot, and the category label corresponding to the business scenario to which the speech to be recognized belongs is stored on the new path.
- FIG. 13 it is a schematic diagram of a state network (lattice) of the voice recognition result " ⁇ s>Call Zhang San ⁇ /s>" in a call service scenario.
- the state network shown in Figure 13 is where "Zhang San” is located.
- a new path is extended between the left and right nodes of the slot.
- the new path shares the start node and end node with "Zhang San” in the original state network, and the category label "class” corresponding to the current business scenario is marked on the new path.
- the "class” can be specifically "person's name”, and the state network after path expansion is shown in FIG. 14 .
- the third speech recognition result is respectively determined according to the recognition result of the training corpus by the clustering language model corresponding to the category label corresponding to the business scene to which the speech to be recognized belongs and a language model score for the extended path of the third speech recognition result.
- the recognition result of the training corpus by the clustering language model includes the N-gram language model probability of each word in the recognition result, and the probability is the language score of the word.
- the candidate speech recognition may be the output of the speech recognition decoding network
- the result (that is, any one or more of the first speech recognition results) may also be the result of the output of the general speech recognition model (that is, one or more of the second speech recognition results), therefore, it should be based on the candidate speech recognition
- select a clustering language model of the same type as the source for re-checking For the source of the result, select a clustering language model of the same type as the source for re-checking.
- model structures of the above-mentioned different types of clustering language models are the same, the difference is that their training corpora are different.
- the clustering language model of the same type as the general speech recognition model is obtained based on massive corpus training, and it is assumed to be named model A; while the clustering language model of the same type as the scene customization model is obtained based on scene corpus training Assuming that it is named model B; since model A and model B are trained based on different types of training corpus, they belong to different types of clustering language models.
- the clustering language model of the same type as the speech recognition decoding network is also trained based on scene corpus, and it is assumed to be named model C; since model B and model C are trained based on the same type of training corpus, both Belong to the same type of clustering language models.
- Table 1 shows the calculation method of re-checking the two paths in Fig. 14 .
- the third speech recognition result and the language model scores scoreA and scoreB of the extension path of the third speech recognition result can be respectively determined.
- the language model score of the extended path of the third speech recognition result is used to determine the activated language score of the language model of the third speech recognition result.
- the language model score scoreA of the third speech recognition result and the language model score scoreB of the extension path of the third speech recognition result are fused according to a certain ratio, and the sum of the fusion coefficients of the two is 1 to obtain the third speech recognition
- the language score after the language model excitation of the result is fused according to a certain ratio, and the sum of the fusion coefficients of the two is 1 to obtain the third speech recognition The language score after the language model excitation of the result.
- the language score Score after the language model excitation of the third speech recognition result can be calculated by the following formula:
- ⁇ is an empirical coefficient, and its value is determined through testing. Specifically, it is determined with the goal of obtaining the correct language score and then selecting the correct speech recognition result from numerous speech recognition results based on the language score PK.
- the above embodiments have introduced the specific implementation schemes of acoustic score incentives and language model incentives.
- the problem of false triggering of vertical keywords has been fully considered.
- the problem of false triggering caused by the same pronunciation of the vertical keyword and the real result cannot be solved by the above-mentioned scheme of controlling the excitation coefficient.
- the speech recognition decoding network constructed in the embodiment of the present application is a sentence network relying on an acoustic model, which does not contain language information, so the situation that the vertical keywords and the real results have the same pronunciation cannot be essentially solved.
- the embodiment of the present application presents the results to the user in the form of multiple candidates.
- the general speech recognition model and speech recognition decoding network share an acoustic model, when their output results are pronounced the same, their acoustic scores must be the same. Therefore, when the acoustic scores of the first speech recognition result output by the speech recognition decoding network and the second speech recognition result output by the general speech recognition model are the same, the first speech recognition result and the second speech recognition result are jointly used as the final speech recognition result As a result, the first speech recognition result and the second speech recognition result are simultaneously output, and the user can select the correct speech recognition result.
- the output sequence when the first speech recognition result and the second speech recognition result are simultaneously output can be flexibly adjusted, and it is preferable to output in the order that the first speech recognition result comes first and the second speech recognition result follows.
- the above idea of outputting speech recognition results in multiple candidate forms is also applicable to scoring PK of speech recognition results of more models.
- the output results of the speech recognition decoding network, the general speech recognition model, and the scene customization model are scored to compare the final speech recognition results of decision-making, if there are multiple different speech recognition results with the same score, then it can be These speech recognition results with the same score are output at the same time, and the user selects the correct speech recognition result.
- the embodiment of the present application also proposes a speech recognition device, as shown in FIG. 15 , the speech recognition device includes:
- the acoustic recognition unit 001 is used to obtain the acoustic state sequence of the speech to be recognized;
- the network construction unit 002 is configured to construct a speech recognition decoding network based on the set of vertical keywords and the sentence pattern decoding network in the scene to which the speech to be recognized belongs, wherein the sentence pattern decoding network at least passes through the speech to be recognized
- the text corpus in the corresponding scene is constructed by sentence induction processing;
- the decoding processing unit 003 is configured to use the speech recognition decoding network to decode the acoustic state sequence to obtain a speech recognition result.
- the above-mentioned vertical keyword set and sentence pattern decoding network based on the business scene to which the speech to be recognized belongs to construct a speech recognition decoding network, including:
- the cloud server can construct speech recognition based on the set of vertical keywords and the sentence pattern decoding network under the scene where the voice to be recognized belongs Decode the network.
- the speech recognition result is used as the first speech recognition result
- the decoding processing unit 003 is also used for:
- a final speech recognition result is determined from at least the first speech recognition result and the second speech recognition result.
- the decoding processing unit 003 is further configured to:
- the scene customization model is obtained by performing speech recognition training on the speech in the scene to which the speech to be recognized belongs;
- the determining the final speech recognition result at least from the first speech recognition result and the second speech recognition result includes:
- a final speech recognition result is determined from the first speech recognition result, the second speech recognition result and the third speech recognition result.
- determining a final speech recognition result from the first speech recognition result, the second speech recognition result, and the third speech recognition result includes:
- the second speech recognition result and the third speech recognition result From the first speech recognition result, the second speech recognition result and the third speech recognition result Determine the final speech recognition result.
- determining a final speech recognition result from the first speech recognition result, the second speech recognition result, and the third speech recognition result includes:
- the language score of the candidate speech recognition result after language model excitation and the language score of the third speech recognition result after language model excitation, determine from the candidate speech recognition result and the third speech recognition result the final speech recognition result.
- Another embodiment of the present application also proposes another speech recognition device, as shown in FIG. 16 , the device includes:
- the acoustic recognition unit 011 is used to obtain the acoustic state sequence of the speech to be recognized;
- a multi-dimensional decoding unit 012 configured to decode the acoustic state sequence using a speech recognition decoding network to obtain a first speech recognition result, and to decode the acoustic state sequence using a general speech recognition model to obtain a second speech recognition result ;
- the speech recognition decoding network is constructed based on the set of vertical keywords and the sentence pattern decoding network under the scene where the speech to be recognized belongs to;
- an acoustic excitation unit 013 configured to perform acoustic score excitation on the first speech recognition result
- the decision processing unit 014 is configured to determine a final speech recognition result from at least the excited first speech recognition result and the second speech recognition result.
- the multi-dimensional decoding unit 012 is also configured to:
- the scene customization model is obtained by performing speech recognition training on the speech in the scene to which the speech to be recognized belongs;
- the determining the final speech recognition result at least from the excited first speech recognition result and the second speech recognition result includes:
- the determining the final speech recognition result from the excited first speech recognition result, the second speech recognition result and the third speech recognition result includes:
- the language score of the candidate speech recognition result after language model excitation and the language score of the third speech recognition result after language model excitation, determine from the candidate speech recognition result and the third speech recognition result the final speech recognition result.
- the sentence pattern decoding network in the scene to which the speech to be recognized belongs is constructed through the following processing:
- a text sentence network is constructed; wherein, the text sentence network includes ordinary grammar slots corresponding to non-vertical keywords and A replacement grammatical slot corresponding to the vertical keyword, the placeholder corresponding to the vertical keyword is stored in the replacement grammatical slot;
- Each word in the ordinary grammar slot of the word-level sentence pattern decoding network is replaced with the corresponding pronunciation, and the pronunciation node is expanded according to the pronunciation corresponding to the word, and the pronunciation-level sentence pattern decoding network is obtained, and the pronunciation-level sentence pattern decoding network is obtained.
- word segmentation is performed on the entry in the ordinary grammar slot of the text sentence network and word node expansion is performed according to the word segmentation result to obtain a word-level sentence decoding network, including:
- the word strings corresponding to each entry corresponding to the same general grammar slot are connected in parallel to obtain a word-level sentence pattern decoding network.
- each word in the ordinary grammar slot of the word-level sentence pattern decoding network is replaced with the corresponding pronunciation, and the pronunciation node is expanded according to the pronunciation corresponding to the word to obtain the pronunciation level sentence pattern decoding network ,include:
- Each word in the common grammar groove of described word-level sentence pattern decoding network is replaced with corresponding pronunciation respectively;
- each pronunciation in the word-level sentence pattern decoding network the pronunciation units are divided respectively, and each pronunciation unit corresponding to the pronunciation is used to expand the pronunciation node to obtain the pronunciation level sentence pattern decoding network.
- a speech recognition decoding network is constructed based on the set of vertical keywords and the sentence pattern decoding network under the scene to which the speech to be recognized belongs, including:
- the construction of a vertical keyword network based on the vertical keywords in the vertical keyword set under the scene where the speech to be recognized belongs includes:
- a word-level vertical keyword network is constructed
- Each word in the word-level vertical keyword network is replaced with the corresponding pronunciation, and the pronunciation node is expanded according to the pronunciation corresponding to the word, so as to obtain the pronunciation-level vertical keyword network.
- both the vertical keyword network and the sentence pattern decoding network are composed of nodes and directed arcs connecting nodes, and pronunciation information or placeholders are stored on directed arcs between nodes ;
- Insert the vertical keyword network into the sentence pattern decoding network to obtain a speech recognition decoding network including:
- the speech recognition decoding network is constructed by connecting the vertical keyword network with the left and right nodes of the replacement grammar slot of the sentence pattern decoding network through directed arcs.
- the left and right nodes of the vertical keyword network and the replacement grammar slot of the sentence pattern decoding network are respectively connected through directed arcs to construct a speech recognition decoding network, including:
- the unique identifier corresponding to the keyword is stored on the first arc and the last arc of each keyword in the vertical keyword network;
- a speech recognition decoding network including:
- the right node of the traversed arc is connected to the left node of the replacement grammar slot through a directed arc, and the traversed arc is stored on the directed arc.
- the left node of the incoming arc traversed is connected to the right node of the replacement grammar slot through a directed arc, and the traversed arc is stored on the directed arc.
- the pronunciation information on the incoming arc is not inserted into the sentence pattern decoding network.
- the right node of each arc out of the start node of the vertical keyword network is connected with the left node of the replacement grammar slot through a directed arc
- the The left node of each entry arc of the end node of the vertical class keyword network is connected with the right node of the replacement grammar slot by a directed arc to construct a speech recognition decoding network, which also includes:
- the unique identifier of the keyword When a keyword in the vertical keyword network is inserted into the sentence pattern decoding network, the unique identifier of the keyword, and the left and right node numbers of the directed arc where the unique identifier is located in the sentence pattern decoding network , correspondingly stored in the network-accessed keyword information set.
- the right node of each arc out of the start node of the vertical keyword network is connected with the left node of the replacement grammar slot through a directed arc
- the The left node of each entry arc of the end node of the vertical class keyword network is connected with the right node of the replacement grammar slot by a directed arc to construct a speech recognition decoding network, which also includes:
- the directed arc between the left and right node numbers corresponding to the unique identifier is disconnected.
- the above speech recognition device further includes:
- the result modification unit is configured to modify the first speech recognition result according to the second speech recognition result.
- modifying the first speech recognition result according to the second speech recognition result includes:
- the reference text content is the text content in the second speech recognition result that matches the non-vertical keyword content in the first speech recognition result.
- the non-vertical keyword content in the first speech recognition result is corrected by using the reference text content in the second speech recognition result to obtain the corrected first speech recognition Results, including:
- the modified first speech recognition result is obtained by combining the modified non-vertical keyword content and the vertical keyword content.
- determining from the second speech recognition result the text content corresponding to the non-vertical keyword content in the first speech recognition result, as a reference text content includes:
- the edit distance matrix and the non-vertical keyword content in the first speech recognition result determine from the second speech recognition result the non-vertical keyword in the first speech recognition result
- the text content corresponding to the content is used as the reference text content.
- the modified non-vertical keyword is determined according to the reference text content in the second speech recognition result and the non-vertical keyword content in the first speech recognition result.
- word content including:
- the target text content in the second speech recognition result or the first speech recognition result is determined as the modified non-vertical keyword content
- the target text content in the second speech recognition result refers to the text content corresponding to the position of the non-vertical keyword content in the first speech recognition result in the second speech recognition result .
- the second speech recognition result The target text content in or the non-vertical keyword content in the first speech recognition result is determined as the modified non-vertical keyword content, including:
- the target text content in the second speech recognition result is determined as the modified non-vertical keyword content
- the second speech recognition result has more characters than the non-vertical keyword content in the first speech recognition result, and the difference in the number of characters between the two does not exceed the set threshold, then the second speech recognition The target text content in the result is determined as the modified non-vertical keyword content;
- the first speech will be The non-vertical keyword content in the recognition result is determined as the modified non-vertical keyword content.
- the determining the final speech recognition result at least from the first speech recognition result and the second speech recognition result includes:
- the degree of matching of the second speech recognition result determines the confidence of the first speech recognition result
- the first speech recognition Select the final speech recognition result from the result and the second speech recognition result;
- the acoustic score incentive is performed on the first speech recognition result, and according to the acoustic score of the stimulated first speech recognition result and the The acoustic score of the second speech recognition result is used to select the final speech recognition result from the first speech recognition result and the second speech recognition result.
- the determining the confidence level of the first speech recognition result based on the matching degree between the first speech recognition result and the second speech recognition result includes:
- the first speech recognition result and the second speech recognition result are jointly used as the final speech recognition results.
- performing acoustic score incentives on the first speech recognition result includes:
- the updated acoustic score of the vertical keyword content in the first voice recognition result and the acoustic score of the non-vertical keyword content in the first voice recognition result recalculate and determine the first voice Acoustic score of recognition results.
- the determining the acoustic excitation coefficient at least according to the vertical keyword content and the non-vertical keyword content in the first speech recognition result includes:
- the prior coefficient calculates and determines the acoustic excitation coefficient.
- the determining the acoustic excitation coefficient at least according to the vertical keyword content and the non-vertical keyword content in the first speech recognition result includes:
- the second The score confidence degree of the vertical category keyword content in the speech recognition result According to the number of phonemes and the acoustic score of the vertical keyword content in the first speech recognition result, and the number of phonemes and the acoustic score of the non-vertical keyword content in the first speech data result, the second The score confidence degree of the vertical category keyword content in the speech recognition result;
- the acoustic excitation coefficient is determined at least according to the score confidence of the vertical keyword content in the first speech recognition result.
- the determining the acoustic excitation coefficient at least according to the score confidence of the vertical keyword content in the first speech recognition result includes:
- the acoustic excitation coefficient is determined according to the score confidence of the vertical keyword content in the first speech recognition result, and the relationship between the predetermined acoustic excitation coefficient and the recognition effect and recognition false trigger.
- performing language model excitation on the third speech recognition result includes:
- the path extension is performed on the third speech recognition result;
- the category label is obtained by clustering the speech recognition scene Sure;
- the clustering language model According to the recognition result of the training corpus by the clustering language model corresponding to the category label, respectively determine the language model score of the third speech recognition result and the extension path of the third speech recognition result; wherein, the clustering The language model is obtained by carrying out speech recognition training to the target corpus, and the vertical keywords in the target corpus are all replaced with the category labels;
- the language score after the language model excitation of the third speech recognition result is determined.
- performing path extension on the third speech recognition result according to the vertical category keyword set in the scene to which the speech to be recognized belongs and the category label corresponding to the scene includes:
- the slot where the vertical keyword in the third speech recognition result is located A new path is extended between the left and right nodes of , and the category label corresponding to the scene to which the speech to be recognized belongs is stored on the new path.
- each speech recognition device for the specific work content of each unit in the above-mentioned embodiments of each speech recognition device, please refer to the processing content of the corresponding steps of the above-mentioned speech recognition method, which will not be repeated here.
- FIG. 17 Another embodiment of the present application also proposes a speech recognition device, as shown in FIG. 17 , the device includes:
- the memory 200 is connected to the processor 210 for storing programs
- the processor 210 is configured to execute the program stored in the memory 200 to implement the voice recognition method disclosed in any of the above embodiments.
- the above speech recognition device may further include: a bus, a communication interface 220 , an input device 230 and an output device 240 .
- the processor 210, the memory 200, the communication interface 220, the input device 230 and the output device 240 are connected to each other through a bus. in:
- a bus may include a pathway that carries information between various components of a computer system.
- the processor 210 can be a general-purpose processor, such as a general-purpose central processing unit (CPU), a microprocessor, etc., and can also be an application-specific integrated circuit (application-specific integrated circuit, ASIC), or one or more for controlling the present invention integrated circuit for program execution. It can also be a digital signal processor (DSP), an application specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, and discrete hardware components.
- DSP digital signal processor
- ASIC application specific integrated circuit
- FPGA off-the-shelf programmable gate array
- the processor 210 may include a main processor, and may also include a baseband chip, a modem, and the like.
- the program for executing the technical solution of the present invention is stored in the memory 200, and an operating system and other key services may also be stored.
- the program may include program code, and the program code includes computer operation instructions.
- the memory 200 may include read-only memory (read-only memory, ROM), other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM), that can store information and Other types of dynamic storage devices, disk storage, flash, etc. for instructions.
- the input device 230 may include a device for receiving data and information input by a user, such as a keyboard, a mouse, a camera, a scanner, a light pen, a voice input device, a touch screen, a pedometer or a gravity sensor, and the like.
- a device for receiving data and information input by a user such as a keyboard, a mouse, a camera, a scanner, a light pen, a voice input device, a touch screen, a pedometer or a gravity sensor, and the like.
- Output devices 240 may include devices that allow information to be output to a user, such as a display screen, printer, speakers, and the like.
- Communication interface 220 may include the use of any transceiver or the like to communicate with other devices or communication networks, such as Ethernet, radio access network (RAN), wireless local area network (WLAN), and the like.
- RAN radio access network
- WLAN wireless local area network
- the processor 2102 executes the programs stored in the memory 200, and calls other devices, which can be used to implement various steps of the speech recognition method provided by the embodiment of the present application.
- Another embodiment of the present application also provides a storage medium, on which a computer program is stored.
- a computer program is stored.
- the computer program is run by a processor, each step of the speech recognition method provided in any of the above-mentioned embodiments is implemented.
- the specific work content of each part of the above-mentioned speech recognition device, and the specific processing content of the above-mentioned computer program on the storage medium when it is run by the processor can refer to the content of each embodiment of the above-mentioned speech recognition method, I won't repeat them here.
- each embodiment in this specification is described in a progressive manner, each embodiment focuses on the difference from other embodiments, and the same and similar parts of each embodiment can be referred to each other.
- the description is relatively simple, and for related parts, please refer to part of the description of the method embodiments.
- modules and submodules in the devices and terminals in the various embodiments of the present application can be combined, divided and deleted according to actual needs.
- the disclosed terminal, device and method may be implemented in other ways.
- the terminal embodiments described above are only illustrative.
- the division of modules or sub-modules is only a logical function division. In actual implementation, there may be other division methods. For example, multiple sub-modules or modules can be combined Or it can be integrated into another module, or some features can be ignored, or not implemented.
- the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or modules may be in electrical, mechanical or other forms.
- modules or sub-modules described as separate components may or may not be physically separated, and the components as modules or sub-modules may or may not be physical modules or sub-modules, that is, they may be located in one place, or may also be distributed to on multiple network modules or submodules. Part or all of the modules or sub-modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
- each functional module or submodule in each embodiment of the present application may be integrated into one processing module, each module or submodule may exist separately physically, or two or more modules or submodules may be integrated in one processing module. in a module.
- the above-mentioned integrated modules or sub-modules can be implemented in the form of hardware or in the form of software function modules or sub-modules.
- the steps of the methods or algorithms described in conjunction with the embodiments disclosed herein may be directly implemented by hardware, software units executed by a processor, or a combination of both.
- the software unit can be placed in random access memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other Any other known storage medium.
- RAM random access memory
- ROM read-only memory
- electrically programmable ROM electrically erasable programmable ROM
- registers hard disk, removable disk, CD-ROM, or any other Any other known storage medium.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Telephonic Communication Services (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims (28)
- 一种语音识别方法,其特征在于,包括:获取待识别语音的声学状态序列;基于所述待识别语音所属场景下的垂类关键字集合及句式解码网络,构建语音识别解码网络,其中,所述句式解码网络至少通过对所述待识别语音所属场景下的文本语料进行句式归纳处理构建得到;利用所述语音识别解码网络对所述声学状态序列进行解码,得到语音识别结果。
- 根据权利要求1所述的方法,其特征在于,基于所述待识别语音所属场景下的垂类关键字集合及句式解码网络,构建语音识别解码网络,包括:将所述待识别语音所属场景下的垂类关键字集合传入云端服务器,以使所述云端服务器基于所述待识别语音所属场景下的垂类关键字集合及句式解码网络,构建语音识别解码网络。
- 根据权利要求1所述的方法,其特征在于,所述语音识别结果作为第一语音识别结果;所述方法还包括:利用通用语音识别模型对所述声学状态序列进行解码,得到第二语音识别结果;至少从所述第一语音识别结果和所述第二语音识别结果中,确定出最终的语音识别结果。
- 根据权利要求3所述的方法,其特征在于,所述方法还包括:通过预先训练的场景定制模型,对所述声学状态序列进行解码得到第三语音识别结果;其中,所述场景定制模型,通过对所述待识别语音所属场景下的语音进行语音识别训练得到;所述至少从所述第一语音识别结果和所述第二语音识别结果中,确定出最终的语音识别结果,包括:从所述第一语音识别结果、所述第二语音识别结果和所述第三语音识别结果中,确定出最终的语音识别结果。
- 根据权利要求4所述的方法,其特征在于,从所述第一语音识别结果、所述第二语音识别结果和所述第三语音识别结果中,确定出最终的语音识别结 果,包括:分别对所述第一语音识别结果、所述第二语音识别结果和所述第三语音识别结果进行语言模型激励;根据激励后的第一语音识别结果、第二语音识别结果和第三语音识别结果的语言得分,从所述第一语音识别结果、所述第二语音识别结果和所述第三语音识别结果中确定出最终的语音识别结果。
- 根据权利要求4所述的方法,其特征在于,从所述第一语音识别结果、所述第二语音识别结果和所述第三语音识别结果中,确定出最终的语音识别结果,包括:对所述第一语音识别结果进行声学得分激励,以及,对所述第三语音识别结果进行语言模型激励;根据声学得分激励后的第一语音识别结果的声学得分,以及所述第二语音识别结果的声学得分,从所述第一语音识别结果和所述第二语音识别结果中确定出候选语音识别结果;对所述候选语音识别结果进行语言模型激励;根据语言模型激励后的所述候选语音识别结果的语言得分,以及语言模型激励后的所述第三语音识别结果的语言得分,从所述候选语音识别结果和所述第三语音识别结果中确定出最终的语音识别结果。
- 一种语音识别方法,其特征在于,包括:获取待识别语音的声学状态序列;利用语音识别解码网络对所述声学状态序列进行解码,得到第一语音识别结果,以及,利用通用语音识别模型对所述声学状态序列进行解码,得到第二语音识别结果;所述语音识别解码网络基于所述待识别语音所属场景下的垂类关键字集合及句式解码网络构建得到;对所述第一语音识别结果进行声学得分激励;至少从激励后的第一语音识别结果以及所述第二语音识别结果中,确定出最终的语音识别结果。
- 根据权利要求7所述的方法,其特征在于,所述方法还包括:通过预先训练的场景定制模型,对所述声学状态序列进行解码得到第三语音识别结果;其中,所述场景定制模型,通过对所述待识别语音所属场景下的 语音进行语音识别训练得到;所述至少从激励后的第一语音识别结果以及所述第二语音识别结果中,确定出最终的语音识别结果,包括:从激励后的第一语音识别结果、所述第二语音识别结果和所述第三语音识别结果中,确定出最终的语音识别结果。
- 根据权利要求8所述的方法,其特征在于,所述从激励后的第一语音识别结果、所述第二语音识别结果和所述第三语音识别结果中,确定出最终的语音识别结果,包括:根据声学得分激励后的第一语音识别结果的声学得分,以及所述第二语音识别结果的声学得分,从所述第一语音识别结果和所述第二语音识别结果中确定出候选语音识别结果;对所述候选语音识别结果以及所述第三语音识别结果分别进行语言模型激励;根据语言模型激励后的所述候选语音识别结果的语言得分,以及语言模型激励后的所述第三语音识别结果的语言得分,从所述候选语音识别结果和所述第三语音识别结果中确定出最终的语音识别结果。
- 根据权利要求1至9中任意一项所述的方法,其特征在于,所述待识别语音所属场景下的句式解码网络通过如下处理构建得到:通过对所述待识别语音所属场景下的语料数据进行句式归纳和语法槽定义处理,构建文本句式网络;其中,所述文本句式网络中包括对应非垂类关键字的普通语法槽和对应垂类关键字的替换语法槽,所述替换语法槽中存储与垂类关键字对应的占位符;对所述文本句式网络的普通语法槽中的词条进行分词并按照分词结果进行单词节点扩展,得到词级句式解码网络;将所述词级句式解码网络的普通语法槽中的各个单词替换为对应的发音,并按照单词对应的发音进行发音节点扩展,得到发音级句式解码网络,所述发音级句式解码网络作为所述待识别语音所属场景下的句式解码网络。
- 根据权利要求1至9中任意一项所述的方法,其特征在于,基于所述待识别语音所属场景下的垂类关键字集合及句式解码网络,构建语音识别解码网络,包括:获取预先构建的所述待识别语音所属场景下的句式解码网络;基于待识别语音所属场景下的垂类关键字集合中的垂类关键字,构建垂类关键字网络;将所述垂类关键字网络插入所述句式解码网络,得到语音识别解码网络。
- 根据权利要求11所述的方法,其特征在于,所述基于待识别语音所属场景下的垂类关键字集合中的垂类关键字,构建垂类关键字网络,包括:基于待识别语音所属场景下的垂类关键字集合中的各个垂类关键字,构建词级垂类关键字网络;将所述词级垂类关键字网络中的各个单词替换为对应的发音,并按照单词对应的发音进行发音节点扩展,得到发音级垂类关键字网络。
- 根据权利要求11所述的方法,其特征在于,所述垂类关键字网络和所述句式解码网络均由节点和连接节点的有向弧构成,在节点间的有向弧上存储发音信息或占位符;将所述垂类关键字网络插入所述句式解码网络,得到语音识别解码网络,包括:通过有向弧将所述垂类关键字网络与所述句式解码网络的替换语法槽的左右节点分别连接,构建得到语音识别解码网络。
- 根据权利要求13所述的方法,其特征在于,所述垂类关键字网络中的每个关键字的第一条弧和最后一条弧上分别存储与该关键字对应的唯一标识;当所述垂类关键字网络中的关键字被插入所述句式解码网络时,将该关键字的唯一标识,以及该唯一标识所在的有向弧在该句式解码网络中的左右节点编号,对应存储至已入网关键字信息集合中;其中,所述已入网关键字信息集合中,对应存储已经插入句式解码网络的关键字的唯一标识,以及该唯一标识所在的有向弧在该句式解码网络中的左右节点编号;。
- 根据权利要求14所述的方法,其特征在于,还包括:遍历所述已入网关键字信息集合中的各个唯一标识;如果遍历到的唯一标识不是所述待识别语音所属场景下的垂类关键字集合中的任意关键字的唯一标识,则将该唯一标识对应的左右节点编号之间的有向弧断开。
- 根据权利要求3至9中任意一项所述的方法,其特征在于,所述方法还包括:利用所述第二语音识别结果中的参考文本内容,对所述第一语音识别结果中的非垂类关键字内容进行修正,得到修正后的第一语音识别结果;其中,所述参考文本内容,是所述第二语音识别结果中的、与所述第一语音识别结果中的非垂类关键字内容相匹配的文本内容。
- 根据权利要求16所述的方法,其特征在于,利用所述第二语音识别结果中的参考文本内容,对所述第一语音识别结果中的非垂类关键字内容进行修正,得到修正后的第一语音识别结果,包括:从所述第一语音识别结果中确定出垂类关键字内容和非垂类关键字内容,以及,从所述第二语音识别结果中确定出与所述第一语音识别结果中的非垂类关键字内容对应的文本内容,作为参考文本内容;根据所述第二语音识别结果中的参考文本内容,以及所述第一语音识别结果中的非垂类关键字内容,确定修正后的非垂类关键字内容;利用所述修正后的非垂类关键字内容,以及所述垂类关键字内容,组合得到修正后的第一语音识别结果。
- 根据权利要求17所述的方法,其特征在于,从所述第二语音识别结果中确定出与所述第一语音识别结果中的非垂类关键字内容对应的文本内容,作为参考文本内容,包括:根据编辑距离算法确定所述第一语音识别结果与所述第二语音识别结果之间的编辑距离矩阵;根据所述编辑距离矩阵,以及所述第一语音识别结果中的非垂类关键字内容,从所述第二语音识别结果中确定出与所述第一语音识别结果中的非垂类关键字内容对应的文本内容,作为参考文本内容。
- 根据权利要求17所述的方法,其特征在于,所述根据所述第二语音识别结果中的参考文本内容,以及所述第一语音识别结果中的非垂类关键字内容,确定修正后的非垂类关键字内容,包括:确定所述第二语音识别结果中的参考文本内容与所述第一语音识别结果中的非垂类关键字内容是否相同;如果相同,则将所述第二语音识别结果中的目标文本内容,确定为修正后 的非垂类关键字内容;如果不同,则确定所述第二语音识别结果是否比所述第一语音识别结果中的非垂类关键字内容的字符数量多,并且两者的字符数量差异不超过设定阈值;如果所述第二语音识别结果比所述第一语音识别结果中的非垂类关键字内容的字符数量多,并且两者的字符数量差异不超过设定阈值,则将所述第二语音识别结果中的目标文本内容,确定为修正后的非垂类关键字内容;如果所述第二语音识别结果比所述第一语音识别结果中的非垂类关键字内容的字符数量少,和/或两者的字符数量差异超过设定阈值,则将所述第一语音识别结果中的非垂类关键字内容,确定为修正后的非垂类关键字内容;其中,所述第二语音识别结果中的目标文本内容,是指所述第二语音识别结果中的、与所述第一语音识别结果中的非垂类关键字内容的位置相对应的文本内容。
- 根据权利要求3所述的方法,其特征在于,所述至少从所述第一语音识别结果和所述第二语音识别结果中,确定出最终的语音识别结果,包括:确定所述第一语音识别结果的置信度是否大于预设的置信度阈值;当所述第一语音识别结果的置信度大于预设的置信度阈值时,根据所述第一语音识别结果的声学得分和所述第二语音识别结果的声学得分,从所述第一语音识别结果和所述第二语音识别结果中选出最终的语音识别结果;当所述第一语音识别结果的置信度不大于预设的置信度阈值时,对所述第一语音识别结果进行声学得分激励,并根据激励后的第一语音识别结果的声学得分以及所述第二语音识别结果的声学得分,从所述第一语音识别结果和所述第二语音识别结果中选出最终的语音识别结果。
- 根据权利要求7或20所述的方法,其特征在于,当所述第一语音识别结果和所述第二语音识别结果的声学得分相同时,将所述第一语音识别结果和所述第二语音识别结果共同作为最终的语音识别结果。
- 根据权利要求6或4或20所述的方法,其特征在于,对所述第一语音识别结果进行声学得分激励,包括:至少根据所述第一语音识别结果中的垂类关键字内容和非垂类关键字内容,确定声学激励系数;利用所述声学激励系数,对所述第一语音识别结果中的垂类关键字内容的声学得分进行更新;根据更新后的所述第一语音识别结果中的垂类关键字内容的声学得分,以及所述第一语音识别结果中的非垂类关键字内容的声学得分,重新计算确定所述第一语音识别结果的声学得分。
- 根据权利要求5或6或9所述的方法,其特征在于,对所述第三语音识别结果进行语言模型激励,包括:根据待识别语音所属场景下的垂类关键字集合以及与所述该场景对应的类别标签,对所述第三语音识别结果进行路径扩展;所述类别标签,通过对语音识别场景进行聚类而确定;根据与所述类别标签对应的聚类语言模型对训练语料的识别结果,分别确定所述第三语音识别结果以及所述第三语音识别结果的扩展路径的语言模型得分;其中,所述聚类语言模型通过对目标语料进行语音识别训练得到,所述目标语料中的垂类关键字均被替换为所述类别标签;根据所述第三语音识别结果的语言模型得分,以及所述第三语音识别结果的扩展路径的语言模型得分,确定所述第三语音识别结果的语言模型激励后的语言得分。
- 根据权利要求23所述的方法,其特征在于,所述根据待识别语音所属场景下的垂类关键字集合以及与该场景对应的类别标签,对所述第三语音识别结果进行路径扩展,包括:将所述第三语音识别结果中的垂类关键字,与待识别语音所属场景下的垂类关键字集合中的垂类关键字分别进行比对;如果所述第三语音识别结果中的垂类关键字与所述垂类关键字集合中的任意垂类关键字相匹配,则在所述第三语音识别结果中的垂类关键字所在槽位的左右节点之间扩展新路径,并在该新路径上存储与待识别语音所属场景对应的类别标签。
- 一种语音识别装置,其特征在于,包括:声学识别单元,用于获取待识别语音的声学状态序列;网络构建单元,用于基于所述待识别语音所属场景下的垂类关键字集合及句式解码网络,构建语音识别解码网络,其中,所述句式解码网络至少通过对 所述待识别语音所属场景下的文本语料进行句式归纳处理构建得到;解码处理单元,用于利用所述语音识别解码网络对所述声学状态序列进行解码,得到语音识别结果。
- 一种语音识别装置,其特征在于,包括:声学识别单元,用于获取待识别语音的声学状态序列;多维解码单元,用于利用语音识别解码网络对所述声学状态序列进行解码,得到第一语音识别结果,以及,利用通用语音识别模型对所述声学状态序列进行解码,得到第二语音识别结果;所述语音识别解码网络基于所述待识别语音所属场景下的垂类关键字集合及句式解码网络构建得到;声学激励单元,用于对所述第一语音识别结果进行声学得分激励;决策处理单元,用于至少从激励后的第一语音识别结果以及所述第二语音识别结果中,确定出最终的语音识别结果。
- 一种语音识别设备,其特征在于,包括:存储器和处理器;所述存储器与所述处理器连接,用于存储程序;所述处理器,用于通过运行所述存储器中存储的程序,实现如权利要求1至24中任意一项所述的语音识别方法。
- 一种存储介质,其特征在于,所述存储介质上存储有计算机程序,所述计算机程序被处理器运行时,实现如权利要求1至24中任意一项所述的语音识别方法。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2024525244A JP2024537481A (ja) | 2021-10-29 | 2021-11-26 | 音声認識方法、装置、設備及び記憶媒体 |
EP21962147.1A EP4425484A1 (en) | 2021-10-29 | 2021-11-26 | Speech recognition method and apparatus, device, and storage medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111274880.8A CN113920999A (zh) | 2021-10-29 | 2021-10-29 | 语音识别方法、装置、设备及存储介质 |
CN202111274880.8 | 2021-10-29 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023070803A1 true WO2023070803A1 (zh) | 2023-05-04 |
Family
ID=79243888
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/133434 WO2023070803A1 (zh) | 2021-10-29 | 2021-11-26 | 语音识别方法、装置、设备及存储介质 |
Country Status (4)
Country | Link |
---|---|
EP (1) | EP4425484A1 (zh) |
JP (1) | JP2024537481A (zh) |
CN (1) | CN113920999A (zh) |
WO (1) | WO2023070803A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117496972A (zh) * | 2023-12-29 | 2024-02-02 | 广州小鹏汽车科技有限公司 | 一种音频识别方法、音频识别装置、车辆和计算机设备 |
CN117558270A (zh) * | 2024-01-11 | 2024-02-13 | 腾讯科技(深圳)有限公司 | 语音识别方法、装置、关键词检测模型的训练方法和装置 |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115472165A (zh) * | 2022-07-07 | 2022-12-13 | 脸萌有限公司 | 用于语音识别的方法、装置、设备和存储介质 |
WO2024188235A1 (zh) * | 2023-03-13 | 2024-09-19 | 北京罗克维尔斯科技有限公司 | 语音识别方法、装置、电子设备、存储介质及车辆 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150340034A1 (en) * | 2014-05-22 | 2015-11-26 | Google Inc. | Recognizing speech using neural networks |
CN105845133A (zh) * | 2016-03-30 | 2016-08-10 | 乐视控股(北京)有限公司 | 语音信号处理方法及装置 |
CN107808662A (zh) * | 2016-09-07 | 2018-03-16 | 阿里巴巴集团控股有限公司 | 更新语音识别用的语法规则库的方法及装置 |
CN113515945A (zh) * | 2021-04-26 | 2021-10-19 | 科大讯飞股份有限公司 | 一种获取文本信息的方法、装置、设备及存储介质 |
-
2021
- 2021-10-29 CN CN202111274880.8A patent/CN113920999A/zh active Pending
- 2021-11-26 EP EP21962147.1A patent/EP4425484A1/en active Pending
- 2021-11-26 WO PCT/CN2021/133434 patent/WO2023070803A1/zh active Application Filing
- 2021-11-26 JP JP2024525244A patent/JP2024537481A/ja active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150340034A1 (en) * | 2014-05-22 | 2015-11-26 | Google Inc. | Recognizing speech using neural networks |
CN105845133A (zh) * | 2016-03-30 | 2016-08-10 | 乐视控股(北京)有限公司 | 语音信号处理方法及装置 |
CN107808662A (zh) * | 2016-09-07 | 2018-03-16 | 阿里巴巴集团控股有限公司 | 更新语音识别用的语法规则库的方法及装置 |
CN113515945A (zh) * | 2021-04-26 | 2021-10-19 | 科大讯飞股份有限公司 | 一种获取文本信息的方法、装置、设备及存储介质 |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117496972A (zh) * | 2023-12-29 | 2024-02-02 | 广州小鹏汽车科技有限公司 | 一种音频识别方法、音频识别装置、车辆和计算机设备 |
CN117496972B (zh) * | 2023-12-29 | 2024-04-16 | 广州小鹏汽车科技有限公司 | 一种音频识别方法、音频识别装置、车辆和计算机设备 |
CN117558270A (zh) * | 2024-01-11 | 2024-02-13 | 腾讯科技(深圳)有限公司 | 语音识别方法、装置、关键词检测模型的训练方法和装置 |
CN117558270B (zh) * | 2024-01-11 | 2024-04-02 | 腾讯科技(深圳)有限公司 | 语音识别方法、装置、关键词检测模型的训练方法和装置 |
Also Published As
Publication number | Publication date |
---|---|
EP4425484A1 (en) | 2024-09-04 |
CN113920999A (zh) | 2022-01-11 |
JP2024537481A (ja) | 2024-10-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2023070803A1 (zh) | 语音识别方法、装置、设备及存储介质 | |
KR102648306B1 (ko) | 음성 인식 오류 정정 방법, 관련 디바이스들, 및 판독 가능 저장 매체 | |
CN108899013B (zh) | 语音搜索方法、装置和语音识别系统 | |
TWI508057B (zh) | 語音辨識系統以及方法 | |
WO2020001458A1 (zh) | 语音识别方法、装置及系统 | |
US11093110B1 (en) | Messaging feedback mechanism | |
US10152298B1 (en) | Confidence estimation based on frequency | |
US10366690B1 (en) | Speech recognition entity resolution | |
WO2014101826A1 (zh) | 一种提高语音识别准确率的方法及系统 | |
JP2005084681A (ja) | 意味的言語モデル化および信頼性測定のための方法およびシステム | |
CN107578771A (zh) | 语音识别方法及装置、存储介质、电子设备 | |
WO2014117645A1 (zh) | 信息的识别方法和装置 | |
US11532301B1 (en) | Natural language processing | |
US10714087B2 (en) | Speech control for complex commands | |
CN116226338A (zh) | 基于检索和生成融合的多轮对话系统及方法 | |
US20220161131A1 (en) | Systems and devices for controlling network applications | |
US11626107B1 (en) | Natural language processing | |
CN108538292A (zh) | 一种语音识别方法、装置、设备及可读存储介质 | |
WO2023050541A1 (zh) | 音素提取方法、语音识别方法、装置、设备及存储介质 | |
CN113724698B (zh) | 语音识别模型的训练方法、装置、设备及存储介质 | |
CN103474063B (zh) | 语音辨识系统以及方法 | |
CN115831117A (zh) | 实体识别方法、装置、计算机设备和存储介质 | |
KR20130073643A (ko) | 개인화된 발음열을 이용한 그룹 매핑 데이터 생성 서버, 음성 인식 서버 및 방법 | |
CN113539241A (zh) | 语音识别校正方法及其相应的装置、设备、介质 | |
CN116052657B (zh) | 语音识别的字符纠错方法和装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21962147 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2024525244 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2021962147 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2021962147 Country of ref document: EP Effective date: 20240529 |