WO2023070803A1 - 语音识别方法、装置、设备及存储介质 - Google Patents

语音识别方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2023070803A1
WO2023070803A1 PCT/CN2021/133434 CN2021133434W WO2023070803A1 WO 2023070803 A1 WO2023070803 A1 WO 2023070803A1 CN 2021133434 W CN2021133434 W CN 2021133434W WO 2023070803 A1 WO2023070803 A1 WO 2023070803A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech recognition
recognition result
vertical
speech
keyword
Prior art date
Application number
PCT/CN2021/133434
Other languages
English (en)
French (fr)
Inventor
李永超
朱晓斐
王众
方昕
Original Assignee
科大讯飞股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 科大讯飞股份有限公司 filed Critical 科大讯飞股份有限公司
Priority to JP2024525244A priority Critical patent/JP2024537481A/ja
Priority to EP21962147.1A priority patent/EP4425484A1/en
Publication of WO2023070803A1 publication Critical patent/WO2023070803A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/083Recognition networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/193Formal grammars, e.g. finite state automata, context free grammars or word networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Definitions

  • the present application relates to the technical field of speech recognition, in particular to a speech recognition method, device, equipment and storage medium.
  • the most effective solution for speech recognition at present is to use neural network technology to learn massive data to obtain a speech recognition model, which is very effective in general scenarios.
  • a very good recognition effect can be achieved.
  • the embodiment of the present application proposes a speech recognition method, device, device and storage medium, which can accurately recognize the speech to be recognized, especially the speech in a specific scene involving vertical keywords. Accurately identify vertical keywords in speech.
  • a speech recognition method characterized in that, comprising:
  • a speech recognition decoding network is constructed, wherein the sentence pattern decoding network at least performs text corpus on the scene where the speech to be recognized belongs Sentence induction processing is constructed;
  • a speech recognition method characterized in that, comprising:
  • the speech recognition decoding network Obtained based on a set of vertical keywords and a sentence pattern decoding network under the scene to which the speech to be recognized belongs;
  • a final speech recognition result is determined from at least the excited first speech recognition result and the second speech recognition result.
  • a speech recognition device is characterized in that, comprising:
  • the acoustic recognition unit is used to obtain the acoustic state sequence of the speech to be recognized
  • a network construction unit configured to construct a speech recognition decoding network based on a vertical keyword set and a sentence pattern decoding network in the scene where the speech to be recognized belongs to, wherein the sentence pattern decoding network at least passes through the The text corpus in the scene is constructed by sentence induction processing;
  • the decoding processing unit is configured to use the speech recognition decoding network to decode the acoustic state sequence to obtain a speech recognition result.
  • a speech recognition device comprising:
  • the acoustic recognition unit is used to obtain the acoustic state sequence of the speech to be recognized
  • a multi-dimensional decoding unit configured to use a speech recognition decoding network to decode the acoustic state sequence to obtain a first speech recognition result, and to use a general speech recognition model to decode the acoustic state sequence to obtain a second speech recognition result;
  • the speech recognition decoding network is constructed based on a set of vertical keywords and a sentence pattern decoding network under the scene to which the speech to be recognized belongs;
  • an acoustic excitation unit configured to perform acoustic score excitation on the first speech recognition result
  • the decision processing unit is configured to determine a final speech recognition result at least from the excited first speech recognition result and the second speech recognition result.
  • a speech recognition device comprising:
  • the memory is connected to the processor for storing programs
  • the processor is configured to implement the above speech recognition method by running the program stored in the memory.
  • a storage medium on which a computer program is stored, and when the computer program is run by a processor, the above speech recognition method is realized.
  • the speech recognition method proposed in this application can build a speech recognition and decoding network based on the set of vertical keywords in the scene to which the speech to be recognized belongs and the pre-built sentence pattern decoding network in the scene. Then, in the speech recognition and decoding network, it contains various speech sentence information in the scene where the speech to be recognized belongs, and also includes various vertical keywords in the scene where the speech to be recognized belongs, and the speech recognition and decoding network can decode the speech to be recognized.
  • the voice composed of any sentence pattern and any vertical keyword in the scene to which the voice belongs. Therefore, by constructing the above-mentioned speech recognition decoding network, the speech to be recognized can be accurately recognized, especially the speech in a specific scene involving vertical keywords can be accurately recognized, especially the vertical keywords in the speech can be accurately recognized.
  • Fig. 1 is a schematic flow chart of a speech recognition method provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of a word-level sentence pattern decoding network provided by an embodiment of the present application
  • Fig. 3 is a schematic flow chart of another speech recognition method provided by the embodiment of the present application.
  • Fig. 4 is a schematic flow chart of another speech recognition method provided by the embodiment of the present application.
  • Fig. 5 is a schematic flow chart of another speech recognition method provided by the embodiment of the present application.
  • FIG. 6 is a schematic flow chart of another speech recognition method provided by the embodiment of the present application.
  • Fig. 7 is a schematic diagram of a text sentence network provided by the embodiment of the present application.
  • FIG. 8 is a schematic diagram of a pronunciation-level sentence pattern decoding network provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a word-level personal name network provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of the pronunciation-level personal name network corresponding to FIG. 9 provided by the embodiment of the present application.
  • Fig. 11 is the process flow diagram of utilizing the second speech recognition result to correct the first speech recognition result provided by the embodiment of the present application;
  • Fig. 12 is a flow chart of processing for determining the final speech recognition result from the first speech recognition result and the second speech recognition result provided by the embodiment of the present application;
  • Fig. 13 is a schematic diagram of the state network of the speech recognition result provided by the embodiment of the present application.
  • Fig. 14 is a schematic diagram of the state network after path extension is performed on the speech recognition result shown in Fig. 13;
  • Fig. 15 is a schematic structural diagram of a speech recognition device provided by an embodiment of the present application.
  • FIG. 16 is a schematic structural diagram of another speech recognition device provided by an embodiment of the present application.
  • Fig. 17 is a schematic structural diagram of a speech recognition device provided by an embodiment of the present application.
  • the technical solutions of the embodiments of the present application are applicable to speech recognition application scenarios.
  • the speech content can be recognized more accurately, especially in specific business scenarios involving vertical keywords, the speech content can be recognized more accurately , especially the ability to accurately recognize the vertical keywords in the voice, and improve the voice recognition effect as a whole.
  • the above vertical keywords generally refer to different keywords belonging to the same type, such as person names, place names, application names, etc. constitute different vertical keywords. Keywords, different place names in the region where the user is located form the place name category keywords, and the names of various applications installed on the user terminal form the application name category keywords.
  • the above-mentioned business scenarios involving vertical keywords refer to business scenarios that contain vertical keywords in the corresponding interactive voice, such as voice dialing, voice navigation and other business scenarios. Since the user must say the name or The place name of the navigation, for example, the user may say "call XX” or “navigate to YY", where "XX” may be any name in the user's mobile phone address book, and "YY” may be a certain name in the user's area. a place name. It can be seen that the speech in these business scenarios contains vertical keywords (such as person names and place names), so these business scenarios are business scenarios involving vertical keywords.
  • vertical keywords Compared with ordinary text keywords, vertical keywords have the characteristics of frequent changes, unpredictability, and user-definable features, and the proportion of vertical keywords in the massive speech recognition training corpus is extremely low, making conventional Speech recognition solutions that use corpus to train speech recognition models are often incapable of performing speech recognition services involving vertical keywords.
  • the occurrence rate of personal names is very low, so even in the massive training corpus, personal names are very rare, which makes the model unable to fully learn the characteristics of personal names through massive corpus.
  • the names of people belong to the user-defined text content, which is inexhaustible and unpredictable. It is unrealistic to completely generate all the names of people artificially.
  • the storage of the contact names in the address book by the user may not be a standardized name, but may be a nickname, code name, nickname, etc., and the user may even modify, add or delete contacts in the address book at any time, which makes different users
  • the names in the address book are highly diverse, and it is impossible to make the speech recognition model learn all the characteristics of names in a unified way.
  • the conventional technical solution of training a speech recognition model through massive corpus and using the speech recognition model to realize the speech recognition function is not fully competent for speech recognition tasks in business scenarios involving vertical keywords, especially for vertical keywords in speech. Such keywords are often not successfully identified, seriously affecting user experience.
  • the embodiment of the present application proposes a voice recognition method, which can improve the voice recognition effect, especially the voice recognition effect in business scenarios involving vertical keywords.
  • the embodiment of the present application proposes a speech recognition method, as shown in FIG. 1, the method includes:
  • the above-mentioned speech to be recognized is specifically the speech data in a business scenario involving vertical keywords.
  • the speech to be recognized it includes The phonetic content of vertical keywords.
  • the audio features can be Mel frequency cepstral coefficient MFCC features, or other audio features of any type.
  • the audio features of the speech to be recognized are obtained, the audio features are input into the acoustic model for acoustic recognition, and the acoustic state posterior score of each frame of audio is obtained, that is, the acoustic state sequence is obtained.
  • the acoustic model is mainly a neural network structure, which identifies the acoustic state corresponding to each frame of audio and its posterior score through forward calculation.
  • the above-mentioned acoustic state corresponding to the audio frame is specifically a pronunciation unit corresponding to the audio frame, such as a phoneme or a phoneme sequence corresponding to the audio frame.
  • the conventional speech recognition technology scheme is the architecture of acoustic model + language model, that is, firstly, the speech to be recognized is recognized acoustically through the acoustic model to realize the mapping of speech features to the phoneme sequence; then, the phoneme sequence is recognized through the language model to realize the phoneme Mapping to text.
  • the acoustic state sequence of the speech to be recognized obtained by the above acoustic recognition will be input into the language model for decoding, so as to determine the text content corresponding to the speech to be recognized.
  • the language model is a model that is trained based on a large amount of training corpus and can realize the mapping of phonemes to text.
  • the embodiment of the present application does not use the above-mentioned speech model based on massive corpus training to decode the acoustic state sequence, but uses a decoding network constructed in real time to decode, as detailed below.
  • S102 Construct a speech recognition decoding network based on the vertical keyword set and the sentence pattern decoding network in the scene to which the speech to be recognized belongs.
  • the embodiment of the present application constructs a speech recognition and decoding network in real time when recognizing speech in vertical keyword business scenarios to decode the acoustic state sequence of the speech to be recognized, and obtains Speech recognition results.
  • the above-mentioned speech recognition decoding network is constructed from a set of vertical keywords in the scene where the speech to be recognized belongs, and a pre-built sentence pattern decoding network in the scene where the speech to be recognized belongs.
  • the sentence pattern decoding network in the scene where the speech to be recognized belongs to is constructed by at least sentence pattern induction processing on the text corpus in the scene where the speech to be recognized belongs;
  • the scene to which the speech to be recognized belongs specifically refers to the business scene to which the speech to be recognized belongs. For example, if the speech to be recognized is "I want to give XX a call", then the speech to be recognized belongs to the voice of the call service, so the scene of the speech to be recognized is the call scene; for another example, suppose the speech to be recognized is "Navigate to XX", the speech to be recognized belongs to the navigation service, so the scene to which the speech to be recognized belongs is the navigation scene.
  • the inventors of the present application have found through research that in business scenarios involving vertical keywords, a considerable part of the sentence patterns of the user's voice is a fixed sentence pattern. It is "I want to give XX a call” or "send a message to XX for me”; in the voice navigation scenario, the user's common sentence pattern is usually "go to XX (place name)” or "navigate to XX (place name)”.
  • the sentence patterns of the user's voice are regular, or exhaustive.
  • sentence decoding network which is named as sentence decoding network in the embodiment of this application.
  • the sentence pattern decoding network constructed based on the above method can contain the sentence pattern information corresponding to the scene.
  • the sentence pattern decoding network can be Contains any sentence patterns in the scene.
  • the embodiment of the present application is constructed by performing sentence induction and grammatical slot definition processing on the text corpus in the scene where the speech to be recognized belongs.
  • the text slots in the text sentences are divided into ordinary grammar slots and replacement grammar slots, wherein the text slots where the non-vertical keywords in the text sentences are located are defined as ordinary grammar slots, and the text slots in the text sentences are The text slot where the vertical class keyword is located is defined as the replacement syntax slot.
  • the sentence decoding network shown in Figure 2.
  • the sentence pattern decoding network is composed of nodes and directed arcs connecting nodes, wherein the directed arcs correspond to ordinary grammar slots and replacement grammar slots, and the directed arcs have label information for recording the text content in the slots.
  • the general grammar slot entry is segmented and connected in series through nodes and directed arcs. The directed arcs between the two nodes are marked with word information, and the left and right colons represent input and output information respectively.
  • Input and output are set here
  • the information is the same, multiple words after word segmentation of a single entry are connected in series, different entries of the same grammar slot are connected in parallel, the placeholder "#placeholder#" is used to replace the grammar slot, no expansion is performed, and the nodes are numbered in order , where words with the same start node ID share a start node, and words with the same end node ID share an end node.
  • Figure 2 illustrates a relatively simple word-level sentence pattern decoding network diagram of the address book.
  • the ordinary grammar slot before the replacement grammar slot contains three words: "I want to give”, "send a message to" and "give a call”.
  • the ordinary grammar slot after the replacement grammar slot contains three entries: "for me”, "a call” and "a call with her number”.
  • the connection between node 10 and node 18 indicates that you can go directly from node 10 to the end node, and the " ⁇ /s>" on the arc represents silence.
  • the aforementioned set of vertical keywords in the business scenario to which the voice to be recognized belongs refers to a set composed of all vertical keywords in the business scenario to which the voice to be recognized belongs.
  • the set of vertical keywords in the business scenario to which the voice to be recognized belongs can specifically be a collection of names composed of the names of the user's address book; assuming that the voice to be recognized If the voice is voice in the voice navigation scenario, then the set of vertical keywords in the business scenario to which the voice to be recognized belongs can specifically be a set of place names composed of various place names in the region where the user is located.
  • the speech recognition decoding network can be obtained by adding the vertical keywords in the vertical keyword set under the business scenario to which the voice to be recognized belongs to the replacement grammar slot of the sentence pattern decoding network. It can be seen that the decoding network not only contains all the speech sentences in the business scenario where the speech to be recognized belongs to, but also includes all the vertical keywords in the scene, then the speech recognition decoding network can recognize the business to which the speech to be recognized belongs The speech sentence pattern in the scene, and can recognize the vertical keywords in the speech, that is, the speech in the business scene can be recognized.
  • the speech recognition and decoding network when constructing the speech recognition and decoding network in the embodiment of the present application, it is specifically constructed on the server, that is, the set of vertical keywords under the business scenario to which the speech to be recognized belongs is transferred to the cloud
  • the server enables the cloud server to build a speech recognition and decoding network based on the set of vertical keywords in the business scenario to which the speech to be recognized belongs and the pre-built sentence pattern decoding network.
  • the mobile terminal transmits the local address book (that is, the collection of keywords of personal names) to the cloud server, and the cloud server And the sentence pattern decoding network in the phone call scene, and build a speech recognition decoding network. Then, in the speech recognition decoding network, it contains various sentence patterns for making a call, and also includes the name of the address book of this call. Using this decoding network, it is possible to recognize the voice of the user calling any member in the current address book .
  • the speech recognition decoding network is constructed locally at the user terminal, and is not constructed in real time, but is repeatedly invoked after pre-construction. Due to the relatively low computing resources of terminal devices, the network construction speed is slow and the network decoding speed is limited. Moreover, the non-real-time construction of the decoding network cannot be updated in time when the set of vertical keywords is updated, which affects the speech recognition effect.
  • the speech recognition decoding network is constructed on the cloud server, and in the speech recognition process, the set of vertical keywords is imported in real time by performing step S102, and the speech recognition decoding network is constructed, so the constructed speech can be guaranteed
  • the identification and decoding network contains the latest set of vertical keywords, that is, the set of vertical keywords required for this recognition, so that the vertical keywords can be accurately identified.
  • the speech recognition decoding network will have stronger decoding performance.
  • the cloud server can target the Build a suitable speech recognition decoding network for the second speech to be recognized, and decode the speech to be recognized this time.
  • the speech recognition and decoding network constructed above includes sentence patterns in the business scenario to which the speech to be recognized belongs, and a set of vertical keywords in the business scenario to which the speech to be recognized belongs. Then, the speech content of the speech to be recognized can be recognized by using the speech recognition decoding network.
  • the terminal local address book (this address book is used as set of vertical keywords) and a pre-built sentence pattern decoding network corresponding to the call business scenario, and construct a speech recognition decoding network including the names of the local address book.
  • the speech recognition decoding network there are multiple same or different sentence pattern paths composed of different vertical keywords.
  • an acoustic state sequence matches the pronunciation of one or several sentence-pattern paths in the speech recognition and decoding network, it can be determined that the text content of the acoustic state sequence is the text content of the sentence-pattern path. Therefore, the finally decoded speech recognition result may be the text of one or several paths in the speech recognition decoding network, that is, the final speech recognition result may be one or more.
  • the voice recognition decoding The network contains the sentence pattern "I want to give XX a call” and the name "John". At the same time, other sentence patterns and other names are included in the speech recognition decoding network.
  • the acoustic state sequence of the speech is matched with each path in the speech recognition decoding network, and it can be determined that the acoustic state sequence is consistent with the pronunciation of the path "I want to give John a call”. match, you can get the voice recognition result "I want to give John a call", that is, to realize the recognition of the user's voice.
  • the speech recognition method proposed in the embodiment of the present application can build a speech recognition decoding network based on the vertical keyword set in the business scenario to which the speech to be recognized belongs and the pre-built sentence pattern decoding network in the business scenario. Then, in the speech recognition and decoding network, it includes various speech sentence patterns under the business scene to which the speech to be recognized belongs, and also includes various vertical keywords under the business scene to which the speech to be recognized belongs.
  • the speech recognition and decoding network can decode the speech to be recognized Recognize speech composed of any sentence pattern and any vertical keyword in the business scenario to which the speech belongs. Therefore, by constructing the above-mentioned speech recognition decoding network, the speech to be recognized can be accurately recognized, especially the speech in a specific scene involving vertical keywords can be accurately recognized, especially the vertical keywords in the speech can be accurately recognized.
  • the general speech recognition model is also used to decode the acoustic state sequence of the speech to be recognized.
  • the result obtained by decoding the acoustic state sequence of the speech to be recognized using the above-mentioned speech recognition decoding network is named the first speech recognition result
  • the acoustic state sequence of the speech to be recognized is decoded by using the above-mentioned general speech recognition model to obtain
  • the result of is named the second speech recognition result.
  • step S301 is executed to obtain the acoustic state sequence of the speech to be recognized
  • step S302 and step S303 are respectively executed to construct a speech recognition decoding network, and use the speech recognition decoding network to decode the acoustic state sequence , to obtain a first speech recognition result; and, perform step S304, using a general speech recognition model to decode the acoustic state sequence, to obtain a second speech recognition result.
  • first speech recognition results and second speech recognition results mentioned above There may be one or more first speech recognition results and second speech recognition results mentioned above.
  • the speech recognition results output by each model are reserved up to 5 to participate in the determination of the final speech recognition results.
  • the above-mentioned universal speech recognition model is a conventional speech recognition model obtained through massive corpus training, which recognizes the text content corresponding to the speech by learning the characteristics of the speech, rather than having a standardized speech recognition decoding network like the above-mentioned speech recognition model. sentence pattern. Therefore, the sentence patterns that the general speech recognition model can recognize are more flexible. Utilizing the universal speech recognition model to decode the acoustic state sequence of the speech to be recognized can more flexibly recognize the content of the speech to be recognized without being limited by the sentence pattern of the speech to be recognized.
  • the speech to be recognized When the speech to be recognized is not a certain sentence pattern in the above-mentioned speech recognition decoding network, it cannot be correctly decoded by the speech recognition decoding network, or the first speech recognition result obtained is inaccurate, but due to the application of the general speech recognition model , the speech to be recognized can still be recognized and decoded to obtain a second speech recognition result.
  • step S305 is executed to determine a final speech recognition result at least from the first speech recognition result and the second speech recognition result.
  • the first speech recognition result and the second speech recognition result are obtained, according to the acoustic scores of the first speech recognition result and the second speech recognition result, through the acoustic score PK, from the first speech recognition One or more of the results and the second speech recognition result are selected as the final speech recognition result.
  • the above-mentioned acoustic scores of the first speech recognition result and the second speech recognition result refer to the scores of the entire decoding result determined according to the decoding scores of each acoustic state sequence element when decoding the acoustic state sequence of the speech to be recognized .
  • the sum of the decoding scores of each acoustic state sequence element can be used as the score of the entire decoding result.
  • the decoding score of an acoustic state sequence element refers to the probability score of an acoustic state sequence element (such as a phoneme or phoneme unit) being decoded into a certain text, so the score of the entire decoding result is the entire acoustic state sequence being decoded into a certain text probability score.
  • the acoustic score of the speech recognition result reflects the score of the speech recognition result, which can be used to characterize the accuracy of the speech recognition result.
  • the accuracy of each recognition result can be reflected, through the acoustic score PK, that is, through the acoustic score
  • one or more speech recognition results with the highest scores are selected from these recognition results as the final speech recognition result.
  • Steps S301 to S303 in the method embodiment shown in FIG. 3 respectively correspond to steps S101 to S103 in the method embodiment shown in FIG. 1 .
  • steps S301 to S303 in the method embodiment shown in FIG. 3 respectively correspond to steps S101 to S103 in the method embodiment shown in FIG. 1 .
  • FIG. 4 shows a schematic flowchart of another speech recognition method proposed by the embodiment of the present application.
  • the speech recognition method proposed in the embodiment of the present application uses the constructed speech recognition decoding network and the general speech recognition model to decode the acoustic state sequence of the speech to be recognized.
  • step S405 is also performed to decode the acoustic state sequence through the pre-trained scene customization model to obtain the third speech recognition result.
  • step S406 After obtaining the first speech recognition result, the second speech recognition result and the third speech recognition result respectively, execute step S406, from the first speech recognition result, the second speech recognition result and the third speech recognition result , determine the final speech recognition result.
  • the aforementioned scenario customization model refers to a speech recognition model obtained by conducting speech recognition training on the speech in the scene to which the speech to be recognized belongs.
  • the scene customization model has the same model architecture as the above-mentioned general speech recognition model.
  • the difference from the general speech recognition model is that the scene customization model is not obtained by training with a large amount of general corpus, but by using the scene of the speech to be recognized The corpus is trained. Therefore, compared with the general speech recognition model, the scene customization model has higher sensitivity and higher recognition rate for the speech in the business scene to which the speech to be recognized belongs.
  • the scene customization model can more accurately recognize speech in a specific business scene, without being limited to predetermined sentence patterns like the speech recognition decoding network mentioned above.
  • a scene customization model is added, so that the three models can decode the acoustic state sequence of the speech to be recognized separately, which can be more comprehensive and in-depth in various ways Perform speech recognition on the speech to be recognized.
  • step S305 For the speech recognition results output by the three models, you can refer to the introduction of the above step S305 as an example, by comparing the acoustic scores of the first speech recognition result, the second speech recognition result and the third speech recognition result, and select One or more speech recognition results with the highest or higher acoustic scores are used as the final speech recognition result.
  • Steps S401-S404 in the method embodiment shown in FIG. 4 respectively correspond to steps S301-S304 in the method embodiment shown in FIG. 3.
  • steps S301-S304 in the method embodiment shown in FIG. 3 respectively correspond to steps S301-S304 in the method embodiment shown in FIG. 3.
  • the main idea of the above-mentioned speech recognition method based on multi-model decoding is to perform decoding through multiple models, and then select the final recognition result from multiple recognition results through the acoustic score PK.
  • PK acoustic score
  • the sentence patterns of the speech recognition results are basically the same, but there are differences in the positions of vertical keywords. While the first speech recognition result contains more accurate vertical keyword information, if the first speech recognition result is PKed out, it may cause inaccurate recognition of the vertical keyword. Therefore, when the scores of the recognition results output by each model are similar, the recognition result containing accurate vertical keywords should win.
  • the first speech recognition result cannot be made to win.
  • the score of the slot where the vertical keyword in the first speech recognition result is located can be first encouraged to increase a certain percentage, that is, the first speech recognition result is acoustically Score incentives, so that when the sentence patterns output by different models are the same, the first speech recognition result can win.
  • the embodiment of the present application proposes another speech recognition method, as shown in FIG. 5 , the method includes:
  • S502 Use a speech recognition decoding network to decode the acoustic state sequence to obtain a first speech recognition result. And, S503, using a general speech recognition model to decode the acoustic state sequence to obtain a second speech recognition result; the speech recognition decoding network is based on the vertical keyword set and sentence pattern decoding in the scene to which the speech to be recognized belongs The network is constructed.
  • the above-mentioned speech recognition decoding network is based on the set of vertical keywords in the scene where the speech to be recognized belongs, and the sentence pattern decoding network obtained by pre-processing the text corpus in the scene to which the speech to be recognized belongs to sentence induction and grammar slot definition , the build gets.
  • the specific content of the speech recognition and decoding network refer to the introduction of the above-mentioned embodiments, and for the construction process of the network, refer to the specific introduction of the following embodiments.
  • the speech recognition method proposed in the embodiment of the present application before performing the acoustic score PK of the first speech recognition result and the second speech recognition result, thereby determining the final speech recognition result , in the embodiment of the present application, the acoustic score incentive is first performed on the first speech recognition result.
  • the acoustic score incentive is performed on the first speech recognition result, specifically, the acoustic score of the slot where the vertical keyword in the first speech recognition result is located is stimulated, that is, the vertical keyword in the first speech recognition result is located.
  • the acoustic score of the slot is scaled according to the incentive coefficient.
  • the specific value of the incentive coefficient is determined by the business scenario and the actual situation of the speech recognition result.
  • For the specific acoustic score incentive content please refer to the implementation of the acoustic score incentive in the following article. Example introduction.
  • the acoustic score of the first speech recognition result can be higher than that of the second speech recognition result, whereby make the first speech recognition result win in the acoustic score PK, thus can make when the score of the first speech recognition result and the second speech recognition result are similar, guarantee that the vertical class keyword in the speech recognition result that finally gets is relatively More accurate recognition results.
  • the speech recognition method proposed in the embodiment of the present application can build a speech recognition decoding network based on the vertical keyword set in the business scenario to which the speech to be recognized belongs and the pre-built sentence pattern decoding network in the business scenario.
  • the speech recognition and decoding network can decode speech composed of any sentence pattern and any vertical keyword in the business scenario to which the speech to be recognized belongs. Therefore, based on the above-mentioned speech recognition decoding network, the speech to be recognized can be accurately recognized, especially the speech in a specific scene involving vertical keywords can be accurately recognized, especially the vertical keywords in the speech can be accurately recognized.
  • the speech recognition method proposed in the embodiment of the present application not only uses the speech recognition decoding network to perform decoding and recognition, but also uses a general speech recognition model to perform decoding and recognition.
  • the general speech recognition model has higher sentence flexibility than the above-mentioned speech recognition decoding network. Using multiple models to decode the acoustic state sequence of the speech to be recognized separately can make speech recognition of the speech to be recognized more comprehensive and in-depth in a variety of ways.
  • the embodiment of the present application performs acoustic score excitation on the speech recognition result output by the speech recognition and decoding network.
  • the above-mentioned speech recognition and decoding network has a higher recognition accuracy for vertical keywords. Therefore, based on the above-mentioned acoustic score excitation process, when the speech recognition result output by the speech recognition and decoding network is consistent with the general speech
  • the speech recognition result output by the speech recognition decoding network can win, thereby ensuring that the vertical keyword recognition in the final speech recognition result is correct.
  • the embodiment of the present application also proposes another speech recognition method. Compared with the speech recognition method shown in FIG. 5, this method adds a scene customization model for decoding the acoustic state sequence of the speech to be recognized. .
  • step S604 is further performed to decode the acoustic state sequence through the pre-trained scene customization model to obtain the third speech recognition result.
  • the above-mentioned scene customization model is obtained by performing speech recognition training on the speech in the scene to which the speech to be recognized belongs.
  • the functions of the above-mentioned scene customization model and the beneficial effects brought by the addition of the scene customization model can refer to the content of the above-mentioned embodiment corresponding to the speech recognition method shown in FIG. 4 , which will not be repeated here.
  • step S606 the final speech recognition result is determined from the excited first speech recognition result, the second speech recognition result and the third speech recognition result.
  • acoustic score PK on the first speech recognition result, the second speech recognition result and the third speech recognition result, multiple or highest acoustic scores can be selected.
  • a speech recognition result as the final speech recognition result.
  • For the specific processing process refer to the introduction of the corresponding content in the above-mentioned embodiments.
  • steps S601 - S603 and S605 in FIG. 6 please refer to the specific processing content of the corresponding steps in the above embodiment, and will not be repeated here.
  • the speech recognition methods mentioned above are completely based on the acoustic score of the speech recognition results when PK and decision-making are finally performed on multiple speech recognition results, which completely ignores the influence of the language model on the recognition effect, especially in the above-mentioned In the speech recognition decoding network or scene customization model.
  • This simple and direct PK strategy will greatly affect the recognition effect, and in severe cases, it will cause false triggering problems and affect user experience.
  • the embodiment of the present application proposes that on the basis of the acoustic score PK, the language model is activated for the speech recognition result, so that the language model information is incorporated into the speech recognition result, and finally, the final speech recognition result is selected through the language score PK.
  • the first speech recognition result, the second speech recognition result and the third speech recognition result are respectively obtained through the speech recognition decoding network, the general speech recognition model, and the scene customization model.
  • the speech recognition result, and after the acoustic score is activated on the first speech recognition result the final speech recognition result is determined from the excited first speech recognition result, the second speech recognition result and the third speech recognition result
  • the result can be executed as follows:
  • a candidate speech is determined from the first speech recognition result and the second speech recognition result recognition result.
  • the processing in this step is the same as the acoustic score PK introduced above.
  • the acoustic score PK By performing the acoustic score PK on the first speech recognition result and the second speech recognition result after the acoustic score excitation, one or more of the highest acoustic scores are selected. speech recognition results as candidate speech recognition results.
  • language model excitation is performed on the candidate speech recognition result and the third speech recognition result respectively.
  • the above-mentioned language model excitation refers to matching the speech recognition result with the vertical keyword in the scene where the speech to be recognized belongs to. If the matching is successful, the path extension is performed on the speech recognition result, and then the language The model re-scores the expanded speech recognition results, and completes the speech recognition results excitation on the language model.
  • the specific processing process of the language model excitation will be specially introduced in the following embodiments.
  • acoustic score PK perform language score PK on the candidate speech recognition results after language model excitation and the third speech recognition result, and select one or more speech recognition results with the highest language scores as The final speech recognition result.
  • the specific processing process of the language score PK refer to the processing process of the acoustic score PK introduced in the above-mentioned embodiments, and will not be described in detail here.
  • an optional speech recognition result decision-making method refer to the introduction of the above-mentioned embodiments, and obtain the first speech recognition result, the second speech recognition result and the After the third speech recognition result, respectively perform language model excitation on the first speech recognition result, the second speech recognition result and the third speech recognition result; then, according to the first speech recognition result after the language model excitation , the language scores of the second speech recognition result and the third speech recognition result, the final speech recognition result is determined from the first speech recognition result, the second speech recognition result and the third speech recognition result through the language score PK.
  • the present application introduces the construction process of the sentence pattern decoding network used to construct the speech recognition decoding network in the above-mentioned embodiments of the speech recognition method.
  • the construction process of the sentence pattern decoding network described below is only an exemplary and preferred implementation plan.
  • the sentence pattern decoding network under the business scenario of the voice to be recognized described in the above-mentioned embodiments can be constructed by executing the following steps A1-A3:
  • A1. Construct a text sentence network by performing sentence pattern induction and grammatical slot definition processing on the corpus data in the scene to which the voice to be recognized belongs.
  • the corpus data in the business scenario where the speech to be recognized belongs is the voice annotation data collected from the actual business scenario. It is marked as the corpus data in the scene of calling or sending text messages. Alternatively, it can also be artificially expanded based on experience to obtain corpus data that conforms to grammatical logic and business scenarios. For example, "I want to give John a call” and "send a message to Peter for me” are two corpus data use cases respectively. Because the subsequent sentence induction and grammatical slot definition are directly based on the corpus, the corpus collected at this stage can have a high degree of coverage, but there is no requirement for the coverage of vertical keywords.
  • the sentence patterns of the user's voice are usually regular, or exhaustive.
  • the sentence network corresponding to the business scenario can be obtained, which is named the text sentence network in the embodiment of this application.
  • the text slots in the text sentences are divided into ordinary grammar slots and replacement grammar slots, wherein the text slots corresponding to non-vertical keywords are defined as ordinary grammar slots, and the text slots corresponding to vertical keywords are defined as is defined as a replacement syntax slot.
  • ordinary grammar slot non-vertical keyword content in the text sentence is stored, and in the replacement grammar slot, placeholders corresponding to the vertical keyword are stored.
  • the number of ordinary grammar slots can be one or more, and each vertical keyword text slot corresponds to a replacement grammar slot.
  • the text-sentence network constructed in the above-mentioned manner includes elements including network nodes and directed arcs connecting nodes, and the text-sentence network is defined based on ABNF (Augmented Backus-Naur Form). Specifically, as shown in Figure 7, there is label information on the directed arc of the text sentence network, and the label information corresponds to the placeholder of the replacement grammar slot corresponding to the directed arc or the ordinary grammar corresponding to the directed arc The text of the slot.
  • ABNF Algmented Backus-Naur Form
  • Figure 7 is a text sentence network defined according to the collected corpus data in the scenario of calling or texting.
  • the directed arcs with the " ⁇ xxxx>" tag are called ordinary grammar slots, which contain at least one entry, and the collection of all entries must be completed in the stage of text sentence network definition.
  • Figure 7 includes two common grammar slots, namely ⁇ phone> and ⁇ sth>, where ⁇ phone> corresponds to the content of the text entry in front of the name of the address book in the scene corpus use case, such as "I want to give” and "send a message” are the two entries of the syntax slot ⁇ phone>; ⁇ sth> indicates the content of the text entry behind the name of the address book in the use case, such as “a call” and “for me” are the two entries of the syntax slot ⁇ sth> entry.
  • the directed arc with the label "xxx" is called a replacement grammar slot, which means that the sentence pattern definition stage does not need to be accompanied by an actual entry but only needs to be matched with a "#placehollder#" placeholder.
  • the actual tokens are dynamically passed in when building the speech recognition decoding network.
  • the "name" in Figure 7 is a replacement grammar slot, and the subsequent dynamically created vertical keyword network will be inserted into the replacement grammar slot to form a complete speech recognition decoding network.
  • the last type of directed arc with a "-" label is called a virtual arc.
  • the virtual arc refers to a directed arc that does not have grammatical slots and entry information, indicating that the path is optional, and the virtual arc must have a corresponding grammatical slot.
  • all grammar slots in the text sentence network can be ID marked, defined as the slot_id field of the grammar slot, and globally unique identifiers can be set.
  • sentence patterns defined by the sentence network, the grammatical slots, and the entries of common grammatical slots together constitute the text sentence network.
  • the word-level sentence pattern decoding network includes several nodes and directed arcs between the nodes.
  • each entry in the common grammar slot is segmented to obtain each word corresponding to each entry.
  • each word corresponding to the same entry to expand the word node, that is, connect the word segmentation results of the same entry through nodes and directed arcs to obtain the word string corresponding to the entry.
  • the directed arc between the two nodes is marked with word information obtained by word segmentation, where the left and right colons represent input and output information respectively, and the input and output information are set to be the same here.
  • the grammar slot ⁇ phone> contains three entries: "I want to give”, “send a message to” and "give a call”.
  • the grammatical slot ⁇ sth> contains three entries: “for me”, “a call” and "a call with her number”.
  • the connection between node 10 and node 18 indicates that you can go directly from node 10 to the end node, and the " ⁇ /s>" on the arc represents silence.
  • the decoding network is used as a sentence decoding network in the scene where the speech to be recognized belongs to.
  • each word in the common grammar slot of the word-level sentence decoding network is replaced with the corresponding pronunciation.
  • the corresponding relationship between existing words and pronunciations can be queried through the pronunciation dictionary, so as to determine the pronunciation corresponding to each word marked on the directed arc in the common grammar slot of the word-level sentence pattern decoding network. On this basis, use the pronunciation corresponding to the word to replace the word marked on the directed arc.
  • each pronunciation in the word-level sentence pattern decoding network is divided into pronunciation units, and each pronunciation unit corresponding to the pronunciation is used to expand the pronunciation node to obtain the pronunciation level sentence pattern decoding network.
  • each utterance on the directed arc of the word-level sentence pattern decoding network its utterance unit is determined respectively, and the utterance unit is divided.
  • This application exemplarily divides the utterance into phoneme sequences. For example, the word “I” is pronounced as the monophone “ay”, and the word “give” is pronounced as the phoneme string "gih v”.
  • the pronunciation nodes are expanded and connected in series according to the arrangement order and quantity of the pronunciation units.
  • each phoneme of the same pronunciation is connected in sequence through nodes and directed arcs to obtain a phoneme string corresponding to the pronunciation.
  • the phoneme string corresponding to the pronunciation is used to replace the pronunciation, and the word-level sentence decoding network is extended to the pronunciation-level sentence decoding network.
  • the replacement grammar slot is still not expanded.
  • the pronunciation-level sentence pattern decoding network is used as the sentence pattern decoding network in the business scenario to which the voice to be recognized belongs.
  • Figure 8 shows a schematic diagram of a simple pronunciation-level sentence decoding network.
  • the nodes in the sentence decoding network are numbered in sequence, and the pronunciation units with the same initial node identifier share a starting point. start node, and pronunciation units with the same end node ID share one end node.
  • a single node in the network includes a total of 3 attribute fields: id number, number of incoming arcs, and number of outgoing arcs, which constitute a node storage triplet. The incoming arc of a node indicates a directed arc pointing to the node, and the outgoing arc is sent from the node. directed arc.
  • a single directed arc in the network includes a total of 4 attribute fields including the left node number, right node number, pronunciation information on the arc, and the grammar slot identifier slot_id to which it belongs. At the same time, the total number of nodes and the total number of arcs in the network are recorded.
  • the left node position of the ordinary grammar slot ⁇ phone> in the network is 0, and the right node position is 10, and the replacement grammar
  • the left node of the slot name is 10, the right node is 11, and the left node of the common syntax slot ⁇ sth> is 11, and the right node is 16.
  • the qualified arcs in the obtained pronunciation-level sentence pattern decoding network can be merged and optimized, and redundant nodes can be deleted to reduce the complexity of the network.
  • the specific method is the same as the general decoding network optimization method, and no longer detail.
  • the decoding network obtained after completing the above steps is the sentence decoding network, which can be loaded into the cloud speech recognition service as a global resource.
  • the real address book information since the real address book information has not been recorded in the replacement grammar slot, it does not yet have the actual decoding capability.
  • a speech recognition decoding network that is actually used to decode the acoustic state sequence of the speech to be recognized can be constructed.
  • the embodiment of the present application will further give an example introduction to the construction process of the speech recognition decoding network.
  • the embodiment of the present application constructs a speech recognition decoding network by performing the following steps B1-B3:
  • the sentence pattern decoding network under the business scenario to which the voice to be recognized belongs can be constructed in advance according to the above-mentioned embodiments.
  • the sentence pattern decoding network can be directly called.
  • step B1 when step B1 is executed, a sentence pattern decoding network is constructed in real time.
  • the set of vertical keywords in the business scenario to which the voice to be recognized belongs refers to a set composed of all vertical keywords in the business scenario to which the voice to be recognized belongs.
  • the set of vertical keywords in the business scenario to which the voice to be recognized belongs can specifically be a collection of names composed of the names of the user's address book; assuming that the voice to be recognized If the voice is voice in the voice navigation scenario, then the set of vertical keywords in the business scenario to which the voice to be recognized belongs can specifically be a set of place names composed of various place names in the region where the user is located.
  • the address book is used as a collection of vertical keywords, and the construction of a name network in a communication is taken as an example to introduce the specific implementation process of building a vertical keyword network.
  • a word-level vertical keyword network is constructed based on each vertical keyword in the vertical keyword set under the business scenario to which the speech to be recognized belongs.
  • a word-level name network For example, perform word segmentation for each name in the address book to obtain each word contained in the name, and connect each word in series through nodes and directed arcs to obtain the word string corresponding to the name; the word strings corresponding to different names are connected in parallel , that is, a word-level name network is constructed.
  • each word in the word-level vertical keyword network is replaced with the corresponding pronunciation, and the pronunciation node is expanded according to the pronunciation corresponding to the word to obtain the pronunciation-level vertical keyword network.
  • the pronunciation-level personal name network For example, for each word in the word-level personal name network, determine its pronunciation and determine the phonemes contained in the pronunciation, connect each phoneme through nodes and directed arcs to form a phoneme string, and then replace the pronunciation with the phoneme string corresponding to the pronunciation, That is, the pronunciation-level personal name network is obtained.
  • a pronunciation-level vertical keyword network is obtained, which is the final vertical keyword network constructed.
  • the node whose number of arcs is 0 is the start node of the network
  • the node whose number of arcs is 0 is the end node of the network.
  • Figure 10 shows the pronunciation-level personal name network obtained after pronunciation replacement of the word-level personal name network shown in Figure 9 .
  • node 0 is the start node of the network
  • node 8 is the end node of the network.
  • the finally constructed sentence pattern decoding network and vertical keyword network are composed of nodes and directed arcs connecting nodes.
  • the pronunciation information of the text in the slot is stored on the directed arc corresponding to the ordinary grammar slot of the sentence pattern decoding network
  • the placeholder is stored on the directed arc corresponding to the replacement grammar slot of the sentence pattern decoding network
  • the corresponding vertical class The pronunciation information of the text in the slot is stored on the directed arc of each grammatical slot of the keyword network.
  • the left and right nodes of the vertical keyword network and the replacement grammar slots of the sentence pattern decoding network are respectively connected through directed arcs, that is, the vertical keyword network is used to replace the replacement grammar slots in the sentence pattern decoding network.
  • Directed arcs can be used to construct a speech recognition decoding network.
  • each connected directed arc stores the starting node of the vertical keyword network respectively.
  • the pronunciation information on each outgoing arc, and the left node of each incoming arc of the end node of the vertical keyword network and the right node of the replacement grammar slot are connected by directed arcs, and each connected directed arc is respectively
  • the pronunciation information on each incoming arc of the end node of the vertical keyword network is stored, so as to construct a speech recognition decoding network.
  • the first arc and the last arc of each keyword in the vertical keyword network in the embodiment of the present application are respectively stored in , and the unique identifiers can be set as the hash codes of the keywords for example.
  • the directed arcs between nodes (0, 1) and the directed arcs between nodes (7, 8) respectively store the hashes corresponding to the personal name "Jack Alen”. Greek code.
  • this embodiment of the present application also sets a keyword information set that has already entered the network.
  • the unique identifier of the keyword that has been inserted into the sentence pattern decoding network, and the directed arc where the unique identifier is located are in the The sentence pattern decodes the left and right node numbers in the network and stores them correspondingly.
  • the above-mentioned network-connected keyword information collection can adopt the HashMap storage structure of key:value structure, the key is the hash code corresponding to the above-mentioned vertical keyword, and the value is the node of the directed arc where the hash code is located Number pair collection, the initial HashMap is empty, during the entire identification service process, HashMap uniquely saves the hash code and node number pairs of all dynamic incoming vertical keyword entries, but does not record user ID and user The mapping relationship between ID and set of vertical keywords.
  • the setting of the above-mentioned keyword information set that has entered the network can facilitate the identification of vertical keyword information that has been inserted into the sentence pattern decoding network, that is, it can clarify the vertical keyword information that already exists in the speech recognition decoding network.
  • it is possible to determine whether the vertical keyword has been inserted by querying the network keyword information collection, and when it is determined that the vertical keyword to be inserted already exists in the speech recognition decoding network , you can cancel the insertion of the current vertical keyword, and continue to perform the insertion operation of other vertical keywords.
  • the right node of the traversed arc is connected to the left node of the replacement grammar slot through a directed arc, and the traversed arc is stored on the directed arc.
  • the pronunciation information on the outgoing arc and update the number of incoming arcs or outgoing arcs of the nodes at both ends of the directed arc.
  • each out arc of the starting node in the person name network is traversed, and for each out arc traversed, the name hash code on the out arc is obtained, by The hash code is compared with all the hash codes in the network keyword information collection. If the hash code matches any hash code in the network keyword information collection, it means that the hash code corresponds to The name of the person has been inserted into the sentence pattern decoding network. At this time, the arc is skipped and the hash code judgment for the next arc is performed.
  • the traversed hash code of the arc out does not match all the hash codes in the network keyword information set, it means that the name corresponding to the hash code has not been inserted into the sentence decoding network.
  • the traversed The right node of the arc is connected to the left node of the replacement grammar slot of the sentence pattern decoding network through a directed arc, and the pronunciation information on the arc that is traversed is stored on the connected directed arc, and the directed arc is updated The number of incoming or outgoing arcs of nodes at both ends.
  • the above process realizes the connection of the right node of each arc out of the starting node of the vertical keyword network and the left node of the replacement grammar slot of the sentence pattern decoding network.
  • the left node of the incoming arc traversed is connected to the right node of the replacement grammar slot through a directed arc, and the traversed arc is stored on the directed arc.
  • the pronunciation information on the incoming arc is not inserted into the sentence pattern decoding network.
  • the hash code is compared with all the hash codes in the network keyword information collection. If the hash code matches any hash code in the network keyword information collection, it means that the hash code corresponds to The person's name has been inserted into the sentence pattern decoding network. At this time, the incoming arc is skipped, and the hash code judgment for the next incoming arc is performed.
  • the traversed The left node of the incoming arc is connected to the right node of the replacement grammar slot of the sentence pattern decoding network through a directed arc, and the pronunciation information on the incoming arc is stored on the connected directed arc, and the directed arc is updated The number of incoming or outgoing arcs of nodes at both ends.
  • the above process realizes the connection between the left node of each incoming arc of the end node of the vertical keyword network and the right node of the replacement grammar slot of the sentence pattern decoding network.
  • the execution order of the insertion of the start node and end node of the vertical keyword network described above can be flexibly arranged, for example, the insertion operation can be performed on the start node of the vertical keyword network first, or the The end node of the vertical keyword network performs the insertion operation, or both.
  • the embodiment of the present application will uniquely identify the keyword and the location where the unique identification is located.
  • the left and right node numbers of the directed arc in the sentence pattern decoding network are correspondingly stored in the keyword information set that has entered the network.
  • each user can upload a vertical keyword set to the cloud server, and the cloud server can build a large-scale Or a very large-scale speech recognition decoding network, so as to be able to meet the calling needs of various users.
  • the setting of the network keyword information set can improve the efficiency of inserting vertical keywords into the speech recognition decoding network, and at the same time can facilitate the selection of a specific decoding path according to the speech recognition needs of the current user.
  • the voice recognition decoding network constructed according to the above scheme not only includes the address book entry information dynamically imported from the current session, but also includes the address book information of other historical sessions, and is unified. It is stored in the network keyword information collection in the form of hash code.
  • the path of the speech recognition decoding network needs to be updated, so that the decoding path is limited to the range of the current incoming address book.
  • the specific implementation method is:
  • the user voice makes a call as an example
  • traverse each hash code in the keyword information set that has entered the network If the hash code traversed belongs to the name hash code in the incoming address book, it will not If it is not the hash code of the name in the incoming address book this time, by querying the keyword information set that has entered the network, determine the left and right node numbers corresponding to the hash code, and compare the left and right node numbers between the left and right node numbers. Break to the arc.
  • the speech recognition decoding network that actually participates in the decoding, in fact, only the decoding path where the current incoming address book is located is connected, so it can only decode the speech recognition result of calling any name in the current address book, This is also in line with user expectations.
  • the decoding path can be limited to the range of vertical keywords set introduced this time, which is beneficial to narrowing the path search range, improving decoding efficiency and reducing decoding errors.
  • the speech recognition decoding network constructed based on the set of vertical keywords and the sentence pattern decoding network is the main network for realizing speech recognition involving vertical keywords.
  • the inventors of the present application found in research that the speech recognition decoding network has two deficiencies, one is the problem of false triggering of vertical keywords, and the other is the problem of insufficient network coverage.
  • False triggering problem is the most common problem in speech recognition based on fixed sentence speech recognition decoding network, and it is also the most difficult problem to solve. False triggering means that the actual content of a piece of audio is not the sentence pattern in the speech recognition decoding network with a fixed sentence pattern, but the final result is the result in the speech recognition decoding network with a fixed sentence pattern. False triggering of vertical keywords means that there is no vertical keyword in the actual result or the vertical keyword is not in the vertical keyword set passed in this time, but the final result is that the result of the speech recognition decoding network wins, and an error is given vertical keywords. For example, in a phone call scenario, there is no name in the actual result or the name is not in the incoming address book this time, but the final result is that the result output by the speech recognition decoding network wins, and a wrong name is given.
  • False triggers of vertical keywords are generally divided into the following four types: (1) false triggers with the same pronunciation of vertical keywords and real results; (2) false triggers with similar pronunciations of vertical keywords and real results; (3) false triggers with vertical keywords and real results with similar pronunciations; There is a large gap between the pronunciation of keywords and the real results due to the false triggers introduced by the incentive strategy; (4) There are no vertical keywords in the real results, but the speech recognition decoding network recognizes the false triggers of vertical keywords.
  • the root cause can be attributed to insufficient sentence pattern coverage in the speech recognition decoding network.
  • the embodiment of the present application proposes a corresponding solution to the problem of insufficient sentence pattern coverage of the speech recognition decoding network. For other false triggering situations, it will be solved and optimized through other solutions in subsequent embodiments.
  • the speech recognition decoding network is built based on the sentence pattern decoding network, that is, based on the sentence pattern.
  • the advantage of this network construction method is that it can accurately match the speech sentence pattern.
  • the disadvantage is that if the sentence pattern is not in the network. There is no way to recognize it, and the corpus used to build the network often cannot cover all sentence patterns in all scenarios, so the speech recognition decoding network will introduce a problem, that is, the sentence pattern coverage is not high enough.
  • the general speech recognition model is trained based on massive data, and the sentence patterns are scalable and the sentence patterns are very rich.
  • the result errors of the general-purpose speech recognition model are mainly vertical keyword recognition errors, mainly because the general-purpose speech recognition model includes language scores during corpus training, and the corpus often does not fit well.
  • the score of out-of-the-box keywords Although the vertical keywords recognized by the general speech recognition model are wrong, the sentence pattern is correct, mainly because the sentence pattern can be fitted by training data.
  • the summary is that the sentence pattern information of the recognition results of the general speech recognition model is more reliable, and the vertical keyword information of the recognition results of the speech recognition decoding network is more reliable. Based on this idea, this case proposes a sentence pattern based on a general speech recognition model to solve the problem of low sentence pattern coverage of the speech recognition decoding network.
  • the specific solution is, on the premise that the acoustic state sequence of the speech to be recognized is decoded to obtain the first speech recognition result and the second speech recognition result by using the speech recognition decoding network and the general speech recognition model constructed according to the technical idea of the present application respectively, The first speech recognition result is corrected according to the second speech recognition result.
  • the content in the first speech recognition result is divided into vertical keyword content and non-vertical keyword content.
  • the content in the second speech recognition result is divided into reference text content and non-reference text content, wherein the reference text content in the second speech recognition result refers to the content in the second speech recognition result and the first speech recognition result
  • the text content that matches the non-vertical keyword content in the first speech recognition result may specifically be the string most similar to the non-vertical keyword content in the first speech recognition result, or the similarity is greater than the set threshold the string content of .
  • the above-mentioned modification of the first speech recognition result according to the second speech recognition result is to use the reference text content in the second speech recognition result to correct the non-vertical key in the first speech recognition result.
  • the content of the word is corrected to obtain the corrected first speech recognition result.
  • the embodiment of the present application matches the first speech recognition result with the second speech recognition result based on the edit distance algorithm, and determines the reference text content from the second speech recognition result.
  • an edit distance matrix between the first speech recognition result and the second speech recognition result is determined according to an edit distance algorithm.
  • the edit distance matrix includes edit distances between each character in the first speech recognition result and each character in the second speech recognition result.
  • the non-vertical keyword content in the first speech recognition result may be divided into the character string before the vertical keyword, and/or, the vertical keyword string after.
  • the corresponding reference text content can also be determined through the above method.
  • the text in the second speech recognition result is The non-vertical keyword content in the target text content or the first speech recognition result is determined as the modified non-vertical keyword content
  • the target text content in the second speech recognition result refers to the text content corresponding to the position of the non-vertical keyword content in the first speech recognition result in the second speech recognition result.
  • the position of the target text content in the second speech recognition result in the second speech recognition result is the same as the position of the non-vertical keyword content in the first speech recognition result.
  • the non-vertical keyword content in the first speech recognition result is the text content before the vertical keyword in the first speech recognition result
  • the target text content in the second speech recognition result specifically the first
  • the position of the vertical keyword in the speech recognition result is mapped to the text content before the position in the second speech recognition result.
  • the mapping between the positions of vertical keywords in the first speech recognition result and the second speech recognition result can be realized based on the above-mentioned edit distance matrix.
  • the process of determining the corrected non-person name content will be introduced.
  • an example is taken in which the person's name in the first speech recognition result is in the middle of the sentence of the first speech recognition result, and the emphasis is on the correction process of the character string before the person's name in the first speech recognition result.
  • the corresponding correction processing can also be carried out for the character string after the person's name.
  • the target text content in the second speech recognition result is determined as the modified non-vertical keyword content
  • the second speech recognition result has more characters than the non-vertical keyword content in the first speech recognition result, and the difference in the number of characters between the two does not exceed the set threshold, then the second speech recognition The target text content in the result is determined as the modified non-vertical keyword content;
  • the second speech recognition result has fewer characters than the non-vertical keyword content in the first speech recognition result, and/or the difference in the number of characters between the two exceeds a set threshold, the first speech will be
  • the non-vertical keyword content in the recognition result is determined as the modified non-vertical keyword content.
  • the character before the person’s name is calculated.
  • string and the local maximum subsequence of the second speech recognition result, and the local maximum subsequence is the reference text content determined from the second speech recognition result.
  • the target text content in the second speech recognition result is determined as the corrected character string before the person's name.
  • the target text content in the second speech recognition result specifically the position of the name in the first speech recognition result, is mapped to the second speech recognition result matrix according to the edit distance matrix between the first speech recognition result and the second speech recognition result. The text content preceding the position in the speech recognition results.
  • the number of characters of the second speech recognition result is not more than the number of characters of the character string before the person's name, then keep the character string before the person's name in the first speech recognition result unchanged, that is, the character string before the person's name in the first speech recognition result Determined as the corrected character string in front of the name.
  • the number of characters in the second speech recognition result is more than the number of characters in the character string before the name, it is further judged whether the difference in the number of character strings between the two exceeds the set threshold. Whether the excess number of characters in the string exceeds 20% of the number of characters in the character string before the name.
  • the character string in front of the person's name in the first speech recognition result remains unchanged, that is, the character string in front of the person's name in the first speech recognition result is determined as the corrected character string in front of the person's name.
  • the target text content in the second speech recognition result is determined as the corrected character string before the person's name.
  • the above introduction can also be referred to to determine the corrected character string after the person's name.
  • the modified non-vertical keyword content and the vertical keyword content are combined according to the positional relationship between the original non-vertical keyword content and the vertical keyword content, and the combination result obtained is the modified The first speech recognition result of .
  • the corrected character string before the person's name, the character string after the person's name, and the corrected character string after the person's name are sequentially concatenated and combined to obtain the first corrected speech recognition result.
  • Correcting the first speech recognition result according to the above-mentioned processing process can make the output result of the speech recognition decoding network not only contain more accurate vertical keyword information, but also can integrate the accurate sentence pattern information recognized by the general speech recognition model , so that the recognition results of the speech recognition decoding network are more accurate and the sentence pattern coverage is higher, and at the same time, the problem of recognition false triggering caused by the low sentence pattern coverage of the speech recognition decoding network can be solved.
  • the result of the speech recognition decoding network will be Stimulate, including acoustic stimulation and verbal stimulation.
  • the incentive if the incentive is not appropriate, it may cause insufficient incentives, or cause false triggers due to excessive incentives, resulting in the second and third vertical keyword false triggers as mentioned above, namely: (2 ) False triggers due to the similar pronunciation of vertical keywords and real results; (3) False triggers due to the large gap between vertical keywords and real results due to the introduction of incentive strategies.
  • this application implements For example, the above-mentioned incentive scheme is studied, and an optimal incentive scheme is proposed.
  • the embodiment of the present application first proposes that when using the speech recognition decoding network and the general speech recognition model to decode the acoustic state sequence of the speech to be recognized, the first speech recognition result output from the speech recognition decoding network and the general speech recognition model output
  • the speech recognition decoding network and the general speech recognition model output
  • the first speech recognition result and the second speech recognition result are used to calculate the character edit distance according to the edit distance algorithm, so as to determine the matching degree between the two.
  • the confidence degree of the first speech recognition result is further determined based on the matching degree of the first speech recognition result and the second speech recognition result.
  • the matching degree threshold is a value that has the largest contribution rate to recognition calculated by multiple test sets.
  • the matching degree between the first speech recognition result and the second speech recognition result is not greater than the set matching degree threshold, then use the vertical keywords in the first speech recognition result and the sentence pattern of the second speech recognition result to , that is, the content in the second speech recognition result except the content corresponding to the vertical keyword, and construct a miniature decoding network. It can be seen that the tiny decoding network has only one decoding path.
  • the miniature decoding network uses the miniature decoding network to decode the above-mentioned acoustic state sequence of the speech to be recognized again to obtain a decoding result, and use the decoding result as a new first speech recognition result. Then, using the updated acoustic scores of each frame of the first speech recognition result during decoding, the acoustic scores of each frame are accumulated or weighted to be used as the confidence degree of the first speech recognition result finally determined.
  • step D2 is performed, according to the acoustic score of the first speech recognition result and the acoustic score of the second speech recognition result, from the first
  • the final speech recognition result is selected from the speech recognition result and the second speech recognition result.
  • the above-mentioned confidence threshold refers to a score threshold determined through experiments that enables the first speech recognition result to win when performing score competition with other speech recognition results.
  • the confidence of the first speech recognition result is greater than the preset confidence threshold, which means that the first speech recognition result can not be easily PKed out in the acoustic score competition with other speech recognition results by virtue of its own confidence. Therefore, at this time, the acoustic score PK can be performed directly according to the acoustic score of the first speech recognition result and the acoustic score of the second speech recognition result, and one or more speech recognition results with the highest acoustic score can be selected as The final speech recognition result.
  • step D3 is performed to stimulate the acoustic score of the first speech recognition result, and according to the acoustic score of the stimulated first speech recognition result, score and the acoustic score of the second speech recognition result, and select the final speech recognition result from the first speech recognition result and the second speech recognition result.
  • the confidence degree of the first speech recognition result is not greater than the preset confidence threshold, it indicates that the confidence degree of the first speech recognition result is low, that is, its acoustic score is low.
  • the acoustic score is PK, it will be PK dropped by other speech recognition results, so that the more accurate vertical keyword information contained in the first speech recognition result will be lost in the final selected speech recognition result, which may cause recognition errors, especially It may cause misidentification of vertical keywords.
  • the embodiment of the present application performs acoustic score incentives on the first speech recognition result, specifically for vertical categories in the first speech recognition result.
  • the acoustic score of the slot where the keyword is located is stimulated, so that the acoustic score of the slot where the vertical keyword is located in the first speech recognition result is increased by a certain percentage, so that the acoustic score of the first speech recognition result is increased to ensure that the first speech
  • the recognition result will not be easily PKed out in the subsequent acoustic score PK with other speech recognition results.
  • the acoustic score of the first speech recognition result after the excitation and the acoustic score of the second speech recognition result can be used to perform acoustic score PK on the two, and one or the other can be selected. Multiple speech recognition results with the highest acoustic scores are used as the final speech recognition result.
  • Acoustic score incentives are performed on the first speech recognition result, that is, the acoustic score of the slot where the vertical keyword in the first speech recognition result is located is stimulated, specifically, the slot where the vertical keyword in the first speech recognition result is located.
  • the acoustic score of the bit is multiplied by an excitation coefficient, and then the acoustic score for determining the first speech recognition result is recalculated on the basis of the stimulated vertical keyword acoustic score.
  • the acoustic score incentive for the first speech recognition result can be realized:
  • the acoustic excitation coefficient determines the strength of the acoustic score excitation for the first speech recognition result. If the excitation coefficient is too large, it will cause excessive excitation, causing the recognition false trigger problem as described above; if the excitation coefficient is too small, it will reach Not for the motivation purpose, it may cause the first speech recognition result to be PK out by other speech recognition results in the score PK.
  • the determination of the acoustic excitation coefficient is the key to solving the problem of false triggering of vertical keywords mentioned in the above embodiments and the problem of losing important vertical keyword information during acoustic PK.
  • the acoustic excitation coefficient when determining the acoustic excitation coefficient, it should be determined at least according to the vertical keyword content and non-vertical keyword content in the first speech recognition result, and should also be combined with actual business scenarios and actual business scenarios.
  • the following empirical parameters are determined.
  • the prior coefficient may be motivated according to the acoustic score in the business scenario to which the voice to be recognized belongs, the number of characters and phonemes of the vertical keyword in the first voice recognition result, and the second Calculate and determine the acoustic excitation coefficient for the total number of characters and the total number of phonemes of the speech recognition result.
  • the acoustic excitation coefficient RC is calculated according to the following formula:
  • is the scene prior coefficient.
  • the prior coefficient ⁇ can be dynamically set in each recognition session according to the requirements of the recognition system. For example, based on natural language processing (NLP) technology, the upper-level system can predict the user's behavioral intention through the context of user interaction, and adjust the coefficient in real time and dynamically to meet the requirements of various scenarios.
  • NLP natural language processing
  • the design of the acoustic excitation coefficient has also fully considered the number of words (word count in slot, SlotWC) and the number of phonemes (phoneme count in slot, SlotPC) in the slot where the vertical class keyword is located (that is, the vertical class The number of characters and the number of phonemes of the keyword), and the number of words in the sentence (word count in sentence, SentWC) and the number of phonemes (phoneme count in sentence, SentPC) (that is, the total number of characters of the first speech recognition result and the number of phonemes), and set the influence weight ⁇ for the number of characters and the number of phonemes respectively, so that the sum of the influence weights of characters and phonemes is 1.
  • the acoustic score incentive only stimulates the acoustic score of the slot where the vertical keyword is located, eliminating the interference of irrelevant context, thereby avoiding the problem of false triggering of recognition caused by excessive incentives.
  • it may first be based on the number of phonemes and the acoustic score of the vertical keyword content in the first speech recognition result, and the phoneme quantity and The acoustic score is calculated to obtain the score confidence of the vertical keyword content in the first speech recognition result; then, the acoustic excitation coefficient is determined according to the score confidence of the vertical keyword content in the first speech recognition result.
  • the state sequence score corresponding to the misrecognized word is relatively low, which is why the correct recognition result often has the highest score.
  • the average score of the acoustic sequence is lower for misrecognized results.
  • the vertical keyword score confidence scheme is to separate the vertical keywords and non-vertical keywords in the first speech recognition result output by the speech recognition decoding network, and calculate the acoustic total score and vertical keyword respectively.
  • the score confidence S c of the above-mentioned vertical keywords can be calculated by the following formula:
  • the acoustic score of vertical keywords in the first speech recognition result is recorded as S p
  • the number of effective acoustic phonemes occupied by vertical keywords is recorded as N p
  • the total acoustic score of non-vertical keywords is It is denoted as S a
  • the number of valid acoustic phonemes occupied by the slots of non-vertical keywords is denoted as N a .
  • an acoustic excitation coefficient for stimulating the vertical keyword may be determined.
  • the confidence score of the vertical keyword content in the first speech recognition result is compared with a preset confidence threshold.
  • the confidence threshold is determined according to the probability of false triggering of the recognition caused by the acoustic excitation coefficient.
  • the score confidence of the vertical keyword content in the first speech recognition result is greater than the confidence threshold, it can be considered that the vertical keyword is likely to cause false triggers.
  • the acoustic excitation coefficient of the vertical keyword The acoustic excitation coefficient should be lowered; when the score confidence of the vertical keyword content in the first speech recognition result is not greater than the confidence threshold, it can be considered that the vertical keyword is easily dropped by PK.
  • the acoustic excitation coefficient of characters is changed, the acoustic excitation coefficient should be adjusted up.
  • the score confidence of the vertical keyword content in the first speech recognition result is jointly used to determine the acoustic excitation coefficient.
  • the embodiment of the present application analyzes the recognition results of multiple test sets, counts the influence of the size of the acoustic excitation coefficient on the recognition effect and recognition false trigger, and determines the relationship between the acoustic excitation coefficient and the recognition effect and recognition false trigger.
  • a value that is more balanced between the improvement of the recognition rate and the reduction of false triggers is selected.
  • the principle of selection is to ensure that the number of false triggers is much smaller than the number of improved recognition effects.
  • the acoustic excitation coefficient selected in the embodiment of the present application is an excitation coefficient that causes the number of false triggers to be recognized to be one percent of the number that promotes the improvement of recognition effect.
  • the acoustic dichotomy of the slot where the vertical keyword content in the first speech recognition result is multiplied by the acoustic excitation coefficient determined in the above steps is used to obtain the updated acoustic score of the vertical keyword content.
  • the language model activation for the third speech recognition result is taken as an example to introduce the specific processing content of the language model activation.
  • the specific language model activation process is not limited by the incentive object, and the language model activation scheme can also be applied to the activation of other speech recognition results, for example, the realization of the language model activation described in the following embodiments
  • the scheme is also applicable to performing language model excitation on candidate speech recognition results selected from the first speech recognition result and the second speech recognition result.
  • the language model excitation is to recalculate the score of the speech recognition result through the language model, so that the score of the speech recognition result carries the language component.
  • the language model incentive mechanism is mainly realized through two aspects: one is the clustering class language model, and the other is a strategy based on the matching of vertical keywords and the pronunciation sequence of the speech recognition result, which expands the path of the speech recognition result, and based on the extension
  • the path together with the above-mentioned clustering language model, determines the language score of the speech recognition result.
  • the clustering class model is introduced.
  • specific speech recognition business scenarios involving vertical keywords such as making phone calls, sending text messages, querying the weather, and navigation
  • the vertical keywords in each scenario can be limited to a limited range through enumeration or user-provided methods Inside, and the context where vertical keywords usually appear in a specific sentence pattern.
  • the clustering class language model In addition to using general training corpus, the clustering class language model also performs special processing for such specific sentence patterns or sayings.
  • the clustering language model will define a class for all specific scenes, and each class will use a special word (class) to mark and distinguish, as the category label corresponding to the scene. After all category labels are defined, vertical keywords such as person names, city names, and audio and video names in the training corpus will be replaced with corresponding category labels to form the target corpus, which will be added to the original training corpus. It is used again for speech recognition training on the above-mentioned clustering language model.
  • This processing method makes the special word class represent the probability of a class of words, so the probability of the N-gram language model where the special word class is located in the clustering model will be significantly higher than the probability of the specific vertical class keyword itself.
  • the embodiment of the present application performs path extension on the third voice recognition result according to the vertical category keyword set under the business scenario to which the voice to be recognized belongs and the category label corresponding to the business scenario.
  • Exemplarily first, compare the vertical keywords in the third speech recognition result with the vertical keywords in the vertical keyword set in the business scenario to which the speech to be recognized belongs.
  • the vertical keyword in the third speech recognition result matches any vertical keyword in the vertical keyword set under the business scenario where the speech to be recognized belongs, then the vertical keyword in the third speech recognition result is located A new path is extended between the left and right nodes of the slot, and the category label corresponding to the business scenario to which the speech to be recognized belongs is stored on the new path.
  • FIG. 13 it is a schematic diagram of a state network (lattice) of the voice recognition result " ⁇ s>Call Zhang San ⁇ /s>" in a call service scenario.
  • the state network shown in Figure 13 is where "Zhang San” is located.
  • a new path is extended between the left and right nodes of the slot.
  • the new path shares the start node and end node with "Zhang San” in the original state network, and the category label "class” corresponding to the current business scenario is marked on the new path.
  • the "class” can be specifically "person's name”, and the state network after path expansion is shown in FIG. 14 .
  • the third speech recognition result is respectively determined according to the recognition result of the training corpus by the clustering language model corresponding to the category label corresponding to the business scene to which the speech to be recognized belongs and a language model score for the extended path of the third speech recognition result.
  • the recognition result of the training corpus by the clustering language model includes the N-gram language model probability of each word in the recognition result, and the probability is the language score of the word.
  • the candidate speech recognition may be the output of the speech recognition decoding network
  • the result (that is, any one or more of the first speech recognition results) may also be the result of the output of the general speech recognition model (that is, one or more of the second speech recognition results), therefore, it should be based on the candidate speech recognition
  • select a clustering language model of the same type as the source for re-checking For the source of the result, select a clustering language model of the same type as the source for re-checking.
  • model structures of the above-mentioned different types of clustering language models are the same, the difference is that their training corpora are different.
  • the clustering language model of the same type as the general speech recognition model is obtained based on massive corpus training, and it is assumed to be named model A; while the clustering language model of the same type as the scene customization model is obtained based on scene corpus training Assuming that it is named model B; since model A and model B are trained based on different types of training corpus, they belong to different types of clustering language models.
  • the clustering language model of the same type as the speech recognition decoding network is also trained based on scene corpus, and it is assumed to be named model C; since model B and model C are trained based on the same type of training corpus, both Belong to the same type of clustering language models.
  • Table 1 shows the calculation method of re-checking the two paths in Fig. 14 .
  • the third speech recognition result and the language model scores scoreA and scoreB of the extension path of the third speech recognition result can be respectively determined.
  • the language model score of the extended path of the third speech recognition result is used to determine the activated language score of the language model of the third speech recognition result.
  • the language model score scoreA of the third speech recognition result and the language model score scoreB of the extension path of the third speech recognition result are fused according to a certain ratio, and the sum of the fusion coefficients of the two is 1 to obtain the third speech recognition
  • the language score after the language model excitation of the result is fused according to a certain ratio, and the sum of the fusion coefficients of the two is 1 to obtain the third speech recognition The language score after the language model excitation of the result.
  • the language score Score after the language model excitation of the third speech recognition result can be calculated by the following formula:
  • is an empirical coefficient, and its value is determined through testing. Specifically, it is determined with the goal of obtaining the correct language score and then selecting the correct speech recognition result from numerous speech recognition results based on the language score PK.
  • the above embodiments have introduced the specific implementation schemes of acoustic score incentives and language model incentives.
  • the problem of false triggering of vertical keywords has been fully considered.
  • the problem of false triggering caused by the same pronunciation of the vertical keyword and the real result cannot be solved by the above-mentioned scheme of controlling the excitation coefficient.
  • the speech recognition decoding network constructed in the embodiment of the present application is a sentence network relying on an acoustic model, which does not contain language information, so the situation that the vertical keywords and the real results have the same pronunciation cannot be essentially solved.
  • the embodiment of the present application presents the results to the user in the form of multiple candidates.
  • the general speech recognition model and speech recognition decoding network share an acoustic model, when their output results are pronounced the same, their acoustic scores must be the same. Therefore, when the acoustic scores of the first speech recognition result output by the speech recognition decoding network and the second speech recognition result output by the general speech recognition model are the same, the first speech recognition result and the second speech recognition result are jointly used as the final speech recognition result As a result, the first speech recognition result and the second speech recognition result are simultaneously output, and the user can select the correct speech recognition result.
  • the output sequence when the first speech recognition result and the second speech recognition result are simultaneously output can be flexibly adjusted, and it is preferable to output in the order that the first speech recognition result comes first and the second speech recognition result follows.
  • the above idea of outputting speech recognition results in multiple candidate forms is also applicable to scoring PK of speech recognition results of more models.
  • the output results of the speech recognition decoding network, the general speech recognition model, and the scene customization model are scored to compare the final speech recognition results of decision-making, if there are multiple different speech recognition results with the same score, then it can be These speech recognition results with the same score are output at the same time, and the user selects the correct speech recognition result.
  • the embodiment of the present application also proposes a speech recognition device, as shown in FIG. 15 , the speech recognition device includes:
  • the acoustic recognition unit 001 is used to obtain the acoustic state sequence of the speech to be recognized;
  • the network construction unit 002 is configured to construct a speech recognition decoding network based on the set of vertical keywords and the sentence pattern decoding network in the scene to which the speech to be recognized belongs, wherein the sentence pattern decoding network at least passes through the speech to be recognized
  • the text corpus in the corresponding scene is constructed by sentence induction processing;
  • the decoding processing unit 003 is configured to use the speech recognition decoding network to decode the acoustic state sequence to obtain a speech recognition result.
  • the above-mentioned vertical keyword set and sentence pattern decoding network based on the business scene to which the speech to be recognized belongs to construct a speech recognition decoding network, including:
  • the cloud server can construct speech recognition based on the set of vertical keywords and the sentence pattern decoding network under the scene where the voice to be recognized belongs Decode the network.
  • the speech recognition result is used as the first speech recognition result
  • the decoding processing unit 003 is also used for:
  • a final speech recognition result is determined from at least the first speech recognition result and the second speech recognition result.
  • the decoding processing unit 003 is further configured to:
  • the scene customization model is obtained by performing speech recognition training on the speech in the scene to which the speech to be recognized belongs;
  • the determining the final speech recognition result at least from the first speech recognition result and the second speech recognition result includes:
  • a final speech recognition result is determined from the first speech recognition result, the second speech recognition result and the third speech recognition result.
  • determining a final speech recognition result from the first speech recognition result, the second speech recognition result, and the third speech recognition result includes:
  • the second speech recognition result and the third speech recognition result From the first speech recognition result, the second speech recognition result and the third speech recognition result Determine the final speech recognition result.
  • determining a final speech recognition result from the first speech recognition result, the second speech recognition result, and the third speech recognition result includes:
  • the language score of the candidate speech recognition result after language model excitation and the language score of the third speech recognition result after language model excitation, determine from the candidate speech recognition result and the third speech recognition result the final speech recognition result.
  • Another embodiment of the present application also proposes another speech recognition device, as shown in FIG. 16 , the device includes:
  • the acoustic recognition unit 011 is used to obtain the acoustic state sequence of the speech to be recognized;
  • a multi-dimensional decoding unit 012 configured to decode the acoustic state sequence using a speech recognition decoding network to obtain a first speech recognition result, and to decode the acoustic state sequence using a general speech recognition model to obtain a second speech recognition result ;
  • the speech recognition decoding network is constructed based on the set of vertical keywords and the sentence pattern decoding network under the scene where the speech to be recognized belongs to;
  • an acoustic excitation unit 013 configured to perform acoustic score excitation on the first speech recognition result
  • the decision processing unit 014 is configured to determine a final speech recognition result from at least the excited first speech recognition result and the second speech recognition result.
  • the multi-dimensional decoding unit 012 is also configured to:
  • the scene customization model is obtained by performing speech recognition training on the speech in the scene to which the speech to be recognized belongs;
  • the determining the final speech recognition result at least from the excited first speech recognition result and the second speech recognition result includes:
  • the determining the final speech recognition result from the excited first speech recognition result, the second speech recognition result and the third speech recognition result includes:
  • the language score of the candidate speech recognition result after language model excitation and the language score of the third speech recognition result after language model excitation, determine from the candidate speech recognition result and the third speech recognition result the final speech recognition result.
  • the sentence pattern decoding network in the scene to which the speech to be recognized belongs is constructed through the following processing:
  • a text sentence network is constructed; wherein, the text sentence network includes ordinary grammar slots corresponding to non-vertical keywords and A replacement grammatical slot corresponding to the vertical keyword, the placeholder corresponding to the vertical keyword is stored in the replacement grammatical slot;
  • Each word in the ordinary grammar slot of the word-level sentence pattern decoding network is replaced with the corresponding pronunciation, and the pronunciation node is expanded according to the pronunciation corresponding to the word, and the pronunciation-level sentence pattern decoding network is obtained, and the pronunciation-level sentence pattern decoding network is obtained.
  • word segmentation is performed on the entry in the ordinary grammar slot of the text sentence network and word node expansion is performed according to the word segmentation result to obtain a word-level sentence decoding network, including:
  • the word strings corresponding to each entry corresponding to the same general grammar slot are connected in parallel to obtain a word-level sentence pattern decoding network.
  • each word in the ordinary grammar slot of the word-level sentence pattern decoding network is replaced with the corresponding pronunciation, and the pronunciation node is expanded according to the pronunciation corresponding to the word to obtain the pronunciation level sentence pattern decoding network ,include:
  • Each word in the common grammar groove of described word-level sentence pattern decoding network is replaced with corresponding pronunciation respectively;
  • each pronunciation in the word-level sentence pattern decoding network the pronunciation units are divided respectively, and each pronunciation unit corresponding to the pronunciation is used to expand the pronunciation node to obtain the pronunciation level sentence pattern decoding network.
  • a speech recognition decoding network is constructed based on the set of vertical keywords and the sentence pattern decoding network under the scene to which the speech to be recognized belongs, including:
  • the construction of a vertical keyword network based on the vertical keywords in the vertical keyword set under the scene where the speech to be recognized belongs includes:
  • a word-level vertical keyword network is constructed
  • Each word in the word-level vertical keyword network is replaced with the corresponding pronunciation, and the pronunciation node is expanded according to the pronunciation corresponding to the word, so as to obtain the pronunciation-level vertical keyword network.
  • both the vertical keyword network and the sentence pattern decoding network are composed of nodes and directed arcs connecting nodes, and pronunciation information or placeholders are stored on directed arcs between nodes ;
  • Insert the vertical keyword network into the sentence pattern decoding network to obtain a speech recognition decoding network including:
  • the speech recognition decoding network is constructed by connecting the vertical keyword network with the left and right nodes of the replacement grammar slot of the sentence pattern decoding network through directed arcs.
  • the left and right nodes of the vertical keyword network and the replacement grammar slot of the sentence pattern decoding network are respectively connected through directed arcs to construct a speech recognition decoding network, including:
  • the unique identifier corresponding to the keyword is stored on the first arc and the last arc of each keyword in the vertical keyword network;
  • a speech recognition decoding network including:
  • the right node of the traversed arc is connected to the left node of the replacement grammar slot through a directed arc, and the traversed arc is stored on the directed arc.
  • the left node of the incoming arc traversed is connected to the right node of the replacement grammar slot through a directed arc, and the traversed arc is stored on the directed arc.
  • the pronunciation information on the incoming arc is not inserted into the sentence pattern decoding network.
  • the right node of each arc out of the start node of the vertical keyword network is connected with the left node of the replacement grammar slot through a directed arc
  • the The left node of each entry arc of the end node of the vertical class keyword network is connected with the right node of the replacement grammar slot by a directed arc to construct a speech recognition decoding network, which also includes:
  • the unique identifier of the keyword When a keyword in the vertical keyword network is inserted into the sentence pattern decoding network, the unique identifier of the keyword, and the left and right node numbers of the directed arc where the unique identifier is located in the sentence pattern decoding network , correspondingly stored in the network-accessed keyword information set.
  • the right node of each arc out of the start node of the vertical keyword network is connected with the left node of the replacement grammar slot through a directed arc
  • the The left node of each entry arc of the end node of the vertical class keyword network is connected with the right node of the replacement grammar slot by a directed arc to construct a speech recognition decoding network, which also includes:
  • the directed arc between the left and right node numbers corresponding to the unique identifier is disconnected.
  • the above speech recognition device further includes:
  • the result modification unit is configured to modify the first speech recognition result according to the second speech recognition result.
  • modifying the first speech recognition result according to the second speech recognition result includes:
  • the reference text content is the text content in the second speech recognition result that matches the non-vertical keyword content in the first speech recognition result.
  • the non-vertical keyword content in the first speech recognition result is corrected by using the reference text content in the second speech recognition result to obtain the corrected first speech recognition Results, including:
  • the modified first speech recognition result is obtained by combining the modified non-vertical keyword content and the vertical keyword content.
  • determining from the second speech recognition result the text content corresponding to the non-vertical keyword content in the first speech recognition result, as a reference text content includes:
  • the edit distance matrix and the non-vertical keyword content in the first speech recognition result determine from the second speech recognition result the non-vertical keyword in the first speech recognition result
  • the text content corresponding to the content is used as the reference text content.
  • the modified non-vertical keyword is determined according to the reference text content in the second speech recognition result and the non-vertical keyword content in the first speech recognition result.
  • word content including:
  • the target text content in the second speech recognition result or the first speech recognition result is determined as the modified non-vertical keyword content
  • the target text content in the second speech recognition result refers to the text content corresponding to the position of the non-vertical keyword content in the first speech recognition result in the second speech recognition result .
  • the second speech recognition result The target text content in or the non-vertical keyword content in the first speech recognition result is determined as the modified non-vertical keyword content, including:
  • the target text content in the second speech recognition result is determined as the modified non-vertical keyword content
  • the second speech recognition result has more characters than the non-vertical keyword content in the first speech recognition result, and the difference in the number of characters between the two does not exceed the set threshold, then the second speech recognition The target text content in the result is determined as the modified non-vertical keyword content;
  • the first speech will be The non-vertical keyword content in the recognition result is determined as the modified non-vertical keyword content.
  • the determining the final speech recognition result at least from the first speech recognition result and the second speech recognition result includes:
  • the degree of matching of the second speech recognition result determines the confidence of the first speech recognition result
  • the first speech recognition Select the final speech recognition result from the result and the second speech recognition result;
  • the acoustic score incentive is performed on the first speech recognition result, and according to the acoustic score of the stimulated first speech recognition result and the The acoustic score of the second speech recognition result is used to select the final speech recognition result from the first speech recognition result and the second speech recognition result.
  • the determining the confidence level of the first speech recognition result based on the matching degree between the first speech recognition result and the second speech recognition result includes:
  • the first speech recognition result and the second speech recognition result are jointly used as the final speech recognition results.
  • performing acoustic score incentives on the first speech recognition result includes:
  • the updated acoustic score of the vertical keyword content in the first voice recognition result and the acoustic score of the non-vertical keyword content in the first voice recognition result recalculate and determine the first voice Acoustic score of recognition results.
  • the determining the acoustic excitation coefficient at least according to the vertical keyword content and the non-vertical keyword content in the first speech recognition result includes:
  • the prior coefficient calculates and determines the acoustic excitation coefficient.
  • the determining the acoustic excitation coefficient at least according to the vertical keyword content and the non-vertical keyword content in the first speech recognition result includes:
  • the second The score confidence degree of the vertical category keyword content in the speech recognition result According to the number of phonemes and the acoustic score of the vertical keyword content in the first speech recognition result, and the number of phonemes and the acoustic score of the non-vertical keyword content in the first speech data result, the second The score confidence degree of the vertical category keyword content in the speech recognition result;
  • the acoustic excitation coefficient is determined at least according to the score confidence of the vertical keyword content in the first speech recognition result.
  • the determining the acoustic excitation coefficient at least according to the score confidence of the vertical keyword content in the first speech recognition result includes:
  • the acoustic excitation coefficient is determined according to the score confidence of the vertical keyword content in the first speech recognition result, and the relationship between the predetermined acoustic excitation coefficient and the recognition effect and recognition false trigger.
  • performing language model excitation on the third speech recognition result includes:
  • the path extension is performed on the third speech recognition result;
  • the category label is obtained by clustering the speech recognition scene Sure;
  • the clustering language model According to the recognition result of the training corpus by the clustering language model corresponding to the category label, respectively determine the language model score of the third speech recognition result and the extension path of the third speech recognition result; wherein, the clustering The language model is obtained by carrying out speech recognition training to the target corpus, and the vertical keywords in the target corpus are all replaced with the category labels;
  • the language score after the language model excitation of the third speech recognition result is determined.
  • performing path extension on the third speech recognition result according to the vertical category keyword set in the scene to which the speech to be recognized belongs and the category label corresponding to the scene includes:
  • the slot where the vertical keyword in the third speech recognition result is located A new path is extended between the left and right nodes of , and the category label corresponding to the scene to which the speech to be recognized belongs is stored on the new path.
  • each speech recognition device for the specific work content of each unit in the above-mentioned embodiments of each speech recognition device, please refer to the processing content of the corresponding steps of the above-mentioned speech recognition method, which will not be repeated here.
  • FIG. 17 Another embodiment of the present application also proposes a speech recognition device, as shown in FIG. 17 , the device includes:
  • the memory 200 is connected to the processor 210 for storing programs
  • the processor 210 is configured to execute the program stored in the memory 200 to implement the voice recognition method disclosed in any of the above embodiments.
  • the above speech recognition device may further include: a bus, a communication interface 220 , an input device 230 and an output device 240 .
  • the processor 210, the memory 200, the communication interface 220, the input device 230 and the output device 240 are connected to each other through a bus. in:
  • a bus may include a pathway that carries information between various components of a computer system.
  • the processor 210 can be a general-purpose processor, such as a general-purpose central processing unit (CPU), a microprocessor, etc., and can also be an application-specific integrated circuit (application-specific integrated circuit, ASIC), or one or more for controlling the present invention integrated circuit for program execution. It can also be a digital signal processor (DSP), an application specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, and discrete hardware components.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA off-the-shelf programmable gate array
  • the processor 210 may include a main processor, and may also include a baseband chip, a modem, and the like.
  • the program for executing the technical solution of the present invention is stored in the memory 200, and an operating system and other key services may also be stored.
  • the program may include program code, and the program code includes computer operation instructions.
  • the memory 200 may include read-only memory (read-only memory, ROM), other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM), that can store information and Other types of dynamic storage devices, disk storage, flash, etc. for instructions.
  • the input device 230 may include a device for receiving data and information input by a user, such as a keyboard, a mouse, a camera, a scanner, a light pen, a voice input device, a touch screen, a pedometer or a gravity sensor, and the like.
  • a device for receiving data and information input by a user such as a keyboard, a mouse, a camera, a scanner, a light pen, a voice input device, a touch screen, a pedometer or a gravity sensor, and the like.
  • Output devices 240 may include devices that allow information to be output to a user, such as a display screen, printer, speakers, and the like.
  • Communication interface 220 may include the use of any transceiver or the like to communicate with other devices or communication networks, such as Ethernet, radio access network (RAN), wireless local area network (WLAN), and the like.
  • RAN radio access network
  • WLAN wireless local area network
  • the processor 2102 executes the programs stored in the memory 200, and calls other devices, which can be used to implement various steps of the speech recognition method provided by the embodiment of the present application.
  • Another embodiment of the present application also provides a storage medium, on which a computer program is stored.
  • a computer program is stored.
  • the computer program is run by a processor, each step of the speech recognition method provided in any of the above-mentioned embodiments is implemented.
  • the specific work content of each part of the above-mentioned speech recognition device, and the specific processing content of the above-mentioned computer program on the storage medium when it is run by the processor can refer to the content of each embodiment of the above-mentioned speech recognition method, I won't repeat them here.
  • each embodiment in this specification is described in a progressive manner, each embodiment focuses on the difference from other embodiments, and the same and similar parts of each embodiment can be referred to each other.
  • the description is relatively simple, and for related parts, please refer to part of the description of the method embodiments.
  • modules and submodules in the devices and terminals in the various embodiments of the present application can be combined, divided and deleted according to actual needs.
  • the disclosed terminal, device and method may be implemented in other ways.
  • the terminal embodiments described above are only illustrative.
  • the division of modules or sub-modules is only a logical function division. In actual implementation, there may be other division methods. For example, multiple sub-modules or modules can be combined Or it can be integrated into another module, or some features can be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or modules may be in electrical, mechanical or other forms.
  • modules or sub-modules described as separate components may or may not be physically separated, and the components as modules or sub-modules may or may not be physical modules or sub-modules, that is, they may be located in one place, or may also be distributed to on multiple network modules or submodules. Part or all of the modules or sub-modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional module or submodule in each embodiment of the present application may be integrated into one processing module, each module or submodule may exist separately physically, or two or more modules or submodules may be integrated in one processing module. in a module.
  • the above-mentioned integrated modules or sub-modules can be implemented in the form of hardware or in the form of software function modules or sub-modules.
  • the steps of the methods or algorithms described in conjunction with the embodiments disclosed herein may be directly implemented by hardware, software units executed by a processor, or a combination of both.
  • the software unit can be placed in random access memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other Any other known storage medium.
  • RAM random access memory
  • ROM read-only memory
  • electrically programmable ROM electrically erasable programmable ROM
  • registers hard disk, removable disk, CD-ROM, or any other Any other known storage medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Telephonic Communication Services (AREA)
  • Machine Translation (AREA)

Abstract

本申请提出一种语音识别方法、装置、设备及存储介质,该方法包括:获取待识别语音的声学状态序列;基于所述待识别语音所属场景下的垂类关键字集合及句式解码网络,构建语音识别解码网络,其中,所述句式解码网络通过对所述待识别语音所属场景下的文本语料进行句式归纳和语法槽定义处理构建得到;利用所述语音识别解码网络对所述声学状态序列进行解码,得到语音识别结果。通过构建上述的语音识别解码网络,并用于语音识别,能够准确识别待识别语音,尤其是能够准确识别涉及垂类关键字的特定场景下的语音,特别是能准确识别语音中的垂类关键字。

Description

语音识别方法、装置、设备及存储介质
本申请要求于2021年10月29日提交中国专利局、申请号为202111274880.8、发明名称为“语音识别方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及语音识别技术领域,具体涉及一种语音识别方法、装置、设备及存储介质。
背景技术
随着移动互联网、人工智能等技术的快速发展,人机交互场景已经大量出现在人们的日常生活生产过程中,而语音识别作为人机交互的重要接口,其应用越来越广泛。
当前语音识别最有效的方案就是使用神经网络技术对海量数据进行学习,得到语音识别模型,该模型在通用场景中识别效果非常好。理论上在数据充足,尽可能覆盖任何词汇的情况下,可以达到非常好的识别效果。
但是,在涉及垂类关键字的语音识别场景下,比如拨打手机联系人电话、给手机联系人发信息、查询城市天气情况、导航定位等场景下,现有的语音识别效果非常差,通常无法准确识别用户语音,尤其是对于用户语音中的人名、地名等垂类关键字,往往无法识别成功。
发明内容
基于上述技术现状,本申请实施例提出一种语音识别方法、装置、设备及存储介质,能够准确识别待识别语音,尤其是能够准确识别涉及垂类关键字的特定场景下的语音,特别是能准确识别语音中的垂类关键字。
一种语音识别方法,其特征在于,包括:
获取待识别语音的声学状态序列;
基于所述待识别语音所属场景下的垂类关键字集合及句式解码网络,构建语音识别解码网络,其中,所述句式解码网络至少通过对所述待识别语音所属场景下的文本语料进行句式归纳处理构建得到;
利用所述语音识别解码网络对所述声学状态序列进行解码,得到语音识别结果。
一种语音识别方法,其特征在于,包括:
获取待识别语音的声学状态序列;
利用语音识别解码网络对所述声学状态序列进行解码,得到第一语音识别结果,以及,利用通用语音识别模型对所述声学状态序列进行解码,得到第二语音识别结果;所述语音识别解码网络基于所述待识别语音所属场景下的垂类关键字集合及句式解码网络构建得到;
对所述第一语音识别结果进行声学得分激励;
至少从激励后的第一语音识别结果以及所述第二语音识别结果中,确定出最终的语音识别结果。
一种语音识别装置,其特征在于,包括:
声学识别单元,用于获取待识别语音的声学状态序列;
网络构建单元,用于基于所述待识别语音所属场景下的垂类关键字集合及句式解码网络,构建语音识别解码网络,其中,所述句式解码网络至少通过对所述待识别语音所属场景下的文本语料进行句式归纳处理构建得到;
解码处理单元,用于利用所述语音识别解码网络对所述声学状态序列进行解码,得到语音识别结果。
一种语音识别装置,包括:
声学识别单元,用于获取待识别语音的声学状态序列;
多维解码单元,用于利用语音识别解码网络对所述声学状态序列进行解码,得到第一语音识别结果,以及,利用通用语音识别模型对所述声学状态序列进行解码,得到第二语音识别结果;所述语音识别解码网络基于所述待识别语音所属场景下的垂类关键字集合及句式解码网络构建得到;
声学激励单元,用于对所述第一语音识别结果进行声学得分激励;
决策处理单元,用于至少从激励后的第一语音识别结果以及所述第二语音识别结果中,确定出最终的语音识别结果。
一种语音识别设备,包括:
存储器和处理器;
所述存储器与所述处理器连接,用于存储程序;
所述处理器,用于通过运行所述存储器中存储的程序,实现上述的语音识别方法。
一种存储介质,所述存储介质上存储有计算机程序,所述计算机程序被处理器运行时,实现上述的语音识别方法。
本申请提出的语音识别方法,能够基于待识别语音所属场景下的垂类关键字集合以及预先构建的该场景下的句式解码网络,构建语音识别解码网络。则在该语音识别解码网络中,包含待识别语音所属场景下的各种语音句式信息,同时包含待识别语音所属场景下的各种垂类关键字,利用该语音识别解码网络能够解码待识别语音所属场景下的任意句式、任意垂类关键字构成的语音。因此,通过构建上述的语音识别解码网络,能够准确识别待识别语音,尤其是能够准确识别涉及垂类关键字的特定场景下的语音,特别是能准确识别语音中的垂类关键字。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1是本申请实施例提供的一种语音识别方法的流程示意图;
图2是本申请实施例提供的一种词级句式解码网络的示意图;
图3是本申请实施例提供的另一种语音识别方法的流程示意图;
图4是本申请实施例提供的又一种语音识别方法的流程示意图;
图5是本申请实施例提供的又一种语音识别方法的流程示意图;
图6是本申请实施例提供的又一种语音识别方法的流程示意图;
图7是本申请实施例提供的一种文本句式网络的示意图;
图8是本申请实施例提供的一种发音级句式解码网络的示意图;
图9是本申请实施例提供的词级人名网络的示意图;
图10是本申请实施例提供的与图9所对应的发音级人名网络的示意图;
图11是本申请实施例提供的利用第二语音识别结果对第一语音识别结果 进行修正的处理流程图;
图12是本申请实施例提供的从第一语音识别结果和第二语音识别结果中确定出最终的语音识别结果的处理流程图;
图13是本申请实施例提供的语音识别结果的状态网络的示意图;
图14是对图13所示的语音识别结果进行路径扩展后的状态网络的示意图;
图15是本申请实施例提供的一种语音识别装置的结构示意图;
图16是本申请实施例提供的另一种语音识别装置的结构示意图;
图17是本申请实施例提供的一种语音识别设备的结构示意图。
具体实施方式
本申请实施例技术方案适用于语音识别应用场景,采用本申请实施例技术方案,能够更加准确地识别语音内容,尤其是在涉及垂类关键字的特定业务场景下,能够更加准确地识别语音内容,尤其是能够准确识别语音中的垂类关键字,整体提升语音识别效果。
上述的垂类关键字,泛指属于同一类型的不同关键字,比如人名、地名、应用名称等分别构成不同的垂类关键字,具体例如,用户通讯录中的各个不同的人名组成人名垂类关键字,用户所处地域的各个不同的地名则组成地名垂类关键字,用户终端上所安装的各个应用的名称则组成应用名称垂类关键字。
上述的涉及垂类关键字的业务场景,是指在相应的交互语音中包含垂类关键字的业务场景,比如语音拨打电话、语音导航等业务场景,由于用户必须要说出打电话的人名或者导航的地名,比如,用户可能会说“给XX打电话”、“导航去YY”,其中的“XX”可能为用户手机通讯录中的任意一个人名,“YY”可能是用户所在地域的某个地名。可见,这些业务场景下的语音中包含垂类关键字(如人名、地名),所以这些业务场景为涉及垂类关键字的业务场景。
随着人工智能及智能终端的全面普及,人机交互场景越来越普遍,而语音识别则是人机交互的重要接口。比如,在智能终端上,很多厂商都在终端操作系统中内置了语音助手,使得用户可以通过语音操控终端,比如,用户可以通过语音给通讯里联系人打电话、发短信,或者通过语音查询城市天气,或者通过语音开启或关闭终端应用程序等。这些交互场景相对于普通的语音识别业务场景,属于特定业务场景,在这些场景下的语音多数是涉及垂类关键字(比如 通讯录人名、地名、终端应用程序名称)的语音。
垂类关键字相对于普通的文本关键字,具有变化频繁、不可预见、用户可自定义的特点,并且垂类关键字在海量的语音识别训练语料中的占比极低,使得常规的通过海量语料训练语音识别模型的语音识别方案,往往无法胜任涉及垂类关键字的语音识别业务。
比如,人名相对于常规的文本预料来说,其出现率是很低的,因此即便在海量训练语料中,人名也是很罕见的,这就导致模型无法通过海量语料充分学习到人名特征。而且,人名属于用户自定义的文本内容,其具有不可穷举、不可预见的特点,完全通过人工生成所有的人名,是不现实的。再者,用户对通讯录里的联系人名称的存储,可能并不是规范的人名,可能是昵称、代号、外号等,甚至用户可能随时修改、增删通讯录联系人,这就使得不同用户的通讯录人名具有高度多样性,根本无法通过统一的方式使得语音识别模型学习到所有的人名特征。
因此,常规的通过海量语料训练语音识别模型,利用该语音识别模型实现语音识别功能的技术方案,并不能完全胜任涉及垂类关键字的业务场景下的语音识别任务,尤其是对于语音中的垂类关键字,往往无法成功识别,严重影响用户体验。
鉴于上述技术现状,本申请实施例提出一种语音识别方法,该方法能够提高语音识别效果,尤其是能够提高涉及垂类关键字的业务场景下的语音识别效果。
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例提出一种语音识别方法,参见图1所示,该方法包括:
S101、获取待识别语音的声学状态序列。
具体的,基于上文介绍的本申请实施例技术方案所适用的应用场景介绍,上述的待识别语音,具体是涉及垂类关键字的业务场景下的语音数据,在该待识别语音中,包含垂类关键字的语音内容。
对上述的待识别语音进行端点检测、加窗分帧、特征提取等处理,获取其 音频特征,该音频特征可以是梅尔频率倒谱系数MFCC特征,或者是其他任意类型的音频特征。
在获取到待识别语音的音频特征后,将该音频特征输入声学模型,进行声学识别,得到每帧音频的声学状态后验得分,即得到声学状态序列。该声学模型主要为神经网络结构,其通过前向计算识别每帧音频对应的声学状态及其后验得分。上述的音频帧对应的声学状态,具体是音频帧对应的发音单元,例如音频帧对应的音素或音素序列。
常规的语音识别技术方案是声学模型+语言模型的架构,即,先通过声学模型对待识别语音进行声学识别,实现语音特征向音素序列的映射;然后,通过语言模型对音素序列进行识别,实现音素向文本的映射。
按照常规的语音识别方案,上述声学识别得到的待识别语音的声学状态序列,将被输入语言模型进行解码,从而确定与待识别语音对应的文本内容。该语言模型是基于海量的训练语料训练得到的能够实现音素向文本的映射的模型。
与常规的语音识别方案不同,本申请实施例不利用上述的基于海量语料训练的语音模型进行声学状态序列解码,而是利用实时构建的解码网络进行解码,具体可见下文内容。
S102、基于所述待识别语音所属场景下的垂类关键字集合及句式解码网络,构建语音识别解码网络。
具体的,与常规的语言模型不同,本申请实施例在对垂类关键字业务场景下的语音进行识别时,实时地构建语音识别解码网络,用于对待识别语音的声学状态序列进行解码,得到语音识别结果。
上述的语音识别解码网络,由待识别语音所属场景下的垂类关键字集合,以及预先构建的该待识别语音所属场景下的句式解码网络构建得到。
其中,待识别语音所属场景下的句式解码网络至少通过对待识别语音所属场景下的文本语料进行句式归纳处理构建得到;
待识别语音所属场景,具体是指待识别语音所属的业务场景。比如,假设待识别语音为“I want to give XX a call”,则该待识别语音是属于打电话业务的语音,因此该待识别语音所属场景为打电话场景;又如,假设待识别语音为“导航去XX”,则该待识别语音是属于导航业务的语音,因此该待识别语音所属 场景为导航场景。
本申请发明人经过研究发现,在涉及垂类关键字的业务场景下,用户语音的句式有相当大的部分是固定的句式,比如在打电话或发短信场景下,用户常用句式通常为“I want to give XX a call”或者“send a message to XX for me”;在语音导航场景下,用户常用句式通常是“去XX(地名)”或者“导航到XX(地名)”。
所以,在某种特定的涉及垂类关键字的业务场景下,用户语音的句式是有规律的,或者说是可穷举的,通过对这些句式进行归纳,可以得到与该场景对应的句式网络,本申请实施例将其命名为句式解码网络。可以理解,基于上述方式构建的句式解码网络能够包含对应该场景的句式信息,当对某种场景下的所有句式的文本语料进行归纳得到句式解码网络时,该句式解码网络可以包含该场景下的任意句式。
作为优选的实施方式,本申请实施例通过对待识别语音所属场景下的文本语料进行句式归纳和语法槽定义处理构建得到。
上述的对句式中的语法槽进行定义,具体是确定句式中的文本槽在语法上的类型。在本申请实施例中,将文本句中的文本槽划分为普通语法槽和替换语法槽,其中,文本句中的非垂类关键字所在的文本槽被定义为普通语法槽,而文本句中的垂类关键字所在的文本槽被定义为替换语法槽。
作为一种简单示例,比如在打电话或发短信场景下,对“I want to give XX a call”或者“send a message to XX for me”这些文本语料进行句式归纳和语法槽定义,可以得到如图2所示的句式解码网络。该句式解码网络由节点和连接节点的有向弧组成,其中,有向弧对应普通语法槽和替换语法槽,有向弧上带有标签信息,用于记载槽内文本内容。具体的,将普通语法槽词条进行分词,并通过节点和有向弧串联,两个节点间的有向弧上标注有单词信息,冒号左右分别代表输入和输出信息,此处设定输入输出信息相同,单个词条分词后的多个单词进行串联,同一语法槽的不同词条进行并联,替换语法槽使用占位符“#placeholder#”进行占位,不执行扩展,节点按顺序进行编号,其中具有相同起始节点标识的单词共用一个起始节点,具有相同结束节点标识的单词共用一个结束节点。图2示例了一个比较简单的通讯录词级句式解码网络示意图,替换语法槽之前的普通语法槽包含了“I want to give”、“send a message to”和 “give a call”3个词条,替换语法槽之后的普通语法槽包含了“for me”、“a call”及“a call with her number”3个词条。节点10与节点18之间的连接说明可以直接从节点10走到结束节点,弧上的“</s>”代表静音。
上述的句式解码网络的具体构建过程,可参见后文实施例的详细介绍。
上述的待识别语音所属业务场景下的垂类关键字集合,是指由待识别语音所属业务场景下的所有垂类关键字组成的集合。比如,假设待识别语音为语音打电话或语音发短信场景下的语音,则待识别语音所属业务场景下的垂类关键字集合,具体可以是由用户通讯录人名构成的人名集合;假设待识别语音为语音导航场景下的语音,则待识别语音所属业务场景下的垂类关键字集合,具体可以是用户所处地域内的各个地名构成的地名集合。
将待识别语音所属业务场景下的垂类关键字集合中的垂类关键字,添加至句式解码网络的替换语法槽中,即可得到语音识别解码网络。可见,该解码网络中,不仅包含了待识别语音所属业务场景下的所有语音句式,而且包含了该场景下的所有垂类关键词,则,该语音识别解码网络能够识别待识别语音所属业务场景下的语音句式,并且能够识别语音中的垂类关键字,也就是能够识别该业务场景下的语音。
上述的语音识别解码网络的具体构建过程,将在后续实施例中详细介绍。
需要说明的是,作为一种优选的实施方式,本申请实施例在构建语音识别解码网络时,具体是在服务器构建,即,将待识别语音所属业务场景下的垂类关键字集合传入云端服务器,使云端服务器基于该待识别语音所属业务场景下的垂类关键字集合及预先构建的句式解码网络,构建语音识别解码网络。
例如,当用户对手机终端说出打电话语音指令“I want to give XX a call”时,手机终端将本地通讯录(即人名垂类关键字集合)传入云端服务器,云端服务器根据通讯录人名以及打电话场景下的句式解码网络,构建语音识别解码网络。则,在该语音识别解码网络中,包含打电话的各种句式,同时包含本次打电话的通讯录人名,利用该解码网络,能够识别用户向当前通讯录中的任意成员打电话的语音。
在常规技术方案中,语音识别解码网络是在用户终端本地构建的,并且不是实时构建的,而是预先构建后反复调用的。由于终端设备计算资源相对较低,导致网络构建速度慢,网络解码速度受限。而且,非实时构建的解码网络在垂 类关键字集合更新时无法及时更新,影响语音识别效果。
而本申请实施例则是在云端服务器构建语音识别解码网络,并且是在语音识别过程中,通过执行步骤S102实时传入垂类关键字集合,并且构建语音识别解码网络,因此能够保证构建的语音识别解码网络中包含最新的垂类关键字集合,也就是包含本次识别所需的垂类关键字集合,从而能够准确识别垂类关键字。同时,基于云端服务器的强大运算能力,该语音识别解码网络将具备更强的解码性能。
而且,基于云端服务器的集中处理能力,不需要为每个终端分别构建语音识别解码网络,只需要云端服务器构建即可。对于与云端服务器连接的任意终端,只要终端将待识别语音信息,以及待识别语音所属业务场景的垂类关键字集合(比如终端本地存储的通讯录)传入云端服务器,云端服务器即可针对本次待识别语音构建合适的语音识别解码网络,并对本次待识别语音进行解码。
S103、利用所述语音识别解码网络对所述声学状态序列进行解码,得到语音识别结果。
通过上文介绍可见,上述构建的语音识别解码网络中,包含待识别语音所属业务场景下的句式,以及包含待识别语音所属业务场景下的垂类关键字集合。则,利用该语音识别解码网络可以识别待识别语音的语音内容。
例如,假设用户终端采集到用户所说的给终端通讯录中的某人打电话的语音信号,则为了识别该语音,按照上述步骤S102的介绍,利用终端本地的通讯录(该通讯录即作为垂类关键字集合)以及预先构建的与打电话业务场景对应的句式解码网络,构建包含本地通讯录人名的语音识别解码网络,具体可参见图2所示的语音识别解码网络架构。在该语音识别解码网络中,不仅包含了所有的打电话语音句式,而且包含了本地通讯录中的全部人名,则理论上,用户通过语音控制终端向终端通讯录中的任意人员打电话时,用户所说的语句均在该语音识别解码网络中有所包含。
示例性的,在上述的语音识别解码网络中,包含多个由不同垂类关键字组成的、相同或不同的句式路径。当某一声学状态序列与该语音识别解码网络中的某一个或某几个句式路径的发音匹配时,即可确定该声学状态序列的文本内容为该句式路径的文本内容。因此,最终解码得到的语音识别结果,可能是该语音识别解码网络中的某一个或某几个路径的文本,即,最终得到的语音识别 结果可能是一个,也可能是多个。
比如,假设终端采集到的用户语音为“I want to give John a call”,则当对该语音进行声学识别得到其声学状态序列后,利用终端通讯录构建语音识别解码网络,则该语音识别解码网络中包含“I want to give XX a call”这种句式,而且包含“John”这一人名,同时,在该语音识别解码网络中还包括其他句式和其他人名。基于该语音识别解码网络,将该语音的声学状态序列与该语音识别解码网络中的各个路径进行发音匹配,可以确定该声学状态序列与“I want to give John a call”这一路径的发音相匹配,则可以得到语音识别结果“I want to give John a call”,即实现对用户语音的识别。
通过上述介绍可见,本申请实施例提出的语音识别方法,能够基于待识别语音所属业务场景下的垂类关键字集合以及预先构建的该业务场景下的句式解码网络,构建语音识别解码网络。则在该语音识别解码网络中,包含待识别语音所属业务场景下的各种语音句式,同时包含待识别语音所属业务场景下的各种垂类关键字,利用该语音识别解码网络能够解码待识别语音所属业务场景下的任意句式、任意垂类关键字构成的语音。因此,通过构建上述的语音识别解码网络,能够准确识别待识别语音,尤其是能够准确识别涉及垂类关键字的特定场景下的语音,特别是能准确识别语音中的垂类关键字。
作为一种优选的实施方式,在利用上述的语音识别解码网络对待识别语音的声学状态序列进行解码的同时,还利用通用语音识别模型对该待识别语音的声学状态序列进行解码。
为了便于区分,将利用上述的语音识别解码网络对待识别语音的声学状态序列进行解码得到的结果命名为第一语音识别结果,将利用上述的通用语音识别模型对待识别语音的声学状态序列进行解码得到的结果命名为第二语音识别结果。
参见图3所示,当执行步骤S301、获取待识别语音的声学状态序列后,分别执行步骤S302和步骤S303,构建得到语音识别解码网络,并利用该语音识别解码网络对该声学状态序列进行解码,得到第一语音识别结果;以及,执行步骤S304、利用通用语音识别模型对所述声学状态序列进行解码,得到第二语音 识别结果。
上述的第一语音识别结果和第二语音识别结果分别可以为一个或多个。为了保证语音识别效果,各模型输出的语音识别结果最多保留5个参与最终语音识别结果的确定。
其中,上述的通用语音识别模型,即为常规的通过海量语料训练得到的语音识别模型,其通过学习语音的特征而识别语音对应的文本内容,而并非像上述的语音识别解码网络那样具有规范的句式。因此,该通用语音识别模型所能识别的句式更加灵活。利用该通用语音识别模型对待识别语音的声学状态序列进行解码,能够更加灵活地识别待识别语音内容,而不受待识别语音句式限制。
当待识别语音不是上述的语音识别解码网络中的某种句式时,其不能被该语音识别解码网络正确解码,或者得到的第一语音识别结果不准确,但是由于该通用语音识别模型的应用,依然能够对该待识别语音进行识别解码,得到第二语音识别结果。
当得到第一语音识别结果和第二语音识别结果后,执行步骤S305、至少从所述第一语音识别结果和所述第二语音识别结果中,确定出最终的语音识别结果。
作为一种示例性的实施方式,当得到第一语音识别结果和第二语音识别结果后,根据第一语音识别结果和第二语音识别结果的声学得分,通过声学得分PK,从第一语音识别结果和第二语音识别结果中选出得到最高的一个或多个,作为最终的语音识别结果。
其中,上述的第一语音识别结果和第二语音识别结果的声学得分,是指在对待识别语音的声学状态序列进行解码时,根据各个声学状态序列元素的解码得分而确定的整个解码结果的得分。例如,对各个声学状态序列元素的解码得分进行求和,即可作为整个解码结果的得分。声学状态序列元素的解码得分,是指声学状态序列元素(例如音素或音素单元)被解码为某一文本的概率得分,因此整个解码结果的得分,即为整个声学状态序列被解码为某一文本的概率得分。语音识别结果的声学得分,体现了语音被识别为该语音识别结果的得分,该得分能够用于表征语音识别结果的准确度。
因此,根据一个或多个第一语音识别结果的声学得分,以及一个或多个第二语音识别结果的声学得分,能够体现出各识别结果的准确度,通过声学得分 PK,也就是通过声学得分对比,从这些识别结果中选出得分最高的一个或多个语音识别结果,即可作为最终的语音识别结果。
图3所示的方法实施例中的步骤S301~S303分别对应图1所示的方法实施例中的步骤S101~S103,其具体内容请参见对应图1的方法实施例介绍。
更进一步的,图4示出了本申请实施例提出的另一种语音识别方法的流程示意图。
与图3所述的语音识别方法不同的是,参见图4所示,本申请实施例提出的语音识别方法在利用构建的语音识别解码网络和通用语音识别模型对待识别语音的声学状态序列进行解码得到第一语音识别结果和第二语音识别结果之外,还执行步骤S405、通过预先训练的场景定制模型,对所述声学状态序列进行解码得到第三语音识别结果。
在分别得到第一语音识别结果、第二语音识别结果和第三语音识别结果后,执行步骤S406、从所述第一语音识别结果、所述第二语音识别结果和所述第三语音识别结果中,确定出最终的语音识别结果。
上述的场景定制模型,是指通过对待识别语音所属场景下的语音进行语音识别训练得到的语音识别模型。该场景定制模型与上述的通用语音识别模型为相同的模型架构,与通用语音识别模型所不同的是,该场景定制模型并非利用海量的通用语料训练得到,而是利用待识别语音所属场景下的语料训练得到,因此,该场景定制模型相对于通用语音识别模型来说,其对待识别语音所属业务场景下的语音具有更高的敏感度和更高的识别率。该场景定制模型能够相对于通用语音识别模型更加准确地识别特定业务场景下的语音,而又不必像上述的语音识别解码网络那样局限于预定的句式。
因此,在上述的语音识别解码网络以及上述的通用语音识别模型的基础上,再加入场景定制模型,使三种模型分别对待识别语音的声学状态序列进行解码,能够通过多种方式更加全面、深入地对待识别语音进行语音识别。
对于三种模型分别输出的语音识别结果,可以示例性地参照上述步骤S305的介绍,通过对第一语音识别结果、第二语音识别结果和第三语音识别结果的声学得分进行对比,从中选出声学得分最高或较高的一个或多个语音识别结果,作为最终的语音识别结果。
图4所示的方法实施例中的步骤S401~S404分别对应图3所示的方法实施例中的步骤S301~S304,其具体内容请参见图3对应的方法实施例的内容,此处不再赘述。
上述的基于多模型解码的语音识别方法的主要思想是通过多模型进行解码,然后再通过声学得分PK从多个识别结果中选出最终的识别结果。在实际应用中发现,当上述的语音识别解码网络输出的第一语音识别结果和通用语音识别模型输出的第二语音识别结果,或者场景定制模型输出的第三语音识别结果的声学得分比较相近时,该第一语音识别结果经常被第二语音识别结果或第三语音识别结果PK掉。
而事实上,三种模型输出的语音识别结果的得分相近的情况下,各语音识别结果的句式基本一致,只是在垂类关键字位置会有区别。而第一语音识别结果中包含更准确的垂类关键字信息,如果第一语音识别结果被PK掉,可能造成对垂类关键字识别不准确的后果。因此,当各个模型输出的识别结果的得分相近时,应当使包含准确的垂类关键字的识别结果胜出。
但是按照上述各实施例的介绍,并不能使第一语音识别结果胜出。
针对上述情况,在进行语音识别结果PK时,可以先对第一语音识别结果中的垂类关键字所在槽位的得分进行激励,使其提高一定比例,也就是对第一语音识别结果进行声学得分激励,这样使得当不同模型输出的句式相同时,第一语音识别结果能够胜出。
上述介绍只是概括地提出了声学得分激励的思想和必要,对于具体的声学得分激励处理,可参见后文实施例介绍。
在上述的声学得分激励的思想下,本申请实施例提出另一种语音识别方法,参见图5所示,该方法包括:
S501、获取待识别语音的声学状态序列。
S502、利用语音识别解码网络对所述声学状态序列进行解码,得到第一语音识别结果。以及,S503、利用通用语音识别模型对所述声学状态序列进行解码,得到第二语音识别结果;所述语音识别解码网络基于所述待识别语音所属场景下的垂类关键字集合及句式解码网络构建得到。
S504、对所述第一语音识别结果进行声学得分激励。
S505、至少从激励后的第一语音识别结果以及所述第二语音识别结果中,确定出最终的语音识别结果。
具体的,上述的步骤S501、S502、S503、S505的具体处理内容,均可以参见图1-图4对应的语音识别方法实施例中的相应内容,此处不再重复。
其中,上述的语音识别解码网络,基于待识别语音所属场景下的垂类关键字集合,以及预先通过对待识别语音所属场景下的文本语料进行句式归纳和语法槽定义处理得到的句式解码网络,构建得到。该语音识别解码网络的具体内容,可以参见上述实施例介绍,该网络的构建过程,可以参见下文实施例的具体介绍。
与上述实施例介绍的语音识别方法所不同的是,本申请实施例提出的语音识别方法,在进行第一语音识别结果和第二语音识别结果的声学得分PK,从而确定最终的语音识别结果之前,本申请实施例先对第一语音识别结果进行声学得分激励。
具体的,对第一语音识别结果进行声学得分激励,具体是对第一语音识别结果中的垂类关键字所在槽的声学得分进行激励,即对第一语音识别结果中的垂类关键字所在槽的声学得分按照激励系数进行比例缩放,该激励系数的具体取值则由业务场景以及语音识别结果的实际情况而确定,具体的声学得分激励内容,可参见后文专门介绍声学得分激励的实施例的介绍。
通过上述的声学得分激励处理,当第一语音识别结果与第二语音识别结果的得分相近时,也就是句式相同时,可以使第一语音识别结果的声学得分高于第二语音识别结果,从而使第一语音识别结果在声学得分PK中胜出,由此能够使得当第一语音识别结果与第二语音识别结果的得分相近时,保证最终得到的语音识别结果中的垂类关键字是相对更加准确的识别结果。
可见,本申请实施例提出的语音识别方法,能够基于待识别语音所属业务场景下的垂类关键字集合以及预先构建的该业务场景下的句式解码网络,构建语音识别解码网络。利用该语音识别解码网络能够解码待识别语音所属业务场景下的任意句式、任意垂类关键字构成的语音。因此,基于上述的语音识别解码网络,能够准确识别待识别语音,尤其是能够准确识别涉及垂类关键字的特定场景下的语音,特别是能准确识别语音中的垂类关键字。
同时,本申请实施例提出的语音识别方法,在利用上述语音识别解码网络进行解码识别的同时,还利用通用语音识别模型进行解码识别。通用语音识别模型比上述语音识别解码网络具有更高的句式灵活性,使用多模型分别对待识别语音的声学状态序列进行解码,能够通过多种方式更加全面、深入地对待识别语音进行语音识别。
另外,在多模型解码识别的情况下,本申请实施例对语音识别解码网络输出的语音识别结果进行声学得分激励。由于上述的语音识别解码网络相对于通用语音识别模型,对垂类关键字的识别准确度更高,因此,基于上述的声学得分激励处理,使得当语音识别解码网络输出的语音识别结果与通用语音识别模型输出的语音识别结果的得分相近时,使语音识别解码网络输出的语音识别结果能够胜出,从而保证最终的语音识别结果中的垂类关键字识别正确。
作为一种优选的实施方式,本申请实施例还提出另一种语音识别方法,该方法相对于图5所示的语音识别方法,增加场景定制模型,用于对待识别语音的声学状态序列进行解码。
参见图6所示,在执行步骤S602、利用语音识别解码网络对所述声学状态序列进行解码,得到第一语音识别结果;以及执行步骤S603、利用通用语音识别模型对所述声学状态序列进行解码,得到第二语音识别结果之外,还执行步骤S604、通过预先训练的场景定制模型,对所述声学状态序列进行解码得到第三语音识别结果。
其中,上述的场景定制模型通过对待识别语音所属场景下的语音进行语音识别训练得到。
具体的,上述的场景定制模型的功能,以及该场景定制模型的加入所带来的有益效果,均可以参见上述的对应图4所示的语音识别方法的实施例内容,此处不再重复。
最后,通过执行步骤S606、从激励后的第一语音识别结果、所述第二语音识别结果和所述第三语音识别结果中,确定出最终的语音识别结果。
示例性的,可以按照上述实施例介绍的,通过对第一语音识别结果、第二语音识别结果和第三语音识别结果进行声学得分PK,可以从中选出声学得分较高或最高的多个或一个语音识别结果,作为最终的语音识别结果。具体的处 理过程可参见上述实施例中的相应内容的介绍。
图6中的步骤S601~S603、S605的具体内容,请参见上述实施例中的相应步骤的具体处理内容,此处不再重复。
上述的各语音识别方法,在最终对多个语音识别结果进行PK、决策时,完全是根据语音识别结果的声学得分进行的,这其中完全忽略了语言模型对识别效果的影响,尤其是在上述的语音识别解码网络或场景定制模型中。这种简单直接的PK策略会极大地影响识别效果,严重时会引起误触发问题,影响用户体验。
对此,本申请实施例提出,在声学得分PK的基础上,对语音识别结果进行语言模型激励,使语音识别结果中融入语言模型信息,最终,通过语言得分PK选出最终的语音识别结果。
作为一种可选的语音识别结果决策选出方式,参见上述实施例介绍,通过语音识别解码网络、通用语音识别模型、场景定制模型分别得到第一语音识别结果、第二语音识别结果和第三语音识别结果,以及对第一语音识别结果进行声学得分激励后,从激励后的第一语音识别结果、所述第二语音识别结果和所述第三语音识别结果中,确定出最终的语音识别结果时,可以按照如下处理执行:
首先,根据声学得分激励后的第一语音识别结果的声学得分,以及所述第二语音识别结果的声学得分,从所述第一语音识别结果和所述第二语音识别结果中确定出候选语音识别结果。
具体的,本步骤处理,与上文所介绍的声学得分PK相同,通过将声学得分激励后的第一语音识别结果和第二语音识别结果进行声学得分PK,选出声学得分最高的一个或多个语音识别结果,作为候选语音识别结果。
然后,对所述候选语音识别结果以及所述第三语音识别结果分别进行语言模型激励。
具体的,上述的语言模型激励,具体是指将语音识别结果与待识别语音所属场景下的垂类关键字进行匹配,如果匹配成功,则对语音识别结果进行路径扩展,然后基于聚类的语言模型对扩展后的语音识别结果进行重新打分,完成语音识别结果在语言模型上的激励,具体的语言模型激励的处理过程,将在后 文实施例中特别介绍。
最后,根据语言模型激励后的所述候选语音识别结果的语言得分,以及语言模型激励后的所述第三语音识别结果的语言得分,从所述候选语音识别结果和所述第三语音识别结果中确定出最终的语音识别结果。
具体的,参照上述的声学得分PK的策略,对语言模型激励后的候选语音识别结果和第三语音识别结果,进行语言得分PK,从中选出语言得分最高的一个或多个语音识别结果,作为最终的语音识别结果。具体的语言得分PK处理过程,可参照上述实施例介绍的声学得分PK的处理过程,此处不再详述。
上述的声学得分激励、语言模型激励、候选语音识别结果的选出等步骤的执行顺序,可以在不影响整体功能实现的情况下灵活调整。
例如,作为一种可选的语音识别结果决策选出方式,参见上述实施例介绍,通过语音识别解码网络、通用语音识别模型、场景定制模型分别得到第一语音识别结果、第二语音识别结果和第三语音识别结果后,分别对所述第一语音识别结果、所述第二语音识别结果和所述第三语音识别结果进行语言模型激励;然后,根据语言模型激励后的第一语音识别结果、第二语音识别结果和第三语音识别结果的语言得分,通过语言得分PK,从第一语音识别结果、第二语音识别结果和第三语音识别结果中确定出最终的语音识别结果。
下面,将以不同的实施例分别对上述的各实施例中介绍的各语音识别方法中的处理步骤进行详细介绍。应当理解的是,由于上述的各种不同的语音识别方法相互之间存在交叉或相同的处理步骤,因此,下文各实施例介绍的处理步骤的具体实施方式,分别适用于上述的各实施例介绍的语音识别方法的相应或相关处理步骤。
首先,本申请对于上述的各个语音识别方法实施例中的用于构建语音识别解码网络的句式解码网络的构建过程,进行介绍。以下介绍的句式解码网络的构建过程,只是示例性的较优选的实现方案,在实际应用本申请实施例技术方案时,也可以参照本实施例中所体现的句式解码网络的功能,采用其他方式构建得到。
上述各实施例中所记载的待识别语音所属业务场景下的句式解码网络,可 以通过执行如下步骤A1-A3构建得到:
A1、通过对所述待识别语音所属场景下的语料数据进行句式归纳和语法槽定义处理,构建文本句式网络。
待识别语音所属业务场景下的语料数据,是从实际业务场景中搜集的语音的标注数据,例如在用户语音打电话或语音发短信场景下,搜集用户打电话或发短信的指令语音,并对其进行标注,作为打电话或发短信场景下的语料数据。或者,也可以结合经验进行人工扩展,得到符合语法逻辑并且符合业务场景的语料数据。如,“I want to give John a call”和“send a message to Peter for me”分别是两个语料数据用例。因为后续直接根据语料进行句式归纳和语法槽定义,该阶段力求搜集的语料能具备较高的覆盖度,但对垂类关键字的覆盖度无要求。
如上文所述,在某种特定的涉及垂类关键字的业务场景下,用户语音的句式通常是有规律的,或者说是可穷举的,通过对这些句式进行归纳以及对句式中的语法槽进行分类和定义,可以得到与该业务场景对应的句式网络,本申请实施例将其命名为文本句式网络。
上述的对句式中的语法槽进行定义,具体是确定句式中的文本槽在语法上的类型。在本申请实施例中,将文本句中的文本槽划分为普通语法槽和替换语法槽,其中,对应非垂类关键字的文本槽被定义为普通语法槽,对应垂类关键字的文本槽被定义为替换语法槽。在普通语法槽中,存储文本句中的非垂类关键字内容,在替换语法槽中存储与垂类关键字对应的占位符。按照垂类关键字文本槽在文本中的位置的不同,普通语法槽的数量可以为一个或多个,每个垂类关键字文本槽分别对应一个替换语法槽。
按照上述方式构建的文本句式网络包含的元素有网络节点、连接节点的有向弧,该文本句式网络基于ABNF(增强型巴科斯范式)定义。具体可参见图7所示,在文本句式网络的有向弧上带有标签信息,该标签信息对应有向弧所对应的替换语法槽的占位符或者对应有向弧所对应的普通语法槽的文本。
图7是根据搜集到的打电话或发短信场景下的语料数据定义的文本句式网络。其中带有“<xxxx>”标签的有向弧称为普通语法槽,普通语法槽至少包含一个词条,而且在文本句式网络定义阶段就要完成全部词条的搜集。图7中包括两个普通语法槽,分别为<phone>和<sth>,其中<phone>对应场景语料 用例中通讯录人名前面的文本词条内容,如“I want to give”和“send a message”就是<phone>语法槽的两个词条;<sth>表示用例中通讯录人名后面的文本词条内容,如“a call”和“for me”就是<sth>语法槽的两个词条。带有“xxx”标签的有向弧称为替换语法槽,所述替换语法槽是指句式定义阶段不需要附带实际词条而是只需搭配一个“#placehollder#”占位符即可,实际的词条是在构建语音识别解码网络时动态传入的。图7中的“name”就是一个替换语法槽,后续动态创建的垂类关键字网络将被插入该替换语法槽中以构成完整的语音识别解码网络。最后一种带有“-”标签的有向弧称为虚弧,所述虚弧是指不具备语法槽和词条信息的有向弧,表示路径可选,虚弧必须有对应的语法槽与其对应,用来表示后续的网络构建过程中要同时创建可以跳过该语法槽的解码路径,如图7中节点2和节点4之间以及节点2与节点3之间的虚弧对应的语法槽是<sth>语法槽,表示识别结果可以不走该语法槽,图7的文本句式网络定义了两种句式,分别是<phone>+name和<phone>+name+<sth>,通过该网络得到的语音识别结果必然符合这两种句式之一,比如用例“give a call to John”就只包含普通语法槽<phone>=“give a call to”和替换语法槽name=“John”两个槽。由于真正使用时普通语法槽的词条数据较多,为了解释方便,后续实施例中只使用少量词条来描述相关概念。
进一步地,还可以将文本句式网络中的所有语法槽进行ID标号,定义为该语法槽的slot_id字段,设置全局唯一性标识。
最终,句式网络定义的句式、语法槽以及普通语法槽的词条共同构成了文本句式网络。
A2、对所述文本句式网络的普通语法槽中的词条进行分词并按照分词结果进行单词节点扩展,得到词级句式解码网络。
具体的,基于上述构建的文本句式网络中的语法槽信息,解析普通语法槽词条,对普通语法槽中的词条进行分词并按照分词结果进行单词节点扩展,构建词级句式解码网络。
所述词级句式解码网络包含若干个节点及节点之间的有向弧。在构建词级句式解码网络时,首先对普通语法槽中的每一个词条分别进行分词,得到每个词条对应的各个单词。
然后,利用对应同一词条的各个单词进行单词节点扩展,即,将同一词条 的分词结果通过节点和有向弧串联,得到与词条对应的单词串。两个节点间的有向弧上标注有分词得到的单词信息,其中,冒号左右分别代表输入和输出信息,此处设定输入输出信息相同。
最后,参照上述节点扩展方式,将单个词条分词后的多个单词进行串联,同一普通语法槽的不同词条的对应的单词串进行并联,替换语法槽使用占位符“#placeholder#”进行占位,不执行扩展,将网络节点按顺序进行编号,其中具有相同起始节点标识的单词共用一个起始节点,具有相同结束节点标识的单词共用一个结束节点,即可得到词级句式解码网络。
如图2示例的一个比较简单的通讯录词级句式解码网络示意图,语法槽<phone>包含了“I want to give”、“send a message to”和“give a call”3个词条,语法槽<sth>包含了“for me”、“a call”及“a call with her number”3个词条。节点10与节点18之间的连接说明可以直接从节点10走到结束节点,弧上的“</s>”代表静音。
A3、将所述词级句式解码网络的普通语法槽中的各个单词替换为对应的发音,并按照单词对应的发音进行发音节点扩展,得到发音级句式解码网络,所述发音级句式解码网络作为所述待识别语音所属场景下的句式解码网络。
具体的,首先,将所述词级句式解码网络的普通语法槽中的各个单词,分别替换为对应的发音。
示例性的,可以通过发音词典查询已有的单词与发音之间的对应关系,从而确定词级句式解码网络的普通语法槽中的每一条有向弧上标注的单词对应的发音。在此基础上,利用单词对应的发音,替换有向弧上标注的单词。
然后,对所述词级句式解码网络中的每个发音,分别进行发音单元划分,并利用发音对应的各个发音单元进行发音节点扩展,得到发音级句式解码网络。
即,对于词级句式解码网络的每条有向弧上的发音,分别确定其发音单元,并进行发音单元划分,本申请示例性地将发音划分为音素序列。比如单词“I”的发音为单音素“ay”,单词“give”的发音为音素串“g ih v”。
在此基础上,按照发音单元的排列顺序和数量对发音进行发音节点扩展串联。示例性的,与上述的单词节点扩展相同的,将同一发音的各个音素按照顺序通过节点和有向弧串联,得到与发音对应的音素串。然后利用发音对应的音 素串替换发音,将词级句式解码网络扩展为发音级句式解码网络。在发音节点扩展过程中,替换语法槽仍不进行扩展。该发音级句式解码网络即作为待识别语音所属业务场景下的句式解码网络。
图8示出了一种简单的发音级句式解码网络的示意图。
另外,为了方便后续的在计算机中的网络结构化存储、子网络动态插入更新以及解码遍历,句式解码网络中的节点按顺序进行编号,且其中具有相同起始节点标识的发音单元共用一个起始节点,具有相同结束节点标识的发音单元共用一个结束节点。网络中的单个节点一共包括id编号、入弧数量和出弧数量3个属性字段,构成节点存储三元组,所述节点入弧是表示指向节点的有向弧,出弧是从节点发出的有向弧。网络中的单条有向弧一共包括左节点编号、右节点编号、弧上发音信息以及所属语法槽标识slot_id共4个属性字段,同时记录网络总的节点个数和弧的总数目。
进一步的,还可以记录语法槽在句式解码网络中的左右节点位置信息,如图8所示,普通语法槽<phone>在网络中的左节点位置为0,右节点位置为10,替换语法槽name的左节点位置为10,右节点为11,普通语法槽<sth>的左节点为11,右节点为16。
更进一步的,还可以将得到的发音级句式解码网络中符合条件的弧进行合并优化,并删除多余节点,以降低网络的复杂度,具体采用的方法与一般解码网络优化方法相同,不再详述。
完成以上步骤后得到的解码网络即是句式解码网络,可作为全局资源加载到云端语音识别服务中,但是由于替换语法槽还未记录真正通讯录信息,所以还不具备实际解码能力。
基于上述构建得到的句式解码网络,结合待识别语音所属业务场景下的垂类关键字集合,能够构建得到真正用于对待识别语音的声学状态序列进行解码的语音识别解码网络。
本申请实施例将进一步对语音识别解码网络的构建过程进行示例介绍。
作为一种实施方式,本申请实施例通过执行如下步骤B1-B3,构建得到语音识别解码网络:
B1、获取预先构建的所述待识别语音所属场景下的句式解码网络。
具体的,可以预先按照上文实施例介绍构建得到待识别语音所属业务场景下的句式解码网络,此时,直接调用该句式解码网络即可。或者,可以参照上述实施例介绍,在执行步骤B1时,实时构建句式解码网络。
B2、基于待识别语音所属场景下的垂类关键字集合中的垂类关键字,构建垂类关键字网络。
具体的,待识别语音所属业务场景下的垂类关键字集合,是指由待识别语音所属业务场景下的所有垂类关键字组成的集合。比如,假设待识别语音为语音打电话或语音发短信场景下的语音,则待识别语音所属业务场景下的垂类关键字集合,具体可以是由用户通讯录人名构成的人名集合;假设待识别语音为语音导航场景下的语音,则待识别语音所属业务场景下的垂类关键字集合,具体可以是用户所处地域内的各个地名构成的地名集合。
本申请实施例以通讯录作为垂类关键字集合,以构建通讯里人名网络为例,对构建垂类关键字网络的具体实现过程进行介绍。
构建垂类关键字网络,具体可参照上述的构建句式解码网络的过程,不同的是,在构建的垂类关键字网络中不再包含语法槽信息,默认所有的语法槽都属于替换语法槽。
首先,基于待识别语音所属业务场景下的垂类关键字集合中的各个垂类关键字,构建词级垂类关键字网络。
例如,对于通讯录中的每个人名分别进行分词,得到人名中包含的各个单词,通过节点和有向弧将各个单词进行串联,得到人名对应的单词串;不同的人名对应的单词串进行并联,即构建得到词级人名网络。
如图9为“Jack Alen”、“Tom”和“Peter”3个联系人词条构建的词级人名网络。
然后,将所述词级垂类关键字网络中的各个单词替换为对应的发音,并按照单词对应的发音进行发音节点扩展,得到发音级垂类关键字网络。
例如,对于词级人名网络中的各个单词,分别确定其发音,以及确定发音所包含的音素,将各个音素通过节点和有向弧连接,构成音素串,然后利用发音对应的音素串替换发音,即得到发音级人名网络。
上述的词级垂类关键字网络的构建,以及发音替换的处理,可参见上述实施例介绍的构建句式解码网络时的相应处理内容。
通过上述介绍,得到发音级垂类关键字网络,该网络即为最终构建得到的垂类关键字网络。在该网络中,入弧数为0的节点为网络开始节点,出弧数为0的节点为网络结束节点。
如图10所示为对图9所示的词级人名网络进行发音替换后得到的发音级人名网络。在该网络中,0号节点为网络的开始节点,8号节点为网络的结束节点。
B3、将所述垂类关键字网络插入所述句式解码网络,得到语音识别解码网络。
通过上述介绍可见,通过构建词级网络、对词级网络进行发音扩展等处理,最终构建得到的句式解码网络和垂类关键字网络,都是由节点和连接节点的有向弧构成的,在节点间的有向弧上存储发音信息或占位符。具体是,在对应句式解码网络的普通语法槽的有向弧上存储槽内文本的发音信息,在对应句式解码网络的替换语法槽的有向弧上存储占位符,在对应垂类关键字网络的各个语法槽的有向弧上存储槽内文本的发音信息。
在此基础上,通过有向弧将垂类关键字网络与句式解码网络的替换语法槽的左右节点分别连接,即,利用垂类关键字网络替换句式解码网络中的替换语法槽所在的有向弧,即可构建得到语音识别解码网络。
在将垂类关键字网络与句式解码网络的替换语法槽的左右节点分别连接时,为了保证连接后的每对相邻节点间的有向弧上都存储有效的发音信息,本申请实施例通过将垂类关键字网络的开始节点的每条出弧的右节点与替换语法槽的左节点通过有向弧连接,连接的每条有向弧上分别存储垂类关键字网络的开始节点的每条出弧上的发音信息,以及,将垂类关键字网络的结束节点的每条入弧的左节点与替换语法槽的右节点通过有向弧连接,连接的每条有向弧上分别存储垂类关键字网络的结束节点的每条入弧上的发音信息,从而构建得到语音识别解码网络。
作为一种更加优选的实施方式,为了能够提高垂类关键字网络插入句式解码网络的效率,本申请实施例在垂类关键字网络中的每个关键字的第一条弧和最后一条弧上分别存储与该关键字对应的唯一标识,该唯一标识可以示例性地被设置为该关键字的哈希码。例如图10所述的发音级人名网络中,节点(0,1)之间的有向弧,以及节点(7,8)之间的有向弧上分别存储与人名“Jack Alen” 对应的哈希码。
相应的,本申请实施例还设置已入网关键字信息集合,在该已入网关键字信息集合中,将已插入句式解码网络的关键字的唯一标识,以及该唯一标识所在的有向弧在该句式解码网络中的左右节点编号,对应存储。
示例性的,上述的已入网关键字信息集合,可以采用key:value结构的HashMap存储结构,key为上述的垂类关键字对应的哈希码,value为该哈希码所在有向弧的节点编号对集合,初始时HashMap为空,整个识别服务过程中,HashMap唯一性的保存了所有动态传入的垂类关键字词条的哈希码和节点编号对,但是并不记录用户ID以及用户ID与垂类关键字集合的映射关系。
上述的已入网关键字信息集合的设置,可以便于明确已插入句式解码网络的垂类关键字信息,也就是能够明确在语音识别解码网络中已经存在的垂类关键字信息。这样,当每次执行垂类关键字网络插入时,可以通过查询该已入网关键字信息集合,确定垂类关键字是否已插入,当确定要插入的垂类关键字已经存在于语音识别解码网络中,则可以取消对当前垂类关键字的插入,继续执行对其他垂类关键字的插入操作。
基于上述思想,在将垂类关键字网络插入句式解码网络时,遍历垂类关键字网络的开始节点的每条出弧,对于遍历到的每一条出弧,根据该出弧上的唯一标识以及已入网关键字信息集合,确定该唯一标识对应的关键字是否已插入句式解码网络。
如果该唯一标识对应的关键字未插入句式解码网络,则将遍历到的该出弧的右节点与所述替换语法槽的左节点通过有向弧连接,在该有向弧上存储遍历到的该出弧上的发音信息,并且更新该有向弧两端节点的入弧或出弧数量。
示例性的,在将人名网络插入句式解码网络时,遍历人名网络中的开始节点的每条出弧,对于遍历到的每一条出弧,获取该出弧上的人名哈希码,通过将该哈希码与已入网关键字信息集合中的所有哈希码进行比对,如果该哈希码与已入网关键字信息集合中的任一哈希码相匹配,则说明该哈希码对应的人名已经插入句式解码网络,此时跳过该出弧,执行对下一条出弧的哈希码判断。
如果遍历到的出弧的哈希码与已入网关键字信息集合中的所有哈希码均不匹配,则说明该哈希码对应的人名未插入句式解码网络,此时,将遍历到的该出弧的右节点与句式解码网络的替换语法槽的左节点通过有向弧连接,并且 在连接的有向弧上存储遍历到的该出弧上的发音信息,并且更新该有向弧两端节点的入弧或出弧数量。
上述过程实现了垂类关键字网络的开始节点的每条出弧的右节点与句式解码网络的替换语法槽的左节点的连接。
相应的,对于垂类关键字网络的结束节点的每条入弧,依次进行遍历,对于遍历到的每一条入弧,根据该入弧上的唯一标识以及已入网关键字信息集合,确定该唯一标识对应的关键字是否已插入句式解码网络;
如果该唯一标识对应的关键字未插入句式解码网络,则将遍历到的该入弧的左节点与所述替换语法槽的右节点通过有向弧连接,在该有向弧上存储遍历到的该入弧上的发音信息。
示例性的,在将人名网络插入句式解码网络时,遍历人名网络中的结束节点的每条入弧,对于遍历到的每一条入弧,获取该入弧上的人名哈希码,通过将该哈希码与已入网关键字信息集合中的所有哈希码进行比对,如果该哈希码与已入网关键字信息集合中的任一哈希码相匹配,则说明该哈希码对应的人名已经插入句式解码网络,此时跳过该入弧,执行对下一条入弧的哈希码判断。
如果遍历到的入弧的哈希码与已入网关键字信息集合中的所有哈希码均不匹配,则说明该哈希码对应的人名未插入句式解码网络,此时,将遍历到的该入弧的左节点与句式解码网络的替换语法槽的右节点通过有向弧连接,并且在连接的有向弧上存储遍历到的该入弧上的发音信息,并且更新该有向弧两端节点的入弧或出弧数量。
上述过程实现了垂类关键字网络的结束节点的每条入弧的左节点与句式解码网络的替换语法槽的右节点的连接。
通过上述处理,对于垂类关键字网络中的每一个关键字,都将其插入了句式解码网络。在具体实施时,上述介绍的对垂类关键字网络的开始节点和结束节点的插入的执行顺序,可以灵活安排,例如可以先对垂类关键字网络的开始节点执行插入操作,也可以先对垂类关键字网络的结束节点执行插入操作,或者同时执行。
进一步的,当垂类关键字网络中的关键字被插入句式解码网络时,也就是该关键字所在的网络路径的第一条弧的右节点与句式解码网络的替换语法槽的左节点连接,以及该关键字所在网络路径的最后一条弧的左节点与句式解码 网络的替换语法槽的右节点连接完成后,本申请实施例将该关键字的唯一标识,以及该唯一标识所在的有向弧在该句式解码网络中的左右节点编号,对应存储至已入网关键字信息集合中。
例如,对于图10所述的发音级人名网络,当人名“Jack Alen”所在的网络路径的节点1与图8所示的句式解码网络的替换语法槽的左节点10连接,以及“Jack Alen”所在的网络路径的节点7与图8所示的句式解码网络的替换语法槽的右节点11连接后,以人名“Jack Alen”的哈希码为key,以该哈希码所在的有向弧的左右节点编号为value,存储至已入网关键字信息集合中。
由于本申请实施例是在云端服务器上构建语音识别解码网络,因此,各个用户均可以向云端服务器上传垂类关键字集合,而云端服务器可以基于各用户上传的垂类关键字集合,构建大规模或超大规模的语音识别解码网络,以便能够满足各种用户的调用需求。
而已入网关键字信息集合的设置,可以提高垂类关键字插入语音识别解码网络的效率,同时可以便于针对当前用户的语音识别需求,选择特定的解码路径。
例如,在用户语音拨打电话的业务场景下,按照上述方案构建的语音识别解码网络中,不但包含当前会话动态传入的通讯录词条信息,也包括其他历史会话的通讯录信息,并统一被以哈希码的形式保存在已入网关键字信息集合中。
当对当前会话进行识别时,理应在当前传入的通讯录范围内进行识别,以便能够准确识别用户是想与通讯录中的哪个联系人打电话。此时,需要对语音识别解码网络的路径进行更新,使得解码路径被限定在当前传入的通讯录范围内。
具体实现方式为:
遍历已入网关键字信息集合中的各个唯一标识;如果遍历到的唯一标识不是所述待识别语音所属业务场景下的垂类关键字集合中的任意关键字的唯一标识,则将该唯一标识对应的左右节点编号之间的有向弧断开。
仍以上述的用户语音拨打电话的业务场景为例,为了解码用户语音,利用用户通讯录以及句式解码网络构建语音识别解码网络。当完成当前通讯录插入语音识别解码网络后,遍历已入网关键字信息集合中的各个哈希码,如果遍历 到的哈希码属于本次传入的通讯录中的人名哈希码,则不做处理;如果不是本次传入的通讯录中的人名哈希码,则通过查询已入网关键字信息集合,确定该哈希码对应的左右节点编号,并将该左右节点编号之间的有向弧断开。则,真正参与解码的语音识别解码网络,其实只有当前传入的通讯录所在的解码路径是导通的,因此其只能解码到与当前通讯录中的任一人名打电话的语音识别结果,这也符合用户预期。
可见,经过上述处理,可以使得对当前语音进行解码时,解码路径被限制在本次传入的垂类关键字集合范围内,有利于缩小路径搜索范围、提高解码效率和降低解码误差。
综合上述的各语音识别方法的实施例介绍可知,基于垂类关键字集合以及句式解码网络构建的语音识别解码网络,是实现涉及垂类关键字的语音识别的主要网络。
本申请的发明人在研究中发现,该语音识别解码网络存在两方面的不足,一方面是垂类关键字误触发的问题,另一方面是网络覆盖度不够的问题。
误触发问题是基于固定句式的语音识别解码网络的语音识别中最常见的问题,也是最难解决的问题。误触发是指一段音频的实际内容并不是固定句式的语音识别解码网络中的句式,但是最终结果却是固定句式的语音识别解码网络中的结果。垂类关键字误触发是指实际结果中没有垂类关键字或者垂类关键字不在本次传入的垂类关键字集合中,但是最终结果是语音识别解码网络的结果胜出,给出一个错误的垂类关键字。例如,在打电话场景中,实际结果中没有人名或者人名不在本次传入的通讯录中,但是最终结果是语音识别解码网络输出的结果胜出,给出了一个错误的人名。
垂类关键字误触发一般分为以下4种:(1)垂类关键字和真实结果发音相同的误触发;(2)垂类关键字和真实结果发音相近的误触发;(3)垂类关键字和真实结果发音差距较大由于激励策略引入的误触发;(4)真实结果中没有垂类关键字,但是语音识别解码网络识别出了垂类关键字的误触发。
其中,关于上述的第四种误触发情况,其根本原因可以归结为是由于语音识别解码网络中的句式覆盖度不够导致的。
结合该垂类关键字误触发情况,本申请实施例对语音识别解码网络的句式 覆盖度不够的问题,提出了相应解决方案。对于其他的误触发情况,将在后续实施例中通过其他方案予以解决和优化。
通过前文论述可知,语音识别解码网络是基于句式解码网络构建的,也就是基于句式构建的,这种构建网络方式的优点就是能够精准匹配出语音句式,缺点就是如果句式不在网络中就没有办法识别出来,而构建网络所用的语料,往往不能覆盖所有场景下的所有句式,所以语音识别解码网络就会引入一个问题,那就是句式覆盖度不够高的问题。
而通用语音识别模型是基于海量数据训练出来的,而且句式具有扩展性,句式非常丰富。在特定业务场景下,通用语音识别模型的结果错误主要都是垂类关键字识别错误,主要是因为通用语音识别模型包含了语料训练时候的语言分,而语料中往往并不能很好的拟合出垂类关键字的得分。虽然通用语音识别模型识别的垂类关键字是错误的,但是其句式是正确的,主要是因为句式可以通过训练数据拟合出来。
总结就是,通用语音识别模型的识别结果的句式信息更加可靠,而语音识别解码网络的识别结果的垂类关键字信息更加可靠。基于这个思路,本案提出基于通用语音识别模型的句式来解决语音识别解码网络句式覆盖度不高的问题。
具体方案是,在分别利用按照本申请技术思想构建得到的语音识别解码网络以及通用语音识别模型,对待识别语音的声学状态序列进行解码得到第一语音识别结果和第二语音识别结果的前提下,根据第二语音识别结果,对第一语音识别结果进行修正。
结合前文论述可知,相对来说,第一语音识别结果中的垂类关键字内容是更加准确的,而第二语音识别结果中的句式信息更加准确。因此,当利用第二语音识别结果对第一语音识别结果进行修正时,应当是针对第一语音识别结果中的非垂类关键字内容进行修正,使其句式更加准确。
因此,本申请实施例将第一语音识别结果中的内容划分为垂类关键字内容和非垂类关键字内容,相应的,按照第二语音识别结果与第一语音识别结果的内容对应关系,将第二语音识别结果中的内容划分为参考文本内容和非参考文本内容,其中,第二语音识别结果中的参考文本内容,是指第二语音识别结果中的、与第一语音识别结果中的非垂类关键字内容相匹配的文本内容。
与第一语音识别结果中的非垂类关键字内容相匹配的文本内容,具体可以是与第一语音识别结果中的非垂类关键字内容的字符串最相似,或者相似度大于设定阈值的字符串内容。
基于上述的文本内容划分,上述的根据第二语音识别结果对第一语音识别结果进行修正,具体是利用第二语音识别结果中的参考文本内容,对第一语音识别结果中的非垂类关键字内容进行修正,得到修正后的第一语音识别结果。
上述的修正过程,具体可通过执行如下步骤C1-C3实现:
C1、从所述第一语音识别结果中确定出垂类关键字内容和非垂类关键字内容,以及,从所述第二语音识别结果中确定出与所述第一语音识别结果中的非垂类关键字内容对应的文本内容,作为参考文本内容。
作为一种优选的实施方式,本申请实施例基于编辑距离算法,将第一语音识别结果和第二语音识别结果进行匹配,从第二语音识别结果中确定出参考文本内容。
具体而言,首先,根据编辑距离算法确定第一语音识别结果与第二语音识别结果之间的编辑距离矩阵。在该编辑距离矩阵中,包含第一语音识别结果中的各个字符与第二语音识别结果中的各个字符之间的编辑距离。
然后,根据该编辑距离矩阵,以及第一语音识别结果中的非垂类关键字内容,从第二语音识别结果中确定出与第一语音识别结果中的非垂类关键字内容对应的文本内容,作为参考文本内容。
按照第一语音识别结果中的垂类关键字的位置不同,第一语音识别结果中的非垂类关键字内容,可能分为垂类关键字之前的字符串,和/或,垂类关键字之后的字符串。
以垂类关键字之前的字符串为例,根据该部分字符串以及上述的编辑距离矩阵,从第二语音识别结果中选出与该部分字符串的编辑距离最小的字符串,即为与该部分字符串对应的参考文本内容。同理,对于垂类关键字之后的字符串,也可以通过上述方法确定与其对应的参考文本内容。
C2、根据所述第二语音识别结果中的参考文本内容,以及所述第一语音识别结果中的非垂类关键字内容,确定修正后的非垂类关键字内容。
作为一种可选的实施方式,本申请实施例根据第二语音识别结果中的参考文本内容与第一语音识别结果中的非垂类关键字内容的字符差异,将第二语音 识别结果中的目标文本内容或第一语音识别结果中的非垂类关键字内容,确定为修正后的非垂类关键字内容;
其中,第二语音识别结果中的目标文本内容,是指第二语音识别结果中的、与第一语音识别结果中的非垂类关键字内容的位置相对应的文本内容。
具体的,第二语音识别结果中的目标文本内容在第二语音识别结果中的位置,与第一语音识别结果中的非垂类关键字内容在第一语音识别结果中的位置相同。
例如,假设第一语音识别结果中的非垂类关键字内容为位于第一语音识别结果中的垂类关键字之前的文本内容,则第二语音识别结果中的目标文本内容,具体是第一语音识别结果中的垂类关键字的位置映射到第二语音识别结果中的位置之前的文本内容。其中,第一语音识别结果中的垂类关键字的位置相第二语音识别结果的映射,可以基于上述的编辑距离矩阵实现。
下面,以打电话场景为例,介绍修正后的非人名内容的确定过程。在下文介绍中,以第一语音识别结果中的人名处于第一语音识别结果的句中位置为例进行说明,重点说明对于第一语音识别结果中的人名前字符串的修正处理过程。参照下文介绍,对于人名后字符串,也可以进行相应的修正处理。
当根据第一语音识别结果与第二语音识别结果之间的编辑距离矩阵,以及第一语音识别结果中的非垂类关键字内容,从第二语音识别结果中确定出参考文本内容后,将第二语音识别结果中的参考文本内容与第一语音识别结果中的非垂类关键字内容进行比对,确定第二语音识别结果中的参考文本内容与第一语音识别结果中的非垂类关键字内容是否相同。
如果相同,则将所述第二语音识别结果中的目标文本内容,确定为修正后的非垂类关键字内容;
如果不同,则确定所述第二语音识别结果是否比所述第一语音识别结果中的非垂类关键字内容的字符数量多,并且两者的字符数量差异不超过设定阈值;
如果所述第二语音识别结果比所述第一语音识别结果中的非垂类关键字内容的字符数量多,并且两者的字符数量差异不超过设定阈值,则将所述第二语音识别结果中的目标文本内容,确定为修正后的非垂类关键字内容;
如果所述第二语音识别结果比所述第一语音识别结果中的非垂类关键字 内容的字符数量少,和/或两者的字符数量差异超过设定阈值,则将所述第一语音识别结果中的非垂类关键字内容,确定为修正后的非垂类关键字内容。
具体而言,参见图11所示的修正处理过程,根据第一语音识别结果与第二语音识别结果之间的编辑距离矩阵,以及第一语音识别结果中的人名前字符串,计算人名前字符串与第二语音识别结果的局部最大子序列,该局部最大子序列即为从第二语音识别结果中确定出的参考文本内容。
然后,判断该最大子序列是否为人名前字符串,也就是判断该参考文本内容是否与人名前字符串相同。
如果相同,则将第二语音识别结果中的目标文本内容确定为修正后的人名前字符串。此时,第二语音识别结果中的目标文本内容,具体是第一语音识别结果中的人名的位置,根据第一语音识别结果和第二语音识别结果之间的编辑距离矩阵,映射到第二语音识别结果中的位置之前的文本内容。
如果不同,则判断该第二语音识别结果是否比该人名前字符串的字符数量多;
如果该第二语音识别结果的字符数量不比人名前字符串的字符数量多,则保持第一语音识别结果中的人名前字符串不变,也就是将第一语音识别结果中的人名前字符串确定为修正后的人名前字符串。
如果第二语音识别结果的字符数量比人名前字符串的字符数量多,则进一步判断两者的字符串数量差异是否超过设定阈值,具体是判断第二语音识别结果的字符数量比人名前字符串的字符数量多出的数量是否超过人名前字符串字符数量的20%。
如果超过该设定阈值,则保持第一语音识别结果中的人名前字符串不变,也就是将第一语音识别结果中的人名前字符串确定为修正后的人名前字符串。
如果不超过该设定阈值,则将第二语音识别结果中的目标文本内容确定为修正后的人名前字符串。
同理,对于第一语音识别结果中的人名后字符串,也可以参照上述介绍,确定修正后的人名后字符串。
C3、利用所述修正后的非垂类关键字内容,以及所述垂类关键字内容,组合得到修正后的第一语音识别结果。
具体的,将修正后的非垂类关键字内容,以及垂类关键字内容,按照原非 垂类关键字内容与垂类关键字内容的位置关系进行组合,得到的组合结果,即作为修正后的第一语音识别结果。
例如图11所示,将修正后的人名前字符串、人名字符串以及修正后的人名后字符串依次进行拼接组合,即得到修正后的第一语音识别结果。
按照上述的处理过程对第一语音识别结果进行修正,可以使得语音识别解码网络的输出结果不仅包含更准确的垂类关键字信息,而且能够融合通用语音识别模型所识别到的准确的句式信息,从而使得语音识别解码网络的识别结果更加准确、句式覆盖度更高,同时能够解决由于语音识别解码网络的句式覆盖度低造成的识别误触发问题。
在上文实施例中介绍到,当有多个模型分别对待识别语音的声学状态序列进行解码时,为了使语音识别解码网络的结果能够在最终的PK中胜出,会对语音识别解码网络的结果进行激励,包括声学激励和语言激励。在激励过程中,如果激励不当,或者会造成激励不到位,或者会导致过度激励造成误触发,造成如上文所述的第二种和第三种垂类关键字误触发情况,即:(2)垂类关键字和真实结果发音相近的误触发;(3)垂类关键字和真实结果发音差距较大由于激励策略引入的误触发。
为了改善上述问题,以便能够通过合理的激励,既保证语音识别解码网络的结果能够不被其他模型的结果PK掉,又能够尽量避免由于过度激励造成语音识别解码网络的识别误触发,本申请实施例对上述的激励方案进行研究,提出了优选的激励方案。
本申请实施例首先提出,当利用语音识别解码网络以及通用语音识别模型对待识别语音的声学状态序列进行解码时,在从语音识别解码网络输出的第一语音识别结果,以及通用语音识别模型输出的第二语音识别结果中确定最终的语音识别结果时,可以参见图12所示的流程,按照如下步骤D1-D3的处理,确定最终的语音识别结果:
D1、通过将所述第一语音识别结果和所述第二语音识别结果进行对比,确定所述第一语音识别结果和所述第二语音识别结果的匹配度,并基于所述第一语音识别结果和所述第二语音识别结果的匹配度确定所述第一语音识别结果的置信度。
具体的,将第一语音识别结果和第二语音识别结果按照编辑距离算法计算字符编辑距离,从而确定两者的匹配度。
在确定第一语音识别结果和第二语音识别结果的匹配度后,以第一语音识别结果和第二语音识别结果的匹配度为基础,进一步确定第一语音识别结果的置信度。
示例性的,在确定第一语音识别结果的置信度时,首先判断第一语音识别结果和第二语音识别结果的匹配度是否大于设定的匹配度阈值。该匹配度阈值是由多个测试集合统计出的一个对识别贡献率最大的值。
如果大于设定的匹配度阈值,则根据所述第一语音识别结果的各个帧的声学得分,计算确定所述第一语音识别结果的置信度。即,将第一语音识别结果的各帧的声学得分进行累加或加权累加,累加结果作为第一语音识别结果的置信度。
如果不大于设定的匹配度阈值,则利用所述第一语音识别结果中的垂类关键字内容以及所述第二语音识别结果构建得到解码网络,利用该解码网络重新对所述声学状态序列进行解码,并利用解码结果更新所述第一语音识别结果;根据更新后的第一语音识别结果的各个帧的声学得分,计算确定第一语音识别结果的置信度。
具体的,如果第一语音识别结果和第二语音识别结果的匹配度不大于设定的匹配度阈值,则利用第一语音识别结果中的垂类关键字,以及第二语音识别结果的句式,也就是第二语音识别结果中的除垂类关键字对应内容之外的内容,构建得到一个微型的解码网络。可知,该微型的解码网络,只有一条解码路径。
利用该微型的解码网络重新对上述的待识别语音的声学状态序列进行解码,得到解码结果,并利用该解码结果作为新的第一语音识别结果。然后,利用更新后的第一语音识别结果的各个帧在解码时的声学得分,通过对各帧的声学得分进行累加或加权累加,作为最终确定的第一语音识别结果的置信度。
当第一语音识别结果的置信度大于预设的置信度阈值时,执行步骤D2、根据所述第一语音识别结果的声学得分和所述第二语音识别结果的声学得分,从所述第一语音识别结果和所述第二语音识别结果中选出最终的语音识别结果。
具体的,上述的置信度阈值,是指通过试验确定的,能够使第一语音识别结果在与其他语音识别结果进行得分PK时能够胜出的得分阈值。
第一语音识别结果的置信度大于预设的置信度阈值,则说明第一语音识别结果凭借其自身的置信度,能够在与其他语音识别结果的声学得分PK中,不被轻易PK掉。因此,此时可以直接根据第一语音识别结果的声学得分,以及第二语音识别结果的声学得分,对两者进行声学得分PK,从中选出一个或多个声学得分最高的语音识别结果,作为最终的语音识别结果。
当所述第一语音识别结果的置信度不大于预设的置信度阈值时,执行步骤D3、对所述第一语音识别结果进行声学得分激励,并根据激励后的第一语音识别结果的声学得分以及所述第二语音识别结果的声学得分,从所述第一语音识别结果和所述第二语音识别结果中选出最终的语音识别结果。
具体的,如果第一语音识别结果的置信度不大于预设的置信度阈值,则说明该第一语音识别结果的置信度较低,也就是其声学得分较低,在与其他语音识别结果进行声学得分PK时,会被其他语音识别结果PK掉,从而使最终选出的语音识别结果中丢失第一语音识别结果中所包含的更准确的垂类关键字信息,可能造成识别错误,尤其是可能造成垂类关键字识别错误。
此时,为了保证第一语音识别结果在后续的声学得分PK中不被轻易PK掉,本申请实施例对第一语音识别结果进行声学得分激励,具体是对第一语音识别结果中的垂类关键字所在槽位的声学得分进行激励,使第一语音识别结果中的垂类关键字所在槽位的声学得分提高一定比例,从而使第一语音识别结果的声学得分提高,以保证第一语音识别结果在后续与其他语音识别结果进行声学得分PK时不被轻易PK掉。
在对第一语音识别结果进行声学得分激励后,即可根据激励后的第一语音识别结果的声学得分以及第二语音识别结果的声学得分,对两者进行声学得分PK,从中选出一个或多个声学得分最高的语音识别结果,作为最终的语音识别结果。
下面,对于上述各实施例中涉及到的对第一语音识别结果进行声学得分激励的具体实现方案进行介绍。
对第一语音识别结果进行声学得分激励,也就是对第一语音识别结果中的 垂类关键字所在槽位的声学得分进行激励,具体是将第一语音识别结果中的垂类关键字所在槽位的声学得分乘以一个激励系数,然后再以激励后的垂类关键字声学得分,为基础,重新计算确定第一语音识别结果的声学得分。
具体而言,通过执行如下E1-E3的处理,可以实现对第一语音识别结果的声学得分激励:
E1、至少根据所述第一语音识别结果中的垂类关键字内容和非垂类关键字内容,确定声学激励系数。
声学激励系数,决定了对第一语音识别结果进行声学得分激励的强度,如果激励系数过大,则会造成过度激励,造成如上文所述的识别误触发问题;如果激励系数过小,则达不到激励目的,可能造成第一语音识别结果在得分PK中被其他语音识别结果PK掉。
所以,声学激励系数的确定,是解决上述实施例中所提到的垂类关键字误触发问题,以及解决声学PK时丢失重要的垂类关键字信息问题的关键。
本申请实施例设定,在确定声学激励系数时,要至少根据第一语音识别结果中的垂类关键字内容和非垂类关键字内容而确定,另外还应当结合实际业务场景、实际业务场景下的经验参数等进行确定。
作为一种可选的实施方式,可以根据待识别语音所属业务场景下的声学得分激励先验系数、所述第一语音识别结果中的垂类关键字的字符数量、音素数量,以及所述第一语音识别结果的全部字符数量、全部音素数量,计算确定声学激励系数。
具体的,在该实施方式中,按照如下公式计算得到声学激励系数RC:
RC=1.0+α*{β*(SlotWC/SentWC)+(1-β)*(SlotPC/SentPC)}
其中,α为场景先验系数,在本申请实施例中,即为待识别语音所属业务场景下的激励先验系数,该系数的值的正负分别代表着正向激励和反向激励两种情况。先验系数α作为一个开放参数,可以根据识别系统的需求,在每次识别会话时进行动态的设置。例如,基于自然语言处理(NLP)的技术,上层系统可以通过用户交互的上下文来预测用户的行为意图,并对该系数进行实时动态的调节,达到满足各类场景下的要求。
此外,由于涉及垂类关键字的业务场景下的垂类关键字存在词个数以及词发音音素长度不一致的情况,为了提升声学激励系数的泛化能力以及在不同槽 的适应能力,避免出现激励不合理的情况,声学激励系数的设计还充分考虑了垂类关键字所在槽位的词的个数(word count in slot,SlotWC)和音素个数(phoneme count in slot,SlotPC)(即垂类关键字的字符数量和音素数量),以及考虑了句子中词的个数(word count in sentence,SentWC)和音素个数(phoneme count in sentence,SentPC)(即第一语音识别结果的全部字符数量和全部音素数量)之间的比例关系,并且为字符数量和音素数量分别设置影响权重β,使字符和音素的影响权重之和为1。这种方式限制了声学激励系数的范围,且实现在不同长度句子和关键字下激励的自适应性。在特定场景下,如果计算得到的激励系数RC大于1.0,则会使得识别结果更偏向于语音识别解码网络输出的第一语音识别结果,反之则更倾向于其他模型的输出结果。此外,声学得分激励只针对垂类关键字所在槽位的声学得分进行激励,排除了无关上下文的干扰,从而能够避免激励过大而引起的识别误触发问题。
作为另一种可选的实施方式,可以先根据第一语音识别结果中的垂类关键字内容的音素数量和声学得分,以及第一语音数据结果中的非垂类关键字内容的音素数量和声学得分,计算得到第一语音识别结果中的垂类关键字内容的得分置信度;然后,根据第一语音识别结果中的垂类关键字内容的得分置信度,确定声学激励系数。
具体的,通过对语音识别结果的声学状态序列进行分析可以发现,识别错误的词语对应的状态序列得分比较低,这也是为何识别正确的结果往往得分是最高的原因。总结来说就是,识别错误的结果的声学序列的平均得分比较低。
在涉及垂类关键字的业务场景中,虽然垂类关键字被识别错误,但是整体句式的识别还是正常的。也就是说,整个语音识别结果中,垂类关键字的局部识别效果比整句的非垂类关键字部分的识别效果差。本案基于这个思路提出了通过垂类关键字内容的得分置信度来解决垂类关键字和正式结果发音相近的误触发问题。
垂类关键字得分置信度方案就是将语音识别解码网络输出的第一语音识别结果中的垂类关键字和非垂类关键字分开,分别计算出垂类关键字的声学总得分和垂类关键字占用的有效声学建模因子个数比值,以及非垂类关键字部分的声学总得分和非垂类关键字部分占用的有效声学建模因子个数比值,然后将二者进行相除就得到垂类关键字的得分置信度值。
上述的垂类关键字的得分置信度S c可以通过如下公式计算得到:
Figure PCTCN2021133434-appb-000001
其中,第一语音识别结果中的垂类关键字的声学得分记为S p,垂类关键字所在槽位所占的有效声学音素个数记为N p,非垂类关键字的总声学得分记为S a,非垂类关键字所在槽位所占的有效声学音素个数记为N a
在得到第一语音识别结果中的垂类关键字内容的得分置信度后,基于该得分置信度,可以确定用于对该垂类关键字进行激励的声学激励系数。
作为一种示例性的实现方式,将第一语音识别结果中的垂类关键字内容的得分置信度与预先设置的置信度阈值进行比对。该置信度阈值,根据声学激励系数造成识别误触发的概率而确定。当第一语音识别结果中的垂类关键字内容的得分置信度大于该置信度阈值时,可以认为该垂类关键字容易引起误触发,在确定对该垂类关键字的声学激励系数时,应当下调声学激励系数;当第一语音识别结果中的垂类关键字内容的得分置信度不大于该置信度阈值时,可以认为该垂类关键字容易被PK掉,在确定对该垂类关键字的声学激励系数时,应当上调声学激励系数。
进一步的,在根据第一语音识别结果中的垂类关键字内容的得分置信度确定声学激励系数时,还可以将第一语音识别结果中的垂类关键字内容的得分置信度,以及预先确定的声学激励系数与识别效果和识别误触发之间的关系相结合,共同用于确定声学激励系数。
具体的,本申请实施例分析多个测试集合的识别结果,统计声学激励系数的大小对识别效果、识别误触发的影响,确定声学激励系数与识别效果和识别误触发之间的关系。
基于声学激励系数与识别效果和识别误触发之间的关系,在确定声学激励系数的具体数值时,选择一个对识别率提升和降低误触发之间比较平衡的数值。选取的原则要保证误触发数量远小于识别效果提升的数量,本申请实施例选取的声学激励系数,是导致识别误触发的数量是促进识别效果提升的数量的百分之一的激励系数。
E2、利用所述声学激励系数,对所述第一语音识别结果中的垂类关键字内容的声学得分进行更新。
具体的,利用第一语音识别结果中的垂类关键字内容所在槽位的声学二分,乘以上述步骤确定的声学激励系数,得到更新后的垂类关键字内容的声学得分。
E3、根据更新后的所述第一语音识别结果中的垂类关键字内容的声学得分,以及所述第一语音识别结果中的非垂类关键字内容的声学得分,重新计算确定所述第一语音识别结果的声学得分。
具体的,利用更新后的垂类关键字内容的声学得分,替换第一语音识别结果中的垂类关键字内容的声学得分,然后重新对第一语音识别结果的各个字符的声学得分进行求和或加权求和,得到更新后的第一语音识别结果的声学得分。
下面,对于上述各实施例中涉及到的语言模型激励的具体实现方案进行介绍。在下文实施例中,以对第三语音识别结果进行语言模型激励为例,介绍语言模型激励的具体处理内容。应当理解的是,具体的语言模型激励的处理过程,不受激励对象的限制,该语言模型激励方案也可以适用于对其他语音识别结果的激励,例如,下文实施例介绍的语言模型激励的实现方案,同样适用于对从第一语音识别结果和第二语音识别结果中选出的候选语音识别结果进行语言模型激励。
语言模型激励,即为通过语言模型重新对语音识别结果的得分进行计算,从而使语音识别结果的得分中携带语言成分。
语言模型激励机制主要通过两方面来实现:其一为聚类class语言模型,其二为基于垂类关键字与语音识别结果发音序列匹配的策略,对语音识别结果的路径进行扩展,并基于扩展路径以及上述的聚类语言模型,确定语音识别结果的语言得分。
首先对聚类class模型进行介绍。在打电话、发短信、查询天气、导航等特定的涉及垂类关键字的语音识别业务场景下,通过枚举或者用户提供的方式,可以将各场景下的垂类关键字限制在有限的范围内,而垂类关键字所在的上下文通常也会以特定的句式出现。
聚类class语言模型除了采用通用的训练语料外,还针对这类特定句式或者说法进行了特殊的处理。聚类语言模型会为所有特定场景都分别定义一种类, 并且每类场景都会采用一个特殊词(class)来进行标记和区分,作为与该场景对应的类别标签。在定义完所有类别标签后,训练语料中的人名、城市名、音视频名等垂类关键字都会被替换为对应的类别标签,形成目标语料,这些目标语料会被添加到原始训练语料中,再次用于对上述的聚类语言模型进行语音识别训练。这种处理方式使得特殊词class代表一类词的概率,因此聚类模型中特殊词class所在的N-gram语言模型概率会显著高于具体的垂类关键字本身的概率。
在上述的类别标签的基础上,本申请实施例根据待识别语音所属业务场景下的垂类关键字集合以及与该业务场景对应的类别标签,对第三语音识别结果进行路径扩展。
示例性的,首先,将第三语音识别结果中的垂类关键字,与待识别语音所属业务场景下的垂类关键字集合中的垂类关键字分别进行比对。
如前文所述,在打电话、查询天气,导航等包含垂类关键字的特定语音识别业务场景下,垂类关键字都会被圈定在有限的范围内。通过枚举、用户提供等方式,可以将对应业务场景下的所有垂类关键字作为一种静态资源进行使用。利用发音词典资源,分别生成待识别语音所属业务场景下的垂类关键字和第三语音识别结果的发音串信息,通过将第三语音识别结果中的垂类关键字的发音信息与待识别语音所属业务场景下的垂类关键字集合中的垂类关键字的发音信息分别进行对比,以判断第三语音识别结果中的垂类关键字是否与待识别语音所属业务场景下的垂类关键字集合中的任意垂类关键字相匹配。
如果第三语音识别结果中的垂类关键字与待识别语音所属业务场景下的垂类关键字集合中的任意垂类关键字相匹配,则在第三语音识别结果中的垂类关键字所在槽位的左右节点之间扩展新路径,并在该新路径上存储与待识别语音所属业务场景对应的类别标签。
示例性的,图13所示,为打电话业务场景下的语音识别结果“<s>给张三打电话</s>”的状态网络(lattice)示意图。
对于该语音识别结果中的人名“张三”,当通过发音匹配确认其与用户上传的通讯录中的“张三”相匹配时,在图13所示的状态网络中的“张三”所在槽位的左右节点之间扩展新路径,该新路径与原状态网络中的“张三”共享起始节点和结束节点,并且在该新路径上标注当前业务场景对应的类别标签 “class”,具体的,该“class”可以具体为“人名”,路径扩展后的状态网络如图14所示。
按照上述处理,在完成对第三语音识别结果的路径扩展后,根据与待识别语音所属业务场景对应的类别标签所对应的聚类语言模型对训练语料的识别结果,分别确定第三语音识别结果以及第三语音识别结果的扩展路径的语言模型得分。
具体的,聚类语言模型对训练语料的识别结果中,包含了识别结果中的每个词的N-gram语言模型概率,该概率即为词的语言得分。
在完成对第三语音识别结果的路径扩展后,选择与第三语音识别结果对应的聚类语言模型对第三语音识别结果以及第三语音识别结果的扩展路径进行重新查分,分别确定第三语音识别结果以及第三语音识别结果的扩展路径的语言模型得分。
在本申请实施例中,由于通用语音识别模型和场景定制模型是基于不同语料进行训练,因而这两类模型分别对应不同的聚类class语言模型,而语音识别解码网络与场景定制模型同属于领域相关模型,所以这两者共享相同的聚类class语言模型。因此,在对语音识别结果进行重新查分时,会根据得出语音识别结果的模型来适配不同的聚类语言模型,用于对语音识别结果进行重新查分。
尤其是,当上述的语言模型激励方案用于对从第一语音识别结果和第二语音识别结果中选出的候选语音识别结果进行语言模型激励时,由于候选语音识别可能是语音识别解码网络输出的结果(即第一语音识别结果中的任意一条或多条),也可能是通用语音识别模型输出的结果(即第二语音识别结果中的一条或多条),因此,应当根据候选语音识别结果的来源,选择与其来源相同类型的聚类语言模型,进行重新查分。
上述的不同类型的聚类语言模型的模型结构是相同的,不同的是其训练语料不同。比如,与通用语音识别模型相同类型的聚类语言模型,是基于海量语料训练得到的,假设将其命名为模型A;而与场景定制模型相同类型的聚类语言模型,是基于场景语料训练得到的,假设将其命名为模型B;由于模型A和模型B是基于不同类型的训练语料训练得到的,因此两者属于不同类型的聚类语言模型。而与语音识别解码网络相同类型的聚类语言模型,也是基于场景语 料训练得到的,假设将其命名为模型C;由于模型B和模型C是基于相同类型的训练语料训练得到的,因此两者属于相同类型的聚类语言模型。
表1示出了对图14中的两条路径进行重新查分的计算方式。
表1
Figure PCTCN2021133434-appb-000002
按照上述介绍,结合表1所示,能够分别确定第三语音识别结果以及第三语音识别结果的扩展路径的语言模型得分scoreA和scoreB。
最后,根据第三语音识别结果的语言模型得分,以第三语音识别结果的扩展路径的语言模型得分,确定第三语音识别结果的语言模型激励后的语言得分。
具体的,将第三语音识别结果的语言模型得分scoreA,以及第三语音识别结果的扩展路径的语言模型得分scoreB按照一定比例融合,且两者的融合系数之和为1,得到第三语音识别结果的语言模型激励后的语言得分。
例如,可以通过如下公式计算第三语音识别结果的语言模型激励后的语言得分Score:
Score=γ*scoreA+(1–γ)*scoreB
其中,γ为经验系数,其取值通过测试确定,具体是以能够得到正确的语言得分,进而基于语言得分PK能够从众多的语音识别结果中选出正确的语音识别结果为目标而确定。
以上实施例介绍了声学得分激励和语言模型激励的具体实现方案,在上述实现方案中,尤其是在声学得分激励实现方案中,充分考虑了垂类关键字误触发问题,通过合理设置激励系数,能够解决由于激励导致的在垂类关键字和真实结果发音相近时引发的误触发,以及在垂类关键字和真实结果发音差距较大 时引入的误触发问题。
对于上文实施例中提到的,由于垂类关键字和真实结果发音相同而引起的误触发问题,则无法通过上述的控制激励系数的方案得以解决。原因是本申请实施例构建的语音识别解码网络是一个依赖声学模型的句式网络,其不包含语言信息,所以对于垂类关键字和真实结果发音相同的情况是不能本质上解决的。为了减少此类误触发的影响,本申请实施例采用多候选的形式将结果展示给用户。
具体而言,由于通用语音识别模型和语音识别解码网络共用一个声学模型,所以当它们的输出结果发音相同时,它们的声学得分一定是相同的。因此,当语音识别解码网络输出的第一语音识别结果和通用语音识别模型输出的第二语音识别结果的声学得分相同时,将第一语音识别结果和第二语音识别结果共同作为最终的语音识别结果,也就是第一语音识别结果和第二语音识别结果同时输出,由用户从中选择正确的语音识别结果。其中,第一语音识别结果和第二语音识别结果同时输出时的输出顺序可以灵活调整,优选为采用第一语音识别结果在前、第二语音识别结果在后的顺序输出。
需要说明的是,上述的多候选形式输出语音识别结果的思想,同样适用于更多模型的语音识别结果的得分PK。例如上述实施例介绍,当将语音识别解码网络、通用语音识别模型、场景定制模型的输出结果进行得分PK决策最终的语音识别结果时,如果有多个不同的语音识别结果的得分相同,则可以将这些得分相同的语音识别结果同时输出,由用户从中选择正确的语音识别结果。
至此,本申请上述各实施例分别对所提出的各语音识别方法的处理过程,尤其是各种语音识别方法中的各个典型处理步骤进行了详细介绍。应当注意的是,为了使说明书简洁,各种语音识别方法中的相同或相应处理步骤的具体实施方式,均可以相互参见,本申请实施例不再一一列举和说明。各语音识别方法中的处理步骤可以相互借鉴、组合,从而形成不超出本申请保护范围的技术方案。
此外,与上述的语音识别方法相对应的,本申请实施例还提出一种语音识别装置,参见图15所示,该语音识别装置,包括:
声学识别单元001,用于获取待识别语音的声学状态序列;
网络构建单元002,用于基于所述待识别语音所属场景下的垂类关键字集合及句式解码网络,构建语音识别解码网络,其中,所述句式解码网络至少通过对所述待识别语音所属场景下的文本语料进行句式归纳处理构建得到;
解码处理单元003,用于利用所述语音识别解码网络对所述声学状态序列进行解码,得到语音识别结果。
作为一种可选的实施方式,上述的基于所述待识别语音所属业务场景下的垂类关键字集合及句式解码网络,构建语音识别解码网络,包括:
将所述待识别语音所属场景下的垂类关键字集合传入云端服务器,以使所述云端服务器基于所述待识别语音所属场景下的垂类关键字集合及句式解码网络,构建语音识别解码网络。
作为一种可选的实施方式,所述语音识别结果作为第一语音识别结果;
所述解码处理单元003还用于:
利用通用语音识别模型对所述声学状态序列进行解码,得到第二语音识别结果;
至少从所述第一语音识别结果和所述第二语音识别结果中,确定出最终的语音识别结果。
作为一种可选的实施方式,所述解码处理单元003还用于:
通过预先训练的场景定制模型,对所述声学状态序列进行解码得到第三语音识别结果;其中,所述场景定制模型,通过对所述待识别语音所属场景下的语音进行语音识别训练得到;
所述至少从所述第一语音识别结果和所述第二语音识别结果中,确定出最终的语音识别结果,包括:
从所述第一语音识别结果、所述第二语音识别结果和所述第三语音识别结果中,确定出最终的语音识别结果。
作为一种可选的实施方式,从所述第一语音识别结果、所述第二语音识别结果和所述第三语音识别结果中,确定出最终的语音识别结果,包括:
分别对所述第一语音识别结果、所述第二语音识别结果和所述第三语音识别结果进行语言模型激励;
根据激励后的第一语音识别结果、第二语音识别结果和第三语音识别结果 的语言得分,从所述第一语音识别结果、所述第二语音识别结果和所述第三语音识别结果中确定出最终的语音识别结果。
作为一种可选的实施方式,从所述第一语音识别结果、所述第二语音识别结果和所述第三语音识别结果中,确定出最终的语音识别结果,包括:
对所述第一语音识别结果进行声学得分激励,以及,对所述第三语音识别结果进行语言模型激励;
根据声学得分激励后的第一语音识别结果的声学得分,以及所述第二语音识别结果的声学得分,从所述第一语音识别结果和所述第二语音识别结果中确定出候选语音识别结果;
对所述候选语音识别结果进行语言模型激励;
根据语言模型激励后的所述候选语音识别结果的语言得分,以及语言模型激励后的所述第三语音识别结果的语言得分,从所述候选语音识别结果和所述第三语音识别结果中确定出最终的语音识别结果。
本申请另一实施例还提出另一种语音识别装置,参见图16所示,该装置包括:
声学识别单元011,用于获取待识别语音的声学状态序列;
多维解码单元012,用于利用语音识别解码网络对所述声学状态序列进行解码,得到第一语音识别结果,以及,利用通用语音识别模型对所述声学状态序列进行解码,得到第二语音识别结果;所述语音识别解码网络基于所述待识别语音所属场景下的垂类关键字集合及句式解码网络构建得到;
声学激励单元013,用于对所述第一语音识别结果进行声学得分激励;
决策处理单元014,用于至少从激励后的第一语音识别结果以及所述第二语音识别结果中,确定出最终的语音识别结果。
作为一种可选的实施方式,所述多维解码单元012还用于:
通过预先训练的场景定制模型,对所述声学状态序列进行解码得到第三语音识别结果;其中,所述场景定制模型,通过对所述待识别语音所属场景下的语音进行语音识别训练得到;
所述至少从激励后的第一语音识别结果以及所述第二语音识别结果中,确定出最终的语音识别结果,包括:
从激励后的第一语音识别结果、所述第二语音识别结果和所述第三语音识 别结果中,确定出最终的语音识别结果。
作为一种可选的实施方式,所述从激励后的第一语音识别结果、所述第二语音识别结果和所述第三语音识别结果中,确定出最终的语音识别结果,包括:
根据声学得分激励后的第一语音识别结果的声学得分,以及所述第二语音识别结果的声学得分,从所述第一语音识别结果和所述第二语音识别结果中确定出候选语音识别结果;
对所述候选语音识别结果以及所述第三语音识别结果分别进行语言模型激励;
根据语言模型激励后的所述候选语音识别结果的语言得分,以及语言模型激励后的所述第三语音识别结果的语言得分,从所述候选语音识别结果和所述第三语音识别结果中确定出最终的语音识别结果。
作为一种可选的实施方式,所述待识别语音所属场景下的句式解码网络通过如下处理构建得到:
通过对所述待识别语音所属场景下的语料数据进行句式归纳和语法槽定义处理,构建文本句式网络;其中,所述文本句式网络中包括对应非垂类关键字的普通语法槽和对应垂类关键字的替换语法槽,所述替换语法槽中存储与垂类关键字对应的占位符;
对所述文本句式网络的普通语法槽中的词条进行分词并按照分词结果进行单词节点扩展,得到词级句式解码网络;
将所述词级句式解码网络的普通语法槽中的各个单词替换为对应的发音,并按照单词对应的发音进行发音节点扩展,得到发音级句式解码网络,所述发音级句式解码网络作为所述待识别语音所属场景下的句式解码网络。
作为一种可选的实施方式,对所述文本句式网络的普通语法槽中的词条进行分词并按照分词结果进行单词节点扩展,得到词级句式解码网络,包括:
对所述文本句式网络中的普通语法槽中的每一个词条,分别进行分词,得到每个词条对应的各个单词;
利用对应同一词条的各个单词进行单词节点扩展,得到对应该词条的单词串;
将对应同一普通语法槽的各个词条对应的单词串进行并联,得到词级句式解码网络。
作为一种可选的实施方式,将所述词级句式解码网络的普通语法槽中的各个单词替换为对应的发音,并按照单词对应的发音进行发音节点扩展,得到发音级句式解码网络,包括:
将所述词级句式解码网络的普通语法槽中的各个单词,分别替换为对应的发音;
对所述词级句式解码网络中的每个发音,分别进行发音单元划分,并利用发音对应的各个发音单元进行发音节点扩展,得到发音级句式解码网络。
作为一种可选的实施方式,基于所述待识别语音所属场景下的垂类关键字集合及句式解码网络,构建语音识别解码网络,包括:
获取预先构建的所述待识别语音所属场景下的句式解码网络;
基于待识别语音所属场景下的垂类关键字集合中的垂类关键字,构建垂类关键字网络;
将所述垂类关键字网络插入所述句式解码网络,得到语音识别解码网络。
作为一种可选的实施方式,所述基于待识别语音所属场景下的垂类关键字集合中的垂类关键字,构建垂类关键字网络,包括:
基于待识别语音所属场景下的垂类关键字集合中的各个垂类关键字,构建词级垂类关键字网络;
将所述词级垂类关键字网络中的各个单词替换为对应的发音,并按照单词对应的发音进行发音节点扩展,得到发音级垂类关键字网络。
作为一种可选的实施方式,所述垂类关键字网络和所述句式解码网络均由节点和连接节点的有向弧构成,在节点间的有向弧上存储发音信息或占位符;
将所述垂类关键字网络插入所述句式解码网络,得到语音识别解码网络,包括:
通过有向弧将所述垂类关键字网络与所述句式解码网络的替换语法槽的左右节点分别连接,构建得到语音识别解码网络。
作为一种可选的实施方式,所述通过有向弧将所述垂类关键字网络与所述句式解码网络的替换语法槽的左右节点分别连接,构建得到语音识别解码网络,包括:
通过将所述垂类关键字网络的开始节点的每条出弧的右节点与所述替换语法槽的左节点通过有向弧连接,以及,将所述垂类关键字网络的结束节点的 每条入弧的左节点与所述替换语法槽的右节点通过有向弧连接,构建得到语音识别解码网络。
作为一种可选的实施方式,所述垂类关键字网络中的每个关键字的第一条弧和最后一条弧上分别存储与该关键字对应的唯一标识;
所述通过将所述垂类关键字网络的开始节点的每条出弧的右节点与所述替换语法槽的左节点通过有向弧连接,以及,将所述垂类关键字网络的结束节点的每条入弧的左节点与所述替换语法槽的右节点通过有向弧连接,构建得到语音识别解码网络,包括:
遍历所述垂类关键字网络的开始节点的每条出弧,对于遍历到的每一条出弧,根据该出弧上的唯一标识以及已入网关键字信息集合,确定该唯一标识对应的关键字是否已插入句式解码网络;其中,所述已入网关键字信息集合中,对应存储已经插入句式解码网络的关键字的唯一标识,以及该唯一标识所在的有向弧在该句式解码网络中的左右节点编号;
如果该唯一标识对应的关键字未插入句式解码网络,则将遍历到的该出弧的右节点与所述替换语法槽的左节点通过有向弧连接,在该有向弧上存储遍历到的该出弧上的发音信息;
以及,
遍历所述垂类关键字网络的结束节点的每条入弧,对于遍历到的每一条入弧,根据该入弧上的唯一标识以及已入网关键字信息集合,确定该唯一标识对应的关键字是否已插入句式解码网络;
如果该唯一标识对应的关键字未插入句式解码网络,则将遍历到的该入弧的左节点与所述替换语法槽的右节点通过有向弧连接,在该有向弧上存储遍历到的该入弧上的发音信息。
作为一种可选的实施方式,所述通过将所述垂类关键字网络的开始节点的每条出弧的右节点与所述替换语法槽的左节点通过有向弧连接,以及,将所述垂类关键字网络的结束节点的每条入弧的左节点与所述替换语法槽的右节点通过有向弧连接,构建得到语音识别解码网络,还包括:
当所述垂类关键字网络中的关键字被插入所述句式解码网络时,将该关键字的唯一标识,以及该唯一标识所在的有向弧在该句式解码网络中的左右节点编号,对应存储至所述已入网关键字信息集合中。
作为一种可选的实施方式,所述通过将所述垂类关键字网络的开始节点的每条出弧的右节点与所述替换语法槽的左节点通过有向弧连接,以及,将所述垂类关键字网络的结束节点的每条入弧的左节点与所述替换语法槽的右节点通过有向弧连接,构建得到语音识别解码网络,还包括:
遍历所述已入网关键字信息集合中的各个唯一标识;
如果遍历到的唯一标识不是所述待识别语音所属场景下的垂类关键字集合中的任意关键字的唯一标识,则将该唯一标识对应的左右节点编号之间的有向弧断开。
作为一种可选的实施方式,上述的语音识别装置还包括:
结果修正单元,用于根据所述第二语音识别结果,对所述第一语音识别结果进行修正。
作为一种可选的实施方式,根据所述第二语音识别结果,对所述第一语音识别结果进行修正,包括:
利用所述第二语音识别结果中的参考文本内容,对所述第一语音识别结果中的非垂类关键字内容进行修正,得到修正后的第一语音识别结果;
其中,所述参考文本内容,是所述第二语音识别结果中的、与所述第一语音识别结果中的非垂类关键字内容相匹配的文本内容。
作为一种可选的实施方式,利用所述第二语音识别结果中的参考文本内容,对所述第一语音识别结果中的非垂类关键字内容进行修正,得到修正后的第一语音识别结果,包括:
从所述第一语音识别结果中确定出垂类关键字内容和非垂类关键字内容,以及,从所述第二语音识别结果中确定出与所述第一语音识别结果中的非垂类关键字内容对应的文本内容,作为参考文本内容;
根据所述第二语音识别结果中的参考文本内容,以及所述第一语音识别结果中的非垂类关键字内容,确定修正后的非垂类关键字内容;
利用所述修正后的非垂类关键字内容,以及所述垂类关键字内容,组合得到修正后的第一语音识别结果。
作为一种可选的实施方式,从所述第二语音识别结果中确定出与所述第一语音识别结果中的非垂类关键字内容对应的文本内容,作为参考文本内容,包括:
根据编辑距离算法确定所述第一语音识别结果与所述第二语音识别结果之间的编辑距离矩阵;
根据所述编辑距离矩阵,以及所述第一语音识别结果中的非垂类关键字内容,从所述第二语音识别结果中确定出与所述第一语音识别结果中的非垂类关键字内容对应的文本内容,作为参考文本内容。
作为一种可选的实施方式,所述根据所述第二语音识别结果中的参考文本内容,以及所述第一语音识别结果中的非垂类关键字内容,确定修正后的非垂类关键字内容,包括:
根据所述第二语音识别结果中的参考文本内容与所述第一语音识别结果中的非垂类关键字内容的字符差异,将所述第二语音识别结果中的目标文本内容或所述第一语音识别结果中的非垂类关键字内容,确定为修正后的非垂类关键字内容;
其中,所述第二语音识别结果中的目标文本内容,是指所述第二语音识别结果中的、与所述第一语音识别结果中的非垂类关键字内容的位置相对应的文本内容。
作为一种可选的实施方式,根据所述第二语音识别结果中的参考文本内容与所述第一语音识别结果中的非垂类关键字内容的字符差异,将所述第二语音识别结果中的目标文本内容或所述第一语音识别结果中的非垂类关键字内容,确定为修正后的非垂类关键字内容,包括:
将所述第二语音识别结果中的参考文本内容与所述第一语音识别结果中的非垂类关键字内容进行比对,确定所述第二语音识别结果中的参考文本内容与所述第一语音识别结果中的非垂类关键字内容是否相同;
如果相同,则将所述第二语音识别结果中的目标文本内容,确定为修正后的非垂类关键字内容;
如果不同,则确定所述第二语音识别结果是否比所述第一语音识别结果中的非垂类关键字内容的字符数量多,并且两者的字符数量差异不超过设定阈值;
如果所述第二语音识别结果比所述第一语音识别结果中的非垂类关键字内容的字符数量多,并且两者的字符数量差异不超过设定阈值,则将所述第二语音识别结果中的目标文本内容,确定为修正后的非垂类关键字内容;
如果所述第二语音识别结果比所述第一语音识别结果中的非垂类关键字内容的字符数量少,和/或两者的字符数量差异超过设定阈值,则将所述第一语音识别结果中的非垂类关键字内容,确定为修正后的非垂类关键字内容。
作为一种可选的实施方式,所述至少从所述第一语音识别结果和所述第二语音识别结果中,确定出最终的语音识别结果,包括:
通过将所述第一语音识别结果和所述第二语音识别结果进行对比,确定所述第一语音识别结果和所述第二语音识别结果的匹配度,并基于所述第一语音识别结果和所述第二语音识别结果的匹配度确定所述第一语音识别结果的置信度;
当所述第一语音识别结果的置信度大于预设的置信度阈值时,根据所述第一语音识别结果的声学得分和所述第二语音识别结果的声学得分,从所述第一语音识别结果和所述第二语音识别结果中选出最终的语音识别结果;
当所述第一语音识别结果的置信度不大于预设的置信度阈值时,对所述第一语音识别结果进行声学得分激励,并根据激励后的第一语音识别结果的声学得分以及所述第二语音识别结果的声学得分,从所述第一语音识别结果和所述第二语音识别结果中选出最终的语音识别结果。
作为一种可选的实施方式,所述基于所述第一语音识别结果和所述第二语音识别结果的匹配度确定所述第一语音识别结果的置信度,包括:
判断所述第一语音识别结果和所述第二语音识别结果的匹配度是否大于设定的匹配度阈值;
如果大于设定的匹配度阈值,则根据所述第一语音识别结果的各个帧的声学得分,计算确定所述第一语音识别结果的置信度;
如果不大于设定的匹配度阈值,则利用所述第一语音识别结果中的垂类关键字内容以及所述第二语音识别结果构建得到解码网络,利用该解码网络重新对所述声学状态序列进行解码,并利用解码结果更新所述第一语音识别结果;
根据更新后的第一语音识别结果的各个帧的声学得分,计算确定第一语音识别结果的置信度。
作为一种可选的实施方式,当所述第一语音识别结果和所述第二语音识别结果的声学得分相同时,将所述第一语音识别结果和所述第二语音识别结果共同作为最终的语音识别结果。
作为一种可选的实施方式,对所述第一语音识别结果进行声学得分激励,包括:
至少根据所述第一语音识别结果中的垂类关键字内容和非垂类关键字内容,确定声学激励系数;
利用所述声学激励系数,对所述第一语音识别结果中的垂类关键字内容的声学得分进行更新;
根据更新后的所述第一语音识别结果中的垂类关键字内容的声学得分,以及所述第一语音识别结果中的非垂类关键字内容的声学得分,重新计算确定所述第一语音识别结果的声学得分。
作为一种可选的实施方式,所述至少根据所述第一语音识别结果中的垂类关键字内容和非垂类关键字内容,确定声学激励系数,包括:
根据待识别语音所属场景下的声学得分激励先验系数、所述第一语音识别结果中的垂类关键字的字符数量、音素数量,以及所述第一语音识别结果的全部字符数量、全部音素数量,计算确定声学激励系数。
作为一种可选的实施方式,所述至少根据所述第一语音识别结果中的垂类关键字内容和非垂类关键字内容,确定声学激励系数,包括:
根据所述第一语音识别结果中的垂类关键字内容的音素数量和声学得分,以及所述第一语音数据结果中的非垂类关键字内容的音素数量和声学得分,计算得到所述第一语音识别结果中的垂类关键字内容的得分置信度;
至少根据所述第一语音识别结果中的垂类关键字内容的得分置信度,确定声学激励系数。
作为一种可选的实施方式,所述至少根据所述第一语音识别结果中的垂类关键字内容的得分置信度,确定声学激励系数,包括:
根据所述第一语音识别结果中的垂类关键字内容的得分置信度,以及预先确定的声学激励系数与识别效果和识别误触发之间的关系,确定声学激励系数。
作为一种可选的实施方式,对所述第三语音识别结果进行语言模型激励,包括:
根据待识别语音所属场景下的垂类关键字集合以及与所述该场景对应的类别标签,对所述第三语音识别结果进行路径扩展;所述类别标签,通过对语 音识别场景进行聚类而确定;
根据与所述类别标签对应的聚类语言模型对训练语料的识别结果,分别确定所述第三语音识别结果以及所述第三语音识别结果的扩展路径的语言模型得分;其中,所述聚类语言模型通过对目标语料进行语音识别训练得到,所述目标语料中的垂类关键字均被替换为所述类别标签;
根据所述第三语音识别结果的语言模型得分,以及所述第三语音识别结果的扩展路径的语言模型得分,确定所述第三语音识别结果的语言模型激励后的语言得分。
作为一种可选的实施方式,所述根据待识别语音所属场景下的垂类关键字集合以及与该场景对应的类别标签,对所述第三语音识别结果进行路径扩展,包括:
将所述第三语音识别结果中的垂类关键字,与待识别语音所属场景下的垂类关键字集合中的垂类关键字分别进行比对;
如果所述第三语音识别结果中的垂类关键字与所述垂类关键字集合中的任意垂类关键字相匹配,则在所述第三语音识别结果中的垂类关键字所在槽位的左右节点之间扩展新路径,并在该新路径上存储与待识别语音所属场景对应的类别标签。
具体的,上述的各语音识别装置的实施例中的各个单元的具体工作内容,请参见上述的语音识别方法的相应步骤的处理内容,此处不再重复。
本申请另一实施例还提出一种语音识别设备,参见图17所示,该设备包括:
存储器200和处理器210;
其中,所述存储器200与所述处理器210连接,用于存储程序;
所述处理器210,用于通过运行所述存储器200中存储的程序,实现上述任一实施例公开的语音识别方法。
具体的,上述语音识别设备还可以包括:总线、通信接口220、输入设备230和输出设备240。
处理器210、存储器200、通信接口220、输入设备230和输出设备240通过总线相互连接。其中:
总线可包括一通路,在计算机系统各个部件之间传送信息。
处理器210可以是通用处理器,例如通用中央处理器(CPU)、微处理器等,也可以是特定应用集成电路(application-specific integrated circuit,ASIC),或一个或多个用于控制本发明方案程序执行的集成电路。还可以是数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。
处理器210可包括主处理器,还可包括基带芯片、调制解调器等。
存储器200中保存有执行本发明技术方案的程序,还可以保存有操作系统和其他关键业务。具体地,程序可以包括程序代码,程序代码包括计算机操作指令。更具体的,存储器200可以包括只读存储器(read-only memory,ROM)、可存储静态信息和指令的其他类型的静态存储设备、随机存取存储器(random access memory,RAM)、可存储信息和指令的其他类型的动态存储设备、磁盘存储器、flash等等。
输入设备230可包括接收用户输入的数据和信息的装置,例如键盘、鼠标、摄像头、扫描仪、光笔、语音输入装置、触摸屏、计步器或重力感应器等。
输出设备240可包括允许输出信息给用户的装置,例如显示屏、打印机、扬声器等。
通信接口220可包括使用任何收发器一类的装置,以便与其他设备或通信网络通信,如以太网,无线接入网(RAN),无线局域网(WLAN)等。
处理器2102执行存储器200中所存放的程序,以及调用其他设备,可用于实现本申请实施例所提供的语音识别方法的各个步骤。
本申请另一实施例还提供了一种存储介质,该存储介质上存储有计算机程序,该计算机程序被处理器运行时,实现上述任一实施例提供的语音识别方法的各个步骤。
具体的,上述的语音识别设备的各个部分的具体工作内容,以及上述的存储介质上的计算机程序被处理器运行时的具体处理内容,均可以参见上述的语音识别方法的各个实施例的内容,此处不再赘述。
对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。对于装置类实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本申请各实施例方法中的步骤可以根据实际需要进行顺序调整、合并和删减,各实施例中记载的技术特征可以进行替换或者组合。
本申请各实施例种装置及终端中的模块和子模块可以根据实际需要进行合并、划分和删减。
本申请所提供的几个实施例中,应该理解到,所揭露的终端,装置和方法,可以通过其它的方式实现。例如,以上所描述的终端实施例仅仅是示意性的,例如,模块或子模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个子模块或模块可以结合或者可以集成到另一个模块,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或模块的间接耦合或通信连接,可以是电性,机械或其它的形式。
作为分离部件说明的模块或子模块可以是或者也可以不是物理上分开的,作为模块或子模块的部件可以是或者也可以不是物理模块或子模块,即可以位于一个地方,或者也可以分布到多个网络模块或子模块上。可以根据实际的需要选择其中的部分或者全部模块或子模块来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能模块或子模块可以集成在一个处理模块中,也可以是各个模块或子模块单独物理存在,也可以两个或两个以上模块或子模块集成在一个模块中。上述集成的模块或子模块既可以采用硬件的形式实现,也可以采用软件功能模块或子模块的形式实现。
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为 了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件单元,或者二者的结合来实施。软件单元可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (28)

  1. 一种语音识别方法,其特征在于,包括:
    获取待识别语音的声学状态序列;
    基于所述待识别语音所属场景下的垂类关键字集合及句式解码网络,构建语音识别解码网络,其中,所述句式解码网络至少通过对所述待识别语音所属场景下的文本语料进行句式归纳处理构建得到;
    利用所述语音识别解码网络对所述声学状态序列进行解码,得到语音识别结果。
  2. 根据权利要求1所述的方法,其特征在于,基于所述待识别语音所属场景下的垂类关键字集合及句式解码网络,构建语音识别解码网络,包括:
    将所述待识别语音所属场景下的垂类关键字集合传入云端服务器,以使所述云端服务器基于所述待识别语音所属场景下的垂类关键字集合及句式解码网络,构建语音识别解码网络。
  3. 根据权利要求1所述的方法,其特征在于,所述语音识别结果作为第一语音识别结果;
    所述方法还包括:
    利用通用语音识别模型对所述声学状态序列进行解码,得到第二语音识别结果;
    至少从所述第一语音识别结果和所述第二语音识别结果中,确定出最终的语音识别结果。
  4. 根据权利要求3所述的方法,其特征在于,所述方法还包括:
    通过预先训练的场景定制模型,对所述声学状态序列进行解码得到第三语音识别结果;其中,所述场景定制模型,通过对所述待识别语音所属场景下的语音进行语音识别训练得到;
    所述至少从所述第一语音识别结果和所述第二语音识别结果中,确定出最终的语音识别结果,包括:
    从所述第一语音识别结果、所述第二语音识别结果和所述第三语音识别结果中,确定出最终的语音识别结果。
  5. 根据权利要求4所述的方法,其特征在于,从所述第一语音识别结果、所述第二语音识别结果和所述第三语音识别结果中,确定出最终的语音识别结 果,包括:
    分别对所述第一语音识别结果、所述第二语音识别结果和所述第三语音识别结果进行语言模型激励;
    根据激励后的第一语音识别结果、第二语音识别结果和第三语音识别结果的语言得分,从所述第一语音识别结果、所述第二语音识别结果和所述第三语音识别结果中确定出最终的语音识别结果。
  6. 根据权利要求4所述的方法,其特征在于,从所述第一语音识别结果、所述第二语音识别结果和所述第三语音识别结果中,确定出最终的语音识别结果,包括:
    对所述第一语音识别结果进行声学得分激励,以及,对所述第三语音识别结果进行语言模型激励;
    根据声学得分激励后的第一语音识别结果的声学得分,以及所述第二语音识别结果的声学得分,从所述第一语音识别结果和所述第二语音识别结果中确定出候选语音识别结果;
    对所述候选语音识别结果进行语言模型激励;
    根据语言模型激励后的所述候选语音识别结果的语言得分,以及语言模型激励后的所述第三语音识别结果的语言得分,从所述候选语音识别结果和所述第三语音识别结果中确定出最终的语音识别结果。
  7. 一种语音识别方法,其特征在于,包括:
    获取待识别语音的声学状态序列;
    利用语音识别解码网络对所述声学状态序列进行解码,得到第一语音识别结果,以及,利用通用语音识别模型对所述声学状态序列进行解码,得到第二语音识别结果;所述语音识别解码网络基于所述待识别语音所属场景下的垂类关键字集合及句式解码网络构建得到;
    对所述第一语音识别结果进行声学得分激励;
    至少从激励后的第一语音识别结果以及所述第二语音识别结果中,确定出最终的语音识别结果。
  8. 根据权利要求7所述的方法,其特征在于,所述方法还包括:
    通过预先训练的场景定制模型,对所述声学状态序列进行解码得到第三语音识别结果;其中,所述场景定制模型,通过对所述待识别语音所属场景下的 语音进行语音识别训练得到;
    所述至少从激励后的第一语音识别结果以及所述第二语音识别结果中,确定出最终的语音识别结果,包括:
    从激励后的第一语音识别结果、所述第二语音识别结果和所述第三语音识别结果中,确定出最终的语音识别结果。
  9. 根据权利要求8所述的方法,其特征在于,所述从激励后的第一语音识别结果、所述第二语音识别结果和所述第三语音识别结果中,确定出最终的语音识别结果,包括:
    根据声学得分激励后的第一语音识别结果的声学得分,以及所述第二语音识别结果的声学得分,从所述第一语音识别结果和所述第二语音识别结果中确定出候选语音识别结果;
    对所述候选语音识别结果以及所述第三语音识别结果分别进行语言模型激励;
    根据语言模型激励后的所述候选语音识别结果的语言得分,以及语言模型激励后的所述第三语音识别结果的语言得分,从所述候选语音识别结果和所述第三语音识别结果中确定出最终的语音识别结果。
  10. 根据权利要求1至9中任意一项所述的方法,其特征在于,所述待识别语音所属场景下的句式解码网络通过如下处理构建得到:
    通过对所述待识别语音所属场景下的语料数据进行句式归纳和语法槽定义处理,构建文本句式网络;其中,所述文本句式网络中包括对应非垂类关键字的普通语法槽和对应垂类关键字的替换语法槽,所述替换语法槽中存储与垂类关键字对应的占位符;
    对所述文本句式网络的普通语法槽中的词条进行分词并按照分词结果进行单词节点扩展,得到词级句式解码网络;
    将所述词级句式解码网络的普通语法槽中的各个单词替换为对应的发音,并按照单词对应的发音进行发音节点扩展,得到发音级句式解码网络,所述发音级句式解码网络作为所述待识别语音所属场景下的句式解码网络。
  11. 根据权利要求1至9中任意一项所述的方法,其特征在于,基于所述待识别语音所属场景下的垂类关键字集合及句式解码网络,构建语音识别解码网络,包括:
    获取预先构建的所述待识别语音所属场景下的句式解码网络;
    基于待识别语音所属场景下的垂类关键字集合中的垂类关键字,构建垂类关键字网络;
    将所述垂类关键字网络插入所述句式解码网络,得到语音识别解码网络。
  12. 根据权利要求11所述的方法,其特征在于,所述基于待识别语音所属场景下的垂类关键字集合中的垂类关键字,构建垂类关键字网络,包括:
    基于待识别语音所属场景下的垂类关键字集合中的各个垂类关键字,构建词级垂类关键字网络;
    将所述词级垂类关键字网络中的各个单词替换为对应的发音,并按照单词对应的发音进行发音节点扩展,得到发音级垂类关键字网络。
  13. 根据权利要求11所述的方法,其特征在于,所述垂类关键字网络和所述句式解码网络均由节点和连接节点的有向弧构成,在节点间的有向弧上存储发音信息或占位符;
    将所述垂类关键字网络插入所述句式解码网络,得到语音识别解码网络,包括:
    通过有向弧将所述垂类关键字网络与所述句式解码网络的替换语法槽的左右节点分别连接,构建得到语音识别解码网络。
  14. 根据权利要求13所述的方法,其特征在于,所述垂类关键字网络中的每个关键字的第一条弧和最后一条弧上分别存储与该关键字对应的唯一标识;
    当所述垂类关键字网络中的关键字被插入所述句式解码网络时,将该关键字的唯一标识,以及该唯一标识所在的有向弧在该句式解码网络中的左右节点编号,对应存储至已入网关键字信息集合中;其中,所述已入网关键字信息集合中,对应存储已经插入句式解码网络的关键字的唯一标识,以及该唯一标识所在的有向弧在该句式解码网络中的左右节点编号;。
  15. 根据权利要求14所述的方法,其特征在于,还包括:
    遍历所述已入网关键字信息集合中的各个唯一标识;
    如果遍历到的唯一标识不是所述待识别语音所属场景下的垂类关键字集合中的任意关键字的唯一标识,则将该唯一标识对应的左右节点编号之间的有向弧断开。
  16. 根据权利要求3至9中任意一项所述的方法,其特征在于,所述方法还包括:
    利用所述第二语音识别结果中的参考文本内容,对所述第一语音识别结果中的非垂类关键字内容进行修正,得到修正后的第一语音识别结果;
    其中,所述参考文本内容,是所述第二语音识别结果中的、与所述第一语音识别结果中的非垂类关键字内容相匹配的文本内容。
  17. 根据权利要求16所述的方法,其特征在于,利用所述第二语音识别结果中的参考文本内容,对所述第一语音识别结果中的非垂类关键字内容进行修正,得到修正后的第一语音识别结果,包括:
    从所述第一语音识别结果中确定出垂类关键字内容和非垂类关键字内容,以及,从所述第二语音识别结果中确定出与所述第一语音识别结果中的非垂类关键字内容对应的文本内容,作为参考文本内容;
    根据所述第二语音识别结果中的参考文本内容,以及所述第一语音识别结果中的非垂类关键字内容,确定修正后的非垂类关键字内容;
    利用所述修正后的非垂类关键字内容,以及所述垂类关键字内容,组合得到修正后的第一语音识别结果。
  18. 根据权利要求17所述的方法,其特征在于,从所述第二语音识别结果中确定出与所述第一语音识别结果中的非垂类关键字内容对应的文本内容,作为参考文本内容,包括:
    根据编辑距离算法确定所述第一语音识别结果与所述第二语音识别结果之间的编辑距离矩阵;
    根据所述编辑距离矩阵,以及所述第一语音识别结果中的非垂类关键字内容,从所述第二语音识别结果中确定出与所述第一语音识别结果中的非垂类关键字内容对应的文本内容,作为参考文本内容。
  19. 根据权利要求17所述的方法,其特征在于,所述根据所述第二语音识别结果中的参考文本内容,以及所述第一语音识别结果中的非垂类关键字内容,确定修正后的非垂类关键字内容,包括:
    确定所述第二语音识别结果中的参考文本内容与所述第一语音识别结果中的非垂类关键字内容是否相同;
    如果相同,则将所述第二语音识别结果中的目标文本内容,确定为修正后 的非垂类关键字内容;
    如果不同,则确定所述第二语音识别结果是否比所述第一语音识别结果中的非垂类关键字内容的字符数量多,并且两者的字符数量差异不超过设定阈值;
    如果所述第二语音识别结果比所述第一语音识别结果中的非垂类关键字内容的字符数量多,并且两者的字符数量差异不超过设定阈值,则将所述第二语音识别结果中的目标文本内容,确定为修正后的非垂类关键字内容;
    如果所述第二语音识别结果比所述第一语音识别结果中的非垂类关键字内容的字符数量少,和/或两者的字符数量差异超过设定阈值,则将所述第一语音识别结果中的非垂类关键字内容,确定为修正后的非垂类关键字内容;
    其中,所述第二语音识别结果中的目标文本内容,是指所述第二语音识别结果中的、与所述第一语音识别结果中的非垂类关键字内容的位置相对应的文本内容。
  20. 根据权利要求3所述的方法,其特征在于,所述至少从所述第一语音识别结果和所述第二语音识别结果中,确定出最终的语音识别结果,包括:
    确定所述第一语音识别结果的置信度是否大于预设的置信度阈值;
    当所述第一语音识别结果的置信度大于预设的置信度阈值时,根据所述第一语音识别结果的声学得分和所述第二语音识别结果的声学得分,从所述第一语音识别结果和所述第二语音识别结果中选出最终的语音识别结果;
    当所述第一语音识别结果的置信度不大于预设的置信度阈值时,对所述第一语音识别结果进行声学得分激励,并根据激励后的第一语音识别结果的声学得分以及所述第二语音识别结果的声学得分,从所述第一语音识别结果和所述第二语音识别结果中选出最终的语音识别结果。
  21. 根据权利要求7或20所述的方法,其特征在于,当所述第一语音识别结果和所述第二语音识别结果的声学得分相同时,将所述第一语音识别结果和所述第二语音识别结果共同作为最终的语音识别结果。
  22. 根据权利要求6或4或20所述的方法,其特征在于,对所述第一语音识别结果进行声学得分激励,包括:
    至少根据所述第一语音识别结果中的垂类关键字内容和非垂类关键字内容,确定声学激励系数;
    利用所述声学激励系数,对所述第一语音识别结果中的垂类关键字内容的声学得分进行更新;
    根据更新后的所述第一语音识别结果中的垂类关键字内容的声学得分,以及所述第一语音识别结果中的非垂类关键字内容的声学得分,重新计算确定所述第一语音识别结果的声学得分。
  23. 根据权利要求5或6或9所述的方法,其特征在于,对所述第三语音识别结果进行语言模型激励,包括:
    根据待识别语音所属场景下的垂类关键字集合以及与所述该场景对应的类别标签,对所述第三语音识别结果进行路径扩展;所述类别标签,通过对语音识别场景进行聚类而确定;
    根据与所述类别标签对应的聚类语言模型对训练语料的识别结果,分别确定所述第三语音识别结果以及所述第三语音识别结果的扩展路径的语言模型得分;其中,所述聚类语言模型通过对目标语料进行语音识别训练得到,所述目标语料中的垂类关键字均被替换为所述类别标签;
    根据所述第三语音识别结果的语言模型得分,以及所述第三语音识别结果的扩展路径的语言模型得分,确定所述第三语音识别结果的语言模型激励后的语言得分。
  24. 根据权利要求23所述的方法,其特征在于,所述根据待识别语音所属场景下的垂类关键字集合以及与该场景对应的类别标签,对所述第三语音识别结果进行路径扩展,包括:
    将所述第三语音识别结果中的垂类关键字,与待识别语音所属场景下的垂类关键字集合中的垂类关键字分别进行比对;
    如果所述第三语音识别结果中的垂类关键字与所述垂类关键字集合中的任意垂类关键字相匹配,则在所述第三语音识别结果中的垂类关键字所在槽位的左右节点之间扩展新路径,并在该新路径上存储与待识别语音所属场景对应的类别标签。
  25. 一种语音识别装置,其特征在于,包括:
    声学识别单元,用于获取待识别语音的声学状态序列;
    网络构建单元,用于基于所述待识别语音所属场景下的垂类关键字集合及句式解码网络,构建语音识别解码网络,其中,所述句式解码网络至少通过对 所述待识别语音所属场景下的文本语料进行句式归纳处理构建得到;
    解码处理单元,用于利用所述语音识别解码网络对所述声学状态序列进行解码,得到语音识别结果。
  26. 一种语音识别装置,其特征在于,包括:
    声学识别单元,用于获取待识别语音的声学状态序列;
    多维解码单元,用于利用语音识别解码网络对所述声学状态序列进行解码,得到第一语音识别结果,以及,利用通用语音识别模型对所述声学状态序列进行解码,得到第二语音识别结果;所述语音识别解码网络基于所述待识别语音所属场景下的垂类关键字集合及句式解码网络构建得到;
    声学激励单元,用于对所述第一语音识别结果进行声学得分激励;
    决策处理单元,用于至少从激励后的第一语音识别结果以及所述第二语音识别结果中,确定出最终的语音识别结果。
  27. 一种语音识别设备,其特征在于,包括:
    存储器和处理器;
    所述存储器与所述处理器连接,用于存储程序;
    所述处理器,用于通过运行所述存储器中存储的程序,实现如权利要求1至24中任意一项所述的语音识别方法。
  28. 一种存储介质,其特征在于,所述存储介质上存储有计算机程序,所述计算机程序被处理器运行时,实现如权利要求1至24中任意一项所述的语音识别方法。
PCT/CN2021/133434 2021-10-29 2021-11-26 语音识别方法、装置、设备及存储介质 WO2023070803A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2024525244A JP2024537481A (ja) 2021-10-29 2021-11-26 音声認識方法、装置、設備及び記憶媒体
EP21962147.1A EP4425484A1 (en) 2021-10-29 2021-11-26 Speech recognition method and apparatus, device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111274880.8A CN113920999A (zh) 2021-10-29 2021-10-29 语音识别方法、装置、设备及存储介质
CN202111274880.8 2021-10-29

Publications (1)

Publication Number Publication Date
WO2023070803A1 true WO2023070803A1 (zh) 2023-05-04

Family

ID=79243888

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/133434 WO2023070803A1 (zh) 2021-10-29 2021-11-26 语音识别方法、装置、设备及存储介质

Country Status (4)

Country Link
EP (1) EP4425484A1 (zh)
JP (1) JP2024537481A (zh)
CN (1) CN113920999A (zh)
WO (1) WO2023070803A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117496972A (zh) * 2023-12-29 2024-02-02 广州小鹏汽车科技有限公司 一种音频识别方法、音频识别装置、车辆和计算机设备
CN117558270A (zh) * 2024-01-11 2024-02-13 腾讯科技(深圳)有限公司 语音识别方法、装置、关键词检测模型的训练方法和装置

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115472165A (zh) * 2022-07-07 2022-12-13 脸萌有限公司 用于语音识别的方法、装置、设备和存储介质
WO2024188235A1 (zh) * 2023-03-13 2024-09-19 北京罗克维尔斯科技有限公司 语音识别方法、装置、电子设备、存储介质及车辆

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150340034A1 (en) * 2014-05-22 2015-11-26 Google Inc. Recognizing speech using neural networks
CN105845133A (zh) * 2016-03-30 2016-08-10 乐视控股(北京)有限公司 语音信号处理方法及装置
CN107808662A (zh) * 2016-09-07 2018-03-16 阿里巴巴集团控股有限公司 更新语音识别用的语法规则库的方法及装置
CN113515945A (zh) * 2021-04-26 2021-10-19 科大讯飞股份有限公司 一种获取文本信息的方法、装置、设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150340034A1 (en) * 2014-05-22 2015-11-26 Google Inc. Recognizing speech using neural networks
CN105845133A (zh) * 2016-03-30 2016-08-10 乐视控股(北京)有限公司 语音信号处理方法及装置
CN107808662A (zh) * 2016-09-07 2018-03-16 阿里巴巴集团控股有限公司 更新语音识别用的语法规则库的方法及装置
CN113515945A (zh) * 2021-04-26 2021-10-19 科大讯飞股份有限公司 一种获取文本信息的方法、装置、设备及存储介质

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117496972A (zh) * 2023-12-29 2024-02-02 广州小鹏汽车科技有限公司 一种音频识别方法、音频识别装置、车辆和计算机设备
CN117496972B (zh) * 2023-12-29 2024-04-16 广州小鹏汽车科技有限公司 一种音频识别方法、音频识别装置、车辆和计算机设备
CN117558270A (zh) * 2024-01-11 2024-02-13 腾讯科技(深圳)有限公司 语音识别方法、装置、关键词检测模型的训练方法和装置
CN117558270B (zh) * 2024-01-11 2024-04-02 腾讯科技(深圳)有限公司 语音识别方法、装置、关键词检测模型的训练方法和装置

Also Published As

Publication number Publication date
EP4425484A1 (en) 2024-09-04
CN113920999A (zh) 2022-01-11
JP2024537481A (ja) 2024-10-10

Similar Documents

Publication Publication Date Title
WO2023070803A1 (zh) 语音识别方法、装置、设备及存储介质
KR102648306B1 (ko) 음성 인식 오류 정정 방법, 관련 디바이스들, 및 판독 가능 저장 매체
CN108899013B (zh) 语音搜索方法、装置和语音识别系统
TWI508057B (zh) 語音辨識系統以及方法
WO2020001458A1 (zh) 语音识别方法、装置及系统
US11093110B1 (en) Messaging feedback mechanism
US10152298B1 (en) Confidence estimation based on frequency
US10366690B1 (en) Speech recognition entity resolution
WO2014101826A1 (zh) 一种提高语音识别准确率的方法及系统
JP2005084681A (ja) 意味的言語モデル化および信頼性測定のための方法およびシステム
CN107578771A (zh) 语音识别方法及装置、存储介质、电子设备
WO2014117645A1 (zh) 信息的识别方法和装置
US11532301B1 (en) Natural language processing
US10714087B2 (en) Speech control for complex commands
CN116226338A (zh) 基于检索和生成融合的多轮对话系统及方法
US20220161131A1 (en) Systems and devices for controlling network applications
US11626107B1 (en) Natural language processing
CN108538292A (zh) 一种语音识别方法、装置、设备及可读存储介质
WO2023050541A1 (zh) 音素提取方法、语音识别方法、装置、设备及存储介质
CN113724698B (zh) 语音识别模型的训练方法、装置、设备及存储介质
CN103474063B (zh) 语音辨识系统以及方法
CN115831117A (zh) 实体识别方法、装置、计算机设备和存储介质
KR20130073643A (ko) 개인화된 발음열을 이용한 그룹 매핑 데이터 생성 서버, 음성 인식 서버 및 방법
CN113539241A (zh) 语音识别校正方法及其相应的装置、设备、介质
CN116052657B (zh) 语音识别的字符纠错方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21962147

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2024525244

Country of ref document: JP

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 2021962147

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2021962147

Country of ref document: EP

Effective date: 20240529