WO2020238341A1 - 语音识别的方法、装置、设备及计算机可读存储介质 - Google Patents

语音识别的方法、装置、设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2020238341A1
WO2020238341A1 PCT/CN2020/079522 CN2020079522W WO2020238341A1 WO 2020238341 A1 WO2020238341 A1 WO 2020238341A1 CN 2020079522 W CN2020079522 W CN 2020079522W WO 2020238341 A1 WO2020238341 A1 WO 2020238341A1
Authority
WO
WIPO (PCT)
Prior art keywords
language model
target language
dynamic target
intention
vocabulary
Prior art date
Application number
PCT/CN2020/079522
Other languages
English (en)
French (fr)
Inventor
聂为然
翁富良
黄佑佳
于海
胡束芒
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to JP2021570241A priority Critical patent/JP7343087B2/ja
Priority to EP20814489.9A priority patent/EP3965101A4/en
Publication of WO2020238341A1 publication Critical patent/WO2020238341A1/zh
Priority to US17/539,005 priority patent/US20220093087A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/187Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/193Formal grammars, e.g. finite state automata, context free grammars or word networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting

Definitions

  • This application relates to the field of artificial intelligence technology, in particular to methods, devices, equipment and computer-readable storage media for speech recognition.
  • Related technologies provide a method of speech recognition. After calling a language model to recognize a voice command and understanding the user's command, the method asks the user on the one hand; on the other hand, adjusts the language model according to the question, such as adding vocabulary related to the question
  • the set is integrated into the language model, so that the adjusted language model can recognize the words in the vocabulary set.
  • the adjusted language model can recognize the reply voice to meet user needs.
  • users may also make irrelevant voices that communicate with third parties.
  • irrelevant voices For example, as a typical multi-user scene and multi-scenario scene, when a user interacts with an onboard module in a car or an electric vehicle, irrelevant speech is likely to include interruptions between the user and other users, or speech inserted by other users.
  • the voice recognition system of the vehicle module will also recognize and understand these irrelevant voices as voice commands or reply voices, which makes the provided services deviate from user needs, and the user experience is poor.
  • the embodiments of the present application provide a voice recognition method, device, equipment, and computer-readable storage medium to overcome the problems of poor recognition effect and poor user experience in related technologies.
  • this application provides a voice recognition method, including:
  • a method for speech recognition includes: obtaining or generating a dynamic target language model according to reply information to a first intention. Since the dynamic target language model includes a front-end part and a core part, the core The part is used to determine the possible description related to the reply information, and the front-end part is used to determine the description of the confirmatory information of the reply information; thus, the voice signal is obtained, and after the keyword is generated by parsing the voice signal, it can be called The dynamic target language model determines the second intention and service content, wherein the front-end part of the dynamic target language model parses the second intention according to the keywords, and the core part of the dynamic target language model is based on The keyword parses out the service content.
  • the first intention includes the intention obtained by parsing the voice signal of the user after the voice dialogue between the user and the vehicle-mounted module starts.
  • the reply information of the first intention includes one or more reply information that the vehicle-mounted module returns to the user for the first intention, and the vehicle-mounted module obtains a dynamic target language model including the front-end part and the core part according to the reply information to the first intention.
  • the vehicle-mounted module After the vehicle-mounted module returns one or more reply messages to the user, the vehicle-mounted module will obtain the voice signal again.
  • the voice signal acquired by the vehicle-mounted module again may include the voice signal of the dialogue between the user and the vehicle-mounted module, that is, the voice signal related to the reply message, and the irrelevant voice signal of the dialogue between the user and other users.
  • the vehicle-mounted module parses the acquired voice signal to generate keywords, calls the dynamic target language model, and parses the part of the vocabulary related to the reply message from the generated keywords.
  • the dynamic target language model includes a front-end part and a core part.
  • the front-end part is used to determine the user's description of the confirmation information of the reply message.
  • the confirmation information can include confirmation, correction, and cancellation.
  • the user’s second intention can be obtained by parsing keywords through the front-end part. For example, if the number of replies to the first intention includes one item, and the confirmation information obtained by the front-end part parsing the keywords includes the confirmation message "right, yes", It can be confirmed that the second intention of the user is the intention indicated by the reply message of the first intention.
  • the core part is used to determine the possible descriptions related to the reply information.
  • the vocabulary used by the user to describe the reply information can be parsed from the keywords, so as to obtain the service content based on the vocabulary to provide the user with the service content
  • the service indicated by the service content may be provided by a third-party cloud service, the service indicated by the service content may also be provided by the vehicle-mounted module, or the service indicated by the service content may be provided by the vehicle-mounted terminal, or the service may be provided by the vehicle.
  • the enterprise provides the service indicated by the service content.
  • the vehicle-mounted terminal may be other terminals on the vehicle other than the vehicle-mounted module, such as a vehicle-mounted display screen, a vehicle-mounted air conditioner, and a vehicle-mounted speaker.
  • the vehicle-mounted terminal may also be other terminals on the vehicle other than the vehicle-mounted module, such as a vehicle-mounted display screen, a vehicle-mounted air conditioner, and a vehicle-mounted speaker.
  • the front-end part and the core part of the dynamic target language model are obtained based on the reply information, the second intention obtained through the front-end part and the service content obtained through the core part are all related to the first intention, and Voice signals unrelated to the first intention are ignored. Therefore, the embodiment of the present application has a better effect of performing voice recognition, avoids the deviation of the provided service from user requirements due to interference of irrelevant voice signals, and improves the user experience.
  • the dynamic target language model further includes a tailing part for confirming whether there is an additional intention
  • the method further includes: invoking the dynamic target language model to determine the additional intention, the dynamic target language model The tail part of parses the additional intention according to the keyword.
  • the tailing part includes tailing flag words; the invoking the dynamic target language model to determine the additional intent, the tailing part of the dynamic target language model parses the additional intent according to the keywords, including : The tailing part parses out the reference tailing marker word according to the keyword and the time point of the reference tailing marker word; based on the reference tailing marker word, combining the first intention and the second intention, The dynamic target language model is updated to obtain an updated target language model; the updated target language model is invoked, and the additional intent is parsed according to the time point where the keyword and the reference tag word are located.
  • the method before the acquiring the voice signal, the method further includes: buffering a historical voice signal; the parsing the voice signal to generate keywords includes: parsing the voice signal, and using the historical voice signal for context detection Then generate the keywords.
  • Context detection through historical voice signals can make the recognized keywords more suitable for the current scene, thereby further improving the accuracy of voice recognition.
  • the method further includes: confirming the second intention, and obtaining the confirmed second intention.
  • the confirming the second intention and obtaining the confirmed second intention includes: sending confirmation information of the second intention to the user, obtaining the second intention fed back by the user, and sending the second intention fed back by the user to the user.
  • the second intention is the confirmed second intention.
  • the second intention is made more accurate, so as to provide more accurate service content.
  • the obtaining the dynamic target language model according to the reply information to the first intention includes: converting the reply information of the first intention into a reference format to obtain reply information in the reference format, and the reply according to the reference format Information acquisition or generation of the dynamic target language model. Since different suppliers may provide response information in different formats, the response information is converted into a reference format, so that the format of the response information is unified to facilitate the reception of the response information. For different application fields, the reply information is converted into different reference formats, so that the reply message format in the same application field is the same.
  • the obtaining or generating the dynamic target language model according to the reply information in the reference format includes: converting the trained language model into a weighted finite state transition automaton model, and automatically converting the weighted finite state A machine model is used as the dynamic target language model, wherein the trained language model is obtained by training the reply information in the reference format and reference vocabulary.
  • the reference vocabulary includes, but is not limited to, the category name corresponding to the vocabulary in the reply message in the reference format, and referential expression words.
  • the obtaining or generating the dynamic target language model according to the reply information in the reference format includes: converting the trained language model into a weighted finite state transition automaton model, and automatically converting the weighted finite state
  • the machine model is used as the first language model, wherein the trained language model is obtained by training the response information in a reference format with a length not less than the reference length; acquiring the second language model according to the response information in the reference format with a length less than the reference length, Obtain a third language model according to the reference vocabulary; merge the first language model, the second language model, and the third language model to obtain a total language model, and use the total language model as the dynamic target language model.
  • the obtaining or generating the dynamic target language model according to the reply information in the reference format includes: obtaining a word confusion network based on the reply information in the reference format with a length not less than the reference length, in the word confusion network
  • Each vocabulary of has a transition probability; the penalty weight of each vocabulary is calculated, and the word confusion network is transformed into a weighted finite state transition automaton model according to the penalty weight of each vocabulary, and the weighted finite state transition automaton
  • the model is used as the first language model;
  • the second language model is acquired according to the reply information in the reference format whose length is lower than the reference length, and the third language model is acquired according to the reference vocabulary;
  • the first language model, the second language model and the The third language model is used to obtain an overall language model, and the overall language model is used as the dynamic target language model.
  • the calculating the penalty weight of each vocabulary includes: for any vocabulary, using a negative logarithm of the transition probability of the vocabulary as the penalty weight.
  • the transition probability of a word is used to indicate the frequency of occurrence of the word in the category of the word. The higher the frequency of occurrence of the word in the category, the greater the transition probability and the smaller the negative logarithm of the transition probability, which is the penalty weight. It is inversely proportional to the frequency of occurrence, so that the target language model can better parse the words that appear frequently in the category.
  • the calculating the penalty weight of each vocabulary includes: for any vocabulary, using the logarithm of the number of items of the reply information in the reference format of the vocabulary as the penalty weight.
  • the more discriminating vocabulary that is, the vocabulary containing the vocabulary with a smaller number of response information in the reference format, is given a smaller penalty weight, so that the target language model can better parse these discriminating vocabulary.
  • the calculating the penalty weight of each vocabulary includes: for any vocabulary, the logarithmic value of the number of occurrences of the vocabulary in the reply information in the reference format is used as the penalty weight. For words with strong distinction, that is, words with fewer occurrences, the penalty probability of the words is smaller, so that the dynamic target language model can better analyze the words with strong distinction.
  • a device for speech recognition includes: a first acquisition module, which acquires or generates a dynamic target language model according to response information to the first intention, the dynamic target language model including a front-end part and a core part , The core part is used to determine possible descriptions related to the reply information, the front-end part is used to determine the description of the confirmatory information of the reply information; the second acquisition module is used to acquire the voice signal and analyze the The speech signal generates keywords; the first determining module is used to call the dynamic target language model to determine the second intention and service content, wherein the front-end part of the dynamic target language model parses out the keyword according to the keyword The second intention is that the core part of the dynamic target language model parses the service content according to the keywords.
  • the dynamic target language model further includes a tailing part for confirming whether there is an additional intent
  • the device further includes: a second determining module for calling the dynamic target language model to determine the additional intent Analyzing the additional intention according to the keyword in the tail part of the dynamic target language model.
  • the tailing part includes tailing marker words
  • the second determining module is configured to parse the tailing part to obtain reference tailing marker words according to the keywords, and the time point where the reference tailing marker words are located; Based on the reference suffix tag word, in combination with the first intent and the second intent, update the dynamic target language model to obtain an updated target language model; call the updated target language model according to the The keyword and the time point at which the reference ending tag word is located are analyzed to parse the additional intention.
  • the device further includes: a cache module, configured to cache historical voice signals; the second acquisition module, configured to parse the voice signals, and generate the keywords after context detection using the historical voice signals .
  • the device further includes: a confirmation module, configured to confirm the second intention, and obtain the confirmed second intention.
  • a confirmation module configured to confirm the second intention, and obtain the confirmed second intention.
  • the confirmation module is configured to send confirmation information of the second intention to the user, obtain the second intention fed back by the user, and use the second intention fed back by the user as the confirmed second intention.
  • the first obtaining module is configured to convert the reply information of the first intention into a reference format to obtain reply information in the reference format, and obtain or generate the dynamic target language according to the reply information in the reference format model.
  • the first acquisition module is configured to transform a trained language model into a weighted finite state transition automaton model, and use the weighted finite state transition automaton model as the dynamic target language model, wherein
  • the trained language model is obtained from the reply information in the reference format and reference vocabulary training.
  • the first acquisition module is configured to convert the trained language model into a weighted finite state transition automaton model, and use the weighted finite state transition automaton model as the first language model, wherein the trained The language model of is obtained by training the reply information of the reference format with a length not less than the reference length; the second language model is obtained according to the reply information of the reference format with the length less than the reference length, and the third language model is obtained according to the reference vocabulary; A language model, the second language model, and the third language model are used to obtain an overall language model, and the overall language model is used as the dynamic target language model.
  • the first acquiring module includes: a first acquiring unit configured to acquire a word confusion network based on reply information in a reference format with a length not less than a reference length, and each word in the word confusion network has a transition Probability; a calculation unit for calculating the penalty weight of each vocabulary, according to the penalty weight of each vocabulary, the word confusion network is transformed into a weighted finite state transition automaton model, and the weighted finite state transition automaton model As the first language model; the second acquiring unit is configured to acquire the second language model according to the reply information in the reference format whose length is lower than the reference length, and the third language model according to the reference vocabulary; the merging unit is configured to merge the first The language model, the second language model, and the third language model obtain a total language model, and the total language model is used as the dynamic target language model.
  • the calculation unit is configured to, for any vocabulary, use the negative logarithm of the transition probability of the vocabulary as the penalty weight.
  • the calculation unit is configured to use, for any vocabulary, a logarithmic value of the number of reply messages in a reference format containing the vocabulary as the penalty weight.
  • the calculation unit is configured to use, for any word, a logarithmic value of the number of occurrences of the word in the reply message in the reference format as the penalty weight.
  • a device for speech recognition includes: a memory and a processor; the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the above Apply for the method in the first aspect or any possible implementation of the first aspect of the embodiment.
  • processors there are one or more processors and one or more memories.
  • the memory may be integrated with the processor, or the memory and the processor may be provided separately.
  • the memory can be a non-transitory (non-transitory) memory, such as a read only memory (ROM), which can be integrated with the processor on the same chip, or can be set in different On the chip, the embodiment of the present application does not limit the type of memory and the setting mode of the memory and the processor.
  • ROM read only memory
  • a computer-readable storage medium stores a program or instruction, and the instruction is loaded by a processor and executes any of the above-mentioned voice recognition methods.
  • a computer program (product) is also provided, the computer program (product) comprising: computer program code, when the computer program code is run by a computer, the computer executes any of the above-mentioned voice recognition methods.
  • a chip including a processor, which is configured to call and execute instructions stored in the memory from a memory, so that a communication device installed with the chip executes any of the above-mentioned voice recognition methods.
  • Another chip including: an input interface, an output interface, a processor, and a memory, the input interface, output interface, the processor, and the memory are connected by an internal connection path, and the processor is used to execute When the code in the memory is executed, the processor is used to execute any of the above-mentioned voice recognition methods.
  • the embodiment of the application obtains or generates a dynamic target language model including a front-end part and a core part according to the reply information of the first intention. After analyzing the voice signal to obtain keywords, the dynamic target language model is called to analyze the keywords to obtain the second intention and service content. . Since the dynamic target language model is obtained based on the reply information of the first intention, the second intention and service content obtained through the analysis of the dynamic target language model are all related to the first intention. Therefore, the embodiment of the present application realizes the ignorance of voices that are not related to the first intention, that is, it can recognize discontinuous multi-intent voices, avoiding the deviation of the provided service content from the user needs, the recognition effect is good, and the user experience is improved .
  • the embodiment of the application obtains or generates a dynamic target language model including a front-end part and a core part according to the reply information of the first intention. After analyzing the voice signal to obtain keywords, the dynamic target language model is called to analyze the keywords to obtain the second intention and service content. . Since the dynamic target language model is obtained based on the reply information of the first intention, the second intention and service content obtained through the analysis of the dynamic target language model are all related to the first intention. Therefore, the embodiment of the present application realizes the ignorance of voices that are not related to the first intention, that is, it can recognize discontinuous multi-intent voices, avoiding the deviation of the provided service content from the user needs, the recognition effect is good, and the user experience is improved .
  • Figure 1 is a schematic diagram of an implementation environment provided by an embodiment of the application.
  • FIG. 2 is a block diagram of a method for implementing speech recognition provided by an embodiment of the application
  • FIG. 3 is a flowchart of a voice recognition method provided by an embodiment of this application.
  • FIG. 4 is a schematic diagram of the structure of a language model provided by an embodiment of the application.
  • FIG. 5 is a schematic diagram of a process of speech recognition provided by an embodiment of this application.
  • Fig. 6 is a schematic structural diagram of a language model provided by an embodiment of the application.
  • FIG. 7 is a schematic structural diagram of a language model provided by an embodiment of the application.
  • FIG. 8 is a schematic structural diagram of a word confusion network provided by an embodiment of this application.
  • Fig. 9 is a schematic structural diagram of a speech recognition device provided by its own embodiment.
  • Related technologies provide a method of speech recognition. After calling a language model to recognize a voice command and understanding the user's command, the method asks the user on the one hand; on the other hand, adjusts the language model according to the question, such as adding vocabulary related to the question
  • the set is integrated into the language model, so that the adjusted language model can recognize the words in the vocabulary set.
  • the adjusted language model can recognize the reply voice to meet user needs.
  • the user's voice is often more flexible.
  • the user and the car module may have the following voice dialogue:
  • the voice recognition system of the on-board module will ask the question based on the voice command, and then integrate the vocabulary "Sichuan Restaurant A” involved in the question into the language model to obtain an adjusted language model. Later, if the user uses "Sichuan Restaurant A” to send out the reply voice "Yes, it is Sichuan Restaurant A", the adjusted language model can recognize the reply voice.
  • the user first uttered an irrelevant voice communicating with other users in the car, so the adjusted language model will recognize the irrelevant voice as a reply voice, which leads to misunderstanding. It can be seen that the speech recognition method provided by the related technology has poor recognition effect and poor user experience.
  • the embodiment of the present application provides a method for speech recognition, which can be applied in the implementation environment as shown in FIG. 1.
  • the audio equipment includes a microphone array and a speaker
  • the memory stores programs or instructions of a module for voice recognition.
  • Audio equipment, memory, and CPU are connected through data bus (databus, D-Bus) communication, so that the CPU calls the microphone array to collect the voice signal from the user, and runs the programs or instructions and calls of each module stored in the memory based on the collected voice signal
  • the speaker sends out a voice signal to the user according to the running result.
  • the CPU may also access the cloud service through a gateway to obtain data returned by the cloud service.
  • the CPU can also access the controller area network bus (CAN-Bus) through the gateway to read and control the status of other devices.
  • CAN-Bus controller area network bus
  • the programs or instructions of the modules for voice recognition stored in the memory include: the ring voice buffer module, AM module, SL module, Dynamic LM module, Programs or instructions such as SLU module, DM module, and NCM process are executed by the CPU in FIG. 1 to implement the voice recognition.
  • the process of voice recognition will be described:
  • Front-end speech module used to distinguish the voice signal sent by the user from non-speech signals such as road noise and music, and also used to reduce and enhance the voice signal sent by the user to improve subsequent recognition and Accuracy of understanding.
  • Circular voice buffer module used to buffer the voice signal processed by the front-end language model so that the stored voice signal can be recognized and understood many times.
  • the ring language buffer has a reference time length. When the time length of the buffered voice signal is greater than the reference time length, the voice signal with the longest storage time will be overwritten by the new voice signal.
  • A Acoustic model (AM) module: used to obtain the voice signal stored in the ring voice buffer module, and convert the voice signal into a phoneme sequence.
  • Selective listening (SL) module used to call the dynamic language model (dynamic language model, Dynamic LM) module, convert the phoneme sequence output by the AM model into keywords, and send the keywords to spoken language understanding, SLU) module.
  • dynamic language model dynamic language model, Dynamic LM
  • SLU spoken language understanding
  • SLU module used to extract intentions and semantic slots from keywords to understand the first intention, second intention, and additional intentions indicated by the user's voice signal.
  • Dialogue manager (DM) module used to request reply information to the cloud service according to the first intention.
  • Application manager (APP Manager) module used to convert the reply information returned by the cloud service into reply information in a reference format.
  • Dialogue manager (DM) module It is also used to start non-continuous multi-intent (NCM) processes in related fields according to the response information in the reference format returned by the APP Manager module, and to control the response generation (
  • the response generator (RG) module generates reply content and performs voice playback. It is also used to send instructions to the APP Manager module according to the second intention and the additional intention to control the application or terminal device to execute the service content and the additional intention.
  • Application management (application manager, APP Manager) module: It is also used for word segmentation and proper noun labeling for reply messages. It is also used to manage applications and terminal devices according to instructions sent by the DM module to control applications or terminal devices to execute service content and additional intentions.
  • an embodiment of the present application provides a voice recognition method. As shown in Figure 3, the method includes:
  • Step 201 Acquire or generate a dynamic target language model according to the response information to the first intention.
  • the dynamic target language model includes a front-end part and a core part.
  • the core part is used to determine possible descriptions related to the response information
  • the front-end part is used to determine the response The description of the confirmatory information.
  • the first intention refers to the intention obtained by analyzing the user's voice command signal after the voice dialogue between the user and the system starts.
  • the user's voice command signal is the voice of "Help me find a nearby Sichuan restaurant" issued by the user.
  • Analyzing the voice command signal includes: calling an acoustic model to convert the voice command signal into a sequence of phonemes.
  • a phoneme is the smallest phonetic unit of a language. For example, in Chinese, a phoneme refers to a consonant or a final.
  • the language model is called to convert the phoneme sequence into a text sequence, and the text sequence is a voice command.
  • the language model refers to a language model that has been trained according to the training set, and an appropriate language model can be called according to the application domain of speech recognition.
  • the text sequence can be parsed to obtain the first intention.
  • the first intention includes intention and semantic slot
  • semantic slot refers to words with clear definitions or concepts in a character sequence. Still taking the above voice dialogue as an example, if the text sequence is "Help me find a nearby Sichuan restaurant", the parsed intention is “navigation”, and the semantic slot is "nearby” and “Sichuan restaurant”, and the first intention is "Navigate to the nearby Sichuan restaurant".
  • the response information of the first intention can be obtained according to the obtained first intention, and the content of the response information of the first intention meets the requirements of the semantic slot.
  • the first intention can be sent to the cloud service, so as to obtain the reply information returned by the cloud service.
  • a plurality of mapping relationships between intentions and reply information may also be stored in the memory, and the reply information corresponding to the first intention can be searched for according to the mapping relationship, so as to realize the acquisition of the reply information.
  • the reply information can be one or more items, and each reply information is a text string. Moreover, if there are multiple reply messages, multiple reply messages can be used as options for the user to choose from. Still taking the above voice dialogue as an example, the reply message can be one item of "Sichuan Restaurant A” or multiple items such as “Sichuan Restaurant A", “Sichuan Restaurant B” and “Sichuan Restaurant C”. In this embodiment, the number of items of the reply message is not limited.
  • the dynamic target language model can be acquired or generated based on the acquired response information of the first intention.
  • the dynamic target language model includes a front-end part and a core part.
  • the front-end part is used to determine the description of the confirmation information of the reply message.
  • the confirmation information includes but is not limited to confirmation, correction or cancellation.
  • the confirmation information can include "right” and “right”
  • the correction information can include “not right” and “wrong”
  • the cancellation information can include "forget it” and "no need.”
  • the core part is used to determine possible descriptions related to the reply information, for example, the user directly repeats the reply information or the user selectively repeats the reply information.
  • the process of obtaining or generating a dynamic target language model based on the response information will be described in detail later, and will not be repeated here.
  • the speech signal can be further received after the dynamic target language model is obtained through the acquisition or generation.
  • Step 202 Acquire a voice signal, analyze the voice signal to generate keywords.
  • the vehicle-mounted module After the vehicle-mounted module obtains the reply information to the first intention, in addition to obtaining or generating a dynamic target language model according to the reply information to the first intention, it will also send the reply information to the first intention to the user to obtain the voice signal .
  • the voice signal may include the voice signal of the user's dialogue with the vehicle-mounted module, that is, the voice signal of the reply message for the first intention, or may include the irrelevant voice signal of the user's dialogue with other users.
  • the voice signal of the user’s dialogue with the on-board module is “Yes, it’s Sichuan Restaurant A”, and the irrelevant voice signal of the user’s dialogue with other users is “This time period happens to be noon, will there be a problem with parking ".
  • irrelevant voice signals in addition to the voice signals that the user actively dialogues with other users, it can also include the voice signals of other users actively dialogue with the user, that is, the voice signals of other users interjecting. This embodiment does not limit the irrelevant voice signals. .
  • the vehicle-mounted module After the vehicle-mounted module obtains the voice signal, it can parse the voice signal to generate keywords.
  • the method before acquiring the voice signal, the method further includes: buffering the historical voice signal. Then, parsing the voice signal to generate keywords includes: parsing the voice signal, and generating keywords after context detection using historical voice signals.
  • the historical voice signal is the voice signal of the past time.
  • the voice command signal "Help me find a nearby Sichuan restaurant" used to obtain the first intention can be used as the historical voice signal.
  • the historical voice signal can be buffered through a ring buffer.
  • the ring buffer has a reference time length. If the buffered historical voice signal is longer than the reference time length, the historical voice signal with the longest buffer time will be updated. Voice signal coverage. If you need to use the historical voice signal, just read the historical voice signal from the ring buffer.
  • this embodiment does not limit the manner of buffering historical voice signals, and other methods may be selected to realize the buffering of historical voice according to needs.
  • the vehicle-mounted module can still call the appropriate acoustic model and language model according to the application field of the speech recognition, and analyze the speech signal through the acoustic model and the language model to obtain the initial keywords.
  • the initial keywords generated by analyzing the voice signal of the dialogue between the user and the on-board module are related to the first intent, and the parse is irrelevant to the dialogue between the user and other users.
  • the initial keywords generated by the voice signal have nothing to do with the first intention. Therefore, historical speech signals need to be used for context detection, so that keywords generated based on initial keywords are only related to the first intention, that is, initial keywords that are not related to the first intention are ignored.
  • the manner of using the historical voice signal for context detection may include: detecting keywords related to the historical voice signal in the initial keywords, so as to use the keywords related to the text sequence corresponding to the historical voice signal as the generated keywords. For example, for the voice signal "This time is exactly noon, will there be a problem with parking? Yes, it is Sichuan Restaurant A".
  • the initial keywords obtained include “noon”, “parking”, “yes, yes” and “ Sichuan Restaurant A”.
  • keywords related to the historical voice signal "Help me find a nearby Sichuan restaurant” include “Yes, yes” and “Sichuan restaurant A”. Therefore, “noon” and “parking” can be ignored. Only “Yes, that is” and “Sichuan Restaurant A” are used as the generated keywords.
  • this embodiment does not limit the method of context detection using historical speech signals. No matter what method is used to detect and generate keywords, after the keywords are generated, the dynamic target language model can be triggered to parse the keywords. To determine the second intention and service content, see step 203 for details.
  • Step 203 Invoke the dynamic target language model to determine the second intention and service content.
  • the front-end part of the dynamic target language model parses the second intention based on keywords, and the core part of the dynamic target language model parses the service content based on keywords.
  • the dynamic target language model includes a front-end part and a core part. Since the dynamic target language model is obtained by replying to the first intention, the second intention and service content determined by the dynamic target language model are all related to the first intention. Among them, the front-end part is used to determine the description of the confirmatory information of the reply message. Therefore, the confirmatory information in the keywords can be obtained by analyzing the keywords in the front-end part, and then the user's second intention can be obtained through the confirmatory information in the keywords.
  • the reply message to the first intention is "Do you want to go to Sichuan Restaurant A”
  • the keywords obtained by parsing are “Yes, yes” and “Sichuan Restaurant A”, which can be parsed through the front-end part "Yes, yes” in the keyword, you can get the user's second intention as “Go to Sichuan Restaurant A”.
  • the key word “Sichuan Restaurant A” is obtained through the analysis of the core part, combined with the current car navigation scene, and the service content is "Navigate to Sichuan Restaurant A".
  • the reply message to the first intention includes only one item
  • the user's second intention can be determined through the front-end part.
  • the reply information to the first intention includes two or more items
  • the front-end part and the core part need to be used to determine the user's second intention. For example, if the reply message to the first intention is "Which one of the following do you want to choose? The first item is Sichuan Restaurant A, and the second item is Sichuan Restaurant B", and the key words obtained by analysis are still "Yes, yes” and "Sichuan Restaurant A” can still be parsed through the front-end part to obtain the confirmatory information "yes, that is” in the keywords.
  • the confirmatory information obtained by the front-end part parsing the keywords includes confirmation information, such as "yes, yes” in the above-mentioned voice dialogue
  • the key words can be further analyzed through the core part to obtain the service content.
  • the confirmation information obtained by the front-end parsing keywords includes correction information or cancellation information, such as "no", "not right” and other words, it means that the user does not approve the reply message and may not respond to the reply message, so there is no need to pass the core Part of the analysis is performed to obtain the service content, but other reply information is obtained again, and a new dynamic target language model is obtained based on the other reply information, so as to complete speech recognition through the new dynamic target language model.
  • invoking the dynamic target language model can also obtain the confidence of the second intention and service content, the mute signal segment in the voice signal, and other information, where the confidence is used to indicate the second intention and the server The accuracy of the content.
  • the provision of the service indicated by the service content can be triggered.
  • the service content in the voice dialogue is "Navigate to Sichuan Restaurant A”
  • the execution service content includes calling the navigation device to navigate the user from the current location (ie, the location where the voice dialogue occurs) to the location of "Sichuan Restaurant A”.
  • the method provided in this embodiment further includes: confirming the second intention, and obtaining the confirmed second intention; and executing the confirmed second intention.
  • the second intention and service content determined by the dynamic target language model may still be inconsistent with the first intention . Therefore, before executing the service content, confirm the second intention to ensure that the second intention is consistent with the first intention. After the confirmed second intention is obtained, the confirmed second intention is executed.
  • the second intention is consistent with the first intention, including but not limited to: the second intention corresponds to the reply information of the first intention (for example, the second intention "Go to Sichuan Restaurant A” and the reply information of the first intention "Sichuan Restaurant A” correspond). Or, the second intention satisfies the condition restriction included in the first intention (for example, the second intention "Go to Sichuan Restaurant A” satisfies the distance condition restriction "near" included in the first intention).
  • the method of obtaining the confirmed second intention includes: sending confirmation information of the second intention to the user, obtaining the second intention fed back by the user, and using the second intention fed back by the user as the confirmed second intention Second intention.
  • the confidence level of the second intention and the service content can be obtained through the dynamic target language model. Therefore, this embodiment can send different confirmation information to the user for different confidence levels to realize the confirmation of the second intention.
  • the second intent as "Go to Sichuan Restaurant A” as an example
  • the confidence level is higher than the threshold, it means that the second intent is more credible, so indirect confirmation can be used to confirm.
  • the voice “You have selected Sichuan Restaurant A” with the default second intention correct is sent to the user as the confirmation information of the second intention to obtain the second intention returned by the user.
  • the degree of confidence is not higher than the threshold, it means that the credibility of the second intention is low, so direct confirmation is used for confirmation. For example, a voice of "Are you sure to choose Sichuan Restaurant A" is sent to the user.
  • the confirmation information sent in the above indirect confirmation method and direct confirmation method are both voice confirmation information. If the second intention of the user feedback cannot be obtained through the voice confirmation information, other forms of confirmation information, such as text confirmation information, can be selected to confirm to the user The second intention.
  • the terminal displays the reply message for the first intention to the user, so that the user can select any reply message through the terminal, take the intention indicated by the reply message selected by the user as the confirmed second intention, and execute the confirmed second intention.
  • the second intention is to complete speech recognition.
  • the dynamic target language model further includes a tailing part, which is used to confirm whether there is an additional intention. Therefore, the method provided in this embodiment further includes: invoking the dynamic target language model to determine additional intentions, wherein the tail part of the dynamic target language model parses the additional intentions according to keywords, thereby identifying each intention in the above multi-intention dialogue .
  • the additional intent is obtained by parsing keywords through the tail part.
  • the schematic diagram of the front part, the core part and the tail part can be seen in Fig. 4.
  • out of vocabulary (OOV) represents vocabulary outside the dictionary, and the dictionary is used to obtain words based on phoneme sequences.
  • eps stands for skipping edge and is used to indicate optional parts.
  • the suffix part includes a suffix marker word
  • the suffix marker word includes but is not limited to words such as "re", “also” and “by the way”.
  • the suffix marker word in the above-mentioned multi-intent dialogue is "zhao". Since the user's description of the tail marker words is often relatively fixed, a set of multiple tail marker words can be used as the corpus to train the language model, and the trained language model can be used as the tail part. Therefore, the dynamic target language model is called to determine the additional intent, where the tail part of the dynamic target language model parses the additional intention according to the keywords, including: the tail part parses the reference tail tag word according to the keyword, and refers to the time point of the tail tag word. ; Based on the reference tag word, combining the first intention and the second intention, update the dynamic target language model to get the updated target language model; call the updated target language model, according to the keyword and the time point of the reference tag word Resolve additional intent.
  • the reference tail marker word is one of a set of multiple tail marker words as a corpus. If there is no reference tag word, it means that there is no additional intention, and the service indicated by the above service content can be directly provided; if there is a reference tag word, it means that there is an additional intention, and the reference tag word is also obtained at the end The point in time.
  • the language model is also called according to the first intention and the second intention.
  • the language model can be the language model of the domain of the first intention and the second intention.
  • the domain of the intent is "navigation"
  • the language model of the navigation domain can be obtained instead of the dynamic target language model, so as to obtain the updated target language model.
  • the updated target language model is called to parse the keywords after the time point where the reference tag word is located, so as to obtain the additional intention of the user.
  • the reference ending mark word is "re”.
  • the voice signal before the time of "Zai” is "This time is exactly noon, will there be a problem with parking? Yes, it is Sichuan Restaurant A”.
  • the keywords included in the dynamic target language model have been included in the front-end part of the dynamic target language model. And the core part has been analyzed. Therefore, the updated target language model can be called to analyze the keywords included in the voice signal after the time point of "re”, that is, the keywords included in "find a parking space for me”, so as to obtain additional intentions of the user.
  • this embodiment also provides another method for updating the target language model as follows: After obtaining the language model according to the first intention and the second intention, the language model and the combined model of the tail part are used as the updated target Language model. Therefore, referring to FIG. 5, the updated target language model can iteratively detect whether there are more additional intents after analyzing an additional intent, which increases the number of intents that can be identified.
  • the additional intention exists, after the additional intention is obtained through analysis of the updated target language model, the following method is used to execute the second intention.
  • the method includes: if the additional intention exists, execute the service content and the additional intention. Among them, after the service content is obtained, the service content is not executed immediately, but the tail part is used to confirm whether there is an additional intention in the voice signal. If there is an additional intention, the additional intention is obtained, and finally the service content and the additional intention Carry out execution. If it is confirmed through the tail message that there is no additional intention in the voice signal, the service content obtained will be executed.
  • executing the service content and the additional intention includes: executing the service content and the additional intention in combination, or executing the service content and the additional intention sequentially. For example, if the service content is "Navigate to Sichuan Restaurant A" and the additional intention is "Play a song”, the additional intention can be executed during the execution of the service content, that is, the service content and the additional intention are combined. If the service content is "Navigate to Sichuan Restaurant A” and the additional intention is "Find a parking space", the service content and the additional intention need to be executed in sequence. In addition, different service contents and additional intentions can be executed by different execution entities.
  • the vehicle-mounted terminal may be other terminals on the vehicle other than the vehicle-mounted module, such as a vehicle-mounted display screen, a vehicle-mounted air conditioner, and a vehicle-mounted speaker.
  • it can also be executed jointly by more than two execution entities of a third-party cloud service, an in-vehicle module, an in-vehicle terminal, and a car company, which is not limited in the embodiment of the present application.
  • obtaining or generating a dynamic target language model according to the reply information to the first intention includes: converting the reply information to the first intention into a reference format to obtain reply information in the reference format, and obtaining or Generate a dynamic target language model.
  • the dynamic target language model includes at least a front-end part and a core part, and can also include a tail part.
  • the front-end part is used to determine the description of the confirmation information of the reply message. Similar to the tail part, since the user's description of the confirmatory information of the reply message is relatively fixed, for the front-end part, a collection of confirmatory information for confirmation, correction or cancellation can also be used as a corpus to train the language model ,
  • the trained language model is used as the front-end part, so that the front-end part has the ability to parse keywords to obtain confirmation, correction, or cancellation.
  • the core part it needs to be obtained according to the reply information in the reference format described above.
  • the reply information may be provided by multiple suppliers. Since different suppliers may provide reply information in different formats, the reply information needs to be converted into a reference format, so that the reply information format is unified to facilitate the reception of the reply information.
  • the reply message can be converted into different reference formats, so that the reply message format in the same application field is the same.
  • the reply message is often an address, so the address can be unified into the format of country (or region), province (or state), city, district, road, and house number.
  • POI point of interest
  • the reply information is often related to the point of interest. Therefore, the reply information can be unified into the format of category name, address, contact number, and user evaluation.
  • the category name can be hotel, Restaurants, shopping malls, museums, concert halls, cinemas, stadiums, hospitals and pharmacies, etc.
  • the reply information can be marked with word segmentation to facilitate the implementation of the conversion of the reference format.
  • Word segmentation labeling refers to decomposing a text string into vocabulary. If the decomposed vocabulary includes proper nouns, proper nouns can also be labeled. Both word segmentation and labeling can be implemented by artificial intelligence algorithms.
  • artificial intelligence algorithms include but are not limited to conditional random field (CRF), long short term memory (LSTM) network, and hidden Markov model (hidden Markov model, HMM).
  • the dynamic target language model is further obtained or generated according to the reply information in the reference format.
  • the target language model according to the response information in the reference format there are three ways to obtain the target language model according to the response information in the reference format:
  • the first method of acquisition convert the trained language model into a weighted finite state transition automaton model, and use the weighted finite state transition automaton model as a dynamic target language model.
  • the trained language model is composed of the reply information in the reference format and the reference Vocabulary training is obtained.
  • the reference vocabulary includes, but is not limited to, the category name corresponding to the vocabulary in the reply message in the reference format, and referential expression words.
  • the vocabulary in the response message in the reference format can be obtained through word segmentation and other methods, so as to further obtain the category name corresponding to the vocabulary.
  • the category name of "Sichuan Restaurant A” is "Restaurant”.
  • Referential expression words are used to refer to any reply message in a reference format. For example, when there are multiple reply messages in a reference format, the referential expression words include "the first item", “the middle one", and “reciprocal”. The second", "last item”, etc.
  • the trained language model includes the initial language model trained with reference format reply information and reference vocabulary as corpus.
  • the initial language model may adopt the N-gram model, and the schematic diagram of the N-gram model can be seen in FIG. 6.
  • the N-gram model assumes that the occurrence probability of a word is only related to the N words before the word, and has nothing to do with other words. For example, when the value of N is 3, N-gram model 3rd order model, the occurrence probability of a word at this time with two words before the word concerned is located, i.e. X i i-th word appears The probability is P(X i
  • the N-gram model can count the probability of a word appearing after another word, that is, the probability of two words appearing next to each other.
  • the N-gram model is trained through the corpus, and the trained N-gram model is obtained.
  • the trained N-gram model has counted the probability of adjacent occurrence of each word contained in the corpus.
  • the trained language model can be converted into a weighted finite state transform automata model (weighted finite state transducer, WFST).
  • WFST can convert the input phoneme sequence into words based on the dictionary, and then obtain the weight of each word adjacent to each other based on the probability of each word appearing adjacently calculated by the trained language model, and output core information according to the weight.
  • the core information can be regarded as a word sequence, so the appearance probability of the core information is the product of the weights of all words contained in the word sequence.
  • the analysis range of the trained language model can be expanded through conversion.
  • the trained language model can obtain the vocabulary and reference vocabulary in the reply message by parsing keywords, and the converted WFST can not only analyze the vocabulary in the reply message and In addition to the reference vocabulary, two or three combinations of the vocabulary in the reply message, the category name corresponding to the vocabulary, or the referential expression words can also be obtained. For example, WFST can analyze the combination of referential expression words and the category name corresponding to the vocabulary "the restaurant in the middle" and so on.
  • the WFST is the core part of the dynamic target language model.
  • the WFST and the front-end part can be used as a dynamic target language model.
  • the second method of acquisition convert the trained original model into a weighted finite state transition automaton model, and use the weighted finite state transition automaton model as the first language model, where the trained language model is not less than the reference length
  • the response information of the reference format is obtained through training training; the second language model is obtained according to the response information of the reference format whose length is less than the reference length; the third language model is obtained according to the reference vocabulary; the first language model, the second language model and the third language are merged Model, get the overall language model, and use the overall language model as the dynamic target language model.
  • the reference vocabulary can be referred to the description in the first acquisition method, which will not be repeated here.
  • reply messages with a length less than the reference length and reference words are not used as corpus, but only reply messages with a length not less than the reference length are used as the corpus.
  • the language model of is the initial language model trained with the reply message whose length is not less than the reference length as the corpus, and the initial language model can still be an N-gram model.
  • the reference length is 2, that is, two words.
  • the reason is that the N-gram model uses a back-off algorithm.
  • the backtracking algorithm refers to: for a word sequence that has not appeared in the corpus, the occurrence probability of a lower-order word sequence can be used as the occurrence probability of the word sequence, so as to ensure that the N-gram model can be used for any phoneme sequence input. Output the result. For example, when the absence of sequence of words (X i-2, X i -1, X i) in the corpus third-order model, the model of the X i is not the statistics-word occurrence probability P (X i
  • the trained language model is to determine the possible descriptions related to the reply message, and the user usually sends out different voice signals to repeat the reply message for reply messages of different lengths, so as to confirm or select the reply message.
  • reply messages whose length is less than the reference length
  • users often repeat the entire reply message instead of repeating some words in the entire reply message.
  • the reply message with a length less than the reference length is used as the corpus to train the N-gram model including the pullback algorithm, it will cause the trained language model to count some word sequences with low occurrence probability, thus affecting the trained language
  • the analytical effect of the model can be set based on the scene or experience, and can also be adjusted during the speech recognition process, which is not limited in the embodiment of the present application.
  • Oriental Pearl can be used as a reply message with a length of 1. If “Oriental Pearl” is used as the corpus, the trained language model will provide word sequences such as "Dongming" and "Fangzhu", which obviously has a low probability of occurrence. Therefore, in this embodiment, the second language model that does not use the fallback algorithm is obtained according to the reply information whose length is less than the reference length, and the second language model only analyzes the reply information whose entire length is less than the reference length in the keyword.
  • the user's expression mode is relatively fixed, and the number of combinations of the class name corresponding to the vocabulary and the referential expression word is relatively limited, so the vocabulary can be corresponding
  • the class names, referential expression words, and the combination of class names and referential expression words are used as corpus training to obtain a third language model that does not use the withdrawal algorithm.
  • the first language model can parse the entire reply message in the keyword or the word contained in the entire reply message combination. For example, in a car navigation scenario, taking the reference length as 2 as an example, "No. 1, D Avenue, C District, B City, A province” is a reply message with a length greater than the reference length. The user may choose "B City", " D Avenue No. 1” and other word sequences are retelled, so the keywords included in the voice signal retelled by the user can be analyzed by using the first language model of the backward withdrawal algorithm.
  • the first language model, the second language model, and the third language model are combined to obtain a total language model, which is It is the core part of the dynamic target language model.
  • the total language model and the front-end part are the dynamic target language models.
  • the third method of acquisition Obtain the word confusion network based on the response information in the reference format with the length not less than the reference length.
  • Each word in the word confusion network has a transition probability; calculate the penalty weight of each word, according to the penalty of each word
  • the weight transforms the word confusion network into a weighted finite state transition automaton model, and uses the weighted finite state transition automaton model as the first language model; obtains the second language model according to the reply information of the reference format whose length is less than the reference length; according to the reference vocabulary
  • Obtain the third language model merge the first language model, the second language model, and the third language model to obtain the overall language model, and use the overall language model as the dynamic target language model.
  • the description of the reference vocabulary can refer to the first acquisition method
  • the second language model can be acquired according to the reply information whose length is less than the reference length
  • the third language model can be acquired according to the reference vocabulary. Go into details. Next, the process of obtaining the first language model is explained:
  • the method of obtaining a word confusion network includes: aligning words of the same category in each reply message whose length is not less than a reference length, and adding one to the number of categories as the number of states in the word confusion network. After that, the state and the state are connected by arcs. Each arc has a word and the corresponding transition probability of the word. The transition probability is used to indicate the frequency of the word in the category of the word, and two adjacent ones The sum of the transition probabilities on all arcs between the states of is 1.
  • the penalty weight of each vocabulary is calculated, and the word confusion network is converted to WFST according to the penalty weight to obtain the first language model.
  • the first language model will calculate the penalty weights of multiple word sequences that the phoneme sequence of the speech signal may correspond to, and the penalty weight of the word sequence is equal to this
  • the product of the penalty weights of the words included in the word sequence, and the word sequence with the smallest penalty weight value will be output.
  • the methods for calculating the penalty weight of each word include but are not limited to the following three:
  • the first calculation method For any word, the negative logarithm of the transition probability of any word is used as the penalty weight.
  • the transition probability of a word is used to indicate the frequency of occurrence of the word in the category of the word.
  • Small, that is, the penalty weight is inversely proportional to the frequency of appearance, so that the target language model can better analyze the words that appear frequently in the category.
  • the second calculation method For any vocabulary, the logarithm of the number of reply messages in the reference format containing the vocabulary is used as the penalty weight.
  • the discriminative strength of words is defined according to the following formula:
  • the inverse presence frequency is used to indicate the discriminative strength of the vocabulary.
  • T Fi representing a vocabulary for the category F i
  • N is the total number of items of reference format reply
  • n is the number of words contained in the T Fi reference format reply information. It can be seen that the more the number of reply messages in the reference format containing the vocabulary, the smaller the IPF value and the weaker the discrimination of the vocabulary.
  • the IPF (skip) of skip edges can be expressed as:
  • the above-mentioned IPF (skip) can also be rewritten to avoid the value of the IPF of the skip side being always equal to zero.
  • the rewritten IPF (skip) is expressed in accordance with the following formula:
  • the penalty weight Penalty (T Fi ) of the vocabulary can be defined according to the following formula, and the penalty weight of the obtained vocabulary is the logarithmic value of the number of reply messages in the reference format containing the vocabulary:
  • Penalty(skip) can be defined as:
  • words with strong discrimination that is, words with a small number of reply messages in the reference format containing the words, are given a small penalty weight, so that the target language model can better analyze Come out these distinguishing words.
  • the third calculation method For any vocabulary, the logarithm of the number of occurrences of the vocabulary in the reply messages of each reference format is used as the penalty weight.
  • the third calculation method can still use the following formula to define the distinction between words:
  • N represents the total number of words contained in the response information of each reference format
  • n represents the number of occurrences of the word T Fi in the response information of each reference format.
  • the first language model, the second language model, and the third language model can be combined to obtain the overall language model, which is the dynamic target The core part of the language model.
  • the general language model and the front-end part (or the general language model, the front-end part, and the tail part) can be used as a dynamic target language model.
  • the embodiment of the application obtains or generates a dynamic target language model including a front-end part and a core part according to the reply information of the first intention, analyzes the voice signal to obtain keywords, and then calls the dynamic target language model to analyze the keywords to obtain the second Intent and service content. Since the dynamic target language model is obtained based on the reply information of the first intention, the second intention and service content obtained through the analysis of the dynamic target language model are all related to the first intention. Therefore, the embodiment of the present application realizes the ignorance of voices that are not related to the first intention, prevents the provided service content from deviating from the user's needs, has a good recognition effect, and improves the user's experience.
  • the embodiment of the present application also judges whether the voice signal has multiple intentions through the tail part in the dynamic target language model, so as to provide the service indicated by each intention of the user, thereby further improving the user experience.
  • an embodiment of the present application also provides a voice recognition device, which includes:
  • the first acquisition module 901 is used to acquire or generate a dynamic target language model according to the reply information to the first intention.
  • the dynamic target language model includes a front-end part and a core part.
  • the core part is used to determine possible descriptions related to the reply information. Used to determine the description of the confirmatory information for the reply message;
  • the second acquisition module 902 is configured to acquire a voice signal, analyze the voice signal to generate keywords;
  • the first determination module 903 is used to call the dynamic target language model to determine the second intent and service content.
  • the front-end part of the dynamic target language model parses the second intent based on keywords, and the core part of the dynamic target language model parses out the second intent based on keywords. Service Content.
  • the dynamic target language model further includes a tailing part, which is used to confirm whether there is an additional intention
  • the device further includes:
  • the second determining module is used to call the dynamic target language model to determine the additional intent, and the tail part of the dynamic target language model parses the additional intent according to the keywords.
  • the tail part includes tail marker words
  • the second determination module is used for the tailing part to parse out the reference tail marker words according to the keywords and the time point where the reference tail marker words are located; based on the reference tail marker words, combine the first intention and the second intention to update the dynamic target language model, The updated target language model is obtained; the updated target language model is called, and the additional intent is parsed according to the time point of the keyword and the reference tag word.
  • the device further includes:
  • Cache module used to cache historical voice signals
  • the second acquisition module 902 is configured to analyze the voice signal, and use the historical voice signal to perform context detection to generate keywords.
  • the device further includes: a confirmation module, configured to confirm the second intention, and obtain the confirmed second intention.
  • a confirmation module configured to confirm the second intention, and obtain the confirmed second intention.
  • the confirmation module is configured to send confirmation information of the second intention to the user, obtain the second intention fed back by the user, and use the second intention fed back by the user as the confirmed second intention.
  • the first obtaining module 901 is configured to convert the reply information of the first intention into a reference format to obtain reply information in the reference format, and obtain or generate a dynamic target language model according to the reply information in the reference format.
  • the first acquisition module is used to transform the trained language model into a weighted finite state transition automaton model, and use the weighted finite state transition automaton model as a dynamic target language model, wherein the trained language model is defined by a reference format Response information and reference vocabulary training.
  • the first acquisition module 901 is configured to convert the trained language model into a weighted finite state transition automaton model, and use the weighted finite state transition automaton model as the first language model, wherein the trained language model is determined by the length
  • the response information in the reference format that is not less than the reference length is obtained through training; the second language model is obtained according to the response information in the reference format with the length less than the reference length, and the third language model is obtained according to the reference vocabulary; the first language model and the second language are combined Model and the third language model, get the overall language model, and use the overall language model as the dynamic target language model.
  • the first obtaining module 901 includes:
  • the first obtaining unit is configured to obtain a word confusion network based on the reply information in a reference format with a length not less than a reference length, and each word in the word confusion network has a transition probability;
  • the calculation unit is used to calculate the penalty weight of each word, convert the word confusion network into a weighted finite state transition automaton model according to the penalty weight of each word, and use the weighted finite state transition automaton model as the first language model;
  • the second obtaining unit is configured to obtain the second language model according to the reply information in the reference format whose length is less than the reference length, and obtain the third language model according to the reference vocabulary;
  • the merging unit is used to merge the first language model, the second language model, and the third language model to obtain the overall language model, and the overall language model is used as the dynamic target language model.
  • the calculation unit is configured to use the negative logarithm of the transition probability of the word as the penalty weight for any word.
  • the calculation unit is configured to use, for any vocabulary, the logarithm of the number of reply messages in the reference format containing the vocabulary as the penalty weight.
  • the calculation unit is configured to use the logarithm of the number of occurrences of the word in the reply message in the reference format as the penalty weight for any word.
  • the embodiment of the application obtains or generates a dynamic target language model including a front-end part and a core part according to the reply information of the first intention, analyzes the voice signal to obtain keywords, and then calls the dynamic target language model to analyze the keywords to obtain the second Intent and service content. Since the dynamic target language model is obtained based on the reply information of the first intention, the second intention and service content obtained through the analysis of the dynamic target language model are all related to the first intention. Therefore, the embodiment of the present application realizes the ignorance of voices that are not related to the first intention, prevents the provided service content from deviating from the user's needs, has a good recognition effect, and improves the user's experience.
  • the embodiment of the present application also judges whether the voice signal has multiple intentions through the tail part in the dynamic target language model, so as to provide the service indicated by each intention of the user, thereby further improving the user experience.
  • An embodiment of the present application also provides a voice recognition device, which includes a memory and a processor; the memory stores at least one instruction, and at least one instruction is loaded and executed by the processor, so as to implement the voice recognition provided by the embodiment of the present application.
  • the method includes: obtaining or generating a dynamic target language model according to the reply information to the first intention.
  • the dynamic target language model includes a front-end part and a core part.
  • the core part is used to determine possible descriptions related to the reply information, and the front-end part uses To determine the description of the confirmatory information of the reply message; obtain the voice signal, analyze the voice signal to generate keywords; call the dynamic target language model to determine the second intention and service content, where the front-end part of the dynamic target language model parses the first part according to the keywords Second intention, the core part of the dynamic target language model parses out the service content based on keywords.
  • the dynamic target language model further includes a tailing part, the tailing part is used to confirm whether there is an additional intent, and the method further includes: calling the dynamic target language model to determine the additional intent, and the tailing part of the dynamic target language model parses out the additional intent based on keywords .
  • the tail part includes tail tag words; the dynamic target language model is called to determine the additional intent, and the tail part of the dynamic target language model parses the additional intent based on keywords, including: the tail part resolves the reference tail tag words according to the keywords, and Refer to the time point where the tail tag word is located; based on the reference tail tag word, combined with the first intention and second intention, update the dynamic target language model to obtain the updated target language model; call the updated target language model, according to the keywords and Analyze the additional intention with reference to the time point of the tag word at the end.
  • the tail part includes tail tag words
  • the dynamic target language model is called to determine the additional intent
  • the tail part of the dynamic target language model parses the additional intent based on keywords, including: the tail part resolves the reference tail tag words according to the keywords, and Refer to the time point where the tail tag word is located; based on the reference tail tag word, combined with the first intention and second intention, update the dynamic target language model to obtain the updated target language model; call the updated target language model, according to the keywords
  • the method before acquiring the voice signal, further includes: buffering the historical voice signal; parsing the voice signal to generate keywords, including: parsing the voice signal, and using the historical voice signal for context detection to generate keywords.
  • the method further includes: confirming the second intention, and obtaining the confirmed second intention.
  • confirming the second intention and obtaining the confirmed second intention includes: sending confirmation information of the second intention to the user, obtaining the second intention fed back by the user, and using the second intention fed back by the user as the confirmed second intention intention.
  • obtaining or generating a dynamic target language model according to the reply information to the first intention includes: converting the reply information of the first intention into a reference format to obtain reply information in the reference format, and obtaining or generating according to the reply information in the reference format Dynamic target language model.
  • obtaining or generating the dynamic target language model according to the reply information in the reference format includes: converting the trained language model into a weighted finite state transition automata model, and using the weighted finite state transition automaton model as the dynamic target language model,
  • the trained language model is obtained from the response information in the reference format and reference vocabulary training.
  • obtaining or generating the dynamic target language model according to the response information in the reference format includes: converting the trained language model into a weighted finite state transition automaton model, and using the weighted finite state transition automaton model as the first language model,
  • the trained language model is obtained by training the response information in the reference format with a length not less than the reference length; the second language model is obtained according to the response information in the reference format with the length less than the reference length, and the third language model is obtained according to the reference vocabulary; merge
  • the first language model, the second language model, and the third language model are used to obtain the overall language model, and the overall language model is used as the dynamic target language model.
  • obtaining or generating a dynamic target language model according to the reply information in the reference format including: obtaining a word confusion network based on the reply information in the reference format with a length not less than the reference length, and each word in the word confusion network has a transition probability; Calculate the penalty weight of each vocabulary, transform the word confusion network into a weighted finite state transition automaton model according to the penalty weight of each vocabulary, and use the weighted finite state transition automaton model as the first language model; Obtain the second language model according to the response information in the reference format, and obtain the third language model according to the reference vocabulary; merge the first language model, the second language model and the third language model to obtain the overall language model, and use the overall language model as the dynamic target language model .
  • calculating the penalty weight of each vocabulary includes: for any vocabulary, taking the negative logarithm of the transition probability of the vocabulary as the penalty weight.
  • calculating the penalty weight for each vocabulary includes: for any vocabulary, the logarithm of the number of items of the reply information in the reference format containing the vocabulary is used as the penalty weight.
  • calculating the penalty weight of each vocabulary includes: for any vocabulary, the logarithm of the number of occurrences of the vocabulary in the reply message in the reference format is used as the penalty weight.
  • the embodiment of the present application also provides a computer-readable storage medium in which at least one instruction is stored.
  • the instruction is loaded and executed by a processor to implement a voice recognition method provided by the embodiment of the present application, the method including: Acquire or generate a dynamic target language model based on the response information to the first intention.
  • the dynamic target language model includes a front-end part and a core part. The core part is used to determine possible descriptions related to the response information, and the front-end part is used to confirm the confirmation of the response information.
  • the dynamic target language model further includes a tailing part, the tailing part is used to confirm whether there is an additional intent, and the method further includes: calling the dynamic target language model to determine the additional intent, and the tailing part of the dynamic target language model parses out the additional intent based on keywords .
  • the tail part includes tail tag words; the dynamic target language model is called to determine the additional intent, and the tail part of the dynamic target language model parses the additional intent based on keywords, including: the tail part resolves the reference tail tag words according to the keywords, and Refer to the time point where the tail tag word is located; based on the reference tail tag word, combined with the first intention and second intention, update the dynamic target language model to obtain the updated target language model; call the updated target language model, according to the keywords and Analyze the additional intention with reference to the time point of the tag word at the end.
  • the tail part includes tail tag words
  • the dynamic target language model is called to determine the additional intent
  • the tail part of the dynamic target language model parses the additional intent based on keywords, including: the tail part resolves the reference tail tag words according to the keywords, and Refer to the time point where the tail tag word is located; based on the reference tail tag word, combined with the first intention and second intention, update the dynamic target language model to obtain the updated target language model; call the updated target language model, according to the keywords
  • the method before acquiring the voice signal, further includes: buffering the historical voice signal; parsing the voice signal to generate keywords, including: parsing the voice signal, and using the historical voice signal for context detection to generate keywords.
  • the method further includes: confirming the second intention, and obtaining the confirmed second intention.
  • confirming the second intention and obtaining the confirmed second intention includes: sending confirmation information of the second intention to the user, obtaining the second intention fed back by the user, and using the second intention fed back by the user as the confirmed second intention intention.
  • obtaining or generating a dynamic target language model according to the reply information to the first intention includes: converting the reply information of the first intention into a reference format to obtain reply information in the reference format, and obtaining or generating according to the reply information in the reference format Target language model.
  • obtaining or generating the dynamic target language model according to the reply information in the reference format includes: converting the trained language model into a weighted finite state transition automata model, and using the weighted finite state transition automaton model as the dynamic target language model,
  • the trained language model is obtained from the response information in the reference format and reference vocabulary training.
  • obtaining or generating the dynamic target language model according to the response information in the reference format includes: converting the trained language model into a weighted finite state transition automaton model, and using the weighted finite state transition automaton model as the first language model,
  • the trained language model is obtained by training the response information in the reference format with a length not less than the reference length; the second language model is obtained according to the response information in the reference format with the length less than the reference length, and the third language model is obtained according to the reference vocabulary; merge
  • the first language model, the second language model, and the third language model are used to obtain the overall language model, and the overall language model is used as the dynamic target language model.
  • obtaining or generating a dynamic target language model according to the reply information in the reference format including: obtaining a word confusion network based on the reply information in the reference format with a length not less than the reference length, and each word in the word confusion network has a transition probability; Calculate the penalty weight of each vocabulary, transform the word confusion network into a weighted finite state transition automaton model according to the penalty weight of each vocabulary, and use the weighted finite state transition automaton model as the first language model; Obtain the second language model according to the response information in the reference format, and obtain the third language model according to the reference vocabulary; merge the first language model, the second language model and the third language model to obtain the overall language model, and use the overall language model as the dynamic target language model .
  • calculating the penalty weight of each vocabulary includes: for any vocabulary, the negative logarithm value of the transition probability of the vocabulary is used as the penalty weight.
  • calculating the penalty weight for each vocabulary includes: for any vocabulary, the logarithm of the number of items of the reply information in the reference format containing the vocabulary is used as the penalty weight.
  • calculating the penalty weight of each vocabulary includes: for any vocabulary, the logarithm of the number of occurrences of the vocabulary in the reply message in the reference format is used as the penalty weight.
  • An embodiment of the present application also provides a chip, including a processor, configured to call and execute instructions stored in the memory from the memory, so that the communication device installed with the chip executes any of the above-mentioned voice recognition methods .
  • the embodiment of the present application also provides another chip, including: an input interface, an output interface, a processor, and a memory.
  • the input interface, the output interface, the processor, and the memory are connected by an internal connection path.
  • the processor is configured to execute the code in the memory, and when the code is executed, the processor is configured to execute any one of the aforementioned voice recognition methods.
  • processors there are one or more processors and one or more memories.
  • the memory may be integrated with the processor, or the memory and the processor may be provided separately.
  • the memory and the processor may be integrated on the same chip, or they may be separately arranged on different chips.
  • the embodiment of the present application does not limit the type of the memory and the arrangement of the memory and the processor.
  • processor may be a central processing unit (CPU), or other general-purpose processors, digital signal processing (digital signal processing, DSP), and application specific integrated circuits. ASIC), field-programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or any conventional processor. It is worth noting that the processor may be a processor that supports an advanced reduced instruction set machine (advanced RISC machines, ARM) architecture.
  • the foregoing memory may include a read-only memory and a random access memory, and provide instructions and data to the processor.
  • the memory may also include non-volatile random access memory.
  • the memory can also store device type information.
  • the memory may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory can be read-only memory (ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), and electronic Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • the volatile memory may be random access memory (RAM), which is used as an external cache. By way of exemplary but not limiting illustration, many forms of RAM are available.
  • static random access memory static random access memory
  • dynamic random access memory dynamic random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • double data rate synchronous dynamic random access Memory double data date SDRAM, DDR SDRAM
  • enhanced synchronous dynamic random access memory enhanced SDRAM, ESDRAM
  • serial link DRAM SLDRAM
  • direct memory bus random access memory direct rambus RAM
  • the embodiments of the present application provide a computer program.
  • the processor or the computer can execute the corresponding steps and/or processes in the foregoing method embodiments.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk).

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

提供了一种语音识别的方法、装置、设备及计算机可读存储介质,属于人工智能技术领域,特别适用于汽车或者电动车内的人机交互。该方法包括:根据对第一意图的回复信息获取或生成动态目标语言模型,动态目标语言模型包括前端部分和核心部分(201);获取语音信号,解析语音信号生成关键词(202);调用动态目标语言模型确定第二意图和服务内容,其中动态目标语言模型的前端部分根据关键词解析出第二意图,动态目标语言模型的核心部分根据关键词解析出服务内容(203)。该语音识别方法能够实现无关语音的忽略,识别非连续多意图的语音,避免提供的服务内容偏离用户需求,识别效果好。

Description

语音识别的方法、装置、设备及计算机可读存储介质
本申请要求于2019年05月31日提交的申请号为201910470966.4、发明名称为“语音识别的方法、装置、设备及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,特别涉及语音识别的方法、装置、设备及计算机可读存储介质。
背景技术
随着人工智能技术的发展,人工智能系统被广泛应用于生活领域,语音识别系统就是其中一种。使用时,用户向语音识别系统发出语音指令,语音识别系统需要对语音指令进行语音识别,理解用户指令,并基于用户指令对用户进行提问。之后,识别用户针对该提问发出的回复语音,理解用户回复,并提供用户回复指示的服务,从而满足用户需求。因此,如何进行语音识别,是满足用户需求的关键。
相关技术提供一种语音识别的方法,该方法在调用语言模型识别语音指令,并理解用户指令之后,一方面向用户发出提问;另一方面根据提问调整语言模型,如将与该提问相关的词汇集整合入语言模型,使得调整后的语言模型可以识别词汇集中的词汇。当用户使用词汇集中的词汇发出回复语音后,调整后的语言模型便可以识别该回复语音,从而满足用户需求。
发明人发现相关技术至少存在以下问题:
除了语音指令、回复语音以外,用户还可能会发出与第三方交流的无关语音。例如,作为典型的多用户场景、多情景场景,在用户与汽车或者电动车内的车载模块进行语音交互时,无关语音很可能包括用户与其他用户的插话,或者其他用户插入的语音等。车载模块的语音识别系统会将这些无关语音也作为语音指令或回复语音进行识别及理解,使得提供的服务偏离用户需求,用户使用体验差。
发明内容
本申请实施例提供了一种语音识别的方法、装置、设备及计算机可读存储介质,以克服相关技术中存在的识别效果差、用户使用体验差的问题。
一方面,本申请提供一种语音识别的方法,包括:
一方面,提供了一种语音识别的方法,所述方法包括:根据对第一意图的回复信息获取或生成动态目标语言模型,由于所述动态目标语言模型包括前端部分和核心部分,所述核心部分用于确定对所述回复信息相关的可能描述,所述前端部分用于确定对所述回复信息的确认性信息的描述;因而获取语音信号,解析所述语音信号生成关键词之后,可调用所述动态目标语言模型确定第二意图和服务内容,其中,所述动态目标语言模型的所述前端部分根据所述关键词解析出所述第二意图,所述动态目标语言模型的核心部分根据所述关键词解析出 所述服务内容。
以用户与车载模块之间进行对话的语音识别场景为例,第一意图包括用户与车载模块的语音对话开始之后,对用户的语音信号进行解析所得到的意图。第一意图的回复信息包括车载模块针对第一意图返回给用户的一项或多项回复信息,车载模块根据对第一意图的回复信息获取包括前端部分及核心部分的动态目标语言模型。车载模块将一项或多项回复信息返回给用户之后,车载模块会再次获取语音信号。需要说明的是,车载模块再次获取的语音信号可能包括用户与车载模块对话的语音信号,即与回复信息相关的语音信号,以及用户与其他用户对话的无关语音信号。
之后,车载模块解析所获取的语音信号生成关键词,调用动态目标语言模型,从生成的关键词中解析出与回复信息相关的那部分词汇。动态目标语言模型包括前端部分和核心部分,前端部分用于确定用户对回复信息的确认性信息的描述,确认性信息可包括确认、修正以及取消等信息。通过前端部分解析关键词可得到用户的第二意图,例如,若第一意图的回复信息的数量包括一项,而前端部分解析关键词得到的确认性信息包括确认信息“对,没错”,则可确认用户的第二意图即为第一意图的回复信息所指示的意图。
核心部分用于确定对与回复信息相关的可能描述,通过核心部分可从关键词中解析出用户对回复信息进行描述所使用的词汇,从而基于该词汇得到服务内容,以向用户提供该服务内容所指示的服务。在本实施例中,可以由第三方云服务提供服务内容所指示的服务,也可以由车载模块提供服务内容所指示的服务,也可以由车载终端提供服务内容所指示的服务,还可以由车企提供服务内容所指示的服务。其中,车载终端可以是车辆上除车载模块以外的其他终端,如车载显示屏、车载空调、车载扬声器等等。当然,也可以由第三方云服务、车载模块、车载终端和车企中的两个以上的执行主体共同提供服务内容所指示的服务。需要说明的是,由于动态目标语言模型的前端部分以及核心部分均为根据回复信息获取的,因而通过前端部分得到的第二意图、以及通过核心部分得到的服务内容均与第一意图相关,而与第一意图无关的语音信号则被忽略。因此,本申请实施例进行语音识别的效果较好,避免了由于无关语音信号的干扰而导致提供的服务偏离用户需求,提高了用户的使用体验。
可选地,所述动态目标语言模型还包括接尾部分,所述接尾部分用于确认是否存在附加意图,所述方法还包括:调用所述动态目标语言模型确定附加意图,所述动态目标语言模型的所述接尾部分根据所述关键词解析出所述附加意图。
可选地,所述接尾部分包括接尾标志词;所述调用所述动态目标语言模型确定附加意图,所述动态目标语言模型的所述接尾部分根据所述关键词解析出所述附加意图,包括:所述接尾部分根据所述关键词解析出参考接尾标志词,以及所述参考接尾标志词所在的时间点;基于所述参考接尾标志词,结合所述第一意图和所述第二意图,更新所述动态目标语言模型,得到更新后的目标语言模型;调用所述更新后的目标语言模型,根据所述关键词和所述参考接尾标志词所在的时间点解析出所述附加意图。通过解析出附加意图,可进一步提供更准确的服务。
可选地,所述获取语音信号之前,所述方法还包括:缓存历史语音信号;所述解析所述语音信号生成关键词,包括:解析所述语音信号,利用所述历史语音信号进行上下文检测后生成所述关键词。通过历史语音信号进行上下文检测,可使得识别出的关键词更加贴合当前的场景,从而进一步提高语音识别的准确性。
可选地,所述调用所述动态目标语言模型确定第二意图和服务内容之后,所述方法还包括:确认所述第二意图,得到确认后的第二意图。
可选地,所述确认所述第二意图,得到确认后的第二意图,包括:向用户发送所述第二意图的确认信息,获取用户反馈的第二意图,将所述用户反馈的第二意图作为确认后的第二意图。通过对第二意图进行确认,使得第二意图更加准确,从而提供更准确的服务内容。
可选地,所述根据对第一意图的回复信息获取动态目标语言模型,包括:将所述第一意图的回复信息转换为参考格式,得到参考格式的回复信息,根据所述参考格式的回复信息获取或生成所述动态目标语言模型。由于不同的供应商可能提供不同格式的回复信息,因而将回复信息转换为参考格式,从而使得回复信息格式统一,以便于回复信息的接收。对于不同的应用领域,将回复信息转换为不同的参考格式,以使同一应用领域中的回复信息格式相同。
可选地,所述根据所述参考格式的回复信息获取或生成所述动态目标语言模型,包括:将已训练的语言模型转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为所述动态目标语言模型,其中所述已训练的语言模型由所述参考格式的回复信息以及参考词汇训练获得。其中,参考词汇包括但不限于参考格式的回复信息中的词汇对应的类名,以及指代性表达词。
可选地,所述根据所述参考格式的回复信息获取或生成所述动态目标语言模型,包括:将已训练的语言模型转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为第一语言模型,其中所述已训练的语言模型由长度不低于参考长度的参考格式的回复信息训练获得;根据长度低于参考长度的参考格式的回复信息获取第二语言模型,根据参考词汇获取第三语言模型;合并所述第一语言模型、所述第二语言模型和所述第三语言模型,得到总语言模型,将所述总语言模型作为所述动态目标语言模型。
可选地,所述根据所述参考格式的回复信息获取或生成所述动态目标语言模型,包括:基于长度不低于参考长度的参考格式的回复信息获取词混淆网络,所述词混淆网络中的每个词汇有一转移概率;计算每个词汇的惩罚权重,根据所述每个词汇的惩罚权重将所述词混淆网络转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为第一语言模型;根据长度低于参考长度的参考格式的回复信息获取第二语言模型,根据参考词汇获取第三语言模型;合并所述第一语言模型、所述第二语言模型和所述第三语言模型,得到总语言模型,将所述总语言模型作为所述动态目标语言模型。
可选地,所述计算每个词汇的惩罚权重,包括:对于任一词汇,将所述词汇的转移概率的负对数值作为所述惩罚权重。词汇的转移概率用于指示该词汇在该词汇所在的类别中的出现频率,则词汇在所在的类别中的出现频率越高,转移概率越大,转移概率的负对数值越小,即惩罚权重反比于出现频率,以便目标语言模型能够更好的解析出在所在的类别中出现频率较高的词汇。
可选地,所述计算每个词汇的惩罚权重,包括:对于任一词汇,将包含所述词汇的参考格式的回复信息的项数的对数值作为所述惩罚权重。对区分性较强的词汇,即包含该词汇的参考格式的回复信息的数量较少的词汇给予较小的惩罚权重,以便目标语言模型能够更好的解析出这些区分性较强的词汇。
可选地,所述计算每个词汇的惩罚权重,包括:对于任一词汇,将所述词汇在所述参考格式的回复信息中的出现次数的对数值作为所述惩罚权重。对于区分性较强的词汇,即出现 次数越少的词汇,词汇的惩罚概率越小,以便动态目标语言模型能够更好的解析出区分性较强的词汇。
一方面,提供了一种语音识别的装置,所述装置包括:第一获取模块,根据对第一意图的回复信息获取或生成动态目标语言模型,所述动态目标语言模型包括前端部分和核心部分,所述核心部分用于确定对所述回复信息相关的可能描述,所述前端部分用于确定对所述回复信息的确认性信息的描述;第二获取模块,用于获取语音信号,解析所述语音信号生成关键词;第一确定模块,用于调用所述动态目标语言模型确定第二意图和服务内容,其中所述动态目标语言模型的所述前端部分根据所述关键词解析出所述第二意图,所述动态目标语言模型的核心部分根据所述关键词解析出所述服务内容。
可选地,所述动态目标语言模型还包括接尾部分,所述接尾部分用于确认是否存在附加意图,所述装置还包括:第二确定模块,用于调用所述动态目标语言模型确定附加意图,所述动态目标语言模型的所述接尾部分根据所述关键词解析出所述附加意图。
可选地,所述接尾部分包括接尾标志词;所述第二确定模块,用于所述接尾部分根据所述关键词解析出参考接尾标志词,以及所述参考接尾标志词所在的时间点;基于所述参考接尾标志词,结合所述第一意图和所述第二意图,更新所述动态目标语言模型,得到更新后的目标语言模型;调用所述更新后的目标语言模型,根据所述关键词和所述参考接尾标志词所在的时间点解析出所述附加意图。
可选地,所述装置还包括:缓存模块,用于缓存历史语音信号;所述第二获取模块,用于解析所述语音信号,利用所述历史语音信号进行上下文检测后生成所述关键词。
可选地,所述装置还包括:确认模块,用于确认所述第二意图,得到确认后的第二意图。
可选地,所述确认模块,用于向用户发送所述第二意图的确认信息,获取用户反馈的第二意图,将所述用户反馈的第二意图作为确认后的第二意图。
可选地,所述第一获取模块,用于将所述第一意图的回复信息转换为参考格式,得到参考格式的回复信息,根据所述参考格式的回复信息获取或生成所述动态目标语言模型。
可选地,所述第一获取模块,用于将已训练的语言模型转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为所述动态目标语言模型,其中所述已训练的语言模型由所述参考格式的回复信息以及参考词汇训练获得。
可选地,所述第一获取模块,用于将已训练的语言模型转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为第一语言模型,其中所述已训练的语言模型由长度不低于参考长度的参考格式的回复信息训练获得;根据长度低于参考长度的参考格式的回复信息获取第二语言模型,根据参考词汇获取第三语言模型;合并所述第一语言模型、所述第二语言模型和所述第三语言模型,得到总语言模型,将所述总语言模型作为所述动态目标语言模型。
可选地,所述第一获取模块,包括:第一获取单元,用于基于长度不低于参考长度的参考格式的回复信息获取词混淆网络,所述词混淆网络中的每个词汇有一转移概率;计算单元,用于计算每个词汇的惩罚权重,根据所述每个词汇的惩罚权重将所述词混淆网络转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为第一语言模型;第二获取单元,用于根据长度低于参考长度的参考格式的回复信息获取第二语言模型,根据参考词汇获取第三语言模型;合并单元,用于合并所述第一语言模型、所述第二语言模型和所述第 三语言模型,得到总语言模型,将所述总语言模型作为所述动态目标语言模型。
可选地,所述计算单元,用于对于任一词汇,将所述词汇的转移概率的负对数值作为所述惩罚权重。
可选地,所述计算单元,用于对于任一词汇,将包含所述词汇的参考格式的回复信息的数量的对数值作为所述惩罚权重。
可选地,所述计算单元,用于对于任一词汇,将所述词汇在所述参考格式的回复信息中的出现次数的对数值作为所述惩罚权重。
一方面,提供了一种语音识别的设备,所述设备包括:存储器及处理器;所述存储器中存储有至少一条指令,所述至少一条指令由所述处理器加载并执行,以实现上述本申请实施例第一方面或者第一方面的任一种可能的实施方式中的方法。
可选地,所述处理器为一个或多个,所述存储器为一个或多个。
可选地,所述存储器可以与所述处理器集成在一起,或者所述存储器与处理器分离设置。
在具体实现过程中,存储器可以为非瞬时性(non-transitory)存储器,例如只读存储器(read only memory,ROM),其可以与处理器集成在同一块芯片上,也可以分别设置在不同的芯片上,本申请实施例对存储器的类型以及存储器与处理器的设置方式不做限定。
另一方面,提供一种计算机可读存储介质,所述计算机可读存储介质存储程序或指令,所述指令由处理器加载并执行上述任一种语音识别的方法。
还提供了一种计算机程序(产品),所述计算机程序(产品)包括:计算机程序代码,当所述计算机程序代码被计算机运行时,使得所述计算机执行上述任一种语音识别的方法。
还提供了一种芯片,包括处理器,处理器用于从存储器中调用并运行所述存储器中存储的指令,使得安装有所述芯片的通信设备执行上述任一种语音识别的方法。
还提供另一种芯片,包括:输入接口、输出接口、处理器和存储器,所述输入接口、输出接口、所述处理器以及所述存储器之间通过内部连接通路相连,所述处理器用于执行所述存储器中的代码,当所述代码被执行时,所述处理器用于执行上述任一种语音识别的方法。
本申请实施例提供的技术方案带来的有益效果至少包括:
本申请实施例根据第一意图的回复信息获取或生成包括前端部分和核心部分的动态目标语言模型,解析语音信号得到关键词后,再调用动态目标语言模型解析关键词得到第二意图和服务内容。由于动态目标语言模型是根据第一意图的回复信息所获取的,因而通过动态目标语言模型解析得到的第二意图和服务内容均与第一意图相关。因此,本申请实施例实现了与第一意图无关的语音的忽略,即能够对非连续多意图的语音进行识别,避免了提供的服务内容偏离用户需求,识别效果好,提升了用户的使用体验。
本申请提供的技术方案至少包括以下有益效果:
本申请实施例根据第一意图的回复信息获取或生成包括前端部分和核心部分的动态目标语言模型,解析语音信号得到关键词后,再调用动态目标语言模型解析关键词得到第二意图和服务内容。由于动态目标语言模型是根据第一意图的回复信息所获取的,因而通过动态目标语言模型解析得到的第二意图和服务内容均与第一意图相关。因此,本申请实施例实现了与第一意图无关的语音的忽略,即能够对非连续多意图的语音进行识别,避免了提供的服务内容偏离用户需求,识别效果好,提升了用户的使用体验。
附图说明
图1为本申请实施例提供的实施环境的示意图;
图2为本申请实施例提供的实现语音识别的方法的模块结构图;
图3为本申请实施例提供的语音识别的方法流程图;
图4为本申请实施例提供的语言模型的结构示意图;
图5为本申请实施例提供的语音识别的流程示意图;
图6为本申请实施例提供的语言模型的结构示意图;
图7为本申请实施例提供的语言模型的结构示意图;
图8为本申请实施例提供的词混淆网络的结构示意图;
图9为本身实施例提供的语音识别的装置的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
随着人工智能技术的发展,人工智能系统被广泛应用于生活领域,语音识别系统就是其中一种。使用时,用户向语音识别系统发出语音指令,语音识别系统需要对语音指令进行语音识别,理解用户指令,并基于用户指令对用户进行提问。之后,识别用户针对该提问发出的回复语音,理解用户回复,并提供用户回复指示的服务,从而满足用户需求。因此,如何进行语音识别,是满足用户需求的关键。
相关技术提供一种语音识别的方法,该方法在调用语言模型识别语音指令,并理解用户指令之后,一方面向用户发出提问;另一方面根据提问调整语言模型,如将与该提问相关的词汇集整合入语言模型,使得调整后的语言模型可以识别词汇集中的词汇。当用户使用词汇集中的词汇发出回复语音后,调整后的语言模型便可以识别该回复语音,从而满足用户需求。
然而,用户发出的语音常常较为灵活。例如,在多用户的车载导航场景中,用户与车载模块可能会发生如下的语音对话:
用户;(对车载模块)帮我找一下附近的川菜馆。
车载模块:(对用户)您要去川菜馆A吗?
用户:(对车内其他用户)这个时段正好是中午,停车会不会有问题?(对车载模块)对,就是川菜馆A。
使用相关技术提供的方法进行该对话,则车载模块的语音识别系统基于语音指令进行提问后,可将提问中涉及的词汇“川菜馆A”整合入语言模型,得到调整后的语言模型。之后,若用户使用“川菜馆A”发出回复语音“对,就是川菜馆A”,则调整后的语言模型可以识别该回复语音。而在上述对话中,用户先发出了与车内其他用户交流的无关语音,因而调整后的语言模型会将该无关语音也作为回复语音进行识别,从而导致理解错误。可以看出,相关技术提供的语音识别方法识别效果较差,用户使用体验差。
本申请实施例提供了一种语音识别的方法,该方法可应用于如图1所示的实施环境中。图1中,包括音频设备、存储器以及中央处理器(central processing unit,CPU)。其中,音 频设备包括麦克风阵列(microphone array)和扬声器(speaker),存储器内存储有用于进行语音识别的模块的程序或指令。音频设备、存储器以及CPU通过数据总线(databus,D-Bus)通信连接,以使CPU调用麦克风阵列采集用户发出的语音信号、基于采集的语音信号运行存储器内存储的各模块的程序或指令以及调用扬声器根据运行结果向用户发出语音信号。
另外,参见图1,CPU也可以通过网关(gateway)访问云服务,以获取云服务返回的数据。CPU还可以通过网关访问控制器局域网络总线(controller area networkbus,CAN-Bus),从而实现对其他设备状态的读取和控制。
可选地,图1所示的实施环境示意图中,存储器内存储的用于进行语音识别的模块的程序或指令包括:图2中的环形语音缓存模块、AM模块、SL模块、Dynamic LM模块、SLU模块、DM模块以及NCM流程等程序或指令,由图1中的CPU运行存储器上的各个模块的程序或指令,从而实现语音识别。接下来,结合图2所示的实现本实施例提供的语音识别的方法的模块功能,对语音识别的流程进行说明:
前端语言(front-end speech)模块:用于区分用户发出的语音信号与路噪、音乐等非语音信号,还用于对用户发出的语音信号进行降噪、增强等处理,以提高后续识别及理解的准确性。
环形语音缓存(circular buffer)模块:用于缓存经过前端语言模型处理的语音信号,以便存储的语音信号能够被多次识别及理解。环形语言缓存具有参考时间长度,当缓存的语音信号的时间长度大于参考时间长度时,存储时间最长的语音信号会被新的语音信号覆盖。
声学模型(acoustic model,AM)模块:用于获取环形语音缓冲模块中存储的语音信号,将语音信号转换成音素序列。
选择听写(selective listening,SL)模块:用于调用动态语言模型(dynamic language model,Dynamic LM)模块,将AM模型输出的音素序列转换为关键词,将关键词发送给口语理解(spoken language understanding,SLU)模块。
SLU模块:用于从关键词中提取意图及语义槽,以理解用户的语音信号指示的第一意图、第二意图及附加意图。
对话管理(dialogue manager,DM)模块:用于根据第一意图向云服务请求回复信息。
应用管理(application manager,APP Manager)模块:用于将云服务(cloud service)返回的回复信息转换成参考格式的回复信息。
对话管理(dialogue manager,DM)模块:还用于根据APP Manager模块返回的参考格式的回复信息来启动相关领域的非连贯多意图(non-continuous multi-intent,NCM)流程,以及控制答复生成(response generator,RG)模块生成回复内容并进行语音播放。还用于根据第二意图及附加意图向APP Manager模块发送指令,以控制应用或终端设备来执行服务内容及附加意图。
应用管理(application manager,APP Manager)模块:还用于对回复信息进行分词标注及专有名词标注。还用于根据DM模块发送的指令管理应用及终端设备,以控制应用或终端设备来执行服务内容及附加意图。
基于上述图1所示的实施环境,参见图3,本申请实施例提供了一种语音识别的方法。如图3所示,该方法包括:
步骤201,根据对第一意图的回复信息获取或生成动态目标语言模型,动态目标语言模型包括前端部分和核心部分,核心部分用于确定对回复信息相关的可能描述,前端部分用于确定对回复信息的确认性信息的描述。
其中,第一意图是指用户与系统之间的语音对话开始之后,对用户的语音指令信号进行解析所得到的意图。以上述语音对话为例,用户的语音指令信号即为用户所发出的“帮我找一下附近的川菜馆”的语音。对语音指令信号进行解析包括:调用声学模型将语音指令信号转换为音素序列,音素是一种语言的最小语音单位,例如在汉语中,音素是指声母或韵母。之后,调用语言模型将音素序列转换为文字序列,该文字序列即为语音指令。该语言模型是指已根据训练集训练过的语言模型,可根据语音识别所应用的领域调用合适的语言模型。
得到文字序列后,可对该文字序列进行解析得到第一意图。其中,第一意图包括意图及语义槽,语义槽是指文字序列中的具有明确定义或概念的词汇。仍以上述语音对话为例,文字序列为“帮我找一下附近的川菜馆”,则解析得到的意图为“导航”,语义槽为“附近”及“川菜馆”,从而得到第一意图为“导航去附近的川菜馆”。之后,便可根据得到的第一意图获取第一意图的回复信息,该第一意图的回复信息内容满足语义槽的要求。其中,对于回复信息的获取方式,可将第一意图发送给云服务,从而获取云服务返回的回复信息。或者,也可在存储器中存储多个意图与回复信息的映射关系,根据映射关系查找与第一意图对应的回复信息,以实现回复信息的获取。
需要说明的是,无论采用哪种方式来获取回复信息,该回复信息均可以为一项或多项,每项回复信息均为一个文字串。并且,若有多项回复信息,则多项回复信息均可作为待选项,以便用户从中选择。仍以上述语音对话为例,回复信息可以为一项“川菜馆A”,也可以为多项如“川菜馆A”、“川菜馆B”及“川菜馆C”。本实施例中不对回复信息的项数进行限定。
接着,便可基于所获取的第一意图的回复信息来获取或生成动态目标语言模型,动态目标语言模型包括前端部分和核心部分。前端部分用于确定对回复信息的确认性信息的描述,确认性信息包括但不限于确认、修正或取消等信息。例如,确认信息可包括“对”、“没错”,修正信息可包括“不对”、“错了”,取消信息可包括“算了”、“不用了”。而核心部分用于确定对回复信息相关的可能描述,例如用户直接复述回复信息或者用户选择性地复述回复信息等描述。
需要说明的是,基于回复信息获取或生成动态目标语言模型的过程在后文会加以详细说明,此处先不进行赘述。当然,无论通过怎样的过程实现动态目标语言模型的获取或生成,在通过获取或生成得到动态目标语言模型之后,便可进一步接收语音信号。
步骤202,获取语音信号,解析语音信号生成关键词。
其中,车载模块在获取对第一意图的回复信息之后,除了根据对第一意图的回复信息获取或生成动态目标语言模型以外,还将对第一意图的回复信息发送给用户,以获取语音信号。需要说明的是,语音信号既可能包括用户与车载模块对话的语音信号,即针对第一意图的回复信息的语音信号,也可能包括用户与其他用户对话的无关语音信号。仍以上述语音对话为例,用户与车载模块对话的语音信号为“对,就是川菜馆A”,而用户与其他用户对话的无关语音信号为“这个时段正好是中午,停车会不会有问题”。当然,对于上述无关语音信号,除了包括用户主动与其他用户对话的语音信号,也可以包括其他用户主动与用户对话的语音信号,即其他用户插话的语音信号,本实施例不对无关语音信号加以限定。
车载模块获取语音信号之后,便可解析语音信号生成关键词。可选地,在本实施例中,获取语音信号之前,方法还包括:缓存历史语音信号。则解析语音信号生成关键词包括:解析语音信号,利用历史语音信号进行上下文检测后生成关键词。
其中,历史语音信号即为过往时间的语音信号,例如在上述语音对话中,用于获取第一意图的语音指令信号“帮我找一下附近的川菜馆”便可以作为历史语音信号。在本实施例中,可通过环形缓存来缓存历史语音信号,环形缓存具有参考时间长度,若缓存的历史语音信号的时间长度多于参考时间长度,则缓存时间最长的历史语音信号会被新的语音信号覆盖。则若需要利用历史语音信号,从环形缓存中读取历史语音信号即可。当然,本实施例不对缓存历史语音信号的方式加以限定,可根据需要选择其他方式来实现历史语音的缓存。
进一步地,对于语音信号的解析方式,车载模块仍可根据语音识别所应用的领域来调用合适的声学模型以及语言模型,通过声学模型以及语言模型来解析语音信号,得到初始关键词。其中,由于用户与车载模块对话的语音信号针对第一意图的回复信息,因而解析用户与车载模块对话的语音信号所生成的初始关键词与第一意图相关,而解析用户与其他用户对话的无关语音信号所生成的初始关键词与第一意图无关。因此,需要利用历史语音信号进行上下文检测,以使得基于初始关键词生成的关键词仅与第一意图相关,也就是忽略那些与第一意图无关的初始关键词。
对于利用历史语音信号进行上下文检测的方式,可以包括:检测初始关键词中与历史语音信号相关的关键词,从而将与历史语音信号所对应的文字序列相关的关键词作为生成的关键词。例如,对于语音信号“这个时段正好是中午,停车会不会有问题?对,就是川菜馆A”进行解析,得到的初始关键词包括“中午”、“停车”、“对,就是”以及“川菜馆A”。在初始关键词中,与历史语音信号“帮我找一下附近的川菜馆”相关的关键词包括“对,就是”以及“川菜馆A”,因此,可忽略“中午”及“停车”,而仅将“对,就是”以及“川菜馆A”作为生成的关键词。
当然,本实施例对利用历史语音信号进行上下文检测的方式不加以限定,无论采用何种方式进行检测生成关键词,在生成关键词之后,便可触发调用动态目标语言模型来解析关键词,从而确定第二意图和服务内容,详见步骤203。
步骤203,调用动态目标语言模型确定第二意图和服务内容,其中动态目标语言模型的前端部分根据关键词解析出第二意图,动态目标语言模型的核心部分根据关键词解析出服务内容。
根据步骤201中的说明可知,动态目标语言模型包括前端部分和核心部分。由于动态目标语言模型是通过对第一意图的回复信息获取的,因而通过动态目标语言模型确定的第二意图和服务内容均与第一意图相关。其中,前端部分用于确定对回复信息的确认性信息的描述,因此通过前端部分解析关键词可得到关键词中的确认性信息,进而通过关键词中的确认性信息获取用户的第二意图。以上述语音对话为例,对第一意图的回复信息为“您要去川菜馆A吗”,解析得到的关键词为“对,就是”以及“川菜馆A”,则通过前端部分可解析得到关键词中的“对,就是”,即可得到用户的第二意图为“去川菜馆A”。另外,还通过核心部分解析得到关键词中的“川菜馆A”,结合当前的车载导航场景,从而得到服务内容为“导航去川菜馆A”。
可以看出,在对第一意图的回复信息仅包括一项的情况下,通过前端部分便能确定用户 的第二意图。而在对第一意图的回复信息包括两项或多项的情况下,则需要通过前端部分与核心部分来确定用户的第二意图。例如,若对第一意图的回复信息为“您要选择下面的哪个?第一项为川菜馆A、第二项为川菜馆B”,而解析得到的关键词仍为“对,就是”以及“川菜馆A”,则通过前端部分仍可解析得到关键词中的确认性信息“对,就是”。但是,仅通过“对,就是”无法确定用户的第二意图是“川菜馆A”还是“川菜馆B”。因此,还需要通过核心部分解析得到关键词中的“川菜馆A”,从而最终确定用户的第二意图为“去川菜馆A”,以及服务内容为“导航去川菜馆A”。
需要说明的是,若前端部分解析关键词得到的确认性信息包括确认信息,例如上述语音对话中的“对,就是”,则可进一步通过核心部分来解析关键词得到服务内容。若前端部分解析关键词得到的确认性信息包括修正信息或者取消信息,如“不是”、“不对”等词汇,则说明用户不认可回复信息,可能未针对回复信息做出答复,则无需通过核心部分进行解析得到服务内容,而是重新获取其他回复信息,基于其他回复信息获取新的动态目标语言模型,以通过新的动态目标语言模型来完成语音识别。
当然,除了第二意图和服务内容以外,调用动态目标语言模型还可得到第二意图和服务内容的置信度、语音信号中的静音信号段等信息,其中置信度用于指示第二意图和服务器内容的准确程度。
在得到第二意图和服务内容之后,便可触发提供服务内容所指示的服务。例如上述语音对话中的服务内容为“导航去川菜馆A”,则执行服务内容包括调用导航设备将用户由当前位置(即发生上述语音对话的位置)导航至“川菜馆A”所在的位置。
在一种可选的实施方式中,执行服务内容之前,本实施例所提供的方法还包括:确认第二意图,得到确认后的第二意图;执行确认后的第二意图。在该实施方式中,考虑到虽然动态目标语言模型是根据对第一意图的回复信息来获取或生成的,但通过动态目标语言模型确定的第二意图和服务内容仍有可能与第一意图不一致。因此,在执行服务内容之前,先对第二意图进行确认,以确保第二意图与第一意图相一致。在得到确认后的第二意图后,再对确认后的第二意图进行执行。
其中,第二意图与第一意图相一致包括但不限于:第二意图与第一意图的回复信息对应(如第二意图“去川菜馆A”与第一意图的回复信息“川菜馆A”对应)。或者,第二意图满足第一意图所包括的条件限制(如第二意图“去川菜馆A”满足第一意图所包括的距离条件限制“附近”)。
可选地,对于确认第二意图,得到确认后的第二意图的方式,包括:向用户发送第二意图的确认信息,获取用户反馈的第二意图,将用户反馈的第二意图作为确认后的第二意图。
根据以上说明可知,通过动态目标语言模型可得到第二意图和服务内容的置信度,因而本实施例可针对不同置信度向用户发送不同的确认信息,以实现第二意图的确认。以第二意图为“去川菜馆A”为例,若置信度高于阈值,则说明根据第二意图较为可信,因而可采用间接确认的方式进行确认。例如,将默认第二意图正确的语音“您选择了川菜馆A”作为第二意图的确认信息向用户发送,以获取用户返回的第二意图。若置信度不高于阈值,则说明第二意图的可信度较低,因而采用直接确认的方式进行确认。例如,向用户发送“您确认选择川菜馆A吗”的语音。
上述间接确认方式及直接确认方式中发送的确认信息均为语音确认信息,若通过语音确 认信息仍无法获取用户反馈的第二意图,可选择其他形式的确认信息,如文字确认信息来向用户确认第二意图。可选地,通过终端向用户显示针对第一意图的回复信息,以便用户通过终端选择任一回复信息,将用户所选择的回复信息指示的意图作为确认后的第二意图,执行该确认后的第二意图,从而完成语音识别。
接下来,对上述语音对话进行扩展,得到如下的复杂对话:
用户;(对车载模块)帮我找一下附近的川菜馆。
车载模块:(对用户)您要去川菜馆A吗?
用户:(对车内其他用户)这个时段正好是中午,停车会不会有问题?(对车载模块)对,就是川菜馆A,再帮我找个停车位。
可以看出,在该复杂对话中,在用户表达“对,就是川菜馆A”之后,还表达了附加意图“再帮我找个停车位”,从而使得该复杂对话形成多意图对话。
对此,在一种可选的实施方式中,动态目标语言模型还包括接尾部分,接尾部分用于确认是否存在附加意图。因此,本实施例所提供的方法还包括:调用动态目标语言模型确定附加意图,其中动态目标语言模型的接尾部分根据关键词解析出附加意图,从而识别出如上的多意图对话中的每个意图。
在该实施方式中,除了通过前端部分得到第二意图、通过核心部分得到服务内容以外,还通过接尾部分解析关键词得到附加意图。其中,前端部分、核心部分及接尾部分的示意图可参见图4。图4中,超词汇(out of vocabulary,OOV)表示词典以外的词汇,词典用于基于音素序列得到字词。eps表示跳跃边,用于指示可选部分。
可选地,接尾部分包括接尾标志词,接尾标志词包括但不限于“再”、“还有”和“顺便”等词汇,例如在上述多意图对话中接尾标志词便为“再”。由于用户对于接尾标志词的描述常常较为固定,因而可将包括多个接尾标志词的集合作为语料来训练语言模型,将训练过的语言模型作为接尾部分。因此,调用动态目标语言模型确定附加意图,其中动态目标语言模型的接尾部分根据关键词解析出附加意图,包括:接尾部分根据关键词解析出参考接尾标志词,以及参考接尾标志词所在的时间点;基于参考接尾标志词,结合第一意图和第二意图,更新动态目标语言模型,得到更新后的目标语言模型;调用更新后的目标语言模型,根据关键词和参考接尾标志词所在的时间点解析出附加意图。
其中,参考接尾标志词是作为语料的多个接尾标志词的集合中的一个。若不存在参考接尾标志词,则说明不存在附加意图,则可直接提供上述服务内容所指示的服务;若存在参考接尾标志词,则说明存在附加意图,此时接尾部分还获取参考接尾标志词所在的时间点。
若存在参考接尾标志词,还根据第一意图和第二意图调用语言模型,该语言模型可以是第一意图和第二意图所在领域的语言模型,例如上述多意图对话中第一意图和第二意图所在的领域均为“导航”,则可获取导航领域的语言模型代替动态目标语言模型,从而得到更新后的目标语言模型。
之后,调用更新后的目标语言模型,解析参考接尾标志词所在的时间点之后的关键词,从而获取用户的附加意图。例如在上述多意图语音对话中,参考接尾标志词为“再”。其中,“再”所在的时间点之前的语音信号为“这个时段正好是中午,停车会不会有问题?对,就是川菜馆A”所包括的关键词已被动态目标语言模型中的前端部分及核心部分解析过。因此 可调用更新后的目标语言模型解析“再”所在的时间点之后的语音信号所包括的关键词,即“帮我找个停车位”中所包括的关键词,从而获取用户的附加意图。
需要说明的是,本实施例还提供另一种更新目标语言模型的方法如下:在根据第一意图和第二意图获取语言模型之后,将该语言模型及接尾部分的合并模型作为更新后的目标语言模型。因此,参见图5,更新后的目标语言模型可在解析得到一个附加意图之后,循环迭代检测是否存在更多的附加意图,提高了能够识别的意图数量。
另外,若附加意图存在,则在通过更新后的目标语言模型解析得到附加意图后,采用如下的方法来执行第二意图,方法包括:若附加意图存在,执行服务内容与附加意图。其中,在得到服务内容之后,不立即对服务内容进行执行,而是先通过接尾部分确认语音信号中是否存在附加意图,若存在附加意图,则对附加意图进行获取,最后对服务内容与附加意图进行执行。若通过接尾信息确认语音信号中不存在附加意图,才执行得到的服务内容。
进一步地,对服务内容与附加意图进行执行包括:合并执行服务内容与附加意图,或者,依次执行服务内容与附加意图。例如,服务内容为“导航去川菜馆A”,附加意图为“播放歌曲”,则可在执行服务内容的过程中执行附加意图,即合并执行了服务内容与附加意图。若服务内容为“导航去川菜馆A”,附加意图为“寻找停车位”,则需要依次执行服务内容与附加意图。另外,对于不同的服务内容与附加意图,可以由不同的执行主体来执行。例如,可以由第三方云服务执行,也可以由车载模块来执行,还可以由车载终端来执行,还可以由车企来执行。其中,车载终端可以是车辆上除车载模块以外的其他终端,如车载显示屏、车载空调、车载扬声器等等。当然,也可以由第三方云服务、车载模块、车载终端和车企中的两个以上的执行主体共同执行,本申请实施例对此不加以限定。
接下来,对以上步骤201中所提到的根据对第一意图的回复信息获取或生成动态目标语言模型的过程加以详细说明。可选地,根据对第一意图的回复信息获取或生成动态目标语言模型,包括:将对第一意图的回复信息转换为参考格式,得到参考格式的回复信息,根据参考格式的回复信息获取或生成动态目标语言模型。
根据以上说明可知,动态目标语言模型至少包括前端部分和核心部分,还可包括接尾部分。而前端部分用于确定对回复信息的确认性信息的描述。与接尾部分相类似,由于用户对回复信息的确认性信息的描述较为固定,因而对于前端部分,也可将包括多个用于确认、修正或取消的确认性信息的集合作为语料来训练语言模型,将训练过的语言模型作为前端部分,从而使得前端部分具有解析关键词得到确认、修正或取消等确认性信息的能力。而对于核心部分,则需要根据上述说明的参考格式的回复信息来进行获取。
其中,回复信息可能由多个供应商来提供,由于不同的供应商可能提供不同格式的回复信息,因而需要将回复信息转换为参考格式,从而使得回复信息格式统一,以便于回复信息的接收。对于不同的应用领域,可以将回复信息转换为不同的参考格式,以使同一应用领域中的回复信息格式相同。例如,在车载导航领域中,回复信息常常为地址,因而可将地址统一为国家(或地区)、省份(或州)、城市、区、路及门牌号的格式。再例如,在兴趣点(point of interest,POI)领域中,回复信息常常与兴趣点相关,因而可将回复信息统一为类名、地址、联系电话及用户评价的格式,类名可以为酒店、饭店、商城、博物馆、音乐厅、电影院、体育场馆、医院及药店等。
此外,在将回复信息转换为参考格式前,可以对回复信息进行分词标注,以便于转换参考格式的实施。分词标注是指将文字串分解为词汇,若分解得到的词汇中包括专有名词,还可对专有名词进行标注。分词和标注均可以通过人工智能算法来实现,在本实施例中,人工智能算法包括但不限于条件随机场(conditional random field,CRF)、长短期记忆(long short term memory,LSTM)网络以及隐马尔可夫模型(hidden Markov model,HMM)。
在本实施例中,得到参考格式的回复信息后,进一步根据该参考格式的回复信息来获取或生成动态目标语言模型。可选地,根据参考格式的回复信息来获取目标语言模型的获取方式包括如下三种:
第一种获取方式:将已训练的语言模型转化为加权有限状态转换自动机模型,将加权有限状态转换自动机模型作为动态目标语言模型,其中已训练的语言模型由参考格式的回复信息以及参考词汇训练获得。
其中,参考词汇包括但不限于参考格式的回复信息中的词汇对应的类名,以及指代性表达词。通过分词标注等方式可获取参考格式的回复信息中的词汇,从而进一步获取词汇对应的类名,例如“川菜馆A”的类名为“饭店”。指代性表达词用于指代任一项参考格式的回复信息,例如,当有多项参考格式的回复信息时,指代性表达词包括“第一项”、“中间那个”、“倒数第二个”、“最后一项”等等。
已训练的语言模型包括将参考格式的回复信息以及参考词汇作为语料训练过的初始语言模型。可选地,初始语言模型可以采用N-gram模型,N-gram模型的示意图可参见图6。N-gram模型假设一个字词的出现概率只与位于该字词之前的N个字词相关,而与其他字词无关。例如,当N取值为3时,N-gram模型为3阶模型,此时一个字词的出现概率与位于该字词之前的2个字词有关,即第i个字词X i的出现概率为P(X i|X i-1,X i-2)。因此,N-gram模型可统计出一个字词出现在另一个字词之后的概率,即两个字词相邻出现的概率。通过语料对N-gram模型进行训练,得到已训练的N-gram模型,已训练的N-gram模型已统计出语料所包含的各字词相邻出现的概率。
进一步地,可将已训练的语言模型转换为加权有限状态转换自动机模型(weighted finite state transducer,WFST)。WFST可基于词典将输入的音素序列转化为字词,再基于已训练的语言模型所统计的各字词相邻出现的概率得到各字词相邻出现的权重,根据权重来输出核心信息。核心信息可以看作是字词序列,因而核心信息的出现概率即为该字词序列所包含的所有字词相邻出现的权重的乘积。
另外,通过转换可扩大已训练的语言模型的解析范围,已训练的语言模型可通过解析关键词得到回复信息中的词汇以及参考词汇,而转换得到的WFST除了可解析得到回复信息中的词汇以及参考词汇以外,还可得到回复信息中的词汇、词汇对应的类名或者指代性表达词中的两种或三种的组合。例如,WFST可解析指代性表达词以及词汇对应的类名的组合“中间那个饭店”等等。
可以看出,该WFST即为动态目标语言模型中的核心部分。之后,可将WFST以及前端部分(或者WFST、前端部分以及接尾部分)作为动态目标语言模型。
第二种获取方式:将已训练的原始模型转化为加权有限状态转换自动机模型,将加权有限状态转换自动机模型作为第一语言模型,其中已训练的语言模型由长度不低于参考长度的参考格式的回复信息训练训练获得;根据长度低于参考长度的参考格式的回复信息获取第二 语言模型;根据参考词汇获取第三语言模型;合并第一语言模型、第二语言模型和第三语言模型,得到总语言模型,将总语言模型作为动态目标语言模型。
其中,参考词汇可参见第一种获取方式中的说明,此处不再进行赘述。相比于第一种获取方式,在第二种获取方式中,不将长度低于参考长度的回复信息以及参考词汇作为语料,而仅将长度不低于参考长度的回复信息作为语料,已训练的语言模型即为将长度不低于参考长度的回复信息作为语料训练过的初始语言模型,初始语言模型仍可为N-gram模型。在一种可选的实施方式中,参考长度为2,即两个词汇。
其原因在于,N-gram模型使用了后撤(back-off)算法。后撤算法是指:对于语料中未出现过的词序列,可使用较低阶的词序列的出现概率作为该词序列的出现概率,以保证N-gram模型能够针对所输入的任何音素序列均输出结果。例如,在3阶模型的语料中不存在字词序列(X i-2,X i-1,X i)的情况下,模型未统计出第X i个字词的出现概率P(X i|X i-1,X i-2)。若用户使用了字词序列(X i-2,X i-1,X i),则基于较低一阶(2阶)的P(X i|X i-1)来估算P(X i|X i-1,X i-2),从而实现对(X i-2,X i-1,X i)的解析。
已训练的语言模型是为了确定对回复信息相关的可能描述,而用户针对长度不同的回复信息通常会发出不同的语音信号来复述回复信息,以实现对回复信息的确认或选择。对于长度低于参考长度的回复信息,用户常常会复述整项回复信息,而不是复述整项回复信息中的某些词。若将长度低于参考长度的回复信息作为语料训练了包括后撤算法的N-gram模型,则会导致已训练的语言模型统计出一些出现概率较低的词序列,从而影响了已训练的语言模型的解析效果。参考长度可基于场景或经验设置,也可以在语音识别过程中进行调整,本申请实施例对此不加以限定。
例如,在车载导航场景,“东方明珠”可以作为长度为1的回复信息。若将“东方明珠”作为语料,则已训练的语言模型会提供“东明”、“方珠”等词序列,显然出现概率较低。因此,在本实施例中根据长度低于参考长度的回复信息获取不使用后撤算法的第二语言模型,第二语言模型仅解析关键词中整项长度低于参考长度的回复信息。
另外,对于包括词汇对应的类名以及指代性表达词的参考词汇,用户的表达方式较为固定,且词汇对应的类名与指代性表达词的组合数量也较为有限,因而可将词汇对应的类名、指代性表达词以及类名与指代性表达词的组合作为语料训练得到不使用后撤算法的第三语言模型。
而对于长度不低于参考长度的回复信息,用户常常会选择整项回复信息中的部分词来进行复述,因而可将这些长度不低于参考长度的回复信息作为语料来训练N-gram模型,得到已训练的语言模型,再将其转换为WFST,从而得到使用后撤算法的第一语言模型,该第一语言模型可解析关键词中整项回复信息或者整项回复信息中包含的词的组合。例如,以车载导航场景中,以参考长度取2为例,则“A省B市C区D大道1号”是一项长度大于参考长度的回复信息,用户可能会选择“B市”、“D大道1号”等词序列进行复述,因而通过使用后撤算法的第一语言模型便可以对用户复述的语音信号所包括的关键词进行解析。
在获取第一语言模型、第二语言模型以及第三语言模型之后,如图7所示,对第一语言模型、第二语言模型以及第三语言模型进行合并得到总语言模型,该总语言模型即为动态目标语言模型中的核心部分。总语言模型以及前端部分(或者总语言模型、前端部分以及接尾部分)即为动态目标语言模型。
第三种获取方式:基于长度不低于参考长度的参考格式的回复信息获取词混淆网络,词混淆网络中的每个词汇有一转移概率;计算每个词汇的惩罚权重,根据每个词汇的惩罚权重将词混淆网络转化为加权有限状态转换自动机模型,将加权有限状态转换自动机模型作为第一语言模型;根据长度低于参考长度的参考格式的回复信息获取第二语言模型;根据参考词汇获取第三语言模型;合并第一语言模型、第二语言模型和第三语言模型,得到总语言模型,将总语言模型作为动态目标语言模型。
其中,对参考词汇的说明可参见第一种获取方式,根据长度低于参考长度的回复信息获取第二语言模型、根据参考词汇获取第三语言模型可参见第二种获取方式,此处不再加以赘述。接下来,对得到第一语言模型的过程进行说明:
获取词混淆网络(confusion network)的方法包括:对各项长度不低于参考长度的回复信息中相同类别的词汇进行词对齐(alignment),将类别数量加一作为词混淆网络中的状态数量。之后,状态与状态之间通过弧进行连接,每个弧上有一个词汇以及该词汇对应的转移概率,转移概率用于指示该词汇在该词汇所在的类别中的出现频率,且两个相邻的状态之间的所有弧上的转移概率之和为1。
进一步地,考虑到用户在复述长度较大的回复信息时,常常会在整项回复信息包括的所有词汇中跳跃选择词汇来进行复述。因而需要在该词混淆网络中每两个状态之间均增加一条跳跃边,以便解析用户跳跃选择词汇的语音信号所包括的关键词。例如,在如图8所示的词混淆网络中,eps即表示跳跃边,F i用于区分不同类别。
之后,计算每个词汇的惩罚权重,根据惩罚权重将词混淆网络转换为WFST,得到第一语言模型。需要说明的是,使用第一语言模型解析关键词时,第一语言模型会对该语音信号的音素序列有可能对应的多种字词序列的惩罚权重进行计算,字词序列的惩罚权重等于该字词序列所包括的词汇的惩罚权重的乘积,惩罚权重值最小的字词序列将被输出。可选地,计算每个词汇的惩罚权重的方式包括但不限于以下三种:
第一种计算方式:对于任一词汇,将该任一词汇的转移概率的负对数值作为惩罚权重。
根据以上说明可知,词汇的转移概率用于指示该词汇在该词汇所在的类别中的出现频率,则词汇在所在的类别中的出现频率越高,转移概率越大,转移概率的负对数值越小,即惩罚权重反比于出现频率,以便目标语言模型能够更好的解析出在所在的类别中出现频率较高的词汇。
第二种计算方式:对于任一词汇,将包含该词汇的参考格式的回复信息的数量的对数值作为惩罚权重。
其中,当用户从多项长度较大的回复信息中复述待选的回复信息时,更倾向于选择待选的回复信息中明显区别于其他回复信息的词汇,即区分性强的词汇来进行复述。例如,当用户在“A省B市C区D大道1号”以及“A省B市E区F路2号”两项中对前一项进行复述时,通常不会选择两项中均出现过的词汇“A省”或者“B市”,而是会选择仅在前一项出现过的词汇“C区”或者“D大道1号”来复述。
在本实施例中,按照如下的公式对词汇的区分性强弱进行定义:
Figure PCTCN2020079522-appb-000001
其中,逆向出现指数(inverse presence frequency,IPF)用于指示词汇的区分性强弱, IPF的数值越大,指示词汇的区分性越强。T Fi用于表示类别F i中的一个词汇,N为参考格式的回复信息的总项数,n为包含词汇T Fi的各项参考格式的回复信息的数量。可以看出,包含词汇的参考格式的回复信息的数量越多,则IPF数值也越小,该词汇的区分性越弱。
在考虑跳跃边的情况下,参考格式的回复信息总数量由N变为(N+1),IPF(T Fi)更新为按照如下的公式进行表示:
Figure PCTCN2020079522-appb-000002
另外,假设跳跃边不具有区分性,即跳跃边会出现在每项参考格式的回复信息中,则跳跃边的IPF(skip)可表示为:
Figure PCTCN2020079522-appb-000003
在本实施例中,也可以对上述IPF(skip)进行改写,以避免跳跃边的IPF的数值恒等于0。改写后的IPF(skip)按照如下的公式进行表示:
Figure PCTCN2020079522-appb-000004
进一步地,基于IPF(T Fi)按照如下的公式定义词汇的惩罚权重Penalty(T Fi)可定义,得到词汇的惩罚权重即为包含该词汇的参考格式的回复信息的数量的对数值:
Figure PCTCN2020079522-appb-000005
相应地,跳跃边的惩罚权重Penalty(skip)可定义为:
Figure PCTCN2020079522-appb-000006
可以看出,在该计算方式中,对区分性较强的词汇,即包含该词汇的参考格式的回复信息的数量较少的词汇给予较小的惩罚权重,以便目标语言模型能够更好的解析出这些区分性较强的词汇。
第三种计算方式:对于任一词汇,将该词汇在各项参考格式的回复信息中的出现次数的对数值作为惩罚权重。
其中,第三种计算方式仍可按照如下的公式来定义词汇的区分性强弱:
Figure PCTCN2020079522-appb-000007
不过,与第二种计算方式不同的是,N表示各项参考格式的回复信息所包含的词汇的总数量,n表示词汇T Fi在各项参考格式的回复信息中的出现次数。之后,基于第二种计算方式中的各公式,可定义词汇T Fi的惩罚权重Penalty(T Fi)可定义如下,从而得到词汇的惩罚权重即为词汇在各项参考格式的回复信息中的出现次数的对数值:
Penalty(T Fi)=Log(n)
可以看出,对于区分性较强的词汇,即出现次数越少的词汇,词汇的惩罚概率越小,以便动态目标语言模型能够更好的解析出区分性较强的词汇。
无论采用哪种计算方式获取第一语言模型,在获取第一语言模型之后,便可合并第一语言模型、第二语言模型以及第三语言模型得到总语言模型,该总语言模型即为动态目标语言模型中的核心部分。之后,可将总语言模型以及前端部分(或者总语言模型、前端部分以及 接尾部分)作为动态目标语言模型。
综上所述,本申请实施例根据第一意图的回复信息获取或生成包括前端部分和核心部分的动态目标语言模型,解析语音信号得到关键词,再调用动态目标语言模型解析关键词得到第二意图和服务内容。由于动态目标语言模型是根据第一意图的回复信息所获取的,因而通过动态目标语言模型解析得到的第二意图和服务内容均与第一意图相关。因此,本申请实施例实现了与第一意图无关的语音的忽略,避免了提供的服务内容偏离用户需求,识别效果好,提高了用户的使用体验。
另外,本申请实施例还通过动态目标语言模型中的接尾部分来判断语音信号是否存在多意图,以便提供用户的每个意图所指示的服务,因而进一步提高了用户的使用体验。
如图9所示,本申请实施例还提供了一种语音识别的装置,该装置包括:
第一获取模块901,用于根据对第一意图的回复信息获取或生成动态目标语言模型,动态目标语言模型包括前端部分和核心部分,核心部分用于确定对回复信息相关的可能描述,前端部分用于确定对回复信息的确认性信息的描述;
第二获取模块902,用于获取语音信号,解析语音信号生成关键词;
第一确定模块903,用于调用动态目标语言模型确定第二意图和服务内容,其中动态目标语言模型的前端部分根据关键词解析出第二意图,动态目标语言模型的核心部分根据关键词解析出服务内容。
可选地,动态目标语言模型还包括接尾部分,接尾部分用于确认是否存在附加意图,装置还包括:
第二确定模块,用于调用动态目标语言模型确定附加意图,动态目标语言模型的接尾部分根据关键词解析出附加意图。
可选地,接尾部分包括接尾标志词;
第二确定模块,用于接尾部分根据关键词解析出参考接尾标志词,以及参考接尾标志词所在的时间点;基于参考接尾标志词,结合第一意图和第二意图,更新动态目标语言模型,得到更新后的目标语言模型;调用更新后的目标语言模型,根据关键词和参考接尾标志词所在的时间点解析出附加意图。
可选地,装置还包括:
缓存模块,用于缓存历史语音信号;
第二获取模块902,用于解析语音信号,利用历史语音信号进行上下文检测后生成关键词。
可选地,装置还包括:确认模块,用于确认第二意图,得到确认后的第二意图。
可选地,确认模块,用于向用户发送第二意图的确认信息,获取用户反馈的第二意图,将用户反馈的第二意图作为确认后的第二意图。
可选地,第一获取模块901,用于将第一意图的回复信息转换为参考格式,得到参考格式的回复信息,根据参考格式的回复信息获取或生成动态目标语言模型。
可选地,第一获取模块,用于将已训练的语言模型转化为加权有限状态转换自动机模型,将加权有限状态转换自动机模型作为动态目标语言模型,其中已训练的语言模型由参考格式的回复信息以及参考词汇训练获得。
可选地,第一获取模块901,用于将已训练的语言模型转化为加权有限状态转换自动机模 型,将加权有限状态转换自动机模型作为第一语言模型,其中已训练的语言模型由长度不低于参考长度的参考格式的回复信息训练获得;根据长度低于参考长度的参考格式的回复信息获取第二语言模型,根据参考词汇获取第三语言模型;合并第一语言模型、第二语言模型和第三语言模型,得到总语言模型,将总语言模型作为动态目标语言模型。
可选地,第一获取模块901,包括:
第一获取单元,用于基于长度不低于参考长度的参考格式的回复信息获取词混淆网络,词混淆网络中的每个词汇有一转移概率;
计算单元,用于计算每个词汇的惩罚权重,根据每个词汇的惩罚权重将词混淆网络转化为加权有限状态转换自动机模型,将加权有限状态转换自动机模型作为第一语言模型;
第二获取单元,用于根据长度低于参考长度的参考格式的回复信息获取第二语言模型,根据参考词汇获取第三语言模型;
合并单元,用于合并第一语言模型、第二语言模型和第三语言模型,得到总语言模型,将总语言模型作为动态目标语言模型。
可选地,计算单元,用于对于任一词汇,将词汇的转移概率的负对数值作为惩罚权重。
可选地,计算单元,用于对于任一词汇,将包含词汇的参考格式的回复信息的数量的对数值作为惩罚权重。
可选地,计算单元,用于对于任一词汇,将词汇在参考格式的回复信息中的出现次数的对数值作为惩罚权重。
综上所述,本申请实施例根据第一意图的回复信息获取或生成包括前端部分和核心部分的动态目标语言模型,解析语音信号得到关键词,再调用动态目标语言模型解析关键词得到第二意图和服务内容。由于动态目标语言模型是根据第一意图的回复信息所获取的,因而通过动态目标语言模型解析得到的第二意图和服务内容均与第一意图相关。因此,本申请实施例实现了与第一意图无关的语音的忽略,避免了提供的服务内容偏离用户需求,识别效果好,提高了用户的使用体验。
另外,本申请实施例还通过动态目标语言模型中的接尾部分来判断语音信号是否存在多意图,以便提供用户的每个意图所指示的服务,因而进一步提高了用户的使用体验。
应理解的是,上述图9提供的装置在实现其功能时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的装置与方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
本申请实施例还提供了一种语音识别的设备,该设备包括存储器及处理器;存储器中存储有至少一条指令,至少一条指令由处理器加载并执行,以实现本申请实施例提供的语音识别的方法,该方法包括:根据对第一意图的回复信息获取或生成动态目标语言模型,动态目标语言模型包括前端部分和核心部分,核心部分用于确定对回复信息相关的可能描述,前端部分用于确定对回复信息的确认性信息的描述;获取语音信号,解析语音信号生成关键词;调用动态目标语言模型确定第二意图和服务内容,其中动态目标语言模型的前端部分根据关键词解析出第二意图,动态目标语言模型的核心部分根据关键词解析出服务内容。
可选地,动态目标语言模型还包括接尾部分,接尾部分用于确认是否存在附加意图,方法还包括:调用动态目标语言模型确定附加意图,动态目标语言模型的接尾部分根据关键词解析出附加意图。
可选地,接尾部分包括接尾标志词;调用动态目标语言模型确定附加意图,动态目标语言模型的接尾部分根据关键词解析出附加意图,包括:接尾部分根据关键词解析出参考接尾标志词,以及参考接尾标志词所在的时间点;基于参考接尾标志词,结合第一意图和第二意图,更新动态目标语言模型,得到更新后的目标语言模型;调用更新后的目标语言模型,根据关键词和参考接尾标志词所在的时间点解析出附加意图。
可选地,获取语音信号之前,方法还包括:缓存历史语音信号;解析语音信号生成关键词,包括:解析语音信号,利用历史语音信号进行上下文检测后生成关键词。
可选地,调用动态目标语言模型确定第二意图和服务内容之后,方法还包括:确认第二意图,得到确认后的第二意图。
可选地,确认第二意图,得到确认后的第二意图,包括:向用户发送第二意图的确认信息,获取用户反馈的第二意图,将用户反馈的第二意图作为确认后的第二意图。
可选地,根据对第一意图的回复信息获取或生成动态目标语言模型,包括:将第一意图的回复信息转换为参考格式,得到参考格式的回复信息,根据参考格式的回复信息获取或生成动态目标语言模型。
可选地,根据参考格式的回复信息获取或生成动态目标语言模型,包括:将已训练的语言模型转化为加权有限状态转换自动机模型,将加权有限状态转换自动机模型作为动态目标语言模型,其中已训练的语言模型由参考格式的回复信息以及参考词汇训练获得。
可选地,根据参考格式的回复信息获取或生成动态目标语言模型,包括:将已训练的语言模型转化为加权有限状态转换自动机模型,将加权有限状态转换自动机模型作为第一语言模型,其中已训练的语言模型由长度不低于参考长度的参考格式的回复信息训练获得;根据长度低于参考长度的参考格式的回复信息获取第二语言模型,根据参考词汇获取第三语言模型;合并第一语言模型、第二语言模型和第三语言模型,得到总语言模型,将总语言模型作为动态目标语言模型。
可选地,根据参考格式的回复信息获取或生成动态目标语言模型,包括:基于长度不低于参考长度的参考格式的回复信息获取词混淆网络,词混淆网络中的每个词汇有一转移概率;计算每个词汇的惩罚权重,根据每个词汇的惩罚权重将词混淆网络转化为加权有限状态转换自动机模型,将加权有限状态转换自动机模型作为第一语言模型;根据长度低于参考长度的参考格式的回复信息获取第二语言模型,根据参考词汇获取第三语言模型;合并第一语言模型、第二语言模型和第三语言模型,得到总语言模型,将总语言模型作为动态目标语言模型。
可选地,计算每个词汇的惩罚权重,包括:对于任一词汇,将词汇的转移概率的负对数值作为惩罚权重。
可选地,计算每个词汇的惩罚权重,包括:对于任一词汇,将包含词汇的参考格式的回复信息的项数的对数值作为惩罚权重。
可选地,计算每个词汇的惩罚权重,包括:对于任一词汇,将词汇在参考格式的回复信息中的出现次数的对数值作为惩罚权重。
本申请实施例还提供了一种计算机可读存储介质,存储介质中存储有至少一条指令,指 令由处理器加载并执行以实现本申请实施例提供的一种语音识别的方法,该方法包括:根据对第一意图的回复信息获取或生成动态目标语言模型,动态目标语言模型包括前端部分和核心部分,核心部分用于确定对回复信息相关的可能描述,前端部分用于确定对回复信息的确认性信息的描述;获取语音信号,解析语音信号生成关键词;调用动态目标语言模型确定第二意图和服务内容,其中动态目标语言模型的前端部分根据关键词解析出第二意图,动态目标语言模型的核心部分根据关键词解析出服务内容。
可选地,动态目标语言模型还包括接尾部分,接尾部分用于确认是否存在附加意图,方法还包括:调用动态目标语言模型确定附加意图,动态目标语言模型的接尾部分根据关键词解析出附加意图。
可选地,接尾部分包括接尾标志词;调用动态目标语言模型确定附加意图,动态目标语言模型的接尾部分根据关键词解析出附加意图,包括:接尾部分根据关键词解析出参考接尾标志词,以及参考接尾标志词所在的时间点;基于参考接尾标志词,结合第一意图和第二意图,更新动态目标语言模型,得到更新后的目标语言模型;调用更新后的目标语言模型,根据关键词和参考接尾标志词所在的时间点解析出附加意图。
可选地,获取语音信号之前,方法还包括:缓存历史语音信号;解析语音信号生成关键词,包括:解析语音信号,利用历史语音信号进行上下文检测后生成关键词。
可选地,调用动态目标语言模型确定第二意图和服务内容之后,方法还包括:确认第二意图,得到确认后的第二意图。
可选地,确认第二意图,得到确认后的第二意图,包括:向用户发送第二意图的确认信息,获取用户反馈的第二意图,将用户反馈的第二意图作为确认后的第二意图。
可选地,根据对第一意图的回复信息获取或生成动态目标语言模型,包括:将第一意图的回复信息转换为参考格式,得到参考格式的回复信息,根据参考格式的回复信息获取或生成目标语言模型。
可选地,根据参考格式的回复信息获取或生成动态目标语言模型,包括:将已训练的语言模型转化为加权有限状态转换自动机模型,将加权有限状态转换自动机模型作为动态目标语言模型,其中已训练的语言模型由参考格式的回复信息以及参考词汇训练获得。
可选地,根据参考格式的回复信息获取或生成动态目标语言模型,包括:将已训练的语言模型转化为加权有限状态转换自动机模型,将加权有限状态转换自动机模型作为第一语言模型,其中已训练的语言模型由长度不低于参考长度的参考格式的回复信息训练获得;根据长度低于参考长度的参考格式的回复信息获取第二语言模型,根据参考词汇获取第三语言模型;合并第一语言模型、第二语言模型和第三语言模型,得到总语言模型,将总语言模型作为动态目标语言模型。
可选地,根据参考格式的回复信息获取或生成动态目标语言模型,包括:基于长度不低于参考长度的参考格式的回复信息获取词混淆网络,词混淆网络中的每个词汇有一转移概率;计算每个词汇的惩罚权重,根据每个词汇的惩罚权重将词混淆网络转化为加权有限状态转换自动机模型,将加权有限状态转换自动机模型作为第一语言模型;根据长度低于参考长度的参考格式的回复信息获取第二语言模型,根据参考词汇获取第三语言模型;合并第一语言模型、第二语言模型和第三语言模型,得到总语言模型,将总语言模型作为动态目标语言模型。
可选地,计算每个词汇的惩罚权重,包括:对于任一词汇,将词汇的转移概率的负对数 值作为惩罚权重。
可选地,计算每个词汇的惩罚权重,包括:对于任一词汇,将包含词汇的参考格式的回复信息的项数的对数值作为惩罚权重。
可选地,计算每个词汇的惩罚权重,包括:对于任一词汇,将词汇在参考格式的回复信息中的出现次数的对数值作为惩罚权重。
本申请实施例还提供了一种芯片,包括处理器,处理器用于从存储器中调用并运行所述存储器中存储的指令,使得安装有所述芯片的通信设备执行上述任一种语音识别的方法。
本申请实施例还提供另一种芯片,包括:输入接口、输出接口、处理器和存储器,所述输入接口、输出接口、所述处理器以及所述存储器之间通过内部连接通路相连,所述处理器用于执行所述存储器中的代码,当所述代码被执行时,所述处理器用于执行上述任一种语音识别的方法。
可选地,上述处理器为一个或多个,上述存储器为一个或多个。
可选地,存储器可以与处理器集成在一起,或者存储器与处理器分离设置。
在具体实现过程中,存储器可以与处理器集成在同一块芯片上,也可以分别设置在不同的芯片上,本申请实施例对存储器的类型以及存储器与处理器的设置方式不做限定。
应理解的是,上述处理器可以是中央处理器(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者是任何常规的处理器等。值得说明的是,处理器可以是支持进阶精简指令集机器(advanced RISC machines,ARM)架构的处理器。
进一步地,在一种可选的实施例中,上述存储器可以包括只读存储器和随机存取存储器,并向处理器提供指令和数据。存储器还可以包括非易失性随机存取存储器。例如,存储器还可以存储设备类型的信息。
该存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用。例如,静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(dynamic random access memory,DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data date SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
本申请实施例提供了一种计算机程序,当计算机程序被计算机执行时,可以使得处理器或计算机执行上述方法实施例中对应的各个步骤和/或流程。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包 括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk)等。
以上所述仅为本申请的实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (44)

  1. 一种语音识别的方法,其特征在于,所述方法包括:
    根据对第一意图的回复信息获取或生成动态目标语言模型,所述动态目标语言模型包括前端部分和核心部分,所述核心部分用于确定对所述回复信息相关的可能描述,所述前端部分用于确定对所述回复信息的确认性信息的描述;
    获取语音信号,解析所述语音信号生成关键词;
    调用所述动态目标语言模型确定第二意图和服务内容,其中所述动态目标语言模型的所述前端部分根据所述关键词解析出所述第二意图,所述动态目标语言模型的核心部分根据所述关键词解析出所述服务内容。
  2. 根据权利要求1所述的方法,其特征在于,所述动态目标语言模型还包括接尾部分,所述接尾部分用于确认是否存在附加意图,所述方法还包括:
    调用所述动态目标语言模型确定附加意图,所述动态目标语言模型的所述接尾部分根据所述关键词解析出所述附加意图。
  3. 根据权利要求2所述的方法,其特征在于,所述接尾部分包括接尾标志词;
    所述调用所述动态目标语言模型确定附加意图,所述动态目标语言模型的所述接尾部分根据所述关键词解析出所述附加意图,包括:
    所述接尾部分根据所述关键词解析出参考接尾标志词,以及所述参考接尾标志词所在的时间点;
    基于所述参考接尾标志词,结合所述第一意图和所述第二意图,更新所述动态目标语言模型,得到更新后的目标语言模型;
    调用所述更新后的目标语言模型,根据所述关键词和所述参考接尾标志词所在的时间点解析出所述附加意图。
  4. 根据权利要求1-3任一所述的方法,其特征在于,所述获取语音信号之前,所述方法还包括:
    缓存历史语音信号;
    所述解析所述语音信号生成关键词,包括:
    解析所述语音信号,利用所述历史语音信号进行上下文检测后生成所述关键词。
  5. 根据权利要求1-4任一所述的方法,其特征在于,所述根据对第一意图的回复信息获取或生成动态目标语言模型,包括:
    将所述第一意图的回复信息转换为参考格式,得到参考格式的回复信息,根据所述参考格式的回复信息获取或生成所述动态目标语言模型。
  6. 根据权利要求5所述的方法,其特征在于,所述根据所述参考格式的回复信息获取或 生成所述动态目标语言模型,包括:
    将已训练的语言模型转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为所述动态目标语言模型,其中所述已训练的语言模型由所述参考格式的回复信息以及参考词汇训练获得。
  7. 根据权利要求5所述的方法,其特征在于,所述根据所述参考格式的回复信息获取或生成所述动态目标语言模型,包括:
    将已训练的语言模型转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为第一语言模型,其中所述已训练的语言模型由长度不低于参考长度的参考格式的回复信息训练获得;
    根据长度低于参考长度的参考格式的回复信息获取第二语言模型,根据参考词汇获取第三语言模型;
    合并所述第一语言模型、所述第二语言模型和所述第三语言模型,得到总语言模型,将所述总语言模型作为所述动态目标语言模型。
  8. 根据权利要求5所述的方法,其特征在于,所述根据所述参考格式的回复信息获取或生成所述动态目标语言模型,包括:
    基于长度不低于参考长度的参考格式的回复信息获取词混淆网络,所述词混淆网络中的每个词汇有一转移概率;
    计算每个词汇的惩罚权重,根据所述每个词汇的惩罚权重将所述词混淆网络转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为第一语言模型;
    根据长度低于参考长度的参考格式的回复信息获取第二语言模型,根据参考词汇获取第三语言模型;
    合并所述第一语言模型、所述第二语言模型和所述第三语言模型,得到总语言模型,将所述总语言模型作为所述动态目标语言模型。
  9. 根据权利要求8所述的方法,其特征在于,所述计算每个词汇的惩罚权重,包括:
    对于任一词汇,将所述词汇的转移概率的负对数值作为所述惩罚权重。
  10. 根据权利要求8所述的方法,其特征在于,所述计算每个词汇的惩罚权重,包括:
    对于任一词汇,将包含所述词汇的参考格式的回复信息的项数的对数值作为所述惩罚权重。
  11. 根据权利要求8所述的方法,其特征在于,所述计算每个词汇的惩罚权重,包括:
    对于任一词汇,将所述词汇在所述参考格式的回复信息中的出现次数的对数值作为所述惩罚权重。
  12. 一种语音识别的装置,其特征在于,所述装置包括:
    第一获取模块,用于根据对第一意图的回复信息获取或生成动态目标语言模型,所述动 态目标语言模型包括前端部分和核心部分,所述核心部分用于确定对所述回复信息相关的可能描述,所述前端部分用于确定对所述回复信息的确认性信息的描述;
    第二获取模块,用于获取语音信号,解析所述语音信号生成关键词;
    第一确定模块,用于调用所述动态目标语言模型确定第二意图和服务内容,其中所述动态目标语言模型的所述前端部分根据所述关键词解析出所述第二意图,所述动态目标语言模型的核心部分根据所述关键词解析出所述服务内容。
  13. 根据权利要求12所述的装置,其特征在于,所述动态目标语言模型还包括接尾部分,所述接尾部分用于确认是否存在附加意图,所述装置还包括:
    第二确定模块,用于调用所述动态目标语言模型确定附加意图,所述动态目标语言模型的所述接尾部分根据所述关键词解析出所述附加意图。
  14. 根据权利要求13所述的装置,其特征在于,所述接尾部分包括接尾标志词;
    所述第二确定模块,用于所述接尾部分根据所述关键词解析出参考接尾标志词,以及所述参考接尾标志词所在的时间点;基于所述参考接尾标志词,结合所述第一意图和所述第二意图,更新所述动态目标语言模型,得到更新后的目标语言模型;调用所述更新后的目标语言模型,根据所述关键词和所述参考接尾标志词所在的时间点解析出所述附加意图。
  15. 根据权利要求12-14任一所述的装置,其特征在于,所述装置还包括:
    缓存模块,用于缓存历史语音信号;
    所述第二获取模块,用于解析所述语音信号,利用所述历史语音信号进行上下文检测后生成所述关键词。
  16. 根据权利要求12-15任一所述的装置,其特征在于,所述第一获取模块,用于将所述第一意图的回复信息转换为参考格式,得到参考格式的回复信息,根据所述参考格式的回复信息获取或生成所述动态目标语言模型。
  17. 根据权利要求16所述的装置,其特征在于,所述第一获取模块,用于将已训练的语言模型转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为所述动态目标语言模型,其中所述已训练的语言模型由所述参考格式的回复信息以及参考词汇训练获得。
  18. 根据权利要求16所述的装置,其特征在于,所述第一获取模块,用于将已训练的语言模型转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为第一语言模型,其中所述已训练的语言模型由长度不低于参考长度的参考格式的回复信息训练获得;根据长度低于参考长度的参考格式的回复信息获取第二语言模型,根据参考词汇获取第三语言模型;合并所述第一语言模型、所述第二语言模型和所述第三语言模型,得到总语言模型,将所述总语言模型作为所述动态目标语言模型。
  19. 根据权利要求16所述的装置,其特征在于,所述第一获取模块,包括:
    第一获取单元,用于基于长度不低于参考长度的参考格式的回复信息获取词混淆网络,所述词混淆网络中的每个词汇有一转移概率;
    计算单元,用于计算每个词汇的惩罚权重,根据所述每个词汇的惩罚权重将所述词混淆网络转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为第一语言模型;
    第二获取单元,用于根据长度低于参考长度的参考格式的回复信息获取第二语言模型,根据参考词汇获取第三语言模型;
    合并单元,用于合并所述第一语言模型、所述第二语言模型和所述第三语言模型,得到总语言模型,将所述总语言模型作为所述动态目标语言模型。
  20. 根据权利要求19所述的装置,其特征在于,所述计算单元,用于对于任一词汇,将所述词汇的转移概率的负对数值作为所述惩罚权重。
  21. 根据权利要求19所述的装置,其特征在于,所述计算单元,用于对于任一词汇,将包含所述词汇的参考格式的回复信息的数量的对数值作为所述惩罚权重。
  22. 根据权利要求19所述的装置,其特征在于,所述计算单元,用于对于任一词汇,将所述词汇在所述参考格式的回复信息中的出现次数的对数值作为所述惩罚权重。
  23. 一种语音识别的设备,其特征在于,所述设备包括存储器及处理器;所述存储器中存储有至少一条指令,所述至少一条指令由所述处理器加载并执行,以实现一种语音识别的方法,所述方法包括:
    根据对第一意图的回复信息获取或生成动态目标语言模型,所述动态目标语言模型包括前端部分和核心部分,所述核心部分用于确定对所述回复信息相关的可能描述,所述前端部分用于确定对所述回复信息的确认性信息的描述;
    获取语音信号,解析所述语音信号生成关键词;
    调用所述动态目标语言模型确定第二意图和服务内容,其中所述动态目标语言模型的所述前端部分根据所述关键词解析出所述第二意图,所述动态目标语言模型的核心部分根据所述关键词解析出所述服务内容。
  24. 根据权利要求23所述的语音识别的设备,其特征在于,所述动态目标语言模型还包括接尾部分,所述接尾部分用于确认是否存在附加意图,所述方法还包括:
    调用所述动态目标语言模型确定附加意图,所述动态目标语言模型的所述接尾部分根据所述关键词解析出所述附加意图。
  25. 根据权利要求24所述的语音识别的设备,其特征在于,所述接尾部分包括接尾标志词;
    所述调用所述动态目标语言模型确定附加意图,所述动态目标语言模型的所述接尾部分 根据所述关键词解析出所述附加意图,包括:
    所述接尾部分根据所述关键词解析出参考接尾标志词,以及所述参考接尾标志词所在的时间点;
    基于所述参考接尾标志词,结合所述第一意图和所述第二意图,更新所述动态目标语言模型,得到更新后的目标语言模型;
    调用所述更新后的目标语言模型,根据所述关键词和所述参考接尾标志词所在的时间点解析出所述附加意图。
  26. 根据权利要求23-25任一所述的语音识别的设备,其特征在于,所述获取语音信号之前,所述方法还包括:
    缓存历史语音信号;
    所述解析所述语音信号生成关键词,包括:
    解析所述语音信号,利用所述历史语音信号进行上下文检测后生成所述关键词。
  27. 根据权利要求23-26任一所述的语音识别的设备,其特征在于,所述根据对第一意图的回复信息获取或生成动态目标语言模型,包括:
    将所述第一意图的回复信息转换为参考格式,得到参考格式的回复信息,根据所述参考格式的回复信息获取或生成所述动态目标语言模型。
  28. 根据权利要求27所述的语音识别的设备,其特征在于,所述根据所述参考格式的回复信息获取或生成所述动态目标语言模型,包括:
    将已训练的语言模型转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为所述动态目标语言模型,其中所述已训练的语言模型由所述参考格式的回复信息以及参考词汇训练获得。
  29. 根据权利要求27所述的语音识别的设备,其特征在于,所述根据所述参考格式的回复信息获取或生成所述动态目标语言模型,包括:
    将已训练的语言模型转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为第一语言模型,其中所述已训练的语言模型由长度不低于参考长度的参考格式的回复信息训练获得;
    根据长度低于参考长度的参考格式的回复信息获取第二语言模型,根据参考词汇获取第三语言模型;
    合并所述第一语言模型、所述第二语言模型和所述第三语言模型,得到总语言模型,将所述总语言模型作为所述动态目标语言模型。
  30. 根据权利要求27所述的语音识别的设备,其特征在于,所述根据所述参考格式的回复信息获取或生成所述动态目标语言模型,包括:
    基于长度不低于参考长度的参考格式的回复信息获取词混淆网络,所述词混淆网络中的每个词汇有一转移概率;
    计算每个词汇的惩罚权重,根据所述每个词汇的惩罚权重将所述词混淆网络转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为第一语言模型;
    根据长度低于参考长度的参考格式的回复信息获取第二语言模型,根据参考词汇获取第三语言模型;
    合并所述第一语言模型、所述第二语言模型和所述第三语言模型,得到总语言模型,将所述总语言模型作为所述动态目标语言模型。
  31. 根据权利要求30所述的语音识别的设备,其特征在于,所述计算每个词汇的惩罚权重,包括:
    对于任一词汇,将所述词汇的转移概率的负对数值作为所述惩罚权重。
  32. 根据权利要求30所述的语音识别的设备,其特征在于,所述计算每个词汇的惩罚权重,包括:
    对于任一词汇,将包含所述词汇的参考格式的回复信息的项数的对数值作为所述惩罚权重。
  33. 根据权利要求30所述的语音识别的设备,其特征在于,所述计算每个词汇的惩罚权重,包括:
    对于任一词汇,将所述词汇在所述参考格式的回复信息中的出现次数的对数值作为所述惩罚权重。
  34. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有至少一条指令,所述指令由处理器加载并执行以实现一种语音识别的方法,所述方法包括:
    根据对第一意图的回复信息获取或生成动态目标语言模型,所述动态目标语言模型包括前端部分和核心部分,所述核心部分用于确定对所述回复信息相关的可能描述,所述前端部分用于确定对所述回复信息的确认性信息的描述;
    获取语音信号,解析所述语音信号生成关键词;
    调用所述动态目标语言模型确定第二意图和服务内容,其中所述动态目标语言模型的所述前端部分根据所述关键词解析出所述第二意图,所述动态目标语言模型的核心部分根据所述关键词解析出所述服务内容。
  35. 根据权利要求34所述的计算机可读存储介质,其特征在于,所述动态目标语言模型还包括接尾部分,所述接尾部分用于确认是否存在附加意图,所述方法还包括:
    调用所述动态目标语言模型确定附加意图,所述动态目标语言模型的所述接尾部分根据所述关键词解析出所述附加意图。
  36. 根据权利要求35所述的计算机可读存储介质,其特征在于,所述接尾部分包括接尾标志词;
    所述调用所述动态目标语言模型确定附加意图,所述动态目标语言模型的所述接尾部分 根据所述关键词解析出所述附加意图,包括:
    所述接尾部分根据所述关键词解析出参考接尾标志词,以及所述参考接尾标志词所在的时间点;
    基于所述参考接尾标志词,结合所述第一意图和所述第二意图,更新所述动态目标语言模型,得到更新后的目标语言模型;
    调用所述更新后的目标语言模型,根据所述关键词和所述参考接尾标志词所在的时间点解析出所述附加意图。
  37. 根据权利要求34-36任一所述的计算机可读存储介质,其特征在于,所述获取语音信号之前,所述方法还包括:
    缓存历史语音信号;
    所述解析所述语音信号生成关键词,包括:
    解析所述语音信号,利用所述历史语音信号进行上下文检测后生成所述关键词。
  38. 根据权利要求34-37任一所述的计算机可读存储介质,其特征在于,所述根据对第一意图的回复信息获取或生成动态目标语言模型,包括:
    将所述第一意图的回复信息转换为参考格式,得到参考格式的回复信息,根据所述参考格式的回复信息获取或生成所述动态目标语言模型。
  39. 根据权利要求38所述的计算机可读存储介质,其特征在于,所述根据所述参考格式的回复信息获取或生成所述动态目标语言模型,包括:
    将已训练的语言模型转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为所述动态目标语言模型,其中所述已训练的语言模型由所述参考格式的回复信息以及参考词汇训练获得。
  40. 根据权利要求38所述的计算机可读存储介质,其特征在于,所述根据所述参考格式的回复信息获取或生成所述动态目标语言模型,包括:
    将已训练的语言模型转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为第一语言模型,其中所述已训练的语言模型由长度不低于参考长度的参考格式的回复信息训练获得;
    根据长度低于参考长度的参考格式的回复信息获取第二语言模型,根据参考词汇获取第三语言模型;
    合并所述第一语言模型、所述第二语言模型和所述第三语言模型,得到总语言模型,将所述总语言模型作为所述动态目标语言模型。
  41. 根据权利要求38所述的计算机可读存储介质,其特征在于,所述根据所述参考格式的回复信息获取或生成所述动态目标语言模型,包括:
    基于长度不低于参考长度的参考格式的回复信息获取词混淆网络,所述词混淆网络中的每个词汇有一转移概率;
    计算每个词汇的惩罚权重,根据所述每个词汇的惩罚权重将所述词混淆网络转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为第一语言模型;
    根据长度低于参考长度的参考格式的回复信息获取第二语言模型,根据参考词汇获取第三语言模型;
    合并所述第一语言模型、所述第二语言模型和所述第三语言模型,得到总语言模型,将所述总语言模型作为所述动态目标语言模型。
  42. 根据权利要求41所述的计算机可读存储介质,其特征在于,所述计算每个词汇的惩罚权重,包括:
    对于任一词汇,将所述词汇的转移概率的负对数值作为所述惩罚权重。
  43. 根据权利要求41所述的计算机可读存储介质,其特征在于,所述计算每个词汇的惩罚权重,包括:
    对于任一词汇,将包含所述词汇的参考格式的回复信息的项数的对数值作为所述惩罚权重。
  44. 根据权利要求41所述的计算机可读存储介质,其特征在于,所述计算每个词汇的惩罚权重,包括:
    对于任一词汇,将所述词汇在所述参考格式的回复信息中的出现次数的对数值作为所述惩罚权重。
PCT/CN2020/079522 2019-05-31 2020-03-16 语音识别的方法、装置、设备及计算机可读存储介质 WO2020238341A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2021570241A JP7343087B2 (ja) 2019-05-31 2020-03-16 音声認識の方法、装置、およびデバイス、並びにコンピュータ可読記憶媒体
EP20814489.9A EP3965101A4 (en) 2019-05-31 2020-03-16 Speech recognition method, apparatus and device, and computer-readable storage medium
US17/539,005 US20220093087A1 (en) 2019-05-31 2021-11-30 Speech recognition method, apparatus, and device, and computer-readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910470966.4 2019-05-31
CN201910470966.4A CN112017642B (zh) 2019-05-31 2019-05-31 语音识别的方法、装置、设备及计算机可读存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/539,005 Continuation US20220093087A1 (en) 2019-05-31 2021-11-30 Speech recognition method, apparatus, and device, and computer-readable storage medium

Publications (1)

Publication Number Publication Date
WO2020238341A1 true WO2020238341A1 (zh) 2020-12-03

Family

ID=73501103

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/079522 WO2020238341A1 (zh) 2019-05-31 2020-03-16 语音识别的方法、装置、设备及计算机可读存储介质

Country Status (5)

Country Link
US (1) US20220093087A1 (zh)
EP (1) EP3965101A4 (zh)
JP (1) JP7343087B2 (zh)
CN (1) CN112017642B (zh)
WO (1) WO2020238341A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112331210B (zh) * 2021-01-05 2021-05-18 太极计算机股份有限公司 一种语音识别装置
US11984125B2 (en) * 2021-04-23 2024-05-14 Cisco Technology, Inc. Speech recognition using on-the-fly-constrained language model per utterance
CN117112065A (zh) * 2023-08-30 2023-11-24 北京百度网讯科技有限公司 大模型插件调用方法、装置、设备及介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101656799A (zh) * 2008-08-20 2010-02-24 阿鲁策株式会社 自动会话系统以及会话情节编辑装置
US8990085B2 (en) * 2009-09-30 2015-03-24 At&T Intellectual Property I, L.P. System and method for handling repeat queries due to wrong ASR output by modifying an acoustic, a language and a semantic model
CN105529030A (zh) * 2015-12-29 2016-04-27 百度在线网络技术(北京)有限公司 语音识别处理方法和装置
CN105590626A (zh) * 2015-12-29 2016-05-18 百度在线网络技术(北京)有限公司 持续语音人机交互方法和系统
CN105632495A (zh) * 2015-12-30 2016-06-01 百度在线网络技术(北京)有限公司 语音识别方法和装置
CN106486120A (zh) * 2016-10-21 2017-03-08 上海智臻智能网络科技股份有限公司 交互式语音应答方法及应答系统
CN109616108A (zh) * 2018-11-29 2019-04-12 北京羽扇智信息科技有限公司 多轮对话交互处理方法、装置、电子设备及存储介质

Family Cites Families (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5384892A (en) * 1992-12-31 1995-01-24 Apple Computer, Inc. Dynamic language model for speech recognition
US6754626B2 (en) 2001-03-01 2004-06-22 International Business Machines Corporation Creating a hierarchical tree of language models for a dialog system based on prompt and dialog context
US7328155B2 (en) * 2002-09-25 2008-02-05 Toyota Infotechnology Center Co., Ltd. Method and system for speech recognition using grammar weighted based upon location information
US20040148170A1 (en) * 2003-01-23 2004-07-29 Alejandro Acero Statistical classifiers for spoken language understanding and command/control scenarios
JP3991914B2 (ja) 2003-05-08 2007-10-17 日産自動車株式会社 移動体用音声認識装置
US7228278B2 (en) 2004-07-06 2007-06-05 Voxify, Inc. Multi-slot dialog systems and methods
JP2006023345A (ja) 2004-07-06 2006-01-26 Alpine Electronics Inc テレビ画像自動キャプチャー方法及び装置
JP4846336B2 (ja) * 2005-10-21 2011-12-28 株式会社ユニバーサルエンターテインメント 会話制御装置
KR20100012051A (ko) * 2010-01-12 2010-02-04 주식회사 다날 스타 음성 메시지 청취 시스템
US8938391B2 (en) * 2011-06-12 2015-01-20 Microsoft Corporation Dynamically adding personalization features to language models for voice search
US9082403B2 (en) * 2011-12-15 2015-07-14 Microsoft Technology Licensing, Llc Spoken utterance classification training for a speech recognition system
JP6280342B2 (ja) 2013-10-22 2018-02-14 株式会社Nttドコモ 機能実行指示システム及び機能実行指示方法
US9286892B2 (en) 2014-04-01 2016-03-15 Google Inc. Language modeling in speech recognition
US10460720B2 (en) * 2015-01-03 2019-10-29 Microsoft Technology Licensing, Llc. Generation of language understanding systems and methods
US10832664B2 (en) * 2016-08-19 2020-11-10 Google Llc Automated speech recognition using language models that selectively use domain-specific model components
US10217458B2 (en) * 2016-09-23 2019-02-26 Intel Corporation Technologies for improved keyword spotting
CN106448670B (zh) * 2016-10-21 2019-11-19 竹间智能科技(上海)有限公司 基于深度学习和强化学习的自动回复对话系统
CN107240394A (zh) * 2017-06-14 2017-10-10 北京策腾教育科技有限公司 一种动态自适应语音分析技术以用于人机口语考试的方法及系统
KR20190004495A (ko) * 2017-07-04 2019-01-14 삼성에스디에스 주식회사 챗봇을 이용한 태스크 처리 방법, 장치 및 시스템
US10083006B1 (en) * 2017-09-12 2018-09-25 Google Llc Intercom-style communication using multiple computing devices
CN108735215A (zh) * 2018-06-07 2018-11-02 爱驰汽车有限公司 车载语音交互系统、方法、设备和存储介质
CN109003611B (zh) * 2018-09-29 2022-05-27 阿波罗智联(北京)科技有限公司 用于车辆语音控制的方法、装置、设备和介质
US11004449B2 (en) * 2018-11-29 2021-05-11 International Business Machines Corporation Vocal utterance based item inventory actions
US10997968B2 (en) * 2019-04-30 2021-05-04 Microsofttechnology Licensing, Llc Using dialog context to improve language understanding

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101656799A (zh) * 2008-08-20 2010-02-24 阿鲁策株式会社 自动会话系统以及会话情节编辑装置
US8990085B2 (en) * 2009-09-30 2015-03-24 At&T Intellectual Property I, L.P. System and method for handling repeat queries due to wrong ASR output by modifying an acoustic, a language and a semantic model
CN105529030A (zh) * 2015-12-29 2016-04-27 百度在线网络技术(北京)有限公司 语音识别处理方法和装置
CN105590626A (zh) * 2015-12-29 2016-05-18 百度在线网络技术(北京)有限公司 持续语音人机交互方法和系统
CN105632495A (zh) * 2015-12-30 2016-06-01 百度在线网络技术(北京)有限公司 语音识别方法和装置
CN106486120A (zh) * 2016-10-21 2017-03-08 上海智臻智能网络科技股份有限公司 交互式语音应答方法及应答系统
CN109616108A (zh) * 2018-11-29 2019-04-12 北京羽扇智信息科技有限公司 多轮对话交互处理方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
US20220093087A1 (en) 2022-03-24
EP3965101A4 (en) 2022-06-29
JP7343087B2 (ja) 2023-09-12
CN112017642A (zh) 2020-12-01
JP2022534242A (ja) 2022-07-28
EP3965101A1 (en) 2022-03-09
CN112017642B (zh) 2024-04-26

Similar Documents

Publication Publication Date Title
US11887604B1 (en) Speech interface device with caching component
US20220156039A1 (en) Voice Control of Computing Devices
US9905228B2 (en) System and method of performing automatic speech recognition using local private data
US10884701B2 (en) Voice enabling applications
US7689420B2 (en) Personalizing a context-free grammar using a dictation language model
WO2020238341A1 (zh) 语音识别的方法、装置、设备及计算机可读存储介质
US11676585B1 (en) Hybrid decoding using hardware and software for automatic speech recognition systems
US10917758B1 (en) Voice-based messaging
US20070239453A1 (en) Augmenting context-free grammars with back-off grammars for processing out-of-grammar utterances
US10685647B2 (en) Speech recognition method and device
US10838954B1 (en) Identifying user content
JP2008097003A (ja) 自動音声認識システムに対する適応コンテキスト
CN110956955B (zh) 一种语音交互的方法和装置
JP2018040904A (ja) 音声認識装置および音声認識方法
US10866948B2 (en) Address book management apparatus using speech recognition, vehicle, system and method thereof
US20210118435A1 (en) Automatic Synchronization for an Offline Virtual Assistant
CN112863496B (zh) 一种语音端点检测方法以及装置
US11211056B1 (en) Natural language understanding model generation
US11277304B1 (en) Wireless data protocol
JP2022121386A (ja) テキストベースの話者変更検出を活用した話者ダイアライゼーション補正方法およびシステム
US11893996B1 (en) Supplemental content output
US11935533B1 (en) Content-related actions based on context
CN116110396B (zh) 语音交互方法、服务器和计算机可读存储介质
US11907676B1 (en) Processing orchestration for systems including distributed components
US10304454B2 (en) Persistent training and pronunciation improvements through radio broadcast

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20814489

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021570241

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020814489

Country of ref document: EP

Effective date: 20211202