WO2022193892A1 - 语音交互方法、装置、计算机可读存储介质及电子设备 - Google Patents

语音交互方法、装置、计算机可读存储介质及电子设备 Download PDF

Info

Publication number
WO2022193892A1
WO2022193892A1 PCT/CN2022/076422 CN2022076422W WO2022193892A1 WO 2022193892 A1 WO2022193892 A1 WO 2022193892A1 CN 2022076422 W CN2022076422 W CN 2022076422W WO 2022193892 A1 WO2022193892 A1 WO 2022193892A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
result
speech recognition
recognition result
recognition
Prior art date
Application number
PCT/CN2022/076422
Other languages
English (en)
French (fr)
Inventor
田川
潘复平
牛建伟
余凯
Original Assignee
深圳地平线机器人科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳地平线机器人科技有限公司 filed Critical 深圳地平线机器人科技有限公司
Priority to JP2022558093A priority Critical patent/JP2023520861A/ja
Priority to US18/247,441 priority patent/US20240005917A1/en
Publication of WO2022193892A1 publication Critical patent/WO2022193892A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/34Adaptation of a single recogniser for parallel processing, e.g. by use of multiple processors or cloud computing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present disclosure relates to the field of computer technology, and in particular, to a voice interaction method, an apparatus, a computer-readable storage medium, and an electronic device.
  • Intelligent voice interaction technology can be applied to various devices such as automobiles, robots, household appliances, central control systems, access control systems, and ATM machines.
  • the voice interaction system usually only receives one voice signal, and the voice signal is processed to give feedback to the user.
  • the voice interaction system is developing towards a more efficient, intelligent and personalized direction.
  • Embodiments of the present disclosure provide a voice interaction method, apparatus, computer-readable storage medium, and electronic device.
  • An embodiment of the present disclosure provides a voice interaction method, the method includes: acquiring at least one audio signal; using a preset voice recognition model to identify the at least one audio signal, and obtaining a first type of recognition result through the voice recognition model; Determine the stored identification data about at least one audio signal in the cache; generate a second type of identification result based on the stored identification data; use a speech recognition model to process the first type of identification result and the second type of identification result to obtain A sentence recognition result corresponding to the at least one audio signal; semantic analysis is performed on the sentence recognition result to obtain at least one analysis result; based on the at least one analysis result, an instruction for controlling the voice interaction device to perform a corresponding function is generated.
  • a voice interaction device includes: an acquisition module for acquiring at least one audio signal; and a recognition module for using a preset voice recognition model for at least one audio signal Recognize, and obtain the first type of recognition result through the speech recognition model; the determination module is used to determine the stored identification data about at least one audio signal from the cache; the first generation module is used to generate the stored identification data based on The second type of recognition result; the processing module is used to process the first type of recognition result and the second type of recognition result by using the speech recognition model to obtain the sentence recognition results corresponding to at least one audio signal respectively; the parsing module is used for each The sentence recognition results are respectively subjected to semantic analysis to obtain at least one analysis result; the second generation module is configured to generate an instruction for controlling the voice interaction device to perform a corresponding function based on the at least one analysis result.
  • a computer-readable storage medium where a computer program is stored in the computer-readable storage medium, and the computer program is used to execute the above voice interaction method.
  • an electronic device includes: a processor; a memory for storing instructions executable by the processor; a processor for reading the executable instructions from the memory, and The instructions are executed to implement the above voice interaction method.
  • At least one audio signal is recognized by using a preset voice recognition model, and during the recognition process, the stored audio signal is extracted from the cache.
  • the recognition data generates part of the recognition results, and the other part of the recognition results is generated by the speech recognition model, thereby effectively multiplexing the stored recognition data, without the need for the speech recognition model to process the full amount of data, and improving the processing efficiency of at least one audio signal , which helps to meet the requirements of low resource consumption and low processing delay in the scenario of multi-channel voice interaction.
  • FIG. 1 is a system diagram to which the present disclosure is applied.
  • FIG. 2 is a schematic flowchart of a voice interaction method provided by an exemplary embodiment of the present disclosure.
  • FIG. 3 is a schematic flowchart of a voice interaction method provided by another exemplary embodiment of the present disclosure.
  • FIG. 4 is a schematic flowchart of a voice interaction method provided by another exemplary embodiment of the present disclosure.
  • FIG. 5 is a schematic flowchart of a voice interaction method provided by another exemplary embodiment of the present disclosure.
  • FIG. 6 is a schematic flowchart of a voice interaction method provided by another exemplary embodiment of the present disclosure.
  • FIG. 7 is a schematic diagram of an application scenario of the voice interaction method according to an embodiment of the present disclosure.
  • FIG. 8 is a schematic structural diagram of a voice interaction apparatus provided by an exemplary embodiment of the present disclosure.
  • FIG. 9 is a schematic structural diagram of a voice interaction apparatus provided by another exemplary embodiment of the present disclosure.
  • FIG. 10 is a structural diagram of an electronic device provided by an exemplary embodiment of the present disclosure.
  • a plurality may refer to two or more, and “at least one” may refer to one, two or more.
  • the term "and/or" in the present disclosure is only an association relationship to describe associated objects, indicating that there can be three kinds of relationships, for example, A and/or B, it can mean that A exists alone, and A and B exist at the same time , there are three cases of B alone.
  • the character "/" in the present disclosure generally indicates that the related objects are an "or" relationship.
  • Embodiments of the present disclosure can be applied to electronic devices such as terminal devices, computer systems, servers, etc., which can operate with numerous other general-purpose or special-purpose computing system environments or configurations.
  • Examples of well-known terminal equipment, computing systems, environments and/or configurations suitable for use with terminal equipment, computer systems, servers, etc. electronic equipment include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients computer, handheld or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network personal computers, minicomputer systems, mainframe computer systems, and distributed cloud computing technology environments including any of the foregoing, among others.
  • Electronic devices such as terminal devices, computer systems, servers, etc., may be described in the general context of computer system-executable instructions, such as program modules, being executed by the computer system.
  • program modules may include routines, programs, object programs, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • Computer systems/servers may be implemented in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network.
  • program modules may be located on local or remote computing system storage media including storage devices.
  • the current voice interaction technology can usually only process one voice signal at the same time, and cannot process multiple voice signals at the same time, so it cannot meet the needs of multi-user and personalized voice recognition. Therefore, the technical solution of the present disclosure needs to apply the voice interaction technology to multiple channels. speech recognition scenarios.
  • the speech recognition model needs to process the full amount of data of the speech signal, resulting in low speech recognition efficiency and large interaction delay. Especially in the scenario of multi-channel speech recognition, it cannot meet the needs of multi-users to efficiently and personalized the speech interaction system. requirements.
  • FIG. 1 shows an exemplary system architecture 100 of a voice interaction method or a voice interaction apparatus to which embodiments of the present disclosure may be applied.
  • the system architecture 100 may include a terminal device 101 , a network 102 and a server 103 .
  • the network 102 is a medium for providing a communication link between the terminal device 101 and the server 103 .
  • the network 102 includes, but is not limited to, various connection types, such as wired, wireless communication links, or fiber optic cables, and the like.
  • the user can use the terminal device 101 to interact with the server 103 through the network 102 to receive or send messages and the like.
  • Various communication client applications may be installed on the terminal device 101, such as speech recognition applications, multimedia applications, search applications, web browser applications, shopping applications, instant messaging tools, and the like.
  • the terminal device 101 may be an electronic device, which includes but is not limited to a vehicle terminal, a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (Personal Digital Assistant), a PAD (tablet computer), a PMP ( Portable Media Player, portable multimedia player), etc., as well as fixed terminals such as digital TV, desktop computer, smart home appliances, etc.
  • a vehicle terminal a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (Personal Digital Assistant), a PAD (tablet computer), a PMP ( Portable Media Player, portable multimedia player), etc.
  • PMP Portable Media Player, portable multimedia player
  • fixed terminals such as digital TV, desktop computer, smart home appliances, etc.
  • the server 103 may be a device that can provide various service functions, such as a background speech recognition server that recognizes the audio signal uploaded by the terminal device 101 .
  • the background voice recognition server can process the received audio to obtain an instruction for controlling the voice interaction device, and feed the instruction back to the terminal device 101 .
  • the voice interaction method provided by the embodiments of the present disclosure may be executed by the server 103 or by the terminal device 101 .
  • the voice interaction device may be set in the server 103 or in the terminal device 101 . middle.
  • terminal devices 101, networks 102 and servers 103 in FIG. 1 are only an example. According to implementation requirements, any number of terminal devices, networks and/or servers may be configured, which is not limited in this application.
  • the above-mentioned system architecture may not include the network 102, but only include a server or a terminal device. For example, when the terminal device 101 and the server 103 are connected in a wired manner, the network 102 may be omitted.
  • FIG. 2 is a schematic flowchart of a voice interaction method provided by an exemplary embodiment of the present disclosure.
  • the method of this embodiment can be applied to an electronic device (the terminal device 101 or the server 103 as shown in FIG. 1 ). , as shown in Figure 2, the method includes the following steps:
  • Step 201 acquiring at least one audio signal.
  • the electronic device can acquire at least one audio signal locally or remotely.
  • the above-mentioned at least one audio signal may be a speech signal of at least one passenger in the vehicle collected by at least one microphone installed in the vehicle.
  • Step 202 Recognize at least one audio signal by using a preset speech recognition model, and obtain the first type of recognition result by using the speech recognition model.
  • the electronic device can use a preset speech recognition model to recognize at least one audio signal, and in the recognition process, the first type of recognition result is obtained by using the preset speech recognition model.
  • the preset speech recognition model may be a model obtained by pre-training with a large number of speech signal samples.
  • the preset speech recognition model is used to recognize at least one input audio signal to obtain at least one sentence recognition result.
  • the preset speech recognition model may include multiple sub-models, for example, including an acoustic sub-model, a language sub-model, a decoding network sub-model, and the like.
  • the acoustic sub-model is used to divide the audio signal into syllables;
  • the language sub-model is used to convert each syllable into a word;
  • the decoding network sub-model is used to select an optimal combination from multiple words to obtain a sentence.
  • the electronic device In the process of recognizing at least one audio signal by using the preset speech recognition model in the above step 202, the electronic device usually first searches the cache to see if there is identification data corresponding to the current processing stage, if there is no corresponding identification data in the cache , the above step 202 is executed to obtain the identification data, and the identification data is used as the first type of identification result.
  • Step 203 Determine the stored identification data about at least one audio signal from the buffer.
  • the electronic device may determine the stored identification data about at least one audio signal from the buffer.
  • the electronic device usually first searches the cache to see if there is recognition data corresponding to the current processing stage, and if so, extracts the recognition data.
  • Step 204 based on the stored identification data, generate a second type of identification result.
  • the electronic device may generate a second type of identification result based on the stored identification data extracted in step 203 above.
  • the above-mentioned stored identification data may be used as the second type of identification result, or the above-mentioned stored identification data may be subjected to certain processing to obtain the second type of identification result, wherein the certain processing process includes the identification of The data is scaled, normalized, etc.
  • first-type recognition results and second-type recognition results are usually intermediate results obtained during the processing of the speech recognition model, such as probability scores of syllables, probability scores of words, and the like.
  • Step 205 using the speech recognition model to process the first type of recognition result and the second type of recognition result, to obtain sentence recognition results corresponding to at least one audio signal respectively.
  • the electronic device may use a speech recognition model to process the first type of recognition result and the second type of recognition result to obtain sentence recognition results corresponding to at least one audio signal respectively.
  • a speech recognition model to process the first type of recognition result and the second type of recognition result to obtain sentence recognition results corresponding to at least one audio signal respectively.
  • the above-mentioned first-type recognition results and second-type recognition results may include the probability score of each syllable and the probability score of each word obtained after the audio signal is recognized.
  • the speech recognition model can use the path A search algorithm (such as the Viterbi algorithm) determines an optimal path from a plurality of recognized words corresponding to the audio signal, and obtains a sentence as a sentence recognition result according to the optimal path.
  • the path A search algorithm such as the Viterbi algorithm
  • one channel of audio signal may correspond to one sentence recognition result
  • multiple channels of audio signals may correspond to multiple channels of sentence recognition results.
  • Step 206 Perform semantic parsing on each sentence recognition result to obtain at least one parsing result.
  • the electronic device may perform semantic parsing on each of the at least one sentence recognition result to obtain at least one parsing result.
  • each analysis result in the above at least one analysis result corresponds to an audio signal.
  • the method for semantic analysis of the sentence recognition result may be adopted, for example, a rule engine, a neural network engine and other methods.
  • Step 207 based on the at least one analysis result, generate an instruction for controlling the voice interaction device to perform a corresponding function.
  • the electronic device may generate an instruction for controlling the voice interaction device to perform a corresponding function based on at least one analysis result.
  • the above-mentioned voice interaction device may be the above-mentioned electronic device for executing the voice interaction method of the present disclosure, or may be an electronic device that is communicatively connected to the above-mentioned electronic device.
  • the voice interaction device is a car air conditioner
  • An instruction for a certain preset temperature the preset temperature being 25°C.
  • At least one audio signal is recognized by using a preset speech recognition model, and during the recognition process, the stored recognition data is extracted from the cache to generate a part of recognition results, and the other part of the recognition results is obtained by speech recognition.
  • Model generation which effectively reuses the stored recognition data, and does not require the speech recognition model to process the full amount of data, thereby improving the efficiency of processing at least one audio signal, which is helpful for multi-channel voice interaction scenarios. Meet the requirements for low resource consumption and low processing delay of electronic devices.
  • the above-mentioned electronic device may also store the recognition data obtained by the preset speech recognition model in the recognition process into the cache. Specifically, when the recognition data corresponding to a certain recognition step does not exist in the above cache, the speech recognition model is required to execute the recognition step, and the obtained recognition data is stored in the cache, thereby facilitating subsequent reuse of the recognition data.
  • the multiplexing of the recognition data can be realized, and the recognition data in the cache can be continuously updated.
  • using more stored recognition data in the model recognition process further improves the efficiency of speech recognition.
  • the specific execution process of the above step 201 is as follows:
  • the initial audio signal collected by the audio collection device is received.
  • the number of the audio collection devices may be one or more, which are used to collect at least one channel of the initial audio signal.
  • the above-mentioned initial audio signal may be a signal obtained by collecting the voice of at least one user by an audio collecting device.
  • multiple audio collection devices are configured, and each audio collection device is installed around each seat in the vehicle. Each audio collection device is used to collect the voices of passengers on the corresponding seats.
  • the collected audio signals usually contain multiple mixed voice signal of each user.
  • the above-mentioned method of sound source separation processing can adopt the prior art, for example, adopts a blind source separation (Blind Source Separation, BSS) algorithm to separate the voice signals of multiple users, and each obtained audio signal corresponds to a user respectively.
  • BSS Blind Source Separation
  • the sound source separation processing can also correspond each audio signal obtained to the corresponding audio acquisition device. Since each audio acquisition device is installed near the corresponding seat, each audio signal obtained can be corresponding to the corresponding seat.
  • the voice signals of multiple users are separated through the sound source separation technology, and a one-to-one correspondence is established with different audio collection devices.
  • This implementation can separate the voices of multiple users by separating the sound source of the initial audio signal, so that each subsequent voice recognition result corresponds to the corresponding user, thereby improving the accuracy of voice interaction of multiple users.
  • step 202 may include the following sub-steps:
  • Step 2021 Determine the speech recognition instance corresponding to each audio signal.
  • the speech recognition instance can be constructed by code, each speech recognition instance corresponds to a channel of audio signal, and each speech recognition instance is used to recognize a channel of audio signal corresponding to it.
  • Step 2022 executing each of the determined speech recognition instances in parallel.
  • each speech recognition instance may be implemented in a multi-threaded manner; or, each speech recognition instance may also be executed by different CPUs, so as to implement parallel execution.
  • Step 2023 Recognize the corresponding audio signal by using the preset speech recognition model through each speech recognition instance.
  • each speech recognition instance can call the above-mentioned preset speech recognition model in parallel and separately to recognize the corresponding speech signal, thereby realizing the parallel recognition of the audio signal.
  • a preset speech recognition model may be loaded into the memory first, and each speech recognition instance shares the preset speech recognition model. It should be noted that, when each speech recognition instance is used to recognize an audio signal, the above-mentioned buffer may be shared, thereby improving the recognition efficiency of each speech recognition instance.
  • this implementation can realize simultaneous recognition of the speech of multiple users.
  • each speech recognition instance shares a speech recognition model to recognize the speech signal.
  • share the same cache to store and call recognition data realize speech recognition for at least one audio signal in parallel, and share the resources required for recognition, and improve the efficiency of speech recognition in multi-user voice interaction scenarios.
  • the identified data is stored, so the stored identification data can be directly called in the subsequent identification process, without repeated identification, which greatly saves memory resources.
  • step 206 may include the following sub-steps:
  • Step 2061 Determine the semantic parsing instance corresponding to each sentence recognition result obtained.
  • the semantic parsing instance can be constructed by code, each semantic parsing instance corresponds to a sentence recognition result of a channel of audio signal, and the semantic parsing instance is used to perform structured parsing on the sentence recognition result.
  • Step 2062 executing each of the determined semantic parsing instances in parallel.
  • each semantic parsing instance may be implemented in a multi-thread manner; or, each semantic parsing instance may be executed by different CPUs, thereby implementing parallel execution.
  • Step 2063 Perform semantic analysis on the corresponding sentence recognition result through each semantic analysis instance.
  • each semantic parsing instance can call the preset rule engine, neural network engine and other modules in parallel to realize parallel parsing of the sentence recognition result.
  • this implementation realizes the simultaneous recognition and parsing of the voices of multiple users, thereby constructing multiple chains that can perform voice interaction at the same time. Moreover, each semantic parsing instance shares a set of semantic resources, which also improves the speech recognition efficiency in multi-user speech interaction scenarios.
  • step 202 may include the following steps:
  • Step 2024 using the acoustic sub-model included in the speech recognition model to determine the syllable set corresponding to at least one audio signal and the first probability score corresponding to the syllables in the syllable set respectively.
  • the acoustic sub-model is used for syllable division of the input audio signal.
  • the acoustic sub-model includes, but is not limited to, a Hidden Markov Model (HMM, Hidden Markov Model), a Gaussian Mixture Model (GMM, Gaussian Mixture Model), and the like.
  • HMM Hidden Markov Model
  • GMM Gaussian Mixture Model
  • the first probability score is used to characterize the probability that the syllable is correctly divided.
  • Step 2025 using the language sub-model included in the speech recognition model to determine the word set corresponding to at least one audio signal.
  • the language sub-model is used to determine the word set according to the above-mentioned syllable set.
  • the language sub-model may include, but is not limited to, an n-gram language model, a neural network language model, and the like.
  • Step 2026 for a word in the word set, determine whether there is a second probability score corresponding to the word in the cache.
  • the second probability score is used to represent the probability of the recognized word appearing. For example, calculating the probability that "air conditioner” appears after “turn on” is the second probability score corresponding to the word "air conditioner".
  • the electronic device When the probability score of a certain word needs to be determined, the electronic device first searches the cache for whether there is a second probability score of the current word, and when it does not exist, uses the language sub-model to calculate the second probability score of the word.
  • a cache is used to pre-store the second probability score generated by the language sub-model, and the second probability score generated by the language sub-model can be stored directly from the cache when used. to obtain the second probability score.
  • Step 2027 Determine the first type of recognition result based on the first probability score and the second probability score determined by the language sub-model.
  • each of the first probability scores and each of the second probability scores may be determined as the first type of recognition result.
  • the method provided by the corresponding embodiment of FIG. 5 above uses a cache specially for storing the second probability score generated by the language sub-model, so that the role of the cache is more targeted.
  • the cache is applied to the process of large data processing volume and frequent data access, giving full play to the role of using the cache to save computing resources, reducing redundant data in the cache, and improving the efficiency of speech recognition.
  • step 203 may further include the following steps:
  • Step 2031 for a word in the word set, determine whether there is a second probability score corresponding to the word in the cache.
  • the second probability score in the cache is determined as the second probability score language sub-model for the word.
  • Step 2032 based on the first probability score and the second probability score determined from the cache, determine a second type of identification result.
  • each of the first probability scores and the second probability scores determined in the cache may be determined as the second type of identification results.
  • the above-mentioned step 205 may be performed as follows:
  • the target path of the word set is determined in the decoding network included in the speech recognition model.
  • the decoding network is a network constructed based on the above-mentioned word set. Based on this network, an optimal path of a word combination can be found in the network according to the first probability score and the second probability score, and this path is the target path.
  • a sentence composed of words corresponding to the target path may be determined as the sentence recognition result.
  • This implementation uses the first probability score, the second probability score calculated by the language sub-model, and the second probability score extracted from the cache to find the target path in the decoding network, and generate the sentence recognition result. Making full use of the stored second probability score in the cache improves the efficiency of generating the sentence recognition result.
  • FIG. 7 a schematic diagram of an application scenario of the voice interaction method of this embodiment is shown.
  • the voice interaction method is applied to the vehicle-mounted voice interaction system.
  • the multiple audio signals respectively correspond to one interaction chain, including: the main driver interaction chain 701 , the assistant driver interaction chain 702 , and other interaction chains 703 .
  • the main driver interaction chain 701 is used for the driver to interact with the in-vehicle voice interaction system
  • the co-driver interaction chain 702 is used for the passengers in the co-driver position to interact with the in-vehicle voice interaction system
  • the other interaction chains 703 are used for passengers in other seats to interact with the in-vehicle voice interaction system. In-vehicle voice interaction system for interaction, etc.
  • the decoding resource 704 includes a speech recognition model 7041 and a cache 7042
  • the semantic resource 705 includes a rule engine 7051 and a neural network engine 7052, wherein the rule engine 7051 is used for parsing the sentence recognition result.
  • the electronic device in the main driver interaction chain 701, the electronic device generates a speech recognition instance A for the main driver's voice signal, and generates a speech recognition instance B for the auxiliary driver's voice.
  • Each speech recognition instance shares a set of decoding resources 704 and executes in parallel, obtaining Statement recognition result C and statement recognition result D.
  • the electronic device constructs a semantic instance E and a semantic instance F, and the semantic instance E and the semantic instance F share a set of semantic resources, respectively parse the statement recognition result C and the statement recognition result D, and obtain a structured parsing result G and parsing result H.
  • the electronic device generates an instruction I, an instruction J, etc. based on the analysis result G and the analysis result H.
  • the instruction I is used to turn on the air conditioner; the instruction J is used to close the window.
  • the in-vehicle voice interaction device executes the corresponding function K and function H based on the instruction I and the instruction J.
  • the execution process of other interaction chains 703 is similar to the above-mentioned main driver interaction chain 701 and auxiliary driver interaction chain 702 , and details are not repeated here.
  • FIG. 8 is a schematic structural diagram of a voice interaction apparatus provided by an exemplary embodiment of the present disclosure. This embodiment can be applied to an electronic device. As shown in FIG. 8 , the voice interaction device includes: an acquisition module 801, an identification module 802, a determination module 803, a first generation module 804, a processing module 805, an analysis module 806, and a second generation module Module 807.
  • the acquisition module 801 is used to acquire at least one audio signal;
  • the recognition module 802 is used to recognize at least one audio signal by using a preset speech recognition model, and obtain the first type of recognition result through the speech recognition model;
  • the determination module 803, used to determine the stored identification data about at least one audio signal from the cache;
  • the first generation module 804 is used to generate a second type of identification result based on the stored identification data;
  • the processing module 805 is used to utilize the speech recognition model , process the first-type recognition result and the second-type recognition result to obtain sentence recognition results corresponding to at least one audio signal;
  • the parsing module 806 is used to perform semantic analysis on each sentence recognition result respectively, and obtain at least one analysis result;
  • the second generating module 807 is configured to generate an instruction for controlling the voice interaction device to perform a corresponding function based on the at least one parsing result.
  • the acquisition module 801 may acquire at least one audio signal locally or remotely.
  • the above-mentioned at least one audio signal may be a speech signal of at least one passenger in the vehicle collected by at least one microphone installed in the vehicle.
  • the recognition module 802 can use a preset speech recognition model to recognize at least one audio signal, and obtain the first type of recognition result through the speech recognition model.
  • the speech recognition model may be a model obtained by pre-training with a large number of speech signal samples.
  • the speech recognition model is used to recognize the input audio signal and obtain a sentence recognition result.
  • a speech recognition model may include multiple sub-models, including, for example, an acoustic sub-model (for syllable division of audio signals), a language sub-model (for converting individual syllables into words), a decoding network (for converting multiple words from multiple words) choose the best combination to get a sentence).
  • an acoustic sub-model for syllable division of audio signals
  • a language sub-model for converting individual syllables into words
  • a decoding network for converting multiple words from multiple words
  • the recognition module 802 usually first searches the cache for the recognition data corresponding to the current processing stage.
  • the identification data is used as the first type of identification result.
  • the determining module 803 may determine the stored identification data about at least one audio signal from the buffer. Usually, during the recognition process by the above speech recognition model, the determination module 803 usually first searches the cache for the identification data corresponding to the current processing stage, and extracts the identification data if there is corresponding identification data in the cache.
  • the first generating module 804 may generate a second type of identification result based on the above-mentioned extracted and stored identification data.
  • the above-mentioned stored identification data may be used as the second type of identification result, or the above-mentioned stored identification data may be processed to a certain extent (for example, the data is scaled by a certain proportion, normalized, etc.) to obtain the first type of identification result.
  • Class II recognition results for example, the data is scaled by a certain proportion, normalized, etc.
  • first-type recognition results and second-type recognition results are usually intermediate results obtained during the processing of the speech recognition model, such as probability scores of syllables, probability scores of words, and the like.
  • the processing module 805 can use a speech recognition model to process the first type of recognition result and the second type of recognition result to obtain sentence recognition results corresponding to at least one audio signal respectively.
  • the speech recognition model needs to further process the first type of recognition result and the second type of recognition result.
  • the first type of recognition result and the second type of recognition result may include the probability score of each syllable and the probability score of each word obtained after the audio signal is recognized.
  • the speech recognition model may use a path search algorithm (For example, the Viterbi algorithm) determines an optimal path from a plurality of recognized words corresponding to the audio signal, thereby obtaining a sentence as the sentence recognition result.
  • the parsing module 806 may perform semantic parsing on each sentence recognition result to obtain at least one parsing result.
  • each analysis result in the above at least one analysis result corresponds to an audio signal.
  • the parsing results can be structured data.
  • the sentence recognition result is "air conditioner temperature is set to 25 degrees”
  • the method for parsing the sentence may adopt the prior art. For example, using rule engines, neural network engines, etc.
  • the second generation module 807 may generate an instruction for controlling the voice interaction device to perform a corresponding function based on at least one analysis result.
  • the above-mentioned voice interaction device may be an electronic device provided with the above-mentioned voice interaction device, or may be an electronic device that is communicatively connected to the above-mentioned electronic device.
  • the voice interaction device is a car air conditioner
  • FIG. 9 is a schematic structural diagram of a voice interaction apparatus provided by another exemplary embodiment of the present disclosure.
  • the apparatus further includes: a storage module 808, configured to store the recognition data obtained by the speech recognition model in the recognition process into a cache.
  • the acquisition module 801 includes: a receiving unit 8011 for receiving an initial audio signal collected by an audio collection device; a processing unit 8012 for performing sound source separation processing on the initial audio signal to obtain at least one channel of audio Signal.
  • the recognition module 802 includes: a first determination unit 8021, configured to determine the speech recognition instances corresponding to at least one audio signal respectively; and a first execution unit 8022, configured to execute the determined speech recognition instances in parallel ; Recognition unit 8023 is used to recognize the corresponding audio signal by using the speech recognition model respectively through each speech recognition instance.
  • the parsing module 806 includes: a second determining unit 8061, configured to determine the semantic parsing instances corresponding to each of the obtained statement recognition results; a second execution unit 8062, configured to execute each of the determined Semantic parsing instance; the parsing unit 8063 is configured to perform semantic parsing on the corresponding sentence recognition result through each semantic parsing instance.
  • the recognition module 802 includes: a third determination unit 8024, configured to use the acoustic sub-model included in the speech recognition model to determine the syllable set corresponding to at least one audio signal and the syllables in the syllable set respectively corresponding the first probability score of the Determine whether there is a second probability score corresponding to the word in the cache, and if not, use the language sub-model to determine the second probability score corresponding to the word; the sixth determining unit 8027 is used to determine based on the first probability score and the language sub-model The second probability score of , determines the first-class recognition result.
  • a third determination unit 8024 configured to use the acoustic sub-model included in the speech recognition model to determine the syllable set corresponding to at least one audio signal and the syllables in the syllable set respectively corresponding the first probability score of the Determine whether there is a second probability score corresponding to the word in the cache, and if not, use the language sub-model
  • the determining module 803 includes: a seventh determining unit 8031, configured to determine, for a word in the word set, whether there is a second probability score corresponding to the word in the cache; The second probability score is determined as the second probability score language sub-model of the word; the eighth determining unit 8032 is configured to determine the second type of recognition result based on the first probability score and the second probability score determined from the cache.
  • the processing module 805 includes: a ninth determination unit 8051, configured to, according to the first probability score and the second probability score respectively included in the first type of recognition result and the second type of recognition result, in the speech recognition
  • the target path of the word set is determined in the decoding network included in the model;
  • the generating unit 8052 is configured to generate sentence recognition results corresponding to at least one audio signal respectively based on the target path.
  • the voice interaction device recognizes at least one audio signal by using a preset voice recognition model.
  • the stored recognition data is extracted from the cache to generate a part of the recognition result, and another part of the recognition result is obtained by
  • the speech recognition model is generated, which effectively reuses the stored recognition data, and does not need the speech recognition model to process the full amount of data, which improves the processing efficiency of at least one audio signal, which is helpful in the scene of multi-channel voice interaction. It can meet the requirements of low resource consumption and low processing delay.
  • the electronic device may be either or both of the terminal device 101 and the server 103 as shown in FIG. 1 , or a stand-alone device independent of them, the stand-alone device may communicate with the terminal device 101 and the server 103 to obtain data from them. Receive the collected input signal.
  • FIG. 10 illustrates a block diagram of an electronic device according to an embodiment of the present disclosure.
  • the electronic device 1000 includes at least one processor 1001 and at least one memory 1002 .
  • any one of the at least one processor 1001 may be a central processing unit (central processing unit, CPU) or another form of processing unit with data processing capability and/or instruction execution capability, and may control the electronic device 1000 other components to perform the desired function.
  • CPU central processing unit
  • any one of the at least one processor 1001 may be a central processing unit (central processing unit, CPU) or another form of processing unit with data processing capability and/or instruction execution capability, and may control the electronic device 1000 other components to perform the desired function.
  • Memory 1002 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory.
  • Volatile memory may include, for example, random access memory (Random Access Memory, RAM) and/or cache memory (cache).
  • the non-volatile memory may include, for example, a read-only memory (Read-Only Memory, ROM), a hard disk, a flash memory, and the like.
  • One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 1001 may execute the program instructions to implement the voice interaction method and/or other desired functions of the various embodiments of the present disclosure above.
  • Various contents such as identification data can also be stored in the computer-readable storage medium.
  • the electronic device 1000 may also include an input device 1003 and an output device 1004 interconnected by a bus system and/or other form of connection mechanism (not shown).
  • the input device 1003 may be a device such as a microphone for inputting audio signals.
  • the input device 1003 may be a communication network connector for receiving input audio signals from the terminal device 101 and the server 103 .
  • the output device 1004 can output various information to the outside, including instructions for the voice interaction device to perform corresponding functions, and the like.
  • the output devices 1004 may include, for example, displays, speakers, printers, and communication networks and their connected remote output devices, among others.
  • the electronic device 1000 may also include any other appropriate components according to specific applications.
  • embodiments of the present disclosure may also be computer program products comprising computer program instructions that, when executed by a processor, cause the processor to perform the "exemplary method" described above in this specification
  • the steps in the voice interaction method according to various embodiments of the present disclosure are described in the section.
  • the computer program product may write program code for performing operations of embodiments of the present disclosure in any combination of one or more programming languages, including object-oriented programming languages, such as Java, C++, etc. , also includes conventional procedural programming languages, such as "C" language or similar programming languages.
  • the program code may execute entirely on the user computing device, partly on the user device, as a stand-alone software package, partly on the user computing device and partly on a remote computing device, or entirely on the remote computing device or server execute on.
  • embodiments of the present disclosure may also be computer-readable storage media having computer program instructions stored thereon that, when executed by a processor, cause the processor to perform the above-described "Example Method" section of this specification Steps in a voice interaction method according to various embodiments of the present disclosure described in .
  • the computer-readable storage medium may employ any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may include, for example, but not limited to, electrical, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatuses or devices, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • the methods and apparatus of the present disclosure may be implemented in many ways.
  • the methods and apparatus of the present disclosure may be implemented in software, hardware, firmware, or any combination of software, hardware, and firmware.
  • the above-described order of steps for the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise.
  • the present disclosure can also be implemented as programs recorded in a recording medium, the programs including machine-readable instructions for implementing methods according to the present disclosure.
  • the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
  • each component or each step may be decomposed and/or recombined. These disaggregations and/or recombinations should be considered equivalents of the present disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

公开了一种语音交互方法、装置、计算机可读存储介质及电子设备,其中,该方法包括:获取至少一路音频信号;利用预设的语音识别模型对至少一路音频信号进行识别,得到第一类识别结果;从缓存中确定已存储的识别数据;基于已存储的识别数据,生成第二类识别结果;利用语音识别模型,对第一类识别结果和第二类识别结果进行处理,得到与所述至少一路音频信号对应的至少一个语句识别结果;对所述语句识别结果进行语义解析,得到至少一个解析结果;基于至少一个解析结果,生成用于控制语音交互设备执行相应功能的指令。本公开实施例提高了对至少一路音频信号进行处理的效率,有助于在多路语音交互的场景仍然能够满足低资源消耗、低处理延迟的要求。

Description

语音交互方法、装置、计算机可读存储介质及电子设备
本申请要求于2021年3月16日提交中国国家知识产权局、申请号为202110279812.4、发明名称为“语音交互方法、装置、计算机可读存储介质及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开涉及计算机技术领域,尤其是一种语音交互方法、装置、计算机可读存储介质及电子设备。
背景技术
随着人工智能技术的不断进步,人机交互也取得了长足的发展。智能语音交互技术可以应用到汽车、机器人、家用电器、中控系统、门禁系统、ATM机等各种设备中。
例如,在一车载语音交互场景中,语音交互系统通常只接收一路语音信号,该语音信号经过处理后向用户做出反馈。随着人工智能技术的发展,语音交互系统向着更加高效、智能、个性化的方向发展。
发明内容
本公开的实施例提供了一种语音交互方法、装置、计算机可读存储介质及电子设备。
本公开的实施例提供了一种语音交互方法,该方法包括:获取至少一路音频信号;利用预设的语音识别模型对至少一路音频信号进行识别,通过语音识别模型得到第一类识别结果;从缓存中确定已存储的关于至少一路音频信号的识别数据;基于已存储的识别数据,生成第二类识别结果;利用语音识别模型,对第一类识别结果和第二类识别结果进行处理,得到与所述至少一路音频信号分别对应的语句识别结果;对所述语句识别结果进行语义解析,得到至少一个解析结果;基于至少一个解析结果,生成用于控制语音交互设备执行相应功能的指令。
根据本公开实施例的另一个方面,提供了一种语音交互装置,该装置包括:获取模块,用于获取至少一路音频信号;识别模块,用于利用预设的语音识别模型对至少一路音频信号进行识别,通过语音识别模型得到第一类识别结果;确定模块,用于从缓存中确定已存储的关于至少一路音频信号的识别数据;第一生成模块,用于基于已存储的识别数据,生成第二类识别结果;处理模块,用于利用语音识别模型,对第一类识别结果和第二类识别结果进行处理,得到至少一路音频信号分别对应的语句识别结果;解析模块,用于对各个语句识别结果分别进行语义解析,得到至少一个解析结果;第二生成模块,用于基于至少一个解析结果,生成用于控制语音交互设备执行相应功能的指令。
根据本公开实施例的另一个方面,提供了一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,计算机程序用于执行上述语音交互方法。
根据本公开实施例的另一个方面,提供了一种电子设备,电子设备包括:处理器;用于存储处理器可执行指令的存储器;处理器,用于从存储器中读取可执行指令,并执行指 令以实现上述语音交互方法。
基于本公开上述实施例提供的语音交互方法、装置、计算机可读存储介质及电子设备,通过利用预设的语音识别模型对至少一路音频信号进行识别,在识别过程中,从缓存提取已经存储的识别数据生成一部分识别结果,另一部分识别结果由语音识别模型生成,从而有效地复用了已经存储的识别数据,无需语音识别模型对全量数据进行处理,提高了对至少一路音频信号进行处理的效率,有助于在多路语音交互的场景仍然能够满足低资源消耗、低处理延迟的要求。
下面通过附图和实施例,对本公开的技术方案做进一步的详细描述。
附图说明
通过结合附图对本公开实施例进行更详细的描述,本公开的上述以及其他目的、特征和优势将变得更加明显。附图用来提供对本公开实施例的进一步理解,并且构成说明书的一部分,与本公开实施例一起用于解释本公开,并不构成对本公开的限制。在附图中,相同的参考标号通常代表相同部件或步骤。
图1是本公开所适用的系统图。
图2是本公开一示例性实施例提供的语音交互方法的流程示意图。
图3是本公开另一示例性实施例提供的语音交互方法的流程示意图。
图4是本公开另一示例性实施例提供的语音交互方法的流程示意图。
图5是本公开另一示例性实施例提供的语音交互方法的流程示意图。
图6是本公开另一示例性实施例提供的语音交互方法的流程示意图。
图7是本公开的实施例的语音交互方法的一个应用场景的示意图。
图8是本公开一示例性实施例提供的语音交互装置的结构示意图。
图9是本公开另一示例性实施例提供的语音交互装置的结构示意图。
图10是本公开一示例性实施例提供的电子设备的结构图。
具体实施方式
下面,将参考附图详细地描述根据本公开的示例实施例。显然,所描述的实施例仅仅是本公开的一部分实施例,而不是本公开的全部实施例,应理解,本公开不受这里描述的示例实施例的限制。
应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本公开的范围。
本领域技术人员可以理解,本公开实施例中的“第一”、“第二”等术语仅用于区别不同步骤、设备或模块等,既不代表任何特定技术含义,也不表示它们之间的必然逻辑顺序。
还应理解,在本公开实施例中,“多个”可以指两个或两个以上,“至少一个”可以指一个、两个或两个以上。
还应理解,对于本公开实施例中提及的任一部件、数据或结构,在没有明确限定或者在前后文给出相反启示的情况下,一般可以理解为一个或多个。
另外,本公开中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在 B这三种情况。另外,本公开中字符“/”,一般表示前后关联对象是一种“或”的关系。
还应理解,本公开对各个实施例的描述着重强调各个实施例之间的不同之处,其相同或相似之处可以相互参考,为了简洁,不再一一赘述。
同时,应当明白,为了便于描述,附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本公开及其应用或使用的任何限制。
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为说明书的一部分。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。
本公开实施例可以应用于终端设备、计算机系统、服务器等电子设备,其可与众多其它通用或专用计算系统环境或配置一起操作。适于与终端设备、计算机系统、服务器等电子设备一起使用的众所周知的终端设备、计算系统、环境和/或配置的例子包括但不限于:个人计算机系统、服务器计算机系统、瘦客户机、厚客户机、手持或膝上设备、基于微处理器的系统、机顶盒、可编程消费电子产品、网络个人电脑、小型计算机系统、大型计算机系统和包括上述任何系统的分布式云计算技术环境,等等。
终端设备、计算机系统、服务器等电子设备可以在由计算机系统执行的计算机系统可执行指令(诸如程序模块)的一般语境下描述。通常,程序模块可以包括例程、程序、目标程序、组件、逻辑、数据结构等等,它们执行特定的任务或者实现特定的抽象数据类型。计算机系统/服务器可以在分布式云计算环境中实施,分布式云计算环境中,任务是由通过通信网络链接的远程处理设备执行的。在分布式云计算环境中,程序模块可以位于包括存储设备的本地或远程计算系统存储介质上。
申请概述
目前的语音交互技术通常只能够同时处理一路语音信号,不能同时处理多路语音信号,因此不能满足多用户、个性化的语音识别需求,所以本公开的技术方案需要将语音交互技术应用到多路语音识别的场景。
目前,语音识别模型需要对语音信号进行全量数据的处理,导致语音识别的效率较低,交互延迟较大,尤其在多路语音识别的场景下,无法满足多用户对语音交互系统高效、个性化的要求。
示例性系统
图1示出了一种可以应用本公开的实施例的语音交互方法或语音交互装置的示例性系统架构100。
如图1所示,系统架构100可以包括终端设备101,网络102和服务器103。网络102用于在终端设备101和服务器103之间提供通信链路的介质。其中,网络102包括但不限于各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
用户可以使用终端设备101通过网络102与服务器103交互,以接收或发送消息等。终端设备101上可以安装有各种通讯客户端应用,例如语音识别应用、多媒体应 用、搜索类应用、网页浏览器应用、购物类应用、即时通信工具等。
终端设备101可以是一种电子设备,该电子设备包括但不限于车载终端、移动电话、笔记本电脑、数字广播接收器、PDA(Personal Digital Assistant,个人数字助理)、PAD(平板电脑)、PMP(Portable Media Player,便携式多媒体播放器)等等的移动终端,以及诸如数字TV、台式计算机、智能家电等等的固定终端。
服务器103可以是一种可提供各种服务功能的设备,例如对终端设备101上传的音频信号进行识别的后台语音识别服务器。该后台语音识别服务器可以对接收到的音频进行处理,从而得到用于控制语音交互设备的指令,并将该指令反馈给终端设备101。
需要说明的是,本公开的实施例所提供的语音交互方法可以由服务器103执行,也可以由终端设备101执行,相应地,语音交互装置可以设置于服务器103中,也可以设置于终端设备101中。
应该理解,图1中的终端设备101、网络102和服务器103的数目仅仅是一种示例。若根据实现需要,可以配置任意数目的终端设备、网络和/或服务器,本申请对此不予限制。另外,在音频信号不需要从远程获取的情况下,上述系统架构可以不包括网络102,只包括服务器或终端设备。比如当终端设备101和服务器103通过有线方式连接时,所述网络102可以省略。
示例性方法
图2是本公开一示例性实施例提供的语音交互方法的流程示意图。本实施例的方法可应用在电子设备(如图1所示的终端设备101或服务器103)上。,如图2所示,该方法包括如下步骤:
步骤201,获取至少一路音频信号。
在本实施例中,电子设备可以从本地或从远程获取至少一路音频信号。作为示例,当本实施例应用于车载语音识别场景时,上述至少一路音频信号可以是车辆内安装的至少一个麦克风采集的车内至少一个乘客的语音信号。
步骤202,利用预设的语音识别模型对至少一路音频信号进行识别,通过语音识别模型得到第一类识别结果。
在本实施例中,电子设备可以利用预设的语音识别模型对至少一路音频信号进行识别,在识别过程中,通过该预设的语音识别模型得到第一类识别结果。其中,该预设的语音识别模型可以是预先利用大量的语音信号样本进行训练得到的模型。该预设的语音识别模型用于对输入的至少一路音频信号进行识别,得到至少一个语句识别的结果。
通常,预设的语音识别模型可以包括多个子模型,例如包括声学子模型、语言子模型、解码网络子模型等。进一步地,声学子模型用于对音频信号进行音节划分;语言子模型用于将各个音节转化为词语;解码网络子模型用于从多个词语中选择最优组合得到一个句子。
在上述步骤202利用预设的语音识别模型识别至少一路音频信号的过程中,电子设备通常会首先从缓存中查找是否存在与当前处理阶段相对应的识别数据,若缓存中不存在相应的识别数据,则执行上述步骤202,得到的识别数据,并将该识别数据作 为第一类识别结果。
步骤203,从缓存中确定已存储的关于至少一路音频信号的识别数据。
在本实施例中,电子设备可以从缓存中确定已存储的关于至少一路音频信号的识别数据。通常,在上述语音识别模型进行识别的过程中,电子设备通常会首先从缓存中查找是否存在与当前处理阶段相对应的识别数据,如果是,则提取该识别数据。
步骤204,基于已存储的识别数据,生成第二类识别结果。
在本实施例中,电子设备可以基于上述步骤203提取的已存储的识别数据,生成第二类识别结果。作为示例,可以将上述已存储的识别数据作为第二类识别结果,也可以对上述已存储的识别数据进行一定的处理后得到第二类识别结果,其中,所述一定的处理过程包括对识别数据进行一定比例的缩放、归一化处理等。
需要说明的是,上述第一类识别结果和第二类识别结果通常为语音识别模型处理过程中得到的中间结果,例如音节的概率得分、词语的概率得分等。
步骤205,利用语音识别模型,对第一类识别结果和第二类识别结果进行处理,得到至少一路音频信号分别对应的语句识别结果。
在本实施例中,电子设备可以利用语音识别模型,对第一类识别结果和第二类识别结果进行处理,得到至少一路音频信号分别对应的语句识别结果。通常,由于第一类识别结果和第二类识别结果均为语音识别模型处理得到的中间结果,因此,需要利用该语音识别模型对上述第一类识别结果和第二类识别结果做进一步处理。
作为一示例,上述第一类识别结果和第二类识别结果可以包括对音频信号识别后得到的每个音节的概率得分和每个词语的概率得分,对于一个音频信号,语音识别模型可以利用路径搜索算法(例如Viterbi算法)从音频信号对应的识别出的多个词语中确定一条最优路径,并根据该最优路径得到一个句子作为一个语句识别结果。其中,一路音频信号可对应一个语句识别结果,那么多路音频信号则对应多路语句识别结果。
步骤206,对各个语句识别结果分别进行语义解析,得到至少一个解析结果。
在本实施例中,电子设备可以对至少一个语句识别结果中的每个结果进行语义解析,得到至少一个解析结果。其中,上述至少一个解析结果中的每个解析结果对应一个音频信号。其中,上述至少一个解析结果可以是结构化的数据。例如,当语句识别结果为“空调温度设为25度”时,对应的解析结果为“领域=车控,意图=空调温度设置,槽位=<温度值=25>”。
需要说明的是,对语句识别结果进行语义解析的方法可以采用,例如规则引擎、神经网络引擎等方法。
步骤207,基于至少一个解析结果,生成用于控制语音交互设备执行相应功能的指令。
在本实施例中,电子设备可以基于至少一个解析结果,生成用于控制语音交互设备执行相应功能的指令。其中,上述语音交互设备可以为上述用于执行本公开的语音交互方法的电子设备,也可以是与上述电子设备通信连接的电子设备。作为示例,当语音交互设备为一车载空调时,若解析结果为“领域=车控,意图=空调温度设置,槽位=<温度值=25>”,则可以生成用于控制车载空调设为某一预设温度的指令,所述预设温度为25℃。
本公开实施例提供的方法,通过利用预设的语音识别模型对至少一路音频信号进行识别,在识别过程中,从缓存提取已经存储的识别数据生成一部分识别结果,另一部分识别结果则由语音识别模型生成,从而有效地复用了已经存储的识别数据,无需语音识别模型对全量数据进行处理,进而提高了对至少一路音频信号进行处理的效率,有助于在多路语音交互的场景下,满足对电子设备低资源消耗、低处理延迟的要求。
在一些可选的实现方式中,上述电子设备还可以将预设的语音识别模型在识别过程中得到的识别数据存入到缓存中。具体地,当上述缓存中不存在某个识别步骤对应的识别数据时,需要语音识别模型执行该识别步骤,得到识别数据存入缓存,从而便于后续对该识别数据的复用。
本实现方式通过将语音识别模型在识别过程中得到的识别数据存入缓存,可以实现识别数据的复用,使缓存中的识别数据不断更新。另外,在模型识别过程中使用更多的已存储的识别数据,还进一步提高了语音识别的效率。
在一些可选的实现方式中,上述步骤201具体执行过程如下:
首先,接收音频采集设备采集的初始音频信号。
其中,所述音频采集设备的数量可以是一个或多个,用于采集至少一路初始音频信号。上述初始音频信号可以是音频采集设备对至少一个用户的语音进行采集得到的信号。例如,配置有多个音频采集设备,且每个音频采集设备安装在车辆内的各个座位周围,每个音频采集设备用于采集相应座位上的乘客的语音,此时采集的音频信号通常包含多个用户的混合语音信号。
其次,对初始音频信号进行声源分离处理,得到至少一路音频信号。
其中,上述声源分离处理的方法可以采用现有技术,例如采用盲源分离(Blind Source Separation,BSS)算法,将多个用户的语音信号分离,得到的每一路音频信号分别对应于一个用户。在车载语音交互的场景下,声源分离处理还可以将得到的每一路音频信号对应到相应的音频采集设备,由于各音频采集设备安装在相应的座位附近,因此,得到的每一路音频信号可以对应到相应的座位。通过声源分离技术实现多个用户的语音信号的分离,并与不同的音频采集设备建立一一对应关系,具体实现过程可参考现有的技术方法,本实施例这里不详细赘述。
本实现方式通过对初始音频信号进行声源分离,可以将多个用户的语音分离,从而使后续的每个语音识别结果对应的相应用户,从而提高了多个用户语音交互的精确性。
在一些可选的实现方式中,如图3所示,步骤202可以包括如下子步骤:
步骤2021,确定每一路音频信号分别对应的语音识别实例。
其中,语音识别实例可以通过代码构建,每个语音识别实例对应于一路音频信号,每个语音识别实例用于对与其对应的一路音频信号进行识别。
步骤2022,并行执行确定的各个语音识别实例。
作为示例,可以采用多线程的方式实现各个语音识别实例并行执行;或者,还可以将各个语音识别实例分别由不同的CPU执行,从而实现并行执行。
步骤2023,通过各个语音识别实例分别利用所述预设的语音识别模型,对对应的音频信号进行识别。
具体地,各个语音识别实例可以并行地、分别地调用上述预设的语音识别模型对相应的语音信号进行识别,从而实现了对音频信号的并行识别。通常,在对至少一路音频信号进行识别时,可以先将将预设的语音识别模型加载到内存中,各个语音识别实例共享该预设的语音识别模型。需要说明的是,利用各个语音识别实例识别音频信号时,可以共用上述缓存,从而提高对各个语音识别实例的识别效率。
本实现方式通过构建每个音频信号对应的语音识别实例以及并行执行各个语音识别实例,可以实现对多个用户的语音的同时识别,同时,各个语音识别实例共用一个语音识别模型来识别语音信号,以及共用同一个缓存来存储和调用识别数据,实现了对至少一路音频信号并行地进行语音识别,以及共享识别所需资源,提高了多用户语音交互场景下的语音识别的效率,由于共用的缓存存储了已经识别出的数据,因此在后续识别过程中直接调用已存储的识别数据即可,无需重复识别,进而极大的节约了内存资源。
在一些可选的实现方式中,如图4所示,步骤206可以包括如下子步骤:
步骤2061,确定所得到的各个语句识别结果分别对应的语义解析实例。
其中,语义解析实例可以通过代码构建,每个语义解析实例对应于一路音频信号的一个语句识别结果,语义解析实例用于对语句识别结果进行结构化解析。
步骤2062,并行执行确定的各个语义解析实例。
作为示例,可以采用多线程的方式实现各个语义解析实例并行执行;或者,还可以将各个语义解析实例分别由不同的CPU执行,从而实现并行执行。
步骤2063,通过各个语义解析实例分别对对应的语句识别结果进行语义解析。
具体地,各个语义解析实例可以并行地调用预先设置的规则引擎、神经网络引擎等模块,实现对语句识别结果的并行解析。
本实现方式通过构建每个语句识别结果对应的语义解析实例,以及并行执行各个语义解析实例,实现了对多个用户的语音同时进行识别及解析,从而构建出多条可以同时进行语音交互的链路,并且,各个语义解析实例共用一套语义资源,还提高了多用户语音交互场景下的语音识别效率。
进一步参考图5,示出了语音交互方法的又一个实施例的流程示意图,本实施例中。如图5所示,在上述图2所示实施例的基础上,步骤202可以包括如下步骤:
步骤2024,利用语音识别模型包括的声学子模型,确定至少一路音频信号分别对应的音节集合和音节集合中的音节分别对应的第一概率得分。
其中,声学子模型用于对输入的音频信号进行音节划分。作为示例,声学子模型包括但不限于隐马尔可夫模型(HMM,Hidden Markov Model)、高斯混合模型(GMM,Gaussian Mixture Model)等。所述第一概率得分用于表征音节被正确划分的概率。
步骤2025,利用语音识别模型包括的语言子模型,确定至少一路音频信号分别对应的词语集合。
其中,语言子模型用于根据上述音节集合确定词语集合,作为示例,语言子模型可以包括但不限于n-gram语言模型、神经网络语言模型等。
步骤2026,对于词语集合中的词语,确定缓存中是否存在该词语对应的第二概率得分。
若不存在,则利用所述语言子模型确定该词语对应的第二概率得分。其中,该第二概率得分用于表征被识别出的词语出现的概率。例如计算“打开”后面出现“空调”的概率即为词语“空调”对应的第二概率得分。
当需要确定某个词语的概率得分时,电子设备首先从缓存中查找是否存在当前词语的第二概率得分,当不存在时,则使用语言子模型对该词语的第二概率得分进行计算。在本实施例中,由于语言子模型的数据处理量较大,因此,为了节约处理开销,使用缓存来预先存储所述语言子模型生成的第二概率得分,进而在使用时可直接从该缓存中获取所述第二概率得分。
步骤2027,基于第一概率得分和语言子模型确定的第二概率得分,确定第一类识别结果。
作为示例,可以将各个第一概率得分和各个第二概率得分确定为第一类识别结果。
由于语言子模型的数据处理量较大,因此,上述图5对应实施例提供的方法,使用缓存专门用于存储语言子模型生成的第二概率得分,从而使缓存的作用更加有针对性,即将缓存针对性地应用到数据处理量较大,数据存取较频繁的过程,充分发挥使用缓存节约运算资源的作用,减少缓存中的冗余数据,提高语音识别的效率。
进一步地,参考图6,示出了语音交互方法的又一个实施例的流程示意图,本实施例中。如图6所示,在上述图5所示实施例的基础上,步骤203还可以包括如下步骤:
步骤2031,对于词语集合中的词语,确定缓存中是否存在与该词语对应的第二概率得分。
若存在,则将缓存中的第二概率得分确定为该词语的第二概率得分语言子模型。
作为示例,当需要计算“打开”之后出现“空调”的概率时,在计算之前,先去缓存中查找之前是否已经存储有计算过的“空调”对应的第二概率得分,若已经存储,则直接从缓存中取用,从而避免重复计算。若是未存储,则不能从缓存中直接取用,进而需要重新计算。
步骤2032,基于第一概率得分和从缓存中确定的第二概率得分,确定第二类识别结果。
作为示例,可以将各个第一概率得分和缓存中确定的第二概率得分确定为第二类识别结果。
上述图6对应实施例提供的方法,通过在确定词语对应的第二概率得分时,首先从缓存中查找第二概率得分,并将查找到的第二概率得分确定为词语的第二概率得分,从而更有针对性地降低了语言子模型的运算量,并且还能减少语言子模型的识别过程所占用的内存资源,进一步提高语音识别的效率。
在一些可选的实现方式中,基于上述图5或图6对应实施例,上述步骤205可以如下执行:
首先,根据第一类识别结果和第二类识别结果分别包括的第一概率得分和第二概率得分,在语音识别模型包括的解码网络中确定词语集合的目标路径。
其中,解码网络是基于上述词语集合构建的网络,基于该网络,可以根据第一概率得分和第二概率得分,在网络中查找一条词语组合的最优路径,该路径即为目标路 径。
需要说明的是,根据音节对应的概率得分和词语对应的概率得分确定最优路径的方法是现有技术,这里不做详细赘述。
然后,基于该目标路径,生成至少一路音频信号分别对应的语句识别结果。
具体地,可以将目标路径对应的词语组合成的句子确定为语句识别结果。
本实现方式通过利用第一概率得分以及语言子模型计算得到的第二概率得分,和从缓存中提取的第二概率得分,在解码网络中查找目标路径,生成语句识别结果,实现了在解码时充分利用缓存中的已存储的第二概率得分,提高了生成语句识别结果的效率。
参见图7,示出了本实施例的语音交互方法的应用场景的一个示意图。在图7的应用场景中,语音交互方法应用在车载语音交互系统。
如图7所示,多路音频信号分别对应于一个交互链,包括:主驾交互链701、副驾交互链702和其他交互链703。其中,主驾交互链701用于驾驶员与车载语音交互系统进行交互;副驾交互链702用于副驾位置上的乘客与车载语音交互系统进行交互;其他交互链703用于其他座位上的乘客与车载语音交互系统进行交互等。
此外,解码资源704包括:语音识别模型7041和缓存7042,语义资源705包括规则引擎7051和神经网络引擎7052,其中,规则引擎7051用于解析语句识别结果。从图7可知,在主驾交互链701中,电子设备针对主驾语音信号生成语音识别实例A,针对副驾语音生成语音识别实例B,各语音识别实例共用一套解码资源704且并行执行,得到语句识别结果C和语句识别结果D。
然后,电子设备构建语义实例E和语义实例F,语义实例E和语义实例F共用一套语义资源,分别对语句识别结果C和语句识别结果D进行解析,得到结构化的解析结果G和解析结果H。
再然后,电子设备基于解析结果G和解析结果H生成指令I、指令J等,例如,指令I用于打开空调;指令J用于关闭车窗。车载语音交互设备基于该指令I和指令J执行相应的功能K和功能H。类似的,其他交互链703的执行过程与上述主驾交互链701和副驾交互链702相类似,这里不再赘述。
示例性装置
图8是本公开一示例性实施例提供的语音交互装置的结构示意图。本实施例可应用在电子设备上,如图8所示,语音交互装置包括:获取模块801、识别模块802、确定模块803、第一生成模块804、处理模块805、解析模块806和第二生成模块807。
其中,获取模块801,用于获取至少一路音频信号;识别模块802,用于利用预设的语音识别模型对至少一路音频信号进行识别,通过语音识别模型得到第一类识别结果;确定模块803,用于从缓存中确定已存储的关于至少一路音频信号的识别数据;第一生成模块804,用于基于已存储的识别数据,生成第二类识别结果;处理模块805,用于利用语音识别模型,对第一类识别结果和第二类识别结果进行处理,得到至少一路音频信号分别对应的语句识别结果;解析模块806,用于对各个语句识别结果分别进行语义解析,得到至少一个解析结果;第二生成模块807,用于基于至少一个解析结果,生成用于控制语音交互设备执行相应功能的指令。
在本实施例中,获取模块801可以从本地或从远程获取至少一路音频信号。作为示例,当本实施例应用于车载语音识别场景时,上述至少一路音频信号可以是车辆内安装的至少一个麦克风采集的车内至少一个乘客的语音信号。
在本实施例中,识别模块802可以利用预设的语音识别模型对至少一路音频信号进行识别,通过语音识别模型得到第一类识别结果。其中,语音识别模型可以是预先利用大量的语音信号样本进行训练得到的模型。语音识别模型用于对输入的音频信号进行识别,得到一个语句识别结果。
通常,语音识别模型可以包括多个子模型,例如包括声学子模型(用于对音频信号进行音节划分)、语言子模型(用于将各个音节转化为词语)、解码网络(用于从多个词语中选择最优组合得到一个句子)。
在上述语音识别模型进行识别的过程中,识别模块802通常会首先从缓存中查找当前处理阶段对应的识别数据,若缓存中不存在相应的识别数据,则利用上述语音识别模型进行识别,得到的识别数据作为第一类识别结果。
在本实施例中,确定模块803可以从缓存中确定已存储的关于至少一路音频信号的识别数据。通常,在上述语音识别模型进行识别的过程中,确定模块803通常会首先从缓存中查找当前处理阶段对应的识别数据,若缓存中存在相应的识别数据,则提取该识别数据。
在本实施例中,第一生成模块804可以基于上述提取的已存储的识别数据,生成第二类识别结果。作为示例,可以将上述已存储的识别数据作为第二类识别结果,也可以对上述已存储的识别数据进行一定的处理后(例如对数据进行一定比例的缩放、归一化处理等)得到第二类识别结果。
需要说明的是,上述第一类识别结果和第二类识别结果通常为语音识别模型处理过程中得到的中间结果,例如音节的概率得分、词语的概率得分等。
在本实施例中,处理模块805可以利用语音识别模型,对第一类识别结果和第二类识别结果进行处理,得到至少一路音频信号分别对应的语句识别结果。通常,由于第一类识别结果和第二类识别结果为语音识别模型处理得到的中间结果,因此,语音识别模型需要进一步对第一类识别结果和第二类识别结果进行处理。作为示例,第一类识别结果和第二类识别结果可以包括对音频信号识别后得到的每个音节的概率得分和每个词语的概率得分,对于一个音频信号,语音识别模型可以利用路径搜索算法(例如Viterbi算法)从音频信号对应的识别出的多个词语中确定一条最优路径,从而得到一个句子作为语句识别结果。
在本实施例中,解析模块806可以对各个语句识别结果分别进行语义解析,得到至少一个解析结果。其中,上述至少一个解析结果中的每个解析结果对应一个音频信号。通常,解析结果可以是结构化的数据。例如,语句识别结果为“空调温度设为25度”,解析结果即为“领域=车控,意图=空调温度设置,槽位=<温度值=25>”。
需要说明的是,对语句进行语句解析的方法可以采用现有技术。例如使用规则引擎、神经网络引擎等。
在本实施例中,第二生成模块807可以基于至少一个解析结果,生成用于控制语音交互设备执行相应功能的指令。其中,上述语音交互设备可以为设置有上述语音交 互装置的电子设备,也可以是与上述电子设备通信连接的电子设备。作为示例,当语音交互设备为车载空调时,若解析结果为“领域=车控,意图=空调温度设置,槽位=<温度值=25>”,则可以生成用于控制车载空调设为25摄氏度的指令。
参照图9,图9是本公开另一示例性实施例提供的语音交互装置的结构示意图。
在一些可选的实现方式中,该装置还包括:存储模块808,用于将语音识别模型在识别过程中得到的识别数据存入缓存。
在一些可选的实现方式中,获取模块801包括:接收单元8011,用于接收音频采集设备采集的初始音频信号;处理单元8012,用于对初始音频信号进行声源分离处理,得到至少一路音频信号。
在一些可选的实现方式中,识别模块802包括:第一确定单元8021,用于确定至少一路音频信号分别对应的语音识别实例;第一执行单元8022,用于并行执行确定的各个语音识别实例;识别单元8023,用于通过各个语音识别实例分别利用语音识别模型,对对应的音频信号进行识别。
在一些可选的实现方式中,解析模块806包括:第二确定单元8061,用于确定所得到的各个语句识别结果分别对应的语义解析实例;第二执行单元8062,用于并行执行确定的各个语义解析实例;解析单元8063,用于通过各个语义解析实例分别对对应的语句识别结果进行语义解析。
在一些可选的实现方式中,识别模块802包括:第三确定单元8024,用于利用语音识别模型包括的声学子模型,确定至少一路音频信号分别对应的音节集合和音节集合中的音节分别对应的第一概率得分;第四确定单元8025,用于利用语音识别模型包括的语言子模型,确定至少一路音频信号分别对应的词语集合;第五确定单元8026,用于对于词语集合中的词语,确定缓存中是否存在该词语对应的第二概率得分,若不存在,利用语言子模型确定该词语对应的第二概率得分;第六确定单元8027,用于基于第一概率得分和语言子模型确定的第二概率得分,确定第一类识别结果。
在一些可选的实现方式中,确定模块803包括:第七确定单元8031,用于对于词语集合中的词语,确定缓存中是否存在该词语对应的第二概率得分,若存在,将缓存中的第二概率得分确定为该词语的第二概率得分语言子模型;第八确定单元8032,用于基于第一概率得分和从缓存中确定的第二概率得分,确定第二类识别结果。
在一些可选的实现方式中,处理模块805包括:第九确定单元8051,用于根据第一类识别结果和第二类识别结果分别包括的第一概率得分和第二概率得分,在语音识别模型包括的解码网络中确定词语集合的目标路径;生成单元8052,用于基于目标路径,生成至少一路音频信号分别对应的语句识别结果。
本公开上述实施例提供的语音交互装置,通过利用预设的语音识别模型对至少一路音频信号进行识别,在识别过程中,从缓存提取已经存储的识别数据生成一部分识别结果,另一部分识别结果由语音识别模型生成,从而有效地复用了已经存储的识别数据,无需语音识别模型对全量数据进行处理,提高了对至少一路音频信号进行处理的效率,有助于在多路语音交互的场景仍然能够满足低资源消耗、低处理延迟的要求。
示例性电子设备
下面,参考图10来描述根据本公开实施例的电子设备。该电子设备可以是如图1 所示的终端设备101和服务器103中的任一个或两者、或与它们独立的单机设备,该单机设备可以与终端设备101和服务器103进行通信,以从它们接收所采集到的输入信号。
图10图示了根据本公开实施例的电子设备的框图。
如图10所示,电子设备1000包括至少一个处理器1001和至少一个存储器1002。
其中,至少一个处理器1001中的任意一个处理器可以是中央处理单元(central processing unit,CPU)或者具有数据处理能力和/或指令执行能力的其他形式的处理单元,并且可以控制电子设备1000中的其他组件以执行期望的功能。
存储器1002可以包括一个或多个计算机程序产品,计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器。易失性存储器例如可以包括随机存取存储器(Random Access Memory,RAM)和/或高速缓冲存储器(cache)等。其中,非易失性存储器例如可以包括只读存储器(Read-Only Memory,ROM)、硬盘、闪存等。在计算机可读存储介质上可以存储一个或多个计算机程序指令,处理器1001可以运行程序指令,以实现上文的本公开的各个实施例的语音交互方法以及/或者其他期望的功能。在计算机可读存储介质中还可以存储诸如识别数据等各种内容。
在一个示例中,电子设备1000还可以包括:输入装置1003和输出装置1004,这些组件通过总线系统和/或其他形式的连接机构(未示出)互连。
例如,在该电子设备是一种终端设备101或服务器103时,该输入装置1003可以是麦克风等设备,用于输入音频信号。在该电子设备是一单机设备时,该输入装置1003可以是一通信网络连接器,用于从终端设备101和服务器103接收所输入的音频信号。
该输出装置1004可以向外部输出各种信息,包括语音交互设备执行相应功能的指令等。该输出设备1004可以包括例如显示器、扬声器、打印机、以及通信网络及其所连接的远程输出设备等等。
当然,为了简化,图10中仅示出了该电子设备1000中与本公开有关的组件中的一些,省略了诸如总线、输入/输出接口等等的组件。除此之外,根据具体应用情况,电子设备1000还可以包括任何其他适当的组件。
示例性计算机程序产品和计算机可读存储介质
除了上述方法和设备以外,本公开的实施例还可以是计算机程序产品,其包括计算机程序指令,所述计算机程序指令在被处理器运行时使得所述处理器执行本说明书上述“示例性方法”部分中描述的根据本公开各种实施例的语音交互方法中的步骤。
所述计算机程序产品可以以一种或多种程序设计语言的任意组合来编写用于执行本公开实施例操作的程序代码,所述程序设计语言包括面向对象的程序设计语言,诸如Java、C++等,还包括常规的过程式程序设计语言,诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。
此外,本公开的实施例还可以是计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令在被处理器运行时使得所述处理器执行本说明书上述“示例 性方法”部分中描述的根据本公开各种实施例的语音交互方法中的步骤。
所述计算机可读存储介质可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以包括但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。
以上结合具体实施例描述了本公开的基本原理,但是,需要指出的是,在本公开中提及的优点、优势、效果等仅是示例而非限制,不能认为这些优点、优势、效果等是本公开的各个实施例必须具备的。另外,上述公开的具体细节仅是为了示例的作用和便于理解的作用,而非限制,上述细节并不限制本公开为必须采用上述具体的细节来实现。
本说明书中各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似的部分相互参见即可。对于系统实施例而言,由于其与方法实施例基本对应,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本公开中涉及的器件、装置、设备、系统的方框图仅作为例示性的例子并且不意图要求或暗示必须按照方框图示出的方式进行连接、布置、配置。如本领域技术人员将认识到的,可以按任意方式连接、布置、配置这些器件、装置、设备、系统。诸如“包括”、“包含”、“具有”等等的词语是开放性词汇,指“包括但不限于”,且可与其互换使用。这里所使用的词汇“或”和“和”指词汇“和/或”,且可与其互换使用,除非上下文明确指示不是如此。这里所使用的词汇“诸如”指词组“诸如但不限于”,且可与其互换使用。
可能以许多方式来实现本公开的方法和装置。例如,可通过软件、硬件、固件或者软件、硬件、固件的任何组合来实现本公开的方法和装置。用于所述方法的步骤的上述顺序仅是为了进行说明,本公开的方法的步骤不限于以上具体描述的顺序,除非以其它方式特别说明。此外,在一些实施例中,还可将本公开实施为记录在记录介质中的程序,这些程序包括用于实现根据本公开的方法的机器可读指令。因而,本公开还覆盖存储用于执行根据本公开的方法的程序的记录介质。
还需要指出的是,在本公开的装置、设备和方法中,各部件或各步骤是可以分解和/或重新组合的。这些分解和/或重新组合应视为本公开的等效方案。
提供所公开的方面的以上描述以使本领域的任何技术人员能够做出或者使用本公开。对这些方面的各种修改对于本领域技术人员而言是非常显而易见的,并且在此定义的一般原理可以应用于其他方面而不脱离本公开的范围。因此,本公开不意图被限制到在此示出的方面,而是按照与在此公开的原理和新颖的特征一致的最宽范围。
为了例示和描述的目的已经给出了以上描述。此外,此描述不意图将本公开的实施例限制到在此公开的形式。尽管以上已经讨论了多个示例方面和实施例,但是本领域技术人员将认识到其某些变型、修改、改变、添加和子组合。

Claims (10)

  1. 一种语音交互方法,包括:
    获取至少一路音频信号;
    利用预设的语音识别模型对所述至少一路音频信号进行识别,通过所述语音识别模型得到第一类识别结果;
    从缓存中确定已存储的关于所述至少一路音频信号的识别数据;
    基于所述已存储的识别数据,生成第二类识别结果;
    利用所述语音识别模型,对所述第一类识别结果和所述第二类识别结果进行处理,得到与所述至少一路音频信号对应的至少一个语句识别结果;
    对所述语句识别结果进行语义解析,得到至少一个解析结果;
    基于所述至少一个解析结果,生成用于控制语音交互设备执行相应功能的指令。
  2. 根据权利要求1所述的方法,其中,所述获取至少一路音频信号,包括:
    接收音频采集设备采集的初始音频信号;
    对所述初始音频信号进行声源分离处理,得到所述至少一路音频信号。
  3. 根据权利要求1所述的方法,其中,所述利用预设的语音识别模型对所述至少一路音频信号进行识别,通过所述语音识别模型得到第一类识别结果,包括:
    确定所述至少一路音频信号分别对应的语音识别实例;
    并行执行确定的各个语音识别实例;
    通过各个语音识别实例分别利用所述语音识别模型,对对应的音频信号进行识别。
  4. 根据权利要求3所述的方法,其中,对所述语句识别结果进行语义解析,得到至少一个解析结果,包括:
    确定所得到的各个语句识别结果分别对应的语义解析实例;
    并行执行确定的各个语义解析实例;
    通过各个语义解析实例分别对对应的语句识别结果进行语义解析,得到所述至少一个解析结果。
  5. 根据权利要求1所述的方法,其中,所述利用预设的语音识别模型对所述至少一路音频信号进行识别,通过所述语音识别模型得到第一类识别结果,包括:
    利用所述语音识别模型包括的声学子模型确定所述至少一路音频信号分别对应的音节集合和所述音节集合中的音节对应的第一概率得分;
    利用所述语音识别模型包括的语言子模型确定所述至少一路音频信号分别对应的词语集合;
    基于所述词语集合中的词语,确定所述缓存中是否存在该词语对应的第二概率得分;
    若不存在,则利用所述语言子模型确定该词语对应的第二概率得分;
    基于所述第一概率得分和所述语言子模型确定的第二概率得分,确定所述第一类识别结果。
  6. 根据权利要求5所述的方法,其中,所述从缓存中确定已存储的关于所述至少一路音频信号的识别数据,包括:
    对于所述词语集合中的词语,确定所述缓存中是否存在该词语对应的第二概率得分;
    若存在,则将所述缓存中的第二概率得分确定为该词语的第二概率得分语言子模型;
    基于所述第一概率得分和从所述缓存中确定的第二概率得分,确定第二类识别结果。
  7. 根据权利要求6所述的方法,其中,所述利用所述语音识别模型,对所述第一类识别结果和所述第二类识别结果进行处理,得到与所述至少一路音频信号对应的至少一个语句识别结果,包括:
    根据所述第一类识别结果和所述第二类识别结果分别包括的第一概率得分和第二概率得分,在所述语音识别模型包括的解码网络中确定所述词语集合的目标路径;
    基于所述目标路径,生成与所述至少一路音频信号对应的至少一个语句识别结果。
  8. 一种语音交互装置,包括:
    获取模块,用于获取至少一路音频信号;
    识别模块,用于利用预设的语音识别模型对所述至少一路音频信号进行识别,通过所述语音识别模型得到第一类识别结果;
    确定模块,用于从缓存中确定已存储的关于所述至少一路音频信号的识别数据;
    第一生成模块,用于基于所述已存储的识别数据,生成第二类识别结果;
    处理模块,用于利用所述语音识别模型,对所述第一类识别结果和所述第二类识别结果进行处理,得到与所述至少一路音频信号对应的至少一个语句识别结果;
    解析模块,用于对所述语句识别结果进行语义解析,得到至少一个解析结果;
    第二生成模块,用于基于所述至少一个解析结果,生成用于控制语音交互设备执行相应功能的指令。
  9. 一种计算机可读存储介质,所述存储介质存储有计算机程序指令,
    当所述计算机程序指令被执行时,实现如权利要求1至7任一项所述的方法。
  10. 一种电子设备,所述电子设备包括:
    处理器;以及
    用于存储所述处理器可执行指令的存储器;
    所述处理器,用于从所述存储器中读取所述可执行指令,并执行所述指令以实现如权利要求1至7任一项所述的方法。
PCT/CN2022/076422 2021-03-16 2022-02-16 语音交互方法、装置、计算机可读存储介质及电子设备 WO2022193892A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2022558093A JP2023520861A (ja) 2021-03-16 2022-02-16 音声対話方法、装置、コンピュータ可読記憶媒体及び電子機器
US18/247,441 US20240005917A1 (en) 2021-03-16 2022-02-16 Speech interaction method ,and apparatus, computer readable storage medium, and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110279812.4A CN113066489A (zh) 2021-03-16 2021-03-16 语音交互方法、装置、计算机可读存储介质及电子设备
CN202110279812.4 2021-03-16

Publications (1)

Publication Number Publication Date
WO2022193892A1 true WO2022193892A1 (zh) 2022-09-22

Family

ID=76560535

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/076422 WO2022193892A1 (zh) 2021-03-16 2022-02-16 语音交互方法、装置、计算机可读存储介质及电子设备

Country Status (4)

Country Link
US (1) US20240005917A1 (zh)
JP (1) JP2023520861A (zh)
CN (1) CN113066489A (zh)
WO (1) WO2022193892A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113066489A (zh) * 2021-03-16 2021-07-02 深圳地平线机器人科技有限公司 语音交互方法、装置、计算机可读存储介质及电子设备

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000038175A1 (en) * 1998-12-21 2000-06-29 Koninklijke Philips Electronics N.V. Language model based on the speech recognition history
CN1455387A (zh) * 2002-11-15 2003-11-12 中国科学院声学研究所 一种语音识别系统中的快速解码方法
US20160358606A1 (en) * 2015-06-06 2016-12-08 Apple Inc. Multi-Microphone Speech Recognition Systems and Related Techniques
US20170236518A1 (en) * 2016-02-16 2017-08-17 Carnegie Mellon University, A Pennsylvania Non-Profit Corporation System and Method for Multi-User GPU-Accelerated Speech Recognition Engine for Client-Server Architectures
EP3425628A1 (en) * 2017-07-05 2019-01-09 Panasonic Intellectual Property Management Co., Ltd. Voice recognition method, recording medium, voice recognition device, and robot
CN109215630A (zh) * 2018-11-14 2019-01-15 北京羽扇智信息科技有限公司 实时语音识别方法、装置、设备及存储介质
CN109727603A (zh) * 2018-12-03 2019-05-07 百度在线网络技术(北京)有限公司 语音处理方法、装置、用户设备及存储介质
CN112071310A (zh) * 2019-06-11 2020-12-11 北京地平线机器人技术研发有限公司 语音识别方法和装置、电子设备和存储介质
CN113066489A (zh) * 2021-03-16 2021-07-02 深圳地平线机器人科技有限公司 语音交互方法、装置、计算机可读存储介质及电子设备

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108573706B (zh) * 2017-03-10 2021-06-08 北京搜狗科技发展有限公司 一种语音识别方法、装置及设备
CN108510990A (zh) * 2018-07-04 2018-09-07 百度在线网络技术(北京)有限公司 语音识别方法、装置、用户设备及存储介质
CN110534095B (zh) * 2019-08-22 2020-10-23 百度在线网络技术(北京)有限公司 语音识别方法、装置、设备以及计算机可读存储介质
CN110415697A (zh) * 2019-08-29 2019-11-05 的卢技术有限公司 一种基于深度学习的车载语音控制方法及其系统
CN110473531B (zh) * 2019-09-05 2021-11-09 腾讯科技(深圳)有限公司 语音识别方法、装置、电子设备、系统及存储介质
CN110661927B (zh) * 2019-09-18 2022-08-26 平安科技(深圳)有限公司 语音交互方法、装置、计算机设备及存储介质

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000038175A1 (en) * 1998-12-21 2000-06-29 Koninklijke Philips Electronics N.V. Language model based on the speech recognition history
CN1455387A (zh) * 2002-11-15 2003-11-12 中国科学院声学研究所 一种语音识别系统中的快速解码方法
US20160358606A1 (en) * 2015-06-06 2016-12-08 Apple Inc. Multi-Microphone Speech Recognition Systems and Related Techniques
US20170236518A1 (en) * 2016-02-16 2017-08-17 Carnegie Mellon University, A Pennsylvania Non-Profit Corporation System and Method for Multi-User GPU-Accelerated Speech Recognition Engine for Client-Server Architectures
EP3425628A1 (en) * 2017-07-05 2019-01-09 Panasonic Intellectual Property Management Co., Ltd. Voice recognition method, recording medium, voice recognition device, and robot
CN109215630A (zh) * 2018-11-14 2019-01-15 北京羽扇智信息科技有限公司 实时语音识别方法、装置、设备及存储介质
CN109727603A (zh) * 2018-12-03 2019-05-07 百度在线网络技术(北京)有限公司 语音处理方法、装置、用户设备及存储介质
CN112071310A (zh) * 2019-06-11 2020-12-11 北京地平线机器人技术研发有限公司 语音识别方法和装置、电子设备和存储介质
CN113066489A (zh) * 2021-03-16 2021-07-02 深圳地平线机器人科技有限公司 语音交互方法、装置、计算机可读存储介质及电子设备

Also Published As

Publication number Publication date
CN113066489A (zh) 2021-07-02
JP2023520861A (ja) 2023-05-22
US20240005917A1 (en) 2024-01-04

Similar Documents

Publication Publication Date Title
US11132172B1 (en) Low latency audio data pipeline
WO2021093449A1 (zh) 基于人工智能的唤醒词检测方法、装置、设备及介质
WO2022105861A1 (zh) 用于识别语音的方法、装置、电子设备和介质
CN114830228A (zh) 与设备关联的账户
WO2020238209A1 (zh) 音频处理的方法、系统及相关设备
US11687526B1 (en) Identifying user content
US11574637B1 (en) Spoken language understanding models
US20200219384A1 (en) Methods and systems for ambient system control
CN114038457B (zh) 用于语音唤醒的方法、电子设备、存储介质和程序
CN113674742B (zh) 人机交互方法、装置、设备以及存储介质
CN111916053B (zh) 语音生成方法、装置、设备和计算机可读介质
US11532301B1 (en) Natural language processing
WO2021098318A1 (zh) 应答方法、终端及存储介质
US20240013784A1 (en) Speaker recognition adaptation
WO2022193892A1 (zh) 语音交互方法、装置、计算机可读存储介质及电子设备
CN113889091A (zh) 语音识别方法、装置、计算机可读存储介质及电子设备
US11626107B1 (en) Natural language processing
CN116075888A (zh) 用于减少云服务中的延迟的系统和方法
CN113611316A (zh) 人机交互方法、装置、设备以及存储介质
CN115699170A (zh) 文本回声消除
CN109887490A (zh) 用于识别语音的方法和装置
US11830476B1 (en) Learned condition text-to-speech synthesis
CN115132195A (zh) 语音唤醒方法、装置、设备、存储介质及程序产品
CN112487180B (zh) 文本分类方法和装置、计算机可读存储介质和电子设备
WO2021051565A1 (zh) 基于机器学习的语义解析方法、装置、电子设备及计算机非易失性可读存储介质

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2022558093

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22770247

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 18247441

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: OTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM1205A DATED 25.01.2024)