WO2022193892A1 - 语音交互方法、装置、计算机可读存储介质及电子设备 - Google Patents
语音交互方法、装置、计算机可读存储介质及电子设备 Download PDFInfo
- Publication number
- WO2022193892A1 WO2022193892A1 PCT/CN2022/076422 CN2022076422W WO2022193892A1 WO 2022193892 A1 WO2022193892 A1 WO 2022193892A1 CN 2022076422 W CN2022076422 W CN 2022076422W WO 2022193892 A1 WO2022193892 A1 WO 2022193892A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio signal
- result
- speech recognition
- recognition result
- recognition
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 101
- 230000003993 interaction Effects 0.000 title claims abstract description 89
- 230000005236 sound signal Effects 0.000 claims abstract description 110
- 238000012545 processing Methods 0.000 claims abstract description 38
- 230000008569 process Effects 0.000 claims description 35
- 238000004458 analytical method Methods 0.000 claims description 24
- 230000006870 function Effects 0.000 claims description 14
- 238000004590 computer program Methods 0.000 claims description 11
- 238000000926 separation method Methods 0.000 claims description 8
- 238000010586 diagram Methods 0.000 description 11
- 238000005516 engineering process Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000010845 search algorithm Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000007792 addition Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000005215 recombination Methods 0.000 description 1
- 230000006798 recombination Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1815—Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1822—Parsing for meaning understanding
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/30—Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/34—Adaptation of a single recogniser for parallel processing, e.g. by use of multiple processors or cloud computing
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Definitions
- the present disclosure relates to the field of computer technology, and in particular, to a voice interaction method, an apparatus, a computer-readable storage medium, and an electronic device.
- Intelligent voice interaction technology can be applied to various devices such as automobiles, robots, household appliances, central control systems, access control systems, and ATM machines.
- the voice interaction system usually only receives one voice signal, and the voice signal is processed to give feedback to the user.
- the voice interaction system is developing towards a more efficient, intelligent and personalized direction.
- Embodiments of the present disclosure provide a voice interaction method, apparatus, computer-readable storage medium, and electronic device.
- An embodiment of the present disclosure provides a voice interaction method, the method includes: acquiring at least one audio signal; using a preset voice recognition model to identify the at least one audio signal, and obtaining a first type of recognition result through the voice recognition model; Determine the stored identification data about at least one audio signal in the cache; generate a second type of identification result based on the stored identification data; use a speech recognition model to process the first type of identification result and the second type of identification result to obtain A sentence recognition result corresponding to the at least one audio signal; semantic analysis is performed on the sentence recognition result to obtain at least one analysis result; based on the at least one analysis result, an instruction for controlling the voice interaction device to perform a corresponding function is generated.
- a voice interaction device includes: an acquisition module for acquiring at least one audio signal; and a recognition module for using a preset voice recognition model for at least one audio signal Recognize, and obtain the first type of recognition result through the speech recognition model; the determination module is used to determine the stored identification data about at least one audio signal from the cache; the first generation module is used to generate the stored identification data based on The second type of recognition result; the processing module is used to process the first type of recognition result and the second type of recognition result by using the speech recognition model to obtain the sentence recognition results corresponding to at least one audio signal respectively; the parsing module is used for each The sentence recognition results are respectively subjected to semantic analysis to obtain at least one analysis result; the second generation module is configured to generate an instruction for controlling the voice interaction device to perform a corresponding function based on the at least one analysis result.
- a computer-readable storage medium where a computer program is stored in the computer-readable storage medium, and the computer program is used to execute the above voice interaction method.
- an electronic device includes: a processor; a memory for storing instructions executable by the processor; a processor for reading the executable instructions from the memory, and The instructions are executed to implement the above voice interaction method.
- At least one audio signal is recognized by using a preset voice recognition model, and during the recognition process, the stored audio signal is extracted from the cache.
- the recognition data generates part of the recognition results, and the other part of the recognition results is generated by the speech recognition model, thereby effectively multiplexing the stored recognition data, without the need for the speech recognition model to process the full amount of data, and improving the processing efficiency of at least one audio signal , which helps to meet the requirements of low resource consumption and low processing delay in the scenario of multi-channel voice interaction.
- FIG. 1 is a system diagram to which the present disclosure is applied.
- FIG. 2 is a schematic flowchart of a voice interaction method provided by an exemplary embodiment of the present disclosure.
- FIG. 3 is a schematic flowchart of a voice interaction method provided by another exemplary embodiment of the present disclosure.
- FIG. 4 is a schematic flowchart of a voice interaction method provided by another exemplary embodiment of the present disclosure.
- FIG. 5 is a schematic flowchart of a voice interaction method provided by another exemplary embodiment of the present disclosure.
- FIG. 6 is a schematic flowchart of a voice interaction method provided by another exemplary embodiment of the present disclosure.
- FIG. 7 is a schematic diagram of an application scenario of the voice interaction method according to an embodiment of the present disclosure.
- FIG. 8 is a schematic structural diagram of a voice interaction apparatus provided by an exemplary embodiment of the present disclosure.
- FIG. 9 is a schematic structural diagram of a voice interaction apparatus provided by another exemplary embodiment of the present disclosure.
- FIG. 10 is a structural diagram of an electronic device provided by an exemplary embodiment of the present disclosure.
- a plurality may refer to two or more, and “at least one” may refer to one, two or more.
- the term "and/or" in the present disclosure is only an association relationship to describe associated objects, indicating that there can be three kinds of relationships, for example, A and/or B, it can mean that A exists alone, and A and B exist at the same time , there are three cases of B alone.
- the character "/" in the present disclosure generally indicates that the related objects are an "or" relationship.
- Embodiments of the present disclosure can be applied to electronic devices such as terminal devices, computer systems, servers, etc., which can operate with numerous other general-purpose or special-purpose computing system environments or configurations.
- Examples of well-known terminal equipment, computing systems, environments and/or configurations suitable for use with terminal equipment, computer systems, servers, etc. electronic equipment include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients computer, handheld or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network personal computers, minicomputer systems, mainframe computer systems, and distributed cloud computing technology environments including any of the foregoing, among others.
- Electronic devices such as terminal devices, computer systems, servers, etc., may be described in the general context of computer system-executable instructions, such as program modules, being executed by the computer system.
- program modules may include routines, programs, object programs, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types.
- Computer systems/servers may be implemented in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network.
- program modules may be located on local or remote computing system storage media including storage devices.
- the current voice interaction technology can usually only process one voice signal at the same time, and cannot process multiple voice signals at the same time, so it cannot meet the needs of multi-user and personalized voice recognition. Therefore, the technical solution of the present disclosure needs to apply the voice interaction technology to multiple channels. speech recognition scenarios.
- the speech recognition model needs to process the full amount of data of the speech signal, resulting in low speech recognition efficiency and large interaction delay. Especially in the scenario of multi-channel speech recognition, it cannot meet the needs of multi-users to efficiently and personalized the speech interaction system. requirements.
- FIG. 1 shows an exemplary system architecture 100 of a voice interaction method or a voice interaction apparatus to which embodiments of the present disclosure may be applied.
- the system architecture 100 may include a terminal device 101 , a network 102 and a server 103 .
- the network 102 is a medium for providing a communication link between the terminal device 101 and the server 103 .
- the network 102 includes, but is not limited to, various connection types, such as wired, wireless communication links, or fiber optic cables, and the like.
- the user can use the terminal device 101 to interact with the server 103 through the network 102 to receive or send messages and the like.
- Various communication client applications may be installed on the terminal device 101, such as speech recognition applications, multimedia applications, search applications, web browser applications, shopping applications, instant messaging tools, and the like.
- the terminal device 101 may be an electronic device, which includes but is not limited to a vehicle terminal, a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (Personal Digital Assistant), a PAD (tablet computer), a PMP ( Portable Media Player, portable multimedia player), etc., as well as fixed terminals such as digital TV, desktop computer, smart home appliances, etc.
- a vehicle terminal a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (Personal Digital Assistant), a PAD (tablet computer), a PMP ( Portable Media Player, portable multimedia player), etc.
- PMP Portable Media Player, portable multimedia player
- fixed terminals such as digital TV, desktop computer, smart home appliances, etc.
- the server 103 may be a device that can provide various service functions, such as a background speech recognition server that recognizes the audio signal uploaded by the terminal device 101 .
- the background voice recognition server can process the received audio to obtain an instruction for controlling the voice interaction device, and feed the instruction back to the terminal device 101 .
- the voice interaction method provided by the embodiments of the present disclosure may be executed by the server 103 or by the terminal device 101 .
- the voice interaction device may be set in the server 103 or in the terminal device 101 . middle.
- terminal devices 101, networks 102 and servers 103 in FIG. 1 are only an example. According to implementation requirements, any number of terminal devices, networks and/or servers may be configured, which is not limited in this application.
- the above-mentioned system architecture may not include the network 102, but only include a server or a terminal device. For example, when the terminal device 101 and the server 103 are connected in a wired manner, the network 102 may be omitted.
- FIG. 2 is a schematic flowchart of a voice interaction method provided by an exemplary embodiment of the present disclosure.
- the method of this embodiment can be applied to an electronic device (the terminal device 101 or the server 103 as shown in FIG. 1 ). , as shown in Figure 2, the method includes the following steps:
- Step 201 acquiring at least one audio signal.
- the electronic device can acquire at least one audio signal locally or remotely.
- the above-mentioned at least one audio signal may be a speech signal of at least one passenger in the vehicle collected by at least one microphone installed in the vehicle.
- Step 202 Recognize at least one audio signal by using a preset speech recognition model, and obtain the first type of recognition result by using the speech recognition model.
- the electronic device can use a preset speech recognition model to recognize at least one audio signal, and in the recognition process, the first type of recognition result is obtained by using the preset speech recognition model.
- the preset speech recognition model may be a model obtained by pre-training with a large number of speech signal samples.
- the preset speech recognition model is used to recognize at least one input audio signal to obtain at least one sentence recognition result.
- the preset speech recognition model may include multiple sub-models, for example, including an acoustic sub-model, a language sub-model, a decoding network sub-model, and the like.
- the acoustic sub-model is used to divide the audio signal into syllables;
- the language sub-model is used to convert each syllable into a word;
- the decoding network sub-model is used to select an optimal combination from multiple words to obtain a sentence.
- the electronic device In the process of recognizing at least one audio signal by using the preset speech recognition model in the above step 202, the electronic device usually first searches the cache to see if there is identification data corresponding to the current processing stage, if there is no corresponding identification data in the cache , the above step 202 is executed to obtain the identification data, and the identification data is used as the first type of identification result.
- Step 203 Determine the stored identification data about at least one audio signal from the buffer.
- the electronic device may determine the stored identification data about at least one audio signal from the buffer.
- the electronic device usually first searches the cache to see if there is recognition data corresponding to the current processing stage, and if so, extracts the recognition data.
- Step 204 based on the stored identification data, generate a second type of identification result.
- the electronic device may generate a second type of identification result based on the stored identification data extracted in step 203 above.
- the above-mentioned stored identification data may be used as the second type of identification result, or the above-mentioned stored identification data may be subjected to certain processing to obtain the second type of identification result, wherein the certain processing process includes the identification of The data is scaled, normalized, etc.
- first-type recognition results and second-type recognition results are usually intermediate results obtained during the processing of the speech recognition model, such as probability scores of syllables, probability scores of words, and the like.
- Step 205 using the speech recognition model to process the first type of recognition result and the second type of recognition result, to obtain sentence recognition results corresponding to at least one audio signal respectively.
- the electronic device may use a speech recognition model to process the first type of recognition result and the second type of recognition result to obtain sentence recognition results corresponding to at least one audio signal respectively.
- a speech recognition model to process the first type of recognition result and the second type of recognition result to obtain sentence recognition results corresponding to at least one audio signal respectively.
- the above-mentioned first-type recognition results and second-type recognition results may include the probability score of each syllable and the probability score of each word obtained after the audio signal is recognized.
- the speech recognition model can use the path A search algorithm (such as the Viterbi algorithm) determines an optimal path from a plurality of recognized words corresponding to the audio signal, and obtains a sentence as a sentence recognition result according to the optimal path.
- the path A search algorithm such as the Viterbi algorithm
- one channel of audio signal may correspond to one sentence recognition result
- multiple channels of audio signals may correspond to multiple channels of sentence recognition results.
- Step 206 Perform semantic parsing on each sentence recognition result to obtain at least one parsing result.
- the electronic device may perform semantic parsing on each of the at least one sentence recognition result to obtain at least one parsing result.
- each analysis result in the above at least one analysis result corresponds to an audio signal.
- the method for semantic analysis of the sentence recognition result may be adopted, for example, a rule engine, a neural network engine and other methods.
- Step 207 based on the at least one analysis result, generate an instruction for controlling the voice interaction device to perform a corresponding function.
- the electronic device may generate an instruction for controlling the voice interaction device to perform a corresponding function based on at least one analysis result.
- the above-mentioned voice interaction device may be the above-mentioned electronic device for executing the voice interaction method of the present disclosure, or may be an electronic device that is communicatively connected to the above-mentioned electronic device.
- the voice interaction device is a car air conditioner
- An instruction for a certain preset temperature the preset temperature being 25°C.
- At least one audio signal is recognized by using a preset speech recognition model, and during the recognition process, the stored recognition data is extracted from the cache to generate a part of recognition results, and the other part of the recognition results is obtained by speech recognition.
- Model generation which effectively reuses the stored recognition data, and does not require the speech recognition model to process the full amount of data, thereby improving the efficiency of processing at least one audio signal, which is helpful for multi-channel voice interaction scenarios. Meet the requirements for low resource consumption and low processing delay of electronic devices.
- the above-mentioned electronic device may also store the recognition data obtained by the preset speech recognition model in the recognition process into the cache. Specifically, when the recognition data corresponding to a certain recognition step does not exist in the above cache, the speech recognition model is required to execute the recognition step, and the obtained recognition data is stored in the cache, thereby facilitating subsequent reuse of the recognition data.
- the multiplexing of the recognition data can be realized, and the recognition data in the cache can be continuously updated.
- using more stored recognition data in the model recognition process further improves the efficiency of speech recognition.
- the specific execution process of the above step 201 is as follows:
- the initial audio signal collected by the audio collection device is received.
- the number of the audio collection devices may be one or more, which are used to collect at least one channel of the initial audio signal.
- the above-mentioned initial audio signal may be a signal obtained by collecting the voice of at least one user by an audio collecting device.
- multiple audio collection devices are configured, and each audio collection device is installed around each seat in the vehicle. Each audio collection device is used to collect the voices of passengers on the corresponding seats.
- the collected audio signals usually contain multiple mixed voice signal of each user.
- the above-mentioned method of sound source separation processing can adopt the prior art, for example, adopts a blind source separation (Blind Source Separation, BSS) algorithm to separate the voice signals of multiple users, and each obtained audio signal corresponds to a user respectively.
- BSS Blind Source Separation
- the sound source separation processing can also correspond each audio signal obtained to the corresponding audio acquisition device. Since each audio acquisition device is installed near the corresponding seat, each audio signal obtained can be corresponding to the corresponding seat.
- the voice signals of multiple users are separated through the sound source separation technology, and a one-to-one correspondence is established with different audio collection devices.
- This implementation can separate the voices of multiple users by separating the sound source of the initial audio signal, so that each subsequent voice recognition result corresponds to the corresponding user, thereby improving the accuracy of voice interaction of multiple users.
- step 202 may include the following sub-steps:
- Step 2021 Determine the speech recognition instance corresponding to each audio signal.
- the speech recognition instance can be constructed by code, each speech recognition instance corresponds to a channel of audio signal, and each speech recognition instance is used to recognize a channel of audio signal corresponding to it.
- Step 2022 executing each of the determined speech recognition instances in parallel.
- each speech recognition instance may be implemented in a multi-threaded manner; or, each speech recognition instance may also be executed by different CPUs, so as to implement parallel execution.
- Step 2023 Recognize the corresponding audio signal by using the preset speech recognition model through each speech recognition instance.
- each speech recognition instance can call the above-mentioned preset speech recognition model in parallel and separately to recognize the corresponding speech signal, thereby realizing the parallel recognition of the audio signal.
- a preset speech recognition model may be loaded into the memory first, and each speech recognition instance shares the preset speech recognition model. It should be noted that, when each speech recognition instance is used to recognize an audio signal, the above-mentioned buffer may be shared, thereby improving the recognition efficiency of each speech recognition instance.
- this implementation can realize simultaneous recognition of the speech of multiple users.
- each speech recognition instance shares a speech recognition model to recognize the speech signal.
- share the same cache to store and call recognition data realize speech recognition for at least one audio signal in parallel, and share the resources required for recognition, and improve the efficiency of speech recognition in multi-user voice interaction scenarios.
- the identified data is stored, so the stored identification data can be directly called in the subsequent identification process, without repeated identification, which greatly saves memory resources.
- step 206 may include the following sub-steps:
- Step 2061 Determine the semantic parsing instance corresponding to each sentence recognition result obtained.
- the semantic parsing instance can be constructed by code, each semantic parsing instance corresponds to a sentence recognition result of a channel of audio signal, and the semantic parsing instance is used to perform structured parsing on the sentence recognition result.
- Step 2062 executing each of the determined semantic parsing instances in parallel.
- each semantic parsing instance may be implemented in a multi-thread manner; or, each semantic parsing instance may be executed by different CPUs, thereby implementing parallel execution.
- Step 2063 Perform semantic analysis on the corresponding sentence recognition result through each semantic analysis instance.
- each semantic parsing instance can call the preset rule engine, neural network engine and other modules in parallel to realize parallel parsing of the sentence recognition result.
- this implementation realizes the simultaneous recognition and parsing of the voices of multiple users, thereby constructing multiple chains that can perform voice interaction at the same time. Moreover, each semantic parsing instance shares a set of semantic resources, which also improves the speech recognition efficiency in multi-user speech interaction scenarios.
- step 202 may include the following steps:
- Step 2024 using the acoustic sub-model included in the speech recognition model to determine the syllable set corresponding to at least one audio signal and the first probability score corresponding to the syllables in the syllable set respectively.
- the acoustic sub-model is used for syllable division of the input audio signal.
- the acoustic sub-model includes, but is not limited to, a Hidden Markov Model (HMM, Hidden Markov Model), a Gaussian Mixture Model (GMM, Gaussian Mixture Model), and the like.
- HMM Hidden Markov Model
- GMM Gaussian Mixture Model
- the first probability score is used to characterize the probability that the syllable is correctly divided.
- Step 2025 using the language sub-model included in the speech recognition model to determine the word set corresponding to at least one audio signal.
- the language sub-model is used to determine the word set according to the above-mentioned syllable set.
- the language sub-model may include, but is not limited to, an n-gram language model, a neural network language model, and the like.
- Step 2026 for a word in the word set, determine whether there is a second probability score corresponding to the word in the cache.
- the second probability score is used to represent the probability of the recognized word appearing. For example, calculating the probability that "air conditioner” appears after “turn on” is the second probability score corresponding to the word "air conditioner".
- the electronic device When the probability score of a certain word needs to be determined, the electronic device first searches the cache for whether there is a second probability score of the current word, and when it does not exist, uses the language sub-model to calculate the second probability score of the word.
- a cache is used to pre-store the second probability score generated by the language sub-model, and the second probability score generated by the language sub-model can be stored directly from the cache when used. to obtain the second probability score.
- Step 2027 Determine the first type of recognition result based on the first probability score and the second probability score determined by the language sub-model.
- each of the first probability scores and each of the second probability scores may be determined as the first type of recognition result.
- the method provided by the corresponding embodiment of FIG. 5 above uses a cache specially for storing the second probability score generated by the language sub-model, so that the role of the cache is more targeted.
- the cache is applied to the process of large data processing volume and frequent data access, giving full play to the role of using the cache to save computing resources, reducing redundant data in the cache, and improving the efficiency of speech recognition.
- step 203 may further include the following steps:
- Step 2031 for a word in the word set, determine whether there is a second probability score corresponding to the word in the cache.
- the second probability score in the cache is determined as the second probability score language sub-model for the word.
- Step 2032 based on the first probability score and the second probability score determined from the cache, determine a second type of identification result.
- each of the first probability scores and the second probability scores determined in the cache may be determined as the second type of identification results.
- the above-mentioned step 205 may be performed as follows:
- the target path of the word set is determined in the decoding network included in the speech recognition model.
- the decoding network is a network constructed based on the above-mentioned word set. Based on this network, an optimal path of a word combination can be found in the network according to the first probability score and the second probability score, and this path is the target path.
- a sentence composed of words corresponding to the target path may be determined as the sentence recognition result.
- This implementation uses the first probability score, the second probability score calculated by the language sub-model, and the second probability score extracted from the cache to find the target path in the decoding network, and generate the sentence recognition result. Making full use of the stored second probability score in the cache improves the efficiency of generating the sentence recognition result.
- FIG. 7 a schematic diagram of an application scenario of the voice interaction method of this embodiment is shown.
- the voice interaction method is applied to the vehicle-mounted voice interaction system.
- the multiple audio signals respectively correspond to one interaction chain, including: the main driver interaction chain 701 , the assistant driver interaction chain 702 , and other interaction chains 703 .
- the main driver interaction chain 701 is used for the driver to interact with the in-vehicle voice interaction system
- the co-driver interaction chain 702 is used for the passengers in the co-driver position to interact with the in-vehicle voice interaction system
- the other interaction chains 703 are used for passengers in other seats to interact with the in-vehicle voice interaction system. In-vehicle voice interaction system for interaction, etc.
- the decoding resource 704 includes a speech recognition model 7041 and a cache 7042
- the semantic resource 705 includes a rule engine 7051 and a neural network engine 7052, wherein the rule engine 7051 is used for parsing the sentence recognition result.
- the electronic device in the main driver interaction chain 701, the electronic device generates a speech recognition instance A for the main driver's voice signal, and generates a speech recognition instance B for the auxiliary driver's voice.
- Each speech recognition instance shares a set of decoding resources 704 and executes in parallel, obtaining Statement recognition result C and statement recognition result D.
- the electronic device constructs a semantic instance E and a semantic instance F, and the semantic instance E and the semantic instance F share a set of semantic resources, respectively parse the statement recognition result C and the statement recognition result D, and obtain a structured parsing result G and parsing result H.
- the electronic device generates an instruction I, an instruction J, etc. based on the analysis result G and the analysis result H.
- the instruction I is used to turn on the air conditioner; the instruction J is used to close the window.
- the in-vehicle voice interaction device executes the corresponding function K and function H based on the instruction I and the instruction J.
- the execution process of other interaction chains 703 is similar to the above-mentioned main driver interaction chain 701 and auxiliary driver interaction chain 702 , and details are not repeated here.
- FIG. 8 is a schematic structural diagram of a voice interaction apparatus provided by an exemplary embodiment of the present disclosure. This embodiment can be applied to an electronic device. As shown in FIG. 8 , the voice interaction device includes: an acquisition module 801, an identification module 802, a determination module 803, a first generation module 804, a processing module 805, an analysis module 806, and a second generation module Module 807.
- the acquisition module 801 is used to acquire at least one audio signal;
- the recognition module 802 is used to recognize at least one audio signal by using a preset speech recognition model, and obtain the first type of recognition result through the speech recognition model;
- the determination module 803, used to determine the stored identification data about at least one audio signal from the cache;
- the first generation module 804 is used to generate a second type of identification result based on the stored identification data;
- the processing module 805 is used to utilize the speech recognition model , process the first-type recognition result and the second-type recognition result to obtain sentence recognition results corresponding to at least one audio signal;
- the parsing module 806 is used to perform semantic analysis on each sentence recognition result respectively, and obtain at least one analysis result;
- the second generating module 807 is configured to generate an instruction for controlling the voice interaction device to perform a corresponding function based on the at least one parsing result.
- the acquisition module 801 may acquire at least one audio signal locally or remotely.
- the above-mentioned at least one audio signal may be a speech signal of at least one passenger in the vehicle collected by at least one microphone installed in the vehicle.
- the recognition module 802 can use a preset speech recognition model to recognize at least one audio signal, and obtain the first type of recognition result through the speech recognition model.
- the speech recognition model may be a model obtained by pre-training with a large number of speech signal samples.
- the speech recognition model is used to recognize the input audio signal and obtain a sentence recognition result.
- a speech recognition model may include multiple sub-models, including, for example, an acoustic sub-model (for syllable division of audio signals), a language sub-model (for converting individual syllables into words), a decoding network (for converting multiple words from multiple words) choose the best combination to get a sentence).
- an acoustic sub-model for syllable division of audio signals
- a language sub-model for converting individual syllables into words
- a decoding network for converting multiple words from multiple words
- the recognition module 802 usually first searches the cache for the recognition data corresponding to the current processing stage.
- the identification data is used as the first type of identification result.
- the determining module 803 may determine the stored identification data about at least one audio signal from the buffer. Usually, during the recognition process by the above speech recognition model, the determination module 803 usually first searches the cache for the identification data corresponding to the current processing stage, and extracts the identification data if there is corresponding identification data in the cache.
- the first generating module 804 may generate a second type of identification result based on the above-mentioned extracted and stored identification data.
- the above-mentioned stored identification data may be used as the second type of identification result, or the above-mentioned stored identification data may be processed to a certain extent (for example, the data is scaled by a certain proportion, normalized, etc.) to obtain the first type of identification result.
- Class II recognition results for example, the data is scaled by a certain proportion, normalized, etc.
- first-type recognition results and second-type recognition results are usually intermediate results obtained during the processing of the speech recognition model, such as probability scores of syllables, probability scores of words, and the like.
- the processing module 805 can use a speech recognition model to process the first type of recognition result and the second type of recognition result to obtain sentence recognition results corresponding to at least one audio signal respectively.
- the speech recognition model needs to further process the first type of recognition result and the second type of recognition result.
- the first type of recognition result and the second type of recognition result may include the probability score of each syllable and the probability score of each word obtained after the audio signal is recognized.
- the speech recognition model may use a path search algorithm (For example, the Viterbi algorithm) determines an optimal path from a plurality of recognized words corresponding to the audio signal, thereby obtaining a sentence as the sentence recognition result.
- the parsing module 806 may perform semantic parsing on each sentence recognition result to obtain at least one parsing result.
- each analysis result in the above at least one analysis result corresponds to an audio signal.
- the parsing results can be structured data.
- the sentence recognition result is "air conditioner temperature is set to 25 degrees”
- the method for parsing the sentence may adopt the prior art. For example, using rule engines, neural network engines, etc.
- the second generation module 807 may generate an instruction for controlling the voice interaction device to perform a corresponding function based on at least one analysis result.
- the above-mentioned voice interaction device may be an electronic device provided with the above-mentioned voice interaction device, or may be an electronic device that is communicatively connected to the above-mentioned electronic device.
- the voice interaction device is a car air conditioner
- FIG. 9 is a schematic structural diagram of a voice interaction apparatus provided by another exemplary embodiment of the present disclosure.
- the apparatus further includes: a storage module 808, configured to store the recognition data obtained by the speech recognition model in the recognition process into a cache.
- the acquisition module 801 includes: a receiving unit 8011 for receiving an initial audio signal collected by an audio collection device; a processing unit 8012 for performing sound source separation processing on the initial audio signal to obtain at least one channel of audio Signal.
- the recognition module 802 includes: a first determination unit 8021, configured to determine the speech recognition instances corresponding to at least one audio signal respectively; and a first execution unit 8022, configured to execute the determined speech recognition instances in parallel ; Recognition unit 8023 is used to recognize the corresponding audio signal by using the speech recognition model respectively through each speech recognition instance.
- the parsing module 806 includes: a second determining unit 8061, configured to determine the semantic parsing instances corresponding to each of the obtained statement recognition results; a second execution unit 8062, configured to execute each of the determined Semantic parsing instance; the parsing unit 8063 is configured to perform semantic parsing on the corresponding sentence recognition result through each semantic parsing instance.
- the recognition module 802 includes: a third determination unit 8024, configured to use the acoustic sub-model included in the speech recognition model to determine the syllable set corresponding to at least one audio signal and the syllables in the syllable set respectively corresponding the first probability score of the Determine whether there is a second probability score corresponding to the word in the cache, and if not, use the language sub-model to determine the second probability score corresponding to the word; the sixth determining unit 8027 is used to determine based on the first probability score and the language sub-model The second probability score of , determines the first-class recognition result.
- a third determination unit 8024 configured to use the acoustic sub-model included in the speech recognition model to determine the syllable set corresponding to at least one audio signal and the syllables in the syllable set respectively corresponding the first probability score of the Determine whether there is a second probability score corresponding to the word in the cache, and if not, use the language sub-model
- the determining module 803 includes: a seventh determining unit 8031, configured to determine, for a word in the word set, whether there is a second probability score corresponding to the word in the cache; The second probability score is determined as the second probability score language sub-model of the word; the eighth determining unit 8032 is configured to determine the second type of recognition result based on the first probability score and the second probability score determined from the cache.
- the processing module 805 includes: a ninth determination unit 8051, configured to, according to the first probability score and the second probability score respectively included in the first type of recognition result and the second type of recognition result, in the speech recognition
- the target path of the word set is determined in the decoding network included in the model;
- the generating unit 8052 is configured to generate sentence recognition results corresponding to at least one audio signal respectively based on the target path.
- the voice interaction device recognizes at least one audio signal by using a preset voice recognition model.
- the stored recognition data is extracted from the cache to generate a part of the recognition result, and another part of the recognition result is obtained by
- the speech recognition model is generated, which effectively reuses the stored recognition data, and does not need the speech recognition model to process the full amount of data, which improves the processing efficiency of at least one audio signal, which is helpful in the scene of multi-channel voice interaction. It can meet the requirements of low resource consumption and low processing delay.
- the electronic device may be either or both of the terminal device 101 and the server 103 as shown in FIG. 1 , or a stand-alone device independent of them, the stand-alone device may communicate with the terminal device 101 and the server 103 to obtain data from them. Receive the collected input signal.
- FIG. 10 illustrates a block diagram of an electronic device according to an embodiment of the present disclosure.
- the electronic device 1000 includes at least one processor 1001 and at least one memory 1002 .
- any one of the at least one processor 1001 may be a central processing unit (central processing unit, CPU) or another form of processing unit with data processing capability and/or instruction execution capability, and may control the electronic device 1000 other components to perform the desired function.
- CPU central processing unit
- any one of the at least one processor 1001 may be a central processing unit (central processing unit, CPU) or another form of processing unit with data processing capability and/or instruction execution capability, and may control the electronic device 1000 other components to perform the desired function.
- Memory 1002 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory.
- Volatile memory may include, for example, random access memory (Random Access Memory, RAM) and/or cache memory (cache).
- the non-volatile memory may include, for example, a read-only memory (Read-Only Memory, ROM), a hard disk, a flash memory, and the like.
- One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 1001 may execute the program instructions to implement the voice interaction method and/or other desired functions of the various embodiments of the present disclosure above.
- Various contents such as identification data can also be stored in the computer-readable storage medium.
- the electronic device 1000 may also include an input device 1003 and an output device 1004 interconnected by a bus system and/or other form of connection mechanism (not shown).
- the input device 1003 may be a device such as a microphone for inputting audio signals.
- the input device 1003 may be a communication network connector for receiving input audio signals from the terminal device 101 and the server 103 .
- the output device 1004 can output various information to the outside, including instructions for the voice interaction device to perform corresponding functions, and the like.
- the output devices 1004 may include, for example, displays, speakers, printers, and communication networks and their connected remote output devices, among others.
- the electronic device 1000 may also include any other appropriate components according to specific applications.
- embodiments of the present disclosure may also be computer program products comprising computer program instructions that, when executed by a processor, cause the processor to perform the "exemplary method" described above in this specification
- the steps in the voice interaction method according to various embodiments of the present disclosure are described in the section.
- the computer program product may write program code for performing operations of embodiments of the present disclosure in any combination of one or more programming languages, including object-oriented programming languages, such as Java, C++, etc. , also includes conventional procedural programming languages, such as "C" language or similar programming languages.
- the program code may execute entirely on the user computing device, partly on the user device, as a stand-alone software package, partly on the user computing device and partly on a remote computing device, or entirely on the remote computing device or server execute on.
- embodiments of the present disclosure may also be computer-readable storage media having computer program instructions stored thereon that, when executed by a processor, cause the processor to perform the above-described "Example Method" section of this specification Steps in a voice interaction method according to various embodiments of the present disclosure described in .
- the computer-readable storage medium may employ any combination of one or more readable media.
- the readable medium may be a readable signal medium or a readable storage medium.
- the readable storage medium may include, for example, but not limited to, electrical, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatuses or devices, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
- the methods and apparatus of the present disclosure may be implemented in many ways.
- the methods and apparatus of the present disclosure may be implemented in software, hardware, firmware, or any combination of software, hardware, and firmware.
- the above-described order of steps for the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise.
- the present disclosure can also be implemented as programs recorded in a recording medium, the programs including machine-readable instructions for implementing methods according to the present disclosure.
- the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
- each component or each step may be decomposed and/or recombined. These disaggregations and/or recombinations should be considered equivalents of the present disclosure.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Artificial Intelligence (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims (10)
- 一种语音交互方法,包括:获取至少一路音频信号;利用预设的语音识别模型对所述至少一路音频信号进行识别,通过所述语音识别模型得到第一类识别结果;从缓存中确定已存储的关于所述至少一路音频信号的识别数据;基于所述已存储的识别数据,生成第二类识别结果;利用所述语音识别模型,对所述第一类识别结果和所述第二类识别结果进行处理,得到与所述至少一路音频信号对应的至少一个语句识别结果;对所述语句识别结果进行语义解析,得到至少一个解析结果;基于所述至少一个解析结果,生成用于控制语音交互设备执行相应功能的指令。
- 根据权利要求1所述的方法,其中,所述获取至少一路音频信号,包括:接收音频采集设备采集的初始音频信号;对所述初始音频信号进行声源分离处理,得到所述至少一路音频信号。
- 根据权利要求1所述的方法,其中,所述利用预设的语音识别模型对所述至少一路音频信号进行识别,通过所述语音识别模型得到第一类识别结果,包括:确定所述至少一路音频信号分别对应的语音识别实例;并行执行确定的各个语音识别实例;通过各个语音识别实例分别利用所述语音识别模型,对对应的音频信号进行识别。
- 根据权利要求3所述的方法,其中,对所述语句识别结果进行语义解析,得到至少一个解析结果,包括:确定所得到的各个语句识别结果分别对应的语义解析实例;并行执行确定的各个语义解析实例;通过各个语义解析实例分别对对应的语句识别结果进行语义解析,得到所述至少一个解析结果。
- 根据权利要求1所述的方法,其中,所述利用预设的语音识别模型对所述至少一路音频信号进行识别,通过所述语音识别模型得到第一类识别结果,包括:利用所述语音识别模型包括的声学子模型确定所述至少一路音频信号分别对应的音节集合和所述音节集合中的音节对应的第一概率得分;利用所述语音识别模型包括的语言子模型确定所述至少一路音频信号分别对应的词语集合;基于所述词语集合中的词语,确定所述缓存中是否存在该词语对应的第二概率得分;若不存在,则利用所述语言子模型确定该词语对应的第二概率得分;基于所述第一概率得分和所述语言子模型确定的第二概率得分,确定所述第一类识别结果。
- 根据权利要求5所述的方法,其中,所述从缓存中确定已存储的关于所述至少一路音频信号的识别数据,包括:对于所述词语集合中的词语,确定所述缓存中是否存在该词语对应的第二概率得分;若存在,则将所述缓存中的第二概率得分确定为该词语的第二概率得分语言子模型;基于所述第一概率得分和从所述缓存中确定的第二概率得分,确定第二类识别结果。
- 根据权利要求6所述的方法,其中,所述利用所述语音识别模型,对所述第一类识别结果和所述第二类识别结果进行处理,得到与所述至少一路音频信号对应的至少一个语句识别结果,包括:根据所述第一类识别结果和所述第二类识别结果分别包括的第一概率得分和第二概率得分,在所述语音识别模型包括的解码网络中确定所述词语集合的目标路径;基于所述目标路径,生成与所述至少一路音频信号对应的至少一个语句识别结果。
- 一种语音交互装置,包括:获取模块,用于获取至少一路音频信号;识别模块,用于利用预设的语音识别模型对所述至少一路音频信号进行识别,通过所述语音识别模型得到第一类识别结果;确定模块,用于从缓存中确定已存储的关于所述至少一路音频信号的识别数据;第一生成模块,用于基于所述已存储的识别数据,生成第二类识别结果;处理模块,用于利用所述语音识别模型,对所述第一类识别结果和所述第二类识别结果进行处理,得到与所述至少一路音频信号对应的至少一个语句识别结果;解析模块,用于对所述语句识别结果进行语义解析,得到至少一个解析结果;第二生成模块,用于基于所述至少一个解析结果,生成用于控制语音交互设备执行相应功能的指令。
- 一种计算机可读存储介质,所述存储介质存储有计算机程序指令,当所述计算机程序指令被执行时,实现如权利要求1至7任一项所述的方法。
- 一种电子设备,所述电子设备包括:处理器;以及用于存储所述处理器可执行指令的存储器;所述处理器,用于从所述存储器中读取所述可执行指令,并执行所述指令以实现如权利要求1至7任一项所述的方法。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2022558093A JP2023520861A (ja) | 2021-03-16 | 2022-02-16 | 音声対話方法、装置、コンピュータ可読記憶媒体及び電子機器 |
US18/247,441 US20240005917A1 (en) | 2021-03-16 | 2022-02-16 | Speech interaction method ,and apparatus, computer readable storage medium, and electronic device |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110279812.4A CN113066489A (zh) | 2021-03-16 | 2021-03-16 | 语音交互方法、装置、计算机可读存储介质及电子设备 |
CN202110279812.4 | 2021-03-16 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022193892A1 true WO2022193892A1 (zh) | 2022-09-22 |
Family
ID=76560535
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/076422 WO2022193892A1 (zh) | 2021-03-16 | 2022-02-16 | 语音交互方法、装置、计算机可读存储介质及电子设备 |
Country Status (4)
Country | Link |
---|---|
US (1) | US20240005917A1 (zh) |
JP (1) | JP2023520861A (zh) |
CN (1) | CN113066489A (zh) |
WO (1) | WO2022193892A1 (zh) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113066489A (zh) * | 2021-03-16 | 2021-07-02 | 深圳地平线机器人科技有限公司 | 语音交互方法、装置、计算机可读存储介质及电子设备 |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2000038175A1 (en) * | 1998-12-21 | 2000-06-29 | Koninklijke Philips Electronics N.V. | Language model based on the speech recognition history |
CN1455387A (zh) * | 2002-11-15 | 2003-11-12 | 中国科学院声学研究所 | 一种语音识别系统中的快速解码方法 |
US20160358606A1 (en) * | 2015-06-06 | 2016-12-08 | Apple Inc. | Multi-Microphone Speech Recognition Systems and Related Techniques |
US20170236518A1 (en) * | 2016-02-16 | 2017-08-17 | Carnegie Mellon University, A Pennsylvania Non-Profit Corporation | System and Method for Multi-User GPU-Accelerated Speech Recognition Engine for Client-Server Architectures |
EP3425628A1 (en) * | 2017-07-05 | 2019-01-09 | Panasonic Intellectual Property Management Co., Ltd. | Voice recognition method, recording medium, voice recognition device, and robot |
CN109215630A (zh) * | 2018-11-14 | 2019-01-15 | 北京羽扇智信息科技有限公司 | 实时语音识别方法、装置、设备及存储介质 |
CN109727603A (zh) * | 2018-12-03 | 2019-05-07 | 百度在线网络技术(北京)有限公司 | 语音处理方法、装置、用户设备及存储介质 |
CN112071310A (zh) * | 2019-06-11 | 2020-12-11 | 北京地平线机器人技术研发有限公司 | 语音识别方法和装置、电子设备和存储介质 |
CN113066489A (zh) * | 2021-03-16 | 2021-07-02 | 深圳地平线机器人科技有限公司 | 语音交互方法、装置、计算机可读存储介质及电子设备 |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108573706B (zh) * | 2017-03-10 | 2021-06-08 | 北京搜狗科技发展有限公司 | 一种语音识别方法、装置及设备 |
CN108510990A (zh) * | 2018-07-04 | 2018-09-07 | 百度在线网络技术(北京)有限公司 | 语音识别方法、装置、用户设备及存储介质 |
CN110534095B (zh) * | 2019-08-22 | 2020-10-23 | 百度在线网络技术(北京)有限公司 | 语音识别方法、装置、设备以及计算机可读存储介质 |
CN110415697A (zh) * | 2019-08-29 | 2019-11-05 | 的卢技术有限公司 | 一种基于深度学习的车载语音控制方法及其系统 |
CN110473531B (zh) * | 2019-09-05 | 2021-11-09 | 腾讯科技(深圳)有限公司 | 语音识别方法、装置、电子设备、系统及存储介质 |
CN110661927B (zh) * | 2019-09-18 | 2022-08-26 | 平安科技(深圳)有限公司 | 语音交互方法、装置、计算机设备及存储介质 |
-
2021
- 2021-03-16 CN CN202110279812.4A patent/CN113066489A/zh active Pending
-
2022
- 2022-02-16 JP JP2022558093A patent/JP2023520861A/ja active Pending
- 2022-02-16 WO PCT/CN2022/076422 patent/WO2022193892A1/zh active Application Filing
- 2022-02-16 US US18/247,441 patent/US20240005917A1/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2000038175A1 (en) * | 1998-12-21 | 2000-06-29 | Koninklijke Philips Electronics N.V. | Language model based on the speech recognition history |
CN1455387A (zh) * | 2002-11-15 | 2003-11-12 | 中国科学院声学研究所 | 一种语音识别系统中的快速解码方法 |
US20160358606A1 (en) * | 2015-06-06 | 2016-12-08 | Apple Inc. | Multi-Microphone Speech Recognition Systems and Related Techniques |
US20170236518A1 (en) * | 2016-02-16 | 2017-08-17 | Carnegie Mellon University, A Pennsylvania Non-Profit Corporation | System and Method for Multi-User GPU-Accelerated Speech Recognition Engine for Client-Server Architectures |
EP3425628A1 (en) * | 2017-07-05 | 2019-01-09 | Panasonic Intellectual Property Management Co., Ltd. | Voice recognition method, recording medium, voice recognition device, and robot |
CN109215630A (zh) * | 2018-11-14 | 2019-01-15 | 北京羽扇智信息科技有限公司 | 实时语音识别方法、装置、设备及存储介质 |
CN109727603A (zh) * | 2018-12-03 | 2019-05-07 | 百度在线网络技术(北京)有限公司 | 语音处理方法、装置、用户设备及存储介质 |
CN112071310A (zh) * | 2019-06-11 | 2020-12-11 | 北京地平线机器人技术研发有限公司 | 语音识别方法和装置、电子设备和存储介质 |
CN113066489A (zh) * | 2021-03-16 | 2021-07-02 | 深圳地平线机器人科技有限公司 | 语音交互方法、装置、计算机可读存储介质及电子设备 |
Also Published As
Publication number | Publication date |
---|---|
CN113066489A (zh) | 2021-07-02 |
JP2023520861A (ja) | 2023-05-22 |
US20240005917A1 (en) | 2024-01-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11132172B1 (en) | Low latency audio data pipeline | |
WO2021093449A1 (zh) | 基于人工智能的唤醒词检测方法、装置、设备及介质 | |
WO2022105861A1 (zh) | 用于识别语音的方法、装置、电子设备和介质 | |
CN114830228A (zh) | 与设备关联的账户 | |
WO2020238209A1 (zh) | 音频处理的方法、系统及相关设备 | |
US11687526B1 (en) | Identifying user content | |
US11574637B1 (en) | Spoken language understanding models | |
US20200219384A1 (en) | Methods and systems for ambient system control | |
CN114038457B (zh) | 用于语音唤醒的方法、电子设备、存储介质和程序 | |
CN113674742B (zh) | 人机交互方法、装置、设备以及存储介质 | |
CN111916053B (zh) | 语音生成方法、装置、设备和计算机可读介质 | |
US11532301B1 (en) | Natural language processing | |
WO2021098318A1 (zh) | 应答方法、终端及存储介质 | |
US20240013784A1 (en) | Speaker recognition adaptation | |
WO2022193892A1 (zh) | 语音交互方法、装置、计算机可读存储介质及电子设备 | |
CN113889091A (zh) | 语音识别方法、装置、计算机可读存储介质及电子设备 | |
US11626107B1 (en) | Natural language processing | |
CN116075888A (zh) | 用于减少云服务中的延迟的系统和方法 | |
CN113611316A (zh) | 人机交互方法、装置、设备以及存储介质 | |
CN115699170A (zh) | 文本回声消除 | |
CN109887490A (zh) | 用于识别语音的方法和装置 | |
US11830476B1 (en) | Learned condition text-to-speech synthesis | |
CN115132195A (zh) | 语音唤醒方法、装置、设备、存储介质及程序产品 | |
CN112487180B (zh) | 文本分类方法和装置、计算机可读存储介质和电子设备 | |
WO2021051565A1 (zh) | 基于机器学习的语义解析方法、装置、电子设备及计算机非易失性可读存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
ENP | Entry into the national phase |
Ref document number: 2022558093 Country of ref document: JP Kind code of ref document: A |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22770247 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 18247441 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: OTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM1205A DATED 25.01.2024) |