WO2022206602A1 - 语音唤醒方法、装置、存储介质及系统 - Google Patents

语音唤醒方法、装置、存储介质及系统 Download PDF

Info

Publication number
WO2022206602A1
WO2022206602A1 PCT/CN2022/083055 CN2022083055W WO2022206602A1 WO 2022206602 A1 WO2022206602 A1 WO 2022206602A1 CN 2022083055 W CN2022083055 W CN 2022083055W WO 2022206602 A1 WO2022206602 A1 WO 2022206602A1
Authority
WO
WIPO (PCT)
Prior art keywords
wake
data
separation
level
module
Prior art date
Application number
PCT/CN2022/083055
Other languages
English (en)
French (fr)
Inventor
肖龙帅
甄一楠
李文洁
彭超
杨占磊
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP22778784.3A priority Critical patent/EP4310838A1/en
Publication of WO2022206602A1 publication Critical patent/WO2022206602A1/zh
Priority to US18/474,968 priority patent/US20240029736A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present application relates to the technical field of terminals, and in particular, to a voice wake-up method, device, storage medium and system.
  • voice wake-up As the beginning of voice interaction, is widely used in different electronic devices, such as smart speakers and smart TVs.
  • the wake-up electronic device When there is an electronic device supporting voice wake-up in the space where the user is located, after the user sends a wake-up voice, the wake-up electronic device will respond to the speaker's request and interact with the user.
  • multi-condition training can be performed on the wake-up module in the electronic device, and the trained wake-up module can be used for voice wake-up; or, the microphone array processing technology can be used for voice wake-up; or, Voice wake-up can be performed using traditional sound source separation techniques.
  • a voice wake-up method, device, storage medium and system are proposed.
  • the embodiment of the present application designs a two-level separation and wake-up scheme, performs pre-wake judgment through the first-level separation and wake-up scheme in the first-level scenario, and performs wake-up confirmation again in the second-level scenario after the pre-awakening is successful, ensuring that the High wake-up rate while reducing false wake-up rate, so as to get better voice wake-up effect.
  • an embodiment of the present application provides a voice wake-up method, and the method includes:
  • First-stage processing is performed according to the first microphone data to obtain first wake-up data, and the first-stage processing includes first-stage separation processing and first-stage wake-up processing based on a neural network model;
  • second-stage processing is performed according to the first microphone data to obtain second wake-up data, and the second-stage processing includes a second-stage separation process based on a neural network model and a second level wake-up processing;
  • a wake-up result is determined according to the second wake-up data.
  • a two-stage separation and wake-up scheme is designed.
  • the first-stage separation processing and the first-stage wake-up processing are performed on the original first microphone data to obtain the first wake-up data.
  • the first-level separation and wake-up scheme can ensure that the wake-up rate is as high as possible, but it will also bring a higher false wake-up rate.
  • the second level In the scenario, the second-level separation processing and the second-level wake-up processing are performed on the data of the first microphone, that is, the wake-up confirmation is performed on the data of the first microphone again, which can obtain better separation performance and ensure a higher wake-up rate at the same time. Reduce false wake-up rate, so as to get better voice wake-up effect.
  • the performing first-level processing according to the first microphone data to obtain the first wake-up data includes:
  • the output of the pre-trained first-level separation module is called to obtain the first separation data, and the first-level separation module is used to perform the first-level separation processing;
  • the multi-channel feature data and the first separation data call the pre-trained first-level wake-up module to output to obtain the first wake-up data, and the first-level wake-up module is used to perform the first-level wake-up deal with.
  • the first microphone data is preprocessed to obtain multi-channel characteristic data, so that the first separation module can be called to output the first separation data according to the multi-channel characteristic data, and then the first separation data can be obtained according to the multi-channel characteristic data and the first separation module.
  • the separated data calls the first-level wake-up module to output the first-level wake-up data, and implements the first-level separation processing and first-level wake-up processing of the first microphone data in the first-level scenario, so as to ensure that the wake-up rate of pre-awakening is as high as possible.
  • the first The secondary processing obtains the second wake-up data, including:
  • the pre-trained second-stage separation module is called according to the multi-channel feature data and the first separation data to output the second separation data, and the second-stage separation a module for performing the second-stage separation process;
  • the first separation data and the second separation data call the pre-trained second-level wake-up module to output the second wake-up data, and the second-level wake-up module is used for The second-level wake-up process is performed.
  • the second-level separation module when the first wake-up data indicates that the pre-wakeup is successful, the second-level separation module is called to output the second separation data according to the multi-channel characteristic data and the first separation data, and the second separation data is obtained according to the multi-channel characteristic data and the first separation data.
  • the second-level wake-up processing that is, to confirm the wake-up of the first microphone data again, ensures a higher wake-up rate while reducing the false wake-up rate, and further improves the voice wake-up effect.
  • the first-stage separation processing is streaming sound source separation processing, and the first-level wake-up processing Wake-up processing for streaming sound sources; and/or,
  • the second-stage separation processing is offline sound source separation processing
  • the second-stage wake-up processing is offline sound source wake-up processing
  • the first-level scene is the first-level streaming scene
  • the second-level scene is the second-level offline scene. Since the first-level separation and wake-up scheme is designed in a streaming manner, the separation performance will generally be lost to ensure that The wake-up rate is as high as possible, but it will also bring a higher false wake-up rate. Therefore, when the first wake-up data indicates that the pre-wakeup is successful, the second-level offline separation processing is performed on the first microphone data in the second-level offline scenario. And the second-level wake-up processing, which can obtain better separation performance, ensure a higher wake-up rate while reducing the false wake-up rate, and further improve the voice wake-up effect.
  • the first-level wake-up module includes a wake-up model in the form of multiple-input single-output or multiple-input multiple-output; and/or,
  • the second-level wake-up module includes a wake-up model in the form of multiple-input single-output or multiple-input-multiple-output.
  • the first-level wake-up module and/or the second-level wake-up module are multi-input wake-up modules. Compared with the single-input wake-up module in the related art, it can not only save the amount of calculation, but also avoid repeated calls to wake up. The computational burden and waste problems brought by the model are significantly increased; moreover, the wake-up performance is greatly improved due to better utilization of the correlation of various input parameters.
  • the first-stage separation module and/or the second-level separation module adopts a dual-path conformer (dual-path conformer, dpconformer) network structure.
  • the conformer network structure of the dual path is provided.
  • the problem of increasing the amount of computation caused by the direct use of the conformer can be avoided, and due to the strong modeling capability of the conformer network, the separation effect of the separation module (ie, the first-level separation module and/or the second-level separation module) can be significantly improved.
  • the first-stage separation module and/or the second-stage separation module is a separation module for performing at least one task, the at least one task including a sound source separation task alone, or the sound source separation task and other tasks;
  • the other tasks include at least one of a sound source localization task, a specific person extraction task, a specific direction extraction task, and a specific person confirmation task.
  • a multi-task design solution for a sound source separation task and other tasks is provided.
  • other tasks include at least one of a sound source localization task, a specific person extraction task, a specific direction extraction task, and a specific person confirmation task
  • the sound source separation result can be associated with other information and provided to the downstream task or the lower-level wake-up module, which improves the output effect of the separation module (ie, the first-level separation module and/or the second-level separation module).
  • the first-level wake-up The module and/or the second-level wake-up module is a wake-up module for performing at least one task, and the at least one task includes a separate wake-up task, or includes the wake-up task and other tasks;
  • the other tasks include at least one of a sound source localization task, a specific person extraction task, a specific direction extraction task, and a specific person confirmation task.
  • a multi-task design solution for the sound source awakening task and other tasks is provided, for example, the other tasks include at least one of a sound source localization task, a specific person extraction task, a specific direction extraction task, and a specific person confirmation task,
  • the sound source wake-up result can be associated with other information and provided to downstream tasks, which improves the output effect of the wake-up module (ie, the first-level wake-up module and/or the second-level wake-up module).
  • other tasks are sound source localization tasks.
  • the wake-up module can provide more accurate orientation information while providing sound source wake-up results. Compared with the direct spatial multi-fixed beam solution in the related art, more accurate orientation estimation is guaranteed. Effect.
  • the first-stage separation The module includes a first-level multi-feature fusion model and a first-level separation model; according to the multi-channel feature data, calling the pre-trained first-level separation module output to obtain the first separation data, including:
  • the first separation data is obtained by inputting the first single-channel feature data to the output of the first-stage separation model.
  • a fusion mechanism of multi-channel feature data is provided, which avoids manual selection of feature data in related technologies, and automatically learns the relationship between feature channels through the first-level multi-feature fusion model, as well as the effect of each feature on the final separation effect. contribution, which further ensures the separation effect of the first-level separation model.
  • the second-stage separation The module includes a second-level multi-feature fusion model and a second-level separation model; the second-level separation module output obtained by calling the pre-trained second-level separation module according to the multi-channel feature data and the first separation data to obtain the second separation data, including :
  • a fusion mechanism of multi-channel feature data is provided, which avoids manual selection of feature data in related technologies, and automatically learns the relationship between feature channels through the second-level multi-feature fusion model, as well as the effect of each feature on the final separation effect. contribution, further ensuring the separation effect of the second-stage separation model.
  • the first-level wake-up The module includes a first wake-up model in the form of multiple input and single output, and according to the multi-channel feature data and the first separation data, calling the pre-trained first-level wake-up module output to obtain the first wake-up data, including :
  • the first wake-up data includes a first confidence level, and the first confidence level The degree is used to indicate the probability that a preset wake-up word is included in the original first microphone data.
  • a first wake-up model in the form of multiple inputs and single output is provided. Since the first wake-up model is a model in the form of multiple inputs, the significant increase and waste of calculation caused by repeatedly calling the wake-up model in the related art can be avoided. The problem is that computing resources are saved, and the processing efficiency of the first wake-up model is improved; and, because the correlation of each input parameter is better utilized, the wake-up performance of the first wake-up model is greatly improved.
  • the first level The wake-up module includes a first wake-up model in the form of multiple input and multiple outputs and a first post-processing module, and according to the multi-channel feature data and the first separation data, the output of the first-level wake-up module that has been trained in advance is called to obtain the result.
  • the first wake-up data including:
  • a first wake-up model in the form of multiple inputs and multiple outputs is provided.
  • the first wake-up model is a model in the form of multiple inputs, the significant amount of computation caused by repeatedly calling the wake-up model in the related art is avoided. Increase and waste problems, save computing resources, and improve the processing efficiency of the first wake-up model; on the other hand, since the first wake-up model is a multi-output model, the phoneme sequence information corresponding to each of multiple sound source data can be simultaneously output. , so as to avoid the situation of low wake-up rate caused by mutual influence of various sound source data, and further ensure the subsequent wake-up rate.
  • the second The first-level wake-up module includes a second-level wake-up model in the form of multiple input and single output, and the pre-trained second-level wake-up module output is called according to the multi-channel feature data, the first separation data and the second separation data.
  • Obtaining the second wake-up data includes:
  • the first separation data and the second separation data into the second-level wake-up model and outputting the second wake-up data, where the second wake-up data includes a third confidence
  • the third confidence level is used to indicate the probability that the original first microphone data includes a preset wake-up word.
  • a second wake-up model in the form of multiple inputs and single output is provided. Since the second wake-up model is a model in the form of multiple inputs, the significant increase and waste of computation caused by repeatedly calling the wake-up model in the related art can be avoided. The problem is that computing resources are saved, and the processing efficiency of the second wake-up model is improved; and, due to better utilization of the correlation of each input parameter, the wake-up performance of the second wake-up model is greatly improved.
  • the first-level wake-up module includes a second wake-up model in the form of multiple input and multiple output and a second post-processing module.
  • the pre-trained model is called according to the multi-channel feature data, the first separation data and the second separation data.
  • the second-level wake-up module outputs the second wake-up data, including:
  • the phoneme sequence information corresponding to each of the plurality of sound source data is input into the second post-processing module, and the second wake-up data is obtained by outputting, and the second wake-up data includes the corresponding first data of the plurality of sound source data.
  • a second wake-up model in the form of multiple inputs and multiple outputs is provided.
  • the second wake-up model is a model in the form of multiple inputs, the significant amount of computation caused by repeatedly calling the wake-up model in the related art is avoided. Increase and waste problems, save computing resources, and improve the processing efficiency of the second wake-up model; on the other hand, because the second wake-up model is a multi-output model, the phoneme sequence information corresponding to each of multiple sound source data can be simultaneously output. , so as to avoid the situation of low wake-up rate caused by mutual influence of various sound source data, and further ensure the subsequent wake-up rate.
  • an embodiment of the present application provides a voice wake-up device, the device includes: an acquisition module, a first-level processing module, a second-level processing module, and a determination module;
  • the acquisition module is used to acquire the original first microphone data
  • the first-level processing module is configured to perform first-level processing according to the first microphone data to obtain first wake-up data, where the first-level processing includes first-level separation processing and first-level wake-up based on a neural network model deal with;
  • the second-level processing module is configured to perform second-level processing according to the first microphone data to obtain second wake-up data when the first wake-up data indicates that the pre-wakeup is successful, and the second-level processing includes a neural network-based The second-level separation processing and the second-level wake-up processing of the model;
  • the determining module is configured to determine a wake-up result according to the second wake-up data.
  • the apparatus further includes a preprocessing module, and the first-level processing module further includes a first-level separation module and a first-level wake-up module;
  • the preprocessing module configured to preprocess the first microphone data to obtain multi-channel feature data
  • the first-stage separation module is configured to perform the first-stage separation processing according to the multi-channel feature data, and output the first separation data;
  • the first-level wake-up module is configured to perform the first-level wake-up processing according to the multi-channel characteristic data and the first separation data, and output the first wake-up data.
  • the second-level processing module further includes a second-level separation module and a second-level wake-up module;
  • the second-stage separation module is configured to perform the second-stage separation processing according to the multi-channel feature data and the first separation data when the first wake-up data indicates that the pre-wakeup is successful, and output the second-stage separation process.
  • the second-level wake-up module is configured to perform the second-level wake-up processing according to the multi-channel feature data, the first separation data and the second separation data, and output the second wake-up data.
  • the first-stage separation processing is a streaming sound source separation processing
  • the first-stage wake-up processing is a streaming sound source wake-up processing
  • the second-stage separation processing is offline sound source separation processing
  • the second-stage wake-up processing is offline sound source wake-up processing
  • the first-level wake-up module includes a wake-up model in the form of multiple-input single-output or multiple-input multiple-output; and/or,
  • the second-level wake-up module includes a wake-up model in the form of multiple-input single-output or multiple-input-multiple-output.
  • the first-stage separation The module and/or the second-stage separation module adopts a dual-path conformer network structure.
  • the first-stage separation The module and/or the second-stage separation module is a separation module for performing at least one task, the at least one task including a sound source separation task alone, or the sound source separation task and other tasks;
  • the other tasks include at least one of a sound source localization task, a specific person extraction task, a specific direction extraction task, and a specific person confirmation task.
  • the first-level wake-up The module and/or the second-level wake-up module is a wake-up module for performing at least one task, and the at least one task includes a separate wake-up task, or includes the wake-up task and other tasks;
  • the other tasks include at least one of a sound source localization task, a specific person extraction task, a specific direction extraction task, and a specific person confirmation task.
  • the first-stage separation The module includes a first-level multi-feature fusion model and a first-level separation model; the first-level separation module is also used for:
  • the first separation data is obtained by inputting the first single-channel feature data to the output of the first-stage separation model.
  • the second-stage separation The module includes a second-level multi-feature fusion model and a second-level separation model; the second-level separation module is also used for:
  • the first-level wake-up The module includes a first wake-up model in the form of multiple input and single output, and the first-level wake-up module is also used for:
  • the first wake-up data includes a first confidence level, and the first confidence level The degree is used to indicate the probability that a preset wake-up word is included in the original first microphone data.
  • the first level The wake-up module includes a first wake-up model in the form of multiple input and multiple output and a first post-processing module, and the first-level wake-up module is also used for:
  • the second The first-level wake-up module includes a second wake-up model in the form of multiple input and single output, and the second-level wake-up module is also used for:
  • the first separation data and the second separation data into the second-level wake-up model and outputting the second wake-up data, where the second wake-up data includes a third confidence
  • the third confidence level is used to indicate the probability that the original first microphone data includes a preset wake-up word.
  • the second The first-level wake-up module includes a multiple-input multiple-output second wake-up model and a second post-processing module, and the second-level wake-up module is also used for:
  • the phoneme sequence information corresponding to each of the plurality of sound source data is input into the second post-processing module, and the second wake-up data is obtained by outputting, and the second wake-up data includes the corresponding first data of the plurality of sound source data.
  • an embodiment of the present application provides an electronic device, and the electronic device includes:
  • memory for storing processor-executable instructions
  • the processor is configured to implement the voice wake-up method provided by the first aspect or any possible implementation manner of the first aspect when executing the instructions.
  • embodiments of the present application provide a non-volatile computer-readable storage medium on which computer program instructions are stored, and when the computer program instructions are executed by a processor, implement the first aspect or the first aspect.
  • a voice wake-up method provided by any possible implementation.
  • embodiments of the present application provide a computer program product, comprising computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes, when the computer-readable codes are stored in an electronic
  • the processor in the electronic device executes the voice wake-up method provided by the first aspect or any one of the possible implementations of the first aspect.
  • the embodiments of the present application provide a voice wake-up system, where the voice wake-up system is configured to execute the voice wake-up method provided by the first aspect or any one of the possible implementations of the first aspect.
  • FIG. 1 is a schematic diagram showing the correlation between the wake-up rate of an electronic device and the sound source distance in the related art.
  • FIG. 2 shows a schematic structural diagram of an electronic device provided by an exemplary embodiment of the present application.
  • FIG. 3 shows a flowchart of a voice wake-up method provided by an exemplary embodiment of the present application.
  • FIG. 4 shows a schematic diagram of the principle of a voice wake-up method provided by an exemplary embodiment of the present application.
  • FIG. 5 shows a schematic structural diagram of a dpconformer network provided by an exemplary embodiment of the present application.
  • FIG. 6 shows a schematic diagram of the principle of a two-stage separation solution provided by an exemplary embodiment of the present application.
  • FIG. 7 to FIG. 14 show schematic schematic diagrams of several possible implementation manners of the first-stage separation solution provided by the exemplary embodiments of the present application.
  • FIG. 15 shows a schematic diagram of the principle of a two-stage wake-up solution provided by an exemplary embodiment of the present application.
  • FIG. 16 to FIG. 19 show schematic schematic diagrams of several possible implementation manners of the first-level wake-up solution provided by the exemplary embodiments of the present application.
  • FIG. 20 to FIG. 23 are schematic diagrams showing the principle of a voice wake-up method in a single-microphone scenario provided by an exemplary embodiment of the present application.
  • FIG. 24 to FIG. 28 show schematic schematic diagrams of the voice wake-up method in a multi-microphone scenario provided by an exemplary embodiment of the present application.
  • FIG. 29 shows a flowchart of a voice wake-up method provided by another exemplary embodiment of the present application.
  • FIG. 30 shows a block diagram of a voice wake-up device provided by an exemplary embodiment of the present application.
  • Voice interaction technology is a relatively important technology in electronic devices, including smart phones, speakers, TVs, robots, tablet devices, in-vehicle devices and other devices.
  • the voice wake-up function is one of the key functions of voice interaction technology. Through a specific wake-up word or command word (such as "Xiaoyi Xiaoyi"), the electronic device in the non-voice interaction state (such as sleep state or other state) is activated, and the electronic device is turned on.
  • Voice recognition, voice search, dialogue, voice navigation and other voice functions not only meet the availability of voice interaction technology at any time, but also avoid the power consumption problem caused by the long-term voice interaction state of electronic equipment or the problem of user privacy data being monitored.
  • the voice wake-up function can meet the needs of the user, that is, the wake-up rate of more than 95%.
  • the acoustic environment of the actual use scene is often complex, the user is far away from the electronic device to be awakened (such as 3-5 meters) and there is background noise (such as TV sound, speech, background music, reverberation, echo, etc.) In this case, the arousal rate will drop sharply.
  • the wake-up rate of the electronic device decreases as the distance from the sound source increases, where the distance from the sound source is the distance between the user and the electronic device.
  • the wake-up rate is 80% when the sound source distance is 0.5 meters, 65% when the sound source distance is 1 meter, 30% when the sound source distance is 3 meters, and 5 meters when the sound source distance is The wake-up rate is 10%. Too low wake-up rate results in poor voice wake-up effect of electronic devices.
  • the wake-up rate in the presence of background noise, the recognition of human voice will be relatively poor, especially in the case of multi-sound source interference (such as other speeches) Human interference, background music interference, echo residual interference of echo scenes, etc.) or strong sound source interference or far-field echo scenes, the wake-up rate will be lower, and a higher false wake-up situation will occur.
  • multi-sound source interference such as other speeches
  • Human interference background music interference
  • echo residual interference of echo scenes etc.
  • strong sound source interference or far-field echo scenes the wake-up rate will be lower, and a higher false wake-up situation will occur.
  • a two-level separation and wake-up scheme is designed, and the first-level separation and wake-up scheme is used to perform pre-wake judgment in the first-level streaming scenario, so as to ensure that the wake-up rate is as high as possible, but it will lead to higher false wake-ups. Therefore, after the pre-awakening is successful, the offline wake-up confirmation is performed in the second-level offline scenario to ensure a higher wake-up rate and reduce the false wake-up rate, so as to obtain a better voice wake-up effect.
  • Offline sound source wake-up processing refers to the sound source wake-up processing of the audio content after obtaining the complete audio content.
  • the offline sound source wake-up processing includes offline separation processing and offline wake-up processing.
  • Streaming sound source wake-up processing (also called online sound source wake-up processing): refers to acquiring an audio segment in real time or at preset time intervals and performing sound source wake-up processing on the audio segment.
  • the streaming sound source wake-up processing includes streaming separation processing and streaming wake-up processing.
  • the audio segment is a continuous number of sample data collected in real time or every preset time segment, for example, the preset time interval is 16 milliseconds. This embodiment of the present application does not limit this.
  • Multi-sound source separation technology It refers to the technology of separating the received single-microphone or multi-microphone voice signal into multiple sound source data.
  • the plurality of sound source data includes sound source data of the target object and sound source data of the interference sound source.
  • the multi-sound source separation technology is used to separate the sound source data of the target object from the sound source data of the interfering sound source, so as to make a better wake-up judgment.
  • Wake-up technology also known as Key Word Spotting (KWS) is used to determine whether the sound source data to be tested contains a preset wake-up word.
  • the wake-up word can be set by default or set by the user.
  • the default setting of the fixed wake-up word is "Xiaoyi Xiaoyi", which cannot be changed by the user, and the wake-up scheme design often relies on specific training sample data.
  • a user manually sets a personalized wake-up word no matter what kind of personalized wake-up word the user sets, they expect a high wake-up rate, and do not want to perform frequent model self-learning on the electronic device side.
  • the modeling methods of the wake-up technology include but are not limited to the following two possible implementation methods: the first is to use the whole word to establish the wake-up module, for example, the wake-up word is fixed as the output target of the wake-up module; the second is based on the general The phoneme in speech recognition means to establish a wake-up module for phoneme recognition, such as supporting fixed wake-up words or supporting user-defined wake-up words to automatically construct the corresponding personalized decoding map, and finally rely on the output of the wake-up module and then decode the map to determine the user's wake-up intention.
  • the fixed wake-up word modeling solution is adopted.
  • the wake-up module In the multi-sound source interference scenario, the wake-up module expects a single channel of output data, which is used to indicate whether to wake up or whether it is fixed or not. wake-up word.
  • the output of the wake-up module of multiple sound source data is meaningful, and it is necessary to decode the decoding map separately, so that the final Determine if it is a custom wake word.
  • the wake-up module is a multi-input single-output model; for the phoneme modeling scheme, the wake-up module is a multi-input and multi-output model.
  • the multiple output data respectively correspond to the phoneme posterior probability sequences of multiple sound source data.
  • FIG. 2 shows a schematic structural diagram of an electronic device provided by an exemplary embodiment of the present application.
  • the electronic device may be a terminal, and the terminal includes a mobile terminal or a fixed terminal.
  • the electronic device may be a mobile phone, a speaker, a television, a robot, a tablet device, a car device, a headset, smart glasses, a smart watch, a laptop computer, a desktop computer, and the like.
  • the server can be one server, or a server cluster composed of several servers, or a cloud computing service center.
  • the electronic device 200 may include one or more of the following components: a processing component 202, a memory 204, a power supply component 206, a multimedia component 208, an audio component 210, an input/output (I/O) interface 212, a sensor component 214 , and the communication component 216 .
  • the processing component 202 generally controls the overall operation of the electronic device 200, such as operations associated with display, phone calls, data communications, camera operations, and recording operations.
  • the processing component 202 may include one or more processors 220 to execute instructions to complete all or part of the steps of the voice wake-up method provided by the embodiments of the present application.
  • processing component 202 may include one or more modules that facilitate interaction between processing component 202 and other components.
  • processing component 202 may include a multimedia module to facilitate interaction between multimedia component 208 and processing component 202.
  • the memory 204 is configured to store various types of data to support operation at the electronic device 200 . Examples of such data include instructions for any application or method operating on the electronic device 200, contact data, phonebook data, messages, pictures, multimedia content, and the like.
  • Memory 204 may be implemented by any type of volatile or non-volatile storage device or combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM erasable Programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Magnetic or Optical Disk Magnetic Disk
  • Power supply assembly 206 provides power to various components of electronic device 200 .
  • Power supply components 206 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to electronic device 200 .
  • the multimedia component 208 includes a screen that provides an output interface between the electronic device 200 and the user.
  • the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user.
  • the touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensor may not only sense the boundaries of a touch or swipe action, but also detect the duration and pressure associated with the touch or swipe action.
  • the multimedia component 208 includes a front-facing camera and/or a rear-facing camera.
  • the front camera and/or the rear camera may receive external multimedia data.
  • Each of the front and rear cameras can be a fixed optical lens system or have focal length and optical zoom capability.
  • the electronic device 200 collects video information through a camera (a front camera and/or a rear camera).
  • Audio component 210 is configured to output and/or input audio signals.
  • audio component 210 includes a microphone (MIC) that is configured to receive external audio signals when electronic device 200 is in operating modes, such as call mode, recording mode, and voice recognition mode.
  • the received audio signal may be further stored in memory 204 or transmitted via communication component 216 .
  • the electronic device 200 collects raw first microphone data through a microphone.
  • the audio component 210 also includes a speaker for outputting audio signals.
  • the I/O interface 212 provides an interface between the processing component 202 and a peripheral interface module, which may be a keyboard, a click wheel, a button, or the like. These buttons may include, but are not limited to: home button, volume buttons, start button, and lock button.
  • Sensor assembly 214 includes one or more sensors for providing status assessment of various aspects of electronic device 200 .
  • the sensor assembly 214 can detect the open/closed state of the electronic device 200, the relative positioning of the components, such as the display and the keypad of the electronic device 200, and the sensor assembly 214 can also detect the electronic device 200 or one of the electronic devices 200. Changes in the positions of components, presence or absence of user contact with the electronic device 200 , orientation or acceleration/deceleration of the electronic device 200 and changes in the temperature of the electronic device 200 .
  • Sensor assembly 214 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact.
  • Sensor assembly 214 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications.
  • the sensor assembly 214 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
  • Communication component 216 is configured to facilitate wired or wireless communication between electronic device 200 and other devices.
  • Electronic device 200 may access wireless networks based on communication standards, such as WiFi, 2G or 3G, or a combination thereof.
  • the communication component 216 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication component 216 also includes a near field communication (NFC) module to facilitate short-range communication.
  • NFC near field communication
  • the NFC module may be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth
  • electronic device 200 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable It is implemented by a programming gate array (FPGA), a controller, a microcontroller, a microprocessor or other electronic components, and is used to execute the voice wake-up method provided by the embodiments of the present application.
  • ASICs application specific integrated circuits
  • DSPs digital signal processors
  • DSPDs digital signal processing devices
  • PLDs programmable logic devices
  • FPGA programming gate array
  • controller a controller
  • microcontroller a microcontroller
  • microprocessor microprocessor or other electronic components
  • a non-volatile computer-readable storage medium such as a memory 204 including computer program instructions, and the computer program instructions can be executed by the processor 220 of the electronic device 200 to complete the embodiments of the present application Provided voice wake-up method.
  • FIG. 3 shows a flowchart of a voice wake-up method provided by an exemplary embodiment of the present application. This embodiment is exemplified by using the method in the electronic device shown in FIG. 2 . The method includes the following steps.
  • Step 301 Obtain original first microphone data.
  • the electronic device acquires the microphone output signal through a single microphone or multiple microphones, and uses the microphone output signal as the original first microphone data.
  • the first microphone data includes sound source data of the target object and sound source data of an interfering sound source, where the interfering sound source includes at least one of speech, background music, and ambient noise of objects other than the target object.
  • Step 302 Preprocess the first microphone data to obtain multi-channel feature data.
  • the electronic device preprocesses the data of the first microphone to obtain multi-channel feature data.
  • the preprocessing includes Acoustic Echo Cancellation (AEC), Dereverberation (Dereverberation), Voice Activity Detection (VAD), Automatic Gain Control (AGC), and beam filtering. at least one treatment.
  • the multi-channel feature is multiple sets of multi-channel features
  • the multi-channel feature data includes multi-channel time-domain signal data, multi-channel spectrum data, multiple sets of inter-channel phase difference (Inter Phase Difference, IPD) data, and multi-directional feature data.
  • IPD Inter Phase Difference
  • multi-directional feature data at least one type of multi-beam feature data.
  • Step 303 Perform first-stage separation processing according to the multi-channel feature data to obtain first separation data.
  • the first-stage separation processing may also be referred to as the first-stage neural network separation processing, and the first-stage separation processing is a separation processing based on a neural network model, that is, the first-stage separation processing includes calling the neural network model to perform sound source separation processing.
  • the electronic device invokes the output of the pre-trained first-level separation module to obtain the first separation data.
  • the first-stage separation module is used to perform the first-stage separation processing, and the first-stage separation processing is the streaming sound source separation processing.
  • the first-stage separation module adopts a dpconformer network structure.
  • the electronic device calls the output of the pre-trained first-level separation module to obtain the first separation data, including but not limited to the following two possible implementations:
  • the first-stage separation module includes a first-stage separation model
  • the electronic device splices the multi-channel features, and inputs the spliced multi-channel feature data into the first-stage separation model and outputs the first-stage separation model. Separate data.
  • the first-stage separation module includes a first-stage multi-feature fusion model and a first-stage separation model, and the electronic device inputs the multi-channel feature data into the first-stage multi-feature fusion model and outputs the first-stage multi-feature fusion model. a single-channel feature data; inputting the first single-channel feature data to the first-stage separation model and outputting the first separation data.
  • the first-level multi-feature fusion model is a conformer feature fusion model.
  • the first-level separation model adopts a streaming network structure.
  • the first-level separation model adopts a dpconformer network structure.
  • the first-level separation model is a neural network model, that is, the first-level separation model is a model obtained by training a neural network.
  • the first-level separation model adopts Deep Neural Networks (DNN), Long Short-Term Memory (LSTM), Convolutional Neural Networks (CNN), full convolution Time-domain audio separation network (Conv-TasNet), any network structure in DPRNN.
  • DNN Deep Neural Networks
  • LSTM Long Short-Term Memory
  • CNN Convolutional Neural Networks
  • Conv-TasNet full convolution Time-domain audio separation network
  • the first-level separation model may also adopt other network structures suitable for streaming scenarios, which are not limited in this embodiment of the present application.
  • the separation task design of the first-stage separation module can be a single-task design of the streaming sound source separation task, or a multi-task design of the streaming sound source separation task and other tasks.
  • the other tasks include multiple tasks.
  • the first-stage separation module is configured to blindly separate multiple sound source data, and the first separated data includes the separated multiple sound source data.
  • the first-stage separation module is configured to extract the sound source data of the target object from the plurality of sound source data, and the first separation data includes the extracted sound source data of the target object.
  • the first-stage separation module is configured to extract the sound source data of the target object from the plurality of sound source data based on the video information, and the first separation data includes the extracted sound source data of the target object.
  • the video information includes visual data of the target object.
  • the first-stage separation module is configured to extract at least one sound source data in the target direction from the plurality of sound source data, and the first separation data includes at least one sound source data in the target direction.
  • the cost function in the first-stage separation module is a function designed based on the Permutation Invariant Training (PIT) criterion.
  • the electronic device sorts the multiple sample sound source data according to the sequence of the starting time of the speech segment, and calculates the loss value of the cost function according to the sorted multiple sample sound source data. Based on the calculated loss value, the cost function is trained.
  • the multiple sound source data are separated and obtained by the first-level separation module.
  • the multiple sound source data are directly input to the next-level processing model, that is, the first-level wake-up module.
  • the statistical information of the multiple sound source data is calculated, and the statistical information is input into the beamforming model and output to obtain a beam.
  • the beamforming data is input to the next-level processing model, that is, the first-level wake-up module.
  • Step 304 first-level wake-up processing is performed according to the multi-channel feature data and the first separation data to obtain first wake-up data.
  • the electronic device calls the pre-trained first-level wake-up module to output the first wake-up data according to the multi-channel feature data and the first separation data.
  • the first-level wake-up module is used to perform the first-level wake-up processing, and the first-level wake-up processing is a streaming sound source wake-up processing.
  • the electronic device inputs the multi-channel feature data and the first separated data into the first-level wake-up module and outputs the first wake-up data.
  • the wake-up scheme is a multi-input single-output streaming wake-up scheme (MISO-KWS), that is, the first-level wake-up module is modeled with fixed wake-up words, and the first-level wake-up module is a multi-input single-output wake-up.
  • MISO-KWS multi-input single-output streaming wake-up scheme
  • the model the input parameters include multi-channel feature data and the first separation data, and the output parameters include the first confidence level.
  • the first confidence level is used to indicate the probability that the original first microphone data includes the preset wake-up word.
  • the first confidence level is a multi-dimensional vector
  • the value of each dimension in the multi-dimensional vector is a probability value between 0 and 1.
  • the wake-up scheme is a multiple-input multiple-output streaming wake-up scheme (MIMO-KWS), that is, the first-level wake-up module is modeled by phonemes, and the first-level wake-up module includes a multiple-input multiple-output wake-up model and
  • the first post-processing module such as a decoder
  • the input parameters of the first-level wake-up module include multi-channel feature data and the first separation data
  • the output parameters of the wake-up model include multiple sound source data The corresponding phoneme sequence information.
  • the phoneme sequence information corresponding to the sound source data is used to indicate the probability distribution of multiple phonemes in the sound source data, that is, the phoneme sequence information includes respective probability values corresponding to the multiple phonemes.
  • the output parameter of the first-level wake-up module (that is, the output parameter of the first post-processing module) includes a second confidence level corresponding to each of the plurality of sound source data, and the second confidence level is used to indicate the sound source data and the preset wake-up word. The acoustic feature similarity between them.
  • the preset wake-up word is a fixed wake-up word set by default, or a wake-up word set by a user. This is not limited in the embodiments of the present application.
  • the first-level wake-up module adopts a streaming network structure.
  • the first-level wake-up module adopts a streaming dpconformer network structure.
  • the first-level wake-up module adopts any network structure among DNN, LSTM, and CNN. It should be noted that the first-level wake-up module may also adopt other network structures suitable for streaming scenarios, and the network structure of the first-level wake-up module may be analogous to the network structure of the first-level separation module, which is not added in this embodiment of the present application. limited.
  • the wake-up task design of the first-level wake-up module may be a single-task design of the wake-up task, or a multi-task design of the wake-up task and other tasks.
  • the other tasks include orientation estimation tasks and/or sound source object recognition Task.
  • the first wake-up data includes a first confidence level, and the first confidence level is used to indicate a probability that the original first microphone data includes a preset wake-up word.
  • the first wake-up data includes a second confidence level corresponding to each of the plurality of sound source data, and the second confidence level is used to indicate an acoustic feature similarity between the sound source data and a preset wake-up word.
  • the first wake-up data further includes orientation information corresponding to the wake-up event and/or object information of the wake-up object, where the object information is used to indicate the object identity of the sound source data.
  • Step 305 according to the first wake-up data, determine whether to pre-wake up.
  • the electronic device sets the first threshold value of the first-level wake-up module.
  • the first threshold value is a threshold value that allows the electronic device to be successfully pre-wake-up.
  • the first wake-up data includes a first confidence level
  • the first confidence level is used to indicate a probability that the original first microphone data includes a preset wake-up word.
  • the first wake-up data includes a second confidence level corresponding to each of the plurality of sound source data, and the second confidence level is used to indicate that the sound source data and the preset wake-up word have similar acoustic features
  • Step 306 Perform second-stage separation processing according to the multi-channel feature data and the first separation data to obtain second separation data.
  • the second-stage separation processing may also be referred to as the second-stage neural network separation processing, and the second-stage separation processing is a separation processing based on a neural network model, that is, the second-stage separation processing includes calling the neural network model to perform sound source separation processing.
  • the electronic device invokes the pre-trained second-level separation module to output the second separation data.
  • the second-stage separation module is used to perform second-stage separation processing, and the second-stage separation processing is offline sound source separation processing.
  • the first wake-up data also includes orientation information corresponding to the wake-up word
  • the electronic device calls the second-level separation module to output the second separation data according to the multi-channel feature data, the first separation data, and the orientation information corresponding to the wake-up word.
  • the electronic device invokes the pre-trained second-level separation module to output the second separation data according to the multi-channel feature data and the first separation data.
  • the second-stage separation module adopts a dpconformer network structure.
  • the electronic device calls the pre-trained second-level separation module to output the second separation data, including but not limited to the following two possible implementations:
  • the second-stage separation module includes a second-stage separation model
  • the electronic device splices the multi-channel features and the first separation data, and inputs the spliced data into the second-stage separation model for output. The second separates the data.
  • the second-stage separation module includes a second-stage multi-feature fusion model and a second-stage separation model, and the electronic device inputs the multi-channel feature data and the first separation data into the second-stage multi-feature fusion
  • the second single-channel feature data is obtained by outputting the model; the second single-channel feature data is input into the second-stage separation model to output the second separation data.
  • the second-level multi-feature fusion model is a conformer feature fusion model.
  • the second-level separation model is a neural network model, that is, the second-level separation model is a model obtained by training a neural network.
  • the second-level separation model adopts a dpconformer network structure.
  • the second-level separation model adopts Deep Neural Networks (DNN), Long Short-Term Memory (LSTM), Convolutional Neural Networks (CNN), fully convolutional time domain Audio separation network (Conv-TasNet), recurrent neural network (Recurrent Neural Network, RNN) any kind of network structure.
  • DNN Deep Neural Networks
  • LSTM Long Short-Term Memory
  • CNN Convolutional Neural Networks
  • Conv-TasNet fully convolutional time domain Audio separation network
  • RNN recurrent neural network
  • the second-level separation model may also adopt other network structures suitable for offline scenarios, which are not limited in this embodiment of the present application.
  • the separation task design of the second-stage separation module can be a single-task design of an offline sound source separation task, or a multi-task design of an offline sound source separation task and other tasks.
  • other tasks include multiple sound sources.
  • the second-stage separation module is configured to blindly separate multiple sound source data, and the second separated data includes the separated multiple sound source data.
  • the second-stage separation module is configured to extract the sound source data of the target object from the plurality of sound source data, and the second separation data includes the extracted sound source data of the target object.
  • the second-stage separation module is configured to extract the sound source data of the target object from the plurality of sound source data based on the video information, and the second separation data includes the extracted sound source data of the target object.
  • the second-stage separation module is configured to extract at least one sound source data in the target direction from the plurality of sound source data, and the second separation data includes at least one sound source data in the target direction.
  • Step 307 Perform second-level wake-up processing according to the multi-channel feature data, the first separation data and the second separation data to obtain second wake-up data.
  • the electronic device calls the pre-trained second-level wake-up module to output the second wake-up data according to the multi-channel feature data, the first separation data and the second separation data.
  • the second-level wake-up module is used to perform second-level wake-up processing, and the second-level wake-up processing is offline sound source wake-up processing.
  • the first wake-up data also includes orientation information corresponding to the wake-up word
  • the electronic device calls the second-level wake-up module to output according to the multi-channel feature data, the first separation data, the second separation data, and the orientation information corresponding to the wake-up word. Second wake-up data.
  • the electronic device inputs the multi-channel feature data, the first separation data and the second separation data into the second-level wake-up module and outputs the second wake-up data.
  • the second-level wake-up module is modeled by a fixed wake-up word, and the second-level wake-up module is a wake-up model in the form of multiple input and single output, that is, the wake-up scheme is a multi-input single-output streaming wake-up scheme (MISO-KWS). ).
  • the second-level wake-up module is modeled by phonemes, and the second-level wake-up module includes a wake-up model in the form of multiple input and multiple output and a second post-processing module (such as a decoder), that is, the wake-up scheme is a multi-input and multiple-output stream. wake-up scheme (MIMO-KWS).
  • the second-level wake-up module adopts a dpconformer network structure.
  • the second-level wake-up module adopts any network structure among DNN, LSTM, and CNN. It should be noted that the second-level wake-up module may also adopt other network structures suitable for offline scenarios.
  • the network structure of the second-level wake-up module may be analogous to the network structure of the second-level separation module, which is not limited in this embodiment of the present application. .
  • the wake-up task design of the second-level wake-up module may be a single-task design of the wake-up task, or a multi-task design of the wake-up task and other tasks.
  • other tasks include orientation estimation tasks and/or sound source object recognition Task.
  • the second wake-up data includes a third confidence level, where the third confidence level is used to indicate a probability that the original first microphone data includes a preset wake-up word.
  • the second wake-up data includes a fourth confidence level corresponding to each of the plurality of sound source data, and the fourth confidence level of the sound source data is used to indicate an acoustic feature similarity between the sound source data and a preset wake-up word.
  • the following description only takes as an example that the second wake-up data includes a third confidence level, and the third confidence level is used to indicate the probability that the original first microphone data includes a preset wake-up word.
  • the second wake-up data further includes orientation information corresponding to the wake-up event and/or object information of the wake-up object.
  • Step 308 Determine the wake-up result according to the second wake-up data.
  • the electronic device determines a wake-up result according to the second wake-up data, where the wake-up result includes one of a wake-up success or a wake-up failure.
  • the electronic device sets a second threshold value of the second-level wake-up module.
  • the second threshold value is a threshold value that allows the electronic device to be woken up successfully.
  • the second threshold value is greater than the first threshold value.
  • the second wake-up data includes a third confidence level
  • the third confidence level is used to indicate a probability that the original first microphone data includes a preset wake-up word.
  • the electronic device determines that the wake-up result is successful wake-up.
  • the electronic device determines that the wake-up result is a wake-up failure, and ends the process.
  • the second wake-up data includes a fourth confidence level corresponding to each of the plurality of sound source data, and the fourth confidence level of the sound source data is used to indicate the difference between the sound source data and the preset wake-up word similarity of acoustic features.
  • the electronic device determines that the wake-up result is successful wake-up.
  • the electronic device determines that the wake-up result is a wake-up failure, and ends the process.
  • the electronic device when the second wake-up data indicates that the wake-up is successful, the electronic device outputs a wake-up success identifier; or, outputs the wake-up success identifier and other information.
  • the wake-up success identifier is used to indicate that the wake-up is successful, and other information includes orientation information corresponding to the wake-up event and object information of the wake-up object.
  • the embodiment of the present application designs a two-level wake-up processing module. After the first-level wake-up is successful, the more complex second-level wake-up After the first-level wake-up is successful, the offline wake-up confirmation is performed.
  • the separation module is also designed in two stages. The first-level separation scheme is streaming and needs to be running all the time. Therefore, the first-level separation module needs to be designed with causal flow. . Streaming design generally loses separation performance, so after the first-level wake-up is successful, a second-level separation scheme can be performed on the output data.
  • the second-level wake-up scheme can use an offline design scheme.
  • the data that has been output by the first stage can also be used for the second stage separation scheme, and finally a better separation performance is obtained, thereby better supporting the effect of the second stage wake-up.
  • the electronic device includes a first-stage separation module 41 (including a first-stage separation model), a first-stage wake-up module 42 , and a second-stage separation module 43 (including a second-stage separation module 43 ). stage separation model) and a second stage wake-up module 44.
  • the electronic device inputs the original first microphone data to the preprocessing module for preprocessing (such as acoustic echo cancellation, de-reverberation and beam filtering) to obtain multi-channel feature data; input the multi-channel feature data to the first-stage separation module 41 performs first-level separation processing to obtain first separation data; input the multi-channel feature data and the first separation data to the first-stage wake-up module 42 to perform first-stage wake-up processing to obtain first wake-up data.
  • the electronic device determines whether to pre-wake according to the first wake-up data.
  • the multi-channel characteristic data and the first separation data are input to the second-stage separation module 43 for second-stage separation processing to obtain the second separation data; the multi-channel characteristic data, the first separation data and the second separation data are obtained; The second-level wake-up data is input to the second-level wake-up module 44 for second-level wake-up processing to obtain second wake-up data.
  • the electronic device determines whether the wake-up is successful according to the second wake-up data.
  • the voice wake-up method provided by the embodiments of the present application is mainly optimized and designed from the perspectives of the multi-sound source separation technology and the wake-up technology, and can largely solve the above-mentioned technical problems.
  • the multi-sound source separation technology and the wake-up technology involved in the embodiments of the present application are respectively introduced.
  • the dpconformer network Before introducing the multi-sound source separation technology and wake-up technology, the dpconformer network structure is introduced first.
  • the schematic diagram of the structure of the dpconformer network is shown in Figure 5.
  • the dpconformer network includes an encoding layer, a separation layer and a decoding layer.
  • Coding layer The dpconformer network receives single-channel feature data, and obtains intermediate feature data through a one-dimensional convolution (1-D Conv) layer.
  • the intermediate feature data is a two-dimensional matrix.
  • X RELU(x*W); wherein, x is the time-domain single-channel Feature data, W is the weight coefficient corresponding to the coding transformation, and x is subjected to a one-dimensional convolution operation with a fixed convolution kernel size and convolution step size through W, and finally the encoded intermediate feature data satisfies X ⁇ R N*I , Among them, N is the dimension of encoding, I is the total number of frames in the time domain, and the intermediate feature data X is a two-dimensional matrix of N*I dimension.
  • the separation layer includes a data cutting module, a conformer layer within a block, and a conformer layer between blocks.
  • the input parameters of the data cutting module are intermediate feature data, and the output parameters are three-dimensional tensors. That is, the data cutting module is used to represent the intermediate feature data as a three-dimensional tensor according to the data frame segmentation method, corresponding to intra-block features, inter-block features and feature dimensions respectively.
  • the N*I-dimensional two-dimensional matrix is equally divided into N*K*P-dimensional three-dimensional tensors, where N is the feature dimension, K is the number of blocks, P is the length of the block, and the block Overlap P/2 between.
  • the input parameter of the conformer layer in the block is the three-dimensional tensor output by the data cutting module, and the output parameter is the first intermediate parameter.
  • the conformer layer includes at least one of a linear layer, a multi-head self-attention layer (MultiHead Self-Attention, MHSA), and a convolutional layer.
  • a linear layer a multi-head self-attention layer (MultiHead Self-Attention, MHSA)
  • MHSA multi-head self-attention layer
  • convolutional layer a convolutional layer
  • b is the current b-th dpconformer submodule, including B dpconformer submodules in total, each dpconformer submodule includes a conformer layer within a block and a conformer layer between blocks, and B is a positive integer.
  • the input parameter of the conformer layer between blocks is the first intermediate parameter output by the conformer layer in the block, and the output parameter is the second intermediate parameter.
  • the converter layer between blocks calculates the attention on all the features of the whole sentence in the offline scene, while in the streaming scene, in order to control the delay, the mask mechanism is used to only calculate the attention of the current block and the previous moment to ensure causality. sex.
  • the block corresponding to the current moment is t
  • the converter calculation between the blocks of the current block t is only related to the block corresponding to the historical moment to the current block t, and has nothing to do with the block t+1.
  • the following formula is used to calculate the conformer between each block:
  • the calculation is performed through the conformer layer within the block of the B layer and between the blocks, that is, the conformer layer within the block and the conformer layer between the blocks are repeatedly calculated B times.
  • the three-dimensional N*K*P tensors passing through the 2-D Conv layer are converted into C N*I two-dimensional matrices, corresponding to the masking matrix M of the C sound sources, where M is the preset to be separated number of sound sources.
  • the separation result is finally obtained, that is, the separated multiple sound source data.
  • the multi-sound source separation scheme provided in the embodiment of the present application is a two-stage separation scheme. Taking the multi-feature fusion model and the separation module in the two-stage separation scheme as an example, the dpconformer network structure provided in FIG. 5 is used as an example. The two-stage separation scheme is shown in the figure 6 shown.
  • the first-stage streaming separation module includes a conformer feature fusion model 61 and a dpconformer separation model 62
  • the second-stage offline separation module includes a conformer feature fusion model 63 and a dpconformer separation model 64 .
  • the first-stage flow separation module may be the above-mentioned first-stage separation module 41
  • the second-stage offline separation module may be the above-mentioned second-stage offline separation module 43 .
  • the electronic device inputs the multi-channel feature data into the conformer feature fusion model 61 for output to obtain single-channel feature data; inputs the single-channel feature data into the dpconformer separation model 62 for output to obtain the first separation data.
  • the multi-channel feature data and the first separation data are input into the conformer feature fusion model 63 for output to obtain single-channel feature data; the single-channel feature data is input into the dpconformer separation model 64 for output to obtain the second separation data.
  • the first-stage separation scheme includes blind separation technology, and the first-stage separation scheme includes but is not limited to the following aspects, as shown in Figure 7:
  • the multi-channel feature data includes multiple sets of multi-channel feature data.
  • the multi-channel feature data includes the original time domain data of multiple microphones, corresponding At least one set of multi-channel feature data among the output data of the fixed beams in a plurality of preset directions, and the directional feature (Directional Features) data of each preset direction.
  • the feature input part includes three sets of multi-channel feature data, ie, multi-channel feature data 1, multi-channel feature data 2, and multi-channel feature data 3. This embodiment of the present application does not limit the number of groups of multi-channel feature data.
  • Conformer feature fusion model 71 used to fuse multiple sets of multi-channel feature data into single-channel feature data. First, each group of multi-channel feature data is based on the conformer layer, and the first attention feature data between the channels in the group is calculated; then, the first attention feature data between each group of channels is unified through another conformer layer, that is, full-channel attention. The force layer 72 obtains the second attention feature data of each group, and then passes through the pooling layer or the projection layer to obtain the single-channel intermediate feature representation, that is, the single-channel feature data.
  • (3) dpconformer separation model 73 used to input the fused sets of multi-channel feature data, that is, single-channel feature data, into the dpconformer separation model, and output M estimated sound source data, where M is a positive integer.
  • the M estimated sound source data include sound source data 1 , sound source data 2 , sound source data 3 and sound source data 4 . This embodiment of the present application does not limit this.
  • Cost function design During cost function training, there is a permutation confusion problem in the output of multiple sound source data and the corresponding labeling of multiple sound source data, so it is necessary to use the permutation invariant training criterion (Permutation Invariant Training, PIT) , that is, to determine all possible labeling sequences corresponding to multiple sound source data, calculate the corresponding loss values of the multiple labeling sequences according to the multiple labeling sequences and the output parameters of the cost function, and perform the gradient calculation according to the labeling sequence with the smallest loss value.
  • PIT Permutation Invariant Training
  • the prior information of multiple sound source data can also be used to set a fixed sorting order, so as to avoid the problem of high computational complexity of loss value due to the increase of the number of sound source data.
  • the prior information of the sound source data includes the start time of the sound source data, and the plurality of sound source data are arranged in order from early to late start time.
  • the first-stage separation solution includes a specific person extraction technology, which is another main technical solution in a multi-sound source interference scenario.
  • the first-stage separation scheme includes but is not limited to the following aspects, as shown in Figure 8:
  • Feature input part including multi-channel feature data and registered voice data. Different from the first-level separation scheme provided in FIG. 7 , in the specific person extraction scenario, the target object needs to be registered, and the registered voice data of the target object is input as additional feature data.
  • the feature input part includes multi-channel feature data 1, multi-channel feature data 2 and registered voice data. This embodiment of the present application does not limit the number of groups of multi-channel feature data.
  • Conformer feature fusion model 81 used to fuse multiple sets of multi-channel feature data and registered voice data into single-channel feature data. First, each group of multi-channel feature data is based on the conformer layer, and the first attention feature data between the channels within the group is calculated; then, the first attention feature data between each group of channels and the speaker representation feature data of the target object are unified again. After the full-channel attention layer 82, the full-channel attention layer 82 is used to calculate the correlation between the speaker representation feature data of the target object and other multi-channel feature data, and fuse the output to obtain single-channel features.
  • input the registered voice data of the target object into the speaker representation model, and output the embedding representation of the target object, that is, the speaker representation feature data, wherein the speaker representation model is pre-trained, and the speaker representation model is obtained. is obtained by standard speaker recognition training methods.
  • the speaker representation feature data of the target object is pre-stored in the electronic device in the form of a vector.
  • (3) dpconformer separation model 83 input the single-channel feature data into the dpconformer separation model 83, and output the sound source data of the target object. That is, the output parameter of the dpconformer separation model 83 is a single output parameter, and the expected output parameter is the sound source data of the target object. For example, the sound source data of the target object is sound source data 1 .
  • Cost function design You can refer to the introduction of the above cost function by analogy, and will not repeat them here.
  • the first-level separation solution includes a visual data-assisted specific person extraction technology, and the first-level separation solution includes but is not limited to the following aspects, as shown in FIG. 9 :
  • Feature input part including multi-channel feature data and target person visual data.
  • electronic devices such as televisions, mobile phones, robots, and in-vehicle equipment are equipped with cameras. These electronic devices can obtain visual data of the target object through the camera, that is, the visual data of the target person.
  • the target person visual data can be used to assist in the specific person extraction task.
  • the feature input part includes multi-channel feature data 1, multi-channel feature data 2 and target person visual data. This embodiment of the present application does not limit the number of groups of multi-channel feature data.
  • Conformer feature fusion model 91 used to fuse multiple sets of multi-channel feature data and visual data into single-channel feature data. First, each group of multi-channel feature data is based on the conformer layer, and the first attention feature data between the channels in the group is calculated; then, the first attention feature data between each group of channels and the visual representation feature data of the target object are unified through The full-channel attention layer 92 is used to calculate the correlation between the visual representation feature data of the target object and other multi-channel feature data, and fuse the output to obtain single-channel features.
  • the electronic device invokes the pre-trained visual classification model according to the target person's visual data to output the vector representation of the target object, that is, the visual representation feature data.
  • the visual classification model includes a lip language recognition model
  • the target person's visual data includes lip activity visual data. This embodiment of the present application does not limit this.
  • the output parameter of the dpconformer separation model 83 is a single output parameter
  • the expected output parameter is the sound source data of the target object.
  • the sound source data of the target object is sound source data 1 .
  • Cost function design You can refer to the introduction of the above cost function by analogy, and will not repeat them here.
  • the first-stage separation solution includes a specific direction extraction technology
  • the specific direction extraction technology is a technology for extracting sound source data in a preset target direction in a multi-sound source interference scenario.
  • the first-stage separation scheme includes but is not limited to the following aspects, as shown in Figure 10:
  • Feature input part including multi-channel feature data and target direction data.
  • target direction data is input as additional feature data.
  • the feature input part includes multi-channel feature data 1, multi-channel feature data 2, multi-channel feature data 3 and target direction data. This embodiment of the present application does not limit the number of groups of multi-channel feature data.
  • Conformer feature fusion model 101 used to fuse multiple sets of multi-channel feature data and target direction data into single-channel feature data. First, each group of multi-channel feature data is based on the conformer layer, and the first attention feature data between the channels in the group is calculated; then, the first attention feature data between each group of channels and the direction feature data of the target direction data are unified through The full-channel attention layer 102 is used to calculate the correlation between the directional feature data of the target direction data and other multi-channel feature data, and fuse the output to obtain single-channel features.
  • the direction feature data of the target direction data is calculated according to the target direction data and the microphone position information of the microphone array.
  • the direction feature data of the target direction data is pre-stored in the electronic device.
  • the output parameter of the dpconformer separation model 103 is a single output parameter or multiple output parameters
  • the expected output parameter is at least one sound source data in the target direction.
  • the at least one sound source data of the target direction includes sound source data 1 and sound source data 2 .
  • Cost function design You can refer to the introduction of the above cost function by analogy, and will not repeat them here.
  • first-stage separation scheme may be implemented in combination of two or two, or any three of them may be implemented in combination, or all of them may be implemented in combination with the embodiments, which are not limited in the embodiments of the present application.
  • the first-stage separation scheme includes the technology of blind separation and multi-sound source localization for multi-task design.
  • the first-stage separation scheme includes but is not limited to the following aspects, as shown in Figure 11:
  • Feature input part including multi-channel feature data.
  • the conformer feature fusion model 111 (including the full-channel attention layer 112 ): used to fuse multiple sets of multi-channel feature data into single-channel feature data.
  • the sound source separation result includes the separated m sound source data
  • the position estimation result includes the position information corresponding to each of the m sound source data.
  • the output parameters include sound source data 1 and sound source data 2 , and orientation information of sound source data 1 and orientation information of sound source data 2 .
  • the sound source separation layer 114 and the direction estimation layer 115 can be set as separate modules outside the dpconformer separation model 113 , that is, the sound source separation layer 114 and the direction estimation layer 115 are set at the output end of the dpconformer separation model 113 .
  • the ith azimuth information output by the direction estimation layer 115 is the azimuth information of the ith sound source data separated by the sound source separation layer 114 , and i is a positive integer.
  • the orientation information is an orientation label, in the form of a one-hot vector.
  • the cost functions of the separation task and the direction estimation task both adopt the PIT criterion.
  • the first-stage separation solution includes a multi-task design technology for specific person extraction and specific person orientation estimation.
  • the first-stage separation scheme includes but is not limited to the following aspects, as shown in Figure 12:
  • Feature input part including multi-channel feature data and registered voice data.
  • the conformer feature fusion model 121 (including the full-channel attention layer 122 ): used to fuse multiple sets of multi-channel feature data and registered voice data into single-channel feature data.
  • the dpconformer separation model 123, the specific person extraction layer 124 and the specific person orientation estimation layer 125 input the single-channel feature data into the dpconformer separation model 123 to output intermediate parameters, and input the intermediate parameters into the specific person extraction layer 124
  • the sound source data of the target object is obtained by outputting, and the intermediate parameters are input into the specific person position estimation layer 125 to output the position information of the sound source data of the target object.
  • the output parameters include sound source data 1 of the target object and orientation information of the sound source data 1 .
  • the orientation information is an orientation label, in the form of a one-hot vector.
  • the speaker representation feature data and other multi-channel feature data are used to design the orientation label in the form of one-hot vector through the dpconformer network structure, and the cross-entropy (CE) cost is adopted. function to train.
  • the technology of multi-task design for specific person extraction and specific person orientation estimation is to share multi-channel feature data, registered voice data, conformer feature fusion model 121 and dpconformer separation model 123 for the two tasks, and the output of the dpconformer separation model 123 is set to a specific
  • the person extraction layer 124 and the position estimation layer 125 of a specific person use the cost function weighting of the separation task and the position estimation task respectively to perform multi-task training.
  • the first-level separation scheme includes the technology of blind separation and multi-speaker recognition for multi-task design, and the technology of blind separation and multi-speaker recognition for multi-task design is to separate from microphone data A plurality of sound source data are generated, and object information corresponding to each of the plurality of sound source data is identified, and the object information is used to indicate the object identity of the sound source data.
  • the electronic device stores correspondences between multiple sample sound source data and multiple object information.
  • the first-stage separation scheme includes but is not limited to the following aspects, as shown in Figure 13:
  • Feature input part including multi-channel feature data.
  • the conformer feature fusion model 131 (including the full-channel attention layer 132 ): used to fuse multiple sets of multi-channel feature data into single-channel feature data.
  • the sound source separation result includes the separated m pieces of sound source data
  • the object recognition result includes the object information corresponding to each of the m pieces of sound source data.
  • the output parameters include sound source data 1 and sound source data 2 , and object information of sound source data 1 and object information of sound source data 2 .
  • the separation task and the object recognition task share the multi-channel feature data, the conformer feature fusion model 131 and the dpconformer separation model 133 , and the sound source separation layer 134 and the object recognition layer 135 are set at the output of the dpconformer separation model 133 .
  • the sound source separation layer 134 separates a plurality of sound source data.
  • the object recognition layer 135 performs segment-level feature fusion to obtain a segment-level multi-object representation.
  • the object representation of each segment outputs the object identity represented by the segment, and the corresponding object information is a one-hot vector, Used to indicate object identity.
  • the dimension of the one-hot vector is the number of objects, and the position corresponding to the sound source data in the one-hot vector corresponding to a sound source data is 1, which is used to indicate that the object of the sound source data is in multiple objects.
  • the speaking order in and 0 elsewhere.
  • the i-th object information output by the object identification layer 135 is the object information of the i-th sound source data separated by the sound source separation layer 134, and i is a positive integer.
  • the cost functions of the separation task and the object recognition task both adopt the PIT criterion.
  • the first-stage separation scheme includes the technology of specific person extraction and specific person confirmation for multi-task design.
  • the specific person extraction task is to use the registered voice data of the target object to extract the sound source data of the target object from the microphone data.
  • the specific person confirmation task is to confirm whether the extracted sound source data is the same as the registered voice data of the target object, or whether the object corresponding to the extracted sound source data contains the target object.
  • the multi-task design technique is to determine the object recognition result of the sound source data while extracting the sound source data of the target object. Again, this task is designed for offline.
  • the first-stage separation scheme includes but is not limited to the following aspects, as shown in Figure 14:
  • Feature input part including multi-channel feature data and registered voice data.
  • the conformer feature fusion model 141 (including the full-channel attention layer 142 ): used to fuse multiple sets of multi-channel feature data and registered speech data into single-channel feature data.
  • the object recognition result includes the probability that the object corresponding to the output sound source data is the target object.
  • the output parameters include the sound source data 1 of the target object and the object recognition result of the sound source data 1 .
  • the specific person extraction and specific person confirmation tasks share multi-channel feature data, the conformer feature fusion model 141 and the dpconformer separation model 143 , and the specific person extraction layer 144 and the specific person confirmation layer 145 are set at the output of the dpconformer separation model 143 .
  • the wake-up scheme involved in the embodiments of the present application is a two-stage wake-up scheme.
  • the first-level wake-up module and the second-stage wake-up module in the two-stage wake-up scheme are both multi-input wake-up model structures, such as the wake-up model structures of DNN, LSTM, and CNN. , Transformer, Conformer any kind of network structure.
  • the wake-up model structure can also adopt other network structures.
  • the first-level wake-up module and the second-level wake-up module in the two-stage wake-up scheme use the dpconformer network structure provided in Figure 5 as an example.
  • the two-stage wake-up scheme is shown in FIG. 15 .
  • the electronic device inputs the multi-channel feature data and the first separation data into the dpconformer wake-up module 151 and outputs the first wake-up data; when the first wake-up data indicates that the pre-wakeup is successful, the multi-channel feature data, the first separation data and the second
  • the separated data is input to the dpconformer wake-up module 152 and output to obtain second wake-up data; the wake-up result is determined according to the second wake-up data.
  • the first-level wake-up solution provided by the embodiment of the present application includes a wake-up technology for multi-input single-output whole-word modeling, and the first-level wake-up module is a multi-input single-output whole-word modeling wake-up module, As shown in Figure 16, including but not limited to the following aspects:
  • Feature input part including multiple sets of multi-channel feature data.
  • the multiple sets of multi-channel feature data include multi-channel feature data obtained by preprocessing the first microphone data and first separation data obtained by performing a first-stage separation process.
  • the conformer feature fusion model 161 (including the full-channel attention layer 162 ): used to fuse multiple sets of multi-channel feature data into single-channel feature data.
  • the dpconformer separation model 163 input the single-channel feature data into the dpconformer separation model 163, and the output obtains the first confidence level, and the first confidence level is used to indicate the original first microphone data includes the probability of a preset wake-up word,
  • the preset wake-up word is a fixed wake-up word set by default.
  • the preset wake-up words include N wake-up words
  • the first confidence level output by the dpconformer separation model 163 is an N+1-dimensional vector
  • the N dimensions of the N+1-dimensional vector correspond to N wake-up words respectively
  • the other dimension corresponds to no Belongs to the category of N wake words.
  • the value of each dimension in the N+1-dimensional vector is a probability value between 0 and 1, and the probability value is used to indicate the awakening probability of the awakening word at the corresponding position.
  • the output parameter of the dpconformer separation model 163 is a single output parameter, the number of modeling units is the number of wake-up words plus one, and an extra unit is a garbage unit, which is used to output other than wake-up words.
  • the output parameter of the dpconformer separation model 163 is the first confidence level.
  • the two preset wake-up words are preset wake-up word 1 and preset wake-up word 2
  • the probability value of each modeling unit is one of the first value, the second value, and the third value.
  • the value is the first value, it is used to indicate that the sound source data does not include the preset wake-up word; when the probability value is the second value, it is used to indicate that the sound source data includes the preset wake-up word 1; when the probability value is the third value Used to indicate that the sound source data includes a preset wake-up word 2 .
  • the default wake-up word 1 is "Xiaoyi Xiaoyi”
  • the default wake-up word 2 is "Hello Xiaoyi”
  • the first value is 0, the second value is 1, and the third value is 2.
  • This embodiment of the present application does not limit this.
  • the first-level wake-up module is calculated in real time. For the currently input multiple sets of multi-channel feature data, the first-level wake-up module determines in real time whether a fixed wake-up word is included. When the outputted first confidence level is greater than the first threshold value, it is determined that the pre-awakening is successful. For the first-level wake-up module, the electronic device determines that the pre-wakeup is successful, and the complete wake-up word information has been received at this time, and the current time is determined as the wake-up time, which is used to provide a time point reference for the second-level separation module and the second-level wake-up module information, and start the second stage offline separation module.
  • the wake-up solution provided by the embodiment of the present application includes a wake-up technology of multiple-input multiple-output phoneme modeling, and the first-level wake-up module is a multiple-input multiple-output phoneme modeling wake-up module, as shown in FIG. 17 . display, including but not limited to the following aspects:
  • Feature input part including multiple sets of multi-channel feature data.
  • the multiple sets of multi-channel feature data include multi-channel feature data obtained by preprocessing the first microphone data and first separation data obtained by performing a first-stage separation process.
  • the conformer feature fusion model 171 (including the full-channel attention layer 172 ): used to fuse multiple sets of multi-channel feature data into single-channel feature data.
  • dpconformer separation model 173 input single-channel feature data into dpconformer separation model 173, and output a phoneme set, the phoneme set includes phoneme sequence information corresponding to each of multiple sound source data, optionally, the phoneme sequence information is The posterior probability of the phoneme sequence, the posterior probability of the phoneme sequence is the product of the posterior probability values of each phoneme corresponding to the sound source data.
  • the output parameters of the dpconformer separation model 173 include phoneme sequence information 1 of sound source data 1 and phoneme sequence information 2 of sound source data 2.
  • the output parameter of the dpconformer separation model 173 is the phoneme sequence information corresponding to each of the multiple sound source data, and the multiple phoneme sequence information is respectively input into the decoder, and the final output obtains multiple phonemes The second confidence level corresponding to the sequence information.
  • the phoneme sequence information corresponding to the sound source data is used to indicate the probability distribution of multiple phonemes in the sound source data, that is, the phoneme sequence information includes respective probability values corresponding to the multiple phonemes.
  • the decoder is called once to obtain the second confidence level corresponding to the phoneme sequence information, and the second confidence level is used to indicate the sound source data and the preset wake-up word. Acoustic feature similarity.
  • the decoder part cannot participate in the model calculation. When the model cannot determine which separated sound source data is the preset wake-up word, it needs to calculate the phoneme sequence information corresponding to each of the multiple sound source data.
  • the modeling unit is a phoneme
  • a phoneme is a representation form of a basic phonetic unit.
  • the corresponding phoneme sequence can be "xi ao y i x i ao y i"
  • each phoneme is represented by a space.
  • the phoneme sequence 1 corresponding to the sound source data 1 is "x i ao y i x i ao y i”
  • the voice content corresponding to the sound source data 2 can be "how is the weather”
  • the corresponding phoneme Sequence 2 is "t ian qi zen m o y ang”.
  • the output parameters of the dpconformer separation model 173 include two phoneme sequence information, namely the probability value of the phoneme sequence 1 "x i ao y i x i ao y i" corresponding to the sound source data 1, and the phoneme sequence 12 corresponding to the sound source data 2.
  • one phoneme sequence information can be the probability distribution of each phoneme corresponding to sound source data 1, and the other phoneme sequence information can be sound source data 2.
  • the corresponding probability distribution of each phoneme For example, if the size of the phoneme set is 100, the two phoneme sequence information are respectively 100-dimensional vectors, the values of the vectors are in the range greater than or equal to 0 and less than or equal to 1, and the sum of the 100-dimensional values is 1.
  • the two phoneme sequence information are respectively 100-dimensional vectors, the probability value corresponding to the "x" position in the first phoneme sequence information is the highest, and the probability value corresponding to the "t" position in the second phoneme sequence information is the highest.
  • the two phoneme sequence information After the two phoneme sequence information is determined, calculate the output probability of the phoneme sequence "x i ao y i x i ao y i" of the preset wake-up word in this phoneme sequence and perform the geometric mean to obtain the two phoneme sequence information respectively.
  • the corresponding second confidence level When any second confidence level is greater than the first threshold value, it is determined that the pre-awakening is successful.
  • the wake-up solution provided by the embodiment of the present application includes the multi-input single-output whole word modeling wake-up and direction estimation technology for multi-task design, and the first-level wake-up module is a multi-input single-output whole word model.
  • the word modeling wake-up module as shown in Figure 18, includes but is not limited to the following aspects:
  • Feature input part including multiple sets of multi-channel feature data.
  • the multiple sets of multi-channel feature data include multi-channel feature data obtained by preprocessing the first microphone data and first separation data obtained by performing a first-stage separation process.
  • the conformer feature fusion model 181 (including the full-channel attention layer 182 ): used to fuse multiple sets of multi-channel feature data into single-channel feature data.
  • the wake-up information includes the respective first confidence levels corresponding to the separated multiple sound source data.
  • the position information is in the form of a one-hot vector. .
  • the model is to calculate the probability of each wake-up event and garbage words, while the direction estimation task only outputs the orientation information corresponding to the wake-up event. Therefore, the orientation information is the output parameter of the orientation estimation task corresponding to the successful wake-up.
  • the wake word detection layer 184 and the orientation estimation layer 185 can be additional network modules arranged at the output of the dpconformer separation model 183, such as a layer of DNN or LSTM, followed by a linear layer and a Softmax layer of the corresponding dimension.
  • the output parameter (ie, wake-up information) of the wake-up word detection layer 184 is the detection probability of the wake-up word.
  • the output parameter (ie, position information) of the position estimation layer 185 is the probability distribution of position estimation vectors.
  • the wake-up solution provided by the embodiment of the present application includes the multi-input multi-output phoneme modeling wake-up and direction estimation technology for multi-task design, and the first-level wake-up module is the multi-input and multi-output phoneme modeling technology.
  • the wake-up module as shown in Figure 19, includes but is not limited to the following aspects:
  • Feature input part including multiple sets of multi-channel feature data.
  • the multiple sets of multi-channel feature data include multi-channel feature data obtained by preprocessing the first microphone data and first separation data obtained by performing a first-stage separation process.
  • the conformer feature fusion model 191 (including the full-channel attention layer 192 ): used to fuse multiple sets of multi-channel feature data into single-channel feature data.
  • (3) dpconformer separation model 193, multi-awakening phoneme sequence layer 194 and orientation estimation layer 195 input the single-channel feature data into the dpconformer separation model 193 to output intermediate parameters, and input the intermediate parameters to the multi-awakening phoneme sequence layer 194 for output
  • the wake-up information is obtained, and the intermediate parameters are input to the position estimation layer 195 for output to obtain the position estimation result.
  • the wake-up information includes the phoneme sequence information corresponding to each of the multiple sound source data, and the position estimation result includes the position information corresponding to the plurality of phoneme sequence information. .
  • the phoneme sequence information is the posterior probability of the phoneme sequence
  • the posterior probability of the phoneme sequence is the product of the posterior probability values of the respective phonemes corresponding to the sound source data.
  • the output parameters include phoneme sequence information 1 of sound source data 1 , phoneme sequence information 2 of sound source data 2 , orientation information of phoneme sequence information 1 , and orientation information of phoneme sequence information 2 .
  • the multi-awakening phoneme sequence layer 194 and the orientation estimation layer 195 may be additional network modules, which are set at the output end of the dpconformer separation model 193 .
  • the wake-up task and the direction estimation task share the feature input part, the conformer feature fusion model 191 and the pconformer separation model 193, the output parameters of the wake-up task include phoneme sequence information corresponding to each of multiple sound source data, and the output parameters of the orientation estimation task include multiple phonemes The sequence information corresponds to the orientation information. Finally, each phoneme sequence information obtains the wake-up result through the decoder, that is, the first confidence level.
  • first-level wake-up scheme may be implemented in combination of two or two, or any three of them may be implemented in combination, or all of them may be implemented in combination with embodiments, which are not limited in the embodiments of the present application.
  • the electronic device is a device with a single microphone
  • the voice wake-up method is a single-channel two-level separation and two-level wake-up method.
  • the method can be used in a near-field wake-up scenario of an electronic device.
  • the false wake-up rate can be reduced while ensuring that the wake-up function has a high wake-up rate.
  • the electronic device includes a first-level separation module 201 , a first-level wake-up module 202 , a second-level separation module 203 and a second-level wake-up module 204 .
  • the electronic device collects the original first microphone data (such as background music, echo, speech 1, speech 2, speech K, and ambient noise) through a single microphone, and inputs the first microphone data to the preprocessing module 205 for preprocessing, Obtain multi-channel feature data; input the multi-channel feature data to the first-stage separation module 201 for first-stage separation processing to obtain the first separation data; input the multi-channel feature data and the first separation data to the first-stage wake-up module 202 for processing The first-level wake-up process obtains the first wake-up data.
  • the original first microphone data such as background music, echo, speech 1, speech 2, speech K, and ambient noise
  • the electronic device determines whether to pre-wake according to the first wake-up data. If it is determined that the pre-awakening is successful, the multi-channel characteristic data and the first separation data are input to the second-stage separation module 203 for second-stage separation processing to obtain the second separation data; the multi-channel characteristic data, the first separation data and the second separation data are obtained; The second-level wake-up data is input to the second-level wake-up module 204 for second-level wake-up processing to obtain second wake-up data. The electronic device determines whether the wake-up is successful according to the second wake-up data.
  • the preprocessing module includes an acoustic echo cancellation module.
  • the output parameters of the acoustic echo cancellation module are used as multi-channel feature data and input to the subsequent separation module and wake-up module.
  • the preprocessing module includes an acoustic echo cancellation module and a de-reverberation module.
  • the output parameters of the acoustic echo cancellation module are input to the de-reverberation module, and the output parameters of the de-reverberation module are used as multi-channel feature data to be input to the subsequent separation module and wake-up module.
  • the first-level wake-up module and the second-level wake-up module are both the above-mentioned multi-input single-output whole-word modeling wake-up modules.
  • the first-level wake-up module and the second-level wake-up module are both the above-mentioned multiple-input multiple-output phoneme modeling wake-up modules.
  • the two-level wake-up module needs to support the confirmation function of a specific person.
  • the multiple sound source data output by the second-stage separation module 203 and the registered voice data of the target object are input to the
  • the speaker identification module (Speaker Identification, SID) 210 is used to confirm whether the separated multiple sound source data includes registered voice data.
  • the speaker identification module 210 is used as a separate network module, which is different from the second-level wake-up module 204.
  • the speaker confirmation module 210 confirms that the separated multiple sound source data includes registered voice data, it is determined that the wake-up is successful, otherwise the wake-up fails.
  • the speaker confirmation module 210 is integrated in the second-level wake-up module 204
  • the source data, the multiple sound source data output by the second-level separation module 203, and the registered voice data of the target object (that is, the registered voice) are input into the second-level wake-up module 204 (including the speaker confirmation module 210), and the output is obtained.
  • Two wake-up data and object confirmation result when the second wake-up data indicates that the wake-up is successful and the object confirmation result indicates that the sound source data of the target object exists in the output sound source data, it is determined that the wake-up is successful, otherwise the wake-up fails.
  • the object confirmation result is used to indicate whether there is sound source data of the target object in the output sound source data, that is, the object confirmation result is used to indicate whether the current wake-up event is caused by the target object.
  • the object confirmation result includes one of a first identification and a second identification, the first identification is used to indicate that the sound source data of the target object exists in the output sound source data, and the second identification is used to indicate the output sound source. The sound source data of the target object does not exist in the data.
  • the second wake-up data indicates that the wake-up is successful and the object confirmation result is the first identifier, it is determined that the wake-up is successful, otherwise the wake-up fails.
  • the first-level separation module 201 is replaced by a first-level specific person extraction module 231
  • the second-level separation module 203 is replaced Implemented as a second level specific person extraction module 232 .
  • the wake-up module 202 outputs the first wake-up data.
  • the multi-channel feature data, the first sound source data of the target object, and the registered voice data of the target object are input.
  • the second-level specific person extraction module 232 outputs the second sound source data of the target object, and inputs the multi-channel feature data, the first sound source data of the target object, the second sound source data, and the registered voice data of the target object into the second sound source data of the target object.
  • the second-level wake-up module 204 (including the speaker confirmation module 210)
  • the second wake-up data and the object confirmation result are outputted.
  • the electronic device is a device with multiple microphones
  • the voice wake-up method is a multi-channel two-level separation and two-level wake-up method.
  • the method can be used in an electronic device with multiple microphones, the electronic device being used to respond to a preset wake word.
  • the electronic device includes a first-level separation module 241 , a first-level wake-up module 242 , a second-level separation module 243 and a second-level wake-up module 244 .
  • the electronic device collects the original first microphone data (such as background music, echoes, voice 1 and voice 2 in the same direction, voice K, and ambient noise) through multiple microphones, and inputs the first microphone data to the preprocessing module 245 for processing.
  • the original first microphone data such as background music, echoes, voice 1 and voice 2 in the same direction, voice K, and ambient noise
  • Preprocessing to obtain multi-channel feature data input the multi-channel feature data to the first-stage separation module 241 for first-stage separation processing to obtain the first separation data; input the multi-channel feature data and the first separation data to the first-stage wake-up
  • the module 242 performs first-level wake-up processing to obtain first wake-up data.
  • the electronic device determines whether to pre-wake according to the first wake-up data.
  • the multi-channel characteristic data and the first separation data are input to the second-stage separation module 243 for second-stage separation processing to obtain the second separation data; the multi-channel characteristic data, the first separation data and the second separation data are obtained The two separated data are input to the second-level wake-up module 244 to perform second-level wake-up processing to obtain second wake-up data.
  • the electronic device determines whether the wake-up is successful according to the second wake-up data.
  • the preprocessing module includes an acoustic echo cancellation module.
  • the preprocessing module includes an acoustic echo cancellation module and a de-reverberation module.
  • the preprocessing module includes an acoustic echo cancellation module, a de-reverberation module and a beam filtering module.
  • beam filtering in multiple directions is performed to obtain multiple sets of multi-channel beam filtering output parameters, multi-mic data after de-reverberation, and IPD of the scene.
  • the characteristic data is input to the subsequent separation module and wake-up module.
  • the first-level wake-up module and the second-level wake-up module are both the above-mentioned multi-input single-output whole-word modeling wake-up modules.
  • the first-level wake-up module and the second-level wake-up module are both the above-mentioned multiple-input multiple-output phoneme modeling wake-up modules.
  • the separation task can be multi-task designed with the positioning task, and the wake-up task can also be multi-task designed with the positioning task.
  • the execution subject of the separation task is a directional feature extractor, and the directional feature extractor can be integrated in the separation module or the wake-up module, and finally outputs the separated multiple sound source data and the orientation information corresponding to each of the multiple sound source data.
  • the output parameters of the first-stage separation module include stream-separated multiple sound source data and their corresponding orientation information.
  • the output parameters of the first-stage separation module can be provided to the first-stage wake-up module, the second-stage wake-up module, and the second-stage
  • the multiple sound source data output by the first-stage separation module can also be provided to the acoustic event detection module to determine whether the current sound source data contains a specific acoustic event, or provide both.
  • the speaker confirmation module is used to determine the identity information corresponding to each current sound source data.
  • the multiple orientation information output by the first-stage separation module can be provided to the system interactive control module to display the respective orientations of the multiple sound source data in real time.
  • the output parameters of the first-level wake-up module include stream-separated multiple sound source data, respective orientation information and object confirmation results corresponding to the multiple sound source data, which can be used to determine whether the current wake-up event is caused by the target object, and the wake-up event.
  • the location information corresponding to the time.
  • the multiple azimuth information output by the first-level wake-up module can be provided to the back-end system to determine the main azimuth of the target object. sound source data for speech recognition.
  • the output parameters of the second-stage separation module include offline-separated multiple sound source data, orientation information corresponding to multiple sound source data, and object confirmation results.
  • the output parameters of the second-stage separation module can be used for system debugging to determine the quality of the separation results.
  • Second-level offline wake-up and multi-task design of speaker recognition and orientation estimation The effect of offline wake-up is better than that of real-time streaming wake-up.
  • the output parameters of the second-level wake-up module include offline-separated multiple sound source data, orientation information corresponding to each of the multiple sound source data, and object confirmation results.
  • Orientation information can be used as supplementary information for wake-up events for subsequent wake-up direction enhancement tasks and speech recognition.
  • FIG. 25 a schematic diagram of the multitasking design of the second-level offline wake-up and wake-up orientation estimation is shown in FIG. 25 , and the second-level wake-up module 244 may adopt the form of multiple input and multiple output. Or a wake-up model in the form of multiple input and single output, and finally output the separated multiple sound source data and the corresponding orientation information of the multiple sound source data.
  • FIG. 26 a schematic diagram of the multi-task design of the second-level offline wake-up and speaker confirmation is shown in FIG. 26 .
  • the speaker confirmation module 261 is integrated in the second-level wake-up module.
  • the multiple sound source data output by the first-level separation module 241, the multiple sound source data output by the second-level separation module 243, and the registered voice data of the target object (that is, the registered voice) are input to the second-level wake-up.
  • the second wake-up data and the object confirmation result are outputted, and when the second wake-up data indicates that the wake-up is successful and the object confirmation result indicates that the sound source data of the target object exists in the output sound source data , confirm that the wake-up is successful, otherwise the wake-up fails.
  • this scenario also supports the combined use of neural network-based separation and traditional beam technology.
  • the first split data and the second split data can also be input to the adaptive
  • the beamforming module such as the minimum variance distortionless frequency response beam filter (Minimum Variance Distortionless Response, MVDR), is used to calculate the noise interference covariance matrix, so as to obtain a better spatial interference suppression effect.
  • the beam-filtered output parameters of multiple sound source data can be used as new sound source data and input to the first-level wake-up module and/or the second-level wake-up module as additional feature data to enhance the wake-up effect.
  • the first separation data is input into the adaptive beamforming module 271 for output to obtain the first filtering data, and the multi-channel feature data, the first filtering data are obtained, and the A separation data and the first filtering data are input to the first-stage wake-up module 242 and output to obtain the first wake-up data.
  • the multi-channel characteristic data and the first separation data are input to the second stage
  • the second separation data is obtained by outputting the separation module 242
  • the second separation data is input into the adaptive beamforming module 272 and the output is obtained to obtain the second filtering data
  • the filtered data is input into the second-level wake-up module 244, and the second wake-up data is outputted to determine whether the wake-up is successful according to the second wake-up data.
  • this scenario also supports a full neural network multi-sound wake-up scheme.
  • the original first microphone data and the calculated multi-channel feature data are input to the subsequent separation module and wake-up module.
  • the first-stage separation module and the second-stage separation module need to consider echo scenarios, so they need to receive echo reference signals to deal with the echo problem.
  • the voice wake-up method can be run in a chip equipped with a dedicated neural network acceleration such as a GPU or a Tensor Processing Unit (TPU), so as to obtain a better algorithm acceleration effect.
  • a dedicated neural network acceleration such as a GPU or a Tensor Processing Unit (TPU)
  • the original first microphone data, the calculated multi-channel feature data and the echo reference data are input to the The first separation module 241 outputs the first separation data, and inputs the first microphone data, the multi-channel feature data and the first separation data into the first-level wake-up module 242 and outputs the first wake-up data.
  • the first microphone data, the multi-channel feature data, the first separation data and the echo reference signal are input into the second-stage separation module 242 for output to obtain the second separation data, and the first microphone data, the multi-channel feature data are , the first separation data and the second separation data are input into the second-level wake-up module 244, the second wake-up data is outputted, and whether the wake-up is successful is determined according to the second wake-up data.
  • the voice wake-up method provided by the embodiments of the present application, on the one hand, based on the self-attention network layer modeling technology of the conformer, provides the conformer network structure of the dual path, and performs the conformer by alternating between the design blocks and between the blocks.
  • the calculation of layers can not only model long sequences, but also avoid the problem of increasing the amount of calculation caused by the direct use of the conformer, and due to the strong modeling ability of the conformer network, the separation effect can be significantly improved.
  • a fusion mechanism of multiple sets of multi-channel feature data of the conformer is provided.
  • For multiple sets of multi-channel features first calculate the first attention feature data within the group, and then calculate the second attention feature data between groups, so that the model can better learn the contribution of each feature to the final separation effect. The subsequent separation effect is further guaranteed.
  • a two-stage separation scheme is provided, namely a streaming separation process for the first-stage wake-up, and an offline separation process for the second-stage wake-up, since the second-stage separation module can additionally use the first-stage separation
  • the first separation data output by the module is used as an input parameter to further enhance the separation effect.
  • a wake-up module in the form of multiple inputs is provided. Compared with the single-input wake-up module in the related art, it can not only save the amount of calculation, but also avoid the problem of significant increase in the amount of calculation and waste caused by repeatedly calling the wake-up model; Moreover, the wake-up performance is greatly improved due to better utilization of the correlation of the various input parameters.
  • a multi-task design solution for a sound source awakening task and other tasks includes at least one of a sound source localization task, a specific person extraction task, a specific direction extraction task, and a specific person confirmation task.
  • Correlating the sound source wake-up result with other information and providing it to downstream tasks improves the output effect of the wake-up module (ie, the first-level wake-up module and/or the second-level wake-up module).
  • other tasks are sound source localization tasks, and the output wake-up data includes multiple sound source data and their corresponding orientation information, so that the wake-up module can provide more accurate orientation information while providing sound source wake-up results.
  • other tasks extract tasks for a specific person, and the output wake-up data includes the sound source data of the target object, so that the electronic device only responds to the wake-up of a specific person (ie, the target object), further reducing the false wake-up rate.
  • other tasks are specific direction extraction tasks, and the output wake-up data includes at least one sound source data in the target direction, so that the electronic device can only respond to wake-up in a specific direction (ie, the target direction), further reducing the false wake-up rate.
  • a robot is used as an example, other tasks are a specific person extraction task and a sound source localization task, and the output wake-up data includes the sound source data of the target object and the sound source of the target object.
  • the orientation information of the source data makes the robot only respond to the wake-up of a specific person (that is, the target object), and determine the orientation of the specific person when it is awakened, so that the robot can adjust its orientation to face the specific person, ensuring better follow-up to accept its commands.
  • FIG. 29 shows a flowchart of a voice wake-up method provided by another exemplary embodiment of the present application. This embodiment is exemplified by using the method in the electronic device shown in FIG. 2 . The method includes the following steps.
  • Step 2901 Obtain original first microphone data.
  • Step 2902 Perform first-level processing according to the first microphone data to obtain first wake-up data, where the first-level processing includes first-level separation processing and first-level wake-up processing based on a neural network model.
  • Step 2903 when the first wake-up data indicates that the pre-wakeup is successful, perform second-level processing according to the first microphone data to obtain second-level wake-up data, and the second-level processing includes second-level separation processing and second-level wake-up processing based on the neural network model.
  • Step 2904 Determine the wake-up result according to the second wake-up data.
  • FIG. 30 shows a block diagram of a voice wake-up device provided by an exemplary embodiment of the present application.
  • the apparatus can be implemented as one or more chips through software, hardware or a combination of the two, or as a voice wake-up system, or as all or a part of the electronic device provided in FIG. 2 .
  • the apparatus may include: an acquisition module 3010, a first-level processing module 3020, a second-level processing module 3030, and a determination module 3040;
  • an acquisition module 3010 configured to acquire original first microphone data
  • a first-level processing module 3020 configured to perform first-level processing according to the first microphone data to obtain first wake-up data, where the first-level processing includes first-level separation processing and first-level wake-up processing based on a neural network model;
  • the second-level processing module 3030 is configured to perform second-level processing according to the first microphone data to obtain second-level wake-up data when the first wake-up data indicates that the pre-awakening is successful, and the second-level processing includes second-level separation processing based on a neural network model and second-level wake-up processing;
  • the determining module 3040 is configured to determine the wake-up result according to the second wake-up data.
  • the apparatus further includes a preprocessing module, and the first-level processing module 3020 further includes a first-level separation module and a first-level wake-up module;
  • a preprocessing module used for preprocessing the first microphone data to obtain multi-channel feature data
  • a first-stage separation module used for performing first-stage separation processing according to the multi-channel feature data, and outputting the first separation data
  • the first-level wake-up module is configured to perform the first-level wake-up processing according to the multi-channel characteristic data and the first separation data, and output the first wake-up data.
  • the second-level processing module 3030 further includes a second-level separation module and a second-level wake-up module;
  • the second-stage separation module is configured to perform a second-stage separation process according to the multi-channel feature data and the first separation data when the first wake-up data indicates that the pre-wake-up is successful, and output the second separation data;
  • the second-level wake-up module is configured to perform the second-level wake-up processing according to the multi-channel characteristic data, the first separation data and the second separation data, and output the second wake-up data.
  • the first-level separation processing is a streaming sound source separation processing
  • the first-level wake-up processing is a streaming sound source wake-up processing
  • the second-level separation processing is offline sound source separation processing
  • the second-level wake-up processing is offline sound source wake-up processing
  • the first-level wake-up module includes a wake-up model in the form of multiple-input single-output or multiple-input multiple-output; and/or,
  • the second-level wake-up module includes multiple-input single-output or multiple-input multiple-output wake-up models.
  • the first-stage separation module and/or the second-stage separation module adopts a dual-path conformer network structure.
  • the first-stage separation module and/or the second-stage separation module is a separation module for performing at least one task, and the at least one task includes a separate sound source separation task, or includes sound source separation tasks and other tasks;
  • the other tasks include at least one of a sound source localization task, a specific person extraction task, a specific direction extraction task, and a specific person confirmation task.
  • the first-level wake-up module and/or the second-level wake-up module is a wake-up module for executing at least one task, and the at least one task includes a single wake-up task, or includes a wake-up task and other tasks ;
  • the other tasks include at least one of a sound source localization task, a specific person extraction task, a specific direction extraction task, and a specific person confirmation task.
  • the first-stage separation module includes a first-stage multi-feature fusion model and a first-stage separation model; the first-stage separation module is further used for:
  • the first single-channel feature data is input to the first-stage separation model output to obtain the first separation data.
  • the second-level separation module includes a second-level multi-feature fusion model and a second-level separation model; the second-level separation module is further used for:
  • the second single-channel feature data is input to the second-stage separation model output to obtain second separation data.
  • the first-level wake-up module includes a first-level wake-up model in the form of multiple-input single-output, and the first-level wake-up module is further configured to:
  • the multi-channel feature data and the first separation data are input into the first-level wake-up model and output to obtain first wake-up data.
  • the first wake-up data includes a first confidence level, and the first confidence level is used to indicate that the original first microphone data includes Preset probability of wake word.
  • the first-level wake-up module includes a multiple-input multiple-output first wake-up model and a first post-processing module, and the first-level wake-up module is further configured to:
  • the first wake-up data includes the second confidence levels corresponding to the plurality of sound source data, the second Used to indicate the acoustic feature similarity between the sound source data and the preset wake-up word.
  • the second-level wake-up module includes a second-level wake-up model in the form of multiple-input single-output, and the second-level wake-up module is further configured to:
  • the multi-channel feature data, the first separation data and the second separation data are input into the second-level wake-up model and output to obtain the second wake-up data.
  • the second wake-up data includes a third confidence level, and the third confidence level is used to indicate the original A probability that a microphone data includes a preset wake-up word.
  • the second-level wake-up module includes a multiple-input multiple-output second wake-up model and a second post-processing module, and the second-level wake-up module is further configured to:
  • An embodiment of the present application provides an electronic device, the electronic device includes: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to implement the above method executed by the electronic device when executing the instructions.
  • Embodiments of the present application provide a computer program product, including computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes, when the computer-readable codes are executed in a processor of an electronic device , the processor in the electronic device executes the above method executed by the electronic device.
  • An embodiment of the present application provides a voice wake-up system, where the voice wake-up system is used to execute the above method executed by an electronic device.
  • Embodiments of the present application provide a non-volatile computer-readable storage medium, on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the foregoing method executed by an electronic device is implemented.
  • a computer-readable storage medium may be a tangible device that can hold and store instructions for use by the instruction execution device.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (Electrically Programmable Read-Only-Memory, EPROM or flash memory), static random access memory (Static Random-Access Memory, SRAM), portable compact disk read-only memory (Compact Disc Read-Only Memory, CD - ROM), Digital Video Disc (DVD), memory sticks, floppy disks, mechanically encoded devices, such as punch cards or raised structures in grooves on which instructions are stored, and any suitable combination of the foregoing .
  • RAM random access memory
  • ROM read only memory
  • EPROM erasable programmable read-only memory
  • EPROM Errically Programmable Read-Only-Memory
  • SRAM static random access memory
  • portable compact disk read-only memory Compact Disc Read-Only Memory
  • CD - ROM Compact Disc Read-Only Memory
  • DVD Digital Video Disc
  • memory sticks floppy disks
  • Computer readable program instructions or code described herein may be downloaded to various computing/processing devices from a computer readable storage medium, or to an external computer or external storage device over a network such as the Internet, a local area network, a wide area network and/or a wireless network.
  • the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer-readable program instructions from a network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
  • the computer program instructions used to perform the operations of the present application may be assembly instructions, Instruction Set Architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or in one or more source or object code written in any combination of programming languages, including object-oriented programming languages such as Smalltalk, C++, etc., and conventional procedural programming languages such as the "C" language or similar programming languages.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement.
  • the remote computer may be connected to the user's computer through any kind of network—including a Local Area Network (LAN) or a Wide Area Network (WAN)—or, may be connected to an external computer (eg, use an internet service provider to connect via the internet).
  • electronic circuits such as programmable logic circuits, Field-Programmable Gate Arrays (FPGA), or Programmable Logic Arrays (Programmable Logic Arrays), are personalized by utilizing state information of computer-readable program instructions.
  • Logic Array, PLA the electronic circuit can execute computer readable program instructions to implement various aspects of the present application.
  • These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer or other programmable data processing apparatus to produce a machine that causes the instructions when executed by the processor of the computer or other programmable data processing apparatus , resulting in means for implementing the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.
  • These computer readable program instructions can also be stored in a computer readable storage medium, these instructions cause a computer, programmable data processing apparatus and/or other equipment to operate in a specific manner, so that the computer readable medium on which the instructions are stored includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.
  • Computer readable program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other equipment to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executing on a computer, other programmable data processing apparatus, or other device to implement the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more functions for implementing the specified logical function(s) executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in hardware (eg, circuits or ASICs (Application) that perform the corresponding functions or actions. Specific Integrated Circuit, application-specific integrated circuit)), or can be implemented by a combination of hardware and software, such as firmware.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

本申请涉及终端技术领域,尤其涉及一种语音唤醒方法、装置、存储介质及系统。该方法包括:获取原始的第一麦克风数据;根据第一麦克风数据进行第一级处理得到第一唤醒数据,第一级处理包括基于神经网络模型的第一级分离处理和第一级唤醒处理;当第一唤醒数据指示预唤醒成功时根据第一麦克风数据进行第二级处理得到第二唤醒数据,第二级处理包括基于神经网络模型的第二级分离处理和第二级唤醒处理;根据第二唤醒数据确定唤醒结果。本申请实施例通过设计两级分离和唤醒方案,在第一级场景下通过第一级分离和唤醒方案进行预唤醒判断,在预唤醒成功后在第二级场景下再次进行唤醒确认,保证较高的唤醒率的同时降低了误唤醒率。

Description

语音唤醒方法、装置、存储介质及系统
本申请要求于2021年03月31日提交中国专利局、申请号为202110348176.6、申请名称为“语音唤醒方法、装置、存储介质及系统”中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及终端技术领域,尤其涉及一种语音唤醒方法、装置、存储介质及系统。
背景技术
随着智能语音交互的兴起,越来越多的电子设备支持语音交互功能。其中,语音唤醒作为语音交互的开始,在不同的电子设备中应用广泛,例如智能音箱、智能电视等。当用户所处空间存在支持语音唤醒的电子设备,用户发出唤醒语音后,被唤醒的电子设备会响应说话人的请求,与用户进行交互。
相关技术中,为了提高电子设备的唤醒率,可以对电子设备中的唤醒模块进行多条件训练,并采用训练后的唤醒模块进行语音唤醒;或者,可以采用麦克风阵列处理技术进行语音唤醒;或者,可以采用传统的声源分离技术进行语音唤醒。
通过上述方法,在唤醒率上虽然已经有了一定的进度,但是在存在背景噪音的情况下,对人声识别就会比较差,特别是在多声源干扰或强声源干扰或远场回声场景时,唤醒率会更低,电子设备的语音唤醒效果较差。
发明内容
有鉴于此,提出了一种语音唤醒方法、装置、存储介质及系统。本申请实施例通过设计两级分离和唤醒方案,在第一级场景下通过第一级分离和唤醒方案进行预唤醒判断,在预唤醒成功后在第二级场景下再次进行唤醒确认,保证较高的唤醒率的同时降低误唤醒率,从而得到更好的语音唤醒效果。
第一方面,本申请实施例提供了一种语音唤醒方法,所述方法包括:
获取原始的第一麦克风数据;
根据所述第一麦克风数据进行第一级处理得到第一唤醒数据,所述第一级处理包括基于神经网络模型的第一级分离处理和第一级唤醒处理;
当所述第一唤醒数据指示预唤醒成功时根据所述第一麦克风数据进行第二级处理得到第二唤醒数据,所述第二级处理包括基于神经网络模型的第二级分离处理和第二级唤醒处理;
根据所述第二唤醒数据确定唤醒结果。
在该实现方式中,设计了两级分离和唤醒方案,在第一级场景下在对原始的第一麦克风数据进行第一级分离处理和第一级唤醒处理后得到第一唤醒数据,根据第一唤醒数据进行预唤醒判断,第一级分离和唤醒方案可以保证唤醒率尽量高,但也会带来 较高的误唤醒率,因此当第一唤醒数据指示预唤醒成功时,在第二级场景下再对第一麦克风数据进行第二级分离处理和第二级唤醒处理,即对第一麦克风数据再次进行唤醒确认,这样可以得到更好的分离性能,保证了较高的唤醒率的同时降低误唤醒率,从而得到更好的语音唤醒效果。
结合第一方面,在第一方面的一种可能的实现方式中,所述根据所述第一麦克风数据进行第一级处理得到第一唤醒数据,包括:
对所述第一麦克风数据进行预处理得到多通道特征数据;
根据所述多通道特征数据,调用预先训练完成的第一级分离模块输出得到第一分离数据,所述第一级分离模块用于进行所述第一级分离处理;
根据所述多通道特征数据和所述第一分离数据,调用预先训练完成的第一级唤醒模块输出得到所述第一唤醒数据,所述第一级唤醒模块用于进行所述第一级唤醒处理。
在该实现方式中,对第一麦克风数据进行预处理得到多通道特征数据,从而可以先根据多通道特征数据调用第一级分离模块输出得到第一分离数据,再根据多通道特征数据和第一分离数据调用第一级唤醒模块输出得到第一唤醒数据,实现在第一级场景下对第一麦克风数据的第一级分离处理和第一级唤醒处理,保证预唤醒的唤醒率尽量高。
结合第一方面的第一种可能的实现方式,在第一方面的第二种可能的实现方式中,所述当所述第一唤醒数据指示预唤醒成功时根据所述第一麦克风数据进行第二级处理得到第二唤醒数据,包括:
当所述第一唤醒数据指示预唤醒成功时,根据所述多通道特征数据和所述第一分离数据调用预先训练完成的第二级分离模块输出得到第二分离数据,所述第二级分离模块用于进行所述第二级分离处理;
根据所述多通道特征数据、所述第一分离数据和所述第二分离数据,调用预先训练完成的第二级唤醒模块输出得到所述第二唤醒数据,所述第二级唤醒模块用于进行所述第二级唤醒处理。
在该实现方式中,当第一唤醒数据指示预唤醒成功时,根据多通道特征数据和第一分离数据调用第二级分离模块输出得到第二分离数据,根据多通道特征数据、第一分离数据和第二分离数据调用第二级唤醒模块输出得到第二唤醒数据,实现在第二级场景下基于第一级分离模块输出的第一分离数据对第一麦克风数据的第二级分离处理和第二级唤醒处理,即对第一麦克风数据再次进行唤醒确认,保证了较高的唤醒率的同时降低误唤醒率,进一步提高了语音唤醒效果。
结合第一方面的第二种可能的实现方式,在第一方面的第三种可能的实现方式中,所述第一级分离处理为流式的声源分离处理,所述第一级唤醒处理为流式的声源唤醒处理;和/或,
所述第二级分离处理为离线的声源分离处理,所述第二级唤醒处理为离线的声源唤醒处理。
在该实现方式中,第一级场景为第一级流式场景,第二级场景为第二级离线场景,由于第一级分离和唤醒方案是流式设计的,一般会损失分离性能,保证唤醒率尽量高,但也会带来较高的误唤醒率,因此当第一唤醒数据指示预唤醒成功时,在第二级离线 场景下再对第一麦克风数据进行离线的第二级分离处理和第二级唤醒处理,这样可以得到更好的分离性能,保证了较高的唤醒率的同时降低误唤醒率,进一步提高了语音唤醒效果。
结合第一方面的第二种可能的实现方式或第三种可能的实现方式,在第一方面的第四种可能的实现方式中,
所述第一级唤醒模块包括多输入单输出形式或者多输入多输出形式的唤醒模型;和/或,
所述第二级唤醒模块包括多输入单输出形式或者多输入多输出形式的唤醒模型。
在该实现方式中,第一级唤醒模块和/或第二级唤醒模块为多输入的唤醒模块,与相关技术中单输入的唤醒模块相比,不但可以节省计算量,避免多次重复调用唤醒模型带来的计算量显著增加和浪费问题;而且,由于更好的利用各个输入参数的相关性,大大提高了唤醒性能。
结合第一方面的第二种可能的实现方式至第四种可能的实现方式中的任意一种可能的实现方式,在第一方面的第五种可能的实现方式中,所述第一级分离模块和/或所述第二级分离模块采用对偶路径的conformer(dual-path conformer,dpconformer)网络结构。
在该实现方式中,基于conformer的自注意力网络层建模技术,提供了对偶路径的conformer网络结构,通过设计块内和块间交替进行conformer层的计算,既能对长序列进行建模,又可以避免直接使用conformer带来的计算量增加问题,并且由于conformer网络较强的建模能力,可以显著提升分离模块(即第一级分离模块和/或第二级分离模块)的分离效果。
结合第一方面的第二种可能的实现方式至第五种可能的实现方式中的任意一种可能的实现方式,在第一方面的第六种可能的实现方式中,所述第一级分离模块和/或所述第二级分离模块为用于执行至少一个任务的分离模块,所述至少一个任务包括单独的声源分离任务,或者包括所述声源分离任务和其他任务;
其中,所述其他任务包括声源定位任务、特定人提取任务、特定方向提取任务、特定人确认任务中的至少一种。
在该实现方式中,提供了声源分离任务和其他任务的多任务设计方案,比如其他任务包括声源定位任务、特定人提取任务、特定方向提取任务、特定人确认任务中的至少一种,可以将声源分离结果与其他信息关联起来,提供给下游任务或者下级唤醒模块,提高了分离模块(即第一级分离模块和/或第二级分离模块)的输出效果。
结合第一方面的第二种可能的实现方式至第六种可能的实现方式中的任意一种可能的实现方式,在第一方面的第七种可能的实现方式中,所述第一级唤醒模块和/或所述第二级唤醒模块为用于执行至少一个任务的唤醒模块,所述至少一个任务包括单独的唤醒任务,或者包括所述唤醒任务和其他任务;
其中,所述其他任务包括声源定位任务、特定人提取任务、特定方向提取任务、特定人确认任务中的至少一种。
在该实现方式中,提供了声源唤醒任务和其他任务的多任务设计方案,比如其他任务包括声源定位任务、特定人提取任务、特定方向提取任务、特定人确认任务中的 至少一种,可以将声源唤醒结果与其他信息关联起来,提供给下游任务,提高了唤醒模块(即第一级唤醒模块和/或第二级唤醒模块)的输出效果。比如其他任务为声源定位任务,这样唤醒模块可以在提供声源唤醒结果的同时提供更准确的方位信息,与相关技术中直接做空间多固定波束的方案相比,保证了更准确的方位估计效果。
结合第一方面的第一种可能的实现方式至第七种可能的实现方式中的任意一种可能的实现方式,在第一方面的第八种可能的实现方式中,所述第一级分离模块包括第一级多特征融合模型和第一级分离模型;所述根据所述多通道特征数据,调用预先训练完成的第一级分离模块输出得到第一分离数据,包括:
将所述多通道特征数据输入至所述第一级多特征融合模型中输出得到第一单通道特征数据;
将所述第一单通道特征数据输入至所述第一级分离模型输出得到所述第一分离数据。
在该实现方式中,提供了多通道特征数据的融合机制,避免相关技术中人工选择特征数据,通过第一级多特征融合模型自动学习特征通道间的相互关系,以及各个特征对最终分离效果的贡献,进一步保证了第一级分离模型的分离效果。
结合第一方面的第二种可能的实现方式至第八种可能的实现方式中的任意一种可能的实现方式,在第一方面的第九种可能的实现方式中,所述第二级分离模块包括第二级多特征融合模型和第二级分离模型;所述根据所述多通道特征数据和所述第一分离数据调用预先训练完成的第二级分离模块输出得到第二分离数据,包括:
将所述多通道特征数据和所述第一分离数据输入至所述第二级多特征融合模型中输出得到第二单通道特征数据;
将所述第二单通道特征数据输入至所述第二级分离模型输出得到所述第二分离数据。
在该实现方式中,提供了多通道特征数据的融合机制,避免相关技术中人工选择特征数据,通过第二级多特征融合模型自动学习特征通道间的相互关系,以及各个特征对最终分离效果的贡献,进一步保证了第二级分离模型的分离效果。
结合第一方面的第一种可能的实现方式至第九种可能的实现方式中的任意一种可能的实现方式,在第一方面的第十种可能的实现方式中,所述第一级唤醒模块包括多输入单输出形式的第一唤醒模型,所述根据所述多通道特征数据和所述第一分离数据,调用预先训练完成的第一级唤醒模块输出得到所述第一唤醒数据,包括:
将所述多通道特征数据和所述第一分离数据输入至所述第一级唤醒模型中输出得到所述第一唤醒数据,所述第一唤醒数据包括第一置信度,所述第一置信度用于指示原始的所述第一麦克风数据中包括预设唤醒词的概率。
在该实现方式中,提供了多输入单输出形式的第一唤醒模型,由于第一唤醒模型是多输入形式的模型,避免相关技术中多次重复调用唤醒模型带来的计算量显著增加和浪费问题,节省了计算资源,提高了第一唤醒模型的处理效率;并且,由于更好的利用各个输入参数的相关性,大大提高了第一唤醒模型的唤醒性能。
结合第一方面的第一种可能的实现方式至第九种可能的实现方式中的任意一种可能的实现方式,在第一方面的第十一种可能的实现方式中,所述第一级唤醒模块包括 多输入多输出形式的第一唤醒模型和第一后处理模块,所述根据所述多通道特征数据和所述第一分离数据,调用预先训练完成的第一级唤醒模块输出得到所述第一唤醒数据,包括:
将所述多通道特征数据和所述第一分离数据输入至所述第一唤醒模型中,输出得到多个声源数据各自对应的音素序列信息;
将所述多个声源数据各自对应的音素序列信息输入至所述第一后处理模块中,输出得到所述第一唤醒数据,所述第一唤醒数据包括多个声源数据各自对应的第二置信度,所述第二置信度用于指示所述声源数据与预设唤醒词之间的声学特征相似度。
在该实现方式中,提供了多输入多输出形式的第一唤醒模型,一方面,由于第一唤醒模型是多输入形式的模型,避免相关技术中多次重复调用唤醒模型带来的计算量显著增加和浪费问题,节省了计算资源,提高了第一唤醒模型的处理效率;另一方面,由于第一唤醒模型是多输出形式的模型,可以同时输出多个声源数据各自对应的音素序列信息,从而避免各个声源数据间相互影响而导致唤醒率低的情况,进一步保证了后续的唤醒率。
结合第一方面的第二种可能的实现方式至第十一种可能的实现方式中的任意一种可能的实现方式,在第一方面的第十二种可能的实现方式中,所述第二级唤醒模块包括多输入单输出形式的第二唤醒模型,所述根据所述多通道特征数据、所述第一分离数据和所述第二分离数据,调用预先训练完成的第二级唤醒模块输出得到所述第二唤醒数据,包括:
将所述多通道特征数据、所述第一分离数据和所述第二分离数据输入至所述第二级唤醒模型中输出得到所述第二唤醒数据,所述第二唤醒数据包括第三置信度,所述第三置信度用于指示原始的所述第一麦克风数据中包括预设唤醒词的概率。
在该实现方式中,提供了多输入单输出形式的第二唤醒模型,由于第二唤醒模型是多输入形式的模型,避免相关技术中多次重复调用唤醒模型带来的计算量显著增加和浪费问题,节省了计算资源,提高了第二唤醒模型的处理效率;并且,由于更好的利用各个输入参数的相关性,大大提高了第二唤醒模型的唤醒性能。
结合第一方面的第二种可能的实现方式至第十一种可能的实现方式中的任意一种可能的实现方式,在第一方面的第十三种可能的实现方式中,所述第二级唤醒模块包括多输入多输出形式的第二唤醒模型和第二后处理模块,所述根据所述多通道特征数据、所述第一分离数据和所述第二分离数据,调用预先训练完成的第二级唤醒模块输出得到所述第二唤醒数据,包括:
将所述多通道特征数据、所述第一分离数据和所述第二分离数据输入至所述第二级唤醒模型中,输出得到多个声源数据各自对应的音素序列信息;
将所述多个声源数据各自对应的音素序列信息输入至所述第二后处理模块中,输出得到所述第二唤醒数据,所述第二唤醒数据包括多个声源数据各自对应的第四置信度,所述第四置信度用于指示所述声源数据与预设唤醒词之间的声学特征相似度。
在该实现方式中,提供了多输入多输出形式的第二唤醒模型,一方面,由于第二唤醒模型是多输入形式的模型,避免相关技术中多次重复调用唤醒模型带来的计算量显著增加和浪费问题,节省了计算资源,提高了第二唤醒模型的处理效率;另一方面, 由于第二唤醒模型是多输出形式的模型,可以同时输出多个声源数据各自对应的音素序列信息,从而避免各个声源数据间相互影响而导致唤醒率低的情况,进一步保证了后续的唤醒率。
第二方面,本申请实施例提供了一种语音唤醒装置,所述装置包括:获取模块、第一级处理模块、第二级处理模块和确定模块;
所述获取模块,用于获取原始的第一麦克风数据;
所述第一级处理模块,用于根据所述第一麦克风数据进行第一级处理得到第一唤醒数据,所述第一级处理包括基于神经网络模型的第一级分离处理和第一级唤醒处理;
所述第二级处理模块,用于当所述第一唤醒数据指示预唤醒成功时根据所述第一麦克风数据进行第二级处理得到第二唤醒数据,所述第二级处理包括基于神经网络模型的第二级分离处理和第二级唤醒处理;
所述确定模块,用于根据所述第二唤醒数据确定唤醒结果。
结合第二方面,在第二方面的一种可能的实现方式中,所述装置还包括预处理模块,所述第一级处理模块还包括第一级分离模块和第一级唤醒模块;
所述预处理模块,用于对所述第一麦克风数据进行预处理得到多通道特征数据;
所述第一级分离模块,用于根据所述多通道特征数据进行所述第一级分离处理,输出得到第一分离数据;
所述第一级唤醒模块,用于根据所述多通道特征数据和所述第一分离数据进行所述第一级唤醒处理,输出得到所述第一唤醒数据。
结合第二方面的第一种可能的实现方式,在第二方面的第二种可能的实现方式中,所述第二级处理模块还包括第二级分离模块和第二级唤醒模块;
所述第二级分离模块,用于当所述第一唤醒数据指示预唤醒成功时,根据所述多通道特征数据和所述第一分离数据进行所述第二级分离处理,输出得到第二分离数据;
所述第二级唤醒模块,用于根据所述多通道特征数据、所述第一分离数据和所述第二分离数据进行所述第二级唤醒处理,输出得到所述第二唤醒数据。
结合第二方面的第二种可能的实现方式,在第二方面的第三种可能的实现方式中,
所述第一级分离处理为流式的声源分离处理,所述第一级唤醒处理为流式的声源唤醒处理;和/或,
所述第二级分离处理为离线的声源分离处理,所述第二级唤醒处理为离线的声源唤醒处理。
结合第二方面的第二种可能的实现方式或第三种可能的实现方式,在第二方面的第四种可能的实现方式中,
所述第一级唤醒模块包括多输入单输出形式或者多输入多输出形式的唤醒模型;和/或,
所述第二级唤醒模块包括多输入单输出形式或者多输入多输出形式的唤醒模型。
结合第二方面的第二种可能的实现方式至第四种可能的实现方式中的任意一种可能的实现方式,在第二方面的第五种可能的实现方式中,所述第一级分离模块和/或所述第二级分离模块采用对偶路径的conformer网络结构。
结合第二方面的第二种可能的实现方式至第五种可能的实现方式中的任意一种可 能的实现方式,在第二方面的第六种可能的实现方式中,所述第一级分离模块和/或所述第二级分离模块为用于执行至少一个任务的分离模块,所述至少一个任务包括单独的声源分离任务,或者包括所述声源分离任务和其他任务;
其中,所述其他任务包括声源定位任务、特定人提取任务、特定方向提取任务、特定人确认任务中的至少一种。
结合第二方面的第二种可能的实现方式至第六种可能的实现方式中的任意一种可能的实现方式,在第二方面的第七种可能的实现方式中,所述第一级唤醒模块和/或所述第二级唤醒模块为用于执行至少一个任务的唤醒模块,所述至少一个任务包括单独的唤醒任务,或者包括所述唤醒任务和其他任务;
其中,所述其他任务包括声源定位任务、特定人提取任务、特定方向提取任务、特定人确认任务中的至少一种。
结合第二方面的第一种可能的实现方式至第七种可能的实现方式中的任意一种可能的实现方式,在第二方面的第八种可能的实现方式中,所述第一级分离模块包括第一级多特征融合模型和第一级分离模型;所述第一级分离模块,还用于:
将所述多通道特征数据输入至所述第一级多特征融合模型中输出得到第一单通道特征数据;
将所述第一单通道特征数据输入至所述第一级分离模型输出得到所述第一分离数据。
结合第二方面的第二种可能的实现方式至第八种可能的实现方式中的任意一种可能的实现方式,在第二方面的第九种可能的实现方式中,所述第二级分离模块包括第二级多特征融合模型和第二级分离模型;所述第二级分离模块,还用于:
将所述多通道特征数据和所述第一分离数据输入至所述第二级多特征融合模型中输出得到第二单通道特征数据;
将所述第二单通道特征数据输入至所述第二级分离模型输出得到所述第二分离数据。
结合第二方面的第一种可能的实现方式至第九种可能的实现方式中的任意一种可能的实现方式,在第二方面的第十种可能的实现方式中,所述第一级唤醒模块包括多输入单输出形式的第一唤醒模型,所述第一级唤醒模块,还用于:
将所述多通道特征数据和所述第一分离数据输入至所述第一级唤醒模型中输出得到所述第一唤醒数据,所述第一唤醒数据包括第一置信度,所述第一置信度用于指示原始的所述第一麦克风数据中包括预设唤醒词的概率。
结合第二方面的第一种可能的实现方式至第九种可能的实现方式中的任意一种可能的实现方式,在第二方面的第十一种可能的实现方式中,所述第一级唤醒模块包括多输入多输出形式的第一唤醒模型和第一后处理模块,所述第一级唤醒模块,还用于:
将所述多通道特征数据和所述第一分离数据输入至所述第一唤醒模型中,输出得到多个声源数据各自对应的音素序列信息;
将所述多个声源数据各自对应的音素序列信息输入至所述第一后处理模块中,输出得到所述第一唤醒数据,所述第一唤醒数据包括多个声源数据各自对应的第二置信度,所述第二置信度用于指示所述声源数据与预设唤醒词之间的声学特征相似度。
结合第二方面的第二种可能的实现方式至第十一种可能的实现方式中的任意一种可能的实现方式,在第二方面的第十二种可能的实现方式中,所述第二级唤醒模块包括多输入单输出形式的第二唤醒模型,所述第二级唤醒模块,还用于:
将所述多通道特征数据、所述第一分离数据和所述第二分离数据输入至所述第二级唤醒模型中输出得到所述第二唤醒数据,所述第二唤醒数据包括第三置信度,所述第三置信度用于指示原始的所述第一麦克风数据中包括预设唤醒词的概率。
结合第二方面的第二种可能的实现方式至第十一种可能的实现方式中的任意一种可能的实现方式,在第二方面的第十三种可能的实现方式中,所述第二级唤醒模块包括多输入多输出形式的第二唤醒模型和第二后处理模块,所述第二级唤醒模块,还用于:
将所述多通道特征数据、所述第一分离数据和所述第二分离数据输入至所述第二级唤醒模型中,输出得到多个声源数据各自对应的音素序列信息;
将所述多个声源数据各自对应的音素序列信息输入至所述第二后处理模块中,输出得到所述第二唤醒数据,所述第二唤醒数据包括多个声源数据各自对应的第四置信度,所述第四置信度用于指示所述声源数据与预设唤醒词之间的声学特征相似度。
第三方面,本申请实施例提供了一种电子设备,所述电子设备包括:
处理器;
用于存储处理器可执行指令的存储器;
其中,所述处理器被配置为执行所述指令时实现第一方面或第一方面中的任意一种可能的实现方式所提供的语音唤醒方法。
第四方面,本申请实施例提供了一种非易失性计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现第一方面或第一方面中的任意一种可能的实现方式所提供的语音唤醒方法。
第五方面,本申请的实施例提供了一种计算机程序产品,包括计算机可读代码,或者承载有计算机可读代码的非易失性计算机可读存储介质,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行上述第一方面或者第一方面中的任意一种可能的实现方式所提供的语音唤醒方法。
第六方面,本申请的实施例提供了一种语音唤醒系统,该语音唤醒系统用于执行上述第一方面或者第一方面中的任意一种可能的实现方式所提供的语音唤醒方法。
附图说明
包含在说明书中并且构成说明书的一部分的附图与说明书一起示出了本申请的示例性实施例、特征和方面,并且用于解释本申请的原理。
图1示出相关技术中电子设备的唤醒率与声源距离的相关关系的示意图。
图2示出了本申请一个示例性实施例提供的电子设备的结构示意图。
图3示出了本申请一个示例性实施例提供的语音唤醒方法的流程图。
图4示出了本申请一个示例性实施例提供的语音唤醒方法的原理示意图。
图5示出了本申请一个示例性实施例提供的dpconformer网络的结构示意图。
图6示出了本申请一个示例性实施例提供的两阶段分离方案的原理示意图。
图7至图14示出了本申请示例性实施例提供的第一级分离方案的几种可能的实现方式的原理示意图。
图15示出了本申请一个示例性实施例提供的两阶段唤醒方案的原理示意图。
图16至图19示出了本申请示例性实施例提供的第一级唤醒方案的几种可能的实现方式的原理示意图。
图20至图23示出了本申请示例性实施例提供的单麦克风场景下语音唤醒方法的原理示意图。
图24至图28示出了本申请示例性实施例提供的多麦克风场景下语音唤醒方法的原理示意图。
图29示出了本申请另一个示例性实施例提供的语音唤醒方法的流程图。
图30示出了本申请一个示例性实施例提供的语音唤醒装置的框图。
具体实施方式
以下将参考附图详细说明本申请的各种示例性实施例、特征和方面。附图中相同的附图标记表示功能相同或相似的元件。尽管在附图中示出了实施例的各种方面,但是除非特别指出,不必按比例绘制附图。
在这里专用的词“示例性”意为“用作例子、实施例或说明性”。这里作为“示例性”所说明的任何实施例不必解释为优于或好于其它实施例。
另外,为了更好的说明本申请,在下文的具体实施方式中给出了众多的具体细节。本领域技术人员应当理解,没有某些具体细节,本申请同样可以实施。在一些实例中,对于本领域技术人员熟知的方法、手段、元件和电路未作详细描述,以便于凸显本申请的主旨。
语音交互技术是现在电子设备中较为重要的技术,电子设备包括智能手机、音箱、电视、机器人、平板设备、车载设备等设备。语音唤醒功能是语音交互技术的关键功能之一,通过特定的唤醒词或者命令词(比如“小艺小艺”),激活处于非语音交互状态(比如休眠状态或者其他状态)的电子设备,开启语音识别、语音搜索、对话、语音导航等其他语音功能,既满足语音交互技术的随时可用性,又避免电子设备长期处于语音交互状态带来的功耗问题或者用户隐私数据被监听的问题。
理想环境(比如安静且用户距离待唤醒的电子设备较近)下,语音唤醒功能达到满足用户使用的需求,即满足95%以上的唤醒率。但是,实际使用场景的声学环境往往比较复杂,用户距离待唤醒的电子设备较远(比如3-5米)并且存在背景噪音(比如电视声、说话声、背景音乐、混响、回声等)的情况下,唤醒率将急剧下降。如图1所示,电子设备的唤醒率随声源距离增加而下降,其中声源距离为用户与电子设备的距离。图1中,声源距离为0.5米时唤醒率为80%,声源距离为1米时唤醒率为65%,声源距离为3米时唤醒率为30%,声源距离为5米时唤醒率为10%,过低的唤醒率,导致电子设备的语音唤醒效果较差。
通过相关技术中提供的一些方法,在唤醒率上虽然已经有了一定的进度,但是在存在背景噪音的情况下,对人声识别就会比较差,特别是在多声源干扰(比如其他说 话人的干扰、背景音乐的干扰、回声场景的回声残余干扰等等)或强声源干扰或远场回声场景时,唤醒率会更低,且产生较高的误唤醒情况。
而本申请实施例通过设计两级分离和唤醒方案,在第一级流式场景下通过第一级分离和唤醒方案进行预唤醒判断,保证唤醒率尽量高,但会带来较高的误唤醒率,因此在预唤醒成功后在第二级离线场景下进行离线唤醒确认,保证较高的唤醒率的同时降低误唤醒率,从而得到更好的语音唤醒效果。
首先,对本申请实施例涉及的一些名词进行介绍。
1、离线的声源唤醒处理:是指在获取完整的音频内容后对该音频内容进行声源唤醒处理。离线的声源唤醒处理包括离线的分离处理和离线的唤醒处理。
2、流式的声源唤醒处理(也称在线的声源唤醒处理):是指实时或每隔预设时间间隔获取音频段并对该音频段进行声源唤醒处理。流式的声源唤醒处理包括流式的分离处理和流式的唤醒处理。
其中,音频段为实时或每隔预设时间段采集的连续数量的样本数据,比如,预设时间间隔为16毫秒。本申请实施例对此不加以限定。
3、多声源分离技术:是指将接收到的单麦克风或者多麦克风语音信号分离出多个声源数据的技术。其中,多个声源数据包括目标对象的声源数据和干扰声源的声源数据。多声源分离技术用于将目标对象的声源数据与干扰声源的声源数据进行分离,以便更好地进行唤醒判断。
4、唤醒技术又称为关键词检出技术(Key Word Spotting,KWS),用于判断待测试的声源数据中是否包含预设的唤醒词。其中,唤醒词可以是默认设置的,或者是用户自定义设置的。比如,默认设置的固定唤醒词为“小艺小艺”,用户不能更改,唤醒方案设计往往依赖特定的训练样本数据。又比如,用户手动设置个性化的唤醒词,无论用户设置什么样的个性化唤醒词,都期待有较高的唤醒率,同时不希望在电子设备侧进行频繁的模型自学习。可选的,唤醒技术的建模方式包括但不限于如下两种可能的实现方式:第一种为采用整词建立唤醒模块,比如固定唤醒词为唤醒模块的输出目标;第二种为基于通用语音识别中的音素表示建立用于音素识别的唤醒模块,比如支持固定唤醒词或者支持用户自定义唤醒词时自动构造对应的个性化解码图,最终依赖唤醒模块的输出再解码图确定用户的唤醒意图。
对于上述第一种可能的实现方式即采用固定唤醒词建模的方案,在多声源干扰场景下,唤醒模块希望有单路的输出数据,该输出数据用于指示是否唤醒,或者是否为固定的唤醒词。而对于上述第二种可能的实现方式即采用音素建模的方案,在多声源干扰场景下,多个声源数据的唤醒模块的输出是有意义的,需要分别进行解码图解码,以便最终确定是否为自定义的唤醒词。因此,在多声源干扰场景下,对于采用固定唤醒词建模的方案,唤醒模块为多输入单输出形式的模型;而对于采用音素建模的方案,唤醒模块为多输入多输出形式的模型,多个输出数据分别对应多个声源数据的音素后验概率序列。
请参考图2,其示出了本申请一个示例性实施例提供的电子设备的结构示意图。
该电子设备可以是终端,终端包括移动终端或者固定终端。比如电子设备可以是手机、音箱、电视、机器人、平板设备、车载设备、耳机、智能眼镜、智能手表、膝上型便携计算机和台式计算机等等。服务器可以是一台服务器,或者由若干台服务器组成的服务器集群,或者是一个云计算服务中心。
参照图2,电子设备200可以包括以下一个或多个组件:处理组件202,存储器204,电源组件206,多媒体组件208,音频组件210,输入/输出(I/O)的接口212,传感器组件214,以及通信组件216。
处理组件202通常控制电子设备200的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理组件202可以包括一个或多个处理器220来执行指令,以完成本申请实施例提供的语音唤醒方法的全部或部分步骤。此外,处理组件202可以包括一个或多个模块,便于处理组件202和其他组件之间的交互。例如,处理组件202可以包括多媒体模块,以方便多媒体组件208和处理组件202之间的交互。
存储器204被配置为存储各种类型的数据以支持在电子设备200的操作。这些数据的示例包括用于在电子设备200上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,多媒体内容等。存储器204可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。
电源组件206为电子设备200的各种组件提供电力。电源组件206可以包括电源管理系统,一个或多个电源,及其他与为电子设备200生成、管理和分配电力相关联的组件。
多媒体组件208包括在所述电子设备200和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件208包括一个前置摄像头和/或后置摄像头。当电子设备200处于操作模式,如拍摄模式或多媒体内容模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。可选地,电子设备200通过摄像头(前置摄像头和/或后置摄像头)采集视频信息。
音频组件210被配置为输出和/或输入音频信号。例如,音频组件210包括一个麦克风(MIC),当电子设备200处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器204或经由通信组件216发送。可选地,电子设备200通过麦克风采集原始的第一麦克风数据。在一些实施例中,音频组件210还包括一个扬声器,用于输出音频信号。
I/O接口212为处理组件202和外围接口模块之间提供接口,上述外围接口模块 可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。
传感器组件214包括一个或多个传感器,用于为电子设备200提供各个方面的状态评估。例如,传感器组件214可以检测到电子设备200的打开/关闭状态,组件的相对定位,例如所述组件为电子设备200的显示器和小键盘,传感器组件214还可以检测电子设备200或电子设备200一个组件的位置改变,用户与电子设备200接触的存在或不存在,电子设备200方位或加速/减速和电子设备200的温度变化。传感器组件214可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件214还可以包括光传感器,如CMOS或CCD图像传感器,用于在成像应用中使用。在一些实施例中,该传感器组件214还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。
通信组件216被配置为便于电子设备200和其他设备之间有线或无线方式的通信。电子设备200可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一个示例性实施例中,通信组件216经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信组件216还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。
在示例性实施例中,电子设备200可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行本申请实施例提供的语音唤醒方法。
在示例性实施例中,还提供了一种非易失性计算机可读存储介质,例如包括计算机程序指令的存储器204,上述计算机程序指令可由电子设备200的处理器220执行以完成本申请实施例提供的语音唤醒方法。
下面,采用几个示例性实施例对本申请实施例提供的语音唤醒方法进行介绍。
请参考图3,其示出了本申请一个示例性实施例提供的语音唤醒方法的流程图,本实施例以该方法用于图2所示的电子设备中来举例说明。该方法包括以下几个步骤。
步骤301,获取原始的第一麦克风数据。
电子设备通过单个麦克风或者多个麦克风获取麦克风输出信号,将麦克风输出信号作为原始的第一麦克风数据。
可选的,第一麦克风数据包括目标对象的声源数据和干扰声源的声源数据,干扰声源包括除目标对象以外的其它对象的说话声、背景音乐、环境噪声中的至少一种。
步骤302,对第一麦克风数据进行预处理得到多通道特征数据。
为了处理真实声学场景下可能遇到的声学回声、混响、信号幅度等问题,电子设备对第一麦克风数据进行预处理得到多通道特征数据。可选的,预处理包括声学回声抵消(Acoustic Echo Cancellation,AEC)、去混响(Dereverberation)、语音活动检测(Voice Activity Detection,VAD)、自动增益控制(Automatic Gain Control,AGC)、波束滤波 中的至少一种处理。
可选的,多通道特征为多组多通道特征,多通道特征数据包括多通道时域信号数据、多通道频谱数据、多组通道间相位差(Inter Phase Difference,IPD)数据、多方向特征数据、多波束特征数据中的至少一种数据。
步骤303,根据多通道特征数据进行第一级分离处理得到第一分离数据。
其中,第一级分离处理也可以称为第一级神经网络分离处理,第一级分离处理为基于神经网络模型的分离处理,即第一级分离处理包括调用神经网络模型进行声源分离处理。
可选的,电子设备根据多通道特征数据,调用预先训练完成的第一级分离模块输出得到第一分离数据。其中,第一级分离模块用于进行第一级分离处理,第一级分离处理为流式的声源分离处理。可选的,第一级分离模块采用dpconformer网络结构。
电子设备根据多通道特征数据,调用预先训练完成的第一级分离模块输出得到第一分离数据,包括但不限于如下两种可能的实现方式:
在一种可能的实现方式中,第一级分离模块包括第一级分离模型,电子设备将多通道特征进行拼接,将拼接后的多通道特征数据输入至第一级分离模型中输出得到第一分离数据。
在另一种可能的实现方式中,第一级分离模块包括第一级多特征融合模型和第一级分离模型,电子设备将多通道特征数据输入至第一级多特征融合模型中输出得到第一单通道特征数据;将第一单通道特征数据输入至第一级分离模型输出得到第一分离数据。为了方便说明,下面仅以第二种可能的实现方式为例进行介绍。本申请实施例对此不加以限定。
可选的,第一级多特征融合模型为conformer特征融合模型。
其中,第一级分离模型采用流式的网络结构。可选的,第一级分离模型采用dpconformer网络结构。
其中,第一级分离模型为神经网络模型,即第一级分离模型为采用神经网络训练得到的模型。可选的,第一级分离模型采用深度神经网络(Deep Neural Networks,DNN)、长短期记忆网络(Long Short-Term Memory,LSTM)、卷积神经网络(Convolutional Neural Networks,CNN)、全卷积时域音频分离网络(Conv-TasNet)、DPRNN中的任意一种网络结构。需要说明的是,第一级分离模型还可以采用其他适合流式场景的网络结构,本申请实施例对此不加以限定。
其中,第一级分离模块的分离任务设计可以是流式声源分离任务的单任务设计,也可以是流式声源分离任务和其他任务的多任务设计,可选的,其他任务包括多个声源各自对应的方位估计任务和/或多个声源各自对应的声源对象识别任务。
在一种可能的实现方式中,第一级分离模块用于对多个声源数据进行盲分离,第一分离数据包括分离的多个声源数据。
在另一种可能的实现方式中,第一级分离模块用于从多个声源数据中提取目标对象的声源数据,第一分离数据包括提取的目标对象的声源数据。
在另一种可能的实现方式中,第一级分离模块用于基于视频信息从多个声源数据中提取目标对象的声源数据,第一分离数据包括提取的目标对象的声源数据。比如, 视频信息包括目标对象的视觉数据。
在另一种可能的实现方式中,第一级分离模块用于从多个声源数据中提取目标方向的至少一个声源数据,第一分离数据包括目标方向的至少一个声源数据。
需要说明的是,分离任务设计的几种可能的实现方式的相关细节可参考下面实施例中的相关描述,在此先不介绍。
可选的,对于需要分离出多个声源数据的盲分离任务,第一级分离模块中的代价函数为基于置换不变训练(Permutation Invariant Traning,PIT)准则设计的函数。
可选的,在代价函数的训练过程中,电子设备将多个样本声源数据按照语音段起始时刻的先后顺序进行排序,根据排序后的多个样本声源数据计算代价函数的损失值。基于计算出的损失值,训练该代价函数。
可选的,在通过第一级分离模块分离得到多个声源数据后,将多个声源数据直接输入至下一级处理模型即第一级唤醒模块。
可选的,对于多麦克风的场景,在通过第一级分离模块分离得到多个声源数据后,计算多个声源数据的统计量信息,将统计量信息输入至波束形成模型中输出得到波束形成数据,将波束形成数据输入至下一级处理模型即第一级唤醒模块。
步骤304,根据多通道特征数据和第一分离数据进行第一级唤醒处理得到第一唤醒数据。
可选的,电子设备根据多通道特征数据和第一分离数据,调用预先训练完成的第一级唤醒模块输出得到第一唤醒数据。其中,第一级唤醒模块用于进行第一级唤醒处理,第一级唤醒处理为流式的声源唤醒处理。
需要说明的是,对多通道特征数据和第一分离数据的介绍可参考上述步骤中的相关描述,在此不再赘述。
可选的,电子设备将多通道特征数据和第一分离数据输入至第一级唤醒模块中输出得到第一唤醒数据。
可选的,唤醒方案为多输入单输出的流式唤醒方案(MISO-KWS),即第一级唤醒模块是采用固定唤醒词建模的,第一级唤醒模块为多输入单输出形式的唤醒模型,输入参数包括多通道特征数据和第一分离数据,输出参数包括第一置信度。其中,第一置信度用于指示原始的第一麦克风数据中包括预设唤醒词的概率。
可选的,第一置信度为多维向量,多维向量中的每个维度的值为0到1之间的概率值。
可选的,唤醒方案为多输入多输出的流式唤醒方案(MIMO-KWS),即第一级唤醒模块是采用音素建模的,第一级唤醒模块包括多输入多输出形式的唤醒模型和第一后处理模块(比如解码器),第一级唤醒模块的输入参数(也即唤醒模型的输入参数)包括多通道特征数据和第一分离数据,唤醒模型的输出参数包括多个声源数据各自对应的音素序列信息。其中,声源数据对应的音素序列信息用于指示该声源数据中多个音素的概率分布,即音素序列信息包括多个音素各自对应的概率值。第一级唤醒模块的输出参数(也即第一后处理模块的输出参数)包括多个声源数据各自对应的第二置信度,第二置信度用于指示该声源数据与预设唤醒词之间的声学特征相似度。
其中,预设唤醒词为默认设置的固定唤醒词,或者用户自定义设置的唤醒词。本 申请实施例对此不加以限定。
其中,第一级唤醒模块采用流式的网络结构。可选的,第一级唤醒模块采用流式的dpconformer网络结构。
可选的,第一级唤醒模块采用DNN、LSTM、CNN中的任意一种网络结构。需要说明的是,第一级唤醒模块还可以采用其他适合流式场景的网络结构,第一级唤醒模块的网络结构可以类比参考第一级分离模块的网络结构,本申请实施例对此不加以限定。
其中,第一级唤醒模块的唤醒任务设计可以是唤醒任务的单任务设计,也可以是唤醒任务和其他任务的多任务设计,可选的,其他任务包括方位估计任务和/或声源对象识别任务。
可选的,第一唤醒数据包括第一置信度,第一置信度用于指示原始的第一麦克风数据中包括预设唤醒词的概率。可选的,第一唤醒数据包括多个声源数据各自对应的第二置信度,第二置信度用于指示该声源数据与预设唤醒词之间的声学特征相似度。
可选的,第一唤醒数据还包括唤醒事件对应的方位信息和/或唤醒对象的对象信息,对象信息用于指示声源数据的对象身份。
步骤305,根据第一唤醒数据,判断是否预唤醒。
电子设备设置第一级唤醒模块的第一门限值。其中,第一门限值为允许电子设备被预唤醒成功的阈值。
在一种可能的实现方式中,第一唤醒数据包括第一置信度,第一置信度用于指示原始的第一麦克风数据中包括预设唤醒词的概率,当第一唤醒数据中的第一置信度大于第一门限值时,确定预唤醒成功即第一级流式唤醒成功,将缓存的多通道特征数据和第一分离数据输入至第二级分离模块,执行步骤306;当第一置信度小于或者等于第一门限值时,确定预唤醒失败即第一级流式唤醒失败,结束进程。
在另一种可能的实现方式中,第一唤醒数据包括多个声源数据各自对应的第二置信度,第二置信度用于指示该声源数据与预设唤醒词之间的声学特征相似度,当第一唤醒数据中存在任意一个第二置信度大于第一门限值时,确定预唤醒成功即第一级流式唤醒成功,将缓存的多通道特征数据和第一分离数据输入至第二级分离模块,执行步骤306;当第一唤醒数据中的各个第二置信度均小于或者等于第一门限值时,确定预唤醒失败即第一级流式唤醒失败,结束进程。
步骤306,根据多通道特征数据和第一分离数据进行第二级分离处理得到第二分离数据。
其中,第二级分离处理也可以称为第二级神经网络分离处理,第二级分离处理为基于神经网络模型的分离处理,即第二级分离处理包括调用神经网络模型进行声源分离处理。
可选的,电子设备根据多通道特征数据和第一分离数据,调用预先训练完成的第二级分离模块输出得到第二分离数据。其中,第二级分离模块用于进行第二级分离处理,第二级分离处理为离线的声源分离处理。
可选的,第一唤醒数据还包括唤醒词对应的方位信息,电子设备根据多通道特征数据、第一分离数据和唤醒词对应的方位信息,调用第二级分离模块输出得到第二分 离数据。
需要说明的是,对第一分离数据、多通道特征数据和第一唤醒数据的介绍可参考上述步骤中的相关描述,在此不再赘述。为了方便说明,下面仅以电子设备根据多通道特征数据和第一分离数据,调用预先训练完成的第二级分离模块输出得到第二分离数据为例进行说明。
可选的,第二级分离模块采用dpconformer网络结构。
电子设备根据多通道特征数据和第一分离数据,调用预先训练完成的第二级分离模块输出得到第二分离数据,包括但不限于如下两种可能的实现方式:
在一种可能的实现方式中,第二级分离模块包括第二级分离模型,电子设备将多通道特征和第一分离数据进行拼接,将拼接后的数据输入至第二级分离模型中输出得到第二分离数据。
在另一种可能的实现方式中,第二级分离模块包括第二级多特征融合模型和第二级分离模型,电子设备将多通道特征数据和第一分离数据输入至第二级多特征融合模型中输出得到第二单通道特征数据;将第二单通道特征数据输入至第二级分离模型输出得到第二分离数据。为了方便说明,下面仅以第二种可能的实现方式为例进行介绍。本申请实施例对此不加以限定。
可选的,第二级多特征融合模型为conformer特征融合模型。
其中,第二级分离模型为神经网络模型,即第二级分离模型为采用神经网络训练得到的模型。可选的,第二级分离模型采用dpconformer网络结构。或者,第二级分离模型采用深度神经网络(Deep Neural Networks,DNN)、长短期记忆网络(Long Short-Term Memory,LSTM)、卷积神经网络(Convolutional Neural Networks,CNN)、全卷积时域音频分离网络(Conv-TasNet)、循环神经网络(Recurrent Neural Network,RNN)中的任意一种网络结构。需要说明的是,第二级分离模型还可以采用其他适合离线场景的网络结构,本申请实施例对此不加以限定。
其中,第二级分离模块的分离任务设计可以是离线声源分离任务的单任务设计,也可以是离线声源分离任务和其他任务的多任务设计,可选的,其他任务包括多个声源各自对应的方位估计任务和/或多个声源各自对应的声源对象识别任务。
在一种可能的实现方式中,第二级分离模块用于对多个声源数据进行盲分离,第二分离数据包括分离的多个声源数据。
在另一种可能的实现方式中,第二级分离模块用于从多个声源数据中提取目标对象的声源数据,第二分离数据包括提取的目标对象的声源数据。
在另一种可能的实现方式中,第二级分离模块用于基于视频信息从多个声源数据中提取目标对象的声源数据,第二分离数据包括提取的目标对象的声源数据。
在另一种可能的实现方式中,第二级分离模块用于从多个声源数据中提取目标方向的至少一个声源数据,第二分离数据包括目标方向的至少一个声源数据。
需要说明的是,多通道特征的融合、网络结构的选择、分离任务设计、代价函数的使用、分离结果的使用可以类比参考第一级分离处理的相关描述,在此不再赘述。
步骤307,根据多通道特征数据、第一分离数据和第二分离数据进行第二级唤醒处理得到第二唤醒数据。
可选的,电子设备根据多通道特征数据、第一分离数据和第二分离数据,调用预先训练完成的第二级唤醒模块输出得到第二唤醒数据。其中,第二级唤醒模块用于进行第二级唤醒处理,第二级唤醒处理为离线的声源唤醒处理。
可选的,第一唤醒数据还包括唤醒词对应的方位信息,电子设备根据多通道特征数据、第一分离数据、第二分离数据和唤醒词对应的方位信息,调用第二级唤醒模块输出得到第二唤醒数据。
需要说明的是,对多通道特征数据、第一分离数据和第二分离数据的介绍可参考上述步骤中的相关描述,在此不再赘述。
可选的,电子设备将多通道特征数据、第一分离数据和第二分离数据输入至第二级唤醒模块中输出得到第二唤醒数据。
可选的,第二级唤醒模块是采用固定唤醒词建模的,第二级唤醒模块为多输入单输出形式的唤醒模型,即唤醒方案为多输入单输出的流式唤醒方案(MISO-KWS)。或者,第二级唤醒模块是采用音素建模的,第二级唤醒模块包括多输入多输出形式的唤醒模型和第二后处理模块(比如解码器),即唤醒方案为多输入多输出的流式唤醒方案(MIMO-KWS)。
可选的,第二级唤醒模块采用dpconformer网络结构。或者,第二级唤醒模块采用DNN、LSTM、CNN中的任意一种网络结构。需要说明的是,第二级唤醒模块还可以采用其他适合离线场景的网络结构,第二级唤醒模块的网络结构可以类比参考第二级分离模块的网络结构,本申请实施例对此不加以限定。
其中,第二级唤醒模块的唤醒任务设计可以是唤醒任务的单任务设计,也可以是唤醒任务和其他任务的多任务设计,可选的,其他任务包括方位估计任务和/或声源对象识别任务。
可选的,第二唤醒数据包括第三置信度,第三置信度用于指示原始的第一麦克风数据中包括预设唤醒词的概率。
可选的,第二唤醒数据包括多个声源数据各自对应的第四置信度,声源数据的第四置信度用于指示该声源数据与预设唤醒词之间的声学特征相似度。为了方便介绍,下面仅以第二唤醒数据包括第三置信度,第三置信度用于指示原始的第一麦克风数据中包括预设唤醒词的概率为例进行说明。
可选的,第二唤醒数据还包括唤醒事件对应的方位信息和/或唤醒对象的对象信息。
步骤308,根据第二唤醒数据,确定唤醒结果。
电子设备根据第二唤醒数据,确定唤醒结果,唤醒结果包括唤醒成功或者唤醒失败中的一种。
可选的,电子设备设置第二级唤醒模块的第二门限值。其中,第二门限值为允许电子设备被唤醒成功的阈值。示意性的,第二门限值大于第一门限值。
在一种可能的实现方式中,第二唤醒数据包括第三置信度,第三置信度用于指示原始的第一麦克风数据中包括预设唤醒词的概率。当第二唤醒数据中的第三置信度大于第二门限值时,电子设备确定唤醒结果为唤醒成功。当第三置信度小于或者等于第二门限值时,电子设备确定唤醒结果为唤醒失败,结束进程。
在另一种可能的实现方式中,第二唤醒数据包括多个声源数据各自对应的第四置 信度,声源数据的第四置信度用于指示该声源数据与预设唤醒词之间的声学特征相似度。当第二唤醒数据中存在任意一个第四置信度大于第二门限值时,电子设备确定唤醒结果为唤醒成功。当第二唤醒数据中的各个第四置信度均小于或者等于第二门限值时,电子设备确定唤醒结果为唤醒失败,结束进程。
可选的,当第二唤醒数据指示唤醒成功时,电子设备输出唤醒成功标识;或者,输出唤醒成功标识和其他信息。其中,该唤醒成功标识用于指示唤醒成功,其他信息包括唤醒事件对应的方位信息、唤醒对象的对象信息。
需要说明的是,在保证唤醒率的同时减少误唤醒情况,本申请实施例设计了两级唤醒处理模块,在第一级唤醒成功后,调用更为复杂的第二级唤醒模块,对第一级唤醒成功后的数据进行离线唤醒确认。为了更好的支持唤醒方案这样的两级测试,对分离模块也进行了两级设计,第一级分离方案是流式的,需要一直在运行,所以第一级分离模块需要进行因果流式设计。流式设计一般会损失分离性能,所以在第一级唤醒成功后,在输出的数据上可以进行第二级分离方案,由于是离线场景,第二级唤醒方案可以采用离线的设计方案,同时第一级已经输出的数据同样可以用于第二级分离方案,最终得到更好的分离性能,从而最终更好的支持二级唤醒的效果。
在一个示意性的例子中,如图4所示,该电子设备包括第一级分离模块41(包括第一级分离模型)、第一级唤醒模块42、第二级分离模块43(包括第二级分离模型)和第二级唤醒模块44。电子设备将原始的第一麦克风数据输入至预处理模块进行预处理(比如声学回声抵消、去混响和波束滤波处理),得到多通道特征数据;将多通道特征数据输入至第一级分离模块41进行第一级分离处理得到第一分离数据;将多通道特征数据和第一分离数据输入至第一级唤醒模块42进行第一级唤醒处理得到第一唤醒数据。电子设备根据第一唤醒数据,判断是否预唤醒。若判断出预唤醒成功,则将多通道特征数据和第一分离数据输入至第二级分离模块43进行第二级分离处理得到第二分离数据;将多通道特征数据、第一分离数据和第二分离数据输入至第二级唤醒模块44进行第二级唤醒处理得到第二唤醒数据。电子设备根据第二唤醒数据判断是否唤醒成功。
本申请实施例提供的语音唤醒方法主要从多声源分离技术和唤醒技术这两个角度进行优化设计,可以大幅度解决上述的技术问题。下面,分别对本申请实施例涉及的多声源分离技术和唤醒技术进行介绍。
在对多声源分离技术和唤醒技术进行介绍之前,先对dpconformer网络结构进行介绍。该dpconformer网络的结构示意图如图5所示。该dpconformer网络包括编码层、分离层和解码层。
1、编码层:该dpconformer网络接收单通道特征数据,经过一维卷积(1-D Conv)层得到中间特征数据,比如中间特征数据为二维矩阵。
可选的,对输入的单通道特征数据进行一维卷积运算,通过如下公式变换到输入时域数据的隐空间中:X=RELU(x*W);其中,x为时域的单通道特征数据,W为编码变换对应的权重系数,x通过W按固定的卷积核大小和卷积步长进行一维卷积运算,最终得到编码之后的中间特征数据满足X∈R N*I,其中N为编码的维度,I为时域 的总帧数,中间特征数据X为N*I维的二维矩阵。
2、分离层包括数据切割模块、块内的conformer层和块间的conformer层。
(1)、数据切割模块
数据切割模块的输入参数为中间特征数据,输出参数为三维张量。即数据切割模块用于按照数据分帧分段方式将中间特征数据表示为三维张量,分别对应块内特征、块间特征以及特征维度。
可选的,将N*I维的二维矩阵按块等分切割成N*K*P维的三维张量,其中N为特征维度,K为块的个数,P为块的长度,块之间重叠P/2。
(3)、块内的conformer层
块内的conformer层的输入参数为数据切割模块输出的三维张量,输出参数为第一中间参数。
可选的,conformer层包括线性层、多头自注意力层(MultiHead Self-Attention,MHSA)、卷积层中的至少一个。
可选的,将K个长度为P的块,通过如下公式进行块内的conformer计算:
Figure PCTCN2022083055-appb-000001
其中,b为当前所处的第b个dpconformer子模块,总共包括B个dpconformer子模块,每个dpconformer子模块包括一层块内的conformer层和一层块间的conformer层,B为正整数。
需要说明的是,流式场景和离线场景下,块内的conformer层的计算方式是相同的。
(4)、块间的conformer层
块间的conformer层的输入参数为块内的conformer层输出的第一中间参数,输出参数为第二中间参数。
可选的,离线场景下,在块内P的每一个相同维度上,通过如下公式进行各个块间的conformer计算:
Figure PCTCN2022083055-appb-000002
块间的conformer层在离线场景在整句所有特征上计算注意力,而流式场景下,为了控制时延,利用掩模(mask)机制,只计算当前块及以前时刻的注意力,保证因果性。
可选的,流式场景下,当前时刻对应的块为t,当前块t的块间的conformer计算只与历史时刻对应的块到当前块t存在关联关系,与块t+1无关,则通过如下公式进行各个块间的conformer计算:
Figure PCTCN2022083055-appb-000003
经过B层的块内以及块间的conformer层进行计算,即块内的conformer层与块间的conformer层重复计算B次。
然后,将经过2-D Conv层三维的N*K*P张量转换为C个N*I的二维矩阵,对应得到C个声源的掩蔽矩阵M,其中M是预设的待分离的声源个数。
3、解码层
根据各个声源的掩蔽矩阵M,与各个声源的隐空间表示通过一维卷积层,最终得 到分离结果,即分离的多个声源数据。
本申请实施例提供的多声源分离方案为两阶段分离方案,以两阶段分离方案中的多特征融合模型和分离模块均采用图5提供的dpconformer网络结构为例,该两阶段分离方案如图6所示。
第一级流式分离模块包括conformer特征融合模型61和dpconformer分离模型62,第二级离线分离模块包括conformer特征融合模型63和dpconformer分离模型64。其中,第一级流式分离模块可以为上述的第一级分离模块41,第二级离线分离模块可以为上述的第二级离线分离模块43。
电子设备将多通道特征数据输入至conformer特征融合模型61中输出得到单通道特征数据;将单通道特征数据输入至dpconformer分离模型62输出得到第一分离数据。当预唤醒成功时,将多通道特征数据和第一分离数据输入至conformer特征融合模型63中输出得到单通道特征数据;将单通道特征数据输入至dpconformer分离模型64输出得到第二分离数据。
需要说明的是,为了方便介绍,仅以两阶段分离方案中的第一级分离方案为例进行说明,第二级分离方案可类比参考,不再赘述。
在一种可能的实现方式中,第一级分离方案包括盲分离技术,第一级分离方案包括但不限于如下几个方面,如图7所示:
(1)、特征输入部分:包括多通道特征数据。在多麦克风场景下,多通道特征数据包括多组多通道特征数据,可选的,多通道特征数据包括多个麦克风的原始时域数据、对应的多通道的变换域数据、多组IPD数据、多个预先设定方向的固定波束的输出数据、各个预设方向的方向性特征(Directional Features)数据中的至少一组多通道特征数据。比如特征输入部分包括三组多通道特征数据,即多通道特征数据1、多通道特征数据2和多通道特征数据3。本申请实施例对多通道特征数据的组数不加以限定。
(2)、conformer特征融合模型71:用于将多组多通道特征数据融合为单通道特征数据。首先,每组多通道特征数据基于conformer层,计算组内的通道间的第一注意力特征数据;然后,每组通道间的第一注意力特征数据再统一经过另一个conformer层即全通道注意力层72,得到各组的第二注意力特征数据,再经过池化层(pooling layer)或者投影层得到单通道的中间特征表示即单通道特征数据。
(3)、dpconformer分离模型73:用于将融合后的多组多通道特征数据即单通道特征数据输入至dpconformer分离模型,输出得到M个估计的声源数据,M为正整数。比如M个估计的声源数据包括声源数据1、声源数据2、声源数据3和声源数据4。本申请实施例对此不加以限定。
(4)、代价函数设计:代价函数训练时,多个声源数据的输出和对应的多个声源数据的标注存在置换混淆问题,所以需要使用置换不变训练准则(Permutation Invariant Training,PIT),即确定多个声源数据对应的所有可能的标注顺序,根据多个标注顺序与代价函数的输出参数计算多个标注顺序各自对应的损失值,根据损失值最小的标 注顺序进行梯度计算。除了采用上述方法训练代价函数以外,还可以使用多个声源数据的先验信息设置固定的排序顺序,以避免声源数据的个数增大而导致损失值计算复杂度高的问题。声源数据的先验信息包括该声源数据的起始时刻,将多个声源数据按照起始时刻从早到晚的顺序依次排列。
在另一种可能的实现方式中,第一级分离方案包括特定人提取技术,特定人提取技术是多声源干扰场景下的另一主要技术方案。该第一级分离方案包括但不限于如下几个方面,如图8所示:
(1)、特征输入部分:包括多通道特征数据和注册语音数据。与图7提供的第一级分离方案不同的是,在特定人提取场景下需要目标对象进行注册,将目标对象的注册语音数据作为额外的特征数据进行输入。比如特征输入部分包括多通道特征数据1、多通道特征数据2和注册语音数据。本申请实施例对多通道特征数据的组数不加以限定。
(2)、conformer特征融合模型81:用于将多组多通道特征数据和注册语音数据融合为单通道特征数据。首先,每组多通道特征数据基于conformer层,计算组内的通道间的第一注意力特征数据;然后,每组通道间的第一注意力特征数据和目标对象的说话人表示特征数据再统一经过全通道注意力层82,全通道注意力层82用于计算目标对象的说话人表示特征数据与其他的多通道特征数据之间的相关性,并融合输出得到单通道特征。
可选的,将目标对象的注册语音数据输入至说话人表示模型中,输出得到目标对象的嵌入(embedding)表示即说话人表示特征数据,其中说话人表示模型是预先训练得到,说话人表示模型是通过标准的说话人识别训练方法得到的。
可选的,将目标对象的说话人表示特征数据以向量形式预先存储在电子设备中。
(3)、dpconformer分离模型83:将单通道特征数据输入至dpconformer分离模型83,输出得到目标对象的声源数据。即该dpconformer分离模型83的输出参数为单输出参数,预期的输出参数为目标对象的声源数据。比如,目标对象的声源数据为声源数据1。
(4)、代价函数设计:可以类比参考上述代价函数的介绍,在此不再赘述。
在另一种可能的实现方式中,该第一级分离方案包括视觉数据辅助的特定人提取技术,该第一级分离方案包括但不限于如下几个方面,如图9所示:
(1)、特征输入部分:包括多通道特征数据和目标人视觉数据。在一些特定场景下,比如电视、手机、机器人、车载设备等电子设备装配有摄像头,这些电子设备可以在通过摄像头获取目标对象的视觉数据即目标人视觉数据。在这些场景下,可以利用目标人视觉数据辅助进行特定人提取任务。比如特征输入部分包括多通道特征数据1、多通道特征数据2和目标人视觉数据。本申请实施例对多通道特征数据的组数不加以限定。
(2)、conformer特征融合模型91:用于将多组多通道特征数据和视觉数据融合为单通道特征数据。首先,每组多通道特征数据基于conformer层,计算组内的通道 间的第一注意力特征数据;然后,每组通道间的第一注意力特征数据和目标对象的视觉表示特征数据再统一经过全通道注意力层92,全通道注意力层92用于计算目标对象的视觉表示特征数据与其他的多通道特征数据之间的相关性,并融合输出得到单通道特征。
可选的,电子设备根据目标人视觉数据调用预先训练好的视觉分类模型输出得到目标对象的向量表示即视觉表示特征数据。比如视觉分类模型包括唇语识别模型,目标人视觉数据包括唇部活动视觉数据。本申请实施例对此不加以限定。
(3)、dpconformer分离模型93:将单通道特征数据输入至dpconformer分离模型,93输出得到目标对象的声源数据。即该dpconformer分离模型83的输出参数为单输出参数,预期的输出参数为目标对象的声源数据。比如,目标对象的声源数据为声源数据1。
(4)、代价函数设计:可以类比参考上述代价函数的介绍,在此不再赘述。
在另一种可能的实现方式中,该第一级分离方案包括特定方向提取技术,特定方向提取技术是多声源干扰场景下提取预设的目标方向的声源数据的技术。该第一级分离方案包括但不限于如下几个方面,如图10所示:
(1)、特征输入部分:包括多通道特征数据和目标方向数据。类比参考图8提供的特定人提取技术,在该场景下,将目标方向数据作为额外的特征数据进行输入。比如特征输入部分包括多通道特征数据1、多通道特征数据2、多通道特征数据3和目标方向数据。本申请实施例对多通道特征数据的组数不加以限定。
(2)、conformer特征融合模型101:用于将多组多通道特征数据和目标方向数据融合为单通道特征数据。首先,每组多通道特征数据基于conformer层,计算组内的通道间的第一注意力特征数据;然后,每组通道间的第一注意力特征数据和目标方向数据的方向特征数据再统一经过全通道注意力层102,全通道注意力层102用于计算目标方向数据的方向特征数据与其他的多通道特征数据之间的相关性,并融合输出得到单通道特征。
可选的,根据目标方向数据和麦克风阵列的麦克位置信息,计算目标方向数据的方向特征数据。
可选的,将目标方向数据的方向特征数据预先存储在电子设备中。
(3)、dpconformer分离模型103:将单通道特征数据输入至dpconformer分离模型103,输出得到目标方向的至少一个声源数据。即该dpconformer分离模型103的输出参数为单输出参数或者多输出参数,预期的输出参数为目标方向的至少一个声源数据。比如,目标方向的至少一个声源数据包括声源数据1和声源数据2。
(4)、代价函数设计:可以类比参考上述代价函数的介绍,在此不再赘述。
需要说明的是,上述第一级分离方案的几种可能的实现方式可以两两结合实施,或者其中任意三个结合实施,或者全部结合实施例,本申请实施例对此不加以限定。
在另一种可能的实现方式中,该第一级分离方案包括盲分离与多声源定位进行多任务设计的技术。该第一级分离方案包括但不限于如下几个方面,如图11所示:
(1)、特征输入部分:包括多通道特征数据。
(2)、conformer特征融合模型111(包括全通道注意力层112):用于将多组多通道特征数据融合为单通道特征数据。
(3)、dpconformer分离模型113、声源分离层114和方向估计层115:将单通道特征数据输入至dpconformer分离模型113中输出得到中间参数,将中间参数输入至声源分离层114中输出得到声源分离结果,并将中间参数输入至方向估计层115中输出得到方位估计结果,声源分离结果包括分离的m个声源数据,方位估计结果包括m个声源数据各自对应的方位信息。比如,输出参数包括声源数据1和声源数据2,以及声源数据1的方位信息和声源数据2的方位信息。
其中,声源分离层114和方向估计层115可以作为单独的模块设置在dpconformer分离模型113外,即在dpconformer分离模型113的输出端设置声源分离层114以及方向估计层115。示意性的,方向估计层115输出的第i个方位信息是声源分离层114分离的第i个声源数据的方位信息,i为正整数。
可选的,方位信息为方位标签,采用one-hot向量形式。比如,多声源定位技术中将水平方位360度,以分辨率gamma=10度为例,等分为360/gamma=36份,即输出维度为36维,方向信息为36维的one-hot向量。
(4)、代价函数设计
可选的,分离任务与方向估计任务的代价函数均采用PIT准则。
需要说明的是,上述几个方面的介绍可类比参考上述实施例中的相关描述,在此不再赘述。
在另一种可能的实现方式中,该第一级分离方案包括特定人提取和特定人方位估计进行多任务设计的技术。该第一级分离方案包括但不限于如下几个方面,如图12所示:
(1)、特征输入部分:包括多通道特征数据和注册语音数据。
(2)、conformer特征融合模型121(包括全通道注意力层122):用于将多组多通道特征数据和注册语音数据融合为单通道特征数据。
(3)、dpconformer分离模型123、特定人提取层124和特定人方位估计层125:将单通道特征数据输入至dpconformer分离模型123中输出得到中间参数,将中间参数输入至特定人提取层124中输出得到目标对象的声源数据,并将中间参数输入至特定人方位估计层中125输出得到目标对象的声源数据的方位信息。比如,输出参数包括目标对象的声源数据1和声源数据1的方位信息。可选的,方位信息为方位标签,采用one-hot向量形式。
在给定目标对象的注册语音数据后,利用说话人表示特征数据以及其他多通道特征数据,通过dpconformer网络结构,设计one-hot向量形式的方位标签,采用交叉熵(cross-entropy,CE)代价函数进行训练。特定人提取和特定人方位估计进行多任务设计的技术是通过将两个任务共享多通道特征数据、注册语音数据、conformer特征融合模型121和dpconformer分离模型123,dpconformer分离模型123的输出端设置特定人提取层124和特定人方位估计层125,分别采用分离任务和方位估计任务的代价函数 加权进行多任务训练。
(4)、代价函数设计
需要说明的是,上述几个方面的介绍可类比参考上述实施例中的相关描述,在此不再赘述。
在另一种可能的实现方式中,该第一级分离方案包括盲分离与多说话人识别进行多任务设计的技术,盲分离与多说话人识别进行多任务设计的技术是从麦克风数据中分离出多个声源数据,并识别出多个声源数据各自对应的对象信息,对象信息用于指示该声源数据的对象身份。可选的,电子设备中存储有多个样本声源数据与多个对象信息之间的对应关系。该第一级分离方案包括但不限于如下几个方面,如图13所示:
(1)、特征输入部分:包括多通道特征数据。
(2)、conformer特征融合模型131(包括全通道注意力层132):用于将多组多通道特征数据融合为单通道特征数据。
(3)、dpconformer分离模型133、声源分离层134和对象识别层135:将单通道特征数据输入至dpconformer分离模型1333中输出得到中间参数,将中间参数输入至声源分离层134中输出得到声源分离结果,并将中间参数输入至对象识别层135中输出得到对象识别结果,声源分离结果包括分离的m个声源数据,对象识别结果包括m个声源数据各自对应的对象信息。比如,输出参数包括声源数据1和声源数据2,以及声源数据1的对象信息和声源数据2的对象信息。
分离任务和对象识别任务共享多通道特征数据、conformer特征融合模型131和dpconformer分离模型133,在dpconformer分离模型133的输出端设置声源分离层134以及对象识别层135。声源分离层134分离出多个声源数据。对象识别层135在帧级特征计算完成后,进行段级特征融合,得到段级的多对象表示,每个段的对象表示输出该段表示的对象身份,对应的对象信息为one-hot向量,用于指示对象身份。可选的,one-hot向量的维数为对象个数,一个声源数据对应的one-hot向量中该声源数据对应的位置为1,用于指示该声源数据的对象在多个对象中的说话顺序,其他位置为0。
对象识别层135输出的第i个对象信息是声源分离层134分离的第i个声源数据的对象信息,i为正整数。
(4)、代价函数设计
可选的,分离任务与对象识别任务的代价函数均采用PIT准则。
需要说明的是,上述几个方面的介绍可类比参考上述实施例中的相关描述,在此不再赘述。
在另一种可能的实现方式中,该第一级分离方案包括特定人提取和特定人确认进行多任务设计的技术。特定人提取任务是利用目标对象的注册语音数据,从麦克风数据中提取出目标对象的声源数据。而单独的特定人提取任务,可能存在麦克风数据中不包含目标对象的声源数据,但特定人提取任务还是会输出声源数据,因此需要设置特定人确认任务,对提取的声源数据进行确认。特定人确认任务是确认提取出的声源数据与目标对象的注册语音数据是否相同,或者确认提取出的声源数据对应的对象中 是否包含目标对象。特定人提取和特定人确认进行多任务设计的技术是在提取目标对象的声源数据的同时,确定该声源数据的对象识别结果。同样,该任务为离线设计。该第一级分离方案包括但不限于如下几个方面,如图14所示:
(1)、特征输入部分:包括多通道特征数据和注册语音数据。
(2)、conformer特征融合模型141(包括全通道注意力层142):用于将多组多通道特征数据和注册语音数据融合为单通道特征数据。
(3)、dpconformer分离模型143、特定人提取层144和特定人确认层145:将单通道特征数据输入至dpconformer分离模型143中输出得到中间参数,将中间参数输入至特定人提取层144中输出得到目标对象的声源数据,并将中间参数输入至特定人确认层145中输出得到该声源数据的对象识别结果,对象识别结果用于指示输出的声源数据与注册语音数据之间的声学特征相似度。可选的,对象识别结果包括输出的声源数据对应的对象为目标对象的概率。比如,输出参数包括目标对象的声源数据1和声源数据1的对象识别结果。
特定人提取和特定人确认任务共享多通道特征数据、conformer特征融合模型141和dpconformer分离模型143,在dpconformer分离模型143的输出端设置特定人提取层144和特定人确认层145。
(4)、代价函数设计
需要说明的是,上述几个方面的介绍可类比参考上述实施例中的相关描述,在此不再赘述。
本申请实施例涉及的唤醒方案为两阶段唤醒方案,两阶段唤醒方案中的第一级唤醒模块和第二级唤醒模块均为多输入的唤醒模型结构,比如唤醒模型结构为DNN、LSTM、CNN、Transformer、conformer中的任意一种网络结构。需要说明的是,唤醒模型结构还可以采用其他的网络结构,为了方便介绍仅以两阶段唤醒方案中的第一级唤醒模块和第二级唤醒模块均采用图5提供的dpconformer网络结构为例进行说明,该两阶段唤醒方案如图15所示。
电子设备将多通道特征数据和第一分离数据输入至dpconformer唤醒模块151中输出得到第一唤醒数据;当第一唤醒数据指示预唤醒成功时,将多通道特征数据、第一分离数据和第二分离数据输入至dpconformer唤醒模块152中输出得到第二唤醒数据;根据第二唤醒数据确定唤醒结果。
需要说明的是,为了方便介绍,仅以两阶段唤醒方案中的第一级唤醒方案为例进行说明,第二级唤醒方案可类比参考,不再赘述。
在一种可能的实现方式中,本申请实施例提供的第一级唤醒方案包括多输入单输出整词建模的唤醒技术,第一级唤醒模块为多输入单输出整词建模唤醒模块,如图16所示,包括但不限于如下几个方面:
(1)、特征输入部分:包括多组多通道特征数据。其中多组多通道特征数据包括对第一麦克风数据进行预处理得到的多通道特征数据和进行第一级分离处理得到的第一分离数据。
(2)、conformer特征融合模型161(包括全通道注意力层162):用于将多组多通道特征数据融合为单通道特征数据。
(3)、dpconformer分离模型163:将单通道特征数据输入至dpconformer分离模型163,输出得到第一置信度,第一置信度用于指示原始的第一麦克风数据中包括预设唤醒词的概率,该预设唤醒词为默认设置的固定唤醒词。
比如,预设唤醒词包括N个唤醒词,dpconformer分离模型163输出的第一置信度为N+1维向量,N+1维向量的N个维度分别对应N个唤醒词,另外一个维度对应不属于N个唤醒词的类别。N+1维向量中的每个维度的值为0到1之间的概率值,该概率值用于指示对应位置的唤醒词的唤醒概率。
(4)、代价函数设计
需要说明的是,上述几个方面的介绍可类比参考第一级分离方案中的相关描述,在此不再赘述。
在本实施例中,dpconformer分离模型163的输出参数为单输出参数,建模单元的个数为唤醒词个数加一,额外的一个单元为垃圾单元,该垃圾单元用于输出唤醒词以外的其他词的概率值,dpconformer分离模型163的输出参数为第一置信度。
可选的,两个预设唤醒词为预设唤醒词1和预设唤醒词2,每个建模单元的概率值为第一数值、第二数值和第三数值中的一种,当概率值为第一数值时用于指示该声源数据不包括预设唤醒词,当概率值为第二数值时用于指示该声源数据包括预设唤醒词1,当概率值为第三数值时用于指示声源数据包括预设唤醒词2。比如,预设唤醒词1为“小艺小艺”,预设唤醒词2为“你好小艺”,第一数值为0,第二数值为1,第三数值为2。本申请实施例对此不加以限定。
第一级唤醒模块是实时计算的,对于当前输入的多组多通道特征数据,第一级唤醒模块实时判断是否包括固定唤醒词。当输出的第一置信度大于第一门限值,则确定预唤醒成功。对于第一级唤醒模块,电子设备确定预唤醒成功,此时已经接收完整的唤醒词信息,将当前时刻确定为唤醒时刻,用于给第二级分离模块和第二级唤醒模块提供时间点参考信息,并启动第二级离线分离模块。
在另一种可能的实现方式中,本申请实施例提供的唤醒方案包括多输入多输出音素建模的唤醒技术,第一级唤醒模块为多输入多输出音素建模唤醒模块,如图17所示,包括但不限于如下几个方面:
(1)、特征输入部分:包括多组多通道特征数据。其中多组多通道特征数据包括对第一麦克风数据进行预处理得到的多通道特征数据和进行第一级分离处理得到的第一分离数据。
(2)、conformer特征融合模型171(包括全通道注意力层172):用于将多组多通道特征数据融合为单通道特征数据。
(3)、dpconformer分离模型173:将单通道特征数据输入至dpconformer分离模型173,输出得到音素集,该音素集包括多个声源数据各自对应的音素序列信息,可选的,音素序列信息为音素序列后验概率,音素序列后验概率为该声源数据对应的各个音素的后验概率值的乘积。比如,dpconformer分离模型173的输出参数包括声源数 据1的音素序列信息1和声源数据2的音素序列信息2。
(4)、代价函数设计
需要说明的是,上述几个方面的介绍可类比参考第一级分离方案中的相关描述,在此不再赘述。
对于多输入多输出音素建模唤醒模块,dpconformer分离模型173的输出参数为多个声源数据各自对应的音素序列信息,将多个音素序列信息分别输入至解码器中,最终输出得到多个音素序列信息各自对应的第二置信度。
其中,声源数据对应的音素序列信息用于指示该声源数据中多个音素的概率分布,即音素序列信息包括多个音素各自对应的概率值。对于多个音素序列信息中的每个音素序列信息,调用一次解码器得到该音素序列信息各自对应的第二置信度,第二置信度用于指示该声源数据与预设唤醒词之间的声学特征相似度。解码器部分不能参与模型计算,模型在没法判断哪个分离的声源数据为预设唤醒词的情况下,需要计算得到多个声源数据各自对应的音素序列信息。
在本实施例中,建模单元为音素,音素是基本语音单元的表示形式。比如,对于唤醒词“小艺小艺”,对应的音素序列可以为“x i ao y i x i ao y i”,各个音素以空格表示。多声源干扰场景下,声源数据1对应的音素序列1是“x i ao y i x i ao y i”,而声源数据2对应的语音内容可以是“天气怎么样”,对应的音素序列2为“t i an q i z en m o y ang”。dpconformer分离模型173的输出参数包括两个音素序列信息,即声源数据1对应的音素序列1“x i ao y i x i ao y i”的概率值,和声源数据2对应的音素序列12“t i an q i z en m o y ang”的概率值。
对于第一级唤醒模块,以输出参数包括两个音素序列信息为例,一个音素序列信息可以为声源数据1所对应的各个音素的概率分布,另一个音素序列信息可以为声源数据2所对应的各个音素的概率分布。比如,音素集大小为100,则两个音素序列信息分别为100维向量,向量的取值位于大于或者等于0,且小于或者等于1的范围,并且100维的各个数值的和为1。比如,两个音素序列信息分别为100维向量,第一个音素序列信息中对应“x”位置的概率值最高,第二个音素序列信息中对应“t”位置的概率值最高。
在确定两个音素序列信息后,分别计算预设唤醒词的音素序列“x i ao y i x i ao y i”在这音素序列的输出概率并进行几何平均,得到这两个音素序列信息各自对应的第二置信度。当任意一个第二置信度大于第一门限值时,则确定预唤醒成功。
在另一种可能的实现方式中,本申请实施例提供的唤醒方案包括多输入单输出整词建模的唤醒与方向估计进行多任务设计的技术,第一级唤醒模块为多输入单输出整词建模唤醒模块,如图18所示,包括但不限于如下几个方面:
(1)、特征输入部分:包括多组多通道特征数据。其中多组多通道特征数据包括对第一麦克风数据进行预处理得到的多通道特征数据和进行第一级分离处理得到的第一分离数据。
(2)、conformer特征融合模型181(包括全通道注意力层182):用于将多组多通道特征数据融合为单通道特征数据。
(3)、dpconformer分离模型183、唤醒词检测层184和方位估计层185:将单通道特征数据输入至dpconformer分离模型183中输出得到中间参数,将中间参数输入至唤醒词检测层184中输出得到唤醒信息,并将中间参数输入至方位估计层185中输出得到唤醒事件的方位信息,唤醒信息包括分离出的多个声源数据各自对应的第一置信度,比如方位信息采用one-hot向量形式。
对于唤醒任务,模型是为了计算各个唤醒事件以及垃圾词的概率,而方向估计任务只输出唤醒事件对应的方位信息。因此方位信息为唤醒成功对应的方向估计任务的输出参数。
其中,唤醒词检测层184和方位估计层185可以为额外的网络模块,设置在dpconformer分离模型183的输出端,比如一层的DNN或LSTM,并接着对应维度的线性层和Softmax层。对于唤醒任务,唤醒词检测层184的输出参数(即唤醒信息)为唤醒词的检测概率。对于方位估计任务,方位估计层185的输出参数(即方位信息)为方位估计向量的概率分布。
(4)、代价函数设计
需要说明的是,上述几个方面的介绍可类比参考第一级分离方案中的相关描述,在此不再赘述。
在另一种可能的实现方式中,本申请实施例提供的唤醒方案包括多输入多输出音素建模唤醒与方向估计进行多任务设计的技术,第一级唤醒模块为多输入多输出音素建模唤醒模块,如图19所示,包括但不限于如下几个方面:
(1)、特征输入部分:包括多组多通道特征数据。其中多组多通道特征数据包括对第一麦克风数据进行预处理得到的多通道特征数据和进行第一级分离处理得到的第一分离数据。
(2)、conformer特征融合模型191(包括全通道注意力层192):用于将多组多通道特征数据融合为单通道特征数据。
(3)、dpconformer分离模型193、多唤醒音素序列层194和方位估计层195:将单通道特征数据输入至dpconformer分离模型193中输出得到中间参数,将中间参数输入至多唤醒音素序列层194中输出得到唤醒信息,并将中间参数输入至方位估计层195中输出得到方位估计结果,唤醒信息包括多个声源数据各自对应的音素序列信息,方位估计结果包括多个音素序列信息各自对应的方位信息。可选的,音素序列信息为音素序列后验概率,音素序列后验概率为该声源数据对应的各个音素的后验概率值的乘积。比如,输出参数包括声源数据1的音素序列信息1、声源数据2的音素序列信息2、音素序列信息1的方位信息和音素序列信息2的方位信息。
其中,多唤醒音素序列层194和方位估计层195可以为额外的网络模块,设置在dpconformer分离模型193的输出端。
(4)、代价函数设计
需要说明的是,上述几个方面的介绍可类比参考第一级分离方案中的相关描述,在此不再赘述。
唤醒任务和方向估计任务共享特征输入部分、conformer特征融合模型191和 pconformer分离模型193,唤醒任务的输出参数包括多个声源数据各自对应的音素序列信息,方位估计任务的输出参数包括多个音素序列信息各自对应的方位信息。最终各个音素序列信息通过解码器得到唤醒结果即第一置信度。
需要说明的是,上述第一级唤醒方案的几种可能的实现方式可以两两结合实施,或者其中任意三个结合实施,或者全部结合实施例,本申请实施例对此不加以限定。
下面,采用几个示意性的例子对本申请实施例提供的语音唤醒方法进行介绍。
在一个示意性的例子中,电子设备为具有单麦克风的设备,该语音唤醒方法为单通道的两级分离以及两级唤醒的方法。该方法可以使用在电子设备的近场唤醒场景下,当用户在嘈杂环境中使用电子设备的唤醒功能时,在保证唤醒功能具有较高的唤醒率的同时,降低误唤醒率。
如图20所示,该电子设备包括第一级分离模块201、第一级唤醒模块202、第二级分离模块203和第二级唤醒模块204。电子设备通过单麦克风采集原始的第一麦克风数据(比如背景音乐、回声、说话声1、说话声2、说话声K和环境噪声),将第一麦克风数据输入至预处理模块205进行预处理,得到多通道特征数据;将多通道特征数据输入至第一级分离模块201进行第一级分离处理得到第一分离数据;将多通道特征数据和第一分离数据输入至第一级唤醒模块202进行第一级唤醒处理得到第一唤醒数据。电子设备根据第一唤醒数据,判断是否预唤醒。若判断出预唤醒成功,则将多通道特征数据和第一分离数据输入至第二级分离模块203进行第二级分离处理得到第二分离数据;将多通道特征数据、第一分离数据和第二分离数据输入至第二级唤醒模块204进行第二级唤醒处理得到第二唤醒数据。电子设备根据第二唤醒数据判断是否唤醒成功。
基于图20提供的语音唤醒方法,部分步骤还可以被替换实现成为如下一些可能的实现方式。
可选的,预处理模块包括声学回声抵消模块。将声学回声抵消模块的输出参数作为多通道特征数据,输入至后续的分离模块和唤醒模块。
可选的,预处理模块包括声学回声抵消模块和去混响模块。将声学回声抵消模块的输出参数输入至去混响模块,将去混响模块的输出参数作为多通道特征数据,输入至后续的分离模块和唤醒模块。
可选的,第一级唤醒模块和第二级唤醒模块均为上述的多输入单输出整词建模唤醒模块。可选的,第一级唤醒模块和第二级唤醒模块均为上述的多输入多输出音素建模唤醒模块。
可选的,当该场景需要支持特定人唤醒的需求时,两级唤醒模块需要支持特定人确认功能。在一种可能的实现方式中,基于图20提供的例子,如图21所示,第二级分离模块203输出的多个声源数据和目标对象的注册语音数据(即注册说话声)输入至说话人确认模块(Speaker Identification,SID)210,用于确认分离出的多个声源数据是否包括注册语音数据,说话人确认模块210作为单独的网络模块,区别于第二级唤醒模块204。如果第二级唤醒模块204输出的第二唤醒数据指示唤醒成功,且说话人确认模块210确认分离出的多个声源数据中包括注册语音数据,则确定唤醒成功,否 则唤醒失败。
在另一种可能的实现方式中,基于图20提供的例子,如图22所示,说话人确认模块210集成在第二级唤醒模块204中,将第一级分离模块201输出的多个声源数据、第二级分离模块203输出的多个声源数据和目标对象的注册语音数据(即注册说话声)输入至第二级唤醒模块204(包括说话人确认模块210)中,输出得到第二唤醒数据和对象确认结果,当第二唤醒数据指示唤醒成功且对象确认结果指示输出的声源数据中存在目标对象的声源数据时,确定唤醒成功,否则唤醒失败。
可选地,对象确认结果用于指示输出的声源数据中是否存在目标对象的声源数据,即对象确认结果用于指示当前的唤醒事件是否为目标对象所引起的。示意性的,对象确认结果包括第一标识和第二标识中的一种,第一标识用于指示输出的声源数据中存在目标对象的声源数据,第二标识用于指示输出的声源数据中不存在目标对象的声源数据。当第二唤醒数据指示唤醒成功且对象确认结果为第一标识时,确定唤醒成功,否则唤醒失败。
在另一种可能的实现方式中,基于图22提供的例子,如图23所示,第一级分离模块201被替换实现为第一级特定人提取模块231,第二级分离模块203被替换实现为第二级特定人提取模块232。将多通道特征数据和注册语音数据输入至第一级特定人提取模块231中输出得到目标对象的第一声源数据,将多通道特征数据和目标对象的第一声源数据输入至第一级唤醒模块202中输出得到第一唤醒数据,当第一唤醒数据指示预唤醒成功时,将多通道特征数据、目标对象的第一声源数据和目标对象的注册语音数据(即注册说话声)输入至第二级特定人提取模块232输出得到目标对象的第二声源数据,将多通道特征数据、目标对象的第一声源数据、第二声源数据、目标对象的注册语音数据输入至第二级唤醒模块204(包括说话人确认模块210)中,输出得到第二唤醒数据和对象确认结果,当第二唤醒数据指示唤醒成功且对象确认结果指示输出的声源数据中存在目标对象的声源数据时,确定唤醒成功,否则唤醒失败。
需要说明的是,在该场景下还可以支持特定人提取技术、视觉数据辅助的特定人提取技术、特定方向提取技术、盲分离与多声源定位进行多任务设计的技术、特定人提取和特定人方位估计进行多任务设计的技术、盲分离与多说话人识别进行多任务设计的技术、唤醒与方向估计进行多任务设计的技术等等。各个步骤的实现细节可参考上述实施例中的相关描述,在此不再赘述。
在另一个示意性的例子中,电子设备为具有多麦克风的设备,该语音唤醒方法为多通道的两级分离以及两级唤醒的方法。该方法可以使用在具有多麦克风的电子设备中,电子设备用于响应预设唤醒词。
如图24所示,该电子设备包括第一级分离模块241、第一级唤醒模块242、第二级分离模块243和第二级唤醒模块244。电子设备通过多麦克风采集原始的第一麦克风数据(比如背景音乐、回声、同向的说话声1和说话声2、说话声K以及环境噪声),将第一麦克风数据输入至预处理模块245进行预处理,得到多通道特征数据;将多通道特征数据输入至第一级分离模块241进行第一级分离处理得到第一分离数据;将多通道特征数据和第一分离数据输入至第一级唤醒模块242进行第一级唤醒处理得到第 一唤醒数据。电子设备根据第一唤醒数据,判断是否预唤醒。若判断出预唤醒成功,则将多通道特征数据和第一分离数据输入至第二级分离模块243进行第二级分离处理得到第二分离数据;将多通道特征数据、第一分离数据和第二分离数据输入至第二级唤醒模块244进行第二级唤醒处理得到第二唤醒数据。电子设备根据第二唤醒数据判断是否唤醒成功。
基于图24提供的语音唤醒方法,部分步骤还可以被替换实现成为如下一些可能的实现方式。
可选的,预处理模块包括声学回声抵消模块。可选的,预处理模块包括声学回声抵消模块和去混响模块。
可选的,预处理模块包括声学回声抵消模块、去混响模块和波束滤波模块。将原始的第一麦克风数据进行回声抵消以及去混响处理后,进行多个方向的波束滤波,得到多路波束滤波输出参数、去混响后的多麦克数据和场景的IPD等多组多通道特征数据,输入至后续的分离模块和唤醒模块。
可选的,第一级唤醒模块和第二级唤醒模块均为上述的多输入单输出整词建模唤醒模块。可选的,第一级唤醒模块和第二级唤醒模块均为上述的多输入多输出音素建模唤醒模块。
可选地,在分离、唤醒和定位的多任务场景下,分离任务可以与定位任务进行多任务设计,唤醒任务也可以与定位任务进行多任务设计。可选地,分离任务的执行主体为方向特征提取器,方向特征提取器可以集成在分离模块或唤醒模块中,最终输出得到分离的多个声源数据和多个声源数据各自对应的方位信息。相关介绍可参考上述实施例中对包括定位任务的多任务设计的相关描述,在此不再赘述。
在多任务设计的需求场景下,包括但不限于如下几种可能的多任务设计方式:
1、第一级流式分离与方位估计的多任务设计。第一级分离模块的输出参数包括流式分离的多个声源数据和多个声源数据各自对应的方位信息,第一级分离模块的输出参数可以提供给第一级唤醒模块、第二级分离模块和第二级唤醒模外,第一级分离模块输出的多个声源数据还可以提供给声学事件检测模块,用以判断当前的各个声源数据是否包含特定的声学事件,或者同时提供给说话人确认模块,用以判断当前的各个声源数据对应的身份信息。第一级分离模块输出的多个方位信息可以提供给系统交互控制模块,用以实时显示多个声源数据各自对应的方位。
2、第一级流式唤醒与说话人识别、方位估计的多任务设计。第一级唤醒模块的输出参数包括流式分离的多个声源数据、多个声源数据各自对应的方位信息和对象确认结果,可以用于判断当前的唤醒事件是否为目标对象引起,以及唤醒时刻对应的方位信息。第一级唤醒模块输出的多个方位信息,可以提供给后端系统,用于判断目标对象的主要方位,比如提供给波束形成模块,对该方位上的声源数据进行实时增强,将加强后的声源数据进行语音识别。
3、第二级离线分离与说话人识别、方位估计的多任务设计。离线场景下说话人识别与方位估计的结果更为准确,第二级分离模块的输出参数包括离线分离的多个声源数据、多个声源数据各自对应的方位信息和对象确认结果。第二级分离模块的输出参数可以用于系统调试,确定分离结果的质量。
4、第二级离线唤醒和说话人识别、方位估计的多任务设计:离线唤醒的效果优于实时流式唤醒的效果。第二级唤醒模块的输出参数包括离线分离的多个声源数据、多个声源数据各自对应的方位信息和对象确认结果。方位信息可以作为唤醒事件的补充信息,用于进行后续的唤醒方向增强任务,进行语音识别。
在一种可能的实现方式中,基于图24提供的例子,第二级离线唤醒与唤醒方位估计的多任务设计的示意图如图25所示,第二级唤醒模块244可以采用多输入多输出形式或多输入单输出形式的唤醒模型,最终输出得到分离的多个声源数据和多个声源数据各自对应的方位信息。
在另一种可能的实现方式中,基于图24提供的例子,第二级离线唤醒与说话人确认的多任务设计的示意图如图26所示,说话人确认模块261集成在第二级唤醒模块244中,将第一级分离模块241输出的多个声源数据、第二级分离模块243输出的多个声源数据和目标对象的注册语音数据(即注册说话声)输入至第二级唤醒模块244(包括说话人确认模块261)中,输出得到第二唤醒数据和对象确认结果,当第二唤醒数据指示唤醒成功且对象确认结果指示输出的声源数据中存在目标对象的声源数据时,确定唤醒成功,否则唤醒失败。
可选的,该场景还支持基于神经网络的分离与传统波束技术的结合使用。除了将第一分离数据输入至第一级唤醒模块,将第一分离数据和第二分离数据输入至第二级唤醒模块使用外,还可以将第一分离数据和第二分离数据输入至自适应波束形成模块,比如最小方差无失真频响波束滤波器(Minimum Variance Distortionless Response,MVDR),用于计算噪声干扰协方差矩阵,从而得到更好的空间干扰抑制效果。多个声源数据进行波束滤波后的输出参数可以作为新的声源数据,同时作为额外的特征数据输入至第一级唤醒模块和/或第二级唤醒模块,增强唤醒效果。
在一种可能的实现方式中,基于图24提供的例子,如图27所示,将第一分离数据输入至自适应波束形成模块271中输出得到第一滤波数据,将多通道特征数据、第一分离数据和第一滤波数据输入至第一级唤醒模块242中输出得到第一唤醒数据,当第一唤醒数据指示预唤醒成功时,将多通道特征数据和第一分离数据输入至第二级分离模块242中输出得到第二分离数据,将第二分离数据输入至自适应波束形成模块272中输出得到第二滤波数据,将多通道特征数据、第一分离数据、第二分离数据、第二滤波数据输入至第二级唤醒模块244中,输出得到第二唤醒数据,根据第二唤醒数据确定是否唤醒成功。
可选的,该场景还支持全神经网络的多声源唤醒方案。不使用预处理模块,将原始的第一麦克风数据和计算出的多通道特征数据输入至后续的分离模块和唤醒模块。可选的,第一级分离模块和第二级分离模块需要考虑回声场景,所以需要接受回声的参考信号,用于处理回声问题。在该实现方式中,语音唤醒方法可以运行在装配有GPU或者张量处理单元(Tensor Processing Unit,TPU)等专用神经网络加速的芯片中,从而得到更好的算法加速效果。
在一种可能的实现方式中,基于图24提供的例子,如图28所示,不使用预处理模块245,将原始的第一麦克风数据、计算出的多通道特征数据和回声参考数据输入至第一分离模块241,输出得到第一分离数据,将第一麦克风数据、多通道特征数据 和第一分离数据输入至第一级唤醒模块242中输出得到第一唤醒数据,当第一唤醒数据指示预唤醒成功时,将第一麦克风数据、多通道特征数据、第一分离数据和回声参考信号输入至第二级分离模块242中输出得到第二分离数据,将第一麦克风数据、多通道特征数据、第一分离数据和第二分离数据输入至第二级唤醒模块244中,输出得到第二唤醒数据,根据第二唤醒数据确定是否唤醒成功。
需要说明的是,在该场景下还可以支持特定人提取技术、视觉数据辅助的特定人提取技术、特定方向提取技术、盲分离与多声源定位进行多任务设计的技术、特定人提取和特定人方位估计进行多任务设计的技术、盲分离与多说话人识别进行多任务设计的技术、唤醒与方向估计进行多任务设计的技术等等。各个步骤的实现细节可参考上述实施例中的相关描述,在此不再赘述。
综上所述,本申请实施例提供的语音唤醒方法,在一方面,基于conformer的自注意力网络层建模技术,提供了对偶路径的conformer网络结构,通过设计块内和块间交替进行conformer层的计算,既能对长序列进行建模,又可以避免直接使用conformer带来的计算量增加问题,并且由于conformer网络较强的建模能力,可以显著提升分离效果。
在另一方面,提供了conformer的多组多通道特征数据的融合机制。对于多组多通道特征先进行组内的第一注意力特征数据的计算,再进行组间的第二注意力特征数据的计算,让模型更好的学习到各个特征对最终分离效果的贡献,进一步保证了后续的分离效果。
在另一方面,提供了两阶段分离方案,即用于第一级唤醒的流式分离过程,以及用于第二级唤醒的离线分离过程,由于第二级分离模块可以额外采用第一级分离模块输出的第一分离数据作为输入参数,进一步加强分离效果。
在另一方面,提供了多输入形式的唤醒模块,与相关技术中单输入的唤醒模块相比,不但可以节省计算量,避免多次重复调用唤醒模型带来的计算量显著增加和浪费问题;而且,由于更好的利用各个输入参数的相关性,大大提高了唤醒性能。
在另一方面,提供了声源唤醒任务和其他任务的多任务设计方案,比如其他任务包括声源定位任务、特定人提取任务、特定方向提取任务、特定人确认任务中的至少一种,可以将声源唤醒结果与其他信息关联起来,提供给下游任务,提高了唤醒模块(即第一级唤醒模块和/或第二级唤醒模块)的输出效果。比如其他任务为声源定位任务,输出的唤醒数据包括多个声源数据和多个声源数据各自对应的方位信息,这样唤醒模块可以在提供声源唤醒结果的同时提供更准确的方位信息,与相关技术中直接做空间多固定波束的方案相比,保证了更准确的方位估计效果。又比如,其他任务为特定人提取任务,输出的唤醒数据包括目标对象的声源数据,从而使得电子设备只会响应特定人(即目标对象)的唤醒,进一步降低了误唤醒率。又比如,其他任务为特定方向提取任务,输出的唤醒数据包括目标方向的至少一个声源数据,从而使得电子设备只会响应特定方向(即目标方向)的唤醒,进一步降低了误唤醒率。又比如,以本申请实施例提供的语音唤醒方法的执行主体为机器人为例,其他任务为特定人提取任务和声源定位任务,输出的唤醒数据包括目标对象的声源数据和目标对象的声源数据的方位信息,使得机器人只会响应特定人(即目标对象)的唤醒,并且在被唤醒的同 时确定出该特定人的方位,从而机器人可以调整自身朝向以面向特定人,保证后续更好地接受其发出的指令。
请参考图29,其示出了本申请另一个示例性实施例提供的语音唤醒方法的流程图,本实施例以该方法用于图2所示的电子设备中来举例说明。该方法包括以下几个步骤。
步骤2901,获取原始的第一麦克风数据。
步骤2902,根据第一麦克风数据进行第一级处理得到第一唤醒数据,第一级处理包括基于神经网络模型的第一级分离处理和第一级唤醒处理。
步骤2903,当第一唤醒数据指示预唤醒成功时根据第一麦克风数据进行第二级处理得到第二唤醒数据,第二级处理包括基于神经网络模型的第二级分离处理和第二级唤醒处理。
步骤2904,根据第二唤醒数据确定唤醒结果。
需要说明的是,本实施例中的各个步骤的相关介绍可参考上述方法实施例中的相关描述,在此不再赘述。
下述为本申请装置实施例,可以用于执行本申请方法实施例。对于本申请装置实施例中未披露的细节,请参照本申请方法实施例。
请参考图30,其示出了本申请一个示例性实施例提供的语音唤醒装置的框图。该装置可以通过软件、硬件或者两者的结合实现成为一个或多个芯片,或者实现成为语音唤醒系统,或者实现成为图2提供的电子设备的全部或者一部分。该装置可以包括:获取模块3010、第一级处理模块3020、第二级处理模块3030和确定模块3040;
获取模块3010,用于获取原始的第一麦克风数据;
第一级处理模块3020,用于根据第一麦克风数据进行第一级处理得到第一唤醒数据,第一级处理包括基于神经网络模型的第一级分离处理和第一级唤醒处理;
第二级处理模块3030,用于当第一唤醒数据指示预唤醒成功时根据第一麦克风数据进行第二级处理得到第二唤醒数据,第二级处理包括基于神经网络模型的第二级分离处理和第二级唤醒处理;
确定模块3040,用于根据第二唤醒数据确定唤醒结果。
在一种可能的实现方式中,该装置还包括预处理模块,第一级处理模块3020还包括第一级分离模块和第一级唤醒模块;
预处理模块,用于对第一麦克风数据进行预处理得到多通道特征数据;
第一级分离模块,用于根据多通道特征数据进行第一级分离处理,输出得到第一分离数据;
第一级唤醒模块,用于根据多通道特征数据和第一分离数据进行第一级唤醒处理,输出得到第一唤醒数据。
在另一种可能的实现方式中,第二级处理模块3030还包括第二级分离模块和第二级唤醒模块;
第二级分离模块,用于当第一唤醒数据指示预唤醒成功时,根据多通道特征数据和第一分离数据进行第二级分离处理,输出得到第二分离数据;
第二级唤醒模块,用于根据多通道特征数据、第一分离数据和第二分离数据进行第二级唤醒处理,输出得到第二唤醒数据。
在另一种可能的实现方式中,第一级分离处理为流式的声源分离处理,第一级唤醒处理为流式的声源唤醒处理;和/或,
第二级分离处理为离线的声源分离处理,第二级唤醒处理为离线的声源唤醒处理。
在另一种可能的实现方式中,
第一级唤醒模块包括多输入单输出形式或者多输入多输出形式的唤醒模型;和/或,
第二级唤醒模块包括多输入单输出形式或者多输入多输出形式的唤醒模型。
在另一种可能的实现方式中,第一级分离模块和/或第二级分离模块采用对偶路径的conformer网络结构。
在另一种可能的实现方式中,第一级分离模块和/或第二级分离模块为用于执行至少一个任务的分离模块,至少一个任务包括单独的声源分离任务,或者包括声源分离任务和其他任务;
其中,其他任务包括声源定位任务、特定人提取任务、特定方向提取任务、特定人确认任务中的至少一种。
在另一种可能的实现方式中,第一级唤醒模块和/或第二级唤醒模块为用于执行至少一个任务的唤醒模块,至少一个任务包括单独的唤醒任务,或者包括唤醒任务和其他任务;
其中,其他任务包括声源定位任务、特定人提取任务、特定方向提取任务、特定人确认任务中的至少一种。
在另一种可能的实现方式中,第一级分离模块包括第一级多特征融合模型和第一级分离模型;第一级分离模块,还用于:
将多通道特征数据输入至第一级多特征融合模型中输出得到第一单通道特征数据;
将第一单通道特征数据输入至第一级分离模型输出得到第一分离数据。
在另一种可能的实现方式中,第二级分离模块包括第二级多特征融合模型和第二级分离模型;第二级分离模块,还用于:
将多通道特征数据和第一分离数据输入至第二级多特征融合模型中输出得到第二单通道特征数据;
将第二单通道特征数据输入至第二级分离模型输出得到第二分离数据。
在另一种可能的实现方式中,第一级唤醒模块包括多输入单输出形式的第一唤醒模型,第一级唤醒模块,还用于:
将多通道特征数据和第一分离数据输入至第一级唤醒模型中输出得到第一唤醒数据,第一唤醒数据包括第一置信度,第一置信度用于指示原始的第一麦克风数据中包括预设唤醒词的概率。
在另一种可能的实现方式中,第一级唤醒模块包括多输入多输出形式的第一唤醒模型和第一后处理模块,第一级唤醒模块,还用于:
将多通道特征数据和第一分离数据输入至第一唤醒模型中,输出得到多个声源数据各自对应的音素序列信息;
将多个声源数据各自对应的音素序列信息输入至第一后处理模块中,输出得到第一唤醒数据,第一唤醒数据包括多个声源数据各自对应的第二置信度,第二置信度用于指示声源数据与预设唤醒词之间的声学特征相似度。
在另一种可能的实现方式中,第二级唤醒模块包括多输入单输出形式的第二唤醒模型,第二级唤醒模块,还用于:
将多通道特征数据、第一分离数据和第二分离数据输入至第二级唤醒模型中输出得到第二唤醒数据,第二唤醒数据包括第三置信度,第三置信度用于指示原始的第一麦克风数据中包括预设唤醒词的概率。
在另一种可能的实现方式中,第二级唤醒模块包括多输入多输出形式的第二唤醒模型和第二后处理模块,第二级唤醒模块,还用于:
将多通道特征数据、第一分离数据和第二分离数据输入至第二级唤醒模型中,输出得到多个声源数据各自对应的音素序列信息;
将多个声源数据各自对应的音素序列信息输入至第二后处理模块中,输出得到第二唤醒数据,第二唤醒数据包括多个声源数据各自对应的第四置信度,第四置信度用于指示声源数据与预设唤醒词之间的声学特征相似度。
需要说明的是,上述实施例提供的装置,在实现其功能时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的装置与方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
本申请实施例提供了一种电子设备,该电子设备包括:处理器;用于存储处理器可执行指令的存储器;其中,处理器被配置为执行指令时实现上述由电子设备执行的方法。
本申请实施例提供了一种计算机程序产品,包括计算机可读代码,或者承载有计算机可读代码的非易失性计算机可读存储介质,当计算机可读代码在电子设备的处理器中运行时,电子设备中的处理器执行上述由电子设备执行的方法。
本申请实施例提供了一种语音唤醒系统,该语音唤醒系统用于执行上述由电子设备执行的方法。
本申请实施例提供了一种非易失性计算机可读存储介质,其上存储有计算机程序指令,计算机程序指令被处理器执行时实现上述由电子设备执行的方法。
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是――但不限于――电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(Random Access Memory,RAM)、只读存储器(Read Only Memory,ROM)、可擦式可编程只读存储器(Electrically Programmable Read-Only-Memory,EPROM或闪存)、静态随机存取存储器(Static Random-Access Memory,SRAM)、便携式压缩盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、数字多功能盘(Digital  Video Disc,DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。
这里所描述的计算机可读程序指令或代码可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令,并转发该计算机可读程序指令,以供存储在各个计算/处理设备中的计算机可读存储介质中。
用于执行本申请操作的计算机程序指令可以是汇编指令、指令集架构(Instruction Set Architecture,ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码,所述编程语言包括面向对象的编程语言—诸如Smalltalk、C++等,以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络—包括局域网(Local Area Network,LAN)或广域网(Wide Area Network,WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中,通过利用计算机可读程序指令的状态信息来个性化定制电子电路,例如可编程逻辑电路、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或可编程逻辑阵列(Programmable Logic Array,PLA),该电子电路可以执行计算机可读程序指令,从而实现本申请的各个方面。
这里参照根据本申请实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本申请的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理器,从而生产出一种机器,使得这些指令在通过计算机或其它可编程数据处理装置的处理器执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。
也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上,使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。
附图中的流程图和框图显示了根据本申请的多个实施例的装置、系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,所述模块、程序段或指令的一部 分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。
也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行相应的功能或动作的硬件(例如电路或ASIC(Application Specific Integrated Circuit,专用集成电路))来实现,或者可以用硬件和软件的组合,如固件等来实现。
尽管在此结合各实施例对本申请进行了描述,然而,在实施所要求保护的本申请过程中,本领域技术人员通过查看所述附图、公开内容、以及所附权利要求书,可理解并实现所述公开实施例的其它变化。在权利要求中,“包括”(comprising)一词不排除其他组成部分或步骤,“一”或“一个”不排除多个的情况。单个处理器或其它单元可以实现权利要求中列举的若干项功能。相互不同的从属权利要求中记载了某些措施,但这并不表示这些措施不能组合起来产生良好的效果。
以上已经描述了本申请的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。

Claims (31)

  1. 一种语音唤醒方法,其特征在于,所述方法包括:
    获取原始的第一麦克风数据;
    根据所述第一麦克风数据进行第一级处理得到第一唤醒数据,所述第一级处理包括基于神经网络模型的第一级分离处理和第一级唤醒处理;
    当所述第一唤醒数据指示预唤醒成功时根据所述第一麦克风数据进行第二级处理得到第二唤醒数据,所述第二级处理包括基于神经网络模型的第二级分离处理和第二级唤醒处理;
    根据所述第二唤醒数据确定唤醒结果。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述第一麦克风数据进行第一级处理得到第一唤醒数据,包括:
    对所述第一麦克风数据进行预处理得到多通道特征数据;
    根据所述多通道特征数据,调用预先训练完成的第一级分离模块输出得到第一分离数据,所述第一级分离模块用于进行所述第一级分离处理;
    根据所述多通道特征数据和所述第一分离数据,调用预先训练完成的第一级唤醒模块输出得到所述第一唤醒数据,所述第一级唤醒模块用于进行所述第一级唤醒处理。
  3. 根据权利要求2所述的方法,其特征在于,所述当所述第一唤醒数据指示预唤醒成功时根据所述第一麦克风数据进行第二级处理得到第二唤醒数据,包括:
    当所述第一唤醒数据指示预唤醒成功时,根据所述多通道特征数据和所述第一分离数据调用预先训练完成的第二级分离模块输出得到第二分离数据,所述第二级分离模块用于进行所述第二级分离处理;
    根据所述多通道特征数据、所述第一分离数据和所述第二分离数据,调用预先训练完成的第二级唤醒模块输出得到所述第二唤醒数据,所述第二级唤醒模块用于进行所述第二级唤醒处理。
  4. 根据权利要求3所述的方法,其特征在于,
    所述第一级分离处理为流式的声源分离处理,所述第一级唤醒处理为流式的声源唤醒处理;和/或,
    所述第二级分离处理为离线的声源分离处理,所述第二级唤醒处理为离线的声源唤醒处理。
  5. 根据权利要求3或4所述的方法,其特征在于,
    所述第一级唤醒模块包括多输入单输出形式或者多输入多输出形式的唤醒模型;和/或,
    所述第二级唤醒模块包括多输入单输出形式或者多输入多输出形式的唤醒模型。
  6. 根据权利要求3至5任一所述的方法,其特征在于,所述第一级分离模块和/或所述第二级分离模块采用对偶路径的conformer网络结构。
  7. 根据权利要求3至6任一所述的方法,其特征在于,所述第一级分离模块和/或所述第二级分离模块为用于执行至少一个任务的分离模块,所述至少一个任务包括单独的声源分离任务,或者包括所述声源分离任务和其他任务;
    其中,所述其他任务包括声源定位任务、特定人提取任务、特定方向提取任务、特定人确认任务中的至少一种。
  8. 根据权利要求3至7任一所述的方法,其特征在于,所述第一级唤醒模块和/或所述第二级唤醒模块为用于执行至少一个任务的唤醒模块,所述至少一个任务包括单独的唤醒任务,或者包括所述唤醒任务和其他任务;
    其中,所述其他任务包括声源定位任务、特定人提取任务、特定方向提取任务、特定人确认任务中的至少一种。
  9. 根据权利要求2至8任一所述的方法,其特征在于,所述第一级分离模块包括第一级多特征融合模型和第一级分离模型;所述根据所述多通道特征数据,调用预先训练完成的第一级分离模块输出得到第一分离数据,包括:
    将所述多通道特征数据输入至所述第一级多特征融合模型中输出得到第一单通道特征数据;
    将所述第一单通道特征数据输入至所述第一级分离模型输出得到所述第一分离数据。
  10. 根据权利要求3至9任一所述的方法,其特征在于,所述第二级分离模块包括第二级多特征融合模型和第二级分离模型;所述根据所述多通道特征数据和所述第一分离数据调用预先训练完成的第二级分离模块输出得到第二分离数据,包括:
    将所述多通道特征数据和所述第一分离数据输入至所述第二级多特征融合模型中输出得到第二单通道特征数据;
    将所述第二单通道特征数据输入至所述第二级分离模型输出得到所述第二分离数据。
  11. 根据权利要求2至10任一所述的方法,其特征在于,所述第一级唤醒模块包括多输入单输出形式的第一唤醒模型,所述根据所述多通道特征数据和所述第一分离数据,调用预先训练完成的第一级唤醒模块输出得到所述第一唤醒数据,包括:
    将所述多通道特征数据和所述第一分离数据输入至所述第一级唤醒模型中输出得到所述第一唤醒数据,所述第一唤醒数据包括第一置信度,所述第一置信度用于指示原始的所述第一麦克风数据中包括预设唤醒词的概率。
  12. 根据权利要求2至10任一所述的方法,其特征在于,所述第一级唤醒模块包括多输入多输出形式的第一唤醒模型和第一后处理模块,所述根据所述多通道特征数据和所述第一分离数据,调用预先训练完成的第一级唤醒模块输出得到所述第一唤醒数据,包括:
    将所述多通道特征数据和所述第一分离数据输入至所述第一唤醒模型中,输出得到多个声源数据各自对应的音素序列信息;
    将所述多个声源数据各自对应的音素序列信息输入至所述第一后处理模块中,输出得到所述第一唤醒数据,所述第一唤醒数据包括多个声源数据各自对应的第二置信度,所述第二置信度用于指示所述声源数据与预设唤醒词之间的声学特征相似度。
  13. 根据权利要求3至12任一所述的方法,其特征在于,所述第二级唤醒模块包括多输入单输出形式的第二唤醒模型,所述根据所述多通道特征数据、所述第一分离数据和所述第二分离数据,调用预先训练完成的第二级唤醒模块输出得到所述第二唤醒数据,包括:
    将所述多通道特征数据、所述第一分离数据和所述第二分离数据输入至所述第二级唤醒模型中输出得到所述第二唤醒数据,所述第二唤醒数据包括第三置信度,所述第三置信度用于指示原始的所述第一麦克风数据中包括预设唤醒词的概率。
  14. 根据权利要求3至12任一所述的方法,其特征在于,所述第二级唤醒模块包括多输入多输出形式的第二唤醒模型和第二后处理模块,所述根据所述多通道特征数据、所述第一分离数据和所述第二分离数据,调用预先训练完成的第二级唤醒模块输出得到所述第二唤醒数据,包括:
    将所述多通道特征数据、所述第一分离数据和所述第二分离数据输入至所述第二级唤醒模型中,输出得到多个声源数据各自对应的音素序列信息;
    将所述多个声源数据各自对应的音素序列信息输入至所述第二后处理模块中,输出得到所述第二唤醒数据,所述第二唤醒数据包括多个声源数据各自对应的第四置信度,所述第四置信度用于指示所述声源数据与预设唤醒词之间的声学特征相似度。
  15. 一种语音唤醒装置,其特征在于,所述装置包括:获取模块、第一级处理模块、第二级处理模块和确定模块;
    所述获取模块,用于获取原始的第一麦克风数据;
    所述第一级处理模块,用于根据所述第一麦克风数据进行第一级处理得到第一唤醒数据,所述第一级处理包括基于神经网络模型的第一级分离处理和第一级唤醒处理;
    所述第二级处理模块,用于当所述第一唤醒数据指示预唤醒成功时根据所述第一麦克风数据进行第二级处理得到第二唤醒数据,所述第二级处理包括基于神经网络模型的第二级分离处理和第二级唤醒处理;
    所述确定模块,用于根据所述第二唤醒数据确定唤醒结果。
  16. 根据权利要求15所述的装置,其特征在于,所述装置还包括预处理模块,所 述第一级处理模块还包括第一级分离模块和第一级唤醒模块;
    所述预处理模块,用于对所述第一麦克风数据进行预处理得到多通道特征数据;
    所述第一级分离模块,用于根据所述多通道特征数据进行所述第一级分离处理,输出得到第一分离数据;
    所述第一级唤醒模块,用于根据所述多通道特征数据和所述第一分离数据进行所述第一级唤醒处理,输出得到所述第一唤醒数据。
  17. 根据权利要求16所述的装置,其特征在于,所述第二级处理模块还包括第二级分离模块和第二级唤醒模块;
    所述第二级分离模块,用于当所述第一唤醒数据指示预唤醒成功时,根据所述多通道特征数据和所述第一分离数据进行所述第二级分离处理,输出得到第二分离数据;
    所述第二级唤醒模块,用于根据所述多通道特征数据、所述第一分离数据和所述第二分离数据进行所述第二级唤醒处理,输出得到所述第二唤醒数据。
  18. 根据权利要求17所述的装置,其特征在于,
    所述第一级分离处理为流式的声源分离处理,所述第一级唤醒处理为流式的声源唤醒处理;和/或,
    所述第二级分离处理为离线的声源分离处理,所述第二级唤醒处理为离线的声源唤醒处理。
  19. 根据权利要求17或18所述的装置,其特征在于,
    所述第一级唤醒模块包括多输入单输出形式或者多输入多输出形式的唤醒模型;和/或,
    所述第二级唤醒模块包括多输入单输出形式或者多输入多输出形式的唤醒模型。
  20. 根据权利要求17至19任一所述的装置,其特征在于,所述第一级分离模块和/或所述第二级分离模块采用对偶路径的conformer网络结构。
  21. 根据权利要求17至20任一所述的装置,其特征在于,所述第一级分离模块和/或所述第二级分离模块为用于执行至少一个任务的分离模块,所述至少一个任务包括单独的声源分离任务,或者包括所述声源分离任务和其他任务;
    其中,所述其他任务包括声源定位任务、特定人提取任务、特定方向提取任务、特定人确认任务中的至少一种。
  22. 根据权利要求17至21任一所述的装置,其特征在于,所述第一级唤醒模块和/或所述第二级唤醒模块为用于执行至少一个任务的唤醒模块,所述至少一个任务包括单独的唤醒任务,或者包括所述唤醒任务和其他任务;
    其中,所述其他任务包括声源定位任务、特定人提取任务、特定方向提取任务、特定人确认任务中的至少一种。
  23. 根据权利要求16至22任一所述的装置,其特征在于,所述第一级分离模块包括第一级多特征融合模型和第一级分离模型;所述第一级分离模块,还用于:
    将所述多通道特征数据输入至所述第一级多特征融合模型中输出得到第一单通道特征数据;
    将所述第一单通道特征数据输入至所述第一级分离模型输出得到所述第一分离数据。
  24. 根据权利要求17至23任一所述的装置,其特征在于,所述第二级分离模块包括第二级多特征融合模型和第二级分离模型;所述第二级分离模块,还用于:
    将所述多通道特征数据和所述第一分离数据输入至所述第二级多特征融合模型中输出得到第二单通道特征数据;
    将所述第二单通道特征数据输入至所述第二级分离模型输出得到所述第二分离数据。
  25. 根据权利要求16至24任一所述的装置,其特征在于,所述第一级唤醒模块包括多输入单输出形式的第一唤醒模型,所述第一级唤醒模块,还用于:
    将所述多通道特征数据和所述第一分离数据输入至所述第一级唤醒模型中输出得到所述第一唤醒数据,所述第一唤醒数据包括第一置信度,所述第一置信度用于指示原始的所述第一麦克风数据中包括预设唤醒词的概率。
  26. 根据权利要求16至24任一所述的装置,其特征在于,所述第一级唤醒模块包括多输入多输出形式的第一唤醒模型和第一后处理模块,所述第一级唤醒模块,还用于:
    将所述多通道特征数据和所述第一分离数据输入至所述第一唤醒模型中,输出得到多个声源数据各自对应的音素序列信息;
    将所述多个声源数据各自对应的音素序列信息输入至所述第一后处理模块中,输出得到所述第一唤醒数据,所述第一唤醒数据包括多个声源数据各自对应的第二置信度,所述第二置信度用于指示所述声源数据与预设唤醒词之间的声学特征相似度。
  27. 根据权利要求17至26任一所述的装置,其特征在于,所述第二级唤醒模块包括多输入单输出形式的第二唤醒模型,所述第二级唤醒模块,还用于:
    将所述多通道特征数据、所述第一分离数据和所述第二分离数据输入至所述第二级唤醒模型中输出得到所述第二唤醒数据,所述第二唤醒数据包括第三置信度,所述第三置信度用于指示原始的所述第一麦克风数据中包括预设唤醒词的概率。
  28. 根据权利要求17至26任一所述的装置,其特征在于,所述第二级唤醒模块包括多输入多输出形式的第二唤醒模型和第二后处理模块,所述第二级唤醒模块,还用于:
    将所述多通道特征数据、所述第一分离数据和所述第二分离数据输入至所述第二级唤醒模型中,输出得到多个声源数据各自对应的音素序列信息;
    将所述多个声源数据各自对应的音素序列信息输入至所述第二后处理模块中,输出得到所述第二唤醒数据,所述第二唤醒数据包括多个声源数据各自对应的第四置信度,所述第四置信度用于指示所述声源数据与预设唤醒词之间的声学特征相似度。
  29. 一种电子设备,其特征在于,所述电子设备包括:
    处理器;
    用于存储处理器可执行指令的存储器;
    其中,所述处理器被配置为执行所述指令时实现权利要求1-14任意一项所述的方法。
  30. 一种非易失性计算机可读存储介质,其上存储有计算机程序指令,其特征在于,所述计算机程序指令被处理器执行时实现权利要求1-14中任意一项所述的方法。
  31. 一种语音唤醒系统,其特征在于,所述语音唤醒系统用于执行权利要求1-14任意一项所述的方法。
PCT/CN2022/083055 2021-03-31 2022-03-25 语音唤醒方法、装置、存储介质及系统 WO2022206602A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22778784.3A EP4310838A1 (en) 2021-03-31 2022-03-25 Speech wakeup method and apparatus, and storage medium and system
US18/474,968 US20240029736A1 (en) 2021-03-31 2023-09-26 Voice wakeup method and apparatus, storage medium, and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110348176.6 2021-03-31
CN202110348176.6A CN115148197A (zh) 2021-03-31 2021-03-31 语音唤醒方法、装置、存储介质及系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/474,968 Continuation US20240029736A1 (en) 2021-03-31 2023-09-26 Voice wakeup method and apparatus, storage medium, and system

Publications (1)

Publication Number Publication Date
WO2022206602A1 true WO2022206602A1 (zh) 2022-10-06

Family

ID=83404394

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/083055 WO2022206602A1 (zh) 2021-03-31 2022-03-25 语音唤醒方法、装置、存储介质及系统

Country Status (4)

Country Link
US (1) US20240029736A1 (zh)
EP (1) EP4310838A1 (zh)
CN (1) CN115148197A (zh)
WO (1) WO2022206602A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116168687B (zh) * 2023-04-24 2023-07-21 北京探境科技有限公司 一种语音数据处理方法、装置、计算机设备及存储介质
CN116168703B (zh) * 2023-04-24 2023-07-21 北京探境科技有限公司 一种语音识别方法、装置、系统、计算机设备及存储介质
CN116825108B (zh) * 2023-08-25 2023-12-08 深圳市友杰智新科技有限公司 语音命令词识别方法、装置、设备和介质

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150269945A1 (en) * 2014-03-24 2015-09-24 Thomas Jason Taylor Voice-key electronic commerce
CN105632486A (zh) * 2015-12-23 2016-06-01 北京奇虎科技有限公司 一种智能硬件的语音唤醒方法和装置
US20180167516A1 (en) * 2016-12-13 2018-06-14 Bullhead Innovations Ltd. Universal interface data gate and voice controlled room system
CN108198548A (zh) * 2018-01-25 2018-06-22 苏州奇梦者网络科技有限公司 一种语音唤醒方法及其系统
CN109360585A (zh) * 2018-12-19 2019-02-19 晶晨半导体(上海)股份有限公司 一种语音激活检测方法
US10304440B1 (en) * 2015-07-10 2019-05-28 Amazon Technologies, Inc. Keyword spotting using multi-task configuration
CN110211599A (zh) * 2019-06-03 2019-09-06 Oppo广东移动通信有限公司 应用唤醒方法、装置、存储介质及电子设备
CN110364143A (zh) * 2019-08-14 2019-10-22 腾讯科技(深圳)有限公司 语音唤醒方法、装置及其智能电子设备
CN111161714A (zh) * 2019-12-25 2020-05-15 联想(北京)有限公司 一种语音信息处理方法、电子设备及存储介质
CN111326146A (zh) * 2020-02-25 2020-06-23 北京声智科技有限公司 语音唤醒模板的获取方法、装置、电子设备及计算机可读存储介质
CN111755002A (zh) * 2020-06-19 2020-10-09 北京百度网讯科技有限公司 语音识别装置、电子设备和语音识别方法
CN112272819A (zh) * 2018-06-05 2021-01-26 三星电子株式会社 被动唤醒用户交互设备的方法和系统
CN112289311A (zh) * 2019-07-09 2021-01-29 北京声智科技有限公司 语音唤醒方法、装置、电子设备及存储介质
CN112489663A (zh) * 2020-11-09 2021-03-12 北京声智科技有限公司 一种语音唤醒方法、装置、介质和设备

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150269945A1 (en) * 2014-03-24 2015-09-24 Thomas Jason Taylor Voice-key electronic commerce
US10304440B1 (en) * 2015-07-10 2019-05-28 Amazon Technologies, Inc. Keyword spotting using multi-task configuration
CN105632486A (zh) * 2015-12-23 2016-06-01 北京奇虎科技有限公司 一种智能硬件的语音唤醒方法和装置
US20180167516A1 (en) * 2016-12-13 2018-06-14 Bullhead Innovations Ltd. Universal interface data gate and voice controlled room system
CN108198548A (zh) * 2018-01-25 2018-06-22 苏州奇梦者网络科技有限公司 一种语音唤醒方法及其系统
CN112272819A (zh) * 2018-06-05 2021-01-26 三星电子株式会社 被动唤醒用户交互设备的方法和系统
CN109360585A (zh) * 2018-12-19 2019-02-19 晶晨半导体(上海)股份有限公司 一种语音激活检测方法
CN110211599A (zh) * 2019-06-03 2019-09-06 Oppo广东移动通信有限公司 应用唤醒方法、装置、存储介质及电子设备
CN112289311A (zh) * 2019-07-09 2021-01-29 北京声智科技有限公司 语音唤醒方法、装置、电子设备及存储介质
CN110364143A (zh) * 2019-08-14 2019-10-22 腾讯科技(深圳)有限公司 语音唤醒方法、装置及其智能电子设备
CN111161714A (zh) * 2019-12-25 2020-05-15 联想(北京)有限公司 一种语音信息处理方法、电子设备及存储介质
CN111326146A (zh) * 2020-02-25 2020-06-23 北京声智科技有限公司 语音唤醒模板的获取方法、装置、电子设备及计算机可读存储介质
CN111755002A (zh) * 2020-06-19 2020-10-09 北京百度网讯科技有限公司 语音识别装置、电子设备和语音识别方法
CN112489663A (zh) * 2020-11-09 2021-03-12 北京声智科技有限公司 一种语音唤醒方法、装置、介质和设备

Also Published As

Publication number Publication date
US20240029736A1 (en) 2024-01-25
CN115148197A (zh) 2022-10-04
EP4310838A1 (en) 2024-01-24

Similar Documents

Publication Publication Date Title
US10818296B2 (en) Method and system of robust speaker recognition activation
CN111699528B (zh) 电子装置及执行电子装置的功能的方法
US20240038218A1 (en) Speech model personalization via ambient context harvesting
WO2022206602A1 (zh) 语音唤醒方法、装置、存储介质及系统
WO2021135577A9 (zh) 音频信号处理方法、装置、电子设备及存储介质
US11687770B2 (en) Recurrent multimodal attention system based on expert gated networks
US10832671B2 (en) Method and system of audio false keyphrase rejection using speaker recognition
US10043521B2 (en) User defined key phrase detection by user dependent sequence modeling
US20220139393A1 (en) Driver interface with voice and gesture control
US20160284349A1 (en) Method and system of environment sensitive automatic speech recognition
Gogate et al. DNN driven speaker independent audio-visual mask estimation for speech separation
US11380326B2 (en) Method and apparatus for performing speech recognition with wake on voice (WoV)
WO2019213443A1 (en) Audio analytics for natural language processing
US10565862B2 (en) Methods and systems for ambient system control
Tao et al. End-to-end audiovisual speech activity detection with bimodal recurrent neural models
WO2021082941A1 (zh) 视频人物识别方法、装置、存储介质与电子设备
CN115312068B (zh) 语音控制方法、设备及存储介质
KR20180025634A (ko) 음성 인식 장치 및 방법
WO2022147692A1 (zh) 一种语音指令识别方法、电子设备以及非瞬态计算机可读存储介质
CN116312512A (zh) 面向多人场景的视听融合唤醒词识别方法及装置
CN116343765A (zh) 自动语境绑定领域特定话音识别的方法和系统
US11743588B1 (en) Object selection in computer vision
US11727926B1 (en) Systems and methods for noise reduction
CN113393834B (zh) 一种控制方法及装置
Ng et al. Small footprint multi-channel convmixer for keyword spotting with centroid based awareness

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22778784

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022778784

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022778784

Country of ref document: EP

Effective date: 20231016

NENP Non-entry into the national phase

Ref country code: DE