WO2022042635A1 - 一种唤醒识别方法、音频装置以及音频装置组 - Google Patents

一种唤醒识别方法、音频装置以及音频装置组 Download PDF

Info

Publication number
WO2022042635A1
WO2022042635A1 PCT/CN2021/114728 CN2021114728W WO2022042635A1 WO 2022042635 A1 WO2022042635 A1 WO 2022042635A1 CN 2021114728 W CN2021114728 W CN 2021114728W WO 2022042635 A1 WO2022042635 A1 WO 2022042635A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio device
wake
sound signal
angle
source
Prior art date
Application number
PCT/CN2021/114728
Other languages
English (en)
French (fr)
Inventor
李树为
孙渊
屈伸
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202011556351.2A external-priority patent/CN114121024A/zh
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022042635A1 publication Critical patent/WO2022042635A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/22Interactive procedures; Man-machine interfaces

Definitions

  • the present application relates to the technical field of terminals, and in particular, to a wake-up identification method, an audio device, and an audio device group.
  • the combination of more than two audio devices has gradually become a trend.
  • the user uses the speaker combination to play music at home (such as a living room). Compared with playing music on a single speaker, playing music with a combination of speakers can bring users a more extreme listening experience.
  • the sound information includes all sound information in the environment.
  • the sound information collected by the speaker combination includes not only the sound made by the user, but also the sound produced by the speaker combination. In this way, due to the interference of other sound information, the voice wake-up word or voice command issued by the user cannot be accurately recognized by the speaker combination.
  • the purpose of the present application is to provide a wake-up identification method, an audio device and an audio device group, which help to improve the accuracy of wake-up identification.
  • a wake-up identification method is provided.
  • the method is applicable to an audio device group, the audio device group includes a first audio device and a second audio device, the first audio device and the second audio device can communicate, the first audio device includes a first microphone array, and the second audio device includes a first audio device. Two microphone array.
  • the method includes: a first audio device receives a first sound signal, and the first sound signal includes a sound signal sent by a wake-up source; the first audio device performs wake-up recognition on the first sound signal to obtain a first recognition result; the second audio device receives The second sound signal, the second sound signal includes the sound signal sent by the wake-up source; the second audio device performs wake-up recognition on the second sound signal to obtain a second recognition result; the second audio device sends the second recognition result to the first audio device ; the first audio device determines whether to wake up the audio device group based on the first identification result and the second identification result.
  • the wake-up recognition method provided by the present application is that two audio devices cooperate to perform wake-up recognition, so as to improve the accuracy of wake-up recognition.
  • the first identification result includes a first probability, and the first probability is a probability that the first sound signal includes wake-up information; the second identification result includes a second probability, and the second probability is that the second sound signal includes wake-up information The probability of information; the first audio device determines whether to wake up the audio device group based on the first identification result and the second identification result, including: if the first probability is greater than the first threshold, the second probability is greater than the second threshold; and/or, the first The average or weighted average of both the first probability and the second probability is greater than the third threshold, determining a wake-up audio device group. That is to say, the wake-up identification is performed by synthesizing the respective wake-up identification results of the two audio devices, so as to improve the accuracy of the wake-up identification.
  • the method before the first audio device performs wake-up recognition on the first sound signal to obtain the first recognition result, the method further includes: the first audio device compares the sound signal from the second audio device in the first sound signal Performing suppression; the first audio device performs wake-up recognition on the first sound signal to obtain a first recognition result, including: the first audio device performs wake-up recognition on the suppressed first sound signal to obtain the first recognition result. That is to say, the first audio device can suppress the sound in the direction of the non-wake-up source (such as the sound from the second audio device), highlight the sound of the wake-up source, and improve the accuracy of the wake-up recognition.
  • the non-wake-up source such as the sound from the second audio device
  • the method before the second audio device performs wake-up recognition on the second sound signal to obtain the second recognition result, the method further includes: the second audio device compares the sound signal from the first audio device in the second sound signal to the second audio device. performing suppression; the second audio device performs wake-up recognition on the second sound signal to obtain a second recognition result, including: the second audio device performs wake-up recognition on the suppressed second sound signal to obtain a second recognition result. That is, the second audio device can suppress the sound in the direction of the non-wake-up source (such as the sound from the first audio device), highlight the sound of the wake-up source, and improve the accuracy of the wake-up recognition.
  • the non-wake-up source such as the sound from the first audio device
  • the method before the first audio device suppresses the sound signal from the second audio device in the first sound signal, the method further includes: the first audio device determines that the second audio device and the wake-up source are in different directions ; before the second audio device suppresses the sound signal from the first audio device in the second sound signal, the method further includes: the second audio device determines that the first audio device and the wake-up source are in different directions.
  • the first audio device before the first audio device suppresses the sound from the second audio device, it is determined whether the second audio device and the wake-up source are in the same direction, if not, the sound from the second audio device is suppressed to avoid the second audio device and When the wake-up source is in the same direction, the sound of the wake-up source is also suppressed when the sound signal in the direction of the second audio device is suppressed.
  • the first identification result further includes a first angle, and the first angle is the direction of the wake-up source relative to the first audio device;
  • the second identification result further includes a second angle, and the second angle is the relative direction of the wake-up source to the first audio device.
  • the method further includes: the first audio device determines the direction where the wake-up source is located based on the first angle and the second angle.
  • the wake-up identification method provided by the present application is that two audio devices cooperate to determine the wake-up direction, which can determine a more accurate wake-up source direction.
  • the first audio device determines the direction of the wake-up source based on the first angle and the second angle, including: when the first probability is greater than the second probability, determining that the first angle is the direction of the wake-up source; When the second probability is greater than the first probability, the second angle is determined as the direction where the wake-up source is located. That is to say, the two audio devices cooperate to determine the wake-up direction, which can determine a more accurate wake-up source direction.
  • the second identification result further includes a second distance, where the second distance is the distance between the wake-up source and the second audio device; the first audio device determines where the wake-up source is based on the first angle and the second angle
  • the direction includes: the first audio device predicts the third angle of the wake-up source relative to the first audio device by using the first distance, the second angle and the second distance; wherein the first distance is the first audio device and the second audio device
  • the first audio device determines the direction where the wake-up source is located according to the third angle and the first angle. That is to say, the two audio devices cooperate to determine the wake-up direction, which can determine a more accurate wake-up source direction.
  • the third angle satisfies:
  • is the second angle
  • S cs is the second distance
  • D is the first distance
  • ⁇ ' is the third angle
  • the first audio device determines the direction in which the wake-up source is located according to the third angle and the first angle, including: determining an average or weighted average of the third angle and the first angle as the direction in which the wake-up source is located; or,
  • the method further includes: the first audio device receives a third sound signal, and the first audio device receives a third sound signal.
  • the three voice signals include voice commands; the first audio device recognizes the voice signals in the direction of the wake-up source in the third voice signal to obtain voice commands, and the voice commands are used to control the first audio device. Since the two audio devices cooperate to determine the direction of the wake-up, the determined direction of the wake-up source is more accurate. Therefore, when the voice command is recognized, the voice command can be recognized for the direction of the wake-up source, which improves the accuracy of voice command recognition. .
  • the method further includes: the first audio device recognizes the sound signal in the third sound signal in the direction of the wake-up source.
  • the sound signal in the direction of the non-awakening source is suppressed, and the direction of the non-awakening source is other directions except the direction of the wake-up source;
  • the first audio device recognizes the sound signal in the direction of the wake-up source in the third sound signal to obtain a voice command , comprising: the first audio device recognizes the sound signal located in the direction of the wake-up source in the suppressed third sound information to obtain a voice command. That is, when the voice command is recognized, the first audio device can suppress the sound in the direction of the non-awakening source (such as the sound from the second audio device), highlight the sound of the wake-up source, and improve the accuracy of the voice command recognition.
  • a wake-up identification method is provided.
  • the method is applicable to a first audio device, and the method includes: the first audio device receives a first sound signal, and the first sound signal includes a sound signal sent by a wake-up source; the first audio device wakes up and recognizes the first sound signal to obtain the first sound signal.
  • the first identification result includes a first probability, and the first probability is a probability that the first sound signal includes wake-up information;
  • the second identification result includes a second probability, and the second probability is that the second sound signal includes wake-up information probability of information;
  • the first audio device determines whether to wake up the audio device group based on the first identification result and the second identification result, including: if the first probability is greater than the first threshold and the second probability is greater than the second threshold; and/or, the first probability and the first probability are greater than the second threshold; The average or weighted average of the two probabilities is greater than the third threshold, determining the group of wake-up audio devices.
  • the method before the first audio device performs wake-up recognition on the first sound signal to obtain the first recognition result, the method further includes: the first audio device compares the sound signal from the second audio device in the first sound signal Performing suppression; the first audio device performs wake-up recognition on the first sound signal to obtain a first recognition result, including: the first audio device performs wake-up recognition on the suppressed first sound signal to obtain the first recognition result.
  • the method before the first audio device suppresses the sound signal from the second audio device in the first sound signal, the method further includes: the first audio device determines that the second audio device and the wake-up source are in different directions .
  • the first identification result further includes a first angle, and the first angle is the direction of the wake-up source relative to the first audio device;
  • the first audio device determines the direction in which the wake-up source is located based on the first angle and the second angle.
  • the first audio device determines the direction of the wake-up source based on the first angle and the second angle, including: when the first probability is greater than the second probability, determining that the first angle is the direction of the wake-up source; When the second probability is greater than the first probability, the second angle is determined as the direction where the wake-up source is located.
  • the second identification result further includes a second distance, where the second distance is the distance between the wake-up source and the second audio device; the first audio device determines where the wake-up source is based on the first angle and the second angle
  • the direction includes: the first audio device predicts the third angle of the wake-up source relative to the first audio device by using the first distance, the second angle and the second distance; wherein the first distance is the first audio device and the second audio device The first audio device determines the direction where the wake-up source is located according to the third angle and the first angle.
  • the third angle satisfies:
  • is the second angle
  • S cs is the second distance
  • D is the first distance
  • ⁇ ' is the third angle
  • the first audio device determines the direction of the wake-up source according to the third angle and the first angle, including:
  • the method further includes:
  • the first audio device receives a third sound signal, and the third sound signal includes a voice command;
  • the first audio device recognizes the sound signal located in the direction of the wake-up source in the third sound signal, and obtains a voice command, where the voice command is used to control the first audio device.
  • the method further includes: the first audio device compares the third sound signal to the sound signal.
  • the sound signal in the direction of the non-awakening source is suppressed, and the direction of the non-awakening source is other directions except the direction of the wake-up source; the first audio device identifies the sound signal in the direction of the wake-up source in the third sound signal,
  • Obtaining the voice command includes: the first audio device recognizes the voice signal located in the direction of the wake-up source in the suppressed third voice information to obtain the voice command.
  • a wake-up identification method is also provided.
  • the method is applicable to a second audio device, and the method includes: the second audio device receives a second sound signal, the second sound signal includes a sound signal sent by a wake-up source; the second audio device performs wake-up recognition on the second sound signal, and obtains the first sound signal. Two recognition results; the second audio device sends the second recognition result to the first audio device, so that the first audio device performs a wake-up judgment according to the first recognition result and the second recognition result, and the first recognition result is the first audio device A recognition result obtained by performing wake-up recognition on the received first sound signal.
  • the second audio device performs wake-up recognition on the second sound signal, and before the second recognition result is obtained, the method further includes: the second audio device recognizes the sound from the first audio device in the second sound signal The signal is suppressed; the second audio device performs wake-up recognition on the second sound signal to obtain a second recognition result, including: the second audio device performs wake-up recognition on the suppressed second sound signal to obtain a second recognition result.
  • the method before the second audio device suppresses the sound signal from the first audio device in the second sound signal, the method further includes: the second audio device determines that the first audio device and the wake-up source are in different directions .
  • the second identification result further includes a second angle, where the second angle is the direction of the wake-up source relative to the second audio device, and/or the second identification result further includes a second distance, the second distance is the distance of the wake-up source relative to the second audio device.
  • the method before the second audio device sends the second identification result to the first audio device, the method further includes: the second audio device receives a query request from the first audio device, where the query request is used to request a query The identification result of the second audio device.
  • a wake-up identification method is provided.
  • the method is applicable to a first audio device, and the method includes: the first audio device receives a first sound signal, and the first sound signal includes a sound signal sent by a wake-up source; the first audio device wakes up and recognizes the first sound signal to obtain the first sound signal. Recognition result; the first audio device suppresses the sound from the second audio device in the first sound signal; the first audio device wakes up and recognizes the suppressed first sound signal to obtain a third recognition result; the first audio device is based on The first identification result and the third identification result determine whether to wake up the audio device group.
  • the method before the first audio device suppresses the sound from the second audio device in the first sound signal, the method further includes: the first audio device determines the direction of the second audio device relative to the first audio device;
  • the first audio device suppressing the sound from the second audio device in the first sound signal includes: the first audio device suppressing the sound located in the direction in the first sound signal.
  • the first identification result includes a first probability, and the first probability is used to describe the probability that the first sound signal includes wake-up information;
  • the third identification result includes a third probability, and the third probability is used to describe the The probability that the suppressed first sound signal includes wake-up information, and the first audio device determines whether to wake up the audio device group based on the first identification result and the third identification result, including: the first probability is greater than the first threshold, and/or, A wake-up audio device group is determined when the third probability is greater than the fourth threshold, and/or the average or weighted average of the first and third probabilities is greater than the fifth threshold.
  • a fifth aspect provides a wake-up source positioning method.
  • the method is applicable to an audio device group, the audio device group includes a first audio device and a second audio device, the first audio device includes a first microphone array, the second audio device includes a second microphone array, the first audio device and the second audio device
  • the audio device can communicate.
  • the method includes: the first audio device receives the first sound signal; the first audio device calculates and obtains a first angle according to the first sound signal, and the first angle is the direction of the wake-up source relative to the first audio device;
  • the second audio device receives the second sound signal; the second audio device calculates the second angle according to the second sound signal, and the second angle is the direction of the wake-up source relative to the second audio device; the second audio device sends the first audio device The second angle; the first audio device determines the direction where the wake-up source is located based on the first angle and the second angle.
  • the wake-up identification method provided by the present application is that two audio devices cooperate to determine the wake-up direction, which can determine a more accurate wake-up source direction.
  • the first audio device determines the direction of the wake-up source based on the first angle and the second angle, including: when the first probability is greater than the second probability, determining that the first angle is the direction of the wake-up source; When the second probability is greater than the first probability, the second angle is determined as the direction of the wake-up source; the first probability is the probability that the first sound signal includes wake-up information; the second probability is the probability that the second sound signal includes wake-up information .
  • the method further includes: the second audio device calculates and obtains the second distance according to the second sound signal, The second distance is the distance between the wake-up source and the second audio device; the first audio device determines the direction of the wake-up source based on the first angle and the second angle, including: the first audio device uses the first distance, the second angle and the first Two distances, predict the third angle of the wake-up source relative to the first audio device; wherein, the first distance is the distance between the first audio device and the second audio device; the first audio device according to the third angle and the first angle, Determine the direction of the wake-up source.
  • the third angle satisfies:
  • is the second angle
  • S cs is the second distance
  • D is the first distance
  • ⁇ ' is the third angle
  • the method further includes: the first audio device receives a third sound signal, and the third sound signal includes a voice command; the first audio device responds to the sound signal in the direction of the wake-up source in the third sound signal.
  • the recognition is performed to obtain a voice command, and the voice command is used to control the first audio device.
  • the method further includes: the first audio device compares the third sound signal to the sound signal.
  • the sound signal located in the direction of the non-awakening source is suppressed, and the direction of the non-awakening source is other directions except the direction where the awakening source is located; the first audio device identifies the sound signal located in the direction of the awakening source in the third sound signal.
  • the voice command includes: the first audio device recognizes the voice signal located in the direction of the wake-up source in the suppressed third voice information to obtain the voice command.
  • a wake-up source positioning method is provided.
  • the method is applicable to a first audio device, the first audio device includes a first microphone array, the second audio device includes a second microphone array, and the first audio device and the second audio device can communicate.
  • the method includes: a first audio device receives a first sound signal; the first audio device calculates and obtains a first angle according to the first sound signal, and the first angle is the direction of the wake-up source relative to the first audio device;
  • the second angle of the second audio device, the second angle is the second angle calculated by the second audio device according to the received second sound signal, and the second angle is the direction of the wake-up source relative to the second audio device;
  • the first audio device is based on The first angle and the second angle determine the direction of the wake-up source.
  • the first audio device determines the direction of the wake-up source based on the first angle and the second angle, including: when the first probability is greater than the second probability, determining that the first angle is the direction of the wake-up source; When the second probability is greater than the first probability, the second angle is determined as the direction of the wake-up source; the first probability is the probability that the first sound signal includes wake-up information; the second probability is the probability that the second sound signal includes wake-up information.
  • the method further includes: the second audio device calculates the second distance according to the second sound signal, and the first The second distance is the distance between the wake-up source and the second audio device; the first audio device determines the direction of the wake-up source based on the first angle and the second angle, including: the first audio device uses the first distance, the second angle and the second distance, predicts the third angle of the wake-up source relative to the first audio device; wherein, the first distance is the distance between the first audio device and the second audio device; the first audio device determines according to the third angle and the first angle The direction of the wake-up source.
  • the third angle satisfies:
  • is the second angle
  • S cs is the second distance
  • D is the first distance
  • ⁇ ' is the third angle
  • the method further includes: the first audio device receives a third sound signal, and the third sound signal includes a voice command; the first audio device responds to the sound signal in the direction of the wake-up source in the third sound signal.
  • the recognition is performed to obtain a voice command, and the voice command is used to control the first audio device.
  • the method further includes: the first audio device compares the third sound signal to the sound signal.
  • the sound signal located in the direction of the non-awakening source is suppressed, and the direction of the non-awakening source is other directions except the direction where the awakening source is located; the first audio device identifies the sound signal located in the direction of the awakening source in the third sound signal.
  • the voice command includes: the first audio device recognizes the voice signal located in the direction of the wake-up source in the suppressed third voice information to obtain the voice command.
  • a wake-up source positioning method is provided.
  • the method is applicable to a second audio device, the second audio device includes a second microphone array, and the second audio device is capable of communicating with the first audio device.
  • the method includes: the second audio device receives the second sound signal; the second audio device calculates and obtains a second angle according to the second sound signal, and the second angle is the direction of the wake-up source relative to the second audio device; An audio device sends the second angle, so that the first audio device determines the direction of the wake-up source based on the first angle and the second angle.
  • the method before the second audio device sends the second angle to the first audio device, the method further includes: the second audio device receives a query request from the first audio device, where the query request is used to query the second audio device for the second angle. Two angles.
  • the method further includes: the second audio device performs wake-up recognition on the second sound signal to obtain a second recognition result, the second recognition result includes a second probability, and the second probability is used to describe the second sound signal includes the probability of wake-up information, and/or, the second identification result further includes a second distance, where the second distance is the direction of the second audio device relative to the second audio device.
  • an audio device group comprising: a first audio device and a second audio device;
  • the first audio device includes: a processor; a memory; a first microphone array; wherein the memory stores a computer program that includes instructions that, when executed by the processor, cause the first audio device to execute The steps of the first audio device in the method provided by the above-mentioned first aspect or the fifth aspect;
  • the second audio device includes: a processor; a memory; a second microphone array; wherein the memory stores a computer program, the computer program includes instructions that, when executed by the processor, cause the second audio device to perform as The steps of the second audio device in the method provided in the first aspect or the fifth aspect.
  • a first audio device comprising: a processor; a memory; a first microphone array, wherein the memory stores a computer program, the computer program includes instructions, when the instructions are executed by the processor When executed, the first audio device is caused to perform the method steps provided in the second aspect or the fourth aspect or the sixth aspect.
  • a first audio device comprising modules/units for performing any possible design method of the second aspect or the fourth aspect or the sixth aspect; these modules/units may be It can be realized by hardware, and it can also be realized by executing corresponding software by hardware.
  • a second audio device comprising: a processor; a memory; a second microphone array; the memory stores a computer program, and the computer program includes instructions, when the instructions are executed by the processor, The second audio device is caused to perform the method steps as provided in the third or seventh aspect above.
  • a twelfth aspect provides a second audio device, the audio device comprising modules/units for performing any possible design method of the third aspect or the seventh aspect; these modules/units may be implemented by hardware , and can also be implemented by hardware executing corresponding software.
  • a thirteenth aspect provides a chip, which is coupled to a memory in an electronic device and used to invoke a computer program stored in the memory and execute the method provided in any one of the above-mentioned first to seventh aspects.
  • "Coupled" in the embodiments means that two components are directly or indirectly coupled to each other.
  • a fourteenth aspect provides a computer-readable storage medium, the computer-readable storage medium comprising a computer program, when the computer program is run on an electronic device, the electronic device is made to perform the above-mentioned first to seventh aspects The method provided by any aspect.
  • a fifteenth aspect provides a computer program product comprising instructions that, when executed on a computer, cause the computer to perform the method provided by any one of the first to seventh aspects above.
  • FIG. 1 is a schematic flowchart of using an algorithm for voice recognition provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of a principle in which an audio device determines the direction in which another audio device is located according to an embodiment of the present application;
  • FIG. 3 is a schematic diagram of an application scenario provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of an example of an application scenario provided by an embodiment of the present application.
  • 5A is a schematic flowchart of a wake-up identification method provided by an embodiment of the present application.
  • 5B is a schematic flowchart of another wake-up identification method provided by an embodiment of the present application.
  • 5C is a schematic diagram of another example of an application scenario provided by an embodiment of the present application.
  • 5D is a schematic flowchart of another wake-up identification method provided by an embodiment of the present application.
  • FIG. 6A is a schematic flowchart of another wake-up identification method provided by an embodiment of the present application.
  • FIG. 6B , FIG. 6C and FIG. 6D are schematic diagrams of principles for calculating a third angle provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of interaction between an audio device combination and a cloud provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of an audio device according to an embodiment of the present application.
  • references in this specification to "one embodiment” or “some embodiments” and the like mean that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application.
  • appearances of the phrases “in one embodiment,” “in some embodiments,” “in other embodiments,” “in other embodiments,” etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean “one or more but not all embodiments” unless specifically emphasized otherwise.
  • the terms “including”, “including”, “having” and their variants mean “including but not limited to” unless specifically emphasized otherwise.
  • the term “connected” includes both direct and indirect connections unless otherwise specified. "First” and “second” are only for descriptive purposes, and cannot be understood as indicating or implying relative importance or implying the number of indicated technical features.
  • words such as “exemplarily” or “for example” are used to represent examples, illustrations or illustrations. Any embodiment or design described in the embodiments of the present application as “exemplarily” or “such as” should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as “exemplarily” or “such as” is intended to present the related concepts in a specific manner.
  • An audio device is a device for playing sound signals.
  • the audio device may be a speaker, a mobile phone, a notebook computer, a TV, a smart bracelet, a watch, and the like.
  • the audio device may also be a logic device.
  • the so-called logic device can be understood as a logic unit/module that can play sound signals, and does not limit the type and performance of hardware devices. For example, it may be one or more logical units/modules in one or more hardware devices.
  • Audio devices can also be used for speech recognition.
  • the speech recognition process includes two processes: wake-up recognition and voice command recognition.
  • wake-up recognition can be understood as recognizing wake-up sentences, such as Xiaoai Xiaoai
  • voice command recognition can be understood as recognizing voice commands (or commands), such as playing song XXX, switching to the next song, and so on.
  • the module responsible for the wake-up recognition in the audio device is called the wake-up recognition module
  • the module responsible for the voice command recognition is called the voice command recognition module.
  • the wake-up recognition module and the voice command recognition module are logical functional divisions, and their corresponding physical devices may be the same or different.
  • the voice command recognition module does not need to be enabled all the time. For example, when the wake-up recognition module detects the wake-up sentence, the voice command recognition module is enabled, and the voice command recognition module performs voice command recognition.
  • An exemplary scenario is: a wake-up source (such as a user) sends out a sound signal, the sound signal includes a wake-up sentence, the sound signal is collected by a wake-up recognition module in the audio device, and when the wake-up recognition module detects that the sound signal includes a wake-up sentence, The voice command recognition module is enabled to perform voice recognition through the voice command recognition module.
  • the wake-up source such as a user
  • a sound signal including a voice command
  • the voice signal is collected by the voice recognition module, and the voice recognition module recognizes the voice command in the voice signal, and executes the corresponding voice command in response to the voice command. operate.
  • the algorithms involved in this application may include wake-up recognition algorithms and voice command recognition algorithms.
  • the wake-up identification algorithm is used for wake-up identification. For example, identifying whether the collected sound signal includes wake-up information (for example, Xiaoai Xiaoai); the voice command recognition algorithm is used to recognize the voice command. For example, identify whether the collected sound signal includes a voice command (eg, play song XXX).
  • FIG. 1 is a schematic flowchart of a voice command recognition algorithm. As shown in (a) of FIG. 1 , the flow of the voice command recognition algorithm includes steps 1 to 5 .
  • Step 1 A sound signal is received, for example, a sound signal from a wake-up source (such as a user) is received.
  • a wake-up source such as a user
  • Step 2 feature extraction, can be understood as extracting the recognizable components in the sound signal.
  • the features of each frame of the sound signal can be extracted.
  • the feature extraction may be extracted by using a mel-frequency cepstral coefficients (mel-frequency cepstral coefficients, MFCC) algorithm, which is not described in detail in this embodiment of the present application.
  • Step 3 Obtain phonemes based on the features.
  • the pronunciation of words is composed of phonemes.
  • initials, finals, etc. are generally used as phoneme sets.
  • This process may be implemented by an acoustic model such as a hidden Markov model (hidden markov model, HMM), which is not described in detail in this embodiment of the present application.
  • HMM hidden markov model
  • Step 4 obtaining words based on phonemes; for example, matching words in a phonetic dictionary by phonemes.
  • Step 5 Obtain sentences based on words. Assuming that the obtained statement is to play song XXX, the audio device responds to the voice instruction and plays song XXX.
  • FIG. 1 is a schematic flowchart of a wake-up identification algorithm. As shown in (b) of FIG. 1 , the flow of the wake-up identification algorithm includes steps 1 to 7 . Wherein, steps 1 to 5 are the same as steps 1 to 5 in the voice command recognition flow, and are not repeated here. Steps 6 to 7 are described below.
  • Step 6 Compare the identified sentence with the preset sentence to obtain the similarity.
  • the preset sentence is a pre-set wake-up sentence.
  • Step 7 output the similarity.
  • the similarity is 80%, 60%, etc.
  • the similarity can also be converted into a probability value. For example, if the similarity is 80%, the corresponding probability is 0.8; if the similarity is 20%, the corresponding probability is 0.2. In this case, step 7 can also output the probability value.
  • the wake-up recognition algorithm can calculate a similarity or probability value.
  • the similarity and probability values are collectively referred to as "confidence".
  • FIG. 1 exemplifies a voice command recognition algorithm, but this application is not limited thereto. Other voice command device algorithms are also possible.
  • FIG. 1 illustrates a wake-up recognition algorithm, but other wake-up recognition algorithms are also possible.
  • Confidence refers to the similarity or probability value calculated by the wake-up recognition algorithm. Taking the confidence as a probability value as an example, it is used to indicate the probability that the sound signal includes wake-up information. For example, the probability value can be 0.1, 0.5, 0.9, etc.
  • the orientation referred to in this application includes the orientation of one audio device relative to another audio device in the audio device combination, or the orientation of the wakeup source relative to the audio device, and so on.
  • the direction may be described using an angle, for example, the direction of the wake-up source relative to the audio device may mean that the wake-up source is located at 30 degrees north west, 60 degrees north east, and so on.
  • the principle of determining the direction of the second audio device relative to the first audio device by the first audio device is described below by taking the microphone array positioning technology as an example.
  • a microphone array is provided in the first audio device.
  • the microphone array can be understood as a plurality of microphones distributed according to a specific rule (such as three rows and three columns, five rows and five columns, etc.).
  • the microphone 1 on the first audio device receives the sound wave 1 at time t1
  • the microphone 2 receives the sound wave 2 at time t2. Therefore, the time difference between the sound wave 1 and the sound wave 2 is t2-t1, and the time difference can be referred to as shown in FIG. 2 .
  • t1 and t2 are known quantities
  • the sound propagation speed c is a known quantity
  • the distance D between microphone 1 and microphone 2 is known (The distance D can be stored by default in the factory). Therefore, as shown in FIG. 2 , the included angle ⁇ can be determined by the above-mentioned known quantities t1 , t2 , c and D. For example, the included angle ⁇ satisfies the following formula:
  • the value of the included angle ⁇ can be obtained, and the included angle ⁇ is used to indicate the direction of the second audio device relative to the first audio device.
  • FIG. 2 is an example in which the microphone array includes two microphones. It will be appreciated that the number of microphones in the microphone array may be greater. For example, the microphone array is three rows and three columns, including 9 microphones, then multiple angles can be obtained. Any one of the plurality of included angles may be used as the direction of the second audio device relative to the first audio device, or the average value of the plurality of included angles may be used as the direction of the second audio device relative to the first audio device.
  • the above takes the microphone array positioning technology as an example for introduction. It can be understood that, in addition to the above-mentioned microphone array positioning technology, other positioning techniques can also be used, such as a beam-directed (steered-beamformer) method, a high-resolution spectral analysis-based (high-resolution spectral analysis)-based Sound time difference (time-delay estimation, TDE) orientation method, etc., are not limited in the embodiments of the present application.
  • a beam-directed (steered-beamformer) method a high-resolution spectral analysis-based (high-resolution spectral analysis)-based Sound time difference (time-delay estimation, TDE) orientation method, etc.
  • the first audio device determining the direction of the second audio device relative to the first audio device as an example. It can be understood that the principle of determining the direction of the first audio device relative to the second audio device by the second audio device is similar to the above-mentioned principle, and will not be repeated here.
  • the first audio device or the second audio device may also determine the direction of the wake-up source (such as the user) based on the principle shown in FIG. 2 .
  • the distance referred to in this application includes the distance between the first audio device and the second audio device in the audio device combination, or the distance between the wake-up source and the audio device.
  • the following description mainly takes the distance between the first audio device and the second audio device as an example.
  • X is the distance of the second audio device relative to the first audio device
  • is the wavelength of the sound wave
  • L is the length of the microphone array (pre-stored value).
  • D the microphone array is the two microphones in Figure 2
  • L is equal to D
  • L is equal to the length of the matrix, such as the centroid of the microphones in the first row and the first column to the third row and the first column The distance between the centroids of the microphones.
  • the first audio device can also calculate the distance between the sound source and the first audio device based on the above distance calculation method.
  • the second audio device can also calculate the distance between the wake-up source and the second audio device based on the above distance calculation method.
  • the "plurality” in the embodiments of the present application refers to two or more (including two), and in view of this, the “plurality” can also be understood as “at least two” in the embodiments of the present application.
  • “At least one” can be understood as one or more, such as one, two or more. For example, including at least one means including one or more, and does not limit which ones are included. For example, if at least one of A, B, and C is included, then A, B, C, A and B, A and C, B and C, or A and B and C may be included. Similarly, the understanding of descriptions such as "at least one" is similar.
  • the embodiments of the present application refer to ordinal numbers such as "first” and "second" to distinguish multiple objects.
  • first audio device and the second audio device are only used to distinguish the two audio devices, and are not used to limit the order, timing, priority or importance of the two audio devices.
  • the two audio devices may be respectively arranged in different geographic locations. Each of the two audio devices can play a sound signal, and the two audio devices can communicate with each other to realize data transmission.
  • the two audio devices may be the same type of audio devices, for example, both audio devices are speakers; or both are mobile phones, both are tablet computers, etc.; or, the two audio devices may include different types of audio devices, such as The two audio devices are a combination of a speaker and a mobile phone; or, a combination of a mobile phone and a tablet, etc. Compared with playing music on a single audio device, when two audio devices play music at the same time, a better listening experience can be provided to the user.
  • the wake-up recognition process includes: after the main audio device receives the sound signal, identifying whether the sound signal includes a wake-up sentence; if so, waking up the voice command recognition module in the main audio device to perform voice command recognition.
  • the secondary audio device does not participate, and mainly relies on the primary audio device for wake-up recognition.
  • the primary audio device locates the orientation of the wake-up source (such as the user) relative to the primary audio device during wake-up recognition.
  • the voice command recognition process generally includes: after the main audio device determines the direction of the wake-up source, the voice command is recognized based on the direction. For example, assuming that the wake-up source is at angle 1, the main audio device receives the sound signal of the wake-up source, considers the sound at angle 1 in the sound signal to be the sound of the wake-up source, and performs voice command recognition on the sound at angle 1.
  • the secondary audio device is still not involved in this process.
  • the primary audio device detects the voice command, it executes the voice command, and notifies the secondary audio device to execute the voice command (eg, play a song, etc.), and the secondary audio device participates in use. To put it simply, only the primary audio device is relied on for wake-up recognition, wake-up source direction determination, and voice command recognition, and the secondary audio device does not participate in wake-up recognition, wake-up source direction determination, and voice command recognition.
  • FIG. 4 is an example of an application scenario provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a living room in a home.
  • Two audio devices are arranged in the living room (for example, the audio devices on the left and right sides of the TV cabinet, such as the first audio device and the second audio device).
  • the first audio device is the primary audio device
  • the second audio device is the secondary audio device.
  • the scene (or environment) shown in FIG. 4 as an example where there are only two audio devices and a wake-up source (eg, a user) emits sound.
  • a wake-up source eg, a user
  • the sound signal received by the primary audio device includes not only the sound signal from the wake-up source (ie the user), but also the sound signal from the noise source (ie the secondary audio device). This will lead to the following problems:
  • a wake-up source such as a user
  • the primary audio device cannot recognize the wake-up sentence issued by the wake-up source.
  • the user sends a wake-up sentence for many times, but still cannot wake up the audio device, and the user experience is poor.
  • the main audio device cannot accurately determine the direction of the wake-up source. For example, suppose the user is actually at angle A, but the primary audio device recognizes that the user is at angle B due to the interference of the audio signal of the secondary audio device. Obviously, angle B is inaccurate.
  • the main audio device Since it is recognized that the wake-up source is at angle B, in the subsequent voice command recognition process, the main audio device performs voice command recognition on the voice signal at angle B in the received voice signal. Obviously, the sound signal corresponding to the angle B is not the sound signal sent by the user, so the accuracy of the voice command is low.
  • embodiments of the present application provide a wake-up identification method.
  • the method does not rely solely on the main audio device for wake-up recognition, direction determination of the wake-up source, and voice command recognition, but rather the cooperation among multiple audio devices to improve the accuracy of wake-up recognition and voice recognition.
  • This embodiment 1 introduces that multiple audio devices cooperate to perform wake-up identification, instead of relying solely on one audio device to wake up a device, so as to improve the accuracy of wake-up identification.
  • This embodiment 1 is introduced by taking the application scenario shown in FIG. 4 as an example, that is, there are only two audio devices in the scenario (or environment) and the wake-up source (eg, the user) sends out sound signals as an example, for example, two audio devices Music is playing and the wake-up source emits the wake-up statement.
  • the wake-up source eg, the user
  • both the first audio device and the second audio device can perform wake-up identification to obtain their respective identification results, and the first audio device integrates the two identification results for further wake-up judgment.
  • the flow of the first mode includes the following steps:
  • a first audio device receives a first sound signal.
  • the first sound signal includes a wake-up sentence sent by the wake-up source, and also includes a sound signal sent by the second audio device (the process is not shown in the figure).
  • the second audio device receives a second sound signal.
  • the second sound signal includes a wake-up sentence issued by the wake-up source, and also includes a sound signal issued by the first audio device (the process is not shown in the figure).
  • the execution order of S501 and S502 is not limited in this application.
  • the first audio device performs wake-up recognition on the first sound signal to obtain a first recognition result.
  • the first recognition result includes a first confidence level.
  • the first confidence level may be a first probability that the first audio device uses a wake-up recognition algorithm to calculate a first probability that the first sound signal contains a wake-up sentence.
  • the wake-up recognition algorithm please refer to the introduction in the previous term explanation part, and will not be repeated here.
  • the second audio device performs wake-up recognition on the second sound signal to obtain a second recognition result.
  • the second identification result includes a second confidence level.
  • the second confidence level may be a second probability that the wake-up sentence is included in the second sound signal calculated by the second audio device using the wake-up recognition algorithm.
  • the wake-up recognition algorithm may be enabled to calculate the confidence level.
  • the first audio device and/or the second audio device may monitor the receiving time of the sound signal of the wake-up source in real time. The wake-up source has been speaking, possibly chatting. At this time, the monitoring state can be continued without starting the wake-up identification algorithm. When it is detected that the time interval is greater than the preset value, the wake-up identification algorithm is started.
  • the second audio device sends the second identification result to the first audio device.
  • each time the second audio device calculates the second confidence level it may actively send it to the first audio device; or, the first audio device may also send a query for querying the second confidence level to the second audio device.
  • information for example, before S505
  • the second audio device receives the query information, it sends the second confidence level to the first audio device.
  • the second confidence level may also be pre-judged. For example, it is judged whether the second confidence level is greater than a preset threshold; if it is greater, the second confidence level is sent to the first audio device; otherwise, the second confidence level is not sent to the first audio device.
  • the second audio device can determine that the received second sound signal does not include a wake-up word, so there is no need to send the second confidence level to the first audio device for further wake-up identification.
  • the second audio device may also directly send the second confidence level to the first audio device without pre-judging the second confidence level.
  • the first audio device determines whether to wake up based on the first identification result and the second identification result. If yes, execute S507, otherwise, it may not respond.
  • the first identification result includes a first confidence level
  • the second identification result includes a second confidence level
  • the first audio device determines whether to wake up based on the first identification result and the second identification result, specifically including: based on the first confidence level and the second confidence level Degree to determine whether to wake up. For example, when the first audio device judges that the following conditions are met, it is determined that wake-up is required, and the conditions include at least one of the following conditions:
  • the first confidence level is greater than the first threshold, and/or the second confidence level is greater than the second threshold.
  • the first threshold and the second threshold may be the same or different; for example, if the first confidence level is 0.95, the first threshold is 0.9, the second confidence level is 0.85, and the second threshold is 0.8, the first audio device wakes up.
  • the first threshold and the second threshold may be preset. Alternatively, the first threshold and the second threshold can be set or adjusted according to user needs.
  • the average of both the first confidence level and the second confidence level or the weighted average of both is greater than the third threshold.
  • the third threshold may be preset. Alternatively, the third threshold can be set or adjusted according to user needs.
  • the first audio device includes a wake-up recognition module and a voice command recognition module, and the voice command recognition module does not need to be in an enabled state all the time. Therefore, the first audio device determines whether to wake up based on the first recognition result and the second recognition result, which can be understood as determining whether to wake up the voice command recognition module in the first audio device based on the first recognition result and the second recognition result.
  • the first audio device sends a wake-up response.
  • an audio response of "I'm here" may be issued to notify the user to wake up the first audio device.
  • S507 is not a required step. That is, in one embodiment, the process may only include S501-S506, but not S507.
  • the first audio device may also send a wake-up instruction to the second audio device to wake up the second audio device. For example, wake up the voice command recognition module in the second audio device.
  • the first audio device may not need to send a wake-up command to the second audio device; or, the second audio device may not have a voice command recognition module.
  • FIG. 5A is an example of the first audio device judging whether to wake up according to the first identification result and the second identification result ( S506 ). It can be understood that this step can also be performed by a second audio device. For example, the first audio device sends the first recognition result to the second audio device, and the second audio device determines whether to wake up based on the first recognition result and the second recognition result.
  • the first audio device uses the second identification result (including the second confidence level) calculated by the second audio device to determine whether to wake up, instead of simply relying on the first audio device calculated by the first audio device itself.
  • the recognition results (including the first confidence level) improve the accuracy of wake-up recognition to a certain extent.
  • the two audio devices perform wake-up recognition respectively, and a total of two recognition results are obtained, and then the two recognition results are combined to determine whether to wake up.
  • the first audio device and/or the second audio device may also suppress the sound from the other party in the received sound signal before performing the wake-up identification.
  • the first audio device suppresses the sound from the second audio device in the received first sound signal
  • the second audio device suppresses the sound from the first audio device in the received second sound signal.
  • the suppressed sound signal highlights the sound of the wake-up source, and the wake-up recognition accuracy of the suppressed sound signal is further improved.
  • FIG. 5B is a schematic flowchart of the second method.
  • the process includes:
  • the first audio device receives a third sound signal from the second audio device.
  • the first audio device determines the direction of the second audio device relative to the first audio device according to the third sound signal.
  • the directions can be calculated in real time. For example, each time the two audio devices start playing music, or every time interval; or, when the first audio device or the second audio device detects a change in position (for example, a sensor on the audio device detects to the position change).
  • S602 may also be performed by the second audio device.
  • the second audio device calculates the direction of the first audio device relative to the second audio device, and sends the direction to the first audio device. The opposite direction of this direction is the direction of the second audio device relative to the first audio device.
  • the direction of the second audio device relative to the first audio device can be calculated in advance, that is, before the wake-up sentence (ie, the first sound signal) is received from the wake-up source, the direction has been calculated so that subsequent This direction is used in the process to suppress the sound from the second audio device.
  • S601 and S602 may not be executed.
  • the first audio device may also obtain the direction of the second audio device relative to the first audio device in other ways.
  • the user manually inputs, etc., so S601 and S602 are represented by dotted lines in the figure.
  • the first audio device receives the first sound signal.
  • the first sound signal includes the wake-up sentence issued by the wake-up source, and of course the sound from the second audio device.
  • the second audio device receives the second sound signal.
  • the second sound signal includes the wake-up sentence issued by the wake-up source, and of course also includes the sound from the first audio device.
  • the first audio device performs wake-up recognition on the first sound signal to obtain a first recognition result, where the first recognition result includes a first confidence level.
  • the second audio device performs wake-up recognition on the second sound signal to obtain a second recognition result, where the second recognition result includes a second confidence level.
  • the second audio device sends the second identification result to the first audio device.
  • S603 to S607 are the same as the implementation principles of S501 to S505 in FIG. 5A , and are not repeated here.
  • the first audio device suppresses the sound located in the direction in the first sound signal.
  • the first sound signal includes the sound signal of the wake-up source, and also includes the stronger sound signal from the noise source (ie, the second audio device).
  • the suppressed sound signal obtained by suppressing the sound located in the direction in the first sound signal includes the sound signal of the wake-up source, and may also include a weaker sound signal from the noise source, which is relatively weaker than the original sound signal (that is, the first sound signal).
  • a sound signal) highlights the sound of the wake-up source.
  • the first audio device may suppress sounds located at 30 degrees north west in the first sound signal.
  • the first audio device suppresses the sound within this angle range in the first sound signal.
  • the suppression principle may be as follows: the first sound signal is a superposition of sounds from multiple directions.
  • the first sound signal satisfies: A*sound signal 1+B*other sound signals.
  • A is the first weight
  • B is the second weight
  • A+B 1.
  • sound signal 1 is the sound signal located in the direction; other sound signals include the sound signal of the wake-up source.
  • the suppressed sound signal satisfies: C*sound signal 1+D*other sound signals.
  • C is the third weight
  • the foregoing suppression process may be performed in real time. For example, after determining the direction, the first audio device suppresses each received sound signal.
  • the first audio device may monitor the receiving moment of the sound signal of the wake-up source in real time. When the time interval between the receiving moments is less than the preset value, the monitoring state may be continued without suppression (someone is chatting in the environment). When the time interval is greater than the preset value, the collected sound signal starts to be suppressed.
  • the first audio device performs wake-up recognition on the suppressed first sound signal to obtain a third recognition result, where the third recognition result includes a third confidence level.
  • the first audio device determines whether to wake up according to the first identification result, the second identification result and the third identification result.
  • the first recognition result includes a first confidence level
  • the second recognition result includes a second confidence level
  • the third recognition result includes a third confidence level
  • the first audio device judges based on the first confidence level, the second confidence level, and the third confidence level Whether to wake up; if yes, execute S611; otherwise, it may not respond.
  • the first audio device determines that it needs to wake up when the following conditions are met, and the conditions include at least one of the following:
  • the third confidence level is greater than the fourth threshold, and/or the first confidence level is greater than the first threshold, and/or the second confidence level is greater than the second threshold.
  • the average of the first confidence level and the second confidence level or the weighted average of the two is greater than the third threshold, and/or the average of the third confidence level and the first confidence level or the combination of both
  • the weighted average is greater than the fifth threshold, and/or the average of both the third confidence level and the second confidence level or the weighted average of both is greater than the sixth threshold, and/or, the third confidence level, the first confidence level
  • the average of the three degrees and the second confidence degree or the weighted average of the three is greater than the seventh threshold.
  • the first to seventh thresholds may be preset, or set or adjusted according to user needs.
  • the first audio device sends a wake-up response.
  • S611 is not a necessary step; that is, in an embodiment, the process only includes S601-S610, and does not include S611.
  • S605 in FIG. 5B may not be executed, so it is represented by a dotted line in the figure. If S605 is not performed, that is, the first audio device does not need to perform wake-up recognition on the first sound signal to obtain the first recognition result, then in S610, it may be determined whether wake-up is required only based on the second recognition result and the third recognition result.
  • the second identification result includes a second confidence level
  • the third identification result includes a third confidence level.
  • the third confidence level is greater than the fourth threshold, and/or the second confidence level is greater than the second threshold, and/or the average or weighted average of both the third confidence level and the second confidence level is greater than the sixth threshold , wake up.
  • S606 and S607 in FIG. 5B may not be executed, so they are represented by dotted lines in the figure. If S606 and S607 are not executed, that is, the second audio device does not need to perform confidence calculation, and the requirements for the second audio device are low. In this case, if S605 is executed, then in S610, it can be determined whether it is necessary to wake up based on the first identification result and the third identification result.
  • the first recognition result includes a first confidence level, and the third recognition result includes a third confidence level.
  • the third confidence level is greater than the fourth threshold, and/or the first confidence level is greater than the first threshold, and/or the average or weighted average of both the third confidence level and the first confidence level is greater than the fifth threshold , then wake up. If S605 is not executed, then in S610, it may be judged whether to wake up based only on the third identification result. For example, if the third confidence level is greater than the fourth threshold, it will wake up.
  • FIG. 5B The embodiment of FIG. 5B is described by taking the first audio device suppressing the sound from the second audio device as an example. It can be understood that the second audio device can also suppress the sound from the first audio device, and then performs wake-up recognition on the suppressed sound, and sends the recognition result to the first audio device for comprehensive judgment.
  • the principle of suppressing the sound from the first audio device by the second audio device is the same as the principle of suppressing the sound from the second audio device by the first audio device, so it is not repeated here.
  • the first audio device may use the first mode or the second mode by default, or a switch button may be set on the first audio device, and the switch button may be used to realize between the first mode and the second mode. switch.
  • the first audio device does not need to suppress the first sound signal.
  • the first audio device needs to suppress the sound signal in the direction of the second audio device in the received first sound information.
  • the wake-up source and the second audio device are in the same direction.
  • the second audio device and the wake-up source are in the same direction.
  • a third method can be used. Specifically, referring to FIG. 5D , it is a schematic flowchart of the third mode, and the process includes:
  • the first audio device receives a third sound signal from the second audio device.
  • the first audio device determines the direction of the second audio device relative to the first audio device according to the third sound signal.
  • the first audio device receives the first sound signal.
  • the first sound signal includes the wake-up sentence issued by the wake-up source, and of course the sound from the second audio device.
  • the first audio device calculates the direction of the wake-up source relative to the first audio device according to the first sound signal.
  • the first audio device can calculate the direction of the wake-up source based on the sound of the wake-up source in the first sound signal; For the calculation process, please refer to the previous nomenclature section.
  • the first audio device determines whether the wake-up source and the second audio device are in the same direction.
  • the first audio device determines whether the wake-up source and the second audio device are in the same direction, and if so, S806 is performed, otherwise, S807 is performed.
  • the first audio device uses the first method to perform wake-up identification.
  • the first audio device uses the first method or the second method to perform wake-up identification.
  • the first method does not need to suppress the sound from the direction of the second audio device, so when the wake-up source and the second audio device are in the same direction, the first method can be used.
  • the second way or the first way can be used.
  • the first audio device may use the first mode, the second mode or the third mode by default, or a switch button is set on the first audio device, and the switch between the three modes is realized by the switch button.
  • the first audio device when the first audio device performs wake-up recognition, it does not rely solely on the first audio device itself to perform wake-up recognition, but the first audio device and the second audio device cooperate to perform wake-up recognition, which improves the efficiency of wake-up recognition. accuracy.
  • the first audio device may include multiple wake-up strategies.
  • the first wake-up strategy and the second wake-up strategy are the wake-up strategy in the prior art, that is, the first audio device only performs wake-up identification based on the information detected by itself, without referring to the information of the second audio device.
  • the second wake-up strategy refers to the cooperative wake-up of the first audio device and the second audio device in the first embodiment.
  • the first audio device may provide a wake-up strategy switch button, and the switch between the first wake-up strategy and the second wake-up strategy can be realized through the wake-up strategy switch button. Under the first wake-up strategy, the accuracy of identifying the wake-up source by the first audio device is low.
  • the user when there is a lot of noise in the environment, the user sends a wake-up command, but the device cannot be woken up for a long time.
  • the accuracy of identifying the wake-up source by the first audio device is high.
  • the first audio device and/or the second audio device may include a non-awake state (or may also be referred to as a sleep state), a pre-wake phase and a wake-up phase.
  • a non-awake state or may also be referred to as a sleep state
  • the voice command recognition module in the first audio device is in the off state, but the sound collection module (eg, the microphone) is in the enabled state and can collect sound.
  • the pre-wake-up stage may be a stage before entering the wake-up stage, and may be considered as a stage in which wake-up is initially determined.
  • the wake-up stage may be a state in which the voice recognition module in the first audio device has been woken up, and can perform voice command recognition.
  • the three-stage handover process is described below.
  • the first audio device when it is in a non-wake-up state, it may enter a pre-wake-up stage when it is determined that certain conditions are met. For example, taking the process shown in FIG. 5A as an example, in a non-awake state, the first audio device receives the first sound signal, and recognizes the first sound signal to obtain a first recognition result, and the first recognition result includes the first confidence Spend. If the first confidence level is greater than the first threshold, the first audio device enters the pre-awakening stage from the non-awakening state.
  • the first audio device preliminarily determines that it needs to be woken up, but further judgment needs to be made in combination with the information of the second audio device, so the first audio device can first enter the pre-wake-up stage. After entering the pre-wake-up phase, the first audio device may do some preparations for entering the wake-up phase. For example, power on the voice command recognition module in preparation for starting the voice command recognition module. If it is determined that wake-up is required in combination with the information of the second audio device, the first audio device enters the wake-up stage from the pre-wake-up stage. For example, start the voice command recognition module. If it is determined not to wake up in combination with the information of the second audio device, the first audio device returns from the pre-wake-up stage to the non-wake-up stage. At this point, the power-on of the voice command recognition module can be stopped.
  • the second audio device receives the second sound signal, and recognizes the second sound signal to obtain the second recognition result.
  • the second recognition result includes a second confidence level.
  • the second audio device enters the pre-awakening stage from the non-awake state. That is, it is initially determined that the second audio device needs to be awakened, but further judgment needs to be made in combination with the information of the first audio device, so the second audio device first enters the pre-awakening stage. If it is further determined based on the information of the first audio device that wake-up is indeed required, the second audio device enters the wake-up stage from the pre-wake-up stage; otherwise, the second audio device exits the pre-wake-up stage.
  • the duration during which the first audio device or the second audio device is in the pre-awakening stage may be a first preset duration (eg, a preset duration, which may be set by a user or a default setting before the device leaves the factory). Assuming that it is not determined whether wake-up is required within the first preset time period, the pre-wake-up phase is exited.
  • the duration of the wake-up phase may be a second preset duration, and if no voice command is recognized within the second preset duration, the wake-up phase is exited.
  • the first audio device and/or the second audio device includes three stages (non-wake-up state, pre-wake-up stage, and wake-up stage) as an example for introduction. Optionally, it can also include only two stages.
  • the wake-up stage and the wake-up stage are not limited in the embodiments of the present application.
  • the first identification result includes the first confidence level
  • the second identification result includes the second confidence level
  • the first audio device can make a wake-up judgment according to the first confidence level and the second confidence level.
  • the first recognition result obtained by the first audio device may also include a first angle, where the first angle describes the direction of the wake-up source relative to the first audio device.
  • the second recognition result obtained by the second audio device may further include a second angle, where the second angle describes the direction of the wake-up source relative to the second audio device.
  • the first audio device can not only judge whether to wake up according to the first confidence level in the first recognition result and the second confidence level in the second recognition result, but also according to the first recognition result
  • the first angle and the second angle in the second identification result determine the direction in which the wake-up source is located.
  • two audio devices cooperate to determine the direction of the wake-up source.
  • this method can determine a more accurate direction of the wake-up source.
  • the voice command can be recognized for the sound signal in the direction of the wake-up source, without the need to recognize the voice command of the sound information in all directions, which can not only improve the voice recognition efficiency, but also It can improve the accuracy of voice command recognition.
  • the second embodiment continues to take the application scenario shown in FIG. 4 as an example, that is, there are only two audio devices and the wake-up source emit sound as an example, and the two audio devices are located on the horizontal line (the horizontal dotted line in FIG. 4 ) For example.
  • FIG. 6A is a schematic flowchart of the wake-up identification method provided in the second embodiment.
  • the process includes:
  • the first audio device receives a third sound signal from the second audio device.
  • the first audio device determines the distance between the two audio devices according to the third sound signal (for convenience of description, the distance is referred to as the device distance), for example, the distance is D.
  • S901 and S902 may not be performed.
  • the first audio device may also use other methods for distance measurement, such as laser distance measurement, manual input of distance by the user, etc., so S901 and S902 are represented by dotted lines in the figure.
  • the first audio device receives the first sound signal.
  • the first sound signal includes the wake-up sentence issued by the wake-up source, and of course the sound from the second audio device.
  • the second audio device receives the second sound signal.
  • the second sound signal includes the wake-up sentence issued by the wake-up source, and of course also includes the sound from the first audio device.
  • the first audio device performs wake-up recognition on the first sound signal to obtain a first recognition result.
  • the first recognition result includes a first confidence level and a first angle
  • the first confidence level is a first probability that the first sound signal includes a wake-up sentence
  • the first angle describes the wake-up source relative to the first audio frequency the orientation of the device.
  • the second audio device performs wake-up recognition on the second sound signal to obtain a second recognition result.
  • the second recognition result includes a second confidence level and a second angle
  • the second confidence level is a second probability that the second sound signal includes a wake-up sentence
  • the second angle describes the wake-up source relative to the second audio the orientation of the device.
  • the second audio device sends the second identification result to the first audio device.
  • the first audio device determines whether to wake up according to the first confidence level and the second confidence level. If necessary, go to S909.
  • the first audio device determines the direction in which the wake-up source is located according to the first angle and the second angle.
  • the first angle is determined as the direction of the wake-up source; if the second confidence level is greater than the first confidence level, the second angle is determined as the direction of the wake-up source.
  • the second identification result further includes a second distance, where the second distance is the distance from the wake-up source to the second audio device. Then, first use the device distance D, the second angle and the second distance included in the second recognition result to predict the third angle of the wake-up source relative to the first audio device, and then according to the predicted wake-up source relative to the first audio device. The third angle of , and the first angle of the actually detected wake-up source relative to the first audio device (ie, S905 ), determine the final angle of the wake-up source.
  • the second angle and the second distance are known, and the equipment distance between the first audio device and the second audio device is also known, so the distance D, the second angle, the second distance (side angle-edge principle) can construct a triangle (this process does not need to use the first angle). Based on the trigonometric relationship, a third angle in the triangle can be determined. The constructed triangle satisfies the following trigonometric function relationship:
  • is the second angle
  • S cs is the second distance
  • h is the distance from the wake-up source to the line connecting the centroids of the two audio devices.
  • D is the distance (equipment distance) between the first audio device and the second audio device
  • P is the difference between the distance between D and the second audio device and the straight line where h is located
  • ⁇ ' is the third angle.
  • FIG. 6B takes an example that the second angle ⁇ is an acute angle.
  • the second angle ⁇ may also be a right angle or an obtuse angle.
  • the third angle ⁇ ′ satisfies:
  • the first audio device can select an appropriate calculation method according to the size of the second angle. For example, when the second angle is an acute angle, the calculation method shown in FIG. 6B can be used; when the second angle is an obtuse angle, the calculation method shown in FIG. 6C is used; when the second angle is a right angle, the calculation method shown in FIG. 6D is used. angle calculation.
  • the above describes the process of two audio devices cooperatively predicting the third angle of the wake-up source relative to the first audio device.
  • the predicted third angle of the wake-up source relative to the first audio device and the actually detected first angle of the wake-up source relative to the first audio device can be used.
  • the angle (S905) determines the final angle of the wake-up source.
  • the first audio device corrects the first angle according to the third angle, and the corrected angle is the final angle of the wake-up source.
  • the correction angle can be determined in at least one of the following ways:
  • Mode 1 the average value of the third angle ⁇ ' and the first angle ⁇ is taken, and the average value is the corrected angle.
  • the included angle ⁇ can also be corrected, for example, the above-mentioned method 1 or method 2 is used for correction.
  • the included angle ⁇ can also be corrected, for example, the above-mentioned method 1 or method 2 is used for correction.
  • the specific correction method may be the above-mentioned method 1 or 2.
  • the first threshold ⁇ 1 and the second threshold ⁇ 2 may be preset values, and the values of the two are not limited in the embodiments of the present application.
  • ⁇ 1 is 2, 3 degrees, 5 degrees, etc.
  • ⁇ 2 is 2 degrees, 3 degrees, 5 degrees, etc. Both can be equal or unequal.
  • the voice command can be recognized for the sound signal in the direction of the wake-up source, and it is not necessary to recognize the voice command of the sound signal in all directions, which improves the efficiency.
  • the recognition process of the voice command refer to S910 to S914 in FIG. 6A .
  • the first audio device receives a fourth sound signal.
  • the fourth sound signal includes a voice command.
  • the first audio device suppresses the sound located in the direction of the non-awakening source in the fourth sound signal.
  • the direction in which the non-awakening source is located may refer to a direction other than the direction in which the awakening source is located. Assuming that the direction of the wake-up source is 30 degrees north west, other angles other than the 30 north west angle in the fourth sound signal can be suppressed.
  • the first audio device may also determine an angle range based on the direction of the wake-up source (eg, 30 degrees north to west), and suppress the sound information in the fourth sound signal within the angle range.
  • S911 is an optional step, and the process of the second embodiment may include S911 or may not include S911.
  • the first audio device performs voice instruction recognition on the sound in the direction of the wake-up source in the fourth sound signal.
  • the first audio device can perform voice command recognition on the sound signal located at 30 degrees north west in the fourth sound signal.
  • the first audio device may also determine an angle range based on the direction of the wake-up source (eg, 30 degrees north to west). For example, 30 degrees minus a threshold value 1 is used as the minimum value min of the angle range, and 30 degrees plus a threshold value 2 is used as the maximum value max of the angular orientation. Then the angle range is the interval of (min, max).
  • the threshold value 1 is 5 degrees
  • the threshold value 2 is 10 degrees
  • the angle range is the interval (northern deviation). 25 degrees west, 40 degrees west of north).
  • the first audio device recognizes the voice command for the sound in the fourth sound signal located within the angle range. In this way, the first audio device does not need to perform voice command recognition on sounds from all angles, thereby saving power consumption.
  • the first audio device executes the voice command.
  • the voice command is to switch to the next song
  • the first audio device switches to the next song.
  • the first audio device sends a voice command to the second audio device to control the second audio device to execute the voice command.
  • the first audio device sends a pre-wake event to the second audio device, and the pre-wake event is the second recognition result.
  • the first audio device reports the first recognition result and the second recognition result to the cloud, and the The cloud determines whether to wake up according to the first identification result and the second identification result. If necessary, a wake-up command is sent to the first audio device to wake up the first audio device, and then the first audio device sends a wake-up command to the second audio device. This way the computational tasks of the first audio device can be relieved.
  • the cloud may also send prompt information to the user's mobile terminal to prompt the user whether to confirm that the first audio device is to be woken up.
  • a confirmation operation is detected on the mobile terminal, a confirmation instruction is sent to the cloud for confirming wake-up of the first audio device.
  • the first audio device After the first audio device receives the confirmation command, it sends a wake-up command to the first audio device, so as to avoid false wake-up.
  • the first audio device may be the primary audio device and the second audio device may be the secondary audio device; alternatively, the second audio device may be the primary audio device and the first audio device may be the secondary audio device.
  • the main audio device can be determined in at least one of the following ways:
  • the primary audio device may be one or more preset audio devices, and the remaining audio devices among the N audio devices are secondary audio devices. For example, it is set by default before leaving the factory. Then, the secondary audio devices are the remaining audio devices among the N audio devices.
  • the primary audio device may be user-specified.
  • each of the N audio devices detects an input operation by touching the display screen, and the input operation is used to select whether the each audio device is a primary audio device or a secondary audio device.
  • the primary audio device may be the highest performing audio device among the N audio devices.
  • the audio device with the strongest performance may include the audio device with the strongest processor performance, for example, the processor with the fastest computing speed and the like.
  • the primary audio device and the secondary audio device may have the same or different structures.
  • the primary audio device has a display screen, while the secondary audio device does not have a display screen.
  • the primary audio device and the secondary speaker playback device may be responsible for different functions.
  • the primary audio device plays the sound information of the left channel of the audio file
  • the secondary audio device plays the sound information of the right channel of the audio file.
  • N 2 as an example. It can be understood that more audio devices may be included in the audio device combination. When N is another value, a similar principle can be adopted, and details are not repeated here.
  • the two audio devices cooperate to perform wake-up recognition, which improves the accuracy of wake-up recognition.
  • the two audio devices cooperate to determine the direction of the wake-up source, which improves the accuracy of the determined direction of the wake-up source.
  • voice command recognition can be performed on the sound signal located in the direction of the wake-up source to improve the accuracy of voice command recognition.
  • the terminal device may include a hardware structure and/or software modules, and implement the above functions in the form of a hardware structure, a software module, or a hardware structure plus a software module. Whether one of the above functions is performed in the form of a hardware structure, a software module, or a hardware structure plus a software module depends on the specific application and design constraints of the technical solution.
  • the audio device is, for example, an electronic device such as a speaker, a mobile phone, a tablet computer, a notebook computer, and a desktop computer.
  • the audio device may include: one or more processors 801 ; a microphone 802 , a speaker 803 , and a memory 804 , wherein the memory 804 includes one or more computer programs 805 .
  • the various devices described above may be connected by one or more communication buses 806 .
  • the microphone 802 can be a microphone array, which is used to receive sound signals, and can also be used to determine the direction, distance, etc. (see the above-mentioned part of terminology).
  • the speaker 803 is used to play sound signals.
  • the one or more computer programs 805 are stored in the above-mentioned memory 804 and configured to be executed by the one or more processors 801, and the one or more computer programs 805 include instructions that can be used to execute Various steps in FIGS. 5A to 6A and corresponding embodiments.
  • the audio device shown in FIG. 8 may be the first audio device or the second audio device above.
  • the audio device shown in FIG. 8 is the first audio device, it can be used to perform the above steps related to the first audio device.
  • the audio device shown in FIG. 8 is the second audio device, it can be used to perform the relevant steps of the second audio device above.
  • the above-mentioned embodiments it may be implemented in whole or in part by software, hardware, firmware or any combination thereof.
  • software it can be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present invention are generated.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center is by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes an integration of one or more available media.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), and the like.
  • Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another.
  • a storage medium can be any available medium that can be accessed by a general purpose or special purpose computer.

Abstract

本申请涉及一种唤醒识别方法、音频装置以及音频装置组。音频装置组包括第一音频装置和第二音频装置,第一音频装置和第二音频装置均包括麦克风阵列,第一音频装置接收第一声音信号,第一声音信号包括唤醒源发出的声音信号;第一音频装置对第一声音信号进行唤醒识别得到第一识别结果;第二音频装置接收第二声音信号,第二声音信号包括唤醒源发出的声音信号;第二音频装置对第二声音信号进行唤醒识别得到第二识别结果;第二音频装置向第一音频装置发送第二识别结果;第一音频装置基于第一识别结果和第二识别结果,确定是否唤醒。通过这种方式,第一音频装置可以综合两个音频装置的唤醒识别结果作唤醒判断,提升唤醒识别准确性。

Description

一种唤醒识别方法、音频装置以及音频装置组
相关申请的交叉引用
本申请要求在2020年08月31日提交中国专利局、申请号为202010893958.3、申请名称为“一种唤醒识别方法与音频装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中;本申请要求在2020年12月23日提交中国专利局、申请号为202011556351.2、申请名称为“一种唤醒识别方法、音频装置以及音频装置组”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及终端技术领域,尤其涉及一种唤醒识别方法、音频装置以及音频装置组。
背景技术
为了提升音频播放效果,两个以上的音频装置的组合逐渐成为趋势。以音箱组合为例,用户在家里(如客厅)内使用音箱组合播放音乐。相对于单个音箱播放音乐而言,音箱组合播放音乐可以给用户带来更极致的听觉体验。
然而,在使用音箱组合的过程中,存在一些体验较差的场景:比如,音箱组合播放音乐的过程中,环境中声音较大,此时若用户发出语音唤醒词或语音指令,音响组合采集到的声音信息包括环境中所有的声音信息。比如,音箱组合采集到的声音信息不仅包括用户发出的声音,还包括音箱组合产生的声音。这样,由于其他的声音信息的干扰,用户发出的语音唤醒词或语音指令无法准确地被音箱组合识别。
发明内容
本申请的目的在于提供了一种唤醒识别方法、音频装置以及音频装置组,有助于提升唤醒识别的准确性。
第一方面,提供一种唤醒识别方法。该方法适用于音频装置组,音频装置组包括第一音频装置和第二音频装置,第一音频装置与第二音频装置能够通信,第一音频装置包括第一麦克风阵列,第二音频装置包括第二麦克风阵列。该方法包括:第一音频装置接收第一声音信号,第一声音信号中包括唤醒源发出的声音信号;第一音频装置对第一声音信号进行唤醒识别得到第一识别结果;第二音频装置接收第二声音信号,第二声音信号包括唤醒源发出的声音信号;第二音频装置对第二声音信号进行唤醒识别,得到第二识别结果;第二音频装置向第一音频装置发送第二识别结果;第一音频装置基于第一识别结果和第二识别结果,确定是否唤醒音频装置组。本申请提供的唤醒识别方法是两个音频装置协作进行唤醒识别,提升唤醒识别的准确性。
在一种可能的设计中,第一识别结果包括第一概率,第一概率为第一声音信号包括唤醒信息的概率;第二识别结果包括第二概率,第二概率为第二声音信号包括唤醒信息的概率;第一音频装置基于第一识别结果和第二识别结果,确定是否唤醒音频装置组,包括:若第一概率大于第一阈值、第二概率大于第二阈值;和/或,第一概率和第二概率二者的平 均值或加权平均值大于第三阈值,确定唤醒音频装置组。也就是说,综合两个音频装置各自的唤醒识别结果进行唤醒识别,提升唤醒识别的准确性。
在一种可能的设计中,第一音频装置对第一声音信号进行唤醒识别得到第一识别结果之前,该方法还包括:第一音频装置对第一声音信号中来自第二音频装置的声音信号进行抑制;第一音频装置对第一声音信号进行唤醒识别得到第一识别结果,包括:第一音频装置对抑制后的第一声音信号进行唤醒识别得到第一识别结果。也就是说,第一音频装置可以对非唤醒源所在方向的声音(如来自第二音频装置的声音)进行抑制,突出唤醒源的声音,提升唤醒识别的准确性。
在一种可能的设计中,第二音频装置对第二声音信号进行唤醒识别得到第二识别结果之前,该方法还包括:第二音频装置对第二声音信号中来自第一音频装置的声音信号进行抑制;第二音频装置对第二声音信号进行唤醒识别得到第二识别结果,包括:第二音频装置对抑制后的第二声音信号进行唤醒识别得到第二识别结果。也就是说,第二音频装置可以对非唤醒源所在方向的声音(如来自第一音频装置的声音)进行抑制,突出唤醒源的声音,提升唤醒识别的准确性。
在一种可能的设计中,第一音频装置对第一声音信号中来自第二音频装置的声音信号进行抑制之前,该方法还包括:第一音频装置确定第二音频装置和唤醒源处于不同方向;第二音频装置对第二声音信号中来自第一音频装置的声音信号进行抑制之前,该方法还包括:第二音频装置确定第一音频装置和唤醒源处于不同方向。也就是说,第一音频装置抑制来自第二音频装置的声音之前,判断第二音频装置和唤醒源是否在同一方向,如果不是,则抑制来自第二音频装置的声音,避免第二音频装置和唤醒源处于同一方向的情况下,在抑制第二音频装置所在方向的声音信号时将唤醒源的声音一并抑制了。
在一种可能的设计中,第一识别结果还包括第一角度,第一角度为唤醒源相对于第一音频装置的方向;第二识别结果还包括第二角度,第二角度为唤醒源相对于第二音频装置的方向;该方法还包括:第一音频装置基于第一角度和第二角度,确定唤醒源所在方向。本申请提供的唤醒识别方法是两个音频装置协作进行唤醒所在方向的确定,能够确定较为准确的唤醒源所在方向。
在一种可能的设计中,第一音频装置基于第一角度和第二角度,确定唤醒源所在方向,包括:当第一概率大于第二概率时,确定第一角度为唤醒源所在方向;当第二概率大于第一概率时,确定第二角度为唤醒源所在方向。也就是说,两个音频装置协作进行唤醒所在方向的确定,能够确定较为准确的唤醒源所在方向。
在一种可能的设计中,第二识别结果还包括第二距离,第二距离为唤醒源相对于第二音频装置的距离;第一音频装置基于第一角度和第二角度,确定唤醒源所在方向,包括:第一音频装置利用第一距离、第二角度和第二距离,预测唤醒源相对于第一音频装置的第三角度;其中,第一距离为第一音频装置和第二音频装置之间的距离;第一音频装置根据第三角度和第一角度,确定唤醒源所在方向。也就是说,两个音频装置协作进行唤醒所在方向的确定,能够确定较为准确的唤醒源所在方向。
在一种可能的设计中,第三角度满足:
Figure PCTCN2021114728-appb-000001
其中,β是第二角度;S c-s为第二距离,D为第一距离,α′是第三角度。
在一种可能的设计中,第一音频装置根据第三角度和第一角度,确定唤醒源所在方向,包括:确定第三角度和第一角度的平均值或加权平均值为唤醒源所在方向;或者,
唤醒源所在方向满足:α±(ω i×Δ),ω i是预设值,Δ=|α-α′|,α是第一角度,所述α′是三角度。
在一种可能的设计中,在第一音频装置基于第一识别结果和第二识别结果,确定唤醒所述音频装置组之后,该方法还包括:第一音频装置接收到第三声音信号,第三声音信号包括语音指令;第一音频装置对第三声音信号中位于唤醒源所在方向的声音信号进行识别得到语音指令,该语音指令用于控制第一音频装置。由于两个音频装置协作进行唤醒所在方向的确定,确定出的唤醒源所在方向较为准确,所以在进行语音指令识别时,可以对唤醒源所在方向进行语音指令的识别,提升语音指令识别的准确性。
在一种可能的设计中,在第一音频装置对第三声音信号中位于唤醒源所在方向的声音信号进行识别得到语音指令之前,该方法还包括:第一音频装置对第三声音信号中处于非唤醒源方向的声音信号进行抑制,非唤醒源所在方向为除了唤醒源所在方向之外的其它方向;第一音频装置对第三声音信号中位于唤醒源所在方向的声音信号进行识别得到语音指令,包括:第一音频装置对经过抑制后的第三声音信息中位于唤醒源所在方向的声音信号进行识别得到语音指令。也就是说,在语音指令识别时,第一音频装置可以对非唤醒源所在方向的声音(如来自第二音频装置的声音)进行抑制,突出唤醒源的声音,提升语音指令识别的准确性。
第二方面,提供一种唤醒识别方法。该方法适用于第一音频装置,该方法包括:第一音频装置接收第一声音信号,第一声音信号包括唤醒源发出的声音信号;第一音频装置对第一声音信号进行唤醒识别得到第一识别结果;第一音频装置接收来自第二音频装置的第二识别结果;第二识别结果是第二音频装置对接收的第二声音信号进行唤醒识别得到识别结果;第一音频装置基于第一识别结果和第二识别结果,确定是否唤醒所述音频装置组。
在一种可能的设计中,第一识别结果包括第一概率,第一概率为第一声音信号包括唤醒信息的概率;第二识别结果包括第二概率,第二概率为第二声音信号包括唤醒信息的概率;
第一音频装置基于第一识别结果和第二识别结果,确定是否唤醒音频装置组,包括:若第一概率大于第一阈值、第二概率大于第二阈值;和/或,第一概率和第二概率二者的平均值或加权平均值大于第三阈值,确定唤醒音频装置组。
在一种可能的设计中,第一音频装置对第一声音信号进行唤醒识别得到第一识别结果之前,该方法还包括:第一音频装置对第一声音信号中来自第二音频装置的声音信号进行抑制;第一音频装置对第一声音信号进行唤醒识别得到第一识别结果,包括:第一音频装置对抑制后的第一声音信号进行唤醒识别得到第一识别结果。
在一种可能的设计中,第一音频装置对第一声音信号中来自第二音频装置的声音信号进行抑制之前,该方法还包括:第一音频装置确定第二音频装置和唤醒源处于不同方向。
在一种可能的设计中,第一识别结果还包括第一角度,第一角度为唤醒源相对于第一音频装置的方向;第二识别结果还包括第二角度,第二角度为唤醒源相对于第二音频装置的方向;该方法还包括:
第一音频装置基于第一角度和第二角度,确定唤醒源所在方向。
在一种可能的设计中,第一音频装置基于第一角度和第二角度,确定唤醒源所在方向, 包括:当第一概率大于第二概率时,确定第一角度为唤醒源所在方向;当第二概率大于第一概率时,确定第二角度为唤醒源所在方向。
在一种可能的设计中,第二识别结果还包括第二距离,第二距离为唤醒源相对于第二音频装置的距离;第一音频装置基于第一角度和第二角度,确定唤醒源所在方向,包括:第一音频装置利用第一距离、第二角度和第二距离,预测唤醒源相对于第一音频装置的第三角度;其中,第一距离为第一音频装置和第二音频装置之间的距离;第一音频装置根据第三角度和第一角度,确定唤醒源所在方向。
在一种可能的设计中,第三角度满足:
Figure PCTCN2021114728-appb-000002
其中,β是第二角度;S c-s为第二距离,D为第一距离,α′是第三角度。
在一种可能的设计中,第一音频装置根据第三角度和第一角度,确定唤醒源所在方向,包括:
确定第三角度和第一角度的平均值或加权平均值为唤醒源所在方向;
或者,
唤醒源所在方向满足:α±(ω i×Δ),ω i是预设值,Δ=|α-α′|,α是第一角度,α′是三角度。
在一种可能的设计中,在第一音频装置基于第一识别结果和第二识别结果,确定唤醒音频装置组之后,该方法还包括:
第一音频装置接收到第三声音信号,第三声音信号包括语音指令;
第一音频装置对第三声音信号中位于唤醒源所在方向的声音信号进行识别,得到语音指令,所述语音指令用于控制第一音频装置。
在一种可能的设计中,在第一音频装置对第三声音信号中位于唤醒源所在方向的声音信号进行识别,得到语音指令之前,该方法还包括:第一音频装置对第三声音信号中处于非唤醒源所在方向的声音信号进行抑制,非唤醒源所在方向为除了唤醒源所在方向之外的其它方向;第一音频装置对第三声音信号中位于唤醒源所在方向的声音信号进行识别,得到语音指令,包括:第一音频装置对经过抑制后的第三声音信息中位于唤醒源所在方向的声音信号进行识别,得到语音指令。
第三方面,还提供一种唤醒识别方法。该方法适用于第二音频装置,该方法包括:第二音频装置接收第二声音信号,第二声音信号包括唤醒源发出的声音信号;第二音频装置对第二声音信号进行唤醒识别,得到第二识别结果;第二音频装置将第二识别结果发送给第一音频装置,以使第一音频装置根据第一识别结果和第二识别结果,进行唤醒判断,第一识别结果为第一音频装置对接收到的第一声音信号进行唤醒识别得到的识别结果。
在一种可能的设计中,第二音频装置对第二声音信号进行唤醒识别,得到第二识别结果之前,该方法还包括:第二音频装置对第二声音信号中来自第一音频装置的声音信号进行抑制;第二音频装置对第二声音信号进行唤醒识别,得到第二识别结果,包括:第二音频装置对抑制后的第二声音信号进行唤醒识别,得到第二识别结果。
在一种可能的设计中,第二音频装置对第二声音信号中来自第一音频装置的声音信号进行抑制之前,该方法还包括:第二音频装置确定第一音频装置和唤醒源处于不同方向。
在一种可能的设计中,第二识别结果还包括第二角度,第二角度为唤醒源相对于第二 音频装置的方向,和/或,第二识别结果还包括第二距离,第二距离为唤醒源相对于第二音频装置的距离。
在一种可能的设计中,第二音频装置将第二识别结果发送给第一音频装置之前,该方法还包括:第二音频装置接收来自第一音频装置的查询请求,查询请求用于请求查询第二音频装置的识别结果。
第四方面,提供一种唤醒识别方法。该方法适用于第一音频装置,该方法包括:第一音频装置接收第一声音信号,第一声音信号包括唤醒源发出的声音信号;第一音频装置对第一声音信号进行唤醒识别得到第一识别结果;第一音频装置对第一声音信号中来自第二音频装置的声音进行抑制;第一音频装置对经过抑制后的第一声音信号进行唤醒识别得到第三识别结果;第一音频装置基于第一识别结果和第三识别结果,确定是否唤醒音频装置组。
在一种可能的设计中,第一音频装置对第一声音信号中来自第二音频装置的声音进行抑制之前,还包括:第一音频装置确定第二音频装置相对于第一音频装置的方向;第一音频装置对第一声音信号中来自第二音频装置的声音进行抑制,包括:第一音频装置对第一声音信号中位于所述方向的声音进行抑制。
在一种可能的设计中,第一识别结果包括第一概率,第一概率用于描述第一声音信号中包括唤醒信息的概率;第三识别结果包括第三概率,第三概率用于描述经过抑制后的第一声音信号中包括唤醒信息的概率,第一音频装置基于第一识别结果和第三识别结果,确定是否唤醒音频装置组,包括:第一概率大于第一阈值,和/或,第三概率大于第四阈值,和/或,第一概率和第三概率的平均值或加权平均值大于第五阈值时,确定唤醒音频装置组。
第五方面,提供一种唤醒源定位方法。该方法适用于音频装置组,该音频装置组包括第一音频装置和第二音频装置,第一音频装置包括第一麦克风阵列,第二音频装置包括第二麦克风阵列,第一音频装置与第二音频装置能够通信。该方法包括:第一音频装置接收第一声音信号;第一音频装置根据第一声音信号计算得到第一角度,第一角度为唤醒源相对于第一音频装置的方向;
第二音频装置接收第二声音信号;第二音频装置根据第二声音信号计算得到第二角度,第二角度为唤醒源相对于第二音频装置的方向;第二音频装置向第一音频装置发送第二角度;第一音频装置基于第一角度和第二角度,确定唤醒源所在方向。本申请提供的唤醒识别方法是两个音频装置协作进行唤醒所在方向的确定,能够确定较为准确的唤醒源所在方向。
在一种可能的设计中,第一音频装置基于第一角度和第二角度,确定唤醒源所在方向,包括:当第一概率大于第二概率时,确定第一角度为唤醒源所在方向;当第二概率大于第一概率时,确定第二角度为唤醒源所在方向;其中,第一概率为第一声音信号中包括唤醒信息的概率;第二概率为第二声音信号中包括唤醒信息的概率。
在一种可能的设计中,在第一音频装置基于第一角度和第二角度,确定唤醒源所在方向之前,所述方法还包括:第二音频装置根据第二声音信号计算得到第二距离,第二距离为唤醒源相对于第二音频装置的距离;第一音频装置基于第一角度和第二角度,确定唤醒源所在方向,包括:第一音频装置利用第一距离、第二角度和第二距离,预测唤醒源相对于第一音频装置的第三角度;其中,第一距离为第一音频装置和第二音频装置之间的距离;第一音频装置根据第三角度和第一角度,确定唤醒源所在方向。
在一种可能的设计中,第三角度满足:
Figure PCTCN2021114728-appb-000003
其中,β是第二角度;S c-s为第二距离,D为第一距离,α′是第三角度。
在一种可能的设计中,第一音频装置根据第三角度和第一角度,确定唤醒源所在方向,包括:确定第三角度和第一角度的平均值或加权平均值为唤醒源所在方向;或者,唤醒源所在方向满足:α±(ω i×Δ),ω i是预设值,Δ=|α-α′|,α是第一角度,α′是三角度。
在一种可能的设计中,该方法还包括:第一音频装置接收到第三声音信号,第三声音信号包括语音指令;第一音频装置对第三声音信号中位于唤醒源所在方向的声音信号进行识别得到语音指令,语音指令用于控制第一音频装置。
在一种可能的设计中,在第一音频装置对第三声音信号中位于唤醒源所在方向的声音信号进行识别,得到语音指令之前,该方法还包括:第一音频装置对第三声音信号中处于非唤醒源所在方向的声音信号进行抑制,非唤醒源所在方向为除了唤醒源所在方向之外的其它方向;第一音频装置对第三声音信号中位于唤醒源所在方向的声音信号进行识别得到语音指令,包括:第一音频装置对经过抑制后的第三声音信息中位于唤醒源所在方向的声音信号进行识别得到语音指令。
第六方面,提供一种唤醒源定位方法。该方法适用于第一音频装置,第一音频装置包括第一麦克风阵列,第二音频装置包括第二麦克风阵列,第一音频装置与第二音频装置能够通信。该方法包括:第一音频装置接收第一声音信号;第一音频装置根据第一声音信号计算得到第一角度,第一角度为唤醒源相对于第一音频装置的方向;第一音频装置接收来自第二音频装置的第二角度,第二角度是第二音频装置根据接收的第二声音信号计算得到第二角度,第二角度为唤醒源相对于第二音频装置的方向;第一音频装置基于第一角度和第二角度,确定唤醒源所在方向。
在一种可能的设计中,第一音频装置基于第一角度和第二角度,确定唤醒源所在方向,包括:当第一概率大于第二概率时,确定第一角度为唤醒源所在方向;当第二概率大于第一概率时,确定第二角度为唤醒源所在方向;其中,第一概率为第一声音信号包括唤醒信息的概率;第二概率为第二声音信号包括唤醒信息的概率。
在一种可能的设计中,在第一音频装置基于第一角度和第二角度,确定唤醒源所在方向之前,该方法还包括:第二音频装置根据第二声音信号计算得到第二距离,第二距离为唤醒源相对于第二音频装置的距离;第一音频装置基于第一角度和第二角度,确定唤醒源所在方向,包括:第一音频装置利用第一距离、第二角度和第二距离,预测唤醒源相对于第一音频装置的第三角度;其中,第一距离为第一音频装置和第二音频装置之间的距离;第一音频装置根据第三角度和第一角度,确定唤醒源所在方向。
在一种可能的设计中,第三角度满足:
Figure PCTCN2021114728-appb-000004
其中,β是第二角度;S c-s为第二距离,D为第一距离,α′是第三角度。
在一种可能的设计中,第一音频装置根据第三角度和第一角度,确定唤醒源所在方向,包括:确定第三角度和第一角度的平均值或加权平均值为唤醒源所在方向;或者,唤醒源所在方向满足:α±(ω i×Δ),ω i是预设值,Δ=|α-α′|,α是第一角度,α′是第三角度。
在一种可能的设计中,该方法还包括:第一音频装置接收到第三声音信号,第三声音信号包括语音指令;第一音频装置对第三声音信号中位于唤醒源所在方向的声音信号进行识别,得到语音指令,语音指令用于控制第一音频装置。
在一种可能的设计中,在第一音频装置对第三声音信号中位于唤醒源所在方向的声音信号进行识别,得到语音指令之前,该方法还包括:第一音频装置对第三声音信号中处于非唤醒源所在方向的声音信号进行抑制,非唤醒源所在方向为除了唤醒源所在方向之外的其它方向;第一音频装置对第三声音信号中位于唤醒源所在方向的声音信号进行识别得到语音指令,包括:第一音频装置对经过抑制后的第三声音信息中位于唤醒源所在方向的声音信号进行识别得到语音指令。
第七方面,提供一种唤醒源定位方法。该方法适用于第二音频装置,第二音频装置包括第二麦克风阵列,第二音频装置与第一音频装置能够通信。该方法包括:第二音频装置接收第二声音信号;第二音频装置根据第二声音信号计算得到第二角度,第二角度为唤醒源相对于第二音频装置的方向;第二音频装置向第一音频装置发送第二角度,以使第一音频装置基于第一角度和第二角度,确定唤醒源所在方向。
在一种可能的设计中,第二音频装置向第一音频装置发送第二角度之前,还包括:第二音频装置接收来自第一音频装置的查询请求,查询请求用于查询第二音频装置第二角度。
在一种可能的设计中,该方法还包括:第二音频装置对第二声音信号进行唤醒识别得到第二识别结果,第二识别结果包括第二概率,第二概率用于描述第二声音信号中包括唤醒信息的概率,和/或,第二识别结果还包括第二距离,第二距离为第二音频装置相对于第二音频装置的方向。
第八方面,提供一种音频装置组,包括:第一音频装置和第二音频装置;
第一音频装置包括:处理器;存储器;第一麦克风阵列;其中,所述存储器存储有计算机程序,所述计算机程序包括指令,当该指令被所述处理器执行时,使得第一音频装置执行如上述第一方面或第五方面提供的方法中第一音频装置的步骤;
第二音频装置包括:处理器;存储器;第二麦克风阵列;其中,存储器存储有计算机程序,所述计算机程序包括指令,当所述指令被所述处理器执行时,使得第二音频装置执行如上述第一方面或第五方面提供的方法中第二音频装置的步骤。
第九方面,提供一种第一音频装置,包括:处理器;存储器;第一麦克风阵列,其中,所述存储器存储有计算机程序,所述计算机程序包括指令,当所述指令被所述处理器执行时,使得第一音频装置执行如上述第二方面或第四方面或第六方面提供的方法步骤。
第十方面,提供了一种第一音频装置,所述音频装置包括执行上述第二方面或者第四方面或者第六方面的任意一种可能的设计的方法的模块/单元;这些模块/单元可以通过硬件实现,也可以通过硬件执行相应的软件实现。
第十一方面,提供一种第二音频装置,包括:处理器;存储器;第二麦克风阵列;存储器存储有计算机程序,所述计算机程序包括指令,当所述指令被所述处理器执行时,使得第二音频装置执行如上述第三方面或第七方面提供的方法步骤。
第十二方面,提供了一种第二音频装置,所述音频装置包括执行上述第三方面或者第七方面的任意一种可能的设计的方法的模块/单元;这些模块/单元可以通过硬件实现,也可以通过硬件执行相应的软件实现。
第十三方面,提供一种芯片,所述芯片与电子设备中的存储器耦合,用于调用存储器 中存储的计算机程序并执行上述第一方面至第七方面中任一方面提供的方法,本申请实施例中“耦合”是指两个部件彼此直接或间接地结合。
第十四方面,提供一种计算机可读存储介质,所述计算机可读存储介质包括计算机程序,当计算机程序在电子设备上运行时,使得所述电子设备执行如上述第一方面至第七方面任一方面提供的方法。
第十五方面,提供一种计算机程序产品,包括指令,当所述指令在计算机上运行时,使得所述计算机执行如上述第一方面至第七方面中任一方面提供的方法。
上述第二方面至第十五方面的有益效果,请参见第一方面的有益效果,不重复赘述。
附图说明
图1为本申请实施例提供的利用算法进行声音识别的流程示意图;
图2为本申请实施例提供的一个音频装置确定另一个音频装置所处方向的原理示意图;
图3为本申请实施例提供的应用场景的示意图;
图4为本申请实施例提供的应用场景的一种示例的示意图;
图5A为本申请实施例提供的一种唤醒识别方法的流程示意图;
图5B为本申请实施例提供的另一种唤醒识别方法的流程示意图;
图5C为本申请实施例提供的应用场景的另一种示例的示意图;
图5D为本申请实施例提供的又一种唤醒识别方法的流程示意图;
图6A为本申请实施例提供的又一种唤醒识别方法的流程示意图;
图6B、图6C和图6D为本申请实施例提供的计算第三角度的原理示意图;
图7为本申请实施例提供的音频装置组合与云端的交互示意图;
图8为本申请实施例提供的一种音频装置的结构示意图。
具体实施方式
下面结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。其中,在本申请实施例的描述中,以下实施例中所使用的术语只是为了描述特定实施例的目的,而并非旨在作为对本申请的限制。如在本申请的说明书和所附权利要求书中所使用的那样,单数表达形式“一个”、“一种”、“所述”、“上述”、“该”和“这一”旨在也包括例如“一个或多个”这种表达形式,除非其上下文中明确地有相反指示。还应当理解,在本申请以下各实施例中,“至少一个”、“一个或多个”是指一个或两个以上(包含两个)。术语“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系;例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A、B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。
在本说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。术语“连接”包括直接连接和间接 连接,除非另外说明。“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。
在本申请实施例中,“示例性地”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性地”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性地”或者“例如”等词旨在以具体方式呈现相关概念。
在详细阐述本申请之前,首先对本申请相关的名词进行解释。
(1)音频装置
音频装置是用于播放声音信号的装置。音频装置可以为音箱、手机、笔记本电脑、电视机、智能手环、手表等。所述音频装置也可以是逻辑装置,所谓逻辑装置可以理解为只要能播放声音信号逻辑单元/模块,不对硬件设备的类型、性能等作限制。例如,可以是一个或多个硬件设备中的一个或多个逻辑单元/模块。
音频装置还可以用于语音识别。通常,语音识别过程包括两个过程:唤醒识别和语音指令识别。其中,唤醒识别可以理解为识别唤醒语句,比如小爱小爱;语音指令识别可以理解为用于识别语音指令(或命令),比如播放歌曲XXX、切换下一首歌曲等。其中,音频装置中负责唤醒识别的模块称为唤醒识别模块,负责语音指令识别的模块称为语音指令识别模块。唤醒识别模块和语音指令识别模块是逻辑功能上的划分,二者对应的物理器件可以相同或不相同。
为了节省功耗,语音指令识别模块可以无需一直处于使能状态。比如,当唤醒识别模块检测到唤醒语句时,使能语音指令识别模块,语音指令识别模块执行语音指令识别。示例性的场景为:唤醒源(比如用户)发出声音信号,该声音信号包括唤醒语句,该声音信号被音频装置中的唤醒识别模块采集到,唤醒识别模块检测到声音信号中包括唤醒语句时,使能语音指令识别模块以通过语音指令识别模块进行语音识别。比如,当唤醒源(比如用户)再次发出声音信号(其中包含语音指令),该声音信号被语音识别模块采集到,语音识别模块识别声音信号中的语音指令,并响应于该语音指令执行相应的操作。
(2)算法
本申请涉及的算法可以包括唤醒识别算法和语音指令识别算法。其中,唤醒识别算法用于进行唤醒识别。比如,识别采集的声音信号中是否包括唤醒信息(比如,小爱小爱);语音指令识别算法用于进行语音指令的识别。比如,识别采集的声音信号中是否包括语音指令(比如,播放歌曲XXX)。
比如,以语音指令识别算法为例,图1的(a)为一种语音指令识别算法的流程示意图。如图1的(a)所示,语音指令识别算法的流程包括步骤1至步骤5。
步骤1、接收到声音信号,例如接收到唤醒源(比如用户)发出的声音信号。
步骤2、特征提取,可以理解为将声音信号中具有辨识度的成分提取出来。为了提高准确性,可以对每一帧声音信号的特征进行提取。其中,特征提取可以使用梅尔频率倒谱系数(mel-frequency cepstral coefficients,MFCC)算法提取,本申请实施例不详细介绍。
步骤3、基于特征得到音素,单词的发音由音素组成,汉语一般用声母、韵母等作为音素集。这个过程可以通过声学模型比如隐马尔可夫模型(hidden markov model,HMM)实现,本申请实施例不作详细介绍。
步骤4、基于音素得到单词;比如通过音素在语音字典中匹配单词。
步骤5、基于单词得到语句。假设得到的语句是播放歌曲XXX,那么音频装置响应该语音指令,播放歌曲XXX。
又如,以唤醒识别算法为例。图1的(b)为一种唤醒识别算法的流程示意图。如图1的(b)所示,唤醒识别算法的流程包括步骤1至步骤7。其中,步骤1至步骤5与语音指令识别流程中的步骤1至步骤5相同,不再赘述。下面介绍步骤6至步骤7。
步骤6、识别出的句子与预设句子比较,得到相似度。其中,预设句子是事先设置好的唤醒语句。
步骤7、输出相似度。比如,相似度是80%、60%等。
或者,还可以将相似度转换为概率值。比如,相似度是80%,对应的概率是0.8;相似度是20%,对应的概率是0.2。这种情况下,步骤7也可以输出概率值。
也就是说,唤醒识别算法可以计算出相似度或概率值。为了方便描述,将相似度和概率值统称为“置信度”。
需要说明的是,图1的(a)举例了一种语音指令识别算法,但是本申请对此不限定。其它的语音指令设备算法也是可以的。同样,图1的(b)举例了一种唤醒识别算法,但其它的唤醒识别算法也是可以的。
(3)置信度
置信度是指利用唤醒识别算法计算出的相似度或概率值。以置信度是概率值为例,用于指示声音信号中包括唤醒信息的概率。比如,概率值可以是0.1、0.5、0.9等。
(4)方向
本申请涉及的方向包括音频装置组合中一个音频装置相对于另一个音频装置的方向,或者,唤醒源相对于音频装置的方向,等等。所述方向可以使用角度来描述,比如,唤醒源相对于音频装置的方向可以是指唤醒源处于音频装置的北偏西30度,北偏东60度等等。
下面以麦克风阵列定位技术为例,介绍第一音频装置确定第二音频装置相对于第一音频装置的方向的原理。
第一音频装置中设置麦克风阵列。其中,麦克风阵列可以理解为按照特定规则(比如三行三列,五行五列等)分布的多个麦克风。
以第二音频装置发出的声波是平行波为例,参见图2所示,第一音频装置上的麦克风1在t1时刻接收到声波1,麦克风2在t2时刻接收到声波2。因此,声波1与声波2的时间差为t2-t1,所述时间差可以参见图2所示。
因为第二音频装置发出的声波时平行波,平行声波到达垂直面(与声波垂直的平面)的时间应该是相同的,所述垂直面请参见图2所示。因此,平行波到达麦克风1和麦克风2的距离差r=(t2-t1)*c。其中,t1、t2是已知量,声音传播速度c是已知量,还有麦克风1与麦克风2之间的距离D(比如麦克风1的质心与麦克风2的质心之间的距离)是已知量(该距离D可以是出厂时默认存储下的)。因此,如图2所示,可以通过上述的已知量t1、t2、c和D确定夹角θ。比如,所述夹角θ满足如下公式:
Figure PCTCN2021114728-appb-000005
通过上面的公式,可以得到夹角θ的值,夹角θ用于指示第二音频装置相对于第一音频装置的方向。
图2是以麦克风阵列包括两个麦克风为例。可以理解的是,麦克风阵列中麦克风的数 量可以更多。比如麦克风阵列是三行三列,包括9个麦克风,那么就可以得到多个夹角。所述多个夹角中的任意一个可作为第二音频装置相对于第一音频装置的方向,或所述多个夹角的平均值可作为第二音频装置相对于第一音频装置的方向。
上面以麦克风阵列定位技术为例进行介绍。可以理解的是,除去上述麦克风阵列定位技术之外,还可以使用其他的定位技术,比如波束指向(steered-beamformer)法,基于高分辨率谱分析(high-resolution spectral analysis)定向法,和基于声音时间差(time-delay estimation,TDE)定向法等等,本申请实施例不作限定。
需要说明的是,上文是以第一音频装置确定第二音频装置相对于第一音频装置的方向为例进行介绍。可以理解的是,第二音频装置确定第一音频装置相对于第二音频装置的方向的原理与上述原理类似,不再赘述。此外,第一音频装置或第二音频装置还可以基于如图2所示原理确定唤醒源(比如用户)所在方向。
(5)距离
本申请涉及的距离包括音频装置组合中第一音频装置与第二音频装置之间的距离,或者,唤醒源与音频装置之间的距离。
下面主要以第一音频装置与第二音频装置之间的距离为例介绍。
作为一种示例,所述距离的一种计算方式为:通过公式X=(2L2/γ)确定所述距离。其中,X为第二音频装置相对于第一音频装置的距离,γ为声波波长,L是麦克风阵列的长度(事先存储的值),假设麦克风阵列是图2中的两个麦克风,则L等于D;假设麦克风阵列中麦克风数量更多时,比如三行三列或三行四列的矩阵,则L等于矩阵的长,比如第一行第一列的麦克风的质心到第三行第一列的麦克风的质心之间的距离。
上述仅是举例了一种距离计算方式,其它的能够计算第一音频装置与第二音频装置之间距离的方式也是可以的。此外,第一音频装置还可以基于上述距离计算方式计算声源与第一音频装置的距离,当然,第二音频装置也可以基于上述距离计算方式计算唤醒源与第二音频装置的距离。
(6)本申请实施例中“多个”是指两个以上(包含两个),鉴于此,本申请实施例中也可以将“多个”理解为“至少两个”。“至少一个”,可理解为一个或多个,例如理解为一个、两个或更多个。例如,包括至少一个,是指包括一个或多个,而且不限制包括的是哪几个。例如,包括A、B和C中的至少一个,那么包括的可以是A、B、C,A和B,A和C,B和C,或A和B和C。同理,对于“至少一种”等描述的理解,也是类似的。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,字符“/”,如无特殊说明,一般表示前后关联对象是一种“或”的关系。
除非有相反的说明,本申请实施例提及“第一”、“第二”等序数词用于对多个对象进行区分。比如,第一音频装置和第二音频装置,只是用来区分两个音频装置,不用于限定两个音频装置的顺序、时序、优先级或者重要程度。
图3为本申请实施例提供的应用场景的示意图。其中包括N个音频装置,N为大于或等于2的正整数。图3中相应简化,以N=2为例。其中,两个音频装置可以分别布局在不同的地理位置。两个音频装置中每个音频装置可以播放声音信号,而且两个音频装置之间可以通信实现数据传输。其中,两个音频装置可以是同一类型的音频装置,比如两个音频装置都是音箱;或者,都是手机、都是平板电脑等;或者,两个音频装置可以包括不同类 型的音频装置,比如两个音频装置是音箱和手机的组合;或者,是手机与平板电脑的组合等。相对于单个音频装置播放音乐,两个音频装置同时播放音乐时,可以为用户提供较好的听觉体验。
以图3所示的应用场景为例,假设第一音频装置是主要音频装置,第二音频装置是次要音频装置。一般来说,唤醒识别过程包括:主要音频装置接收到声音信号之后,识别该声音信号中是否包括唤醒语句;如果是,就唤醒主要音频装置中的语音指令识别模块进行语音指令识别。这个过程中,次要音频装置没有参与,主要依赖主要音频装置进行唤醒识别。主要音频装置在唤醒识别时,会定位唤醒源(比如用户)相对于主要音频装置的方向。因此,语音指令识别过程一般包括:主要音频装置确定唤醒源的方向后,基于该方向进行语音指令的识别。比如,假设确定出唤醒源在角度1,主要音频装置接收到唤醒源的声音信号,认为所述声音信号中位于角度1的声音是唤醒源的声音,对位于角度1的声音进行语音指令识别。这个过程中,次要音频装置仍然没有参与。主要音频装置检测到语音指令时,执行该语音指令,并通知次要音频装置执行该语音指令(比如播放歌曲等),次要音频装置才参与使用。简单来说,只依赖主要音频装置进行唤醒识别、唤醒源的方向确定及语音指令的识别,次要音频装置不参与唤醒识别、唤醒源的方向确定及语音指令的识别。
然而,存在一些可能的场景:比如,环境中噪声较大,主要音频装置采集的声音信号中包括较多噪声,导致唤醒识别准确性降低、唤醒源的方向确定不准确,进而导致语音指令识别准确性低。
下面介绍所述场景的一种示例。
图4为本申请实施例提供的一种应用场景的示例。
图4为家庭内的客厅的示意图。客厅内设置两个音频装置(如,电视柜左右两侧的音频装置,比如第一音频装置和第二音频装置)。假设第一音频装置是主要音频装置,第二音频装置是次要音频装置。为了方便描述,以图4所示的场景(或环境)中仅有两个音频装置以及唤醒源(如,用户)发出声音为例。比如,两个音频装置正在播放音乐,唤醒源发出唤醒语句。这种情况下,主要音频装置接收到的声音信号中不仅包括唤醒源(即用户)发出的声音信号,还包括噪声源(即次要音频装置)发出的声音信号。这样的话,会导致如下问题:
1、由于噪声的干扰,主要音频装置唤醒识别的准确性较低。比如,唤醒源(比如用户)发出唤醒语句,但是由于受到次要音频装置的声音的干扰,导致主要音频装置无法识别出唤醒源发出的唤醒语句。比如,会出现用户多次发出唤醒语句,但是仍然无法唤醒音频装置的情况,用户体验较差。
2、由于噪声的干扰,主要音频装置无法准确的确定唤醒源的方向。比如,假设用户实际位于角度A,但是由于受到次要音频装置的声音信号的干扰,主要音频装置识别出用户在角度B,显然角度B是不准确的。
3、由于识别出唤醒源在角度B,所以在后续的语音指令识别过程中,主要音频装置对接收的声音信号中处于角度B的声音信号进行语音指令识别。显然,角度B对应的声音信号不是用户发出的声音信号,所以导致语音指令的准确性低。
鉴于此,本申请实施例提供一种唤醒识别方法。该方法不是单纯依赖主要音频装置进行唤醒识别、唤醒源的方向确定以及语音指令的识别,而是多个音频装置之间协作配合,提升唤醒识别和语音识别的准确性。
为了更清楚地展示本申请提供的技术方案,下面分多个实施例对本申请提供的技术方案进行说明。
实施例一
本实施例一介绍多个音频装置协作进行唤醒识别,不是单纯依赖一个音频装置进行唤醒设备,提升唤醒识别的准确性。
本实施例一以图4所示的应用场景为例进行介绍,即场景(或环境)中仅有两个音频装置以及唤醒源(如,用户)发出声音信号为例,比如,两个音频装置正在播放音乐,唤醒源发出唤醒语句。
第一种方式
第一种方式中,第一音频装置和第二音频装置均可以进行唤醒识别,分别得到各自的识别结果,第一音频装置综合两个识别结果作进一步的唤醒判断。比如,参见图5A,第一种方式的流程包括如下步骤:
S501,第一音频装置接收到第一声音信号。第一声音信号中包括唤醒源发出的唤醒语句,还包括第二音频装置发出的声音信号(图中未示出该过程)。
S502,第二音频装置接收到第二声音信号。第二声音信号中包括唤醒源发出的唤醒语句,还包括第一音频装置发出的声音信号(图中未示出该过程)。
其中,S501和S502的执行顺序本申请不作限定。
S503,第一音频装置对第一声音信号进行唤醒识别,得到第一识别结果。所述第一识别结果中包括第一置信度。所述第一置信度可以是第一音频装置利用唤醒识别算法计算出第一声音信号中包含唤醒语句的第一概率。其中,唤醒识别算法请参见前面名词解释部分的介绍,不再赘述。
S504,第二音频装置对第二声音信号进行唤醒识别,得到第二识别结果。所述第二识别结果中包括第二置信度。所述第二置信度可以是第二音频装置利用唤醒识别算法计算出的第二声音信号中包括唤醒语句的第二概率。
可选的,第一音频装置和/或第二音频装置可以在每次接收到声音信号时,就开启唤醒识别算法进行置信度的计算。或者,为了节省功耗,第一音频装置和/或第二音频装置可以实时的监听唤醒源的声音信号的接收时刻,当所述接收时刻之间的时间间隔小于预设值时,说明环境中唤醒源在一直发出语音,可能是在聊天。此时,可以无需启动唤醒识别算法,继续保持监听状态。当监听到所述时间间隔大于预设值时,启动唤醒识别算法。
S505,第二音频装置将第二识别结果发送给第一音频装置。
可选的,第二音频装置每次计算出第二置信度时,可以主动发给第一音频装置;或者,第一音频装置还可以向第二音频装置发送用于查询第二置信度的查询信息(比如在S505之前),第二音频装置接收到查询信息之后,向第一音频装置发送第二置信度。或者,第二音频装置向第一音频装置发送第二置信度之前,还可以对第二置信度进行预判断。比如,判断第二置信度是否大于预设阈值;若大于,向第一音频装置发送第二置信度;否则,不向第一音频装置发送第二置信度。因为,当第二置信度很低时,说明第二音频装置可以确定接收的第二声音信号中不包括唤醒词,这样也就无需向第一音频装置发送第二置信度作进一步的唤醒识别。当然,第二音频装置也可以无需对第二置信度进行预判断,直接向第一音频装置发送第二置信度。
S506,第一音频装置基于第一识别结果和第二识别结果判断是否唤醒。如果是,执行 S507,否则,可以不响应。
第一识别结果包括第一置信度,第二识别结果包括第二置信度,第一音频装置基于第一识别结果和第二识别结果判断是否唤醒,具体包括:基于第一置信度和第二置信度判断是否唤醒。比如,第一音频装置判断满足下述条件时,确定需要唤醒,所述条件包括如下条件中的至少一种:
1,第一置信度大于第一阈值,和/或,第二置信度大于第二阈值。其中,第一阈值和第二阈值可以相同或不同;比如,第一置信度是0.95,第一阈值是0.9,第二置信度是0.85,第二阈值是0.8,则第一音频装置唤醒。第一阈值和第二阈值可以是预设的。或者,第一阈值和第二阈值可根据用户需要进行设置或调整。
2,第一置信度和第二置信度两者的平均值或两者的加权平均值大于第三阈值。其中,所述加权平均值可以表示为P=第一置信度*A+第二置信度*B,其中,P是加权平均值,A和B是权重,A+B=1,A和B的取值可以事先设置好。第三阈值可以是预设的。或者,第三阈值可根据用户需要进行设置或调整。
可以理解的是,第一音频装置中包括唤醒识别模块和语音指令识别模块,语音指令识别模块无需一直处于使能状态。因此,第一音频装置基于第一识别结果和第二识别结果判断是否需要唤醒,可以理解为基于第一识别结果和第二识别结果,判断是否要唤醒第一音频装置中的语音指令识别模块。
S507,第一音频装置发出唤醒应答。
比如,第一音频装置确定唤醒时,可以发出“我在”的音频应答,以通知用户以唤醒第一音频装置。
可选地,S507不是必需步骤。也就是说,在一种实施方式中,该流程可以只包括S501-S506,不包括S507。
可选的,第一音频装置还可以向第二音频装置发送唤醒指令,以唤醒第二音频装置。比如,唤醒第二音频装置中的语音指令识别模块。当然,如果后续无需第二装置执行语音指令的识别的情况下,第一音频装置可以无需向第二音频装置发送唤醒指令;或者,第二音频装置中可以不设置语音指令识别模块。
需要说明的是,图5A是以第一音频装置根据第一识别结果和第二识别结果判断是否需要唤醒(S506)为例。可以理解的是,还可以由第二音频装置来执行该步骤。比如,第一音频装置向第二音频装置发送第一识别结果,第二音频装置基于第一识别结果和第二识别结果判断是否唤醒。
因此,第一种方式中,第一音频装置利用第二音频装置计算出的第二识别结果(包括第二置信度)判断是否需要唤醒,而不是单纯依据第一音频装置自身计算出的第一识别结果(包括第一置信度),一定程度上提升了唤醒识别的准确性。
第二种方式
上面的第一种方式中,两个音频装置各自进行了唤醒识别,共得到两个识别结果,然后综合两个识别结果判断是否唤醒。相对于第一种方式而言,第二种方式中,第一音频装置和/或第二音频装置在进行唤醒识别之前,还可以将接收到的声音信号中来自对方的声音进行抑制。比如,第一音频装置对接收的第一声音信号中来自第二音频装置的声音进行抑制,第二音频装置对接收的第二声音信号中来自第一音频装置的声音进行抑制。抑制后的声音信号突出了唤醒源的声音,对抑制后的声音信号进行唤醒识别时,更加提升唤醒识别 的准确性。
具体地,请参见图5B,为第二种方式的流程示意图。所述流程包括:
S601,第一音频装置接收来自第二音频装置的第三声音信号。
S602,第一音频装置根据第三声音信号确定第二音频装置相对于第一音频装置的方向。
所述方向的计算过程请参见前文名词解释部分,不再赘述。为了提升准确性,可以实时地计算所述方向。比如,两个音频装置每次开始播放音乐时就计算,或者,每隔一段时间计算一次;或者,当第一音频装置或第二音频装置检测到位置发生变化(比如,音频装置上的传感器检测到位置变化)时计算。
可选的,S602也可以由第二音频装置来执行。比如,第二音频装置计算出第一音频装置相对于第二音频装置的方向,将该方向发送给第一音频装置。该方向的相反方向即第二音频装置相对于第一音频装置的方向。
也就是说,第二音频装置相对于第一音频装置的方向可以提前计算出,即在接收到唤醒源发出唤醒语句(即第一声音信号)之前,就已计算出所述方向,以便后面的流程中使用该方向对来自第二音频装置的声音进行抑制。
可选的,S601和S602可以不执行。比如,第一音频装置还可以通过其它方式获取第二音频装置相对于第一音频装置的方向。比如,用户手动输入等等,所以图中S601和S602使用虚线表示。
S603,第一音频装置接收到第一声音信号。第一声音信号包括唤醒源发出的唤醒语句,当然还包括来自第二音频装置的声音。
S604,第二音频装置接收到第二声音信号。第二声音信号包括唤醒源发出的唤醒语句,当然还包括来自第一音频装置的声音。
S605,第一音频装置对第一声音信号进行唤醒识别得到第一识别结果,所述第一识别结果包括第一置信度。
S606,第二音频装置对第二声音信号进行唤醒识别得到第二识别结果,所述第二识别结果包括第二置信度。
S607,第二音频装置将第二识别结果发送给第一音频装置。
S603至S607的实现原理与图5A中S501至S505的实现原理相同,不再赘述。
S608,第一音频装置对第一声音信号中位于所述方向的声音进行抑制。
简单来说,第一声音信号中包括唤醒源的声音信号,也包括较强的来自噪声源(即第二音频装置)的声音信号。对第一声音信号中位于所述方向的声音进行抑制得到的抑制后的声音信号中包括唤醒源的声音信号,可能还包括较弱的来自噪声源的声音信号,相对于原始声音信号(即第一声音信号)突出了唤醒源的声音。
可选的,假设所述方向为北偏西30度,第一音频装置可以将第一声音信号中位于北偏西30度的声音进行抑制。或者,第一音频装置也可以基于北偏西30度确定一个角度范围。比如,30度减去一个阈值1作为角度范围的最小值min,30度加上一个阈值2作为角度方位的最大值max,那么角度范围就是(min,max)的区间。比如,阈值1是5度,那么最小值min是30度-5度=25度;阈值2是10度,那么最大值max是30度+10度=40度;所以角度范围是区间(北偏西25度,北偏西40度)。第一音频装置将第一声音信号中位于这个角度范围内的声音进行抑制。
其中,抑制原理可以为:第一声音信号是来自多个方向的声音的叠加。比如,第一声 音信号满足:A*声音信号1+B*其它声音信号。其中,A是第一权重,B是第二权重;A+B=1。假设声音信号1是位于所述方向的声音信号;其它声音信号包括唤醒源的声音信号。抑制后的声音信号满足:C*声音信号1+D*其它声音信号。其中,C是第三权重,D是第四权重;C+D=1;C小于A。也就是说,抑制后的声音信号中来自第二音频装置的声音信号的权重降低,突出了唤醒源发出的声音信号。
可选的,上述抑制过程可以实时的进行。比如在确定所述方向之后,第一音频装置对每次接收到的声音信号都进行抑制。或者,为了节省功耗,第一音频装置可以实时的监听唤醒源的声音信号的接收时刻。当所述接收时刻之间的时间间隔小于预设值时,可以无需抑制(环境中有人聊天),继续保持监听状态。当所述时间间隔大于预设值时,开始对采集的声音信号进行抑制。
S609,第一音频装置对抑制后的第一声音信号进行唤醒识别,得到第三识别结果,所述第三识别结果包括第三置信度。
需要说明的是,上述S603至S609的执行顺序,本申请不限定。
S610,第一音频装置根据第一识别结果、第二识别结果和第三识别结果判断是否唤醒。
第一识别结果包括第一置信度,第二识别结果包括第二置信度,第三识别结果包括第三置信度,第一音频装置基于第一置信度、第二置信度和第三置信度判断是否唤醒;如果是,执行S611;否则,可以不响应。比如,第一音频装置确定满足如下条件时,确定需要唤醒,所述条件包括如下至少一种:
1、第三置信度大于第四阈值,和/或,第一置信度大于第一阈值,和/或,第二置信度大于第二阈值。
2、第一置信度和第二置信度两者的平均值或两者的加权平均值大于第三阈值,和/或,第三置信度与第一置信度两者的平均值或两者的加权平均值大于第五阈值,和/或,第三置信度与第二置信度两者的平均值或两者的加权平均值大于第六阈值,和/或,第三置信度、第一置信度和第二置信度三者的平均值或三者的加权平均值大于第七阈值。
其中,第一阈值至第七阈值可以是预设的,或者,根据用户需要进行设置或调整。
S611,第一音频装置发出唤醒应答。
可选地,S611不是必需步骤;也就是说,在一种实施方式中,该流程只包括S601-S610,不包括S611。
需要说明的是,图5B中S605可以不执行,所以图中使用虚线表示。如果不执行S605,即第一音频装置不需要对第一声音信号进行唤醒识别得到第一识别结果,那么S610中可以仅基于第二识别结果和第三识别结果判断是否需要唤醒。第二识别结果包括第二置信度,第三识别结果包括第三置信度。比如,第三置信度大于第四阈值,和/或,第二置信度大于第二阈值,和/或,第三置信度和第二置信度二者的平均值或加权平均值大于第六阈值时,则唤醒。
需要说明的是,图5B中S606和S607可以不执行,所以图中使用虚线表示。如果不执行S606和S607,即不需要第二音频装置进行置信度的计算,对第二音频装置要求较低。这种情况下,如果S605执行,那么S610中可以基于第一识别结果和第三识别结果判断是否需要唤醒。第一识别结果包括第一置信度,第三识别结果包括第三置信度。比如,第三置信度大于第四阈值,和/或,第一置信度大于第一阈值,和/或,第三置信度和第一置信度二者的平均值或加权平均值大于第五阈值,则唤醒。如果S605不执行,那么S610中可 以仅基于第三识别结果判断是否需要唤醒。比如,第三置信度大于第四阈值,则唤醒。
图5B的实施例以第一音频装置抑制来自第二音频装置的声音为例进行介绍。可以理解的是,第二音频装置也可以抑制来自第一音频装置的声音,然后对抑制后的声音进行唤醒识别,将识别结果发送给第一音频装置进行综合判断。其中,第二音频装置抑制来自第一音频装置的声音的原理与第一音频装置抑制来自第二音频装置的声音的原理相同,所以不再赘述。
在一些实施例中,第一音频装置可以默认使用第一种方式或第二种方式,或者,第一音频装置上设置切换按钮,通过该切换按钮实现第一种方式和第二种方式之间的切换。
第三种方式
上面的第一种方式第一音频装置不需要对第一声音信号进行抑制,第二种方式中第一音频装置需要对接收的第一声音信息中处于第二音频装置所在方向的声音信号进行抑制。考虑到存在一种可能的场景:对于第一音频装置而言,唤醒源和第二音频装置在同一方向。比如,如图5C所示,对于第一音频装置而言,第二音频装置和唤醒源在同一方向。这种场景下,如果对第二音频装置所在方向的声音进行抑制的话,会将唤醒源的声音一并抑制,所以为了避免将唤醒源的声音抑制,可以使用第三种方式。具体的,参见图5D,为第三种方式的流程示意图,该流程包括:
S801,第一音频装置接收到来自第二音频装置的第三声音信号。
S802,第一音频装置根据第三声音信号确定第二音频装置相对于第一音频装置的方向。
S803,第一音频装置接收到第一声音信号。第一声音信号包括唤醒源发出的唤醒语句,当然还包括来自第二音频装置的声音。
S804,第一音频装置根据第一声音信号计算唤醒源相对于第一音频装置的方向。
此处,由于第一声音信号中不仅包括唤醒源发出的声音,还包括第二音频装置的声音,所以第一音频装置可以基于第一声音信号中唤醒源的声音计算出唤醒源所在方向;其计算过程可以参见前面名词解释部分。
S805,第一音频装置判断唤醒源和第二音频装置是否在同一方向。
第一音频装置判断唤醒源和第二音频装置是否在同一方向,如果是,执行S806,否则,执行S807。
S806,第一音频装置使用第一种方式进行唤醒识别。
S807,第一音频装置使用第一种方式或第二种方式进行唤醒识别。
需要说明的是,第一种方式不需要对来自第二音频装置所在方向的声音进行抑制,所以当唤醒源与第二音频装置在同一方向时,可以使用第一种方式处理。当唤醒源和第二音频装置不在同一方向时,可以使用第二种方式或第一种方式处理。
可选的,第一音频装置可以默认使用第一种方式、第二种方式或第三种方式,或者,第一音频装置上设置切换按钮,通过该切换按钮实现三种方式之间的切换。
总结来说,实施例一中,第一音频装置进行唤醒识别时,不是单纯依靠第一音频装置自身进行唤醒识别,而是第一音频装置和第二音频装置协作进行唤醒识别,提升唤醒识别的准确性。
在一些实施例中,第一音频装置可以包括多种唤醒策略。比如,第一种唤醒策略和第二种唤醒策略。其中,第一种唤醒策略即现有技术的唤醒策略,即第一音频装置只基于自身检测到的信息进行唤醒识别,不参考第二音频装置的信息。第二种唤醒策略是指上述实 施例一中第一音频装置和第二音频装置协作唤醒。另外,第一音频装置可以提供一个唤醒策略切换按钮,通过该唤醒策略切换按钮实现第一种唤醒策略和第二种唤醒策略之间的切换。在第一种唤醒策略下,第一音频装置识别唤醒源的准确性较低。比如,在环境中噪声较大的情况下,用户发出唤醒指令,但是迟迟无法唤醒设备。在第二种唤醒策略下,第一音频装置识别唤醒源的准确性较高。
可选地,第一音频装置和/或第二音频装置可以包括非唤醒状态(或者也可以称为休眠状态)、预唤醒阶段和唤醒阶段。以第一音频装置为例,非唤醒状态下第一音频装置中语音指令识别模块处于关闭状态,但声音采集模块(如麦克风)处于使能状态,可以采集声音。预唤醒阶段可以是进入唤醒阶段之前的一个阶段,可以认为是初步确定需要唤醒的阶段。唤醒阶段可以是第一音频装置中的语音识别模块已被唤醒的状态,可进行语音指令识别。
下面介绍三个阶段的切换过程。
可选地,第一音频装置处于非唤醒状态时,确定满足一定的条件时可以进入预唤醒阶段。比如,以图5A所示的流程为例,在非唤醒状态下,第一音频装置接收到第一声音信号,对第一声音信号进行识别得到第一识别结果,第一识别结果包括第一置信度。若第一置信度大于第一阈值,第一音频装置从非唤醒状态进入预唤醒阶段。可以理解为,第一音频装置初步确定是需要唤醒的,但是还需要结合第二音频装置的信息作进一步的判断,所以第一音频装置可以先进入预唤醒阶段。在进入预唤醒阶段后,第一音频装置可以为进入唤醒阶段作一些准备工作。比如,为语音指令识别模块上电,以作好启动语音指令识别模块的准备。如果结合第二音频装置的信息确定需要唤醒,那么第一音频装置从预唤醒阶段进入唤醒阶段。比如,启动语音指令识别模块。如果结合第二音频装置的信息确定不唤醒,那么第一音频装置从预唤醒阶段退回非唤醒阶段。此时,可以停止对语音指令识别模块的上电。
同理,在非唤醒状态下,第二音频装置接收到第二声音信号,对第二声音信号进行识别得到第二识别结果。第二识别结果包括第二置信度。在确定第二置信度大于第二阈值时,第二音频装置从非唤醒状态进入预唤醒阶段。即第二音频装置初步确定是需要唤醒的,但是还需要结合第一音频装置的信息作进一步的判断,所以第二音频装置先进入预唤醒阶段。如果结合第一音频装置的信息进一步判断确实需要唤醒,第二音频装置从预唤醒阶段进入唤醒阶段;否则,第二音频装置退出预唤醒阶段。
可选地,第一音频装置或第二音频装置处于预唤醒阶段的时长可以是第一预设时长(如事先设置好的时长,可以是用户设置也可以是设备出厂之前默认设置的)。假设在第一预设时长内未判断出是否需要唤醒,则退出预唤醒阶段。唤醒阶段的时长可以是第二预设时长,假设第二预设时长内未识别到语音指令,则退出唤醒阶段。
上面以第一音频装置和/或第二音频装置包括三种阶段(非唤醒状态、预唤醒阶段和唤醒阶段)为例进行介绍的,可选的,也可以仅包括两种阶段,比如,非唤醒阶段和唤醒阶段,本申请实施例不作限定。
实施例二
实施例一中第一识别结果包括第一置信度,第二识别结果包括第二置信度,所以第一音频装置可以根据第一置信度和第二置信度作唤醒判断。本实施例二中,第一音频装置得到的第一识别结果中除了包括第一置信度之外,还可以包括第一角度,所述第一角度描述 唤醒源相对于第一音频装置的方向。第二音频装置识别得到的第二识别结果除了包括第二置信度之外,还可以包括第二角度,第二角度描述唤醒源相对于第二音频装置的方向。因此,本实施例二中,第一音频装置不仅可以根据第一识别结果中的第一置信度和第二识别结果中的第二置信度判断是否需要唤醒,还可以根据第一识别结果中的第一角度和第二识别结果中的第二角度确定唤醒源所在方向。简单来说,两个音频装置协作进行唤醒源所在方向的确定。相对于现有技术中没有第二音频装置的参与,由第一音频装置独自进行唤醒定位的方案而言,这种方式能够确定出较为准确的唤醒源所在方向。当确定出唤醒源所在方向之后,在语音识别阶段,可以对唤醒源所在方位的声音信号进行语音指令的识别,无需对所有方向的声音信息进行语音指令的识别,不仅可以提升语音识别效率,还可以提升语音指令识别的准确性。
本实施例二继续以图4所示的应用场景为例,即场景中仅有两个音频装置和唤醒源发出声音为例,且以两个音频装置位于水平线上(如图4的水平虚线)为例。
请参见图6A所示,为本实施例二提供的唤醒识别方法的流程示意图。所述流程包括:
S901,第一音频装置接收到来自第二音频装置的第三声音信号。
S902,第一音频装置根据第三声音信号确定两个音频装置之间的距离(为了方便描述,该距离称为设备间距),比如所述距离为D。
需要说明的是,S901和S902可以不执行,比如,第一音频装置还可以使用其他的方式测距,比如激光测距,用户手动输入距离等等,所以图中S901和S902使用虚线表示。
S903,第一音频装置接收到第一声音信号。第一声音信号包括唤醒源发出的唤醒语句,当然还包括来自第二音频装置的声音。
S904,第二音频装置接收到第二声音信号。第二声音信号包括唤醒源发出的唤醒语句,当然还包括来自第一音频装置的声音。
S905,第一音频装置对第一声音信号进行唤醒识别,得到第一识别结果。第一识别结果中包括第一置信度和第一角度,所述第一置信度为所述第一声音信号中包括唤醒语句的第一概率,所述第一角度描述唤醒源相对于第一音频装置的方向。其中,第一音频装置根据第一声音信号计算唤醒源相对于第一音频装置的第一角度的原理请参见前文名词解释部分。
S906,第二音频装置对第二声音信号进行唤醒识别,得到第二识别结果。第二识别结果中包括第二置信度和第二角度,所述第二置信度为所述第二声音信号中包括唤醒语句的第二概率,所述第二角度描述唤醒源相对于第二音频装置的方向。其中,第二音频装置根据第二声音信号计算唤醒源相对于第二音频装置的第二角度的原理请参见前文名词解释部分。
S907,第二音频装置向第一音频装置发送第二识别结果。
S908,第一音频装置根据第一置信度和第二置信度判断是否需要唤醒。如果需要,进入S909。
S909,第一音频装置根据第一角度和第二角度确定唤醒源所在方向。
第一种方式,如果第一置信度大于第二置信度时,确定第一角度为唤醒源所在方向;如果第二置信度大于第一置信度时,确定第二角度为唤醒源所在方向。
第二种方式,第二识别结果中还包括第二距离,所述第二距离为唤醒源到第二音频装置的距离。那么,先利用设备间距D、第二识别结果中包括的第二角度和第二距离,预测 唤醒源相对于第一音频装置的第三角度,然后根据预测出的唤醒源相对于第一音频装置的第三角度和实际检测出的唤醒源相对于第一音频装置的第一角度(即S905),确定唤醒源的最终角度。
下面介绍预测唤醒源相对于第一音频装置的第三角度的过程。
参见图6B所示,第二角度和第二距离已知,而且第一音频装置和第二音频装置之间的设备间距也是已知的,所以通过距离D、第二角度、第二距离(边角边原理)可以构建出一个三角形(该过程不需要使用第一角度)。基于三角函数关系,可确定该三角形中的第三角度。所述构建的三角形满足如下三角函数关系:
sinβ×Sc-s=h
D-(cosβ×Sc-s)=P
α′=arctan(h/P)
其中,β是第二角度;S c-s为第二距离,h是唤醒源到两个音频装置的质心连线的距离。D为第一音频装置与第二音频装置之间的距离(设备间距),P是D与第二音频装置到h所在直线的距离之差,α′是第三角度。
因此,通过上述公式可变形得到第三角度α′满足:
Figure PCTCN2021114728-appb-000006
需要说明的是,图6B是以第二角度β是锐角为例,当然,第二角度β还可以是直角或钝角。
以钝角为例,请参见图6C,满足如下三角函数:
sin(180-β)×Sc-s=h
D+cos(180-β)×Sc-s)=P
α′=arctan(h/P)
因此,通过上述公式可变形得到第三角度α′满足:
Figure PCTCN2021114728-appb-000007
以第二角度β是直角为例,请参见图6D所示,第三角度α′满足:
Figure PCTCN2021114728-appb-000008
因此,第一音频装置可以根据第二角度的大小,选择合适的计算方式。比如,当第二角度是锐角时,可以使用图6B所示的方式计算,当第二角度是钝角时,使用图6C所示的方式计算,当第二角度是直角时,使用图6D所示的角度计算。
以上介绍两个音频装置协作预测唤醒源相对于第一音频装置的第三角度的过程。
当预测出唤醒源相对于第一音频装置的第三角度之后,可以根据预测出的唤醒源相对于第一音频装置的第三角度和实际检测出的唤醒源相对于第一音频装置的第一角度(S905)确定唤醒源的最终角度。比如,第一音频装置根据第三角度对第一角度进行校正,校正后的角度为唤醒源的最终角度。其中,校正角度可通过如下至少一种方式确定:
方式1,取第三角度α′和第一角度α两者的平均值,该平均值即为校正后的角度。
方式2,确定第三角度α′和第一角度α之间的差的绝对值:Δ=|α-α′|;校正后的角度满足:α±(ω i×Δ),ω i可以是预设值。
除了上述方式1和方式2以外,在一些实施例中,第一音频装置可以确定第三角度α′和 第一角度α之间的差的绝对值:Δ=|α-α′|;根据Δ判断是否需要对第一角度α进行校正。比如,参见如下公式:
Figure PCTCN2021114728-appb-000009
当Δ较小(比如小于第一阈值θ 1)时,即第三角度α′和第一角度α之间的差的绝对值很小,此时可以无需对第一角度α校正,即校正后的α modify=α。当然,也可以对夹角α校正,比如使用上述方式1或方式2进行校正。
当Δ较大(比如大于第二阈值θ 2),即第三角度α′和第一角度α之间的差的绝对值较大,此时也可以无需对第一角度α校正,即校正后的α modify=α。当然,也可以对夹角α校正,比如使用上述方式1或方式2进行校正。
当Δ处于第一阈值θ 1到第二阈值θ 2的范围内时,需要对第一角度α校正,具体的校正方式可以是上述方式1或方式2。
其中,第一阈值θ 1和第二阈值θ 2可以为预设值,两者的取值,本申请实施例不作限定。比如,θ 1是2、3度、5度等;θ 2是2度、3度、5度等。两者可以相等,也可以不等。
当唤醒源所在方向确定之后,在语音指令识别阶段,可以对处于所述唤醒源所在方向的声音信号进行语音指令的识别,无需对所有方向的声音信号均进行语音指令的识别,提升效率。具体地,语音指令的识别过程参见图6A中的S910至S914。
S910,第一音频装置接收到第四声音信号。第四声音信号中包括语音指令。
S911,第一音频装置对第四声音信号中位于非唤醒源所在方向的声音抑制。
非唤醒源所在方向可以是指除去所述唤醒源所在方向之外的其它方向。假设唤醒源所在方向是北偏西30度,可以对第四声音信号中除去北偏西30度角度之外的其它角度进行抑制。或者,第一音频装置也可以基于唤醒源所在方向(如北偏西30度)确定一个角度范围,将第四声音信号中处于该角度范围内的声音信息进行抑制。
需要说明的是,S911是可选步骤,本实施例二的流程可以包含S911,也可以不包含S911。
S912,第一音频装置对第四声音信号中唤醒源所在方向的声音进行语音指令识别。
假设唤醒源所在方向是北偏西30度,第一音频装置可以对第四声音信号中位于北偏西30度的声音信号进行语音指令识别。或者,第一音频装置也可以基于唤醒源所在方向(如,北偏西30度)确定,确定一个角度范围。比如,30度减去一个阈值1作为角度范围的最小值min,30度加上一个阈值2作为角度方位的最大值max。那么角度范围就是(min,max)的区间。比如,阈值1是5度,那么最小值min是30度-5度=25度;阈值2是10度,那么最大值max是30度+10度=40度,所以角度范围是区间(北偏西25度,北偏西40度)。第一音频装置对第四声音信号中位于这个角度范围内的声音进行语音指令的识别。这样,第一音频装置无需对所有角度的声音都进行语音指令识别,节省功耗。
其中,语音指令的识别过程请参见前文名词解释部分,不再赘述。
S913,第一音频装置执行语音指令。
比如,语音指令是切换下一首歌曲,那么第一音频装置切换到下一首歌曲。
S914,第一音频装置向第二音频装置发送语音指令,以控制第二音频装置执行语音指令。
上面的实施例是以第一音频装置和第二音频装置作为执行主体来介绍的。可选的,上面的实施例中的部分步骤还可以由云端执行。以图5A为例,S506可以由云端执行。比如,参见图7,第一音频装置向第二音频装置发送预唤醒事件,所述预唤醒事件即第二识别结果,第一音频装置将第一识别结果和第二识别结果上报给云端,由云端根据第一识别结果和第二识别结果判断是否需要唤醒。如果需要,则向第一音频装置发送唤醒指令,用以唤醒第一音频装置,然后第一音频装置向第二音频装置发送唤醒指令。这种方式可以减轻第一音频装置的计算任务。
可选的,云端向第一音频装置发送唤醒指令之前,还可以向用户的移动终端发送提示信息,以提示用户是否确认要唤醒第一音频装置。当移动终端上检测到确认操作时,向云端发送确认指令,用于确认唤醒第一音频装置。第一音频装置接收到确认指令之后,向第一音频装置发送唤醒指令,可以避免误唤醒。
在一些实施例中,第一音频装置可以是主要音频装置,第二音频装置是次要音频装置;或者,第二音频装置是主要音频装置,第一音频装置是次要音频装置。其中,主要音频装置可以通过如下至少一种方式确定:
主要音频装置可以是预先设置好的一个或多个音频装置,那么N个音频装置中剩余的音频装置就是次要音频装置。比如,出厂之前默认设置好的。那么,次要音频装置即N个音频装置中剩余的音频装置。
或者,主要音频装置可以是用户指定的。比如,N个音频装置中每个音频装置通过触摸显示屏检测到输入操作,该输入操作用于选择所述每个音频装置是主要音频装置还是次要音频装置。
或者,主要音频装置可以是N个音频装置中性能最强的音频装置。所述性能最强的音频装置可以包括处理器性能最强的音频装置,比如处理器运算速度最快等。
其中,主要音频装置和次要音频装置可以具有相同或不同的结构。比如,主要音频装置具有显示屏,而次要音频装置不具有显示屏。或者,主要音频装置和从音箱播放设备可以负责不同的功能。比如,主要音频装置播放音频文件的左声道声音信息,次要音频装置播放音频文件的右声道声音信息。
上述实施例是以N=2为例介绍的。可以理解的是,音频装置组合中还可以包括更多个音频装置。当N为其它数值时,可以采用类似的原理,此处不重复赘述。
因此,本申请实施例中N个音频装置之间协作配合,实现了如下有益效果:
1、两个音频装置协作进行唤醒识别,提升了唤醒识别的准确性。
2、两个音频装置协作进行唤醒源方向的确定,提升了确定的唤醒源所在方向的准确性。
3、由于确定的唤醒源所在方向准确,后续进行语音指令识别时,可对位于唤醒源所在方向的声音信号进行语音指令识别,提升语音指令识别的准确性。
上述本申请提供的实施例是从音频装置作为执行主体的角度,对本申请实施例提供的方法进行了介绍。为了实现上述本申请实施例提供的方法中的各功能,终端设备可以包括硬件结构和/或软件模块,以硬件结构、软件模块、或硬件结构加软件模块的形式来实现上述各功能。上述各功能中的某个功能以硬件结构、软件模块、还是硬件结构加软件模块的方式来执行,取决于技术方案的特定应用和设计约束条件。
如图8所示,本申请另外一些实施例公开了一种音频装置。该音频装置比如为音箱、 手机、平板电脑、笔记本电脑、台式电脑等电子设备。如图8所示,音频装置可以包括:一个或多个处理器801;麦克风802、扬声器803、存储器804,其中,存储器804中包括一个或多个计算机程序805。上述各器件可以通过一个或多个通信总线806连接。其中,麦克风802可以是麦克风阵列,用于接收声音信号,还可以用于确定方向、距离等(参见前文名词解释部分)。扬声器803用于播放声音信号。
其中,所述一个或多个计算机程序805被存储在上述存储器804中并被配置为被该一个或多个处理器801执行,该一个或多个计算机程序805包括指令,上述指令可以用于执行如图5A至图6A及相应实施例中的各个步骤。
图8所示的音频装置可以是上文中的第一音频装置或第二音频装置。当图8所示的音频装置是第一音频装置时,可以用于执行上文中第一音频装置的相关步骤。当图8所示的音频装置是第二音频装置时,可以用于执行上文中第二音频装置的相关步骤。
以上实施例中所用,根据上下文,术语“当…时”或“当…后”可以被解释为意思是“如果…”或“在…后”或“响应于确定…”或“响应于检测到…”。类似地,根据上下文,短语“在确定…时”或“如果检测到(所陈述的条件或事件)”可以被解释为意思是“如果确定…”或“响应于确定…”或“在检测到(所陈述的条件或事件)时”或“响应于检测到(所陈述的条件或事件)”。另外,在上述实施例中,使用诸如第一、第二之类的关系术语来区份一个实体和另一个实体,而并不限制这些实体之间的任何实际的关系和顺序。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本申请实施例所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。计算机可读介质包括计算机存储介质和通信介质,其中通信介质包括便于从一个地方向另一个地方传送计算机程序的任何介质。存储介质可以是通用或专用计算机能够存取的任何可用介质。
以上所述的具体实施方式,对本申请实施例的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本申请实施例的具体实施方式而已,并不用于限定本申请实施例的保护范围,凡在本申请实施例的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本申请实施例的保护范围之内。本申请说明书的上述描述可以使得本领域技术任何可以利用或实现本申请实施例的内容,任何基于所公开内容 的修改都应该被认为是本领域显而易见的,本申请实施例所描述的基本原则可以应用到其它变形中而不偏离本申请的发明本质和范围。因此,本申请实施例所公开的内容不仅仅局限于所描述的实施例和设计,还可以扩展到与本申请原则和所公开的新特征一致的最大范围。
尽管结合具体特征及其实施例对本申请进行了描述,显而易见的,在不脱离本申请实施例的精神和范围的情况下,可对其进行各种修改和组合。相应地,本说明书和附图仅仅是所附权利要求所界定的本申请的示例性说明,且视为已覆盖本申请范围内的任意和所有修改、变化、组合或等同物。显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的范围。这样,倘若本申请实施例的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请实施例也意图包括这些改动和变型在内。

Claims (19)

  1. 一种唤醒识别方法,其特征在于,所述方法适用于音频装置组,所述音频装置组包括第一音频装置和第二音频装置,所述第一音频装置包括第一麦克风阵列,所述第二音频装置包括第二麦克风阵列,所述方法包括:
    所述第一音频装置接收第一声音信号,所述第一声音信号包括唤醒源发出的声音信号;
    所述第一音频装置对所述第一声音信号进行唤醒识别,得到第一识别结果;
    所述第二音频装置接收第二声音信号,所述第二声音信号包括唤醒源发出的声音信号;
    所述第二音频装置对所述第二声音信号进行唤醒识别,得到第二识别结果;
    所述第二音频装置向所述第一音频装置发送所述第二识别结果;
    所述第一音频装置基于所述第一识别结果和所述第二识别结果,确定是否唤醒所述音频装置组。
  2. 如权利要求1所述的方法,其特征在于,所述第一识别结果包括第一概率,所述第一概率为所述第一声音信号包括唤醒信息的概率;所述第二识别结果包括第二概率,所述第二概率为所述第二声音信号包括唤醒信息的概率;
    所述第一音频装置基于所述第一识别结果和所述第二识别结果,确定是否唤醒所述音频装置组,包括:
    若所述第一概率大于第一阈值、所述第二概率大于第二阈值;和/或,第一概率和第二概率二者的平均值或加权平均值大于第三阈值,则唤醒所述音频装置组。
  3. 如权利要求1或2所述的方法,其特征在于,在所述第一音频装置对所述第一声音信号进行唤醒识别,得到第一识别结果之前,所述方法还包括:
    所述第一音频装置对所述第一声音信号中来自所述第二音频装置的声音信号进行抑制;
    所述第一音频装置对所述第一声音信号进行唤醒识别,得到第一识别结果,包括:
    所述第一音频装置对抑制后的所述第一声音信号进行唤醒识别,得到第一识别结果。
  4. 如权利要求1-3中任意一项所述的方法,其特征在于,所述第二音频装置对所述第二声音信号进行唤醒识别,得到第二识别结果之前,所述方法还包括:
    所述第二音频装置对所述第二声音信号中来自所述第一音频装置的声音信号进行抑制;
    所述第二音频装置对所述第二声音信号进行唤醒识别,得到第二识别结果,包括:
    所述第二音频装置对抑制后的所述第二声音信号进行唤醒识别,得到第二识别结果。
  5. 如权利要求3或4所述的方法,其特征在于,在所述第一音频装置对所述第一声音信号中来自所述第二音频装置的声音信号进行抑制之前,所述方法还包括:所述第一音频装置确定所述第二音频装置和所述唤醒源处于不同方向;
    在所述第二音频装置对所述第二声音信号中来自所述第一音频装置的声音信号进行抑制之前,所述方法还包括:所述第二音频装置确定所述第一音频装置和所述唤醒源处于不同方向。
  6. 如权利要求2-5中任意一项所述的方法,其特征在于,所述第一识别结果还包括第一角度,所述第一角度用于指示所述唤醒源相对于所述第一音频装置的方向;所述第二识别结果还包括第二角度,所述第二角度用于指示所述唤醒源相对于所述第二音频装置的方 向;所述方法还包括:所述第一音频装置基于所述第一角度和所述第二角度,确定所述唤醒源所在的方向。
  7. 如权利要求6所述的方法,其特征在于,所述第一音频装置基于所述第一角度和所述第二角度,确定唤醒源所在的方向,包括:
    当所述第一概率大于所述第二概率时,确定所述第一角度指示所述唤醒源相对于所述第一音频装置的方向;
    当所述第二概率大于所述第一概率时,确定所述第二角度指示所述唤醒源相对于所述第二音频装置的方向。
  8. 如权利要求6所述的方法,其特征在于,所述第二识别结果还包括第二距离,所述第二距离为所述唤醒源相对于第二音频装置的距离;所述第一音频装置基于所述第一角度和所述第二角度,确定所述唤醒源所在的方向,包括:
    所述第一音频装置根据第一距离、所述第二角度和所述第二距离,预测所述唤醒源相对于所述第一音频装置的第三角度;其中,所述第一距离为所述第一音频装置与所述第二音频装置之间的距离;
    所述第一音频装置根据所述第三角度和所述第一角度,确定所述唤醒源相对于所述第一音频装置的方向。
  9. 如权利要求8所述的方法,其特征在于,所述第三角度满足:
    Figure PCTCN2021114728-appb-100001
    其中,β是所述第二角度;S c-s为所述第二距离,D为所述第一距离,α′是所述第三角度。
  10. 如权利要求8或9所述的方法,其特征在于,所述第一音频装置根据所述第三角度和所述第一角度,确定所述唤醒源所在的方向,包括:
    确定所述第三角度和所述第一角度的平均值或加权平均值,指示所述唤醒源相对于所述第一音频装置的方向;
    或者,
    所述唤醒源相对于所述第一音频装置的角度为α±(ω i×Δ),ω i是预设值,Δ=|α-α′|,α是所述第一角度,α′是所述三角度。
  11. 如权利要求6-10中任意一项所述的方法,其特征在于,在所述第一音频装置基于所述第一识别结果和所述第二识别结果,确定唤醒所述音频装置组之后,所述方法还包括:
    所述第一音频装置接收到第三声音信号,所述第三声音信号包括语音指令;
    所述第一音频装置对所述第三声音信号中位于所述唤醒源所在的方向的声音信号,进行识别,得到语音指令,所述语音指令用于控制所述第一音频装置。
  12. 如权利要求11所述的方法,其特征在于,在所述第一音频装置对所述第三声音信号中位于所述唤醒源所在的方向的声音信号,进行识别,得到语音指令之前,所述方法还包括:
    所述第一音频装置对所述第三声音信号中位于非唤醒源所在的方向的声音信号进行抑制,所述非唤醒源所在的方向为除了所述唤醒源所在的方向之外的其它方向;
    所述第一音频装置对所述第三声音信号中位于所述唤醒源所在的方向的声音信号,进行识别,得到语音指令,包括:
    所述第一音频装置对经过抑制后的所述第三声音信息中位于所述唤醒源所在的方向的声音信号,进行识别,得到语音指令。
  13. 一种唤醒识别方法,其特征在于,所述方法适用于第一音频装置,第一音频装置包括第一麦克风阵列,所述方法包括:
    所述第一音频装置接收第一声音信号,所述第一声音信号包括唤醒源发出的声音信号;
    所述第一音频装置对所述第一声音信号进行唤醒识别,得到第一识别结果;
    所述第一音频装置接收来自第二音频装置的第二识别结果,所述第二识别结果为所述第二音频装置对接收的第二声音信号进行唤醒识别,得到的识别结果;
    所述第一音频装置基于所述第一识别结果和所述第二识别结果,确定是否唤醒所述第一音频装置。
  14. 如权利要求13所述的方法,其特征在于,所述第一识别结果包括第一概率,所述第一概率为所述第一声音信号包括唤醒信息的概率;所述第二识别结果包括第二概率,所述第二概率为所述第二声音信号包括唤醒信息的概率;
    所述第一音频装置基于所述第一识别结果和所述第二识别结果,确定是否唤醒所述第一音频装置,包括:
    若所述第一概率大于第一阈值、所述第二概率大于第二阈值;和/或,第一概率和第二概率二者的平均值或加权平均值大于第三阈值,则确定唤醒所述第一音频装置。
  15. 如权利要求13或14所述的方法,其特征在于,在所述第一音频装置对所述第一声音信号进行唤醒识别,得到第一识别结果之前,所述方法还包括:
    所述第一音频装置对所述第一声音信号中来自所述第二音频装置的声音信号进行抑制;
    所述第一音频装置对所述第一声音信号进行唤醒识别,得到第一识别结果,包括:
    所述第一音频装置对抑制后的所述第一声音信号进行唤醒识别,得到第一识别结果。
  16. 如权利要求15所述的方法,其特征在于,在所述第一音频装置对所述第一声音信号中来自所述第二音频装置的声音信号进行抑制之前,所述方法还包括:所述第一音频装置确定所述第二音频装置和所述唤醒源处于不同的方向。
  17. 一种音频装置组,其特征在于,包括:第一音频装置和第二音频装置;
    第一音频装置包括:处理器;存储器;第一麦克风阵列;其中,所述存储器存储有计算机程序,当所述计算机程序被所述处理器执行时,使得所述第一音频装置执行如权利要求1-16中任意一项所述的方法中由第一音频装置执行的步骤;
    第二音频装置包括:处理器;存储器;第二麦克风阵列;其中,所述存储器存储有计算机程序,当所述计算机程序被所述处理器执行时,使得所述第二音频装置执行如权利要求1-16中任意一项所述的方法中由第二音频装置执行的步骤。
  18. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质包括计算机程序,当计算机程序在计算机上运行时,使得所述计算机执行如权利要求1-16中任意一项所述的方法。
  19. 一种计算机程序产品,其特征在于,包括计算机程序,当所述计算机程序在计算机上运行时,使得所述计算机执行如上述权利要求1-16中任意一项所述的方法。
PCT/CN2021/114728 2020-08-31 2021-08-26 一种唤醒识别方法、音频装置以及音频装置组 WO2022042635A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202010893958 2020-08-31
CN202010893958.3 2020-08-31
CN202011556351.2A CN114121024A (zh) 2020-08-31 2020-12-23 一种唤醒识别方法、音频装置以及音频装置组
CN202011556351.2 2020-12-23

Publications (1)

Publication Number Publication Date
WO2022042635A1 true WO2022042635A1 (zh) 2022-03-03

Family

ID=80352722

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/114728 WO2022042635A1 (zh) 2020-08-31 2021-08-26 一种唤醒识别方法、音频装置以及音频装置组

Country Status (1)

Country Link
WO (1) WO2022042635A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190026067A1 (en) * 2015-12-24 2019-01-24 Samsung Electronics Co., Ltd. Electronic device and method for controlling operation of electronic device
CN110660407A (zh) * 2019-11-29 2020-01-07 恒玄科技(北京)有限公司 一种音频处理方法及装置
CN110858488A (zh) * 2018-08-24 2020-03-03 阿里巴巴集团控股有限公司 语音活动检测方法、装置、设备及存储介质
CN111369988A (zh) * 2018-12-26 2020-07-03 华为终端有限公司 一种语音唤醒方法及电子设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190026067A1 (en) * 2015-12-24 2019-01-24 Samsung Electronics Co., Ltd. Electronic device and method for controlling operation of electronic device
CN110858488A (zh) * 2018-08-24 2020-03-03 阿里巴巴集团控股有限公司 语音活动检测方法、装置、设备及存储介质
CN111369988A (zh) * 2018-12-26 2020-07-03 华为终端有限公司 一种语音唤醒方法及电子设备
CN110660407A (zh) * 2019-11-29 2020-01-07 恒玄科技(北京)有限公司 一种音频处理方法及装置

Similar Documents

Publication Publication Date Title
US11756563B1 (en) Multi-path calculations for device energy levels
US11289087B2 (en) Context-based device arbitration
US11138977B1 (en) Determining device groups
CN108351872B (zh) 用于响应用户语音的方法和系统
US10504511B2 (en) Customizable wake-up voice commands
CN108735209B (zh) 唤醒词绑定方法、智能设备及存储介质
US9484028B2 (en) Systems and methods for hands-free voice control and voice search
US11502859B2 (en) Method and apparatus for waking up via speech
US10811008B2 (en) Electronic apparatus for processing user utterance and server
JP2019117623A (ja) 音声対話方法、装置、デバイス及び記憶媒体
CN110634507A (zh) 用于语音唤醒的音频的语音分类
CN114830228A (zh) 与设备关联的账户
CN109101517B (zh) 信息处理方法、信息处理设备以及介质
WO2021031308A1 (zh) 音频处理方法、装置及存储介质
US11393490B2 (en) Method, apparatus, device and computer-readable storage medium for voice interaction
US9508345B1 (en) Continuous voice sensing
CN114121024A (zh) 一种唤醒识别方法、音频装置以及音频装置组
US11437022B2 (en) Performing speaker change detection and speaker recognition on a trigger phrase
CN109032554A (zh) 一种音频处理方法和电子设备
US11455997B2 (en) Device for processing user voice input
WO2022147692A1 (zh) 一种语音指令识别方法、电子设备以及非瞬态计算机可读存储介质
WO2022042635A1 (zh) 一种唤醒识别方法、音频装置以及音频装置组
WO2023155607A1 (zh) 终端设备和语音唤醒方法
US20210110838A1 (en) Acoustic aware voice user interface
CN114863936A (zh) 一种唤醒方法及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21860469

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21860469

Country of ref document: EP

Kind code of ref document: A1