WO2020051841A1

WO2020051841A1 - Human-machine speech interaction apparatus and method of operating the same

Info

Publication number: WO2020051841A1
Application number: PCT/CN2018/105518
Authority: WO
Inventors: Jinwei Feng; Xinguo LI
Original assignee: Alibaba Group Holding Limited
Priority date: 2018-09-13
Filing date: 2018-09-13
Publication date: 2020-03-19
Also published as: JP2021536692A; CN112654960A

Abstract

An apparatus including a forward-facing microphone configured to receive a first audio signal, a backward-facing microphone that is adjacent to the forward-facing microphone and configured to receive a second audio signal, and a controller comprising circuitry configured to compute an energy ratio of the first audio signal and the second audio signal and to wake up the apparatus for a speech processing when the computed energy ratio satisfies a threshold condition.

Description

HUMAN-MACHINE SPEECH INTERACTION APPARATUS AND METHOD OF OPERATING THE SAME

FIELD

Apparatuses and methods consistent with the present disclosure relate generally to acoustics, and more particularly, to apparatuses that receive sounds from users and respond to the sounds.

BACKGROUND

An operation of a human-machine speech interaction apparatus relies on a response of the apparatus to words uttered by a human user. Conventional human-machine speech interaction apparatus (e.g., used in smart speakers) requires the user to speak out a wake-up-word. The wake-up-word system would, however, put the burden on the user to always utter the wake-up-word to the human-machine speech interaction apparatus to wake up the apparatus first before the user would like the apparatus to operate and to provide a proper response. Because of the requirement of such uttering of wake-up-word, it is difficult for the user to have an experience of everyday human-to-human interaction.

Another approach to wake up a human-machine speech interaction apparatus is to use face detection technology by detecting the face of a user standing close to the apparatus. This approach would allow the user to wake up the apparatus without uttering a wake-up-word. However, this approach suffers from some limitations, for example, the apparatus is always in the wake-up mode as long as the apparatus detects the face of a person even if the person has no intent to interact with the apparatus.

Another approach to wake up a human-machine speech interaction apparatus is to use an array of microphones, for example, an array of eight microphones, to calculate the distance and pan angle of a user interacting with the apparatus. Only the speeches detected to be in the front near field may be used to wake up the apparatus. However, this approach also suffers from limitations, for example, the apparatus would respond to unintended situations, when the user turns his/her face away from the apparatus and talks to his/her friend. In addition, operating the eight-microphone array for one user would increase both computational and material cost.

SUMMARY

According to some embodiments of the present disclosure, there is provided an apparatus comprising: a forward-facing microphone configured to receive a first audio signal; a backward-facing microphone which is adjacent to the forward-facing microphone and configured to receive a second audio signal; and a controller comprising circuitry configured to compute an energy ratio of the first audio signal and the second audio signal, and to respond to a user when the computed energy ratio satisfies a threshold condition. In some embodiments, the apparatus may simply wake up for a speech processing when the computed energy ratio satisfies the threshold condition, without interacting with the user.

In the apparatus, at least one of the forward-facing microphone and the backward-facing microphone may be a cardioid microphone or an omnidirectional microphone or any other transducer that converts an acoustic energy into an electrical signal. The frontal facet of the backward-facing microphone may be positioned adjacent to the rear facet of the forward-facing microphone.

In the apparatus, the controller may be further configured to: perform Fourier transform on the first and second audio signals, respectively; determine a first speech signal power of the first audio signal in each of a plurality of frequency bins and a second speech signal power of the second audio signal in each of the plurality of frequency bins, respectively; accumulate the first and second signal powers over time, respectively; perform frequency weighting on the first and second speech signal powers, respectively; and calculate a total audio energy of the first audio signal and a total audio energy of the second audio signal by adding the weighted first and second speech signal powers across the plurality of frequency bins, respectively.

In the apparatus, the forward-facing microphone may comprise a plurality of forward-facing cardioid microphones; and the backward-facing microphone may comprise a plurality of backward-facing cardioid microphones alternately arranged with the plurality of forward-facing cardioid microphones in a horizontal direction or in a vertical direction. The plurality of forward-facing cardioid microphones and the plurality of backward-facing cardioid microphones may be alternately arranged to form a matrix array.

The apparatus may further comprise a display configured to respond to the user by displaying a message. The apparatus may further comprise a slot configured to dispense an item purchased by the user.

According to some embodiments of the present disclosure, there is provided a method of operating an apparatus. The method includes: obtaining a first audio signal by a forward-facing microphone of the apparatus and a second audio signal by a backward-facing microphone of the apparatus; computing an energy ratio of the first audio signal and the second audio signal; and responding to a user when the computed energy ratio satisfies a threshold condition. An example of satisfying the threshold condition is that the computed energy ratio is greater than a predetermined threshold value. In some embodiments, the responding to a user may be simply waking up the apparatus for a speech processing.

The method may further comprise: performing Fourier transform on the first audio signal and the second audio signal, respectively; determining a first speech signal power of the first audio signal in each of a plurality of frequency bins and a second speech signal power of the second audio signal in each of the plurality of frequency bins, respectively; accumulating the first speech signal power and the second speech signal power over time, respectively; performing frequency weighting on the first speech signal power and the second speech signal power, respectively; and adding the weighted first speech signal power and the weighted second speech signal power across the plurality of frequency bins to obtain a first and second audio energies, respectively.

According to some embodiments of the present disclosure, there is provided a method of operating an apparatus. The method includes: receiving a first audio signal by a forward-facing microphone of the apparatus and a second audio signal by a backward-facing microphone of the apparatus; performing Fourier transform on the first and second audio signals; determining a first speech power of the first audio signal in each of a plurality of frequency bins and a second speech power of the second audio signal in each of the plurality of frequency bins; comparing the first and second speech powers in each of the plurality of frequency bins and determining a dominant microphone in each of the plurality of frequency bins; counting a first number of dominant frequency bins of the first audio signal and a second number of dominant frequency bins of the second audio signal; and comparing the first number to the second number and to operate the apparatus to respond to a user when the first number is significantly greater than the second number of dominant frequency bins. In some embodiments, the responding to a user may be simply waking up the apparatus for a speech processing.

The subject matter below is taught by way of various specific exemplary embodiments explained in detail, and illustrated in the enclosed drawing figures.

BRIEF DESCRIPTION OF FIGURES

FIG. 1 is a schematic diagram illustrating an exemplary human-machine speech interaction apparatus and a user interacting with the human-machine speech interaction apparatus, consistent with some embodiments of the present disclosure.

FIG. 2 is a top plan view of FIG. 1, consistent with some embodiments of the present disclosure.

FIG. 3 is a top plan view of an exemplary human-machine speech interaction apparatus and a user interacting with the human-machine speech interaction apparatus, showing the polar responses of the cardioid microphones, consistent with some embodiments of the present disclosure.

FIG. 4 illustrates an example of energy ratio (ER) of the combination of cardioid microphones in FIG. 3, consistent with some embodiments of the present disclosure.

FIGS. 5, 6, 7, and 8 illustrate different orientations of a user with respect to an exemplary human-machine speech interaction apparatus, consistent with some embodiments of the present disclosure.

FIG. 9 shows energy ratio ER measured at different distance L (adistance between a user and a front panel of a human-machine speech interaction apparatus) and angle θ (an angle between the user’s sound direction and the perpendicular of the front panel of the human-machine speech interaction apparatus) , consistent with some embodiments of the present disclosure.

FIG. 10 illustrates an exemplary horizontal array of cardioid microphones of a human-machine speech interaction apparatus, consistent with some embodiments of the present disclosure.

FIG. 11 illustrates an exemplary vertical array of cardioid microphones of a human-machine speech interaction apparatus, consistent with some embodiments of the present disclosure.

FIG. 12 illustrates an exemplary matrix array of cardioid microphones of a human-machine speech interaction apparatus, consistent with some embodiments of the present disclosure.

FIG. 13 is a flowchart indicating an exemplary method of operating a human-machine speech interaction apparatus, consistent with some embodiments of the present disclosure.

FIG. 14 is a flowchart indicating another exemplary method of operating a human-machine speech interaction apparatus, consistent with some embodiments of the present disclosure.

FIG. 15A illustrates an exemplary pressure-time diagram for a forward-facing cardioid microphone.

FIG. 15B illustrates an exemplary pressure-time diagram for a backward-facing cardioid microphone, consistent with some embodiments of the present disclosure.

FIG. 16A illustrates a prototype of an exemplary human-machine speech interaction apparatus having a circuit board and a microphone system, according to some embodiments of the present disclosure.

FIG. 16B illustrates a circuit diagram in the exemplary human-machine speech interaction apparatus of FIG. 16A, consistent with some embodiments of the present disclosure.

FIG. 17 illustrates an exemplary circuit diagram in a human-machine speech interaction apparatus, consistent with some embodiments of the present disclosure.

DETAILED DESCRIPTION

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise represented. The implementations set forth in the following description of exemplary embodiments do not represent all implementations consistent with the invention. Instead, they are merely examples of apparatuses and methods consistent with aspects related to the invention as recited in the appended claims. For example, although some embodiments are described in the context of utilizing cardioid microphones, the disclosure is not so limited. Other types of microphones, can be similarly applied. Furthermore, other transducers that convert an acoustic energy into an electrical signal can be used.

References are now made to FIG. 1, a schematic diagram illustrating an exemplary human-machine speech interaction apparatus and a user interacting with the human-machine speech interaction apparatus, consistent with exemplary embodiments of the present disclosure, while FIG. 2 illustrates a top plan view of FIG. 1. In some embodiments, the human-machine speech interaction apparatus may simply receive the user’s speech and process the speech accordingly, without interacting with the user. As shown in FIG. 1 and FIG. 2, a user 180 stands in front of a human-machine speech interaction apparatus 100. A distance L between the face of the user and the front panel of human-machine speech interaction apparatus 100 may be in a range of, for example, 0.5 m to 3 m, but the distance is not so limited. Distance L may be in any ranges by adjusting the circuitry and the sensitivity of human-machine speech interaction apparatus 100. User 180 provides sound waves 190 to human-machine speech interaction apparatus 100 by speaking to the apparatus. Sound waves 190 propagate toward a left opening 120 and a right opening 130 of the front panel of human-machine speech interaction apparatus 100. A separation distance between the center of left opening 120 and the center of right opening 130 may be in a range of, for example, 3 mm and 10 mm, but the separation distance is not so limited.

Openings

120 and 130 include the microphone system 110 installed on the front panel of human-machine speech interaction apparatus 100. Microphone system 110 includes a forward-facing unidirectional microphone having two sound receiving portions 140 facing the user 180 and a rear part 200. Microphone system 110 further includes a backward-facing unidirectional microphone having two

sound receiving portions

150 and 210, in which sound receiving portion 210 faces user 180. It is noted that the

openings

120 and 130 should be made big enough so that the sound waves can enter the

sound receiving portions

200 and 210.

Rear parts

200 and 210 of microphone system 100 include electrical circuits and are connected to a controller 250 for processing the audio signals received by microphone system 110.

In some embodiments of the present disclosure, human-machine speech interaction apparatus 100 further includes a display 160. Display 160 may be a liquid crystal display, a light emitting diode array display, an organic light emitting diode display, a plasma display, a cathode ray tube display, a holographic display, a laser plasma display, and any combination thereof. Human-machine speech interaction apparatus 100 may further include a slot 170 configured to dispense items purchased by user 180, for example, a train ticket ordered by user 180 by speaking to human-machine speech interaction apparatus 100.

Display 160 functions to provide an instruction to user 180 for using human-machine speech interaction apparatus 100. The instruction may be a message displayed on display 160. For example, the message may be the words telling user 180 to stand within a yellow line painted on the floor, or the words instructing user 180 to speak to human-machine speech interaction apparatus 100 directly without allowing any obstacle positioned between the mouth of user 180 and human-machine speech interaction apparatus 100. The message may also include, for example, “your voice message is received, we are processing your message, ” or “your ordered ticket is ready, and please pick it up from the slot. ”

References are now made to FIG. 3, a top plan view of a human-machine speech interaction apparatus and a user interacting with the human-machine speech interaction apparatus, showing the polar responses of cardioid microphones, consistent with exemplary embodiments of the present disclosure. A unidirectional microphone may include a cardioid microphone, a subcardioid microphone, a supercardioid microphone, and a hypercardioid microphone. FIG. 3 shows that the microphones of a microphone system 110 are cardioid microphones that have heart-shaped sound pickup patterns. The forward-facing microphone with a frontal sound receiving part 140 facing a user 180 has a pattern including a front portion 220 and a rear portion 230, and the backward-facing microphone with a rear sound-receiving part 210 facing user 180 has a pattern including a front portion 240 and a rear portion 250. When user 180 speaks and transmits sound waves 190 toward microphone system 110, controller 250 computes an overall energy ratio ER which may take on the form of a peak along the time axis as illustrated in FIG. 4. This is because both microphones receive the same background noise (thus energy ratio ER is about 1.0) when the speech is not active.

In some embodiments, the microphones of microphone system 110 may be an end-fire array comprising two omnidirectional microphones (not shown) . The two omnidirectional microphones may be modified by circuitry and appropriate digital signal processing algorithm (not shown) to form two virtual cardioid microphones, with one facing forward and another facing backward. Furthermore, the microphones of microphone system 110 may be other types of transducers (not shown) that convert an acoustic energy into an electrical signal.

References are now made to FIGS. 5-8 illustrating different orientations of a user with respect to a human-machine speech interaction apparatus, consistent with exemplary embodiments of the present disclosure. Considering the sensitivity of microphone system 110 at different orientations of user 180, FIGS. 5-8 show top plan views indicating the situations of user 180 speaking at different directions in front of human-machine speech interaction apparatus 100. FIG. 5 shows a situation that user 180 speaks in a direction having an arbitrary angle θ with the perpendicular of front panel of human-machine speech interaction apparatus 100. FIG. 6 shows a situation that user 180 speaks at a direction parallel to the perpendicular of the front panel of human-machine speech interaction apparatus 100, that is, θ= 0°. FIG. 7 shows a situation that user 180 speaks at a 90° angle to the perpendicular of the front panel of human-machine speech interaction apparatus 100, that is, θ= 90°. FIG. 8 shows a situation that sounds provided by a person standing near user 180 and speaking at a direction not cutting the microphone system of human-machine speech interaction apparatus 100.

References are now made to FIG. 9 showing energy ratio ER values measured at different distance L (adistance between a user and a front panel of a human-machine speech interaction apparatus) and angle θ (an angle between the user’s sound direction and the perpendicular of the front panel of the human-machine speech interaction apparatus) , consistent with some embodiments of the present disclosure. As shown in the table of FIG. 9, when the separation distance L between user 180 and human-machine speech interaction apparatus 100 is 0.5 m, and angle θ between the direction of sound waves provided by user 180 and the perpendicular of the front panel of human-machine speech interaction apparatus 100 is 0° (e.g., FIG. 6) , the measured ER value is 10.91. When the distance L is increased to 2 m, the measured ER value is dropped to be 3.63. When the distance L is further increased to 3 m, the measured ER value is further decreased to 2.23. At distance L being 0.5 m and angle θ being 90° (e.g., FIG. 7) , the measured ER value is 4.01. When distance L is increased to 2.0 m, the measure ER value is dramatically reduced to 1.89. When distance L is further increased to 3.0 m, the measured ER value is further reduced to 1.77.

When distance L is 0.5 m but the sound provided from a side of user 180 (e.g., FIG. 8) , the measured ER value is 1.07, similar to the case of background signal. Therefore, as long as the people around user 180 speaks sideway, microphone system 110 can recognize the sound waves as background speech that is much lower than the 10.91 ER value of user 180 facing the front panel of human-machine speech interaction apparatus 100. In this way, human-machine speech interaction apparatus 100 of the present disclosure only responds to the voice of user 180 instead of the people around user 180 talking sideway. In some embodiments, the responding to the voice of the user may be simply waking up the human-machine speech interaction apparatus for a speech processing, without interacting with the user.

References are now made to FIG. 10 illustrating a horizontal array of cardioid microphones of a human-machine speech interaction apparatus, consistent with some exemplary embodiments of the present disclosure. As shown in FIG. 10, a linear array of microphone system 110 (C) has each of the backward-facing microphones with the rear sound receiving portion 210 facing a user 180 alternately arranged with each of the forward-facing microphones with the frontal sound receiving portion 140 facing user 180. In this way, user 180 interacting with human-machine speech interaction apparatus 100 is not required to stand in front of a particular area (e.g., the central area) of the front panel of human-machine speech interaction apparatus 100. User 180 can stand at any position in front of the front panel of human-machine speech interaction apparatus 100. The energy ratio is computed by finding a maximum energy ratio ER among all of the microphone pairs. In some exemplary embodiments of the present disclosure, the liner array of microphone system 110 (C) may cover the entire horizontal width or a portion of the horizontal width of the front panel of human-machine speech interaction apparatus 100.

References are now made to FIG. 11 illustrating a vertical array of cardioid microphones of a human-machine speech interaction apparatus, consistent with some exemplary embodiments of the present disclosure. As shown in FIG. 11, a vertical linear array of microphone system 110 (D) having each of the backward-facing microphones with rear portion 210 facing a user 180 vertically and alternately arranged with each of the forward-facing microphones with the frontal sound receiving portion 140 facing user 180. In this way, user 180 interacting with human-machine speech interaction apparatus 100 is not limited to a specific height. User 180 can be a child with a height less than 1 m to an adult having a height of 2 m. In some embodiments of the present disclosure, the vertical liner array of microphone system 110(D) may cover the entire height or a portion of the height of the front panel of human-machine speech interaction apparatus 100. Energy ratio ER is computed, for example, by finding a maximum ER among all of the microphone pairs.

References are now made to FIG. 12 illustrating a matrix array of cardioid microphones of a human-machine speech interaction apparatus, consistent with some exemplary embodiments of the present disclosure. The exemplary matrix as shown in FIG. 12 extends the horizontal linear array in FIG. 10 and the vertical linear array in FIG. 11 into a matrix of microphone system 110 (F) . In some embodiments of the present disclosure, the matrix of the microphone system 110 (F) may cover the entire front panel or a portion of the front panel of human-machine speech interaction apparatus 100 such that users having different heights can stand in any position in front of the matrix array. Energy ratio ER is computed, for example, by finding a maximum ER among all of the microphone pairs.

References are now made to FIG. 13 showing a flowchart indicating a method of operating a human-machine speech interaction apparatus, consistent with some exemplary embodiments of the present disclosure. In FIG. 13, steps S1501 to S1505 describe the steps of operating a forward-facing microphone. In step S1501, an audio frame is received by the forward-facing microphone. In step S1502, a short time Fourier transform is performed to the received audio frame. In step S1503, speech signal power is estimated in each frequency bin. In step S1504, the signal power is accumulated over a period of time and are frequency-weighted. In step S1505, an audio energy is obtained, for example, by summation of the frequency-weighted signal power across all frequency bins. Steps S1506 to S1510 describe the steps of operating a backward-facing microphone. In step S1506, an audio frame is received by the backward-facing microphone. In step S1507, a short time Fourier transform is performed to the received audio frame. In step S1508, speech signal power is estimated in each frequency bin. In step S1509, the signal power is accumulated over time and are frequency-weighted. In step S1510, an audio energy is obtained by, for example, by summation of the frequency-weighted signal power across all frequency bins. In step S1511, a controller comprising circuitry computes an energy ratio ER using the obtained audio energy from the forward-facing microphone (from step S1505) and the obtained audio energy from the backward-facing microphone (from step S1510) . In step S1512, the controller determines whether the energy ratio ER (from step S1511) satisfies a threshold condition, in this case the energy ratio ER is greater than a predetermined threshold value. If the energy ratio ER (from step S1511) satisfies the threshold condition (e.g., the energy ratio ER is greater than the predetermined threshold value) , an automatic speech recognition (ASR) of the human-machine speech interaction apparatus is waken up and is operated in step S1514. If the energy ratio ER (from step S1511) does not satisfy the threshold condition (e.g., the energy ratio ER is not greater than a predetermined value of threshold) , an automatic speech recognition (ASR) of the human-machine speech interaction apparatus is not waken up and is not operated in step S1513.

References are now made to FIG. 14 showing a flowchart indicating another method of operating a human-machine speech interaction apparatus, consistent with some exemplary embodiments of the present disclosure. In FIG. 14, steps S1601 to S1603 describe the steps of operating a forward-facing microphone. In step S1601, an audio frame is received by the forward-facing microphone. In step S1602, a short time Fourier transform is performed to the received audio frame. In step S1603, speech power estimation is carried out at each frequency bin of a plurality of frequency bins. Steps S1605 to S1607 describe the steps of operating a backward-facing microphone. In step S1605, an audio frame is received by the backward-facing microphone. In step S1606, a short time Fourier transform is performed to the received audio frame. In step S1607, speech power estimation is carried out at each frequency bin of the plurality of frequency bins. In step S1608, the two signals are compared to determine which microphone has the dominant speech power at each frequency bin of the plurality of frequency bins. In step S1609, a controller comprising circuitry counts the number of dominant bins in the forward-facing microphone N (from step S1608) and the number of dominant bins in the backward-facing microphone M (from step S1608) , and compares N and M. In step S1610, based on the comparison of N and M, the controller determines whether a threshold condition is satisfied. For example, the threshold condition may be satisfied when N is significantly greater than M, i.e., N>>M. If N is significantly greater than M, an automatic speech recognition of the human-machine speech interaction apparatus is woken up and is operated in step S1612. If N is not significantly greater than M, an automatic speech recognition (ASR) of the human-machine speech interaction apparatus is not waken up and is not operated in step S1614.

References are now made to FIG. 15A illustrating a time domain signal for a forward-facing cardioid microphone, and FIG. 15B illustrating a time domain signal for a backward-facing cardioid microphone, when a talker speaks in the front of the forward-facing microphone, consistent with some exemplary embodiments of the present disclosure. It can be seen that the speech signal power received by the forward-facing microphone is noticeably bigger than the one received by the rear facing microphone.

References are now made to FIG. 16A illustrating a prototype of a human-machine speech interaction apparatus having a circuit board 1900 and a microphone system having a forward-facing microphone 1710 and a backward-facing microphone 1720. FIG. 16B illustrates an exemplary circuit diagram of a human-machine speech interaction apparatus. A user 180 transmits sound waves 190 propagating toward the forward-facing microphone 1710 and the backward-facing microphone 1720. The forward-facing microphone 1710 turns the sound waves into an electrical signal passing through an amplifier 1730, a variable resistor 1750, a filter 1770, and then a processor 1790 and a controller 1800. The backward-facing microphone 1720 turns the sound waves into an electrical signal passing through an amplifier 1740, a variable resistor 1760, a filter 1780, and then to processor 1790 and controller 1800.

References are now made to FIG. 17 illustrating a circuit diagram in a human-machine speech interaction apparatus, consistent with some embodiments of the present disclosure. User 180 transmits sound waves 190 propagating toward a front-facing microphone 1710 and a backward-facing microphone 1720. The front-facing microphone 1710 turns the sound waves into an electrical signal transmitting toward amplifier 1730, a variable resistor 1750, a filter 1770, a processor 1790, a controller 1800, and then a user interface 1810 that functions to adjust the performance of the human-machine speech interaction apparatus based on the response of the human-machine speech interaction apparatus toward user 180. The backward-facing microphone 1720 turns the sound waves into an electrical signal transmitting toward an amplifier 1740, a variable resistor 1760, a filter 1780, processor 1790, controller 1800, and then user interface 1810 that functions to adjust the performance of the human-machine speech interaction apparatus based on the response of the human-machine speech interaction apparatus toward user 180.

Any combination of one or more computer readable medium (s) may be utilized. The computer readable medium may be a non-transitory computer readable storage medium. Common forms of non-transitory computer readable media include, for example, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM or any other flash memory, NVRAM, a cache, a register, any other memory chip or cartridge, and networked versions of the same. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (anon-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM) , a read-only memory (ROM) , an erasable programmable read-only memory (EPROM, EEPROM or Flash memory) , an optical fiber, a portable compact disc read-only memory (CD-ROM) , an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, IR, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for example embodiments may be written in any combination of one or more programming languages, including an object-oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN) , or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider) .

Example embodiments are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer readable medium that can direct a hardware processor core of a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium form an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the Figures illustrate examples of the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function (s) . It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It is understood that the described embodiments are not mutually exclusive, and elements, components, materials, or steps described in connection with one example embodiment may be combined with, or eliminated from, other embodiments in suitable ways to accomplish desired design objectives.

Reference herein to “some embodiments” or “some exemplary embodiments” mean that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment. The appearance of the phrases “one embodiment” “some embodiments” or “some exemplary embodiments” in various places in the specification do not all necessarily refer to the same embodiment, nor are separate or alternative embodiments necessarily mutually exclusive of other embodiments.

It should be understood that the steps of the example methods set forth herein are not necessarily required to be performed in the order described, and the order of the steps of such methods should be understood to be merely example. Likewise, additional steps may be included in such methods, and certain steps may be omitted or combined, in methods consistent with various embodiments.

As used in this application, the word “exemplary” is used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other aspects or designs. Rather, use of the word is intended to present concepts in a concrete fashion.

Additionally, the term “or” is intended to mean an inclusive “or” rather than an exclusive “or” . That is, unless specified otherwise, or clear from context, “X employs A or B” is intended to mean any of the natural inclusive permutations. That is, if X employs A; X employs B; or X employs both A and B, then “X employs A or B” is satisfied under any of the foregoing instances. In addition, the articles “a” and “an” as used in this application and the appended claims should generally be construed to mean “one or more” unless specified otherwise or clear from context to be directed to a singular form.

Unless explicitly stated otherwise, each numerical value and range should be interpreted as being approximate as if the word "about"or "approximately"preceded the value of the value or range.

The use of figure numbers or figure reference labels in the claims is intended to identify one or more possible embodiments of the claimed subject matter to facilitate the interpretation of the claims. Such use is not to be construed as necessarily limiting the scope of those claims to the embodiments shown in the corresponding figures.

Although the elements in the following method claims, if any, are recited in a particular sequence with corresponding labeling, unless the claim recitations otherwise imply a particular sequence for implementing some or all of those elements, those elements are not necessarily intended to be limited to being implemented in that particular sequence.

It will be further understood that various changes in the details, materials, and arrangements of the parts which have been described and illustrated in order to explain the nature of described embodiments may be made by those skilled in the art without departing from the scope as expressed in the following claims.

Claims

An apparatus, comprising:

a forward-facing microphone configured to receive a first audio signal;

a backward-facing microphone that is adjacent to the forward-facing microphone and configured to receive a second audio signal; and

a controller comprising circuitry configured to compute an energy ratio of the first audio signal and the second audio signal, and to wake up for a speech processing when the computed energy ratio satisfies a threshold condition.
The apparatus of claim 1, wherein the satisfying a threshold condition comprises that the computed energy ratio is greater than a predetermined threshold value.
The apparatus of any one of claims 1-2, wherein the forward-facing microphone and the backward-facing microphone are cardioid microphones.
The apparatus of any one of claims 1-2, wherein at least one of the forward-facing microphone and the backward-facing microphone is an omnidirectional microphone.
The apparatus of any one of claims 1-3, wherein the front of the backward-facing microphone is adjacent to the rear of the forward-facing microphone.
The apparatus of any one of claims 1-5, wherein the controller is further configured to perform Fourier transform on the first and second audio signals, respectively.
The apparatus of claim 6, wherein the controller is further configured to determine a first speech signal power of the first audio signal in each of a plurality of frequency bins and a second speech signal power of the second audio signal in each of the plurality of frequency bins, respectively.
The apparatus of claim 7, wherein the controller is further configured to perform accumulation of the first and second speech signal powers, respectively.
The apparatus of claim 8, wherein the controller is further configured to perform frequency weighting on the first and second speech signal powers, respectively.
The apparatus of claim 9, wherein the controller is further configured to determine a total audio energy of the first audio signal and a total audio energy of the second audio signal by adding the weighted first and second speech signal powers across the plurality of frequency bins, respectively.
The apparatus of claim 3, wherein:

the forward-facing cardioid microphone comprises a plurality of forward-facing cardioid microphones; and

the backward-facing cardioid microphone comprises a plurality of backward-facing cardioid microphones alternately arranged with the plurality of forward-facing cardioid microphones in a horizontal direction.
The apparatus of claim 3, wherein:

the forward-facing cardioid microphone comprises a plurality of forward-facing cardioid microphones; and

the backward-facing cardioid microphone comprises a plurality of backward-facing cardioid microphones alternately arranged with the plurality of forward-facing cardioid microphones in a vertical direction.
The apparatus of claim 3, wherein:

the forward-facing cardioid microphone comprises a plurality of forward-facing cardioid microphones; and

the backward-facing cardioid microphone comprises a plurality of backward-facing cardioid microphones alternately arranged with the plurality of forward-facing cardioid microphones to form a matrix array.
The apparatus of any one of claims 1-13, further comprising a display configured to respond to the user by displaying a message.
The apparatus of any one of claims 1-13, further comprising a slot configured to dispense an item purchased by the user.
An apparatus, comprising:

a forward-facing microphone configured to receive a first audio signal;

a backward-facing microphone which is adjacent to the forward-facing microphone and configured to receive a second audio signal; and

a controller comprising circuitry configured to:

determine a first speech signal power of the first audio signal in each of a plurality of frequency bins and a second speech signal power of the second audio signal in each of the plurality of frequency bins;

compare the first and second speech signal powers in each frequency bin and determine a dominant microphone in each of the plurality of frequency bins;

determine a first number of dominant frequency bins in the forward-facing microphone and a second number of dominant frequency bins in the backward-facing microphone; and

compare the first number and the second number and to wake up for a speech processing when comparison of the first and second numbers of dominant frequency bins satisfies a threshold condition.
The apparatus of claim 16, wherein the controller comprising circuitry is further configured to perform Fourier transform on the first and second audio signals.
The apparatus of any one of claims 16-17, wherein the satisfying the threshold condition comprises that a computed difference in the first and second numbers of dominant frequency bins is greater than a predetermined threshold value.
The apparatus of any one of claims 16-18, wherein at least one of the forward-facing microphone and the backward-facing microphone is a cardioid microphone.
The apparatus of any one of claims 16-18, wherein at least one of the forward-facing microphone and the backward-facing microphone is an omnidirectional microphone.
The apparatus of any one of claims 16-19, wherein the front of the backward-facing microphone is adjacent to the rear of the forward-facing microphone.
A method of operating an apparatus, comprising:

obtaining a first audio signal by a forward-facing microphone of the apparatus and a second audio signal by a backward-facing microphone of the apparatus;

computing an energy ratio of the first audio signal and the second audio signal; and

waking up for a speech processing when the computed energy ratio satisfies a threshold condition.
The method of claim 22, wherein the satisfying a threshold condition comprises that the computed energy ratio is greater than a predetermined threshold value.
The method of any one of claims 22-23, wherein at least one of the forward-facing microphone and the backward-facing microphone is a cardioid microphone.
The method of any one of claims 22-24, further comprising:

performing Fourier transform on the first audio signal and the second audio signal, respectively;

determining a first speech signal power of the first audio signal in each of a plurality of frequency bins and a second speech signal power of the second audio signal in each of the plurality of frequency bins, respectively;

accumulating the first speech signal power and the second speech signal power over time, respectively;

performing frequency weighting on the first and second speech signal powers, respectively; and

adding the weighted first speech signal power and the weighted second speech signal power across the plurality of frequency bins to obtain a first and second audio energies, respectively.
A method of operating an apparatus, comprising:

receiving a first audio signal by a forward-facing microphone of the apparatus and a second audio signal by a backward-facing microphone of the apparatus;

determining a first number of dominant frequency bins of the first audio signal and a second number of dominant frequency bins of the second audio signal;

comparing the first number of dominant frequency bins of the first audio signal and the second number of dominant frequency bins of the second audio signal; and

waking up for a speech processing when the comparison of the first and second numbers of dominant frequency bins satisfies a threshold condition.
The method of claim 26, further comprising:

performing Fourier transform on the first and second audio signals;

determining a first speech power of the first audio signal in each of a plurality of frequency bins and a second speech power of the second audio signal in each of the plurality of frequency bins; and

comparing the first and second speech powers in each of the plurality of frequency bins and determining a dominant microphone in each of the plurality of frequency bins.
The method of any one of claims 26-27, wherein the satisfying a threshold condition comprises that a computed difference in the first and second numbers of dominant frequency bins is greater than a predetermined threshold value.
The method of any one of claims 26-28, at least one of the forward-facing microphone and the backward-facing microphone is a cardioid microphone.
The method of any one of claims 26-29, wherein the waking up for a speech processing further comprises displaying a message to a user when the computed difference in the numbers of time domains is greater than a predetermined threshold value.
The method of any one of claims 26-30, wherein the waking up for a speech processing further comprises outputting an item to a user when the computed difference in the numbers of time domains is greater than a predetermined threshold value.
A non-transitory computer readable storage medium storing a program, the program causing a computer to perform:

obtaining a first audio signal by a forward-facing microphone of an apparatus and a second audio signal by a backward-facing microphone of the apparatus;

computing an energy ratio of the first audio signal and the second audio signal; and

waking up for a speech processing when the computed energy ratio satisfies a threshold condition.
A non-transitory computer readable storage medium storing a program, the program causing a computer to perform:

receiving a first audio signal by a forward-facing microphone of an apparatus and a second audio signal by a backward-facing microphone of the apparatus;

determining a first number of dominant frequency bins of the first audio signal and a second number of dominant frequency bins of the second audio signal;

computing a difference between the first and second numbers of dominant frequency bins; and

waking up for a speech processing when the computed difference in the numbers of dominant frequency bins satisfies a threshold condition.
The non-transitory computer readable storage medium storing a program of claim 33, wherein the program further causes the computer to perform:

performing Fourier transform on the first and second audio signals;

determining a first speech power of the first audio signal in each of a plurality of frequency bins and a second speech power of the second audio signal in each of the plurality of frequency bins; and

comparing the first and second speech powers in each of the plurality of frequency bins and determining a dominant microphone in each of the plurality of frequency bins.