CN110164443A

CN110164443A - Method of speech processing, device and electronic equipment for electronic equipment

Info

Publication number: CN110164443A
Application number: CN201910584198.5A
Authority: CN
Inventors: 龚永燕; 黄海锋; 范海涛
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2019-08-23
Anticipated expiration: 2039-06-28
Also published as: CN110164443B

Abstract

Present disclose provides a kind of method of speech processing, voice processing apparatus and electronic equipments for electronic equipment.Wherein, method of speech processing for electronic equipment includes: the first voice data in the first reception user；Meet wake-up condition in response to first voice data, wakes up the electronic equipment；By pronunciation receiver in the second speech data of the second reception user, the second speech data is used to indicate the electronic equipment and executes relevant operation；Meet the first specific time length in response to the time span between second moment and first moment, the relative position information of user's face Yu the electronic equipment is determined based on the second speech data；And meet specified conditions in response to the relative position information, it controls the electronic equipment and is based on the second speech data execution relevant operation.

Description

Method of speech processing, device and electronic equipment for electronic equipment

Technical field

This disclosure relates to a kind of method of speech processing for electronic equipment, a kind of voice processing apparatus and a kind of electronics Equipment.

Background technique

With the fast development of electronic technology, miscellaneous electronic equipment is gradually dissolved into our work and life In.Wherein, user can pass through voice control electronic equipment when using electronic equipment.But in the related art, when with After family first passage wakes up word wake-up electronic equipment, if user also needs after a period of time and electronic equipment carries out voice friendship Mutually, then after needing again by word wake-up electronic equipment is waken up, could continue to carry out interactive voice with electronic equipment, that is, every Secondary interactive voice requires to wake up electronic equipment by waking up word, causes user experience poor.

Summary of the invention

An aspect of this disclosure provides a kind of method of speech processing for electronic equipment, comprising: at the first moment The first voice data for receiving user, meets wake-up condition in response to first voice data, wakes up the electronic equipment, leads to Pronunciation receiver is crossed in the second speech data of the second reception user, the second speech data is used to indicate the electricity Sub- equipment executes relevant operation, and it is specific to meet first in response to the time span between second moment and first moment Time span determines the relative position information of user's face Yu the electronic equipment based on the second speech data, in response to The relative position information meets specified conditions, controls the electronic equipment and is based on the second speech data execution correlation Operation.

Optionally, the above method further include: in response to the time span between second moment and first moment Meet the second specific time length, control the electronic equipment and be based on the second speech data execution relevant operation, In, the second specific time length is less than the first specific time length.

Optionally, above-mentioned pronunciation receiver includes multiple pronunciation receivers, described to be based on the second speech data Determine the relative position information of user's face Yu the electronic equipment, comprising: handle the second speech data and obtain described The speech waveform and audio time delay of two voice data, wherein the audio time delay characterizes the multiple pronunciation receiver and receives To the time difference of the second speech data, be based on the speech waveform and the audio time delay, determine user's face with it is described The relative position information of electronic equipment.

Optionally, above-mentioned to be based on the speech waveform and the audio time delay, determine user's face and the electronic equipment Relative position information, comprising: determine whether the type of the speech waveform meets specific type, in response to the speech waveform Type meet specific type, the relative position information of user's face Yu the electronic equipment is determined based on the audio time delay.

Optionally, above-mentioned multiple pronunciation receivers include the first pronunciation receiver and the second pronunciation receiver, institute Stating the distance between the first pronunciation receiver and the second pronunciation receiver is specific range, described to be based on the audio time delay Determine the relative position information of user's face Yu the electronic equipment, comprising: determine that first pronunciation receiver receives The third moment of the second speech data determines that second pronunciation receiver receives the of the second speech data Four moment were based on the third moment and the 4th moment, and determined the first delay inequality of the audio time delay, based on described the One delay inequality and the specific range, determine the relative position information of user's face Yu the electronic equipment.

Optionally, the above method further include: handle the second speech data and obtain the audio of the second speech data Energy；The relative position information that user's face Yu the electronic equipment are determined based on the audio time delay, comprising: in response to The audio power is greater than particular energy threshold value, determines the user relative to the electronic equipment based on the audio power Target position is based on the target position and the audio time delay, determines the relative position of user's face Yu the electronic equipment Information.

It is optionally, above-mentioned that target position of the user relative to the electronic equipment is determined based on the audio power, Comprise determining that the first audio power and the second audio power, wherein first audio power is located at for characterizing the user The front region of the electronic equipment, second audio power is for characterizing the side that the user is located at the electronic equipment Region, handles first audio power and second audio power obtains processing result, is based on the processing result, determines Target position of the user relative to the electronic equipment.

Optionally, above-mentioned multiple pronunciation receivers include multiple groups pronunciation receiver, the method also includes: in response to The audio power is less than or equal to the particular energy threshold value, determines the second delay inequality of the audio time delay, based on described The location information of second delay inequality and the multiple groups pronunciation receiver determines the opposite position of user's face Yu the electronic equipment Confidence breath.

An aspect of this disclosure provides a kind of method of speech processing for electronic equipment, comprising: passes through multiple languages Sound acquisition device acquires the voice data of user, and the voice data is used to indicate the electronic equipment and executes relevant operation, place It manages the voice data and obtains the speech waveform and audio time delay of the voice data, wherein described in the audio time delay characterization Multiple voice acquisition devices receive the time difference of the voice data, are based on the speech waveform and the audio time delay, really The relative position information for determining user's face Yu the electronic equipment meets specified conditions in response to the relative position information, control It makes the electronic equipment and is based on the voice data execution relevant operation.

Optionally, described to determine that the relative position of user's face and the electronic equipment is believed based on the second speech data Breath, comprising: handle the second speech data and obtain the speech waveform and audio time delay of the second speech data, wherein institute It states audio time delay and characterizes the time difference that the multiple pronunciation receiver receives the second speech data, be based on the voice Waveform and the audio time delay, determine the relative position information of user's face Yu the electronic equipment.

Another aspect of the disclosure provides a kind of voice processing apparatus, comprising: the first receiving module, wake-up module, Second receiving module, the first determining module and the first control module.Wherein, the first receiving module is in the first reception user The first voice data, wake-up module meets wake-up condition in response to first voice data, wakes up the electronic equipment, Second speech data of two receiving modules by pronunciation receiver in the second reception user, the second speech data use In indicating that the electronic equipment executes relevant operation, the first determining module in response to second moment and first moment it Between time span meet the first specific time length, determine that user's face is set with the electronics based on the second speech data Standby relative position information, the first control module meet specified conditions in response to the relative position information, control the electronics Equipment is based on the second speech data and executes the relevant operation.

Another aspect of the disclosure provides a kind of electronic equipment, comprising: processor and memory.Wherein, it stores Device is for storing executable instruction, wherein when described instruction is executed by the processor, so that processor execution is used for Realize method as above.

Another aspect of the present disclosure provides a kind of non-volatile readable storage medium, is stored with the executable finger of computer It enables, instructs when executed for realizing method as above.

Another aspect of the present disclosure provides a kind of computer program, and computer program includes computer executable instructions, Instruction is when executed for realizing method as above.

Detailed description of the invention

In order to which the disclosure and its advantage is more fully understood, referring now to being described below in conjunction with attached drawing, in which:

Fig. 1 diagrammatically illustrates the method for speech processing for electronic equipment and speech processes according to the embodiment of the present disclosure The application scenarios of device；

Fig. 2 diagrammatically illustrates the process of the method for speech processing for electronic equipment according to the first embodiment of the present disclosure Figure；

Fig. 3 diagrammatically illustrates the process of the method for speech processing for electronic equipment according to the second embodiment of the present disclosure Figure；

Fig. 4 diagrammatically illustrate include according to the electronic equipment of the embodiment of the present disclosure pronunciation receiver schematic diagram；

Fig. 5-Fig. 6 diagrammatically illustrates the schematic diagram that speech waveform is received according to the electronic equipment of the embodiment of the present disclosure；

Fig. 7 diagrammatically illustrates the signal that relative position information is determined based on audio time delay according to the embodiment of the present disclosure Figure；

Fig. 8 diagrammatically illustrates the process of the method for speech processing for electronic equipment according to the third embodiment of the present disclosure Figure；

Fig. 9, which is diagrammatically illustrated, to be shown according to the determination user of the embodiment of the present disclosure relative to the target position of electronic equipment It is intended to；

Figure 10 diagrammatically illustrates the stream of the method for speech processing for electronic equipment according to the fourth embodiment of the present disclosure Cheng Tu；

Figure 11-Figure 12, which is diagrammatically illustrated, determines relative position by multiple groups pronunciation receiver according to the embodiment of the present disclosure Schematic diagram；

Figure 13 diagrammatically illustrates the block diagram of the electronic equipment according to the embodiment of the present disclosure；

Figure 14 diagrammatically illustrates the block diagram of the voice processing apparatus according to the first embodiment of the present disclosure；

Figure 15 diagrammatically illustrates the block diagram of the voice processing apparatus according to the second embodiment of the present disclosure；

Figure 16 diagrammatically illustrates the block diagram of the voice processing apparatus according to the third embodiment of the present disclosure；

Figure 17 diagrammatically illustrates the block diagram of the voice processing apparatus according to the fourth embodiment of the present disclosure；And

Figure 18 diagrammatically illustrates the box of the computer system for realizing speech processes according to the embodiment of the present disclosure Figure.

Specific embodiment

Hereinafter, will be described with reference to the accompanying drawings embodiment of the disclosure.However, it should be understood that these descriptions are only exemplary , and it is not intended to limit the scope of the present disclosure.In the following detailed description, to elaborate many specific thin convenient for explaining Section is to provide the comprehensive understanding to the embodiment of the present disclosure.It may be evident, however, that one or more embodiments are not having these specific thin It can also be carried out in the case where section.In addition, in the following description, descriptions of well-known structures and technologies are omitted, to avoid Unnecessarily obscure the concept of the disclosure.

Term as used herein is not intended to limit the disclosure just for the sake of description specific embodiment.It uses herein The terms "include", "comprise" etc. show the presence of the feature, step, operation and/or component, but it is not excluded that in the presence of Or add other one or more features, step, operation or component.

There are all terms (including technical and scientific term) as used herein those skilled in the art to be generally understood Meaning, unless otherwise defined.It should be noted that term used herein should be interpreted that with consistent with the context of this specification Meaning, without that should be explained with idealization or excessively mechanical mode.

It, in general should be according to this using statement as " at least one in A, B and C etc. " is similar to Field technical staff is generally understood the meaning of the statement to make an explanation (for example, " system at least one in A, B and C " Should include but is not limited to individually with A, individually with B, individually with C, with A and B, with A and C, have B and C, and/or System etc. with A, B, C).Using statement as " at least one in A, B or C etc. " is similar to, generally come Saying be generally understood the meaning of the statement according to those skilled in the art to make an explanation (for example, " having in A, B or C at least One system " should include but is not limited to individually with A, individually with B, individually with C, with A and B, have A and C, have B and C, and/or the system with A, B, C etc.).

Shown in the drawings of some block diagrams and/or flow chart.It should be understood that some sides in block diagram and/or flow chart Frame or combinations thereof can be realized by computer program instructions.These computer program instructions can be supplied to general purpose computer, The processor of special purpose computer or other programmable control units, so that these instructions can create when executed by this processor For realizing function/operation device illustrated in these block diagrams and/or flow chart.

Therefore, the technology of the disclosure can be realized in the form of hardware and/or software (including firmware, microcode etc.).Separately Outside, the technology of the disclosure can take the form of the computer program product on the computer-readable medium for being stored with instruction, should Computer program product uses for instruction execution system or instruction execution system is combined to use.In the context of the disclosure In, computer-readable medium, which can be, can include, store, transmitting, propagating or transmitting the arbitrary medium of instruction.For example, calculating Machine readable medium can include but is not limited to electricity, magnetic, optical, electromagnetic, infrared or semiconductor system, device, device or propagation medium. The specific example of computer-readable medium includes: magnetic memory apparatus, such as tape or hard disk (HDD)；Light storage device, such as CD (CD-ROM)；Memory, such as random access memory (RAM) or flash memory；And/or wire/wireless communication link.

Embodiment of the disclosure provides a kind of method of speech processing for electronic equipment, comprising: connects at the first moment The first voice data for receiving user meets wake-up condition in response to the first voice data, wakes up electronic equipment, pass through phonetic incepting In the second speech data of the second reception user, second speech data is used to indicate electronic equipment and executes related behaviour device Make, meets the first specific time length in response to the time span between the second moment and the first moment, be based on the second voice number According to the relative position information for determining user's face and electronic equipment, meet specified conditions, control electricity in response to relative position information Sub- equipment is based on second speech data and executes relevant operation.

Fig. 1 diagrammatically illustrates the method for speech processing for electronic equipment and speech processes according to the embodiment of the present disclosure The application scenarios of device.It should be noted that being only the example that can apply the scene of the embodiment of the present disclosure shown in Fig. 1, with side The technology contents those skilled in the art understand that disclosure are helped, but are not meant to that the embodiment of the present disclosure may not be usable for other and set Standby, system, environment or scene.

As shown in Figure 1, the application scenarios 100 for example may include user 110 and electronic equipment 120.

According to the embodiment of the present disclosure, electronic equipment 120 for example can be smart machine, the electronic equipment 120 for example with Receive the function of voice and speech processes.Wherein, which for example can be computer, smart phone, intelligent sound Case etc..

For example, user 110 can be interacted by voice and electronic equipment 120, set with will pass through voice control electronics Standby 120 execute relevant operation.Wherein, user 110 can wake up electronic equipment 120 by waking up word, be called out in electronic equipment 120 After waking up, user 110 can continue to execute relevant operation based on voice command control electronic equipment 120.For example, user 110 can Word " Hi, XX " are waken up to issue.Electronic equipment 120 after the voice for receiving user 110, judge user 110 voice whether To wake up word, if it is, responding the wake-up word and waking up, after the wake-up of electronic equipment 120, user 110 can for example be sent out Phonetic order " please open XXX application " out after electronic equipment 120 receives phonetic order, can respond the phonetic order to beat Open related application.

Below with reference to the application scenarios of Fig. 1, the use according to disclosure illustrative embodiments is described with reference to Fig. 2~Figure 12 In the method for speech processing of electronic equipment.It should be noted that above-mentioned application scenarios are merely for convenience of understanding the essence of the disclosure Mind and principle and show, embodiment of the present disclosure is unrestricted in this regard.On the contrary, embodiment of the present disclosure can be with Applied to applicable any scene.

Fig. 2 diagrammatically illustrates the process of the method for speech processing for electronic equipment according to the first embodiment of the present disclosure Figure.

As shown in Fig. 2, this method includes operation S210~S250.

In operation S210, in the first voice data of the first reception user.

According to the embodiment of the present disclosure, user can for example pass through voice control electronic equipment.Such as it is in electronic equipment When dormant state or off-mode, user can wake up electronic equipment by waking up word accordingly.For example, when electronic equipment exists After first reception to the first voice data to user, can further judge whether first voice data is to wake up word.

In operation S220, meets wake-up condition in response to the first voice data, wake up electronic equipment.

For example, it is wake-up word that the first voice data, which meets wake-up condition including the first voice data,.Electronic equipment is judging First voice data is that after waking up word, can respond first voice data and be waken up, and is indicated convenient for subsequent execution user Relevant operation.

S230 is being operated, the second speech data by pronunciation receiver in the second reception user, the second voice Data are used to indicate electronic equipment and execute relevant operation.

According to the embodiment of the present disclosure, pronunciation receiver for example can be microphone or microphone array in electronic equipment Column etc..Wherein, after electronic equipment is waken up, when user needs further controlling electronic devices to execute relevant operation, User can issue second speech data, so that electronic equipment receives the second speech data of user and responds second speech data Execute relevant operation.Wherein, electronic equipment is in the second reception to second speech data, the second moment the first moment it Afterwards.

In operation S240, it is long to meet the first specific time in response to the time span between the second moment and the first moment Degree, the relative position information of user's face and electronic equipment is determined based on second speech data.

For example, after receiving the second speech data of user, electronic equipment can further judge the second moment and Time span between first moment, when the time span at the second moment and the first moment is less than or equal to the first specific time length When, electronic equipment can determine the relative position information of user's face and electronic equipment based on second speech data.Wherein, first Specific time length for example can be 30 seconds, 1 minute etc..It is appreciated that the first specific time length can be according to actually answering Depending on demand.

Wherein, the relative position information of user's face and electronic equipment can be determined based on second speech data.For example, can To determine that user's face is directed towards the first electronic equipment or back to electronic equipment, further, it is based on second speech data The direction that user's face can also be calculated and the relative angle between electronic equipment, the relative angle can indicate that user sends out Electronics is controlled by second speech data convenient for learning whether user has whether towards electronic equipment when second speech data out The intention of equipment.

In operation S250, meet specified conditions in response to relative position information, controlling electronic devices is based on the second voice number According to execution relevant operation.

According to the embodiment of the present disclosure, relative position information meets specified conditions and sets for example including user's face towards electronics It is standby.Alternatively, relative position information meet specified conditions can also include user's face direction between electronic equipment it is opposite Angle meets special angle.For example, when electronic equipment includes display unit, the direction of user's face between display unit Relative angle meets special angle.

According to the embodiment of the present disclosure, when relative position information meets specified conditions, electronic equipment can be directly in response to Two voice data execute relevant operation, without waking up electronic equipment again by wake-up word.That is, the embodiment of the present disclosure passes through User's face is determined towards the mode with the relative position information between electronic equipment, realizes and is waken up in user by waking up word In a period of time after electronic equipment (the first moment to a period of time between the second moment, this period of time in for example with There is no carry out other interactive voices with electronic equipment at family), user directly can execute correlation by voice control electronic equipment Operation, without again by word wake-up electronic equipment is waken up, the interactive process avoided between user and electronic equipment is numerous It is trivial, and improve the interactive experience between user and electronic equipment.

Fig. 3 diagrammatically illustrates the process of the method for speech processing for electronic equipment according to the second embodiment of the present disclosure Figure.

As shown in figure 3, this method includes operation S210~S250 and S310.Wherein, operation S210~S250 and upper ginseng Examine described in Fig. 2 that operation is same or like, and details are not described herein.

In operation S310, it is long to meet the second specific time in response to the time span between the second moment and the first moment Degree, controlling electronic devices are based on second speech data and execute relevant operation, wherein the second specific time length is specific less than first Time span.

According to the embodiment of the present disclosure, if the time span between the second moment and the first moment is specific less than or equal to second Time span can directly control electronic equipment response second speech data and execute relevant operation, judge user without continuing Relative position information between face and electronic equipment.Wherein, the second specific time length is less than the first specific time length.Example Such as, when the first specific time length is 30 seconds, the second specific time length be can be 20 seconds, when the first specific time length is At 1 minute, the second specific time length is 40 seconds etc..

In the embodiments of the present disclosure, the size of the time span between the second moment and the first moment can for example characterize use Family is used for the probability size that controlling electronic devices executes relevant operation in the second speech data that the second moment issued.For example, the Time span between two moment and the first moment is smaller, indicates that user issues the after waking up electronic equipment within the short period It is larger by the probability (the first probability) of second speech data controlling electronic devices to show that user wants for two voice data, at this point, Electronic equipment can execute relevant operation directly in response to the second speech data, not need to wake up electronics again by waking up word and set It is standby.When the time span between the second moment and the first moment is bigger, indicate user after waking up electronic equipment the long period it Interior sending second speech data shows that user is smaller by the probability (the second probability) of second speech data controlling electronic devices, That is the second probability is less than the first probability, at this point, electronic equipment can continue to judge user whether towards electronic equipment, in user face Indicate that user wants that the second voice number can be responded at this time by second speech data controlling electronic devices when to electronic equipment According to relevant operation is executed, without waking up electronic equipment again by waking up word.

With reference to following figure 4-Figure 12, wherein Fig. 4-Fig. 9 description embodiment be suitable between user and electronic equipment away from From closer scene, the embodiment of Figure 10-Figure 12 description is suitable for the farther away scene of the distance between user and electronic equipment.Its In, the distance between user and electronic equipment are more closely user within 1 meter for example including the distance between user and electronic equipment The distance between electronic equipment is farther out for example including the distance between user and electronic equipment more than 1 meter.

Firstly, such as being describeed how really with reference to Fig. 4-Fig. 9 at a distance from user is between electronic equipment under closer scene Determine the relative position information of user's face and electronic equipment.

Fig. 4 diagrammatically illustrate include according to the electronic equipment of the embodiment of the present disclosure pronunciation receiver schematic diagram.

As shown in figure 4, electronic equipment is for example including multiple pronunciation receivers.Wherein, two are diagrammatically illustrated in Fig. 4 Pronunciation receiver, for example, microphone M1 and microphone M₂。

According to the embodiment of the present disclosure, S240 is operated as described in Fig. 2, based on second speech data determine user's face with The relative position information of electronic equipment for example may comprise steps of (1)~(2).

(1) processing second speech data obtains the speech waveform and audio time delay of second speech data, wherein audio time delay Characterize the time difference that multiple pronunciation receivers receive second speech data.

For example, electronic equipment can handle second speech data and obtain after electronic equipment receives second speech data Speech waveform and audio time delay.Wherein, speech waveform for example may include plane waveform, curved surface waveform or other irregular waves Shape etc..Wherein, due to microphone M₁With microphone M₂Difference at the time of receiving second speech data, therefore microphone M₁And wheat Gram wind M₂The time difference for receiving second speech data is audio time delay.

(2) it is based on speech waveform and audio time delay, determines the relative position information of user's face and electronic equipment.

For example, speech waveform can characterize user issue second speech data when user face be directed towards electronic equipment or Person is back to electronic equipment.Also, microphone M₁With microphone M₂Receive second speech data audio time delay can indicate user with Microphone M₁With microphone M₂Relative position information.Therefore, user's face can be determined according to speech waveform and audio time delay With the relative position information of electronic equipment, detailed process is referring to the description in following Fig. 5-Fig. 6.

Fig. 5-Fig. 6 diagrammatically illustrates the schematic diagram that speech waveform is received according to the electronic equipment of the embodiment of the present disclosure.

According to the embodiment of the present disclosure, after processing second speech data obtains speech waveform, it is first determined speech waveform Whether type meets specific type, secondly, the type in response to speech waveform meets specific type, is determined and is used based on audio time delay The relative position information of family face and electronic equipment.Wherein, it includes speech wave that whether the type of speech waveform, which meets specific type, Whether shape is plane waveform.

As shown in figure 5, after electronic equipment receives second speech data, if it is judged that the voice of second speech data The type of waveform is plane waveform, at least not back to electronic equipment when can issue second speech data with principium identification user, and Judge the relative position information of user's face and electronic equipment, by audio time delay further so that electronic equipment is based on relatively Location information determines whether to respond second speech data to execute relevant operation.

As shown in fig. 6, after electronic equipment receives second speech data, if it is judged that the voice of second speech data It, can be with principium identification user when the type of waveform is not plane waveform (for example, curved surface waveform or other irregular waveforms etc.) Back to electronic equipment when issuing second speech data, this is because user's face causes to hinder to lead to the transmission of second speech data Causing speech waveform is not plane waveform.Therefore, when the type of the speech waveform of second speech data is not plane waveform, tentatively Judge when user issues second speech data that then electronic equipment can be not responding to the second speech data, no back to electronic equipment The judgement subsequently with respect to relative position information is carried out again.

Fig. 7 diagrammatically illustrates the signal that relative position information is determined based on audio time delay according to the embodiment of the present disclosure Figure.

As shown in fig. 7, multiple pronunciation receivers include the first pronunciation receiver (microphone M₁) and the second voice connect Receiving apparatus (microphone M₂), the distance between the first pronunciation receiver and the second pronunciation receiver are specific range D₁.At this In open embodiment, since each pronunciation receiver in multiple pronunciation receivers is different at a distance from user, no May be different at the time of receiving second speech data with pronunciation receiver, it is possible thereby to be connect according to multiple pronunciation receivers The audio time delay of second speech data is received to determine the relative position information between user's face and electronic equipment.

According to the embodiment of the present disclosure, the relative position information of user's face and electronic equipment is determined based on audio time delay, is wrapped Include following steps (1)~(4).

(1) determine that the first pronunciation receiver receives the third moment of second speech data.

(2) determine that the second pronunciation receiver receives the 4th moment of second speech data.

For example, determining microphone M when electronic equipment receives second speech data₁Receive second speech data Third moment and microphone M₂Receive the 4th moment of second speech data.

(3) it is based on third moment and the 4th moment, determines the first delay inequality of audio time delay.

For example, indicating that second speech data arrives first at microphone M when being greater than for four moment at the third moment₂, work as third When moment is less than four moment, indicate that second speech data arrives first at microphone M₁(second speech data arrives first at Mike Wind M₂The case where it is as shown in Figure 7).Wherein, the difference between third moment and the 4th moment is the first delay inequality.

(4) the first delay inequality and specific range D are based on₁, determine the relative position information of user's face and electronic equipment.

By taking situation shown in Fig. 7 as an example, second speech data arrives first at microphone M₁, the first delay inequality is negative at this time Number, according to the spread speed (for example, velocity of sound) of the first time delay absolute value of the difference and voice it can be seen that distance D in figure₂.It should be away from From D₂The difference (absolute value) between first distance and second distance can be characterized.Wherein, first distance can indicate user face Plane and microphone M where portion₁The distance between, plane and microphone M where second distance can indicate user's face₂Between Distance.

In the embodiments of the present disclosure, it is based on distance D₂With specific range D₁It can be seen that angle R, for example, D₂=D₁* cosR, In, due to D₁And D₂It is known that angle R can be calculated.Wherein, angle R for example can be used to indicate that user's face and electronic equipment Relative position information, for example, angle R be user's face towards between the plane where the display unit of N and electronic equipment Angle, wherein the embodiment of the present disclosure assume electronic equipment display unit perpendicular to ground.

As shown in fig. 7, since the distance between user and electronic equipment are relatively close, when based on the first delay inequality (distance D₂) and Specific range D₁After the relative position information (angle R) for determining user's face and electronic equipment, user is likely to be at institute in Fig. 7 Location A or B location for showing etc..Therefore, it is necessary to further judge target position locating for user, for example, it is desired to judge to use Family is in location A or B location.Detailed process is as follows with reference to describing in Fig. 8-Fig. 9.

Fig. 8 diagrammatically illustrates the process of the method for speech processing for electronic equipment according to the third embodiment of the present disclosure Figure.

As shown in figure 8, this method includes operation S210~S250 and S810.Wherein, operation S210~S250 and upper ginseng Examine described in Fig. 2 that operation is same or like, and details are not described herein.

In operation S810, processing second speech data obtains the audio power of second speech data.

According to the embodiment of the present disclosure, the audio power of second speech data can for example be indicated between user and electronic equipment Distance.The audio power of second speech data is bigger, then it represents that the distance between user and electronic equipment are smaller, audio power It is smaller, then it represents that the distance between user and electronic equipment are bigger.

According to the embodiment of the present disclosure, the relative position information of user's face and electronic equipment is determined based on audio time delay, is wrapped Include following steps (1)~(2).

(1) it is greater than particular energy threshold value in response to audio power, determines user relative to electronic equipment based on audio power Target position.

For example, indicating that the distance between user and electronic equipment are smaller, then when audio power is greater than particular energy threshold value May further determine that user relative to electronic equipment target position (such as shown in fig. 7, determine target position be location A Or B location).Wherein, determine that the process of target position is referred to as follows shown in Fig. 9 according to audio power.

If audio power is less than or equal to particular energy threshold value, indicate that the distance between user and electronic equipment are larger, Then determine process side with reference to described in following figure 10-Figure 12 of the relative position information between user and electronic equipment Formula.

(2) it is based on target position and audio time delay, determines the relative position information of user's face and electronic equipment.

After determining user relative to the target position of electronic equipment, can according to target position and audio time delay compared with Adequately (relative position information is for example including the A in Fig. 7 for the relative position information between determining user's face and electronic equipment Position and angle R).

Fig. 9, which is diagrammatically illustrated, to be shown according to the determination user of the embodiment of the present disclosure relative to the target position of electronic equipment It is intended to.

As shown in figure 9, determining target position of the user relative to electronic equipment based on audio power, include the following steps (1)~(3)

(1) the first audio power and the second audio power are determined, wherein the first audio power is located at electricity for characterizing user The front region of sub- equipment, the second audio power is for characterizing the side region that user is located at electronic equipment.

According to the embodiment of the present disclosure, such as the target position of microphone array technological orientation user can be passed through.Specifically, Such as pass through the first audio power in two kinds of different modes in Beam Forming technology respectively determining second speech data With the second audio power.Wherein.Such as the first audio power is obtained by Cardioid mode, which can use It is located at the front region of electronic equipment in expression user, the second audio power is obtained by Dipole mode, the second audio energy Amount can be used in indicating that user is located at the side region of electronic equipment.Wherein, as shown in figure 9, front region is, for example, the region E, Side region is, for example, the region F (left and right sides in Fig. 9 is the region F).

(2) it handles the first audio power and the second audio power obtains processing result.

(3) it is based on processing result, determines target position of the user relative to electronic equipment.

For example, the first audio power and the second audio power, which are overlapped processing, obtains processing result, processing result can Indicate the target position of user.

As shown in figure 9, indicating that user is in front region, then according to user's when the target position for obtaining user is A The direction and the phase between electronic equipment that relative position information (including target position A and angle R) can determine user's face Specified conditions are met to location information, and controlling electronic devices responds second speech data.If the target position of user is B When, indicate that user is in front region, then it can be true according to the relative position information of user (including target position B and angle R) The direction for determining user's face with the relative position information between electronic equipment is unsatisfactory for specified conditions (user is not towards electronics Equipment), then electronic equipment is not responding to second speech data.

Similarly, if the target position of user is C, indicate that user is in side region, then according to the opposite of user The direction and the opposite position between electronic equipment that location information (including target position C and angle R) can determine user's face Confidence breath meets specified conditions, then controlling electronic devices responds second speech data.If the target position of user is D, table Show that user is in side region, then user can be determined according to the relative position information of user (including target position D and angle R) The direction of face is unsatisfactory for specified conditions (user is not towards electronic equipment) with the relative position information between electronic equipment, Then electronic equipment is not responding to second speech data.

According to the embodiment of the present disclosure, at a distance from user is between electronic equipment under closer scene, by determining user Relative to the target position of electronic equipment, and determine based on target position and audio time delay the opposite position of user's face electronic equipment Confidence breath, so as to the second speech data of information controlling electronic devices response user and the directly related behaviour of execution depending on the relative position Make, does not have to again by word wake-up electronic equipment is waken up, the interactive process avoided between user and electronic equipment is cumbersome, and mentions The interactive experience between user and electronic equipment is risen.

In addition, such as being describeed how with reference to Figure 10-Figure 12 at a distance from user is between electronic equipment under farther away scene Determine the relative position information of user's face and electronic equipment.

Figure 10 diagrammatically illustrates the stream of the method for speech processing for electronic equipment according to the fourth embodiment of the present disclosure Cheng Tu.

As shown in Figure 10, this method includes operation S210~S250 and S1010~S1020.Wherein, operate S210~ S250 is same or like with the upper operation with reference to described in Fig. 2, and details are not described herein.

Figure 11-Figure 12, which is diagrammatically illustrated, determines relative position by multiple groups pronunciation receiver according to the embodiment of the present disclosure Schematic diagram.

As is illustrated by figs. 11 and 12, multiple pronunciation receivers include multiple groups pronunciation receiver.For example including three groups of languages Sound reception device, every group of pronunciation receiver for example may include two microphones.Due between user and electronic equipment away from From farther out, therefore, the opposite position between user and electronic equipment can relatively accurately be determined by multiple groups pronunciation receiver Confidence breath.

In conjunction with shown in Figure 10, Figure 11 and Figure 12, in operation S1010, it is less than or equal to particular energy in response to audio power Threshold value determines the second delay inequality of audio time delay.

For example, receiving second speech data by multiple groups pronunciation receiver first, and judge the language of second speech data Whether the type of sound wave shape is plane waveform type, if it is, can further judge the audio power of second speech data Whether particular energy threshold value is less than, if it is, the distance between user and electronic equipment are indicated farther out, at this point it is possible into one Walk the second delay inequality for determining that multiple groups pronunciation receiver receives second speech data.

Operation S1020, the location information based on the second delay inequality and multiple groups pronunciation receiver, determine user's face with The relative position information of electronic equipment.

As shown in figure 12, farther out due to the distance between user and electronic equipment, the change in location of user is to user Face whether towards electronic equipment influence it is smaller.For example, after user is moved to B distance from A distance, it is believed that user's face Always towards electronic equipment, (when rather than the user in Fig. 7 and electronic equipment short distance, user is set in location A towards electronics It is standby, in B location then not towards electronic equipment).Therefore, at a distance from user is between electronic equipment farther out when, due to user The variation of target position to user's face whether towards electronic equipment influence it is smaller, from there through the second delay inequality and multiple groups language The location information of sound reception device is the relative position information that can determine that user's face and electronic equipment.

According to the embodiment of the present disclosure, at a distance from user is between electronic equipment under farther away scene, audio can be based on Time delay determines the relative position information of user's face and electronic equipment, so that information controlling electronic devices responds depending on the relative position The second speech data of user simultaneously directly executes relevant operation, does not have to avoid use again by word wake-up electronic equipment is waken up Interactive process between family and electronic equipment is cumbersome, and improves the interactive experience between user and electronic equipment.

It is opposite between user's face and electronic equipment in addition to being determined according to second speech data according to the embodiment of the present disclosure Except location information, the relative position information between user's face and electronic equipment can also be determined by other sensors.Example It can such as be obtained by radar, TOF (Time OfFlight) distance measuring sensor, the outer scanner mode of thermal technology about user The data of face, to determine the relative position information between user's face and electronic equipment.

Figure 13 diagrammatically illustrates the block diagram of the electronic equipment according to the embodiment of the present disclosure.

As shown in figure 13, the electronic equipment 1300 of the embodiment of the present disclosure includes: processor 1310 and memory 1320.Its In, memory 1320 is for storing executable instruction, wherein when instruction is executed by processor 1310, so that processor 1310 The method of speech processing as shown in Fig. 2-Figure 12 is executed, details are not described herein.

Figure 14 diagrammatically illustrates the block diagram of the voice processing apparatus according to the first embodiment of the present disclosure.

As shown in figure 14, voice processing apparatus 1400 includes the first receiving module 1410, the reception of wake-up module 1420, second Module 1430, the first determining module 1440 and the first control module 1450.

First receiving module 1410 can be used for the first voice data in the first reception user.According to disclosure reality Example is applied, the first receiving module 1410 can for example execute the operation S210 above with reference to Fig. 2 description, and details are not described herein.

Wake-up module 1420 can be used for meeting in response to the first voice data wake-up condition, wake up electronic equipment.According to The embodiment of the present disclosure, wake-up module 1420 can for example execute the operation S220 above with reference to Fig. 2 description, and details are not described herein.

Second receiving module 1430 can be used for the second voice by pronunciation receiver in the second reception user Data, second speech data are used to indicate electronic equipment and execute relevant operation.According to the embodiment of the present disclosure, the second receiving module 1430 can for example execute the operation S230 above with reference to Fig. 2 description, and details are not described herein.

First determining module 1440 can be used for meeting the in response to the time span between the second moment and the first moment One specific time length, the relative position information of user's face and electronic equipment is determined based on second speech data.

According to the embodiment of the present disclosure, pronunciation receiver includes multiple pronunciation receivers, true based on second speech data Determine the relative position information of user's face and electronic equipment, comprising: processing second speech data obtains the language of second speech data Sound wave shape and audio time delay, wherein audio time delay characterizes the time difference that multiple pronunciation receivers receive second speech data, Based on speech waveform and audio time delay, the relative position information of user's face and electronic equipment is determined.

According to the embodiment of the present disclosure, it is based on speech waveform and audio time delay, determines the opposite of user's face and electronic equipment Location information, comprising: determine whether the type of speech waveform meets specific type, in response to speech waveform type meet it is specific Type determines the relative position information of user's face and electronic equipment based on audio time delay.

According to the embodiment of the present disclosure, multiple pronunciation receivers include that the first pronunciation receiver and the second phonetic incepting fill It sets, the distance between the first pronunciation receiver and the second pronunciation receiver are specific range, are determined and are used based on audio time delay The relative position information of family face and electronic equipment, comprising: determine that the first pronunciation receiver receives second speech data The third moment determines that the second pronunciation receiver receives the 4th moment of second speech data, is based on third moment and the 4th Moment determines the first delay inequality of audio time delay, is based on the first delay inequality and specific range, determines user's face and electronic equipment Relative position information.

According to the embodiment of the present disclosure, the first determining module 1440 can for example execute the operation above with reference to Fig. 2 description S240, details are not described herein.

First control module 1450 can be used for meeting in response to relative position information specified conditions, controlling electronic devices base Relevant operation is executed in second speech data.According to the embodiment of the present disclosure, the first control module 1450 can for example be executed above With reference to the operation S250 that Fig. 2 is described, details are not described herein.

Figure 15 diagrammatically illustrates the block diagram of the voice processing apparatus according to the second embodiment of the present disclosure.

As shown in figure 15, voice processing apparatus 1500 includes the first receiving module 1410, the reception of wake-up module 1420, second Module 1430, the first determining module 1440, the first control module 1450 and the second control module 1510.Wherein, it first receives Module 1410, wake-up module 1420, the second receiving module 1430, the first determining module 1440 and the first control module 1450 are such as On with reference to Figure 14 describe module it is same or like, details are not described herein.

Second control module 1510 can be used for meeting the in response to the time span between the second moment and the first moment Two specific time length, controlling electronic devices are based on second speech data and execute relevant operation, wherein the second specific time length Less than the first specific time length.According to the embodiment of the present disclosure, the second control module 1510 can for example be executed above with reference to Fig. 3 The operation S310 of description, details are not described herein.

Figure 16 diagrammatically illustrates the block diagram of the voice processing apparatus according to the third embodiment of the present disclosure.

As shown in figure 16, voice processing apparatus 1600 includes the first receiving module 1410, the reception of wake-up module 1420, second Module 1430, the first determining module 1440, the first control module 1450 and processing module 1610.Wherein, the first receiving module 1410, wake-up module 1420, the second receiving module 1430, the first determining module 1440 and the first control module 1450 are as above joined The module for examining Figure 14 description is same or like, and details are not described herein.

Processing module 1610 can be used for handling second speech data and obtain the audio power of second speech data.

According to the embodiment of the present disclosure, the relative position information of user's face and electronic equipment is determined based on audio time delay, is wrapped It includes: being greater than particular energy threshold value in response to audio power, target position of the user relative to electronic equipment is determined based on audio power It sets, is based on target position and audio time delay, determines the relative position information of user's face and electronic equipment.

According to the embodiment of the present disclosure, target position of the user relative to electronic equipment is determined based on audio power, comprising: really Fixed first audio power and the second audio power, wherein the first audio power is for characterizing the front that user is located at electronic equipment Region, the second audio power handle the first audio power and the second sound for characterizing the side region that user is located at electronic equipment Frequency energy obtains processing result, is based on processing result, determines target position of the user relative to electronic equipment.

According to the embodiment of the present disclosure, processing module 1610 can for example execute the operation S810 above with reference to Fig. 8 description, This is repeated no more.

Figure 17 diagrammatically illustrates the block diagram of the voice processing apparatus according to the fourth embodiment of the present disclosure.

As shown in figure 17, voice processing apparatus 1700 includes the first receiving module 1410, the reception of wake-up module 1420, second Module 1430, the first determining module 1440, the first control module 1450, the second determining module 1710 and third determining module 1720.Wherein, the first receiving module 1410, wake-up module 1420, the second receiving module 1430, the first determining module 1440 and First control module 1450 is same or like above with reference to the module that Figure 14 is described, and details are not described herein.

Second determining module 1710 can be used for being less than or equal to particular energy threshold value in response to audio power, determine audio Second delay inequality of time delay.According to the embodiment of the present disclosure, the second determining module 1710 can for example be executed retouches above with reference to Figure 10 The operation S1010 stated, details are not described herein.

Third determining module 1720 can be used for the location information based on the second delay inequality and multiple groups pronunciation receiver, really Determine the relative position information of user's face and electronic equipment.According to the embodiment of the present disclosure, third determining module 1720 for example can be with The operation S1020 described above with reference to Figure 10 is executed, details are not described herein.

It is module according to an embodiment of the present disclosure, submodule, unit, any number of or in which any more in subelement A at least partly function can be realized in a module.It is single according to the module of the embodiment of the present disclosure, submodule, unit, son Any one or more in member can be split into multiple modules to realize.According to the module of the embodiment of the present disclosure, submodule, Any one or more in unit, subelement can at least be implemented partly as hardware circuit, such as field programmable gate Array (FPGA), programmable logic array (PLA), system on chip, the system on substrate, the system in encapsulation, dedicated integrated electricity Road (ASIC), or can be by the hardware or firmware for any other rational method for integrate or encapsulate to circuit come real Show, or with any one in three kinds of software, hardware and firmware implementations or with wherein any several appropriately combined next reality It is existing.Alternatively, can be at least by part according to one or more of the module of the embodiment of the present disclosure, submodule, unit, subelement Ground is embodied as computer program module, when the computer program module is run, can execute corresponding function.

For example, the first receiving module 1410, wake-up module 1420, the second receiving module 1430, the first determining module 1440, First control module 1450, the second control module 1510, processing module 1610, the second determining module 1710 and third determine mould Any number of in block 1720, which may be incorporated in a module, to be realized or any one module therein can be split into Multiple modules.Alternatively, at least partly function of one or more modules in these modules can be at least portion of other modules Point function combines, and realizes in a module.In accordance with an embodiment of the present disclosure, module 610, the first control module are obtained 620, at least one of memory module 710, the second control module 720 and third control module 810 can be at least by parts Ground is embodied as hardware circuit, such as field programmable gate array (FPGA), programmable logic array (PLA), system on chip, substrate On system, the system in encapsulation, specific integrated circuit (ASIC), or can be by carrying out integrated to circuit or encapsulating any The hardware such as other rational methods or firmware realize, or with any one in three kinds of software, hardware and firmware implementations Or it several appropriately combined is realized with wherein any.Alternatively, the first receiving module 1410, wake-up module 1420, second receive Module 1430, the first determining module 1440, the first control module 1450, the second control module 1510, processing module 1610, second At least one of determining module 1710 and third determining module 1720 can at least be implemented partly as computer program Module can execute corresponding function when the computer program module is run.

Figure 18 diagrammatically illustrates the box of the computer system for realizing speech processes according to the embodiment of the present disclosure Figure.Computer system shown in Figure 18 is only an example, should not function to the embodiment of the present disclosure and use scope bring Any restrictions.

As shown in figure 18, the computer system 1800 for realizing speech processes includes processor 1801, computer-readable storage Medium 1802.The system 1800 can execute the method according to the embodiment of the present disclosure.

Specifically, processor 1801 for example may include general purpose microprocessor, instruction set processor and/or related chip group And/or special microprocessor (for example, specific integrated circuit (ASIC)), etc..Processor 1801 can also include for caching The onboard storage device of purposes.Processor 1801 can be the different movements for executing the method flow according to the embodiment of the present disclosure Single treatment unit either multiple processing units.

Computer readable storage medium 1802, for example, can be can include, store, transmitting, propagating or transmitting instruction Arbitrary medium.For example, readable storage medium storing program for executing can include but is not limited to electricity, magnetic, optical, electromagnetic, infrared or semiconductor system, dress It sets, device or propagation medium.The specific example of readable storage medium storing program for executing includes: magnetic memory apparatus, such as tape or hard disk (HDD)；Light Storage device, such as CD (CD-ROM)；Memory, such as random access memory (RAM) or flash memory；And/or wire/wireless communication Link.

Computer readable storage medium 1802 may include computer program 1803, which may include Code/computer executable instructions executes processor 1801 and is implemented according to the disclosure The method or its any deformation of example.

Computer program 1803 can be configured to have the computer program code for example including computer program module.Example Such as, in the exemplary embodiment, the code in computer program 1803 may include one or more program modules, for example including 1803A, module 1803B ....It should be noted that the division mode and number of module are not fixed, those skilled in the art It can be combined according to the actual situation using suitable program module or program module, when these program modules are combined by processor When 1801 execution, processor 1801 is executed according to the method for the embodiment of the present disclosure or its any deformation.

In accordance with an embodiment of the present disclosure, the first receiving module 1410, wake-up module 1420, the second receiving module 1430, One determining module 1440, the first control module 1450, the second control module 1510, processing module 1610, the second determining module At least one of 1710 and third determining module 1720 can be implemented as the computer program module with reference to Figure 18 description, When being executed by processor 1801, corresponding operating described above may be implemented.

The disclosure additionally provides a kind of computer-readable medium, which, which can be in above-described embodiment, retouches Included in the equipment/device/system stated；It is also possible to individualism, and without in the supplying equipment/device/system.On State computer-readable medium and carry one or more program, when said one or multiple programs are performed, realize with Upper method of speech processing.

In accordance with an embodiment of the present disclosure, computer-readable medium can be computer-readable signal media or computer can Read storage medium either the two any combination.Computer readable storage medium for example can be --- but it is unlimited In system, device or the device of --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, or any above combination.It calculates The more specific example of machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, portable of one or more conducting wires Formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device or The above-mentioned any appropriate combination of person.In the disclosure, computer readable storage medium can be it is any include or storage program Tangible medium, which can be commanded execution system, device or device use or in connection.And in this public affairs In opening, computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, In carry computer-readable program code.The data-signal of this propagation can take various forms, including but not limited to Electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be computer-readable Any computer-readable medium other than storage medium, the computer-readable medium can send, propagate or transmit for by Instruction execution system, device or device use or program in connection.The journey for including on computer-readable medium Sequence code can transmit with any suitable medium, including but not limited to: wireless, wired, optical cable, radiofrequency signal etc., or Above-mentioned any appropriate combination.

Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the disclosure, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of above-mentioned module, program segment or code include one or more Executable instruction for implementing the specified logical function.It should also be noted that in some implementations as replacements, institute in box The function of mark can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are practical On can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it wants It is noted that the combination of each box in block diagram or flow chart and the box in block diagram or flow chart, can use and execute rule The dedicated hardware based systems of fixed functions or operations is realized, or can use the group of specialized hardware and computer instruction It closes to realize.

It will be understood by those skilled in the art that the feature recorded in each embodiment and/or claim of the disclosure can To carry out multiple combinations and/or combination, even if such combination or combination are not expressly recited in the disclosure.Particularly, exist In the case where not departing from disclosure spirit or teaching, the feature recorded in each embodiment and/or claim of the disclosure can To carry out multiple combinations and/or combination.All these combinations and/or combination each fall within the scope of the present disclosure.

Although the disclosure, art technology has shown and described referring to the certain exemplary embodiments of the disclosure Personnel it should be understood that in the case where the spirit and scope of the present disclosure limited without departing substantially from the following claims and their equivalents, A variety of changes in form and details can be carried out to the disclosure.Therefore, the scope of the present disclosure should not necessarily be limited by above-described embodiment, But should be not only determined by appended claims, also it is defined by the equivalent of appended claims.

Claims

1. a kind of method of speech processing for electronic equipment, comprising:

In the first voice data of the first reception user；

Meet wake-up condition in response to first voice data, wakes up the electronic equipment；

By pronunciation receiver in the second speech data of the second reception user, the second speech data is used to indicate The electronic equipment executes relevant operation；

Meet the first specific time length in response to the time span between second moment and first moment, is based on institute State the relative position information that second speech data determines user's face Yu the electronic equipment；And

Meet specified conditions in response to the relative position information, controls the electronic equipment and held based on the second speech data The row relevant operation.

2. according to the method described in claim 1, further include:

Meet the second specific time length in response to the time span between second moment and first moment, controls institute It states electronic equipment and is based on the second speech data execution relevant operation, wherein the second specific time length is less than The first specific time length.

3. according to the method described in claim 1, wherein, the pronunciation receiver includes multiple pronunciation receivers, described The relative position information of user's face Yu the electronic equipment is determined based on the second speech data, comprising:

It handles the second speech data and obtains the speech waveform and audio time delay of the second speech data, wherein the sound Frequency time delay characterizes the time difference that the multiple pronunciation receiver receives the second speech data；And

Based on the speech waveform and the audio time delay, the relative position information of user's face Yu the electronic equipment is determined.

4. according to the method described in claim 3, wherein, described to be based on the speech waveform and the audio time delay, determination is used The relative position information of family face and the electronic equipment, comprising:

Determine whether the type of the speech waveform meets specific type；And

Meet specific type in response to the type of the speech waveform, user's face and the electricity are determined based on the audio time delay The relative position information of sub- equipment.

5. according to the method described in claim 4, wherein, the multiple pronunciation receiver include the first pronunciation receiver and Second pronunciation receiver, the distance between first pronunciation receiver and the second pronunciation receiver are specific range, The relative position information that user's face Yu the electronic equipment are determined based on the audio time delay, comprising:

Determine that first pronunciation receiver receives the third moment of the second speech data；

Determine that second pronunciation receiver receives the 4th moment of the second speech data；

Based on the third moment and the 4th moment, the first delay inequality of the audio time delay is determined；And

Based on first delay inequality and the specific range, determine that the relative position of user's face and the electronic equipment is believed Breath.

6. according to the method described in claim 4, further include: it handles the second speech data and obtains the second speech data Audio power；The relative position information that user's face Yu the electronic equipment are determined based on the audio time delay, comprising:

It is greater than particular energy threshold value in response to the audio power, determines the user relative to described based on the audio power The target position of electronic equipment；And

Based on the target position and the audio time delay, the relative position information of user's face Yu the electronic equipment is determined.

7. described to determine the user relative to described based on the audio power according to the method described in claim 6, wherein The target position of electronic equipment, comprising:

Determine the first audio power and the second audio power, wherein first audio power is located at for characterizing the user The front region of the electronic equipment, second audio power is for characterizing the side that the user is located at the electronic equipment Region；

It handles first audio power and second audio power obtains processing result；And

Based on the processing result, target position of the user relative to the electronic equipment is determined.

8. according to the method described in claim 6, wherein, the multiple pronunciation receiver includes multiple groups pronunciation receiver, The method also includes:

It is less than or equal to the particular energy threshold value in response to the audio power, determines the second time delay of the audio time delay Difference；And

Location information based on second delay inequality and the multiple groups pronunciation receiver, determines user's face and the electronics The relative position information of equipment.

9. a kind of voice processing apparatus, comprising:

First receiving module, in the first voice data of the first reception user；

Wake-up module meets wake-up condition in response to first voice data, wakes up the electronic equipment；

Second receiving module, the second speech data by pronunciation receiver in the second reception user, second language Sound data are used to indicate the electronic equipment and execute relevant operation；

First determining module, when specific in response to the time span satisfaction first between second moment and first moment Between length, the relative position information of user's face Yu the electronic equipment is determined based on the second speech data；And

First control module meets specified conditions in response to the relative position information, controls the electronic equipment based on described Second speech data executes the relevant operation.

10. a kind of electronic equipment, comprising:

Processor；And

Memory, for storing executable instruction, wherein when described instruction is executed by the processor, so that the processing Device executes the method as described in claim 1~8 any one.