CN107146614B

CN107146614B - Voice signal processing method and device and electronic equipment

Info

Publication number: CN107146614B
Application number: CN201710231244.4A
Authority: CN
Inventors: 李福祥; 李峥
Original assignee: Beijing Orion Star Technology Co Ltd
Current assignee: Beijing Orion Star Technology Co Ltd
Priority date: 2017-04-10
Filing date: 2017-04-10
Publication date: 2020-11-06
Anticipated expiration: 2037-04-10
Also published as: CN107146614A

Abstract

The embodiment of the invention discloses a voice signal processing method, a voice signal processing device and electronic equipment, wherein the method comprises the following steps: receiving a voice signal under the condition that the electronic equipment is in a sleep state, and judging whether an interactive instruction corresponding to the received voice signal is a wake-up instruction or not; if so, switching from the sleep state to the working state, and positioning the sound source position of the received voice signal as the user sound source position; continuously receiving voice signals, and performing noise suppression processing on voice signals which are originated from outside of the user sound source direction in the continuously received voice signals to obtain user voice signals; and responding to the interactive instruction corresponding to the voice signal of the user. The electronic equipment carries out noise suppression processing on voice signals which are continuously received and are outside the user sound source position, and the obtained user voice signals are the voice signals sent by the user in the user sound source position, so that correct response can be carried out, and user experience is improved.

Description

Voice signal processing method and device and electronic equipment

Technical Field

The present invention relates to the field of speech signal processing technologies, and in particular, to a speech signal processing method and apparatus, and an electronic device.

Background

Currently, more and more products with voice interaction function are on the market, such as intelligent sound boxes, robots and other electronic devices. After receiving the wake-up instruction, the electronic devices are switched from the standby state to the working state, receive the voice signal through the microphone array, namely pick up the voice, and further can identify and analyze the voice signal, so that the interactive instruction corresponding to the voice signal is responded.

After receiving the wake-up instruction, the electronic device with the voice interaction function receives voice signals sent by each sound source in the surrounding environment through the microphone array, identifies the sound source direction corresponding to the sound quantity maximum in the voice signals as the user sound source direction, and responds to the interaction instruction corresponding to the voice signals, namely the voice signals with the maximum sound are regarded as the voice signals sent by the user.

In general, the above method can be applied to better perform voice signal processing, but if one or more sound generating objects with volume greater than the volume of the user exist around the user, the electronic device with the voice interaction function recognizes the sound source direction corresponding to the sound source with the maximum volume in the received voice signal as the user sound source direction, and recognizes and analyzes the voice signal with the maximum volume to obtain an interaction instruction, and further performs an incorrect response, resulting in poor user experience.

Disclosure of Invention

The embodiment of the invention discloses a voice signal processing method and device and electronic equipment, which are used for avoiding response errors and improving user experience. The technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a speech signal processing method, which is applied to an electronic device with a speech interaction function, where the method includes:

receiving a voice signal under the condition that the electronic equipment is in a sleep state, and judging whether an interactive instruction corresponding to the received voice signal is a wake-up instruction or not;

if so, switching from the sleep state to the working state, and positioning the sound source position of the received voice signal as the user sound source position;

continuously receiving voice signals, and performing noise suppression processing on voice signals which are originated from the position except the user sound source in the continuously received voice signals to obtain user voice signals;

and responding to the interactive instruction corresponding to the user voice signal.

Optionally, the step of performing noise suppression processing on the voice signal, which is derived from the voice signal other than the user sound source direction, in the voice signals which are continuously received to obtain the user voice signal includes:

and carrying out noise suppression processing on voice signals which are continuously received and come from the positions except the user sound source direction, and carrying out beam enhancement processing on voice information numbers which are continuously received and come from the user sound source direction to obtain the user voice signals.

Optionally, the method further includes:

and indicating the user direction according to the user sound source direction.

Optionally, the method further includes:

judging whether an interactive instruction corresponding to the voice signal received from the user sound source direction is a sound source positioning mode conversion instruction or not;

if so, continuing to receive the voice signals, determining the sound source position corresponding to the voice signal with the maximum volume in the received voice signals as the user sound source position, determining the voice signal with the maximum volume in the received voice signals as the user voice signals, and responding to the interactive instruction corresponding to the user voice signals.

Optionally, the step of determining whether the interactive instruction corresponding to the received voice signal is a wake-up instruction includes:

judging whether the interactive instruction corresponding to each received voice signal is a wake-up instruction or not according to the following modes:

carrying out filtering processing on a target voice signal, and filtering out a voice signal of which the frequency belongs to a preset frequency band in the target voice signal, wherein the target voice signal is as follows: a received voice signal;

and judging whether the interactive instruction corresponding to the target voice signal after filtering is a wake-up instruction.

Optionally, the step of locating the sound source position of the received voice signal as the sound source position of the user includes:

positioning and recording the sound source position of the received voice signal as a second type sound source position;

and positioning a user sound source position according to a first type sound source position and a second type sound source position, wherein the first type sound source position is the sound source position of the received voice signal which is positioned and recorded under the condition that the electronic equipment is in a sleep state, and an interactive instruction corresponding to the voice signal is not a wake-up instruction.

Optionally, the step of locating the user sound source position according to the first type of sound source position and the second type of sound source position includes:

judging whether a sound source azimuth which does not belong to the first type sound source azimuth exists in the second type sound source azimuth;

if so, positioning the sound source position which does not belong to the first type sound source position in the second type sound source positions as a user sound source position.

Optionally, the step of locating, as a user sound source azimuth, a sound source azimuth that does not belong to the first-class sound source azimuth among the second-class sound source azimuths includes:

determining the number of sound source azimuths not belonging to the first type of sound source azimuths in the second type of sound source azimuths;

and when the determined number is larger than 1, determining the sound source position corresponding to the voice signal which does not belong to the preset frequency band as the user sound source position.

Optionally, the step of determining a sound source position corresponding to the speech signal not belonging to the preset frequency segment as the user sound source position includes:

determining the number of sound source orientations corresponding to the voice signals which do not belong to the preset frequency band;

and when the determined number is more than 1, determining the sound source position corresponding to the sound signal of which the similarity between the waveform and the preset waveform is more than a first preset value in the sound signals which do not belong to the preset frequency segment as the sound source position of the user.

Optionally, in a case that the sound source azimuths of the second type all belong to the sound source azimuths of the first type, the method further includes:

judging whether an energy difference value between a first voice signal and a second voice signal in the same sound source direction is larger than a second preset value, wherein the first voice signal is a voice signal received when the electronic equipment is in a sleep state, and the second voice signal is a voice signal received when the electronic equipment is in a working state;

and if so, determining the second type sound source position corresponding to the second voice signal as the user sound source position.

and determining the sound source position corresponding to the voice signal with the similarity between the waveform and the preset waveform being greater than the first preset value in the second type of sound source position as the user sound source position.

determining a sound source azimuth which does not belong to the first type sound source azimuth in the second type sound source azimuth as a target sound source azimuth;

and determining a target range [ A, B ] according to the target sound source position, and determining the sound source position in the target range as the user sound source position, wherein A is the difference value between the target sound source position and a first preset position difference value, and B is the sum of the target sound source position and a second preset position difference value.

In a second aspect, an embodiment of the present invention further provides a speech signal processing apparatus, which is applied to an electronic device with a speech interaction function, where the apparatus includes:

the wake-up instruction judging module is used for receiving the voice signal under the condition that the electronic equipment is in a sleep state and judging whether the interactive instruction corresponding to the received voice signal is a wake-up instruction or not;

the sound source positioning module is used for switching the sleep state to the working state under the condition that the interactive instruction corresponding to the received voice signal is a wake-up instruction, and positioning the sound source position of the received voice signal as the sound source position of the user;

the user voice signal obtaining module is used for continuously receiving voice signals and carrying out noise suppression processing on voice signals which are sourced from positions except the user sound source in the continuously received voice signals to obtain user voice signals;

and the first interactive instruction response module is used for responding the interactive instruction corresponding to the user voice signal.

Optionally, the user voice signal obtaining module includes:

and the user voice signal obtaining submodule is used for carrying out noise suppression processing on voice signals which are continuously received and are derived from the voice signals except the user sound source direction, and carrying out beam enhancement processing on voice information numbers which are continuously received and are derived from the user sound source direction to obtain the user voice signals.

Optionally, the apparatus further comprises:

and the user direction indicating module is used for indicating the user direction according to the user sound source direction.

Optionally, the apparatus further comprises:

the conversion instruction judging module is used for judging whether an interactive instruction corresponding to the voice signal received from the user sound source direction is a sound source positioning mode conversion instruction or not;

and the second interactive instruction response module is used for continuously receiving the voice signals under the condition that the interactive instruction corresponding to the voice signals received from the user voice source position is the voice source positioning mode conversion instruction, determining the voice source position corresponding to the voice signal with the maximum volume in the received voice signals as the user voice source position, determining the voice signal with the maximum volume in the received voice signals as the user voice signals, and responding to the interactive instruction corresponding to the user voice signals.

Optionally, the wake-up instruction determining module includes:

the signal filtering submodule and the instruction judging submodule are arranged in the main control module;

the awakening instruction judging module is specifically used for judging whether the interaction instruction corresponding to each received voice signal is an awakening instruction or not through the signal filtering submodule and the instruction judging submodule;

the signal filtering submodule is used for filtering a target voice signal and filtering the voice signal of which the frequency belongs to a preset frequency band in the target voice signal, wherein the target voice signal is: a received voice signal;

and the instruction judgment submodule is used for judging whether the interactive instruction corresponding to the filtered target voice signal is a wake-up instruction or not.

Optionally, the sound source localization module includes:

the sound source positioning submodule is used for positioning and recording the sound source position of the received voice signal as a second type of sound source position;

and the user sound source position determining submodule is used for positioning the user sound source position according to a first class sound source position and a second class sound source position, wherein the first class sound source position is the sound source position of the received voice signal which is positioned and recorded under the condition that the electronic equipment is in a sleep state, and the interactive instruction corresponding to the voice signal is not a wake-up instruction.

Optionally, the sub-module for determining the orientation of the sound source of the user includes:

a judging unit configured to judge whether or not a sound source bearing not belonging to the first type sound source bearing exists in the second type sound source bearings;

and a user sound source azimuth determining unit configured to, when a sound source azimuth that does not belong to the first-type sound source azimuth exists in the second-type sound source azimuths, position a sound source azimuth that does not belong to the first-type sound source azimuth in the second-type sound source azimuths as a user sound source azimuth.

Optionally, the user sound source position determining unit includes:

a number determining subunit, configured to determine the number of sound source orientations that do not belong to the first type of sound source orientation in the second type of sound source orientations;

and the first orientation determining subunit is used for determining the sound source orientation corresponding to the voice signal which does not belong to the preset frequency band as the user sound source orientation when the determined number is greater than 1.

Optionally, the first direction determining subunit is specifically configured to determine the number of sound source directions corresponding to the speech signals that do not belong to the preset frequency segment; and when the determined number is more than 1, determining the sound source position corresponding to the sound signal of which the similarity between the waveform and the preset waveform is more than a first preset value in the sound signals which do not belong to the preset frequency segment as the sound source position of the user.

Optionally, the apparatus further comprises:

the energy difference value judging module is used for judging whether the energy difference value between a first voice signal and a second voice signal in the same sound source direction is greater than a second preset value or not under the condition that the second type of sound source direction belongs to the first type of sound source direction, wherein the first voice signal is a voice signal received when the electronic equipment is in a sleep state, and the second voice signal is a voice signal received when the electronic equipment is in a working state; and if so, determining the second type sound source position corresponding to the second voice signal as the user sound source position.

Optionally, the apparatus further comprises:

and the waveform comparison module is used for determining the sound source position corresponding to the voice signal with the similarity between the waveform and the preset waveform being greater than the first preset value in the second type of sound source position as the user sound source position.

a target sound source bearing determining unit configured to determine, as a target sound source bearing, a sound source bearing that does not belong to the first-type sound source bearing among the second-type sound source bearings;

and the second azimuth determining unit is used for determining a target range [ A, B ] according to the target sound source azimuth, and determining the sound source azimuth in the target range as the user sound source azimuth, wherein A is the difference value between the target sound source azimuth and a first preset azimuth difference value, and B is the sum of the target sound source azimuth and a second preset azimuth difference value.

In a third aspect, an embodiment of the present invention further provides an electronic device, where the electronic device includes: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space enclosed by the shell, and the processor and the memory are arranged on the circuit board; a power supply circuit for supplying power to each circuit or device of the electronic apparatus; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, for executing the above-described voice signal processing method.

In the scheme provided by the embodiment of the invention, the electronic equipment with the voice interaction function receives the voice signals under the condition of a sleep state, judges whether the interaction instruction corresponding to the received voice signals is a wake-up instruction or not, if so, switches from the sleep state to the working state, positions the sound source direction of the received voice signals as the user sound source direction, then continues to receive the voice signals, and carries out noise suppression processing on the voice signals which are continuously received and come from positions other than the user sound source direction to obtain the user voice signals, thereby responding to the interaction instruction corresponding to the user voice signals. The electronic equipment determines the sound source position corresponding to the awakening instruction as the user sound source position, noise suppression processing is carried out on the voice signals except the position in the voice signals which are continuously received, the obtained user voice signals are the voice signals sent by the user in the user sound source position, therefore, correct response can be carried out, and user experience is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a speech signal processing method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a speech signal processing apparatus according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to avoid response errors and improve user experience, embodiments of the present invention provide a method and an apparatus for processing a voice signal, and an electronic device.

First, a speech signal processing method according to an embodiment of the present invention will be described below.

It should be noted that the voice signal processing method provided by the embodiment of the present invention may be applied to electronic devices (hereinafter referred to as electronic devices) with a voice interaction function, such as smart speakers and robots. This electronic equipment generally has the microphone array, or establishes communication connection with the microphone array, and this communication connection can be wired connection or wireless connection, and wherein, wireless connection can be WIFI connection, bluetooth connection etc.. The microphone array is used for receiving a voice signal.

As shown in fig. 1, a voice signal processing method is applied to an electronic device with a voice interaction function, and the method includes:

s101, receiving a voice signal under the condition that the electronic equipment is in a sleep state, judging whether an interactive instruction corresponding to the received voice signal is a wake-up instruction or not, and if so, executing S102;

from a certain perspective, the state of an electronic device can be divided into: the electronic equipment is required to be awakened by receiving an awakening instruction when the electronic equipment is in the sleep state, and then is switched to the working state. In addition, when the electronic device is in a sleep state, the voice signals emitted by the sound source in the surrounding environment can still be continuously received, that is, when the electronic device is in the sleep state, the microphone array works in a certain order. At this time, the electronic device may receive a voice signal to determine whether a wake-up command is received.

After receiving a section of voice signal, the electronic device starts to perform voice recognition on the received section of voice signal, and judges whether an interactive instruction corresponding to the received section of voice signal is a wake-up instruction. Specifically, if the voice recognition result of the voice signal segment includes a preset wake-up word, the interactive instruction corresponding to the voice signal segment is the wake-up instruction. That is to say, after receiving the voice signal, the electronic device may perform voice recognition on the voice signal to obtain a voice recognition result, and then may determine whether the voice recognition result of the voice signal includes a preset wake-up word.

It should be noted that, after receiving the voice signal, the electronic device may perform voice recognition on the voice signal locally to obtain a voice recognition result, or send the voice signal to the server, and after receiving the voice signal, the server may perform voice recognition on the voice signal to obtain a voice recognition result, and send the voice recognition result to the electronic device, and the electronic device may also obtain the voice recognition result, and further may determine whether the voice recognition result of the voice signal includes a preset wake-up word.

For example, if the preset wake-up word is "xiao ya", if the voice recognition result corresponding to the voice signal received by the electronic device includes two words of "xiao ya", the interactive instruction corresponding to the voice signal is the wake-up instruction; if the speech recognition result corresponding to the speech signal received by the electronic device is other sentences not including two words of 'xiao ya' or is a speech signal without any semantics, such as a speech signal sent by an air conditioner, the interactive instruction corresponding to the speech signal is not a wake-up instruction.

S102, switching from a sleep state to a working state, and positioning a sound source position of the received voice signal as a user sound source position;

when the electronic device judges that the interactive instruction corresponding to the received voice signal is a wake-up instruction, the electronic device sends the voice signal to wake up the electronic device so that the electronic device can perform voice interaction with the user to realize functions, and the electronic device needs to be switched from a sleep state to a working state.

Meanwhile, the electronic device may locate a sound source bearing of the received voice signal, determining the sound source bearing as a user sound source bearing. It should be noted that the sound source direction of the speech signal may be determined by using a sound source determining method such as a time delay estimation method, that is, the sound source direction of the speech signal may be determined according to the time of the speech signal reaching each microphone in the microphone array, which is not specifically limited and described herein.

It can be understood that, if the electronic device determines that the interactive instruction corresponding to the received voice signal is not the wake-up instruction, the electronic device does not transition to the working state, but continues to receive the voice signal in the sleep state, and continues to determine whether the interactive instruction corresponding to the received voice signal is the wake-up instruction.

S103, continuously receiving voice signals, and performing noise suppression processing on voice signals which are originated from the position except the user sound source in the continuously received voice signals to obtain user voice signals;

after the sound source position of the user is determined, the user generally continues to send out the voice signal, so that the electronic equipment can continue to receive the voice signal, and carry out noise suppression processing on the voice signal which is originated from the position except the sound source position of the user in the voice signal which is continuously received, so as to obtain the voice signal of the user.

It can be understood that, because there may be sound sources in other directions than the user sound source direction, that is, noise sound sources, and these noise sound sources may also emit voice signals, the electronic device may also receive the voice signals emitted by these noise sound sources, and in order to better receive the voice signals emitted by the user, that is, the voice signals originating from the user sound source direction, the electronic device may perform noise suppression processing on the voice signals originating from other directions than the user sound source direction, so as to weaken the energy originating from the voice signals originating from other directions than the user sound source direction, and further obtain the user voice signals.

The noise suppression processing may be any conventional noise suppression processing method, for example, an end point detection method, a noise separation method, a spectrum filtering method, etc., as long as the purpose of attenuating the energy of the speech signal is achieved, and is not particularly limited herein.

And S104, responding to the interactive instruction corresponding to the user voice signal.

After the user voice signal is obtained, the electronic equipment can respond to the interactive instruction corresponding to the user voice signal. The electronic device may respond to the interactive instruction in various forms such as voice playing, and if the electronic device has a display screen, it is reasonable that the electronic device may also respond to the interactive instruction through the display screen.

For example, if the interactive instruction corresponding to the user voice signal is to play a piece of music, the electronic device may acquire a locally stored music resource, or request the server for the music resource, and then play the music resource. If the interactive instruction corresponding to the voice signal of the user is to inquire about the weather condition of the day, the electronic equipment can request weather resources from the server, and then informs the weather condition to the user in the forms of voice playing and the like, so as to complete the response of the interactive instruction.

Therefore, in the solution provided in the embodiment of the present invention, when the electronic device with a voice interaction function is in a sleep state, the electronic device receives a voice signal, and determines whether an interaction instruction corresponding to the received voice signal is a wake-up instruction, if so, the electronic device switches from the sleep state to a working state, and locates a sound source position of the received voice signal as a user sound source position, then continues to receive the voice signal, and performs noise suppression processing on the voice signal, which is originated from a source other than the user sound source position, in the continuously received voice signal to obtain a user voice signal, so as to respond to the interaction instruction corresponding to the user voice signal. The electronic equipment determines the sound source position corresponding to the awakening instruction as the user sound source position, noise suppression processing is carried out on the voice signals except the position in the voice signals which are continuously received, the obtained user voice signals are the voice signals sent by the user in the user sound source position, therefore, correct response can be carried out, and user experience is improved.

As an implementation manner of the embodiment of the present invention, the step of performing noise suppression processing on a speech signal, which is derived from a speech signal other than the user sound source direction, in the speech signals that are continuously received to obtain the user speech signal may include:

In order to make the obtained user voice signal stronger, so that the electronic equipment can more accurately respond to the interactive instruction corresponding to the user voice signal, the electronic equipment can perform noise suppression processing on voice signals which are continuously received and are originated from positions other than the user sound source position, and simultaneously can perform beam enhancement processing on voice information numbers which are continuously received and are originated from the user sound source position so as to increase the energy of the voice signals which are originated from the user sound source position.

The beam enhancement processing may be any conventional beam enhancement processing method, for example, a speech extraction separation method, a diagonal loading algorithm, an adaptive wave velocity forming method, etc., and is not particularly limited as long as the purpose of enhancing the energy of the speech signal can be achieved.

As an implementation manner of the embodiment of the present invention, the method may further include:

and indicating the user direction according to the user sound source direction.

In order to facilitate the user to view the current user sound source location, the electronic device may indicate the user location in accordance with the user sound source location. In one embodiment, the electronic device may indicate the user orientation by way of an indicator light, for example, if the user sound source orientation is a 45 degree orientation, then the electronic device may illuminate the indicator light at the 45 degree orientation. In another embodiment, if the electronic device has a display screen, it is reasonable that the electronic device also displays the user's sound source location on the display screen, or displays an indicator light on the display screen. In another embodiment, if the electronic device is a robot or other electronic device with movable parts, the electronic device may also indicate the user orientation by rotating the head, swinging the arm, or the like.

judging whether an interactive instruction corresponding to the voice signal received from the user sound source direction is a sound source positioning mode conversion instruction or not; if so, continuing to receive the voice signals, determining the sound source position corresponding to the voice signal with the maximum volume in the received voice signals as the user sound source position, determining the voice signal with the maximum volume in the received voice signals as the user voice signals, and responding to the interactive instruction corresponding to the user voice signals.

Because the application scene of the electronic device is changeable, when the electronic device is used for responding to the interactive instructions sent by a plurality of users, in order to more accurately respond to the interactive instructions sent by the plurality of users, when the electronic device receives the voice signal from the sound source direction of the user, the electronic device can judge whether the interactive instruction corresponding to the voice signal is a sound source positioning mode conversion instruction, if so, the electronic device indicates that the user sends the sound source positioning mode conversion instruction to indicate that the application scene of the electronic device is changed, and the electronic device needs to respond to the sound source positioning mode conversion instruction, namely, the mode of converting the sound source positioning.

Specifically, if the interactive instruction corresponding to the voice signal is a sound source positioning mode conversion instruction, the electronic device continues to receive the voice signal and converts the sound source positioning mode, where the converted sound source positioning mode is: the sound source position corresponding to the maximum volume in the received voice signals is determined as the user sound source position, and it can be understood that the electronic device can determine the maximum volume in the received voice signals as the user voice signals at this time, and further respond to the interactive instruction corresponding to the user voice signals. Thus, when a plurality of users at different orientations issue interactive instructions, the electronic device may receive the interactive instructions issued by each user, instead of regarding a fixed orientation as the user's sound source orientation.

Of course, after the converted sound source localization mode is adopted to localize the user sound source, it is reasonable that the electronic device can perform noise suppression processing on the voice signals which are continuously received and are derived from positions other than the user sound source position, and can also perform beam enhancement processing on the voice information number which is continuously received and is derived from the user sound source position, so as to obtain the user voice signal.

Since the process of the electronic device determining whether the interactive instruction corresponding to each received voice signal is the wake-up instruction is the same, as an implementation manner of the embodiment of the present invention, the step of determining whether the interactive instruction corresponding to the received voice signal is the wake-up instruction includes:

It is understood that, if the electronic device is located in an environment where there may be a plurality of sound sources, the electronic device may receive voice signals from the sound sources in the surrounding environment, for example, if the electronic device is located in a home environment, the electronic device may receive voice signals from a plurality of sound sources, for example, voice signals from home appliances such as a television and a refrigerator, or voice signals from outside a window. The electronic device may filter each received voice signal in order to filter out the voice signal and more accurately locate the position of the user's voice source.

In particular, the frequency range of the sound emitted by the person is typically 100-, therefore, in order to effectively remove the adverse effect of some voice signals which do not belong to the frequency range of the voice signal sent by the user on the positioning of the sound source orientation of the user, before judging whether the interactive instruction corresponding to the received voice signal is a wake-up instruction, the electronic device can filter the target voice signal to filter the voice signal with the frequency belonging to a preset frequency band in the target voice signal, then judging whether the interactive instruction corresponding to the filtered target voice signal is a wake-up instruction or not, the target voice signal refers to a voice signal received by the electronic device in a sleep state.

The preset frequency band may be one or more frequency bands not belonging to the frequency range of the sound emitted by the human, may be a low frequency band, and may be 0-100Hz, for example; it may be a high frequency band, for example 20000-.

Some voice signals with frequencies belonging to a preset frequency band often exist in the use environment of electronic equipment, for example, some bass sound equipment, the frequency of the voice signal sent by the bass sound equipment is generally dozens of hertz and obviously does not belong to the frequency range of the voice signal sent by a person, so the voice signal can be filtered by adopting the filtering processing mode, the workload of subsequently positioning the sound source direction of a user is reduced, and meanwhile, the sound source of the user is positioned more accurately.

As an implementation manner of the embodiment of the present invention, for a case where there are multiple sound sources in an environment where the electronic device is located, the step of locating a sound source bearing of the received voice signal as a sound source bearing of the user includes:

When the electronic device is in a sleep state and the received voice signal is judged not to trigger the wake-up instruction, the electronic device can locate and record the sound source position of the voice signal.

At this time, the electronic device is in a sleep state, and the interaction instruction corresponding to the received voice signal is not a wake-up instruction, so that it can be understood that the voice signal received by the electronic device is a voice signal sent by a noise sound source and is not a voice signal sent by a user, and the electronic device is not triggered to process the voice signal, so that the electronic device can record the sound source directions of the voice signals as the first type of sound source direction, namely as the direction of the noise sound source, and continue to receive the voice signal.

When the electronic equipment judges that the interactive instruction corresponding to the received voice signal is a wake-up instruction, the electronic equipment can locate the sound source position of the currently received voice signal and record the sound source position as a second type of sound source position.

After the electronic equipment records the first kind of sound source azimuth and the second kind of sound source azimuth, the electronic equipment can position the user sound source azimuth according to the first kind of sound source azimuth and the second kind of sound source azimuth. The speech signal received by the electronic device in the sleep state may be variable, i.e. over time, there may be some sound sources that no longer emit speech signals, and there may be some sound sources that did not emit speech signals before.

For example, when the electronic device is in a sleep state, there may be a television and an air conditioner that may be sending voice signals, and after a period of time, the television may be turned off, and then the first type of sound source location corresponding to the television does not exist, and after a period of time, the computer may be turned on, and playing music, and then the sound source location corresponding to the computer appears in the first type of sound source location. For another example, when the electronic device is in a sleep state, a person may send a voice signal at a certain point in time, but an interactive instruction corresponding to the voice signal is not a wake-up instruction, and the electronic device is not switched from the sleep state to the working state, so that the electronic device records the position of the person in the first type of sound source position at this point in time, and after a period of time, the person does not send the voice signal any more, and therefore, the first type of sound source position may change with time.

Because the difference between the first-class sound source position and the second-class sound source position corresponding to a longer time before the time when the electronic device is switched from the sleep state to the working state may be larger, in order to more simply and accurately locate the user sound source position, the second-class sound source position and the first-class sound source position in a preset time period before the time when the electronic device is switched from the sleep state to the working state may be used to determine the user target sound source position. The preset time period may be determined by a person skilled in the art according to actual factors such as a usage scenario of the electronic device, for example, the preset time period may be 2 seconds, 3 seconds, or 5 seconds, and is not limited herein.

In one embodiment, the manner of locating the user's sound source bearing according to the first type of sound source bearing and the second type of sound source bearing may be: judging whether a sound source azimuth which does not belong to the first type sound source azimuth exists in the second type sound source azimuth; if so, positioning the sound source azimuth which does not belong to the first type sound source azimuth in the second type sound source azimuth as the user sound source azimuth.

It is understood that if there is a sound source bearing in the second type of sound source bearing that does not belong to the first type of sound source bearing, then the sound source bearing in the second type of sound source bearing that does not belong to the first type of sound source bearing is: when the electronic device is switched from the sleep state to the working state, the sound source position which is positioned and does not belong to the first type of sound source position can be determined as the sound source position of the voice signal which is sent by the user and has the corresponding interactive instruction as the awakening instruction, and the sound source position is the sound source position of the user.

For example, the number of the first type sound source positions in the preset time period before the time when the electronic device is switched from the sleep state to the working state is 3, which are respectively: 0 degree, 30 degrees and 90 degrees position, when electronic equipment switched to operating condition by the sleep state, the second sound source position of record was 4, is respectively: obviously, the 60-degree sound source location is a sound source location that newly appears when the electronic device is switched from the sleep state to the working state, and at this time, the electronic device has just received the voice signal whose corresponding interactive instruction is the wake-up instruction, so that it can be determined that the 60-degree sound source location is the sound source location of the voice signal whose corresponding interactive instruction is the wake-up instruction sent by the user, that is, the sound source location of the user.

As an implementation manner of the embodiment of the present invention, the step of locating, as a user sound source azimuth, a sound source azimuth that does not belong to the first-type sound source azimuth in the second-type sound source azimuth may include:

determining the number of sound source azimuths not belonging to the first type of sound source azimuths in the second type of sound source azimuths; and when the determined number is larger than 1, determining the sound source position corresponding to the voice signal which does not belong to the preset frequency band as the sound source position of the user.

In some cases, while the electronic device receives the voice signal whose corresponding interactive instruction is the wake-up instruction, there may be another or more other sound sources whose sound source orientations do not belong to the first type of sound source orientation, and these other sound sources also emit voice signals, so the electronic device will also receive these voice signals. For example, when the user sends a voice signal that the corresponding interactive instruction is a wake-up instruction, the bass sound equipment is turned on to send a voice signal, and then the electronic equipment receives the voice signal sent by the user and the voice signal sent by the bass sound equipment.

In this case, in order to accurately locate the user sound source bearing, the electronic device may first determine the number of sound source bearings that do not belong to the first type of sound source bearing in the second type of sound source bearing, and if the determined number is greater than 1, which indicates that the number of sound source bearings that do not belong to the first type of sound source bearing in the second type of sound source bearing is plural at this time, the electronic device may determine the sound source bearing corresponding to the speech signal that does not belong to the preset frequency band as the user sound source bearing.

For example, when the user sends a voice signal that the corresponding interactive instruction is a wake-up instruction, the bass sound equipment is turned on to send a voice signal, and then the electronic equipment receives the voice signal sent by the user and the voice signal sent by the bass sound equipment, and the electronic equipment can determine that the number of sound source orientations that do not belong to the first type of sound source orientations in the second type of sound source orientations is 2, and obviously greater than 1, and then the electronic equipment can determine the sound source orientations corresponding to the voice signals that do not belong to the preset frequency band as the user sound source orientations, because the frequency of the voice signal sent by the bass sound equipment belongs to a fixed low-frequency range, the preset frequency band is set to the low-frequency range, so that the sound source position where the bass sound equipment is located can be accurately eliminated, and further, the electronic equipment can accurately determine the sound source position of the user.

As an implementation manner of the embodiment of the present invention, the step of determining a sound source bearing corresponding to a speech signal that does not belong to the preset frequency segment as the user sound source bearing may include:

determining the number of sound source orientations corresponding to the voice signals which do not belong to the preset frequency band; and when the determined number is more than 1, determining the sound source position corresponding to the sound signal of which the similarity between the waveform and the preset waveform is more than a first preset value in the sound signals which do not belong to the preset frequency segment as the sound source position of the user.

Since the number of sound source orientations corresponding to the voice signals not belonging to the preset frequency band may also be greater than 1 in some cases, that is, there may be a plurality of sound source orientations corresponding to the voice signals not belonging to the preset frequency band, at this time, in order to accurately determine the user sound source orientation, the electronic device may further determine the user sound source orientation through waveform comparison of the voice signals.

It can be understood that the sound source direction of the user is the sound source direction corresponding to the wake-up instruction sent by the user, and the preset waveform may be the waveform of the voice signal corresponding to the wake-up word, so that the waveform with the similarity to the preset waveform greater than the first preset value is obviously the waveform with the high similarity to the waveform of the voice signal corresponding to the wake-up word, which means that the interaction instruction corresponding to the voice signal is likely to be the wake-up instruction, and the sound source direction of the voice signal is also the sound source direction of the user. The first preset value may be set by a person skilled in the art according to factors such as waveform characteristics of a voice signal emitted by a sound source existing in a usage scene of the electronic device, and is not specifically limited herein.

For example, when the user sends a voice signal that the corresponding interaction instruction is a wake-up instruction, another person sends a voice signal, the electronic device receives the voice signal sent by the user and the voice signal sent by the other person, the frequency of the voice signal sent by the other person does not belong to the preset frequency band, the electronic device can determine that the number of the sound source orientations corresponding to the voice signals that do not belong to the preset frequency band is multiple, and obviously is greater than 1, and then the electronic device can compare the waveforms of the multiple voice signals with the waveform corresponding to the preset wake-up word, and the sound source orientation of the voice signal with the similarity higher than the first preset value, that is, the sound source orientation of the user. Therefore, the voice signal waveform comparison mode can be used for more accurately determining the sound source direction of the user.

It should be noted that when it is determined that the number of sound source orientations that do not belong to the first type of sound source orientation in the second type of sound source orientation is greater than 1, the sound source orientation of the sound signal corresponding to the waveform with the higher similarity to the preset waveform may be determined by the above-mentioned speech signal waveform comparison method, and if the determined number is still greater than 1, the sound source orientation corresponding to the sound signal that does not belong to the preset frequency band may be further determined as the user sound source orientation, which is also reasonable.

As an implementation manner of the embodiment of the present invention, in a case that the sound source azimuths of the second type all belong to the sound source azimuths of the first type, the method may further include:

judging whether the energy difference value of the first voice signal and the second voice signal in the same sound source direction is larger than a second preset value or not; if so, determining a second type sound source position corresponding to the second voice signal as the user sound source position, wherein the first voice signal is a voice signal received when the electronic equipment is in a sleep state, and the second voice signal is a voice signal received when the electronic equipment is in a working state.

When the user sends the voice signal of which the corresponding interactive instruction is the wake-up instruction, the voice signal may be in the same direction as a certain sound source direction in the first sound source direction, and then the second sound source direction located by the electronic device will all belong to the first sound source direction. The energy of the voice signal may be represented by volume, frequency, waveform characteristics, and the like, and is not particularly limited herein.

For convenience of description, the first voice signal refers to a voice signal received when the electronic device is in a sleep state, and a corresponding sound source direction is a first type sound source direction, and the second voice signal refers to a voice signal received when the electronic device is in an operating state, and a corresponding sound source direction is a second type sound source direction. It should be further noted that the second preset value may be set by a person skilled in the art according to factors such as energy of a voice signal emitted by a sound source existing in a usage scenario of the electronic device, and is not specifically limited herein.

If the energy difference value between the first voice signal and the second voice signal in the same sound source direction is larger than a second preset value, the fact that the first voice signal and the second voice signal are probably not the voice signals sent by the same sound source is proved. For example, if the first voice signal and the second voice signal are both voice signals sent by a refrigerator, the energy difference between the first voice signal and the second voice signal is very small, and is not greater than the second preset value; if the first voice signal is a voice signal sent by the refrigerator and the second voice signal is a voice signal sent by the user, the energy difference between the first voice signal and the second voice signal is generally larger than a second preset value. Therefore, when the energy difference between the first voice signal and the second voice signal in the same sound source direction is greater than the second preset value, the electronic device can determine the second type of sound source direction corresponding to the second voice signal as the user sound source direction.

Under the condition that the second type of sound source azimuth belongs to the first type of sound source azimuth, the electronic device may also determine the user sound source azimuth in a manner of comparing the waveforms of the voice signals, the specific implementation manner is similar to the waveform comparison manner, and the relevant parts may refer to the description of the waveform comparison manner, which is not described herein again.

It should be noted that, if there are a plurality of second voice signals in the same sound source direction, where the energy difference between the first voice signal and the second voice signal is greater than the second preset value, the sound source direction of the user may also be determined by comparing the similarity between the waveforms of the plurality of second voice signals and the preset waveforms.

It can be understood that, during the process of sending out the voice signal, the user may change the position of the user within a small range, and then the sound source bearing of the voice signal sent out by the user changes accordingly, and in order to accurately receive the voice signal under the condition, the electronic device may determine the sound source bearing not belonging to the first sound source bearing in the second sound source bearing as the target sound source bearing, and then determine the target range [ a, B ] according to the target sound source bearing, and determine the sound source bearing in the target range as the user sound source bearing.

Wherein, a may be a difference between the target sound source azimuth and a first preset azimuth difference, and B may be a sum of the target sound source azimuth and a second preset azimuth difference. The first preset orientation difference and the second preset orientation difference may be equal or unequal, and the specific value may be set by a person skilled in the art according to a usage scenario of the electronic device and an activity situation of a user, for example, may be 10 degrees, 15 degrees, 30 degrees, and the like, and is not specifically limited herein.

In one embodiment, the first predetermined azimuth difference and the second predetermined azimuth difference may be equal, for example, the user sound source azimuth is a 60-degree azimuth, and the first predetermined azimuth difference and the second predetermined azimuth difference are both 30 degrees, so that the electronic device may determine the sound source azimuth in the range of (60-30 ═ 30) degrees to (60+30 ═ 90) degrees as the final user sound source azimuth. Of course, in another embodiment, the first preset azimuth difference and the second preset azimuth difference may not be equal, for example, the user sound source azimuth is a 60-degree azimuth, the first preset azimuth difference is 10 degrees, and the second preset azimuth difference is 15 degrees, then the electronic device may determine the sound source azimuth in the range of (60-10 ═ 50) degrees to (60+15 ═ 75) degrees as the final user sound source azimuth, which is reasonable.

Corresponding to the above method embodiment, an embodiment of the present invention further provides a speech signal processing apparatus, and the following describes a speech signal processing apparatus provided in an embodiment of the present invention.

As shown in fig. 2, a speech signal processing apparatus applied to an electronic device with a speech interaction function includes:

a wake-up instruction determining module 210, configured to receive a voice signal when the electronic device is in a sleep state, and determine whether an interactive instruction corresponding to the received voice signal is a wake-up instruction;

the sound source positioning module 220 is configured to switch from a sleep state to a working state and position a sound source position of the received voice signal as a user sound source position when the interactive instruction corresponding to the received voice signal is the wake-up instruction;

a user voice signal obtaining module 230, configured to continue to receive voice signals, and perform noise suppression processing on voice signals, which are originated from other than the user sound source direction, in the continuously received voice signals, to obtain user voice signals;

and a first interactive instruction response module 240, configured to respond to an interactive instruction corresponding to the user voice signal.

As an implementation manner of the embodiment of the present invention, the user speech signal obtaining module 230 may include:

a user voice signal obtaining sub-module (not shown in fig. 2) configured to perform noise suppression processing on voice signals, which are continuously received and are derived from directions other than the user sound source direction, and perform beam enhancement processing on a voice information number, which is continuously received and is derived from the user sound source direction, in the voice signals, so as to obtain the user voice signal.

The electronic equipment carries out wave beam enhancement processing on the voice information number derived from the user sound source position in the voice signals which are continuously received, and increases the energy of the voice information number derived from the user sound source position, so that the electronic equipment can take the voice signals which are subjected to the wave beam enhancement processing as the user voice signals, carry out more accurate analysis and recognition on the user voice signals, obtain correct interactive instructions, and then correctly respond to the interactive instructions.

As an implementation manner of the embodiment of the present invention, the apparatus may further include:

a user direction indicating module (not shown in fig. 2) for indicating the user direction according to the user sound source direction.

The electronic equipment can indicate the user position according to the user sound source position, and a user can conveniently check the current user sound source position.

a conversion instruction determining module (not shown in fig. 2) configured to determine whether an interactive instruction corresponding to the voice signal received from the user sound source direction is a sound source positioning mode conversion instruction;

a second interactive instruction response module (not shown in fig. 2), configured to, when the interactive instruction corresponding to the voice signal received from the user sound source direction is a sound source positioning mode switching instruction, continue to receive the voice signal, determine the sound source direction corresponding to the sound volume maximum in the received voice signal as the user sound source direction, determine the sound volume maximum in the received voice signal as the user voice signal, and respond to the interactive instruction corresponding to the user voice signal.

When the electronic equipment receives a voice signal from a user sound source direction, whether an interactive instruction corresponding to the voice signal is a sound source positioning mode conversion instruction or not can be judged, if so, the user sends the sound source positioning mode conversion instruction to indicate that an application scene of the electronic equipment is changed, and then the electronic equipment can respond to the sound source positioning mode conversion instruction to respond to the interactive instructions sent by a plurality of users and can more accurately respond to the interactive instructions.

As an implementation manner of the embodiment of the present invention, the wake-up instruction determining module 210 may include:

a signal filtering sub-module (not shown in FIG. 2) and an instruction judging sub-module (not shown in FIG. 2);

the wake-up instruction determining module 210 is specifically configured to determine, by the signal filtering submodule and the instruction determining submodule, whether the interactive instruction corresponding to each received voice signal is a wake-up instruction;

As an implementation manner of the embodiment of the present invention, the sound source localization module 220 may include:

a sound source localization submodule (not shown in fig. 2) for localizing and recording a sound source bearing of the received voice signal as a second type of sound source bearing;

a user sound source position determining submodule (not shown in fig. 2) configured to position a user sound source position according to a first type of sound source position and the second type of sound source position, where the first type of sound source position is a sound source position of a received voice signal that is positioned and recorded when the electronic device is in a sleep state, and an interactive instruction corresponding to the voice signal is not a wake-up instruction.

When the electronic equipment is in an environment with a plurality of sound sources, the sound source position of the user can be accurately positioned through the first type sound source position and the second type sound source position.

As an implementation manner of the embodiment of the present invention, the sub-module for determining a user sound source location may include:

a determination unit (not shown in fig. 2) for determining whether there is a sound source bearing that does not belong to the first type of sound source bearing in the second type of sound source bearings;

a user sound source bearing determining unit (not shown in fig. 2) for, in a case where there is a sound source bearing that does not belong to the first type of sound source bearing among the second type of sound source bearings, positioning a sound source bearing that does not belong to the first type of sound source bearing among the second type of sound source bearings as a user sound source bearing.

Because the sound source position not belonging to the first kind of sound source position in the second kind of sound source position is: when the electronic equipment is switched from the sleep state to the working state, the sound source position is positioned and does not belong to the first type of sound source position, so that the sound source position can be determined to be the sound source position of the voice signal of which the corresponding interactive instruction sent by the user is the awakening instruction, and the sound source position of the user can be accurately positioned.

As an implementation manner of the embodiment of the present invention, the user sound source direction determining unit may include:

a number determination subunit (not shown in fig. 2) for determining the number of sound source azimuths of the second type of sound source azimuths that do not belong to the first type of sound source azimuths;

a first direction determining unit (not shown in fig. 2) for determining a direction of a sound source corresponding to a speech signal not belonging to a preset frequency band as the user sound source direction when the determined number is greater than 1.

Because the frequency of the voice signal sent by the equipment capable of sending noise such as bass sound equipment generally belongs to a fixed frequency range, the preset frequency section is set to be the fixed frequency range, the electronic equipment can determine the sound source position corresponding to the voice signal which does not belong to the preset frequency section as the user sound source position, so that the sound source position of the voice signal which belongs to the preset frequency section can be accurately excluded, and further, the electronic equipment can accurately determine the user sound source position.

As an implementation manner of the embodiment of the present invention, the first direction determining subunit may be specifically configured to determine the number of sound source directions corresponding to the speech signals that do not belong to the preset frequency segment; and when the determined number is more than 1, determining the sound source position corresponding to the sound signal of which the similarity between the waveform and the preset waveform is more than a first preset value in the sound signals which do not belong to the preset frequency segment as the sound source position of the user.

By judging the similarity between the waveform and the preset waveform in the voice signals which do not belong to the preset frequency band, the sound source position of the user can be accurately positioned when the voice signals which do not belong to the preset frequency band are multiple.

an energy difference value determining module (not shown in fig. 2) configured to determine, when the second type of sound source location belongs to the first type of sound source location, whether an energy difference value between a first voice signal and a second voice signal in the same sound source location is greater than a second preset value, where the first voice signal is a voice signal received when the electronic device is in a sleep state, and the second voice signal is a voice signal received when the electronic device is in a working state; and if so, determining the second type sound source position corresponding to the second voice signal as the user sound source position.

When the user sends a voice signal of which the corresponding interactive instruction is a wake-up instruction, the voice signal may be in the same direction as a certain sound source direction in the first sound source direction, and then the second sound source direction located by the electronic device will be in the first sound source direction, and in this case, if the energy difference between the first voice signal and the second voice signal in the same sound source direction is greater than the second preset value, it is very likely that the first voice signal and the second voice signal are not voice signals sent by the same sound source. Therefore, when the energy difference between the first voice signal and the second voice signal in the same sound source direction is greater than the second preset value, the electronic device can determine the second type of sound source direction corresponding to the second voice signal as the user sound source direction.

and a waveform comparing module (not shown in fig. 2) configured to determine, as the user sound source direction, a sound source direction corresponding to a voice signal in the second type of sound source direction, where a similarity between a waveform and a preset waveform is greater than a first preset value.

By judging the similarity between the waveform of the voice signal corresponding to the second type of sound source position and the preset waveform, the user sound source position can be accurately positioned under the condition that the second type of sound source position belongs to the first type of sound source position.

a target sound source bearing determining unit (not shown in fig. 2) for determining a sound source bearing not belonging to the first type of sound source bearing among the second type of sound source bearings as a target sound source bearing;

and a second azimuth determining unit (not shown in fig. 2) configured to determine a target range [ a, B ] according to the target sound source azimuth, and determine a sound source azimuth in the target range as the user sound source azimuth, where a is a difference between the target sound source azimuth and a first preset azimuth difference, and B is a sum of the target sound source azimuth and a second preset azimuth difference.

In the process of sending the voice signal, the user may change the position of the user within a small range, so that the sound source position of the voice signal sent by the user changes accordingly.

The embodiment of the invention also provides electronic equipment, and the electronic equipment provided by the embodiment of the invention is introduced below.

As shown in fig. 3, an electronic device includes:

the device comprises a shell 301, a processor 302, a memory 303, a circuit board 304 and a power circuit 305, wherein the circuit board 304 is arranged inside a space enclosed by the shell 301, and the processor 302 and the memory 303 are arranged on the circuit board 304; a power supply circuit 305 for supplying power to each circuit or device of the electronic apparatus; memory 303 is used to store executable program code; the processor 302 runs a program corresponding to the executable program code by reading the executable program code stored in the memory 303 for executing the voice signal processing method described in the above-described method embodiment.

In one implementation, the voice signal processing method may include:

For other implementation manners of the above speech signal processing method, reference is made to the description of the foregoing method embodiment portion, and details are not repeated here.

For specific execution processes of the above steps and other implementation manners of the voice signal processing method by the processor 302 and further execution processes of the processor 302 by running the executable program code, reference may be made to the description of the embodiments shown in fig. 1 and fig. 2 in the embodiments of the present invention, and details are not repeated here.

It should be noted that the electronic device exists in various forms, including but not limited to:

(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include: smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as ipads.

(3) A portable entertainment device: such devices can display and play multimedia content. This type of device comprises: audio, video players (e.g., ipods), handheld game consoles, electronic books, and smart toys and portable car navigation devices.

(4) A server: the device for providing the computing service comprises a processor, a hard disk, a memory, a system bus and the like, and the server is similar to a general computer architecture, but has higher requirements on processing capacity, stability, reliability, safety, expandability, manageability and the like because of the need of providing high-reliability service.

(5) And other electronic devices with data interaction functions.

It can be seen that, in the solution provided in the embodiment of the present invention, a processor of an electronic device runs a program corresponding to an executable program code by reading the executable program code stored in a memory, receives a voice signal when the electronic device is in a sleep state, and determines whether an interactive instruction corresponding to the received voice signal is a wake-up instruction, if so, switches from the sleep state to a working state, and locates a sound source position of the received voice signal as a user sound source position, then continues to receive the voice signal, and performs noise suppression processing on a voice signal, which is originated from a source other than the user sound source position, in the continuously received voice signal to obtain a user voice signal, thereby responding to the interactive instruction corresponding to the user voice signal. The electronic equipment determines the sound source position corresponding to the awakening instruction as the user sound source position, noise suppression processing is carried out on the voice signals except the position in the voice signals which are continuously received, the obtained user voice signals are the voice signals sent by the user in the user sound source position, therefore, correct response can be carried out, and user experience is improved.

For the embodiment of the electronic device, since it is basically similar to the embodiment of the method, the description is simple, and for the relevant points, reference may be made to the partial description of the embodiment of the method.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A voice signal processing method is applied to an electronic device with a voice interaction function, and the method comprises the following steps:

if so, switching from the sleep state to the working state, positioning and recording the sound source position of the received voice signal as a second type of sound source position;

positioning a user sound source position according to whether a sound source position which does not belong to a first type of sound source position exists in the second type of sound source position, wherein the first type of sound source position is the sound source position of a received voice signal which is positioned and recorded under the condition that the electronic equipment is in a sleep state, and an interactive instruction corresponding to the voice signal is not a wake-up instruction;

2. The method according to claim 1, wherein the step of performing noise suppression processing on the voice signals which are continuously received and are derived from the voice signals other than the user's voice source direction to obtain the user's voice signals comprises:

and carrying out noise suppression processing on voice signals which are sourced from the positions except the user sound source direction in the voice signals which are continuously received, and carrying out beam enhancement processing on the voice signals which are sourced from the user sound source direction in the voice signals which are continuously received to obtain the user voice signals.

3. The method of claim 1 or 2, wherein the method further comprises:

and indicating the user direction according to the user sound source direction.

4. The method of claim 1 or 2, wherein the method further comprises:

5. The method according to claim 1 or 2, wherein the step of determining whether the interactive command corresponding to the received voice signal is a wake-up command comprises:

6. The method of claim 1, wherein said step of locating a user's sound source bearing based on whether there is a sound source bearing in said second type of sound source bearing that does not belong to said first type of sound source bearing comprises:

and if the sound source position which does not belong to the first type of sound source position exists, positioning the sound source position which does not belong to the first type of sound source position in the second type of sound source position as a user sound source position.

7. The method of claim 6, wherein said step of locating a sound source bearing of said second type of sound source bearing that does not belong to said first type of sound source bearing as a user sound source bearing comprises:

8. The method of claim 7, wherein the step of determining a sound source bearing corresponding to a speech signal not belonging to the preset frequency band as the user sound source bearing comprises:

9. The method of claim 1, wherein said step of locating a user's sound source bearing based on whether there is a sound source bearing in said second type of sound source bearing that does not belong to said first type of sound source bearing comprises:

under the condition that the second type of sound source position belongs to the first type of sound source position, judging whether an energy difference value between a first voice signal and a second voice signal in the same sound source position is larger than a second preset value, wherein the first voice signal is a voice signal received when the electronic equipment is in a sleep state, and the second voice signal is a voice signal received when the electronic equipment is in a working state;

10. The method of claim 1, wherein said step of locating a user's sound source bearing based on whether there is a sound source bearing in said second type of sound source bearing that does not belong to said first type of sound source bearing comprises:

and under the condition that the second type sound source orientations belong to the first type sound source orientations, determining the sound source orientation corresponding to the voice signal with the similarity between the waveform and the preset waveform being greater than a first preset value in the second type sound source orientations as the user sound source orientation.

11. The method of claim 6, wherein said step of locating a sound source bearing of said second type of sound source bearing that does not belong to said first type of sound source bearing as a user sound source bearing comprises:

12. A speech signal processing apparatus, applied to an electronic device with a speech interaction function, the apparatus comprising:

the first interactive instruction response module is used for responding to an interactive instruction corresponding to the user voice signal;

the sound source localization module includes:

and the user sound source position determining submodule is used for positioning the user sound source position according to whether a sound source position which does not belong to the first type sound source position exists in the second type sound source position, wherein the first type sound source position is the sound source position of the received voice signal which is positioned and recorded under the condition that the electronic equipment is in a sleep state, and the interactive instruction corresponding to the voice signal is not a wake-up instruction.

13. The apparatus of claim 12, wherein the user speech signal obtaining module comprises:

and the user voice signal obtaining submodule is used for carrying out noise suppression processing on voice signals which are continuously received and are derived from the positions except the user sound source direction, and carrying out beam enhancement processing on the voice signals which are continuously received and are derived from the user sound source direction to obtain the user voice signals.

14. The apparatus of claim 12 or 13, wherein the apparatus further comprises:

15. The apparatus of claim 12 or 13, wherein the apparatus further comprises:

16. The apparatus of claim 12 or 13, wherein the wake up command determining module comprises:

17. The apparatus of claim 12, wherein the user sound source position determination submodule comprises:

18. The apparatus of claim 17, wherein the user sound source direction determining unit comprises:

19. The apparatus of claim 18,

the first direction determining subunit is specifically configured to determine the number of sound source directions corresponding to the speech signals that do not belong to the preset frequency segment; and when the determined number is more than 1, determining the sound source position corresponding to the sound signal of which the similarity between the waveform and the preset waveform is more than a first preset value in the sound signals which do not belong to the preset frequency segment as the sound source position of the user.

20. The apparatus of claim 12, wherein the user sound source direction determining sub-module is further configured to determine whether an energy difference between a first voice signal and a second voice signal in the same sound source direction is greater than a second preset value, where the first voice signal is a voice signal received when the electronic device is in a sleep state, and the second voice signal is a voice signal received when the electronic device is in an operating state, if the second type of sound source directions all belong to the first type of sound source directions; and if so, determining the second type sound source position corresponding to the second voice signal as the user sound source position.

21. The apparatus of claim 12, wherein the user sound source bearing determining sub-module is further configured to determine, as the user sound source bearing, a sound source bearing corresponding to a speech signal having a waveform with a similarity greater than a first preset value in the second type of sound source bearing, in case that the second type of sound source bearings all belong to the first type of sound source bearing.

22. The apparatus of claim 17, wherein the user sound source direction determining unit comprises:

23. An electronic device, characterized in that the electronic device comprises: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space enclosed by the shell, and the processor and the memory are arranged on the circuit board; a power supply circuit for supplying power to each circuit or device of the electronic apparatus; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory for performing the voice signal processing method of any one of claims 1 to 11.