CN112863511B

CN112863511B - Signal processing method, device and storage medium

Info

Publication number: CN112863511B
Application number: CN202110056746.4A
Authority: CN
Inventors: 李倩
Original assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2021-01-15
Filing date: 2021-01-15
Publication date: 2024-06-04
Anticipated expiration: 2041-01-15
Also published as: CN112863511A

Abstract

The present disclosure relates to a signal processing method, apparatus and storage medium; the signal processing method is applied to the first electronic equipment and comprises the following steps: acquiring a plurality of collected first voice signals, wherein the plurality of first voice signals correspond to a plurality of user identities; determining whether a target voice signal representing a first user identity exists in a plurality of first voice signals; and if the target voice signal exists, executing corresponding response to the target voice signal. Therefore, whether the target voice signals representing the first user identity exist or not can be determined according to the plurality of first voice signals collected each time, the response to the user with the first user identity is achieved preferentially, and the use experience of the user with the first user identity can be improved.

Description

Signal processing method, device and storage medium

Technical Field

The disclosure relates to the technical field of voice information processing, and in particular relates to a signal processing method, a signal processing device and a storage medium.

Background

At present, the application scenes of the voice technology are more and more, and the requirements on voice recognition in various application scenes are also more and more. And the current trend of interconnection of everything is added, more and more intelligent devices in the home are available, upstairs and downstairs can possibly occur, and the situation that a plurality of people wake up the devices and give down commands at the same time can possibly occur because of the problem of multi-device response caused by space or network delay.

Disclosure of Invention

The present disclosure aims to solve, at least to some extent, one of the technical problems in the related art. To this end, the present disclosure provides a signal processing method, apparatus, and storage medium.

According to a first aspect of an embodiment of the present disclosure, there is provided a signal processing method, applied to a first electronic device, including:

acquiring a plurality of collected first voice signals, wherein the plurality of first voice signals correspond to a plurality of user identities;

Determining whether a target voice signal representing a first user identity exists in a plurality of first voice signals;

and if the target voice signal exists, executing corresponding response to the target voice signal.

Optionally, the performing the corresponding response on the target voice signal if the target voice signal exists includes:

if the target voice signal exists, monitoring whether a second electronic device responds to the first voice signal in a voice way or not to obtain a monitoring result;

And executing corresponding response to the target voice signal according to the monitoring result.

Optionally, the executing a corresponding response to the target voice signal according to the monitoring result includes:

If the monitoring result is that the second electronic equipment does not respond to the target voice signal in a voice way, outputting a second voice signal for responding to the target audio signal, wherein the second voice signal contains information indicating the first user identity;

Or alternatively, the first and second heat exchangers may be,

If the monitoring result is that the second electronic equipment responds to the target voice signal in a voice way, silence is maintained, and prompt information is output in a form different from the second voice signal; the prompt message is used for prompting that the target voice signal is received.

if the target voice signal exists, determining the emotion state of the first user identity represented by the target voice signal according to a target voiceprint model;

Performing a corresponding response to the target speech signal according to the emotional state;

The target voiceprint model is obtained through training according to historical voice signals corresponding to the first user identity in different emotion states.

Optionally, the target voiceprint model is created based on the steps of:

receiving a voiceprint registration request associated with the first user identity;

Outputting dynamically generated verification information;

And under the condition that the verification information is matched, establishing a target voiceprint model corresponding to the first user identity based on the voiceprint registration request.

Optionally, the determining, according to a target voiceprint model, an emotional state of the first user identity represented by the target voice signal if the target voice signal exists includes:

if the target voice signal exists, extracting target voice print characteristics of the target voice signal according to a target voice print model;

Matching the target voiceprint characteristics with voiceprint characteristics in a preset voiceprint library, and determining an emotion state of a first user identity corresponding to the target voiceprint characteristics; wherein, the preset voiceprint library comprises: voiceprint features of a sounding user corresponding to a first user identity in different emotion states.

Optionally, the performing a corresponding response to the target speech signal according to the emotional state includes:

determining a response parameter for responding to the target voice signal according to the emotion state; wherein the response at least parameters include: response tone and response volume;

and under the response tone, executing corresponding response to the target voice signal according to the response volume.

if the target voice signal exists, generating a first information packet indicating to wake up by the first user identity;

The first information packet is sent to a server or second electronic equipment; the first information packet is used for notifying a second electronic device, and the first electronic device prepares to respond to the target voice signal.

Optionally, the method further comprises:

If the target voice signal does not exist, generating different second information packets indicating to be awakened by the non-first user identity according to different first voice signals; the different second information packets carry identification information for distinguishing different first voice signals;

transmitting a plurality of different second information packets to a server;

Receiving different decision instructions generated by the server based on different second information packets;

And determining whether to respond to the first voice signal according to each decision instruction.

Optionally, the generating, according to the different first voice signals, a second packet indicating to wake up by a non-first user identity includes:

determining distances between the first voice signals and sounding users of the first voice signals according to different first voice signals;

Generating a second packet indicating a wake up by a non-first user identity based on the distance information and determination information characterizing the absence of the target speech signal from a plurality of first speech signals.

According to a second aspect of embodiments of the present disclosure, there is provided a signal processing apparatus applied to a first electronic device, including:

the acquisition module is used for acquiring a plurality of acquired first voice signals, wherein the plurality of first voice signals correspond to a plurality of user identities;

A determining module, configured to determine whether a target voice signal representing a first user identity exists in a plurality of first voice signals;

and the response module is used for executing corresponding response to the target voice signal if the target voice signal exists.

Optionally, the response module includes:

the monitoring module is used for monitoring whether the second electronic equipment responds to the first voice signal in a voice way if the target voice signal exists, so as to obtain a monitoring result;

And the processing module is used for executing corresponding response to the target voice signal according to the monitoring result.

Optionally, the processing module includes:

The first processing sub-module is used for outputting a second voice signal for responding to the target audio signal if the monitoring result is that the second electronic equipment does not respond to the target voice signal, wherein the second voice signal contains information indicating the first user identity;

Or alternatively, the first and second heat exchangers may be,

The second processing sub-module is used for keeping silence and outputting prompt information in a form different from the second voice signal if the monitoring result is that the second electronic equipment makes voice response to the target voice signal; the prompt message is used for prompting that the target voice signal is received.

Optionally, the response module includes:

The emotion determining module is used for determining an emotion state of the first user identity represented by the target voice signal according to a target voiceprint model if the target voice signal exists;

an emotion response sub-module for executing a corresponding response to the target speech signal according to the emotion state;

Optionally, the target voiceprint model is created based on the steps of:

Outputting dynamically generated verification information;

Optionally, the emotion determining module includes:

The extraction module is used for extracting target voiceprint characteristics of the target voice signal according to a target voiceprint model if the target voice signal exists;

The matching module is used for matching the target voiceprint characteristics with voiceprint characteristics in a preset voiceprint library and determining an emotion state of a first user identity corresponding to the target voiceprint characteristics; wherein, the preset voiceprint library comprises: voiceprint features of a sounding user corresponding to a first user identity in different emotion states.

Optionally, the emotion response sub-module is further configured to:

Determining a response parameter for responding to the target voice signal according to the emotion state; wherein the response parameters include at least: response tone and response volume;

Optionally, the response module is further configured to:

Optionally, the apparatus further comprises:

The generating module is used for generating different second information packets indicating to be awakened by the non-first user identity according to different first voice signals if the target voice signals are not existed; the different second information packets carry identification information for distinguishing different first voice signals;

The sending module is used for sending a plurality of different second information packets to a server;

The receiving module is used for receiving different decision instructions generated by the server based on different second information packets;

And the judging module is used for determining whether to respond to the first voice signal according to each decision instruction.

Optionally, the generating module is further configured to:

According to a third aspect of the embodiments of the present disclosure, there is provided a signal processing apparatus comprising:

a processor;

A memory for storing processor-executable instructions;

Wherein the processor is configured to: the method of any of the above first aspects is implemented when executing executable instructions stored in the memory.

According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium having stored therein computer-executable instructions which, when executed by a processor, implement the steps of the method provided in any of the above-mentioned first aspects.

The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects:

When a plurality of first voice signals corresponding to a plurality of user identities are acquired, determining whether a target voice signal representing the first user identity exists in the plurality of first voice signals in advance, and if so, responding to the target voice signal. Therefore, when a plurality of users wake up the first electronic equipment at the same time, the first electronic equipment is controlled to respond to the user with the first user identity preferentially, so that the requirement of the user with the first user identity can be responded in time, the response logic of the equipment to a plurality of input voices is optimized, the false response or delayed response of the equipment is avoided, and the use experience of the user is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 is a flowchart illustrating a signal processing method according to an exemplary embodiment.

Fig. 2 is a schematic diagram of an application scenario, according to an example embodiment.

Fig. 3 is a flow chart two of a signal processing method according to an exemplary embodiment.

Fig. 4 is a flowchart three illustrating a signal processing method according to an exemplary embodiment.

Fig. 5 is a schematic diagram showing a structure of a signal processing apparatus according to an exemplary embodiment.

Fig. 6 is a block diagram of a signal processing apparatus according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

An embodiment of the present disclosure provides a signal processing method, fig. 1 is a flowchart of a signal processing method according to an exemplary embodiment, and as shown in fig. 1, the signal processing method includes the following steps:

Step 101, acquiring a plurality of collected first voice signals, wherein the plurality of first voice signals correspond to a plurality of user identities;

step 102, determining whether a target voice signal representing the identity of a first user exists in a plurality of first voice signals;

step 103, if the target voice signal exists, executing corresponding response to the target voice signal.

It should be noted that the signal processing method may be applied to various types of first electronic devices, where the first electronic devices may be smart speakers, smart phones, tablet computers, wearable electronic devices, or the like. The first electronic equipment comprises a voice acquisition module, and can acquire voice signals in real time through the voice acquisition module when the equipment is in an on state.

Here, the user identity may be determined according to registration information for registering the voice interaction function on the first electronic device; i.e. the user identity comprises: the first user identity and the non-first user identity, wherein the first user identity can be understood as a master identity, and the master identity can correspond to a main user of the first electronic equipment. For example, the owner identity may be a user registering a voice interaction function on said first electronic device, and non-owner refers to a user other than the owner. For another example, when the smart speaker of the user a is started for the first time, the user a is the owner and the users other than the user a are not owners. In addition, when the first electronic device includes a plurality of voiceprint registered users, the first user identity may also be used to characterize the user indicated by the user account bound by the first electronic device, or the user with the highest activity using the first electronic device.

In other embodiments, the user identity may also be determined according to job level or age; i.e. the user identity comprises: adult or child. And when the first electronic device is applied to a different scene, the corresponding first user identity is different.

For example, when the first electronic device is applied to an amusement scene, the first user identity is child, i.e. the child is preferentially responded to. When the first electronic equipment is applied to a working scene, based on the position of the first electronic equipment, a first user identity which is preferentially responded can be determined; i.e. the first user identity is the user to which the location of the first electronic device belongs. For example, a first electronic device located in a group leader's office, the first user identity corresponding to the prioritized response is the group leader.

The present disclosure is not limited to the division of user identities, and the signal processing method aims to determine a first user identity which is preferentially responded from a plurality of user identities, and further execute the response. The first user identity is taken as an owner for example, and specific description is made below.

It should be noted that, in order to enhance the user experience of the host with respect to the electronic devices, the identity of the host is also highlighted, and in some embodiments, there is only one host, i.e., only one registered user, for each electronic device. Therefore, when the voice recognition is carried out, the target voiceprint model can be obtained by training only based on the historical voice signals of the host, the difficulty of voice training is simplified, and the workload is reduced.

Here, the first user identity represents the host in step 102, and the target voice signal is a voice signal sent by the host.

In the embodiment of the disclosure, if a plurality of voice signals are acquired at the current moment when the acquisition of the voice signals is performed, and it is determined that the plurality of voice signals are sent by a plurality of users, the judgment needs to be performed first, whether a host exists in the sounding users of the plurality of voice signals, and if so, the target voice signals sent by the host are responded. Therefore, when a plurality of users wake up the first electronic equipment at the same time, the first electronic equipment is controlled to respond to the user with the first user identity preferentially, so that the requirement of the user with the first user identity can be responded in time, the response logic of the equipment to a plurality of input voices is optimized, the false response or delayed response of the equipment is avoided, and the use experience of the user is improved.

Here, the above determination of whether the plurality of voice signals are issued by the plurality of users may be implemented by a target voiceprint model trained based on the historical voice signals of the host. Specifically, when the acquired plurality of first voice signals are respectively processed based on the target voiceprint model, after the first voice signals belonging to the host are determined, other first voice signals can be determined to be sent by non-host.

It should be noted that, if only one voice signal is collected at the current moment when the collection of the voice signal is performed, the voice signal is directly responded no matter whether the voice signal belongs to the host or not. Thus, since there is only one voice signal, if the sounding user of the voice signal is the host, the direct response to the voice signal can promote the use feeling of the host on the electronic device. Correspondingly, if the sounding user of the voice signal is not the owner, and the current owner does not sound, and the user has no requirement, the voice signal is directly responded, and the use experience of other users can be improved.

Further, in the embodiment of the present disclosure, after determining that there is a target voice signal sent by the host, a corresponding response needs to be performed on the target voice signal.

In some embodiments, the performing a corresponding response to the target speech signal may be: the response executed to the demand in the target voice signal, namely the first electronic equipment prepares to answer the host and completes the demand of the host; for example, by outputting a speech signal: the "owner me is, me executes immediately" to respond.

In other embodiments, the performing the corresponding response to the target voice signal may be: maintaining silence without voice response, outputting other responses representing that the target voice signal is received, and giving feedback to the host; i.e., indicating that the need of the host was received, but not processed; for example, by outputting a light signal to respond, the owner is prompted to have received.

In the embodiment of the disclosure, the response mode is determined according to a specific application scenario. For example, when there are other electronic devices that can interact with the user in the space where the first electronic device is located, if the first target voice signal of the first electronic device to the host outputs a voice signal with the content of "host me is, i am immediately executes", then the other electronic devices can respond by outputting a light signal. If there is a target voice signal of the other electronic device to the host, a voice signal with the content of "host me is, me performs immediately" is output, the first electronic device can respond by outputting a light signal. Thus, only one electronic device can answer the host and complete the requirement at a time according to the target voice signal of the host, and the one-call caused by the answer of a plurality of electronic devices is reduced.

The specific manner of determining the response method is described in detail below.

In some embodiments, the performing the corresponding response on the target voice signal if the target voice signal exists in step 103 includes:

Step 1031, if the target voice signal exists, monitoring whether a second electronic device makes a voice response to the first voice signal, and obtaining a monitoring result;

step 1032, executing corresponding response to the target voice signal according to the monitoring result.

In the embodiment of the disclosure, as shown in fig. 2, fig. 2 is a schematic view of an application scenario, where in fig. 2, a first electronic device is located in a space where a plurality of electronic devices are located, for example, including: a first electronic device 201 and a second electronic device 202. The second electronic device is any electronic device other than the first electronic device.

And each electronic device is stored with a monitoring program for monitoring whether the second electronic device makes a voice response to the target voice signal.

For each electronic device, when a plurality of first voice signals are acquired, if it is determined that a target voice signal representing the first user identity exists in the plurality of first voice signals, a corresponding response needs to be executed on the target voice signal, and as to which response is executed, the response is determined according to the monitoring result.

Different monitoring results correspond to different responses.

Wherein, the monitoring result includes: no second electronic device is or no second electronic device is in voice response to the target voice signal.

The response includes: and outputting a second voice signal for responding to the target audio signal, or maintaining silence and outputting a prompt message in a form different from the second voice signal.

Specifically, for example, the second voice signal may be "the host is present, i execute immediately". The prompt information can be lamplight information or animation information, namely, the prompt information is used for prompting the host in a mode of displaying lamplight or animation.

It should be noted that, in other embodiments, the prompt information may also be: other speech signals than the second speech signal, such as output "drip sound". When the existing electronic equipment outputs 'the host is in the middle of I and immediately executes' to respond to the target voice signal, the first electronic equipment outputs 'dripping sound' to inform the host, and the existing electronic equipment responds to the host, but does not respond at the moment. Thus, the execution confusion caused by one-hundred-call response can be reduced.

It should be noted that, there is no listening result of the second electronic device making a voice response to the target voice signal, which means that the first electronic device is the first electronic device responding to the host. The first monitoring result of the second electronic device responding to the target voice signal means that other electronic devices respond to the host, and in order to reduce the one-call-for-one, silence can be maintained, prompt information representing that the target voice signal is received can be output, and the host is given feedback.

In some embodiments, fig. 3 is a flowchart two of a signal processing method according to an exemplary embodiment, as shown in fig. 3, the step 1032, according to the listening result, performs a corresponding response on the target voice signal, including:

Step 301, if the monitoring result is that the second electronic device does not respond to the target voice signal, outputting a second voice signal for responding to the target audio signal, where the second voice signal includes information indicating the first user identity;

Or alternatively, the first and second heat exchangers may be,

Step 302, if the monitoring result is that the second electronic device has made a voice response to the target voice signal, silence is maintained, and a prompt message is output in a form different from the second voice signal; the prompt message is used for prompting that the target voice signal is received.

Here, when the listening result is: no second electronic device is in voice response to the target voice signal, the response mode is: a second speech signal is output in response to the target audio signal.

When the monitoring result is: the second electronic device is in voice response to the target voice signal, and the response mode is as follows: silence is maintained and a prompt is output in a form different from the second speech signal to give feedback to the host.

It should be noted that the second voice signal may include information indicating the identity of the first user; therefore, when a plurality of users generate sound each time, the first electronic equipment adds the information indicating the identity of the first user in response, so that the response to the host can be clearly expressed, but not the response to the non-host, the response each time is clearer, and the confusion is reduced.

In some embodiments, the performing, if the target voice signal exists in step 103, a corresponding response on the target voice signal may further include:

Here, when there are a plurality of electronic devices in the space, the server is configured to coordinate and determine a response device of the plurality of electronic devices when a plurality of first voice signals are collected at the same time.

The first information packet is used for indicating that the first electronic device is awakened by the host;

The first packet carries the time when the target voice signal is received. When a plurality of electronic devices exist in the space, if a plurality of users simultaneously sound at the same time, if it is determined that the master exists, the first information package generated by each electronic device can be generated, that is, the number of the first information packages can be multiple, so that after the first information packages are sent to other electronic devices, each electronic device can receive the first information packages of all other electronic devices, and then the time in the first information packages can be compared to determine which electronic device is used for responding to the target voice signal.

Here, before comparing the time in the first packet, the target voice signal is not responded, and after determining which device responds to the target voice signal, the response is executed, so that the occurrence number of the one-call or simultaneous wake-up problem can be reduced.

As an example, assuming that there are 3 electronic devices in the space where 3 users are speaking simultaneously, each electronic device will collect 3 first voice signals. If it is determined that one of the 3 users corresponds to the host identity, the response is made to the host user, i.e., the user characterized by the first user identity. Specifically, each electronic device generates a first information packet after determining that the master exists, and sends the first information packet to the other 2 electronic devices. At this time, each electronic device receives 2 first information packets, and the information packets also carry the time when the corresponding electronic device receives the target voice signal, so that each electronic device can determine which device receives the target voice signal first by comparing the time when the corresponding electronic device receives the target voice signal with the time carried in the received 2 first information packets, and determine that the electronic device which receives the target voice signal first responds to the target voice signal.

Here, the comparison of the time may be: each electronic device sends the first information packet to the server, and the server compares the first information packet to determine which device receives the target voice signal first. Or directly sent to the second electronic device, compared by the second electronic device, and the determined result is output. The present disclosure is not limited in this regard.

It should be noted that, in some embodiments, if the target voice signal exists, the generating a first packet indicating to wake up by the first user identity may further be:

If the target voice signal exists and no second electronic equipment is monitored to make voice response to the target voice signal, a first information packet indicating to wake up by the first user identity is generated.

That is, after receiving the first information packet, the second electronic device determines that the existing electronic device responds to the host, and also knows which electronic device is currently responding to the target voice signal, so that the second voice signal responding to the target audio signal is no longer output, and thus, the occurrence number of the one-call or simultaneous wake-up problem can be reduced.

In this embodiment, the first packet may also be sent directly to the server and forwarded by the server to the second electronic device, so that the second electronic device does not respond to the target voice signal. The present disclosure is not limited in this regard.

In some embodiments, the performing a corresponding response to the target speech signal if the target speech signal is present comprises:

In the embodiment of the disclosure, a target voiceprint model is obtained through training according to historical voice signals corresponding to different emotion states of an owner, and the emotion state of the owner is determined through the target voiceprint model.

The emotional state may include: happy, sad or calm, anger, etc. The pitch and speech rate in different emotion states may differ, and a target voiceprint model may be trained based on the differences.

To further enhance the user's use experience, the response's tone, volume, or other parameters may be adjusted for the user's current emotional state while responding to the user, giving a diversified response experience.

As a specific example, when determining that the current emotional state of the owner is "happy", determining, by the correspondence between the emotional state and the response parameters, the response parameters corresponding to "happy" includes: the volume is 10 db in response to tone 60 hz. Then a 10 db speech signal is output at a pitch of 60 hz.

In some embodiments, the timbre of the response may also be selected according to the emotional state. The tone may be a boy tone or a girl tone. In other embodiments, the responsive speech signal may be determined from a combination of pitch, volume and timbre based on the emotional state. For example, upon determining that the current emotional state of the owner is "sad", a gentle female voice may be output at a certain volume.

It should be noted that, in other embodiments, in order to further deepen the user's use feeling, corresponding response contents may be given according to different emotional states to provide diversified response experiences.

Response content corresponding to different emotional states is different; and, the response content may be content associated with an emotional state for encouraging or accompanying the user. For example, upon determining that the owner's current emotional state is "happy", the response content may be "today is a good day"; the response content may be "happy spot" when it is determined that the owner's current emotional state is "sad".

In some embodiments, the target voiceprint model is created based on the steps of:

Outputting dynamically generated verification information;

Here, since the target voiceprint model is trained based on the historical voice signals corresponding to the owners in different emotion states, the historical voice signals of the owners can be obtained in the registration stage.

In the embodiment of the disclosure, in order to increase the accuracy of registration, after a user touches a registration control on a first electronic device to generate a voiceprint registration request in a registration stage, the first electronic device outputs dynamically generated verification information based on the voiceprint registration request; when the user reads the verification information, a verification voice signal is obtained, the verification voice signal is matched, and after the matching is successful, a target voiceprint model with the host is established based on a voiceprint registration request.

Here, the dynamically generated verification information may be a dynamically generated number, for example, "2395". The method can also be dynamically generated words, such as 'love', and has certain randomness.

After the verification information is output, the registered user needs to read the verification information, and at the moment, a verification voice signal during the reading of the user can be acquired; the verification speech signal carries verification information. At this time, the verification information in the verification voice signal is matched with the output verification information, and after the matching is successful, the voice signal which is correct and correct to the host is considered to be collected, and then a target voiceprint model with the host can be established based on a voiceprint registration request.

In some embodiments, the determining, according to a target voiceprint model, an emotional state of the first user identity characterized by the target voice signal if the target voice signal is present includes:

In the embodiment of the disclosure, the target voiceprint feature is established based on a preset neural network model, the target voiceprint feature of the target voice signal can be extracted, and the emotion state of the host can be determined after the target voiceprint feature is matched with the voiceprint feature in the preset voiceprint library.

Here, the preset voiceprint library includes voiceprint features of the host in different emotion states; for example, the vocal print feature corresponding to the happy owner and the vocal print feature corresponding to the sad owner. And the emotion state can be determined through the comparison of voiceprint characteristics.

In the matching, the output matching result can be characterized by similarity, and the emotional state corresponding to the similarity exceeding the similarity threshold value is determined as the emotional state of the owner corresponding to the target voice signal.

In some embodiments, the performing a corresponding response to the target speech signal according to the emotional state includes:

Here, in order to perform a corresponding response in accordance with the emotional state of the user, a correspondence between the emotional state and the response parameter may be preset before the response, and the response parameter for responding to the target voice signal may be determined according to the correspondence.

The response parameters include at least: response volume and response volume.

The response tone is used to indicate the frequency level of the output speech signal.

The response volume is used to indicate the magnitude of the output speech signal, i.e. the amplitude magnitude.

After the response parameters are obtained, the voice signal is output at a response tone at a volume equal to the response volume.

It should be noted that, in other embodiments, the response parameters may also be: responding to the content; i.e. the response content corresponding to the different emotional states is different. And, the response content may be content associated with an emotional state for encouraging or accompanying the user. For example, upon determining that the owner's current emotional state is "happy", the response content may be "today is a good day"; the response content may be "happy spot" when it is determined that the owner's current emotional state is "sad".

Here, the response content may be output after the output fixed response sentence. For example, the fixed response statement is: the owner is in, i execute immediately; upon determining that the owner's current emotional state is "happy", the response content may be "today is a good day"; the final output speech signal is "owner i am, i am immediately executing, today is a good day".

Thus, the user can respond correspondingly in cooperation with the emotion state of the user, and different response effects are given.

In some embodiments, fig. 4 is a flowchart three of a signal processing method, as shown in fig. 4, according to an exemplary embodiment, the method further comprising:

Step 104, if the target voice signal does not exist, generating different second information packets indicating to wake up by the non-first user identity according to different first voice signals; the different second information packets carry identification information for distinguishing different first voice signals;

step 105, sending a plurality of different second information packets to a server;

Step 106, receiving different decision instructions generated by the server based on different second information packets;

Step 107, determining whether to respond to the first voice signal according to each decision instruction.

When a plurality of users sound at the same moment and it is determined that no user with the user identity as the master exists among the plurality of users, corresponding second information packets are generated for different first voice signals. For example, assuming that 3 users are speaking at the same time and that none of the 3 users are owners, 3 second packets are generated.

In the embodiment of the disclosure, the second packet carries identification information, and the first voice signal is distinguished by different identification information, for example, the received first voice signal of the user a is identified by the letter "a", the received first voice signal of the user B is identified by the letter "B", and the received first voice signal of the user C is identified by the letter "C". A. And if neither user B nor user C is an owner, 3 second information packets generated by the first electronic equipment are as follows: information packets carrying the letter "A", information packets carrying the letter "B" and information packets carrying the letter "C". In this way, the first speech signal can be distinguished directly based on the identification later, conveniently and quickly.

After receiving the second information packets, the server further indicates the response devices corresponding to the different first voice signals based on different decision instructions generated by the different second information packets.

For example, for a first voice signal sent by user a, the server determines a response device based on the received second packet, for a first voice signal sent by user B, the server also determines a response device based on the received second packet, and so on. In this way, a response device can be determined for each user's first speech signal. A corresponding response may be performed for each device.

In some embodiments, the generating, according to the different first voice signals, a second packet indicating to wake up by a non-first user identity includes:

Here, for the first speech signals of different users, when determining the response device, the response may be implemented based on the proximity principle; i.e. which device is closest to the originating user, the originating user is responded to by which user.

The determining the distance between the first voice signal and the sounding user according to different first voice signals comprises the following steps: a distance from a speaking user of each of the first speech signals is determined based on the time at which the different first speech signals were received.

After the distance is obtained, the distance is used as a parameter to combine the determination result of the target voice signal without the host in the plurality of first voice signals to generate a second information packet. Thus, when the second packet is sent to the server, the server may determine, based on the received second packet, the electronic device closest to the sounding user as the responding device.

In this way, when a plurality of first voice signals are obtained, the signal processing method provided by the embodiment of the disclosure determines whether a target voice signal representing the identity of the first user exists for the plurality of first voice signals, and if so, responds to the target voice signal. Therefore, when a plurality of users wake up the first electronic equipment at the same time, the first electronic equipment is controlled to respond to the user with the first user identity preferentially, so that the requirement of the user with the first user identity can be responded in time, the response logic of the equipment to a plurality of input voices is optimized, the false response or delayed response of the equipment is avoided, and the use experience of the user is improved.

The present disclosure further provides a signal processing apparatus, and fig. 5 is a schematic structural diagram of a signal processing apparatus according to an exemplary embodiment, and as shown in fig. 5, the signal processing apparatus 500 includes:

an obtaining module 501, configured to obtain a plurality of collected first voice signals, where the plurality of first voice signals correspond to a plurality of user identities;

a determining module 502, configured to determine whether a target voice signal representing a first user identity exists in a plurality of the first voice signals;

And a response module 503, configured to execute a corresponding response to the target voice signal if the target voice signal exists.

In some embodiments, the response module includes:

In some embodiments, the processing module comprises:

Or alternatively, the first and second heat exchangers may be,

In some embodiments, the response module includes:

Outputting dynamically generated verification information;

In some embodiments, the emotion determination module includes:

In some embodiments, the emotional response sub-module is further to:

In some embodiments, the response module is further configured to:

In some embodiments, the apparatus further comprises:

In some embodiments, the generating module is further configured to:

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 6 is a block diagram illustrating a signal processing apparatus 1800, according to an exemplary embodiment. For example, apparatus 1800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, or the like.

Referring to fig. 6, apparatus 1800 may include one or more of the following components: a processing component 1802, a memory 1804, a power component 1806, a multimedia component 1808, an audio component 1810, an input/output (I/O) interface 1812, a sensor component 1814, and a communication component 1816.

The processing component 1802 generally controls overall operation of the device 1800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 1802 may include one or more processors 1820 to execute instructions to perform all or part of the steps of the methods described above. In addition, the processing component 1802 may also include one or more modules that facilitate interactions between the processing component 1802 and other components. For example, the processing component 1802 may include a multimedia module to facilitate interaction between the multimedia component 1808 and the processing component 1802.

The memory 1804 is configured to store various types of data to support operations at the apparatus 1800. Examples of such data include instructions for any application or method operating on the device 1800, contact data, phonebook data, messages, images, video, and the like. The memory 1804 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically Erasable Programmable Read Only Memory (EEPROM), erasable Programmable Read Only Memory (EPROM), programmable Read Only Memory (PROM), read Only Memory (ROM), magnetic memory, flash memory, magnetic disk, or optical disk.

The power components 1806 provide power to the various components of the device 1800. The power assembly 1806 may include: a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 1800.

The multimedia component 1808 includes a screen that provides an output interface between the device 1800 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 1808 includes a front-facing camera and/or a rear-facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 1800 is in an operational mode, such as a shooting mode or a video mode. Each front camera and/or rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 1810 is configured to output and/or input audio signals. For example, the audio component 1810 includes a Microphone (MIC) configured to receive external audio signals when the device 1800 is in an operational mode, such as a call mode, a recording mode, and a speech recognition mode. The received audio signals may be further stored in the memory 1804 or transmitted via the communication component 1816. In some embodiments, audio component 1810 also includes a speaker for outputting audio signals.

The I/O interface 1812 provides an interface between the processing component 1802 and a peripheral interface module, which may be a keyboard, click wheel, buttons, or the like. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 1814 includes one or more sensors for providing status assessment of various aspects of the apparatus 1800. For example, the sensor assembly 1814 may detect the on/off state of the device 1800, the relative positioning of the components, such as the display and keypad of the device 1800, the sensor assembly 1814 may also detect the change in position of the device 1800 or one component of the device 1800, the presence or absence of user contact with the device 1800, the orientation or acceleration/deceleration of the device 1800, and the change in temperature of the device 1800. The sensor assembly 1814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 1814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1814 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 1816 is configured to facilitate communication between the apparatus 1800 and other devices, either wired or wireless. The device 1800 may access a wireless network based on a communication standard, such as WiFi, 2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication component 1816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 1816 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, or other technologies.

In an exemplary embodiment, the apparatus 1800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

In an exemplary embodiment, a non-transitory computer-readable storage medium is also provided, such as memory 1804, including instructions executable by processor 1820 of apparatus 1800 to perform the above-described methods. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

A non-transitory computer readable storage medium, which when executed by a processor, enables the execution of the above-described method.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A signal processing method, applied to a first electronic device, comprising:

Performing a corresponding response to the target speech signal if the target speech signal is present, the performing a corresponding response to the target speech signal if the target speech signal is present comprising:

If the monitoring result is that the second electronic equipment does not respond to the target voice signal in a voice way, outputting a second voice signal for responding to the target voice signal, wherein the second voice signal contains information indicating the first user identity;

if the monitoring result is that the second electronic equipment responds to the target voice signal in a voice way, silence is maintained, and prompt information is output in a form different from the second voice signal; the prompt information comprises lamplight information or animation information and is used for prompting that the target voice signal is received.

2. The method of claim 1, wherein the performing a corresponding response to the target speech signal if the target speech signal is present comprises:

The emotional state is used for executing corresponding response to the target voice signal;

3. The method of claim 2, wherein the target voiceprint model is created based on the steps of:

Outputting dynamically generated verification information;

4. The method of claim 2, wherein the determining the emotional state of the first user identity characterized by the target speech signal according to a target voiceprint model if the target speech signal is present comprises:

5. The method according to claim 2, wherein the method further comprises:

Determining a response parameter for responding to the target voice signal according to the emotion state; wherein the response parameters include at least: response tone and response volume; the response tone and the response volume are used for outputting the second voice signal.

6. The method of claim 1, wherein the performing a corresponding response to the target speech signal if the target speech signal is present comprises:

7. The method according to claim 1, wherein the method further comprises:

transmitting a plurality of different second information packets to a server;

8. The method of claim 7, wherein generating a second packet indicative of waking up from a non-first user identity based on the different first voice signals comprises:

9. A signal processing apparatus, characterized by being applied to a first electronic device, comprising:

a response module, configured to execute a corresponding response to the target voice signal if the target voice signal exists;

A processing module, comprising:

The first processing sub-module is used for outputting a second voice signal for responding to the target voice signal if the monitoring result is that the second electronic equipment does not respond to the target voice signal, wherein the second voice signal contains information indicating the first user identity;

The second processing sub-module is used for keeping silence and outputting prompt information in a form different from the second voice signal if the monitoring result is that the second electronic equipment makes voice response to the target voice signal; the prompt information comprises lamplight information or animation information and is used for prompting that the target voice signal is received.

10. The apparatus of claim 9, wherein the response module comprises:

11. The apparatus of claim 10, wherein the target voiceprint model is created based on the steps of:

Outputting dynamically generated verification information;

12. The apparatus of claim 10, wherein the emotion determination module comprises:

13. The apparatus of claim 10, wherein the emotional response sub-module is further configured to:

14. The apparatus of claim 9, wherein the response module is further configured to:

15. The apparatus of claim 9, wherein the apparatus further comprises:

16. The apparatus of claim 15, wherein the generating module is further configured to:

17. A signal processing apparatus, comprising:

A processor and a memory for storing executable instructions capable of executing on the processor, wherein:

A processor is arranged to execute the executable instructions, which when executed, perform the steps of the method provided in any of the preceding claims 1 to 8.

18. A non-transitory computer readable storage medium having stored therein computer executable instructions which when executed by a processor implement the steps of the method provided in any one of the preceding claims 1 to 8.