CN113674759A

CN113674759A - Fixed sound source identification method and device

Info

Publication number: CN113674759A
Application number: CN202011399173.7A
Authority: CN
Inventors: 李晓建; 胡伟湘; 王保辉; 李伟
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-05-14
Filing date: 2020-12-04
Publication date: 2021-11-19

Abstract

The embodiment of the application discloses a method and a device for identifying a fixed sound source, wherein the method comprises the following steps: the electronic equipment acquires a first audio stream in a first time period, wherein the first audio stream at least comprises a first sound signal; separating a first sound signal in a first audio stream; the electronic equipment determines first attribute information of the first sound signal; the electronic equipment judges whether the first attribute information is matched with attribute information of a fixed sound source in a fixed sound source library or not; and when the first attribute information is matched with the attribute information of the fixed sound source in the fixed sound source library, determining the first sound signal as the sound signal emitted by the fixed sound source. The electronic equipment matches the first attribute information of the first sound signal with the fixed sound source library, and if the first attribute information matches the attribute information of the fixed sound source in the fixed sound source library, the first sound signal is a sound signal emitted by the fixed sound source, so that the electronic equipment can accurately identify the fixed sound source existing in the environment.

Description

Fixed sound source identification method and device

The present application claims priority of Chinese patent application with application number 202010404799.6 entitled "a fixed sound source identification method and apparatus" filed by the Chinese patent office on 14/5/2020.

Technical Field

The application relates to the field of artificial intelligence, in particular to a fixed sound source identification method and device.

Background

With the progress of technology, intelligent voice recognition functions are widely applied to electronic devices. For example, electronic devices such as smart phones, smart speakers, smart televisions, and smart robots are all provided with an intelligent voice recognition function. Currently, in the process of using such electronic devices, a user needs to issue a voice command in a quiet environment so that the electronic device can perform a corresponding operation according to the voice command issued by the user.

If a noise source exists in the environment where the user is located, the electronic device receives the voice command input by the user and simultaneously receives the noise emitted by the noise source, so that the voice command input by the user is interfered by the noise emitted by the noise source, the electronic device is difficult to correctly recognize the real intention corresponding to the voice command input by the user, and the accuracy of recognizing the voice by the electronic device is reduced.

Therefore, how to identify the noise source in the environment around the electronic device to avoid the electronic device from being interfered by the environmental noise is a technical problem that needs to be solved.

Disclosure of Invention

The embodiment of the application provides a fixed sound source identification method and device, which are used for identifying a fixed sound source in the surrounding environment of electronic equipment.

In a first aspect, an embodiment of the present application provides a fixed sound source identification method, where the method is applied in an electronic device, and the method includes: the electronic equipment acquires a first audio stream in a first time period, wherein the first audio stream at least comprises a first sound signal; the electronic device separating the first sound signal in the first audio stream; the electronic equipment determines first attribute information of the first sound signal; the electronic equipment judges whether the first attribute information is matched with attribute information of a fixed sound source in a fixed sound source library, wherein the fixed sound source library comprises attribute information corresponding to one or more fixed sound sources, and the fixed sound sources are sound sources which are located at the same position and emit a known sound type; and when the first attribute information is matched with the attribute information of the fixed sound source in the fixed sound source library, determining the first sound signal as the sound signal emitted by the fixed sound source.

In the first aspect, the electronic device can match the first attribute information of the first sound signal in the first audio stream with a pre-generated fixed sound source library, and if the first attribute information matches the attribute information of the fixed sound source in the fixed sound source library, it indicates that the first sound signal is a sound signal emitted by the fixed sound source, so that the electronic device can accurately identify the fixed sound source existing in the environment.

In one possible implementation manner of the first aspect, an electronic device includes a microphone array, and the electronic device acquires a first audio stream in a first time period, and includes: the electronic device generates a first audio stream using the microphone array to capture sound in an environment in which the electronic device is located for a first time period.

In one possible implementation manner of the first aspect, the first attribute information includes an utterance position, a sound type, and an utterance time of the first sound signal.

In one possible implementation manner of the first aspect, the determining, by the electronic device, first attribute information of the first sound signal includes: the electronic equipment determines the sound production position of the first sound signal by using the microphone array; the electronic equipment determines the sound type of the first sound signal according to the sound characteristics of the first sound signal; the electronic device determines an utterance time of the first sound signal.

In one possible implementation manner of the first aspect, the first attribute information includes an utterance position, a sound content, and an utterance time of the first sound signal.

In one possible implementation manner of the first aspect, the determining, by the electronic device, first attribute information of the first sound signal includes: the electronic equipment determines the sound production position of the first sound signal by using the microphone array; the electronic equipment determines the sound content of the first sound signal according to the sound characteristics of the first sound signal; the electronic device determines an utterance time of the first sound signal.

In one possible implementation manner of the first aspect, the first attribute information includes an utterance position, a sound type, a sound content, and an utterance time of the first sound signal.

In one possible implementation manner of the first aspect, the determining, by the electronic device, first attribute information of the first sound signal includes: the electronic equipment determines the sound production position of the first sound signal by using the microphone array; the electronic equipment determines the sound type of the first sound signal according to the sound characteristics of the first sound signal; the electronic equipment determines the sound content of the first sound signal according to the sound characteristics of the first sound signal; the electronic device determines an utterance time of the first sound signal.

In one possible implementation manner of the first aspect, the determining, by the electronic device, a sound type of the first sound signal according to a sound feature of the first sound signal includes: the electronic equipment determines whether a sound type corresponding to the sound characteristic of the first sound signal exists in a sound event library, wherein the sound event library comprises one or more sound types; when the sound type corresponding to the sound feature of the first sound signal exists in the sound event library, determining the sound type corresponding to the sound feature of the first sound signal as the sound type of the first sound signal; when the sound type corresponding to the sound feature of the first sound signal does not exist in the sound event library, the electronic equipment sends a first network request to the external server, receives a first response request sent by the external server, wherein the first network request comprises the sound feature of the first sound signal, and the first response request comprises the sound type corresponding to the sound feature of the first sound signal; or, when the sound type corresponding to the sound feature of the first sound signal does not exist in the sound event library, the electronic device determines whether the number of times the sound feature of the first sound signal appears at the first position is larger than a first threshold, the first position is the sound production position of the first sound signal, and if the number of times the sound feature of the first sound signal appears at the first position is larger than the first threshold, the sound type of the first sound signal is determined to be the known sound type.

In one possible implementation manner of the first aspect, the electronic device is capable of obtaining, in the sound event library or the external server, a sound type corresponding to the sound feature of the first sound signal.

In a possible implementation manner of the first aspect, the method further includes: the electronic equipment acquires a second audio stream in a second time period, wherein the second audio stream at least comprises a second sound signal; the electronic equipment determines second attribute information of the second sound signal; the electronic equipment judges whether the second attribute information exists in a fixed sound source library, the fixed sound source library comprises attribute information corresponding to one or more fixed sound sources, and the fixed sound sources are sound sources which are located at the same position and emit a known sound type; and when the second attribute information does not exist in the fixed sound source library, storing the second attribute information into the fixed sound source library.

The electronic equipment can establish a fixed sound source library and can continuously update the content in the fixed sound source library.

In a second aspect, embodiments of the present application provide an electronic device, including a memory and a processor connected to the memory, the memory being configured to store instructions; the processor is configured to execute instructions to cause the computer device to: acquiring a first audio stream in a first time period, wherein the first audio stream at least comprises a first sound signal; separating a first sound signal in a first audio stream; determining first attribute information of a first sound signal; judging whether the first attribute information is matched with attribute information of a fixed sound source in a fixed sound source library, wherein the fixed sound source library comprises attribute information corresponding to one or more fixed sound sources, and the fixed sound sources are sound sources which are positioned at the same position and emit a known sound type; and when the first attribute information is matched with the attribute information of the fixed sound source in the fixed sound source library, determining the first sound signal as the sound signal emitted by the fixed sound source.

In one possible implementation of the second aspect, the electronic device comprises a microphone array; and a processor, specifically configured to capture sound in an environment in which the electronic device is located within a first time period using the microphone array to generate a first audio stream.

In one possible implementation manner of the second aspect, the first attribute information includes an utterance position, a sound type, and an utterance time of the first sound signal.

In a possible implementation form of the second aspect, the processor is specifically configured to determine a sound emission location of the first sound signal using the microphone array; determining a sound type of the first sound signal according to the sound characteristics of the first sound signal; the sound emission time of the first sound signal is determined.

In one possible implementation manner of the second aspect, the first attribute information includes an utterance position, a sound content, and an utterance time of the first sound signal.

In a possible implementation form of the second aspect, the processor is specifically configured to determine a sound emission location of the first sound signal using the microphone array; determining the sound content of the first sound signal according to the sound characteristics of the first sound signal; the sound emission time of the first sound signal is determined.

In one possible implementation manner of the second aspect, the first attribute information includes an utterance position, a sound type, a sound content, and an utterance time of the first sound signal.

In a possible implementation form of the second aspect, the processor is specifically configured to determine a sound emission location of the first sound signal using the microphone array; determining a sound type of the first sound signal according to the sound characteristics of the first sound signal; determining the sound content of the first sound signal according to the sound characteristics of the first sound signal; the sound emission time of the first sound signal is determined.

In a possible implementation manner of the second aspect, the processor is specifically configured to determine whether a sound type corresponding to the sound feature of the first sound signal exists in a sound event library, where the sound event library includes one or more sound types; when the sound type corresponding to the sound feature of the first sound signal exists in the sound event library, determining the sound type corresponding to the sound feature of the first sound signal as the sound type of the first sound signal; when the sound type corresponding to the sound feature of the first sound signal does not exist in the sound event library, sending a first network request to an external server, and receiving a first response request sent by the external server, wherein the first network request comprises the sound feature of the first sound signal, and the first response request comprises the sound type corresponding to the sound feature of the first sound signal; or, when the sound type corresponding to the sound feature of the first sound signal does not exist in the sound event library, determining whether the number of times the sound feature of the first sound signal appears at the first position is larger than a first threshold, wherein the first position is the sound production position of the first sound signal, and if the number of times the sound feature of the first sound signal appears at the first position is larger than the first threshold, determining that the sound type of the first sound signal is the known sound type.

In a possible implementation manner of the second aspect, the processor is further configured to obtain a second audio stream in a second time period, where the second audio stream includes at least a second sound signal; determining second attribute information of the second sound signal; judging whether the second attribute information exists in a fixed sound source library, wherein the fixed sound source library comprises attribute information corresponding to one or more fixed sound sources, and the fixed sound sources are sound sources which are positioned at the same position and emit a known sound type; and when the second attribute information does not exist in the fixed sound source library, storing the second attribute information into the fixed sound source library.

In a third aspect, an embodiment of the present application provides an electronic device, including: the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a first audio stream in a first time period, and the first audio stream at least comprises a first sound signal; a processing module for separating a first sound signal in a first audio stream; determining first attribute information of a first sound signal; judging whether the first attribute information is matched with attribute information of a fixed sound source in a fixed sound source library, wherein the fixed sound source library comprises attribute information corresponding to one or more fixed sound sources, and the fixed sound sources are sound sources which are positioned at the same position and emit a known sound type; and when the first attribute information is matched with the attribute information of the fixed sound source in the fixed sound source library, determining the first sound signal as the sound signal emitted by the fixed sound source.

Drawings

Fig. 1 is a schematic view of a scene provided in an embodiment of the present application;

FIG. 2 is a schematic view of the azimuth of the 3 sound sources of FIG. 1;

fig. 3 is a flowchart illustrating a fixed sound source identification method according to an embodiment of the present application;

fig. 4 is a schematic view of another scenario provided in the embodiment of the present application;

FIG. 5 is a schematic view of the azimuth of the 2 sound sources of FIG. 4;

fig. 6 is a flowchart illustrating another fixed sound source identification method according to an embodiment of the present application;

fig. 7 is a schematic view of another scenario provided by the embodiment of the present application;

fig. 8 is a flowchart illustrating a further fixed sound source identification method according to an embodiment of the present application;

fig. 9 is a schematic view of another scenario provided by the embodiment of the present application;

FIG. 10 is a schematic view of the azimuth of 1 sound source of FIG. 9;

fig. 11 is a schematic view of an electronic device according to an embodiment of the present application;

fig. 12 is a schematic view of another electronic device provided in the embodiment of the present application.

Detailed Description

Referring to fig. 1 and fig. 2, fig. 1 is a schematic view of a scene provided by an embodiment of the present application, and fig. 2 is a schematic view of directions of 3 sound sources in fig. 1. Fig. 1 is a schematic view of a scene showing a smart sound box 100, a range hood 200, an air conditioner 300, and a user 400. The smart sound box 100 shown in fig. 1 is capable of executing the fixed sound source identification method provided in the embodiment of the present application.

In one possible scenario, as shown in fig. 1 and 2, it is assumed that a user 400 has devices such as a smart sound box 100, a range hood 200, and an air conditioner 300 in a home, and the relative positions of the smart sound box 100, the range hood 200, and the air conditioner 300 are as shown in fig. 1 and 2. Assuming that user 400 wants to have smart sound box 100 play song ABC, user 400 may issue a voice command to smart sound box 100 to play song ABC, which is sound signal C in fig. 1 and 2.

Assuming that the range hood 200 is in an operating state during the process of the user 400 sending the sound signal C to the smart sound box 100, the range hood 200 sends noise to the smart sound box 100, and the noise sent by the range hood 200 is the sound signal a in fig. 1 and fig. 2.

Assuming that air conditioner 300 is in an operating state during the process of sending sound signal C to smart sound box 100 by user 400, air conditioner 300 may send noise to smart sound box 100, where the noise sent by air conditioner 300 is sound signal B in fig. 1 and 2.

At this time, the smart sound box 100 may determine which received sound signal belongs to the fixed sound source by using the fixed sound source identification method provided in the embodiment of the present application. After smart sound box 100 determines that sound signal a and sound signal B belong to fixed sound sources, smart sound box 100 may mask received sound signal a and sound signal B and only recognize the voice command corresponding to received sound signal C, thereby playing song ABC for the user.

In the example shown in fig. 1 and fig. 2, the smart sound box 100 can accurately identify a fixed sound source existing in an environment by using the fixed sound source identification method provided in the embodiment of the present application, and correctly identify a real intention corresponding to a voice command input by the user 400, so as to improve the accuracy of identifying the voice command by the smart sound box 100.

Referring to fig. 3, fig. 3 is a flowchart illustrating a fixed sound source identification method according to an embodiment of the present application. The fixed sound source recognition method shown in fig. 3 may be applied to electronic devices, and the electronic devices may be devices with an intelligent voice recognition function, such as a smart phone, a smart speaker, a smart television, and a smart robot. The method shown in fig. 3 includes the following steps S101 to S104.

S101, the electronic equipment obtains a first audio stream in a first time period, wherein the first audio stream at least comprises a first sound signal.

Wherein the first time period refers to a time period during which a user inputs a voice instruction to the electronic device.

When the electronic device includes a microphone array, the electronic device may generate a first audio stream using the microphone array to capture sound in an environment in which the electronic device is located for a first time period.

For example, please refer to fig. 1 and fig. 2, it is assumed that the first audio stream includes a sound signal a, a sound signal B, and a sound signal C, wherein the first sound signal is the sound signal a.

After S101 and before S102, the electronic device further needs to separate the first sound signal in the first audio stream, and the specific separation process includes step a1 and step a 2:

step a1, the electronic device performs a preprocessing operation on the first audio stream to obtain a modified first audio stream.

The preprocessing operation comprises variable centralization processing, whitening processing, principal component analysis dimensionality reduction processing and time filtering processing. The purpose of the pre-processing operation is to reduce noise in the first audio stream.

Step a2, performing Independent Component Analysis (ICA) processing on the modified first audio stream to obtain a first audio signal.

Wherein the independent component analysis process is used to separate the first sound signal in the first audio stream, so that the subsequent steps can perform corresponding processing on the first sound signal.

S102, the electronic equipment determines first attribute information of the first sound signal.

There are various implementations of determining the first attribute information of the first sound signal by the electronic device, and several specific implementations are described below.

In a first implementation manner, if the first attribute information includes an utterance position, a sound type, and an utterance time of the first sound signal, the electronic device determines the first attribute information of the first sound signal, and may include the following steps: the electronic equipment determines the sound production position of the first sound signal by using the microphone array; the electronic equipment determines the sound type of the first sound signal according to the sound characteristics of the first sound signal; the electronic device determines an utterance time of the first sound signal.

In a first implementation manner, the sound emission position of the first sound signal refers to a sound emission position of a sound source corresponding to the first sound signal relative to the electronic device. For example, please refer to fig. 1 and fig. 2, assuming that the first sound signal is the sound signal a, the sound emitting position of the sound signal a is the sound emitting position of the sound source 1 corresponding to the sound signal a relative to the smart sound box 100. If the position of the smart sound box 100 is considered as the central point, the sound source 1 corresponding to the sound signal a is 140 degrees relative to the sound emitting position of the smart sound box 100.

In a first implementation, the sound characteristic of the first sound signal includes, but is not limited to, Mel Frequency Cepstrum Coefficient (MFCC). The sound type of the first sound signal refers to the sound emitted by the sound source corresponding to the first sound signal. For example, referring to fig. 1 and fig. 2, it is assumed that the first sound signal is a sound signal a, and the sound type of the sound signal a is the sound of the range hood 200.

In a first implementation manner, the sounding time of the first sound signal is the time when the electronic device receives the first sound signal. For example, referring to fig. 1 and fig. 2, assuming that the first sound signal is the sound signal a, the sounding time of the sound signal a is 18 o' clock 30 minutes in 4 months, 10 months, and 10 months in 2020.

In a second implementation manner, if the first attribute information includes the sound emission position, the sound content, and the sound emission time of the first sound signal, the electronic device determines the first attribute information of the first sound signal, and may include the following steps: the electronic equipment determines the sound production position of the first sound signal by using the microphone array; the electronic equipment determines the sound content of the first sound signal according to the sound characteristics of the first sound signal; the electronic device determines an utterance time of the first sound signal.

In a second implementation, the sound content of the first sound signal is speech content. For example, referring to fig. 1 and 2, assuming that the first sound signal is sound signal C, when user 400 speaks "play song ABC" to smart sound box 100, sound signal C is generated to propagate in the air, and the sound content of sound signal C is "play song ABC" that user 400 speaks in the mouth.

In a third implementation manner, if the first attribute information includes an utterance position, a sound type, a sound content, and an utterance time of the first sound signal, the electronic device determining the first attribute information of the first sound signal may include the following steps: the electronic equipment determines the sound production position of the first sound signal by using the microphone array; the electronic equipment determines the sound type of the first sound signal according to the sound characteristics of the first sound signal; the electronic equipment determines the sound content of the first sound signal according to the sound characteristics of the first sound signal; the electronic device determines an utterance time of the first sound signal.

Of course, the implementation is not limited to the three implementations mentioned above, and other types of information may be added to the first attribute information.

When the contents contained in the first attribute information are different, different types of scenes can be identified.

For the first implementation, the first attribute information includes an utterance position, a sound type, and an utterance time of the first sound signal. The application scenario of the first implementation manner is as follows: the electronic device may identify sound sources that emit only one sound type.

For the second implementation, the first attribute information includes an utterance position, a sound content, and an utterance time of the first sound signal. The application scenario of the second implementation manner is as follows: the electronic device can identify a sound source from which the voice content originates.

For the third implementation, the first attribute information includes an utterance position, a sound type, a sound content, and an utterance time of the first sound signal. The application scenario of the third implementation manner is as follows: not only sound sources emitting only one type of sound but also sound sources emitting speech content can be identified.

S103, the electronic equipment judges whether the first attribute information is matched with the attribute information of the fixed sound source in the fixed sound source library.

The fixed sound source library comprises attribute information of one or more fixed sound sources, and the fixed sound sources are sound sources which are located at the same position and emit a known sound type.

For example, please refer to table 1, fig. 1 and fig. 2, where table 1 shows attribute information of a plurality of fixed sound sources in the fixed sound source library, and the attribute information of the fixed sound sources in table 1 is data generated by the smart sound box 100 according to the history information. The following embodiments will be described in detail with respect to the generation process of table 1.

TABLE 1

In the examples of table 1, fig. 1, and fig. 2, assuming that the first sound signal is sound signal a, smart sound box 100 determines that the attribute information of sound signal a includes an utterance position, a sound type, and an utterance time of sound signal a. The sounding position of the sound signal A is 140 degrees, the sound type of the sound signal A is the sound of the range hood 200, and the sounding time of the sound signal A is 18 o' clock 30 minutes in 4 months, 10 days and 10 months in 2020.

After smart sound box 100 obtains the attribute information of sound signal a, smart sound box 100 may determine whether the attribute information of sound signal a matches the attribute information of the fixed sound source in the fixed sound source library of table 1. As can be seen from table 1, the attribute information of the sound signal a matches the attribute information of the fixed sound source corresponding to the number 1 in table 1, which indicates that the sound source 1 corresponding to the sound signal a is a sound signal emitted by the fixed sound source.

And S104, when the first attribute information is matched with the attribute information of the fixed sound source in the fixed sound source library, determining that the first sound signal is the sound signal emitted by the fixed sound source.

In the embodiment shown in fig. 3, the electronic device can match the first attribute information of the first sound signal in the first audio stream with a pre-generated fixed sound source library, and if the first attribute information matches the attribute information of the fixed sound source in the fixed sound source library, it indicates that the first sound signal is a sound signal emitted by the fixed sound source, so that the electronic device can accurately identify the fixed sound source existing in the environment.

Referring to fig. 4 and 5, fig. 4 is a schematic view of another scene provided in an embodiment of the present application, and fig. 5 is a schematic view of the directions of 2 sound sources in fig. 4. The scene diagram shown in fig. 4 illustrates the smart sound box 100, the user 400, and the smart tv 600. The smart sound box 100 shown in fig. 4 is capable of executing the fixed sound source identification method provided in the embodiment of the present application.

In one possible scenario, as shown in fig. 4 and 5, it is assumed that the user 400 has devices such as the smart sound box 100 and the smart tv 600 in the home, and the relative positions of the smart sound box 100 and the smart tv 600 are as shown in fig. 4 and 5. Assuming that user 400 wants to have smart sound box 100 play song ABC, user 400 may issue a voice command to smart sound box 100 to play song ABC, which is sound signal F in fig. 4 and 5.

Assuming that the smart tv 600 is in an operating state during the process of sending the sound signal F to the smart sound box 100 by the user 400, the smart tv 600 may send noise to the smart sound box 100, where the noise sent by the smart tv 600 is the sound signal E in fig. 4 and 5.

At this time, the smart sound box 100 may determine which received sound signal belongs to the fixed sound source by using the fixed sound source identification method shown in fig. 3. After smart sound box 100 determines that smart television 600 belongs to a fixed sound source, smart sound box 100 may mask received sound signal E and only recognize the voice command corresponding to received sound signal F, thereby playing song ABC for the user.

Specifically, assume that the first audio stream includes a sound signal E and a sound signal F, wherein the first sound signal is the sound signal E. Smart sound box 100 determines that the attribute information of sound signal E includes the sound production location, sound content, and sound production time of sound signal E. Assuming that the sound emission position of the sound signal E is 230 degrees, the sound content of the sound signal E is "welcome to view DEF program", and the sound emission time of the sound signal E is 18 o' clock and 30 minutes.

With reference to table 2, fig. 4 and fig. 5, table 2 shows attribute information of a plurality of fixed sound sources in the fixed sound source library, and the attribute information of the fixed sound sources in table 2 is data generated by the smart sound box 100 by learning in advance based on history information.

TABLE 2

In the examples of table 2, fig. 4, and fig. 5, after smart sound box 100 obtains the attribute information of sound signal E, smart sound box 100 determines whether the attribute information of sound signal E matches the attribute information of the fixed sound source in the fixed sound source library shown in table 2. As can be seen from table 2, the attribute information of the sound signal E matches the attribute information of the fixed sound source corresponding to number 1 in table 2, which indicates that the sound source 5 corresponding to the sound signal E is a sound signal emitted by the fixed sound source.

Referring to fig. 6, fig. 6 is a flowchart illustrating another fixed sound source identification method according to an embodiment of the present application. The method shown in fig. 6 is a refinement step in S102 of fig. 3, and specifically, is a refinement step of "the electronic device determines the sound type of the first sound signal according to the sound feature of the first sound signal". The method shown in fig. 6 includes the following steps S201 to S203.

S201, the electronic equipment determines whether a sound type corresponding to the sound feature of the first sound signal exists in the sound event library. If yes, go to step S202; if not, step S203 is performed.

The sound event library comprises one or more sound types, and the sound types in the sound event library are preset.

For example, please refer to table 3, fig. 1 and fig. 2, where table 3 shows a correspondence table between sound characteristics and sound types in the sound event library.

Characteristics of sound	Type of sound
		Sound characteristic X	Sound of the cooker hood 200
Sound characteristic Y	Sound of air conditioner 300
		…	…

TABLE 3

In the examples of table 3, fig. 1, and fig. 2, assuming that the first sound signal is sound signal a, smart sound box 100 determines sound characteristic X of sound signal a. Then, smart sound box 100 determines whether a sound type corresponding to sound characteristic X of sound signal a exists in the sound event library shown in table 3. As can be seen from table 3, the type of sound corresponding to the sound characteristic X of the sound signal a is the sound of the range hood 200. Finally, the smart sound box 100 may determine that the sound type corresponding to the sound feature X of the sound signal a is the sound of the range hood 200.

S202, determining the sound type corresponding to the sound feature of the first sound signal as the sound type of the first sound signal.

S203, the electronic device sends a first network request to the external server, and the electronic device receives a first response request sent by the external server.

The first network request comprises sound characteristics of the first sound signal, and the first response request comprises a sound type corresponding to the sound characteristics of the first sound signal.

For example, referring to fig. 1 and 7, and fig. 7 is a schematic view of another scenario provided in the embodiment of the present application, the smart sound box 100 may be connected to an external server 1000 through the internet. Assuming that no sound type corresponding to sound signature X of sound signal a exists in the sound event library of smart sound box 100, smart sound box 100 may send a first network request to server 1000, where the first network request includes sound signature X of sound signal a. After the server 1000 receives the first network request, the server 1000 may query the cloud storage that the sound type corresponding to the sound feature X of the sound signal a is the sound of the range hood 200, and then the server 1000 may send a first response request to the smart sound box 100, where the first response request includes the sound type corresponding to the sound feature X of the sound signal a is the sound of the range hood 200. After the smart speaker 100 receives the first response request sent by the server 1000, the smart speaker 100 can know that the sound type corresponding to the sound feature X of the sound signal a is the sound of the range hood 200.

In the embodiment shown in fig. 6, the electronic device is capable of obtaining a sound type corresponding to the sound feature of the first sound signal in a sound event library or an external server.

In the embodiment shown in fig. 6, step S204 may be further included (S204 is not shown in fig. 6), and S204 may replace S203 to form another implementation. Wherein, S204 may include the steps of: the electronic device determines whether the number of occurrences of the sound feature of the first sound signal at the first location is greater than a first threshold, the first location being a sound emitting location of the first sound signal, and determines the sound type of the first sound signal to be a known sound type if the number of occurrences of the sound feature of the first sound signal at the first location is greater than the first threshold. In addition, after determining that the sound type of the first sound signal is a known sound type, the electronic device may further store the sound feature of the first sound signal and the known sound type in a sound event library.

In S204, the first position is an utterance position of the sound source corresponding to the first sound signal relative to the electronic device, that is, the first position is an utterance position of the first sound signal. The first threshold is set to a predetermined number of times, and for example, the first threshold may be set to 3 times in advance. A known sound type refers to a sound type that is not certain of a particular sound type but belongs to a fixed sound source.

If the number of times of occurrence of the sound feature of the first sound signal at the first position is greater than the first threshold, it is indicated that the sound source corresponding to the first sound signal belongs to a fixed sound source, but since the sound type corresponding to the sound feature of the first sound signal is not stored in the sound event library, the electronic device may determine the sound type of the first sound signal as a known sound type.

For example, the electronic device may determine the sound type of the first sound signal as a known sound type a, and although the electronic device does not determine which specific sound type the known sound type a belongs to, the electronic device may determine that the known sound type a is one that can be received often.

Referring to fig. 8, fig. 8 is a flowchart illustrating a further fixed sound source identification method according to an embodiment of the present application. The method shown in fig. 8 includes the following steps S301 to S304.

S301, the electronic device acquires a second audio stream in a second time period, wherein the second audio stream at least comprises a second sound signal.

Wherein the second time period refers to a time period in which the user does not input a voice instruction to the electronic device. When a user does not input a voice command to the electronic equipment, the electronic equipment can acquire a sound signal emitted by a fixed sound source of the surrounding environment in real time. The second audio stream is an audio stream generated by the electronic device collecting sound in the environment where the electronic device is located in a second time period by using the microphone array when the user does not input a voice instruction to the electronic device.

For example, please refer to fig. 9 and fig. 10, fig. 9 is a schematic diagram of another scene provided by an embodiment of the present application, and fig. 10 is a schematic diagram of the directions of 1 sound source in fig. 9. Fig. 9 is a schematic view of a scene showing the smart sound box 100 and the water dispenser 500. The smart sound box 100 obtains a second audio stream within a period of time, where the second audio stream only includes a sound signal D, and the sound signal D is a sound signal sent by the water dispenser 500.

S302, the electronic device determines second attribute information of the second sound signal.

The execution process of S302 in fig. 8 is the same as that of S102 in fig. 3, and please refer to the detailed description of S102 in fig. 3 for the details of S302 in fig. 8.

S303, the electronic equipment judges whether the second attribute information exists in the fixed sound source library.

The fixed sound source library comprises attribute information corresponding to one or more fixed sound sources, and the fixed sound sources are positioned at the same position and emit sound of a known sound type.

If the second attribute information exists in the fixed sound source library, it indicates that the attribute information of the sound source corresponding to the second sound signal has been stored in the fixed sound source library. If the second attribute information does not exist in the fixed sound source library, it indicates that the attribute information of the sound source corresponding to the second sound signal is not stored in the fixed sound source library, i.e. the sound source corresponding to the second sound signal is a new fixed sound source for the electronic device.

And S304, when the second attribute information does not exist in the fixed sound source library, storing the second attribute information in the fixed sound source library.

For example, referring to table 1, fig. 9 and fig. 10, assuming that the second sound signal is the sound signal D, the smart sound box 100 determines that the attribute information of the sound signal D includes the sound emitting position, the sound type and the sound emitting time of the sound signal D. The sounding position of the sound signal D is 180 degrees, the sound type of the sound signal D is the sound of the water dispenser 500, and the sounding time of the sound signal D is from 18 o 10 to 18 o 13 o 10/4/2020.

After the smart sound box 100 acquires the attribute information of the sound signal D, the smart sound box 100 may determine whether the attribute information of the sound signal D exists in the fixed sound source library. As can be known from table 1, the attribute information of the sound signal D does not exist in the fixed sound source library, so the smart sound box 100 stores the attribute information of the sound signal D in the fixed sound source library.

As shown in table 4, table 4 shows a state after the attribute information of the sound signal D is stored in the fixed sound source library shown in table 1.

TABLE 4

In the embodiment shown in fig. 8, the electronic device is capable of establishing a fixed sound source library, and may also continuously update the content in the fixed sound source library. By the method shown in fig. 8, a fixed sound source library shown in table 1 or table 4 can be established.

Referring to fig. 11, fig. 11 is a schematic view of an electronic device according to an embodiment of the present disclosure. The electronic device shown in fig. 11 includes the following modules:

an obtaining module 11 is configured to obtain a first audio stream in a first time period, where the first audio stream at least includes a first sound signal.

A processing module 12 for separating a first sound signal in a first audio stream; determining first attribute information of the first sound signal; judging whether the first attribute information is matched with attribute information of a fixed sound source in a fixed sound source library, wherein the fixed sound source library comprises attribute information corresponding to one or more fixed sound sources, and the fixed sound sources are sound sources which are located at the same position and emit a known sound type; and when the first attribute information is matched with the attribute information of the fixed sound source in the fixed sound source library, determining that the first sound signal is the sound signal emitted by the fixed sound source.

For additional functions that can be realized by the obtaining module 11 and the processing module 12, and for more details of realizing the above functions, reference is made to the description of the foregoing method embodiments, and no repetition is made here.

The apparatus embodiment depicted in fig. 11 is merely illustrative, and for example, a division of modules is merely a logical division, and an actual implementation may have another division, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. The functional modules in the embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules are integrated into one module.

Referring to fig. 12, fig. 12 is a schematic view of another electronic device according to an embodiment of the present disclosure. The electronic device shown in fig. 12 includes a processor 21 and a memory 22.

In the embodiment shown in fig. 12, the processor 21 is configured to execute instructions stored in the memory 22 to cause the electronic device to perform the following operations: acquiring a first audio stream in a first time period, wherein the first audio stream at least comprises a first sound signal; separating a first sound signal in a first audio stream; determining first attribute information of the first sound signal; judging whether the first attribute information is matched with attribute information of a fixed sound source in a fixed sound source library, wherein the fixed sound source library comprises attribute information corresponding to one or more fixed sound sources, and the fixed sound sources are sound sources which are located at the same position and emit a known sound type; and when the first attribute information is matched with the attribute information of the fixed sound source in the fixed sound source library, determining that the first sound signal is the sound signal emitted by the fixed sound source.

The processor 21 is one or more CPUs. Optionally, the CPU is a single-core CPU or a multi-core CPU.

The Memory 22 includes, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), an erasable programmable Read-only Memory (EPROM or flash Memory), a flash Memory, an optical Memory, or the like. The memory 22 stores the code of the operating system.

Optionally, the electronic device further includes a bus 23, and the processor 21 and the memory 22 are connected to each other through the bus 23, and may be connected to each other in other manners.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the scope of the invention. Thus, to the extent that such modifications and variations of the present application fall within the scope of the claims, it is intended that the present invention encompass such modifications and variations as well.

Claims

1. A fixed sound source identification method is applied to electronic equipment, and comprises the following steps:

the electronic equipment acquires a first audio stream in a first time period, wherein the first audio stream at least comprises a first sound signal;

the electronic device separating the first sound signal in the first audio stream;

the electronic equipment determines first attribute information of the first sound signal;

the electronic equipment judges whether the first attribute information is matched with attribute information of a fixed sound source in a fixed sound source library, wherein the fixed sound source library comprises attribute information corresponding to one or more fixed sound sources, and the fixed sound sources are sound sources which are located at the same position and emit a known sound type;

and when the first attribute information is matched with the attribute information of the fixed sound source in the fixed sound source library, determining that the first sound signal is the sound signal emitted by the fixed sound source.

2. The fixed sound source identification method of claim 1, wherein the electronic device comprises a microphone array, and wherein the electronic device acquires a first audio stream over a first time period, comprising:

the electronic device generates the first audio stream by collecting sound in an environment in which the electronic device is located during the first time period using the microphone array.

3. The fixed sound source recognition method according to claim 2, wherein the first attribute information includes an utterance position, a sound type, and an utterance time of the first sound signal.

4. The fixed sound source recognition method of claim 3, wherein the electronic device determines first attribute information of the first sound signal, comprising:

the electronic device determining a sound production location of the first sound signal using the microphone array;

the electronic equipment determines the sound type of the first sound signal according to the sound characteristics of the first sound signal;

the electronic device determines an utterance time of the first sound signal.

5. The fixed sound source recognition method according to claim 2, wherein the first attribute information includes an utterance position, a sound content, and an utterance time of the first sound signal.

6. The fixed sound source recognition method of claim 5, wherein the electronic device determines first attribute information of the first sound signal, comprising:

the electronic equipment determines the sound content of the first sound signal according to the sound characteristics of the first sound signal;

the electronic device determines an utterance time of the first sound signal.

7. The fixed sound source recognition method according to claim 2, wherein the first attribute information includes an utterance position, a sound type, a sound content, and an utterance time of the first sound signal.

8. The fixed sound source recognition method of claim 7, wherein the electronic device determines first attribute information of the first sound signal, comprising:

the electronic device determines an utterance time of the first sound signal.

9. The fixed sound source recognition method of claim 4 or 8, wherein the electronic device determines the sound type of the first sound signal according to the sound feature of the first sound signal, comprising:

the electronic device determining whether a sound type corresponding to a sound feature of the first sound signal exists in a sound event library, the sound event library comprising one or more sound types;

determining a sound type corresponding to the sound feature of the first sound signal as the sound type of the first sound signal when the sound type corresponding to the sound feature of the first sound signal exists in the sound event library;

when the sound type corresponding to the sound feature of the first sound signal does not exist in the sound event library, the electronic equipment sends a first network request to an external server, receives a first response request sent by the external server, wherein the first network request comprises the sound feature of the first sound signal, and the first response request comprises the sound type corresponding to the sound feature of the first sound signal; or,

when the sound type corresponding to the sound feature of the first sound signal does not exist in the sound event library, the electronic equipment determines whether the number of times of occurrence of the sound feature of the first sound signal at a first position is larger than a first threshold value, wherein the first position is a sound production position of the first sound signal, and if the number of times of occurrence of the sound feature of the first sound signal at the first position is larger than the first threshold value, the sound type of the first sound signal is determined to be a known sound type.

10. The fixed sound source recognition method according to any one of claims 1 to 8, further comprising:

the electronic equipment acquires a second audio stream in a second time period, wherein the second audio stream at least comprises a second sound signal;

the electronic device determining second attribute information of the second sound signal;

the electronic equipment judges whether the second attribute information exists in a fixed sound source library, wherein the fixed sound source library comprises attribute information corresponding to one or more fixed sound sources, and the fixed sound sources are sound sources which are located at the same position and emit a known sound type;

and when the second attribute information does not exist in the fixed sound source library, storing the second attribute information into the fixed sound source library.

11. An electronic device comprising a memory and a processor coupled to the memory, the memory configured to store instructions;

the processor is configured to execute the instructions to cause the computer device to:

acquiring a first audio stream in a first time period, wherein the first audio stream at least comprises a first sound signal; separating the first sound signal in the first audio stream; determining first attribute information of the first sound signal; judging whether the first attribute information is matched with attribute information of a fixed sound source in a fixed sound source library, wherein the fixed sound source library comprises attribute information corresponding to one or more fixed sound sources, and the fixed sound sources are sound sources which are located at the same position and emit a known sound type; and when the first attribute information is matched with the attribute information of the fixed sound source in the fixed sound source library, determining that the first sound signal is the sound signal emitted by the fixed sound source.

12. The electronic device of claim 11, wherein the electronic device comprises a microphone array;

the processor is specifically configured to capture, by the microphone array, sounds in an environment in which the electronic device is located within the first time period to generate the first audio stream.

13. The electronic device according to claim 12, wherein the first attribute information includes an utterance position, a sound type, and an utterance time of the first sound signal.

14. The electronic device of claim 13, wherein:

the processor is specifically configured to determine a sound production location of the first sound signal using the microphone array; determining a sound type of the first sound signal according to the sound characteristics of the first sound signal; determining a sound emission time of the first sound signal.

15. The electronic device according to claim 12, wherein the first attribute information includes an utterance position, a sound content, and an utterance time of the first sound signal.

16. The electronic device of claim 15, wherein:

the processor is specifically configured to determine a sound production location of the first sound signal using the microphone array; determining the sound content of the first sound signal according to the sound characteristics of the first sound signal; determining a sound emission time of the first sound signal.

17. The electronic device according to claim 12, wherein the first attribute information includes an utterance position, a sound type, a sound content, and an utterance time of the first sound signal.

18. The electronic device of claim 17, wherein:

the processor is specifically configured to determine a sound production location of the first sound signal using the microphone array; determining a sound type of the first sound signal according to the sound characteristics of the first sound signal; determining the sound content of the first sound signal according to the sound characteristics of the first sound signal; determining a sound emission time of the first sound signal.

19. The electronic device of claim 14 or 18, wherein:

the processor is specifically configured to determine whether a sound type corresponding to a sound feature of the first sound signal exists in a sound event library, where the sound event library includes one or more sound types; determining a sound type corresponding to the sound feature of the first sound signal as the sound type of the first sound signal when the sound type corresponding to the sound feature of the first sound signal exists in the sound event library; when the sound type corresponding to the sound feature of the first sound signal does not exist in the sound event library, sending a first network request to an external server, and receiving a first response request sent by the external server, wherein the first network request comprises the sound feature of the first sound signal, and the first response request comprises the sound type corresponding to the sound feature of the first sound signal; or, when the sound type corresponding to the sound feature of the first sound signal does not exist in the sound event library, determining whether the number of times of occurrence of the sound feature of the first sound signal at a first position is greater than a first threshold, where the first position is a sound production position of the first sound signal, and if the number of times of occurrence of the sound feature of the first sound signal at the first position is greater than the first threshold, determining that the sound type of the first sound signal is a known sound type.

20. The electronic device of any of claims 11-18, wherein:

the processor is further configured to obtain a second audio stream in a second time period, where the second audio stream includes at least a second sound signal; determining second attribute information of the second sound signal; judging whether the second attribute information exists in a fixed sound source library, wherein the fixed sound source library comprises attribute information corresponding to one or more fixed sound sources, and the fixed sound sources are sound sources which are located at the same position and emit a known sound type; and when the second attribute information does not exist in the fixed sound source library, storing the second attribute information into the fixed sound source library.

21. An electronic device, comprising:

an obtaining module, configured to obtain a first audio stream in a first time period, where the first audio stream at least includes a first sound signal;

a processing module for separating the first sound signal in the first audio stream; determining first attribute information of the first sound signal; judging whether the first attribute information is matched with attribute information of a fixed sound source in a fixed sound source library, wherein the fixed sound source library comprises attribute information corresponding to one or more fixed sound sources, and the fixed sound sources are sound sources which are located at the same position and emit a known sound type; and when the first attribute information is matched with the attribute information of the fixed sound source in the fixed sound source library, determining that the first sound signal is the sound signal emitted by the fixed sound source.