WO2021228059A1

WO2021228059A1 - Fixed sound source recognition method and apparatus

Info

Publication number: WO2021228059A1
Application number: PCT/CN2021/092948
Authority: WO
Inventors: 李晓建; 胡伟湘; 王保辉; 李伟
Original assignee: 华为技术有限公司
Priority date: 2020-05-14
Filing date: 2021-05-11
Publication date: 2021-11-18

Abstract

A fixed sound source recognition method and device. The method comprises: an electronic device obtains a first audio stream within a first time period, the first audio stream at least comprising a first sound signal (S101); separate the first sound signal from the first audio stream; the electronic device determines first attribute information of the first sound signal (S102); the electronic device determines whether the first attribute information matches attribute information of a fixed sound source in a fixed sound source library (S103); if yes, determine that the first sound signal is a sound signal sent by the fixed sound source (S104). The electronic device matches the first attribute information of the first sound signal and the fixed sound source library, and if the first attribute information matches attribute information of a fixed sound source in the fixed sound source library, the first sound signal is a sound signal sent by the fixed sound source, so that the electronic device can accurately recognize a fixed sound source in an environment.

Description

Method and device for identifying fixed sound source

Technical field

This application relates to the field of artificial intelligence, and more specifically, to a method and device for identifying a fixed sound source.

Background technique

With the advancement of technology, intelligent voice recognition functions are widely used in electronic devices. For example, electronic devices such as smart phones, smart speakers, smart TVs, and smart robots are all equipped with smart voice recognition functions. At present, in the process of using this type of electronic device, the user needs to issue a voice command in a quiet environment, so that the electronic device can perform corresponding operations according to the voice command issued by the user.

If there is a noise source in the environment where the user is located, the electronic device will receive the noise from the noise source while receiving the voice command input by the user, so that the voice command input by the user is disturbed by the noise emitted by the noise source. It is difficult for the electronic device to correctly recognize the true intention corresponding to the voice command input by the user, which leads to a decrease in the accuracy of the electronic device in recognizing the voice.

Therefore, how to identify the noise source in the surrounding environment of the electronic device to prevent the electronic device from being interfered by the environmental noise has become a technical problem that needs to be solved urgently.

Summary of the invention

The embodiments of the present application provide a method and device for identifying a fixed sound source, so as to identify a fixed sound source in an environment around an electronic device.

In the first aspect, an embodiment of the present application provides a method for identifying a fixed sound source. The method is applied to an electronic device. The method includes: the electronic device acquires a first audio stream in a first time period, and the first audio stream includes at least the first audio stream. A sound signal; the electronic device separates the first sound signal from the first audio stream; the electronic device determines the first attribute information of the first sound signal; the electronic device determines whether the first attribute information is consistent with a fixed sound source library The fixed sound source in the fixed sound source matches the attribute information, the fixed sound source library includes one or more fixed sound sources corresponding to the attribute information, and the fixed sound source is a sound source that is located in the same position and emits a known sound type; When the first attribute information matches the attribute information of the fixed sound source in the fixed sound source library, it is determined that the first sound signal is a sound signal emitted by the fixed sound source.

In the first aspect, the electronic device can match the first attribute information of the first sound signal in the first audio stream with the fixed sound source library generated in advance, if the first attribute information matches the fixed sound source library in the fixed sound source library. The attribute information of the source matches, indicating that the first sound signal is a sound signal from a fixed sound source, so the electronic device can accurately identify the fixed sound source in the environment.

In a possible implementation of the first aspect, the electronic device includes a microphone array, and the electronic device acquiring the first audio stream in the first time period includes: the electronic device uses the microphone array to collect data from the electronic device in the first time period. The sound in the environment generates the first audio stream.

In a possible implementation manner of the first aspect, the first attribute information includes a sounding position, a sound type, and a sounding time of the first sound signal.

In a possible implementation of the first aspect, the electronic device determining the first attribute information of the first sound signal includes: the electronic device uses a microphone array to determine the sounding position of the first sound signal; The sound feature determines the sound type of the first sound signal; the electronic device determines the sounding time of the first sound signal.

In a possible implementation manner of the first aspect, the first attribute information includes the sounding position, sound content, and sounding time of the first sound signal.

In a possible implementation of the first aspect, the electronic device determining the first attribute information of the first sound signal includes: the electronic device uses a microphone array to determine the sounding position of the first sound signal; The sound feature determines the sound content of the first sound signal; the electronic device determines the sounding time of the first sound signal.

In a possible implementation manner of the first aspect, the first attribute information includes the sounding position, sound type, sound content, and sounding time of the first sound signal.

In a possible implementation of the first aspect, the electronic device determining the first attribute information of the first sound signal includes: the electronic device uses a microphone array to determine the sounding position of the first sound signal; The sound characteristic determines the sound type of the first sound signal; the electronic device determines the sound content of the first sound signal according to the sound characteristic of the first sound signal; the electronic device determines the sounding time of the first sound signal.

In a possible implementation manner of the first aspect, the electronic device determines the sound type of the first sound signal according to the sound characteristics of the first sound signal, including: the electronic device determines whether there is a sound that corresponds to the first sound signal in the sound event library. The sound type corresponding to the feature, the sound event library includes one or more sound types; when there is a sound type corresponding to the sound feature of the first sound signal in the sound event library, the sound type corresponding to the sound feature of the first sound signal Determine as the sound type of the first sound signal; when there is no sound type corresponding to the sound feature of the first sound signal in the sound event library, the electronic device sends the first network request to the external server, and the electronic device receives the first network request sent by the external server. A response request, the first network request includes the sound characteristic of the first sound signal, and the first response request includes the sound type corresponding to the sound characteristic of the first sound signal; or, there is no sound corresponding to the first sound signal in the sound event library When the feature corresponds to the sound type, the electronic device determines whether the number of times the sound feature of the first sound signal appears in the first position is greater than the first threshold, and the first position is the sounding position of the first sound signal. If the sound characteristic of the first sound signal is The number of occurrences at the first position is greater than the first threshold, and it is determined that the sound type of the first sound signal is a known sound type.

In a possible implementation manner of the first aspect, the electronic device can obtain the sound type corresponding to the sound feature of the first sound signal in a sound event library or an external server.

In a possible implementation of the first aspect, the method further includes: the electronic device acquires a second audio stream in a second time period, the second audio stream includes at least a second sound signal; and the electronic device determines the value of the second sound signal The second attribute information; the electronic device determines whether the second attribute information exists in the fixed sound source library. The fixed sound source library includes attribute information corresponding to one or more fixed sound sources. The fixed sound sources are located in the same position and emit one A sound source with a known sound type; when the second attribute information does not exist in the fixed sound source library, the second attribute information is stored in the fixed sound source library.

Among them, the electronic device can establish a fixed sound source library, and can also continuously update the content in the fixed sound source library.

In the second aspect, an embodiment of the present application provides an electronic device, including a memory and a processor connected to the memory, the memory is used for storing instructions; the processor is used for executing instructions, so that the computer device performs the following operations: The first audio stream, the first audio stream includes at least a first sound signal; the first sound signal is separated from the first audio stream; the first attribute information of the first sound signal is determined; whether the first attribute information is related to a fixed sound source The attribute information of the fixed sound sources in the library matches, and the fixed sound source library includes attribute information corresponding to one or more fixed sound sources, and the fixed sound sources are sound sources that are located in the same position and emit a known sound type; When the first attribute information matches the attribute information of the fixed sound source in the fixed sound source library, it is determined that the first sound signal is a sound signal emitted by the fixed sound source.

In a possible implementation of the second aspect, the electronic device includes a microphone array; the processor is specifically configured to use the microphone array to collect sounds in an environment in which the electronic device is located within the first time period to generate a first audio stream.

In a possible implementation manner of the second aspect, the first attribute information includes the sounding position, sound type, and sounding time of the first sound signal.

In a possible implementation manner of the second aspect, the processor is specifically configured to determine the sounding position of the first sound signal by using the microphone array; determine the sound type of the first sound signal according to the sound characteristics of the first sound signal; and determine the sound type of the first sound signal; The sounding time of a sound signal.

In a possible implementation manner of the second aspect, the first attribute information includes the sounding position, sound content, and sounding time of the first sound signal.

In a possible implementation manner of the second aspect, the processor is specifically configured to determine the sounding position of the first sound signal by using the microphone array; determine the sound content of the first sound signal according to the sound characteristics of the first sound signal; and determine the sound content of the first sound signal; The sounding time of a sound signal.

In a possible implementation manner of the second aspect, the first attribute information includes the sounding position, sound type, sound content, and sounding time of the first sound signal.

In a possible implementation of the second aspect, the processor is specifically configured to determine the sounding position of the first sound signal by using the microphone array; determine the sound type of the first sound signal according to the sound characteristics of the first sound signal; The sound feature of a sound signal determines the sound content of the first sound signal; the sounding time of the first sound signal is determined.

In a possible implementation of the second aspect, the processor is specifically configured to determine whether there is a sound type corresponding to the sound feature of the first sound signal in the sound event library, and the sound event library includes one or more sound types ; When there is a sound type corresponding to the sound feature of the first sound signal in the sound event library, the sound type corresponding to the sound feature of the first sound signal is determined as the sound type of the first sound signal; it does not exist in the sound event library When the sound type corresponds to the sound characteristic of the first sound signal, the first network request is sent to the external server, and the first response request sent by the external server is received. The first network request includes the sound characteristic of the first sound signal, and the first response request It includes the sound type corresponding to the sound feature of the first sound signal; or, when there is no sound type corresponding to the sound feature of the first sound signal in the sound event library, it is determined that the sound feature of the first sound signal appears at the first position Whether the number of times is greater than the first threshold, the first position is the sounding position of the first sound signal, and if the number of times the sound feature of the first sound signal appears at the first position is greater than the first threshold, it is determined that the sound type of the first sound signal is known Sound type.

In a possible implementation manner of the second aspect, the processor is further configured to obtain a second audio stream in a second time period, where the second audio stream includes at least a second sound signal; and determines the second audio signal of the second sound signal. Attribute information; determine whether the second attribute information exists in the fixed sound source library. The fixed sound source library includes attribute information corresponding to one or more fixed sound sources. The fixed sound source is located at the same position and emits a known sound Type of sound source; when the second attribute information does not exist in the fixed sound source library, the second attribute information is stored in the fixed sound source library.

In a third aspect, an embodiment of the present application provides an electronic device, including: an acquisition module, configured to acquire a first audio stream in a first time period, the first audio stream including at least a first sound signal; a processing module, configured to Separate the first sound signal from the first audio stream; determine the first attribute information of the first sound signal; determine whether the first attribute information matches the attribute information of the fixed sound source in the fixed sound source library, and fix the sound source library It includes attribute information corresponding to one or more fixed sound sources. The fixed sound source is a sound source that is located at the same position and emits a known sound type; the first attribute information is related to the fixed sound source in the fixed sound source library. When the attribute information matches, it is determined that the first sound signal is a sound signal emitted by a fixed sound source.

Description of the drawings

FIG. 1 shows a schematic diagram of a scenario provided by an embodiment of this application;

Figure 2 shows a schematic diagram of the azimuths of the three sound sources in Figure 1;

FIG. 3 shows a flowchart of a method for identifying a fixed sound source according to an embodiment of this application;

FIG. 4 shows a schematic diagram of another scenario provided by an embodiment of this application;

Figure 5 shows a schematic diagram of the azimuths of the two sound sources in Figure 4;

FIG. 6 shows a flowchart of another method for identifying a fixed sound source provided by an embodiment of this application;

FIG. 7 shows a schematic diagram of another scenario provided by an embodiment of this application;

FIG. 8 shows a flowchart of another method for identifying a fixed sound source according to an embodiment of this application;

FIG. 9 shows a schematic diagram of another scenario provided by an embodiment of this application;

FIG. 10 shows a schematic diagram of the azimuth of one sound source in FIG. 9;

FIG. 11 shows a schematic diagram of an electronic device provided by an embodiment of this application;

FIG. 12 shows a schematic diagram of yet another electronic device provided by an embodiment of this application.

Detailed ways

Please refer to FIG. 1 and FIG. 2. FIG. 1 shows a schematic diagram of a scene provided by an embodiment of the present application, and FIG. 2 shows a schematic diagram of the orientation of the three sound sources in FIG. 1. The scene diagram shown in FIG. 1 shows a smart speaker 100, a range hood 200, an air conditioner 300, and a user 400. Among them, the smart speaker 100 shown in FIG. 1 can execute the fixed sound source identification method provided by the embodiment of the present application.

As shown in Figure 1 and Figure 2, in a possible scenario, it is assumed that the home of the user 400 has smart speakers 100, range hood 200, and air conditioner 300, and the smart speaker 100, range hood 200, and air conditioner 300 The relative position between the two is shown in Figure 1 and Figure 2. Assuming that the user 400 wants to make the smart speaker 100 play the song "ABC", the user 400 will send a voice command to the smart speaker 100 to play the song "ABC", and the voice command is the sound signal C in FIGS. 1 and 2.

Suppose that the range hood 200 is in working state while the user 400 sends the sound signal C to the smart speaker 100, the range hood 200 will emit noise to the smart speaker 100, and the noise emitted by the range hood 200 is shown in FIGS. 1 and 2 The sound signal A.

Assuming that during the process of the user 400 sending the sound signal C to the smart speaker 100, the air conditioner 300 is in the working state, the air conditioner 300 emits noise to the smart speaker 100, and the noise emitted by the air conditioner 300 is the sound signal B in FIGS. 1 and 2.

At this time, the smart speaker 100 can use the fixed sound source identification method provided in the embodiments of the present application to determine which sound signal is received belongs to the fixed sound source. After the smart speaker 100 determines that the sound signal A and the sound signal B belong to a fixed sound source, the smart speaker 100 can shield the received sound signal A and sound signal B, and only recognize the voice command corresponding to the received sound signal C, thereby The user plays the song "ABC".

In the examples shown in Figures 1 and 2, the smart speaker 100 can use the fixed sound source identification method provided by the embodiments of the present application to accurately identify the fixed sound source in the environment, and correctly identify the input of the user 400 The real intention corresponding to the voice command improves the accuracy of the smart speaker 100 in recognizing the voice command.

Please refer to FIG. 3, which shows a flowchart of a method for identifying a fixed sound source according to an embodiment of this application. The fixed sound source recognition method shown in FIG. 3 can be applied to electronic devices, and the electronic devices can be devices with smart voice recognition functions such as smart phones, smart speakers, smart TVs, and smart robots. The method shown in FIG. 3 includes the following steps S101 to S104.

S101. The electronic device acquires a first audio stream in a first time period, where the first audio stream includes at least a first sound signal.

Wherein, the first time period refers to the time period during which the user inputs a voice instruction to the electronic device.

When the electronic device includes a microphone array, the electronic device may use the microphone array to collect sounds in the environment where the electronic device is located in the first time period to generate the first audio stream.

For example, please refer to FIG. 1 and FIG. 2. Assume that the first audio stream includes a sound signal A, a sound signal B, and a sound signal C, where the first sound signal is the sound signal A.

After S101 and before S102, the electronic device also needs to separate the first sound signal from the first audio stream. The specific separation process includes step A1 and step A2:

Step A1. The electronic device performs a preprocessing operation on the first audio stream to obtain the corrected first audio stream.

Among them, the preprocessing operation includes variable centralization processing, whitening processing, principal component analysis dimensionality reduction processing and time filtering processing. The purpose of the preprocessing operation is to reduce the noise in the first audio stream.

Step A2: Perform independent component correlation (ICA) processing on the corrected first audio stream to obtain the first sound signal.

Wherein, the independent component analysis processing is used to separate the first sound signal from the first audio stream, so that the first sound signal can be processed correspondingly in subsequent steps.

S102. The electronic device determines first attribute information of the first sound signal.

Among them, there are multiple implementation manners for determining the first attribute information of the first sound signal by the electronic device, and several specific implementation manners are introduced below.

In the first implementation manner, if the first attribute information includes the sounding position, sound type, and sounding time of the first sound signal, then the electronic device determining the first attribute information of the first sound signal may include the following steps: the electronic device uses a microphone array The sounding position of the first sound signal is determined; the electronic device determines the sound type of the first sound signal according to the sound characteristics of the first sound signal; the electronic device determines the sounding time of the first sound signal.

In the first implementation manner, the sounding position of the first sound signal refers to the sounding position of the sound source corresponding to the first sound signal relative to the electronic device. For example, please refer to FIG. 1 and FIG. 2, assuming that the first sound signal is sound signal A, then the sounding position of sound signal A is the sounding position of sound source 1 corresponding to sound signal A relative to smart speaker 100. If it is considered that the position where the smart speaker 100 is located is the center point, then the sound source 1 corresponding to the sound signal A has a sound position of 140 degrees relative to the smart speaker 100. Similarly, the sounding position of the sound source 2 corresponding to the sound signal B relative to the smart speaker 100 is 45 degrees, and the sounding position of the sound source 3 corresponding to the sound signal C relative to the smart speaker 100 is 270 degrees.

In the first implementation manner, the sound feature of the first sound signal includes but is not limited to Mel frequency cepstrum coefficient (MFCC). The sound type of the first sound signal refers to the sound emitted by the sound source corresponding to the first sound signal. For example, please refer to FIG. 1 and FIG. 2, assuming that the first sound signal is the sound signal A, then the sound type of the sound signal A is the sound of the range hood 200.

In the first implementation manner, the sounding time of the first sound signal is the time when the electronic device receives the first sound signal. For example, please refer to Figures 1 and 2, assuming that the first sound signal is sound signal A, then the sounding time of sound signal A is 18:30 on April 10, 2020.

In the second implementation manner, if the first attribute information includes the sounding position, sound content, and sounding time of the first sound signal, then the electronic device determining the first attribute information of the first sound signal may include the following steps: the electronic device uses a microphone array The sounding position of the first sound signal is determined; the electronic device determines the sound content of the first sound signal according to the sound characteristics of the first sound signal; the electronic device determines the sounding time of the first sound signal.

In the second implementation manner, the sound content of the first sound signal is voice content. For example, please refer to Figure 1 and Figure 2. Assume that the first sound signal is sound signal C. When user 400 says "Play song ABC" to smart speaker 100, sound signal C will be generated to propagate in the air. The sound content of the signal C is "play song ABC" that the user 400 said with his mouth.

In a third implementation manner, if the first attribute information includes the sounding position, sound type, sound content, and sounding time of the first sound signal, then the electronic device determining the first attribute information of the first sound signal may include the following steps: electronic device Use the microphone array to determine the sounding position of the first sound signal; the electronic device determines the sound type of the first sound signal according to the sound characteristics of the first sound signal; the electronic device determines the sound content of the first sound signal according to the sound characteristics of the first sound signal; The electronic device determines the sounding time of the first sound signal.

Of course, it is not limited to the above-mentioned three implementation manners, and other types of information can also be added to the first attribute information.

When the content contained in the first attribute information is different, different types of scenes can be identified.

For the first implementation manner, the first attribute information includes the sounding position, sound type, and sounding time of the first sound signal. The application scenario of the first implementation manner is: the electronic device can identify a sound source that only emits one type of sound.

For the second implementation manner, the first attribute information includes the sounding position, sound content, and sounding time of the first sound signal. The application scenario of the second implementation manner is: the electronic device can identify the sound source emitting the voice content.

For the third implementation manner, the first attribute information includes the sounding position, sound type, sound content, and sounding time of the first sound signal. The application scenario of the third implementation manner is: not only can identify the sound source that emits only one type of sound, but also the sound source that emits the voice content.

S103. The electronic device determines whether the first attribute information matches the attribute information of the fixed sound source in the fixed sound source library.

Wherein, the fixed sound source library includes attribute information of one or more fixed sound sources, and the fixed sound sources are sound sources that are located at the same location and emit a known sound type.

For example, please refer to Table 1, Figure 1 and Figure 2. Table 1 shows the attribute information of multiple fixed sound sources in the fixed sound source library, and the attribute information of the fixed sound source in Table 1 is smart speakers 100 Pre-learned and generated data based on historical information. Regarding the generation process of Table 1, the following embodiments will introduce in detail.

Table 1

In the examples in Table 1, FIG. 1 and FIG. 2, assuming that the first sound signal is sound signal A, the smart speaker 100 determines that the attribute information of sound signal A includes the sounding position, sound type, and sounding time of sound signal A. Among them, the sounding position of the sound signal A is 140 degrees, the sound type of the sound signal A is the sound of the range hood 200, and the sounding time of the sound signal A is 18:30 on April 10, 2020.

After the smart speaker 100 obtains the attribute information of the sound signal A, the smart speaker 100 will determine whether the attribute information of the sound signal A matches the attribute information of the fixed sound source in the fixed sound source library of Table 1. It can be known from Table 1 that the attribute information of the sound signal A matches the attribute information of the fixed sound source corresponding to number 1 in Table 1, indicating that the sound source 1 corresponding to the sound signal A is the sound signal emitted by the fixed sound source.

S104: When the first attribute information matches the attribute information of the fixed sound source in the fixed sound source library, determine that the first sound signal is a sound signal emitted by the fixed sound source.

In the embodiment shown in FIG. 3, the electronic device can match the first attribute information of the first sound signal in the first audio stream with the pre-generated fixed sound source library, if the first attribute information matches the fixed sound source library The attribute information of the fixed sound source in the match indicates that the first sound signal is the sound signal emitted by the fixed sound source, so the electronic device can accurately identify the fixed sound source in the environment.

Please refer to FIG. 4 and FIG. 5. FIG. 4 shows another schematic diagram of a scene provided by an embodiment of the present application, and FIG. 5 shows a schematic diagram of the orientation of the two sound sources in FIG. 4. The scene diagram shown in FIG. 4 shows the smart speaker 100, the user 400, and the smart TV 600. Among them, the smart speaker 100 shown in FIG. 4 can execute the fixed sound source identification method provided by the embodiment of the present application.

As shown in FIG. 4 and FIG. 5, in a possible scenario, assume that the user 400 has devices such as smart speakers 100 and smart TV 600 at home, and the relative positions between smart speakers 100 and smart TV 600 are shown in FIGS. 4 and Shown in Figure 5. Assuming that the user 400 wants the smart speaker 100 to play the song "ABC", the user 400 will send a voice instruction to the smart speaker 100 to play the song "ABC", and the voice instruction is the sound signal F in FIGS. 4 and 5.

Suppose that during the process of the user 400 sending the sound signal F to the smart speaker 100, the smart TV 600 is in a working state, the smart TV 600 will emit noise to the smart speaker 100, and the noise emitted by the smart TV 600 is the sound signal in FIGS. 4 and 5 E.

At this time, the smart speaker 100 can use the fixed sound source identification method shown in FIG. 3 to determine which sound signal received belongs to the fixed sound source. After the smart speaker 100 determines that the smart TV 600 belongs to a fixed sound source, the smart speaker 100 can shield the received sound signal E, and only recognize the voice command corresponding to the received sound signal F, so as to play the song "ABC" for the user.

Specifically, it is assumed that the first audio stream includes a sound signal E and a sound signal F, where the first sound signal is a sound signal E, the second sound signal is a sound signal F, and the sounding positions of the sound signals E and F are divided into 230° and 320°. Taking the first sound signal as an example, the smart speaker 100 determines that the attribute information of the sound signal E includes the sounding position, sound content, and sounding time of the sound signal E. Assuming that the sounding position of the sound signal E is 230 degrees, the sound content of the sound signal E is "Welcome to the DEF program", and the sounding time of the sound signal E is 18:30.

Combined with Table 2, Figure 4 and Figure 5, Table 2 shows the attribute information of multiple fixed sound sources in the fixed sound source library. Pre-learned and generated data.

Table 2

In the examples in Table 2, FIG. 4, and FIG. 5, after the smart speaker 100 obtains the attribute information of the sound signal E, the smart speaker 100 will determine whether the attribute information of the sound signal E is the same as that in the fixed sound source library shown in Table 2. Match the attribute information of the fixed sound source. It can be known from Table 2 that the attribute information of the sound signal E matches the attribute information of the fixed sound source corresponding to number 1 in Table 2, indicating that the sound source 5 corresponding to the sound signal E is the sound signal emitted by the fixed sound source.

Please refer to FIG. 6, which shows a flowchart of another method for identifying a fixed sound source according to an embodiment of the present application. The method shown in FIG. 6 is the refinement step in S102 of FIG. 3, specifically the refinement step of "the electronic device determines the sound type of the first sound signal according to the sound characteristics of the first sound signal". The method shown in FIG. 6 includes the following steps S201 to S203.

S201: The electronic device determines whether there is a sound type corresponding to the sound feature of the first sound signal in the sound event library. If it exists, execute step S202; if it does not exist, execute step S203.

Among them, the sound event library includes one or more sound types, and the sound types in the sound event library are all preset.

For example, please refer to Table 3, Figure 1 and Figure 2. Table 3 shows the correspondence between sound features and sound types in the sound event library.

声音特征Voice characteristics	声音类型Sound type
声音特征XVoice feature X	排油烟机200的声音The sound of range hood 200
声音特征YVoice feature Y	空调300的声音The sound of air conditioner 300
……	……

table 3

In the examples in Table 3, FIG. 1 and FIG. 2, assuming that the first sound signal is the sound signal A, the smart speaker 100 determines the sound characteristic X of the sound signal A. Then, the smart speaker 100 determines whether there is a sound type corresponding to the sound feature X of the sound signal A in the sound event library shown in Table 3. It can be known from Table 3 that the sound type corresponding to the sound feature X of the sound signal A is the sound of the range hood 200. Finally, the smart speaker 100 can determine that the sound type corresponding to the sound feature X of the sound signal A is the sound of the range hood 200.

S202: Determine the sound type corresponding to the sound feature of the first sound signal as the sound type of the first sound signal.

S203. The electronic device sends a first network request to the external server, and the electronic device receives the first response request sent by the external server.

The first network request includes the sound feature of the first sound signal, and the first response request includes the sound type corresponding to the sound feature of the first sound signal.

For example, please refer to FIG. 1 and FIG. 7. FIG. 7 shows a schematic diagram of another scenario provided in an embodiment of the present application. The smart speaker 100 can connect to an external server 1000 through the Internet. Assuming that there is no sound type corresponding to the sound characteristic X of the sound signal A in the sound event library of the smart speaker 100, the smart speaker 100 will send a first network request to the server 1000, and the first network request includes the sound characteristic X of the sound signal A. . After the server 1000 receives the first network request, the server 1000 will query in the cloud storage that the sound type corresponding to the sound feature X of the sound signal A is the sound of the range hood 200, and then the server 1000 will send the first network request to the smart speaker 100 A response request. The first response request includes the sound characteristic X of the sound signal A and the corresponding sound type is the sound of the range hood 200. After the smart speaker 100 receives the first response request sent by the server 1000, the smart speaker 100 can learn that the sound type corresponding to the sound feature X of the sound signal A is the sound of the range hood 200.

In the embodiment shown in FIG. 6, the electronic device can obtain the sound type corresponding to the sound feature of the first sound signal in a sound event library or an external server.

In the embodiment shown in FIG. 6, step S204 may be further included (S204 is not shown in FIG. 6), and S204 may replace S203 to form another implementation manner. Wherein, S204 may include the following steps: the electronic device determines whether the number of times the sound feature of the first sound signal appears in the first position is greater than a first threshold, the first position is the sounding position of the first sound signal, and if the sound of the first sound signal The number of times the feature appears in the first position is greater than the first threshold, and it is determined that the sound type of the first sound signal is a known sound type. In addition, after determining that the sound type of the first sound signal is a known sound type, the electronic device may also store the sound characteristics of the first sound signal and the known sound type in the sound event library.

In S204, the first position is the sounding position of the sound source corresponding to the first sound signal relative to the electronic device, that is, the first position is the sounding position of the first sound signal. The first threshold is a preset number of times. For example, the first threshold may be set to 3 times in advance. The known sound type refers to a sound type that is uncertain of a specific sound type but belongs to a fixed sound source.

Wherein, if the number of occurrences of the sound feature of the first sound signal at the first position is greater than the first threshold, it means that the sound source corresponding to the first sound signal belongs to a fixed sound source, but the sound type corresponding to the sound feature of the first sound signal is not It is stored in the sound event library, so the electronic device can determine the sound type of the first sound signal as a known sound type.

For example, the electronic device may determine the sound type of the first sound signal as the known sound type A. Although the electronic device is not sure which specific sound type the known sound type A belongs to, the electronic device may determine the known sound type A. It is a type of sound that can often be received.

Please refer to FIG. 8, which shows a flowchart of another method for identifying a fixed sound source according to an embodiment of the present application. The method shown in FIG. 8 includes the following steps S301 to S304.

S301. The electronic device acquires a second audio stream in a second time period, where the second audio stream includes at least a second sound signal.

Wherein, the second time period refers to a time period during which the user does not input a voice instruction to the electronic device. When the user does not input a voice command to the electronic device, the electronic device will obtain the sound signal from the fixed sound source in the surrounding environment in real time. The second audio stream is an audio stream generated by the electronic device using the microphone array to collect sounds in the environment where the electronic device is located in the second time period when the user does not input a voice command to the electronic device.

For example, please refer to FIG. 9 and FIG. 10. FIG. 9 shows a schematic diagram of another scene provided by an embodiment of this application, and FIG. 10 shows a schematic diagram of the position of one sound source in FIG. 9. The scene schematic diagram shown in FIG. 9 shows the smart speaker 100 and the water dispenser 500. Wherein, the smart speaker 100 obtains the second audio stream within a period of time, the second audio stream only includes the sound signal D, and the sound signal D is the sound signal emitted by the water dispenser 500.

S302. The electronic device determines second attribute information of the second sound signal.

The execution process of S302 in FIG. 8 is the same as that of S102 in FIG. 3. For details of S302 in FIG. 8, please refer to the detailed description of S102 in FIG. 3.

S303. The electronic device judges whether the second attribute information exists in the fixed sound source library.

Wherein, the fixed sound source library includes attribute information corresponding to one or more fixed sound sources, and the fixed sound sources are sound sources that are located at the same location and emit a known sound type.

If the second attribute information exists in the fixed sound source library, it means that the attribute information of the sound source corresponding to the second sound signal has been stored in the fixed sound source library. If the second attribute information does not exist in the fixed sound source library, it means that the attribute information of the sound source corresponding to the second sound signal is not stored in the fixed sound source library, that is, the sound source corresponding to the second sound signal is for the electronic device New fixed sound source.

S304: When the second attribute information does not exist in the fixed sound source library, store the second attribute information in the fixed sound source library.

For example, please refer to Table 1, FIG. 9 and FIG. 10. Assuming that the second sound signal is the sound signal D, the smart speaker 100 determines that the attribute information of the sound signal D includes the sounding position, the sound type, and the sounding time of the sound signal D. The sounding position of the sound signal D is 180 degrees, the sound type of the sound signal D is the sound of the water dispenser 500, and the sounding time of the sound signal D is from 18:10 to 18:13 on April 10, 2020.

After the smart speaker 100 obtains the attribute information of the sound signal D, the smart speaker 100 determines whether the attribute information of the sound signal D exists in the fixed sound source library. It can be known from Table 1 that the attribute information of the sound signal D does not exist in the fixed sound source library, so the smart speaker 100 stores the attribute information of the sound signal D in the fixed sound source library.

Please refer to Table 4, which is the state after the attribute information of the sound signal D is stored in the fixed sound source library shown in Table 1.

Table 4

In the embodiment shown in FIG. 8, the electronic device can establish a fixed sound source library, and can also continuously update the content in the fixed sound source library. Through the method shown in Figure 8, the fixed sound source library shown in Table 1 or Table 4 can be established.

Please refer to FIG. 11, which shows a schematic diagram of an electronic device provided by an embodiment of the present application. The electronic device shown in Figure 11 includes the following modules:

The acquiring module 11 is configured to acquire a first audio stream in a first time period, where the first audio stream includes at least a first sound signal.

The processing module 12 is configured to separate the first sound signal from the first audio stream; determine the first attribute information of the first sound signal; determine whether the first attribute information is consistent with the fixed sound source in the fixed sound source library The fixed sound source library includes attribute information corresponding to one or more fixed sound sources, and the fixed sound source is a sound source that is located in the same position and emits a known sound type; When the first attribute information matches the attribute information of the fixed sound source in the fixed sound source library, it is determined that the first sound signal is a sound signal emitted by the fixed sound source.

For the additional functions that can be implemented by the acquiring module 11 and the processing module 12, and for more details of implementing the above-mentioned functions, please refer to the descriptions in the previous method embodiments, and will not be repeated here.

The device embodiment described in FIG. 11 is only schematic. For example, the division of modules is only a logical function division. In actual implementation, there may be other division methods. For example, multiple modules or components can be combined or integrated into Another system, or some features can be ignored, or not implemented. The functional modules in the various embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.

Please refer to FIG. 12, which shows a schematic diagram of another electronic device provided by an embodiment of the present application. The electronic device shown in FIG. 12 includes a processor 21 and a memory 22.

In the embodiment shown in FIG. 12, the processor 21 is configured to execute instructions stored in the memory 22, so that the electronic device performs the following operations: obtain a first audio stream in a first time period, and the first audio stream is at least The first sound signal is included; the first sound signal is separated from the first audio stream; the first attribute information of the first sound signal is determined; whether the first attribute information is related to a fixed sound source in a fixed sound source library The fixed sound source library includes attribute information corresponding to one or more fixed sound sources, and the fixed sound source is a sound source that is located in the same position and emits a known sound type; When the first attribute information matches the attribute information of the fixed sound source in the fixed sound source library, it is determined that the first sound signal is a sound signal emitted by the fixed sound source.

The processor 21 is one or more CPUs. Optionally, the CPU is a single-core CPU or a multi-core CPU.

The memory 22 includes, but is not limited to, random access memory (RAM), read only memory (ROM), erasable programmable read-only memory, EPROM or flash Memory), flash memory, or optical memory, etc. The code of the operating system is stored in the memory 22.

Optionally, the electronic device further includes a bus 23, and the above-mentioned processor 21 and the memory 22 are connected to each other through the bus 23, and may also be connected to each other in other ways.

The various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, as for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.

Obviously, those skilled in the art can make various changes and modifications to the present invention without departing from the scope of the present invention. In this way, if these modifications and variations of the present application fall within the scope of the claims of the present invention, the present invention is also intended to include these modifications and variations.

Claims

A method for identifying a fixed sound source, wherein the method is applied to an electronic device, and the method includes:

Acquiring, by the electronic device, a first audio stream in a first time period, where the first audio stream includes at least a first sound signal;

The electronic device separates the first sound signal from the first audio stream;

Determining, by the electronic device, first attribute information of the first sound signal;

The electronic device determines whether the first attribute information matches the attribute information of a fixed sound source in a fixed sound source library, and the fixed sound source library includes one or more fixed sound source corresponding attribute information, the A fixed sound source is a sound source that is located at the same location and emits a known sound type;

When the first attribute information matches the attribute information of the fixed sound source in the fixed sound source library, it is determined that the first sound signal is the sound signal emitted by the fixed sound source.
The method for identifying a fixed sound source according to claim 1, wherein the electronic device comprises a microphone array, and the electronic device acquiring the first audio stream in the first time period comprises:

The electronic device uses the microphone array to collect sounds in the environment where the electronic device is located in the first time period to generate the first audio stream.
The method for identifying a fixed sound source according to claim 2, wherein the first attribute information includes a sounding position, a sound type, and a sounding time of the first sound signal.
The method for identifying a fixed sound source according to claim 3, wherein the electronic device determining the first attribute information of the first sound signal comprises:

The electronic device uses the microphone array to determine the sounding position of the first sound signal;

Determining, by the electronic device, the sound type of the first sound signal according to the sound feature of the first sound signal;

The electronic device determines the sounding time of the first sound signal.
The method for identifying a fixed sound source according to claim 2, wherein the first attribute information includes the sounding position, sound content, and sounding time of the first sound signal.
The method for identifying a fixed sound source according to claim 5, wherein the electronic device determining the first attribute information of the first sound signal comprises:

The electronic device uses the microphone array to determine the sounding position of the first sound signal;

Determining, by the electronic device, the sound content of the first sound signal according to the sound feature of the first sound signal;

The electronic device determines the sounding time of the first sound signal.
The method for identifying a fixed sound source according to claim 2, wherein the first attribute information includes the sounding position, sound type, sound content, and sounding time of the first sound signal.
The method for identifying a fixed sound source according to claim 7, wherein the electronic device determining the first attribute information of the first sound signal comprises:

The electronic device uses the microphone array to determine the sounding position of the first sound signal;

Determining, by the electronic device, the sound type of the first sound signal according to the sound feature of the first sound signal;

Determining, by the electronic device, the sound content of the first sound signal according to the sound feature of the first sound signal;

The electronic device determines the sounding time of the first sound signal.
The method for identifying a fixed sound source according to claim 4 or 8, wherein the electronic device determining the sound type of the first sound signal according to the sound characteristics of the first sound signal comprises:

Determining, by the electronic device, whether there is a sound type corresponding to the sound feature of the first sound signal in a sound event library, the sound event library including one or more sound types;

When there is a sound type corresponding to the sound feature of the first sound signal in the sound event library, determining the sound type corresponding to the sound feature of the first sound signal as the sound type of the first sound signal;

When there is no sound type corresponding to the sound feature of the first sound signal in the sound event library, the electronic device sends a first network request to an external server, and the electronic device receives the first network request sent by the external server. A response request, the first network request includes the sound characteristic of the first sound signal, and the first response request includes the sound type corresponding to the sound characteristic of the first sound signal; or,

When there is no sound type corresponding to the sound feature of the first sound signal in the sound event library, the electronic device determines whether the sound feature of the first sound signal appears more frequently in the first position than the first position. Threshold, the first position is the sounding position of the first sound signal, and if the number of times that the sound feature of the first sound signal appears at the first position is greater than the first threshold, determine the sounding position of the first sound signal The sound type is a known sound type.
The method for identifying a fixed sound source according to any one of claims 1 to 8, wherein the method further comprises:

Acquiring, by the electronic device, a second audio stream in a second time period, where the second audio stream includes at least a second sound signal;

Determining the second attribute information of the second sound signal by the electronic device;

The electronic device determines whether the second attribute information exists in a fixed sound source library, the fixed sound source library includes attribute information corresponding to one or more fixed sound sources, and the fixed sound sources are located at the same location And emit a sound source of a known sound type;

When the second attribute information does not exist in the fixed sound source library, the second attribute information is stored in the fixed sound source library.
An electronic device, characterized by comprising a memory and a processor connected to the memory, the memory being used for storing instructions;

The processor is configured to execute the instructions, so that the computer device performs the following operations:

Acquire the first audio stream in the first time period, where the first audio stream includes at least a first sound signal; separate the first sound signal from the first audio stream; determine the quality of the first sound signal First attribute information; determine whether the first attribute information matches the attribute information of a fixed sound source in a fixed sound source library, the fixed sound source library includes one or more fixed sound source corresponding attribute information, so The fixed sound source is a sound source that is located at the same location and emits a known sound type; when the first attribute information matches the attribute information of the fixed sound source in the fixed sound source library, it is determined that the The first sound signal is a sound signal emitted by the fixed sound source.
The electronic device according to claim 11, wherein the electronic device comprises a microphone array;

The processor is specifically configured to use the microphone array to collect sounds in the environment where the electronic device is located in the first time period to generate the first audio stream.
The electronic device according to claim 12, wherein the first attribute information includes a sounding position, a sound type, and a sounding time of the first sound signal.
The electronic device according to claim 13, wherein:

The processor is specifically configured to determine the sounding position of the first sound signal by using the microphone array; determine the sound type of the first sound signal according to the sound characteristics of the first sound signal; determine the first sound signal The sounding time of the sound signal.
The electronic device according to claim 12, wherein the first attribute information comprises a sounding position, sound content, and sounding time of the first sound signal.
The electronic device according to claim 15, wherein:

The processor is specifically configured to determine the sounding position of the first sound signal by using the microphone array; determine the sound content of the first sound signal according to the sound characteristics of the first sound signal; determine the first sound signal The sounding time of the sound signal.
The electronic device according to claim 12, wherein the first attribute information comprises a sounding position, sound type, sound content, and sounding time of the first sound signal.
The electronic device according to claim 17, wherein:

The processor is specifically configured to determine the sounding position of the first sound signal by using the microphone array; determine the sound type of the first sound signal according to the sound characteristics of the first sound signal; The sound feature of the sound signal determines the sound content of the first sound signal; the sounding time of the first sound signal is determined.
The electronic device according to claim 14 or 18, wherein:

The processor is specifically configured to determine whether there is a sound type corresponding to the sound feature of the first sound signal in the sound event library, the sound event library includes one or more sound types; in the sound event library When there is a sound type corresponding to the sound feature of the first sound signal in the first sound signal, the sound type corresponding to the sound feature of the first sound signal is determined as the sound type of the first sound signal; in the sound event library When there is no sound type corresponding to the sound feature of the first sound signal in the first sound signal, a first network request is sent to an external server, and a first response request sent by the external server is received, and the first network request includes the first network request. A sound characteristic of a sound signal, the first response request includes the sound type corresponding to the sound characteristic of the first sound signal; or, there is no sound characteristic corresponding to the sound characteristic of the first sound signal in the sound event library When determining the sound type of the first sound signal, it is determined whether the number of times the sound feature of the first sound signal appears at the first position is greater than the first threshold, and the first position is the sounding position of the first sound signal. If the first sound The number of occurrences of the sound feature of the signal at the first position is greater than a first threshold, and it is determined that the sound type of the first sound signal is a known sound type.
The electronic device according to any one of claims 11 to 18, characterized in that:

The processor is further configured to obtain a second audio stream in a second time period, where the second audio stream includes at least a second sound signal; determine second attribute information of the second sound signal; determine the first 2. Whether the attribute information exists in a fixed sound source library, the fixed sound source library includes attribute information corresponding to one or more fixed sound sources, and the fixed sound sources are located in the same position and emit a known sound type The sound source; when the second attribute information does not exist in the fixed sound source library, the second attribute information is stored in the fixed sound source library.
An electronic device, characterized in that it comprises:

An obtaining module, configured to obtain a first audio stream in a first time period, where the first audio stream includes at least a first sound signal;

The processing module is configured to separate the first sound signal from the first audio stream; determine the first attribute information of the first sound signal; determine whether the first attribute information is consistent with that in the fixed sound source library The attribute information of the fixed sound source is matched, the fixed sound source library includes one or more fixed sound sources corresponding to the attribute information, and the fixed sound source is a sound source that is located at the same position and emits a known sound type When the first attribute information matches the attribute information of the fixed sound source in the fixed sound source library, it is determined that the first sound signal is the sound signal emitted by the fixed sound source.