CN110797045A

CN110797045A - Sound processing method, system, electronic device and computer readable medium

Info

Publication number: CN110797045A
Application number: CN201810868993.2A
Authority: CN
Inventors: 杨楠
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2018-08-01
Filing date: 2018-08-01
Publication date: 2020-02-14

Abstract

The present disclosure provides a method of sound processing, comprising acquiring a plurality of audio frames, determining potential sound sources that generate sound of a current audio frame, determining a instantaneous maximum energy sound source that generates sound of the current audio frame from the potential sound sources, determining a plurality of tracked sound sources that generate sound of the current audio frame from the potential sound sources that generate sound of a historical audio frame and the potential sound sources that generate sound of the current audio frame, determining a plurality of first probabilities that the instantaneous maximum energy sound source matches the plurality of tracked sound sources, respectively, and determining a maximum energy sound source that generates sound of the current audio frame based on the plurality of first probabilities. The present disclosure also provides a sound processing system, an electronic device, and a computer readable medium.

Description

Sound processing method, system, electronic device and computer readable medium

Technical Field

The present disclosure relates to the field of electronic technologies, and in particular, to a sound processing method, system, electronic device, and computer readable medium.

Background

The microphone array is an array formed by a plurality of microphones according to a designed topological structure, and collected signals of the microphone array not only comprise information on a time domain but also comprise information on a spatial domain, so that a sound source can be positioned, tracked and separated according to the collected signals.

In the course of implementing the disclosed concept, the inventors found that there are at least the following problems in the prior art: the existing microphone array technology cannot obtain more information about a sound source, so that the sound source cannot be located, tracked, separated and the like in a targeted mode.

Disclosure of Invention

In view of the above, the present disclosure provides a sound processing method, system, electronic device and a computer readable medium.

One aspect of the present disclosure provides a method of sound processing, including acquiring a plurality of audio frames, determining potential sound sources that generate sound of a current audio frame, determining a transient maximum energy sound source from among the potential sound sources that generates sound of the current audio frame, determining a plurality of tracked sound sources that generate sound of the current audio frame based on the potential sound sources that generate sound of a historical audio frame and the potential sound sources that generate sound of the current audio frame, determining a plurality of first probabilities that the transient maximum energy sound source matches the plurality of tracked sound sources, respectively, and determining a maximum energy sound source that generates sound of the current audio frame based on the plurality of first probabilities.

According to an embodiment of the present disclosure, the determining, based on the plurality of first probabilities, a maximum energy sound source that generates the sound of the current audio frame includes taking a tracked sound source corresponding to a maximum first probability of the plurality of first probabilities as a target sound source, determining the plurality of target sound sources corresponding to a history audio frame and a current audio frame, for each of the plurality of target sound sources, summing a maximum first probability of the target sound source in the sounds that generate the history audio frame and the current audio frame, dividing the sum by the number of the history audio frame and the current audio frame, the result being an average of the maximum first probabilities corresponding to the target sound source, and determining a target sound source corresponding to the largest average as a maximum energy sound source that generates the sound of the current audio frame.

According to an embodiment of the present disclosure, the summing, for each of the plurality of target sound sources, the maximum first probabilities of the target sound source in the sounds that generate the history audio frame and the current audio frame, the result of the summing divided by the numbers of the history audio frame and the current audio frame, the result of which as the average of the maximum first probabilities corresponding to the target sound source, includes, in a case where the number of the plurality of audio frames is less than a first preset number, summing, for each of the plurality of target sound sources, the maximum first probabilities of the target sound source in the sounds that generate the plurality of audio frames, the result of the summing divided by the number of the plurality of audio frames, the result of which as the average of the maximum first probabilities corresponding to the target sound source, and, in a case where the number of the plurality of audio frames is not less than the first preset number, for each of the plurality of target sound sources, summing the maximum first probabilities of the target sound source in the sounds that produce a first preset number of consecutive audio frames, the result of the summing being divided by the first preset number, the result being an average of the maximum first probabilities corresponding to the target sound source, wherein the current audio frame is an end frame of the first preset number of consecutive audio frames.

According to an embodiment of the present disclosure, the method further includes outputting sound information corresponding to the maximum energy sound source.

According to an embodiment of the present disclosure, determining a plurality of tracked sound sources generating sounds of a current audio frame according to potential sound sources generating sounds of a historical audio frame and the potential sound sources generating sounds of the current audio frame includes determining a tracked sound source in an observation period, wherein the observation period refers to a plurality of audio frames from a first occurrence of a certain potential sound source to before the potential sound source is determined to be the tracked sound source, determining a second probability that the tracked sound source in the observation period exists in the sounds generating the plurality of audio frames, and determining the tracked sound source in the observation period to be the tracked sound source if the second probability is greater than a first threshold.

According to an embodiment of the present disclosure, it is determined that a plurality of tracked sound sources generating sounds of a current audio frame are included in a second preset number of consecutive audio frames according to a potential sound source generating sounds of a historical audio frame and a potential sound source generating sounds of a current audio frame, and if a third probability that a tracked sound source of the plurality of tracked sound sources matches the potential sound source is smaller than a second threshold, the tracked sound source is deleted.

Another aspect of the disclosure provides a sound processing system comprising an acquisition module for acquiring a plurality of audio frames, a first determination module, for determining potential sound sources producing sound of the current audio frame, a second determining module for determining an instantaneous maximum energy sound source producing sound of the current audio frame from said potential sound sources, a third determining module, for determining a plurality of tracked sound sources generating sound for the current audio frame based on the potential sound sources generating sound for the historical audio frames and the potential sound sources generating sound for the current audio frame, a fourth determining module, a fifth determining module for determining a plurality of first probabilities that the instantaneous maximum energy sound source matches the plurality of tracked sound sources, respectively, for determining a source of maximum energy sound that produces sound of the current audio frame based on the plurality of first probabilities.

According to an embodiment of the present disclosure, the fifth determining module includes a first determining submodule for determining a tracked sound source corresponding to a largest first probability of the plurality of first probabilities as a target sound source, a second determining submodule, for determining the plurality of target sound sources corresponding to the historical audio frame and the current audio frame, and a third determining sub-module for, for each target sound source of the plurality of target sound sources, summing the maximum first probabilities of the target sound source in the sounds that produced the historical and current audio frames, the result of the summing divided by the number of the historical and current audio frames, the result of which is taken as the average of the maximum first probabilities corresponding to the target sound source, and a fourth determination submodule, and the target sound source corresponding to the largest average value is determined to be the maximum energy sound source generating the sound of the current audio frame.

According to an embodiment of the present disclosure, the third determining sub-module includes a first determining sub-unit for summing, for each of the plurality of target sound sources, a maximum first probability of the target sound source in the sounds that generate the plurality of audio frames, the sum being divided by the number of the plurality of audio frames in a case where the number of the plurality of audio frames is less than a first preset number, the result of the summation being an average of the maximum first probabilities corresponding to the target sound source, and a second determining sub-unit for summing, for each of the plurality of target sound sources, the maximum first probabilities of the target sound sources in the sounds that generate a first preset number of consecutive audio frames, the sum being divided by the first preset number, the result being an average of the maximum first probabilities corresponding to the target sound sources in a case where the number of the plurality of audio frames is not less than the first preset number, wherein the current audio frame is a termination frame of the first preset number of consecutive audio frames.

According to an embodiment of the present disclosure, the system further includes an output module for outputting sound information corresponding to the maximum energy sound source.

According to an embodiment of the present disclosure, the third determining module includes a fifth determining submodule for determining a tracked sound source during an observation period, where the observation period refers to a plurality of audio frames from a first occurrence of a potential sound source to a time before the potential sound source is determined to be the tracked sound source, a sixth determining submodule for determining a second probability that the tracked sound source during the observation period exists in sound generating the plurality of audio frames, and a seventh determining submodule for determining the tracked sound source during the observation period to be the tracked sound source if the second probability is greater than a first threshold.

According to an embodiment of the present disclosure, the third determining module includes a deleting submodule, configured to delete a tracked sound source of the plurality of tracked sound sources if a third probability that the tracked sound source matches the potential sound source is smaller than a second threshold in a second preset number of consecutive audio frames.

Another aspect of the disclosure provides an electronic device comprising one or more processors, storage means for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of the above.

Another aspect of the disclosure provides a non-volatile storage medium storing computer-executable instructions for implementing the method as described above when executed.

Another aspect of the disclosure provides a computer program comprising computer executable instructions for implementing the method as described above when executed.

According to the embodiment of the disclosure, the problem that existing sound processing technologies such as sound source positioning, tracking, separation and the like have no pertinence can be at least partially solved, and therefore, the technical effect that the sound is processed in a targeted manner, so that the information in the sound can be acquired in a targeted manner can be achieved.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments of the present disclosure with reference to the accompanying drawings, in which:

fig. 1 schematically shows an application scenario of a sound processing method according to an embodiment of the present disclosure;

FIG. 2 schematically shows a flow chart of a sound processing method according to an embodiment of the present disclosure;

FIG. 3A schematically illustrates a flow chart for determining a plurality of tracked sound sources that produce sound for a current audio frame based on potential sound sources for a historical audio frame and potential sound sources that produce sound for the current audio frame, according to an embodiment of the disclosure;

FIG. 3B schematically shows a schematic diagram of a potential sound source generating a Kth audio frame and a potential sound source generating a K +1 th audio frame according to an embodiment of the disclosure;

FIG. 3C is a schematic diagram illustrating a tracked sound source for the Kth audio frame and a tracked sound source for the K +1 th audio frame according to an embodiment of the disclosure;

FIG. 3D is a schematic diagram that schematically illustrates a matching relationship of potential sound sources to tracked sound sources, in accordance with an embodiment of the present disclosure;

FIG. 3E is a schematic diagram of a tracked sound source for the Kth audio frame and a tracked sound source for the K +1 th audio frame according to another embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow chart for determining a maximum energy sound source that produces sound of the current audio frame based on the plurality of first probabilities, according to an embodiment of the present disclosure;

FIG. 5 schematically shows a flow chart of a sound processing method according to another embodiment of the present disclosure;

FIG. 6 schematically shows a block diagram of a sound processing system according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a block diagram of a fifth determination module according to an embodiment of the disclosure;

FIG. 8 schematically illustrates a block diagram of a third determination submodule according to an embodiment of the present disclosure;

FIG. 9 schematically shows a block diagram of a sound processing system according to another embodiment of the present disclosure;

FIG. 10 schematically illustrates a block diagram of a third determination module according to an embodiment of the present disclosure; and

FIG. 11 schematically shows a block diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a convention analogous to "A, B or at least one of C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B or C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). It will be further understood by those within the art that virtually any disjunctive word and/or phrase presenting two or more alternative terms, whether in the description, claims, or drawings, should be understood to contemplate the possibilities of including one of the terms, either of the terms, or both terms. For example, the phrase "a or B" should be understood to include the possibility of "a" or "B", or "a and B".

An embodiment of the present disclosure provides a method of sound processing, including acquiring a plurality of audio frames, determining potential sound sources generating sound of a current audio frame, determining a transient maximum energy sound source generating sound of the current audio frame from the potential sound sources, determining a plurality of tracked sound sources generating sound of the current audio frame according to the potential sound sources generating sound of a history audio frame and the potential sound sources generating sound of the current audio frame, determining a plurality of first probabilities that the transient maximum energy sound source matches the plurality of tracked sound sources, respectively, and determining a maximum energy sound source generating sound of the current audio frame based on the plurality of first probabilities

Fig. 1 schematically shows an application scenario of a sound processing method according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a scenario in which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, but does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.

As shown in fig. 1, an electronic device 110 is included in the application scenario, and a microphone array 111 is disposed on the electronic device 110. The electronic device 110 may, for example, identify sound information received by the microphone array 111 and may perform a corresponding function according to the identified sound information.

It is noted that the microphone array 111 is only a schematic representation and the present disclosure does not limit the shape, number, etc. of the microphone array. In addition, the microphone array 111 may be integrated with the electronic device 110, or may be a device structurally independent from the electronic device 110.

As shown in fig. 1, the user 120 sends voice information to the electronic device 110, and the microphone array 111 receives the voice of the user 120 and also receives sounds in the surrounding environment, such as the sound of the keyboard 130 being struck, the sound of the fan 140, and the like.

The sound source generating the sound information can be localized, tracked, separated, and the like by using the microphone array related art, but the existing microphone array related art cannot obtain more information about the sound source, and thus cannot localize, track, separate, and the like the sound source in a targeted manner. For example, in the scenario shown in fig. 1, the microphone array 111 may separate sound sources that produce sound, but cannot identify which sound source is the user 120 and which sound source is a fan or a keyboard. Since the microphone array 111 cannot identify which sound source is the user 120, the microphone array cannot acquire the sound information emitted by the user 120 in a targeted manner, so that it is difficult for the electronic device 110 to accurately identify the sound information of the user 120.

The present disclosure provides a sound processing method, which can be applied to a microphone array 111, so that the microphone array 111 can identify the sound emitted by the user 120, and thus the electronic device 110 can accurately acquire the requirement of the user 120.

Fig. 2 schematically shows a flow chart of a sound processing method according to an embodiment of the present disclosure.

As shown in fig. 2, the method includes operations S210 to S260.

In operation S210, a plurality of audio frames are acquired.

In operation S220, potential sound sources generating sounds of the current audio frame are determined.

In operation S230, an instantaneous maximum energy sound source generating the sound of the current audio frame is determined from the potential sound sources.

In operation S240, a plurality of tracked sound sources generating sounds of the current audio frame are determined according to potential sound sources generating sounds of the history audio frame and the potential sound sources generating sounds of the current audio frame.

In operation S250, a plurality of first probabilities that the instantaneous maximum energy sound source matches the plurality of tracked sound sources, respectively, are determined.

In operation S260, a maximum energy sound source generating a sound of the current audio frame is determined based on the plurality of first probabilities.

The method can determine the maximum energy sound source in the current audio frame, so that the microphone array can be used for positioning, tracking, separating and the like of the maximum energy sound source in a targeted manner. For example, in the scenario shown in fig. 1, if the microphone array 111 applies the method, the microphone array 111 can determine that the largest energy sound source in the current audio is the user 120, so that the sound generated by the user 120 can be tracked, separated, and the like in a targeted manner.

According to an embodiment of the present disclosure, in operation S210, for example, a microphone array may acquire a plurality of audio frames.

According to an embodiment of the present disclosure, in operation S220, a potential sound source generating sound of a current audio frame is determined, for example, a beamformer is steered, probabilities of sound sources existing in all possible directions are calculated, and a direction corresponding to a higher probability is determined as a direction of the potential sound source. For example, the sound sources corresponding to the first 4 larger probabilities are determined as potential sound sources, and the potential sound source q includes q ═ 0, 1, 2, and 3, and their corresponding probabilities are:

wherein q represents a potential sound source, v ═ E₀/E_T，E₀Is the instantaneous maximum energy, E, determined by the beamformer_TIs a value related to the number of microphones, the frame size, and the like, and is an instantaneous maximum energy sound source when q is 0 in the present embodiment.

According to an embodiment of the present disclosure, in operation S230, it may be, for example, that an instantaneous maximum energy sound source is found while determining a potential sound source by using a beamformer.

According to the embodiment of the present disclosure, in operation S240, for example, the audio frames acquired by the microphone array have 10 frames in total, the historical audio frames may be, for example, the 1 st frame to the 9 th frame arranged in chronological order of the audio frames acquired by the microphone array, and the current audio frame may be, for example, the 10 th frame. According to an embodiment of the present disclosure, determining a plurality of tracked sound sources generating sound of a current audio frame based on potential sound sources generating sound of a historical audio frame and potential sound sources generating sound of the current audio frame may be a method as shown in fig. 3A.

FIG. 3A schematically illustrates a flow chart for determining a plurality of tracked sound sources that produce sound for a current audio frame based on potential sound sources for a historical audio frame and potential sound sources that produce sound for the current audio frame, according to an embodiment of the disclosure.

As shown in fig. 3A, the method includes operations S241 to S243.

In operation S241, a tracked sound source in an observation period, which refers to a plurality of audio frames from a first occurrence of a potential sound source to before the potential sound source is determined to be the tracked sound source, is determined.

In operation S242, a second probability that the tracked sound source in the observation period exists in the sound generating the plurality of audio frames is determined.

In operation S243, in case that the second probability is greater than a first threshold, it is determined that the tracked sound source in the observation period is a tracked sound source.

The method can adjust the tracked sound source in real time, so that the method has real-time performance and the accuracy of the method is further improved.

A schematic diagram of an implementation of determining a plurality of tracked sound sources generating sound of a current audio frame according to potential sound sources of a historical audio frame and potential sound sources generating sound of the current audio frame according to an embodiment of the present disclosure is described below with reference to fig. 3B to 3E.

Fig. 3B schematically shows a schematic diagram of a potential sound source generating a kth audio frame and a potential sound source generating a K +1 th audio frame according to an embodiment of the disclosure.

Fig. 3C schematically illustrates a tracked sound source corresponding to the K-th audio frame and a tracked sound source corresponding to the K + 1-th audio frame according to an embodiment of the present disclosure.

As shown in fig. 3B and 3C, the potential sound source 310 and the corresponding tracked sound source 311 generating the K-th audio frame, the potential sound source 320 and the corresponding tracked sound source 321 generating the K + 1-th audio frame, the potential sound source 330 and the corresponding tracked sound source 331 generating the current audio frame are included in the schematic diagram.

The potential sound source 310 includes a potential sound source q-0 and a potential sound source q-1, and the corresponding tracked sound source 311 includes a tracked sound source j-0 and a tracked sound source j-1. The next audio frame after the kth audio frame is the K +1 th audio frame, and the potential sound source q is 2 in the potential sound source 320 generating the K +1 th audio frame for the first time. The potential sound source q-2 is a tracked sound source in the observed period from the K +1 th audio frame 320. As shown in fig. 3C, the tracked sound source 321 includes a tracked sound source j-0, a tracked sound source j-1, and a tracked sound source j-2 in the observed period.

Referring back to fig. 3A, according to the embodiment of the present disclosure, in operation S242, a second probability that the tracked sound source in the observation period is present in the sound generating the plurality of audio frames may be calculated by the methods of formulas (2) to (4).

Wherein E is_jRepresenting events in which a tracked sound source j exists, O^tRepresenting potential sound sources determined up to the current audio frame,

representing the probability that the corresponding tracked sound source j in the current audio frame can match the potential sound source q,

probability of a tracked source j corresponding to a potential source q, P (f | O)^t) To observe data O^tThe probability of occurrence of the matching relation function f, δ_j，f(q)Is the Kronecker delta function, P₀Representing the probability that a sound source is present but not observed, which may be an empirical value, for example 0.2.

FIG. 3D is a schematic diagram that schematically illustrates a matching relationship between potential sound sources and tracked sound sources, in accordance with an embodiment of the present disclosure.

As shown in fig. 3D, the potential sound source q-0 matches the tracked sound source j-1, the potential sound source q-1 matches the tracked sound source j-0, and the potential sound source q-2 is the potential sound source which appears for the first time. In the scenario shown in fig. 3D, the matching relationship function f may be, for example, f ({0, 1, 2}) {1, 0, -1 }.

Referring back to fig. 3A, according to an embodiment of the present disclosure, at operation S243, for example, the first threshold is 0.98, at P (E)_j|O^t) If the value is greater than 0.98, the tracked sound source j in the observation period is determined as a tracked sound source. For example, in fig. 3B and 3C, until the current audio frame, the potential sound source q in the observation period is 2 corresponding to P (E)_j|O^t) Above 0.98, the potential sound source q-2 is confirmed as a tracked sound source.

According to an embodiment of the present disclosure, determining, according to the potential sound source generating the sound of the historical audio frame and the potential sound source generating the sound of the current audio frame, that the plurality of tracked sound sources generating the sound of the current audio frame includes, in a second preset number of consecutive audio frames, deleting a tracked sound source of the plurality of tracked sound sources if a third probability that the tracked sound source matches the potential sound source is smaller than a second threshold. For example, in the scenarios shown in fig. 3B and 3C, if the third probability that the tracked sound source j ═ 1 matches the potential sound source in the second preset number of consecutive audio frames is less than the second threshold value until the current audio frame, the tracked sound source j ═ 1 is deleted.

Fig. 3E schematically illustrates a tracked sound source corresponding to the K-th audio frame and a tracked sound source corresponding to the K + 1-th audio frame according to another embodiment of the present disclosure.

As shown in fig. 3E, the tracked sound source is updated to the tracked sound source 332, and the tracked sound source 332 includes a tracked sound source j ═ 0 and a tracked sound source j ═ 2.

The method can delete the potential sound source which cannot be the maximum energy sound source in time, and reduces the calculated amount.

Referring back to fig. 2, in operation S250, for example, the instantaneous maximum energy sound source is a potential sound source q-0, and the tracked sound source includes j-0, 1, 2. According to the disclosed embodiments, the calculation can be performed separately

The calculation method is shown in formula (4).

In operation S260, a maximum energy sound source generating a sound of the current audio frame is determined based on the plurality of first probabilities. For example, in the scenario described in operation S250, it is determined

The tracked sound source j is 0 which is the maximum energy sound source.

Fig. 4 schematically shows a flow chart for determining a maximum energy sound source generating sound of the current audio frame based on the plurality of first probabilities according to an embodiment of the present disclosure.

As shown in fig. 4, the method includes operations S261 to S264.

In operation S261, a tracked sound source corresponding to a largest first probability among the plurality of first probabilities is taken as a target sound source.

In operation S262, the plurality of target sound sources corresponding to the history audio frame and the current audio frame are determined.

In operation S263, for each target sound source of the plurality of target sound sources, a maximum first probability of the target sound source in the sound that generates the history audio frame and the current audio frame is summed, and the summed result is divided by the number of the history audio frame and the current audio frame, and the result is an average of the maximum first probabilities corresponding to the target sound source.

In operation S264, the target sound source corresponding to the largest average value is determined as the maximum energy sound source generating the sound of the current audio frame.

The method can ensure the accuracy of the determined maximum energy sound source and avoid frequently switching the maximum energy sound source. For example, in the scenario shown in FIG. 1, there is a pause in the speech of user 120, and during the pause, keyboard 130 is just being tapped to generate sound, resulting in switching the source of maximum energy sound to keyboard 130, causing inaccurate identification of the source of maximum energy sound.

According to an embodiment of the present disclosure, in operation S261, for example, in the scenario described in operation S260

And taking the tracked sound source j as a target sound source, wherein the tracked sound source j is 0.

According to the embodiment of the present disclosure, in operation S262, for example, a tracked sound source corresponding to the maximum first probability determined for each audio frame obtained by the microphone array may be used as the target sound source, and a tracked sound source corresponding to the maximum first probability determined for a part of the history audio frames obtained by the microphone array and the current audio frame may also be used as the target sound source.

According to an embodiment of the present disclosure, in operation S263, for example, the microphone array acquires 5 audio frames, which correspond to the

target sound sources

0, 1, and 2, respectively. The correspondence is shown in table 1. In the scenario shown in table 1, according to the embodiment of the present disclosure, the average value of the maximum first probabilities corresponding to the target sound source 0 is (P1+ P2+ P4)/5, the average value of the maximum first probabilities corresponding to the target sound source 1 is (P3)/5, and the average value of the maximum first probabilities corresponding to the target sound source 2 is (P5)/5.

TABLE 1

Audio frame	Target sound source	Maximum first probability
			1	0	P1
2	0	P2
			3	1	P3
4	0	P4
			5	2	P5

According to an embodiment of the present disclosure, in operation S264, for example, in the scenario described in operation S273, if the average value of the maximum first probabilities corresponding to the target sound source 0 (P1+ P2+ P4)/5 is maximum, the target sound source 0 is determined to be the maximum energy sound source generating the sound of the current audio frame.

According to an embodiment of the present disclosure, the summing, for each of the plurality of target sound sources, the maximum first probabilities of the target sound source in the sounds that generate the history audio frame and the current audio frame, the result of the summing divided by the numbers of the history audio frame and the current audio frame, the result of which as the average of the maximum first probabilities corresponding to the target sound source, includes, in a case where the number of the plurality of audio frames is less than a first preset number, summing, for each of the plurality of target sound sources, the maximum first probabilities of the target sound source in the sounds that generate the plurality of audio frames, the result of the summing divided by the number of the plurality of audio frames, the result of which as the average of the maximum first probabilities corresponding to the target sound source, and in a case where the number of the plurality of audio frames is not less than the first preset number, for each of the plurality of target sound sources, summing a maximum first probability of the target sound source in the sound that produces a first preset number of consecutive audio frames, the result of the summing being divided by the first preset number, the result of which is an average of the maximum first probabilities corresponding to the target sound source, wherein the current audio frame is an end frame of the first preset number of consecutive audio frames. For example, according to an embodiment of the present disclosure, the method of calculating the average value of the maximum first probabilities corresponding to the target sound source may be a method according to equation (5).

Wherein, P_iRepresenting the average value of the maximum first probability corresponding to the target sound source, fn representing the number of audio frames acquired by the microphone array until the current audio frame, N representing a first preset number, P_inN frames representing the ith tracked sound source

The sum of (1). For example, the number fn of the audio frames acquired by the microphone array until the current audio frame is 5 frames, the first preset number N is 10, fn < N, and the average value of the maximum first probabilities corresponding to the target sound source may be calculated by

Similar to the method described in operation S273. For another example, the number fn of the audio frames acquired by the microphone array is 15 frames, the first preset number N is 10, fn > N, can be determined according to

And calculating the average value of the maximum first probabilities corresponding to the target sound sources, wherein the 6 th audio frame is taken as the audio frame corresponding to the n-0, and the average values of the maximum first probabilities corresponding to the target sound sources from the 6 th audio frame to the 15 th audio frame are calculated respectively.

Fig. 5 schematically shows a flow chart of a sound processing method according to another embodiment of the present disclosure.

As shown in fig. 5, the method further includes operation S510 based on the foregoing embodiment.

In operation S510, sound information corresponding to the maximum energy sound source is output. For example, in the scenario shown in fig. 1, the microphone array outputs sound information emitted by the user 120 to the electronic device 111 without outputting sound information emitted by the sound keypad 130 and the fan 140 upon determining that the maximum energy sound source is the user 120.

The method enables the microphone array to only output the sound information corresponding to the sound source with the maximum energy, enables the sound information output by the microphone array to have pertinence, and improves the utilization rate of the sound information output by the microphone array.

Fig. 6 schematically shows a block diagram of a sound processing system 600 according to an embodiment of the present disclosure.

As shown in fig. 6, the sound processing system 600 includes an acquisition module 610, a first determination module 620, a second determination module 630, a third determination module 640, a fourth determination module 650, and a fifth determination module 660.

The obtaining module 610, for example, performs operation S210 described above with reference to fig. 2, for obtaining a plurality of audio frames.

The first determining module 620, for example, performs operation S220 described above with reference to fig. 2, for determining potential sound sources that produce sound of the current audio frame.

The second determining module 630, for example performing operation S230 described above with reference to fig. 2, is used for determining, from the potential sound sources, the instantaneous maximum energy sound source that produces the sound of the current audio frame.

The third determining module 640, for example, performs operation S240 described above with reference to fig. 2, for determining a plurality of tracked sound sources generating sound of the current audio frame according to the potential sound sources generating sound of the historical audio frame and the potential sound sources generating sound of the current audio frame.

A fourth determining module 650, for example performing operation S250 described above with reference to fig. 2, is configured to determine a plurality of first probabilities that the instantaneous maximum energy sound source matches the plurality of tracked sound sources, respectively.

A fifth determining module 660, e.g., performing operation S260 described above with reference to fig. 2, is configured to determine, based on the plurality of first probabilities, a most energetic sound source that generates sound of the current audio frame.

Fig. 7 schematically illustrates a block diagram of a fifth determination module 660 according to an embodiment of the disclosure.

As shown in fig. 7, the fifth determination module 660 includes a first determination sub-module 661, a second determination sub-module 662, a third determination sub-module 663, and a fourth determination sub-module 664.

The first determining sub-module 661, for example, performs the operation S261 described above with reference to fig. 4, for regarding the tracked sound source corresponding to the largest first probability among the plurality of first probabilities as the target sound source.

The second determining submodule 662, for example, performs operation S262 described above with reference to fig. 4, for determining the plurality of target sound sources corresponding to the historical audio frame and the current audio frame.

The third determining sub-module 663, for example, performs operation S263 described above with reference to fig. 4, for each of the plurality of target sound sources, to sum the maximum first probabilities of the target sound source in the sounds that generated the history audio frame and the current audio frame, and the sum is divided by the number of the history audio frame and the current audio frame, and the result is taken as an average of the maximum first probabilities corresponding to the target sound source.

The fourth determining sub-module 664, for example performing operation S264 described above with reference to fig. 4, is configured to determine the largest target sound source corresponding to the average value as the most energetic sound source generating the sound of the current audio frame.

Fig. 8 schematically illustrates a block diagram of the third determination sub-module 663 according to an embodiment of the present disclosure.

As shown in fig. 8, the third determination sub-module 663 includes the first determination sub-unit 810 and the second determination sub-unit 820.

A first determining subunit 810, configured to, for each of the plurality of target sound sources, sum a maximum first probability of the target sound source in the sound that generates the plurality of audio frames if the number of the plurality of audio frames is less than a first preset number, and divide the sum result by the number of the plurality of audio frames, where the result is an average of the maximum first probabilities corresponding to the target sound source.

A second determining subunit 820, configured to, for each target sound source in the plurality of target sound sources, sum a maximum first probability of the target sound source in the sound that produces a first preset number of consecutive audio frames, and divide the sum by the first preset number, where a current audio frame is an end frame of the first preset number of consecutive audio frames, and a result of the sum is an average of the maximum first probabilities corresponding to the target sound source, where the number of the plurality of audio frames is not less than the first preset number.

Fig. 9 schematically illustrates a block diagram of a sound processing system 900 according to another embodiment of the present disclosure.

As shown in fig. 9, the sound processing system 900 further includes an output module 910 based on the foregoing embodiments.

The output module 910, for example, performs operation S410 described above with reference to fig. 4, for outputting sound information corresponding to the maximum energy sound source.

Fig. 10 schematically illustrates a block diagram of the third determination module 640 according to an embodiment of the present disclosure.

As shown in fig. 10, the third determination module 640 includes a fifth determination sub-module 641, a sixth determination sub-module 642, and a seventh determination sub-module 643.

The fifth determining submodule 641 performs, for example, operation S241 described above with reference to fig. 3A, to determine a tracked sound source in an observation period, where the observation period refers to a plurality of audio frames from the first occurrence of a potential sound source to before the potential sound source is determined to be the tracked sound source.

A sixth determining submodule 642, for example performing operation S242 described above with reference to fig. 3A, is configured to determine a second probability that the tracked sound source during the observation period is present in the sound generating the plurality of audio frames.

The seventh determining sub-module 643, for example, executes the operation S243 described above with reference to fig. 3A, and is configured to determine that the tracked sound source in the observation period is the tracked sound source if the second probability is greater than the first threshold value.

According to an embodiment of the present disclosure, the third determining module 640 includes a deleting submodule, configured to delete a tracked sound source of the plurality of tracked sound sources if a third probability that the tracked sound source matches the potential sound source is smaller than a second threshold value in a second preset number of consecutive audio frames.

Any number of modules, sub-modules, units, sub-units, or at least part of the functionality of any number thereof according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, and sub-units according to the embodiments of the present disclosure may be implemented by being split into a plurality of modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in any other reasonable manner of hardware or firmware by integrating or packaging a circuit, or in any one of or a suitable combination of software, hardware, and firmware implementations. Alternatively, one or more of the modules, sub-modules, units, sub-units according to embodiments of the disclosure may be at least partially implemented as a computer program module, which when executed may perform the corresponding functions.

For example, any number of the obtaining module 610, the first determining module 620, the second determining module 630, the third determining module 640, the fourth determining module 650, the fifth determining module 660, the first determining sub-module 661, the second determining sub-module 662, the third determining sub-module 663, the fourth determining sub-module 664, the first determining sub-unit 810, the second determining sub-unit 820, the output module 910, the fifth determining sub-module 641, the sixth determining sub-module 642, the seventh determining sub-module 643, and the deleting module may be combined to be implemented in one module, or any one of them may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the present disclosure, at least one of the obtaining module 610, the first determining module 620, the second determining module 630, the third determining module 640, the fourth determining module 650, the fifth determining module 660, the first determining sub-module 661, the second determining sub-module 662, the third determining sub-module 663, the fourth determining sub-module 664, the first determining sub-unit 810, the second determining sub-unit 820, the output module 910, the fifth determining sub-module 641, the sixth determining sub-module 642, the seventh determining sub-module 643, and the deleting module may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or any other reasonable way of integrating or packaging a circuit, or in any one of three implementations, software, hardware and firmware, or in any suitable combination of any of them. Alternatively, at least one of the obtaining module 610, the first determining module 620, the second determining module 630, the third determining module 640, the fourth determining module 650, the fifth determining module 660, the first determining sub-module 661, the second determining sub-module 662, the third determining sub-module 663, the fourth determining sub-module 664, the first determining sub-unit 810, the second determining sub-unit 820, the output module 910, the fifth determining sub-module 641, the sixth determining sub-module 642, the seventh determining sub-module 643 and the deleting module may be at least partially implemented as a computer program module, which may perform corresponding functions when being executed.

Fig. 11 schematically shows a block diagram of an electronic device adapted to implement the above described method according to an embodiment of the present disclosure. The electronic device shown in fig. 11 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 11, an electronic device 1100 according to an embodiment of the present disclosure includes a processor 1101, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)1102 or a program loaded from a storage section 1108 into a Random Access Memory (RAM) 1103. The processor 1101 may comprise, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 1101 may also include on-board memory for caching purposes. The processor 1101 may comprise a single processing unit or a plurality of processing units for performing the different actions of the method flows according to the embodiments of the present disclosure.

In the RAM 1103, various programs and data necessary for the operation of the system 1100 are stored. The processor 1101, the ROM1102, and the RAM 1103 are connected to each other by a bus 1104. The processor 1101 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM1102 and/or the RAM 1103. It is noted that the programs may also be stored in one or more memories other than the ROM1102 and RAM 1103. The processor 1101 may also perform various operations of the method flows according to the embodiments of the present disclosure by executing programs stored in the one or more memories.

System 1100 may also include an input/output (I/O) interface 1105, which input/output (I/O) interface 1105 is also connected to bus 1104, according to an embodiment of the present disclosure. The system 1100 may also include one or more of the following components connected to the I/O interface 1105: an input portion 1106 including a keyboard, mouse, and the like; an output portion 1107 including a signal output unit such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 1108 including a hard disk and the like; and a communication section 1109 including a network interface card such as a LAN card, a modem, or the like. The communication section 1109 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 1105 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 1108 as necessary.

According to embodiments of the present disclosure, method flows according to embodiments of the present disclosure may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication part 1109, and/or installed from the removable medium 611. The computer program, when executed by the processor 1101, performs the above-described functions defined in the system of the embodiment of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

The present disclosure also provides a computer-readable medium, which may be embodied in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer readable medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.

According to embodiments of the present disclosure, a computer readable medium may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, optical fiber cable, radio frequency signals, etc., or any suitable combination of the foregoing.

For example, according to an embodiment of the present disclosure, a computer-readable medium may include the ROM1102 and/or the RAM 1103 and/or one or more memories other than the ROM1102 and the RAM 1103 described above.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.

The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims

1. A method of sound processing, comprising:

acquiring a plurality of audio frames;

determining potential sound sources that produce sound for the current audio frame;

determining an instantaneous maximum energy sound source from the potential sound sources that produces the sound of the current audio frame;

determining a plurality of tracked sound sources generating sound of the current audio frame according to the potential sound sources generating sound of the historical audio frame and the potential sound sources generating sound of the current audio frame;

determining a plurality of first probabilities that the instantaneous maximum energy sound source matches the plurality of tracked sound sources, respectively; and

based on the plurality of first probabilities, a maximum energy sound source that produces sound of the current audio frame is determined.

2. The method of claim 1, wherein the determining, based on the plurality of first probabilities, a source of maximum energy sound that produces sound of the current audio frame comprises:

taking the tracked sound source corresponding to the largest first probability in the plurality of first probabilities as a target sound source;

determining a plurality of target sound sources corresponding to the historical audio frames and the current audio frames;

for each of the plurality of target sound sources, summing a maximum first probability of that target sound source in the sound from which the historical and current audio frames were generated, the result of the summing divided by the number of the historical and current audio frames, the result of which is an average of the maximum first probabilities corresponding to that target sound source; and

and determining the target sound source corresponding to the largest average value as the maximum energy sound source generating the sound of the current audio frame.

3. The method of claim 2, wherein said summing, for each of the plurality of target sound sources, the maximum first probabilities of the target sound source in the sounds from which the historical and current audio frames were generated, the result of the summing divided by the number of the historical and current audio frames, the result of which is the average of the maximum first probabilities corresponding to the target sound source comprises:

in the case that the number of the plurality of audio frames is less than a first preset number, for each of the plurality of target sound sources, summing a maximum first probability of the target sound source in the sound that generates the plurality of audio frames, and dividing the sum result by the number of the plurality of audio frames, the result being an average of the maximum first probabilities corresponding to the target sound source; and

in the case that the number of the plurality of audio frames is not less than a first preset number, for each of the plurality of target sound sources, summing a maximum first probability of the target sound source in sounds that produce a first preset number of consecutive audio frames, the result of the summing being divided by the first preset number, the result being an average of the maximum first probabilities corresponding to the target sound source, wherein a current audio frame is an end frame of the first preset number of consecutive audio frames.

4. The method of claim 1, further comprising:

outputting sound information corresponding to the maximum energy sound source.

5. The method of claim 1, wherein determining the plurality of tracked sound sources that produce sound for the current audio frame based on the potential sound sources that produce sound for the historical audio frame and the potential sound sources that produce sound for the current audio frame comprises:

determining a tracked sound source during an observation period, wherein the observation period refers to a plurality of audio frames from a first occurrence of a potential sound source to before the potential sound source is determined to be the tracked sound source;

determining a second probability that the tracked sound source during the observation period is present in the sound that produced the plurality of audio frames;

and determining the tracked sound source in the observation period as the tracked sound source when the second probability is larger than a first threshold value.

6. The method of claim 1, wherein determining the plurality of tracked sound sources that produce sound for the current audio frame based on the potential sound sources that produce sound for the historical audio frame and the potential sound sources that produce sound for the current audio frame comprises:

and deleting a tracked sound source in a second preset number of continuous audio frames if the third probability that the tracked sound source in the plurality of tracked sound sources is matched with the potential sound source is less than a second threshold value.

7. A sound processing system comprising:

the acquisition module is used for acquiring a plurality of audio frames;

a first determining module for determining potential sound sources that produce sound for a current audio frame;

a second determining module for determining an instantaneous maximum energy sound source from the potential sound sources that produces the sound of the current audio frame;

a third determining module, configured to determine a plurality of tracked sound sources generating sounds of the current audio frame according to the potential sound sources generating sounds of the historical audio frame and the potential sound sources generating sounds of the current audio frame;

a fourth determining module for determining a plurality of first probabilities that the instantaneous maximum energy sound source matches the plurality of tracked sound sources, respectively; and

a fifth determining module for determining a maximum energy sound source generating sound of the current audio frame based on the plurality of first probabilities.

8. The system of claim 7, wherein the fifth determination module comprises:

a first determining submodule configured to use a tracked sound source corresponding to a maximum first probability among the plurality of first probabilities as a target sound source;

the second determining submodule is used for determining the plurality of target sound sources corresponding to the historical audio frames and the current audio frames;

a third determining submodule for summing, for each of the plurality of target sound sources, a maximum first probability of the target sound source in the sound from which the historical audio frame and the current audio frame are generated, the result of the summing being divided by the number of the historical audio frames and the current audio frame, the result being an average of the maximum first probabilities corresponding to the target sound source; and

and the fourth determining submodule is used for determining the target sound source corresponding to the largest average value as the maximum energy sound source generating the sound of the current audio frame.

9. The system of claim 8, wherein the third determination submodule comprises:

a first determining subunit, configured to, for each of the plurality of target sound sources, sum a maximum first probability of the target sound source in the sounds that produce the plurality of audio frames if the number of the plurality of audio frames is less than a first preset number, and divide the sum by the number of the plurality of audio frames, the result of which is an average of the maximum first probabilities corresponding to the target sound source; and

a second determining subunit, configured to, for each of the plurality of target sound sources, sum a maximum first probability of the target sound source in the sound that produces a first preset number of consecutive audio frames, and divide a result of the sum by the first preset number, where a current audio frame is an end frame of the first preset number of consecutive audio frames, where the result is an average of the maximum first probabilities corresponding to the target sound source, where the number of the plurality of audio frames is not less than the first preset number.

10. The system of claim 7, further comprising:

and the output module is used for outputting the sound information corresponding to the maximum energy sound source.

11. The system of claim 7, wherein the third determination module comprises:

a fifth determining submodule, configured to determine a tracked sound source during an observation period, where the observation period refers to a plurality of audio frames from a first occurrence of a potential sound source to a time before the potential sound source is determined to be the tracked sound source;

a sixth determining sub-module for determining a second probability that the tracked sound source during the observation period is present in the sound producing the plurality of audio frames;

a seventh determining sub-module, configured to determine the tracked sound source in the observation period as a tracked sound source if the second probability is greater than the first threshold.

12. The system of claim 7, wherein the third determination module comprises:

and the deleting submodule is used for deleting a tracked sound source in a second preset number of continuous audio frames if the third probability that one tracked sound source in the plurality of tracked sound sources is matched with the potential sound source is smaller than a second threshold value.

13. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-6.

14. A computer readable medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1 to 6.