WO2022078249A1

WO2022078249A1 - Sound source tracking method and apparatus, and device, system and storage medium

Info

Publication number: WO2022078249A1
Application number: PCT/CN2021/122742
Authority: WO
Inventors: 黄伟隆; 李威; 冯津伟
Original assignee: 阿里巴巴集团控股有限公司
Priority date: 2020-10-12
Filing date: 2021-10-09
Publication date: 2022-04-21
Also published as: CN114355286A

Abstract

A sound source tracking method and apparatus, and a device, a system and a storage medium. The method comprises: acquiring an acoustic signal stream collected by a microphone array under at least one time frame (100); performing sound source orientation estimation on the basis of the acoustic signal stream, so as to obtain an information stream that includes sound source orientation information under the at least one time frame (101); converting the information stream into visualization data that describes an orientation distribution state of a sound source (102); and performing sound source tracking according to the visualization data (103). In the method, an information stream that includes sound source orientation information is converted into visualization data that describes an orientation distribution state of a sound source, and sound source tracking is performed on the basis of the visualization data. By means of the method, the accuracy of sound source tracking can be effectively improved, and the adaptability to various complicated environments can be improved.

Description

A sound source tracking method, device, device, system and storage medium

This application claims the priority of the Chinese patent application with the application number 202011086519.8 filed on October 12, 2020 and the invention titled "A sound source tracking method, device, device, system and storage medium", the entire contents of which are incorporated by reference in this application.

technical field

The present application relates to the technical field of data processing, and in particular, to a sound source tracking method, apparatus, device, system, and storage medium.

Background technique

Sound source tracking based on microphone array is a popular technology in the field of acoustic signal processing in recent years. At present, the sound source tracking technology usually performs signal level processing such as filtering, taking extreme values, calculating the fundamental frequency, and calculating the azimuth angle of the microphone array to track the sound source.

However, such processing methods have poor robustness and insufficient generalization ability, especially in multi-source or noisy environments, and the accuracy of sound source tracking is insufficient.

SUMMARY OF THE INVENTION

Various aspects of the present application provide a sound source tracking method, apparatus, device, system, and storage medium, so as to improve the accuracy of sound source tracking.

The embodiment of the present application provides a sound source tracking method, including:

Acquire the acoustic signal stream collected by the microphone array under at least one time frame;

performing sound source position estimation based on the acoustic signal stream to obtain an information stream containing sound source position information under the at least one time frame;

converting the information flow into visual data describing the azimuth distribution state of the sound source;

Based on the visualized data, sound source tracking is performed.

The embodiment of the present application also provides a sound source tracking method, including:

Under at least one time frame within the target period, determine the sound source position information respectively;

Converting the sound source position information under the at least one time frame into at least one set of image data describing the position distribution state of the sound source to form an image stream;

Image recognition is performed on the image stream using an image recognition model for sound source tracking within the target time period.

The embodiment of the present application also provides a sound source tracking device, including:

an acquisition module, configured to acquire the acoustic signal flow collected by the microphone array in at least one time frame;

a calculation module for performing sound source orientation estimation based on the acoustic signal flow to obtain an information flow including sound source orientation information under the at least one time frame;

a conversion module for converting the information flow into visual data describing the azimuth distribution state of the sound source;

A tracking module, configured to perform sound source tracking according to the visualized data.

Embodiments of the present application also provide a computing device, including a memory and a processor;

the memory for storing one or more computer instructions;

The processor is coupled to the memory for executing the one or more computer instructions for:

Based on the visualized data, sound source tracking is performed.

a determining module, configured to determine the position information of the sound source respectively under at least one time frame within the target period;

a conversion module, configured to convert the position information of the sound source in the at least one time frame into at least one set of image data describing the position distribution state of the sound source, so as to form an image stream;

The tracking module is used for performing image recognition on the image stream by using an image recognition model, so as to perform sound source tracking within the target period.

the memory for storing one or more computer instructions;

Embodiments of the present application further provide a sound source tracking system, including: a microphone array and a computing device, where the microphone array is communicatively connected to the computing device;

the microphone array for collecting acoustic signals;

The computing device is configured to acquire the acoustic signal flow collected by the microphone array in at least one time frame; perform sound source orientation estimation based on the acoustic signal flow, so as to obtain a sound source orientation information containing the at least one time frame. information flow; convert the information flow into visual data describing the azimuth distribution state of the sound source; and track the sound source according to the visual data.

Embodiments of the present application further provide a computer-readable storage medium storing computer instructions, which, when executed by one or more processors, cause the one or more processors to execute the aforementioned sound source tracking method.

In this embodiment of the present application, acoustic orientation estimation may be performed on the acoustic signal flow collected by the microphone array in at least one time frame, so as to determine the acoustic orientation information under at least one time frame, respectively, and the information flow containing the orientation information of the sound source It is converted into visual data describing the azimuth distribution state of the sound source, and based on the visual data, the sound source is tracked. In this way, in the embodiment of the present application, the traditional sound source tracking method from the acoustic signal processing level is subverted, but the sound source tracking is performed from the visual analysis level. However, in this embodiment, the visualized data can accurately and comprehensively reflect the azimuth distribution state of the sound source, which ensures the accuracy and comprehensiveness of the basis of the visual analysis, and avoids the problem of robustness; moreover, in the process of visual analysis , the analyzed field of view can cover more time frames, therefore, noise in the field of view can be found, so as to avoid noise interference; accordingly, in the embodiment of the present application, the accuracy of sound source tracking can be effectively improved, and the accuracy of sound source tracking can be improved. Adaptability to various complex environments.

Description of drawings

The drawings described herein are used to provide further understanding of the present application and constitute a part of the present application. The schematic embodiments and descriptions of the present application are used to explain the present application and do not constitute an improper limitation of the present application. In the attached image:

FIG. 1 is a schematic flowchart of a sound source tracking method provided by an exemplary embodiment of the present application;

FIG. 2 is a schematic logical diagram of a sound source tracking solution provided by an exemplary embodiment of the present application;

FIG. 3 is a schematic diagram of sound source orientation information provided by an exemplary embodiment of the present application;

FIG. 4 is a schematic diagram of a heat map of azimuth distribution of a sound source according to an exemplary embodiment of the present application;

FIG. 5 is a schematic structural diagram of a sound source tracking device according to an exemplary embodiment of the present application;

FIG. 6 is a schematic structural diagram of a computing device according to another exemplary embodiment of the present application;

FIG. 7 is a flowchart of another sound source tracking method provided by an exemplary embodiment of the present application;

FIG. 8 is a schematic structural diagram of another sound source tracking device provided by an exemplary embodiment of the present application;

FIG. 9 is a schematic structural diagram of another computing device provided by an exemplary embodiment of the present application;

FIG. 10 is a schematic structural diagram of a sound source tracking system provided by an exemplary embodiment of the present application.

Detailed ways

In order to make the objectives, technical solutions and advantages of the present application clearer, the technical solutions of the present application will be clearly and completely described below with reference to the specific embodiments of the present application and the corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In view of the technical problems such as poor robustness and insufficient generalization ability of the existing sound source tracking scheme, in some embodiments of the present application, the information stream containing the sound source azimuth information can be converted to describe the azimuth distribution of the sound source Visual data of the state, and based on the visual data, sound source tracking. This subverts the traditional way of sound source tracking from the level of acoustic signal processing, but from the level of visual analysis. Accordingly, in the embodiments of the present application, the accuracy of sound source tracking can be effectively improved, and the adaptability to various complex environments can be improved.

The technical solutions provided by the embodiments of the present application will be described in detail below with reference to the accompanying drawings.

FIG. 1 is a schematic flowchart of a sound source tracking method provided by an exemplary embodiment of the present application. FIG. 2 is a schematic logical diagram of a sound source tracking solution provided by an exemplary embodiment of the present application. The sound source tracking method provided in this embodiment may be performed by a sound source tracking apparatus, which may be implemented as software or a combination of software and hardware, and may be integrated in a computing device. As shown in Figure 1, the method includes:

Step 100: Acquire an acoustic signal stream collected by the microphone array in at least one time frame;

Step 101: Perform sound source orientation estimation based on the acoustic signal flow to obtain an information flow containing sound source orientation information under at least one time frame;

Step 102, converting the information flow into visual data describing the azimuth distribution state of the sound source;

Step 103: Track the sound source according to the visualized data.

The sound source tracking method provided in this embodiment can be applied to various scenarios, for example, a voice control scenario, an audio and video conference scenario, or other scenarios requiring sound source tracking, and the application scenario is not limited in this embodiment. In different application scenarios, the sound source tracking method provided in this embodiment can be integrated and set in various scenario devices. For example, in a voice control scenario, the scenario device may be a smart speaker, an intelligent robot, etc. In a video conference scenario, the scenario devices may be various conference terminals, etc.

In this embodiment, in step 100, a microphone array may be used to collect the acoustic signal flow. The microphone array may be a group of arrays composed of multiple array elements, and the number of the array elements in the microphone array is not limited in this embodiment. This embodiment also does not limit the arrangement of the microphone arrays, and the microphone arrays may be annular arrays, linear arrays, planar arrays, or stereoscopic arrays, and so on. In different application scenarios, the microphone array can be assembled in various types of scene equipment as required.

The signal acquisition process of the microphone array is usually a continuous process, and therefore, in this embodiment, subsequent processing may be performed in the form of an acoustic signal stream.

In this embodiment, based on the recognition accuracy, at least one time frame may be selected within a single recognition period, and a single time frame may be used as a processing unit. The length of a single identification period is adapted to the identification accuracy. For example, if the identification accuracy is 1s, that is, the sound source tracking result is displayed every 1s, the length of a single identification period can be set to 1s. In step 100, The acoustic signal stream formed by the acoustic signals in at least one time frame within 1 s can be acquired at a time as the processing object of the subsequent steps. In practical applications, under different application scenarios, at least one time frame can be selected on demand within the target period. For example, in the case that the change of the acoustic signal is small, at least one time frame may be selected within the target period by means of frame skipping or sampling at a changing frame rate. Of course, in most cases, all time frames in the target period may be selected, which is not limited in this embodiment.

In this embodiment, the frame length of the time frame may be configured according to an actual request, for example, the frame length of a single time frame may be configured as 20 ms. In addition, the number of at least one time frame in the identification period can also be set as required. For example, if the identification period is 1s, 3 time frames can be selected in the identification period. Sound source tracking over time. Of course, in this embodiment, the frame length of the time frame and the number of time frames in the identification period are not limited to this. In addition, the frame lengths of different time frames in the identification period may not be exactly the same, which is not limited in this embodiment.

Based on this, the sound source tracking method provided in this embodiment can be applied to real-time sound source tracking scenarios, and can also be applied to offline sound source tracking scenarios. According to the recognition accuracy, sound source tracking is successively performed in each recognition period.

In practical applications, each array element in the microphone array can be used to collect time domain signals respectively. Taking M array elements included in the microphone array as an example, M channels of time domain signal streams can be collected in at least one time frame, as step 100. Acoustic signal flow in .

Referring to FIG. 1 and FIG. 2 , in step 101 , sound source position estimation may be performed based on an acoustic signal stream to obtain an information stream including sound source position information under at least one time frame.

In this embodiment, a sound source orientation estimation technique may be used to perform signal processing on the acoustic signal stream, so as to determine sound source orientation information in at least one time frame respectively. The position information of the sound source is used to represent the position data of the sound source under the time frame. In this embodiment, the position data may be the confidence levels of the sound source at each position. In this way, the sound source position information may at least include the confidence levels of the sound source at each position in the time frame. Wherein, in this embodiment, all the azimuths involved in the sound source azimuth information can be configured according to actual needs, for example, 360 azimuths, 120 azimuths, 60 azimuths, etc. can be configured for the entire circumference of the microphone array. Of course, the microphone array may also be non-circumferential, for example, the front face has a range of 180° and is configured in 180 directions, etc., which is not limited in this embodiment.

FIG. 3 is a schematic diagram of sound source orientation information provided by an exemplary embodiment of the present application. The sound source orientation information is visualized in FIG. 3, but it should be understood that FIG. 3 is only for the convenience of describing the sound source orientation information, and this should not cause any change in the data form of the sound source orientation information in this embodiment. limited. In practical applications, the sound source location information can be [1, 3, 5, 60, 70, 80, 90, 80, 70, . Each number in [ ] can represent the confidence of the sound source in 360 azimuths, respectively.

In addition, in step 101, the determination of sound source orientation information is performed in units of time frames. Continuing from what was mentioned above, what is acquired in step 100 is the acoustic signal stream collected by each of the M array elements under at least one time frame. The acoustic signal collected under the time frame is used to estimate the sound source position, so as to determine the sound source position information under the time frame.

In an optional implementation manner, the time-domain signal streams collected by each array element can be converted into time-frequency domain signals respectively; the sound source orientation estimation technology is used to determine at least one of the time-frequency domain signals under each array element. Sound source orientation information under the time frame.

In this implementation manner, a target time frame in the at least one time frame is taken as an example, where the target time frame may be any one of the at least one time frame. The time domain signals collected by each array element under the target time frame can be converted into time-frequency domain signals. For example, the time-domain signal may be decomposed into sub-bands to obtain time-frequency domain signals, and the sub-band decomposition process may be implemented based on end-time Fourier transform and/or filter bank, etc., which is not limited herein. Accordingly, the time-frequency domain signals corresponding to each array element under the target time frame can be obtained.

On this basis, the sound source orientation can be estimated for the time-frequency domain signals corresponding to each array element under the target time frame, so as to output the sound source orientation information under the target time frame. Among them, the sound source azimuth estimation technology includes but is not limited to the controllable beam response phase transformation technology SRP-PHAT (Steered Response Power-Phase Transform), the generalized cross-correlation phase transformation technology GCC-PHAT (Generalized Cross Correlation PHAse Transformation) or multiple signal classification Technology MUSIC (multiple signal classification) and so on. The principle of sound source azimuth estimation may be: according to the acoustic signals collected by different microphones in the microphone array at the same time, calculate the azimuth range of the sound source separately, and then estimate the sound source azimuth according to multiple azimuth ranges. Of course, this is only exemplary, and the present embodiment is not limited thereto. This embodiment does not limit the sound source orientation estimation technology used, and the processing procedures of various sound source orientation estimation techniques are not described in detail.

In addition, in this embodiment, other implementation manners may also be used to estimate the sound source azimuth based on the acoustic signal flow, and this embodiment is not limited to the foregoing implementation manners.

Accordingly, an information stream can be obtained, and the information stream includes sound source position information in at least one time frame.

Referring to FIG. 1 and FIG. 2, on this basis, in step 102, the information flow can be converted into visual data describing the azimuth distribution state of the sound source.

In this embodiment, the sound source bearing information may include description information of dimensions such as time and bearing, and based on this, the sound source bearing information may be converted into description information of bearing distribution state. For example, the position information of the sound source may include the confidence that the sound source is located in different directions, and then the confidence may be converted into an adapted display brightness, and the display brightness in different directions is used as the description information of the position distribution state. It is worth noting that in the process of converting the visualization data, nothing in the sound source orientation information is lost, but only the representation form of the sound source orientation information is converted, which can ensure that the visualization data in this embodiment is accurate. , A comprehensive description of the azimuth distribution of the sound source.

The visualization data may be a heat map of the azimuth distribution of the sound source. Continuing the above example, after converting the confidence of the sound source in different directions into the appropriate display brightness, the display brightness in each position under the time frame can be obtained, so as to determine from the three dimensions of time frame, direction and display brightness. A thermal map of the azimuthal distribution of the sound source.

Of course, in this embodiment, the visualization data is not limited to this. For example, the visualization data may also be a three-dimensional stereogram used to represent the position information of the sound source in at least one time frame. The display curves of the sound source orientation information in FIG. 3 are arranged in time to obtain a three-dimensional stereogram.

In this embodiment, the information flow can be converted into various forms of visual data. During the visualization process, the orientation information of each sound source can be fully retained, so in step 102 , the acoustic signal processing level can be switched to the visualization processing level during the sound source tracking process.

In step 103, sound source tracking may be performed according to the visualized data. Wherein, in this embodiment, the number of sound sources is not limited, and the number of sound sources may be one or more.

Accordingly, in this embodiment, the sound source in at least one time frame can be tracked by performing visual analysis on the visualized data, thereby converting the acoustic signal processing problem into a visual analysis problem. Because in this embodiment, the visual data can accurately and comprehensively reflect the azimuth distribution state of the sound source, which ensures the accuracy and comprehensiveness of the basis of visual analysis, and avoids the problem of robustness; moreover, in the process of visual analysis , the analyzed field of view can cover more time frames instead of being limited to a single time frame, therefore, noise in the field of view can be found and noise interference can be avoided. This can effectively avoid the shortcomings of poor robustness and insufficient generalization ability in the traditional acoustic signal processing process.

Accordingly, in this embodiment, the information stream including the sound source azimuth information can be converted into visual data describing the azimuth distribution state of the sound source, and based on the visual data, sound source tracking can be performed. This subverts the traditional way of sound source tracking from the level of acoustic signal processing, but from the level of visual analysis. Accordingly, in the embodiments of the present application, the accuracy of sound source tracking can be effectively improved, and the adaptability to various complex environments can be improved.

In the above or the following embodiments, the information stream may be converted into a heat map of the azimuth distribution of the sound source in at least one time frame, as a basis for tracking the sound source in the at least one time frame. The azimuth distribution heat map is used to describe the distribution heat of the sound source in different azimuths under the at least one time frame.

In this embodiment, the sound source location information includes the confidence levels that the sound source is located at each location. On this basis, based on the correspondence between the confidence and the display brightness, and according to the confidence of the sound source in each position under at least one time frame, the display brightness corresponding to each position can be determined under at least one time frame respectively; Display brightness, and generate a heat map of the azimuth distribution of the sound source in at least one time frame; different display brightness represents different distribution heat.

In practical applications, the higher the confidence, the higher the corresponding display brightness, and the higher the distribution heat of the representation. Of course, this embodiment is not limited to this, and the higher the confidence level, the lower the display brightness may be. In general, however, there is a proportionality between confidence and display brightness to accurately reflect confidence through display brightness.

FIG. 4 is a schematic diagram of a heat map of azimuth distribution of a sound source according to an exemplary embodiment of the present application. Referring to Figure 4, the vertical axis of the heat map is the time frame, and the horizontal axis is the orientation. In FIG. 4 , the number of at least one time frame is 800, and the number of orientations is configured to be 120, which are used to represent the entire circumferential space of the microphone array.

In an optional implementation manner, the image content corresponding to each of the at least one time frame may be determined according to the display brightness corresponding to each position under the at least one time frame; according to the time sequence between the at least one time frame, at least one Image content corresponding to each time frame to generate a heat map of azimuth distribution. Referring to Figure 4, the sound source position information under each time frame can be converted into a horizontal direction in the heat map. For example, the sound source position information of the 400th frame can be converted into a straight line y=400 in the heat map, and the straight line The display brightness of the pixel corresponding to each position above is determined according to the corresponding confidence. The pixel with higher confidence corresponds to the brighter display brightness. For example, combined with the schematic diagram of the sound source azimuth information shown in FIG. 3 , the confidence level corresponding to the peak position in FIG. 3 is the highest, and when converted to the heat map, the display brightness at the azimuth corresponding to the peak position is the brightest.

Of course, in this embodiment, other implementation manners can also be used to generate the heat map of azimuth distribution. For example, according to the azimuth information of the sound source in at least one time frame, the confidence level of the existence of the sound source in the target azimuth in different time frames can be obtained, so as to determine The display brightness corresponding to each time frame on the target azimuth is used to generate image content corresponding to the target azimuth, where the target azimuth may be any one of the various azimuths. In this way, the image content corresponding to each azimuth can be obtained, so that the image content corresponding to each azimuth can be arranged in sequence according to the azimuth order, so as to generate an azimuth distribution heat map. This embodiment does not limit the manner of generating the azimuth distribution heat map.

Accordingly, in this embodiment, the position information of the sound source in at least one time frame can be converted into a heat map of the position distribution of the sound source. Moreover, in the conversion process, all the content of the sound source orientation information is preserved, which provides an accurate analysis basis for the visual analysis process, thus ensuring the accuracy of the tracking results.

In the above or the following embodiments, the sound source tracking can be performed using the machine learning model and the visualization data.

In this embodiment, no matter what form of visual data, a machine learning model can be used to perform visual analysis to track sound sources. In practical applications, for different forms of visual data, different types of machine learning models can be selected, and a model training method adapted to the data form can be used to improve the performance of the machine learning model.

The following still takes the heat map as an example to illustrate the visual analysis process.

In this embodiment, in the machine learning model, image features in the heat map of azimuth distribution can be extracted; based on the mapping relationship between image features and sound source attribute parameters and the image features extracted from the heat map of azimuth distribution, at least The target sound source attribute parameter under a time frame for sound source tracking.

Among them, the mapping relationship between image features and sound source attribute parameters can be configured into the machine learning model through model training.

An exemplary model training process could also be:

Obtain sample heatmaps corresponding to several sample time frame groups; label sound source attribute parameters for each sample heatmap to obtain labeling information corresponding to each sample heatmap; input each sample heatmap and its corresponding labeling information into the machine learning model , for the machine learning model to learn the mapping relationship between image features and sound source attribute parameters.

Wherein, the number of time frames in the sample time frame may be consistent with the number of at least one time frame in which the acoustic signal acquisition is performed in step 100 . That is, the processing units in the model training process and the model use process can be kept consistent. In this way, in the model training process, the sample time frame group is used as a unit for labeling, and in the model use process, the tracking result can be output in at least one time frame unit of the same quantity level.

In this embodiment, the sound source attribute parameters include one or more of orientation, quantity, sounding duration, and covered time frame. Accordingly, in this embodiment, after visual analysis, the machine learning model can output information such as the number of sound sources in at least one time frame, the location, the sounding duration, and the time frame covered as the tracking result.

In the case of introducing a machine learning model, in this embodiment, the step of converting the information flow into visual data describing the azimuth distribution state of the sound source can be performed in the machine learning model or outside the machine learning model.

In a possible implementation scheme, in the process of using the model, the process of converting the information flow into visual data describing the azimuth distribution state of the sound source can be performed outside the machine learning model, and the visual data can be used as the machine learning model. input parameters. Correspondingly, in the model training process, the information flow corresponding to the sample time frame group can be converted into a sample heat map in advance, which is used as the basis for model training.

In another possible implementation solution, in the process of using the model, the information flow can be input into the machine learning model; in the machine learning model, the information flow is converted into visual data describing the azimuth distribution state of the sound source.

In this implementation solution, a functional module that converts the information flow into visual data describing the azimuth distribution state of the sound source can be configured in the machine learning model, so that the information flow can be used as an input parameter of the machine learning model. For the machine learning model, when the information flow is received, the information flow can be converted into visual data describing the azimuth distribution state of the sound source, and then the visual analysis can be carried out.

Accordingly, there will be subtle differences in the model training process to keep up with one implementation. In the model training process of this implementation scheme, the sample information streams corresponding to each sample time frame group can be obtained; the sound source attribute parameters can be labeled for each sample information stream to obtain the label information corresponding to each sample information stream; The stream and its corresponding annotation information are input into the machine learning model, so that the machine learning model converts each sample information stream into visual data describing the azimuth distribution state of the sound source and learns the mapping relationship between image features and sound source attribute parameters.

Accordingly, in this embodiment, after the machine learning model is trained with a sufficient amount of sample data, the machine learning model learns an accurate mapping relationship between image features and sound source attribute parameters. Therefore, the trained machine learning model can be used to visually analyze the visualized data to output sound source attribute information under at least one time frame, so as to track one or more sound sources that emit sound based on the sound source attribute information. This sound source tracking method can eliminate all kinds of noise interference in the tracking process, does not need to perform the operation of finding the starting point separately, and can avoid the deficiencies in other acoustic signal processing levels. Furthermore, the accuracy of the tracking result can be effectively improved, and the adaptability to various complex environments can be improved.

It should be noted that, the execution subject of each step of the method provided in the above-mentioned embodiments may be the same device, or the method may also be executed by different devices. For example, the execution subject of steps 101 to 103 may be device A; for another example, the execution subject of

steps

101 and 102 may be device A, and the execution subject of step 103 may be device B; and so on.

In addition, in some of the processes described in the above embodiments and the accompanying drawings, multiple operations appearing in a specific order are included, but it should be clearly understood that these operations may be performed out of the order in which they appear in this document or performed in parallel , the sequence numbers of the operations, such as 101, 102, etc., are only used to distinguish different operations, and the sequence numbers themselves do not represent any execution order. Additionally, these flows may include more or fewer operations, and these operations may be performed sequentially or in parallel.

FIG. 5 is a schematic structural diagram of a sound source tracking device according to an exemplary embodiment of the present application. Referring to Figure 5, the sound source tracking device includes:

an acquisition module 50, configured to acquire the acoustic signal flow collected by the microphone array in at least one time frame;

a calculation module 51, configured to perform sound source orientation estimation based on the acoustic signal flow to obtain an information flow including sound source orientation information under at least one time frame;

The conversion module 52 is used to convert the information flow into visual data describing the azimuth distribution state of the sound source;

The tracking module 53 is configured to track the sound source according to the visualized data.

In an optional embodiment, when converting the information flow into visual data describing the azimuth distribution state of the sound source, the conversion module 52 is used for:

The information flow is converted into an azimuth distribution heat map of the sound source under at least one time frame, and the azimuth distribution heat map is used to describe the distribution heat of the sound source in different azimuths under the at least one time frame.

In an optional embodiment, the sound source position information includes the confidence that the sound source is in each position; when converting the information flow into a heat map of the position distribution of the sound source in at least one time frame, the conversion module 52 is used for:

Based on the correspondence between the confidence level and the display brightness, according to the confidence level of the sound source in each position under at least one time frame, the display brightness corresponding to each position is determined in at least one time frame, and the distribution heat represented by different display brightness is different;

According to the display brightness, a heat map of the azimuth distribution of the sound source under at least one time frame is generated.

In an optional embodiment, the conversion module 52, when generating the heat map of the azimuth distribution of the sound source in at least one time frame according to the display brightness, is used for:

According to the display brightness corresponding to each position under the at least one time frame, the image content corresponding to each of the at least one time frame is respectively determined;

According to the time sequence between the at least one time frame, the image contents corresponding to the at least one time frame are sequentially arranged to generate the azimuth distribution heat map.

In an optional embodiment, when tracking the sound source according to the visualization data, the tracking module 53 is used to:

Use machine learning models and visualized data for sound source tracking.

In an optional embodiment, if the visualized data is a heat map of the azimuth distribution of the sound source under at least one time frame, the tracking module 53, when using the machine learning model and the visualized data to track the sound source, is used for:

In the machine learning model, extract the image features in the azimuthal distribution heatmap;

Based on the mapping relationship between the image features and the sound source attribute parameters and the image features extracted from the azimuth distribution heat map, the target sound source attribute parameters under at least one time frame are determined to perform sound source tracking.

In an optional embodiment, the sound source attribute parameters include one or more of orientation, quantity, sounding duration and covered time frame.

In an optional embodiment, the tracking module 53 is further configured to:

Obtain the sample heatmaps corresponding to several sample time frame groups, and the sample heatmaps are used to describe the distribution heat of the sound source in different directions under the sample time frame;

Label the sound source attribute parameters for each sample heatmap to obtain the labeling information corresponding to each sample heatmap;

The heat map of each sample and its corresponding annotation information are input into the machine learning model, so that the machine learning model can learn the mapping relationship between image features and sound source attribute parameters.

In an optional embodiment, when the tracking module 53 converts the information flow into visual data describing the azimuth distribution state of the sound source, it is used for:

Feed the information flow into the machine learning model;

In the machine learning model, the information flow is transformed into visual data describing the azimuthal distribution of the sound source.

In an optional embodiment, the tracking module 53 is further configured to:

Obtain the sample information stream corresponding to each sample time frame group;

Labeling sound source attribute parameters for each sample information stream to obtain label information corresponding to each sample information stream;

Input each sample information stream and its corresponding annotation information into the machine learning model, so that the machine learning model converts each sample information stream into visual data describing the azimuth distribution state of the sound source and learns the relationship between image features and sound source attribute parameters. Mapping relations.

In an optional embodiment, the acoustic signal flow includes a time-domain signal flow collected by each array element in the microphone array, and the calculation module 51 performs sound source orientation estimation based on the acoustic signal flow to obtain an acoustic signal including at least one time frame. When the information flow of source bearing information is used to:

The time-domain signal streams collected by each array element are converted into time-frequency domain signals respectively;

Using the sound source orientation estimation technology, the sound source orientation information under at least one time frame is determined according to the time-frequency domain signals under each array element.

In an optional embodiment, the sound source orientation estimation technology includes one or more of the steerable beam response phase transformation technology SRP-PHAT, the generalized cross-correlation phase transformation technology GCC-PHAT or the multiple signal classification technology MUSIC.

It is worth noting that, for the technical details in the above-mentioned embodiments of the sound source tracking device, reference may be made to the relevant descriptions in the above-mentioned embodiments of the sound source tracking method. loss of the scope of protection of this application.

FIG. 6 is a schematic structural diagram of a computing device according to another exemplary embodiment of the present application. As shown in FIG. 6 , the computing device includes: a memory 60 and a processor 61 .

A processor 61, coupled to the memory 60, executes a computer program in the memory 60 for:

Perform sound source orientation estimation based on the acoustic signal flow to obtain an information flow containing sound source orientation information under at least one time frame;

Convert the information flow into visual data describing the azimuth distribution state of the sound source;

Based on the visualized data, sound source tracking is performed.

In an optional embodiment, when the processor 61 converts the information flow into visual data describing the azimuth distribution state of the sound source, it is used for:

In an optional embodiment, the sound source azimuth information includes confidence that the sound source is in each position; when the processor 61 converts the information flow into a heat map of azimuth distribution of the sound source in at least one time frame, the processor 61 is used for:

Based on the correspondence between the confidence level and the display brightness, according to the confidence level of the sound source in each position under at least one time frame, the display brightness corresponding to each position is determined under at least one time frame, and the distribution heat represented by different display brightness is different;

In an optional embodiment, when the processor 61 generates a heat map of the azimuth distribution of the sound source under at least one time frame according to the display brightness, the processor 61 is configured to:

In an optional embodiment, when the processor 61 performs sound source tracking according to the visualization data, the processor 61 is configured to:

Use machine learning models and visualized data for sound source tracking.

In an optional embodiment, if the visualized data is a heat map of the azimuth distribution of the sound source under at least one time frame, the processor 61 is used to track the sound source by using the machine learning model and the visualized data:

In an optional embodiment, the processor 61 is further configured to:

In an optional embodiment, when the processor 61 converts the information flow into visual data describing the azimuth distribution state of the sound source, the processor 61 is configured to:

Feed the information flow into the machine learning model;

In an optional embodiment, the processor 61 is further configured to:

In an optional embodiment, the acoustic signal flow includes a time-domain signal flow collected by each element in the microphone array, and the processor 61 performs sound source orientation estimation based on the acoustic signal flow to obtain an acoustic signal including at least one time frame. When the information flow of source bearing information is used to:

In an optional embodiment, the sound source orientation estimation technique includes one or more of the steerable beam response phase transform technique SRP-PHAT, the generalized cross-correlation phase transform technique GCC-PHAT or the multiple signal classification technique MUSIC.

It is worth noting that, for the technical details in the above-mentioned embodiments of the computing device, reference may be made to the relevant descriptions in the above-mentioned embodiments of the sound source tracking method. loss of protection.

Further, as shown in FIG. 6 , the computing device further includes: a microphone array 62 , a communication component 63 , a power supply component 64 and other components. Only some components are schematically shown in FIG. 6 , which does not mean that the computing device only includes the components shown in FIG. 6 .

FIG. 7 is a flowchart of another sound source tracking method provided by an exemplary embodiment of the present application. The sound source tracking method provided in this embodiment may be performed by a sound source tracking apparatus, which may be implemented as software or a combination of software and hardware, and may be integrated in a computing device. As shown in Figure 7, the method includes:

Step 700: Under at least one time frame within the target period, determine sound source orientation information respectively;

Step 701: Convert the position information of the sound source in at least one time frame into at least one set of image data describing the position distribution state of the sound source to form an image stream;

Step 702: Perform image recognition on the image stream by using the image recognition model, so as to track the sound source within the target time period.

Wherein, for step 700, reference may be made to the relevant description in the embodiment associated with FIG. 1 . Step 701 may further include acquiring an acoustic signal flow collected by the microphone array in at least one time frame; and performing sound source orientation estimation based on the acoustic signal flow to obtain sound source orientation information under at least one time frame. To save space, the specific process will not be repeated.

In step 701, the position information of the sound source in at least one time frame may be converted into at least one set of image data. Wherein, the image data may be an orientation distribution heat map, and in step 701, at least one orientation distribution heat map can be obtained to form an image stream and input to the image recognition model.

Based on step 701, the sound source tracking solution provided in this implementation can be applied to scenarios such as real-time tracking or offline tracking. In the offline tracking scenario, the position information of the sound source in at least one time frame can be obtained at one time, and the position information of the sound source in at least one time frame can be grouped according to the recognition accuracy, so that the conversion of the acoustic position information into image data can be performed in groups. operate.

In the online tracking scenario, the target time frame within the current recognition period can be determined from at least one time frame based on the preset recognition accuracy;

Convert the sound source azimuth information under each target time frame into a set of image data describing the azimuth distribution state of the sound source in the current identification period;

Continue to determine the time frame and image data in the next identification period in the target period until image data corresponding to all the identification periods in the target period are generated.

In the online tracking scenario, along with the continuous generation of acoustic signals, the operation of converting the acoustic orientation information into image data can be performed successively in each recognition period. And then input to the image recognition model in step 702.

For example, if the recognition accuracy is 1s, you can generate a heat map of azimuth distribution based on the sound source azimuth information of N time frames within the current recognition period (1s), and then continue to generate the azimuth for the next recognition period (1s). distributing heat maps, and successively generating azimuth distribution heat maps for subsequent recognition periods, and streamly providing each azimuth distribution heat map to the image recognition model.

In this embodiment, the image recognition model can be pre-trained. The image recognition model may adopt a machine learning model, and the training process of the image recognition model may refer to the relevant description in the embodiment associated with FIG. 1 .

It is worth noting that the recognition accuracy is consistent between the training phase and the application phase of the image recognition model.

For the technical details in the above-mentioned embodiments of the sound source tracking method, reference may be made to the relevant descriptions in the various embodiments of the sound source tracking method associated with FIG. loss of protection.

FIG. 8 is a schematic structural diagram of another sound source tracking apparatus provided by an exemplary embodiment of the present application. Referring to Figure 8, the sound source tracking device includes:

A determination module 80, configured to determine sound source orientation information respectively under at least one time frame within the target period;

The conversion module 81 is configured to convert the position information of the sound source in at least one time frame into at least one set of image data describing the position distribution state of the sound source, so as to form an image stream;

The tracking module 82 is configured to perform image recognition on the image stream by using the image recognition model, so as to perform sound source tracking within the target time period. In an optional embodiment, when the conversion module 81 converts the position information of the sound source in at least one time frame into at least one set of image data describing the position distribution state of the sound source to form an image stream, it is used for:

Based on the preset recognition accuracy, from at least one time frame, determine the target time frame within the current recognition period;

Continue to determine the time frame and image data in the next identification period in the target period until image data corresponding to all the identification periods in the target period are generated to form an image stream.

In an optional embodiment, the determination module 80 includes an acquisition module 83 and a calculation module 84;

an acquisition module 83, configured to acquire the acoustic signal flow collected by the microphone array in at least one time frame;

The calculation module 84 is configured to perform sound source position estimation based on the acoustic signal flow, so as to obtain sound source position information in at least one time frame.

In an optional embodiment, the conversion module 81, when converting the sound source azimuth information under each target time frame into a group of image data describing the azimuth distribution state of the sound source in the current identification period, is used for:

The sound source orientation information under each target time frame is converted into an orientation distribution heat map of the sound source under at least one time frame, and the orientation distribution heat map is used to describe the distribution heat of the sound source in different orientations under at least one time frame.

In an optional embodiment, the sound source position information includes the confidence that the sound source is in each position; the conversion module 81 converts the sound source position information under each target time frame into the position distribution of the sound source under at least one time frame. When a heatmap is used, it is used to:

In an optional embodiment, when the conversion module 81 generates a heat map of the azimuth distribution of the sound source in at least one time frame according to the display brightness, it is used for:

In an optional embodiment, if the image data is a heat map of the azimuth distribution of the sound source in at least one time frame, the tracking module 82 uses the image recognition model to perform image recognition on the image stream to track the sound source within the target time period. , for:

In the image recognition model, the image features in the azimuth distribution heat map are extracted;

In an optional embodiment, the tracking module 82 is further configured to:

The heat map of each sample and its corresponding annotation information are input into the image recognition model, so that the image recognition model can learn the mapping relationship between image features and sound source attribute parameters.

In an optional embodiment, when the tracking module 82 converts the position information of the sound source under each target time frame into a heat map of the position distribution of the sound source under at least one time frame, it is used for:

Feed the information stream into the image recognition model;

In the image recognition model, the position information of the sound source under each target time frame is converted into a heat map of the position distribution of the sound source under at least one time frame.

In an optional embodiment, the tracking module 82 is further configured to:

Input each sample information stream and its corresponding annotation information into the image recognition model, so that the image recognition model converts each sample information stream into visual data describing the azimuth distribution state of the sound source and learns the relationship between image features and sound source attribute parameters. Mapping relations.

In an optional embodiment, the acoustic signal flow includes the time domain signal flow collected by each array element in the microphone array, and the calculation module 84 performs sound source orientation estimation based on the acoustic signal flow to obtain the sound source in at least one time frame. When bearing information, it is used to:

It is worth noting that, for the technical details of the above-mentioned embodiments of the sound source tracking device, reference may be made to the relevant descriptions in the respective embodiments of the sound source tracking method associated with FIG. 1 and FIG. 7 . Repeated description, but this should not cause loss of the protection scope of the present application.

FIG. 9 is a schematic structural diagram of another computing device provided by an exemplary embodiment of the present application. Referring to FIG. 9 , the computing device includes: a memory 90 and a processor 91 .

A processor 91, coupled to the memory 90, executes a computer program in the memory 90 for:

Converting the sound source position information in at least one time frame into at least one set of image data describing the position distribution state of the sound source to form an image stream;

In an optional embodiment, when the processor 91 converts the sound source azimuth information in at least one time frame into at least one set of image data describing the azimuth distribution state of the sound source to form an image stream, it is used for:

In an optional embodiment, when the processor 91 respectively determines the position information of the sound source under at least one time frame within the target time period, it is used for:

The sound source position estimation is performed based on the acoustic signal flow to obtain sound source position information under at least one time frame.

In an optional embodiment, when the processor 91 converts the sound source azimuth information under each target time frame into a set of image data describing the azimuth distribution state of the sound source in the current identification period, it is used for:

In an optional embodiment, the sound source position information includes the confidence that the sound source is in each position; the processor 91 converts the sound source position information under each target time frame into the position distribution of the sound source under at least one time frame. When a heatmap is used, it is used to:

In an optional embodiment, when the processor 91 generates the heat map of the azimuth distribution of the sound source in at least one time frame according to the display brightness, the processor 91 is configured to:

In an optional embodiment, if the image data is a heat map of the azimuth distribution of the sound source in at least one time frame, the processor 91 uses the image recognition model to perform image recognition on the image stream to track the sound source within the target time period. , for:

In an optional embodiment, the processor 91 is further configured to:

In an optional embodiment, when the processor 91 converts the position information of the sound source under each target time frame into a heat map of the position distribution of the sound source under at least one time frame, the processor 91 is used for:

Feed the information stream into the image recognition model;

In an optional embodiment, the processor 91 is further configured to:

In an optional embodiment, the acoustic signal flow includes a time-domain signal flow collected by each array element in the microphone array, and the processor 91 performs sound source orientation estimation based on the acoustic signal flow to obtain the sound source in at least one time frame. When bearing information, it is used to:

It is worth noting that, for the technical details in the above-mentioned embodiments of the computing device, reference may be made to the relevant descriptions in the respective embodiments of the sound source tracking method associated with FIG. 1 and FIG. 7 . However, this should not cause any loss to the scope of protection of this application.

Further, as shown in FIG. 9 , the computing device further includes: a microphone array 92 , a communication component 93 , a power supply component 94 and other components. Only some components are schematically shown in FIG. 9 , which does not mean that the computing device only includes the components shown in FIG. 9 .

Correspondingly, the embodiments of the present application further provide a computer-readable storage medium storing a computer program, and when the computer program is executed, each step that can be executed by a computing device in the foregoing method embodiments can be implemented.

FIG. 10 is a schematic structural diagram of a sound source tracking system provided by an exemplary embodiment of the present application. Referring to FIG. 10 , the sound source tracking system may include: a microphone array 10 and a computing device 20 , and the microphone array 10 and the computing device 20 are connected in communication.

The sound source tracking system provided in this embodiment can be applied to various scenarios, for example, a voice control scenario, an audio and video conference scenario, or other scenarios that require sound source tracking, and the application scenario is not limited in this embodiment. In different application scenarios, the sound source tracking system provided in this embodiment can be integrated and deployed in various scenarios. For example, in a voice control scenario, it can be deployed in smart speakers and smart robots. In video conference scenarios, it can be deployed in various conference terminals.

Among them, the microphone array 10 can be used to collect acoustic signals. In this embodiment, the number and arrangement of the array elements of the microphone array 10 are not limited.

For the technical details involved in the computing device, reference may be made to the relevant descriptions in the embodiments associated with FIG. 6 and FIG. 9 , which are not repeated here to save space, but this should not cause loss of the protection scope of the present application.

The memories in FIGS. 6 and 9 described above are used to store computer programs and may be configured to store various other data to support operations on the computing platform. Examples of such data include instructions for any application or method operating on the computing platform, contact data, phonebook data, messages, pictures, videos, etc. Memory can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read Only Memory (EEPROM), Erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.

The above-mentioned communication components in FIG. 6 and FIG. 9 are configured to facilitate wired or wireless communication between the device where the communication component is located and other devices. The device where the communication component is located can access a wireless network based on a communication standard, such as WiFi, a mobile communication network such as 2G, 3G, 4G/LTE, 5G, or a combination thereof. In one exemplary embodiment, the communication component receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication assembly further includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.

The power supply assemblies in the above-mentioned FIGS. 6 and 9 provide power for various components of the equipment in which the power supply assemblies are located. A power supply assembly may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to the equipment in which the power supply assembly is located.

As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flows of the flowcharts and/or the block or blocks of the block diagrams.

These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

Memory may include forms of non-persistent memory, random access memory (RAM) and/or non-volatile memory in computer readable media, such as read only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media includes both persistent and non-permanent, removable and non-removable media, and storage of information may be implemented by any method or technology. Information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer-readable media does not include transitory computer-readable media, such as modulated data signals and carrier waves.

It should also be noted that the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device comprising a series of elements includes not only those elements, but also Other elements not expressly listed, or which are inherent to such a process, method, article of manufacture, or apparatus are also included. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article of manufacture, or device that includes the element.

The above descriptions are merely examples of the present application, and are not intended to limit the present application. Various modifications and variations of this application are possible for those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included within the protection scope of this application.

Claims

A sound source tracking method, comprising:

Acquire the acoustic signal stream collected by the microphone array under at least one time frame;

performing sound source position estimation based on the acoustic signal stream to obtain an information stream containing sound source position information under the at least one time frame;

converting the information flow into visual data describing the azimuth distribution state of the sound source;

Based on the visualized data, sound source tracking is performed.
The method according to claim 1, wherein the converting the information flow into visual data describing the azimuth distribution state of the sound source comprises:

Converting the information stream into an azimuth distribution heat map of the sound source under the at least one time frame, the azimuth distribution heat map being used to describe the distribution heat of the sound source in different azimuths under the at least one time frame.
The method according to claim 2, characterized in that, the sound source position information includes confidence that the sound source is in each position; the converting the information flow into the sound source's position in the at least one time frame Azimuth distribution heatmap, including:

Based on the correspondence between the confidence level and the display brightness, and according to the confidence level of the sound source in each position under the at least one time frame, the display brightness corresponding to each position is respectively determined under the at least one time frame, and the display brightness corresponding to the different positions is determined under the at least one time frame. Characterize different distribution heats;

According to the display brightness, an azimuth distribution heat map of the sound source in the at least one time frame is generated.
The method according to claim 3, wherein the generating, according to the display brightness, a heat map of the azimuth distribution of the sound source in the at least one time frame, comprises:

According to the display brightness corresponding to each position under the at least one time frame, the image content corresponding to each of the at least one time frame is respectively determined;

The image contents corresponding to the at least one time frame are sequentially arranged according to the time sequence between the at least one time frame, so as to generate the azimuth distribution heat map.
The method according to claim 1, wherein the performing sound source tracking according to the visualized data comprises:

Using the machine learning model and the visualization data, sound source tracking is performed.
The method according to claim 5, wherein, if the visualization data is a heat map of the azimuth distribution of the sound source in the at least one time frame, the sound source is analyzed by using the machine learning model and the visualization data. Tracking, including:

In the machine learning model, extract the image features in the orientation distribution heatmap;

Based on the mapping relationship between the image features and the sound source attribute parameters and the image features extracted from the azimuth distribution heat map, the target sound source attribute parameters under the at least one time frame are determined to perform sound source tracking.
The method according to claim 6, wherein the sound source attribute parameters include one or more of orientation, quantity, sounding duration and covered time frame.
The method of claim 6, further comprising:

obtaining sample heat maps corresponding to several sample time frame groups, where the sample heat maps are used to describe the distribution heat of the sound source in different directions under the sample time frame;

Label the sound source attribute parameters for each sample heatmap to obtain the labeling information corresponding to each sample heatmap;

The heat map of each sample and its corresponding annotation information are input into the machine learning model, so that the machine learning model can learn the mapping relationship between the image features and sound source attribute parameters.
The method of claim 6, further comprising:

Obtain the sample information stream corresponding to each sample time frame group;

Labeling sound source attribute parameters for each sample information stream to obtain label information corresponding to each sample information stream;

Input the each sample information stream and its corresponding annotation information into the machine learning model, so that the machine learning model converts each sample information stream into visual data describing the azimuth distribution state of the sound source and learns the image features Mapping relationship with sound source attribute parameters.
The method according to claim 9, wherein the converting the information flow into visual data describing the azimuth distribution state of the sound source comprises:

inputting the information flow into a machine learning model;

In the machine learning model, the information flow is converted into visual data describing the azimuthal distribution state of the sound source.
The method according to claim 1, wherein the acoustic signal stream comprises a time-domain signal stream collected by each array element in the microphone array, and the sound source azimuth estimation is performed based on the acoustic signal stream to obtain Obtaining an information stream containing sound source position information under the at least one time frame, including:

The time-domain signal streams collected by each array element are converted into time-frequency domain signals respectively;

Using the sound source orientation estimation technology, the sound source orientation information under the at least one time frame is determined according to the time-frequency domain signals under each array element.
The method according to claim 11, wherein the sound source orientation estimation technology comprises one of the controllable beam response phase transformation technology SRP-PHAT, the generalized cross-correlation phase transformation technology GCC-PHAT or the multiple signal classification technology MUSIC one or more.
A sound source tracking method, comprising:

Under at least one time frame within the target period, determine the sound source position information respectively;

Converting the sound source position information under the at least one time frame into at least one set of image data describing the position distribution state of the sound source to form an image stream;

Image recognition is performed on the image stream using an image recognition model for sound source tracking within the target time period.
The method according to claim 13, wherein, converting the position information of the sound source in the at least one time frame into at least one set of image data describing the position distribution state of the sound source to form an image stream ,include:

Based on the preset recognition accuracy, from the at least one time frame, determine a target time frame within the current recognition period;

Converting the sound source azimuth information under each target time frame into a set of image data describing the azimuth distribution state of the sound source in the current identification period;

Continue to determine the time frame and image data in the next identification period in the target period until image data corresponding to all the identification periods in the target period are generated to form the image stream.
A sound source tracking device, comprising:

an acquisition module, configured to acquire the acoustic signal flow collected by the microphone array in at least one time frame;

a calculation module for performing sound source orientation estimation based on the acoustic signal flow to obtain an information flow including sound source orientation information under the at least one time frame;

a conversion module for converting the information flow into visual data describing the azimuth distribution state of the sound source;

A tracking module, configured to perform sound source tracking according to the visualized data.
A computing device, comprising a memory and a processor;

the memory for storing one or more computer instructions;

The processor is coupled to the memory for executing the one or more computer instructions for:

Acquire the acoustic signal stream collected by the microphone array under at least one time frame;

performing sound source position estimation based on the acoustic signal stream to obtain an information stream containing sound source position information under the at least one time frame;

converting the information flow into visual data describing the azimuth distribution state of the sound source;

Based on the visualized data, sound source tracking is performed.
A sound source tracking device, comprising:

a determining module, configured to determine the position information of the sound source respectively under at least one time frame within the target period;

a conversion module, configured to convert the position information of the sound source in the at least one time frame into at least one set of image data describing the position distribution state of the sound source, so as to form an image stream;

The tracking module is used for performing image recognition on the image stream by using an image recognition model, so as to perform sound source tracking within the target period.
A computing device, comprising a memory and a processor;

the memory for storing one or more computer instructions;

The processor is coupled to the memory for executing the one or more computer instructions for:

Under at least one time frame within the target period, determine the sound source position information respectively;

Converting the sound source position information under the at least one time frame into at least one set of image data describing the position distribution state of the sound source to form an image stream;

Image recognition is performed on the image stream using an image recognition model for sound source tracking within the target time period.
A sound source tracking system, comprising: a microphone array and a computing device, wherein the microphone array is communicatively connected to the computing device;

the microphone array for collecting acoustic signals;

The computing device is configured to acquire the acoustic signal flow collected by the microphone array in at least one time frame; perform sound source orientation estimation based on the acoustic signal flow, so as to obtain a sound source orientation information containing the at least one time frame. information flow; convert the information flow into visual data describing the azimuth distribution state of the sound source; and track the sound source according to the visual data.
A computer-readable storage medium storing computer instructions, characterized in that, when the computer instructions are executed by one or more processors, the one or more processors are caused to execute any one of claims 1-14. The sound source tracking method described above.