CN114355286A - Sound source tracking method, device, equipment, system and storage medium - Google Patents

Sound source tracking method, device, equipment, system and storage medium Download PDF

Info

Publication number
CN114355286A
CN114355286A CN202011086519.8A CN202011086519A CN114355286A CN 114355286 A CN114355286 A CN 114355286A CN 202011086519 A CN202011086519 A CN 202011086519A CN 114355286 A CN114355286 A CN 114355286A
Authority
CN
China
Prior art keywords
sound source
time frame
information
azimuth
stream
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011086519.8A
Other languages
Chinese (zh)
Inventor
黄伟隆
李威
冯津伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN202011086519.8A priority Critical patent/CN114355286A/en
Priority to PCT/CN2021/122742 priority patent/WO2022078249A1/en
Publication of CN114355286A publication Critical patent/CN114355286A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/08Mouthpieces; Microphones; Attachments therefor

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • General Physics & Mathematics (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)

Abstract

The embodiment of the application provides a sound source tracking method, a sound source tracking device, sound source tracking equipment, a sound source tracking system and a storage medium. The method comprises the following steps: acquiring an acoustic signal stream acquired by a microphone array under at least one time frame; performing a sound source bearing estimation based on the acoustic signal stream to obtain an information stream containing sound source bearing information for the at least one time frame; converting the information flow into visual data describing the azimuth distribution state of the sound source; and tracking the sound source according to the visual data. In the embodiment of the present application, an information stream containing sound source azimuth information is converted into visualized data describing an azimuth distribution state of a sound source, and sound source tracking is performed based on the visualized data. This reverses the traditional approach of source tracking from the acoustic signal processing level, but from the visual analysis level. Therefore, in the embodiment of the application, the accuracy of sound source tracking can be effectively improved, and the adaptability to various complex environments can be improved.

Description

Sound source tracking method, device, equipment, system and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a sound source tracking method, device, apparatus, system, and storage medium.
Background
Sound source tracking based on microphone arrays is a popular technique in the field of acoustic signal processing in recent years. Currently, the sound source tracking technology generally performs filtering, extremum obtaining, fundamental frequency calculation, azimuth angle calculation, and other signal levels on a microphone array to perform sound source tracking.
However, such processing methods have poor robustness and insufficient generalization capability, and especially in a multi-sound-source or noisy environment, the accuracy of sound source tracking is insufficient.
Disclosure of Invention
Aspects of the present disclosure provide a sound source tracking method, device, apparatus, system, and storage medium to improve the accuracy of sound source tracking.
The embodiment of the application provides a sound source tracking method, which comprises the following steps:
acquiring an acoustic signal stream acquired by a microphone array under at least one time frame;
performing a sound source bearing estimation based on the acoustic signal stream to obtain an information stream containing sound source bearing information for the at least one time frame;
converting the information flow into visual data describing the azimuth distribution state of the sound source;
and tracking the sound source according to the visual data.
The embodiment of the present application further provides a sound source tracking method, including:
respectively determining sound source azimuth information in at least one time frame in a target time period;
converting the sound source azimuth information at the at least one time frame into at least one set of image data describing an azimuth distribution state of the sound source to form an image stream;
and carrying out image recognition on the image stream by using an image recognition model so as to track the sound source in the target time interval.
The embodiment of the present application further provides a sound source tracking apparatus, including:
the acquisition module is used for acquiring an acoustic signal stream acquired by the microphone array under at least one time frame;
a calculation module for performing a sound source bearing estimation based on the acoustic signal stream to obtain an information stream comprising sound source bearing information for the at least one time frame;
the conversion module is used for converting the information flow into visual data describing the azimuth distribution state of the sound source;
and the tracking module is used for tracking the sound source according to the visual data.
The embodiment of the application also provides a computing device, which comprises a memory and a processor;
the memory is to store one or more computer instructions;
the processor is coupled with the memory for executing the one or more computer instructions for:
acquiring an acoustic signal stream acquired by a microphone array under at least one time frame;
performing a sound source bearing estimation based on the acoustic signal stream to obtain an information stream containing sound source bearing information for the at least one time frame;
converting the information flow into visual data describing the azimuth distribution state of the sound source;
and tracking the sound source according to the visual data.
The embodiment of the present application further provides a sound source tracking apparatus, including:
the determining module is used for respectively determining the sound source azimuth information in at least one time frame in the target time period;
a conversion module for converting the sound source azimuth information at the at least one time frame into at least one set of image data describing an azimuth distribution state of the sound source to form an image stream;
and the tracking module is used for carrying out image recognition on the image stream by using an image recognition model so as to track the sound source in the target time interval.
The embodiment of the application also provides a computing device, which comprises a memory and a processor;
the memory is to store one or more computer instructions;
the processor is coupled with the memory for executing the one or more computer instructions for:
respectively determining sound source azimuth information in at least one time frame in a target time period;
converting the sound source azimuth information at the at least one time frame into at least one set of image data describing an azimuth distribution state of the sound source to form an image stream;
and carrying out image recognition on the image stream by using an image recognition model so as to track the sound source in the target time interval.
The embodiment of the present application further provides a sound source tracking system, including: a microphone array and a computing device, the microphone array in communicative connection with the computing device;
the microphone array is used for acquiring acoustic signals;
the computing device is used for acquiring an acoustic signal stream acquired by the microphone array under at least one time frame; performing a sound source bearing estimation based on the acoustic signal stream to obtain an information stream containing sound source bearing information for the at least one time frame; converting the information flow into visual data describing the azimuth distribution state of the sound source; and tracking the sound source according to the visual data.
Embodiments of the present application also provide a computer-readable storage medium storing computer instructions that, when executed by one or more processors, cause the one or more processors to perform the aforementioned sound source tracking method.
In the embodiment of the application, acoustic azimuth estimation may be performed on acoustic signal streams acquired by a microphone array in at least one time frame to respectively determine acoustic azimuth information in at least one time frame, convert an information stream containing sound source azimuth information into visual data describing an azimuth distribution state of a sound source, and perform sound source tracking based on the visual data. Thus, in the embodiment of the present application, the traditional way of performing sound source tracking from the acoustic signal processing level is overturned, and sound source tracking is performed from the visual analysis level. In the embodiment, the visual data can accurately and comprehensively reflect the azimuth distribution state of the sound source, so that the accuracy and comprehensiveness of the basis of visual analysis are ensured, and the problem of robustness is avoided; moreover, during the visual analysis, the analyzed visual field can cover more time frames, therefore, the noise in the visual field can be found, thereby avoiding noise interference; therefore, in the embodiment of the application, the accuracy of sound source tracking can be effectively improved, and the adaptability to various complex environments can be improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a schematic flow chart of a sound source tracking method according to an exemplary embodiment of the present application;
FIG. 2 is a logic diagram of an acoustic source tracking scheme provided by an exemplary embodiment of the present application;
FIG. 3 is a schematic diagram of sound source azimuth information provided by an exemplary embodiment of the present application;
FIG. 4 is a schematic diagram of an azimuthal distribution thermodynamic diagram of a sound source provided by an exemplary embodiment of the present application;
fig. 5 is a schematic structural diagram of a sound source tracking device according to an exemplary embodiment of the present application;
FIG. 6 is a schematic block diagram of a computing device according to yet another exemplary embodiment of the present application;
FIG. 7 is a flow chart of another sound source tracking method provided by an exemplary embodiment of the present application;
fig. 8 is a schematic structural diagram of another sound source tracking device according to an exemplary embodiment of the present application;
FIG. 9 is a schematic block diagram of another computing device provided in an exemplary embodiment of the present application;
fig. 10 is a schematic structural diagram of a sound source tracking system according to an exemplary embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Aiming at the technical problems of poor robustness, insufficient generalization capability and the like of the existing sound source tracking scheme, some embodiments of the application embodiment: the information stream containing the sound source azimuth information can be converted into visual data describing the azimuth distribution state of the sound source, and sound source tracking can be performed based on the visual data. This reverses the traditional approach of source tracking from the acoustic signal processing level, but from the visual analysis level. Therefore, in the embodiment of the application, the accuracy of sound source tracking can be effectively improved, and the adaptability to various complex environments can be improved.
The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.
Fig. 1 is a schematic flowchart of a sound source tracking method according to an exemplary embodiment of the present application. Fig. 2 is a logic diagram of a sound source tracking scheme according to an exemplary embodiment of the present application. The sound source tracking method provided by the embodiment can be executed by a sound source tracking device, which can be implemented as software or as a combination of software and hardware, and can be integrally disposed in a computing device. As shown in fig. 1, the method includes:
step 100, acquiring an acoustic signal stream acquired by a microphone array under at least one time frame;
step 101, performing sound source orientation estimation based on an acoustic signal stream to obtain an information stream containing sound source orientation information in at least one time frame;
102, converting the information flow into visual data describing the azimuth distribution state of the sound source;
and 103, tracking the sound source according to the visual data.
The sound source tracking method provided by the embodiment can be applied to various scenes, such as a voice control scene, an audio and video conference scene or other scenes needing sound source tracking, and the application scene is not limited in the embodiment. In different application scenarios, the sound source tracking method provided by this embodiment may be integrated in various scene devices, for example, in a speech control scenario, the scene device may be an intelligent sound box, an intelligent robot, or the like, and in an audio and video conference scenario, the scene device may be various conference terminals, or the like.
In this embodiment, in step 100, a microphone array may be used to collect the acoustic signal stream. The microphone array may be a group array composed of a plurality of array elements, and the present embodiment does not limit the number of array elements in the microphone array. The present embodiment also does not limit the arrangement of the microphone array, and the microphone array may be a circular array, a linear array, a planar array, or a stereo array. In different application scenarios, the microphone array can be assembled in various types of scene devices as required.
The signal acquisition process of the microphone array is usually a continuous process, so that the subsequent processing can be performed in the form of an acoustic signal stream in this embodiment.
In this embodiment, at least one time frame may be selected within a single recognition period based on the recognition accuracy, and the single time frame may be used as the processing unit. The length of the single recognition period is adapted to the recognition accuracy, for example, the recognition accuracy is 1s, that is, the sound source tracking result is shown once every 1s, the length of the single recognition period can be set to 1s, and in step 100, an acoustic signal stream formed by acoustic signals in at least one time frame within 1s can be acquired once as a processing object of a subsequent step. In practical application, under different application scenarios, at least one time frame can be selected as required within the target time period. For example, in the case of small changes in the acoustic signal, at least one time frame may be selected in the target period by frame skipping or frame rate change sampling. Of course, in most cases, all time frames in the target time period may be selected, which is not limited in this embodiment.
In this embodiment, the frame length of the time frame may be configured according to an actual request, for example, the frame length of a single time frame may be configured to be 20 ms. In addition, the number of at least one time frame in the identification period can also be set as required, for example, if the identification period is 1s, 3 time frames can be selected in the identification period, so that the sound source tracking in the target period is performed based on the 3 time frames. Of course, the frame length of the time frame and the number of time frames in the identification period in this embodiment are not limited to these. In addition, the frame lengths of different time frames in the identification period may not be completely the same, which is not limited in this embodiment.
Therefore, the sound source tracking method provided by the embodiment can be applied to real-time sound source tracking scenes and can also be applied to off-line sound source tracking scenes, and sound source tracking is continuously performed in each identification time period according to the identification precision.
In practical applications, each array element in the microphone array may be used to collect a time domain signal, and taking the example that the microphone array includes M array elements, M time domain signal streams may be collected in at least one time frame as the acoustic signal streams in step 100.
Referring to fig. 1 and 2, in step 101, sound source bearing estimation may be performed based on an acoustic signal stream to obtain an information stream containing at least one sound source bearing information under a time frame.
In this embodiment, the acoustic signal stream may be subjected to signal processing by using a sound source location estimation technique to determine sound source location information in at least one time frame respectively. Wherein the sound source bearing information is used to characterize bearing data of the sound source at the time frame. In this embodiment, the azimuth data may be the confidence level of the sound source in each azimuth, and thus, the sound source azimuth information may include at least the confidence level of the sound source in each azimuth in the time frame. In this embodiment, each azimuth involved in the sound source azimuth information may be configured as required, for example, 360 azimuths, 120 azimuths, 60 azimuths, and the like may be configured for the whole circumference of the microphone array. Of course, the microphone array may be configured with 180 orientations in a non-full circumference, for example, a 180 ° range of the front surface, and the present embodiment does not limit this.
Fig. 3 is a schematic diagram of sound source azimuth information according to an exemplary embodiment of the present application. The sound source azimuth information is visualized in fig. 3, but it should be understood that fig. 3 is only for convenience of explanation of the sound source azimuth information, and this should not be construed as limiting the data format of the sound source azimuth information in the present embodiment. In practice, the sound source orientation information may be in any other data form understood by the computing device [ 1, 3, 5, 60, 70, 80, 90, 80, 70, …, 0 ], where each number in [ in ] may represent a confidence level of the sound source in 360 orientations, respectively, in this example.
In step 101, sound source direction information is specified in units of time frames. Receiving the acoustic signal streams acquired by the M array elements in at least one time frame in step 100, where in step 101, sound source direction estimation may be performed on the acoustic signals acquired by the M array elements in the time frame in units of the time frame, so as to determine sound source direction information in the time frame.
In an optional implementation manner, time-domain signal streams acquired by each array element can be converted into time-domain and frequency-domain signals respectively; and determining sound source azimuth information under at least one time frame according to the time-frequency domain signals under each array element by adopting a sound source azimuth estimation technology.
In this implementation, a target time frame of the at least one time frame is taken as an example, where the target time frame may be any one of the at least one time frame. The time domain signals collected by each array element under the target time frame can be converted into time-frequency domain signals. For example, the time-domain signal may be sub-band decomposed to obtain a time-frequency-domain signal, and the sub-band decomposition process may be implemented based on an end-time fourier transform and/or a filter bank, and the like, which is not limited herein. Therefore, the time-frequency domain signal corresponding to each array element under the target time frame can be obtained.
On the basis, the sound source azimuth estimation can be carried out on the time-frequency domain signals corresponding to the array elements in the target time frame so as to output the sound source azimuth information in the target time frame. The sound source orientation estimation technique includes, but is not limited to, a controlled Response Phase transformation technique SRP-PHAT (controlled Response Power-Phase Transform), a generalized Cross-Correlation Phase transformation technique GCC-PHAT (generalized Cross Correlation Phase transformation), or a multiple signal classification technique music (multiple signal classification). The principle of sound source orientation estimation may be: according to acoustic signals collected by different microphones in the microphone array at the same time, azimuth ranges of a sound source are respectively calculated, and then the azimuth of the sound source is estimated according to a plurality of azimuth ranges. Of course, this is merely exemplary, and the present embodiment is not limited thereto. The embodiment does not limit the sound source direction estimation technology, and the processing procedures of various sound source direction estimation technologies are not described in detail.
In addition, in the present embodiment, other implementations may also be adopted for sound source direction estimation based on the acoustic signal stream, and the present embodiment is not limited to the above implementations.
Accordingly, an information stream is obtained comprising sound source bearing information for at least one time frame.
Referring to fig. 1 and 2, on the basis of this, in step 102, the information stream may be converted into visualization data describing the azimuthal distribution state of the sound source.
In this embodiment, the sound source azimuth information may include description information of dimensions such as time and azimuth, and based on this, the sound source azimuth information may be converted into description information of an azimuth distribution state. For example, the sound source location information may include confidence levels that the sound source is located in different locations, and the confidence levels may be converted into adaptive display brightness, where the display brightness in different locations is used as description information of the location distribution state. It should be noted that, during the process of converting the visualization data, any content in the sound source azimuth information is not lost, but only the representation form of the sound source azimuth information is converted, which can ensure that the visualization data in this embodiment accurately and comprehensively describes the azimuth distribution state of the sound source.
Wherein the visualization data may be an azimuthal distribution thermodynamic diagram of the sound source. According to the above example, after the confidences of the sound source in different directions are converted into the adaptive display brightness, the display brightness in each direction under the time frame can be obtained, so that the direction distribution thermodynamic diagram of the sound source can be determined from three dimensions of the time frame, the direction and the display brightness.
Of course, in this embodiment, the visualization data is not limited to this, for example, the visualization data may also be a three-dimensional stereogram for representing the sound source orientation information under at least one time frame, and in practical applications, the three-dimensional stereogram may be obtained by arranging display curves of the sound source orientation information in fig. 3 corresponding to at least one time frame in time.
In this embodiment, the information stream may be converted into various forms of visualization data. During the visualization process, the azimuth information of each sound source can be fully preserved, so that, in step 102, the acoustic signal processing level can be switched to the visualization processing level during the sound source tracking process.
In step 103, sound source tracking may be performed based on the visualization data. In this embodiment, the number of the sound sources is not limited, and the number of the sound sources may be one or more.
Accordingly, in the present embodiment, the acoustic source within at least one time frame can be tracked by performing a visual analysis on the visual data, so that the acoustic signal processing problem is converted into a visual analysis problem. In the embodiment, the visual data can accurately and comprehensively reflect the azimuth distribution state of the sound source, so that the accuracy and comprehensiveness of the visual analysis foundation are ensured, and the problem of robustness is avoided; moreover, in the visual analysis process, the analyzed visual field can cover more time frames and is not limited to a single time frame, so that the noise in the visual field can be found, and the noise interference is avoided. The defects of poor robustness, insufficient generalization capability and the like in the traditional acoustic signal processing process can be effectively overcome.
Accordingly, in the present embodiment, the information stream including the sound source azimuth information can be converted into the visualized data describing the azimuth distribution state of the sound source, and the sound source tracking can be performed based on the visualized data. This reverses the traditional approach of source tracking from the acoustic signal processing level, but from the visual analysis level. Therefore, in the embodiment of the application, the accuracy of sound source tracking can be effectively improved, and the adaptability to various complex environments can be improved.
In the above or below embodiments, the information stream may be converted into an azimuthal distribution thermodynamic diagram of the sound source at the at least one time frame as a basis for tracking the sound source within the at least one time frame. Wherein the azimuth distribution thermodynamic diagram is used to describe the distribution heat of the sound source in different azimuths at the at least one time frame.
In the present embodiment, the sound source azimuth information includes the confidence levels of the sound sources in the respective azimuths. On the basis, the display brightness corresponding to each direction can be respectively determined in at least one time frame according to the confidence degrees of the sound sources in each direction in at least one time frame based on the corresponding relation between the confidence degrees and the display brightness; generating an azimuth distribution thermodynamic diagram of the sound source in at least one time frame according to the display brightness; different display luminances characterize different heat of distribution.
In practical application, the higher the confidence, the higher the corresponding display brightness, and the higher the distribution heat of the characterization. Of course, the present embodiment is not limited to this, and the higher the confidence, the lower the display brightness may be. But generally, the confidence level and the display brightness can be in direct proportion, so that the confidence level can be accurately represented by the display brightness.
Fig. 4 is a schematic diagram of an azimuth distribution thermodynamic diagram of a sound source according to an exemplary embodiment of the present application. Referring to fig. 4, the vertical axis of the thermodynamic diagram is the time frame and the horizontal axis is the azimuth. In fig. 4, the number of at least one time frame is 800 and the number of azimuths is configured to be 120 for characterizing the full perimeter space of the microphone array.
In an alternative implementation manner, the image content corresponding to each of the at least one time frame may be determined according to the display brightness corresponding to each of the directions in the at least one time frame; and sequentially arranging the image contents corresponding to the at least one time frame according to the time sequence among the at least one time frame so as to generate the azimuth distribution thermodynamic diagram. Referring to fig. 4, the sound source orientation information at each time frame may be converted into a horizontal direction in a thermodynamic diagram, for example, the sound source orientation information at frame 400 may be converted into a straight line y in the thermodynamic diagram as 400, and the display brightness of the pixel points corresponding to each orientation on the straight line may be determined according to the corresponding confidence. The higher the confidence the brighter the display brightness corresponding to the pixel point. For example, in conjunction with the schematic diagram of the sound source azimuth information shown in fig. 3, the confidence corresponding to the peak position in fig. 3 is the highest, and the display brightness at the azimuth corresponding to the peak position is the brightest when converted into the thermodynamic diagram.
Of course, in this embodiment, an azimuth distribution thermodynamic diagram may also be generated in other implementation manners, for example, according to the sound source azimuth information in at least one time frame, the confidence level that sound sources exist in the target azimuth in different time frames may be obtained, so as to determine the display brightness corresponding to each time frame in the target azimuth, so as to generate the image content corresponding to the target azimuth, where the target azimuth may be any one of the azimuths. This may obtain image content corresponding to each orientation, so that the image content corresponding to each orientation may be sequentially arranged in the order of orientation to generate an orientation distribution thermodynamic diagram. The present embodiment does not limit the manner of generating the azimuth distribution thermodynamic diagram.
Accordingly, in the present embodiment, the sound source azimuth information in at least one time frame can be converted into the azimuth distribution thermodynamic diagram of the sound source. In addition, all contents in the sound source azimuth information are reserved in the conversion process, so that an accurate analysis basis is provided for the visual analysis process, and the accuracy of the tracking result can be ensured.
In the above or below embodiments, sound source tracking may be performed using a machine learning model and visualization data.
In this embodiment, no matter what form of visual data is, the machine learning model can be used to perform visual analysis to track the sound source. In practical application, for visual data of different forms, machine learning models of different types can be selected, and a model training mode matched with the data form can be adopted to improve the performance of the machine learning model.
The visualization analysis process is described below by taking a thermodynamic diagram as an example.
In the embodiment, in the machine learning model, image features in the orientation distribution thermodynamic diagram can be extracted; and determining the attribute parameters of the target sound source in at least one time frame for sound source tracking based on the mapping relation between the image features and the sound source attribute parameters and the image features extracted from the azimuth distribution thermodynamic diagram.
The mapping relation between the image characteristics and the sound source attribute parameters can be configured into a machine learning model through model training.
An exemplary model training process may also be:
acquiring sample thermodynamic diagrams corresponding to a plurality of sample time frame groups; marking the sound source attribute parameters for each sample thermodynamic diagram to obtain marking information corresponding to each sample thermodynamic diagram; and inputting the thermodynamic diagrams of all samples and the corresponding labeling information into a machine learning model so as to enable the machine learning model to learn the mapping relation between the image characteristics and the sound source attribute parameters.
Wherein the number of time frames in the sample time frame may coincide with the number of at least one time frame in which the acoustic signal acquisition is performed in step 100. That is, the processing units in the model training process and the model using process can be kept consistent. Thus, during the training of the model, the output of the tracking result can be performed in units of at least one time frame of the same number level.
In this embodiment, the sound source property parameters include one or more of azimuth, number, utterance duration, and covered time frame. Accordingly, in this embodiment, after the visualization analysis, the machine learning model may output information such as the number of sound sources in at least one time frame, the location, the sounding duration, and the covered time frame as the tracking result.
In the case of introducing the machine learning model, in the present embodiment, the step of converting the information flow into the visualized data describing the azimuth distribution state of the sound source may be performed in the machine learning model or may be performed outside the machine learning model.
In one possible implementation, the process of converting the information flow into the visualized data describing the azimuth distribution state of the sound source may be performed outside the machine learning model during the use of the model, and the visualized data may be used as the input parameters of the machine learning model. Accordingly, in the model training process, the information stream corresponding to the sample time frame group may be converted into a sample thermodynamic diagram in advance, and used as a basis for model training.
In another possible implementation, during model use, the information stream may be input into a machine learning model; in the machine learning model, the information flow is converted into visualized data describing the azimuthal distribution state of the sound source.
In this implementation, a functional module that converts the information flow into visual data describing the azimuth distribution state of the sound source may be configured in the machine learning model, so that the information flow may be used as an input parameter of the machine learning model. For the machine learning model, the information stream can be converted into the visual data describing the azimuth distribution state of the sound source under the condition of receiving the information stream, and then the visual analysis is carried out.
Accordingly, there will be nuances in the model training process that keep up with one implementation. In the model training process of the implementation scheme, sample information streams corresponding to a plurality of sample time frame groups can be obtained; marking sound source attribute parameters for each sample information stream to obtain marking information corresponding to each sample information stream; and inputting each sample information stream and the corresponding labeling information into the machine learning model so that the machine learning model converts each sample information stream into visual data describing the azimuth distribution state of the sound source and learns the mapping relation between the image characteristics and the sound source attribute parameters.
Accordingly, in the present embodiment, after the machine learning model is trained using a sufficient amount of sample data, the machine learning model learns an accurate mapping relationship between the image features and the sound source attribute parameters. Thus, the trained machine learning model can be utilized to perform visual analysis on the visual data to output sound source attribute information in at least one time frame, so as to track one or more sound sources of the sound production based on the sound source attribute information. The sound source tracking method can eliminate various noise interferences in the tracking process, does not need to independently perform the operation of searching for the occurrence starting point, and can avoid the defects of other acoustic signal processing layers. Furthermore, the accuracy of the tracking result can be effectively improved, and the adaptability to various complex environments can be improved.
It should be noted that the execution subjects of the steps of the methods provided in the above embodiments may be the same device, or different devices may be used as the execution subjects of the methods. For example, the execution subjects of steps 101 to 103 may be device a; for another example, the execution subject of steps 101 and 102 may be device a, and the execution subject of step 103 may be device B; and so on.
In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations are included in a specific order, but it should be clearly understood that the operations may be executed out of the order presented herein or in parallel, and the sequence numbers of the operations, such as 101, 102, etc., are merely used for distinguishing different operations, and the sequence numbers do not represent any execution order per se. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel.
Fig. 5 is a schematic structural diagram of a sound source tracking device according to an exemplary embodiment of the present application. Referring to fig. 5, the sound source tracking apparatus includes:
an acquisition module 50 for acquiring an acoustic signal stream acquired by the microphone array at least one time frame;
a calculation module 51 for performing sound source bearing estimation based on the acoustic signal streams to obtain an information stream containing sound source bearing information for at least one time frame;
a conversion module 52 for converting the information flow into visualized data describing the azimuth distribution state of the sound source;
and a tracking module 53, configured to perform sound source tracking according to the visualization data.
In an alternative embodiment, the conversion module 52, when converting the information stream into the visualization data describing the azimuthal distribution state of the sound source, is configured to:
the information stream is converted into an azimuth distribution thermodynamic diagram of the sound source at the at least one time frame, the azimuth distribution thermodynamic diagram being used to describe the distribution heat of the sound source at different azimuths at the at least one time frame.
In an optional embodiment, the sound source position information includes confidence levels of sound sources in various positions; the conversion module 52, when converting the information stream into an azimuthal distribution thermodynamic diagram of the sound source at the at least one time frame, is configured to:
respectively determining the display brightness corresponding to each azimuth in at least one time frame according to the confidence coefficient of the sound source in each azimuth in at least one time frame based on the corresponding relation between the confidence coefficient and the display brightness, wherein the distribution heat represented by different display brightness is different;
and generating an azimuth distribution thermodynamic diagram of the sound source under at least one time frame according to the display brightness.
In an alternative embodiment, the conversion module 52, when generating the azimuth distribution thermodynamic diagram of the sound source for at least one time frame according to the display brightness, is configured to:
respectively determining image contents corresponding to at least one time frame according to the display brightness corresponding to each direction under the at least one time frame;
and sequentially arranging the image contents corresponding to the at least one time frame according to the time sequence among the at least one time frame so as to generate the azimuth distribution thermodynamic diagram.
In an alternative embodiment, the tracking module 53, when performing sound source tracking based on the visualization data, is configured to:
and tracking the sound source by using the machine learning model and the visual data.
In an alternative embodiment, if the visualized data is an orientation distribution thermodynamic diagram of the sound source in at least one time frame, the tracking module 53 is configured to, when performing sound source tracking by using the machine learning model and the visualized data:
extracting image features in the orientation distribution thermodynamic diagram in a machine learning model;
and determining the attribute parameters of the target sound source in at least one time frame for sound source tracking based on the mapping relation between the image features and the sound source attribute parameters and the image features extracted from the azimuth distribution thermodynamic diagram.
In an alternative embodiment, the sound source property parameters comprise one or more of azimuth, number, utterance duration and covered time frame.
In an alternative embodiment, the tracking module 53 is further configured to:
acquiring sample thermodynamic diagrams corresponding to a plurality of sample time frame groups respectively, wherein the sample thermodynamic diagrams are used for describing the distribution heat of a sound source in different directions under the sample time frames;
marking the sound source attribute parameters for each sample thermodynamic diagram to obtain marking information corresponding to each sample thermodynamic diagram;
and inputting the thermodynamic diagrams of all samples and the corresponding labeling information into a machine learning model so as to enable the machine learning model to learn the mapping relation between the image characteristics and the sound source attribute parameters.
In an alternative embodiment, the tracking module 53, when converting the information stream into visualization data describing the azimuthal distribution state of the sound source, is configured to:
inputting the information stream into a machine learning model;
in the machine learning model, the information flow is converted into visualized data describing the azimuthal distribution state of the sound source.
In an alternative embodiment, the tracking module 53 is further configured to:
acquiring sample information streams corresponding to the sample time frame groups;
marking sound source attribute parameters for each sample information stream to obtain marking information corresponding to each sample information stream;
and inputting each sample information stream and the corresponding labeling information into the machine learning model so that the machine learning model converts each sample information stream into visual data describing the azimuth distribution state of the sound source and learns the mapping relation between the image characteristics and the sound source attribute parameters.
In an alternative embodiment, the acoustic signal stream includes a time domain signal stream collected by each array element in the microphone array, and the computing module 51 is configured to, when performing sound source bearing estimation based on the acoustic signal stream to obtain an information stream including sound source bearing information in at least one time frame:
respectively converting the time domain signal streams collected by each array element into time-frequency domain signals;
and determining sound source azimuth information under at least one time frame according to the time-frequency domain signals under each array element by adopting a sound source azimuth estimation technology.
In an alternative embodiment, the sound source location estimation technique includes one or more of a steered beam response phase transform technique, SRP-PHAT, a generalized cross-correlation phase transform technique, GCC-PHAT, or a multiple signal classification technique, MUSIC.
It should be noted that, for the sake of brevity, the technical details of the embodiments of the sound source tracking apparatus described above may be referred to the related descriptions of the embodiments of the sound source tracking method, which should not be repeated herein, but should not cause a loss of the scope of the present application.
Fig. 6 is a schematic structural diagram of a computing device according to another exemplary embodiment of the present application. As shown in fig. 6, the computing device includes: a memory 60 and a processor 61.
A processor 61, coupled to the memory 60, for executing computer programs in the memory 60 for:
acquiring a stream of acoustic signals acquired by the microphone array 62 at least one time frame;
performing a sound source bearing estimation based on the acoustic signal stream to obtain an information stream comprising sound source bearing information for at least one time frame;
converting the information flow into visual data describing the azimuth distribution state of the sound source;
and carrying out sound source tracking according to the visual data.
In an alternative embodiment, the processor 61, when converting the information stream into visualization data describing the azimuthal distribution state of the sound source, is configured to:
the information stream is converted into an azimuth distribution thermodynamic diagram of the sound source at the at least one time frame, the azimuth distribution thermodynamic diagram being used to describe the distribution heat of the sound source at different azimuths at the at least one time frame.
In an optional embodiment, the sound source position information includes confidence levels of sound sources in various positions; the processor 61, when converting the information stream into an azimuthal distribution thermodynamic map of the sound source at least one time frame, is configured to:
respectively determining the display brightness corresponding to each azimuth in at least one time frame according to the confidence coefficient of the sound source in each azimuth in at least one time frame based on the corresponding relation between the confidence coefficient and the display brightness, wherein the distribution heat represented by different display brightness is different;
and generating an azimuth distribution thermodynamic diagram of the sound source under at least one time frame according to the display brightness.
In an alternative embodiment, the processor 61, when generating the azimuth distribution thermodynamic diagram of the sound source for at least one time frame according to the display brightness, is configured to:
respectively determining image contents corresponding to at least one time frame according to the display brightness corresponding to each direction under the at least one time frame;
and sequentially arranging the image contents corresponding to the at least one time frame according to the time sequence among the at least one time frame so as to generate the azimuth distribution thermodynamic diagram.
In an alternative embodiment, the processor 61, when performing sound source tracking based on the visualization data, is configured to:
and tracking the sound source by using the machine learning model and the visual data.
In an alternative embodiment, if the visualized data is an azimuthal distribution thermodynamic map of the sound source in at least one time frame, the processor 61, when performing sound source tracking using the machine learning model and the visualized data, is configured to:
extracting image features in the orientation distribution thermodynamic diagram in a machine learning model;
and determining the attribute parameters of the target sound source in at least one time frame for sound source tracking based on the mapping relation between the image features and the sound source attribute parameters and the image features extracted from the azimuth distribution thermodynamic diagram.
In an alternative embodiment, the sound source property parameters comprise one or more of azimuth, number, utterance duration and covered time frame.
In an alternative embodiment, the processor 61 is further configured to:
acquiring sample thermodynamic diagrams corresponding to a plurality of sample time frame groups respectively, wherein the sample thermodynamic diagrams are used for describing the distribution heat of a sound source in different directions under the sample time frames;
marking the sound source attribute parameters for each sample thermodynamic diagram to obtain marking information corresponding to each sample thermodynamic diagram;
and inputting the thermodynamic diagrams of all samples and the corresponding labeling information into a machine learning model so as to enable the machine learning model to learn the mapping relation between the image characteristics and the sound source attribute parameters.
In an alternative embodiment, the processor 61, when converting the information stream into visualization data describing the azimuthal distribution state of the sound source, is configured to:
inputting the information stream into a machine learning model;
in the machine learning model, the information flow is converted into visualized data describing the azimuthal distribution state of the sound source.
In an alternative embodiment, the processor 61 is further configured to:
acquiring sample information streams corresponding to the sample time frame groups;
marking sound source attribute parameters for each sample information stream to obtain marking information corresponding to each sample information stream;
and inputting each sample information stream and the corresponding labeling information into the machine learning model so that the machine learning model converts each sample information stream into visual data describing the azimuth distribution state of the sound source and learns the mapping relation between the image characteristics and the sound source attribute parameters.
In an alternative embodiment, the acoustic signal stream comprises a time domain signal stream collected by each array element in the microphone array, and the processor 61, when performing sound source bearing estimation based on the acoustic signal stream to obtain an information stream comprising sound source bearing information in at least one time frame, is configured to:
respectively converting the time domain signal streams collected by each array element into time-frequency domain signals;
and determining sound source azimuth information under at least one time frame according to the time-frequency domain signals under each array element by adopting a sound source azimuth estimation technology.
In an alternative embodiment, the sound source location estimation technique includes one or more of a steered beam response phase transform technique, SRP-PHAT, a generalized cross-correlation phase transform technique, GCC-PHAT, or a multiple signal classification technique, MUSIC.
It should be noted that, for the sake of brevity, the above description of the technical details of the embodiments of the computing device may be referred to in the related descriptions of the embodiments of the sound source tracking method, which should not be repeated herein, but should not cause a loss of scope of the present application.
Further, as shown in fig. 6, the computing device further includes: communication components 63, power components 64, and the like. Only some of the components are schematically shown in fig. 6, and the computing device is not meant to include only the components shown in fig. 6.
Fig. 7 is a flowchart of another sound source tracking method according to an exemplary embodiment of the present application. The sound source tracking method provided by the embodiment can be executed by a sound source tracking device, which can be implemented as software or as a combination of software and hardware, and can be integrally disposed in a computing device. As shown in fig. 7, the method includes:
step 700, respectively determining sound source azimuth information in at least one time frame in a target time period;
step 701, converting the sound source azimuth information in at least one time frame into at least one group of image data describing the azimuth distribution state of the sound source to form an image stream;
step 702, image recognition is performed on the image stream by using an image recognition model to perform sound source tracking in a target time period.
Step 700 may refer to the relevant description in the associated embodiment of fig. 1. Step 701 may further include acquiring an acoustic signal stream acquired by the microphone array at least one time frame; a sound source bearing estimation is performed based on the acoustic signal streams to obtain at least one time frame down sound source bearing information. For the sake of brevity, the detailed process is not described again.
In step 701, the sound source bearing information at least one time frame may be converted into at least one set of image data. Wherein the image data may be an orientation distribution thermodynamic diagram, then in step 701, at least one orientation distribution thermodynamic diagram may be obtained to form an image stream input to the image recognition model.
Based on step 701, the sound source tracking scheme provided by the present embodiment can be applied to scenes such as real-time tracking or offline tracking. In the offline tracking scenario, at least one time-frame lower sound source orientation information may be obtained at a time, and the at least one time-frame lower sound source orientation information may be grouped according to the recognition accuracy, thereby performing an operation of converting the acoustic orientation information into image data by group.
In the online tracking scene, a target time frame in the current identification time period can be determined from at least one time frame based on preset identification accuracy;
converting the sound source azimuth information in each target time frame into a group of image data describing the azimuth distribution state of the sound source in the current identification period;
and continuing to determine the time frame and the image data in the next identification period in the target period until the image data corresponding to all the identification periods in the target period are generated.
In the online tracking scenario, the operation of converting the acoustic orientation information into image data may be performed at each recognition period successively as the acoustic signal is continuously generated. And subsequently input to the image recognition model in step 702.
For example, if the recognition accuracy is 1s, one azimuth distribution thermodynamic diagram may be generated based on the sound source azimuth information of N time frames within the current recognition period (1s), and then the generation of the azimuth distribution thermodynamic diagram of the next recognition period (1s) may be continued, and the azimuth distribution thermodynamic diagrams of the subsequent recognition periods may be sequentially generated, and each azimuth distribution thermodynamic diagram may be provided to the image recognition model in a streaming manner.
In this embodiment, the image recognition model may be trained in advance. The image recognition model may adopt a machine learning model, and the training process of the image recognition model may refer to the related description in the associated embodiment of fig. 1.
It is worth noting that the recognition accuracy is consistent in the training phase and the application phase of the image recognition model.
For the above technical details of the embodiments of the sound source tracking method, reference may be made to the related descriptions in the embodiments of the sound source tracking method shown in fig. 1, which are not repeated herein for brevity, but this should not cause a loss of the scope of the present application.
Fig. 8 is a schematic structural diagram of another sound source tracking device according to an exemplary embodiment of the present application. Referring to fig. 8, the sound source tracking apparatus includes:
a determining module 80, configured to determine the sound source bearing information in at least one time frame within the target time period respectively;
a conversion module 81 for converting the sound source azimuth information at least one time frame into at least one set of image data describing the azimuth distribution state of the sound source to form an image stream;
and a tracking module 82, configured to perform image recognition on the image stream by using an image recognition model to perform sound source tracking within the target time period. In an alternative embodiment, the conversion module 81, when converting the sound source bearing information at the at least one time frame into at least one set of image data describing a bearing distribution state of the sound source to form the image stream, is configured to:
determining a target time frame located in the current identification period from at least one time frame based on a preset identification precision;
converting the sound source azimuth information in each target time frame into a group of image data describing the azimuth distribution state of the sound source in the current identification period;
and continuing to determine the time frame and the image data in the next identification period in the target period until the image data corresponding to all the identification periods in the target period is generated to form an image stream.
In an alternative embodiment, the determination module 80 includes an acquisition module 83 and a calculation module 84;
an acquisition module 83 configured to acquire an acoustic signal stream acquired by the microphone array at least one time frame;
a calculation module 84 for performing sound source bearing estimation based on the acoustic signal streams to obtain at least one time frame down sound source bearing information.
In an alternative embodiment, the conversion module 81, when converting the sound source bearing information at each target time frame into a set of image data describing the bearing distribution state of the sound source within the current identification period, is configured to:
the sound source azimuth information at each target time frame is converted into an azimuth distribution thermodynamic diagram of the sound source at the at least one time frame, the azimuth distribution thermodynamic diagram being used to describe the distribution heat of the sound source at different azimuths at the at least one time frame.
In an optional embodiment, the sound source position information includes confidence levels of sound sources in various positions; the conversion module 81, when converting the sound source bearing information at each target time frame into a bearing distribution thermodynamic diagram of the sound source at least one time frame, is configured to:
respectively determining the display brightness corresponding to each azimuth in at least one time frame according to the confidence coefficient of the sound source in each azimuth in at least one time frame based on the corresponding relation between the confidence coefficient and the display brightness, wherein the distribution heat represented by different display brightness is different;
and generating an azimuth distribution thermodynamic diagram of the sound source under at least one time frame according to the display brightness.
In an alternative embodiment, the conversion module 81, when generating the azimuth distribution thermodynamic diagram of the sound source for at least one time frame according to the display brightness, is configured to:
respectively determining image contents corresponding to at least one time frame according to the display brightness corresponding to each direction under the at least one time frame;
and sequentially arranging the image contents corresponding to the at least one time frame according to the time sequence among the at least one time frame so as to generate the azimuth distribution thermodynamic diagram.
In an alternative embodiment, if the image data is an azimuthal distribution thermodynamic diagram of the sound source in at least one time frame, the tracking module 82, when performing image recognition on the image stream by using an image recognition model to perform sound source tracking in the target time period, is configured to:
extracting image features in an orientation distribution thermodynamic diagram in an image recognition model;
and determining the attribute parameters of the target sound source in at least one time frame for sound source tracking based on the mapping relation between the image features and the sound source attribute parameters and the image features extracted from the azimuth distribution thermodynamic diagram.
In an alternative embodiment, the sound source property parameters comprise one or more of azimuth, number, utterance duration and covered time frame.
In an alternative embodiment, the tracking module 82 is further configured to:
acquiring sample thermodynamic diagrams corresponding to a plurality of sample time frame groups respectively, wherein the sample thermodynamic diagrams are used for describing the distribution heat of a sound source in different directions under the sample time frames;
marking the sound source attribute parameters for each sample thermodynamic diagram to obtain marking information corresponding to each sample thermodynamic diagram;
and inputting each sample thermodynamic diagram and the corresponding labeling information thereof into the image recognition model so that the image recognition model learns the mapping relation between the image characteristics and the sound source attribute parameters.
In an alternative embodiment, the tracking module 82, in converting the sound source location information at each target time frame to an orientation distribution thermodynamic diagram for the sound source at least one time frame, is configured to:
inputting the information stream into an image recognition model;
in the image recognition model, the sound source orientation information at each target time frame is converted into an orientation distribution thermodynamic diagram of the sound source at least one time frame.
In an alternative embodiment, the tracking module 82 is further configured to:
acquiring sample information streams corresponding to the sample time frame groups;
marking sound source attribute parameters for each sample information stream to obtain marking information corresponding to each sample information stream;
and inputting each sample information stream and the corresponding labeling information thereof into the image recognition model so that the image recognition model converts each sample information stream into visual data describing the azimuth distribution state of the sound source and learns the mapping relation between the image characteristics and the sound source attribute parameters.
In an alternative embodiment, the acoustic signal stream includes a time domain signal stream collected by each array element in the microphone array, and the calculation module 84, when performing sound source bearing estimation based on the acoustic signal stream to obtain sound source bearing information in at least one time frame, is configured to:
respectively converting the time domain signal streams collected by each array element into time-frequency domain signals;
and determining sound source azimuth information under at least one time frame according to the time-frequency domain signals under each array element by adopting a sound source azimuth estimation technology.
In an alternative embodiment, the sound source location estimation technique includes one or more of a steered beam response phase transform technique, SRP-PHAT, a generalized cross-correlation phase transform technique, GCC-PHAT, or a multiple signal classification technique, MUSIC.
It should be noted that, for the sake of brevity, the technical details of the embodiments of the sound source tracking apparatus described above with reference to the related descriptions of the embodiments of the sound source tracking method in fig. 1 and fig. 7 are not described herein again, but should not cause a loss of the scope of the present application.
Fig. 9 is a schematic structural diagram of another computing device provided in an exemplary embodiment of the present application, and referring to fig. 9, the computing device includes: a memory 90 and a processor 91.
A processor 91, coupled to the memory 90, for executing the computer program in the memory 90 for:
respectively determining sound source azimuth information in at least one time frame in a target time period;
converting sound source azimuth information at least one time frame into at least one set of image data describing an azimuth distribution state of a sound source to form an image stream;
and carrying out image recognition on the image stream by using an image recognition model so as to track the sound source in the target time interval.
In an alternative embodiment, the processor 91, when converting the sound source bearing information at the at least one time frame into at least one set of image data describing a bearing distribution state of the sound source to form the image stream, is configured to:
determining a target time frame located in the current identification period from at least one time frame based on a preset identification precision;
converting the sound source azimuth information in each target time frame into a group of image data describing the azimuth distribution state of the sound source in the current identification period;
and continuing to determine the time frame and the image data in the next identification period in the target period until the image data corresponding to all the identification periods in the target period is generated to form an image stream.
In an alternative embodiment, the processor 91, when determining the sound source bearing information at least one time frame within the target period, is configured to:
acquiring an acoustic signal stream acquired by a microphone array under at least one time frame;
a sound source bearing estimation is performed based on the acoustic signal streams to obtain at least one time frame down sound source bearing information.
In an alternative embodiment, the processor 91, when converting the sound source bearing information at each target time frame into a set of image data describing the bearing distribution state of the sound source within the current identification period, is configured to:
the sound source azimuth information at each target time frame is converted into an azimuth distribution thermodynamic diagram of the sound source at the at least one time frame, the azimuth distribution thermodynamic diagram being used to describe the distribution heat of the sound source at different azimuths at the at least one time frame.
In an optional embodiment, the sound source position information includes confidence levels of sound sources in various positions; the processor 91, when converting the sound source bearing information at each target time frame into a bearing distribution thermodynamic diagram for the sound source at least one time frame, is configured to:
respectively determining the display brightness corresponding to each azimuth in at least one time frame according to the confidence coefficient of the sound source in each azimuth in at least one time frame based on the corresponding relation between the confidence coefficient and the display brightness, wherein the distribution heat represented by different display brightness is different;
and generating an azimuth distribution thermodynamic diagram of the sound source under at least one time frame according to the display brightness.
In an alternative embodiment, the processor 91, when generating the azimuth distribution thermodynamic diagram of the sound source for at least one time frame according to the display brightness, is configured to:
respectively determining image contents corresponding to at least one time frame according to the display brightness corresponding to each direction under the at least one time frame;
and sequentially arranging the image contents corresponding to the at least one time frame according to the time sequence among the at least one time frame so as to generate the azimuth distribution thermodynamic diagram.
In an alternative embodiment, if the image data is an azimuthal distribution thermodynamic diagram of the sound source in at least one time frame, the processor 91, when performing image recognition on the image stream by using an image recognition model to perform sound source tracking in the target time period, is configured to:
extracting image features in an orientation distribution thermodynamic diagram in an image recognition model;
and determining the attribute parameters of the target sound source in at least one time frame for sound source tracking based on the mapping relation between the image features and the sound source attribute parameters and the image features extracted from the azimuth distribution thermodynamic diagram.
In an alternative embodiment, the sound source property parameters comprise one or more of azimuth, number, utterance duration and covered time frame.
In an alternative embodiment, the processor 91 is further configured to:
acquiring sample thermodynamic diagrams corresponding to a plurality of sample time frame groups respectively, wherein the sample thermodynamic diagrams are used for describing the distribution heat of a sound source in different directions under the sample time frames;
marking the sound source attribute parameters for each sample thermodynamic diagram to obtain marking information corresponding to each sample thermodynamic diagram;
and inputting each sample thermodynamic diagram and the corresponding labeling information thereof into the image recognition model so that the image recognition model learns the mapping relation between the image characteristics and the sound source attribute parameters.
In an alternative embodiment, the processor 91, when converting the sound source bearing information at each target time frame into a bearing distribution thermodynamic diagram for the sound source at least one time frame, is configured to:
inputting the information stream into an image recognition model;
in the image recognition model, the sound source orientation information at each target time frame is converted into an orientation distribution thermodynamic diagram of the sound source at least one time frame.
In an alternative embodiment, the processor 91 is further configured to:
acquiring sample information streams corresponding to the sample time frame groups;
marking sound source attribute parameters for each sample information stream to obtain marking information corresponding to each sample information stream;
and inputting each sample information stream and the corresponding labeling information thereof into the image recognition model so that the image recognition model converts each sample information stream into visual data describing the azimuth distribution state of the sound source and learns the mapping relation between the image characteristics and the sound source attribute parameters.
In an alternative embodiment, the acoustic signal stream includes a time-domain signal stream collected by each array element in the microphone array, and the processor 91, when performing sound source bearing estimation based on the acoustic signal stream to obtain sound source bearing information in at least one time frame, is configured to:
respectively converting the time domain signal streams collected by each array element into time-frequency domain signals;
and determining sound source azimuth information under at least one time frame according to the time-frequency domain signals under each array element by adopting a sound source azimuth estimation technology.
In an alternative embodiment, the sound source location estimation technique includes one or more of a steered beam response phase transform technique, SRP-PHAT, a generalized cross-correlation phase transform technique, GCC-PHAT, or a multiple signal classification technique, MUSIC.
It should be noted that, for the sake of brevity, the above-mentioned technical details regarding the embodiments of the computing device may refer to the related descriptions in the embodiments of the sound source tracking method related to fig. 1 and fig. 7, which should not be repeated herein, but should not cause a loss of the scope of the present application.
Further, as shown in fig. 9, the computing device further includes: communication components 93, power components 94, and the like. Only some of the components are schematically shown in fig. 9, and the computing device is not meant to include only the components shown in fig. 9.
Accordingly, the present application further provides a computer-readable storage medium storing a computer program, where the computer program can implement the steps that can be executed by a computing device in the foregoing method embodiments when executed.
Fig. 10 is a schematic structural diagram of a sound source tracking system according to an exemplary embodiment of the present application. Referring to fig. 10, the sound source tracking system may include: microphone array 10 and computing device 20, microphone array 10 and computing device 20 are communicatively coupled.
The sound source tracking system provided by the embodiment can be applied to various scenes, for example, a voice control scene, an audio and video conference scene, or other scenes requiring sound source tracking, and the application scene is not limited in the embodiment. In different application scenarios, the sound source tracking system provided by this embodiment may be integrally deployed in various scene devices, for example, in a speech control scenario, may be deployed in an intelligent sound box or an intelligent robot, and in an audio and video conference scenario, may be deployed in various conference terminals.
Wherein the microphone array 10 can be used for picking up acoustic signals. In this embodiment, the number and arrangement form of the array elements of the microphone array 10 are not limited.
For details of the technology involved in the computing device, reference may be made to the relevant description in the embodiments associated with fig. 6 and fig. 9, which are not repeated herein for brevity, but should not cause a loss of scope of the present application.
The memory of fig. 6 and 9 described above is used to store computer programs and may be configured to store various other data to support operations on the computing platform. Examples of such data include instructions for any application or method operating on the computing platform, contact data, phonebook data, messages, pictures, videos, and so forth. The memory may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The communication components of fig. 6 and 9 described above are configured to facilitate wired or wireless communication between the device in which the communication component is located and other devices. The device where the communication component is located can access a wireless network based on a communication standard, such as a WiFi, a 2G, 3G, 4G/LTE, 5G and other mobile communication networks, or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
The power supply components of fig. 6 and 9 described above provide power to the various components of the device in which the power supply components are located. The power components may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device in which the power component is located.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (20)

1. A sound source tracking method, comprising:
acquiring an acoustic signal stream acquired by a microphone array under at least one time frame;
performing a sound source bearing estimation based on the acoustic signal stream to obtain an information stream containing sound source bearing information for the at least one time frame;
converting the information flow into visual data describing the azimuth distribution state of the sound source;
and tracking the sound source according to the visual data.
2. The method of claim 1, wherein said converting said stream of information into visualization data describing the azimuthal distribution of the sound source comprises:
converting the information stream into an azimuth distribution thermodynamic diagram of the sound source at the at least one time frame, the azimuth distribution thermodynamic diagram being used to describe a distribution heat of the sound source at different azimuths at the at least one time frame.
3. The method according to claim 2, wherein the sound source azimuth information includes a confidence level of the sound source at each azimuth; the converting the stream of information into an azimuthal distribution thermodynamic diagram of a sound source at the at least one time frame, comprising:
respectively determining display brightness corresponding to each azimuth in the at least one time frame according to the confidence coefficient of the sound source in each azimuth in the at least one time frame based on the corresponding relation between the confidence coefficient and the display brightness, wherein different display brightness represents different distribution heat degrees;
and generating an azimuth distribution thermodynamic diagram of the sound source at the at least one time frame according to the display brightness.
4. The method of claim 3, wherein generating an azimuthal distribution thermodynamic diagram of the sound source at the at least one time frame based on the display brightness comprises:
respectively determining image contents corresponding to the at least one time frame according to the display brightness corresponding to each direction under the at least one time frame;
and sequentially arranging the image contents corresponding to the at least one time frame according to the time sequence among the at least one time frame so as to generate the orientation distribution thermodynamic diagram.
5. The method of claim 1, wherein said performing sound source tracking based on said visualization data comprises:
and tracking the sound source by using a machine learning model and the visual data.
6. The method of claim 5, wherein if the visualization data is an azimuthal distribution thermodynamic map of the sound source at the at least one time frame, the performing sound source tracking using a machine learning model and the visualization data comprises:
extracting image features in the orientation distribution thermodynamic diagram in the machine learning model;
and determining the attribute parameters of the target sound source at the at least one time frame for sound source tracking based on the mapping relation between the image features and the sound source attribute parameters and the image features extracted from the azimuth distribution thermodynamic diagram.
7. The method of claim 6, wherein the sound source property parameters include one or more of azimuth, number, utterance duration, and covered time frame.
8. The method of claim 6, further comprising:
acquiring sample thermodynamic diagrams corresponding to a plurality of sample time frame groups respectively, wherein the sample thermodynamic diagrams are used for describing distribution heat of a sound source in different directions under the sample time frames;
marking the sound source attribute parameters for each sample thermodynamic diagram to obtain marking information corresponding to each sample thermodynamic diagram;
and inputting the sample thermodynamic diagrams and the corresponding labeling information thereof into the machine learning model so as to enable the machine learning model to learn the mapping relation between the image characteristics and the sound source attribute parameters.
9. The method of claim 6, further comprising:
acquiring sample information streams corresponding to the sample time frame groups;
marking sound source attribute parameters for each sample information stream to obtain marking information corresponding to each sample information stream;
and inputting the sample information streams and the corresponding labeling information into the machine learning model so that the machine learning model converts the sample information streams into visual data describing the azimuth distribution state of the sound source and learns the mapping relation between the image characteristics and the sound source attribute parameters.
10. The method of claim 9, wherein said converting said stream of information into visualization data describing the azimuthal distribution of the sound source comprises:
inputting the information stream into a machine learning model;
in the machine learning model, the information flow is converted into visual data describing the azimuthal distribution state of the sound source.
11. The method of claim 1, wherein the acoustic signal streams comprise time domain signal streams collected by each array element in the microphone array, and wherein performing sound source bearing estimation based on the acoustic signal streams to obtain an information stream comprising sound source bearing information at the at least one time frame comprises:
respectively converting the time domain signal streams collected by each array element into time-frequency domain signals;
and determining the sound source azimuth information under at least one time frame according to the time-frequency domain signals under each array element by adopting a sound source azimuth estimation technology.
12. The method of claim 11, wherein the sound source localization estimation technique comprises one or more of a steered beam response phase transform technique (SRP-PHAT), a generalized cross-correlation phase transform technique (GCC-PHAT), or a multiple signal classification technique (MUSIC).
13. A sound source tracking method, comprising:
respectively determining sound source azimuth information in at least one time frame in a target time period;
converting the sound source azimuth information at the at least one time frame into at least one set of image data describing an azimuth distribution state of the sound source to form an image stream;
and carrying out image recognition on the image stream by using an image recognition model so as to track the sound source in the target time interval.
14. The method of claim 13, wherein converting the sound source bearing information at the at least one time frame into at least one set of image data describing a bearing distribution state of a sound source to form an image stream comprises:
determining a target time frame located in the current identification period from the at least one time frame based on a preset identification precision;
converting the sound source azimuth information in each target time frame into a group of image data describing the azimuth distribution state of the sound source in the current identification period;
and continuing to determine the time frame and the image data in the next identification period in the target period until the image data corresponding to all the identification periods in the target period is generated to form the image stream.
15. An acoustic source tracking device, comprising:
the acquisition module is used for acquiring an acoustic signal stream acquired by the microphone array under at least one time frame;
a calculation module for performing a sound source bearing estimation based on the acoustic signal stream to obtain an information stream comprising sound source bearing information for the at least one time frame;
the conversion module is used for converting the information flow into visual data describing the azimuth distribution state of the sound source;
and the tracking module is used for tracking the sound source according to the visual data.
16. A computing device comprising a memory and a processor;
the memory is to store one or more computer instructions;
the processor is coupled with the memory for executing the one or more computer instructions for:
acquiring an acoustic signal stream acquired by a microphone array under at least one time frame;
performing a sound source bearing estimation based on the acoustic signal stream to obtain an information stream containing sound source bearing information for the at least one time frame;
converting the information flow into visual data describing the azimuth distribution state of the sound source;
and tracking the sound source according to the visual data.
17. An acoustic source tracking device, comprising:
the determining module is used for respectively determining the sound source azimuth information in at least one time frame in the target time period;
a conversion module for converting the sound source azimuth information at the at least one time frame into at least one set of image data describing an azimuth distribution state of the sound source to form an image stream;
and the tracking module is used for carrying out image recognition on the image stream by using an image recognition model so as to track the sound source in the target time interval.
18. A computing device comprising a memory and a processor;
the memory is to store one or more computer instructions;
the processor is coupled with the memory for executing the one or more computer instructions for:
respectively determining sound source azimuth information in at least one time frame in a target time period;
converting the sound source azimuth information at the at least one time frame into at least one set of image data describing an azimuth distribution state of the sound source to form an image stream;
and carrying out image recognition on the image stream by using an image recognition model so as to track the sound source in the target time interval.
19. A sound source tracking system, comprising: a microphone array and a computing device, the microphone array in communicative connection with the computing device;
the microphone array is used for acquiring acoustic signals;
the computing device is used for acquiring an acoustic signal stream acquired by the microphone array under at least one time frame; performing a sound source bearing estimation based on the acoustic signal stream to obtain an information stream containing sound source bearing information for the at least one time frame; converting the information flow into visual data describing the azimuth distribution state of the sound source; and tracking the sound source according to the visual data.
20. A computer-readable storage medium storing computer instructions, which when executed by one or more processors, cause the one or more processors to perform the sound source tracking method of any one of claims 1-14.
CN202011086519.8A 2020-10-12 2020-10-12 Sound source tracking method, device, equipment, system and storage medium Pending CN114355286A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011086519.8A CN114355286A (en) 2020-10-12 2020-10-12 Sound source tracking method, device, equipment, system and storage medium
PCT/CN2021/122742 WO2022078249A1 (en) 2020-10-12 2021-10-09 Sound source tracking method and apparatus, and device, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011086519.8A CN114355286A (en) 2020-10-12 2020-10-12 Sound source tracking method, device, equipment, system and storage medium

Publications (1)

Publication Number Publication Date
CN114355286A true CN114355286A (en) 2022-04-15

Family

ID=81089773

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011086519.8A Pending CN114355286A (en) 2020-10-12 2020-10-12 Sound source tracking method, device, equipment, system and storage medium

Country Status (2)

Country Link
CN (1) CN114355286A (en)
WO (1) WO2022078249A1 (en)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5477357B2 (en) * 2010-11-09 2014-04-23 株式会社デンソー Sound field visualization system
KR101282673B1 (en) * 2011-12-09 2013-07-05 현대자동차주식회사 Method for Sound Source Localization
CN105073073B (en) * 2013-01-25 2018-12-07 胡海 Apparatus and method for for sound visualization and auditory localization
JP2022051974A (en) * 2019-02-12 2022-04-04 ソニーグループ株式会社 Information processing device, method, and program
CN110907778A (en) * 2019-12-12 2020-03-24 国网重庆市电力公司电力科学研究院 GIS equipment partial discharge ultrasonic positioning method, device, equipment and medium
CN111443330B (en) * 2020-05-15 2022-06-03 浙江讯飞智能科技有限公司 Acoustic imaging method, acoustic imaging device, acoustic imaging equipment and readable storage medium

Also Published As

Publication number Publication date
WO2022078249A1 (en) 2022-04-21

Similar Documents

Publication Publication Date Title
JP7434137B2 (en) Speech recognition method, device, equipment and computer readable storage medium
JP6109927B2 (en) System and method for source signal separation
CN108877770A (en) For testing the methods, devices and systems of intelligent sound equipment
CN109597022A (en) The operation of sound bearing angle, the method, apparatus and equipment for positioning target audio
JP2021515277A (en) Audio signal processing system and how to convert the input audio signal
US11138989B2 (en) Sound quality prediction and interface to facilitate high-quality voice recordings
KR102257910B1 (en) Apparatus and method for speech recognition, apparatus and method for generating noise-speech recognition model
CN107945791B (en) Voice recognition method based on deep learning target detection
CN113469118B (en) Multi-target pedestrian tracking method and device, electronic equipment and storage medium
CN109376363A (en) A kind of real-time voice interpretation method and device based on earphone
CN110610698B (en) Voice labeling method and device
CN109637525A (en) Method and apparatus for generating vehicle-mounted acoustic model
Tao et al. Bimodal Recurrent Neural Network for Audiovisual Voice Activity Detection.
CN107202559A (en) The object identification method analyzed based on room acoustics channel perturbation
CN111868823A (en) Sound source separation method, device and equipment
WO2012098427A1 (en) An audio scene selection apparatus
WO2014091281A1 (en) An apparatus aligning audio signals in a shared audio scene
CN114355286A (en) Sound source tracking method, device, equipment, system and storage medium
CN109166581A (en) Audio recognition method, device, electronic equipment and computer readable storage medium
WO2022183968A1 (en) Audio signal processing method, devices, system, and storage medium
CN114792522A (en) Audio signal processing method, conference recording and presenting method, apparatus, system and medium
CN111103568A (en) Sound source positioning method, device, medium and equipment
CN116312570A (en) Voice noise reduction method, device, equipment and medium based on voiceprint recognition
CN117373468A (en) Far-field voice enhancement processing method, far-field voice enhancement processing device, computer equipment and storage medium
CN109089112B (en) Multi-channel virtual sound image audio and video online detection method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination