WO2022078249A1 - Sound source tracking method and apparatus, and device, system and storage medium - Google Patents

Sound source tracking method and apparatus, and device, system and storage medium Download PDF

Info

Publication number
WO2022078249A1
WO2022078249A1 PCT/CN2021/122742 CN2021122742W WO2022078249A1 WO 2022078249 A1 WO2022078249 A1 WO 2022078249A1 CN 2021122742 W CN2021122742 W CN 2021122742W WO 2022078249 A1 WO2022078249 A1 WO 2022078249A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound source
time frame
information
under
tracking
Prior art date
Application number
PCT/CN2021/122742
Other languages
French (fr)
Chinese (zh)
Inventor
黄伟隆
李威
冯津伟
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2022078249A1 publication Critical patent/WO2022078249A1/en

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/08Mouthpieces; Microphones; Attachments therefor

Definitions

  • the present application relates to the technical field of data processing, and in particular, to a sound source tracking method, apparatus, device, system, and storage medium.
  • Sound source tracking based on microphone array is a popular technology in the field of acoustic signal processing in recent years.
  • the sound source tracking technology usually performs signal level processing such as filtering, taking extreme values, calculating the fundamental frequency, and calculating the azimuth angle of the microphone array to track the sound source.
  • Various aspects of the present application provide a sound source tracking method, apparatus, device, system, and storage medium, so as to improve the accuracy of sound source tracking.
  • the embodiment of the present application provides a sound source tracking method, including:
  • the embodiment of the present application also provides a sound source tracking method, including:
  • Image recognition is performed on the image stream using an image recognition model for sound source tracking within the target time period.
  • the embodiment of the present application also provides a sound source tracking device, including:
  • an acquisition module configured to acquire the acoustic signal flow collected by the microphone array in at least one time frame
  • a calculation module for performing sound source orientation estimation based on the acoustic signal flow to obtain an information flow including sound source orientation information under the at least one time frame;
  • a conversion module for converting the information flow into visual data describing the azimuth distribution state of the sound source
  • a tracking module configured to perform sound source tracking according to the visualized data.
  • Embodiments of the present application also provide a computing device, including a memory and a processor;
  • the memory for storing one or more computer instructions
  • the processor is coupled to the memory for executing the one or more computer instructions for:
  • the embodiment of the present application also provides a sound source tracking device, including:
  • a determining module configured to determine the position information of the sound source respectively under at least one time frame within the target period
  • a conversion module configured to convert the position information of the sound source in the at least one time frame into at least one set of image data describing the position distribution state of the sound source, so as to form an image stream;
  • the tracking module is used for performing image recognition on the image stream by using an image recognition model, so as to perform sound source tracking within the target period.
  • Embodiments of the present application also provide a computing device, including a memory and a processor;
  • the memory for storing one or more computer instructions
  • the processor is coupled to the memory for executing the one or more computer instructions for:
  • Image recognition is performed on the image stream using an image recognition model for sound source tracking within the target time period.
  • Embodiments of the present application further provide a sound source tracking system, including: a microphone array and a computing device, where the microphone array is communicatively connected to the computing device;
  • the microphone array for collecting acoustic signals
  • the computing device is configured to acquire the acoustic signal flow collected by the microphone array in at least one time frame; perform sound source orientation estimation based on the acoustic signal flow, so as to obtain a sound source orientation information containing the at least one time frame. information flow; convert the information flow into visual data describing the azimuth distribution state of the sound source; and track the sound source according to the visual data.
  • Embodiments of the present application further provide a computer-readable storage medium storing computer instructions, which, when executed by one or more processors, cause the one or more processors to execute the aforementioned sound source tracking method.
  • acoustic orientation estimation may be performed on the acoustic signal flow collected by the microphone array in at least one time frame, so as to determine the acoustic orientation information under at least one time frame, respectively, and the information flow containing the orientation information of the sound source It is converted into visual data describing the azimuth distribution state of the sound source, and based on the visual data, the sound source is tracked.
  • the traditional sound source tracking method from the acoustic signal processing level is subverted, but the sound source tracking is performed from the visual analysis level.
  • the visualized data can accurately and comprehensively reflect the azimuth distribution state of the sound source, which ensures the accuracy and comprehensiveness of the basis of the visual analysis, and avoids the problem of robustness; moreover, in the process of visual analysis , the analyzed field of view can cover more time frames, therefore, noise in the field of view can be found, so as to avoid noise interference; accordingly, in the embodiment of the present application, the accuracy of sound source tracking can be effectively improved, and the accuracy of sound source tracking can be improved. Adaptability to various complex environments.
  • FIG. 1 is a schematic flowchart of a sound source tracking method provided by an exemplary embodiment of the present application
  • FIG. 2 is a schematic logical diagram of a sound source tracking solution provided by an exemplary embodiment of the present application
  • FIG. 3 is a schematic diagram of sound source orientation information provided by an exemplary embodiment of the present application.
  • FIG. 4 is a schematic diagram of a heat map of azimuth distribution of a sound source according to an exemplary embodiment of the present application
  • FIG. 5 is a schematic structural diagram of a sound source tracking device according to an exemplary embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a computing device according to another exemplary embodiment of the present application.
  • FIG. 7 is a flowchart of another sound source tracking method provided by an exemplary embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of another sound source tracking device provided by an exemplary embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of another computing device provided by an exemplary embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a sound source tracking system provided by an exemplary embodiment of the present application.
  • the information stream containing the sound source azimuth information can be converted to describe the azimuth distribution of the sound source Visual data of the state, and based on the visual data, sound source tracking. This subverts the traditional way of sound source tracking from the level of acoustic signal processing, but from the level of visual analysis. Accordingly, in the embodiments of the present application, the accuracy of sound source tracking can be effectively improved, and the adaptability to various complex environments can be improved.
  • FIG. 1 is a schematic flowchart of a sound source tracking method provided by an exemplary embodiment of the present application.
  • FIG. 2 is a schematic logical diagram of a sound source tracking solution provided by an exemplary embodiment of the present application.
  • the sound source tracking method provided in this embodiment may be performed by a sound source tracking apparatus, which may be implemented as software or a combination of software and hardware, and may be integrated in a computing device. As shown in Figure 1, the method includes:
  • Step 100 Acquire an acoustic signal stream collected by the microphone array in at least one time frame
  • Step 101 Perform sound source orientation estimation based on the acoustic signal flow to obtain an information flow containing sound source orientation information under at least one time frame;
  • Step 102 converting the information flow into visual data describing the azimuth distribution state of the sound source
  • Step 103 Track the sound source according to the visualized data.
  • the sound source tracking method provided in this embodiment can be applied to various scenarios, for example, a voice control scenario, an audio and video conference scenario, or other scenarios requiring sound source tracking, and the application scenario is not limited in this embodiment.
  • the sound source tracking method provided in this embodiment can be integrated and set in various scenario devices.
  • the scenario device may be a smart speaker, an intelligent robot, etc.
  • the scenario devices may be various conference terminals, etc.
  • a microphone array may be used to collect the acoustic signal flow.
  • the microphone array may be a group of arrays composed of multiple array elements, and the number of the array elements in the microphone array is not limited in this embodiment.
  • This embodiment also does not limit the arrangement of the microphone arrays, and the microphone arrays may be annular arrays, linear arrays, planar arrays, or stereoscopic arrays, and so on. In different application scenarios, the microphone array can be assembled in various types of scene equipment as required.
  • the signal acquisition process of the microphone array is usually a continuous process, and therefore, in this embodiment, subsequent processing may be performed in the form of an acoustic signal stream.
  • At least one time frame may be selected within a single recognition period, and a single time frame may be used as a processing unit.
  • the length of a single identification period is adapted to the identification accuracy. For example, if the identification accuracy is 1s, that is, the sound source tracking result is displayed every 1s, the length of a single identification period can be set to 1s.
  • the acoustic signal stream formed by the acoustic signals in at least one time frame within 1 s can be acquired at a time as the processing object of the subsequent steps. In practical applications, under different application scenarios, at least one time frame can be selected on demand within the target period.
  • At least one time frame may be selected within the target period by means of frame skipping or sampling at a changing frame rate.
  • all time frames in the target period may be selected, which is not limited in this embodiment.
  • the frame length of the time frame may be configured according to an actual request, for example, the frame length of a single time frame may be configured as 20 ms.
  • the number of at least one time frame in the identification period can also be set as required. For example, if the identification period is 1s, 3 time frames can be selected in the identification period. Sound source tracking over time.
  • the frame length of the time frame and the number of time frames in the identification period are not limited to this.
  • the frame lengths of different time frames in the identification period may not be exactly the same, which is not limited in this embodiment.
  • the sound source tracking method provided in this embodiment can be applied to real-time sound source tracking scenarios, and can also be applied to offline sound source tracking scenarios. According to the recognition accuracy, sound source tracking is successively performed in each recognition period.
  • each array element in the microphone array can be used to collect time domain signals respectively.
  • M array elements included in the microphone array as an example, M channels of time domain signal streams can be collected in at least one time frame, as step 100.
  • sound source position estimation may be performed based on an acoustic signal stream to obtain an information stream including sound source position information under at least one time frame.
  • a sound source orientation estimation technique may be used to perform signal processing on the acoustic signal stream, so as to determine sound source orientation information in at least one time frame respectively.
  • the position information of the sound source is used to represent the position data of the sound source under the time frame.
  • the position data may be the confidence levels of the sound source at each position.
  • the sound source position information may at least include the confidence levels of the sound source at each position in the time frame.
  • all the azimuths involved in the sound source azimuth information can be configured according to actual needs, for example, 360 azimuths, 120 azimuths, 60 azimuths, etc. can be configured for the entire circumference of the microphone array.
  • the microphone array may also be non-circumferential, for example, the front face has a range of 180° and is configured in 180 directions, etc., which is not limited in this embodiment.
  • FIG. 3 is a schematic diagram of sound source orientation information provided by an exemplary embodiment of the present application.
  • the sound source orientation information is visualized in FIG. 3, but it should be understood that FIG. 3 is only for the convenience of describing the sound source orientation information, and this should not cause any change in the data form of the sound source orientation information in this embodiment. limited.
  • the sound source location information can be [1, 3, 5, 60, 70, 80, 90, 80, 70, .
  • Each number in [ ] can represent the confidence of the sound source in 360 azimuths, respectively.
  • step 101 the determination of sound source orientation information is performed in units of time frames.
  • what is acquired in step 100 is the acoustic signal stream collected by each of the M array elements under at least one time frame.
  • the acoustic signal collected under the time frame is used to estimate the sound source position, so as to determine the sound source position information under the time frame.
  • the time-domain signal streams collected by each array element can be converted into time-frequency domain signals respectively; the sound source orientation estimation technology is used to determine at least one of the time-frequency domain signals under each array element. Sound source orientation information under the time frame.
  • a target time frame in the at least one time frame is taken as an example, where the target time frame may be any one of the at least one time frame.
  • the time domain signals collected by each array element under the target time frame can be converted into time-frequency domain signals.
  • the time-domain signal may be decomposed into sub-bands to obtain time-frequency domain signals, and the sub-band decomposition process may be implemented based on end-time Fourier transform and/or filter bank, etc., which is not limited herein. Accordingly, the time-frequency domain signals corresponding to each array element under the target time frame can be obtained.
  • the sound source orientation can be estimated for the time-frequency domain signals corresponding to each array element under the target time frame, so as to output the sound source orientation information under the target time frame.
  • the sound source azimuth estimation technology includes but is not limited to the controllable beam response phase transformation technology SRP-PHAT (Steered Response Power-Phase Transform), the generalized cross-correlation phase transformation technology GCC-PHAT (Generalized Cross Correlation PHAse Transformation) or multiple signal classification Technology MUSIC (multiple signal classification) and so on.
  • the principle of sound source azimuth estimation may be: according to the acoustic signals collected by different microphones in the microphone array at the same time, calculate the azimuth range of the sound source separately, and then estimate the sound source azimuth according to multiple azimuth ranges.
  • this is only exemplary, and the present embodiment is not limited thereto. This embodiment does not limit the sound source orientation estimation technology used, and the processing procedures of various sound source orientation estimation techniques are not described in detail.
  • an information stream can be obtained, and the information stream includes sound source position information in at least one time frame.
  • step 102 the information flow can be converted into visual data describing the azimuth distribution state of the sound source.
  • the sound source bearing information may include description information of dimensions such as time and bearing, and based on this, the sound source bearing information may be converted into description information of bearing distribution state.
  • the position information of the sound source may include the confidence that the sound source is located in different directions, and then the confidence may be converted into an adapted display brightness, and the display brightness in different directions is used as the description information of the position distribution state. It is worth noting that in the process of converting the visualization data, nothing in the sound source orientation information is lost, but only the representation form of the sound source orientation information is converted, which can ensure that the visualization data in this embodiment is accurate. , A comprehensive description of the azimuth distribution of the sound source.
  • the visualization data may be a heat map of the azimuth distribution of the sound source.
  • the display brightness in each position under the time frame can be obtained, so as to determine from the three dimensions of time frame, direction and display brightness.
  • the visualization data is not limited to this.
  • the visualization data may also be a three-dimensional stereogram used to represent the position information of the sound source in at least one time frame.
  • the display curves of the sound source orientation information in FIG. 3 are arranged in time to obtain a three-dimensional stereogram.
  • the information flow can be converted into various forms of visual data.
  • the orientation information of each sound source can be fully retained, so in step 102 , the acoustic signal processing level can be switched to the visualization processing level during the sound source tracking process.
  • step 103 sound source tracking may be performed according to the visualized data.
  • the number of sound sources is not limited, and the number of sound sources may be one or more.
  • the sound source in at least one time frame can be tracked by performing visual analysis on the visualized data, thereby converting the acoustic signal processing problem into a visual analysis problem.
  • the visual data can accurately and comprehensively reflect the azimuth distribution state of the sound source, which ensures the accuracy and comprehensiveness of the basis of visual analysis, and avoids the problem of robustness; moreover, in the process of visual analysis , the analyzed field of view can cover more time frames instead of being limited to a single time frame, therefore, noise in the field of view can be found and noise interference can be avoided. This can effectively avoid the shortcomings of poor robustness and insufficient generalization ability in the traditional acoustic signal processing process.
  • the information stream including the sound source azimuth information can be converted into visual data describing the azimuth distribution state of the sound source, and based on the visual data, sound source tracking can be performed.
  • the information stream may be converted into a heat map of the azimuth distribution of the sound source in at least one time frame, as a basis for tracking the sound source in the at least one time frame.
  • the azimuth distribution heat map is used to describe the distribution heat of the sound source in different azimuths under the at least one time frame.
  • the sound source location information includes the confidence levels that the sound source is located at each location.
  • the display brightness corresponding to each position can be determined under at least one time frame respectively; Display brightness, and generate a heat map of the azimuth distribution of the sound source in at least one time frame; different display brightness represents different distribution heat.
  • this embodiment is not limited to this, and the higher the confidence level, the lower the display brightness may be. In general, however, there is a proportionality between confidence and display brightness to accurately reflect confidence through display brightness.
  • FIG. 4 is a schematic diagram of a heat map of azimuth distribution of a sound source according to an exemplary embodiment of the present application.
  • the vertical axis of the heat map is the time frame, and the horizontal axis is the orientation.
  • the number of at least one time frame is 800, and the number of orientations is configured to be 120, which are used to represent the entire circumferential space of the microphone array.
  • the image content corresponding to each of the at least one time frame may be determined according to the display brightness corresponding to each position under the at least one time frame; according to the time sequence between the at least one time frame, at least one Image content corresponding to each time frame to generate a heat map of azimuth distribution.
  • the sound source position information under each time frame can be converted into a horizontal direction in the heat map.
  • the display brightness of the pixel corresponding to each position above is determined according to the corresponding confidence. The pixel with higher confidence corresponds to the brighter display brightness.
  • the confidence level corresponding to the peak position in FIG. 3 is the highest, and when converted to the heat map, the display brightness at the azimuth corresponding to the peak position is the brightest.
  • the heat map of azimuth distribution can also be used to generate the heat map of azimuth distribution.
  • the confidence level of the existence of the sound source in the target azimuth in different time frames can be obtained, so as to determine
  • the display brightness corresponding to each time frame on the target azimuth is used to generate image content corresponding to the target azimuth, where the target azimuth may be any one of the various azimuths.
  • the image content corresponding to each azimuth can be obtained, so that the image content corresponding to each azimuth can be arranged in sequence according to the azimuth order, so as to generate an azimuth distribution heat map.
  • This embodiment does not limit the manner of generating the azimuth distribution heat map.
  • the position information of the sound source in at least one time frame can be converted into a heat map of the position distribution of the sound source. Moreover, in the conversion process, all the content of the sound source orientation information is preserved, which provides an accurate analysis basis for the visual analysis process, thus ensuring the accuracy of the tracking results.
  • the sound source tracking can be performed using the machine learning model and the visualization data.
  • a machine learning model can be used to perform visual analysis to track sound sources.
  • different types of machine learning models can be selected, and a model training method adapted to the data form can be used to improve the performance of the machine learning model.
  • image features in the heat map of azimuth distribution can be extracted; based on the mapping relationship between image features and sound source attribute parameters and the image features extracted from the heat map of azimuth distribution, at least The target sound source attribute parameter under a time frame for sound source tracking.
  • mapping relationship between image features and sound source attribute parameters can be configured into the machine learning model through model training.
  • An exemplary model training process could also be:
  • sample heatmaps corresponding to several sample time frame groups label sound source attribute parameters for each sample heatmap to obtain labeling information corresponding to each sample heatmap; input each sample heatmap and its corresponding labeling information into the machine learning model , for the machine learning model to learn the mapping relationship between image features and sound source attribute parameters.
  • the number of time frames in the sample time frame may be consistent with the number of at least one time frame in which the acoustic signal acquisition is performed in step 100 . That is, the processing units in the model training process and the model use process can be kept consistent. In this way, in the model training process, the sample time frame group is used as a unit for labeling, and in the model use process, the tracking result can be output in at least one time frame unit of the same quantity level.
  • the sound source attribute parameters include one or more of orientation, quantity, sounding duration, and covered time frame. Accordingly, in this embodiment, after visual analysis, the machine learning model can output information such as the number of sound sources in at least one time frame, the location, the sounding duration, and the time frame covered as the tracking result.
  • the step of converting the information flow into visual data describing the azimuth distribution state of the sound source can be performed in the machine learning model or outside the machine learning model.
  • the process of converting the information flow into visual data describing the azimuth distribution state of the sound source can be performed outside the machine learning model, and the visual data can be used as the machine learning model. input parameters.
  • the model training process the information flow corresponding to the sample time frame group can be converted into a sample heat map in advance, which is used as the basis for model training.
  • the information flow in the process of using the model, can be input into the machine learning model; in the machine learning model, the information flow is converted into visual data describing the azimuth distribution state of the sound source.
  • a functional module that converts the information flow into visual data describing the azimuth distribution state of the sound source can be configured in the machine learning model, so that the information flow can be used as an input parameter of the machine learning model.
  • the information flow can be converted into visual data describing the azimuth distribution state of the sound source, and then the visual analysis can be carried out.
  • the sample information streams corresponding to each sample time frame group can be obtained; the sound source attribute parameters can be labeled for each sample information stream to obtain the label information corresponding to each sample information stream;
  • the stream and its corresponding annotation information are input into the machine learning model, so that the machine learning model converts each sample information stream into visual data describing the azimuth distribution state of the sound source and learns the mapping relationship between image features and sound source attribute parameters.
  • the machine learning model learns an accurate mapping relationship between image features and sound source attribute parameters. Therefore, the trained machine learning model can be used to visually analyze the visualized data to output sound source attribute information under at least one time frame, so as to track one or more sound sources that emit sound based on the sound source attribute information.
  • This sound source tracking method can eliminate all kinds of noise interference in the tracking process, does not need to perform the operation of finding the starting point separately, and can avoid the deficiencies in other acoustic signal processing levels. Furthermore, the accuracy of the tracking result can be effectively improved, and the adaptability to various complex environments can be improved.
  • the execution subject of each step of the method provided in the above-mentioned embodiments may be the same device, or the method may also be executed by different devices.
  • the execution subject of steps 101 to 103 may be device A; for another example, the execution subject of steps 101 and 102 may be device A, and the execution subject of step 103 may be device B; and so on.
  • FIG. 5 is a schematic structural diagram of a sound source tracking device according to an exemplary embodiment of the present application.
  • the sound source tracking device includes:
  • an acquisition module 50 configured to acquire the acoustic signal flow collected by the microphone array in at least one time frame
  • a calculation module 51 configured to perform sound source orientation estimation based on the acoustic signal flow to obtain an information flow including sound source orientation information under at least one time frame;
  • the conversion module 52 is used to convert the information flow into visual data describing the azimuth distribution state of the sound source
  • the tracking module 53 is configured to track the sound source according to the visualized data.
  • the conversion module 52 when converting the information flow into visual data describing the azimuth distribution state of the sound source, the conversion module 52 is used for:
  • the information flow is converted into an azimuth distribution heat map of the sound source under at least one time frame, and the azimuth distribution heat map is used to describe the distribution heat of the sound source in different azimuths under the at least one time frame.
  • the sound source position information includes the confidence that the sound source is in each position; when converting the information flow into a heat map of the position distribution of the sound source in at least one time frame, the conversion module 52 is used for:
  • the display brightness corresponding to each position is determined in at least one time frame, and the distribution heat represented by different display brightness is different;
  • a heat map of the azimuth distribution of the sound source under at least one time frame is generated.
  • the conversion module 52 when generating the heat map of the azimuth distribution of the sound source in at least one time frame according to the display brightness, is used for:
  • the image content corresponding to each of the at least one time frame is respectively determined
  • the image contents corresponding to the at least one time frame are sequentially arranged to generate the azimuth distribution heat map.
  • the tracking module 53 when tracking the sound source according to the visualization data, the tracking module 53 is used to:
  • the tracking module 53 when using the machine learning model and the visualized data to track the sound source, is used for:
  • the target sound source attribute parameters under at least one time frame are determined to perform sound source tracking.
  • the sound source attribute parameters include one or more of orientation, quantity, sounding duration and covered time frame.
  • the tracking module 53 is further configured to:
  • sample heatmaps corresponding to several sample time frame groups are used to describe the distribution heat of the sound source in different directions under the sample time frame;
  • the heat map of each sample and its corresponding annotation information are input into the machine learning model, so that the machine learning model can learn the mapping relationship between image features and sound source attribute parameters.
  • the tracking module 53 converts the information flow into visual data describing the azimuth distribution state of the sound source, it is used for:
  • the information flow is transformed into visual data describing the azimuthal distribution of the sound source.
  • the tracking module 53 is further configured to:
  • the acoustic signal flow includes a time-domain signal flow collected by each array element in the microphone array, and the calculation module 51 performs sound source orientation estimation based on the acoustic signal flow to obtain an acoustic signal including at least one time frame.
  • the information flow of source bearing information is used to:
  • the time-domain signal streams collected by each array element are converted into time-frequency domain signals respectively;
  • the sound source orientation information under at least one time frame is determined according to the time-frequency domain signals under each array element.
  • the sound source orientation estimation technology includes one or more of the steerable beam response phase transformation technology SRP-PHAT, the generalized cross-correlation phase transformation technology GCC-PHAT or the multiple signal classification technology MUSIC.
  • FIG. 6 is a schematic structural diagram of a computing device according to another exemplary embodiment of the present application. As shown in FIG. 6 , the computing device includes: a memory 60 and a processor 61 .
  • a processor 61 coupled to the memory 60, executes a computer program in the memory 60 for:
  • the processor 61 converts the information flow into visual data describing the azimuth distribution state of the sound source, it is used for:
  • the information flow is converted into an azimuth distribution heat map of the sound source under at least one time frame, and the azimuth distribution heat map is used to describe the distribution heat of the sound source in different azimuths under the at least one time frame.
  • the sound source azimuth information includes confidence that the sound source is in each position; when the processor 61 converts the information flow into a heat map of azimuth distribution of the sound source in at least one time frame, the processor 61 is used for:
  • the display brightness corresponding to each position is determined under at least one time frame, and the distribution heat represented by different display brightness is different;
  • a heat map of the azimuth distribution of the sound source under at least one time frame is generated.
  • the processor 61 when the processor 61 generates a heat map of the azimuth distribution of the sound source under at least one time frame according to the display brightness, the processor 61 is configured to:
  • the image content corresponding to each of the at least one time frame is respectively determined
  • the image contents corresponding to the at least one time frame are sequentially arranged to generate the azimuth distribution heat map.
  • the processor 61 when the processor 61 performs sound source tracking according to the visualization data, the processor 61 is configured to:
  • the processor 61 is used to track the sound source by using the machine learning model and the visualized data:
  • the target sound source attribute parameters under at least one time frame are determined to perform sound source tracking.
  • the sound source attribute parameters include one or more of orientation, quantity, sounding duration and covered time frame.
  • the processor 61 is further configured to:
  • sample heatmaps corresponding to several sample time frame groups are used to describe the distribution heat of the sound source in different directions under the sample time frame;
  • the heat map of each sample and its corresponding annotation information are input into the machine learning model, so that the machine learning model can learn the mapping relationship between image features and sound source attribute parameters.
  • the processor 61 when the processor 61 converts the information flow into visual data describing the azimuth distribution state of the sound source, the processor 61 is configured to:
  • the information flow is transformed into visual data describing the azimuthal distribution of the sound source.
  • the processor 61 is further configured to:
  • the acoustic signal flow includes a time-domain signal flow collected by each element in the microphone array, and the processor 61 performs sound source orientation estimation based on the acoustic signal flow to obtain an acoustic signal including at least one time frame.
  • the information flow of source bearing information is used to:
  • the time-domain signal streams collected by each array element are converted into time-frequency domain signals respectively;
  • the sound source orientation information under at least one time frame is determined according to the time-frequency domain signals under each array element.
  • the sound source orientation estimation technique includes one or more of the steerable beam response phase transform technique SRP-PHAT, the generalized cross-correlation phase transform technique GCC-PHAT or the multiple signal classification technique MUSIC.
  • the computing device further includes: a microphone array 62 , a communication component 63 , a power supply component 64 and other components. Only some components are schematically shown in FIG. 6 , which does not mean that the computing device only includes the components shown in FIG. 6 .
  • FIG. 7 is a flowchart of another sound source tracking method provided by an exemplary embodiment of the present application.
  • the sound source tracking method provided in this embodiment may be performed by a sound source tracking apparatus, which may be implemented as software or a combination of software and hardware, and may be integrated in a computing device. As shown in Figure 7, the method includes:
  • Step 700 Under at least one time frame within the target period, determine sound source orientation information respectively;
  • Step 701 Convert the position information of the sound source in at least one time frame into at least one set of image data describing the position distribution state of the sound source to form an image stream;
  • Step 702 Perform image recognition on the image stream by using the image recognition model, so as to track the sound source within the target time period.
  • Step 701 may further include acquiring an acoustic signal flow collected by the microphone array in at least one time frame; and performing sound source orientation estimation based on the acoustic signal flow to obtain sound source orientation information under at least one time frame. To save space, the specific process will not be repeated.
  • the position information of the sound source in at least one time frame may be converted into at least one set of image data.
  • the image data may be an orientation distribution heat map, and in step 701, at least one orientation distribution heat map can be obtained to form an image stream and input to the image recognition model.
  • the sound source tracking solution provided in this implementation can be applied to scenarios such as real-time tracking or offline tracking.
  • the position information of the sound source in at least one time frame can be obtained at one time, and the position information of the sound source in at least one time frame can be grouped according to the recognition accuracy, so that the conversion of the acoustic position information into image data can be performed in groups. operate.
  • the target time frame within the current recognition period can be determined from at least one time frame based on the preset recognition accuracy
  • the operation of converting the acoustic orientation information into image data can be performed successively in each recognition period. And then input to the image recognition model in step 702.
  • the recognition accuracy is 1s
  • the image recognition model can be pre-trained.
  • the image recognition model may adopt a machine learning model, and the training process of the image recognition model may refer to the relevant description in the embodiment associated with FIG. 1 .
  • FIG. 8 is a schematic structural diagram of another sound source tracking apparatus provided by an exemplary embodiment of the present application.
  • the sound source tracking device includes:
  • a determination module 80 configured to determine sound source orientation information respectively under at least one time frame within the target period
  • the conversion module 81 is configured to convert the position information of the sound source in at least one time frame into at least one set of image data describing the position distribution state of the sound source, so as to form an image stream;
  • the tracking module 82 is configured to perform image recognition on the image stream by using the image recognition model, so as to perform sound source tracking within the target time period.
  • the conversion module 81 converts the position information of the sound source in at least one time frame into at least one set of image data describing the position distribution state of the sound source to form an image stream, it is used for:
  • the determination module 80 includes an acquisition module 83 and a calculation module 84;
  • an acquisition module 83 configured to acquire the acoustic signal flow collected by the microphone array in at least one time frame
  • the calculation module 84 is configured to perform sound source position estimation based on the acoustic signal flow, so as to obtain sound source position information in at least one time frame.
  • the conversion module 81 when converting the sound source azimuth information under each target time frame into a group of image data describing the azimuth distribution state of the sound source in the current identification period, is used for:
  • the sound source orientation information under each target time frame is converted into an orientation distribution heat map of the sound source under at least one time frame, and the orientation distribution heat map is used to describe the distribution heat of the sound source in different orientations under at least one time frame.
  • the sound source position information includes the confidence that the sound source is in each position; the conversion module 81 converts the sound source position information under each target time frame into the position distribution of the sound source under at least one time frame.
  • a heatmap it is used to:
  • the display brightness corresponding to each position is determined under at least one time frame, and the distribution heat represented by different display brightness is different;
  • a heat map of the azimuth distribution of the sound source under at least one time frame is generated.
  • the conversion module 81 when the conversion module 81 generates a heat map of the azimuth distribution of the sound source in at least one time frame according to the display brightness, it is used for:
  • the image content corresponding to each of the at least one time frame is respectively determined
  • the image contents corresponding to the at least one time frame are sequentially arranged to generate the azimuth distribution heat map.
  • the tracking module 82 uses the image recognition model to perform image recognition on the image stream to track the sound source within the target time period. , for:
  • the image features in the azimuth distribution heat map are extracted
  • the target sound source attribute parameters under at least one time frame are determined to perform sound source tracking.
  • the sound source attribute parameters include one or more of orientation, quantity, sounding duration and covered time frame.
  • the tracking module 82 is further configured to:
  • sample heatmaps corresponding to several sample time frame groups are used to describe the distribution heat of the sound source in different directions under the sample time frame;
  • the heat map of each sample and its corresponding annotation information are input into the image recognition model, so that the image recognition model can learn the mapping relationship between image features and sound source attribute parameters.
  • the tracking module 82 converts the position information of the sound source under each target time frame into a heat map of the position distribution of the sound source under at least one time frame, it is used for:
  • the position information of the sound source under each target time frame is converted into a heat map of the position distribution of the sound source under at least one time frame.
  • the tracking module 82 is further configured to:
  • the acoustic signal flow includes the time domain signal flow collected by each array element in the microphone array, and the calculation module 84 performs sound source orientation estimation based on the acoustic signal flow to obtain the sound source in at least one time frame.
  • bearing information it is used to:
  • the time-domain signal streams collected by each array element are converted into time-frequency domain signals respectively;
  • the sound source orientation information under at least one time frame is determined according to the time-frequency domain signals under each array element.
  • the sound source orientation estimation technology includes one or more of the steerable beam response phase transformation technology SRP-PHAT, the generalized cross-correlation phase transformation technology GCC-PHAT or the multiple signal classification technology MUSIC.
  • FIG. 9 is a schematic structural diagram of another computing device provided by an exemplary embodiment of the present application.
  • the computing device includes: a memory 90 and a processor 91 .
  • a processor 91 coupled to the memory 90, executes a computer program in the memory 90 for:
  • Image recognition is performed on the image stream using an image recognition model for sound source tracking within the target time period.
  • the processor 91 when the processor 91 converts the sound source azimuth information in at least one time frame into at least one set of image data describing the azimuth distribution state of the sound source to form an image stream, it is used for:
  • the processor 91 when the processor 91 respectively determines the position information of the sound source under at least one time frame within the target time period, it is used for:
  • the sound source position estimation is performed based on the acoustic signal flow to obtain sound source position information under at least one time frame.
  • the processor 91 converts the sound source azimuth information under each target time frame into a set of image data describing the azimuth distribution state of the sound source in the current identification period, it is used for:
  • the sound source orientation information under each target time frame is converted into an orientation distribution heat map of the sound source under at least one time frame, and the orientation distribution heat map is used to describe the distribution heat of the sound source in different orientations under at least one time frame.
  • the sound source position information includes the confidence that the sound source is in each position; the processor 91 converts the sound source position information under each target time frame into the position distribution of the sound source under at least one time frame.
  • a heatmap it is used to:
  • the display brightness corresponding to each position is determined in at least one time frame, and the distribution heat represented by different display brightness is different;
  • a heat map of the azimuth distribution of the sound source under at least one time frame is generated.
  • the processor 91 when the processor 91 generates the heat map of the azimuth distribution of the sound source in at least one time frame according to the display brightness, the processor 91 is configured to:
  • the image content corresponding to each of the at least one time frame is respectively determined
  • the image contents corresponding to the at least one time frame are sequentially arranged to generate the azimuth distribution heat map.
  • the processor 91 uses the image recognition model to perform image recognition on the image stream to track the sound source within the target time period. , for:
  • the image features in the azimuth distribution heat map are extracted
  • the target sound source attribute parameters under at least one time frame are determined to perform sound source tracking.
  • the sound source attribute parameters include one or more of orientation, quantity, sounding duration and covered time frame.
  • the processor 91 is further configured to:
  • sample heatmaps corresponding to several sample time frame groups are used to describe the distribution heat of the sound source in different directions under the sample time frame;
  • the heat map of each sample and its corresponding annotation information are input into the image recognition model, so that the image recognition model can learn the mapping relationship between image features and sound source attribute parameters.
  • the processor 91 when the processor 91 converts the position information of the sound source under each target time frame into a heat map of the position distribution of the sound source under at least one time frame, the processor 91 is used for:
  • the position information of the sound source under each target time frame is converted into a heat map of the position distribution of the sound source under at least one time frame.
  • the processor 91 is further configured to:
  • the acoustic signal flow includes a time-domain signal flow collected by each array element in the microphone array, and the processor 91 performs sound source orientation estimation based on the acoustic signal flow to obtain the sound source in at least one time frame.
  • bearing information it is used to:
  • the time-domain signal streams collected by each array element are converted into time-frequency domain signals respectively;
  • the sound source orientation information under at least one time frame is determined according to the time-frequency domain signals under each array element.
  • the sound source orientation estimation technology includes one or more of the steerable beam response phase transformation technology SRP-PHAT, the generalized cross-correlation phase transformation technology GCC-PHAT or the multiple signal classification technology MUSIC.
  • the computing device further includes: a microphone array 92 , a communication component 93 , a power supply component 94 and other components. Only some components are schematically shown in FIG. 9 , which does not mean that the computing device only includes the components shown in FIG. 9 .
  • the embodiments of the present application further provide a computer-readable storage medium storing a computer program, and when the computer program is executed, each step that can be executed by a computing device in the foregoing method embodiments can be implemented.
  • FIG. 10 is a schematic structural diagram of a sound source tracking system provided by an exemplary embodiment of the present application.
  • the sound source tracking system may include: a microphone array 10 and a computing device 20 , and the microphone array 10 and the computing device 20 are connected in communication.
  • the sound source tracking system provided in this embodiment can be applied to various scenarios, for example, a voice control scenario, an audio and video conference scenario, or other scenarios that require sound source tracking, and the application scenario is not limited in this embodiment.
  • the sound source tracking system provided in this embodiment can be integrated and deployed in various scenarios.
  • a voice control scenario it can be deployed in smart speakers and smart robots.
  • video conference scenarios it can be deployed in various conference terminals.
  • the microphone array 10 can be used to collect acoustic signals.
  • the number and arrangement of the array elements of the microphone array 10 are not limited.
  • the memories in FIGS. 6 and 9 described above are used to store computer programs and may be configured to store various other data to support operations on the computing platform. Examples of such data include instructions for any application or method operating on the computing platform, contact data, phonebook data, messages, pictures, videos, etc.
  • Memory can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read Only Memory (EEPROM), Erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.
  • SRAM Static Random Access Memory
  • EEPROM Electrically Erasable Programmable Read Only Memory
  • EPROM Erasable Programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Magnetic or Optical Disk Magnetic Disk
  • the above-mentioned communication components in FIG. 6 and FIG. 9 are configured to facilitate wired or wireless communication between the device where the communication component is located and other devices.
  • the device where the communication component is located can access a wireless network based on a communication standard, such as WiFi, a mobile communication network such as 2G, 3G, 4G/LTE, 5G, or a combination thereof.
  • the communication component receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication assembly further includes a near field communication (NFC) module to facilitate short-range communication.
  • the NFC module may be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth
  • a power supply assembly may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to the equipment in which the power supply assembly is located.
  • the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions
  • the apparatus implements the functions specified in the flow or flows of the flowcharts and/or the block or blocks of the block diagrams.
  • a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • Memory may include forms of non-persistent memory, random access memory (RAM) and/or non-volatile memory in computer readable media, such as read only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
  • RAM random access memory
  • ROM read only memory
  • flash RAM flash memory
  • Computer-readable media includes both persistent and non-permanent, removable and non-removable media, and storage of information may be implemented by any method or technology.
  • Information may be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
  • computer-readable media does not include transitory computer-readable media, such as modulated data signals and carrier waves.

Abstract

A sound source tracking method and apparatus, and a device, a system and a storage medium. The method comprises: acquiring an acoustic signal stream collected by a microphone array under at least one time frame (100); performing sound source orientation estimation on the basis of the acoustic signal stream, so as to obtain an information stream that includes sound source orientation information under the at least one time frame (101); converting the information stream into visualization data that describes an orientation distribution state of a sound source (102); and performing sound source tracking according to the visualization data (103). In the method, an information stream that includes sound source orientation information is converted into visualization data that describes an orientation distribution state of a sound source, and sound source tracking is performed on the basis of the visualization data. By means of the method, the accuracy of sound source tracking can be effectively improved, and the adaptability to various complicated environments can be improved.

Description

一种声源追踪方法、装置、设备、系统及存储介质A sound source tracking method, device, device, system and storage medium
本申请要求2020年10月12日递交的申请号为202011086519.8、发明名称为“一种声源追踪方法、装置、设备、系统及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202011086519.8 filed on October 12, 2020 and the invention titled "A sound source tracking method, device, device, system and storage medium", the entire contents of which are incorporated by reference in this application.
技术领域technical field
本申请涉及数据处理技术领域,尤其涉及一种声源追踪方法、装置、设备、系统及存储介质。The present application relates to the technical field of data processing, and in particular, to a sound source tracking method, apparatus, device, system, and storage medium.
背景技术Background technique
基于麦克风阵列进行声源追踪是近年来的声学信号处理领域的热门技术。目前,声源追踪技术通常是对麦克风阵列进行滤波、取极值、计算基频、计算方位角等信号层面的处理,以进行声源追踪。Sound source tracking based on microphone array is a popular technology in the field of acoustic signal processing in recent years. At present, the sound source tracking technology usually performs signal level processing such as filtering, taking extreme values, calculating the fundamental frequency, and calculating the azimuth angle of the microphone array to track the sound source.
但是,这类处理方式的鲁棒性较差,泛化能力不足,尤其是在多声源或嘈杂的环境下,声源追踪的准确度不足。However, such processing methods have poor robustness and insufficient generalization ability, especially in multi-source or noisy environments, and the accuracy of sound source tracking is insufficient.
发明内容SUMMARY OF THE INVENTION
本申请的多个方面提供一种声源追踪方法、装置、设备、系统及存储介质,用以提高声源追踪的准确度。Various aspects of the present application provide a sound source tracking method, apparatus, device, system, and storage medium, so as to improve the accuracy of sound source tracking.
本申请实施例提供一种声源追踪方法,包括:The embodiment of the present application provides a sound source tracking method, including:
获取麦克风阵列在至少一个时间帧下采集到的声学信号流;Acquire the acoustic signal stream collected by the microphone array under at least one time frame;
基于所述声学信号流进行声源方位估计,以获得包含所述至少一个时间帧下声源方位信息的信息流;performing sound source position estimation based on the acoustic signal stream to obtain an information stream containing sound source position information under the at least one time frame;
将所述信息流转换为描述声源的方位分布状态的可视化数据;converting the information flow into visual data describing the azimuth distribution state of the sound source;
根据所述可视化数据,进行声源追踪。Based on the visualized data, sound source tracking is performed.
本申请实施例还提供一种声源追踪方法,包括:The embodiment of the present application also provides a sound source tracking method, including:
在目标时段内的至少一个时间帧下,分别确定声源方位信息;Under at least one time frame within the target period, determine the sound source position information respectively;
将所述至少一个时间帧下的声源方位信息,转换为描述声源的方位分布状态的至少一组图像数据,以形成图像流;Converting the sound source position information under the at least one time frame into at least one set of image data describing the position distribution state of the sound source to form an image stream;
利用图像识别模型对所述图像流进行图像识别,以在所述目标时段内进行声源追踪。Image recognition is performed on the image stream using an image recognition model for sound source tracking within the target time period.
本申请实施例还提供一种声源追踪装置,包括:The embodiment of the present application also provides a sound source tracking device, including:
获取模块,用于获取麦克风阵列在至少一个时间帧下采集到的声学信号流;an acquisition module, configured to acquire the acoustic signal flow collected by the microphone array in at least one time frame;
计算模块,用于基于所述声学信号流进行声源方位估计,以获得包含所述至少一个时间帧下声源方位信息的信息流;a calculation module for performing sound source orientation estimation based on the acoustic signal flow to obtain an information flow including sound source orientation information under the at least one time frame;
转换模块,用于将所述信息流转换为描述声源的方位分布状态的可视化数据;a conversion module for converting the information flow into visual data describing the azimuth distribution state of the sound source;
追踪模块,用于根据所述可视化数据,进行声源追踪。A tracking module, configured to perform sound source tracking according to the visualized data.
本申请实施例还提供一种计算设备,包括存储器和处理器;Embodiments of the present application also provide a computing device, including a memory and a processor;
所述存储器用于存储一条或多条计算机指令;the memory for storing one or more computer instructions;
所述处理器与所述存储器耦合,用于执行所述一条或多条计算机指令,以用于:The processor is coupled to the memory for executing the one or more computer instructions for:
获取麦克风阵列在至少一个时间帧下采集到的声学信号流;Acquire the acoustic signal stream collected by the microphone array under at least one time frame;
基于所述声学信号流进行声源方位估计,以获得包含所述至少一个时间帧下声源方位信息的信息流;performing sound source position estimation based on the acoustic signal stream to obtain an information stream containing sound source position information under the at least one time frame;
将所述信息流转换为描述声源的方位分布状态的可视化数据;converting the information flow into visual data describing the azimuth distribution state of the sound source;
根据所述可视化数据,进行声源追踪。Based on the visualized data, sound source tracking is performed.
本申请实施例还提供一种声源追踪装置,包括:The embodiment of the present application also provides a sound source tracking device, including:
确定模块,用于在目标时段内的至少一个时间帧下,分别确定声源方位信息;a determining module, configured to determine the position information of the sound source respectively under at least one time frame within the target period;
转换模块,用于将所述至少一个时间帧下的声源方位信息,转换为描述声源的方位分布状态的至少一组图像数据,以形成图像流;a conversion module, configured to convert the position information of the sound source in the at least one time frame into at least one set of image data describing the position distribution state of the sound source, so as to form an image stream;
追踪模块,用于利用图像识别模型对所述图像流进行图像识别,以在所述目标时段内进行声源追踪。The tracking module is used for performing image recognition on the image stream by using an image recognition model, so as to perform sound source tracking within the target period.
本申请实施例还提供一种计算设备,包括存储器和处理器;Embodiments of the present application also provide a computing device, including a memory and a processor;
所述存储器用于存储一条或多条计算机指令;the memory for storing one or more computer instructions;
所述处理器与所述存储器耦合,用于执行所述一条或多条计算机指令,以用于:The processor is coupled to the memory for executing the one or more computer instructions for:
在目标时段内的至少一个时间帧下,分别确定声源方位信息;Under at least one time frame within the target period, determine the sound source position information respectively;
将所述至少一个时间帧下的声源方位信息,转换为描述声源的方位分布状态的至少一组图像数据,以形成图像流;Converting the sound source position information under the at least one time frame into at least one set of image data describing the position distribution state of the sound source to form an image stream;
利用图像识别模型对所述图像流进行图像识别,以在所述目标时段内进行声源追踪。Image recognition is performed on the image stream using an image recognition model for sound source tracking within the target time period.
本申请实施例还提供一种声源追踪系统,包括:麦克风阵列和计算设备,所述麦克风阵列与所述计算设备通信连接;Embodiments of the present application further provide a sound source tracking system, including: a microphone array and a computing device, where the microphone array is communicatively connected to the computing device;
所述麦克风阵列,用于采集声学信号;the microphone array for collecting acoustic signals;
所述计算设备,用于获取麦克风阵列在至少一个时间帧下采集到的声学信号流;基于所述声学信号流进行声源方位估计,以获得包含所述至少一个时间帧下声源方位信息的信息流;将所述信息流转换为描述声源的方位分布状态的可视化数据;根据所述可视化数据,进行声源追踪。The computing device is configured to acquire the acoustic signal flow collected by the microphone array in at least one time frame; perform sound source orientation estimation based on the acoustic signal flow, so as to obtain a sound source orientation information containing the at least one time frame. information flow; convert the information flow into visual data describing the azimuth distribution state of the sound source; and track the sound source according to the visual data.
本申请实施例还提供一种存储计算机指令的计算机可读存储介质,当所述计算机指令被一个或多个处理器执行时,致使所述一个或多个处理器执行前述的声源追踪方法。Embodiments of the present application further provide a computer-readable storage medium storing computer instructions, which, when executed by one or more processors, cause the one or more processors to execute the aforementioned sound source tracking method.
在本申请实施例中,可对麦克风阵列在至少一个时间帧下采集到的声学信号流进行声学方位估计,以分别确定至少一个时间帧下的声学方位信息,将包含声源方位信息的 信息流转换为描述声源的方位分布状态的可视化数据,并基于可视化数据,进行声源追踪。这样,本申请实施例中,颠覆了传统的从声学信号处理层面进行声源追踪的方式,而是从可视化分析层面进行声源追踪。而由于本实施例中,可视化数据可准确、全面地反映出声源的方位分布状态,这保证了可视化分析的基础的准确性、全面性,规避了鲁棒性问题;而且,在可视化分析过程中,分析的视野可覆盖更多的时间帧,因此,可发现视野内的噪声,从而避免噪声干扰;据此,本申请实施例中,可有效提高声源追踪的准确度,且可提高对各种复杂环境的适应性。In this embodiment of the present application, acoustic orientation estimation may be performed on the acoustic signal flow collected by the microphone array in at least one time frame, so as to determine the acoustic orientation information under at least one time frame, respectively, and the information flow containing the orientation information of the sound source It is converted into visual data describing the azimuth distribution state of the sound source, and based on the visual data, the sound source is tracked. In this way, in the embodiment of the present application, the traditional sound source tracking method from the acoustic signal processing level is subverted, but the sound source tracking is performed from the visual analysis level. However, in this embodiment, the visualized data can accurately and comprehensively reflect the azimuth distribution state of the sound source, which ensures the accuracy and comprehensiveness of the basis of the visual analysis, and avoids the problem of robustness; moreover, in the process of visual analysis , the analyzed field of view can cover more time frames, therefore, noise in the field of view can be found, so as to avoid noise interference; accordingly, in the embodiment of the present application, the accuracy of sound source tracking can be effectively improved, and the accuracy of sound source tracking can be improved. Adaptability to various complex environments.
附图说明Description of drawings
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:The drawings described herein are used to provide further understanding of the present application and constitute a part of the present application. The schematic embodiments and descriptions of the present application are used to explain the present application and do not constitute an improper limitation of the present application. In the attached image:
图1为本申请一示例性实施例提供的一种声源追踪方法的流程示意图;FIG. 1 is a schematic flowchart of a sound source tracking method provided by an exemplary embodiment of the present application;
图2为本申请一示例性实施例提供的一种声源追踪方案的逻辑示意图;FIG. 2 is a schematic logical diagram of a sound source tracking solution provided by an exemplary embodiment of the present application;
图3为本申请一示例性实施例提供的一种声源方位信息的示意;FIG. 3 is a schematic diagram of sound source orientation information provided by an exemplary embodiment of the present application;
图4为本申请一示例性实施例提供的一种声源的方位分布热力图的示意图;FIG. 4 is a schematic diagram of a heat map of azimuth distribution of a sound source according to an exemplary embodiment of the present application;
图5为本申请一示例性实施例提供的一种声源追踪装置的结构示意图;FIG. 5 is a schematic structural diagram of a sound source tracking device according to an exemplary embodiment of the present application;
图6为本申请又一示例性实施例提供的一种计算设备的结构示意图;FIG. 6 is a schematic structural diagram of a computing device according to another exemplary embodiment of the present application;
图7为本申请一示例性实施例提供的另一种声源追踪方法的流程图;FIG. 7 is a flowchart of another sound source tracking method provided by an exemplary embodiment of the present application;
图8为本申请一示例性实施例提供的另一种声源追踪装置的结构示意图;FIG. 8 is a schematic structural diagram of another sound source tracking device provided by an exemplary embodiment of the present application;
图9为本申请一示例性实施例提供的另一种计算设备的结构示意图;FIG. 9 is a schematic structural diagram of another computing device provided by an exemplary embodiment of the present application;
图10为本申请一示例性实施例提供的一种声源追踪系统的结构示意图。FIG. 10 is a schematic structural diagram of a sound source tracking system provided by an exemplary embodiment of the present application.
具体实施方式Detailed ways
为使本申请的目的、技术方案和优点更加清楚,下面将结合本申请具体实施例及相应的附图对本申请技术方案进行清楚、完整地描述。显然,所描述的实施例仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the objectives, technical solutions and advantages of the present application clearer, the technical solutions of the present application will be clearly and completely described below with reference to the specific embodiments of the present application and the corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
针对现有声源追踪方案存在的鲁棒性较差,泛化能力不足等技术问题,本申请实施例的一些实施例中:可将包含声源方位信息的信息流转换为描述声源的方位分布状态的可视化数据,并基于可视化数据,进行声源追踪。这颠覆了传统的从声学信号处理层面进行声源追踪的方式,而是从可视化分析层面进行声源追踪。据此,本申请实施例中,可有效提高声源追踪的准确度,且可提高对各种复杂环境的适应性。In view of the technical problems such as poor robustness and insufficient generalization ability of the existing sound source tracking scheme, in some embodiments of the present application, the information stream containing the sound source azimuth information can be converted to describe the azimuth distribution of the sound source Visual data of the state, and based on the visual data, sound source tracking. This subverts the traditional way of sound source tracking from the level of acoustic signal processing, but from the level of visual analysis. Accordingly, in the embodiments of the present application, the accuracy of sound source tracking can be effectively improved, and the adaptability to various complex environments can be improved.
以下结合附图,详细说明本申请各实施例提供的技术方案。The technical solutions provided by the embodiments of the present application will be described in detail below with reference to the accompanying drawings.
图1为本申请一示例性实施例提供的一种声源追踪方法的流程示意图。图2为本申 请一示例性实施例提供的一种声源追踪方案的逻辑示意图。本实施例提供的声源追踪方法可以由一声源追踪装置来执行,该声源追踪装置可以实现为软件或实现为软件和硬件的组合,该声源追踪装置可集成设置在计算设备中。如图1所示,该方法包括:FIG. 1 is a schematic flowchart of a sound source tracking method provided by an exemplary embodiment of the present application. FIG. 2 is a schematic logical diagram of a sound source tracking solution provided by an exemplary embodiment of the present application. The sound source tracking method provided in this embodiment may be performed by a sound source tracking apparatus, which may be implemented as software or a combination of software and hardware, and may be integrated in a computing device. As shown in Figure 1, the method includes:
步骤100、获取麦克风阵列在至少一个时间帧下采集到的声学信号流;Step 100: Acquire an acoustic signal stream collected by the microphone array in at least one time frame;
步骤101、基于声学信号流进行声源方位估计,以获得包含至少一个时间帧下声源方位信息的信息流;Step 101: Perform sound source orientation estimation based on the acoustic signal flow to obtain an information flow containing sound source orientation information under at least one time frame;
步骤102、将信息流转换为描述声源的方位分布状态的可视化数据; Step 102, converting the information flow into visual data describing the azimuth distribution state of the sound source;
步骤103、根据可视化数据,进行声源追踪。Step 103: Track the sound source according to the visualized data.
本实施提供的声源追踪方法可应用于各种场景中,例如,语音控制场景、音视频会议场景或其它需要进行声源追踪的场景,本实施例中对应用场景不做限定。在不同的应用场景中,本实施例提供的声源追踪方法可集成设置在各式各样的场景设备中,例如,在语音控制场景下,场景设备可以是智能音箱、智能机器人等,在音视频会议场景下,场景设备可以是各类会议终端等。The sound source tracking method provided in this embodiment can be applied to various scenarios, for example, a voice control scenario, an audio and video conference scenario, or other scenarios requiring sound source tracking, and the application scenario is not limited in this embodiment. In different application scenarios, the sound source tracking method provided in this embodiment can be integrated and set in various scenario devices. For example, in a voice control scenario, the scenario device may be a smart speaker, an intelligent robot, etc. In a video conference scenario, the scenario devices may be various conference terminals, etc.
本实施例中,在步骤100中,可采用麦克风阵列进行声学信号流的采集。麦克风阵列可以是由多个阵元组成的一组阵列,本实施例对麦克风阵列中阵元的数量不做限定。本实施例对麦克风阵列的排列方式也不做限定,麦克风阵列可以是环形阵列、线性阵列、平面阵列或者立体形态阵列等等。在不同的应用场景中,麦克风阵列可按需装配在各种类型的场景设备中。In this embodiment, in step 100, a microphone array may be used to collect the acoustic signal flow. The microphone array may be a group of arrays composed of multiple array elements, and the number of the array elements in the microphone array is not limited in this embodiment. This embodiment also does not limit the arrangement of the microphone arrays, and the microphone arrays may be annular arrays, linear arrays, planar arrays, or stereoscopic arrays, and so on. In different application scenarios, the microphone array can be assembled in various types of scene equipment as required.
麦克风阵列的信号采集过程通常是一个持续性的过程,因此,本实施例中可以声学信号流的形式进行后续处理。The signal acquisition process of the microphone array is usually a continuous process, and therefore, in this embodiment, subsequent processing may be performed in the form of an acoustic signal stream.
本实施例中,可基于识别精度,在单个识别时段内选取至少一个时间帧,并以单个时间帧作为处理单位。其中,单个识别时段的长度与识别精度适配,例如,识别精度为1s,也即是每1s示出一次声源追踪结果,则单个识别时段的长度可设定为1s,则步骤100中,单次可获取1s内的至少一个时间帧下的声学信号形成的声学信号流作为后续步骤的处理对象。在实际应用中,不同的应用场景下,可在目标时段内按需进行至少一个时间帧的选取。例如,在声学信号变化较小的情况下,可在目标时段内,采用跳帧或者变化帧率抽样的方式选取出至少一个时间帧。当然,在大多数情况下,可选取目标时段内所有的时间帧,本实施例对此不作限定。In this embodiment, based on the recognition accuracy, at least one time frame may be selected within a single recognition period, and a single time frame may be used as a processing unit. The length of a single identification period is adapted to the identification accuracy. For example, if the identification accuracy is 1s, that is, the sound source tracking result is displayed every 1s, the length of a single identification period can be set to 1s. In step 100, The acoustic signal stream formed by the acoustic signals in at least one time frame within 1 s can be acquired at a time as the processing object of the subsequent steps. In practical applications, under different application scenarios, at least one time frame can be selected on demand within the target period. For example, in the case that the change of the acoustic signal is small, at least one time frame may be selected within the target period by means of frame skipping or sampling at a changing frame rate. Of course, in most cases, all time frames in the target period may be selected, which is not limited in this embodiment.
本实施例中,时间帧的帧长可根据实际请求进行配置,例如,单个时间帧的帧长可配置为20ms。另外,识别时段内至少一个时间帧的数量也可按需进行设定,例如,若识别时段为1s,可在识别标时段内选取3个时间帧,这样,根据基于这3个时间帧进行目标时段下的声源追踪。当然,本实施例中时间帧的帧长及识别时段下的时间帧数量均不限于此。另外,识别时段下的不同时间帧的帧长也可不完全相同,本实施例对此均不做限定。In this embodiment, the frame length of the time frame may be configured according to an actual request, for example, the frame length of a single time frame may be configured as 20 ms. In addition, the number of at least one time frame in the identification period can also be set as required. For example, if the identification period is 1s, 3 time frames can be selected in the identification period. Sound source tracking over time. Of course, in this embodiment, the frame length of the time frame and the number of time frames in the identification period are not limited to this. In addition, the frame lengths of different time frames in the identification period may not be exactly the same, which is not limited in this embodiment.
基于此,本实施例提供的声源追踪方法可应用于实时地声源追踪场景,也可应用于离线的声源追踪场景,按照识别精度,接续地在各识别时段内进行声源追踪。Based on this, the sound source tracking method provided in this embodiment can be applied to real-time sound source tracking scenarios, and can also be applied to offline sound source tracking scenarios. According to the recognition accuracy, sound source tracking is successively performed in each recognition period.
实际应用中,可利用麦克风阵列中的各阵元分别采集时域信号,以麦克风阵列中包含M个阵元为例,可在至少一个时间帧下采集到M路时域信号流,作为步骤100中的声学信号流。In practical applications, each array element in the microphone array can be used to collect time domain signals respectively. Taking M array elements included in the microphone array as an example, M channels of time domain signal streams can be collected in at least one time frame, as step 100. Acoustic signal flow in .
参考图1和图2,在步骤101中,可基于声学信号流进行声源方位估计,以获得包含至少一个时间帧下声源方位信息的信息流。Referring to FIG. 1 and FIG. 2 , in step 101 , sound source position estimation may be performed based on an acoustic signal stream to obtain an information stream including sound source position information under at least one time frame.
本实施例中,可采用声源方位估计技术对声学信号流进行信号处理,以分别确定至少一个时间帧下的声源方位信息。其中,声源方位信息用于表征时间帧下声源的方位数据。本实施例中,方位数据可以是声源处于各方位的置信度,这样,声源方位信息中至少可包含时间帧下声源处于各方位的置信度。其中,本实施例中,可按实际需要配置声源方位信息中涉及到的各方位,例如,可为麦克风阵列的全周配置360个方位、120个方位、60个方位等等。当然,也可为麦克风阵列非全周,例如,正面180°范围,配置180个方位等,本实施例对此均不做限定。In this embodiment, a sound source orientation estimation technique may be used to perform signal processing on the acoustic signal stream, so as to determine sound source orientation information in at least one time frame respectively. The position information of the sound source is used to represent the position data of the sound source under the time frame. In this embodiment, the position data may be the confidence levels of the sound source at each position. In this way, the sound source position information may at least include the confidence levels of the sound source at each position in the time frame. Wherein, in this embodiment, all the azimuths involved in the sound source azimuth information can be configured according to actual needs, for example, 360 azimuths, 120 azimuths, 60 azimuths, etc. can be configured for the entire circumference of the microphone array. Of course, the microphone array may also be non-circumferential, for example, the front face has a range of 180° and is configured in 180 directions, etc., which is not limited in this embodiment.
图3为本申请一示例性实施例提供的一种声源方位信息的示意图。在图3中将声源方位信息进行了可视化,但应当理解的是,图3仅是为了便于对声源方位信息进行说明,而这不应造成对本实施例中声源方位信息的数据形式的限定。实际应用中,声源方位信息可以是【1、3、5、60、70、80、90、80、70、…、0】等任何其它可被计算设备理解的数据形式,在该示例中,【】中的每个数字可分别表示声源处于360个方位上的置信度。FIG. 3 is a schematic diagram of sound source orientation information provided by an exemplary embodiment of the present application. The sound source orientation information is visualized in FIG. 3, but it should be understood that FIG. 3 is only for the convenience of describing the sound source orientation information, and this should not cause any change in the data form of the sound source orientation information in this embodiment. limited. In practical applications, the sound source location information can be [1, 3, 5, 60, 70, 80, 90, 80, 70, . Each number in [ ] can represent the confidence of the sound source in 360 azimuths, respectively.
另外,在步骤101中,以时间帧为单位进行声源方位信息的确定。承接前文中提及的,步骤100中获取到的是至少一个时间帧下M个阵元各自采集到的声学信号流,这里,步骤101中,可以时间帧为单位,对M个阵元在该时间帧下采集到的声学信号进行声源方位估计,从而确定在该时间帧下的声源方位信息。In addition, in step 101, the determination of sound source orientation information is performed in units of time frames. Continuing from what was mentioned above, what is acquired in step 100 is the acoustic signal stream collected by each of the M array elements under at least one time frame. The acoustic signal collected under the time frame is used to estimate the sound source position, so as to determine the sound source position information under the time frame.
在一种可选的实现方式中,可将各阵元采集到的时域信号流分别转换时频域信号;采用声源方位估计技术,根据各阵元下的时频域信号,确定至少一个时间帧下的声源方位信息。In an optional implementation manner, the time-domain signal streams collected by each array element can be converted into time-frequency domain signals respectively; the sound source orientation estimation technology is used to determine at least one of the time-frequency domain signals under each array element. Sound source orientation information under the time frame.
在该实现方式下,以至少一个时间帧中的目标时间帧为例,其中,目标时间帧可以是至少一个时间帧中的任意一个。可将目标时间帧下各阵元采集到的时域信号转换为时频域信号。例如,可将时域信号进行子带分解,以获得时频域信号,子带分解的过程可基于端时傅里叶变换和/或滤波器组等实现,在此不做限定。据此,可获得目标时间帧下各阵元对应的时频域信号。In this implementation manner, a target time frame in the at least one time frame is taken as an example, where the target time frame may be any one of the at least one time frame. The time domain signals collected by each array element under the target time frame can be converted into time-frequency domain signals. For example, the time-domain signal may be decomposed into sub-bands to obtain time-frequency domain signals, and the sub-band decomposition process may be implemented based on end-time Fourier transform and/or filter bank, etc., which is not limited herein. Accordingly, the time-frequency domain signals corresponding to each array element under the target time frame can be obtained.
在此基础上,可对目标时间帧下各阵元对应的时频域信号,进行声源方位估计,以输出目标时间帧下的声源方位信息。其中,声源方位估计技术包括但不限于可控波束响 应相位变换技术SRP-PHAT(Steered Response Power-Phase Transform)、广义互相关相位变换技术GCC-PHAT(Generalized Cross Correlation PHAse Transformation)或多重信号分类技术MUSIC(multiple signal classification)等。声源方位估计的原理可以是:根据麦克风阵列中不同麦克风在同一时间采集到的声学信号,分别计算声源的方位范围,再根据多个方位范围估计声源方位。当然,这仅是示例性的,本实施例并不限于此。本实施例对采用的声源方位估计技术不做限定,也不再对各类声源方位估计技术的处理过程进行赘述。On this basis, the sound source orientation can be estimated for the time-frequency domain signals corresponding to each array element under the target time frame, so as to output the sound source orientation information under the target time frame. Among them, the sound source azimuth estimation technology includes but is not limited to the controllable beam response phase transformation technology SRP-PHAT (Steered Response Power-Phase Transform), the generalized cross-correlation phase transformation technology GCC-PHAT (Generalized Cross Correlation PHAse Transformation) or multiple signal classification Technology MUSIC (multiple signal classification) and so on. The principle of sound source azimuth estimation may be: according to the acoustic signals collected by different microphones in the microphone array at the same time, calculate the azimuth range of the sound source separately, and then estimate the sound source azimuth according to multiple azimuth ranges. Of course, this is only exemplary, and the present embodiment is not limited thereto. This embodiment does not limit the sound source orientation estimation technology used, and the processing procedures of various sound source orientation estimation techniques are not described in detail.
另外,本实施例中,基于声学信号流还可采用其它实现方式进行声源方位估计,本实施例并不限于上述的实现方式。In addition, in this embodiment, other implementation manners may also be used to estimate the sound source azimuth based on the acoustic signal flow, and this embodiment is not limited to the foregoing implementation manners.
据此,可获得一信息流,该信息流中包含至少一个时间帧下的声源方位信息。Accordingly, an information stream can be obtained, and the information stream includes sound source position information in at least one time frame.
参考图1和图2,在此基础上,步骤102中,可将信息流转换为描述声源的方位分布状态的可视化数据。Referring to FIG. 1 and FIG. 2, on this basis, in step 102, the information flow can be converted into visual data describing the azimuth distribution state of the sound source.
本实施例中,声源方位信息中可包含时间、方位等维度的描述信息,基于此,可将声源方位信息转换为方位分布状态的描述信息。例如,声源方位信息中可包含声源位于不同方位上的置信度,则可将置信度转换为适配的显示亮度,不同方位上的显示亮度即作为方位分布状态的描述信息。值得说明的是,在进行可视化数据的转换过程中,并未丢失声源方位信息中的任何内容,而仅是转换了声源方位信息的表示形式,这可保证本实施例中的可视化数据准确、全面地描述声源的方位分布状态。In this embodiment, the sound source bearing information may include description information of dimensions such as time and bearing, and based on this, the sound source bearing information may be converted into description information of bearing distribution state. For example, the position information of the sound source may include the confidence that the sound source is located in different directions, and then the confidence may be converted into an adapted display brightness, and the display brightness in different directions is used as the description information of the position distribution state. It is worth noting that in the process of converting the visualization data, nothing in the sound source orientation information is lost, but only the representation form of the sound source orientation information is converted, which can ensure that the visualization data in this embodiment is accurate. , A comprehensive description of the azimuth distribution of the sound source.
其中,可视化数据可以是声源的方位分布热力图。承接上例,在将声源位于不同方位上的置信度转换为适配的显示亮度后,可获得时间帧下各方位上的显示亮度,从而从时间帧、方位和显示亮度三个维度确定出声源的方位分布热力图。The visualization data may be a heat map of the azimuth distribution of the sound source. Continuing the above example, after converting the confidence of the sound source in different directions into the appropriate display brightness, the display brightness in each position under the time frame can be obtained, so as to determine from the three dimensions of time frame, direction and display brightness. A thermal map of the azimuthal distribution of the sound source.
当然,本实施例中,可视化数据并不限于此,例如,可视化数据还可以是用于表征至少一个时间帧下声源方位信息的三维立体图,实际应用中,可通过将至少一个时间帧对应的图3中的声源方位信息的显示曲线进行按时间排列,以获得三维立体图。Of course, in this embodiment, the visualization data is not limited to this. For example, the visualization data may also be a three-dimensional stereogram used to represent the position information of the sound source in at least one time frame. The display curves of the sound source orientation information in FIG. 3 are arranged in time to obtain a three-dimensional stereogram.
本实施例中,可将信息流转换为各种形式的可视化数据。在可视化过程中,可全面保留各声源方位信息,从而,在步骤102中,可在声源追踪的过程中,从声学信号处理层面转换到可视化处理层面。In this embodiment, the information flow can be converted into various forms of visual data. During the visualization process, the orientation information of each sound source can be fully retained, so in step 102 , the acoustic signal processing level can be switched to the visualization processing level during the sound source tracking process.
而在步骤103中,可根据可视化数据,进行声源追踪。其中,本实施例中,对于声源的数量不做限定,声源的数量可以是一个或多个。In step 103, sound source tracking may be performed according to the visualized data. Wherein, in this embodiment, the number of sound sources is not limited, and the number of sound sources may be one or more.
据此,本实施例中,可通过对可视化数据进行可视化分析,来追踪至少一个时间帧内声源,从而将声学信号处理问题转换为可视化分析问题。由于本实施例中,可视化数据可准确、全面地反映出声源的方位分布状态,这保证了可视化分析的基础的准确性、全面性,避免了鲁棒性问题;而且,在可视化分析过程中,分析的视野可覆盖更多的时间帧,而不再局限在单个时间帧,因此,可发现视野内的噪声,从而避免噪声干扰。这 可有效规避传统的声学信号处理过程中存在的鲁棒性差、泛化能力不足等缺点。Accordingly, in this embodiment, the sound source in at least one time frame can be tracked by performing visual analysis on the visualized data, thereby converting the acoustic signal processing problem into a visual analysis problem. Because in this embodiment, the visual data can accurately and comprehensively reflect the azimuth distribution state of the sound source, which ensures the accuracy and comprehensiveness of the basis of visual analysis, and avoids the problem of robustness; moreover, in the process of visual analysis , the analyzed field of view can cover more time frames instead of being limited to a single time frame, therefore, noise in the field of view can be found and noise interference can be avoided. This can effectively avoid the shortcomings of poor robustness and insufficient generalization ability in the traditional acoustic signal processing process.
据此,本实施例中,可将包含声源方位信息的信息流转换为描述声源的方位分布状态的可视化数据,并基于可视化数据,进行声源追踪。这颠覆了传统的从声学信号处理层面进行声源追踪的方式,而是从可视化分析层面进行声源追踪。据此,本申请实施例中,可有效提高声源追踪的准确度,且可提高对各种复杂环境的适应性。Accordingly, in this embodiment, the information stream including the sound source azimuth information can be converted into visual data describing the azimuth distribution state of the sound source, and based on the visual data, sound source tracking can be performed. This subverts the traditional way of sound source tracking from the level of acoustic signal processing, but from the level of visual analysis. Accordingly, in the embodiments of the present application, the accuracy of sound source tracking can be effectively improved, and the adaptability to various complex environments can be improved.
在上述或下述实施例中,可将信息流转换为至少一个时间帧下声源的方位分布热力图,作为追踪至少一个时间帧内声源的依据。其中,方位分布热力图用于描述在所述至少一个时间帧下声源在不同方位上的分布热度。In the above or the following embodiments, the information stream may be converted into a heat map of the azimuth distribution of the sound source in at least one time frame, as a basis for tracking the sound source in the at least one time frame. The azimuth distribution heat map is used to describe the distribution heat of the sound source in different azimuths under the at least one time frame.
在本实施例中,声源方位信息中包含声源处于各方位的置信度。在此基础上,可基于置信度与显示亮度之间的对应关系,根据至少一个时间帧下的声源处于各方位的置信度,在至少一个时间帧下分别确定各方位对应的显示亮度;根据显示亮度,生成至少一个时间帧下声源的方位分布热力图;不同显示亮度表征不同的分布热度。In this embodiment, the sound source location information includes the confidence levels that the sound source is located at each location. On this basis, based on the correspondence between the confidence and the display brightness, and according to the confidence of the sound source in each position under at least one time frame, the display brightness corresponding to each position can be determined under at least one time frame respectively; Display brightness, and generate a heat map of the azimuth distribution of the sound source in at least one time frame; different display brightness represents different distribution heat.
实际应用中,置信度越高,对应的显示亮度可以越高,表征的分布热度越高。当然,本实施例并不限于此,置信度越高,显示亮度也可以越低。但通常,置信度和显示亮度之间可成正比,以通过显示亮度准确体现置信度。In practical applications, the higher the confidence, the higher the corresponding display brightness, and the higher the distribution heat of the representation. Of course, this embodiment is not limited to this, and the higher the confidence level, the lower the display brightness may be. In general, however, there is a proportionality between confidence and display brightness to accurately reflect confidence through display brightness.
图4为本申请一示例性实施例提供的一种声源的方位分布热力图的示意图。参考图4,热力图的纵轴为时间帧,横轴为方位。图4中,至少一个时间帧的数量为800,方位的数量配置为120个,用于表征麦克风阵列的全周空间。FIG. 4 is a schematic diagram of a heat map of azimuth distribution of a sound source according to an exemplary embodiment of the present application. Referring to Figure 4, the vertical axis of the heat map is the time frame, and the horizontal axis is the orientation. In FIG. 4 , the number of at least one time frame is 800, and the number of orientations is configured to be 120, which are used to represent the entire circumferential space of the microphone array.
在一种可选的实现方式中,可根据至少一个时间帧下各方位对应的显示亮度,分别确定至少一个时间帧各自对应的图像内容;按照至少一个时间帧之间的时间顺序,依次排列至少一个时间帧各自对应的图像内容,以生成方位分布热力图。参考图4,每个时间帧下的声源方位信息可转换为热力图中的一条横向,例如,第400帧的声源方位信息,可转换为热力图中的直线y=400,且该直线上与各方位对应的像素点的显示亮度为根据相应的置信度而确定的。置信度越高的像素点对应的显示亮度越亮。例如,结合图3中示出的声源方位信息的示意图,图3中的波峰位置对应的置信度最高,转换到热力图中,该波峰位置对应的方位上的显示亮度最亮。In an optional implementation manner, the image content corresponding to each of the at least one time frame may be determined according to the display brightness corresponding to each position under the at least one time frame; according to the time sequence between the at least one time frame, at least one Image content corresponding to each time frame to generate a heat map of azimuth distribution. Referring to Figure 4, the sound source position information under each time frame can be converted into a horizontal direction in the heat map. For example, the sound source position information of the 400th frame can be converted into a straight line y=400 in the heat map, and the straight line The display brightness of the pixel corresponding to each position above is determined according to the corresponding confidence. The pixel with higher confidence corresponds to the brighter display brightness. For example, combined with the schematic diagram of the sound source azimuth information shown in FIG. 3 , the confidence level corresponding to the peak position in FIG. 3 is the highest, and when converted to the heat map, the display brightness at the azimuth corresponding to the peak position is the brightest.
当然,本实施例中,还可采用其它实现方式生成方位分布热力图,例如,可根据至少一个时间帧下的声源方位信息,获取不同时间帧在目标方位存在声源的置信度,从而确定目标方位上各时间帧对应的显示亮度,以生成目标方位对应的图像内容,其中,目标方位可以是各方位中的任意一个。这可获得各方位对应的图像内容,从而可按照方位顺序,依次排列各方位对应的图像内容,以生成方位分布热力图。本实施例对生成方位分布热力图的方式不做限定。Of course, in this embodiment, other implementation manners can also be used to generate the heat map of azimuth distribution. For example, according to the azimuth information of the sound source in at least one time frame, the confidence level of the existence of the sound source in the target azimuth in different time frames can be obtained, so as to determine The display brightness corresponding to each time frame on the target azimuth is used to generate image content corresponding to the target azimuth, where the target azimuth may be any one of the various azimuths. In this way, the image content corresponding to each azimuth can be obtained, so that the image content corresponding to each azimuth can be arranged in sequence according to the azimuth order, so as to generate an azimuth distribution heat map. This embodiment does not limit the manner of generating the azimuth distribution heat map.
据此,本实施例中,可将至少一个时间帧下的声源方位信息转换为声源的方位分布热力图。而且,在转换过程中保留了声源方位信息中的全部内容,这为可视化分析过程 提供了准确的分析基础,从而可保证追踪结果的准确度。Accordingly, in this embodiment, the position information of the sound source in at least one time frame can be converted into a heat map of the position distribution of the sound source. Moreover, in the conversion process, all the content of the sound source orientation information is preserved, which provides an accurate analysis basis for the visual analysis process, thus ensuring the accuracy of the tracking results.
在上述或下述实施例中,可利用机器学习模型以及可视化数据,进行声源追踪。In the above or the following embodiments, the sound source tracking can be performed using the machine learning model and the visualization data.
本实施例中,无论是何种形式的可视化数据,均可利用机器学习模型进行可视化分析,以进行声源追踪。实际应用中,对于不同形式的可视化数据,可选用不同类型的机器学习模型,并可采用与数据形式相适配模型训练方式,以提高机器学习模型的性能。In this embodiment, no matter what form of visual data, a machine learning model can be used to perform visual analysis to track sound sources. In practical applications, for different forms of visual data, different types of machine learning models can be selected, and a model training method adapted to the data form can be used to improve the performance of the machine learning model.
以下还是以热力图为例,对可视化分析过程进行说明。The following still takes the heat map as an example to illustrate the visual analysis process.
本实施例中,在机器学习模型中,可提取方位分布热力图中的图像特征;基于图像特征与声源属性参数之间的映射关系以及从方位分布热力图中提取到的图像特征,确定至少一个时间帧下的目标声源属性参数,以进行声源追踪。In this embodiment, in the machine learning model, image features in the heat map of azimuth distribution can be extracted; based on the mapping relationship between image features and sound source attribute parameters and the image features extracted from the heat map of azimuth distribution, at least The target sound source attribute parameter under a time frame for sound source tracking.
其中,图像特征与声源属性参数之间的映射关系可通过模型训练,而配置到机器学习模型中。Among them, the mapping relationship between image features and sound source attribute parameters can be configured into the machine learning model through model training.
一种示例性的模型训练过程也可以是:An exemplary model training process could also be:
获取若干样本时间帧组各自对应的样本热力图;为各样本热力图标注声源属性参数,以获得各样本热力图对应的标注信息;将各样本热力图及其对应的标注信息输入机器学习模型,以供机器学习模型学习图像特征与声源属性参数之间的映射关系。Obtain sample heatmaps corresponding to several sample time frame groups; label sound source attribute parameters for each sample heatmap to obtain labeling information corresponding to each sample heatmap; input each sample heatmap and its corresponding labeling information into the machine learning model , for the machine learning model to learn the mapping relationship between image features and sound source attribute parameters.
其中,样本时间帧中的时间帧数量可与步骤100中进行声学信号采集的至少一个时间帧的数量一致。也即是,模型训练过程和模型使用过程中的处理单位可保持一致。这样,在模型训练过程中,以样本时间帧组为单位进行标注,而在模型使用过程中则可同样数量级别的至少一个时间帧为单位进行追踪结果的输出。Wherein, the number of time frames in the sample time frame may be consistent with the number of at least one time frame in which the acoustic signal acquisition is performed in step 100 . That is, the processing units in the model training process and the model use process can be kept consistent. In this way, in the model training process, the sample time frame group is used as a unit for labeling, and in the model use process, the tracking result can be output in at least one time frame unit of the same quantity level.
本实施例中,声源属性参数包括方位、数量、发声时长和所覆盖时间帧中的一个或多个。据此,本实施例中,通过可视化分析后,机器学习模型可输出至少一个时间帧下声源数量、所处方位、发声时长以及所覆盖的时间帧等信息作为追踪结果。In this embodiment, the sound source attribute parameters include one or more of orientation, quantity, sounding duration, and covered time frame. Accordingly, in this embodiment, after visual analysis, the machine learning model can output information such as the number of sound sources in at least one time frame, the location, the sounding duration, and the time frame covered as the tracking result.
在引入机器学习模型的情况下,本实施例中,将信息流转换为描述声源的方位分布状态的可视化数据的步骤,可在机器学习模型中执行,也可在机器学习模型之外执行。In the case of introducing a machine learning model, in this embodiment, the step of converting the information flow into visual data describing the azimuth distribution state of the sound source can be performed in the machine learning model or outside the machine learning model.
在一种可能的实现方案中,在模型使用过程中,可在机器学习模型之外执行将信息流转换为描述声源的方位分布状态的可视化数据的过程,并可将可视化数据作为机器学习模型的输入参数。相应地,在模型训练过程中,可预先将样本时间帧组对应的信息流转换为样本热力图,作为模型训练的基础。In a possible implementation scheme, in the process of using the model, the process of converting the information flow into visual data describing the azimuth distribution state of the sound source can be performed outside the machine learning model, and the visual data can be used as the machine learning model. input parameters. Correspondingly, in the model training process, the information flow corresponding to the sample time frame group can be converted into a sample heat map in advance, which is used as the basis for model training.
在另一种可能的实现方案中,在模型使用过程中,可将信息流输入机器学习模型;在机器学习模型中,将信息流转换为描述声源的方位分布状态的可视化数据。In another possible implementation solution, in the process of using the model, the information flow can be input into the machine learning model; in the machine learning model, the information flow is converted into visual data describing the azimuth distribution state of the sound source.
在该实现方案中,可在机器学习模型中配置将信息流转换为描述声源的方位分布状态的可视化数据的功能模块,从而可将信息流作为机器学习模型的输入参数。而对机器学习模型来说,可在接收到信息流的情况下,将信息流转换为描述声源的方位分布状态的可视化数据,之后再进行可视化分析。In this implementation solution, a functional module that converts the information flow into visual data describing the azimuth distribution state of the sound source can be configured in the machine learning model, so that the information flow can be used as an input parameter of the machine learning model. For the machine learning model, when the information flow is received, the information flow can be converted into visual data describing the azimuth distribution state of the sound source, and then the visual analysis can be carried out.
相应地,跟上一种实现方案的模型训练过程将存在细微差别。在该实现方案的模型训练过程中,可获取若样本时间帧组各自对应的样本信息流;为各样本信息流标注声源属性参数,以获得各样本信息流对应的标注信息;将各样本信息流及其对应的标注信息输入机器学习模型,以供机器学习模型将各样本信息流转换为描述声源的方位分布状态的可视化数据并学习图像特征与声源属性参数之间的映射关系。Accordingly, there will be subtle differences in the model training process to keep up with one implementation. In the model training process of this implementation scheme, the sample information streams corresponding to each sample time frame group can be obtained; the sound source attribute parameters can be labeled for each sample information stream to obtain the label information corresponding to each sample information stream; The stream and its corresponding annotation information are input into the machine learning model, so that the machine learning model converts each sample information stream into visual data describing the azimuth distribution state of the sound source and learns the mapping relationship between image features and sound source attribute parameters.
据此,本实施例中,通过足量的样本数据进行机器学习模型的训练后,可是机器学习模型学习到准确的图像特征与声源属性参数之间的映射关系。从而,可利用训练好的机器学习模型,对可视化数据进行可视化分析,以输出至少一个时间帧下的声源属性信息,从而基于声源属性信息追踪到发声的一个或多个声源。这种声源追踪方法可排除追踪过程中各种噪声干扰,不需要在单独执行寻找发生起始点的操作,而且可避免其它声学信号处理层面存在的不足。进而,可有效提高追踪结果的准确度,且可提高对各种复杂环境的适应性。Accordingly, in this embodiment, after the machine learning model is trained with a sufficient amount of sample data, the machine learning model learns an accurate mapping relationship between image features and sound source attribute parameters. Therefore, the trained machine learning model can be used to visually analyze the visualized data to output sound source attribute information under at least one time frame, so as to track one or more sound sources that emit sound based on the sound source attribute information. This sound source tracking method can eliminate all kinds of noise interference in the tracking process, does not need to perform the operation of finding the starting point separately, and can avoid the deficiencies in other acoustic signal processing levels. Furthermore, the accuracy of the tracking result can be effectively improved, and the adaptability to various complex environments can be improved.
需要说明的是,上述实施例所提供方法的各步骤的执行主体均可以是同一设备,或者,该方法也由不同设备作为执行主体。比如,步骤101至步骤103的执行主体可以为设备A;又比如,步骤101和102的执行主体可以为设备A,步骤103的执行主体可以为设备B;等等。It should be noted that, the execution subject of each step of the method provided in the above-mentioned embodiments may be the same device, or the method may also be executed by different devices. For example, the execution subject of steps 101 to 103 may be device A; for another example, the execution subject of steps 101 and 102 may be device A, and the execution subject of step 103 may be device B; and so on.
另外,在上述实施例及附图中的描述的一些流程中,包含了按照特定顺序出现的多个操作,但是应该清楚了解,这些操作可以不按照其在本文中出现的顺序来执行或并行执行,操作的序号如101、102等,仅仅是用于区分开各个不同的操作,序号本身不代表任何的执行顺序。另外,这些流程可以包括更多或更少的操作,并且这些操作可以按顺序执行或并行执行。In addition, in some of the processes described in the above embodiments and the accompanying drawings, multiple operations appearing in a specific order are included, but it should be clearly understood that these operations may be performed out of the order in which they appear in this document or performed in parallel , the sequence numbers of the operations, such as 101, 102, etc., are only used to distinguish different operations, and the sequence numbers themselves do not represent any execution order. Additionally, these flows may include more or fewer operations, and these operations may be performed sequentially or in parallel.
图5为本申请一示例性实施例提供的一种声源追踪装置的结构示意图。参考图5,该声源追踪装置包括:FIG. 5 is a schematic structural diagram of a sound source tracking device according to an exemplary embodiment of the present application. Referring to Figure 5, the sound source tracking device includes:
获取模块50,用于获取麦克风阵列在至少一个时间帧下采集到的声学信号流;an acquisition module 50, configured to acquire the acoustic signal flow collected by the microphone array in at least one time frame;
计算模块51,用于基于声学信号流进行声源方位估计,以获得包含至少一个时间帧下声源方位信息的信息流;a calculation module 51, configured to perform sound source orientation estimation based on the acoustic signal flow to obtain an information flow including sound source orientation information under at least one time frame;
转换模块52,用于将信息流转换为描述声源的方位分布状态的可视化数据;The conversion module 52 is used to convert the information flow into visual data describing the azimuth distribution state of the sound source;
追踪模块53,用于根据可视化数据,进行声源追踪。The tracking module 53 is configured to track the sound source according to the visualized data.
在一可选实施例中,转换模块52在将信息流转换为描述声源的方位分布状态的可视化数据时,用于:In an optional embodiment, when converting the information flow into visual data describing the azimuth distribution state of the sound source, the conversion module 52 is used for:
将信息流转换为至少一个时间帧下声源的方位分布热力图,方位分布热力图用于描述在所述至少一个时间帧下声源在不同方位上的分布热度。The information flow is converted into an azimuth distribution heat map of the sound source under at least one time frame, and the azimuth distribution heat map is used to describe the distribution heat of the sound source in different azimuths under the at least one time frame.
在一可选实施例中,声源方位信息中包含声源处于各方位的置信度;转换模块52在将信息流转换为在至少一个时间帧下声源的方位分布热力图时,用于:In an optional embodiment, the sound source position information includes the confidence that the sound source is in each position; when converting the information flow into a heat map of the position distribution of the sound source in at least one time frame, the conversion module 52 is used for:
基于置信度与显示亮度之间的对应关系,根据至少一个时间帧下的声源处于各方位的置信度,在至少一个时间帧下分别确定各方位对应的显示亮度,不同显示亮度表征的分布热度不同;Based on the correspondence between the confidence level and the display brightness, according to the confidence level of the sound source in each position under at least one time frame, the display brightness corresponding to each position is determined in at least one time frame, and the distribution heat represented by different display brightness is different;
根据显示亮度,生成至少一个时间帧下声源的方位分布热力图。According to the display brightness, a heat map of the azimuth distribution of the sound source under at least one time frame is generated.
在一可选实施例中,转换模块52在根据显示亮度,生成至少一个时间帧下声源的方位分布热力图时,用于:In an optional embodiment, the conversion module 52, when generating the heat map of the azimuth distribution of the sound source in at least one time frame according to the display brightness, is used for:
根据至少一个时间帧下各方位对应的显示亮度,分别确定至少一个时间帧各自对应的图像内容;According to the display brightness corresponding to each position under the at least one time frame, the image content corresponding to each of the at least one time frame is respectively determined;
按照至少一个时间帧之间的时间顺序,依次排列至少一个时间帧各自对应的图像内容,以生成方位分布热力图。According to the time sequence between the at least one time frame, the image contents corresponding to the at least one time frame are sequentially arranged to generate the azimuth distribution heat map.
在一可选实施例中,追踪模块53在根据可视化数据,进行声源追踪时,用于:In an optional embodiment, when tracking the sound source according to the visualization data, the tracking module 53 is used to:
利用机器学习模型以及可视化数据,进行声源追踪。Use machine learning models and visualized data for sound source tracking.
在一可选实施例中,若可视化数据为至少一个时间帧下声源的方位分布热力图,则追踪模块53在利用机器学习模型以及可视化数据,进行声源追踪时,用于:In an optional embodiment, if the visualized data is a heat map of the azimuth distribution of the sound source under at least one time frame, the tracking module 53, when using the machine learning model and the visualized data to track the sound source, is used for:
在机器学习模型中,提取方位分布热力图中的图像特征;In the machine learning model, extract the image features in the azimuthal distribution heatmap;
基于图像特征与声源属性参数之间的映射关系以及从方位分布热力图中提取到的图像特征,确定至少一个时间帧下的目标声源属性参数,以进行声源追踪。Based on the mapping relationship between the image features and the sound source attribute parameters and the image features extracted from the azimuth distribution heat map, the target sound source attribute parameters under at least one time frame are determined to perform sound source tracking.
在一可选实施例中,声源属性参数包括方位、数量、发声时长和所覆盖时间帧中的一个或多个。In an optional embodiment, the sound source attribute parameters include one or more of orientation, quantity, sounding duration and covered time frame.
在一可选实施例中,追踪模块53还用于:In an optional embodiment, the tracking module 53 is further configured to:
获取若干样本时间帧组各自对应的样本热力图,样本热力图用于描述在样本时间帧下声源在不同方位上的分布热度;Obtain the sample heatmaps corresponding to several sample time frame groups, and the sample heatmaps are used to describe the distribution heat of the sound source in different directions under the sample time frame;
为各样本热力图标注声源属性参数,以获得各样本热力图对应的标注信息;Label the sound source attribute parameters for each sample heatmap to obtain the labeling information corresponding to each sample heatmap;
将各样本热力图及其对应的标注信息输入机器学习模型,以供机器学习模型学习图像特征与声源属性参数之间的映射关系。The heat map of each sample and its corresponding annotation information are input into the machine learning model, so that the machine learning model can learn the mapping relationship between image features and sound source attribute parameters.
在一可选实施例中,追踪模块53在将所述信息流转换为描述声源的方位分布状态的可视化数据时,用于:In an optional embodiment, when the tracking module 53 converts the information flow into visual data describing the azimuth distribution state of the sound source, it is used for:
将信息流输入机器学习模型;Feed the information flow into the machine learning model;
在机器学习模型中,将信息流转换为描述声源的方位分布状态的可视化数据。In the machine learning model, the information flow is transformed into visual data describing the azimuthal distribution of the sound source.
在一可选实施例中,追踪模块53还用于:In an optional embodiment, the tracking module 53 is further configured to:
获取若样本时间帧组各自对应的样本信息流;Obtain the sample information stream corresponding to each sample time frame group;
为各样本信息流标注声源属性参数,以获得各样本信息流对应的标注信息;Labeling sound source attribute parameters for each sample information stream to obtain label information corresponding to each sample information stream;
将各样本信息流及其对应的标注信息输入机器学习模型,以供机器学习模型将各样本信息流转换为描述声源的方位分布状态的可视化数据并学习图像特征与声源属性参数 之间的映射关系。Input each sample information stream and its corresponding annotation information into the machine learning model, so that the machine learning model converts each sample information stream into visual data describing the azimuth distribution state of the sound source and learns the relationship between image features and sound source attribute parameters. Mapping relations.
在一可选实施例中,声学信号流包含麦克风阵列中各阵元采集到的时域信号流,计算模块51在基于声学信号流进行声源方位估计,以获得包含至少一个时间帧下的声源方位信息的信息流时,用于:In an optional embodiment, the acoustic signal flow includes a time-domain signal flow collected by each array element in the microphone array, and the calculation module 51 performs sound source orientation estimation based on the acoustic signal flow to obtain an acoustic signal including at least one time frame. When the information flow of source bearing information is used to:
将各阵元采集到的时域信号流分别转换时频域信号;The time-domain signal streams collected by each array element are converted into time-frequency domain signals respectively;
采用声源方位估计技术,根据各阵元下的时频域信号,确定至少一个时间帧下的声源方位信息。Using the sound source orientation estimation technology, the sound source orientation information under at least one time frame is determined according to the time-frequency domain signals under each array element.
在一可选实施例中,声源方位估计技术包括可控波束响应相位变换技术SRP-PHAT、广义互相关相位变换技术GCC-PHAT或多重信号分类技术MUSIC中的一种或多种。In an optional embodiment, the sound source orientation estimation technology includes one or more of the steerable beam response phase transformation technology SRP-PHAT, the generalized cross-correlation phase transformation technology GCC-PHAT or the multiple signal classification technology MUSIC.
值得说明的是,上述关于声源追踪装置的各实施例中的技术细节,可参考前述声源追踪方法各实施例中的相关描述,为节省篇幅,在此不再赘述,但这不应造成对本申请保护范围的损失。It is worth noting that, for the technical details in the above-mentioned embodiments of the sound source tracking device, reference may be made to the relevant descriptions in the above-mentioned embodiments of the sound source tracking method. loss of the scope of protection of this application.
图6为本申请又一示例性实施例提供的一种计算设备的结构示意图。如图6所示,该计算设备包括:存储器60和处理器61。FIG. 6 is a schematic structural diagram of a computing device according to another exemplary embodiment of the present application. As shown in FIG. 6 , the computing device includes: a memory 60 and a processor 61 .
处理器61,与存储器60耦合,用于执行存储器60中的计算机程序,以用于:A processor 61, coupled to the memory 60, executes a computer program in the memory 60 for:
获取麦克风阵列在至少一个时间帧下采集到的声学信号流;Acquire the acoustic signal stream collected by the microphone array under at least one time frame;
基于声学信号流进行声源方位估计,以获得包含至少一个时间帧下声源方位信息的信息流;Perform sound source orientation estimation based on the acoustic signal flow to obtain an information flow containing sound source orientation information under at least one time frame;
将信息流转换为描述声源的方位分布状态的可视化数据;Convert the information flow into visual data describing the azimuth distribution state of the sound source;
根据可视化数据,进行声源追踪。Based on the visualized data, sound source tracking is performed.
在一可选实施例中,处理器61在将信息流转换为描述声源的方位分布状态的可视化数据时,用于:In an optional embodiment, when the processor 61 converts the information flow into visual data describing the azimuth distribution state of the sound source, it is used for:
将信息流转换为至少一个时间帧下声源的方位分布热力图,方位分布热力图用于描述在所述至少一个时间帧下声源在不同方位上的分布热度。The information flow is converted into an azimuth distribution heat map of the sound source under at least one time frame, and the azimuth distribution heat map is used to describe the distribution heat of the sound source in different azimuths under the at least one time frame.
在一可选实施例中,声源方位信息中包含声源处于各方位的置信度;处理器61在将信息流转换为在至少一个时间帧下声源的方位分布热力图时,用于:In an optional embodiment, the sound source azimuth information includes confidence that the sound source is in each position; when the processor 61 converts the information flow into a heat map of azimuth distribution of the sound source in at least one time frame, the processor 61 is used for:
基于置信度与显示亮度之间的对应关系,根据至少一个时间帧下的声源处于各方位的置信度,在至少一个时间帧下分别确定各方位对应的显示亮度,不同显示亮度表征的分布热度不同;Based on the correspondence between the confidence level and the display brightness, according to the confidence level of the sound source in each position under at least one time frame, the display brightness corresponding to each position is determined under at least one time frame, and the distribution heat represented by different display brightness is different;
根据显示亮度,生成至少一个时间帧下声源的方位分布热力图。According to the display brightness, a heat map of the azimuth distribution of the sound source under at least one time frame is generated.
在一可选实施例中,处理器61在根据显示亮度,生成至少一个时间帧下声源的方位分布热力图时,用于:In an optional embodiment, when the processor 61 generates a heat map of the azimuth distribution of the sound source under at least one time frame according to the display brightness, the processor 61 is configured to:
根据至少一个时间帧下各方位对应的显示亮度,分别确定至少一个时间帧各自对应的图像内容;According to the display brightness corresponding to each position under the at least one time frame, the image content corresponding to each of the at least one time frame is respectively determined;
按照至少一个时间帧之间的时间顺序,依次排列至少一个时间帧各自对应的图像内容,以生成方位分布热力图。According to the time sequence between the at least one time frame, the image contents corresponding to the at least one time frame are sequentially arranged to generate the azimuth distribution heat map.
在一可选实施例中,处理器61在根据可视化数据,进行声源追踪时,用于:In an optional embodiment, when the processor 61 performs sound source tracking according to the visualization data, the processor 61 is configured to:
利用机器学习模型以及可视化数据,进行声源追踪。Use machine learning models and visualized data for sound source tracking.
在一可选实施例中,若可视化数据为至少一个时间帧下声源的方位分布热力图,则处理器61在利用机器学习模型以及可视化数据,进行声源追踪时,用于:In an optional embodiment, if the visualized data is a heat map of the azimuth distribution of the sound source under at least one time frame, the processor 61 is used to track the sound source by using the machine learning model and the visualized data:
在机器学习模型中,提取方位分布热力图中的图像特征;In the machine learning model, extract the image features in the azimuthal distribution heatmap;
基于图像特征与声源属性参数之间的映射关系以及从方位分布热力图中提取到的图像特征,确定至少一个时间帧下的目标声源属性参数,以进行声源追踪。Based on the mapping relationship between the image features and the sound source attribute parameters and the image features extracted from the azimuth distribution heat map, the target sound source attribute parameters under at least one time frame are determined to perform sound source tracking.
在一可选实施例中,声源属性参数包括方位、数量、发声时长和所覆盖时间帧中的一个或多个。In an optional embodiment, the sound source attribute parameters include one or more of orientation, quantity, sounding duration and covered time frame.
在一可选实施例中,处理器61还用于:In an optional embodiment, the processor 61 is further configured to:
获取若干样本时间帧组各自对应的样本热力图,样本热力图用于描述在样本时间帧下声源在不同方位上的分布热度;Obtain the sample heatmaps corresponding to several sample time frame groups, and the sample heatmaps are used to describe the distribution heat of the sound source in different directions under the sample time frame;
为各样本热力图标注声源属性参数,以获得各样本热力图对应的标注信息;Label the sound source attribute parameters for each sample heatmap to obtain the labeling information corresponding to each sample heatmap;
将各样本热力图及其对应的标注信息输入机器学习模型,以供机器学习模型学习图像特征与声源属性参数之间的映射关系。The heat map of each sample and its corresponding annotation information are input into the machine learning model, so that the machine learning model can learn the mapping relationship between image features and sound source attribute parameters.
在一可选实施例中,处理器61在将所述信息流转换为描述声源的方位分布状态的可视化数据时,用于:In an optional embodiment, when the processor 61 converts the information flow into visual data describing the azimuth distribution state of the sound source, the processor 61 is configured to:
将信息流输入机器学习模型;Feed the information flow into the machine learning model;
在机器学习模型中,将信息流转换为描述声源的方位分布状态的可视化数据。In the machine learning model, the information flow is transformed into visual data describing the azimuthal distribution of the sound source.
在一可选实施例中,处理器61还用于:In an optional embodiment, the processor 61 is further configured to:
获取若样本时间帧组各自对应的样本信息流;Obtain the sample information stream corresponding to each sample time frame group;
为各样本信息流标注声源属性参数,以获得各样本信息流对应的标注信息;Labeling sound source attribute parameters for each sample information stream to obtain label information corresponding to each sample information stream;
将各样本信息流及其对应的标注信息输入机器学习模型,以供机器学习模型将各样本信息流转换为描述声源的方位分布状态的可视化数据并学习图像特征与声源属性参数之间的映射关系。Input each sample information stream and its corresponding annotation information into the machine learning model, so that the machine learning model converts each sample information stream into visual data describing the azimuth distribution state of the sound source and learns the relationship between image features and sound source attribute parameters. Mapping relations.
在一可选实施例中,声学信号流包含麦克风阵列中各阵元采集到的时域信号流,处理器61在基于声学信号流进行声源方位估计,以获得包含至少一个时间帧下的声源方位信息的信息流时,用于:In an optional embodiment, the acoustic signal flow includes a time-domain signal flow collected by each element in the microphone array, and the processor 61 performs sound source orientation estimation based on the acoustic signal flow to obtain an acoustic signal including at least one time frame. When the information flow of source bearing information is used to:
将各阵元采集到的时域信号流分别转换时频域信号;The time-domain signal streams collected by each array element are converted into time-frequency domain signals respectively;
采用声源方位估计技术,根据各阵元下的时频域信号,确定至少一个时间帧下的声源方位信息。Using the sound source orientation estimation technology, the sound source orientation information under at least one time frame is determined according to the time-frequency domain signals under each array element.
在一可选实施例中,声源方位估计技术包括可控波束响应相位变换技术SRP-PHAT、 广义互相关相位变换技术GCC-PHAT或多重信号分类技术MUSIC中的一种或多种。In an optional embodiment, the sound source orientation estimation technique includes one or more of the steerable beam response phase transform technique SRP-PHAT, the generalized cross-correlation phase transform technique GCC-PHAT or the multiple signal classification technique MUSIC.
值得说明的是,上述关于计算设备的各实施例中的技术细节,可参考前述声源追踪方法各实施例中的相关描述,为节省篇幅,在此不再赘述,但这不应造成对本申请保护范围的损失。It is worth noting that, for the technical details in the above-mentioned embodiments of the computing device, reference may be made to the relevant descriptions in the above-mentioned embodiments of the sound source tracking method. loss of protection.
进一步,如图6所示,该计算设备还包括:麦克风阵列62、通信组件63、电源组件64等其它组件。图6中仅示意性给出部分组件,并不意味着计算设备只包括图6所示组件。Further, as shown in FIG. 6 , the computing device further includes: a microphone array 62 , a communication component 63 , a power supply component 64 and other components. Only some components are schematically shown in FIG. 6 , which does not mean that the computing device only includes the components shown in FIG. 6 .
图7为本申请一示例性实施例提供的另一种声源追踪方法的流程图。本实施例提供的声源追踪方法可以由一声源追踪装置来执行,该声源追踪装置可以实现为软件或实现为软件和硬件的组合,该声源追踪装置可集成设置在计算设备中。如图7所示,该方法包括:FIG. 7 is a flowchart of another sound source tracking method provided by an exemplary embodiment of the present application. The sound source tracking method provided in this embodiment may be performed by a sound source tracking apparatus, which may be implemented as software or a combination of software and hardware, and may be integrated in a computing device. As shown in Figure 7, the method includes:
步骤700、在目标时段内的至少一个时间帧下,分别确定声源方位信息;Step 700: Under at least one time frame within the target period, determine sound source orientation information respectively;
步骤701、将至少一个时间帧下的声源方位信息,转换为描述声源的方位分布状态的至少一组图像数据,以形成图像流;Step 701: Convert the position information of the sound source in at least one time frame into at least one set of image data describing the position distribution state of the sound source to form an image stream;
步骤702、利用图像识别模型对图像流进行图像识别,以在目标时段内进行声源追踪。Step 702: Perform image recognition on the image stream by using the image recognition model, so as to track the sound source within the target time period.
其中,步骤700可参考图1关联的实施例中的相关描述。步骤701可进一步包括获取麦克风阵列在至少一个时间帧下采集到的声学信号流;基于声学信号流进行声源方位估计,以获得至少一个时间帧下声源方位信息。为节省篇幅,具体过程不再赘述。Wherein, for step 700, reference may be made to the relevant description in the embodiment associated with FIG. 1 . Step 701 may further include acquiring an acoustic signal flow collected by the microphone array in at least one time frame; and performing sound source orientation estimation based on the acoustic signal flow to obtain sound source orientation information under at least one time frame. To save space, the specific process will not be repeated.
步骤701中,可将至少一个时间帧下的声源方位信息,转换为至少一组图像数据。其中,图像数据可以是方位分布热力图,则步骤701中,可获得至少一张方位分布热力图,以形成图像流输入至图像识别模型。In step 701, the position information of the sound source in at least one time frame may be converted into at least one set of image data. Wherein, the image data may be an orientation distribution heat map, and in step 701, at least one orientation distribution heat map can be obtained to form an image stream and input to the image recognition model.
基于步骤701,本实施提供的声源追踪方案可应用于实时追踪或离线追踪等场景。在离线追踪场景下,可一次性获得至少一个时间帧下声源方位信息,并按照识别精度,对至少一个时间帧下声源方位信息分组,从而按组执行将声学方位信息转换为图像数据的操作。Based on step 701, the sound source tracking solution provided in this implementation can be applied to scenarios such as real-time tracking or offline tracking. In the offline tracking scenario, the position information of the sound source in at least one time frame can be obtained at one time, and the position information of the sound source in at least one time frame can be grouped according to the recognition accuracy, so that the conversion of the acoustic position information into image data can be performed in groups. operate.
而在线追踪场景下,则可基于预设的识别精度,从至少一个时间帧中,确定位于当前识别时段内的目标时间帧;In the online tracking scenario, the target time frame within the current recognition period can be determined from at least one time frame based on the preset recognition accuracy;
将各目标时间帧下的声源方位信息,转换为描述当前识别时段内声源的方位分布状态的一组图像数据;Convert the sound source azimuth information under each target time frame into a set of image data describing the azimuth distribution state of the sound source in the current identification period;
继续确定位于目标时段中的下一识别时段内的时间帧以及图像数据,直至产生目标时段内所有识别时段对应的图像数据。Continue to determine the time frame and image data in the next identification period in the target period until image data corresponding to all the identification periods in the target period are generated.
在线追踪场景下,可随着声学信号的不断产生,接续地在各识别时段执行将声学方位信息转换为图像数据的操作。并接续地输入至步骤702中的图像识别模型。In the online tracking scenario, along with the continuous generation of acoustic signals, the operation of converting the acoustic orientation information into image data can be performed successively in each recognition period. And then input to the image recognition model in step 702.
例如,识别精度为1s,则可基于当前识别时段(1s)内的N个时间帧的声源方位信息,生成一张方位分布热力图,之后,再继续生成下一识别时段(1s)的方位分布热力图,以及接续地生成后续各识别时段的方位分布热力图,并流式地将各方位分布热力图提供给图像识别模型。For example, if the recognition accuracy is 1s, you can generate a heat map of azimuth distribution based on the sound source azimuth information of N time frames within the current recognition period (1s), and then continue to generate the azimuth for the next recognition period (1s). distributing heat maps, and successively generating azimuth distribution heat maps for subsequent recognition periods, and streamly providing each azimuth distribution heat map to the image recognition model.
本实施例中,可预先训练图像识别模型。图像识别模型可采用机器学习模型,图像识别模型的训练过程可参考图1关联的实施例中的相关描述。In this embodiment, the image recognition model can be pre-trained. The image recognition model may adopt a machine learning model, and the training process of the image recognition model may refer to the relevant description in the embodiment associated with FIG. 1 .
值得说明的是,识别精度在图像识别模型的训练阶段和应用阶段保持一致。It is worth noting that the recognition accuracy is consistent between the training phase and the application phase of the image recognition model.
上述关于声源追踪方法的各实施例中的技术细节,可参考图1关联的声源追踪方法各实施例中的相关描述,为节省篇幅,在此不再赘述,但这不应造成对本申请保护范围的损失。For the technical details in the above-mentioned embodiments of the sound source tracking method, reference may be made to the relevant descriptions in the various embodiments of the sound source tracking method associated with FIG. loss of protection.
图8为本申请一示例性实施例提供的另一种声源追踪装置的结构示意图。参考图8,该声源追踪装置包括:FIG. 8 is a schematic structural diagram of another sound source tracking apparatus provided by an exemplary embodiment of the present application. Referring to Figure 8, the sound source tracking device includes:
确定模块80,用于在目标时段内的至少一个时间帧下,分别确定声源方位信息;A determination module 80, configured to determine sound source orientation information respectively under at least one time frame within the target period;
转换模块81,用于将至少一个时间帧下的声源方位信息,转换为描述声源的方位分布状态的至少一组图像数据,以形成图像流;The conversion module 81 is configured to convert the position information of the sound source in at least one time frame into at least one set of image data describing the position distribution state of the sound source, so as to form an image stream;
追踪模块82,用于利用图像识别模型对图像流进行图像识别,以在目标时段内进行声源追踪。在一可选实施例中,转换模块81在将至少一个时间帧下的声源方位信息,转换为描述声源的方位分布状态的至少一组图像数据,以形成图像流时,用于:The tracking module 82 is configured to perform image recognition on the image stream by using the image recognition model, so as to perform sound source tracking within the target time period. In an optional embodiment, when the conversion module 81 converts the position information of the sound source in at least one time frame into at least one set of image data describing the position distribution state of the sound source to form an image stream, it is used for:
基于预设的识别精度,从至少一个时间帧中,确定位于当前识别时段内的目标时间帧;Based on the preset recognition accuracy, from at least one time frame, determine the target time frame within the current recognition period;
将各目标时间帧下的声源方位信息,转换为描述当前识别时段内声源的方位分布状态的一组图像数据;Convert the sound source azimuth information under each target time frame into a set of image data describing the azimuth distribution state of the sound source in the current identification period;
继续确定位于目标时段中的下一识别时段内的时间帧以及图像数据,直至产生目标时段内所有识别时段对应的图像数据,以形成图像流。Continue to determine the time frame and image data in the next identification period in the target period until image data corresponding to all the identification periods in the target period are generated to form an image stream.
在一可选实施例中,确定模块80包括获取模块83和计算模块84;In an optional embodiment, the determination module 80 includes an acquisition module 83 and a calculation module 84;
获取模块83,用于获取麦克风阵列在至少一个时间帧下采集到的声学信号流;an acquisition module 83, configured to acquire the acoustic signal flow collected by the microphone array in at least one time frame;
计算模块84,用于基于声学信号流进行声源方位估计,以获得至少一个时间帧下声源方位信息。The calculation module 84 is configured to perform sound source position estimation based on the acoustic signal flow, so as to obtain sound source position information in at least one time frame.
在一可选实施例中,转换模块81在将各目标时间帧下的声源方位信息,转换为描述当前识别时段内声源的方位分布状态的一组图像数据时,用于:In an optional embodiment, the conversion module 81, when converting the sound source azimuth information under each target time frame into a group of image data describing the azimuth distribution state of the sound source in the current identification period, is used for:
将各目标时间帧下的声源方位信息转换为至少一个时间帧下声源的方位分布热力图,方位分布热力图用于描述在至少一个时间帧下声源在不同方位上的分布热度。The sound source orientation information under each target time frame is converted into an orientation distribution heat map of the sound source under at least one time frame, and the orientation distribution heat map is used to describe the distribution heat of the sound source in different orientations under at least one time frame.
在一可选实施例中,声源方位信息中包含声源处于各方位的置信度;转换模块81在将各目标时间帧下的声源方位信息转换为至少一个时间帧下声源的方位分布热力图时, 用于:In an optional embodiment, the sound source position information includes the confidence that the sound source is in each position; the conversion module 81 converts the sound source position information under each target time frame into the position distribution of the sound source under at least one time frame. When a heatmap is used, it is used to:
基于置信度与显示亮度之间的对应关系,根据至少一个时间帧下的声源处于各方位的置信度,在至少一个时间帧下分别确定各方位对应的显示亮度,不同显示亮度表征的分布热度不同;Based on the correspondence between the confidence level and the display brightness, according to the confidence level of the sound source in each position under at least one time frame, the display brightness corresponding to each position is determined under at least one time frame, and the distribution heat represented by different display brightness is different;
根据显示亮度,生成至少一个时间帧下声源的方位分布热力图。According to the display brightness, a heat map of the azimuth distribution of the sound source under at least one time frame is generated.
在一可选实施例中,转换模块81在根据显示亮度,生成至少一个时间帧下声源的方位分布热力图时,用于:In an optional embodiment, when the conversion module 81 generates a heat map of the azimuth distribution of the sound source in at least one time frame according to the display brightness, it is used for:
根据至少一个时间帧下各方位对应的显示亮度,分别确定至少一个时间帧各自对应的图像内容;According to the display brightness corresponding to each position under the at least one time frame, the image content corresponding to each of the at least one time frame is respectively determined;
按照至少一个时间帧之间的时间顺序,依次排列至少一个时间帧各自对应的图像内容,以生成方位分布热力图。According to the time sequence between the at least one time frame, the image contents corresponding to the at least one time frame are sequentially arranged to generate the azimuth distribution heat map.
在一可选实施例中,若图像数据为至少一个时间帧下声源的方位分布热力图,追踪模块82在利用图像识别模型对图像流进行图像识别,以在目标时段内进行声源追踪时,用于:In an optional embodiment, if the image data is a heat map of the azimuth distribution of the sound source in at least one time frame, the tracking module 82 uses the image recognition model to perform image recognition on the image stream to track the sound source within the target time period. , for:
在图像识别模型中,提取方位分布热力图中的图像特征;In the image recognition model, the image features in the azimuth distribution heat map are extracted;
基于图像特征与声源属性参数之间的映射关系以及从方位分布热力图中提取到的图像特征,确定至少一个时间帧下的目标声源属性参数,以进行声源追踪。Based on the mapping relationship between the image features and the sound source attribute parameters and the image features extracted from the azimuth distribution heat map, the target sound source attribute parameters under at least one time frame are determined to perform sound source tracking.
在一可选实施例中,声源属性参数包括方位、数量、发声时长和所覆盖时间帧中的一个或多个。In an optional embodiment, the sound source attribute parameters include one or more of orientation, quantity, sounding duration and covered time frame.
在一可选实施例中,追踪模块82还用于:In an optional embodiment, the tracking module 82 is further configured to:
获取若干样本时间帧组各自对应的样本热力图,样本热力图用于描述在样本时间帧下声源在不同方位上的分布热度;Obtain the sample heatmaps corresponding to several sample time frame groups, and the sample heatmaps are used to describe the distribution heat of the sound source in different directions under the sample time frame;
为各样本热力图标注声源属性参数,以获得各样本热力图对应的标注信息;Label the sound source attribute parameters for each sample heatmap to obtain the labeling information corresponding to each sample heatmap;
将各样本热力图及其对应的标注信息输入图像识别模型,以供图像识别模型学习图像特征与声源属性参数之间的映射关系。The heat map of each sample and its corresponding annotation information are input into the image recognition model, so that the image recognition model can learn the mapping relationship between image features and sound source attribute parameters.
在一可选实施例中,追踪模块82在将各目标时间帧下的声源方位信息转换为至少一个时间帧下声源的方位分布热力图时,用于:In an optional embodiment, when the tracking module 82 converts the position information of the sound source under each target time frame into a heat map of the position distribution of the sound source under at least one time frame, it is used for:
将信息流输入图像识别模型;Feed the information stream into the image recognition model;
在图像识别模型中,将将各目标时间帧下的声源方位信息转换为至少一个时间帧下声源的方位分布热力图。In the image recognition model, the position information of the sound source under each target time frame is converted into a heat map of the position distribution of the sound source under at least one time frame.
在一可选实施例中,追踪模块82还用于:In an optional embodiment, the tracking module 82 is further configured to:
获取若样本时间帧组各自对应的样本信息流;Obtain the sample information stream corresponding to each sample time frame group;
为各样本信息流标注声源属性参数,以获得各样本信息流对应的标注信息;Labeling sound source attribute parameters for each sample information stream to obtain label information corresponding to each sample information stream;
将各样本信息流及其对应的标注信息输入图像识别模型,以供图像识别模型将各样 本信息流转换为描述声源的方位分布状态的可视化数据并学习图像特征与声源属性参数之间的映射关系。Input each sample information stream and its corresponding annotation information into the image recognition model, so that the image recognition model converts each sample information stream into visual data describing the azimuth distribution state of the sound source and learns the relationship between image features and sound source attribute parameters. Mapping relations.
在一可选实施例中,声学信号流包含麦克风阵列中各阵元采集到的时域信号流,计算模块84在基于声学信号流进行声源方位估计,以获得至少一个时间帧下的声源方位信息时,用于:In an optional embodiment, the acoustic signal flow includes the time domain signal flow collected by each array element in the microphone array, and the calculation module 84 performs sound source orientation estimation based on the acoustic signal flow to obtain the sound source in at least one time frame. When bearing information, it is used to:
将各阵元采集到的时域信号流分别转换时频域信号;The time-domain signal streams collected by each array element are converted into time-frequency domain signals respectively;
采用声源方位估计技术,根据各阵元下的时频域信号,确定至少一个时间帧下的声源方位信息。Using the sound source orientation estimation technology, the sound source orientation information under at least one time frame is determined according to the time-frequency domain signals under each array element.
在一可选实施例中,声源方位估计技术包括可控波束响应相位变换技术SRP-PHAT、广义互相关相位变换技术GCC-PHAT或多重信号分类技术MUSIC中的一种或多种。In an optional embodiment, the sound source orientation estimation technology includes one or more of the steerable beam response phase transformation technology SRP-PHAT, the generalized cross-correlation phase transformation technology GCC-PHAT or the multiple signal classification technology MUSIC.
值得说明的是,上述关于声源追踪装置的各实施例中的技术细节,可参考前述图1和图7关联的声源追踪方法各实施例中的相关描述,为节省篇幅,在此不再赘述,但这不应造成对本申请保护范围的损失。It is worth noting that, for the technical details of the above-mentioned embodiments of the sound source tracking device, reference may be made to the relevant descriptions in the respective embodiments of the sound source tracking method associated with FIG. 1 and FIG. 7 . Repeated description, but this should not cause loss of the protection scope of the present application.
图9为本申请一示例性实施例提供的另一种计算设备的结构示意图,参考图9,该计算设备包括:存储器90和处理器91。FIG. 9 is a schematic structural diagram of another computing device provided by an exemplary embodiment of the present application. Referring to FIG. 9 , the computing device includes: a memory 90 and a processor 91 .
处理器91,与存储器90耦合,用于执行存储器90中的计算机程序,以用于:A processor 91, coupled to the memory 90, executes a computer program in the memory 90 for:
在目标时段内的至少一个时间帧下,分别确定声源方位信息;Under at least one time frame within the target period, determine the sound source position information respectively;
将至少一个时间帧下的声源方位信息,转换为描述声源的方位分布状态的至少一组图像数据,以形成图像流;Converting the sound source position information in at least one time frame into at least one set of image data describing the position distribution state of the sound source to form an image stream;
利用图像识别模型对图像流进行图像识别,以在目标时段内进行声源追踪。Image recognition is performed on the image stream using an image recognition model for sound source tracking within the target time period.
在一可选实施例中,处理器91在将至少一个时间帧下的声源方位信息,转换为描述声源的方位分布状态的至少一组图像数据,以形成图像流时,用于:In an optional embodiment, when the processor 91 converts the sound source azimuth information in at least one time frame into at least one set of image data describing the azimuth distribution state of the sound source to form an image stream, it is used for:
基于预设的识别精度,从至少一个时间帧中,确定位于当前识别时段内的目标时间帧;Based on the preset recognition accuracy, from at least one time frame, determine the target time frame within the current recognition period;
将各目标时间帧下的声源方位信息,转换为描述当前识别时段内声源的方位分布状态的一组图像数据;Convert the sound source azimuth information under each target time frame into a set of image data describing the azimuth distribution state of the sound source in the current identification period;
继续确定位于目标时段中的下一识别时段内的时间帧以及图像数据,直至产生目标时段内所有识别时段对应的图像数据,以形成图像流。Continue to determine the time frame and image data in the next identification period in the target period until image data corresponding to all the identification periods in the target period are generated to form an image stream.
在一可选实施例中,处理器91在目标时段内的至少一个时间帧下,分别确定声源方位信息时,用于:In an optional embodiment, when the processor 91 respectively determines the position information of the sound source under at least one time frame within the target time period, it is used for:
获取麦克风阵列在至少一个时间帧下采集到的声学信号流;Acquire the acoustic signal stream collected by the microphone array under at least one time frame;
基于声学信号流进行声源方位估计,以获得至少一个时间帧下声源方位信息。The sound source position estimation is performed based on the acoustic signal flow to obtain sound source position information under at least one time frame.
在一可选实施例中,处理器91在将各目标时间帧下的声源方位信息,转换为描述当前识别时段内声源的方位分布状态的一组图像数据时,用于:In an optional embodiment, when the processor 91 converts the sound source azimuth information under each target time frame into a set of image data describing the azimuth distribution state of the sound source in the current identification period, it is used for:
将各目标时间帧下的声源方位信息转换为至少一个时间帧下声源的方位分布热力图,方位分布热力图用于描述在至少一个时间帧下声源在不同方位上的分布热度。The sound source orientation information under each target time frame is converted into an orientation distribution heat map of the sound source under at least one time frame, and the orientation distribution heat map is used to describe the distribution heat of the sound source in different orientations under at least one time frame.
在一可选实施例中,声源方位信息中包含声源处于各方位的置信度;处理器91在将各目标时间帧下的声源方位信息转换为至少一个时间帧下声源的方位分布热力图时,用于:In an optional embodiment, the sound source position information includes the confidence that the sound source is in each position; the processor 91 converts the sound source position information under each target time frame into the position distribution of the sound source under at least one time frame. When a heatmap is used, it is used to:
基于置信度与显示亮度之间的对应关系,根据至少一个时间帧下的声源处于各方位的置信度,在至少一个时间帧下分别确定各方位对应的显示亮度,不同显示亮度表征的分布热度不同;Based on the correspondence between the confidence level and the display brightness, according to the confidence level of the sound source in each position under at least one time frame, the display brightness corresponding to each position is determined in at least one time frame, and the distribution heat represented by different display brightness is different;
根据显示亮度,生成至少一个时间帧下声源的方位分布热力图。According to the display brightness, a heat map of the azimuth distribution of the sound source under at least one time frame is generated.
在一可选实施例中,处理器91在根据显示亮度,生成至少一个时间帧下声源的方位分布热力图时,用于:In an optional embodiment, when the processor 91 generates the heat map of the azimuth distribution of the sound source in at least one time frame according to the display brightness, the processor 91 is configured to:
根据至少一个时间帧下各方位对应的显示亮度,分别确定至少一个时间帧各自对应的图像内容;According to the display brightness corresponding to each position under the at least one time frame, the image content corresponding to each of the at least one time frame is respectively determined;
按照至少一个时间帧之间的时间顺序,依次排列至少一个时间帧各自对应的图像内容,以生成方位分布热力图。According to the time sequence between the at least one time frame, the image contents corresponding to the at least one time frame are sequentially arranged to generate the azimuth distribution heat map.
在一可选实施例中,若图像数据为至少一个时间帧下声源的方位分布热力图,处理器91在利用图像识别模型对图像流进行图像识别,以在目标时段内进行声源追踪时,用于:In an optional embodiment, if the image data is a heat map of the azimuth distribution of the sound source in at least one time frame, the processor 91 uses the image recognition model to perform image recognition on the image stream to track the sound source within the target time period. , for:
在图像识别模型中,提取方位分布热力图中的图像特征;In the image recognition model, the image features in the azimuth distribution heat map are extracted;
基于图像特征与声源属性参数之间的映射关系以及从方位分布热力图中提取到的图像特征,确定至少一个时间帧下的目标声源属性参数,以进行声源追踪。Based on the mapping relationship between the image features and the sound source attribute parameters and the image features extracted from the azimuth distribution heat map, the target sound source attribute parameters under at least one time frame are determined to perform sound source tracking.
在一可选实施例中,声源属性参数包括方位、数量、发声时长和所覆盖时间帧中的一个或多个。In an optional embodiment, the sound source attribute parameters include one or more of orientation, quantity, sounding duration and covered time frame.
在一可选实施例中,处理器91还用于:In an optional embodiment, the processor 91 is further configured to:
获取若干样本时间帧组各自对应的样本热力图,样本热力图用于描述在样本时间帧下声源在不同方位上的分布热度;Obtain the sample heatmaps corresponding to several sample time frame groups, and the sample heatmaps are used to describe the distribution heat of the sound source in different directions under the sample time frame;
为各样本热力图标注声源属性参数,以获得各样本热力图对应的标注信息;Label the sound source attribute parameters for each sample heatmap to obtain the labeling information corresponding to each sample heatmap;
将各样本热力图及其对应的标注信息输入图像识别模型,以供图像识别模型学习图像特征与声源属性参数之间的映射关系。The heat map of each sample and its corresponding annotation information are input into the image recognition model, so that the image recognition model can learn the mapping relationship between image features and sound source attribute parameters.
在一可选实施例中,处理器91在将各目标时间帧下的声源方位信息转换为至少一个时间帧下声源的方位分布热力图时,用于:In an optional embodiment, when the processor 91 converts the position information of the sound source under each target time frame into a heat map of the position distribution of the sound source under at least one time frame, the processor 91 is used for:
将信息流输入图像识别模型;Feed the information stream into the image recognition model;
在图像识别模型中,将将各目标时间帧下的声源方位信息转换为至少一个时间帧下声源的方位分布热力图。In the image recognition model, the position information of the sound source under each target time frame is converted into a heat map of the position distribution of the sound source under at least one time frame.
在一可选实施例中,处理器91还用于:In an optional embodiment, the processor 91 is further configured to:
获取若样本时间帧组各自对应的样本信息流;Obtain the sample information stream corresponding to each sample time frame group;
为各样本信息流标注声源属性参数,以获得各样本信息流对应的标注信息;Labeling sound source attribute parameters for each sample information stream to obtain label information corresponding to each sample information stream;
将各样本信息流及其对应的标注信息输入图像识别模型,以供图像识别模型将各样本信息流转换为描述声源的方位分布状态的可视化数据并学习图像特征与声源属性参数之间的映射关系。Input each sample information stream and its corresponding annotation information into the image recognition model, so that the image recognition model converts each sample information stream into visual data describing the azimuth distribution state of the sound source and learns the relationship between image features and sound source attribute parameters. Mapping relations.
在一可选实施例中,声学信号流包含麦克风阵列中各阵元采集到的时域信号流,处理器91在基于声学信号流进行声源方位估计,以获得至少一个时间帧下的声源方位信息时,用于:In an optional embodiment, the acoustic signal flow includes a time-domain signal flow collected by each array element in the microphone array, and the processor 91 performs sound source orientation estimation based on the acoustic signal flow to obtain the sound source in at least one time frame. When bearing information, it is used to:
将各阵元采集到的时域信号流分别转换时频域信号;The time-domain signal streams collected by each array element are converted into time-frequency domain signals respectively;
采用声源方位估计技术,根据各阵元下的时频域信号,确定至少一个时间帧下的声源方位信息。Using the sound source orientation estimation technology, the sound source orientation information under at least one time frame is determined according to the time-frequency domain signals under each array element.
在一可选实施例中,声源方位估计技术包括可控波束响应相位变换技术SRP-PHAT、广义互相关相位变换技术GCC-PHAT或多重信号分类技术MUSIC中的一种或多种。In an optional embodiment, the sound source orientation estimation technology includes one or more of the steerable beam response phase transformation technology SRP-PHAT, the generalized cross-correlation phase transformation technology GCC-PHAT or the multiple signal classification technology MUSIC.
值得说明的是,上述关于计算设备的各实施例中的技术细节,可参考前述图1及图7关联的声源追踪方法各实施例中的相关描述,为节省篇幅,在此不再赘述,但这不应造成对本申请保护范围的损失。It is worth noting that, for the technical details in the above-mentioned embodiments of the computing device, reference may be made to the relevant descriptions in the respective embodiments of the sound source tracking method associated with FIG. 1 and FIG. 7 . However, this should not cause any loss to the scope of protection of this application.
进一步,如图9所示,该计算设备还包括:麦克风阵列92、通信组件93、电源组件94等其它组件。图9中仅示意性给出部分组件,并不意味着计算设备只包括图9所示组件。Further, as shown in FIG. 9 , the computing device further includes: a microphone array 92 , a communication component 93 , a power supply component 94 and other components. Only some components are schematically shown in FIG. 9 , which does not mean that the computing device only includes the components shown in FIG. 9 .
相应地,本申请实施例还提供一种存储有计算机程序的计算机可读存储介质,计算机程序被执行时能够实现上述方法实施例中可由计算设备执行的各步骤。Correspondingly, the embodiments of the present application further provide a computer-readable storage medium storing a computer program, and when the computer program is executed, each step that can be executed by a computing device in the foregoing method embodiments can be implemented.
图10为本申请一示例性实施例提供的一种声源追踪系统的结构示意图。参考图10,该声源追踪系统可包括:麦克风阵列10和计算设备20,麦克风阵列10和计算设备20通信连接。FIG. 10 is a schematic structural diagram of a sound source tracking system provided by an exemplary embodiment of the present application. Referring to FIG. 10 , the sound source tracking system may include: a microphone array 10 and a computing device 20 , and the microphone array 10 and the computing device 20 are connected in communication.
本实施例提供的声源追踪系统可应用于各种场景中,例如,语音控制场景、音视频会议场景或其它需要进行声源追踪的场景,本实施例中对应用场景不做限定。在不同的应用场景中,本实施例提供的声源追踪系统可集成部署在各式各样的场景设备中,例如,在语音控制场景下,可部署在是智能音箱、智能机器人中,在音视频会议场景下,可部署在各类会议终端中。The sound source tracking system provided in this embodiment can be applied to various scenarios, for example, a voice control scenario, an audio and video conference scenario, or other scenarios that require sound source tracking, and the application scenario is not limited in this embodiment. In different application scenarios, the sound source tracking system provided in this embodiment can be integrated and deployed in various scenarios. For example, in a voice control scenario, it can be deployed in smart speakers and smart robots. In video conference scenarios, it can be deployed in various conference terminals.
其中,麦克风阵列10可用于采集声学信号。本实施例中,对麦克风阵列10的阵元数量及排列形式等均不作限定。Among them, the microphone array 10 can be used to collect acoustic signals. In this embodiment, the number and arrangement of the array elements of the microphone array 10 are not limited.
关于计算设备涉及到的技术细节可参考图6和图9关联的实施例中的相关描述,为节省篇幅,在此不再赘述,但这不应造成对本申请保护范围的损失。For the technical details involved in the computing device, reference may be made to the relevant descriptions in the embodiments associated with FIG. 6 and FIG. 9 , which are not repeated here to save space, but this should not cause loss of the protection scope of the present application.
上述图6和图9中的存储器,用于存储计算机程序,并可被配置为存储其它各种数据以支持在计算平台上的操作。这些数据的示例包括用于在计算平台上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。The memories in FIGS. 6 and 9 described above are used to store computer programs and may be configured to store various other data to support operations on the computing platform. Examples of such data include instructions for any application or method operating on the computing platform, contact data, phonebook data, messages, pictures, videos, etc. Memory can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read Only Memory (EEPROM), Erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.
上述图6和图9中的通信组件,被配置为便于通信组件所在设备和其他设备之间有线或无线方式的通信。通信组件所在设备可以接入基于通信标准的无线网络,如WiFi,2G、3G、4G/LTE、5G等移动通信网络,或它们的组合。在一个示例性实施例中,通信组件经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信组件还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。The above-mentioned communication components in FIG. 6 and FIG. 9 are configured to facilitate wired or wireless communication between the device where the communication component is located and other devices. The device where the communication component is located can access a wireless network based on a communication standard, such as WiFi, a mobile communication network such as 2G, 3G, 4G/LTE, 5G, or a combination thereof. In one exemplary embodiment, the communication component receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication assembly further includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.
上述图6和图9中的电源组件,为电源组件所在设备的各种组件提供电力。电源组件可以包括电源管理系统,一个或多个电源,及其他与为电源组件所在设备生成、管理和分配电力相关联的组件。The power supply assemblies in the above-mentioned FIGS. 6 and 9 provide power for various components of the equipment in which the power supply assemblies are located. A power supply assembly may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to the equipment in which the power supply assembly is located.
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flows of the flowcharts and/or the block or blocks of the block diagrams.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算 机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。Memory may include forms of non-persistent memory, random access memory (RAM) and/or non-volatile memory in computer readable media, such as read only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。Computer-readable media includes both persistent and non-permanent, removable and non-removable media, and storage of information may be implemented by any method or technology. Information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer-readable media does not include transitory computer-readable media, such as modulated data signals and carrier waves.
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device comprising a series of elements includes not only those elements, but also Other elements not expressly listed, or which are inherent to such a process, method, article of manufacture, or apparatus are also included. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article of manufacture, or device that includes the element.
以上所述仅为本申请的实施例而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。The above descriptions are merely examples of the present application, and are not intended to limit the present application. Various modifications and variations of this application are possible for those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included within the protection scope of this application.

Claims (20)

  1. 一种声源追踪方法,其特征在于,包括:A sound source tracking method, comprising:
    获取麦克风阵列在至少一个时间帧下采集到的声学信号流;Acquire the acoustic signal stream collected by the microphone array under at least one time frame;
    基于所述声学信号流进行声源方位估计,以获得包含所述至少一个时间帧下声源方位信息的信息流;performing sound source position estimation based on the acoustic signal stream to obtain an information stream containing sound source position information under the at least one time frame;
    将所述信息流转换为描述声源的方位分布状态的可视化数据;converting the information flow into visual data describing the azimuth distribution state of the sound source;
    根据所述可视化数据,进行声源追踪。Based on the visualized data, sound source tracking is performed.
  2. 根据权利要求1所述的方法,其特征在于,所述将所述信息流转换为描述声源的方位分布状态的可视化数据,包括:The method according to claim 1, wherein the converting the information flow into visual data describing the azimuth distribution state of the sound source comprises:
    将所述信息流转换为所述至少一个时间帧下声源的方位分布热力图,所述方位分布热力图用于描述在所述至少一个时间帧下声源在不同方位上的分布热度。Converting the information stream into an azimuth distribution heat map of the sound source under the at least one time frame, the azimuth distribution heat map being used to describe the distribution heat of the sound source in different azimuths under the at least one time frame.
  3. 根据权利要求2所述的方法,其特征在于,所述声源方位信息中包含声源处于各方位的置信度;所述将所述信息流转换为在所述至少一个时间帧下声源的方位分布热力图,包括:The method according to claim 2, characterized in that, the sound source position information includes confidence that the sound source is in each position; the converting the information flow into the sound source's position in the at least one time frame Azimuth distribution heatmap, including:
    基于置信度与显示亮度之间的对应关系,根据所述至少一个时间帧下的声源处于各方位的置信度,在所述至少一个时间帧下分别确定各方位对应的显示亮度,不同显示亮度表征不同的分布热度;Based on the correspondence between the confidence level and the display brightness, and according to the confidence level of the sound source in each position under the at least one time frame, the display brightness corresponding to each position is respectively determined under the at least one time frame, and the display brightness corresponding to the different positions is determined under the at least one time frame. Characterize different distribution heats;
    根据所述显示亮度,生成所述至少一个时间帧下声源的方位分布热力图。According to the display brightness, an azimuth distribution heat map of the sound source in the at least one time frame is generated.
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述显示亮度,生成所述至少一个时间帧下声源的方位分布热力图,包括:The method according to claim 3, wherein the generating, according to the display brightness, a heat map of the azimuth distribution of the sound source in the at least one time frame, comprises:
    根据所述至少一个时间帧下各方位对应的显示亮度,分别确定所述至少一个时间帧各自对应的图像内容;According to the display brightness corresponding to each position under the at least one time frame, the image content corresponding to each of the at least one time frame is respectively determined;
    按照所述至少一个时间帧之间的时间顺序,依次排列所述至少一个时间帧各自对应的图像内容,以生成所述方位分布热力图。The image contents corresponding to the at least one time frame are sequentially arranged according to the time sequence between the at least one time frame, so as to generate the azimuth distribution heat map.
  5. 根据权利要求1所述的方法,其特征在于,所述根据所述可视化数据,进行声源追踪,包括:The method according to claim 1, wherein the performing sound source tracking according to the visualized data comprises:
    利用机器学习模型以及所述可视化数据,进行声源追踪。Using the machine learning model and the visualization data, sound source tracking is performed.
  6. 根据权利要求5所述的方法,其特征在于,若所述可视化数据为所述至少一个时间帧下声源的方位分布热力图,则所述利用机器学习模型以及所述可视化数据,进行声源追踪,包括:The method according to claim 5, wherein, if the visualization data is a heat map of the azimuth distribution of the sound source in the at least one time frame, the sound source is analyzed by using the machine learning model and the visualization data. Tracking, including:
    在所述机器学习模型中,提取所述方位分布热力图中的图像特征;In the machine learning model, extract the image features in the orientation distribution heatmap;
    基于图像特征与声源属性参数之间的映射关系以及从所述方位分布热力图中提取到的图像特征,确定所述至少一个时间帧下的目标声源属性参数,以进行声源追踪。Based on the mapping relationship between the image features and the sound source attribute parameters and the image features extracted from the azimuth distribution heat map, the target sound source attribute parameters under the at least one time frame are determined to perform sound source tracking.
  7. 根据权利要求6所述的方法,其特征在于,所述声源属性参数包括方位、数量、发声时长和所覆盖时间帧中的一个或多个。The method according to claim 6, wherein the sound source attribute parameters include one or more of orientation, quantity, sounding duration and covered time frame.
  8. 根据权利要求6所述的方法,其特征在于,还包括:The method of claim 6, further comprising:
    获取若干样本时间帧组各自对应的样本热力图,所述样本热力图用于描述在样本时间帧下声源在不同方位上的分布热度;obtaining sample heat maps corresponding to several sample time frame groups, where the sample heat maps are used to describe the distribution heat of the sound source in different directions under the sample time frame;
    为各样本热力图标注声源属性参数,以获得各样本热力图对应的标注信息;Label the sound source attribute parameters for each sample heatmap to obtain the labeling information corresponding to each sample heatmap;
    将所述各样本热力图及其对应的标注信息输入所述机器学习模型,以供所述机器学习模型学习所述图像特征与声源属性参数之间的映射关系。The heat map of each sample and its corresponding annotation information are input into the machine learning model, so that the machine learning model can learn the mapping relationship between the image features and sound source attribute parameters.
  9. 根据权利要求6所述的方法,其特征在于,还包括:The method of claim 6, further comprising:
    获取若样本时间帧组各自对应的样本信息流;Obtain the sample information stream corresponding to each sample time frame group;
    为各样本信息流标注声源属性参数,以获得各样本信息流对应的标注信息;Labeling sound source attribute parameters for each sample information stream to obtain label information corresponding to each sample information stream;
    将所述各样本信息流及其对应的标注信息输入所述机器学习模型,以供所述机器学习模型将各样本信息流转换为描述声源的方位分布状态的可视化数据并学习所述图像特征与声源属性参数之间的映射关系。Input the each sample information stream and its corresponding annotation information into the machine learning model, so that the machine learning model converts each sample information stream into visual data describing the azimuth distribution state of the sound source and learns the image features Mapping relationship with sound source attribute parameters.
  10. 根据权利要求9所述的方法,其特征在于,所述将所述信息流转换为描述声源的方位分布状态的可视化数据,包括:The method according to claim 9, wherein the converting the information flow into visual data describing the azimuth distribution state of the sound source comprises:
    将所述信息流输入机器学习模型;inputting the information flow into a machine learning model;
    在所述机器学习模型中,将所述信息流转换为描述声源的方位分布状态的可视化数据。In the machine learning model, the information flow is converted into visual data describing the azimuthal distribution state of the sound source.
  11. 根据权利要求1所述的方法,其特征在于,所述声学信号流包含所述麦克风阵列中各阵元采集到的时域信号流,所述基于所述声学信号流进行声源方位估计,以获得包含所述至少一个时间帧下的声源方位信息的信息流,包括:The method according to claim 1, wherein the acoustic signal stream comprises a time-domain signal stream collected by each array element in the microphone array, and the sound source azimuth estimation is performed based on the acoustic signal stream to obtain Obtaining an information stream containing sound source position information under the at least one time frame, including:
    将各阵元采集到的时域信号流分别转换时频域信号;The time-domain signal streams collected by each array element are converted into time-frequency domain signals respectively;
    采用声源方位估计技术,根据所述各阵元下的时频域信号,确定所述至少一个时间帧下的声源方位信息。Using the sound source orientation estimation technology, the sound source orientation information under the at least one time frame is determined according to the time-frequency domain signals under each array element.
  12. 根据权利要求11所述的方法,其特征在于,所述声源方位估计技术包括可控波束响应相位变换技术SRP-PHAT、广义互相关相位变换技术GCC-PHAT或多重信号分类技术MUSIC中的一种或多种。The method according to claim 11, wherein the sound source orientation estimation technology comprises one of the controllable beam response phase transformation technology SRP-PHAT, the generalized cross-correlation phase transformation technology GCC-PHAT or the multiple signal classification technology MUSIC one or more.
  13. 一种声源追踪方法,其特征在于,包括:A sound source tracking method, comprising:
    在目标时段内的至少一个时间帧下,分别确定声源方位信息;Under at least one time frame within the target period, determine the sound source position information respectively;
    将所述至少一个时间帧下的声源方位信息,转换为描述声源的方位分布状态的至少一组图像数据,以形成图像流;Converting the sound source position information under the at least one time frame into at least one set of image data describing the position distribution state of the sound source to form an image stream;
    利用图像识别模型对所述图像流进行图像识别,以在所述目标时段内进行声源追踪。Image recognition is performed on the image stream using an image recognition model for sound source tracking within the target time period.
  14. 根据权利要求13所述的方法,其特征在于,所述将所述至少一个时间帧下的声源方位信息,转换为描述声源的方位分布状态的至少一组图像数据,以形成图像流,包括:The method according to claim 13, wherein, converting the position information of the sound source in the at least one time frame into at least one set of image data describing the position distribution state of the sound source to form an image stream ,include:
    基于预设的识别精度,从所述至少一个时间帧中,确定位于当前识别时段内的目标时间帧;Based on the preset recognition accuracy, from the at least one time frame, determine a target time frame within the current recognition period;
    将所述各目标时间帧下的声源方位信息,转换为描述所述当前识别时段内声源的方位分布状态的一组图像数据;Converting the sound source azimuth information under each target time frame into a set of image data describing the azimuth distribution state of the sound source in the current identification period;
    继续确定位于所述目标时段中的下一识别时段内的时间帧以及图像数据,直至产生所述目标时段内所有识别时段对应的图像数据,以形成所述图像流。Continue to determine the time frame and image data in the next identification period in the target period until image data corresponding to all the identification periods in the target period are generated to form the image stream.
  15. 一种声源追踪装置,其特征在于,包括:A sound source tracking device, comprising:
    获取模块,用于获取麦克风阵列在至少一个时间帧下采集到的声学信号流;an acquisition module, configured to acquire the acoustic signal flow collected by the microphone array in at least one time frame;
    计算模块,用于基于所述声学信号流进行声源方位估计,以获得包含所述至少一个时间帧下声源方位信息的信息流;a calculation module for performing sound source orientation estimation based on the acoustic signal flow to obtain an information flow including sound source orientation information under the at least one time frame;
    转换模块,用于将所述信息流转换为描述声源的方位分布状态的可视化数据;a conversion module for converting the information flow into visual data describing the azimuth distribution state of the sound source;
    追踪模块,用于根据所述可视化数据,进行声源追踪。A tracking module, configured to perform sound source tracking according to the visualized data.
  16. 一种计算设备,其特征在于,包括存储器和处理器;A computing device, comprising a memory and a processor;
    所述存储器用于存储一条或多条计算机指令;the memory for storing one or more computer instructions;
    所述处理器与所述存储器耦合,用于执行所述一条或多条计算机指令,以用于:The processor is coupled to the memory for executing the one or more computer instructions for:
    获取麦克风阵列在至少一个时间帧下采集到的声学信号流;Acquire the acoustic signal stream collected by the microphone array under at least one time frame;
    基于所述声学信号流进行声源方位估计,以获得包含所述至少一个时间帧下声源方位信息的信息流;performing sound source position estimation based on the acoustic signal stream to obtain an information stream containing sound source position information under the at least one time frame;
    将所述信息流转换为描述声源的方位分布状态的可视化数据;converting the information flow into visual data describing the azimuth distribution state of the sound source;
    根据所述可视化数据,进行声源追踪。Based on the visualized data, sound source tracking is performed.
  17. 一种声源追踪装置,其特征在于,包括:A sound source tracking device, comprising:
    确定模块,用于在目标时段内的至少一个时间帧下,分别确定声源方位信息;a determining module, configured to determine the position information of the sound source respectively under at least one time frame within the target period;
    转换模块,用于将所述至少一个时间帧下的声源方位信息,转换为描述声源的方位分布状态的至少一组图像数据,以形成图像流;a conversion module, configured to convert the position information of the sound source in the at least one time frame into at least one set of image data describing the position distribution state of the sound source, so as to form an image stream;
    追踪模块,用于利用图像识别模型对所述图像流进行图像识别,以在所述目标时段内进行声源追踪。The tracking module is used for performing image recognition on the image stream by using an image recognition model, so as to perform sound source tracking within the target period.
  18. 一种计算设备,其特征在于,包括存储器和处理器;A computing device, comprising a memory and a processor;
    所述存储器用于存储一条或多条计算机指令;the memory for storing one or more computer instructions;
    所述处理器与所述存储器耦合,用于执行所述一条或多条计算机指令,以用于:The processor is coupled to the memory for executing the one or more computer instructions for:
    在目标时段内的至少一个时间帧下,分别确定声源方位信息;Under at least one time frame within the target period, determine the sound source position information respectively;
    将所述至少一个时间帧下的声源方位信息,转换为描述声源的方位分布状态的至少一组图像数据,以形成图像流;Converting the sound source position information under the at least one time frame into at least one set of image data describing the position distribution state of the sound source to form an image stream;
    利用图像识别模型对所述图像流进行图像识别,以在所述目标时段内进行声源追踪。Image recognition is performed on the image stream using an image recognition model for sound source tracking within the target time period.
  19. 一种声源追踪系统,其特征在于,包括:麦克风阵列和计算设备,所述麦克风 阵列与所述计算设备通信连接;A sound source tracking system, comprising: a microphone array and a computing device, wherein the microphone array is communicatively connected to the computing device;
    所述麦克风阵列,用于采集声学信号;the microphone array for collecting acoustic signals;
    所述计算设备,用于获取麦克风阵列在至少一个时间帧下采集到的声学信号流;基于所述声学信号流进行声源方位估计,以获得包含所述至少一个时间帧下声源方位信息的信息流;将所述信息流转换为描述声源的方位分布状态的可视化数据;根据所述可视化数据,进行声源追踪。The computing device is configured to acquire the acoustic signal flow collected by the microphone array in at least one time frame; perform sound source orientation estimation based on the acoustic signal flow, so as to obtain a sound source orientation information containing the at least one time frame. information flow; convert the information flow into visual data describing the azimuth distribution state of the sound source; and track the sound source according to the visual data.
  20. 一种存储计算机指令的计算机可读存储介质,其特征在于,当所述计算机指令被一个或多个处理器执行时,致使所述一个或多个处理器执行权利要求1-14任一项所述的声源追踪方法。A computer-readable storage medium storing computer instructions, characterized in that, when the computer instructions are executed by one or more processors, the one or more processors are caused to execute any one of claims 1-14. The sound source tracking method described above.
PCT/CN2021/122742 2020-10-12 2021-10-09 Sound source tracking method and apparatus, and device, system and storage medium WO2022078249A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011086519.8A CN114355286A (en) 2020-10-12 2020-10-12 Sound source tracking method, device, equipment, system and storage medium
CN202011086519.8 2020-10-12

Publications (1)

Publication Number Publication Date
WO2022078249A1 true WO2022078249A1 (en) 2022-04-21

Family

ID=81089773

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/122742 WO2022078249A1 (en) 2020-10-12 2021-10-09 Sound source tracking method and apparatus, and device, system and storage medium

Country Status (2)

Country Link
CN (1) CN114355286A (en)
WO (1) WO2022078249A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120113122A1 (en) * 2010-11-09 2012-05-10 Denso Corporation Sound field visualization system
CN103167373A (en) * 2011-12-09 2013-06-19 现代自动车株式会社 Technique for localizing sound source
CN105073073A (en) * 2013-01-25 2015-11-18 胡海 Devices and methods for the visualization and localization of sound
CN110907778A (en) * 2019-12-12 2020-03-24 国网重庆市电力公司电力科学研究院 GIS equipment partial discharge ultrasonic positioning method, device, equipment and medium
CN111443330A (en) * 2020-05-15 2020-07-24 浙江讯飞智能科技有限公司 Acoustic imaging method, acoustic imaging device, acoustic imaging equipment and readable storage medium
WO2020166324A1 (en) * 2019-02-12 2020-08-20 ソニー株式会社 Information processing device and method, and program

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120113122A1 (en) * 2010-11-09 2012-05-10 Denso Corporation Sound field visualization system
CN103167373A (en) * 2011-12-09 2013-06-19 现代自动车株式会社 Technique for localizing sound source
CN105073073A (en) * 2013-01-25 2015-11-18 胡海 Devices and methods for the visualization and localization of sound
WO2020166324A1 (en) * 2019-02-12 2020-08-20 ソニー株式会社 Information processing device and method, and program
CN110907778A (en) * 2019-12-12 2020-03-24 国网重庆市电力公司电力科学研究院 GIS equipment partial discharge ultrasonic positioning method, device, equipment and medium
CN111443330A (en) * 2020-05-15 2020-07-24 浙江讯飞智能科技有限公司 Acoustic imaging method, acoustic imaging device, acoustic imaging equipment and readable storage medium

Also Published As

Publication number Publication date
CN114355286A (en) 2022-04-15

Similar Documents

Publication Publication Date Title
JP7434137B2 (en) Speech recognition method, device, equipment and computer readable storage medium
JP7034339B2 (en) Audio signal processing system and how to convert the input audio signal
JP6526083B2 (en) System and method for source signal separation
JP6526049B2 (en) Method and system for improved measurements in source signal separation, entity and parameter estimation, and path propagation effect measurement and mitigation
JP2017516131A5 (en)
US20160302005A1 (en) Method for processing data for the estimation of mixing parameters of audio signals, mixing method, devices, and associated computers programs
CN110610698B (en) Voice labeling method and device
Bianco et al. Semi-supervised source localization in reverberant environments with deep generative modeling
Argentieri et al. Binaural systems in robotics
CN113466793A (en) Sound source positioning method and device based on microphone array and storage medium
Chen et al. Sound localization by self-supervised time delay estimation
CN110515034B (en) Acoustic signal azimuth angle measurement system and method
WO2022078249A1 (en) Sound source tracking method and apparatus, and device, system and storage medium
EP2932503A1 (en) An apparatus aligning audio signals in a shared audio scene
Fuentes et al. Urban sound & sight: Dataset and benchmark for audio-visual urban scene understanding
WO2019127437A1 (en) Map labeling method and apparatus, and cloud server, terminal and application program
CN113608167B (en) Sound source positioning method, device and equipment
Bergh et al. Multi-speaker voice activity detection using a camera-assisted microphone array
CN112311999A (en) Intelligent video sound box device and camera visual angle adjusting method thereof
Wu Digital media recording and broadcasting classroom using Internet intelligent image positioning and opinion monitoring in communication
Berghi et al. Audio inputs for active speaker detection and localization via microphone array
WO2022183968A1 (en) Audio signal processing method, devices, system, and storage medium
Zhao et al. Visually assisted self-supervised audio speaker localization and tracking
Wu et al. Multi-speaker DoA Estimation Using Audio and Visual Modality
CN115620201B (en) House model construction method, device, equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21879294

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21879294

Country of ref document: EP

Kind code of ref document: A1