CN117054968B - Sound source positioning system and method based on linear array microphone - Google Patents

Sound source positioning system and method based on linear array microphone Download PDF

Info

Publication number
CN117054968B
CN117054968B CN202311051156.8A CN202311051156A CN117054968B CN 117054968 B CN117054968 B CN 117054968B CN 202311051156 A CN202311051156 A CN 202311051156A CN 117054968 B CN117054968 B CN 117054968B
Authority
CN
China
Prior art keywords
microphone
matrix
training
feature
topology
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311051156.8A
Other languages
Chinese (zh)
Other versions
CN117054968A (en
Inventor
黄术
黄琪敏
魏祥
夏航剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Youhang Information Technology Co ltd
Original Assignee
Hangzhou Youhang Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Youhang Information Technology Co ltd filed Critical Hangzhou Youhang Information Technology Co ltd
Priority to CN202311051156.8A priority Critical patent/CN117054968B/en
Publication of CN117054968A publication Critical patent/CN117054968A/en
Application granted granted Critical
Publication of CN117054968B publication Critical patent/CN117054968B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • G01S5/20Position of source determined by a plurality of spaced direction-finders

Abstract

The application discloses a sound source localization system and method based on linear array microphone, it is through real-time supervision and analysis sound signal in the environment to according to the direction information automatic calculation optimal array direction of sound in the environment, and send the adjustment instruction to the microphone array, make the direction of microphone array can be all the time towards the sound source. Therefore, the sound source positioning system based on the linear array microphone can be better adapted to the actual application requirements, and therefore the sound source positioning effect is optimized.

Description

Sound source positioning system and method based on linear array microphone
Technical Field
The present application relates to the field of intelligent localization, and more particularly, to a sound source localization system based on linear array microphones and a method thereof.
Background
Sound source localization refers to determining the position and direction of a sound source by analyzing sound signals in an environment. It has wide application in many fields such as conference recording, speech recognition, smart home, etc. A linear array microphone is a commonly used microphone array configuration that arranges a plurality of microphones in a straight line. By utilizing the time delay differences and amplitude differences between microphones, sound source localization can be achieved.
However, conventional linear array microphone parameter and configuration adjustments are typically based on static environmental information and cannot adapt to dynamic changes in the environment in real time. For example, when the noise level, sound absorption characteristics, or sound source position in the environment changes, the system cannot make corresponding adjustments in time, resulting in a decrease in positioning accuracy.
Accordingly, an optimized linear array microphone based sound source localization system is desired.
Disclosure of Invention
The present application has been made in order to solve the above technical problems. The embodiment of the application provides a sound source positioning system and a sound source positioning method based on linear array microphones, which are used for automatically calculating the optimal array direction according to direction information of sound in an environment by monitoring and analyzing sound signals in the environment in real time and sending an adjustment instruction to a microphone array so that the direction of the microphone array can always face a sound source. Therefore, the sound source positioning system based on the linear array microphone can be better adapted to the actual application requirements, and therefore the sound source positioning effect is optimized.
According to one aspect of the present application, there is provided a sound source localization method based on a linear array microphone, including:
acquiring a plurality of ambient sound signals acquired by a linear array microphone;
performing feature analysis on the plurality of ambient sound signals to obtain global sound waveform features; and
based on the global sound waveform characteristics, direction coordinates of a recommended microphone array are determined.
According to another aspect of the present application, there is provided a linear array microphone-based sound source localization system, including:
the acquisition board consists of a microphone array which is linearly arranged and is used for acquiring sound signals of surrounding environment;
the main control board is connected with the acquisition board in a communication way through the PIN wire connector, and is used for receiving and analyzing the sound signals acquired by the acquisition board and controlling and adjusting the direction of the microphone array.
Compared with the prior art, the sound source positioning system and the method based on the linear array microphone are used for automatically calculating the optimal array direction according to the direction information of the sound in the environment by monitoring and analyzing the sound signals in the environment in real time, and sending the adjustment instruction to the microphone array, so that the direction of the microphone array can always face the sound source. Therefore, the sound source positioning system based on the linear array microphone can be better adapted to the actual application requirements, and therefore the sound source positioning effect is optimized.
Drawings
The foregoing and other objects, features and advantages of the present application will become more apparent from the following more particular description of embodiments of the present application, as illustrated in the accompanying drawings. The accompanying drawings are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate the application and not constitute a limitation to the application. In the drawings, like reference numerals generally refer to like parts or steps.
Fig. 1 is a block diagram of a linear array microphone-based sound source localization system according to an embodiment of the present application;
fig. 2 is a flow chart of a method of linear array microphone based sound source localization in accordance with an embodiment of the present application;
fig. 3 is a system architecture diagram of a linear array microphone-based sound source localization method according to an embodiment of the present application;
fig. 4 is a flowchart of a training phase of a linear array microphone-based sound source localization method according to an embodiment of the present application;
fig. 5 is a flowchart of substep S2 of a linear array microphone based sound source localization method according to an embodiment of the present application;
fig. 6 is a flowchart of substep S22 of a linear array microphone based sound source localization method according to an embodiment of the present application.
Detailed Description
Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application and not all of the embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.
As used in this application and in the claims, the terms "a," "an," "the," and/or "the" are not specific to the singular, but may include the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.
Although the present application makes various references to certain modules in a system according to embodiments of the present application, any number of different modules may be used and run on a user terminal and/or server. The modules are merely illustrative, and different aspects of the systems and methods may use different modules.
Flowcharts are used in this application to describe the operations performed by systems according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in order precisely. Rather, the various steps may be processed in reverse order or simultaneously, as desired. Also, other operations may be added to or removed from these processes.
Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application and not all of the embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.
The parameters and configuration adjustment of the traditional linear array microphone is usually based on static environment information, and cannot adapt to the dynamic change of the environment in real time. For example, when the noise level, sound absorption characteristics, or sound source position in the environment changes, the system cannot make corresponding adjustments in time, resulting in a decrease in positioning accuracy. Accordingly, an optimized linear array microphone based sound source localization system is desired.
In the technical scheme of the application, a wind-solar power generation energy storage management method is provided. Fig. 2 is a flowchart of a method of linear array microphone based sound source localization in accordance with an embodiment of the present application. Fig. 3 is a system architecture diagram of a linear array microphone-based sound source localization method according to an embodiment of the present application. As shown in fig. 2 and 3, the sound source localization method based on the linear array microphone according to the embodiment of the present application includes the steps of: s1, acquiring a plurality of surrounding environment sound signals acquired by a linear array microphone; s2, performing feature analysis on the plurality of surrounding environment sound signals to obtain global sound waveform features; and S3, determining the direction coordinates of the recommended microphone array based on the global sound waveform characteristics.
In particular, in step S1, a plurality of ambient sound signals collected by the linear array microphone are acquired. Accordingly, in one specific example of the present application, for a linear array comprising 8 microphones, it may be possible to collect 8 ambient sound signals simultaneously. By analysing these signals, it is possible to determine the location information of the sound sources and to distinguish between different sound sources, or to separate the mixed sound signals into individual sound sources, thereby optimizing the sound source localization effect. It should be noted that a linear array microphone is a microphone arrangement in which a plurality of microphones are placed together in a linear arrangement so as to collect signals of a plurality of sound sources at the same time. Its principle is based on the relative positional relationship between the propagation of sound waves in space and microphones. When sound waves propagate to the linear array microphone, there is a slight difference in arrival time of the sound waves at the different microphones due to the difference in distance between the microphones. This difference can be used to estimate the direction and position of the sound source.
Accordingly, in one possible implementation, a plurality of ambient sound signals acquired by the linear array microphone may be acquired, for example, by: a set of linearly arranged microphones is obtained, typically in a uniformly spaced arrangement. Ensuring that the position of the microphone is fixed and no physical obstacle exists between the microphone and the sound source to be collected; microphone calibration is performed to ensure that the gain and delay of each microphone are consistent. This can be accomplished by playing known sound sources and measuring the response of each microphone using a calibration tool (e.g., a sound card or a dedicated device); sound signals of the surrounding environment are collected by the microphone array. Connecting the microphone array to a computer or audio interface using appropriate hardware and software configurations, and setting parameters such as sampling rate and bit depth; ensuring that all microphones in the microphone array are data acquired at the same sampling rate and clock synchronization. This may be achieved by hardware synchronization or software synchronization; and processing the collected sound signals. Digital Signal Processing (DSP) techniques may be used to filter out noise, reduce echo, etc. Beamforming techniques may also be applied to enhance sound in a particular direction; the signal of each microphone channel is analyzed. Useful information can be extracted using time domain analysis, frequency domain analysis, sound source localization, etc.; if signals of multiple microphone channels need to be synchronized, time stamps or synchronization signals may be used to align the data; further processing the analysis results as required. This may include sound enhancement, sound source separation, speech recognition, etc.
In particular, in step S2, a feature analysis is performed on the plurality of ambient sound signals to obtain global sound waveform features. In particular, in one specific example of the present application, as shown in fig. 5, the S2 includes: s21, extracting waveform characteristics of the plurality of surrounding environment sound signals through a sound waveform characteristic extractor based on a two-dimensional convolution layer so as to obtain a plurality of surrounding environment sound waveform characteristic vectors; s22, carrying out microphone array topology association analysis on the linear array microphone to obtain a microphone array topology feature matrix; and S23, performing association coding of graph structures on the plurality of surrounding environment sound waveform feature vectors and the microphone array topology feature matrix to obtain a topology global sound waveform feature matrix as the global sound waveform feature.
Specifically, the step S21 is to extract waveform characteristics of the plurality of ambient sound signals by a sound waveform characteristic extractor based on a two-dimensional convolution layer, so as to obtain a plurality of ambient sound waveform characteristic vectors. Considering that the representation forms of the respective ambient sound signals in the time domain are waveform diagrams, the two-dimensional convolution layer-based sound waveform feature extractor having excellent representation in terms of implicit feature extraction of images is used to perform feature mining on the respective ambient sound signals to extract feature distribution information about the ambient sound waveforms in the respective ambient sound signals, respectively, thereby obtaining a plurality of ambient sound waveform feature vectors. Specifically, each layer using the acoustic waveform feature extractor based on the two-dimensional convolution layer performs respective processing on input data in forward transfer of the layer: carrying out convolution processing on input data to obtain a convolution characteristic diagram; performing global mean pooling based on a feature matrix on the convolution feature map to obtain a pooled feature map; performing nonlinear activation on the pooled feature map to obtain an activated feature map; wherein the output of the last layer of the two-dimensional convolution layer-based sound waveform feature extractor is the plurality of ambient sound waveform feature vectors, and the input of the first layer of the two-dimensional convolution layer-based sound waveform feature extractor is the plurality of ambient sound signals.
It is noted that a two-dimensional convolution layer is a neural network layer commonly used in deep learning for processing two-dimensional data, such as images. It performs feature extraction and representation learning on input data through convolution operation. The coding process of the two-dimensional convolution layer is as follows: input and convolution kernels: the two-dimensional convolution layer accepts a two-dimensional input tensor, typically expressed as a shape (width, height, channel). In addition, the convolution layer contains a set of convolution kernels (also called filters), each of which is a small two-dimensional weight matrix; convolution operation: the convolution operation is the core operation of the two-dimensional convolution layer. Each convolution kernel is multiplied by a local area of input data element by element, and then the product results are summed to obtain one element of output. By sliding the convolution kernel over the input data, an overall output feature map can be calculated; filling and stride: in performing the convolution operation, padding may be selected around the input data to control the size of the output feature map. The padding may be zero padding or other padding. In addition, a sliding step of the convolution kernel can be specified to control the spatial resolution of the output feature map; activation function: after the convolution operation, an activation function is typically applied to nonlinearly transform the output. Common activation functions include ReLU, si gmoi d, tanh, etc. for introducing non-linear characteristics and increasing the expression capacity of the network; parameter sharing: in the convolution layer, the weights of the convolution kernels are shared, i.e. the convolution operation is performed using the same weights over the whole input. Therefore, the number of parameters of the network can be greatly reduced, and the efficiency and generalization capability of the model are improved; pooling layer: after the convolution layer, a pooling layer is typically added to reduce the size of the feature map and extract more robust features. Common pooling operations include maximum pooling and average pooling, which respectively select as output the maximum or average value within the pooling window; multiple channels: the convolution layer may process input data having multiple channels, e.g., an RGB image having three channels. In the convolution operation, each convolution kernel convolves with a corresponding channel of the input data and generates an output signature.
Specifically, in S22, a microphone array topology association analysis is performed on the linear array microphone to obtain a microphone array topology feature matrix. In particular, in one specific example of the present application, as shown in fig. 6, the S22 includes: s221, constructing a microphone array topology matrix of the linear array microphone, wherein the characteristic value of each position on the non-diagonal position in the microphone array topology matrix is the Euclidean distance between the two corresponding microphones; and S222, passing the microphone array topology matrix through a topology feature extractor based on a convolutional neural network model to obtain the microphone array topology feature matrix.
More specifically, the S221 constructs a microphone array topology matrix of the linear array microphone, where a characteristic value of each position on the non-diagonal position in the microphone array topology matrix is a euclidean distance between the two corresponding microphones. Because the linear array microphones are arranged according to a preset topological pattern, each microphone in the linear array microphones has a topological association relation in space. Thus, there is also implicit associated characteristic distribution information between the waveform characteristic information of the respective ambient sound signals. Based on this, in order to analyze the ambient sound signals collected by the linear microphone more fully and accurately, so as to improve the accuracy of sound localization, in the technical solution of the present application, a microphone array topology matrix of the linear array microphone is further constructed, where the feature value of each position on the non-diagonal position in the microphone array topology matrix is the euclidean distance between the two corresponding microphones. It is worth mentioning that euclidean distance is a measure of the distance between two points calculated in euclidean space. It is one of the most common and intuitive distance measurement methods, and is also widely used in data analysis and pattern recognition tasks in various fields. Euclidean distance is widely used in many fields, such as clustering and classification algorithms in machine learning, feature matching in image processing, distance measurement in spatial analysis, and the like. It provides an intuitive measure of similarity or variability between data points.
Accordingly, in one possible implementation manner, the microphone array topology matrix of the linear array microphone may be constructed by the following steps, where the feature value of each position on the non-diagonal position in the microphone array topology matrix is the euclidean distance between the two corresponding microphones, for example: determining a linear arrangement of microphones, for example, in a left-to-right order; and measuring or calculating the Euclidean distance between adjacent microphones according to actual conditions or design requirements. Assuming N microphones, then there are N-1 off-diagonal positions for which the eigenvalues need to be determined; creating an N x N matrix, where N represents the number of microphones; setting the element on the diagonal position to 0, which means that the distance between the same microphone and itself is 0; and filling corresponding values into the off-diagonal positions of the matrix according to the Euclidean distance determined in the step 2. For example, the element of the ith row and jth column represents the euclidean distance between the ith microphone and the jth microphone; after filling all off-diagonal elements, the microphone array topology matrix is constructed.
More specifically, the step S222 is to pass the microphone array topology matrix through a topology feature extractor based on a convolutional neural network model to obtain the microphone array topology feature matrix. That is, in this way, the spatial topology association characteristic information about each microphone in the microphone array topology matrix is extracted, so as to obtain a microphone array topology characteristic matrix. Specifically, each layer of the topological feature extractor based on the convolutional neural network model is used for respectively carrying out input data in forward transfer of the layer: carrying out convolution processing on input data to obtain a convolution characteristic diagram; pooling the convolution feature map along a channel dimension to obtain a pooled feature map; performing nonlinear activation on the pooled feature map to obtain an activated feature map; the output of the last layer of the topological feature extractor based on the convolutional neural network model is the microphone array topological feature matrix, and the input of the first layer of the topological feature extractor based on the convolutional neural network model is the microphone array topological matrix.
Notably, convolutional neural networks (Convolutional Neural Network, CNN) are a deep learning model that is mainly used to process data with a grid structure, such as images and speech. CNNs have enjoyed great success in the field of computer vision and are excellent in many tasks such as image classification, object detection, image segmentation, and the like. The core idea of the CNN is to extract the characteristics of input data through a convolution layer and a pooling layer, and perform tasks such as classification or regression through a full connection layer. The general structure of CNNs is: convolution layer: the convolutional layer is the basic building block of CNN. It extracts local features of the input data by convolving the input data with a set of learnable filters (also called convolution kernels). The convolution operation can effectively capture the spatial structure information of the input data. The convolution layer typically includes a plurality of filters, each of which generates a corresponding feature map; pooling layer: the pooling layer serves to reduce the spatial dimensions of the feature map and preserve important features. Common pooling operations include maximum pooling and average pooling. The pooling operation reduces the dimension of data by aggregating the local areas of the feature map, and has certain translation and scale invariance; activation function: the activation function introduces a nonlinear transformation that enables the CNN to learn more complex patterns and features. Common activation functions include ReLU, sigmoid, tanh, etc.; full tie layer: the full connection layer connects the outputs of the previous convolutional layer and the pooling layer and passes them as inputs to the output layer. The full connection layer is typically used for the final classification or regression task; dropout layer: dropout is a regularization technique for reducing overfitting of the neural network. During training, the Dropout layer randomly sets the output of a portion of neurons to zero, thereby reducing the dependency between neurons. The CNN gradually extracts advanced features of the data through the stacking of multiple convolution layers and pooling layers, and classifies or regresses through the fully connected layers. During training, the CNN uses a back-propagation algorithm to update parameters in the network to minimize the error between the predicted output and the real label.
It should be noted that, in other specific examples of the present application, the microphone array topology association analysis may be performed on the linear array microphone in other manners to obtain a microphone array topology feature matrix, for example: recording by using a linear array microphone to obtain a section of audio data containing a plurality of sound sources; the recording data is preprocessed, including noise removal, echo reduction, and the like. This helps to improve the accuracy of subsequent analysis; if there is a slight difference in time of the sound source signals in the recording data, it is necessary to align them. The data may be aligned using a time stamp or synchronization signal to ensure that each sound source signal corresponds in time to the same event; features are extracted from the aligned audio data. Common features include time domain features and frequency domain features. The time domain features may include amplitude, energy, zero-crossing rate, etc. The frequency domain features may include spectrograms, power spectral densities, and the like. These features may be calculated using signal processing techniques such as fourier transforms, short Time Fourier Transforms (STFT), etc.; correlation analysis is performed on the extracted features to determine the topological relationship between microphones. Common correlation analysis methods include cross-correlation and correlation matrix analysis. By calculating the correlation between the features, topology association information between microphones can be obtained; and constructing a topological feature matrix of the microphone array according to the result of the correlation analysis. The matrix describes the topological relation among microphones, and can be used for subsequent tasks such as sound source localization, sound source separation and the like; and analyzing and visualizing the topological feature matrix. The topological relationship between microphones can be observed using data analysis tools and graphs and their effect on sound source localization and separation can be further analyzed.
Specifically, in S23, performing association coding of a graph structure on the plurality of surrounding sound waveform feature vectors and the microphone array topology feature matrix to obtain a topology global sound waveform feature matrix as the global sound waveform feature. That is, the feature representation of each of the surrounding sound waveform feature vectors is taken as a node, the feature representation of the microphone array topology feature matrix is taken as a node-to-node edge feature representation, and the surrounding sound waveform feature matrix obtained by two-dimensionally arranging the plurality of surrounding sound waveform feature vectors and the microphone array topology feature matrix pass through a graph neural network model to obtain a topology global sound waveform feature matrix. Specifically, the graph neural network model performs graph structure data coding on the surrounding environment sound waveform feature matrix and the microphone array topology feature matrix through a learnable neural network parameter to obtain the topology global sound waveform feature matrix containing irregular microphone space topology association features and the various sound waveform feature information.
Notably, the graph neural network (Graph Neural Network, GNN) is a deep learning model for processing graph structure data. Unlike traditional neural networks, which mainly process vector or matrix data, GNNs can learn and infer nodes and edges in the graph, thereby implementing analysis and prediction of graph structure data. The principle of GNN can be summarized as the following key steps: the figure shows: first, the graph structure data is represented in a form that can be handled by a computer. The connection relation of the diagrams is typically represented using an adjacency matrix or adjacency table. The adjacency matrix represents the connection relation between the nodes, and the adjacency table represents the neighbor node of each node; node representation learning: the core of GNN is to learn the representation vector of each node, which captures the characteristics of the node itself and its relationships to its neighbors. The process of node representation learning can be implemented by iteratively updating the representation vector of the node; information transfer: GNNs update the representation vector of the node by passing information in the graph. Specifically, each node aggregates and combines its own representation vector with the representation vector of its neighboring node to obtain a comprehensive neighbor information. Such an information transfer process may be implemented by a graph convolution operation; graph level prediction: in some tasks, prediction of the entire graph is required, not just per node. To achieve graph level prediction, the representation vectors of all nodes may be assembled into a graph level representation vector using a graph pooling operation, and then predicted through the fully connected layer or other method. The key point of GNN is the process of information transfer, which allows nodes to update their own representation vectors through interactions with their neighboring nodes, thereby fusing global and local information. This enables GNNs to perform efficient feature learning and reasoning on graph structure data.
It should be noted that, in other specific examples of the present application, the plurality of ambient sound signals may be further subjected to feature analysis by other manners to obtain global sound waveform features, for example: the collected plurality of sound signals are preprocessed. This may include removing noise, reducing echo, equalizing volume, etc. The pretreatment is helpful for improving the accuracy of the subsequent feature analysis; if there is a slight difference in the time of the acquired sound signals, it is necessary to align them. The data may be aligned using a time stamp or synchronization signal to ensure that each sound signal corresponds in time to the same event; features are extracted from the aligned sound signals. Common sound features include time domain features and frequency domain features: time domain features: the time domain features describe the variation of the sound signal over time. Common time domain features include amplitude, energy, zero crossing rate, time domain envelope, etc.; frequency domain characteristics: the frequency domain features describe the distribution of the sound signal over frequency. Common frequency domain features include spectrograms, power spectral densities, spectral envelopes, etc., and signal processing techniques such as fourier transforms, short-time fourier transforms (STFT), etc., may be used to calculate time and frequency domain features; features extracted from a plurality of sound signals are aggregated. This can be achieved by calculating statistics of mean, maximum, minimum, variance, etc. for each feature; global sound waveform features are extracted from the aggregated features. This may include global energy, global frequency distribution, duration of sound, etc.; the extracted global sound waveform features are analyzed and visualized. Data analysis tools and charts may be used to observe trends, changes, and associations of sound features.
In particular, in step S3, the direction coordinates of the recommended microphone array are determined based on the global sound waveform characteristics. In particular, in one specific example of the present application, the S3 includes: the topological global sound waveform feature matrix is passed through a decoder to obtain decoded values representing the directional coordinates of the recommended microphone array. That is, decoding regression is performed with the graph structure correlation characteristic information between the topology correlation characteristic of the microphone array and the waveform characteristic of each sound signal to obtain the direction coordinates of the recommended microphone array. Therefore, the optimal array direction can be automatically calculated according to the direction information of the sound in the environment, and an adjusting instruction is sent to the microphone array, so that the direction of the microphone array can always face the sound source, and the sound source positioning system based on the linear array microphone can better adapt to the actual application requirements. Specifically, the topological global sound waveform feature matrix is subjected to decoding regression by using the decoder in the following formula to obtain a decoding value for representing the direction coordinates of the recommended microphone array; wherein, the formula is:wherein X represents the topological global sound waveform feature matrix, Y is the decoding value, W is the weight matrix,>representing matrix multiplication.
Notably, decoding regression refers to mapping the representation vectors or features output by the neural network back to the original data domain, thereby enabling regression prediction of the original data. In the decoding regression task, the neural network typically implements a mapping from the representation vector to the original data by learning a Decoder (Decoder). The decoding regression task may be applied to a variety of fields, such as image generation, speech synthesis, sequence generation, and the like.
It should be noted that, in other specific examples of the present application, the direction coordinates of the recommended microphone array may also be determined based on the global sound waveform feature in other manners, for example: first, sound data needs to be collected. Recording sounds in the environment using one or more microphones and acquiring a global sound waveform; features are extracted from the global sound waveform. Various signal processing techniques and feature extraction algorithms, such as short-time fourier transforms, cepstral coefficients, power spectral densities, etc., may be used to obtain spectral features of the sound signal; and estimating the direction of the microphone array by using the extracted frequency spectrum characteristics. A common approach is to determine the direction of the sound source by means of beam forming techniques. Beamforming may be performed by weighting and phase adjusting the individual microphone signals in the microphone array such that sound from a particular direction is enhanced while sound from other directions is suppressed. According to the result of beam forming, the direction coordinates of the sound source can be estimated; and determining recommended microphone array direction coordinates according to the direction estimation result. An appropriate microphone array layout may be selected according to the directional coordinates of the sound source to maximize the capture of the sound source.
It will be appreciated that the two-dimensional convolutional layer based acoustic waveform feature extractor, the convolutional neural network model based topological feature extractor, and the decoder need to be trained prior to the inference using the neural network model described above. That is, in the linear array microphone-based sound source localization method of the present application, a training stage is further included for training the two-dimensional convolutional layer-based sound waveform feature extractor, the convolutional neural network model-based topology feature extractor, and the decoder.
Fig. 4 is a flowchart of a training phase of a linear array microphone-based sound source localization method according to an embodiment of the present application. As shown in fig. 4, a sound source localization method based on a linear array microphone according to an embodiment of the present application includes: a training phase comprising: s110, acquiring training data, wherein the training data comprises a plurality of training ambient sound signals acquired by a linear array microphone and a true value of a direction coordinate of the recommended microphone array; s120, enabling the training ambient sound signals to pass through the sound waveform feature extractor based on the two-dimensional convolution layer to obtain training ambient sound waveform feature vectors; s130, constructing a training microphone array topology matrix of the linear array microphone, wherein the characteristic value of each position on the non-diagonal position in the training microphone array topology matrix is the Euclidean distance between the two corresponding microphones; s140, passing the training microphone array topology matrix through the topology feature extractor based on the convolutional neural network model to obtain a training microphone array topology feature matrix; s150, passing the training surrounding environment sound waveform feature vectors and the training microphone array topological feature matrix through a graph neural network model to obtain a training topological global sound waveform feature matrix; s160, enabling the training topological global sound waveform feature matrix to pass through the decoder to obtain a decoding loss function value; and S170, training the sound waveform feature extractor based on the two-dimensional convolution layer, the topological feature extractor based on the convolution neural network model and the decoder based on the decoding loss function value and through gradient descent direction propagation, wherein in each round of iteration of training, the weight matrix of the decoder is subjected to external boundary constraint iteration based on reference annotation.
In particular, in the technical scheme of the application, in consideration of the fact that in the process of extracting image semantic features of signal waveforms of the plurality of surrounding environment sound signals through the sound waveform feature extractor based on the two-dimensional convolution layer, the local correlation image semantic feature distribution corresponding to the source image semantic distribution relative to the source image semantic distribution of the source image provides an image semantic fitting direction for weight matrix fitting in the training process of a decoder, when the plurality of surrounding environment sound waveform feature vectors and the microphone array topological feature matrix pass through a graph neural network model to obtain a topological global sound waveform feature matrix, the feature distribution of the obtained topological global sound waveform feature matrix deviates from the local space image semantic correlation features corresponding to the source image semantic distribution on the image semantic feature domain expression at the same time when the feature distance array topology of the image semantic features of each microphone is subjected to topological correlation, so that in a decoding scene, the image feature domain of regression probability map of the topological global sound waveform feature matrix is caused in the weight matrix iteration process of the decoder, and further weight matrices are based on the feature distance array topology of the image semantic features of each microphone, and the obtained topological global sound waveform feature matrix is accurately influenced by the training result of the topological global sound waveform feature matrix. Based on the above, in the training process of the decoder, the applicant of the present application performs external boundary constraint of the weight matrix based on the reference annotation in the topological global sound waveform feature vector obtained after the topological global sound waveform feature matrix is expanded, which is specifically expressed as:
M 1 and M 2 The weight matrix of the previous iteration and the current iteration are respectively adopted, wherein, during the first iteration, M is set by adopting different initialization strategies 1 And M 2 (e.g., M 1 Set as a unitary matrix and M 2 Set as the average diagonal matrix of feature vectors to be decoded), V c The topological global sound waveform feature vector is in a column vector form. Here, by global acoustic waveform feature vector V in the topology c The iteration association expression in the weight space is used as the external association boundary constraint of the weight matrix iteration, so that the previous weight matrix is used as the reference in the iteration processIn the case of annotation (benchmark annotation), the global acoustic waveform feature vector V in the topology during the weight space iteration is reduced c Is used as an anchor point, thereby carrying out the directional mismatching (oriented mismatch) of the weight matrix relative to the topological global sound waveform characteristic vector V in the iterative process c Compensation of inter-domain offsets for regression probability mapping of (2) and further enhance the weight matrix based on the topological global acoustic waveform feature vector V c The image semantic fitting aggregation of the model is improved, and the accuracy of the decoding value of the topological global sound waveform characteristic matrix obtained by the trained model is improved. Therefore, the direction of the microphone array can be automatically adjusted based on the direction information of the sound in the environment, so that the direction of the microphone array can always face the sound source, and the sound source positioning system can be better adapted to the actual application requirements, so that the sound source positioning effect is optimized.
In summary, the method for positioning a sound source based on the linear array microphone according to the embodiment of the application is explained, which monitors and analyzes sound signals in an environment in real time to automatically calculate an optimal array direction according to direction information of the sound in the environment, and sends an adjustment instruction to the microphone array, so that the direction of the microphone array can always face the sound source. Therefore, the sound source positioning system based on the linear array microphone can be better adapted to the actual application requirements, and therefore the sound source positioning effect is optimized.
Further, a wind-solar power generation energy storage management system is also provided.
Fig. 1 is a block diagram of a linear array microphone based sound source localization system according to an embodiment of the present application. As shown in fig. 1, a linear array microphone-based sound source localization system 300 according to an embodiment of the present application includes: the acquisition board 310 is composed of a microphone array which is linearly arranged and is used for acquiring sound signals of the surrounding environment; the main control board 320 is communicatively connected to the acquisition board through a PI N wire connector, and is configured to receive and analyze the sound signal acquired by the acquisition board, and to control and adjust the direction of the microphone array.
As described above, the linear array microphone-based sound source localization system 300 according to the embodiment of the present application may be implemented in various wireless terminals, such as a server or the like having a linear array microphone-based sound source localization algorithm. In one possible implementation, the linear array microphone based sound source localization system 300 according to embodiments of the present application may be integrated into a wireless terminal as one software module and/or hardware module. For example, the linear array microphone based sound source localization system 300 may be a software module in the operating system of the wireless terminal or may be an application developed for the wireless terminal; of course, the linear array microphone based sound source localization system 300 could equally be one of many hardware modules of the wireless terminal.
Alternatively, in another example, the linear array microphone based sound source localization system 300 and the wireless terminal may be separate devices, and the linear array microphone based sound source localization system 300 may be connected to the wireless terminal through a wired and/or wireless network and transmit interactive information in a agreed data format.
The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (5)

1. A method for locating a sound source based on a linear array microphone, comprising:
acquiring a plurality of ambient sound signals acquired by a linear array microphone;
performing feature analysis on the plurality of ambient sound signals to obtain global sound waveform features; and
determining a direction coordinate of a recommended microphone array based on the global sound waveform feature;
wherein, still include training step: the device is used for training a sound waveform feature extractor based on a two-dimensional convolution layer, a topological feature extractor based on a convolution neural network model and a decoder;
wherein the training step comprises:
acquiring training data, wherein the training data comprises a plurality of training ambient sound signals acquired by a linear array microphone and a true value of a direction coordinate of the recommended microphone array;
passing the plurality of training ambient sound signals through the two-dimensional convolution layer based sound waveform feature extractor to obtain a plurality of training ambient sound waveform feature vectors;
constructing a training microphone array topology matrix of the linear array microphone, wherein the characteristic value of each position on the non-diagonal position in the training microphone array topology matrix is the Euclidean distance between the two corresponding microphones;
passing the training microphone array topology matrix through the topological feature extractor based on the convolutional neural network model to obtain a training microphone array topological feature matrix;
the training surrounding environment sound waveform feature vectors and the training microphone array topology feature matrix are processed through a graph neural network model to obtain a training topology global sound waveform feature matrix;
passing the training topology global sound waveform feature matrix through the decoder to obtain a decoding loss function value;
training the two-dimensional convolutional layer-based acoustic waveform feature extractor, the convolutional neural network model-based topological feature extractor, and the decoder based on the decoding loss function values and by gradient descent direction propagation, wherein, in each iteration of the training, an external boundary constraint iteration based on reference annotations is performed on a weight matrix of the decoder;
wherein, in each iteration of the training, the weight matrix of the decoder is iterated with an external boundary constraint based on reference annotation according to the following optimization formula;
wherein, the optimization formula is:
wherein M is 1 And M 2 The weight matrix of the last iteration and the current iteration are respectively V c Is the training topology global sound waveform feature vector obtained after the training topology global sound waveform feature matrix is unfolded, and V c In the form of a column vector that is,representing a matrix multiplication of the number of bits,representing matrix addition, M 2 ' represents the weight matrix of the decoder after iteration.
2. The method of claim 1, wherein performing a feature analysis on the plurality of ambient sound signals to obtain global sound waveform features comprises:
extracting waveform characteristics of the plurality of surrounding environment sound signals through a sound waveform characteristic extractor based on a two-dimensional convolution layer so as to obtain a plurality of surrounding environment sound waveform characteristic vectors;
performing microphone array topology association analysis on the linear array microphone to obtain a microphone array topology feature matrix; and
and carrying out association coding of a graph structure on the plurality of surrounding environment sound waveform feature vectors and the microphone array topology feature matrix to obtain a topology global sound waveform feature matrix as the global sound waveform feature.
3. The method for linear array microphone based sound source localization of claim 2, wherein performing a microphone array topology correlation analysis on the linear array microphone to obtain a microphone array topology feature matrix comprises:
constructing a microphone array topology matrix of the linear array microphone, wherein the characteristic value of each position on the non-diagonal position in the microphone array topology matrix is the Euclidean distance between the two corresponding microphones; and
and passing the microphone array topological matrix through a topological feature extractor based on a convolutional neural network model to obtain the microphone array topological feature matrix.
4. A method of linear array microphone based sound source localization as claimed in claim 3, wherein performing a graph structured associative coding of the plurality of ambient sound waveform feature vectors and the microphone array topology feature matrix to obtain a topology global sound waveform feature matrix as the global sound waveform feature comprises: and the topological feature matrix of the microphone array and the plurality of surrounding environment sound waveform feature vectors are processed through a graph neural network model to obtain the topological global sound waveform feature matrix.
5. The method of claim 4, wherein determining the directional coordinates of the recommended microphone array based on the global sound waveform characteristics comprises: the topological global sound waveform feature matrix is passed through a decoder to obtain decoded values representing the directional coordinates of the recommended microphone array.
CN202311051156.8A 2023-08-19 2023-08-19 Sound source positioning system and method based on linear array microphone Active CN117054968B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311051156.8A CN117054968B (en) 2023-08-19 2023-08-19 Sound source positioning system and method based on linear array microphone

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311051156.8A CN117054968B (en) 2023-08-19 2023-08-19 Sound source positioning system and method based on linear array microphone

Publications (2)

Publication Number Publication Date
CN117054968A CN117054968A (en) 2023-11-14
CN117054968B true CN117054968B (en) 2024-03-12

Family

ID=88664122

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311051156.8A Active CN117054968B (en) 2023-08-19 2023-08-19 Sound source positioning system and method based on linear array microphone

Country Status (1)

Country Link
CN (1) CN117054968B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104093094A (en) * 2014-06-16 2014-10-08 华南理工大学 Indoor voice acquisition method and device based on self-adaptive rotating alignment
CN107202976A (en) * 2017-05-15 2017-09-26 大连理工大学 The distributed microphone array sound source localization system of low complex degree
CN109286875A (en) * 2018-09-29 2019-01-29 百度在线网络技术(北京)有限公司 For orienting method, apparatus, electronic equipment and the storage medium of pickup
CN112185406A (en) * 2020-09-18 2021-01-05 北京大米科技有限公司 Sound processing method, sound processing device, electronic equipment and readable storage medium
CN112463103A (en) * 2019-09-06 2021-03-09 北京声智科技有限公司 Sound pickup method, sound pickup device, electronic device and storage medium
CN113196291A (en) * 2019-01-23 2021-07-30 动态Ad有限责任公司 Automatic selection of data samples for annotation
CN114611795A (en) * 2022-03-14 2022-06-10 杭州清淮科技有限公司 Water level linkage-based water level prediction method and system
CN115144312A (en) * 2022-06-29 2022-10-04 杭州里莹网络科技有限公司 Indoor air fine particle measuring method and system based on Internet of things
CN115169463A (en) * 2022-07-11 2022-10-11 杭州里莹网络科技有限公司 Indoor air quality monitoring system based on Internet of things and monitoring method thereof
CN115690908A (en) * 2022-10-28 2023-02-03 中国科学院上海微系统与信息技术研究所 Three-dimensional gesture attitude estimation method based on topology perception
CN115963229A (en) * 2023-01-09 2023-04-14 吉安创成环保科技有限责任公司 Gas monitoring system and monitoring method thereof
CN115982573A (en) * 2023-03-20 2023-04-18 东莞市杰达机械有限公司 Multifunctional feeder and control method thereof

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112733937A (en) * 2021-01-11 2021-04-30 西安电子科技大学 Credible graph data node classification method, system, computer equipment and application

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104093094A (en) * 2014-06-16 2014-10-08 华南理工大学 Indoor voice acquisition method and device based on self-adaptive rotating alignment
CN107202976A (en) * 2017-05-15 2017-09-26 大连理工大学 The distributed microphone array sound source localization system of low complex degree
CN109286875A (en) * 2018-09-29 2019-01-29 百度在线网络技术(北京)有限公司 For orienting method, apparatus, electronic equipment and the storage medium of pickup
CN113196291A (en) * 2019-01-23 2021-07-30 动态Ad有限责任公司 Automatic selection of data samples for annotation
CN112463103A (en) * 2019-09-06 2021-03-09 北京声智科技有限公司 Sound pickup method, sound pickup device, electronic device and storage medium
CN112185406A (en) * 2020-09-18 2021-01-05 北京大米科技有限公司 Sound processing method, sound processing device, electronic equipment and readable storage medium
CN114611795A (en) * 2022-03-14 2022-06-10 杭州清淮科技有限公司 Water level linkage-based water level prediction method and system
CN115144312A (en) * 2022-06-29 2022-10-04 杭州里莹网络科技有限公司 Indoor air fine particle measuring method and system based on Internet of things
CN115169463A (en) * 2022-07-11 2022-10-11 杭州里莹网络科技有限公司 Indoor air quality monitoring system based on Internet of things and monitoring method thereof
CN115690908A (en) * 2022-10-28 2023-02-03 中国科学院上海微系统与信息技术研究所 Three-dimensional gesture attitude estimation method based on topology perception
CN115963229A (en) * 2023-01-09 2023-04-14 吉安创成环保科技有限责任公司 Gas monitoring system and monitoring method thereof
CN115982573A (en) * 2023-03-20 2023-04-18 东莞市杰达机械有限公司 Multifunctional feeder and control method thereof

Also Published As

Publication number Publication date
CN117054968A (en) 2023-11-14

Similar Documents

Publication Publication Date Title
JP7337953B2 (en) Speech recognition method and device, neural network training method and device, and computer program
CN110531313B (en) Near-field signal source positioning method based on deep neural network regression model
Takeda et al. Discriminative multiple sound source localization based on deep neural networks using independent location model
CN108318862B (en) Sound source positioning method based on neural network
CN109841226A (en) A kind of single channel real-time noise-reducing method based on convolution recurrent neural network
Hammer et al. A seismic‐event spotting system for volcano fast‐response systems
WO2019080551A1 (en) Target voice detection method and apparatus
CN111239687B (en) Sound source positioning method and system based on deep neural network
Mak et al. Ship as a wave buoy: Estimating relative wave direction from in-service ship motion measurements using machine learning
He et al. Neural network adaptation and data augmentation for multi-speaker direction-of-arrival estimation
CN111341319B (en) Audio scene identification method and system based on local texture features
Nida et al. Instructor activity recognition through deep spatiotemporal features and feedforward extreme learning machines
CN112735460A (en) Beam forming method and system based on time-frequency masking value estimation
CN111816200B (en) Multi-channel speech enhancement method based on time-frequency domain binary mask
Yang et al. Learning deep direct-path relative transfer function for binaural sound source localization
Vecchiotti et al. Detection of activity and position of speakers by using deep neural networks and acoustic data augmentation
Wang et al. A novel underground pipeline surveillance system based on hybrid acoustic features
CN111859241B (en) Unsupervised sound source orientation method based on sound transfer function learning
CN112180318B (en) Sound source direction of arrival estimation model training and sound source direction of arrival estimation method
CN117054968B (en) Sound source positioning system and method based on linear array microphone
Li et al. Satellite communication anti-jamming based on artificial bee colony blind source separation
CN112014791A (en) Near-field source positioning method of array PCA-BP algorithm with array errors
CN116645973A (en) Directional audio enhancement method and device, storage medium and electronic equipment
CN111880146B (en) Sound source orientation method and device and storage medium
JP6114053B2 (en) Sound source separation device, sound source separation method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant