CN117054968B

CN117054968B - Sound source positioning system and method based on linear array microphone

Info

Publication number: CN117054968B
Application number: CN202311051156.8A
Authority: CN
Inventors: 黄术; 黄琪敏; 魏祥; 夏航剑
Original assignee: Hangzhou Youhang Information Technology Co ltd
Current assignee: Hangzhou Youhang Information Technology Co ltd
Priority date: 2023-08-19
Filing date: 2023-08-19
Publication date: 2024-03-12
Anticipated expiration: 2043-08-19
Also published as: CN117054968A

Abstract

The application discloses a sound source localization system and method based on linear array microphone, it is through real-time supervision and analysis sound signal in the environment to according to the direction information automatic calculation optimal array direction of sound in the environment, and send the adjustment instruction to the microphone array, make the direction of microphone array can be all the time towards the sound source. Therefore, the sound source positioning system based on the linear array microphone can be better adapted to the actual application requirements, and therefore the sound source positioning effect is optimized.

Description

Sound source positioning system and method based on linear array microphone

Technical Field

The present application relates to the field of intelligent localization, and more particularly, to a sound source localization system based on linear array microphones and a method thereof.

Background

Sound source localization refers to determining the position and direction of a sound source by analyzing sound signals in an environment. It has wide application in many fields such as conference recording, speech recognition, smart home, etc. A linear array microphone is a commonly used microphone array configuration that arranges a plurality of microphones in a straight line. By utilizing the time delay differences and amplitude differences between microphones, sound source localization can be achieved.

However, conventional linear array microphone parameter and configuration adjustments are typically based on static environmental information and cannot adapt to dynamic changes in the environment in real time. For example, when the noise level, sound absorption characteristics, or sound source position in the environment changes, the system cannot make corresponding adjustments in time, resulting in a decrease in positioning accuracy.

Accordingly, an optimized linear array microphone based sound source localization system is desired.

Disclosure of Invention

The present application has been made in order to solve the above technical problems. The embodiment of the application provides a sound source positioning system and a sound source positioning method based on linear array microphones, which are used for automatically calculating the optimal array direction according to direction information of sound in an environment by monitoring and analyzing sound signals in the environment in real time and sending an adjustment instruction to a microphone array so that the direction of the microphone array can always face a sound source. Therefore, the sound source positioning system based on the linear array microphone can be better adapted to the actual application requirements, and therefore the sound source positioning effect is optimized.

According to one aspect of the present application, there is provided a sound source localization method based on a linear array microphone, including:

acquiring a plurality of ambient sound signals acquired by a linear array microphone;

performing feature analysis on the plurality of ambient sound signals to obtain global sound waveform features; and

based on the global sound waveform characteristics, direction coordinates of a recommended microphone array are determined.

According to another aspect of the present application, there is provided a linear array microphone-based sound source localization system, including:

the acquisition board consists of a microphone array which is linearly arranged and is used for acquiring sound signals of surrounding environment;

the main control board is connected with the acquisition board in a communication way through the PIN wire connector, and is used for receiving and analyzing the sound signals acquired by the acquisition board and controlling and adjusting the direction of the microphone array.

Compared with the prior art, the sound source positioning system and the method based on the linear array microphone are used for automatically calculating the optimal array direction according to the direction information of the sound in the environment by monitoring and analyzing the sound signals in the environment in real time, and sending the adjustment instruction to the microphone array, so that the direction of the microphone array can always face the sound source. Therefore, the sound source positioning system based on the linear array microphone can be better adapted to the actual application requirements, and therefore the sound source positioning effect is optimized.

Drawings

The foregoing and other objects, features and advantages of the present application will become more apparent from the following more particular description of embodiments of the present application, as illustrated in the accompanying drawings. The accompanying drawings are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate the application and not constitute a limitation to the application. In the drawings, like reference numerals generally refer to like parts or steps.

Fig. 1 is a block diagram of a linear array microphone-based sound source localization system according to an embodiment of the present application;

fig. 2 is a flow chart of a method of linear array microphone based sound source localization in accordance with an embodiment of the present application;

fig. 3 is a system architecture diagram of a linear array microphone-based sound source localization method according to an embodiment of the present application;

fig. 4 is a flowchart of a training phase of a linear array microphone-based sound source localization method according to an embodiment of the present application;

fig. 5 is a flowchart of substep S2 of a linear array microphone based sound source localization method according to an embodiment of the present application;

fig. 6 is a flowchart of substep S22 of a linear array microphone based sound source localization method according to an embodiment of the present application.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application and not all of the embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.

As used in this application and in the claims, the terms "a," "an," "the," and/or "the" are not specific to the singular, but may include the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.

Although the present application makes various references to certain modules in a system according to embodiments of the present application, any number of different modules may be used and run on a user terminal and/or server. The modules are merely illustrative, and different aspects of the systems and methods may use different modules.

Flowcharts are used in this application to describe the operations performed by systems according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in order precisely. Rather, the various steps may be processed in reverse order or simultaneously, as desired. Also, other operations may be added to or removed from these processes.

The parameters and configuration adjustment of the traditional linear array microphone is usually based on static environment information, and cannot adapt to the dynamic change of the environment in real time. For example, when the noise level, sound absorption characteristics, or sound source position in the environment changes, the system cannot make corresponding adjustments in time, resulting in a decrease in positioning accuracy. Accordingly, an optimized linear array microphone based sound source localization system is desired.

In the technical scheme of the application, a wind-solar power generation energy storage management method is provided. Fig. 2 is a flowchart of a method of linear array microphone based sound source localization in accordance with an embodiment of the present application. Fig. 3 is a system architecture diagram of a linear array microphone-based sound source localization method according to an embodiment of the present application. As shown in fig. 2 and 3, the sound source localization method based on the linear array microphone according to the embodiment of the present application includes the steps of: s1, acquiring a plurality of surrounding environment sound signals acquired by a linear array microphone; s2, performing feature analysis on the plurality of surrounding environment sound signals to obtain global sound waveform features; and S3, determining the direction coordinates of the recommended microphone array based on the global sound waveform characteristics.

In particular, in step S1, a plurality of ambient sound signals collected by the linear array microphone are acquired. Accordingly, in one specific example of the present application, for a linear array comprising 8 microphones, it may be possible to collect 8 ambient sound signals simultaneously. By analysing these signals, it is possible to determine the location information of the sound sources and to distinguish between different sound sources, or to separate the mixed sound signals into individual sound sources, thereby optimizing the sound source localization effect. It should be noted that a linear array microphone is a microphone arrangement in which a plurality of microphones are placed together in a linear arrangement so as to collect signals of a plurality of sound sources at the same time. Its principle is based on the relative positional relationship between the propagation of sound waves in space and microphones. When sound waves propagate to the linear array microphone, there is a slight difference in arrival time of the sound waves at the different microphones due to the difference in distance between the microphones. This difference can be used to estimate the direction and position of the sound source.

Accordingly, in one possible implementation, a plurality of ambient sound signals acquired by the linear array microphone may be acquired, for example, by: a set of linearly arranged microphones is obtained, typically in a uniformly spaced arrangement. Ensuring that the position of the microphone is fixed and no physical obstacle exists between the microphone and the sound source to be collected; microphone calibration is performed to ensure that the gain and delay of each microphone are consistent. This can be accomplished by playing known sound sources and measuring the response of each microphone using a calibration tool (e.g., a sound card or a dedicated device); sound signals of the surrounding environment are collected by the microphone array. Connecting the microphone array to a computer or audio interface using appropriate hardware and software configurations, and setting parameters such as sampling rate and bit depth; ensuring that all microphones in the microphone array are data acquired at the same sampling rate and clock synchronization. This may be achieved by hardware synchronization or software synchronization; and processing the collected sound signals. Digital Signal Processing (DSP) techniques may be used to filter out noise, reduce echo, etc. Beamforming techniques may also be applied to enhance sound in a particular direction; the signal of each microphone channel is analyzed. Useful information can be extracted using time domain analysis, frequency domain analysis, sound source localization, etc.; if signals of multiple microphone channels need to be synchronized, time stamps or synchronization signals may be used to align the data; further processing the analysis results as required. This may include sound enhancement, sound source separation, speech recognition, etc.

In particular, in step S2, a feature analysis is performed on the plurality of ambient sound signals to obtain global sound waveform features. In particular, in one specific example of the present application, as shown in fig. 5, the S2 includes: s21, extracting waveform characteristics of the plurality of surrounding environment sound signals through a sound waveform characteristic extractor based on a two-dimensional convolution layer so as to obtain a plurality of surrounding environment sound waveform characteristic vectors; s22, carrying out microphone array topology association analysis on the linear array microphone to obtain a microphone array topology feature matrix; and S23, performing association coding of graph structures on the plurality of surrounding environment sound waveform feature vectors and the microphone array topology feature matrix to obtain a topology global sound waveform feature matrix as the global sound waveform feature.

Specifically, the step S21 is to extract waveform characteristics of the plurality of ambient sound signals by a sound waveform characteristic extractor based on a two-dimensional convolution layer, so as to obtain a plurality of ambient sound waveform characteristic vectors. Considering that the representation forms of the respective ambient sound signals in the time domain are waveform diagrams, the two-dimensional convolution layer-based sound waveform feature extractor having excellent representation in terms of implicit feature extraction of images is used to perform feature mining on the respective ambient sound signals to extract feature distribution information about the ambient sound waveforms in the respective ambient sound signals, respectively, thereby obtaining a plurality of ambient sound waveform feature vectors. Specifically, each layer using the acoustic waveform feature extractor based on the two-dimensional convolution layer performs respective processing on input data in forward transfer of the layer: carrying out convolution processing on input data to obtain a convolution characteristic diagram; performing global mean pooling based on a feature matrix on the convolution feature map to obtain a pooled feature map; performing nonlinear activation on the pooled feature map to obtain an activated feature map; wherein the output of the last layer of the two-dimensional convolution layer-based sound waveform feature extractor is the plurality of ambient sound waveform feature vectors, and the input of the first layer of the two-dimensional convolution layer-based sound waveform feature extractor is the plurality of ambient sound signals.

It is noted that a two-dimensional convolution layer is a neural network layer commonly used in deep learning for processing two-dimensional data, such as images. It performs feature extraction and representation learning on input data through convolution operation. The coding process of the two-dimensional convolution layer is as follows: input and convolution kernels: the two-dimensional convolution layer accepts a two-dimensional input tensor, typically expressed as a shape (width, height, channel). In addition, the convolution layer contains a set of convolution kernels (also called filters), each of which is a small two-dimensional weight matrix; convolution operation: the convolution operation is the core operation of the two-dimensional convolution layer. Each convolution kernel is multiplied by a local area of input data element by element, and then the product results are summed to obtain one element of output. By sliding the convolution kernel over the input data, an overall output feature map can be calculated; filling and stride: in performing the convolution operation, padding may be selected around the input data to control the size of the output feature map. The padding may be zero padding or other padding. In addition, a sliding step of the convolution kernel can be specified to control the spatial resolution of the output feature map; activation function: after the convolution operation, an activation function is typically applied to nonlinearly transform the output. Common activation functions include ReLU, si gmoi d, tanh, etc. for introducing non-linear characteristics and increasing the expression capacity of the network; parameter sharing: in the convolution layer, the weights of the convolution kernels are shared, i.e. the convolution operation is performed using the same weights over the whole input. Therefore, the number of parameters of the network can be greatly reduced, and the efficiency and generalization capability of the model are improved; pooling layer: after the convolution layer, a pooling layer is typically added to reduce the size of the feature map and extract more robust features. Common pooling operations include maximum pooling and average pooling, which respectively select as output the maximum or average value within the pooling window; multiple channels: the convolution layer may process input data having multiple channels, e.g., an RGB image having three channels. In the convolution operation, each convolution kernel convolves with a corresponding channel of the input data and generates an output signature.

Specifically, in S22, a microphone array topology association analysis is performed on the linear array microphone to obtain a microphone array topology feature matrix. In particular, in one specific example of the present application, as shown in fig. 6, the S22 includes: s221, constructing a microphone array topology matrix of the linear array microphone, wherein the characteristic value of each position on the non-diagonal position in the microphone array topology matrix is the Euclidean distance between the two corresponding microphones; and S222, passing the microphone array topology matrix through a topology feature extractor based on a convolutional neural network model to obtain the microphone array topology feature matrix.

More specifically, the S221 constructs a microphone array topology matrix of the linear array microphone, where a characteristic value of each position on the non-diagonal position in the microphone array topology matrix is a euclidean distance between the two corresponding microphones. Because the linear array microphones are arranged according to a preset topological pattern, each microphone in the linear array microphones has a topological association relation in space. Thus, there is also implicit associated characteristic distribution information between the waveform characteristic information of the respective ambient sound signals. Based on this, in order to analyze the ambient sound signals collected by the linear microphone more fully and accurately, so as to improve the accuracy of sound localization, in the technical solution of the present application, a microphone array topology matrix of the linear array microphone is further constructed, where the feature value of each position on the non-diagonal position in the microphone array topology matrix is the euclidean distance between the two corresponding microphones. It is worth mentioning that euclidean distance is a measure of the distance between two points calculated in euclidean space. It is one of the most common and intuitive distance measurement methods, and is also widely used in data analysis and pattern recognition tasks in various fields. Euclidean distance is widely used in many fields, such as clustering and classification algorithms in machine learning, feature matching in image processing, distance measurement in spatial analysis, and the like. It provides an intuitive measure of similarity or variability between data points.

Accordingly, in one possible implementation manner, the microphone array topology matrix of the linear array microphone may be constructed by the following steps, where the feature value of each position on the non-diagonal position in the microphone array topology matrix is the euclidean distance between the two corresponding microphones, for example: determining a linear arrangement of microphones, for example, in a left-to-right order; and measuring or calculating the Euclidean distance between adjacent microphones according to actual conditions or design requirements. Assuming N microphones, then there are N-1 off-diagonal positions for which the eigenvalues need to be determined; creating an N x N matrix, where N represents the number of microphones; setting the element on the diagonal position to 0, which means that the distance between the same microphone and itself is 0; and filling corresponding values into the off-diagonal positions of the matrix according to the Euclidean distance determined in the step 2. For example, the element of the ith row and jth column represents the euclidean distance between the ith microphone and the jth microphone; after filling all off-diagonal elements, the microphone array topology matrix is constructed.

More specifically, the step S222 is to pass the microphone array topology matrix through a topology feature extractor based on a convolutional neural network model to obtain the microphone array topology feature matrix. That is, in this way, the spatial topology association characteristic information about each microphone in the microphone array topology matrix is extracted, so as to obtain a microphone array topology characteristic matrix. Specifically, each layer of the topological feature extractor based on the convolutional neural network model is used for respectively carrying out input data in forward transfer of the layer: carrying out convolution processing on input data to obtain a convolution characteristic diagram; pooling the convolution feature map along a channel dimension to obtain a pooled feature map; performing nonlinear activation on the pooled feature map to obtain an activated feature map; the output of the last layer of the topological feature extractor based on the convolutional neural network model is the microphone array topological feature matrix, and the input of the first layer of the topological feature extractor based on the convolutional neural network model is the microphone array topological matrix.

Notably, convolutional neural networks (Convolutional Neural Network, CNN) are a deep learning model that is mainly used to process data with a grid structure, such as images and speech. CNNs have enjoyed great success in the field of computer vision and are excellent in many tasks such as image classification, object detection, image segmentation, and the like. The core idea of the CNN is to extract the characteristics of input data through a convolution layer and a pooling layer, and perform tasks such as classification or regression through a full connection layer. The general structure of CNNs is: convolution layer: the convolutional layer is the basic building block of CNN. It extracts local features of the input data by convolving the input data with a set of learnable filters (also called convolution kernels). The convolution operation can effectively capture the spatial structure information of the input data. The convolution layer typically includes a plurality of filters, each of which generates a corresponding feature map; pooling layer: the pooling layer serves to reduce the spatial dimensions of the feature map and preserve important features. Common pooling operations include maximum pooling and average pooling. The pooling operation reduces the dimension of data by aggregating the local areas of the feature map, and has certain translation and scale invariance; activation function: the activation function introduces a nonlinear transformation that enables the CNN to learn more complex patterns and features. Common activation functions include ReLU, sigmoid, tanh, etc.; full tie layer: the full connection layer connects the outputs of the previous convolutional layer and the pooling layer and passes them as inputs to the output layer. The full connection layer is typically used for the final classification or regression task; dropout layer: dropout is a regularization technique for reducing overfitting of the neural network. During training, the Dropout layer randomly sets the output of a portion of neurons to zero, thereby reducing the dependency between neurons. The CNN gradually extracts advanced features of the data through the stacking of multiple convolution layers and pooling layers, and classifies or regresses through the fully connected layers. During training, the CNN uses a back-propagation algorithm to update parameters in the network to minimize the error between the predicted output and the real label.

It should be noted that, in other specific examples of the present application, the microphone array topology association analysis may be performed on the linear array microphone in other manners to obtain a microphone array topology feature matrix, for example: recording by using a linear array microphone to obtain a section of audio data containing a plurality of sound sources; the recording data is preprocessed, including noise removal, echo reduction, and the like. This helps to improve the accuracy of subsequent analysis; if there is a slight difference in time of the sound source signals in the recording data, it is necessary to align them. The data may be aligned using a time stamp or synchronization signal to ensure that each sound source signal corresponds in time to the same event; features are extracted from the aligned audio data. Common features include time domain features and frequency domain features. The time domain features may include amplitude, energy, zero-crossing rate, etc. The frequency domain features may include spectrograms, power spectral densities, and the like. These features may be calculated using signal processing techniques such as fourier transforms, short Time Fourier Transforms (STFT), etc.; correlation analysis is performed on the extracted features to determine the topological relationship between microphones. Common correlation analysis methods include cross-correlation and correlation matrix analysis. By calculating the correlation between the features, topology association information between microphones can be obtained; and constructing a topological feature matrix of the microphone array according to the result of the correlation analysis. The matrix describes the topological relation among microphones, and can be used for subsequent tasks such as sound source localization, sound source separation and the like; and analyzing and visualizing the topological feature matrix. The topological relationship between microphones can be observed using data analysis tools and graphs and their effect on sound source localization and separation can be further analyzed.

Specifically, in S23, performing association coding of a graph structure on the plurality of surrounding sound waveform feature vectors and the microphone array topology feature matrix to obtain a topology global sound waveform feature matrix as the global sound waveform feature. That is, the feature representation of each of the surrounding sound waveform feature vectors is taken as a node, the feature representation of the microphone array topology feature matrix is taken as a node-to-node edge feature representation, and the surrounding sound waveform feature matrix obtained by two-dimensionally arranging the plurality of surrounding sound waveform feature vectors and the microphone array topology feature matrix pass through a graph neural network model to obtain a topology global sound waveform feature matrix. Specifically, the graph neural network model performs graph structure data coding on the surrounding environment sound waveform feature matrix and the microphone array topology feature matrix through a learnable neural network parameter to obtain the topology global sound waveform feature matrix containing irregular microphone space topology association features and the various sound waveform feature information.

Notably, the graph neural network (Graph Neural Network, GNN) is a deep learning model for processing graph structure data. Unlike traditional neural networks, which mainly process vector or matrix data, GNNs can learn and infer nodes and edges in the graph, thereby implementing analysis and prediction of graph structure data. The principle of GNN can be summarized as the following key steps: the figure shows: first, the graph structure data is represented in a form that can be handled by a computer. The connection relation of the diagrams is typically represented using an adjacency matrix or adjacency table. The adjacency matrix represents the connection relation between the nodes, and the adjacency table represents the neighbor node of each node; node representation learning: the core of GNN is to learn the representation vector of each node, which captures the characteristics of the node itself and its relationships to its neighbors. The process of node representation learning can be implemented by iteratively updating the representation vector of the node; information transfer: GNNs update the representation vector of the node by passing information in the graph. Specifically, each node aggregates and combines its own representation vector with the representation vector of its neighboring node to obtain a comprehensive neighbor information. Such an information transfer process may be implemented by a graph convolution operation; graph level prediction: in some tasks, prediction of the entire graph is required, not just per node. To achieve graph level prediction, the representation vectors of all nodes may be assembled into a graph level representation vector using a graph pooling operation, and then predicted through the fully connected layer or other method. The key point of GNN is the process of information transfer, which allows nodes to update their own representation vectors through interactions with their neighboring nodes, thereby fusing global and local information. This enables GNNs to perform efficient feature learning and reasoning on graph structure data.

It should be noted that, in other specific examples of the present application, the plurality of ambient sound signals may be further subjected to feature analysis by other manners to obtain global sound waveform features, for example: the collected plurality of sound signals are preprocessed. This may include removing noise, reducing echo, equalizing volume, etc. The pretreatment is helpful for improving the accuracy of the subsequent feature analysis; if there is a slight difference in the time of the acquired sound signals, it is necessary to align them. The data may be aligned using a time stamp or synchronization signal to ensure that each sound signal corresponds in time to the same event; features are extracted from the aligned sound signals. Common sound features include time domain features and frequency domain features: time domain features: the time domain features describe the variation of the sound signal over time. Common time domain features include amplitude, energy, zero crossing rate, time domain envelope, etc.; frequency domain characteristics: the frequency domain features describe the distribution of the sound signal over frequency. Common frequency domain features include spectrograms, power spectral densities, spectral envelopes, etc., and signal processing techniques such as fourier transforms, short-time fourier transforms (STFT), etc., may be used to calculate time and frequency domain features; features extracted from a plurality of sound signals are aggregated. This can be achieved by calculating statistics of mean, maximum, minimum, variance, etc. for each feature; global sound waveform features are extracted from the aggregated features. This may include global energy, global frequency distribution, duration of sound, etc.; the extracted global sound waveform features are analyzed and visualized. Data analysis tools and charts may be used to observe trends, changes, and associations of sound features.

In particular, in step S3, the direction coordinates of the recommended microphone array are determined based on the global sound waveform characteristics. In particular, in one specific example of the present application, the S3 includes: the topological global sound waveform feature matrix is passed through a decoder to obtain decoded values representing the directional coordinates of the recommended microphone array. That is, decoding regression is performed with the graph structure correlation characteristic information between the topology correlation characteristic of the microphone array and the waveform characteristic of each sound signal to obtain the direction coordinates of the recommended microphone array. Therefore, the optimal array direction can be automatically calculated according to the direction information of the sound in the environment, and an adjusting instruction is sent to the microphone array, so that the direction of the microphone array can always face the sound source, and the sound source positioning system based on the linear array microphone can better adapt to the actual application requirements. Specifically, the topological global sound waveform feature matrix is subjected to decoding regression by using the decoder in the following formula to obtain a decoding value for representing the direction coordinates of the recommended microphone array; wherein, the formula is:wherein X represents the topological global sound waveform feature matrix, Y is the decoding value, W is the weight matrix,>representing matrix multiplication.

Notably, decoding regression refers to mapping the representation vectors or features output by the neural network back to the original data domain, thereby enabling regression prediction of the original data. In the decoding regression task, the neural network typically implements a mapping from the representation vector to the original data by learning a Decoder (Decoder). The decoding regression task may be applied to a variety of fields, such as image generation, speech synthesis, sequence generation, and the like.

It should be noted that, in other specific examples of the present application, the direction coordinates of the recommended microphone array may also be determined based on the global sound waveform feature in other manners, for example: first, sound data needs to be collected. Recording sounds in the environment using one or more microphones and acquiring a global sound waveform; features are extracted from the global sound waveform. Various signal processing techniques and feature extraction algorithms, such as short-time fourier transforms, cepstral coefficients, power spectral densities, etc., may be used to obtain spectral features of the sound signal; and estimating the direction of the microphone array by using the extracted frequency spectrum characteristics. A common approach is to determine the direction of the sound source by means of beam forming techniques. Beamforming may be performed by weighting and phase adjusting the individual microphone signals in the microphone array such that sound from a particular direction is enhanced while sound from other directions is suppressed. According to the result of beam forming, the direction coordinates of the sound source can be estimated; and determining recommended microphone array direction coordinates according to the direction estimation result. An appropriate microphone array layout may be selected according to the directional coordinates of the sound source to maximize the capture of the sound source.

It will be appreciated that the two-dimensional convolutional layer based acoustic waveform feature extractor, the convolutional neural network model based topological feature extractor, and the decoder need to be trained prior to the inference using the neural network model described above. That is, in the linear array microphone-based sound source localization method of the present application, a training stage is further included for training the two-dimensional convolutional layer-based sound waveform feature extractor, the convolutional neural network model-based topology feature extractor, and the decoder.

Fig. 4 is a flowchart of a training phase of a linear array microphone-based sound source localization method according to an embodiment of the present application. As shown in fig. 4, a sound source localization method based on a linear array microphone according to an embodiment of the present application includes: a training phase comprising: s110, acquiring training data, wherein the training data comprises a plurality of training ambient sound signals acquired by a linear array microphone and a true value of a direction coordinate of the recommended microphone array; s120, enabling the training ambient sound signals to pass through the sound waveform feature extractor based on the two-dimensional convolution layer to obtain training ambient sound waveform feature vectors; s130, constructing a training microphone array topology matrix of the linear array microphone, wherein the characteristic value of each position on the non-diagonal position in the training microphone array topology matrix is the Euclidean distance between the two corresponding microphones; s140, passing the training microphone array topology matrix through the topology feature extractor based on the convolutional neural network model to obtain a training microphone array topology feature matrix; s150, passing the training surrounding environment sound waveform feature vectors and the training microphone array topological feature matrix through a graph neural network model to obtain a training topological global sound waveform feature matrix; s160, enabling the training topological global sound waveform feature matrix to pass through the decoder to obtain a decoding loss function value; and S170, training the sound waveform feature extractor based on the two-dimensional convolution layer, the topological feature extractor based on the convolution neural network model and the decoder based on the decoding loss function value and through gradient descent direction propagation, wherein in each round of iteration of training, the weight matrix of the decoder is subjected to external boundary constraint iteration based on reference annotation.

In particular, in the technical scheme of the application, in consideration of the fact that in the process of extracting image semantic features of signal waveforms of the plurality of surrounding environment sound signals through the sound waveform feature extractor based on the two-dimensional convolution layer, the local correlation image semantic feature distribution corresponding to the source image semantic distribution relative to the source image semantic distribution of the source image provides an image semantic fitting direction for weight matrix fitting in the training process of a decoder, when the plurality of surrounding environment sound waveform feature vectors and the microphone array topological feature matrix pass through a graph neural network model to obtain a topological global sound waveform feature matrix, the feature distribution of the obtained topological global sound waveform feature matrix deviates from the local space image semantic correlation features corresponding to the source image semantic distribution on the image semantic feature domain expression at the same time when the feature distance array topology of the image semantic features of each microphone is subjected to topological correlation, so that in a decoding scene, the image feature domain of regression probability map of the topological global sound waveform feature matrix is caused in the weight matrix iteration process of the decoder, and further weight matrices are based on the feature distance array topology of the image semantic features of each microphone, and the obtained topological global sound waveform feature matrix is accurately influenced by the training result of the topological global sound waveform feature matrix. Based on the above, in the training process of the decoder, the applicant of the present application performs external boundary constraint of the weight matrix based on the reference annotation in the topological global sound waveform feature vector obtained after the topological global sound waveform feature matrix is expanded, which is specifically expressed as:

M ₁ and M ₂ The weight matrix of the previous iteration and the current iteration are respectively adopted, wherein, during the first iteration, M is set by adopting different initialization strategies ₁ And M ₂ (e.g., M ₁ Set as a unitary matrix and M ₂ Set as the average diagonal matrix of feature vectors to be decoded), V _c The topological global sound waveform feature vector is in a column vector form. Here, by global acoustic waveform feature vector V in the topology _c The iteration association expression in the weight space is used as the external association boundary constraint of the weight matrix iteration, so that the previous weight matrix is used as the reference in the iteration processIn the case of annotation (benchmark annotation), the global acoustic waveform feature vector V in the topology during the weight space iteration is reduced _c Is used as an anchor point, thereby carrying out the directional mismatching (oriented mismatch) of the weight matrix relative to the topological global sound waveform characteristic vector V in the iterative process _c Compensation of inter-domain offsets for regression probability mapping of (2) and further enhance the weight matrix based on the topological global acoustic waveform feature vector V _c The image semantic fitting aggregation of the model is improved, and the accuracy of the decoding value of the topological global sound waveform characteristic matrix obtained by the trained model is improved. Therefore, the direction of the microphone array can be automatically adjusted based on the direction information of the sound in the environment, so that the direction of the microphone array can always face the sound source, and the sound source positioning system can be better adapted to the actual application requirements, so that the sound source positioning effect is optimized.

In summary, the method for positioning a sound source based on the linear array microphone according to the embodiment of the application is explained, which monitors and analyzes sound signals in an environment in real time to automatically calculate an optimal array direction according to direction information of the sound in the environment, and sends an adjustment instruction to the microphone array, so that the direction of the microphone array can always face the sound source. Therefore, the sound source positioning system based on the linear array microphone can be better adapted to the actual application requirements, and therefore the sound source positioning effect is optimized.

Further, a wind-solar power generation energy storage management system is also provided.

Fig. 1 is a block diagram of a linear array microphone based sound source localization system according to an embodiment of the present application. As shown in fig. 1, a linear array microphone-based sound source localization system 300 according to an embodiment of the present application includes: the acquisition board 310 is composed of a microphone array which is linearly arranged and is used for acquiring sound signals of the surrounding environment; the main control board 320 is communicatively connected to the acquisition board through a PI N wire connector, and is configured to receive and analyze the sound signal acquired by the acquisition board, and to control and adjust the direction of the microphone array.

As described above, the linear array microphone-based sound source localization system 300 according to the embodiment of the present application may be implemented in various wireless terminals, such as a server or the like having a linear array microphone-based sound source localization algorithm. In one possible implementation, the linear array microphone based sound source localization system 300 according to embodiments of the present application may be integrated into a wireless terminal as one software module and/or hardware module. For example, the linear array microphone based sound source localization system 300 may be a software module in the operating system of the wireless terminal or may be an application developed for the wireless terminal; of course, the linear array microphone based sound source localization system 300 could equally be one of many hardware modules of the wireless terminal.

Alternatively, in another example, the linear array microphone based sound source localization system 300 and the wireless terminal may be separate devices, and the linear array microphone based sound source localization system 300 may be connected to the wireless terminal through a wired and/or wireless network and transmit interactive information in a agreed data format.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for locating a sound source based on a linear array microphone, comprising:

determining a direction coordinate of a recommended microphone array based on the global sound waveform feature;

wherein, still include training step: the device is used for training a sound waveform feature extractor based on a two-dimensional convolution layer, a topological feature extractor based on a convolution neural network model and a decoder;

wherein the training step comprises:

acquiring training data, wherein the training data comprises a plurality of training ambient sound signals acquired by a linear array microphone and a true value of a direction coordinate of the recommended microphone array;

passing the plurality of training ambient sound signals through the two-dimensional convolution layer based sound waveform feature extractor to obtain a plurality of training ambient sound waveform feature vectors;

constructing a training microphone array topology matrix of the linear array microphone, wherein the characteristic value of each position on the non-diagonal position in the training microphone array topology matrix is the Euclidean distance between the two corresponding microphones;

passing the training microphone array topology matrix through the topological feature extractor based on the convolutional neural network model to obtain a training microphone array topological feature matrix;

the training surrounding environment sound waveform feature vectors and the training microphone array topology feature matrix are processed through a graph neural network model to obtain a training topology global sound waveform feature matrix;

passing the training topology global sound waveform feature matrix through the decoder to obtain a decoding loss function value;

training the two-dimensional convolutional layer-based acoustic waveform feature extractor, the convolutional neural network model-based topological feature extractor, and the decoder based on the decoding loss function values and by gradient descent direction propagation, wherein, in each iteration of the training, an external boundary constraint iteration based on reference annotations is performed on a weight matrix of the decoder;

wherein, in each iteration of the training, the weight matrix of the decoder is iterated with an external boundary constraint based on reference annotation according to the following optimization formula;

wherein, the optimization formula is:

wherein M is ₁ And M ₂ The weight matrix of the last iteration and the current iteration are respectively V _c Is the training topology global sound waveform feature vector obtained after the training topology global sound waveform feature matrix is unfolded, and V _c In the form of a column vector that is,representing a matrix multiplication of the number of bits,representing matrix addition, M ₂ ' represents the weight matrix of the decoder after iteration.

2. The method of claim 1, wherein performing a feature analysis on the plurality of ambient sound signals to obtain global sound waveform features comprises:

extracting waveform characteristics of the plurality of surrounding environment sound signals through a sound waveform characteristic extractor based on a two-dimensional convolution layer so as to obtain a plurality of surrounding environment sound waveform characteristic vectors;

performing microphone array topology association analysis on the linear array microphone to obtain a microphone array topology feature matrix; and

and carrying out association coding of a graph structure on the plurality of surrounding environment sound waveform feature vectors and the microphone array topology feature matrix to obtain a topology global sound waveform feature matrix as the global sound waveform feature.

3. The method for linear array microphone based sound source localization of claim 2, wherein performing a microphone array topology correlation analysis on the linear array microphone to obtain a microphone array topology feature matrix comprises:

constructing a microphone array topology matrix of the linear array microphone, wherein the characteristic value of each position on the non-diagonal position in the microphone array topology matrix is the Euclidean distance between the two corresponding microphones; and

and passing the microphone array topological matrix through a topological feature extractor based on a convolutional neural network model to obtain the microphone array topological feature matrix.

4. A method of linear array microphone based sound source localization as claimed in claim 3, wherein performing a graph structured associative coding of the plurality of ambient sound waveform feature vectors and the microphone array topology feature matrix to obtain a topology global sound waveform feature matrix as the global sound waveform feature comprises: and the topological feature matrix of the microphone array and the plurality of surrounding environment sound waveform feature vectors are processed through a graph neural network model to obtain the topological global sound waveform feature matrix.

5. The method of claim 4, wherein determining the directional coordinates of the recommended microphone array based on the global sound waveform characteristics comprises: the topological global sound waveform feature matrix is passed through a decoder to obtain decoded values representing the directional coordinates of the recommended microphone array.