CN117031399A

CN117031399A - Multi-agent cooperative sound source positioning method, equipment and storage medium

Info

Publication number: CN117031399A
Application number: CN202311305182.9A
Authority: CN
Inventors: 吕少卿; 俞鸣园; 王克彦; 曹亚曦; 孙俊伟; 费敏健
Original assignee: Zhejiang Huachuang Video Signal Technology Co Ltd
Current assignee: Zhejiang Huachuang Video Signal Technology Co Ltd
Priority date: 2023-10-10
Filing date: 2023-10-10
Publication date: 2023-11-10
Anticipated expiration: 2043-10-10
Also published as: CN117031399B

Abstract

The application discloses a multi-agent cooperative sound source positioning method, equipment and a storage medium, wherein the multi-agent cooperative sound source positioning method comprises the following steps: collecting audio data of a sound source in a current sound source environment, and estimating the sound source position of the audio data by utilizing a sound source positioning strategy to obtain a position estimation result; in the sound source position estimation process, extracting designated parameters related to the sound source position estimation process to obtain shared parameters; transmitting the sharing parameters to other agents and receiving the sharing parameters transmitted by the other agents; and optimizing and updating the sound source positioning strategy by utilizing the shared parameters sent by other intelligent agents so as to perform the next round of sound source position estimation by utilizing the optimized and updated sound source positioning strategy. Through the parameter sharing among a plurality of agents, more information is captured by a single agent, so that the positioning performance of the single agent is improved, and the influence on the accuracy of sound source positioning due to the complexity and dynamic change of the sound source environment is avoided.

Description

Multi-agent cooperative sound source positioning method, equipment and storage medium

Technical Field

The present application relates to the field of sound source positioning technologies, and in particular, to a method, an apparatus, and a storage medium for sound source positioning with multi-agent collaboration.

Background

In general, vision is a main way for people to acquire information, but in many cases, vision cannot provide complete information of a target, and has a certain limitation compared with hearing, for example, in an image processing method, the vision is easily affected by factors such as video shielding, illumination, posture change and the like. Compared with the limited visual range, the auditory system is omnidirectional, is not limited by angles and positions, and well supplements the defect of visual information.

Therefore, sound source localization technology is receiving more and more attention, and sound sources need to be localized in numerous application scenarios such as conferences, teaching, communication, and mechanical equipment service execution. However, conventional approaches tend to be unsatisfactory in facing complex, dynamic sound source environments.

Disclosure of Invention

In order to solve the above problems, the present application at least provides a multi-agent collaborative sound source localization method, apparatus and storage medium.

The first aspect of the present application provides a multi-agent collaborative sound source localization method, which is applied to any agent in an agent set, each agent in the agent set corresponds to a sound source localization policy, and the method includes: collecting audio data of a sound source in a current sound source environment, and estimating the sound source position of the audio data by utilizing a sound source positioning strategy to obtain a position estimation result; in the sound source position estimation process, extracting designated parameters related to the sound source position estimation process to obtain shared parameters; transmitting the sharing parameters to other agents and receiving the sharing parameters transmitted by the other agents; and optimizing and updating the sound source positioning strategy by utilizing the shared parameters sent by other intelligent agents so as to perform the next round of sound source position estimation by utilizing the optimized and updated sound source positioning strategy.

In an embodiment, the shared parameter sent by the other agents includes a position estimation result obtained by the other agents through estimation respectively, and the method further includes: taking the position estimation results sent by other intelligent agents as reference estimation results; and fusing a position estimation result and a reference estimation result corresponding to the audio data aiming at the same sound source to obtain a fused position result.

In an embodiment, optimizing and updating the sound source localization strategy by using the shared parameters sent by other agents comprises: determining the parameter types of shared parameters sent by other intelligent agents; inquiring an optimization strategy matched with the parameter types; and optimizing and updating the sound source positioning strategy by adopting the optimizing strategy.

In an embodiment, the method further comprises: calculating a reward value corresponding to the position estimation result by using the reward function; and optimizing and updating the sound source positioning strategy according to the rewarding value.

In one embodiment, calculating a prize value for a position estimation result using a prize function includes: determining sound source localization influencing factors of a sound source environment, wherein the sound source localization influencing factors refer to factors influencing the accuracy of a position estimation result in the sound source environment; constructing a reward function based on the sound source localization influencing factors; and calculating a reward value corresponding to the position estimation result by using the reward function.

In one embodiment, the sound source localization influencing factor contains a position estimation error; calculating a prize value corresponding to the position estimation result by using the prize function, including: calculating a position estimation error between the position estimation result and the actual sound source position; and acquiring a position error threshold value, and acquiring a reward value corresponding to the position estimation result based on the difference value between the position estimation error and the position error threshold value.

In one embodiment, determining a sound source localization influencing factor of a sound source environment includes: acquiring an environment type of a sound source environment and a service type of a sound source positioning service; and inquiring to obtain a sound source positioning influence factor corresponding to the current sound source environment based on the environment type and the service type.

In one embodiment, the shared parameters sent by other agents contain time differences of arrival; optimizing and updating a sound source positioning strategy by using shared parameters sent by other intelligent agents, wherein the method comprises the following steps: calculating the arrival time difference of the audio data relative to the current intelligent agent, and extracting the arrival time difference in the sharing parameters sent by other intelligent agents; integrating the arrival time difference shared by other intelligent agents and the arrival time difference obtained by the current intelligent agent, and taking the integration result as the current state representation; calculating a position estimation result corresponding to the current state representation according to the sound source positioning strategy; calculating a reward value corresponding to the position estimation result by using the reward function; and optimizing and updating the sound source positioning strategy according to the rewarding value.

The second aspect of the present application provides a multi-agent collaborative sound source localization device, deployed on any agent in an agent set, each agent in the agent set corresponding to a sound source localization policy, the device comprising: the position estimation module is used for collecting the audio data of the sound source in the current sound source environment, and performing sound source position estimation on the audio data by utilizing a sound source positioning strategy to obtain a position estimation result; the parameter extraction module is used for extracting designated parameters related to the sound source position estimation process in the sound source position estimation process to obtain shared parameters; the parameter sharing and acquiring module is used for sending the sharing parameters to other intelligent agents and receiving the sharing parameters sent by the other intelligent agents; and the strategy optimization module is used for optimizing and updating the sound source positioning strategy by utilizing the shared parameters sent by other intelligent agents so as to perform the next round of sound source position estimation by utilizing the optimized and updated sound source positioning strategy.

The third aspect of the present application provides an electronic device, including a memory and a processor, where the processor is configured to execute program instructions stored in the memory, so as to implement the multi-agent collaborative sound source localization method.

A fourth aspect of the present application provides a computer readable storage medium having stored thereon program instructions which, when executed by a processor, implement the multi-agent collaborative sound source localization method described above.

According to the scheme, the audio data of the sound source in the current sound source environment are collected, and the sound source position estimation is carried out on the audio data by utilizing the sound source positioning strategy to obtain a position estimation result; in the sound source position estimation process, extracting designated parameters related to the sound source position estimation process to obtain shared parameters; transmitting the sharing parameters to other agents and receiving the sharing parameters transmitted by the other agents; and the sharing parameters sent by other agents are utilized to optimize and update the sound source positioning strategy so as to carry out the next round of sound source position estimation by utilizing the sound source positioning strategy after optimization and updating, so that the parameter sharing among a plurality of agents can be realized, and more information can be captured by a single agent, so that the positioning performance of the single agent is improved, and the influence on the sound source positioning accuracy due to the complexity and dynamic change of the sound source environment is avoided.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

FIG. 1 is a schematic diagram of an implementation environment involved in a multi-agent collaborative sound source localization method according to an exemplary embodiment of the present application;

FIG. 2 is a flow chart of a multi-agent collaborative sound source localization method according to an exemplary embodiment of the present application;

FIG. 3 is a schematic diagram of the architecture of a sound source localization model according to an exemplary embodiment of the present application;

FIG. 4 is a schematic diagram illustrating the calculation of prize values in accordance with an exemplary embodiment of the present application;

FIG. 5 is a block diagram of a multi-agent collaborative sound source localization device shown in an exemplary embodiment of the present application;

FIG. 6 is a schematic diagram of an electronic device shown in an exemplary embodiment of the application;

fig. 7 is a schematic diagram of a structure of a computer-readable storage medium according to an exemplary embodiment of the present application.

Detailed Description

The following describes embodiments of the present application in detail with reference to the drawings.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, interfaces, techniques, etc., in order to provide a thorough understanding of the present application.

The term "and/or" is herein merely an association information describing an associated object, meaning that three relationships may exist, e.g., a and/or B may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship. Further, "a plurality" herein means two or more than two. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

The following describes a multi-agent collaborative sound source localization method provided by the embodiment of the application.

Referring to fig. 1, fig. 1 is a schematic diagram of an implementation environment related to a multi-agent collaborative sound source localization method according to the present application. As shown in fig. 1, the implementation environment includes a plurality of agents 110 and a server 120, and the server 120 is directly or indirectly connected to each agent 110 through wired or wireless communication.

The agent 110 is configured to collect audio data of a sound source in a sound source environment, and may be a microphone array including a plurality of microphones.

Each agent 110 is deployed with a sound source localization strategy, which is used to localize a sound source according to audio data of the sound source, so that each agent 110 can independently collect and process the audio data.

The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligence platform.

In some embodiments, as shown in fig. 1, each agent 110 may communicate shared parameters to each other through a server 120 to enable co-location and experience learning among multiple agents. For example, server 120 collects the shared parameters reported by agent 110-1 and broadcasts the shared parameters to agents 110-2 through 110-N.

In some embodiments, each agent 110 may also directly communicate shared parameters with each other. For example, agent 110-1 stores a list of communication addresses for storing communication addresses of agents 110-2 through 110-N, according to which agent 110-1 directly transmits sharing parameters to agents 110-2 through 110-N.

Referring to fig. 2, fig. 2 is a flowchart illustrating a multi-agent collaborative sound source localization method according to an exemplary embodiment of the present application. The multi-agent collaborative sound source localization method may be applied to the implementation environment shown in fig. 1 and specifically executed by any agent in the implementation environment, and for convenience of description, an agent as an execution subject will be referred to as a current agent. It should be understood that the method may be adapted to other exemplary implementation environments and be specifically executed by devices in other implementation environments, and the implementation environments to which the method is adapted are not limited by the present embodiment.

As shown in fig. 2, the multi-agent collaborative sound source localization method at least includes steps S210 to S240, which are described in detail as follows:

step S210: and acquiring audio data of a sound source in the current sound source environment, and estimating the sound source position of the audio data by utilizing a sound source positioning strategy to obtain a position estimation result.

The sound source environment is a place where sound is transmitted, such as a meeting room where a meeting is being held, an environment where searching and rescue are required, a park, and the like.

And an agent set which can be communicated with each other is correspondingly deployed in the sound source environment, the agent set comprises a plurality of agents, and each agent is respectively deployed with a sound source positioning strategy.

The sound source localization strategy is used to implement a sound source location estimate for the audio data. The sound source localization strategy can contain a pre-built neural network model or contains a pre-built sound source localization algorithm, and the specific implementation mode of the sound source localization strategy is not limited by the application.

Each agent respectively collects the audio data of the sound source in the current sound source environment, and utilizes the sound source positioning strategy respectively corresponding to each agent to carry out sound source position estimation on the audio data to obtain a position estimation result.

The position estimation result is used to describe the position information of the sound source in the sound source environment. It can be understood that if the number of sound sources corresponding to the audio data is plural, the position estimation result contains position information corresponding to each of the plural sound sources.

Step S220: in the sound source position estimation process, the appointed parameters related to the sound source position estimation process are extracted to obtain the shared parameters.

When the sound source position estimation is carried out on the sound source in the sound source environment by the plurality of agents, the parameter sharing is carried out, so that each agent can learn from the experience of other agents, and the learning efficiency and the positioning performance are improved.

The specified parameter may be any relevant parameter in the sound source position estimation process, such as audio data collected by the agent, a position estimation result, positioning accuracy corresponding to the position estimation result, a policy parameter of a sound source positioning policy, and the like.

In some embodiments, the specified parameters may be preset, for example, a technician presets the specified parameters that each agent needs to share in estimating the sound source position.

In some embodiments, the specified parameters may be flexibly determined according to the current actual situation, for example, according to the number of optimization and update times of the sound source positioning strategy of the current agent, the positioning accuracy corresponding to the position estimation result, the current network state, and the like, the specified parameters that need to be extracted corresponding to the sound source position estimation process in the current round are flexibly determined.

For example, according to the optimization updating times of the current sound source positioning strategy, determining the appointed parameters which are required to be extracted and correspond to the sound source position estimation process in the current round, and if the optimization updating times are greater than the estimation times threshold, taking strategy parameters of the sound source positioning strategy, the audio data acquired by the current intelligent agent and the position estimation result as the appointed parameters; if the optimizing updating times are not greater than the estimated times threshold, the audio data and the position estimation result acquired by the current intelligent agent are used as specified parameters.

By way of further illustration, determining a specified parameter to be extracted corresponding to the sound source position estimation process in the current round according to the positioning accuracy corresponding to the position estimation result, and if the positioning accuracy corresponding to the position estimation result is greater than the estimation accuracy threshold, taking the strategy parameter of the sound source positioning strategy, the audio data acquired by the current intelligent agent and the position estimation result as the specified parameters; if the positioning accuracy corresponding to the position estimation result is not greater than the estimation accuracy threshold, taking the audio data acquired by the current intelligent agent as the appointed parameter.

Namely, when the intelligent agents select the parameters which can be shared, the parameters with larger gain effects on other intelligent agents are selected to be shared, so that the optimization effect among the intelligent agents is improved.

And extracting the specified parameters according to the determined specified parameters to be extracted in the process of estimating the sound source position of the audio data to obtain the shared parameters.

Step S230: and sending the shared parameters to other agents and receiving the shared parameters sent by the other agents.

And sharing parameters among all the agents.

In some embodiments, a parameter sharing policy is obtained, and sharing parameters are sent to other agents based on the parameter sharing policy.

Illustratively, a parameter sharing policy is used to define the timing of parameter sharing. For example, parameter sharing may be performed once after each sound source position estimation; the parameter sharing may be performed once again after the sound source position estimation of the number of times of estimation is accumulated. The time of parameter sharing may be determined according to the requirement corresponding to the sound source positioning service, the current network state, the number of optimization and update times of the sound source positioning policy of other intelligent agents, and the application is not limited to this.

The parameter sharing policy is also used, for example, to define the manner in which the parameters are shared. For example, the server may collect sharing parameters of each agent at the same time, and then the server broadcasts the sharing parameters to each agent; the parameters can also be shared by the mutual interaction among the intelligent agents.

The parameter sharing policy between each agent may be the same or different, which is not limited in the present application.

Step S240: and optimizing and updating the sound source positioning strategy by utilizing the shared parameters sent by other intelligent agents so as to perform the next round of sound source position estimation by utilizing the optimized and updated sound source positioning strategy.

The optimization strategies adopted by the optimization updating sound source positioning strategies can be the same or different based on the received sharing parameters, and can be determined according to the optimization updating times of the sound source positioning strategies of the current agents, the positioning accuracy of the position estimation results of the current agents, the parameter types of the received sharing parameters, the adopted sound source positioning strategies and the like.

Illustratively, the optimization strategy employed to optimize updating the sound source localization strategy is determined based on the sound source localization strategy employed by the current agent.

For example, a pre-built neural network model is deployed in the current agent, and a position estimation result is obtained by performing sound source position estimation on audio data through the neural network model. The input of the neural network model is audio data, the output is the position estimation result of the sound source corresponding to the audio data, the shared parameters sent by other agents contain the position estimation result of the sound source, and the neural network model can be optimized according to the position estimation error between the position estimation result and the actual sound source position, the position estimation result shared by other agents, and the like.

For another example, a sound source localization algorithm is deployed in the current agent, and the sound source localization algorithm performs sound source location estimation on the audio data to obtain a location estimation result. The sound source localization algorithm may be: and calculating the arrival time difference (Time Difference Of Arrival, TDOA) of the audio data relative to the microphone array corresponding to the current agent, and deducing the position of the sound source by using the arrival time difference to obtain a position estimation result, for example, deducing the arrival time difference by using a triangulation method, a maximum likelihood estimation method and the like to obtain the position estimation result. The process of optimizing and updating the sound source positioning strategy by using the sharing parameters sent by other agents can be as follows: the use of audio data of other agents improves the accuracy of TDOA computation. For example, a more accurate TDOA is calculated by combining audio data sent by other agents and audio data collected by the current agent, so that the TDOA is used to perform sound source position estimation to obtain a position estimation result. The method can ensure that a more accurate position estimation result can be obtained even under the noisy condition or the condition of more signal interference.

Illustratively, an optimization strategy employed to optimize updating of the sound source localization strategy is determined based on the received parameter type of the shared parameter.

For example, if the shared parameter sent by the other agent contains the audio data collected by the other agent, the process of optimizing and updating the sound source localization policy by using the shared parameter sent by the other agent may be: extracting audio features contained in audio data shared by other agents, extracting audio features contained in audio data acquired by the current agent, wherein the audio features comprise but are not limited to audio intensity, audio frequency, arrival time difference and the like, and calculating to obtain a more accurate position estimation result by combining the audio features of the audio data transmitted by other agents and the audio features of the audio data acquired by the current agent. Information of audio data collected by different collection sources can be captured, and correlation and complementarity among the information can help the current agent locate the sound source more accurately.

For another example, if the shared parameter sent by the other agent contains a policy parameter of the sound source localization policy in the other agent, the process of optimizing and updating the sound source localization policy by using the shared parameter sent by the other agent may be: and fusing the strategy parameters of the sound source positioning strategy corresponding to the current intelligent agent with the strategy parameters of the sound source positioning strategy in other intelligent agents, so as to obtain the optimized and updated sound source positioning strategy. The sound source positioning strategy effect of the whole intelligent agent set is improved in a mode of sharing better strategy parameters.

Illustratively, an optimization strategy employed to optimize updating of the sound source localization strategy is determined based on the localization accuracy of the location estimation results of the current agent.

The sharing parameters sent by other agents contain sound source movement information (such as movement direction and movement speed) detected by other agents and policy parameters of sound source positioning policies, and the process of optimizing and updating the sound source positioning policies by using the sharing parameters sent by other agents can be as follows: acquiring the positioning accuracy of the position estimation result of the current intelligent agent, if the positioning accuracy is smaller than a preset accuracy threshold, fusing the strategy parameters of the sound source positioning strategy corresponding to the current intelligent agent with the strategy parameters of the sound source positioning strategy in other intelligent agents received, and obtaining the optimized and updated sound source positioning strategy; if the positioning accuracy is not less than the preset accuracy threshold, adding relevant analysis for the sound source movement information to a sound source positioning strategy, for example, predicting the position of the sound source in advance according to the sound source movement information, and correcting the position estimation result obtained by carrying out sound source position estimation on the audio data in combination with the position of the sound source predicted in advance.

In some embodiments, besides optimizing and updating the sound source positioning strategy according to the shared parameters sent by other agents, the position estimation result of the current round can be optimized according to the shared parameters sent by other agents, so as to obtain a more accurate position estimation result.

The shared parameters sent by other agents include position estimation results obtained by the other agents through estimation respectively, and the position estimation results sent by the other agents and the position estimation results sent by the current agents are fused for the same sound source to obtain a fused position result.

The fusion method includes, but is not limited to, weighting calculation, median screening, screening based on confidence corresponding to each position estimation result, and the like.

For example, the current agent receives the position estimation results sent by other agents, takes the received position estimation results as reference estimation results, namely, the current agent can obtain the position estimation results estimated by the current agent and the reference estimation results sent by other agents in the current round of sound source position estimation, and can perform mean value calculation on the position estimation results estimated by the current agent and each reference estimation result to obtain a fusion position result; the weighted average calculation can be carried out on the position estimation result estimated by the self and each reference estimation result, and the weight value corresponding to each result can be obtained according to the positioning accuracy corresponding to each result; the weight value corresponding to each result can also be dynamically adjusted according to feedback of other agents, for example, if one agent continuously provides inaccurate position estimation results, the weight value of the position estimation results in the fusion process can be gradually reduced.

In the sound source positioning process, through mutual parameter sharing of the plurality of agents, the application can enable a single agent to capture more information so as to improve the positioning performance of the single agent and avoid influencing the sound source positioning accuracy due to the complex and dynamic changes of the sound source environment.

Next, a detailed description will be given of a multi-agent collaborative sound source localization method according to the present application, taking a specific embodiment as an example.

Currently, agents deploy a sound source localization model, which, as shown in fig. 3, consists of a policy network (Actor) and a value network (Critic). The input of the strategy network is the current state, such as the audio signal currently collected by the microphone array corresponding to the current agent, and the output of the strategy network is the action, namely the position estimation result; the input of the value network is the current state and the action output by the strategy network, namely the input is the current collected audio signal and the position estimation result, and the output of the value network is the action value, namely the positioning accuracy of the position estimation result evaluation.

The policy network and the value network may be implemented based on neural networks such as fully connected neural networks (Fully Connected Neural Network, FCNN), convolutional neural networks (Convolutional Neural Networks, CNN), or recurrent neural networks (Recurrent Neural Network, RNN), which are not limited in this regard.

The position estimation process of a single agent is illustrated by combining the sound source localization model:

the method comprises the steps that a current intelligent agent collects audio data of a sound source in a sound source environment, a position estimation result obtained by the current intelligent agent in the previous round is obtained, the arrival time difference of the audio data relative to a microphone array of the current intelligent agent is calculated according to the position estimation result obtained by the current intelligent agent in the previous round and the current collected audio data, a strategy network selects actions according to the arrival time difference, the position estimation result is obtained, a reward value fed back by the sound source environment is obtained, and the reward value can be used for reflecting positioning accuracy of position estimation result evaluation. Meanwhile, the value network generates an action value according to the action selected by the strategy network, and the positioning accuracy of the position estimation result is evaluated through the action value, so that the strategy network optimally updates the network parameters according to the action value, and the value network optimally updates the network parameters according to the rewarding value.

For example, a time sequence difference Error (Temporal Difference Error, TD Error) is calculated according to the reward value, network parameters of the value network are updated according to TD Error optimization, and the policy network is updated according to the updated value network by adopting a policy gradient method.

The reward value is used for driving the sound source positioning model to learn, and is calculated by using a reward function.

By continuously iterating the above process until reaching the preset condition, the sound source localization strategy of the current intelligent agent is considered to meet the performance requirement, and the preset condition can be that the iteration number reaches the preset number or the improvement amplitude of the sound source localization model is smaller, which is not limited by the application.

Furthermore, the sharing parameters shared by other agents can be used for carrying out collaborative optimization updating on the sound source positioning model:

each agent is deployed with the sound source localization model shown in fig. 3, and the shared parameters shared by other agents include the arrival time difference of the sound source that it has calculated most recently.

And the current agent acquires the shared parameters shared by other agents, analyzes the shared parameters shared by other agents, and obtains the arrival time difference shared by other agents. And then integrating the arrival time difference shared by other intelligent agents and the arrival time difference obtained by the current intelligent agent, and taking the integration result as the current state representation. The sound source localization model selects an action according to the current state representation, the sound source environment feeds back a reward value to the action by using a reward function, and the sound source localization model is updated by using the reward value.

Continuous states and action spaces can be better processed through a reinforcement learning mode of an Actor-Critic architecture, the characteristics of sound source positioning problems are adapted, and an intelligent body learns and optimizes a sound source positioning strategy through interaction with the environment, so that the self-learning capacity of the system is enhanced, and the adaptability of the intelligent body to complex and dynamic sound source environments is improved.

In other embodiments, the shared parameters shared by other agents further include other parameters, for example, parameters including the latest output action of other agents, the rewarding value corresponding to the action, and the like, and the sound source localization model may be optimally updated in combination with the parameters, and other sound source localization policy parameters may be updated, for example, updating a rewarding function for calculating the rewarding value, which is not limited in the present application.

For example, feedback from other agents may also be used to adjust the bonus function. For example, if the position estimation error of a certain position in the sound source environment is large by a plurality of agents, the position estimation result of the position can be given a large negative reward, thereby promoting the updating of the sound source localization strategy.

As can be seen from the above embodiments, the construction of the bonus function affects the optimizing and updating effect of the sound source localization model, and the construction of the bonus function is illustrated.

In some embodiments, a sound source localization influencing factor of a current sound source environment is determined; constructing a reward function based on the sound source localization influencing factors; and calculating a reward value corresponding to the position estimation result by using the reward function.

The sound source positioning influencing factors refer to factors influencing the accuracy of the position estimation result in the sound source environment.

For example, sound source localization influencing factors include, but are not limited to, position estimation errors, correlations between individual sound sources, sound volumes of the sound sources, sound quality of the sound sources, change speeds of the sound sources, ambient noise, and the like.

The method comprises the steps of obtaining an environment type of a sound source environment and a service type of a sound source positioning service; and inquiring to obtain a sound source positioning influence factor corresponding to the current sound source environment based on the environment type and the service type.

For example, the environment type to which the sound source environment belongs may be classified into indoor, outdoor, stage, street, and the like.

The sound source positioning service refers to a specific service for applying a position estimation result, for example, the sound source positioning service is the positioning of singers in a concert scene, the positioning of life bodies in a search and rescue scene, the positioning of a presenter in a conference room, and the like.

According to different environment types and service types, sound source positioning influence factors to be considered in calculating a reward value in the current sound source position estimation process can be flexibly determined, so that the flexible construction of a reward function in a complex environment by an agent is realized, the optimization of a sound source positioning strategy is effectively carried out, and the adaptability of the agent to the complex and dynamic sound source environment is improved.

Next, the multi-agent cooperative sound source localization method of the present application will be described by taking an example in which the sound source localization influencing factor includes a position estimation error.

For example, referring to fig. 4, fig. 4 is a schematic diagram illustrating calculating a prize value according to an exemplary embodiment of the present application, as shown in fig. 4: let the sound source environment be a 10 x 10 two-dimensional environment, which contains a movable sound source and two agents a and B.

An application scenario illustrated in fig. 4 is illustrated:

in some embodiments, the step of calculating the prize value may comprise: calculating a position estimation error between the position estimation result and the actual sound source position; and acquiring a position error threshold value, and acquiring a reward value corresponding to the position estimation result based on the difference value between the position estimation error and the position error threshold value.

For each round of position estimation, the positions of sound source, agent a and agent B are set randomly, e.g. the current round sets sound source at (5, 7), agent a at (2, 3), agent B at (8, 9).

First, the current state of each agent is defined, which consists of current observations (collected audio data). Alternatively, agent a and agent B calculate the arrival time differences of the audio data of the sound source to them, respectively, and then use the arrival time differences as the current states of agent a and agent B, respectively.

Then, determining an action according to the current state: the agent a and the agent B estimate the position of the sound source based on the arrival time difference, respectively, to obtain a position estimation result (action), for example, the position estimation result obtained by the agent a is (4, 6), and the position estimation result obtained by the agent B is (6, 8).

And then, calculating a position estimation error according to the action: calculating a position estimation error between the position estimation result and the actual sound source position, for example, the position estimation error of the agent a is sqrt ((5-4)/(2+ (7-6)/(2) =sqrt (2), and the estimation error of the agent B is sqrt ((5-6)/(2+ (7-8)/(2) =sqrt (2).

Finally, the position estimation error is converted into a reward value: and obtaining a reward value corresponding to the position estimation result based on the difference value between the position estimation error and the position error threshold. For example, a positive error threshold and a negative error threshold are obtained, which may be the same or different, and if the position estimation error is smaller than the positive error threshold, the agent can obtain a positive reward, and the smaller the position estimation error is, the larger the reward value corresponding to the positive reward is; otherwise, if the position estimation error is greater than the negative error threshold, the agent gets a negative reward, and the greater the position estimation error, the greater the reward value corresponding to the negative reward. The positive rewards and the negative rewards further correspond to the position error thresholds respectively, and rewards corresponding to the position estimation result are obtained according to the difference value between the position estimation errors and the position error thresholds.

For example, assuming that the positive error threshold is set to 1.5, then both agent A and agent B may have a positive prize because the position estimation errors are less than the positive error threshold. When the position error threshold is set to be 1, the corresponding rewarding value of the positive rewarding is 1 minus the difference value obtained by subtracting the position estimation error, so that the rewarding value of the agent A and the agent B is 1-sqrt (2).

In other embodiments, the agent may perform the movement, where the agent a and the agent B obtain the position estimation result by calculating the arrival time difference corresponding to the audio data, and then determine the movement parameters, such as the direction of the movement and the distance of the movement, according to the position estimation result, and the agent performs the movement operation according to the movement parameters. When the intelligent body performs the moving operation, the relative position between the intelligent body and the sound source may be changed, and the calculated arrival time difference is changed when the arrival time difference of the next round is calculated after the moving operation is performed, and the sound source environment gives the intelligent body a reward value after the intelligent body performs the moving operation, so as to reflect the quality of the moving operation according to the reward value. For example, if the prize value is calculated based on the position estimation error, then a positive prize may be obtained if the agent is closer to the sound source after movement; if the agent moves farther from the sound source, a negative prize may be achieved.

It should be noted that, the calculation manner of the above-mentioned prize value is only schematically illustrated, and in practical application, the prize function may be flexibly constructed according to the environment type to which the sound source environment belongs or the service type to which the sound source positioning service belongs, and then a part of the prize function is illustrated:

example 1: the sound source environment is a concert venue, the sound source environment contains a plurality of sound sources, and the rewarding function can consider the position estimation error of each sound source, the importance (or priority) of each sound source, the relevance among the sound sources and the like. For example, each sound source is assigned a weight value according to the importance of the sound source, and the larger the weight value is, the more important the sound source is. And, the reward function contains a correlation capture parameter, the correlation capture parameter is used for reflecting whether the intelligent agent can accurately capture the correlation between each sound source, if the intelligent agent can locate all the sound sources at the same time, the value of the correlation capture parameter is increased; otherwise, if the agent locates only a portion of the sound sources, the value of the correlation capture parameter is reduced.

Constructing a bonus function based on the position estimation error of each sound source, the importance (or priority) of each sound source, the association between the respective sound sources, etc., the bonus function may be defined as:

r=w11*e1-w12*e2+w13*cogn

Wherein e1 and e2 are the average error and the maximum error of each sound source in the position estimation result, w11 and w12 are the weights of the two errors, cogn is the correlation capturing parameter, w13 is the weight of the correlation capturing parameter, and r is the calculated reward value.

The sound source positioning strategy is updated through iterative optimization of the reward function, so that the intelligent agent can be effectively guided to position all sound sources at the same time.

Example 2: the sound source environment is a concert scene, the sound source environment contains a plurality of sound sources, and the rewarding function can consider the volume of the sound sources, the tone quality of the sound sources, the feedback of listeners, environmental noise and the like. For example, a sound source of a large volume is more noticeable to a listener, and thus, a sound source of a large volume may be given a larger weight, e.g., the weight of the corresponding volume of each sound source may be set as a ratio of the volume of the sound source to the total volume; the sound source with good sound quality brings better hearing experience to the audience, so that the sound source with good sound quality can be given larger weight, for example, the weight of the sound quality can be set according to the frequency spectrum characteristic of the sound source, and the sound source with richer frequency spectrum has larger weight; the feedback of the audience is an important basis for evaluating whether the sound source is accurately captured and played, so that the weight of the sound source can be adjusted according to the feedback of the audience, and if the weight of the sound source with good feedback is larger; ambient noise affects the capturing and playing of sound sources, and therefore, the weight of sound sources can be adjusted according to the ambient noise, for example, where the ambient noise is large, the weight of the corresponding sound source is reduced.

Constructing a bonus function according to the volume of each sound source, the tone quality of the sound source, the feedback of the listener, and the environmental noise, the bonus function can be defined as:

r=w21*v+w22*q+w23*f-w24*n

where v denotes the volume of the sound source, q denotes the sound quality of the sound source, f denotes the feedback of the listener, n denotes the ambient noise, w21, w22, w23, w24 are the volume, the sound quality of the sound source, the feedback of the listener and the weight of the ambient noise, respectively, and r is the calculated prize value.

Through the rewarding function, the intelligent agent can be guided to pay more attention to sound sources with good volume and sound quality and better audible crowd feedback, so that the sound source positioning effect under a complex scene is improved.

Example 3: the sound source environment is a search and rescue scene, the agent needs to locate possible life signals in the sound source scene, and the reward function can consider the distance between the sound source and the agent, the signal intensity of the sound source, the search cost of the agent for searching the sound source and the like. For example, the closer a sound source is to an agent, the greater the prize that the sound source obtains, and conversely, the farther the sound source is to the agent, the less the prize that the sound source obtains; the stronger the signal intensity of the sound source is, the more obvious the vital signal is, the more rewarding is obtained by the agent, otherwise, the weaker the signal intensity of the sound source is, the less obvious the vital signal is, the less rewarding is obtained by the agent; the search cost of searching for a sound source refers to the cost that it takes to find a sound source, such as energy, time, etc., to encourage the agent to find a sound source at a minimum cost, so the search cost should have a negative impact on rewards.

Constructing a reward function according to the distance between the sound source and the agent, the signal intensity of the sound source, the searching cost of the agent for searching the sound source, and the like, wherein the reward function can be defined as follows:

r=w31*(1/(d+epsilon))+w32*log(s+epsilon)-w33*sqrt(cost)

wherein d is the distance between the sound source and the intelligent agent, and the positive number epsilon is used for preventing the situation that the divisor is zero when the distance is zero; s is the signal strength of the sound source, and the contribution of the signal strength to the reward is represented by a logarithmic function, because in an actual search and rescue task, the signal strength tends to follow a logarithmic scale, i.e. small amplitude changes in strength may mean large amplitude changes in vital signs; cost is the cost of searching for sound sources for an agent, meaning that larger cost differences are more of a concern by squaring the cost, and less sensitive to small cost differences. w31, w32, w33 are weights of the corresponding parameter items, respectively.

Through the rewarding function, the intelligent agent can be guided to pay more attention to the sound source with high signal intensity, short distance and low searching cost, so that the sound source positioning effect in a complex scene is improved.

Example 4: the sound source environment is an open space outdoors, such as a park, which contains a plurality of sound sources (e.g., people talking, music playing, traffic noise, etc.), and these sound sources may change in position and intensity over time, and the bonus function may take into account the intensity of the sound sources, the distance between the sound source and the agent, the speed of change of the sound source, etc. For example, for each sound source i, its importance is represented by a parameter proportional to its sound intensity Hi, which can be estimated by the amplitude of the sound signal received by the agent; the importance of the sound source is represented by a parameter item proportional to the reciprocal of the distance Di; if a sound source varies very rapidly in position or intensity, it needs to be positioned faster, and therefore its importance is represented by a parameter term proportional to its speed of variation Vi.

Constructing a bonus function according to the intensity of the sound source, the distance between the sound source and the agent, the change speed of the sound source, etc., the bonus function for the sound source i can be defined as:

Ri=w41*Hi-w42/Di+w43*Vi

where Hi is the sound intensity of the sound source i, di is the distance between the sound source i and the agent, vi is the change speed of the sound source i, w41, w42, w43 are the weights of the corresponding parameter items, respectively, and Ri is the prize value of the sound source i.

The total prize value is: r=Σri. I.e. the prize values of each sound source are summed to obtain a total prize value for each round.

Through the rewarding function, the intelligent agent can be guided to pay more attention to the sound source with large intensity, close range and quick change, so that the sound source positioning effect under the complex scene is improved.

Example 5: the sound source environment is a city street, and contains a plurality of sound sources (e.g., vehicles, pedestrians, sounds reflected by buildings, etc.), and the bonus function may consider the intensity of the sound source, the moving speed of the sound source, the frequency characteristics of the sound source, the relative positions of the sound source and the microphone array of the agent, etc. For example, the sound intensity of a sound source directly affects the likelihood that it is captured by an agent, and thus, a higher weight may be given to a sound source of strong sound; the sound source with high moving speed is more difficult to be accurately positioned by the intelligent agent, so that lower weight can be given to the sound source with high moving speed; the frequency characteristics of different sound sources are also different, and the sound sources can be identified through the frequency characteristics of the sound sources; the relative position of the sound source and the microphone array of the agent also affects the likelihood of it being captured.

Constructing a bonus function according to the intensity of the sound source, the moving speed of the sound source, the frequency characteristic of the sound source, the relative positions of the sound source and the microphone array of the agent, and the like, the bonus function can be defined as:

r=scale*log(1+I)*1/(1+V)*F*D

wherein scale is a constant for adjusting the overall scale of the reward function, I is the intensity of the sound source, V is the moving speed of the sound source, F is a function for evaluating the frequency characteristic of the sound source, D is used for describing the relative positional relationship between the sound source and the microphone array of the agent, and r is the calculated reward value.

Example 6: the sound source environment is an indoor conference scene, and the sound source environment contains a plurality of speakers (sound sources) to communicate with each other, and the reward function can consider the voice definition of each sound source, the processing of overlapped voice, background noise and the like. For example, the speech intelligibility of each speaker reflects the effect of the agent on the speech capture of that speaker, assuming di represents the speech intelligibility of the ith speaker, a function f (d) is designed to measure the speech intelligibility, e.g., this function may be a logarithmic function, f (d) =log (d); in an indoor conference scenario, a situation that multiple speakers speak at the same time often occurs, which requires an agent to have the capability of processing overlapping voices, and if oi represents the ratio of the voice of the i-th speaker to overlap with the voice of other speakers, the processing effect of the overlapping voices is measured by a function g (o), for example, the function may be an exponential function, g (o) =exp (-o); in indoor environments, background noise affects the effect of speech capture, assuming ni represents the degree of background noise impact to which the i-th speaker's speech is subjected, the effect of background noise is measured by a function h (n), which may be a linear function, h (n) = -n.

The bonus functions corresponding to all sound sources can be defined as:

r=∑[w51*f(di)+w52*g(oi)+w53*h(ni)]

wherein w51, w52 and w53 are weights of the corresponding parameter items, and r is the calculated reward value.

Example 7: the sound source environment is an indoor conference scene, and the sound source environment contains a plurality of speakers (sound sources) to communicate with each other, and the speaking time, the importance of speaking content, the positions of the speakers and the like of each speaker can be considered by the reward function. For example, for any speaker, the reward increases when the speaker starts speaking and decreases when the speaker ends speaking; higher weight can be given to important speaking contents, such as voice recognition is carried out on collected audio, keywords are extracted from voice recognition results, and the importance of the corresponding voice contents is determined according to the keyword extraction results; and, a speaker close to the microphone array may be given a higher weight.

Constructing a reward function based on the speaking time of each speaker, the importance of the speaking content, the location of the speaker, etc., then the reward function may be defined as:

r=α*T_reward+β*C_reward+γ*P_reward

where t_rewind=t_end-t_start, t_end being the speaking end time of the speaker, t_start being the speaking start time of the speaker; c_recall=Σ_i (w_i×k_i), where w_i is the weight of the i-th keyword and k_i is the number of times the i-th keyword appears in the speaking content; p_walk=1/d, where d is the distance of the speaker from the microphone array, α, β, γ are the weights of the corresponding parameter items, respectively, and r is the calculated prize value.

Through the rewarding function, the intelligent agent can be guided to accurately capture the speech of different speakers and give proper weight to different speech contents, so that the automation of sound source positioning in a conference is realized.

Example 8: the sound source environment contains a plurality of talking groups, and the rewarding function can consider the number of people in each talking group, the sound volume of talking, the importance of talking content, environmental noise, the capacity limit of intelligent agent and the like. For example, the number of people in a group may affect the importance of the sound of the group, and in general, the more people the more important the group may be; the louder the conversation, the more important it may be; for importance of the conversation content, the conversation content can be analyzed by voice recognition and natural language processing techniques to judge the importance thereof; ambient noise may affect the clarity of sound, which may affect sound source localization; the agent may have limitations on its processing capabilities, e.g., it may only be able to process a certain amount of sound at the same time.

Constructing a bonus function according to the number of people in each talk group, the sound volume of the talk, the importance of the talk content, the environmental noise, the capability limit of the agent, etc., the bonus function can be defined as:

r=a*f(M)*g(V)*h(I)*j(E)*k(C)

Where a is a constant for adjusting the overall prize size; f (M) is used to describe the effect of the number of people in each talk group, e.g., f (M) =log (m+1), M being the number of people in each talk group; g (V) is used to describe the effect of the corresponding sound volume of the conversation group talking, such as g (V) =v, sound volume of the V talking; h (I) is used to describe the effect corresponding to the importance of the conversation content, e.g., h (I) =i, I being the importance of the conversation content; j (E) is used to describe the effect corresponding to the environmental noise, for example, j (E) =1/(1+e), and the larger the noise is, the larger the effect is; k (C) is used to describe the effect corresponding to the capability limit of the agent, for example, k (C) =1-exp (-C), where C is the capability limit of the agent, and when the number of sounds processed by the agent exceeds a certain value, the effect is gradually saturated, and r is the calculated prize value.

Through the reward function, the intelligent agent can be guided to locate the sound of a plurality of talking groups at the same time, and pay different degrees of attention to each group.

Example 9: the sound source environment is an indoor conference scene, the sound source environment contains a plurality of speakers (sound sources) to communicate with each other, and the reward function can consider the voice activity of each sound source, the spatial separation degree between the sound sources, noise, reverberation and the like.

For example, the reward function may be defined as:

r=w61*A-w62*S-w63*R

wherein, A is voice activity, which can be calculated by using short-time energy or short-time zero-crossing rate and other characteristic quantities, wherein, the short-time energy is defined as E=1/N x Sigma|s (N) |2, N is from 1 to N, N is window length, s (N) is signal in window; s is the spatial separation between sound sources, which can be calculated by, for example, the generalized cross correlation (Generalized Cross Correlation, GCC) method, defined as: s= Σ (S1 (N) -S2 (N))2, N from 1 to N, S1 (N) and S2 (N) being signals of two sound sources; r is noise and reverberation, which can be calculated by short-time Signal-to-noise ratio (SNR) or reverberation time, wherein the SNR is defined as: snr=10xlog10 (e_signal/e_noise), e_signal being the short-term energy of the signal, e_noise being the short-term energy of the noise; w61, w62, w63 are weights of the corresponding parameter items, and r is the calculated prize value.

Through the rewarding function, the intelligent agent can be guided to accurately lock the talking person when facing the complex scene of talking for multiple persons, so that effective sound source positioning is realized.

Example 10: the sound source environment is a playing field in which a football match is being held, and for example, the sound source environment has a shout S1 of a player, a whistle S2 of a referee, a sound S3 in which a ball is hit, and cheering S4 of a spectator, each sound source has its own importance and is also affected by distance and environmental noise. The importance of each sound source may be represented by a weight w, e.g. wi represents the importance of the ith sound source, and the bonus function may take into account the importance of each sound source, the distance of the sound source, the ambient noise, etc.

For example, the reward function may be defined as:

r=∑(wi*Si)+∑(wi/di)-∑(ni)

where Si is the i-th sound source, di is the distance from the i-th sound source to the microphone array, ni is the ambient noise of the i-th sound source, and r is the calculated prize value.

Through the reward function, the intelligent agent can be guided to accurately locate and preferentially transmit key dynamic sounds on the court, such as shouts of players, whistles of referees, sounds of the ball hit and the like.

According to the multi-agent collaborative sound source positioning method provided by the application, the audio data of the sound source in the current sound source environment are collected, and the sound source positioning strategy is utilized to carry out sound source position estimation on the audio data so as to obtain a position estimation result; in the sound source position estimation process, extracting designated parameters related to the sound source position estimation process to obtain shared parameters; transmitting the sharing parameters to other agents and receiving the sharing parameters transmitted by the other agents; and the sharing parameters sent by other agents are utilized to optimize and update the sound source positioning strategy so as to carry out the next round of sound source position estimation by utilizing the sound source positioning strategy after optimization and updating, so that the parameter sharing among a plurality of agents can be realized, and more information can be captured by a single agent, so that the positioning performance of the single agent is improved, and the influence on the sound source positioning accuracy due to the complexity and dynamic change of the sound source environment is avoided.

FIG. 5 is a block diagram of a multi-agent collaborative sound source localization device deployed in any of a collection of agents, each of which corresponds to a sound source localization strategy, as shown in an exemplary embodiment of the present application. As shown in fig. 5, the exemplary multi-agent collaborative sound source localization apparatus 500 includes: a location estimation module 510, a parameter extraction module 520, a parameter sharing and acquisition module 530, and a policy optimization module 540. Specifically:

the position estimation module 510 is configured to collect audio data of a sound source in a current sound source environment, and perform sound source position estimation on the audio data by using a sound source localization strategy to obtain a position estimation result;

the parameter extraction module 520 is configured to extract specified parameters related to the sound source position estimation process in the sound source position estimation process, so as to obtain shared parameters;

the parameter sharing and obtaining module 530 is configured to send the shared parameter to other agents and receive the shared parameter sent by the other agents;

and the policy optimization module 540 is configured to optimize and update the sound source localization policy by using the shared parameters sent by other agents, so as to perform the next round of sound source position estimation by using the sound source localization policy after optimization and update.

In the above-mentioned exemplary multi-agent cooperative sound source positioning device, parameter sharing among multiple agents can be realized, so that a single agent captures more information, thereby improving the positioning performance of the single agent and avoiding affecting the accuracy of sound source positioning due to the complexity and dynamic change of the sound source environment.

Referring to fig. 6, fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the application. The electronic device 600 comprises a memory 601 and a processor 602, the processor 602 being configured to execute program instructions stored in the memory 601 to implement the steps of any of the multi-agent collaborative sound source localization method embodiments described above. In one particular implementation scenario, electronic device 600 may include, but is not limited to: the electronic device 600 may also include mobile devices such as a notebook computer and a tablet computer, and is not limited herein.

In particular, the processor 602 is configured to control itself and the memory 601 to implement the steps of any of the multi-agent collaborative sound source localization method embodiments described above. The processor 602 may also be referred to as a central processing unit (Central Processing Unit, CPU). The processor 602 may be an integrated circuit chip having signal processing capabilities. The processor 602 may also be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a Field-Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 602 may be commonly implemented by an integrated circuit chip.

Referring to fig. 7, fig. 7 is a schematic structural diagram of an embodiment of a computer readable storage medium according to the present application. The computer readable storage medium 700 stores program instructions 710 executable by a processor, the program instructions 710 for implementing the steps in any of the multi-agent collaborative sound source localization method embodiments described above.

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical, or other forms.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units. The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Claims

1. The multi-agent collaborative sound source positioning method is characterized by being applied to any agent in an agent set, wherein each agent in the agent set corresponds to a sound source positioning strategy, and the method comprises the following steps:

collecting audio data of a sound source in a current sound source environment, and estimating the sound source position of the audio data by utilizing the sound source positioning strategy to obtain a position estimation result;

extracting designated parameters related to the sound source position estimation process in the sound source position estimation process to obtain shared parameters;

the sharing parameters are sent to other agents, and the sharing parameters sent by the other agents are received;

and optimizing and updating the sound source positioning strategy by utilizing the shared parameters sent by the other agents so as to perform the next round of sound source position estimation by utilizing the optimized and updated sound source positioning strategy.

2. The method of claim 1, wherein the shared parameter transmitted by the other agent comprises a location estimate obtained by each of the other agents, the method further comprising:

taking the position estimation results sent by the other agents as reference estimation results;

And fusing the position estimation result corresponding to the audio data and the reference estimation result aiming at the same sound source to obtain a fused position result.

3. The method of claim 1, wherein optimizing the updating of the sound source localization strategy using the shared parameters transmitted by the other agents comprises:

determining the parameter types of the shared parameters sent by the other agents;

inquiring an optimization strategy matched with the parameter type;

and optimizing and updating the sound source positioning strategy by adopting the optimizing strategy.

4. A method according to any one of claims 1 to 3, further comprising:

calculating a reward value corresponding to the position estimation result by using a reward function;

and optimizing and updating the sound source positioning strategy according to the reward value.

5. The method of claim 4, wherein calculating the prize value for the position estimation result using the prize function comprises:

determining a sound source localization influencing factor of the sound source environment, wherein the sound source localization influencing factor refers to a factor influencing the accuracy of a position estimation result in the sound source environment;

Constructing a reward function based on the sound source localization influencing factors;

and calculating a reward value corresponding to the position estimation result by using the reward function.

6. The method of claim 5, wherein the sound source localization influencing factor comprises a position estimation error; the calculating the reward value corresponding to the position estimation result by using the reward function includes:

calculating a position estimation error between the position estimation result and an actual sound source position;

and acquiring a position error threshold value, and acquiring a reward value corresponding to the position estimation result based on the difference value between the position estimation error and the position error threshold value.

7. The method of claim 5, wherein said determining a sound source localization influencing factor for said sound source environment comprises:

acquiring an environment type of the sound source environment and a service type of a sound source positioning service;

and inquiring to obtain a sound source positioning influence factor corresponding to the current sound source environment based on the environment type and the service type.

8. The method of claim 4, wherein the shared parameters transmitted by the other agents contain time differences of arrival; the optimizing and updating the sound source positioning strategy by using the shared parameters sent by the other agents comprises the following steps:

Calculating the arrival time difference of the audio data relative to the current intelligent agent, and extracting the arrival time difference in the sharing parameters sent by the other intelligent agents;

integrating the arrival time difference shared by other intelligent agents and the arrival time difference obtained by the current intelligent agent, and taking the integration result as the current state representation;

calculating a position estimation result corresponding to the current state representation according to the sound source positioning strategy;

calculating a reward value corresponding to the position estimation result by using the reward function;

9. An electronic device comprising a memory and a processor for executing program instructions stored in the memory to implement the steps of the method according to any of claims 1-8.

10. A computer readable storage medium storing program instructions executable by a processor to perform the steps of the method according to any one of claims 1-8.