CN116243244A - Sound source positioning method, device, equipment and computer readable storage medium - Google Patents

Sound source positioning method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN116243244A
CN116243244A CN202211742577.0A CN202211742577A CN116243244A CN 116243244 A CN116243244 A CN 116243244A CN 202211742577 A CN202211742577 A CN 202211742577A CN 116243244 A CN116243244 A CN 116243244A
Authority
CN
China
Prior art keywords
sound source
determining
human body
calibration
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211742577.0A
Other languages
Chinese (zh)
Inventor
段锦辉
察志富
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Yueer Innovation Technology Co ltd
Original Assignee
Shenzhen Yueer Innovation Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Yueer Innovation Technology Co ltd filed Critical Shenzhen Yueer Innovation Technology Co ltd
Priority to CN202211742577.0A priority Critical patent/CN116243244A/en
Publication of CN116243244A publication Critical patent/CN116243244A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • G01S5/22Position of source determined by co-ordinating a plurality of position lines defined by path-difference measurements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)

Abstract

The application discloses a sound source positioning method, a device, equipment and a computer readable storage medium, wherein the sound source positioning method comprises the following steps: acquiring scene images and audio signals in an environment, and detecting all human body images included in the scene images; determining the current position corresponding to each human body image and determining the sound source angle of the audio signal; and detecting target positions matched with the sound source angle in all the current positions, and outputting a human body image corresponding to the target positions. The accuracy of sound source localization is improved.

Description

Sound source positioning method, device, equipment and computer readable storage medium
Technical Field
The present disclosure relates to the field of positioning technologies, and in particular, to a method, an apparatus, a device, and a computer readable storage medium for positioning a sound source.
Background
Sound source localization is a technology for obtaining the sound source position through sound waves emitted by a sound source, and the sound source localization technology is beneficial to intelligent development of machines.
The traditional sound source localization technology utilizes a microphone array to receive sound source signals, and estimates the time delay difference of the received sound source signals among the microphones based on the correlation among the sound source signals received by the microphones; and then calculating the distance difference between the sound source and different microphones according to the time delay difference and the sound speed, establishing an equation set according to the distance difference and the distance between each microphone and the sound source, and solving to calculate the position coordinate of the sound source.
However, the sound source signal received by the microphone will often include signals other than the signal emitted by the sound source. For example, reverberation, etc. This results in a large error in the delay difference estimated based on the correlation of each sound source signal, which in turn leads to a problem of inaccurate sound source localization.
Disclosure of Invention
The main objective of the present application is to provide a sound source localization method, a device and a computer readable storage medium, which aim to solve the technical problem of how to improve the sound source localization accuracy.
To achieve the above object, the present application provides a sound source localization method, including the steps of:
acquiring scene images and audio signals in an environment, and detecting all human body images included in the scene images;
determining the current position corresponding to each human body image and determining the sound source angle of the audio signal;
and detecting target positions matched with the sound source angle in all the current positions, and outputting a human body image corresponding to the target positions.
Optionally, the step of detecting target positions matched with the sound source angle in all the current positions includes:
acquiring a calibration position corresponding to the sound source angle from a first preset calibration relation, wherein the first preset calibration relation comprises the corresponding relation between different calibration positions of different sound source angles in the scene image;
traversing all the current positions in sequence, and detecting whether the traversed current positions are matched with the calibration positions;
and taking the traversed current position as a target position after successful matching.
Optionally, the step of detecting whether the traversed current position matches the calibration position includes:
determining the distance from the traversed human body image corresponding to the current position to a preset camera;
determining a target position range according to the calibration position and the distance, and detecting whether the traversed current position is in the target position range;
and after the traversed current position is within the target position range, determining that the traversed current position is successfully matched with the calibration position.
Optionally, the step of detecting whether the traversed current position is within the target position range includes:
determining a first horizontal coordinate of the traversed current position in a preset coordinate system, and detecting whether the coordinate value of the first horizontal coordinate is in the target position range;
and if the coordinate value of the first horizontal coordinate is in the target position range, determining that the traversed current position is in the target position range.
Optionally, the step of determining the target position range according to the calibration position and the distance includes:
acquiring a position error value corresponding to the distance from a second preset calibration relation, wherein the second preset calibration relation comprises the corresponding relation between different distances from the human body image to a preset camera and different position error values;
and determining a second horizontal coordinate of the calibration position in a preset coordinate system, and determining a target position range according to the second horizontal coordinate and the position error value.
Optionally, the step of determining the target position range according to the second horizontal coordinate and the position error value includes:
taking the sum of the coordinate value of the second horizontal coordinate and the position error value as an upper limit position, and taking the difference between the coordinate value of the second horizontal coordinate and the position error value as a lower limit position;
and taking the range between the upper limit position and the lower limit position as a target position range.
Optionally, the step of determining the sound source angle of the audio signal includes:
and counting a plurality of microphones in a preset microphone array to acquire time difference information of the audio signals, inputting the time difference information into a pre-trained sound source positioning model, and outputting a sound source angle.
In addition, to achieve the above object, the present application further provides a sound source positioning device including:
the acquisition module is used for acquiring scene images and audio signals in the environment and detecting all human body images included in the scene images;
the determining module is used for determining the current position corresponding to each human body image and determining the sound source angle of the audio signal;
and the output module is used for detecting target positions matched with the sound source angle in all the current positions and outputting a human body image corresponding to the target positions.
In addition, in order to achieve the above object, the present application further provides a sound source localization apparatus including: a memory, a processor, and a sound source localization program stored on the memory; the processor is configured to execute the sound source localization program to implement the steps of the above sound source localization method.
In addition, to achieve the above object, the present application also provides a computer-readable storage medium storing one or more programs, which are further executable by one or more processors for implementing the steps of the above sound source localization method.
According to the method, the scene images and the audio signals in the environment to be positioned are collected, human body detection is carried out on the scene images, all human body images included in the scene images are detected, the current position of each human body image in the environment to be positioned is determined, the sound source angle of each human body image is combined with the sound source angle of each audio body image, all the target positions matched with the sound source angle in the current position are automatically matched, the human body images corresponding to the target positions are output to a display device to be displayed as the positioning result of final sound source positioning, sound source positioning is achieved, the situation that the sound source positioning is achieved by calculating the position coordinates of the sound source through the audio signals received by the microphone array is avoided, due to the fact that interference signals are included in the audio signals received by the microphone, the fact that the sound source position calculation based on the audio signals has large errors is caused, and then the phenomenon that the sound source positioning is inaccurate occurs is improved, sound source positioning is achieved by combining the human body images through the audio signals, face recognition is not needed, only the simple human body image detection is needed, the required computer computing power is low, and the use of the sound source positioning is reduced.
Drawings
The implementation, functional features and advantages of the present application will be further described with reference to the accompanying drawings in conjunction with the embodiments.
Fig. 1 is a schematic diagram of a terminal/device structure of a hardware running environment according to an embodiment of the present application;
FIG. 2 is a flow chart of a first embodiment of the sound source localization method of the present application;
FIG. 3 is an illustrative schematic of a positioning apparatus of the sound source positioning method of the present application;
FIG. 4 is an illustrative diagram of a sound source range map in the sound source localization method of the present application;
FIG. 5 is an illustrative diagram of a human body image and a sound source range map in the sound source localization method of the present application;
fig. 6 is a schematic view of a device module of the sound source positioning device of the present application.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a sound source positioning device in a hardware running environment according to an embodiment of the present application.
As shown in fig. 1, the sound source localization apparatus may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FId identification (WI-FI) interface). The Memory 1005 may be a high-speed random access Memory (Random Access Memory, RAM) Memory or a stable nonvolatile Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.
It will be appreciated by those skilled in the art that the structure shown in fig. 1 is not limiting of the sound source localization device and may include more or fewer components than shown, or certain components in combination, or a different arrangement of components.
As shown in fig. 1, an operating system, a data storage module, a network communication module, a user interface module, and a sound source localization program may be included in the memory 1005 as one type of storage medium. In the sound source localization device shown in fig. 1, the network interface 1004 is mainly used for sound source localization with other devices; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the sound source localization device of the present application may be disposed in the sound source localization device, and the sound source localization device calls the sound source localization program stored in the memory 1005 through the processor 1001 and executes the sound source localization method provided in the embodiment of the present application.
The present application proposes a sound source localization method, in a first embodiment of the sound source localization method of the present application, referring to fig. 2, the sound source localization method includes:
step S10, collecting scene images and audio signals in the environment, and detecting all human body images included in the scene images;
at present, a one-to-one mapping relation is established between array microphones and camera preset positions on an elliptical conference seat, and when the sound intensity of one of the array microphones is strongest, the preset position corresponding to the array microphone is found out, so that a camera holder is driven to reach the microphone position, and sound source positioning is realized. However, the sound source localization achieved by this method is not flexible enough and can only face a fixed seat. In the face of different meeting rooms, the preset positions of the cameras are required to be collected again, and the corresponding relation between the preset positions and the microphones on each seat is reestablished, so that the labor cost is high.
Also, digital signals of audio are collected by using a plurality of vertical and horizontal array microphones, and the angle between the sound source sounding position and the array microphone line is obtained by performing DOA (Direction Of Arrival, direction of arrival estimation) algorithm processing on the digital signals, and the audio algorithm is used in both the horizontal and vertical directions, and both directions are easily interfered by noise sources. At present, noise reduction algorithm can be adopted to carry out noise reduction processing on audio digital signals collected by a microphone, but the noise-reduced digital signals can cause damage to sound source signals, so that positioning is inaccurate in two directions, and positioning accuracy is low.
And acquiring the sound source position of the sounder through the combination of face recognition and voice recognition, and correcting the condition of inaccurate audio positioning through face recognition. But the algorithms used are relatively complex. Face recognition algorithm: advanced face detection is required to detect the face and then identify which person is the person. Speech recognition algorithm: the algorithm not only calculates the sound source occurrence angle, but also recognizes the voiceprint. Both algorithms require high hardware support, and common low cost embedded hardware is not supported with sufficient computational power, i.e., the hardware cost is high. Meanwhile, face recognition and voice recognition also relate to the related problems of moral laws such as personal rights. In a considerable number of scenarios, such sound source localization devices are not well suited for use without authorization by the speaker. Secondly, the robustness of face detection in face recognition is poor, and the accuracy is not high under the condition of poor picture quality, so that the positioning accuracy is low.
Therefore, in this embodiment, in order to avoid the above-mentioned defect, improve the accuracy of sound source localization, correct the problem of inaccurate sound source localization only through the microphone array by detecting the position of the human body in the image, simultaneously, compared with face recognition, only need detect human body image and need not discern the face, the required calculation power is lower, ordinary embedded hardware can support, human body detection is relative to face detection, the robustness is high, and the distance that detects is farther, be fit for more scenes, relative to the voiceprint discernment, only need detect the sound source angle, and need not discern the voiceprint characteristic of sounding people, the required calculation power is lower, ordinary embedded hardware can support.
In this embodiment, firstly, a scene image and an audio signal in an environment to be located need to be collected, the scene image and the audio signal can be collected by using conventional technical means in the art, the scene image can be collected by a camera, and the audio signal can be collected by a microphone array, wherein the microphone array is formed by an array formed by at least one microphone, and the audio signal comprises a sound source signal and an interference signal. And in this embodiment, the positioning result of the microphone array is corrected by the position of the human body in the scene image, so that it is necessary to detect all the human body images included in the scene image using a human body detection algorithm, wherein the human body detection algorithm may be a human body image detection algorithm commonly used by those skilled in the art, such as a detection algorithm based on operation and a detection algorithm based on machine learning.
In this embodiment, a positioning device including at least a microphone array and a camera is provided, the positioning device is placed in an environment to be positioned, an audio signal is collected through the microphone array in the positioning device, a sound source direction of the audio signal is determined, the audio signal is initially positioned, a scene image is collected through the camera, a human body image included in the scene image is detected, a positioning result of the microphone array is corrected, an initial positioning result of the microphone array on the audio signal is matched through a position of the human body image, and a target user of the sound source signal is determined. For example, referring to fig. 3, a positioning apparatus including a camera C1 and a microphone array, wherein the camera C1 does not include a pan/tilt, has a wide angle of view (FOV) of 120 degrees, can collect a 4K image, the microphone array is composed of two symmetrical microphones m1 and m2 for pickup, the distance between the two microphones can be near or far, can be selected according to the actual positioning situation, preferably 10 cm, by collecting a 4K panoramic image using the C1 camera, and performing a reduction process on the 4K image to obtain a D1 panoramic image, i.e., a scene image, performing a human detection on the D1 panoramic image using a human detection algorithm, and marking the detected image such as a D2 human image.
Step S20, determining the current position corresponding to each human body image, and determining the sound source angle of the audio signal;
in this embodiment, after all the human images included in the scene image are determined, since the positioning result of only the audio signal needs to be corrected using the human images, the position of each human image in the scene image, that is, the current position of each human image needs to be determined, alternatively, each human image may be regarded as a rectangle as a whole, and the center position of the rectangular frame of each human image is regarded as the current position of each human image, and meanwhile, the current position of each human image is updated periodically, so that the accuracy of the current position of the human image is ensured.
And, the DOA algorithm may be used to calculate the sound source angle of the audio signal, so as to complete the preliminary localization, and in one embodiment, the step of determining the sound source angle of the audio signal includes:
and a, counting a plurality of microphones in a preset microphone array to obtain time difference information of the audio signals, inputting the time difference information into a pre-trained sound source positioning model, and outputting a sound source angle.
Illustratively, time difference information of audio signals is obtained by counting a plurality of microphones in a preset microphone array, and a sound source angle is obtained according to the time difference information through a pre-trained sound source positioning model. The sound source localization model uses a known sound source angle to test the time difference of the microphone array for receiving the audio signals to obtain a time difference matrix to complete the training of the sound source localization model, and after the training is completed, the time difference of the audio signals acquired by any microphone array input into the sound source localization model can be obtained to determine the azimuth angle and the pitch angle of the sound source in a microphone coordinate system, namely the sound source angle of the audio signals.
In addition, in another scene, the microphone array can firstly convert the audio signal into the electric signal, and perform noise reduction and amplification processing on the audio signal, the multi-channel data is adopted to synchronously collect the signal, and finally the sound source angle of the audio signal is obtained based on the MUSIC algorithm in the upper computer system.
In this embodiment, time difference information of the audio signal is obtained by counting a plurality of microphones in a preset microphone array, a pre-trained sound source positioning model determines a sound source angle of the audio signal according to the time difference information, a preliminary positioning is performed on a sound source of the audio signal, a data basis is provided for correcting a sound source preliminary positioning result through a position of a human body image later, a positioning algorithm specifically used in the pre-trained sound source positioning model is not limited, a user can train the sound source positioning model according to an actual application scene, select different algorithms, can use a DOA algorithm, can use a MUSIC algorithm, can use other algorithms and the like, and therefore the sound source angle determination of the audio signal is not limited by a specific use scene, and the robustness of determining the sound source angle of the audio signal is improved.
And step S30, detecting target positions matched with the sound source angle in all the current positions, and outputting a human body image corresponding to the target positions.
In this embodiment, after determining the current position of each character image in the scene image, it is necessary to determine the current position (i.e., the target position) of all the character images that matches the sound source angle of the audio signal, and the sound source positions of the sound sources are co-located by using the sound source angles of the audio signal in combination with the human body images, as if the human eyes and ears were, so as to achieve the effect of more accurate sound source real-time localization.
After the human body image corresponding to the target position is determined in a mode of combining the images and the sound, the human body image corresponding to the target position is cut out, and an image with a proper size of the human body is cut out and sent to a display module for display. At this time, the image seen from the display module is only the human body image of the sounding human body and does not contain other images, thereby realizing sound source localization.
According to the embodiment, the scene images and the audio signals in the environment to be positioned are collected, human body detection is carried out on the scene images, all human body images included in the scene images are detected, the current position of each human body image in the environment to be positioned is determined, the sound source angle of each human body image is combined with the sound source angle of each audio body image, all the target positions matched with the sound source angle in the current position are automatically matched, the human body images corresponding to the target positions are output to a display device to be displayed as the positioning result of final sound source positioning, sound source positioning is achieved, the situation that the sound source positioning is achieved by calculating the position coordinates of the sound source through the audio signals received by the microphone array is avoided, due to the fact that interference signals are included in the audio signals received by the microphone, the fact that the sound source position calculation based on the audio signals has large errors is caused, and then the phenomenon that sound source positioning is inaccurate is caused, the accuracy of sound source positioning is improved, sound source positioning is achieved by combining the human body images through the audio signals, the fact that the sound source positioning is not needed, the sound path characteristics of a sounder is not needed to be recognized, only the simple human body image detection is needed, the required, the requirement on computer hardware is low, and the cost is lowered.
Further, based on the first embodiment of the present application, a second embodiment of the sound source positioning method of the present application is provided, and step S30 of the foregoing embodiment detects all target positions matched with the sound source angle in the current position, and outputs refinement of a human body image step corresponding to the target positions, including:
b, acquiring a calibration position corresponding to the sound source angle from a first preset calibration relation, wherein the first preset calibration relation comprises the corresponding relation between different calibration positions of different sound source angles in the scene image;
in this embodiment, a positioning device including at least a camera and a microphone array is provided, and in an early stage of development of the positioning device, calibration needs to be performed once, and after calibration of the positioning device is completed, the positioning device is put into use for sound source positioning. The calibration comprises a first calibration relation calibration and a second calibration relation calibration, wherein the first calibration relation calibration is used for converting the sound source angle into a horizontal coordinate on a panoramic image (namely a scene image), namely a calibration position, and the second calibration relation calibration is used for determining an error of the sound source range, namely a position error value, through the distance from the human body image detected by a human body detection algorithm to the camera. The embodiment realizes sound source positioning based on the positioning equipment which finishes calibration.
In this embodiment, after the positioning device finishes the calibration of the first calibration relationship and the second calibration relationship, for the sound source angle corresponding to the audio signal, the horizontal coordinate (i.e., the calibration position) in the scene image corresponding to the sound source angle can be determined according to the first calibration relationship (i.e., the first preset calibration relationship) in the positioning device, for example, the positioning device performs the DOA positioning algorithm processing on the audio signal picked up by using the array microphone m1 and the array microphone m2 to obtain the sound source angle a, and the angle a is transferred to the coordinate horizontal coordinate Xa on the scene image according to the calibration relationship (i.e., the first preset calibration relationship), so Xa is the calibration position of the sound source angle.
In another scenario, the first calibration relation may be to convert the sound source angle into a two-dimensional coordinate position in the scene image, and the subsequent position matches the two-dimensional coordinate, or may be to convert the sound source angle into a three-dimensional coordinate position in the scene image. In short, the first calibration relation can be set by the user according to the actual situation, and the specific selection of which coordinate axis position is matched with the position can also be selected by the user according to the actual situation, which is not limited herein.
Step c, traversing all the current positions in sequence, and detecting whether the traversed current positions are matched with the calibration positions;
in this embodiment, the current position of the human body image and the calibration position of the sound source angle are both a definite position coordinate, so in most cases, the current position of the human body image and the calibration position of the sound source angle do not completely coincide, and therefore, it is necessary to determine a current position matched with the calibration position in all the current positions, that is, to match all the current positions with the calibration position one by one. Optionally, each time a current position (i.e. the traversed current position) is selected from all the current positions, whether the current position is matched with a calibration position or not is detected, if the matching is successful, the current position is taken as a target position, a human body image corresponding to the target position is output to a display module, so that sound source positioning is realized, and if the matching is unsuccessful, the matching is continued.
And d, taking the traversed current position as a target position after successful matching.
In the embodiment, the sound source angle of the audio signal is converted into the calibration position in the scene image through the calibration relation set in the positioning equipment, then the current position of each human body image is matched with the calibration position in sequence, the target position matched with the calibration position is determined, the calibration position of the sound source angle is corrected by the current position of the human body image, and the accuracy of sound source positioning is improved. The process of converting the sound source angle into the calibration position and matching the calibration position with all the current positions is automatically completed in the positioning equipment, so that the required computer calculation force is low, the restriction of scenes is less, the use scene is wide, and the sound source positioning effect robustness is high.
In one embodiment, the step of detecting whether the current location traversed matches the calibration location includes:
step e, determining the distance from the traversed human body image corresponding to the current position to a preset camera;
in this embodiment, all the current positions are sequentially matched with the calibration positions, and each time a current position is selected from all the current positions for matching, because the distances between the corresponding current positions and the cameras in the positioning device are different, the sound source errors corresponding to the current positions are also different. Therefore, after each time of traversing from all the current positions to one current position, the distance between the human body corresponding to the traversed current position and the camera (i.e. the preset camera) in the positioning equipment in the environment to be positioned is determined, so that the sound source error is determined by the distance. Optionally, the distance from the human body to the preset camera in the actual environment, i.e. the environment to be positioned, can be determined by calculating the area of the human body image, and the distance is calculated in the size of the area, so that the method is simple and effective.
F, determining a target position range according to the calibration position and the distance, and detecting whether the traversed current position is in the target position range;
and g, after the traversed current position is within the target position range, determining that the traversed current position is successfully matched with the calibration position.
In this embodiment, when one current position is traversed, the current position is matched with the calibration position once, the target position range of the calibration position is determined according to the distance from the human body image corresponding to the traversed current position to the preset camera in the environment to be positioned, whether the traversed current position is in the target position range is detected, if the traversed current position is in the target position range, the current position is the corresponding human body image and is the target user sending out the sound source signal, and if the traversed current position is not in the target position range, the matching is continued. The distances from the human body to the preset cameras are different, the target position ranges are different, the successful matching is considered only when the traversed current position is in the target position range, the successfully matched current position is taken as the target position, and the effectiveness and the accuracy of the finally determined target position are ensured.
In one embodiment, the step of determining the target position range according to the calibration position and the distance includes:
step h, acquiring a position error value corresponding to the distance from a second preset calibration relation, wherein the second preset calibration relation comprises a corresponding relation between different distances from the human body image to a preset camera and different position error values;
and i, determining a second horizontal coordinate of the calibration position in a preset coordinate system, and determining a target position range according to the second horizontal coordinate and the position error value.
In this embodiment, the second calibration relationship of the positioning device determines the error of the sound source range, that is, the position error value, by using the distance from the human body image detected by the human body detection algorithm to the camera, and after determining the distance from the traversed human body image corresponding to the current position to the camera in the actual scene, the second preset calibration relationship in the positioning device automatically acquires the error of the sound source range corresponding to the distance, that is, the position error value. And converting the calibration position into a preset coordinate system, determining a horizontal coordinate of the calibration position in the preset coordinate system, namely a second horizontal coordinate, and determining a target position range according to the second horizontal coordinate and a position error value.
In this embodiment, the position error value corresponding to the distance is determined through the second preset calibration relation in the positioning device, the calibration position is converted into the preset coordinate system, the second horizontal coordinate of the calibration position is determined, and the target position range is further determined.
In one embodiment, the step of determining the target position range according to the second horizontal coordinate and the position error value includes:
j, taking the sum of the coordinate value of the second horizontal coordinate and the position error value as an upper limit position, and taking the difference between the coordinate value of the second horizontal coordinate and the position error value as a lower limit position;
and k, taking the range between the upper limit position and the lower limit position as a target position range.
In this example, after the second horizontal coordinate of the calibration position in the preset coordinates is determined and the position error value corresponding to the traversed current position is determined, the sum of the coordinate value of the second horizontal coordinate and the position error value is taken as an upper limit position, and the difference between the coordinate value of the second horizontal coordinate and the position error value is taken as a lower limit position, so that the target position range is determined. For example, referring to fig. 4, a positioning apparatus including a camera C1 and a microphone array, where the microphone array is composed of two symmetrical microphones m1 and m2 for pickup, a panoramic image D1 (i.e., a scene image) is acquired by the camera C1, an angle a of a sound source is obtained by performing a DOA positioning algorithm processing using audio data picked up by the array microphone m1 and the array microphone m2, the angle a is transferred to a coordinate Xa on the D1 panoramic image by a first preset calibration relation, an error range b is obtained by a second preset calibration relation according to a distance from a human body image to the C1 camera in an actual scene according to a traversed current position, and finally a D3 sound source range image in the D1 panoramic image is obtained by a [ Xa-b, xa+b ] horizontal coordinate range.
In this embodiment, the sum of the coordinate value of the second horizontal coordinate and the position error value is taken as an upper limit position, the difference between the coordinate value of the second horizontal coordinate and the position error value is taken as a lower limit position, a target position range is obtained, the determination of the position error value is determined through a second preset relation in the positioning device, the second horizontal coordinate is the horizontal coordinate of the calibration position in the preset coordinate system, so that the determination of the target position range is performed from the determination of the sound source angle of the audio signal to the determination of the calibration position corresponding to the sound source angle, to the determination of the second horizontal coordinate of the calibration position in the preset coordinate system, and from the traversed current position to the distance of the human body image to the preset camera, to the determination of the position error value corresponding to the distance, and finally, the accuracy and the effectiveness of the target position range are ensured, and an effective data basis is provided for determining the target position.
In an embodiment, the step of detecting whether the traversed current position is within the target position range includes:
step l, determining a first horizontal coordinate of the traversed current position in a preset coordinate system, and detecting whether the coordinate value of the first horizontal coordinate is in the target position range;
and m, if the coordinate value of the first horizontal coordinate is in the target position range, determining that the traversed current position is in the target position range.
In the present embodiment, the horizontal coordinate axis direction of the preset coordinate system is taken as the matching direction of the position matching, so that only coordinate values in the horizontal direction are matched in the present embodiment. The preset coordinate system may be any coordinate system set by a user, in this embodiment, preferably, the lower left vertex of the scene image is taken as an origin, the horizontal direction is taken as an X axis, the vertical direction of the gravitational acceleration is taken as a Y axis, the Y axis is called a vertical coordinate, after the coordinate system is established, the current position of the traversal is converted into the preset coordinate system, the horizontal coordinate (i.e. the first horizontal coordinate) of the current position of the traversal in the X axis direction is determined, then whether the coordinate value of the first horizontal coordinate is in the target position range is determined, if yes, the current position of the traversal is considered to be in the target position range, only the matching of the horizontal coordinate of the current position of the traversal is performed, if the coordinate value of the first horizontal coordinate is not in the target position range, the current position of the traversal is considered to be not in the target position range, and the next round of matching is performed by continuously traversing the current position again from all the current positions.
In addition, to assist understanding the matching process between the current position and the calibration position of the traversal of this example, referring to fig. 5, a positioning device including a camera C1 and a microphone array is illustrated, where the microphone array is composed of two symmetrical microphones m1 and m2 for pickup, the panoramic image D1 (i.e. scene image) is acquired through the camera C1, the coordinates (x 0, y 0) of the D2 human body image corresponding to the current position of the traversal on the D1 panoramic image are acquired, the horizontal coordinates Xa of the calibration position in the panoramic image D1 can be obtained by the DOA sound source positioning algorithm and the first preset calibration relation by the microphones m1 and m2, and the distance D between the current position of the traversal and the preset camera is calculated, and the sound source error range determined for the distance D is b, then the D3 sound source range map, i.e. the target position range is [ Xa-b, xa+b ], and whether the matching of the current position is successful or not is determined by comparing whether the current position is within the range of [ Xa-b, xa+b ]. If x0< Xa-b or x0> Xa+b; if the matching fails, if the D2 human body is not sounding, continuing to walk through one current position again from all current positions to perform next round of matching; if x0> =xa-b and x0< =xa+b, then the matching is successful, at this time, it can be known that the current position of the traversal is the target position and the human body corresponding to the current position of the traversal is the target user who is sounding, that is, the D2 human body is sounding.
In this embodiment, by matching a first horizontal coordinate of a current position to be traversed in a preset coordinate system with a target position range, if the coordinate value of the first horizontal coordinate is not in the target position range, one current position is traversed again for matching, if the coordinate value of the first horizontal coordinate is in the target position range, matching is successful, the traversed current position is used as a target position, and a human body image corresponding to the target position is output to a display device, so that sound source positioning is completed. Therefore, the accuracy and the effectiveness of sound source positioning are realized, the whole calculation and matching process is simple, the required computer calculation force is less, the hardware requirement is low, the restriction of scenes is less, the use scenes are wide, and the robustness of the sound source positioning effect is high.
The matching is successful, the accuracy of determining the target position is ensured, the whole matching process is automatically completed, the matching is completed only by comparing coordinate values, the required computer calculation force is low, and the matching process is efficient, accurate and rapid.
In addition, the present application further provides a sound source positioning device, referring to the drawings, the sound source positioning device includes:
the acquisition module is used for acquiring scene images and audio signals in the environment and detecting all human body images included in the scene images;
the determining module is used for determining the current position corresponding to each human body image and determining the sound source angle of the audio signal;
and the output module is used for detecting target positions matched with the sound source angle in all the current positions and outputting a human body image corresponding to the target positions.
In addition, the present application also provides a sound source localization apparatus including: a memory, a processor, and a sound source localization program stored on the memory; the processor is configured to execute the sound source localization program to implement the steps of the embodiments of the sound source localization method described above.
The present application also provides a computer-readable storage medium storing one or more programs that are further executable by one or more processors for implementing the steps of the embodiments of the above-described sound source localization method.
The specific implementation manner of the storage medium is basically the same as that of each embodiment of the sound source positioning method, and is not repeated here.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) as described above, including several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method described in the embodiments of the present application.
The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the claims, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application, or direct or indirect application in other related technical fields are included in the scope of the claims of the present application.

Claims (10)

1. A sound source localization method, characterized in that the sound source localization method comprises:
acquiring scene images and audio signals in an environment, and detecting all human body images included in the scene images;
determining the current position corresponding to each human body image and determining the sound source angle of the audio signal;
and detecting target positions matched with the sound source angle in all the current positions, and outputting a human body image corresponding to the target positions.
2. The sound source localization method according to claim 1, wherein the step of detecting target positions which are matched with the sound source angles among all the current positions includes:
acquiring a calibration position corresponding to the sound source angle from a first preset calibration relation, wherein the first preset calibration relation comprises the corresponding relation between different calibration positions of different sound source angles in the scene image;
traversing all the current positions in sequence, and detecting whether the traversed current positions are matched with the calibration positions;
and taking the traversed current position as a target position after successful matching.
3. The sound source localization method of claim 2, wherein the step of detecting whether the current location traversed matches the nominal location comprises:
determining the distance from the traversed human body image corresponding to the current position to a preset camera;
determining a target position range according to the calibration position and the distance, and detecting whether the traversed current position is in the target position range;
and after the traversed current position is within the target position range, determining that the traversed current position is successfully matched with the calibration position.
4. A sound source localization method as claimed in claim 3, wherein the step of detecting whether the current location traversed is within the target location range comprises:
determining a first horizontal coordinate of the traversed current position in a preset coordinate system, and detecting whether the coordinate value of the first horizontal coordinate is in the target position range;
and if the coordinate value of the first horizontal coordinate is in the target position range, determining that the traversed current position is in the target position range.
5. A sound source localization method as claimed in claim 3, wherein the step of determining a target location range from the calibration location and the distance comprises:
acquiring a position error value corresponding to the distance from a second preset calibration relation, wherein the second preset calibration relation comprises the corresponding relation between different distances from the human body image to a preset camera and different position error values;
and determining a second horizontal coordinate of the calibration position in a preset coordinate system, and determining a target position range according to the second horizontal coordinate and the position error value.
6. The sound source localization method of claim 5, wherein the step of determining a target location range from the second horizontal coordinate and the location error value comprises:
taking the sum of the coordinate value of the second horizontal coordinate and the position error value as an upper limit position, and taking the difference between the coordinate value of the second horizontal coordinate and the position error value as a lower limit position;
and taking the range between the upper limit position and the lower limit position as a target position range.
7. The sound source localization method of claim 1, wherein the step of determining a sound source angle of the audio signal comprises:
and counting a plurality of microphones in a preset microphone array to acquire time difference information of the audio signals, inputting the time difference information into a pre-trained sound source positioning model, and outputting a sound source angle.
8. A sound source localization device, the sound source localization device comprising:
the acquisition module is used for acquiring scene images and audio signals in the environment and detecting all human body images included in the scene images;
the determining module is used for determining the current position corresponding to each human body image and determining the sound source angle of the audio signal;
and the output module is used for detecting target positions matched with the sound source angle in all the current positions and outputting a human body image corresponding to the target positions.
9. A sound source localization device, the sound source localization device comprising: memory, a processor and a sound source localization program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the sound source localization method according to any one of claims 1 to 7.
10. A computer-readable storage medium, on which a sound source localization program is stored, which when executed by a processor implements the steps of the sound source localization method according to any one of claims 1 to 7.
CN202211742577.0A 2022-12-29 2022-12-29 Sound source positioning method, device, equipment and computer readable storage medium Pending CN116243244A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211742577.0A CN116243244A (en) 2022-12-29 2022-12-29 Sound source positioning method, device, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211742577.0A CN116243244A (en) 2022-12-29 2022-12-29 Sound source positioning method, device, equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN116243244A true CN116243244A (en) 2023-06-09

Family

ID=86625321

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211742577.0A Pending CN116243244A (en) 2022-12-29 2022-12-29 Sound source positioning method, device, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN116243244A (en)

Similar Documents

Publication Publication Date Title
US11398235B2 (en) Methods, apparatuses, systems, devices, and computer-readable storage media for processing speech signals based on horizontal and pitch angles and distance of a sound source relative to a microphone array
CN107534725B (en) Voice signal processing method and device
US10424320B2 (en) Voice detection, apparatus, voice detection method, and non-transitory computer-readable storage medium
EP3546976B1 (en) Device control method, apparatus and system
US20240048932A1 (en) Personalized hrtfs via optical capture
Schillebeeckx et al. Biomimetic sonar: Binaural 3D localization using artificial bat pinnae
JP6467736B2 (en) Sound source position estimating apparatus, sound source position estimating method, and sound source position estimating program
US10582117B1 (en) Automatic camera control in a video conference system
CN111034222A (en) Sound collecting device, sound collecting method, and program
JP7194897B2 (en) Signal processing device and signal processing method
JP6977448B2 (en) Device control device, device control program, device control method, dialogue device, and communication system
CN108877787A (en) Audio recognition method, device, server and storage medium
CN112423191B (en) Video call device and audio gain method
CN107450882B (en) Method and device for adjusting sound loudness and storage medium
CN113064576A (en) Volume adjusting method and device, mobile equipment and storage medium
CN110188179B (en) Voice directional recognition interaction method, device, equipment and medium
CN111627456A (en) Noise elimination method, device, equipment and readable storage medium
KR20190016683A (en) Apparatus for automatic conference notetaking using mems microphone array
CN116243244A (en) Sound source positioning method, device, equipment and computer readable storage medium
JP2017108240A (en) Information processing apparatus and information processing method
CN112578338B (en) Sound source positioning method, device, equipment and storage medium
CN110730378A (en) Information processing method and system
US20230386165A1 (en) Information processing device, recording medium, and information processing method
Lindqvist et al. Real-time multiple audio beamforming system
CN118050682A (en) Sound source positioning method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination