WO2023010599A1

WO2023010599A1 - Target trajectory calibration method based on video and audio, and computer device

Info

Publication number: WO2023010599A1
Application number: PCT/CN2021/111895
Authority: WO
Inventors: 郑勇; 张缤; 戴志涛
Original assignee: 深圳市沃特沃德信息有限公司
Priority date: 2021-08-04
Filing date: 2021-08-10
Publication date: 2023-02-09
Also published as: CN113794830A

Abstract

Provided by the present application are a target trajectory calibration method based on video and audio, and a computer device. A microphone array is deployed on a camera device, and the positional relationship between a sound source and the camera device is determined by means of the deployment position relationship between a reference sub-microphone and the camera device. Then, a second motion trajectory of each sound source is calibrated by using video data and the collection time of each item of audio data as a time reference and by using a first motion trajectory of the camera device as a position parameter.

Description

Target trajectory calibration method and computer equipment based on video and audio

technical field

The present application relates to the technical field of audio and video processing, in particular to a video and audio-based target trajectory marking method and computer equipment.

Background technique

Currently, for the position calibration of the sound source included in the audio and video, it is necessary to ensure that the sound source appears within the visual range of the audio and video, that is, the camera equipment needs to capture the sound source, and the user can manually find the sound source from the captured audio and video Relative to the position of the camera device, the movement trajectory of the sound source is obtained through the shooting field of view of the camera device. If the sound source appears outside the shooting field of view of the camera device, even if the sound from the sound source is received, the user cannot determine the relative position between the sound source and the camera device, let alone further determine the trajectory of the sound source.

technical problem

The main purpose of this application is to provide a target trajectory calibration method and computer equipment based on video and audio, aiming at solving the disadvantage that the existing sound source cannot determine the movement trajectory of the sound source when it is not within the shooting field of view of the camera equipment.

technical solution

In order to achieve the above object, in the first aspect, the present application provides a target trajectory calibration method based on video and audio, the video is collected by a camera device, the audio is collected by a microphone array, and the microphone array is composed of a plurality of sub-microphones, The microphone array is deployed on the camera equipment, and the target trajectory calibration method includes:

collecting video data through the camera device, and collecting a plurality of audio data through the microphone array;

Carrying out VAD algorithm identification to the sound contained in each said audio data respectively, obtain several sound sources;

Based on the difference between the receiving time of the sound corresponding to the same sound source of the two first sub-microphones, and the deployment position between the two first sub-microphones, the distance between each of the sound sources and the reference sub-microphone is calculated. The first relative positional relationship between the reference sub-microphones is any one of the two first sub-microphones;

According to the deployment position of the reference sub-microphone on the imaging equipment and the first relative positional relationship corresponding to each of the sound sources, the second distance between each of the sound sources and the imaging equipment is obtained through conversion. Relative positional relationship;

According to the acquisition time of the video data and each of the audio data, the first movement trajectory of the camera device, and each of the second relative positional relationships, a second movement trajectory corresponding to each of the sound sources is constructed.

In a second aspect, the present application also provides a computer device, including a memory and a processor, and a computer program is stored in the memory, wherein, when the processor executes the computer program, a target trajectory based on video and audio is realized In the calibration method, the video is collected by a camera device, the audio is collected by a microphone array, the microphone array is composed of a plurality of sub-microphones, and the microphone array is deployed on the camera device;

Wherein, the target track marking method based on video and audio includes:

In a third aspect, the present application also provides a computer-readable storage medium on which a computer program is stored, wherein, when the computer program is executed by a processor, a method for marking a target trajectory based on video and audio is implemented, and the video Collected by camera equipment, the audio is collected by a microphone array, the microphone array is composed of a plurality of sub-microphones, the microphone array is deployed on the camera equipment, and the target track calibration method based on video and audio includes the following steps:

Beneficial effect

A video and audio-based target trajectory calibration method and computer equipment provided in this application, wherein the video is collected by a camera device, the audio is collected by a microphone array, the microphone array is composed of multiple sub-microphones, and the microphone array is deployed on the camera device . In application, the processing system collects video data through a camera device, and collects a plurality of audio data through a microphone array. Then perform VAD algorithm recognition on the sounds contained in each audio data respectively to obtain several sound sources contained in each audio data. The processing system calculates the first distance between each sound source and the reference sub-microphone based on the difference in receiving time of the sound corresponding to the same sound source by the two first sub-microphones and the deployment position between the two first sub-microphones. Relative positional relationship, wherein, the reference sub-microphone is any one of the two first sub-microphones. The processing system converts and obtains a second relative positional relationship between each sound source and the imaging device according to the deployment position of the reference sub-microphone on the imaging device and the first relative positional relationship corresponding to each sound source. Finally, the processing system constructs the second movement trajectory corresponding to each sound source according to the acquisition time of the video data and each audio data, the first movement trajectory of the camera device, and each second relative positional relationship. In this application, the camera device is equipped with a microphone array, and the first relative positional relationship between each sound source and the reference sub-microphone can be calculated according to the time difference between the sound corresponding to each sound source and each sub-microphone. Then, by means of the deployment positional relationship between the reference sub-microphone and the imaging device, a second positional relationship between each sound source relative to the imaging device is obtained through position conversion. Therefore, even if the sound source does not appear in the shooting field of view of the camera device, as long as the sound from the sound source can be received by the microphone array, the positional relationship between the reference sub-microphone and the camera device can be used to determine the relative position of the sound source relative to the camera. The location relationship between devices. Then, the first trajectory of the camera device is used as a position parameter to calibrate the second trajectory of each sound source.

Description of drawings

Fig. 1 is a schematic diagram of the steps of the target trajectory marking method based on video and audio in an embodiment of the present application;

Fig. 2 is a schematic diagram of the distribution of the reference sub-microphone, the field of view center of the imaging device, and the sound source in an embodiment of the present application;

Fig. 3 is the overall structural block diagram of the target trajectory marking device based on video and audio in one embodiment of the present application;

Fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present application.

The realization, functional features and advantages of the present application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.

BEST MODE FOR CARRYING OUT THE INVENTION

In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not intended to limit the present application.

Referring to Fig. 1, an embodiment of the present application provides a method for marking a target trajectory based on video and audio, the video is collected by a camera device, the audio is collected by a microphone array, and the microphone array is composed of a plurality of sub-microphones, The microphone array is deployed on the camera equipment, and the target trajectory calibration method includes:

S1: collecting video data by the camera device, and collecting a plurality of audio data by the microphone array;

S2: do VAD algorithm recognition to the sound contained in each described audio data respectively, obtain several sound sources;

S3: Based on the difference between the receiving time of the sound corresponding to the same sound source of the two first sub-microphones, and the deployment position between the two first sub-microphones, calculate the relationship between each of the sound sources and the reference sub-microphone A first relative positional relationship between microphones, the reference sub-microphone being any one of the two first sub-microphones;

S4: According to the deployment position of the reference sub-microphone on the imaging device, and the first relative positional relationship corresponding to each of the sound sources, convert and obtain the distance between each of the sound sources and the imaging device The second relative positional relationship;

S5: According to the acquisition time of the video data and each of the audio data, the first motion track of the camera device, and each of the second relative positional relationships, construct a second motion track corresponding to each of the sound sources respectively .

In this embodiment, a microphone array is deployed on the camera device, and the microphone array is composed of multiple sub-microphones. In application, the processing system collects video data through the camera device, and collects multiple audio data through the microphone array. Among them, the processing system can be the local system of the camera equipment, which can directly analyze and process the collected video data and audio data; (such as wifi signals, 4g/5g network signals) are uploaded to the cloud server for analysis and processing by the processing system. The processing system performs VAD (Voice Activity Detection, Voice Endpoint Detection) algorithm recognition on the sounds included in each audio data, performs speech recognition on each sound, and obtains all sound sources included in each audio data. The processing system uses a TDOA (Time Difference of Arrival) positioning algorithm based on the difference in the receiving time of the sound corresponding to the same sound source between the two first sub-microphones and the deployment position between the two first sub-microphones The first relative positional relationship between each sound source and the reference sub-microphone is obtained through calculation. Wherein, the first relative positional relationship includes a first distance and a first angle, the first distance characterizes the linear distance between the sound source and the reference sub-microphone, and the first angle represents the distance between the sound source and the reference sub-microphone relative to the horizontal plane. Angle (Since the reference sub-microphone can be any one of the first sub-microphones, different first sub-microphones are selected as the reference sub-microphone, and the corresponding first distances are correspondingly different, and the first angle is the same, the calculation of the first distance The logic is the same and will not be described in detail here). The processing system calculates the complementary angle of the first angle, and obtains the linear distance between the reference sub-microphone and the center of the field of view of the imaging device according to the deployment position of the reference sub-microphone on the imaging device. Then, corresponding calculation is performed according to the complementary angle of the first angle, the straight-line distance and the first distance to obtain the second distance between the imaging device and the sound source. The processing system calculates the vertical distance between the reference sub-microphone and the sound source through the formula of the law of cosines according to the first angle and the first distance. Finally, according to the vertical distance and the second distance, the second angle between the camera device and the sound source is obtained by calculating again through the formula of the law of cosines. According to the above calculation logic, the processing system calculates the second distance and the second angle corresponding to each sound source and the center of field of view of the device respectively, so as to generate the second relative positional relationship corresponding to each sound source. The processing system obtains the first motion track of the camera device according to the GPS positioning module deployed on the camera device, and calibrates and constructs each sound source according to the second relative positional relationship between each sound source and the camera device, taking the first motion track as a position reference. Sources respectively correspond to the second motion trajectory.

In this embodiment, a microphone array is deployed on the imaging device, and the first relative positional relationship between each sound source and the reference sub-microphone can be calculated according to the time difference between the sound corresponding to each sound source and each sub-microphone. Then, by means of the deployment positional relationship between the reference sub-microphone and the imaging device, a second positional relationship between each sound source relative to the imaging device is obtained through position conversion. Therefore, even if the sound source does not appear in the shooting field of view of the camera device, as long as the sound from the sound source can be received by the microphone array, the positional relationship between the reference sub-microphone and the camera device can be used to determine the relative position of the sound source relative to the camera. The location relationship between devices. Then, the acquisition time of the video data and each audio data is taken as the time reference, and the first movement trajectory of the camera device is used as the position parameter, so as to calibrate the second movement trajectory of each sound source.

Further, the first relative positional relationship includes a first distance and a first angle, according to the deployment position of the reference sub-microphone on the imaging device, and the first position corresponding to each of the sound sources. The relative positional relationship, the step of converting to obtain the second relative positional relationship between each of the sound sources and the imaging device includes:

S401: Calculate the complementary angle of the first angle;

S402: Call the straight-line distance between the reference sub-microphone and the imaging device, and substitute the complementary angle of the first angle, the straight-line distance, and the first distance into a calculation formula to obtain a second distance , wherein the calculation formula is:

, b is the first distance, c is the straight-line distance, β is the complementary angle of the first angle, and a is the second distance, representing the distance between the imaging device and the sound source;

S403: According to the first angle and the first distance, calculate the vertical distance between the reference sub-microphone and the sound source through the cosine law formula;

S404: Calculate and obtain a second angle between the imaging device and the sound source according to the second distance and the vertical distance through the formula of the law of cosines, where the distance between the imaging device and the sound source The value of the vertical distance is the same as the vertical distance between the reference sub-microphone and the sound source;

S405: Calculate and obtain the second distance and the second angle respectively corresponding between each of the sound sources and the imaging device according to the above rules, and generate each of the second relative positional relationships.

In this embodiment, as shown in Figure 2, it is assumed that the center of the field of view of the imaging device is point A, the reference sub-microphone is point B, and the sound source is point C, and a vertical line is drawn through the reference sub-microphone and the sound source respectively, intersecting at point D , then the triangle BCD is a right triangle, ∠BDC is a right angle, ∠CBD is the first angle between the sound source and the reference sub-microphone, and side BC is the first distance between the sound source and the reference sub-microphone. In the triangle ABC, ∠ABC is the complementary angle of ∠CBD (i.e. the first angle); side AB is the linear distance between the reference sub-microphone and the center of field of view of the camera device; side AC is the field of view between the sound source and the camera device Second distance between centers. Since the values of ∠ABC, side AB and side BC are known, they are substituted into the calculation formula:

The value of side AC can be calculated. Among them, b is the first distance (that is, side BC), c is the straight-line distance (that is, side AB), β is the complementary angle of the first angle (that is, ∠ABC), and a is the second distance (that is, side AC), which represents the camera The distance between the device and the sound source in question. In the right triangle BCD, since the sides BC and ∠CBD are known values, the value of the side BD (that is, the vertical distance between the reference sub-microphone and the sound source) can be calculated through the formula of the law of cosines. Make a line segment perpendicular to the side CD through the center of the field of view of the camera equipment (namely point A), and assume that the vertical point is E, then the triangle ACE is a right triangle, and the length of the side AE is the same as that of the side BD; ∠CAE is the sound source and The second included angle between the camera devices, ∠CEA is a right angle. In the right triangle ACE, since the hypotenuse AC and the adjacent side AE of ∠CAE are known values, the value of ∠CAE can be calculated by the formula of the law of cosines, and thus the first distance between the center of field of view of the camera equipment and the sound source can be obtained. Two angles. The processing system calculates the second distance and the second angle corresponding to each sound source and the field of view center of the camera device according to the above rules, and generates the second relative positional relationship corresponding to each sound source according to the second distance and the second angle .

Further, the step of performing VAD algorithm recognition on the sounds contained in each of the audio data to obtain several sound sources includes:

S201: Perform VAD algorithm recognition on the sounds contained in each of the audio data, respectively, to obtain a number of personal sound sources and other types of sound sources;

S202: Mark and number each of the human voice sound sources, and perform decibel value detection on each of the other types of sound sources, hide the first other type of sound source whose decibel value is below the decibel threshold, and simultaneously detect the decibel value in the decibel value. The second other type of sound source above the above-mentioned decibel threshold shall be marked and numbered.

In this embodiment, the processing system uses the VAD algorithm to perform speech recognition on the sounds included in each audio data, thereby obtaining several sound sources, and distinguishing each sound source into human voice sources and other types of sound sources (such as animal sound sources, automobile sound sources, etc.) sound source, etc.). The processing system marks and numbers each vocal sound source to distinguish each vocal sound source. At the same time, in order to reduce the amount of subsequent processing of sound source data and the complexity of building motion trajectories, the processing system screens other types of sound sources to eliminate some other types of sound sources that are less practical. Specifically, the processing system recalls a preset decibel threshold, and then respectively detects the decibel values of sounds emitted by various other types of sound sources. The processing system compares the decibel value of the sound emitted by each other type of sound source with the decibel threshold, and hides or eliminates some other types of sound sources (that is, the first other type of sound source) whose sound decibel value is below the decibel threshold. The sound corresponding to the first other type of sound source is correspondingly processed (such as the subsequent marker number and the construction of the corresponding motion trajectory). At the same time, the processing system marks and numbers some other types of sound sources (that is, second other types of sound sources) whose decibel value is above the decibel threshold, so as to distinguish each second other type of sound source.

Further, the step of marking and numbering the second other type of sound source whose decibel value is above the decibel threshold includes:

S2021: Input the sound corresponding to each of the second other types of sound sources into a pre-trained sound type recognition model for identification, and obtain the sound types corresponding to each of the second other types of sound sources;

S2022: Using the sound type as marking information, mark and number each of the second sound sources of other types.

In this embodiment, a pre-trained sound type recognition model is built in the processing system. The sound type recognition model uses various types of sounds (such as cat meowing, dog barking, car driving sound, etc.) After learning and training (the method of deep learning training model is the same as that of the prior art, and will not be described in detail here), the corresponding types of various sounds can be identified. During application, the processing system inputs the sounds corresponding to the second other types of sound sources into the pre-trained sound type recognition model for corresponding processing, and outputs the sound types corresponding to the sounds of each second other types of sound sources (such as the first The sound type corresponding to the sound of the second other type of sound source A is cat meowing, and the sound type corresponding to the sound of the second other type of sound source B is the sound of driving cars). When marking and numbering each of the second other types of sound sources, the processing system correspondingly marks the sound types corresponding to each of the second other types of sound sources as marking information, so as to facilitate the user to know specific information.

Further, according to the acquisition time of the video data and each of the audio data, the first motion track of the camera device, and each of the second relative positional relationships, construct the first sound source corresponding to each of the sound sources. The steps of the second motion trajectory include:

S501: Perform time synchronization based on the acquisition time of the video data and each of the audio data respectively, and locate the appearance time of each of the sound sources in the video data;

S502: Collect the first motion track of the imaging device by GPS positioning method, and use the first motion track as a position reference, according to the appearance time corresponding to each of the sound sources and each of the second The relative positional relationship is constructed to obtain the second motion trajectory of each of the sound sources relative to the first motion trajectory.

In this embodiment, since the video data is shot from the beginning to the end in the actual application process, the sound corresponding to each sound source may appear during the shooting process and disappear after a period of time. Therefore, the processing system takes the acquisition time of video data as a benchmark, and synchronizes the acquisition time of each audio data with the acquisition time of video data, thereby locating and obtaining the appearance time of each sound source in the video data (the appearance time includes the appearance time , duration, and end time). The camera device is equipped with a GPS positioning module, and the processing system uses the GPS positioning module to realize the positions corresponding to each acquisition time of the camera device in the process of shooting video data, and then form the first trajectory according to these positions. On this basis, taking the first motion trajectory of the imaging device as the position reference (specifically, taking the position corresponding to the acquisition time as the position reference), according to the appearance time of each sound source relative to the video data and the distance between the sound source and the imaging device The second relative positional relationship is constructed to obtain the second movement trajectory of each sound source relative to the first movement trajectory, so as to realize the calibration of the movement trajectory of the sound source not within the shooting field of view of the camera device.

Further, according to the acquisition time of the video data and each of the audio data, the first motion track of the camera device, and each of the second relative positional relationships, construct the first sound source corresponding to each of the sound sources. After the steps of the second motion trajectory, including:

S6: Constructing each of the second motion trajectories with lines of different colors, and recording the corresponding relationship between each color and each of the sound sources to form corresponding information;

S7: Generate a trajectory distribution diagram according to the first movement trajectory, the corresponding information, and each of the second movement trajectories, and output the trajectory distribution diagram to a display interface.

In this embodiment, after the processing system generates the second movement trajectories corresponding to each sound source according to the first movement trajectory, in order to reflect the difference between the second movement trajectories of each sound source, the processing system is constructed with lines of different colors The second motion track of each sound source, and record the corresponding relationship between each color and each sound source to form corresponding information. For example, the color of the track line of the second motion track of sound source A is red, and the second track of sound source B The track line color of the motion track is yellow. The processing system generates a trajectory distribution diagram according to the first movement trajectory, the second movement trajectory and the corresponding information (the corresponding information is recorded on the trajectory distribution diagram as label information, which is convenient for users to view the sound source of the corresponding color), and outputs the trajectory distribution diagram to The display interface enables the user to intuitively understand the changes of each second motion track.

Further, the step of generating a trajectory distribution diagram according to the first movement trajectory, the corresponding information and each of the second movement trajectories, and outputting the trajectory distribution diagram to a display interface includes:

S701: Call a three-dimensional map, and mark the first movement track on the three-dimensional map;

S702: Using the first movement trajectory as a position reference, respectively mark each of the second movement trajectories on the three-dimensional map, and add the corresponding information and each of the second movements on the three-dimensional map The moment of appearance and the end moment of the trajectory form the distribution diagram of the trajectory;

S703: Output the trajectory distribution map to a display interface.

In this embodiment, the processing system retrieves the three-dimensional map of the shooting area of the camera device (the three-dimensional map can be pre-stored in the database of the processing system, or can be downloaded from the network by the processing system), and then marks the first movement track on the on a three-dimensional map. Then, with the first motion trajectory as the position parameter, the second motion trajectory corresponding to each sound source is marked on the three-dimensional map according to the appearance time, and the corresponding information of the color and each second motion trajectory are added on the three-dimensional map The appearance time and end time of , form a trajectory distribution diagram as a whole. Finally, the processing system outputs the trajectory distribution map to the display interface, and the user can understand the change of each second motion trajectory more clearly from the three-dimensional level.

Referring to Fig. 3, an embodiment of the present application also provides a target trajectory marking device based on video and audio, the video is collected by a camera device, the audio is collected by a microphone array, and the microphone array is composed of a plurality of sub-microphones , the microphone array is deployed on the camera equipment, and the target trajectory marking device includes:

A collection module 1, configured to collect video data through the camera device, and collect a plurality of audio data through the microphone array;

The recognition module 2 is used to perform VAD algorithm recognition on the sounds contained in each of the audio data respectively to obtain several sound sources;

Calculation module 3, for calculating and obtaining each of the sound sources based on the difference between the receiving time of the sound corresponding to the same sound source by the two first sub-microphones, and the deployment position between the two first sub-microphones. a first relative positional relationship between a source and a reference sub-microphone, where the reference sub-microphone is any one of the two first sub-microphones;

The conversion module 4 is configured to convert each of the sound sources and the imaging device according to the deployment position of the reference sub-microphone on the imaging device and the first relative positional relationship corresponding to each of the sound sources. a second relative positional relationship between the devices;

A construction module 5, configured to construct a sound source corresponding to each of the sound sources according to the acquisition time of the video data and each of the audio data, the first motion trajectory of the camera device, and each of the second relative positional relationships. Second motion track.

Further, the first relative positional relationship includes a first distance and a first angle, and the conversion module 4 includes:

a first calculation unit, configured to calculate a complementary angle of the first angle;

A second calculation unit, configured to call the straight-line distance between the reference sub-microphone and the imaging device, and substitute the complement of the first angle, the straight-line distance and the first distance into a calculation formula , to obtain the second distance, wherein the calculation formula is:

A third calculation unit, configured to calculate the vertical distance between the reference sub-microphone and the sound source through the cosine law formula according to the first angle and the first distance;

A fourth calculation unit, configured to calculate a second angle between the imaging device and the sound source according to the second distance and the vertical distance through the formula of the law of cosines, wherein the imaging device and the sound source The vertical distance between the sound sources is the same as the vertical distance between the reference sub-microphone and the sound source;

The generation unit is configured to calculate the second distance and the second angle respectively corresponding between each of the sound sources and the imaging device according to the above rules, and generate each of the second relative positional relationships.

Further, the identification module 2 includes:

The recognition unit is used to perform VAD algorithm recognition on the sounds contained in each of the audio data respectively to obtain a number of personal sound sources and other types of sound sources;

The screening unit is used to mark and number each of the human voice sound sources, and detect the decibel value of each of the other types of sound sources, hide the first other type of sound source whose decibel value is below the decibel threshold, and simultaneously decibel A second other type of sound source whose value is above said decibel threshold is marked with a number.

Further, the screening unit includes:

An identification subunit, configured to input the sounds corresponding to each of the second other types of sound sources into a pre-trained sound type recognition model for identification, and obtain the sound types corresponding to each of the second other types of sound sources;

The marking subunit is configured to use the sound type as marking information to mark and number each of the second sound sources of other types.

Further, the building block 5 includes:

A positioning unit, configured to perform time synchronization based on the acquisition time of the video data and each of the audio data, respectively, and locate the appearance time of each of the sound sources in the video data;

A construction unit, configured to collect the first motion trajectory of the imaging device through a GPS positioning method, and use the first motion trajectory as a position reference, according to the appearance time corresponding to each of the sound sources and each sound source The second relative positional relationship is used to construct and obtain the second motion trajectory of each of the sound sources relative to the first motion trajectory.

Further, the target trajectory calibration device also includes:

A recording module 6, configured to construct each of the second motion trajectories with lines of different colors, and record the corresponding relationship between each color and each of the sound sources to form corresponding information;

The generating module 7 is configured to generate a trajectory distribution diagram according to the first movement trajectory, the corresponding information and each of the second movement trajectories, and output the trajectory distribution diagram to a display interface.

Further, the generating module 7 includes:

a marking unit, configured to call a three-dimensional map, and mark the first movement track on the three-dimensional map;

A forming unit, configured to mark each of the second movement trajectories on the three-dimensional map by using the first movement trajectory as a position reference, and add the corresponding information and each of the three-dimensional maps on the three-dimensional map. The appearance time and the end time of the second motion track form the track distribution diagram;

an output unit, configured to output the trajectory distribution map to a display interface.

In this embodiment, each module, unit, and subunit in the target trajectory marking device is used to perform corresponding steps in the above-mentioned video and audio-based target trajectory marking method, and its specific implementation process will not be described in detail here.

This embodiment provides a target trajectory marking device based on video and audio, wherein the video is collected by a camera device, the audio is collected by a microphone array, the microphone array is composed of multiple sub-microphones, and the microphone array is deployed on the camera device. In application, the processing system collects video data through a camera device, and collects a plurality of audio data through a microphone array. Then perform VAD algorithm recognition on the sounds contained in each audio data respectively to obtain several sound sources contained in each audio data. The processing system calculates the first distance between each sound source and the reference sub-microphone based on the difference in receiving time of the sound corresponding to the same sound source by the two first sub-microphones and the deployment position between the two first sub-microphones. Relative positional relationship, wherein, the reference sub-microphone is any one of the two first sub-microphones. The processing system converts and obtains a second relative positional relationship between each sound source and the imaging device according to the deployment position of the reference sub-microphone on the imaging device and the first relative positional relationship corresponding to each sound source. Finally, the processing system constructs the second movement trajectory corresponding to each sound source according to the acquisition time of the video data and each audio data, the first movement trajectory of the camera device, and each second relative positional relationship. In this application, the camera device is equipped with a microphone array, and the first relative positional relationship between each sound source and the reference sub-microphone can be calculated according to the time difference between the sound corresponding to each sound source and each sub-microphone. Then, by means of the deployment positional relationship between the reference sub-microphone and the imaging device, a second positional relationship between each sound source relative to the imaging device is obtained through position conversion. Therefore, even if the sound source does not appear in the shooting field of view of the camera device, as long as the sound from the sound source can be received by the microphone array, the positional relationship between the reference sub-microphone and the camera device can be used to determine the relative position of the sound source relative to the camera. The location relationship between devices. Then, the first trajectory of the camera device is used as a position parameter to calibrate the second trajectory of each sound source.

Referring to FIG. 4 , an embodiment of the present application also provides a computer device, which may be a server, and its internal structure may be as shown in FIG. 4 . The computer device includes a processor, memory, network interface and database connected by a system bus. Among them, the processor designed by the computer is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs and databases. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used to store data such as decibel thresholds. The network interface of the computer device is used to communicate with an external terminal via a network connection. When the computer program is executed by the processor, the function of a target trajectory marking method based on video and audio in any of the above-mentioned embodiments is realized, the video is collected by an imaging device, the audio is collected by a microphone array, and the microphone array Composed of multiple sub-microphones, the microphone array is deployed on the imaging device.

Above-mentioned processor carries out the step of above-mentioned method based on target track marking of video and audio frequency:

S5: According to the acquisition time of the video data and each audio data, the first motion trajectory of the camera device, and the second relative positional relationship, construct the second motion trajectory corresponding to each sound source.

An embodiment of the present application also provides a computer-readable storage medium. The storage medium may be a non-volatile storage medium or a volatile storage medium on which a computer program is stored. When the computer program is executed by a processor To achieve any of the above embodiments based on video and audio target trajectory calibration method, the video is collected by camera equipment, the audio is collected by a microphone array, the microphone array is composed of a plurality of sub-microphones, and the microphone array is deployed in the On the imaging device, the method is specifically:

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be completed by instructing related hardware through computer programs, and the computer programs can be stored in a non-volatile computer-readable memory In the medium, when the computer program is executed, it may include the processes of the embodiments of the above-mentioned methods. Wherein, any references to memory, storage, database or other media provided in the present application and used in the embodiments may include non-volatile and/or volatile memory. Nonvolatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in many forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

It should be noted that, in this document, the terms "comprising", "comprising" or any other variation thereof are intended to cover a non-exclusive inclusion such that a process, apparatus, article or method comprising a set of elements includes not only those elements, It also includes other elements that are not expressly listed, or that are inherent in the process, apparatus, article, or method. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional same elements in the process, apparatus, article or method comprising the element.

The above descriptions are only preferred embodiments of the application, and are not intended to limit the patent scope of the application. Any equivalent structure or equivalent process conversion made by using the specification and drawings of the application, or directly or indirectly used in other relevant All technical fields are equally included in the patent protection scope of the present application.

Claims

A target trajectory calibration method based on video and audio, characterized in that the video is collected by a camera, the audio is collected by a microphone array, the microphone array is composed of a plurality of sub-microphones, and the microphone array is deployed on the On the camera equipment, the target track calibration method includes:

collecting video data through the camera device, and collecting a plurality of audio data through the microphone array;

Carrying out VAD algorithm identification to the sound contained in each said audio data respectively, obtain several sound sources;

Based on the difference between the receiving time of the sound corresponding to the same sound source of the two first sub-microphones, and the deployment position between the two first sub-microphones, the distance between each of the sound sources and the reference sub-microphone is calculated. The first relative positional relationship between the reference sub-microphones is any one of the two first sub-microphones;

According to the deployment position of the reference sub-microphone on the imaging equipment and the first relative positional relationship corresponding to each of the sound sources, the second distance between each of the sound sources and the imaging equipment is obtained through conversion. Relative positional relationship;

According to the acquisition time of the video data and each of the audio data, the first movement trajectory of the camera device, and each of the second relative positional relationships, a second movement trajectory corresponding to each of the sound sources is constructed.
The method for marking target tracks based on video and audio according to claim 1, wherein the first relative positional relationship includes a first distance and a first angle, and the reference sub-microphone in the imaging device The deployment position above, and the first relative positional relationship corresponding to each of the sound sources, the step of converting the second relative positional relationship between each of the sound sources and the imaging device includes:

calculating a complement of said first angle;

Retrieving the straight-line distance between the reference sub-microphone and the imaging device, and substituting the complementary angle of the first angle, the straight-line distance and the first distance into a calculation formula to obtain a second distance, wherein , the calculation formula is:
, b is the first distance, c is the straight-line distance, β is the complementary angle of the first angle, and a is the second distance, representing the distance between the imaging device and the sound source;

According to the first angle and the first distance, the vertical distance between the reference sub-microphone and the sound source is obtained by calculating the law of cosines;

According to the second distance and the vertical distance, the second angle between the imaging device and the sound source is calculated by the formula of the law of cosines, wherein the vertical angle between the imaging device and the sound source The distance has the same value as the vertical distance between the reference sub-microphone and the sound source;

The second distance and the second angle respectively corresponding between each of the sound sources and the imaging device are calculated according to the above rules, and each of the second relative positional relationships is generated.
The target trajectory marking method based on video and audio according to claim 1, wherein the step of performing VAD algorithm recognition on the sounds contained in each of the audio data to obtain several sound sources includes:

Respectively perform VAD algorithm identification on the sounds contained in each of the audio data to obtain a number of personal voice sound sources and other types of sound sources;

Marking and numbering each of the human voice sound sources, and detecting the decibel value of each of the other types of sound sources, hiding the first other type of sound source whose decibel value is below the decibel threshold, and simultaneously detecting the decibel value below the decibel threshold. The second other type of sound source above the threshold is marked with a number.
The method for marking target tracks based on video and audio according to claim 3, wherein the step of marking and numbering second other types of sound sources whose decibel value is above the decibel threshold includes:

Inputting the sound corresponding to each of the second other types of sound sources into a pre-trained sound type recognition model for identification, and obtaining the sound types corresponding to each of the second other types of sound sources;

Using the sound type as marking information, mark and number each of the second sound sources of other types.
The target trajectory marking method based on video and audio according to claim 1, characterized in that, according to the acquisition time of the video data and each of the audio data, the first motion trajectory of the imaging device, and each The second relative positional relationship, the step of constructing the second motion trajectory corresponding to each of the sound sources, includes:

Carrying out time synchronization based on the acquisition time of the video data and each of the audio data respectively, and locating the appearance time of each of the sound sources in the video data;

Collect the first motion trajectory of the imaging device by a GPS positioning method, and use the first motion trajectory as a position reference, according to the appearance time corresponding to each of the sound sources and each of the second relative positions Relationships are constructed to obtain the second motion trajectories of each of the sound sources relative to the first motion trajectories.
The method for marking target tracks based on video and audio according to claim 1, characterized in that, according to the acquisition time of the video data and each of the audio data, the first motion track of the imaging device, and each The second relative positional relationship, after the step of constructing the second motion trajectory corresponding to each of the sound sources, includes:

Constructing each of the second motion trajectories with lines of different colors, and recording the corresponding relationship between each color and each of the sound sources to form corresponding information;

A trajectory distribution graph is generated according to the first motion trajectory, the corresponding information, and each of the second motion trajectories, and the trajectory distribution graph is output to a display interface.
The target trajectory marking method based on video and audio according to claim 6, wherein the trajectory profile is generated according to the first trajectory, the corresponding information and each of the second trajectory, and The step of outputting the trajectory distribution map to the display interface includes:

calling a three-dimensional map, and marking the first movement track on the three-dimensional map;

Using the first motion track as a position reference, mark each of the second motion tracks on the three-dimensional map, and add the corresponding information and the information of each of the second motion tracks on the three-dimensional map The moment of appearance and the moment of end, forming the distribution map of the trajectory;

Outputting the trajectory distribution diagram to a display interface.
A computer device comprising a memory and a processor, wherein a computer program is stored in the memory, wherein a method for marking a target trajectory based on video and audio is implemented when the processor executes the computer program, and the video Collected by a camera device, the audio is collected by a microphone array, the microphone array is composed of a plurality of sub-microphones, and the microphone array is deployed on the camera device;

Wherein, the target track marking method based on video and audio includes:

collecting video data through the camera device, and collecting a plurality of audio data through the microphone array;

Carrying out VAD algorithm identification to the sound contained in each said audio data respectively, obtain several sound sources;

Based on the difference between the receiving time of the sound corresponding to the same sound source of the two first sub-microphones, and the deployment position between the two first sub-microphones, the distance between each of the sound sources and the reference sub-microphone is calculated. The first relative positional relationship between the reference sub-microphones is any one of the two first sub-microphones;

According to the deployment position of the reference sub-microphone on the imaging equipment and the first relative positional relationship corresponding to each of the sound sources, the second distance between each of the sound sources and the imaging equipment is obtained through conversion. Relative positional relationship;

According to the acquisition time of the video data and each of the audio data, the first movement trajectory of the camera device, and each of the second relative positional relationships, a second movement trajectory corresponding to each of the sound sources is constructed.
The computer device according to claim 8, wherein the first relative positional relationship includes a first distance and a first angle, and according to the deployment position of the reference sub-microphone on the imaging device, and each The step of converting the first relative positional relationship corresponding to each of the sound sources to obtain a second relative positional relationship between each of the sound sources and the imaging device includes:

calculating a complement of said first angle;

Retrieving the straight-line distance between the reference sub-microphone and the imaging device, and substituting the complementary angle of the first angle, the straight-line distance and the first distance into a calculation formula to obtain a second distance, wherein , the calculation formula is:
, b is the first distance, c is the straight-line distance, β is the complementary angle of the first angle, and a is the second distance, representing the distance between the imaging device and the sound source;

According to the first angle and the first distance, the vertical distance between the reference sub-microphone and the sound source is obtained by calculating the law of cosines;

According to the second distance and the vertical distance, the second angle between the imaging device and the sound source is calculated by the formula of the law of cosines, wherein the vertical angle between the imaging device and the sound source The distance has the same value as the vertical distance between the reference sub-microphone and the sound source;

The second distance and the second angle respectively corresponding between each of the sound sources and the imaging device are calculated according to the above rules, and each of the second relative positional relationships is generated.
The computer device according to claim 8, wherein the step of performing VAD algorithm recognition on the sounds contained in each of the audio data to obtain several sound sources includes:

Respectively perform VAD algorithm identification on the sounds contained in each of the audio data to obtain a number of personal voice sound sources and other types of sound sources;

Marking and numbering each of the human voice sound sources, and detecting the decibel value of each of the other types of sound sources, hiding the first other type of sound source whose decibel value is below the decibel threshold, and simultaneously detecting the decibel value below the decibel threshold. The second other type of sound source above the threshold is marked with a number.
The computer device according to claim 10, wherein the step of marking and numbering second other types of sound sources whose decibel value is above the decibel threshold comprises:

Inputting the sound corresponding to each of the second other types of sound sources into a pre-trained sound type recognition model for identification, and obtaining the sound types corresponding to each of the second other types of sound sources;

Using the sound type as marking information, mark and number each of the second sound sources of other types.
The computer device according to claim 8, characterized in that, according to the collection time of the video data and each of the audio data, the first motion trajectory of the camera device, and each of the second relative positional relationships , the step of constructing the second motion trajectory corresponding to each of the sound sources, including:

Carrying out time synchronization based on the acquisition time of the video data and each of the audio data respectively, and locating the appearance time of each of the sound sources in the video data;

Collect the first motion trajectory of the imaging device by a GPS positioning method, and use the first motion trajectory as a position reference, according to the appearance time corresponding to each of the sound sources and each of the second relative positions Relationships are constructed to obtain the second motion trajectories of each of the sound sources relative to the first motion trajectories.
The computer device according to claim 8, characterized in that, according to the collection time of the video data and each of the audio data, the first motion trajectory of the camera device, and each of the second relative positional relationships After the step of constructing the second motion trajectory corresponding to each of the sound sources, it includes:

Constructing each of the second motion trajectories with lines of different colors, and recording the corresponding relationship between each color and each of the sound sources to form corresponding information;

A trajectory distribution graph is generated according to the first motion trajectory, the corresponding information, and each of the second motion trajectories, and the trajectory distribution graph is output to a display interface.
The computer device according to claim 13, characterized in that, the trajectory distribution diagram is generated according to the first movement trajectory, the corresponding information and each of the second movement trajectories, and the trajectory distribution diagram is output to The steps to display the interface include:

calling a three-dimensional map, and marking the first movement track on the three-dimensional map;

Using the first motion track as a position reference, mark each of the second motion tracks on the three-dimensional map, and add the corresponding information and the information of each of the second motion tracks on the three-dimensional map The moment of appearance and the moment of end, forming the distribution map of the trajectory;

Outputting the trajectory distribution diagram to a display interface.
A computer-readable storage medium, on which a computer program is stored, is characterized in that, when the computer program is executed by a processor, a target trajectory marking method based on video and audio, the video is collected by an imaging device, and the The audio is collected by a microphone array, the microphone array is composed of a plurality of sub-microphones, the microphone array is deployed on the camera, and the video and audio-based target track calibration method includes the following steps:

collecting video data through the camera device, and collecting a plurality of audio data through the microphone array;

Carrying out VAD algorithm identification to the sound contained in each said audio data respectively, obtain several sound sources;

Based on the difference between the receiving time of the sound corresponding to the same sound source of the two first sub-microphones, and the deployment position between the two first sub-microphones, the distance between each of the sound sources and the reference sub-microphone is calculated. The first relative positional relationship between the reference sub-microphones is any one of the two first sub-microphones;

According to the deployment position of the reference sub-microphone on the imaging equipment and the first relative positional relationship corresponding to each of the sound sources, the second distance between each of the sound sources and the imaging equipment is obtained through conversion. Relative positional relationship;

According to the acquisition time of the video data and each of the audio data, the first movement trajectory of the camera device, and each of the second relative positional relationships, a second movement trajectory corresponding to each of the sound sources is constructed.
The computer-readable storage medium according to claim 15, wherein the first relative positional relationship includes a first distance and a first angle, and the deployment position of the reference sub-microphone on the imaging device , and the first relative positional relationship corresponding to each of the sound sources, the step of converting the second relative positional relationship between each of the sound sources and the imaging device includes:

calculating a complement of said first angle;

Retrieving the linear distance between the reference sub-microphone and the imaging device, and substituting the complementary angle of the first angle, the linear distance and the first distance into a calculation formula to obtain a second distance, wherein , the calculation formula is:
, b is the first distance, c is the straight-line distance, β is the complementary angle of the first angle, and a is the second distance, representing the distance between the imaging device and the sound source;

According to the first angle and the first distance, the vertical distance between the reference sub-microphone and the sound source is obtained by calculating the law of cosines;

According to the second distance and the vertical distance, the second angle between the imaging device and the sound source is calculated by the formula of the law of cosines, wherein the vertical angle between the imaging device and the sound source The distance has the same value as the vertical distance between the reference sub-microphone and the sound source;

The second distance and the second angle respectively corresponding between each of the sound sources and the imaging device are calculated according to the above rules, and each of the second relative positional relationships is generated.
The computer-readable storage medium according to claim 15, wherein the step of performing VAD algorithm recognition on the sounds contained in each of the audio data to obtain several sound sources includes:

Respectively perform VAD algorithm identification on the sounds contained in each of the audio data to obtain a number of personal voice sound sources and other types of sound sources;

Marking and numbering each of the human voice sound sources, and performing decibel value detection on each of the other types of sound sources, hiding the first other type of sound source whose decibel value is below the decibel threshold, and simultaneously detecting the decibel value below the decibel threshold. The second other type of sound source above the threshold is marked with a number.
The computer-readable storage medium according to claim 17, wherein the step of marking and numbering the second other type of sound source whose decibel value is above the decibel threshold comprises:

Inputting the sound corresponding to each of the second other types of sound sources into a pre-trained sound type recognition model for identification, and obtaining the sound types corresponding to each of the second other types of sound sources;

Using the sound type as marking information, mark and number each of the second sound sources of other types.
The computer-readable storage medium according to claim 15, wherein, according to the collection time of the video data and each of the audio data, the first motion trajectory of the camera device, and each of the second Relative to the positional relationship, the step of constructing the second motion trajectory corresponding to each of the sound sources includes:

Carrying out time synchronization based on the acquisition time of the video data and each of the audio data respectively, and locating the appearance time of each of the sound sources in the video data;

Collect the first motion trajectory of the imaging device by a GPS positioning method, and use the first motion trajectory as a position reference, according to the appearance time corresponding to each of the sound sources and each of the second relative positions Relationships are constructed to obtain the second motion trajectories of each of the sound sources relative to the first motion trajectories.
The computer-readable storage medium according to claim 15, wherein, according to the collection time of the video data and each of the audio data, the first motion trajectory of the camera device, and each of the second The relative positional relationship, after the step of constructing the second motion trajectory corresponding to each of the sound sources, includes:

Constructing each of the second motion trajectories with lines of different colors, and recording the corresponding relationship between each color and each of the sound sources to form corresponding information;

A trajectory distribution graph is generated according to the first motion trajectory, the corresponding information, and each of the second motion trajectories, and the trajectory distribution graph is output to a display interface.