CN113794813A

CN113794813A - Method and device for controlling sound and picture synchronization and computer storage medium

Info

Publication number: CN113794813A
Application number: CN202111352216.0A
Authority: CN
Inventors: 肖兵; 陈宇; 黄昌松
Original assignee: Zhuhai Shixi Technology Co Ltd
Current assignee: Zhuhai Shixi Technology Co Ltd
Priority date: 2021-11-16
Filing date: 2021-11-16
Publication date: 2021-12-14
Anticipated expiration: 2041-11-16
Also published as: CN113794813B

Abstract

The application discloses a method and a device for controlling audio-video synchronization and a computer storage medium, which are used for reducing adverse effects of audio-video asynchronism of input data on a final video image processing result. The method comprises the following steps: receiving sound source positioning information and target detection information in parallel, wherein the sound source positioning information comprises an audio time stamp, and the target detection information comprises a picture time stamp; configuring a pre-created sound and picture information unit according to the audio time stamp; determining a matched target audio-visual information unit according to the visual time stamp; and updating the target detection information to the target sound and picture information unit.

Description

Method and device for controlling sound and picture synchronization and computer storage medium

Technical Field

The present application relates to the field of video image processing technologies, and in particular, to a method and an apparatus for controlling audio and video synchronization, and a computer storage medium.

Background

When a video conference is popularized, some intelligent conference systems in the market can automatically provide close-up pictures for speakers in the pictures in the video conference process, so that other participants can clearly see facial expressions and body actions of the speakers, and the conference effect is greatly improved.

Among these, how to determine the position of the speaker in the screen is particularly critical. In the prior art, an array microphone sound source positioning technology is adopted, image recognition technologies (such as portrait detection, mouth opening and closing degree judgment, standing motion detection, face motion information identification and the like) are also used, and a scheme of combining the sound source positioning technology and the image recognition technology is adopted.

In contrast, a solution combining sound source localization and image recognition techniques can consider audio and video information simultaneously, which is more reliable. However, the sound source positioning module and the target detection module are independent of each other, delay of different degrees generally exists, and especially when the two modules work on different platforms, the delay difference is larger, which affects the final result of video image output, and greatly affects user experience.

Disclosure of Invention

The application provides a method, a device and a computer storage medium for controlling audio-video synchronization, which are used for reducing adverse effects of audio-video synchronization of input data on final video image processing results.

The application provides a method for controlling sound and picture synchronization in a first aspect, which comprises the following steps:

receiving sound source positioning information and target detection information in parallel, wherein the sound source positioning information comprises an audio time stamp, and the target detection information comprises a picture time stamp;

configuring a pre-created sound and picture information unit according to the audio time stamp;

determining a matched target audio-visual information unit according to the visual time stamp;

and updating the target detection information to the target sound and picture information unit.

Optionally, after configuring the pre-created sound and picture information unit according to the audio time stamp, the method further includes:

storing the sound and picture information unit into a target queue;

the step of determining the matched target audio-visual information unit according to the visual time stamp comprises the following steps:

and determining a matched target audio-video information unit from the target queue according to the video time stamp.

Optionally, the determining a matched target audio-visual information unit from the target queue according to the visual time stamp includes:

traversing the target queue in a reverse order, and calculating sound and picture time differences of all sound and picture information units in the target queue through a target formula according to the picture time stamps and the audio time stamps;

and determining the sound-picture information unit at the extreme point position with the sound-picture time difference from small to large as a target sound-picture information unit.

Optionally, the target formula is:

ΔT（i）=abs（T1-T2(i)+T0）；

where Δ T is a sound-picture time difference, i is a sound-picture information unit number, T1 denotes a picture time stamp in the target detection information, T2(i) denotes an audio time stamp of the ith sound-picture information unit, and T0 is a preset sound-picture time offset.

Optionally, the sound source positioning information further includes a sound source position;

the updating the target detection information to the target sound and picture information unit comprises:

counting effective target detection results in the target detection information within a preset range of the sound source azimuth, wherein the effective target detection results are effective target bounding box sets or effective target numbers;

and updating the effective target detection result to the target sound and picture information unit.

determining the number of targets detected in the target detection information;

and if the target number is not 0, determining a matched target audio-video information unit from the target queue according to the video time stamp.

determining the number of sound and picture information units in the target queue;

and if the number of the sound and picture information units in the target queue is not 0, determining a matched target sound and picture information unit from the target queue according to the picture time stamp.

Optionally, the storing the sound and picture information unit to a target queue includes:

judging whether the length of the target queue reaches a preset length or not;

if not, storing the sound and picture information unit to the tail of the target queue;

and if so, deleting the sound and picture information unit at the head of the target queue and storing the sound and picture information unit to the tail of the target queue.

Optionally, the audio timestamp is a sound source timestamp in the sound source positioning information;

or the like, or, alternatively,

the audio time stamp is a time stamp when the sound source localization information is received.

Optionally, the picture timestamp is a video frame acquisition timestamp;

or the like, or, alternatively,

the picture time stamp is a time stamp of the video frame before being detected by the target.

The second aspect of the present application provides an apparatus for controlling synchronization of sound and picture, comprising:

a receiving unit configured to receive sound source localization information and target detection information in parallel, where the sound source localization information includes an audio time stamp, and the target detection information includes a picture time stamp;

the configuration unit is used for configuring a pre-created sound and picture information unit according to the audio time stamp;

the matching unit is used for determining a matched target audio-visual information unit according to the image time stamp;

and the updating unit is used for updating the target detection information to the target sound and picture information unit.

Optionally, the apparatus further comprises:

the storage unit is used for storing the sound and picture information unit into a target queue;

the matching unit is specifically configured to:

Optionally, the matching unit is specifically configured to:

Optionally, the target formula is:

ΔT（i）=abs（T1-T2(i)+T0）；

the update unit is specifically configured to:

Optionally, the apparatus further comprises:

a first determination unit configured to determine the number of targets detected in the target detection information;

the matching unit is specifically configured to:

and when the first determining unit determines that the target number is not 0, determining a matched target audio-visual information unit from the target queue according to the visual time stamp.

Optionally, the apparatus further comprises:

the second determining unit is used for determining the number of the sound and picture information units in the target queue;

the matching unit is specifically further configured to:

and when the second determining unit determines that the number of the sound and picture information units in the target queue is not 0, determining the matched target sound and picture information units from the target queue according to the picture time stamp.

Optionally, the storage unit is specifically configured to:

judging whether the length of the target queue reaches a preset length or not;

or the like, or, alternatively,

Optionally, the picture timestamp is a video frame acquisition timestamp;

or the like, or, alternatively,

A third aspect of the present application provides an apparatus for controlling synchronization of sound and picture, the apparatus comprising:

the device comprises a processor, a memory, an input and output unit and a bus;

the processor is connected with the memory, the input and output unit and the bus;

the memory stores a program, and the processor calls the program to execute the method for controlling the sound and picture synchronization according to any one of the first aspect and the second aspect.

A fourth aspect of the present application provides a computer-readable storage medium having a program stored thereon, where the program is executed on a computer to perform the method for controlling sound and picture synchronization according to the first aspect and any one of the first aspect.

According to the technical scheme, the method has the following advantages:

the sound source positioning information and the target detection result are respectively processed, firstly, pre-created sound and picture information units are configured according to audio time stamps in the sound source positioning information, then, corresponding target sound and picture information units are matched according to picture time stamps in the received target detection result, and then, the target detection result is updated to the matched target sound and picture information units. The method has better tolerance on the delay of sound source positioning and target detection, and can greatly reduce the adverse effect of asynchronous sound and picture of input data on the final video picture processing result in practical application, thereby improving the user experience.

Drawings

In order to more clearly illustrate the technical solutions in the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic flow chart illustrating an embodiment of a method for controlling audio-visual synchronization provided in the present application;

FIG. 2 is a schematic flow chart illustrating a method for controlling audio-visual synchronization according to another embodiment of the present disclosure;

FIG. 3 is a graph showing a relationship between a sound-picture information unit and a sound-picture time difference in the method for controlling sound-picture synchronization according to the present application;

FIG. 4 is another graph showing a relationship between a sound-picture information unit and a sound-picture time difference in the method for controlling sound-picture synchronization provided by the present application;

FIG. 5 is a schematic structural diagram illustrating an embodiment of an apparatus for controlling synchronization of sound and pictures provided in the present application;

FIG. 6 is a schematic structural diagram of another embodiment of the apparatus for controlling synchronization of sound and pictures provided in the present application;

fig. 7 is a schematic structural diagram of another embodiment of the apparatus for controlling synchronization of sound and picture provided by the present application.

Detailed Description

It should be noted that the method for controlling audio and video synchronization provided by the present application may be applied to a terminal, and may also be applied to a server, for example, the terminal may be a fixed terminal such as a smart phone or a computer, a tablet computer, a smart television, a video conference television or a tablet, a smart watch, a portable computer terminal, or a desktop computer. For convenience of explanation, the terminal is taken as an execution subject for illustration in the present application.

Referring to fig. 1, fig. 1 is a diagram illustrating an embodiment of a method for controlling audio-video synchronization provided by the present application, the method including:

101. receiving sound source positioning information and target detection information in parallel, wherein the sound source positioning information comprises an audio time stamp, and the target detection information comprises a picture time stamp;

at present, the sound source positioning technology is widely applied, in a multimedia video conference, the position and the direction angle of a sound source can be estimated through the sound source positioning technology, in addition, the speaker in a picture can be positioned through the image recognition technology based on target detection, and therefore sound and video information can be considered simultaneously through the scheme combining the sound source positioning technology and the image recognition technology, and the calculation result of a final close-up picture is more reliable.

In this embodiment, the terminal receives audio information through the microphone device and determines sound source positioning information in the audio information according to a sound source positioning technology, and simultaneously receives video information through the camera device and determines target detection information in the video information according to an image recognition technology, and the purpose of parallel reception is to improve the processing efficiency of the terminal.

The sound source positioning information received by the terminal includes an audio time stamp, which may be a sound source time stamp or a time stamp of the received sound source positioning information, and the sound source positioning information further includes a sound source position, which may be a one-dimensional sound source angle or a two-dimensional or three-dimensional sound source position, and is not limited herein.

The target detection information received by the terminal comprises a picture timestamp, the picture timestamp can be a collection timestamp of the video frame or a timestamp of the video frame before the video frame is detected by the target, and the target detection information further comprises a target detection bounding box set.

102. Configuring pre-created sound-picture information units according to audio time stamps

The terminal is pre-created with a plurality of sound and picture information units which are carriers for storing time stamps, sound source directions and valid target detection results, and the terminal writes and stores the processed sound source positioning information and target detection information into the sound and picture information units through a series of processing, so that the time stamps, the sound source directions and the valid target detection results can be read from the sound and picture information units when the terminal needs to calculate close-up pictures, and output video image results can be calculated by taking the data as the basis.

The terminal sets the time stamp in the sound-picture information unit as the audio time stamp of the received sound source positioning information, and the audio time stamp can be the sound source time stamp in the sound source positioning information or the time stamp when the sound source positioning information is received. In addition, the terminal needs to set the sound source bearing stored in the sound-picture information unit as the sound source bearing of the received sound source positioning information, and initialize the valid target detection result stored in the sound-picture information unit to an invalid value, so as to prepare for next storing the target detection information received by the terminal.

103. Determining a matched target audio-visual information unit according to the visual time stamp;

and the terminal matches the sound-picture information unit with the nearest time in the sound-picture information units according to the picture timestamp of the received target detection information, and determines the sound-picture information unit as the target sound-picture information unit.

104. And updating the target detection information to a target sound and picture information unit.

And for the matched target sound and picture information unit, the terminal updates the target detection information to the target sound and picture information unit. The sound source positioning information and the target detection information stored in the target sound-picture information unit can reduce or eliminate the sound-picture time difference of the sound-picture data in the target sound-picture information unit due to the calibration of the audio time stamp and the picture time stamp, so that the calibrated sound-picture data can be read from the target sound-picture information unit when a terminal needs to output a video image result, and the output video image result can be calculated according to the sound-picture data.

In this embodiment, by processing the sound source positioning information and the target detection result respectively, a pre-created sound-picture information unit is configured according to an audio time stamp in the sound source positioning information, a corresponding target sound-picture information unit is matched according to a picture time stamp in the received target detection result, and then the target detection result is updated to the matched target sound-picture information unit. The method has better tolerance on the delay of sound source positioning and target detection, and can greatly reduce the adverse effect of asynchronous sound and picture of input data on the final video picture processing result in practical application, thereby improving the user experience.

Referring to fig. 2, fig. 2 is another embodiment of the method for controlling audio-video synchronization provided in the present application, where the method includes:

201. receiving sound source positioning information and target detection information in parallel, wherein the sound source positioning information comprises an audio time stamp, and the target detection information comprises a picture time stamp;

202. configuring a pre-created sound and picture information unit according to the audio time stamp;

in this embodiment, steps 201 to 202 are similar to steps 101 to 102 of the previous embodiment, and are not described again here.

It should be noted that the audio time stamp in the sound source localization information acquired by the terminal may be a sound source time stamp, or may be a time stamp when the sound source localization information is received by the terminal. The former is closer to the true timestamp than the latter, with less theoretical bias, but is often not readily available. The latter is more theoretically biased but can be easily obtained. In the embodiment, the process of calculating the sound-picture time difference further includes sound-picture time compensation, and the deviation can be corrected according to actual conditions. Therefore, no matter any one of the two timestamps is selected, the invention can obtain more accurate results.

Similarly, the picture timestamp in the target detection information acquired by the terminal may be a capture timestamp of the video frame, or may be a timestamp of the video frame before the target detection. The former is closer to the real time than the latter, and the theoretical deviation is smaller, but it is not always possible to obtain the same, so the latter is generally used in practice. As described above, the audio and video synchronization method provided in this embodiment can compensate for audio and video time deviation well, so that which timestamp is specifically selected has no significant influence on the final video processing result.

203. Storing the sound and picture information unit into a target queue;

the terminal stores the unit of sound and picture information in the form of a target queue.

In some embodiments, the length of the target queue is fixed, a new audio and video information unit is added from the tail of the queue, and before adding, it is necessary to determine whether the length of the target queue has reached a set value (preset length), if not, step a is performed, and if so, step b is performed.

a) And if the length of the target queue does not reach the set value, directly adding a new sound and picture information unit into the target queue from the tail of the target queue.

b) If the length of the target queue reaches the set value, a sound and picture information unit is deleted from the head of the target queue, and then a new sound and picture information unit is added into the target queue from the tail of the target queue.

204. Traversing the target queue in a reverse order, and calculating the sound and picture time differences of all sound and picture information units in the target queue through a target formula according to the picture time stamps and the audio time stamps;

in this embodiment, before determining the matching target sound-picture information unit, the target detection information and the status of the target queue need to be checked, which is described in detail below:

firstly, checking input target detection information:

the terminal detects the number of targets detected in the input target detection information, namely the number of detected faces and/or heads and/or bodies, and if the number of targets is 0, the terminal directly returns to receive new sound source positioning information and target detection information. And if the target number is not 0, matching the target sound and picture information units.

Secondly, checking a target queue:

and if the target queue has no sound and picture information unit, directly returning and receiving new sound source positioning information and target detection information. And if the sound and picture information units exist in the target queue, matching the target sound and picture information units.

It should be noted that the above two states are not sequential and can be performed simultaneously. After the condition check is passed, the terminal matches the sound-picture information unit with the nearest time from the sound-picture information units according to the picture time stamp in the target detection information.

Specifically, the terminal traverses the sound and picture information units in the target queue in a reverse order, for each sound and picture information unit, the terminal calculates the sound and picture time difference delta T according to a target formula, if the delta T exceeds a preset maximum audio time interval, the terminal directly returns to receive new sound source positioning information and target detection information again, otherwise, the terminal continues traversing until an extreme value point position where the delta T starts to change from small to large is found, and the sound and picture information unit at the corresponding position is the matched target sound and picture information unit. The preset maximum audio time interval can be specifically set according to the recording conditions of different devices.

Specifically, the target formula is as follows:

ΔT（i）=abs（T1-T2(i)+T0）；

Specifically, the T0 is used for compensating the picture-in-sound asynchronism, and can be set and adjusted according to the actual situation of different devices, if the audio time stamp lags behind the picture time stamp, the T0 is positive, otherwise, the T0 is negative, and when the audio time stamp and the picture time stamp delay are relatively close, the T0 approaches 0.

205. Determining the sound-picture information unit at the extreme point position with the sound-picture time difference from small to large as a target sound-picture information unit;

the reason for determining the extreme point position with the sound-picture time difference from small to large as the position to be matched is that the sound-picture information units in the target queue are traversed in a reverse order, and the new sound-picture information units are added from the tail of the queue, and the corresponding audio time stamps are from new to old, namely, the earlier the audio time stamps are traversed, the earlier the audio time stamps are. In the case where Δ T does not exceed the preset maximum audio time interval, as shown in fig. 3 and 4, there are only two trends from right to left:

1) as shown in fig. 3, Δ T becomes smaller and then larger;

2) as shown in fig. 4, Δ T is increased stepwise.

Therefore, only when the sound-picture time difference is at an extreme point (lowest point) with the difference from small to large, the actual time of the audio and the picture is the closest, and the position is taken as a matching point of the sound-picture synchronization, namely, the sound-picture information unit of the position is determined as the target sound-picture information unit.

206. Counting effective target detection results in the target detection information within a preset range of the sound source azimuth;

the sound source positioning information received by the terminal includes a sound source position, and the sound source position may be a one-dimensional sound source angle or a two-dimensional or three-dimensional sound source position, which is not limited herein. For the matched target sound and picture information units, the terminal counts target detection results within a preset range of the sound source azimuth (sound source angle) of the terminal as effective target detection results.

In some specific embodiments, the valid target detection result in the sound and picture information unit stores a set of valid target bounding boxes, i.e. target bounding boxes within the azimuth range of the statistical sound source, and determines it as a valid target detection result. The object bounding box refers to a rectangular object detection frame generated in object detection, which is used for positioning the position of an object in an image and is generally determined by using the horizontal and vertical coordinates of the center point of the object in combination with the length and width of the bounding box.

In other specific embodiments, the valid target detection result in the sound and picture information unit stores the number of valid targets, that is, the number of targets detected in the azimuth range of the statistical sound source, and determines it as the valid target detection result. For example, the target detection task is to detect face information, the sound source direction is 60 ° direction in the video image, the preset range is 30 °, the terminal counts the target detection results (the number of detected faces) in the range of 30 ° to 90 ° in the video image, and if there are 2 target detection results in the range, the number of valid target detection results is 2. Preferably, the valid target detection result stored in the sound and picture information unit is the number of valid targets.

207. And updating the effective target detection result to a target sound and picture information unit.

The terminal only updates the effective target detection result to the target sound and picture information unit, and the target detection result outside the preset range is not considered, so that the accuracy of close-up picture calculation is further improved.

In this embodiment, by processing the sound source positioning information and the target detection result respectively, a pre-created sound-picture information unit is configured according to an audio time stamp in the sound source positioning information, a corresponding target sound-picture information unit is matched according to a picture time stamp in the received target detection result, and then the target detection result is updated to the matched target sound-picture information unit. Further, in this embodiment, when determining a matched target sound-picture information unit, matching is performed by using a sound-picture time difference, a sound-picture unit at an extreme point position where the sound-picture time difference in the target queue changes from small to large is determined as the target sound-picture information unit, and then an effective target detection result within a preset range of the sound source direction is counted and updated to the target sound-picture information unit, so that sound-picture data can be read from the target sound-picture information unit and calculated in the subsequent video image calculation.

The method has better tolerance on the delay of sound source positioning and target detection, and can greatly reduce the adverse effect of asynchronous sound and picture of input data on the final video picture processing result in practical application, thereby improving the user experience.

Referring to fig. 5, fig. 5 is a diagram illustrating an embodiment of an apparatus for controlling audio-video synchronization provided in the present application, the apparatus including:

a receiving unit 501, configured to receive sound source positioning information and target detection information in parallel, where the sound source positioning information includes an audio time stamp, and the target detection information includes a picture time stamp;

a configuration unit 502, configured to configure a pre-created sound and picture information unit according to the audio time stamp;

a matching unit 503, configured to determine a matched target audio-visual information unit according to the screen timestamp;

an updating unit 504, configured to update the target detection information to the target sound-picture information unit.

In this embodiment, the receiving unit 501 processes the received sound source positioning information and the target detection result respectively, the configuration unit 502 configures pre-created sound-picture information units according to audio time stamps in the sound source positioning information, the matching unit 503 matches corresponding target sound-picture information units according to picture time stamps in the received target detection result, and the updating unit 504 updates the target detection result to the matched target sound-picture information units. The device has better tolerance to the delay of sound source positioning and target detection, and can greatly reduce the adverse effect on the final video picture processing result caused by the asynchronism of input data and sound pictures in practical application, thereby improving the user experience.

Referring to fig. 6, fig. 6 is another embodiment of the apparatus for controlling audio-visual synchronization provided in the present application, and the apparatus includes:

a receiving unit 601, configured to receive sound source positioning information and target detection information in parallel, where the sound source positioning information includes an audio time stamp, and the target detection information includes a picture time stamp;

a configuration unit 602, configured to configure a pre-created sound and picture information unit according to the audio time stamp;

a matching unit 603 for determining a matched target sound-picture information unit according to the picture time stamp;

an updating unit 604 for updating the target detection information to the target sound-picture information unit.

Optionally, the apparatus further comprises:

a storage unit 605, configured to store the sound and picture information unit into the target queue;

the matching unit 603 is specifically configured to:

and determining a matched target sound-picture information unit from the target queue according to the picture time stamp.

Optionally, the matching unit 603 is specifically configured to:

traversing the target queue in a reverse order, and calculating the sound and picture time differences of all sound and picture information units in the target queue through a target formula according to the picture time stamps and the audio time stamps;

Optionally, the target formula is:

ΔT（i）=abs（T1-T2(i)+T0）；

the updating unit 604 is specifically configured to:

and updating the effective target detection result to a target sound and picture information unit.

Optionally, the apparatus further comprises:

a first determining unit 606 for determining the number of targets detected in the target detection information;

the matching unit 603 is specifically configured to determine a matched target audio-visual information unit from the target queue according to the visual time stamp when the first determining unit determines that the target number is not 0.

Optionally, the apparatus further comprises:

a second determining unit 607 for determining the number of the sound and picture information units in the target queue;

the matching unit 603 is further specifically configured to determine a matched target sound-picture information unit from the target queue according to the picture time stamp when the second determining unit determines that the number of sound-picture information units in the target queue is not 0.

Optionally, the storage unit 605 is specifically configured to:

judging whether the length of the target queue reaches a preset length or not;

or the like, or, alternatively,

the audio time stamp is the time stamp when the sound source localization information is received.

Optionally, the picture timestamp is a video frame acquisition timestamp;

or the like, or, alternatively,

the picture timestamp is the timestamp of the video frame before it is detected by the target.

In the device of this embodiment, the functions of each unit correspond to the steps in the method embodiment shown in fig. 2, and are not described herein again.

Referring to fig. 7, fig. 7 is an embodiment of a device for controlling audio and video synchronization, where the device includes:

a processor 701, a memory 702, an input/output unit 703, a bus 704;

the processor 701 is connected with the memory 702, the input/output unit 703 and the bus 704;

the memory 702 holds a program that the processor 701 calls to perform any of the above methods of controlling the synchronization of sound and pictures.

The present application also relates to a computer-readable storage medium having a program stored thereon, wherein the program, when executed on a computer, causes the computer to perform any of the above-mentioned methods of controlling the synchronization of sound and picture.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like.

Claims

1. A method for controlling sound-picture synchronization, the method comprising:

storing the sound and picture information unit into a target queue;

determining the sound-picture information unit at the extreme point position with the sound-picture time difference from small to large as a target sound-picture information unit;

2. The method of claim 1, wherein the target formula is:

ΔT（i）=abs（T1-T2(i)+T0）；

3. The method according to claim 1, wherein the sound source localization information further includes a sound source bearing;

4. The method of claim 1, wherein traversing the target queue in the reverse order, and wherein calculating the voice-to-picture time differences for all voice-to-picture information units in the target queue from the picture time stamps and the audio time stamps via a target formula comprises:

determining the number of targets detected in the target detection information;

and if the target number is not 0, traversing the target queue in a reverse order, and calculating the sound-picture time difference of all sound-picture information units in the target queue through a target formula according to the picture time stamp and the audio time stamp.

5. The method of claim 1, wherein traversing the target queue in the reverse order, and wherein calculating the voice-to-picture time differences for all voice-to-picture information units in the target queue from the picture time stamps and the audio time stamps via a target formula comprises:

and if the number of the sound and picture information units in the target queue is not 0, traversing the target queue in a reverse order, and calculating the sound and picture time difference of all the sound and picture information units in the target queue through a target formula according to the picture time stamp and the audio time stamp.

6. The method of claim 1, wherein storing the unit of voice and picture information into a target queue comprises:

judging whether the length of the target queue reaches a preset length or not;

7. The method according to any of claims 1 to 6, wherein the audio time stamp is a sound source time stamp in the sound source localization information;

or the like, or, alternatively,

8. The method according to any one of claims 1 to 6, wherein the picture time stamp is a video frame acquisition time stamp;

or the like, or, alternatively,

9. An apparatus for controlling synchronization of sound and picture, the apparatus comprising:

the calculation unit is used for traversing the target queue in a reverse order and calculating the sound-picture time difference of all sound-picture information units in the target queue through a target formula according to the picture time stamps and the audio time stamps;

the determining unit is used for determining the sound-picture information unit at the extreme point position with the sound-picture time difference from small to large as a target sound-picture information unit;

10. An apparatus for controlling synchronization of sound and picture, the apparatus comprising:

the device comprises a processor, a memory, an input and output unit and a bus;

the memory holds a program that the processor calls to perform the method of any one of claims 1 to 8.

11. A computer-readable storage medium having a program stored thereon, which when executed on a computer performs the method of any one of claims 1 to 8.