CN111050023A

CN111050023A - Video detection method and device, terminal equipment and storage medium

Info

Publication number: CN111050023A
Application number: CN201911302567.3A
Authority: CN
Inventors: 徐易楠; 陈泷翔
Original assignee: Shenzhen Zhuiyi Technology Co Ltd
Current assignee: Shenzhen Zhuiyi Technology Co Ltd
Priority date: 2019-12-17
Filing date: 2019-12-17
Publication date: 2020-04-21

Abstract

The application discloses a video detection method, a video detection device, terminal equipment and a storage medium. The method comprises the following steps: acquiring audio data and video image data in a video to be detected; extracting audio features corresponding to the audio data and extracting image features corresponding to the video image data; acquiring original audio features corresponding to the audio data and original image features corresponding to the video image data, wherein the original audio features and the original image features are audio features and image features which are extracted in advance before the video to be detected is sent to terminal equipment; respectively comparing the audio features with the original audio features, and comparing the image features with the original image features; and determining a synchronization result of the audio data and the video image data in the video to be detected according to the first comparison result corresponding to the audio features and the second comparison result corresponding to the image features, wherein the synchronization result comprises synchronization or desynchronization of the audio data and the video image data. The method and the device can realize synchronous detection of sound and picture of the video.

Description

Video detection method and device, terminal equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of human-computer interaction, in particular to a video detection method, a video detection device, terminal equipment and a storage medium.

Background

With the rapid development of internet technology, online video viewing is also more and more widely applied. The audio data and video data of the current video are usually encoded separately for transmission, and then decoded and played back, in which the sound and picture are out of synchronization for various reasons. For example, the performance problems of the playing client end, such as different decoders, network delay, packet loss, etc.

At present, whether the sound and the picture of the video are synchronous is generally detected by a method of matching sound and picture time stamps, but the method excessively depends on the stability of time stamp calibration and delivery.

Disclosure of Invention

In view of the foregoing problems, embodiments of the present application provide a video detection method, an apparatus, a terminal device, and a storage medium, which can detect whether a sound and a picture are synchronous through audio and video features.

In a first aspect, an embodiment of the present application provides a video detection method, which is applied to a terminal device, and the method may include: acquiring audio data and video image data in a video to be detected; extracting audio features corresponding to the audio data, and extracting image features corresponding to the video image data; acquiring original audio features corresponding to the audio data and original image features corresponding to the video image data, wherein the original audio features are audio features which are extracted in advance before the video to be detected is sent to the terminal equipment, and the original image features are image features which are extracted in advance before the video to be detected is sent to the terminal equipment; respectively comparing the audio features with the original audio features, and comparing the image features with the original image features to obtain a first comparison result corresponding to the audio features and a second comparison result corresponding to the image features; and determining a synchronization result of the audio data and the video image data in the video to be detected according to the first comparison result and the second comparison result, wherein the synchronization result comprises synchronization or desynchronization of the audio data and the video image data.

Optionally, after determining a synchronization result of the audio data and the video image data in the video to be detected, the method further includes: when the synchronization result is detected to be asynchronous, acquiring lost time length corresponding to the video to be detected, wherein the lost time length comprises time length corresponding to an audio frame which is not matched with the original audio characteristics in the audio frame set and time length corresponding to an image frame which is not matched with the original image characteristics in the video image frame set; and synchronizing the audio data and the video image data in the video to be detected based on the loss duration.

Optionally, the synchronizing the audio data and the video image data in the video to be detected based on the loss duration may include: when the lost time length meets an error allowance condition, determining a lost frame corresponding to the lost time length, wherein the lost frame comprises at least one of an audio frame and a video image frame; determining a non-lost frame corresponding to the time node of the lost frame in the video to be detected, wherein the non-lost frame is a video image frame when the lost frame is an audio frame, and the non-lost frame is an audio frame when the lost frame is a video image frame; and deleting the undisplayed frames from the video to be detected, wherein the audio data and the video image data in the deleted video to be detected are synchronous.

Optionally, the method may further comprise: and when the lost time does not meet the error allowance condition, supplementing the lost frame to the video to be detected through a machine learning model.

Optionally, when the lost frame is an audio frame, the supplementing the lost frame to the video to be detected through a machine learning model may include: acquiring a correction frame output by a first machine learning model according to text content corresponding to the audio data, wherein the correction frame is the newly acquired audio frame, and the first machine learning model is pre-trained to output an audio frame corresponding to the text content according to the text content; and generating a corrected video based on the corrected frame and the video to be detected, wherein the corrected video is the video to be detected after the corrected frame is supplemented.

Optionally, the supplementing the lost frame to the video to be detected through a machine learning model may include: acquiring a front-segment video adjacent to the video to be detected and before the video to be detected, and a rear-segment video adjacent to the video to be detected and after the video to be detected, wherein the front-segment video, the rear-segment video and the video to be detected are different segmented videos obtained by segmenting the same complete video for a preset time; and acquiring a corrected video output by a second machine learning model according to the front and rear segmented videos and the video to be detected, wherein the corrected video is the video to be detected after the lost frame is supplemented, and the second machine learning model is trained in advance so as to output the corrected video after the lost frame is supplemented according to the video to be detected and the adjacent front and rear segmented videos of the video to be detected.

Optionally, the determining, according to the first comparison result and the second comparison result, a synchronization result of the audio data and the video image data in the video to be detected may include: and when at least one comparison result in the first comparison result and the second comparison result is inconsistent in representation comparison, determining that the audio data and the video image data in the video to be detected are asynchronous.

Optionally, the video to be detected includes a virtual robot, and the extracting the image features corresponding to the video image data may include: and extracting lip image characteristics of the virtual robot in the video image data.

In a second aspect, an embodiment of the present application provides a video detection apparatus, which is applied to a terminal device, and the video detection apparatus may include: the data acquisition module is used for acquiring audio data and video image data in the video to be detected; the characteristic extraction module is used for extracting audio characteristics corresponding to the audio data and extracting image characteristics corresponding to the video image data; the original characteristic acquisition module is used for acquiring original audio characteristics corresponding to the audio data and original image characteristics corresponding to the video image data, wherein the original audio characteristics are audio characteristics which are extracted in advance before the video to be detected is sent to the terminal equipment, and the original image characteristics are image characteristics which are extracted in advance before the video to be detected is sent to the terminal equipment; the feature comparison module is used for respectively comparing the audio features with the original audio features and comparing the image features with the original image features to obtain a first comparison result corresponding to the audio features and a second comparison result corresponding to the image features; and the result acquisition module is used for determining a synchronization result of the audio data and the video image data in the video to be detected according to the first comparison result and the second comparison result, wherein the synchronization result comprises synchronization or desynchronization of the audio data and the video image data.

Optionally, the original audio features and the original image features are audio features and image features corresponding to a specified time period, and the data obtaining module may include: the video decomposition unit is used for decomposing the video to be detected to obtain an audio frame sequence and a video image frame sequence; and the data selection unit is used for selecting the audio frame set corresponding to the specified time period from the audio frame sequence as the audio data in the video to be detected, and selecting the video image frame set corresponding to the specified time period from the video image frame sequence as the video image data in the video to be detected.

Optionally, the apparatus may further comprise: a duration obtaining module, configured to obtain a lost duration corresponding to the video to be detected when it is detected that the synchronization result is not synchronized, where the lost duration includes a duration corresponding to an audio frame in the audio frame set that does not match the original audio feature and a duration corresponding to an image frame in the video image frame set that does not match the original image feature; and the image and sound synchronization module is used for synchronizing the audio data and the video image data in the video to be detected based on the loss duration.

Optionally, the image sound synchronization module may include: a lost frame determining unit, configured to determine a lost frame corresponding to the lost time length when the lost time length meets an error allowance condition, where the lost frame includes at least one of an audio frame and a video image frame; a frame non-loss determining unit, configured to determine a frame non-loss corresponding to a time node of the lost frame in the video to be detected, where the frame non-loss is a video image frame when the lost frame is an audio frame, and the frame non-loss is an audio frame when the lost frame is a video image frame; and the frame deleting unit is used for deleting the undisplayed frames from the video to be detected, wherein the audio data in the deleted video to be detected is synchronous with the video image data.

Optionally, the apparatus may further comprise: and the frame supplementing module is used for supplementing the lost frame to the video to be detected through a machine learning model when the lost time does not meet the error allowance condition.

Optionally, when the lost frame is an audio frame, the frame complementing module may be specifically configured to: acquiring a correction frame output by a first machine learning model according to text content corresponding to the audio data, wherein the correction frame is the newly acquired audio frame, and the first machine learning model is pre-trained to output an audio frame corresponding to the text content according to the text content; and generating a corrected video based on the corrected frame and the video to be detected, wherein the corrected video is the video to be detected after the corrected frame is supplemented.

Optionally, the frame complementing module may also be specifically configured to: acquiring a front-segment video adjacent to the video to be detected and before the video to be detected, and a rear-segment video adjacent to the video to be detected and after the video to be detected, wherein the front-segment video, the rear-segment video and the video to be detected are different segmented videos obtained by segmenting the same complete video for a preset time; and acquiring a corrected video output by a second machine learning model according to the front and rear segmented videos and the video to be detected, wherein the corrected video is the video to be detected after the lost frame is supplemented, and the second machine learning model is trained in advance so as to output the corrected video after the lost frame is supplemented according to the video to be detected and the adjacent front and rear segmented videos of the video to be detected.

Optionally, the result obtaining module may be specifically configured to: and when at least one comparison result in the first comparison result and the second comparison result is inconsistent in representation comparison, determining that the audio data and the video image data in the video to be detected are asynchronous.

Optionally, the video to be detected includes a virtual robot, and the feature extraction module may be specifically configured to: and extracting lip image characteristics of the virtual robot in the video image data.

In a third aspect, an embodiment of the present application provides a terminal device, where the terminal device may include: a memory; one or more processors coupled with the memory; one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more application programs configured to perform the method of the first aspect as described above.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having program code stored therein, where the program code is called by a processor to execute the method according to the first aspect.

The embodiment of the application provides a video detection method, a video detection device, terminal equipment and a storage medium, wherein audio data and video image data in a video to be detected are acquired; extracting audio features corresponding to the audio data and extracting image features corresponding to the video image data; acquiring original audio features corresponding to the audio data and original image features corresponding to the video image data, wherein the original audio features and the original image features are audio features and image features which are extracted in advance before the video to be detected is sent to terminal equipment; respectively comparing the audio features with the original audio features, and comparing the image features with the original image features; and determining a synchronization result of the audio data and the video image data in the video to be detected according to the first comparison result corresponding to the audio features and the second comparison result corresponding to the image features, wherein the synchronization result comprises synchronization or desynchronization of the audio data and the video image data. Therefore, whether the sound and the picture are synchronous or not is detected through the audio and video characteristics, the influence of repeated modification of the timestamp can be avoided, meanwhile, the sound and picture synchronous detection can be accurately realized, and the video playing effect is effectively guaranteed.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description are only some embodiments, not all embodiments, of the present application. All other embodiments and drawings obtained by a person skilled in the art based on the embodiments of the present application without any inventive step are within the scope of the present invention.

Fig. 1 shows a schematic diagram of an application environment suitable for the embodiment of the present application.

Fig. 2 is a flowchart illustrating a video detection method according to an embodiment of the present application.

Fig. 3 shows an interface schematic diagram provided in an embodiment of the present application.

Fig. 4 shows a flowchart of a video detection method according to another embodiment of the present application.

Fig. 5 shows a flowchart of a method of step S310 in a video detection method provided in an embodiment of the present application.

Fig. 6 shows a schematic diagram of a face key point provided in an embodiment of the present application.

Fig. 7 shows a flowchart of a method of step S370 in the video detection method provided in the embodiment of the present application.

Fig. 8 shows a flowchart of a method of step S374 in the video detection method provided in the embodiment of the present application.

Fig. 9 shows another method flowchart of step S374 in the video detection method provided in the embodiment of the present application.

Fig. 10 shows a schematic block flow diagram of a video detection method and a video synchronization method provided in an embodiment of the present application.

Fig. 11 shows a block diagram of a video detection apparatus according to an embodiment of the present application.

Fig. 12 shows a block diagram of a data obtaining module 910 in a video detection apparatus according to an embodiment of the present application.

Fig. 13 shows a block diagram of a video detection apparatus according to another embodiment of the present application.

Fig. 14 shows a block diagram of an image synchronization module 970 in the video detection apparatus according to the embodiment of the present application.

Fig. 15 shows a block diagram of a video detection apparatus according to another embodiment of the present application.

Fig. 16 shows a block diagram of a terminal device for executing a video detection method according to an embodiment of the present application.

Fig. 17 shows a block diagram of a computer-readable storage medium for executing a video detection method according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The existing method for detecting whether the sound and the picture are synchronous generally adopts direct detection of human eyes, and if the difference is within 200ms, human eyes cannot perceive asynchronous sound and picture, and the sound and the picture are considered to be synchronous; or by matching the voice print time stamps. However, the method for detecting whether the sound and the picture are synchronous by human eyes cannot detect the problem of the asynchrony of the sound and the picture caused by the network and the client (or the playing end), and the method for detecting the time stamp depends on the stability of the time stamp calibration and transmission.

It can be understood that if the audio and video timestamps are obtained during recording, the difference between the timestamps of the corresponding positions of the audio and video will be small, but if the video is subjected to face beautification and re-encoding, or the audio is subjected to noise reduction, new timestamps may be stamped, and the difference between the new timestamps and the original source (recording or synthesizing) may be larger. Therefore, if the time stamp is obtained again before transmission and playing, the time stamp of the audio/video is not fixed and is consistent with that of the video source (recording or synthesizing), and problems of backspace, disorder and the like can occur. At this time, the method of detecting and correcting the sound and the picture based on the time stamp is basically ineffective.

Based on the above defects, the inventor provides the video detection method, the device, the terminal device and the storage medium in the embodiments of the application after long-term research, and can detect whether the sound and the picture are synchronous or not through audio and video characteristics, so that the sound and the picture detection can be performed without being limited by a playing end, and meanwhile, the influence of repeated modification of a timestamp is avoided, and the playing effect of the video is effectively ensured.

In order to better understand a video detection method, an apparatus, a terminal device, and a storage medium provided in the embodiments of the present application, an application environment suitable for the embodiments of the present application is described below.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating an application environment suitable for the embodiment of the present application. The video detection method provided by the embodiment of the present application can be applied to the polymorphic interaction system 100 shown in fig. 1. The polymorphic interaction system 100 includes a terminal device 101 and a server 102, the server 102 being communicatively coupled to the terminal device 101. The server 102 may be a conventional server or a cloud server, and is not limited herein.

The terminal device 101 may be various terminal devices having a display screen and supporting data input, including but not limited to a smart phone, a tablet computer, a laptop portable computer, a desktop computer, a wearable terminal device, and the like. Specifically, the data input may be based on a voice module provided on the terminal device 101 to input voice, a character input module to input characters, an image input module to input images, a video input module to input video, and the like, or may be based on a gesture recognition module provided on the terminal device 101, so that a user may implement an interaction manner such as gesture input.

In some embodiments, the terminal device 101 may have a client application installed thereon, and the user may communicate with the server 102 based on the client application (e.g., APP, wechat applet, etc.), and in particular, the server 102 has a corresponding server application installed thereon, and the user may register a user account with the server 102 based on the client application, communicate with the server 102 based on the user account, for example, a user logs into a user account at a client application, and enters through the client application based on the user account, text information, voice information, image information or video information and the like can be input, and after the client application program receives the information input by the user, the information may be sent to the server 102 so that the server 102 may receive the information, process and store the information, and the server 102 may also receive the information and return a corresponding output information to the terminal device 101 according to the information.

In some embodiments, the client application may be used to provide customer services to the user, such as broadcasting (playing) a video to the user, engaging in customer service communications with the user, and the like. As one approach, the client application may interact with the user based on a virtual robot. In particular, the client application may receive information input by a user and respond to the information based on the virtual robot. The virtual robot is a software program based on visual graphics, and the software program can present robot forms simulating biological behaviors or ideas to a user after being executed. The virtual robot may be a robot simulating a real person, such as a robot resembling a real person, which is created according to the shape of the user himself or the other person, or a robot having an animation effect, such as a robot having an animal shape or a cartoon character shape.

In some embodiments, after acquiring reply information corresponding to information input by the user, the terminal device 101 may display a virtual robot image corresponding to the reply information on a display screen of the terminal device 101 or other image output device connected thereto. As a mode, while the virtual robot image is played, the audio corresponding to the virtual robot image may be played through a speaker of the terminal device 101 or other audio output devices connected thereto, and a text or a graphic corresponding to the reply information may be displayed on a display screen of the terminal device 101, so that multi-state interaction with the user in multiple aspects of image, voice, text, and the like is realized.

In some embodiments, the means for processing the information input by the user may also be disposed on the terminal device 101, so that the terminal device 101 can interact with the user without relying on establishing communication with the server 102, and in this case, the polymorphic interaction system 100 may only include the terminal device 101.

The above application environments are only examples for facilitating understanding, and it is to be understood that the embodiments of the present application are not limited to the above application environments.

The following describes in detail a video detection method, an apparatus, a terminal device, and a storage medium provided in embodiments of the present application with specific embodiments.

Referring to fig. 2, fig. 2 is a schematic flowchart illustrating a video detection method according to an embodiment of the present application, where the video detection method according to the embodiment may be applied to a terminal device, and the terminal device may be the terminal device with a display screen or other image output apparatus. In a specific embodiment, the video detection method can be applied to the video detection apparatus 900 shown in fig. 13 and the terminal device 600 shown in fig. 14. As will be described in detail with respect to the flow shown in fig. 2, the video detection method may specifically include the following steps:

step S210: and acquiring audio data and video image data in the video to be detected.

In the embodiment of the application, after the terminal device acquires the video to be detected, audio and video decoding can be performed. After the audio and video are decoded, the terminal equipment can acquire audio data and video image data in the video to be detected so as to extract audio and video characteristics to perform sound and picture synchronous detection. The video to be detected may be a video that the terminal device needs to play, such as an AI (Artificial Intelligence) broadcast video, a live broadcast video, a network video, a real-time generated video, and the like, and the specific video type is not limited herein.

In some embodiments, the video to be processed may be a video stream containing a human face and audio. As one mode, the video to be processed may be a reply video which is fed back to the terminal device in real time by the server according to the interaction intention of the user to be played when the user has a conversation with the interactive robot in the application program through the terminal device. The reply video may include an image of the virtual robot and reply audio of the virtual robot. So as to present the simulated appearance, sound and behavior to the user, which are similar to the real person's customer service robot image.

In some embodiments, the video to be detected may be a video composed of a piece of audio and a plurality of consecutive video images. The terminal device can decompose the video to be detected, and can obtain an audio stream (i.e. audio data) and a video image stream (i.e. video image data). The image stream may be an image sequence composed of decomposed multiple frames of video images. For example, a 1 minute length of video may be decomposed into 1200 video images and 1 minute of audio.

Step S220: and extracting audio features corresponding to the audio data, and extracting image features corresponding to the video image data.

The audio feature extracted by the terminal device may be an audio time domain feature, an audio frequency domain feature, or both, which is not limited herein. The audio time domain feature may be a statistical characteristic extracted in relation to time during the change of the audio signal with time. For example, the statistical characteristics may be mean, variance, covariance, skewness, peak, and the like. The audio frequency domain feature may be a periodic characteristic in the audio signal, and may be obtained by converting the audio signal into a frequency domain signal through fourier transform calculation.

In some embodiments, the terminal device may extract the audio features in a variety of ways. As one method, a mel-frequency cepstrum coefficient (MFCC) may be used, or a chrominance characteristic (Chroma), a short-time average Zero Crossing Rate (ZCR), a spectrum root mean square value, a spectrum center moment, a spectrum monotone value, a spectrum bandwidth, a spectrum polynomial coefficient, or the like may be used, or a variation of these methods may be used, which is not limited herein. For example, a machine learning model may also be utilized to extract through a neural network.

In the embodiment of the application, the terminal device may also extract image features corresponding to the video image data to distinguish each frame of video image. The image features may be, but are not limited to, color features, texture features, contour features, region features, spatial relationship features, and the like in the video image. For example, it may also be a feature of key information in the video image.

As one mode, when a human face exists in the video image, the extracted image features may be key points of the human face or key points of a lip region. Specifically, the face detection can be performed on the video image through a face detection algorithm to determine the face position. And then estimating the region of the lip according to the structural features of the face, and further extracting feature points on the lip edge as key points of the lip region.

Step S230: acquiring original audio features corresponding to the audio data and original image features corresponding to the video image data, wherein the original audio features are audio features extracted in advance before the video to be detected is sent to the terminal equipment, and the original image features are image features extracted in advance before the video to be detected is sent to the terminal equipment.

In the embodiment of the application, the terminal device may obtain an original audio feature corresponding to the audio data and an original image feature corresponding to the video image data. The original audio features and the original image features may be image features that are extracted in advance from the video to be detected before the video to be detected is sent to the terminal device.

In some embodiments, before transmitting the video to be detected to the terminal device, the electronic device may first perform audio feature extraction and recording on audio data in the video to be detected, and simultaneously perform image feature extraction and recording on video image data in the video to be detected. The recorded audio features and image features are the original audio features and the original image features. The electronic device may be a server or other terminal devices, and is not limited herein.

The specific manner of extracting the original audio features and the original image features can be referred to the above, and is not limited herein. However, it should be noted that, in order to ensure the accuracy of the subsequent feature comparison, the extraction manner of the audio features and the extraction manner of the original audio features need to be synchronized, and the extraction manner of the image features and the extraction manner of the original image features need to be synchronized.

In some embodiments, the electronic device may add the recorded raw audio features and raw image features to the transmission process. Therefore, when the terminal equipment receives the video to be detected, the recorded original audio features and original image features can be received. As a way, the recorded original audio features and the original image features can be stored together in one recording file, which is independent of the video transmission, i.e., the object transmitted by the electronic device to the terminal device can include video images, audio, and recording files. In some example manners, the electronic device may transmit the recording file after the video image and the audio of the video to be detected are transmitted, so that the terminal device may receive the recorded original audio features and original image features after receiving the video to be detected.

Step S240: and comparing the audio features with the original audio features respectively, and comparing the image features with the original image features to obtain a first comparison result corresponding to the audio features and a second comparison result corresponding to the image features.

In this embodiment, the terminal device may compare the audio features with the original audio features, and compare the image features with the original image features to obtain a first comparison result corresponding to the audio features and a second comparison result corresponding to the image features. To determine whether the characteristics received by the terminal device are consistent with the original characteristics.

In some embodiments, it may be that the audio frame or the image frame of the video to be detected received by the terminal device is lost, so that the sound and the picture are not synchronized. Therefore, the audio features are compared with the original audio features, which can be performed according to the playing time nodes of the audio in the video. When the audio characteristics corresponding to a certain playing node are inconsistent with the original audio characteristics corresponding to the same playing node before transmission, the audio frame loss can be considered to occur, and the video and sound picture detection can be determined to be asynchronous. As one way, the audio features of the audio of the designated frame in the video to be detected may be compared with the original audio features corresponding to the designated frame. For example, the audio characteristics of the 1 st frame of audio in the video to be detected are compared with the original audio characteristics of the recorded 1 st frame of audio.

Similarly, the image characteristics are compared with the original image characteristics, or compared according to the playing time node of the video image in the video. When the image characteristics corresponding to a certain playing node are inconsistent with the original image characteristics corresponding to the same playing node before transmission, the image frame loss can be considered to occur, and the video sound and picture detection can be determined to be asynchronous. As one mode, the image characteristics of the video image of the specified frame in the video to be detected may be compared with the original image characteristics corresponding to the specified frame. For example, comparing the lip key point feature of the 1 st frame video image in the video to be detected with the original lip key point feature of the recorded 1 st frame video image.

The terminal equipment can obtain the comparison result after comparing the audio-visual characteristics and the image characteristics with the original audio characteristics and the original image characteristics. The comparison result may be similarity, or may be directly a conclusion of consistency or inconsistency, and is not limited herein. In some embodiments, when the comparison result is similarity, the terminal device may compare the features of which the similarity is greater than the similarity threshold, and determine that the features are consistent, and compare the features of which the similarity is less than the similarity threshold, and determine that the features are inconsistent.

Step S250: and determining a synchronization result of the audio data and the video image data in the video to be detected according to the first comparison result and the second comparison result, wherein the synchronization result comprises synchronization or desynchronization of the audio data and the video image data.

In the embodiment of the application, after the terminal device obtains the first comparison result and the second comparison result, whether audio features are inconsistent or not and whether image features are inconsistent or not can be determined according to the first comparison result and the second comparison result. When the audio characteristics are determined to be inconsistent according to the first comparison result, the terminal equipment can consider that the audio frame is lost, and can judge that the audio data and the video image data in the video to be detected are asynchronous. When the image characteristics are determined to be inconsistent according to the second comparison result, the terminal device can consider that the video image frame is lost, and can also judge that the audio data and the video image data in the video to be detected are asynchronous. When the two situations exist, the terminal device can also judge that the audio data and the video image data in the video to be detected are not synchronous. That is, when the audio feature and the image feature are consistent, the terminal device may determine that the audio data and the video image data in the video to be detected are synchronized.

In some embodiments, when there is an audio feature inconsistency, the terminal device may further determine, according to the first comparison result, a frame number of the audio feature inconsistency and a position where the audio frame is lost, and then the terminal device may perform sound-picture synchronization correction according to the frame number and the frame loss position. Similarly, when the image features are inconsistent, the terminal device can also determine the frame number with inconsistent image features and the position of the lost video image frame according to the second comparison result, and then the terminal device can perform sound and picture synchronous correction according to the frame number and the frame lost position.

The video detection method provided by the embodiment of the application obtains the audio data and the video image data in the video to be detected; extracting audio features corresponding to the audio data and extracting image features corresponding to the video image data; acquiring original audio features corresponding to the audio data and original image features corresponding to the video image data, wherein the original audio features and the original image features are audio features and image features which are extracted in advance before the video to be detected is sent to terminal equipment; respectively comparing the audio features with the original audio features, and comparing the image features with the original image features; and determining a synchronization result of the audio data and the video image data in the video to be detected according to the first comparison result corresponding to the audio features and the second comparison result corresponding to the image features, wherein the synchronization result comprises synchronization or desynchronization of the audio data and the video image data. Therefore, whether the sound and the picture are synchronous or not is detected through the audio and video characteristics, the influence of repeated modification of the timestamp can be avoided, meanwhile, the sound and picture synchronous detection can be accurately realized, and the video playing effect is effectively guaranteed.

Referring to fig. 4, fig. 4 is a schematic flowchart illustrating a video detection method according to another embodiment of the present application. As will be described in detail with respect to the flow shown in fig. 4, the video detection method may specifically include the following steps:

step S310: and acquiring audio data and video image data in the video to be detected.

In some embodiments, before transmitting the video to be detected to the terminal device, the electronic device may segment the video to be detected to obtain a plurality of videos to be detected with shorter time duration. The electronic equipment can transmit a plurality of videos to be detected with short time length to the terminal equipment in sequence, and therefore risks caused by network packet loss are reduced. As one way, the video transmission process may be to transmit the recording file (containing the original audio feature and the original image feature) after the current video with a shorter duration is transmitted, and then transmit the video with the next shorter duration.

As one way, the video to be detected may be divided into video clips of a specified duration on average according to the time axis. The specified length may be a default value set by human, for example, 10S. Of course, the video clips may be divided into video clips with different durations, and the method is not limited herein. The electronic device may segment the video to be detected in various ways, which is not limited herein. For example, as one way, ffmpeg multimedia video processing tools may be employed to segment the video to be detected in time windows of a specified duration.

When receiving the video to be detected with shorter time, the terminal equipment can decode the video, acquire the audio data and the video image data in the video to be detected with shorter time, and perform sound-picture synchronous detection on the video to be detected with shorter time according to the audio data and the video image data. Therefore, the video to be detected with the shorter time duration is played (namely segmented playing) when the sound and the picture are synchronous. Therefore, the complete video is segmented, and sound and picture synchronous detection is carried out on each segment of video, so that the playing end can find out the condition that sound and pictures in the video are not synchronous in time before each segment of video is played, and the subsequent sound and picture synchronous correction is positioned, thereby ensuring the playing continuity of the video and improving the video playing effect.

In some embodiments, when the video encoding method is fixed, on the premise that no packet is lost during transmission, the audio-video asynchronization of each segment of video may be uniform, that is, if the 1 st S of a 10S video is 200ms asynchronization, then the subsequent asynchronization per second may be 200 ms. Therefore, the sound and picture synchronous detection can be carried out on a certain time slot in the video to be detected. Specifically, when the original audio features and the original image features extracted in advance before the video to be detected is sent to the terminal device are the audio features and the image features corresponding to the specified time period, please refer to fig. 5, and step S310 may include:

step S311: and decomposing the video to be detected to obtain an audio frame sequence and a video image frame sequence.

Step S312: and selecting an audio frame set corresponding to the specified time period from the audio frame sequence as audio data in the video to be detected, and selecting a video image frame set corresponding to the specified time period from the video image frame sequence as video image data in the video to be detected.

In some embodiments, the terminal device decomposes the video to be detected, which may be video image frame and audio frame sampling performed on the video to be detected according to a specified sampling rate, so as to obtain an audio frame sequence and a video image frame sequence. Wherein the specified sampling rate may be a default value set by human. For example, it may be 30fps (i.e., 30 frames per second). Because the recorded original audio features and original image features are audio features and image features corresponding to the specified time period, the terminal device may select an audio frame set corresponding to the specified time period from the audio frame sequence as audio data for detecting the audio-visual synchronization in the video to be detected, and select a video image frame set corresponding to the specified time period from the video image frame sequence as video image data for detecting the audio-visual synchronization in the video to be detected. The specified time period may be any segment in the video to be detected, and the length of the specified time period may be considered to be set. For example, it may be 1S. For example, a 10S section of video to be detected may be sampled from video image frames and audio frames according to a sampling rate of 30fps, and then the video image frames and audio frames corresponding to the first 1S are selected, so as to obtain 30 video image frames and 30 audio frames.

In other embodiments, the terminal device may also obtain a video segment of a specified time period from the video to be detected, and then perform video image frame and audio frame sampling on the video to be detected according to a specified sampling rate, so as to obtain an audio frame sequence corresponding to the specified time period, which is used as audio data for detecting audio and video synchronization in the video to be detected, and a video image frame sequence corresponding to the specified time period, which is used as video image data for detecting audio and video synchronization in the video to be detected. For example, the first 1S video is selected from 10S segments of video to be detected, and video image frames and audio frames are sampled according to a sampling rate of 30fps, so as to obtain 30 video image frames and 30 audio frames.

It can be understood that before transmitting the video to be detected to the terminal device, the electronic device also needs to acquire audio data and video image data for detecting audio-video synchronization from the video to be detected, so as to extract the original audio features and the original image features. Therefore, in order to ensure the validity of the sound-picture synchronization detection, the method for the terminal device to acquire the audio data and the video image data for detecting the sound-picture synchronization needs to be synchronized with the method for the electronic device to acquire the audio data and the video image data for detecting the sound-picture synchronization.

Step S320: and extracting audio features corresponding to the audio data, and extracting image features corresponding to the video image data.

In some embodiments, when the audio data is a set of audio frames corresponding to the specified time period, the terminal device may obtain an audio feature corresponding to each audio frame. Similarly, when the video image data is a video image frame set corresponding to the specified time period, the terminal device may obtain an image feature corresponding to each video image frame.

In some embodiments, the video to be detected may be a broadcast video of the virtual robot, that is, the video to be detected may include the virtual robot (such as the female image shown in fig. 3). Because the lip is the change of persistence when the virtual robot is broadcasting, and when the sound is drawn asynchronously, user's human eye also more easily perceives. Therefore, the terminal equipment can detect whether the sound and the picture are synchronous or not through the lip characteristics in the video, and the video broadcasting effect is improved. Specifically, the extracting of the image feature corresponding to the video image data may include: and extracting lip image characteristics of the virtual robot in the video image data.

The lip image feature may be a position feature of a lip key point. As one way, the terminal device may obtain and record 20 keypoint coordinates, where the numbers 48 to 68 represent lips, as the lip image features through a 68 keypoint detection module in the dlib. Further, lip opening distance, lip relative thickness and the like can also be determined according to the distance between key points of the lips, so that lip image characteristics can be determined. For example, referring to fig. 6, fig. 6 is a point location distribution diagram of 68 key points of a face detected in a video image, the terminal device may obtain 3 point pairs (52,63), (58,67) and (49,55) respectively to determine the relative thickness of the lips, and may obtain 2 point pairs (62,68) and (64,66) respectively to determine the opening distance of the lips.

Step S330: acquiring original audio features corresponding to the audio data and original image features corresponding to the video image data, wherein the original audio features are audio features extracted in advance before the video to be detected is sent to the terminal equipment, and the original image features are image features extracted in advance before the video to be detected is sent to the terminal equipment.

Step S340: and comparing the audio features with the original audio features respectively, and comparing the image features with the original image features to obtain a first comparison result corresponding to the audio features and a second comparison result corresponding to the image features.

Step S350: and determining a synchronization result of the audio data and the video image data in the video to be detected according to the first comparison result and the second comparison result, wherein the synchronization result comprises synchronization or desynchronization of the audio data and the video image data.

In the embodiment of the present application, reference may be made to the related description in the foregoing embodiment for step S340 and step S350, and details are not repeated here.

In some embodiments, step S350 may include: and when at least one comparison result in the first comparison result and the second comparison result is inconsistent in representation comparison, determining that the audio data and the video image data in the video to be detected are asynchronous. That is, whether the audio features are inconsistent with the original audio features or the image features are inconsistent with the original image. The terminal equipment can consider that frame loss occurs, and can determine that the sound and the picture are not synchronous.

Step S360: and when the synchronization result is detected to be asynchronous, acquiring lost time length corresponding to the video to be detected, wherein the lost time length comprises time length corresponding to an audio frame which is not matched with the original audio characteristics in the audio frame set and time length corresponding to an image frame which is not matched with the original image characteristics in the video image frame set.

Step S370: and synchronizing the audio data and the video image data in the video to be detected based on the loss duration.

In some embodiments, when the terminal device determines that the audio and video images to be detected are not synchronous, the audio and video images of the detected video can be corrected to ensure the continuity of the video. As one mode, the terminal device may perform different sound and picture correction processing according to the length of the lost time. Specifically, the terminal device may obtain a loss duration corresponding to the video to be detected, and synchronize the audio data and the video image data in the video to be detected based on the loss duration. The lost time length comprises the time length corresponding to the audio frame which does not match with the original audio characteristic in the audio frame set and the time length corresponding to the image frame which does not match with the original image characteristic in the video image frame set.

In some embodiments, the terminal device may mark an audio frame corresponding to the audio feature when detecting that the audio feature does not match the original audio feature, and mark a video image frame corresponding to the image feature when detecting that the image feature does not match the original image feature. The audio loss duration and the image loss duration are then determined by counting the number of marked audio frames and the number of video image frames.

In some embodiments, when the video loss duration is short, the impact on the video playing effect may be small. Therefore, the sound-picture synchronization can be realized by cutting out the parts which are not lost. Specifically, referring to fig. 7, step S370 may include:

step S371: and when the lost time length meets an error allowance condition, determining a lost frame corresponding to the lost time length, wherein the lost frame comprises at least one of an audio frame and a video image frame.

In some embodiments, the loss duration satisfies the error allowance condition, and the total duration of the audio loss duration and the image loss duration satisfies the error allowance condition. Wherein the error allowance condition may be a boundary condition that human eyes perceive the sound and the picture as not being synchronous. For example, the duration of the picture-in-picture asynchronism reaches 200 ms. If the video is counted in terms of a sampling rate of 30 frames per second, the video lost frames exceed 6 frames.

When the loss duration satisfies the error allowance condition, the terminal device may obtain a lost frame corresponding to the loss duration to determine an undiseased frame to be deleted. Wherein the lost frame may comprise at least one of an audio frame and a video image frame.

Step S372: and determining the undisplayed frame corresponding to the time node of the lost frame in the video to be detected, wherein the undisplayed frame is a video image frame when the lost frame is an audio frame, and the undisplayed frame is an audio frame when the lost frame is a video image frame.

Step S373: and deleting the undisplayed frames from the video to be detected, wherein the audio data and the video image data in the deleted video to be detected are synchronous.

In some embodiments, there may be a video frame that is normally played but a sound frame-dropping jam, or there may be a sound normal played and a video frame-dropping jam. That is, there may be one of the video picture and the sound being a normal frame and one being a lost frame at the same time. Therefore, when the video loss time is determined to be short, the normal frames can be cut off to ensure the synchronization of the sound and the picture. Specifically, the terminal device may determine an undisrupted frame corresponding to a time node of the lost frame in the video to be detected, and delete the undisrupted frame from the video to be detected, so that the audio data and the video image data in the deleted video to be detected are synchronized. When the lost frame is an audio frame, the frame which is not lost is a video image frame, and when the lost frame is a video image frame, the frame which is not lost is an audio frame. Therefore, when the terminal equipment continues to play the video after deleting the corresponding non-lost frame, the continuity of the video can be ensured.

In other embodiments, when the video loss duration is long, performing the clipping process may affect the consistency of video playing. Therefore, the lost audio and video can be repaired through the video repairing module so as to ensure the continuity of the video. Specifically, with continuing reference to fig. 7, step S370 may further include:

step S374: and when the lost time does not meet the error allowance condition, supplementing the lost frame to the video to be detected through a machine learning model.

In some embodiments, the lost frame may be analyzed by a deep learning technique and learned automatically. As one way, the terminal device may supplement the lost frame to the video to be detected through the machine learning model. The machine learning model can automatically generate lost frames through a neural network. The Machine learning model may adopt RNN (Recurrent Neural Network) model, CNN (Convolutional Neural Network) model, blst (Bi-directional Long Short-Term Memory) model, VAE (variable automatic Encoder) model, BERT (Bidirectional Encoder representation of transformer), Support Vector Machine (SVM) and other Machine learning models, which are not limited herein. For example, the machine learning model may also be a variation or combination of the machine learning models described above, and the like.

It is understood that the specific training method of the machine learning model may be an existing training method, and is not limited in the embodiments of the present application. For example, the structure, training method and objective function of the model may be improved twice according to actual requirements. As one way, the machine learning model may be trained on a data set that does not distinguish a specific field (text) or a character object (image), and then fine tuning may be performed on the machine learning model according to the specific field (text) or the character object (image), so as to achieve an ideal effect quickly.

In some embodiments, when the lost frame is an audio frame, the lost audio frame may be obtained by text patching. Specifically, referring to fig. 8, step S374 may include:

step S3741: and acquiring a correction frame output by a first machine learning model according to the text content corresponding to the audio data, wherein the correction frame is the newly acquired audio frame, and the first machine learning model is trained in advance so as to output the audio frame corresponding to the text content according to the text content.

In some embodiments, the terminal device may obtain text content corresponding to audio data in the video to be detected. As a mode, when a subtitle corresponding to an audio needs to be displayed in a video to be detected, text content corresponding to the audio has been stored in a database of the terminal device, so that the terminal device can directly read the text content. As another mode, when the video to be detected is a reply video containing a virtual robot generated in the process of communication between the user and the interactive robot, the reply text is generally acquired first, and then the corresponding reply audio is generated, so that the terminal device can directly acquire the text content. Of course, the text content may also be obtained from the electronic device, and is not limited herein.

After the terminal device obtains the text content corresponding to the audio data, the text content can be input into the first machine learning model to obtain the correction frame output by the first machine learning model. The modified frame is the newly acquired lost audio frame. The first machine learning model can be obtained by training based on a machine learning method based on a large number of text samples and audio samples, so that the speech can be synthesized according to the text. By one approach, the first machine learning model may be a speech synthesis (TTS) model. The speech synthesis model may select a CNN (Convolutional Neural Networks) model, which may perform feature extraction through a convolution kernel, and generate audio information corresponding to the text content by one-to-one correspondence of each phoneme in the phoneme sequence corresponding to the text content to the spectrum information and the fundamental frequency information. In some embodiments, the speech synthesis model may also be an RNN model, such as WaveRNN.

Step S3742: and generating a corrected video based on the corrected frame and the video to be detected, wherein the corrected video is the video to be detected after the corrected frame is supplemented.

In some embodiments, after the terminal device obtains the correction frame, the correction frame and the video to be detected may be synthesized and repaired by using a machine learning model, so as to obtain the correction video. The corrected video is the video to be detected after the corrected frame is supplemented, so that the sound and picture synchronization is realized. For a specific machine learning model, reference may be made to the related description above, and details are not repeated here.

In other embodiments, when the video loss time is long, the repair can be performed by front and back videos. Specifically, referring to fig. 9, step S374 may also include:

step S3743: the method comprises the steps of obtaining a front-segment video adjacent to a video to be detected and before the video to be detected, and a rear-segment video adjacent to the video to be detected and after the video to be detected, wherein the front-segment video, the rear-segment video and the video to be detected are different segmented videos obtained by segmenting the same complete video for a preset time length.

Step S3744: and acquiring a corrected video output by a second machine learning model according to the front and rear segmented videos and the video to be detected, wherein the corrected video is the video to be detected after the lost frame is supplemented, and the second machine learning model is trained in advance so as to output the corrected video after the lost frame is supplemented according to the video to be detected and the adjacent front and rear segmented videos of the video to be detected.

In some embodiments, when the electronic device transmits the complete video to the terminal device after segmenting the complete video by the preset duration, the terminal device may acquire a plurality of videos to be detected with the preset duration. When the video loss duration is long, the terminal device can acquire a front-segment video adjacent to the video to be detected and before the video to be detected, and a rear-segment video adjacent to the video to be detected and after the video to be detected. In some embodiments, in order to ensure the playing consistency of each video segment, the playing of the first video segment may be started after the third video segment is acquired. Therefore, if the loss time of the first section of video is longer, the repair can be carried out according to the second section of video, and if the loss time of the second section of video is longer, the repair can be carried out according to the first section of video and the second section of video.

In some embodiments, the terminal device may input the front and back segmented videos and the video to be detected to the second machine learning model, so that a modified video output by the second machine learning model may be obtained. The corrected video is the video to be detected after the lost frame is supplemented. The second machine learning model is trained in advance, and can output a corrected video after supplementing a lost frame according to the video to be detected and adjacent front and back segmented videos of the video to be detected. Therefore, the completion of the lost audio frame is realized by taking the front and the back videos as the context of the video. The lost frame may be an audio frame or a video frame.

In some embodiments, after the audio and video calibration is performed on the video to be detected through the cutting module or the repairing module, the calibrated video to be detected can be played in a segmented manner, so that the video playing effect is ensured.

For example, referring to fig. 10, fig. 10 shows a block schematic flow diagram of a video detection method and a video synchronization method. Specifically, after the electronic device AI plays the video segments, the image features of each video segment can be extracted and recorded by the face key point detection module, and the audio features can be extracted and recorded by calculating the mel spectrum for each audio segment. When the electronic equipment sends each section of AI playing video to the playing end, the playing end can detect the image characteristics and the audio characteristics of each section of video and compare whether the detection result is consistent with the recording result to determine whether the sound and the picture are synchronous. When the video and the audio are consistent, the audio and the video can be considered to be synchronous, and the playing end can directly play the section of AI broadcast video. When the sound and picture are inconsistent, the sound and picture are considered to be asynchronous, and the playing end can correspondingly select a sound and picture calibration mode by judging whether the time length of the sound and picture asynchronization exceeds the boundary sensed by human eyes. When the boundary sensed by human eyes is exceeded, the lost frame is repaired through the repairing module, and video repairing is achieved. And when the boundary sensed by human eyes is not exceeded, the normal frame corresponding to the lost frame is cut by the cutting module, so that video restoration is realized. And finally, playing the video subjected to sound-picture synchronous calibration in a segmented manner.

The video detection method provided by the embodiment of the application obtains the audio data and the video image data of the video to be detected in the appointed time period; extracting audio features corresponding to the audio data of the specified time period, and extracting image features corresponding to the video image data of the specified time period; respectively comparing the audio features with the original audio features, and comparing the image features with the original image features; and determining a synchronization result of the audio data and the video image data in the video to be detected according to the first comparison result corresponding to the audio features and the second comparison result corresponding to the image features, wherein the synchronization result comprises synchronization or desynchronization of the audio data and the video image data. In addition, when the synchronization result is detected to be asynchronous, the modes of cutting, text patching, front-and-back video patching and the like can be flexibly selected to synchronize the audio data and the video image data in the video to be detected based on the difference of the video loss time lengths. Therefore, whether the sound and the picture are synchronous or not can be accurately detected through the audio and video characteristics, and meanwhile, the sound and the picture can be synchronously corrected through various modes, so that the playing continuity of the video is effectively guaranteed.

It can be understood that, in the above embodiment, each step may be performed locally by the terminal device, may also be performed in the server, and may also be performed by the terminal device and the server separately, and according to different actual application scenarios, tasks may be allocated according to requirements, so as to implement an optimized virtual robot customer service experience, which is not limited herein.

It should be understood that although the steps in the flow diagrams of fig. 2, 4, 5, 7-10 are not strictly limited to being performed in a strict order, the steps may be performed in other orders. Moreover, at least some of the steps in fig. 2, 4, and 7-10 may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, or may be performed in turn or alternatively with other steps or at least some of the sub-steps or stages of other steps.

Referring to fig. 11, fig. 11 is a block diagram illustrating a video detection apparatus according to an embodiment of the present application. As will be explained below with respect to the block diagram of fig. 11, the video detection apparatus 900 includes: a data obtaining module 910, a feature extracting module 920, an original feature obtaining module 930, a feature comparing module 940, and a result obtaining module 950, wherein:

a data obtaining module 910, configured to obtain audio data and video image data in a video to be detected;

a feature extraction module 920, configured to extract an audio feature corresponding to the audio data, and extract an image feature corresponding to the video image data;

an original feature obtaining module 930, configured to obtain an original audio feature corresponding to the audio data and an original image feature corresponding to the video image data, where the original audio feature is an audio feature that is extracted in advance before the video to be detected is sent to the terminal device, and the original image feature is an image feature that is extracted in advance before the video to be detected is sent to the terminal device;

a feature comparison module 940, configured to compare the audio features with the original audio features, and compare the image features with the original image features to obtain a first comparison result corresponding to the audio features and a second comparison result corresponding to the image features;

a result obtaining module 950, configured to determine a synchronization result of the audio data and the video image data in the video to be detected according to the first comparison result and the second comparison result, where the synchronization result includes that the audio data and the video image data are synchronous or asynchronous.

In some embodiments, the original audio features and the original image features are audio features and image features corresponding to a specified time period, referring to fig. 12, the data obtaining module 910 may include: a video decomposition unit 911, configured to decompose the video to be detected to obtain an audio frame sequence and a video image frame sequence; a data selecting unit 912, configured to select an audio frame set corresponding to the specified time period from the audio frame sequence as the audio data in the video to be detected, and select a video image frame set corresponding to the specified time period from the video image frame sequence as the video image data in the video to be detected.

In some embodiments, referring to fig. 13, the video detection apparatus 900 may further include:

a duration obtaining module 960, configured to obtain a lost duration corresponding to the video to be detected when it is detected that the synchronization result is not synchronized, where the lost duration includes a duration corresponding to an audio frame in the audio frame set that does not match the original audio feature and a duration corresponding to an image frame in the video image frame set that does not match the original image feature;

and the image-sound synchronization module 970 is configured to synchronize the audio data and the video image data in the video to be detected based on the loss duration.

In some embodiments, referring to fig. 14, the audio synchronization module 970 may include: a lost frame determining unit 971, configured to determine a lost frame corresponding to the lost time length when the lost time length meets an error allowance condition, where the lost frame includes at least one of an audio frame and a video image frame; a non-lost frame determining unit 972, configured to determine a non-lost frame corresponding to a time node of the lost frame in the video to be detected, where the non-lost frame is a video image frame when the lost frame is an audio frame, and the non-lost frame is an audio frame when the lost frame is a video image frame; a frame deleting unit 973, configured to delete the undisrupted frame from the video to be detected, where the audio data in the deleted video to be detected is synchronized with the video image data.

In some embodiments, referring to fig. 15, the video detection apparatus 900 may further include: and a frame supplementing module 980, configured to supplement the lost frame to the video to be detected through a machine learning model when the lost duration does not meet the error allowance condition.

In some embodiments, when the lost frame is an audio frame, the frame complementing module 980 may be specifically configured to: acquiring a correction frame output by a first machine learning model according to text content corresponding to the audio data, wherein the correction frame is the newly acquired audio frame, and the first machine learning model is pre-trained to output an audio frame corresponding to the text content according to the text content; and generating a corrected video based on the corrected frame and the video to be detected, wherein the corrected video is the video to be detected after the corrected frame is supplemented.

In some embodiments, the frame complementing module 980 may also be specifically configured to: acquiring a front-segment video adjacent to the video to be detected and before the video to be detected, and a rear-segment video adjacent to the video to be detected and after the video to be detected, wherein the front-segment video, the rear-segment video and the video to be detected are different segmented videos obtained by segmenting the same complete video for a preset time; and acquiring a corrected video output by a second machine learning model according to the front and rear segmented videos and the video to be detected, wherein the corrected video is the video to be detected after the lost frame is supplemented, and the second machine learning model is trained in advance so as to output the corrected video after the lost frame is supplemented according to the video to be detected and the adjacent front and rear segmented videos of the video to be detected.

In some embodiments, the result obtaining module 950 may be specifically configured to: and when at least one comparison result in the first comparison result and the second comparison result is inconsistent in representation comparison, determining that the audio data and the video image data in the video to be detected are asynchronous.

In some embodiments, the video to be detected may include a virtual robot, and the feature extraction module 920 may be specifically configured to: and extracting lip image characteristics of the virtual robot in the video image data.

The video detection device provided in the embodiment of the present application is used to implement the corresponding video detection method in the foregoing method embodiment, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

It can be clearly understood by those skilled in the art that the video detection apparatus provided in the embodiment of the present application can implement each process in the foregoing method embodiment, and for convenience and brevity of description, the specific working processes of the apparatus and the modules described above may refer to corresponding processes in the foregoing method embodiment, and are not described herein again.

In the several embodiments provided in the present application, the coupling or direct coupling or communication connection between the modules shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or modules may be in an electrical, mechanical or other form.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

Referring to fig. 16, a block diagram of a terminal device 600 according to an embodiment of the present disclosure is shown. The terminal device 600 may be a terminal device capable of running an application, such as a smart phone, a tablet computer, a terminal book, and a physical robot. The terminal device 600 in the present application may comprise one or more of the following components: a processor 610, a memory 620, and one or more applications, wherein the one or more applications may be stored in the memory 620 and configured to be executed by the one or more processors 610, the one or more programs configured to perform the methods as described in the aforementioned method embodiments.

The processor 610 may include one or more processing cores. The processor 610 connects various parts within the entire terminal apparatus 600 using various interfaces and lines, and performs various functions of the terminal apparatus 600 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 620 and calling data stored in the memory 620. Alternatively, the processor 610 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 610 may integrate one or a combination of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 610, but may be implemented by a communication chip.

The Memory 620 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 620 may be used to store instructions, programs, code sets, or instruction sets. The memory 620 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like. The storage data area may also store data created by the terminal device 600 during use (such as a phonebook, audio-video data, chat log data), and the like.

Further, the terminal device 600 may further include a Display screen, and the Display screen may be a Liquid Crystal Display (LCD), an Organic Light-emitting diode (OLED), or the like. The display screen is used to display information entered by the user, information provided to the user, and various graphical user interfaces that may be composed of graphics, text, icons, numbers, video, and any combination thereof.

Those skilled in the art will appreciate that the structure shown in fig. 16 is a block diagram of only a portion of the structure associated with the present application, and does not constitute a limitation on the terminal device to which the present application is applied, and a particular terminal device may include more or less components than those shown in fig. 16, or combine certain components, or have a different arrangement of components.

Referring to fig. 17, a block diagram of a computer-readable storage medium according to an embodiment of the present application is shown. The computer-readable storage medium 1100 has stored therein a program code 1110, the program code 1110 being invokable by the processor for performing the method described in the above-described method embodiments.

The computer-readable storage medium 1100 may be a terminal memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Optionally, the computer-readable storage medium 1100 includes a non-transitory computer-readable storage medium. The computer readable storage medium 1100 has storage space for program code 1110 for performing any of the method steps of the method described above. The program code can be read from or written to one or more computer program products. The program code 1110 may be compressed, for example, in a suitable form.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A video detection method is applied to a terminal device, and comprises the following steps:

acquiring audio data and video image data in a video to be detected;

extracting audio features corresponding to the audio data, and extracting image features corresponding to the video image data;

acquiring original audio features corresponding to the audio data and original image features corresponding to the video image data, wherein the original audio features are audio features which are extracted in advance before the video to be detected is sent to the terminal equipment, and the original image features are image features which are extracted in advance before the video to be detected is sent to the terminal equipment;

respectively comparing the audio features with the original audio features, and comparing the image features with the original image features to obtain a first comparison result corresponding to the audio features and a second comparison result corresponding to the image features;

and determining a synchronization result of the audio data and the video image data in the video to be detected according to the first comparison result and the second comparison result, wherein the synchronization result comprises synchronization or desynchronization of the audio data and the video image data.

2. The method according to claim 1, wherein the original audio features and the original image features are audio features and image features corresponding to a specified time period, and the acquiring audio data and video image data in the video to be detected comprises:

decomposing the video to be detected to obtain an audio frame sequence and a video image frame sequence;

and selecting an audio frame set corresponding to the specified time period from the audio frame sequence as audio data in the video to be detected, and selecting a video image frame set corresponding to the specified time period from the video image frame sequence as video image data in the video to be detected.

3. The method according to claim 2, wherein after said determining a result of synchronization of said audio data and said video image data in said video to be detected, said method further comprises:

when the synchronization result is detected to be asynchronous, acquiring lost time length corresponding to the video to be detected, wherein the lost time length comprises time length corresponding to an audio frame which is not matched with the original audio characteristics in the audio frame set and time length corresponding to an image frame which is not matched with the original image characteristics in the video image frame set;

and synchronizing the audio data and the video image data in the video to be detected based on the loss duration.

4. The method according to claim 3, wherein the synchronizing the audio data and the video image data in the video to be detected based on the loss duration comprises:

when the lost time length meets an error allowance condition, determining a lost frame corresponding to the lost time length, wherein the lost frame comprises at least one of an audio frame and a video image frame;

determining a non-lost frame corresponding to the time node of the lost frame in the video to be detected, wherein the non-lost frame is a video image frame when the lost frame is an audio frame, and the non-lost frame is an audio frame when the lost frame is a video image frame;

and deleting the undisplayed frames from the video to be detected, wherein the audio data and the video image data in the deleted video to be detected are synchronous.

5. The method of claim 4, further comprising:

and when the lost time does not meet the error allowance condition, supplementing the lost frame to the video to be detected through a machine learning model.

6. The method of claim 5, wherein the supplementing the lost frame to the video to be detected through the machine learning model when the lost frame is an audio frame comprises:

acquiring a correction frame output by a first machine learning model according to text content corresponding to the audio data, wherein the correction frame is the newly acquired audio frame, and the first machine learning model is pre-trained to output an audio frame corresponding to the text content according to the text content;

and generating a corrected video based on the corrected frame and the video to be detected, wherein the corrected video is the video to be detected after the corrected frame is supplemented.

7. The method of claim 5, wherein the supplementing the lost frame into the video to be detected by the machine learning model comprises:

acquiring a front-segment video adjacent to the video to be detected and before the video to be detected, and a rear-segment video adjacent to the video to be detected and after the video to be detected, wherein the front-segment video, the rear-segment video and the video to be detected are different segmented videos obtained by segmenting the same complete video for a preset time;

and acquiring a corrected video output by a second machine learning model according to the front and rear segmented videos and the video to be detected, wherein the corrected video is the video to be detected after the lost frame is supplemented, and the second machine learning model is trained in advance so as to output the corrected video after the lost frame is supplemented according to the video to be detected and the adjacent front and rear segmented videos of the video to be detected.

8. The method according to any one of claims 1 to 7, wherein the determining the synchronization result of the audio data and the video image data in the video to be detected according to the first comparison result and the second comparison result comprises:

and when at least one comparison result in the first comparison result and the second comparison result is inconsistent in representation comparison, determining that the audio data and the video image data in the video to be detected are asynchronous.

9. The method according to any one of claims 1 to 7, wherein the video to be detected includes a virtual robot, and the extracting image features corresponding to the video image data includes:

and extracting lip image characteristics of the virtual robot in the video image data.

10. A video detection device is applied to a terminal device, and the device comprises:

the data acquisition module is used for acquiring audio data and video image data in the video to be detected;

the characteristic extraction module is used for extracting audio characteristics corresponding to the audio data and extracting image characteristics corresponding to the video image data;

the original characteristic acquisition module is used for acquiring original audio characteristics corresponding to the audio data and original image characteristics corresponding to the video image data, wherein the original audio characteristics are audio characteristics which are extracted in advance before the video to be detected is sent to the terminal equipment, and the original image characteristics are image characteristics which are extracted in advance before the video to be detected is sent to the terminal equipment;

the feature comparison module is used for respectively comparing the audio features with the original audio features and comparing the image features with the original image features to obtain a first comparison result corresponding to the audio features and a second comparison result corresponding to the image features;

and the result acquisition module is used for determining a synchronization result of the audio data and the video image data in the video to be detected according to the first comparison result and the second comparison result, wherein the synchronization result comprises synchronization or desynchronization of the audio data and the video image data.

11. The apparatus of claim 10, wherein the raw audio features and the raw image features are audio features and image features corresponding to a specified time period, and wherein the data obtaining module comprises:

the video decomposition unit is used for decomposing the video to be detected to obtain an audio frame sequence and a video image frame sequence;

and the data selection unit is used for selecting the audio frame set corresponding to the specified time period from the audio frame sequence as the audio data in the video to be detected, and selecting the video image frame set corresponding to the specified time period from the video image frame sequence as the video image data in the video to be detected.

12. The apparatus of claim 11, further comprising:

a duration obtaining module, configured to obtain a lost duration corresponding to the video to be detected when it is detected that the synchronization result is not synchronized, where the lost duration includes a duration corresponding to an audio frame in the audio frame set that does not match the original audio feature and a duration corresponding to an image frame in the video image frame set that does not match the original image feature;

and the image and sound synchronization module is used for synchronizing the audio data and the video image data in the video to be detected based on the loss duration.

13. The apparatus of claim 12, wherein the tone synchronization module comprises:

a lost frame determining unit, configured to determine a lost frame corresponding to the lost time length when the lost time length meets an error allowance condition, where the lost frame includes at least one of an audio frame and a video image frame;

a frame non-loss determining unit, configured to determine a frame non-loss corresponding to a time node of the lost frame in the video to be detected, where the frame non-loss is a video image frame when the lost frame is an audio frame, and the frame non-loss is an audio frame when the lost frame is a video image frame;

and the frame deleting unit is used for deleting the undisplayed frames from the video to be detected, wherein the audio data in the deleted video to be detected is synchronous with the video image data.

14. A terminal device, comprising:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to perform the method of any of claims 1-9.

15. A computer-readable storage medium, having stored thereon program code that can be invoked by a processor to perform the method according to any one of claims 1 to 9.