CN116567351A

CN116567351A - Video processing method, device, equipment and medium

Info

Publication number: CN116567351A
Application number: CN202310823097.5A
Authority: CN
Inventors: 张皓
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-07-06
Filing date: 2023-07-06
Publication date: 2023-08-08
Anticipated expiration: 2043-07-06
Also published as: CN116567351B

Abstract

The embodiment of the application discloses a video processing method, a device, equipment and a medium, wherein the method comprises the following steps: acquiring a first video and a second video; screening m target video frames from n second video frames; and forming m target video frames into a reference video segment according to the sequence of the playing time, and carrying out dubbing comparison on the first video and the reference video segment to obtain a comparison result. By adopting the embodiment of the application, the identification accuracy of the dubbing video can be improved.

Description

Video processing method, device, equipment and medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a video processing method, apparatus, device, and medium.

Background

With the development of internet technology, a large number of dubbing videos based on original videos (such as movie and television drama) appear in the internet. For example, by using the character images in the original video, new lines are newly matched for characters in the original video, or the characters in the original video are newly dubbed but the consistency of the lines before and after dubbing is maintained.

The advent of dubbing video brings new risks to video dissemination. For example, dubbing video may infringe on the original video's copyright. If the audio and video are dubbed, compliance risks can be brought to the audio and video platform; if the speech content of the dubbing video is discriminated from the content or false information, the economic order and social order may be disturbed. As another example, dubbing video may mislead a viewer; if the viewer does not watch the original video, the content of the dubbing video may mislead the viewer, and the mislead may cause the viewer to misunderstand the evaluation of the original video, so as to influence the viewing experience and understanding of the original video. Therefore, how to accurately recognize dubbing video is important.

Disclosure of Invention

The embodiment of the application provides a video processing method, device, equipment and medium, which can effectively identify whether a first video is an dubbing video or not and improve the accuracy of dubbing video identification.

In one aspect, an embodiment of the present application provides a video processing method, including:

acquiring a first video and a second video, wherein the first video is a video clip in the second video; the first video comprises m first video frames, and the second video comprises n second video frames; n and m are positive integers, and n is more than or equal to m;

screening m target video frames from n second video frames; one target video frame is matched with one first video frame, and the playing time of m target video frames in the second video is continuous;

forming m target video frames into a reference video segment according to the sequence of playing time, and carrying out dubbing comparison on the first video and the reference video segment to obtain a comparison result; the comparison result is used for indicating whether the first video is obtained by dubbing the reference video segment.

In another aspect, an embodiment of the present application provides a video processing apparatus, including:

the device comprises an acquisition unit, a video generation unit and a video generation unit, wherein the acquisition unit is used for acquiring a first video and a second video, and the first video is a video fragment in the second video; the first video comprises m first video frames, and the second video comprises n second video frames; n and m are positive integers, and n is more than or equal to m;

The processing unit is used for screening m target video frames from the n second video frames; one target video frame is matched with one first video frame, and the playing time of m target video frames in the second video is continuous;

the processing unit is also used for forming m target video frames into a reference video segment according to the sequence of the playing time, and carrying out dubbing comparison on the first video and the reference video segment to obtain a comparison result; the comparison result is used for indicating whether the first video is obtained by dubbing the reference video segment.

In one implementation, the processing unit is configured to, when screening m target video frames from the n second video frames, specifically:

screening video for each first video frame from n second video frames to obtain m video groups corresponding to m first video frames; each video group comprises K candidate video frames, and the K candidate video frames contained in the video group are matched with the first video frames corresponding to the video group; k is an integer greater than 1;

determining a target video frame for each first video frame from m video groups based on time interval information of two candidate video frames in two adjacent video groups in the m video groups in the second video and similarity between each first video frame and each candidate video frame in the corresponding video group; the two candidate video frames belong to different ones of the two adjacent video groups.

In one implementation manner, the processing unit is configured to screen a video from n second video frames for each first video frame, and when obtaining m video groups corresponding to m first video frames, specifically is configured to:

performing feature extraction processing on each first video frame in the m first video frames to obtain an image representation of each first video frame; performing feature extraction processing on each second video frame in the n second video frames to obtain image representation of each second video frame; the image characterization is used for representing semantic information of the video frame;

performing similarity operation on the image representation of each first video frame and the image representation of each second video frame to obtain n similarity corresponding to each first video frame; the similarity is used for indicating the similarity degree between the first video frame and the corresponding second video frame;

k second video frames with the similarity larger than a similarity threshold value in n similarity corresponding to each first video frame are used as K candidate video frames matched with the corresponding first video frames; k candidate video frames corresponding to each first video frame form a video group.

In one implementation, any one of the first video frames is denoted as an i+1th video frame, and a video group corresponding to the i+1th video frame is denoted as an i+1th video group, i=0, 1,2 …, m-1; the processing unit is configured to determine, for each first video frame, a target video frame from m video groups based on time interval information of two candidate video frames in two adjacent video groups in the m video groups in the second video and similarity between each first video frame and each candidate video frame in the corresponding video group, where the target video frame is specifically configured to:

When i=0, the similarity between K candidate video frames in the first video group in the m video groups and the first video frame in the first video is used as K matching scores of the first video frame; a matching score corresponds to a candidate video frame;

when 0<i is less than or equal to m-1, calculating K matching scores of the (i+1) th video frame based on K matching scores of the (i) th video frame, time interval information of each candidate video frame in the (i) th video group and each candidate video frame in the (i+1) th video group in the second video, and similarity between each candidate video frame in the (i+1) th video group and the (i+1) th video frame; the matching score for the i+1th video frame is used to indicate: on the basis that the target video frames are determined by the i video frames before the i+1th video frame, the continuous degree of the candidate video frames in the i video group and the candidate video frames in the i+1th video group in the time dimension, and the matching degree of the candidate video frames in the i+1th video frame and the i+1th video group in the image dimension;

a target video frame is determined for each first video frame from the m video groups based on the K matching scores for each first video frame of the m first video frames.

In one implementation, the processing unit is configured to calculate the K matching score of the i+1th video frame based on the K matching scores of the i+1th video frame, time interval information of each candidate video frame in the i+1th video group and each candidate video frame in the second video, and similarity between each candidate video frame in the i+1th video group and the i+1th video frame, where the K matching score is specifically configured to:

Calculating K matching scores of each candidate video frame in the i+1 video group corresponding to the i+1 video frame based on K matching scores of the i video frame, time interval information of each candidate video frame in the i video group and each candidate video frame in the i+1 video group in the second video, and similarity between each candidate video frame in the i+1 video group and the i+1 video frame;

selecting the matching score with the largest value from K matching scores of each candidate video frame corresponding to the i+1th video frame in the i+1th video group;

the K matching scores are selected as K matching scores of the (i+1) th video frame.

In one implementation, any candidate video frame in the i+1-th video group corresponding to the i+1-th video frame is represented as a j-th candidate video frame, j=1, 2; any candidate video frame in the ith video group is denoted as the kth candidate video frame, k=1, 2, …, K; the processing unit is configured to calculate, based on K matching scores of the i-th video frame, time interval information of each candidate video frame in the i-th video group and each candidate video frame in the i+1-th video group in the second video, and similarity between each candidate video frame in the i+1-th video group and the i+1-th video frame, a K matching score of each candidate video frame in the i+1-th video group corresponding to the i+1-th video frame, where the K matching scores are specifically configured to:

Acquiring image characterization of the jth candidate video frame and image characterization of the kth candidate video frame, and calculating time interval information of the kth candidate video frame and the jth candidate video frame in the second video based on the image characterization of the jth candidate video frame and the image characterization of the kth candidate video frame；

Obtaining similarity between the (i+1) th video frame and the (j) th candidate video frameAnd based on time interval informationFor similarity->Performing weighted operation to obtain a weighted operation result; wherein the time interval informationThe larger the similarity ∈>The smaller the weight value of (2);

obtaining matching score of kth candidate video frame corresponding to ith video frameAnd ∈A for matching score>Summing the weighted operation results to obtain a matching score +.f of the j candidate video frame corresponding to the i+1 video frame when the i candidate video frame and the k candidate video frame are matched>。

In one implementation, the processing unit is configured to, when determining, for each first video frame from the m video groups, a target video frame based on K matching scores of each first video frame from the m first video frames, specifically configured to:

when i=m-1, determining the maximum matching score from K matching scores of the mth video frame in the first video, and taking a candidate video frame corresponding to the maximum matching score in the mth video group as a target video frame of the mth video frame;

When i < m-1 is not less than 0, determining a candidate video frame which can enable the i+1th video frame to obtain the maximum matching score from the i-th video group, and taking the candidate video frame which can enable the i+1th video frame to obtain the maximum matching score as a target video frame of the i+1th video frame.

In one implementation, the dubbing alignment includes text alignment; the processing unit is used for dubbing comparison of the first video and the reference video segment, and is specifically used for:

performing speech recognition processing on the first video to obtain text information of the first video; performing speech recognition processing on the reference video segment to obtain text information of the reference video segment;

and comparing the text information of the first video with the text information of the reference video fragment to obtain a comparison result.

In one implementation, the text information includes one or more characters; the processing unit is used for comparing the text information of the first video with the text information of the reference video fragment, and is particularly used for:

performing multiple editing operations on one or more characters contained in the text information of the first video to obtain new edited text information; editing the obtained new text information to be the same as the text information of the reference video clip;

Counting the number of editing operations to obtain an editing number result, and generating a comparison result based on the editing number result;

when the result of the editing times indicates that the times of the editing operation are greater than an operation threshold, the comparison result indicates that the first video is obtained by dubbing the reference video segment; when the editing frequency result indicates that the frequency of the editing operation is smaller than or equal to an operation threshold value, the comparison result indicates that the first video is obtained by not dubbing the reference video segment.

In one implementation, the dubbing alignment includes audio alignment; the processing unit is used for dubbing comparison of the first video and the reference video segment, and is specifically used for:

performing speech recognition processing on the first video to obtain audio information of the first video; performing speech recognition processing on the reference video clips to obtain audio information of the reference video clips;

performing audio comparison on the audio information of the first video and the audio information of the reference video clip to obtain a comparison result;

when the audio information of the first video is different from the audio information of the reference video segment, the comparison result indicates that the first video is obtained by dubbing the reference video segment; when the audio information of the first video is the same as the audio information of the reference video segment, the comparison result indicates that the first video is not obtained by dubbing the reference video segment.

In one implementation manner, M first video frames are obtained by performing frame extraction processing on M video frames contained in a first video, where M is an integer greater than M; the N second video frames are obtained by performing frame extraction processing on N video frames contained in the second video, and N is an integer greater than N.

In one implementation, if the comparison result indicates that the first video is obtained by dubbing the reference video segment, the processing unit is further configured to:

displaying a first playing interface of a first video;

and displaying dubbing prompt information in the first playing interface, wherein the dubbing prompt information is used for prompting that the first video is obtained by re-dubbing.

displaying a first playing interface of the first video, wherein a jump inlet is displayed in the first playing interface;

and playing the second video in response to the trigger operation for the jump entrance.

In one implementation, the processing unit is configured to, when playing the second video, specifically:

starting playing from a first second video frame in the second video;

or starting playing from a target second video frame in the second video; the target second video frame is a second video frame corresponding to the first video frame displayed in the first playing interface when the jump entrance in the first playing interface is triggered.

In one implementation manner, if the comparison result indicates that the first video is obtained by dubbing the reference video segment, and the second video to which the reference video segment belongs corresponds to a plurality of dubbed videos, wherein the plurality of dubbing videos include the first video, the method further includes:

displaying a first playing interface of a first video;

responding to the dubbing selection requirement in the first playing interface, and outputting the video identification of each dubbing video in the plurality of dubbing videos;

selecting a target video identifier from the video identifiers of the plurality of dubbing videos according to the identifier selection operation;

and playing the target dubbing video corresponding to the selected target video identifier.

In another aspect, embodiments of the present application provide a computer device, including:

a processor for loading and executing the computer program;

a computer readable storage medium having stored therein a computer program which, when executed by a processor, implements the video processing method described above.

In another aspect, embodiments of the present application provide a computer readable storage medium storing a computer program adapted to be loaded by a processor and to perform the above-described video processing method.

In another aspect, embodiments of the present application provide a computer program product comprising a computer program that, when executed by a processor, implements the video processing method described above.

In this embodiment of the present application, a first video to be identified and a second video to which the first video belongs may be acquired, where the first video includes m first video frames, and the second video includes n second video frames. Then, m target video frames can be screened out from n second video frames contained in the second video; wherein, a target video frame is matched with one first video frame of m first video frames contained in the first video, and the matching refers to that the target video frame and the first video frame are similar (for example, semantic information contained in the video frames is similar), that is, the target video frame and the first video frame may be the same video frame; and, the playing time of the m target video frames in the second video is continuous. And finally, carrying out dubbing comparison on the first video to be identified and the reference video segment formed by the m target video frames to obtain a comparison result of the dubbing segment for indicating whether the first video is the reference video segment. As can be seen from the above solution, in the embodiment of the present application, m target video frames are selected from the second video, so that not only each target video frame is matched with one first video frame in the first video to ensure similarity between the target video frame and the first video frame, but also the playing time of the m target video frames in the second video is continuous, so as to ensure that video frames of a reference video segment formed by the m target video frames remain continuous rather than skip frames. In this way, each frame of picture of the reference video segment extracted from the second video can be ensured to be consistent with each frame of picture of the first video as far as possible, that is, the reference video segment can be considered to be the same video segment extracted from the second video as the first video, so that the situation that the comparison result of the dubbing comparison is invalid because the reference video segment is not the same video segment as the first video in the second video when the dubbing comparison is carried out later can be avoided, namely, the effectiveness and the accuracy of the dubbing comparison are ensured.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1a is a schematic architecture diagram of a video processing system according to an exemplary embodiment of the present application;

FIG. 1b is a schematic architecture diagram of another video processing system provided in an exemplary embodiment of the present application;

FIG. 2 is a schematic diagram of a system architecture of a video processing scheme provided in one exemplary embodiment of the present application;

FIG. 3 is a flow chart of a video processing method according to an exemplary embodiment of the present application;

FIG. 4 is a schematic diagram of an algorithm framework of a self-supervised learning algorithm provided by an exemplary embodiment of the present application;

FIG. 5 is a flowchart of selecting Top-K candidate video frames for a first video frame in a first video according to an exemplary embodiment of the present application;

FIG. 6 is a flow chart of calculating a matching score provided by an exemplary embodiment of the present application;

FIG. 7 is a flow chart of another video processing method provided in an exemplary embodiment of the present application;

FIG. 8a is a schematic diagram showing a first video playing in a first playing interface and showing a dubbing alert message according to an exemplary embodiment of the present application;

FIG. 8b is a schematic diagram of a display style of a dubbing alert message provided in an exemplary embodiment of the present application;

FIG. 8c is a schematic diagram of another display style of dubbing cue provided in one exemplary embodiment of the present application;

FIG. 8d is a schematic diagram of a display style of still another dubbing cue provided in an exemplary embodiment of the present application;

FIG. 8e is a schematic diagram of a display style of still another dubbing alert message provided in an exemplary embodiment of the present application;

FIG. 9 is a schematic diagram showing playing a first video in a first playing interface and displaying dubbing alert information according to another exemplary embodiment of the present application;

FIG. 10 is an interface diagram of a jump back to a second video through a jump-in portal provided in an exemplary embodiment of the present application;

FIG. 11 is a schematic diagram of an interface for displaying multiple dubbing videos according to an exemplary embodiment of the present application;

FIG. 12 is a schematic illustration of another interface for displaying multiple dubbing videos provided in one exemplary embodiment of the present application;

fig. 13 is a schematic structural view of a video processing apparatus according to an exemplary embodiment of the present application;

fig. 14 is a schematic structural diagram of a computer device according to an exemplary embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

In the embodiment of the application, a video processing scheme is provided, specifically, a processing scheme for dubbing identification is performed on a video to determine whether the video is an audio-video of an original video. Wherein: (1) video is composed of at least two video frames (or called image frames) connected in sequence, that is, a video frame is a minimum or most basic unit of video; when the video is played, a plurality of video frames are continuously output according to the sequence of the playing time of the video frames, and when the continuous video frame changes by more than 24 frames per second, the human eyes obtain the smooth and continuous visual effect of each video frame according to the principle of human eye persistence of vision. Wherein, the human eye persistence refers to: when an object moves rapidly, such as when a plurality of video frame frames of a video are played at a speed exceeding 24 frames per second, after an image (such as a scene contained in one video frame) seen by a human eye disappears, the human eye can still keep an image of about 0.1-0.4 seconds. With the development and popularization of networks, videos are widely shot and spread by virtue of intuitiveness, interactivity, abundant information quantity and the like.

(2) The original video can be called as original video or video feature film for short, which refers to the video which is not re-dubbed after the video is generated; may include, but is not limited to, movie dramas (e.g., movies, television dramas, or animation, etc.), short videos or advertisement videos (e.g., short videos of a sponsor or a sponsor, etc.) taken from media (i.e., a way for a general user to post information (e.g., video) out through the internet, etc.), etc. Conversely, dubbing the video refers to a new video obtained by re-dubbing the original video; the dubbing video is often obtained by dubbing a video clip in the original video, and has the characteristic of shorter specific duration, so that the dubbing video can also be called a short video in the embodiment of the application.

The re-dubbing may be called re-dubbing for short, which refers to a process of performing secondary dubbing on the original video to obtain a new dubbed video. According to different secondary dubbing modes, the difference between the dubbing video obtained by performing secondary dubbing on the original video and the original video is also different; the differences herein may include a difference in the speech dimension and a difference in the voiceprint dimension. The term dimension refers to the dimension of the term content, and the term content before and after dubbing is different when the term dimension is different. Voiceprint dimensions refer to dimensions of voiceprints of users, wherein voiceprints (voiceprints) are sound wave spectrums carrying voice information of the users, and are biological features composed of various characteristic dimensions such as wavelength, frequency, intensity and the like; the voice print method has the characteristics of stability, measurability, uniqueness and the like, and can be used for uniquely identifying the voice characteristics of a user, namely, voice prints can be used for representing the identity of an object; when the dimensions of the voiceprint are different, the sound characteristics of the same speech line before and after dubbing are different, and the difference can be represented by different voiceprint parameters such as tone, volume and the like. Briefly, the re-dubbing for an original video according to the embodiment of the present application may include, but is not limited to: re-dubbing is carried out on the speech content of a certain character in the original video, so that the speech content of the character before and after dubbing is different and the voiceprints are the same (for example, the speech of the character in other original videos is clipped to be used as the current speech of the original video after dubbing, or a voiceprint simulator is adopted to simulate the speech of the character, and the speech content of the speech before and after dubbing is different but the speech frequency is the same); or, the ligand is rearranged for the voiceprint of a certain character in the original video, so that the speech content of the character is the same before and after dubbing, but the voiceprints are different (such as tone and/or tone color, etc.); or, dubbing is carried out on the speech content and the voiceprint of a certain character in the original video, so that the speech content and the voiceprint of the character are different before and after dubbing.

With the generation of a large amount of dubbing videos, the dubbing videos bring about problems while promoting the development of self-media. For example, dubbing video may violate the copyright rights of the original video; for another example, the dubbing video may bring compliance risks (such as legitimacy and compliance of the speech content contained in the dubbing video) to the audio/video platform (such as an application with audio and video playing functions); as another example, the content of the dubbing video may mislead the evaluation and understanding of the original video by the viewer, and affect the viewing experience of the viewer.

In order to accurately identify the dubbing video, the legal and compliance of the dubbing video are effectively ensured, the viewing and reflecting of the audience are ensured, and the video processing scheme provided by the embodiment of the application supports: first, a first video and a second video (original video mentioned above) are acquired, wherein the first video is a video to be identified, and the first video is a video clip in the second video, that is, whether the first video is an dubbing video obtained by dubbing a certain video clip in the second video needs to be identified. Then, m target video frames can be screened out for the first video from n second video frames contained in the second video, wherein n and m are positive integers, and n is more than or equal to m; one target video frame is matched with one first video frame of m first video frames contained in the first video, wherein the matching refers to similarity, and the playing time of the m target video frames in the second video is continuous. And finally, dubbing and comparing the reference video segments formed by the m target video frames according to the playing time with the first video so as to judge whether the first video is the dubbing video obtained by dubbing the reference video segments in the second video.

Therefore, when m target video frames are screened from the second video, the embodiment of the application not only needs to determine that one target video frame is matched with one first video frame in the first video, namely the semantic information of the target video frame is consistent with the semantic information of the first video frame, and the picture scene depicted by the target video frame is similar to the picture scene depicted by the first video frame in visual effect; furthermore, it is also ensured that the playing time of the m target video frames selected from the second video in the second video is continuous, i.e. that the video pictures of the reference video segment composed of the m target video frames remain continuous instead of frame skipping. Therefore, when the subsequent dubbing comparison is carried out, the problem that the comparison result of the dubbing comparison is invalid because the reference video segment is not the same as the first video in the second video can be avoided, namely the effectiveness and the accuracy of the dubbing comparison are ensured.

In order to facilitate understanding of the video processing scheme provided in the embodiments of the present application, the following describes a video processing scenario related to the embodiments of the present application with reference to the video processing system shown in fig. 1 a; as shown in fig. 1a, the video processing system includes a terminal 101 and a server 102, and the types and the number of the terminal 101 and the server 102 are not limited in the embodiment of the present application. The following describes a terminal 101 and a server 102 related to a video processing system, in which:

(1) The terminal 101 is a terminal held by a user (or referred to as an creator) who has published a first video on an audio-video platform (e.g., a short audio-video platform). Terminal 101 may include, but is not limited to: smart phones (such as smart phones deployed with Android systems, or smart phones deployed with internet operating systems (Internetworking Operating System, IOS), etc.), tablet computers, personal computers, portable personal computers, mobile internet devices (Mobile Internet Devices, abbreviated as MID), smart televisions, vehicle-mounted devices, headsets, smart televisions, or smart homes, etc., the embodiments of the present application do not limit the type of the terminal 101, and are described herein. Furthermore, an audio and video platform with video publishing or playing function can be deployed in the terminal 101; thus, the user can upload the first video to be released by opening the audio/video platform deployed in the terminal 101 during the process of using the terminal 101. The audio and video platform deployed in the terminal 101 may be an application program. An application refers to a computer program that performs some specific task or tasks; in detail, the applications may include, but are not limited to, by way of their operation: (1) a client, a so-called client (which may also be referred to as an application client, a client APP (APPlication)), refers to an application installed and running in a terminal. (2) An application may also refer to an installation-free application, i.e., an application that can be used without downloading an installation, such an application is also commonly referred to as an applet, which typically runs as a sub-program in a client. (3) The application may also refer to a web (world wide web) application opened through a browser; etc. Furthermore, the audio/video platform may be, besides the above-mentioned several applications, a plug-in supporting video publishing, which is included in the above-mentioned applications. For example, the application program is the client mentioned above, and assuming that the client has a social function, the audio-video platform may be a video plug-in included in the client having the social function; thus, the user can also perform functions such as video publishing and/or playing in the social process by using the client without application skip (such as skip from the client to a separate video playing application).

(2) The server 102 is a server side for performing video auditing on the uploaded first video; the server side can be specifically a side provided by the audio/video platform, so that the audio/video platform can realize the auditing function of the first video to be released through the server side. That is, the server 102 is a background server corresponding to the terminal 101, and is used for interacting with the terminal 101, specifically, a server corresponding to an audio/video platform deployed in the terminal 101, and is used for interacting with an audio/video platform deployed in the terminal 101, so as to provide computing and application service support for the audio/video platform deployed in the terminal 101. The server 102 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligence platform.

The above-mentioned terminal 101 and server 102 may be directly or indirectly connected through wired or wireless communication, which is not limited herein.

The video processing scenario is described below in connection with the video processing system shown in fig. 1 a. In a specific implementation, when a user has a requirement of publishing a first video, a video publishing interface provided by an audio and video platform can be opened and displayed through a terminal 101 held by the user, so that the user can upload the first video to be published in the video publishing interface; the terminal 101 (specifically, the audio/video platform may transmit the first video uploaded by the user to the server 102 through the terminal 101). In this way, the server 102 may screen the second video for the first video from the deployed original video repository, where the first video belongs to a video clip in the second video; then, the server 102 performs dubbing identification processing on the first video based on the second video, specifically, filters out target video frames for each first video frame in the first video from the second video, and performs dubbing comparison by adopting a reference video segment formed by the first video and a plurality of target video frames to obtain a comparison result, wherein the comparison result indicates whether the first video is the dubbed video of the reference video segment. Secondly, after obtaining the comparison result of the dubbing comparison for the first video, the server 102 may also audit and operate the first video, specifically audit the first video to determine that the first video will not infringe the copyright, have a compliance risk or mislead the audience after being released. Finally, after determining that the first video audit is passed, the server 102 may push or distribute the first video to one or more viewers in the audio-video platform (may include an creator uploading the first video) according to a certain video recommendation mechanism, so as to implement the distribution or distribution of the "first video" of the dubbed audio-video.

It should be understood that the system shown in fig. 1a mentioned above in the embodiment of the present application is for more clearly describing the technical solution of the embodiment of the present application, and does not constitute a limitation on the technical solution provided by the embodiment of the present application. As can be appreciated by those skilled in the art, with the evolution of the system architecture and the appearance of new service scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems. For example, the foregoing describes a video processing scenario of the video processing scheme by taking an example that the execution subject "computer device" of the embodiment of the present application includes a terminal and a server, that is, the terminal and the server jointly execute the video processing scheme provided in the embodiment of the present application; it should be understood that, in practical applications, the computer device may also be a terminal or a server, that is, support the video processing scheme provided by the embodiments of the present application executed by the terminal or the server alone. For another example, a terminal 103 may be further included in the video processing system, and as shown in fig. 1b, the terminal 103 may be a terminal held by an auditor having video auditing rights. In this way, the server 102 may return the auditing result for the first video to the terminal 103, so as to visually present the auditing result in the terminal 103, and the auditing result is checked again by the auditor, so as to ensure legal compliance of the video distributed by the audio-video platform, and maintain the environment of the audio-video platform or; or the auditor can also combine the auditing result to execute the downstream business (such as analyzing the video type of the platform according to the auditing result), and the downstream business can be further realized by the audio-video platform.

It should be noted that fig. 1a and fig. 1b are only a simple description of the interaction flow when the terminal 101 and the server 102 jointly execute the video processing scheme. Further, a system architecture of the video processing scheme provided by the present solution is deployed in the server 102, so that after the server 102 receives the first video sent by the terminal 101, the architecture can be specifically adopted to implement dubbing recognition processing on the first video. Wherein, the system architecture schematic diagram of the present solution can be seen in fig. 2; as shown in fig. 2, the system architecture may mainly include the following modules: the system comprises a speech recognition module, an image representation learning module, an image representation matching module and a comparison module; wherein:

(1) the speech recognition module is mainly used for preprocessing speech in the first video; the speech recognition module can recognize speech content and/or voiceprint information of one or more characters (or actors) in the first video, and prepare for subsequent dubbing comparison. (2) The image representation learning module is mainly used for carrying out image representation on video frames in videos (such as a first video and a second video) to obtain feature vectors capable of representing semantic information of the video frames; that is, after the image characterization learning module performs image characterization learning on the video frames in the video, the video frames can be converted into feature vectors, and the feature vectors are summary of semantic information (such as scene information, element information and the like) expressed by the video frames. Notably, the image characterization learning module provided by the embodiment of the application is shared between the first video and the second video; that is, image characterization learning for a first video frame in a first video and image characterization learning for a second video frame in a second video are both using the image characterization learning module; the dubbing recognition efficiency can be improved to a certain extent.

(3) The image representation matching module is mainly used for finding m target video frames matched with m first video frames in the first video from n second video frames contained in the second video, and ensuring that the playing time of the m target video frames in the second video is continuous. That is, correspondence between the second video frame in the second video and the first video frame in the first video is achieved; in this way, the reference video segment composed of the m target video frames can be ensured to be the corresponding video segment of the first video in the second video, namely, comparability between the reference video segment and the first video is ensured, and further, the effectiveness of subsequent dubbing comparison is ensured. (4) The comparison module is mainly used for dubbing comparison of the first video and the reference video segment, and specifically adopts the speech recognition module to compare the speech recognition result of the first video with the speech recognition result of the recognized reference video segment so as to judge whether the first video is the dubbing segment obtained by dubbing the reference video segment; the dubbing here may include, without limitation, the aforementioned reassortment of only the speech content, or reassortment of only the speech voiceprint, or reassortment of both the speech content and the speech voiceprint.

The video processing scheme is implemented based on the modules in the system shown in fig. 2, and the general flow of the scheme can be summarized as follows: first, text information (or audio information) of a first video to be recognized is extracted based on a speech recognition module (or referred to as a short video ASR (Automatic Speech Recognition, automatic speech recognition) recognition module). Then, the image representation-based learning module converts the video frames into image representations, in particular, the first video frames in the first video and the second video frames in the second video, through self-supervised learning. And then, the first video frames in the first video and the second video frames in the second video are corresponding based on the image representation matching module, specifically, target video frames matched with each first video frame in the m first video frames are screened out from n second video frames contained in the second video, and the playing time of the m target video frames in the second video is ensured to be continuous. And finally, carrying out dubbing comparison on the first video and the reference video segment by adopting a comparison module, and judging whether the first video is subjected to secondary dubbing on the basis of the reference video segment in the second video according to the comparison result.

It should also be noted that (1) the first video may be obtained by clipping and splicing a part of video clips in the plurality of second videos; in this case, the first video may be first identified, and mainly, a video splicing point in the first video may be identified, so that the first video may be divided into a plurality of sub-videos according to the video splicing point, and then, by adopting the embodiment of the present application, dubbing comparison is performed on each sub-video. Therefore, as long as the sub video is the dubbing video, the first video can be determined to be the dubbing video, and annotation prompt is carried out when the first video is played, so that the dubbing recognition of the spliced video is realized.

(2) The foregoing is merely a simple introduction to each module and the overall flow in the system related to the video processing scheme provided in the embodiments of the present application, and detailed description will be given in connection with specific implementation manners of each module in the following description. In addition, in the embodiment of the present application, the relevant data collection process should strictly obtain the personal information according to the requirements of relevant national laws and regulations, so that the personal information needs to be informed or agreed (or has the legal basis of information acquisition), and the subsequent data use and processing actions are performed within the authorized range of the laws and regulations and the personal information body. For example, when the embodiments of the present application are applied to specific products or technologies, such as when a first video is released, permission or consent of an uploader or creator of the first video needs to be obtained, and collection, use and processing of relevant data (such as collection and release of a bullet screen of an object release) needs to comply with relevant laws and regulations and standards of relevant countries and regions.

Based on the video processing scheme described above, the embodiment of the present application proposes a more detailed video processing method, and the video processing method proposed by the embodiment of the present application will be described in detail below with reference to the accompanying drawings. FIG. 3 is a flow chart of a video processing method according to an exemplary embodiment of the present application; the video processing method may be performed by a computer device in the aforementioned system, such as the computer device being a terminal and/or a server; the video processing method may include, but is not limited to, steps S301-S303:

s301: and acquiring a first video and a second video.

The first video is a video to be processed received by the audio-video platform, for example, the first video is a short video uploaded by an creator. The second video is a video cached in a video database of the audio-video platform, and is an original video, namely a video which is not subjected to secondary dubbing; the video database is cached with abundant original video resources in advance, such as abundant film and television drama feature resources.

In a specific implementation, when an creator has a requirement of publishing a first video to an audio/video platform, the creator can upload the first video by using an audio/video platform logged in a computer device (i.e., a terminal) held by the creator, so that the terminal can send the first video to a computer device corresponding to the audio/video platform (i.e., a background server corresponding to the audio/video platform), and at the moment, the first video to be processed or identified can be obtained. Further, the background server corresponding to the audio-video platform can screen a second video matched with the first video from the video database based on the first video; the first video and the second video being matched here can be understood as: the first video is a video clip in the second video, for example, the first video belongs to a video clip included in the second video, for example, the total duration of the second video is 10 minutes, and the first video may be a video clip in a 5 th minute to 8 th minute in the total duration.

It is noted that the embodiments of the present application do not limit the filtering rule for filtering the second video from the video database based on the first video. Illustratively, the screening rules may include, but are not limited to: screening a video database with actor names of one or more actors in the first video as indexes; or, screening the video database by taking keywords of the lines in the first video as indexes; or, the scene or video element (such as the interface elements of mountain, sea, etc. contained in a certain video frame) contained in the first video is taken as an index to screen the video database; etc.

S302: and screening m target video frames from the n second video frames.

In order to be able to locate the first video from the second video, i.e. for the first video, it is necessary to locate it at a video position in the second video in order to be able to determine whether the first video is obtained by re-dubbing the video clip located in the second video. The embodiment of the application supports positioning the first video from the second video from the dimension of the video frame; specifically, each first video frame in the first video corresponds to each second video frame in the second video, and a video segment formed by the corresponding second video frames is used as a video position of the first video positioned from the second video. The method comprises the steps that a first video comprises m continuous first video frames, a second video comprises n continuous second video frames, n and m are positive integers, and n is larger than or equal to m; the continuity herein refers to the continuity of the playing time of each video frame, and when the playing time of each video frame is continuous, the video pictures or video scenarios presented when each video frame is played in sequence are continuous; then a target video frame that matches (or corresponds to) each of the m first video frames needs to be found from the n second video frames.

It is noted that, considering that the number of video frames included in the first video and the second video tends to be large, if the calculation is performed for each frame of video frame, it is likely to cause a decrease in calculation speed and efficiency; based on this, the embodiment of the application supports frame extraction of the first video and the second video, so as to obtain m first video frames from the first video in a frame extraction mode, and obtain n second video frames from the second video in a frame extraction mode. In this way, the calculation is performed based on a smaller number of the first video frames and the second video frames, and the subsequent calculation efficiency can be ensured to some extent. In a specific implementation, assuming that the first video always includes M video frames, where M is an integer greater than M, the aforementioned M first video frames may be obtained by performing frame extraction processing on M video frames included in the first video; similarly, assuming that the second video always includes N video frames, where N is an integer greater than N, the aforementioned N second video frames may be obtained by performing frame extraction processing on N video frames included in the second video. The frame extraction process may include extracting video frames in the video at 5 FPS; FPS (Frames Per Second, transmission frames per second) is a definition in the image arts, referring to the number of frames per second transmitted for a picture, 5FPS refers to the extraction of 5 frames of video frames per second; it should be understood that the specific value of the transmission frame number per second in the embodiment of the present application is not limited, and may be 5 frames per second, 10 frames per second, or the like as mentioned above.

Based on the above description about the video frames, a specific implementation process of screening m target video frames matched with m first video frames from n second video frames included in the second video is given below; the specific implementation process may include, but is not limited to, steps (1) - (2), wherein:

(1) And screening the video for each first video frame from the n second video frames to obtain m video groups corresponding to the m first video frames. Wherein each video group contains K candidate video frames, and K is an integer greater than 1. And, a video group corresponds to a first video frame, where the correspondence is represented by K candidate video frames contained in the video group being matched with the first video frame corresponding to the video group; the matching is represented by similarity in image dimension between the candidate video frame and the first video frame, in other words, each candidate video frame in the video group to which the first video frame corresponds has a degree of similarity to the first video frame.

Specific implementations of screening m video groups from n second video frames may include, but are not limited to:

(1) in order to be able to locate a first video in a second video, embodiments of the present application support converting each video frame of the video into a feature vector, which is a generalized summary of semantic information of the video pictures described by the video frame; semantic information may include, but is not limited to: scene information described by the video frame, element information of each element contained in the video frame, and the like. The distance between the feature vectors of different video frames can be used to reflect the similarity of the corresponding different video frames in the image dimension, namely the picture similarity between the different video frames. In detail, an image representation learning algorithm can be adopted to perform feature extraction processing on each first video frame in m first video frames, so as to obtain an image representation of each first video frame; and carrying out feature extraction processing on each second video frame in the n second video frames by adopting an image representation learning algorithm to obtain an image representation of each second video frame, wherein the image representation is the feature vector.

Wherein the above mentioned image characterization learning algorithm may be provided by the image characterization learning module in the system shown in fig. 2; the image characterization learning algorithm employed in the embodiments of the present application is not limited and may include a self-supervised learning algorithm (Bootstrap Your Own Latent, BYOL). Self-supervised learning algorithms aim to convert unlabeled data into useful information by learning its own potential representation (Latent Representation), thereby improving the generalization ability of the model. The core idea of the self-supervised learning algorithm is to use two neural networks to perform self-supervised learning, train one online network to predict the characteristics of the target network, and then use the predicted characteristics to update the characteristics of the self-supervised learning algorithm. The weights of the online network (or called a prediction network) and the target network are shared, but the weights of the target network are obtained through moving average calculation and do not directly participate in gradient updating; thus, the online network can gradually improve its performance by learning the potential representation of the target network.

The algorithm framework of the self-supervised learning algorithm can be seen in fig. 4; as shown in fig. 4, the BYOL algorithm includes two network structures, an online network (online) and a target network (target), respectively. Wherein, online network: parameters of the on-line network are denoted by θ and include coding in the on-line network Parameter speculation->And forecast->The method comprises the steps of carrying out a first treatment on the surface of the Target network: using xi to represent the parameters of the target network, and including the code +.>And surmise->. Wherein the weight of the target network is an exponential moving average of the online network, that is, the target network is updated by updating the online network and adopting a moving average mode; the moving average (or called moving average) is a common tool for analyzing time series in technical analysis, and specifically is a process of performing average value operation on the values of parameters in a historical time period to obtain an average value.

In the process of model training for the model architecture shown in fig. 4, firstly, an image dataset D is acquired, wherein the image dataset D contains rich images, and any image can be expressed as x; then, performing different image enhancement (such as image denoising, feature extraction and/or graying) on the image dataset D to obtain two new image distributions, namely t and t'; then, images are acquired from two new distributions, denoted v and v ', respectively, i.e. the process of image enhancement of the images in the image dataset D is represented by t () and t' (), then v=t (x),v '=t' (x). Further, for upsilon in the online network, the upsilon is encoded (i.e., the feature representation) to obtain =/>By speculation (project) to obtain +.>=/>Obtaining +.>The method comprises the steps of carrying out a first treatment on the surface of the Similarly, for v' in the target network, encoded to give +.>=/>Is surmised to obtain->=/>. Thus, for each input sample image during training, the online network generates a token vector +.>And transmits the result to the target network for prediction. Further, support for representing the vector +.>True token vector with target network->=/>A comparison is made to calculate a loss function. Wherein the loss function comprises two parts, namely: cosine similarity loss between the online network and the target network, mean square error loss between the online network and the target network; in this way, when the online network and the target network are optimized based on the loss function, the weight of the online network is updated by using gradient descent, and the weight of the target network is updated by using exponential moving average at the same time, so that the aim of training to obtain a model with better performance is achieved.

(2) And carrying out similarity operation on the image representation of each first video frame and the image representation of each second video frame to obtain n similarities corresponding to each first video, wherein the similarities are used for indicating the similarities between the first video frames in the first video and the corresponding second video frames in the second video, namely the picture similarities. As described above, the distance between feature vectors of different video frames may be used to reflect the similarity of the respective different video frames in the image dimension, so embodiments of the present application measure the similarity between the first video frame and the second video frame by the distance between the image representation of the first video frame and the image representation of the second video frame.

Specifically, assume that the commandRepresenting an image representation of an i+1th video frame in the first video, i=0, 1,2 …, m-1;(or expressed as->) The image representation of the (i+1) th video frame is represented by the image representation of the Top-j candidate video frames after nearest neighbor searching in the n second video frames by a nearest neighbor detection algorithm, j=1, 2 …, K. Taking the i+1th video frame and the j-th candidate video frame as an example, the similarity between the i+1th video frame and the j-th candidate video frame can be expressed as:

(1)

representing a similarity between the i+1st video frame and the j candidate video frame; />Representation of image representationsAnd performing transposition.

(3) K second video frames with the similarity larger than a similarity threshold value in n similarity corresponding to each first video frame are used as K candidate video frames matched with the corresponding first video frames; wherein K candidate video frames corresponding to each first video frame constitute one of the aforementioned video groups. That is, in the embodiment of the present application, n candidate video frames corresponding to the first video frame are ordered according to the order of the similarity from large to small, and the Top-K candidate image frames are captured from the sequence as K candidate video frames matched with the first video frame. In this way, compared with the mode of reserving all candidate image frames, the mode of selecting Top-K candidate image frames as K candidate video frames matched with the first video frames not only improves the speed and efficiency of secondary dubbing identification by reducing the number of the candidate video frames, but also ensures that the selected candidate image frames and the first video frames have higher similarity, thereby improving the accuracy of secondary dubbing identification.

The schematic flow chart of the steps (1) - (3) can be seen in fig. 5; as shown in fig. 5, the similarity between n second video frames included in the second video and each of the first video frames may be calculated; and then, selecting K matched candidate video frames for each first video frame according to the principle of selecting Top-K candidate video images with larger similarity. Notably, for image characterization of first video frames and second video frames, embodiments of the present application support using a nearest neighbor detection algorithm (e.g., faiss (Facebook AI Similarity Search) algorithm) to find the Top-K nearest neighbor of each first video frame in n second video frames, where K is a super parameter; that is, the above-described process of selecting K candidate video frames may be implemented using a nearest neighbor detection algorithm. Wherein Faiss is a library for efficient similarity searching; a series of algorithms and data structures for vector indexing and similarity searching are provided that allow efficient similarity searching over a large data set. The core of the Faiss algorithm is an indexing method based on vector quantization (Vector Quantization). Vector quantization is a method of dividing a continuous vector space into discrete subspaces, and can map high-dimensional vectors into a low-dimensional space and perform similarity search in the low-dimensional space. The Faiss algorithm uses a variety of vector quantization algorithms; vector quantization algorithms may include, but are not limited to: product quantization (Product Quantization, PQ) and IVFADC, etc.

(2) Determining a target video frame for each first video frame from m video groups based on time interval information of two candidate video frames in two adjacent video groups in the m video groups in the second video and similarity between each first video frame and each candidate video frame in the corresponding video group; the two candidate video frames here belong to different ones of the two adjacent video groups. For example, two adjacent first video frames in the first video are a first video frame i and a first video frame i+1 respectively, the first video frame i corresponds to a video group i, the first video frame i+1 corresponds to a video group i+1, and the video group i and the video group i+1 both contain K candidate video frames; then two candidate video frames in a neighboring video group here may refer to: any one of the candidate video frames in video group i and any one of the candidate video frames in video group i+1.

By the similarity operation between the first video frame and the second video frame, the image representation based on the video frames can be realized to correspond each first video frame in the first video to the second video frame in the second video; specifically, each first video frame may correspond to a matching Top-K second video frames (i.e., the candidate video frames mentioned above). Conventionally, the first video frame is selected from K candidate video frames The candidate video frame with the maximum similarity value is used as the target video frame in the second video corresponding to the first video frame; as shown in fig. 5, for an image, characterized by q ₁ The image of the second video frame selected from the first video frame of (2) is characterized by v ₁₁ The same applies to image characterization q ₂ The image of the second video frame selected from the first video frame of (2) is characterized by v ₂₁ Etc. The method for selecting the candidate video frame with the largest similarity in the K candidate video frames as the target video frame corresponding to the first video frame in the second video frame does not necessarily find the optimal solution; in other words, although the target video frame selected for each first video frame is the second video frame most similar to the first video frame, the playing time of these target video frames in the second video may be far different when viewed as a whole, and the matching result with smaller similarity but actually better similarity of other single frames may be missed.

For example, the playing time of the candidate video frame 1 matched with the first video frame 1 in the second video is 10 seconds, the playing time of the candidate video frame 2 matched with the first video frame 2 in the second video is 15 seconds, and the playing time of the candidate video frame 3 matched with the first video frame 3 in the second video is 8 seconds; then, when the candidate video frame 1, the candidate video frame 2 and the candidate video frame 3 are sequenced according to the playing sequence of the first video frame 1, the first video frame 2 and the first video frame 3 to form a video segment, the playing sequence of each video frame in the video segment is that the playing sequence of the candidate video frame 1- & gt the candidate video frame 2- & gt the candidate video frame 3, however, the playing time of the candidate video frame 1, the candidate video frame 2 and the candidate video frame 3 in the second video is discontinuous; this results in an inefficient dubbing between the video segment composed of candidate video frame 1-candidate video frame 2-candidate video frame 3 and the first video.

In order to select an optimal target video frame from the K candidate video frames corresponding to each first video frame, the embodiment of the present application performs image representation matching (i.e., a process of selecting an optimal target video frame from the K candidate video frames of the first video frame, which may be provided by an image representation matching module in the system shown in fig. 2) based on a dynamic programming (Dynamic Programming) algorithm. Wherein: the "optimal" described above is reflected in that the similarity between the target video frame and the first video frame is greater than the similarity threshold, and the playing time of two target video frames corresponding to two adjacent first video frames should satisfy that the playing time of the latter target video frame is greater than the former target video frame; the two points are used for ensuring that the reference video segments formed by the finally selected target video frames according to the playing time in the second video are continuous video segments in the second video, and ensuring that the reference video segments and the first video have dubbing comparability. (2) The dynamic programming algorithm is a commonly used optimization algorithm and is used for solving the problem of overlapping sub-problems and optimal sub-structure properties; usually, a bottom-up mode is adopted, and a global optimal solution is finally obtained by solving the sub-problem firstly and then gradually solving the problem with larger scale. The basic idea of a dynamic programming algorithm can be roughly described as: to solve a given problem, the problem may be divided into different sub-problems, and after each sub-problem is solved, the solutions of the sub-problems are combined to obtain the solution of the original problem.

Specifically, the process of selecting a matched target video frame for a first video frame based on K candidate video frames corresponding to the first video frame by a dynamic programming algorithm in the embodiment of the present application may be roughly divided into two parts: the first portion is forward matching score and the second portion is reverse determining a target video frame for the first video frame based on the matching score. The first part is described in detail as the second part, wherein:

(1) a first part: the matching score is forward found.

Each first video frame in the first video corresponds to K matching scores, and one matching score corresponds to one candidate video frame in the K candidate video frames corresponding to the first video frame. For convenience of explanation, it is assumed that any one of the first video frames is represented as an i+1th video frame, and a video group corresponding to the i+1th video frame is represented as an i+1th video group, i=0, 1,2 …, m-1; then the matching score for the i+1th video frame is used to indicate: the (i) th video frame is based on the (i) th video frame before the (i+1) th video frame having determined the target video frameThe degree of continuity in the time dimension of the candidate video frames in the i video group and the candidate video frames in the i+1 video group, and the degree of matching in the image dimension of the candidate video frames in the i+1 video group and the i+1 video group. In the embodiment of the application Representing a matching score when the i+1st video frame matches the j candidate video frame in the i+1st video group; that is to say +>Representing that the first to the i-th video frames in the first video have been matched and that the i-th video frame and the image representation +.>When the corresponding j candidate video frames are matched, the total matching score from the first video frame to the i+1th video frame in the first video is calculated. />The higher the value of (c) indicates that the j candidate video frame in the i+1 video group is more likely to be the optimal solution of the i+1 video frame; i.e. the more likely the reference video segment consisting of the j-th candidate video frame remains continuous and is the correct video segment for the first video to locate into the second video.

Based on the above description of the correlation of the matching score, the following is a specific implementation procedure for forward calculating the matching score in conjunction with fig. 6:

when i=0, each candidate video frame in the first video group of the m video groups is calculated to correspond to a matching score of the first video frame of the first video. Considering that no video frame exists before the first video frame in the first video, the similarity between the K candidate video frames in the first video group in the m video groups and the first video frame in the first video can be directly used as K matching scores of the first video frame. For example, the matching score between the j candidate video frame in the first video group of the m video groups corresponding to the first video frame may be represented by a vector cosine similarity, that is:

(2)

When 0<i is less than or equal to m-1, not only the similarity between the i+1th video frame and the candidate video frames in the m video groups is considered, but also the matching score of the first video frame between the candidate video frame and the i+1th video frame is considered, so that the optimal candidate video frame is selected for the i+1th video frame as a target video frame based on the matching score calculated in two aspects. In a specific implementation, the K matching scores of the i+1th video frame may be calculated based on K matching scores of the i+1th video frame, time interval information of each candidate video frame in the i+1th video group and each candidate video frame in the i+1th video group in the second video, and similarity between each candidate video frame in the i+1th video group and the i+1th video frame. Then, selecting the matching score with the largest value from K matching scores of each candidate video frame corresponding to the i+1th video frame in the i+1th video group; and finally, taking the selected K matching scores as K matching scores of the (i+1) th video frame.

The specific process of calculating the K matching scores of the (i+1) th video frame may include: first, based on K matching scores of the i-th video frame, time interval information (i.e., an instant difference) of each candidate video frame in the i-th video group and each candidate video frame in the i+1-th video group in the second video, and a similarity between each candidate video frame in the i+1-th video group and the i+1-th video frame, K matching scores of each candidate video frame in the i+1-th video group corresponding to the i+1-th video frame are calculated. Taking any candidate video frame in the (i+1) th video group corresponding to the (i+1) th video frame as a j candidate video frame, j=1, 2,., K, any candidate video frame in the i video group as a K candidate video frame, k=1, 2, …, K as an example; then image characterization of the j candidate video frame needs to be acquired And image characterization of the kth candidate video frame +.>And characterizing +.based on the image of the j-th candidate video frame>And image characterization of the kth candidate video frame +.>Calculating time interval information of the kth candidate video frame and the jth candidate video frame in the second video +.>. Then, the similarity +.>And based on time interval information +.>For similarity->Performing weighted operation to obtain a weighted operation result; wherein the time interval information->The larger the similarity ∈>The smaller the weight value of (2), the negative correlation between the time interval information and the similarity. Finally, the matching score of the kth candidate video frame corresponding to the ith video frame is acquired +.>And ∈A for matching score>Summing the weighted operation results to obtain a matching score +.f of the j candidate video frame corresponding to the i+1 video frame when the i candidate video frame and the k candidate video frame are matched>. The above-described specific procedure for calculating K matching scores for the i+1th video frame can be expressed as:

(3)

the meaning of each parameter in the formula (3) can be referred to the above description, and will not be repeated here.

Further, for an mth video frame (i.e., a last frame) of m first video frames included in the first video, the k value corresponding to the maximum value is supported to be taken as a matching score, which is expressed as:

(4)

(2) A second part: the target video frame is determined in reverse.

Determining a maximum matching score for candidate video frames included in each of the m video groups based on step (1); a target video frame is thus determined for each first video frame from the m video groups based on the K matching scores for each first video frame of the m first video frames. Embodiments of the present application support determining a target video frame for each first video frame by reverse derivation based on a matching score. In a specific implementation, when i=m-1, determining the maximum matching score from K matching scores of the mth video frame (i.e. the last frame) in the first video, and taking the candidate video frame corresponding to the maximum matching score in the mth video group as the target video frame of the mth video frame. When i < m-1 is not less than 0, determining a candidate video frame which can enable the i+1th video frame to obtain the maximum matching score from the i-th video group, and taking the candidate video frame which can enable the i+1th video frame to obtain the maximum matching score as a target video frame of the i+1th video frame.

In summary, in the embodiment of the present application, each first video frame in the first video may be corresponding to the second video frame in the second video; in this way, based on the matching process realized based on the dynamic programming algorithm, the target video frames matched in both the image dimension and the time dimension can be selected for each first video frame, namely, the similarity of each frame during nearest neighbor search is considered, and the time interval information when the first video frame and the candidate video frame are matched is also considered, so that the reference video segments formed by the selected target video frames are matched to continuous video segments in the second video as much as possible, the comparability and the effectiveness of dubbing comparison between the first video and the reference video segments are improved, and the accuracy of dubbing comparison of the first video is further improved.

S303: and forming m target video frames into a reference video segment according to the sequence of the playing time, and carrying out dubbing comparison on the first video and the reference video segment to obtain a comparison result.

Selecting a matched target video frame from n second video frames of the second video for each first video frame of the first video based on the steps, and combining the m target video frames according to the sequence of the playing time of the target video frames in the second video to obtain a continuous reference video segment; in this way, continuous reference video clips and the first video can be adopted for dubbing comparison, so as to judge whether the first video is obtained by dubbing the reference video clips in the second video.

The embodiment of the application supports dubbing comparison from two aspects of text and/or audio. That is, it may be determined whether the text content included in the first video is the same as the text content included in the reference video clip, so as to implement dubbing comparison; text content herein may refer to the speech content (or caption information) of one or more actors in the video. It may also be determined whether the audio characteristics of the actor in the first video (or the voiceprint characteristics, which may be used to uniquely identify the user making the sound) and the audio characteristics of the actor in the reference video clip are the same to achieve dubbing alignment.

The following describes the implementation procedure of the dubbing comparison in the two aspects, wherein:

in one implementation, the dubbing alignment includes text alignment. Under the implementation manner, step 1, performing speech recognition processing on a first video to obtain text information of the first video; and performing speech recognition processing on the reference video segment to obtain text information of the reference video segment. The specific implementation process of the text recognition adopted in the embodiment of the application is not limited, for example, text recognition can be implemented by adopting a trained text recognition model, manual extraction of subtitle information, and the like.

Illustratively, the embodiment of the application supports the recognition of the line words by the line word recognition module, and extracts text information in the first video and the reference video segment. The speech recognition module can be an ASR recognition module, and the ASR recognition module realizes speech recognition of the speech through an ASR algorithm. ASR algorithms are a technique to convert speech signals into text; the basic flow may include front-end processing, feature extraction, acoustic model training, decoding, and post-processing steps for speech signals in video. Wherein: front-end processing of a speech signal includes pre-emphasis (i.e., emphasizing the high frequency portion of the speech signal, removing the effects of lip radiation, etc., increasing the high frequency resolution of the speech signal), framing (i.e., slicing the speech signal into multiple frame segments based on the short-time stationarity of the speech signal), and windowing (the more the frames are split, the greater the error with the original signal, through which the framed speech signal can be made continuous, reducing the error), etc., with the aim of converting the speech signal into a series of short-time frames. Feature extraction is the extraction of a set of feature vectors from each video frame to represent the semantic information of the video frame. The acoustic model training is to learn an acoustic model from the feature vectors extracted by the feature extraction by using a machine learning algorithm; the acoustic model may be used to calculate the probability that a speech signal contained in an input video frame belongs to a certain state, such as a certain actor. Decoding is to combine an acoustic model and a language model (expressing linguistic knowledge contained in natural language) by using Viterbi algorithm and other technologies to obtain the most probable text output; specifically, according to the state probability output by the acoustic model, a corresponding phoneme string (or a sequence called a phoneme sequence, which is the most basic unit of pronunciation) is calculated through a Viterbi algorithm, and then the phoneme string is converted into text information through a language model. The post-processing is to perform some post-processing operations on the decoding result, such as pinyin to Chinese character and grammar error correction, so as to make the text information more accurate.

Further, embodiments of the present application may implement the ASR automatic recognition process described above using, but not limited to, a neural network-based temporal class classification (Connectionist Temporal Classification, CTC) algorithm to extract text information of a video. Specific processes of CTC algorithms may include, but are not limited to: firstly, mapping an input sequence and an output sequence of an algorithm to sequences with the same length; for an input sequence, framing, feature extraction and other techniques can be used to convert the input sequence into a series of feature vectors; for the output sequence, it is necessary to convert it into a series of characters and insert a space between the characters. The mapped sequence is then trained using a neural network model. The neural network model generally adopts a structure such as a cyclic neural network (Recurrent Neural Network, RNN) or a convolutional neural network (Convolutional Neural Networks, CNN), and the mapping relationship between the input sequence and the output sequence can be learned through the neural network model. Secondly, calculating the probability of an output sequence; specifically, for each time step (i.e., the process of receiving input and output at each point in the sequence), the neural network model outputs a probability distribution representing the probability of each character output for that time step, and calculates the probability of the output sequence using a dynamic programming algorithm. Finally, calculating a loss function; the loss function of the CTC algorithm is a negative log likelihood probability of the output sequence, which can be minimized by maximizing the probability of the output sequence. The input of the CTC algorithm is an acoustic characteristic sequence, and the output is a text sequence; the CTC algorithm can train the model by maximizing the probability of the output sequence, so that end-to-end training is realized, and manual design features are avoided.

And 2, comparing the text information of the first video with the text information of the reference video fragment to obtain a comparison result. It should be appreciated that there are a variety of ways in which text may be aligned, including but not limited to: a text semantic feature similarity measurement mode based on a deep learning model, an Edit Distance mode and the like. The text semantic feature similarity measurement mode is realized by training a deep learning model, specifically by constructing sample text information to train the deep learning model, and by adopting the trained deep learning model, text comparison is realized. The edit distance, also called the Levenshtein distance, is an index for measuring the similarity between two strings; the edit distance is defined as the minimum number of editing operations required to convert one character string into another, and the editing operations include three operations of insertion, deletion, and replacement; the text information of the first video is considered to the text of the reference video segment by editing the text information of the first video, so that the text comparison is realized by counting the times of editing operation.

The text comparison process between the text information of the first video and the text information of the reference video clip by means of the edit distance is described below. In particular implementations, the text information includes one or more characters, which may include at least one of: chinese characters (i.e., chinese characters), english characters (i.e., letters), numbers, and punctuation marks (e.g., comma ", period", bracket "[ MEANS FOR SOLVING"). Firstly, performing multiple editing operations on one or more characters contained in text information of a first video to obtain new edited text information; the new text information obtained by editing is the same as the text information of the reference video clip. Then, counting the times of editing operation to obtain an editing times result, and generating a comparison result based on the editing times result; when the result of the editing times indicates that the times of the editing operation are greater than an operation threshold, the comparison result indicates that the first video frame is obtained by dubbing the reference video segment; when the editing frequency result indicates that the frequency of the editing operation is smaller than or equal to an operation threshold value, the comparison result indicates that the first video is obtained by not dubbing the reference video segment.

For example, assuming that the above-mentioned operation threshold is 1, that is, the number of editing operations required to convert the text information of the first video into the text information of the reference video clip is greater than 1, it may be determined that the first video is obtained by dubbing the reference video clip, then: for example, the text information of the first video is a character string "abc", and the text information of the reference video clip is a character string "acd"; then, converting the character string "abc" into the character string "acc" requires replacing the character "b" in the character string "abc" with the character "c", and replacing the character "c" in the character string "abc" with the character "d", i.e., two replacing operations are required, so that the editing distance between the first video and the reference video clip is 2; and if the editing distance 2 is greater than the operation threshold 1, determining that the first video is obtained by dubbing the reference video segment. For another example, the text information of the first video is a character string "abc", the text information of the reference video segment is a character to character "abcc", and then the character string "abc" is converted into the character string "abcc", and the character "d" needs to be inserted between the character "b" and the character "c" in the character string "abc", i.e. an insertion operation is performed, so that the editing distance between the first video and the reference video segment is 1; and if the editing distance 1 is equal to the operation threshold 1, determining that the first video is not obtained by dubbing the reference video segment. For another example, the text information of the first video is a character string "abc", the text information of the reference video segment is "ac", and then the character string "abc" is converted into a character string "ac", and the character "b" in the character string "abc" needs to be deleted, i.e. a deletion operation needs to be performed once, so that the editing distance between the first video and the reference video segment is 1; and if the editing distance 1 is equal to the operation threshold 1, determining that the first video is not obtained by dubbing the reference video segment.

In other implementations, the dubbing alignment includes audio alignment. The audio comparison is mainly performed by comparing whether the voiceprint characteristics of the actors in the first video are consistent with the voiceprint characteristics of the actors in the reference video clips; if the two pieces of the first video are consistent, the dubbing persons representing the same speech in the first video and the reference video are the same, so that the first video is determined not to be obtained by secondary culture of the reference video; if not, dubbing persons representing the same speech in the first video and the reference video segments are different, thereby determining that the first video is obtained by culturing for the reference video segments.

In the specific implementation, speech recognition processing is firstly carried out on a first video to obtain audio information of the first video; performing speech recognition processing on the reference video clips to obtain audio information of the reference video clips; the algorithm used for speech recognition processing is not limited in the embodiments of the present application, and includes, but is not limited to: mel-frequency cepstrum coefficients (Mel Frequency Cepstrum Coefficient, MFCC), logfBank, and the like. Then, audio frequency comparison is carried out on the audio frequency information of the first video and the audio frequency information of the reference video clip, and a comparison result is obtained; when the audio information of the first video is different from the audio information of the reference video segment, the comparison result indicates that the first video frame is obtained by dubbing the reference video segment; when the audio information of the first video is the same as the audio information of the reference video segment, the comparison result indicates that the first video is not obtained by dubbing the reference video segment.

It should be noted that, in the embodiment of the present application, in the one-time dubbing comparison process, the dubbing comparison is performed by adopting the two aspects at the same time, and only if the comparison results in the two aspects indicate that the first video is not the dubbing video of the reference video segment, it is determined that the first video is not the dubbing video of the reference video segment; otherwise, judging that the first video is the dubbing video of the reference video segment. Of course, in some scenes, a dubbing mode may be predetermined, for example, the audio-video platform can only upload the dubbing video for dubbing the audio, and in this case, only the dubbing comparison in the aspect of the audio needs to be executed; similarly, if the audio/video platform can only upload dubbing audio/video for dubbing text, the dubbing comparison in the text aspect is only needed.

In summary, in the embodiment of the present application, m target video frames are selected from the second video, so that not only each target video frame is matched with one first video frame in the first video to ensure similarity between the target video frame and the first video frame, but also the playing time of the m target video frames in the second video is continuous, so as to ensure that the video frames of the reference video segment formed by the m target video frames remain continuous rather than frame skipping. In this way, the reference video segment extracted from the second video can be ensured to be consistent with the first video as far as possible, that is to say, the reference video segment can be considered to be the same video segment extracted from the second video as the first video, so that the situation that the comparison result of the dubbing comparison is invalid because the reference video segment is not the same video segment as the first video in the second video when the dubbing comparison is carried out later can be avoided, namely, the effectiveness and the accuracy of the dubbing comparison are ensured.

FIG. 7 is a flow chart of a video processing method according to an exemplary embodiment of the present application; the video processing method may be performed by a computer device in the aforementioned system, such as a terminal and a server, or a terminal; the video processing method may include, but is not limited to, steps S701-S704:

s701: and acquiring a first video and a second video.

S702: and screening m target video frames from the n second video frames.

S703: and forming m target video frames into a reference video segment according to the sequence of the playing time, and carrying out dubbing comparison on the first video and the reference video segment to obtain a comparison result.

It should be noted that, the specific implementation process shown in steps S701 to S703 may be referred to the description of the specific implementation process shown in steps S301 to S303 in the embodiment shown in fig. 3, which is not repeated herein.

S704: if the comparison result indicates that the first video is obtained by dubbing the reference video clip, displaying a first playing interface of the first video, and displaying dubbing prompt information in the first playing interface when the first video is played in the first playing interface.

Based on the specific implementation process shown in the foregoing steps S701-S703, a comparison result obtained by dubbing comparison between the first video and the reference video segment may be obtained, where the comparison result indicates whether the first video is the dubbing video of the reference video segment in the second video. Furthermore, if the first video is to be put on shelf (i.e. successfully released), the audio-video platform also needs to audit the first video to ensure that the first video is a legal and compliant video, thereby effectively maintaining the health and safety of internet resources.

Furthermore, when the comparison result indicates that the first video is obtained by dubbing the reference video segment and the first video is successfully audited, the embodiment of the application further supports marking the first video when the first video is played in the terminal screen of the terminal, so that a viewer is prompted to indicate that the first video is the dubbed video through the marking, and the copyright of the original video (namely the second video) is effectively maintained. In a specific implementation, a first playing interface of a first video is displayed on a terminal, and dubbing prompt information is displayed on the first playing interface, wherein the dubbing prompt information is used for prompting that the first video is obtained by dubbing again. Wherein the first playing Interface is a User Interface (UI) for playing video. The first playing interface is provided by an audio-video platform; thus, when a user opens and uses the audio/video platform (such as a client with audio/video playing function, an applet or a web application), the terminal can output a service interface provided by the audio/video platform, and the service interface can be any interface in the audio/video platform. The service interface includes, but is not limited to: a channel interface (an interface including a plurality of audio and video channels, such as a variety channel, a series channel, and a movie channel), a personal interface (an interface including account information of an object account logged in an audio and video platform), or a first play interface, etc.

It should be noted that, in the case that the service interface is a channel interface or a personal interface, if the user has a requirement for playing the first video in the audio/video platform, the user may trigger to play the first video through the service interface, and at this time, may trigger to display the first playing interface of the first video in the terminal screen, so as to play the first video in the first playing interface, and display dubbing prompt information in the first playing interface. If the service interface is directly the first playing interface (for example, the audio-video platform is a short video platform, and the short video that can be quickly switched to play by performing a sliding operation (up-down sliding) on the terminal interface), the first video is directly played in the first playing interface, and the dubbing prompt information is displayed in the first playing interface.

An exemplary first video played in a first play interface and a schematic diagram showing dubbing cues may be seen in fig. 8a. As shown in fig. 8a, assuming that the audio/video platform is a short video platform, after a user opens the deployed short video platform through a terminal held by the user, a first playing interface 801 of a first video provided by the short video platform may be displayed, and the first video may be played in the first playing interface 801. In the process of playing the first video in the first playing interface 801, the dubbing prompt information 802 may be displayed at any display position in the first playing interface 801, so that the user may perceive that the first video is the dubbing video through the dubbing prompt information 802, thereby avoiding misleading to the user.

Note that (1) the embodiment of the present application does not limit the expression form or the form of the dubbing prompt information in the first playing interface; the dubbing alert message 802 shown in fig. 8a is a text type alert message with transparency of 0 and fixedly displayed at the upper right corner display position of the first playback interface. However, in practical applications, the transparency of the dubbing alert information in the first playing interface may be greater than 0, i.e., the dubbing alert information may be displayed in a certain transparency form, as shown in fig. 8 b. Alternatively, the dubbing alert information may be dynamically displayed in the first playing interface, where the dynamic display may include, but is not limited to: the scrolling display is cycled along the target direction (e.g., horizontal direction (as shown in fig. 8 c), diagonal direction, or vertical direction) or periodically appearing in the first playback interface (e.g., once every 1 second as shown in fig. 8 d). Or, the dubbing prompt information may also be displayed in the form of an icon or the like in the first playing interface, as shown in fig. 8 e; etc.

(2) In addition to the continuous display or the periodic display of the dubbing prompt information on the first playing interface during the playing process of the first video as shown in fig. 8a, the dubbing prompt information may be triggered to be displayed when the first playing interface for displaying the first video is triggered, and the display of the dubbing prompt information may be canceled after the first video starts to be played or the first video starts to be played for a period of time (such as 1 second or 2 seconds). The method for displaying the dubbing prompt information when the first playing interface is triggered to display can not only play a role in dubbing prompt for a user, but also avoid the problems of picture shielding and the like caused by displaying the dubbing prompt information in the first playing interface in the subsequent process of playing the first video, thereby improving the watching experience of the user. For example, a schematic diagram of triggering the display of the dubbing cue in the first playing interface and canceling the display of the dubbing cue after 1 second of the first video playing may be seen in fig. 9. As shown in fig. 9, when a candidate video 901 is played in a terminal screen, if a video switching operation (such as a sliding operation, a clicking operation, or a clicking operation for a first video) is detected, switching from the candidate video 901 to a first playing interface of the first video, and displaying dubbing prompt information 902 in the first playing interface when the first playing interface is triggered to be displayed; and the dubbing prompt message 902 is continuously displayed in the first playing interface until the playing duration of the first video reaches 2 seconds (i.e. the dubbing prompt message 902 is continuously displayed in the first playing interface for 2 seconds), and the display of the dubbing prompt message 902 is canceled. Of course, the duration of the continuous display of the dubbing cue in the first playback interface is not limited to 2 seconds, and 2 seconds in the above example is merely an example.

In addition, the embodiment of the application also supports providing jump links (or entries, controls, components, buttons, options and the like) for the user when the comparison result indicates that the first video frame is obtained by dubbing the reference video segment; through the jump-in port, the second video can be automatically jumped to be played; compared with the link that the first video is firstly exited and then the second video is triggered from the inlet for playing the second video, the link that the first video is switched to the second video is effectively shortened, and user experience is improved. In a specific implementation, a first playing interface of a first video is displayed, and a jump inlet is displayed in the first playing interface; and responding to the trigger operation for the jump entrance, indicating that the user wants to switch back from the dubbing video to the original video, jumping from the first video to the second video, and playing the second video. An exemplary interface diagram for jumping back to the second video through the jump-in port can be seen in fig. 10; as shown in fig. 10, a jump-in port 1001 is displayed in the first playing interface, and the jump-in port 1001 not only can realize fast switching from the first video to the second video, but also can prompt the user that the first video is an audio-video; when the user performs a trigger operation on the jump-in 1001, it is possible to switch from the first playing interface to the second playing interface 1002 and play the second video in the second playing interface 1002.

With continued reference to fig. 10, when a user makes a jump from a first playing interface for playing a first video to a second playing interface for playing a second video through a jump portal in the first playing interface, the embodiments of the present application support: starting playing from a first second video frame in the second video in the second playing interface; i.e. playing the second video from the beginning in the second play interface. Or starting playing from a target second video frame in the second video; the target second video frame is a second video frame corresponding to the first video frame displayed in the first playing interface when the jump entrance in the first playing interface is triggered; that is, when the jump-in is triggered, the jump-in can be automatically jumped to the position of the second video frame corresponding to the first video frame when the jump-in is triggered in the second video for playing. Therefore, the method and the device not only can realize quick switching from the first video to the second video, but also can keep the continuity of video playing, promote the continuity of video watching by a user, and effectively ensure the watching experience of the user.

In addition, in the case that the second video corresponds to a plurality of dubbing videos, the embodiment of the application also supports switching from the first video to any one of the plurality of dubbing videos. In the specific implementation, if the comparison result indicates that the first video is obtained by dubbing the reference video segment, and the second video to which the reference video segment belongs corresponds to a plurality of dubbed videos, wherein the plurality of dubbing videos comprise the first video; in this case, a first playing interface of the first video may be displayed, and if a dubbing selection requirement exists in the first playing interface in the process of playing the first video in the first playing interface, then responding to the dubbing selection requirement, and outputting a video identifier of each of the plurality of dubbing videos; and selecting a target video identifier from the video identifiers of the plurality of dubbing videos according to the identifier selection operation, and playing the target dubbing video corresponding to the selected target video identifier. The video identifier of the dubbing video can be used for uniquely identifying the dubbing video; the dubbing identifier of the dubbing video can be an account number of an creator creating the dubbing video, and the specific form of the video identifier of the dubbing video is not limited in the embodiment of the application. Therefore, the mode of switching from the first video to any video except the first video in the plurality of dubbing videos corresponding to the second video can improve man-machine interaction and interestingness in the video watching process.

For example, the video identifier of each of the plurality of audio/video clips corresponding to the second video is directly displayed in the first playing interface as shown in fig. 11, including: video identifier 1 of dubbing video 1, video identifier 2 corresponding to dubbing video 2, and video identifiers 3, … … corresponding to dubbing video 3. In this way, the user may directly perform the identifier selection operation (such as a clicking operation for a video identifier) in the first playing interface, select the target video identifier (such as video identifier 2) from the video identifiers of the multiple dubbing videos, and jump from the first video to the target dubbing video (such as dubbing video 2) corresponding to the target video identifier. The embodiment of the application does not limit the display positions of the video identifications of the plurality of dubbing videos in the first playing interface.

It should be appreciated that the implementation of triggering the display of the target video identification from the first playback interface is not limited to the exemplary process shown in fig. 11. As shown in fig. 12, a selection entry 1201 may be displayed in the first playback interface; when the selection entry 1201 is triggered, indicating that the user has a need to switch from the first video to the other dubbing video corresponding to the second video, a video selection window 1202 is output. The video selection window 1202 displays the video identifier of each of the multiple audios corresponding to the second video, and at this time, the user may perform an identifier selection operation in the video selection window 1202 to switch from the first video to the target audios corresponding to the selected target video identifier.

Based on the two exemplary implementation manners of selecting dubbing video shown in fig. 11 and fig. 12, it should be further noted that: (1) similar to the foregoing description, after the jump from the first video to the target dubbing video, the target dubbing video may be played from the beginning; or starting playing from a second video frame corresponding to the first video frame corresponding to the target video identification in the first video when the target video identification is selected; the embodiments of the present application are not limited in this regard. (2) The plurality of dubbing videos corresponding to the second video may be obtained by dubbing the reference video segment in the second video, that is, the plurality of dubbing videos are the dubbing videos of the reference video segment. Or, the plurality of dubbing videos corresponding to the second video may be obtained by dubbing other video clips except the reference video clip in the second video, that is, the plurality of dubbing videos are not dubbing videos of the reference video clip, but belong to dubbing videos of other video clips except the reference video clip in the second video. Of course, the multiple dubbing videos corresponding to the second video may also include the dubbing videos of the reference video segment at the same time, or the dubbing videos of other video segments except the reference video segment in the second video; this is not limited thereto.

In summary, on the one hand, in the embodiment of the present application, m target video frames are selected from the second video, so that not only each target video frame is matched with one first video frame in the first video to ensure similarity between the target video frame and the first video frame, but also playing time of the m target video frames in the second video is continuous, so as to ensure that video frames of a reference video segment formed by the m target video frames remain continuous rather than skip frames. The reference video segment extracted from the second video is ensured to be consistent with the first video as far as possible, that is to say, the reference video segment is the same video segment extracted from the second video as the first video, so that the situation that the comparison result of the dubbing comparison is invalid because the reference video segment is not the same video segment as the first video in the second video when the dubbing comparison is carried out later can be avoided, namely, the effectiveness and the accuracy of the dubbing comparison are ensured. On the other hand, the embodiment of the application also supports fast switching from the first video to the second video or switching to other dubbing videos corresponding to the second video, effectively shortens links for exiting the first video and triggering and displaying the second video or other dubbing videos, simplifies video switching operation, improves the interestingness of watching videos of users, and improves the watching experience of the users.

The foregoing details of the method of embodiments of the present application are set forth in order to provide a better understanding of the foregoing aspects of embodiments of the present application, and accordingly, the following provides a device of embodiments of the present application.

Fig. 13 is a schematic structural view of a video processing apparatus according to an exemplary embodiment of the present application; the video processing device may be used to perform some or all of the steps in the method embodiments shown in fig. 3 or fig. 7. Referring to fig. 13, the video processing apparatus includes the following units:

an acquiring unit 1301 configured to acquire a first video and a second video, where the first video is a video clip in the second video; the first video comprises m first video frames, and the second video comprises n second video frames; n and m are positive integers, and n is more than or equal to m;

a processing unit 1302, configured to screen out m target video frames from the n second video frames; one target video frame is matched with one first video frame, and the playing time of m target video frames in the second video is continuous;

the processing unit 1302 is further configured to compose m target video frames into a reference video segment according to the sequence of the playing time, and perform dubbing comparison on the first video and the reference video segment to obtain a comparison result; the comparison result is used for indicating whether the first video is obtained by dubbing the reference video segment.

In one implementation, the processing unit 1302 is configured to, when screening m target video frames from the n second video frames, specifically:

and determining a target video frame for each first video frame from m video groups based on the time interval information of two candidate video frames in the second video in two adjacent video groups in the m video groups and the similarity between each first video frame and each candidate video frame in the corresponding video group, wherein the two candidate video frames belong to different video groups in the two adjacent video groups.

In one implementation manner, the processing unit 1302 is configured to filter video from the n second video frames for each first video frame to obtain m video groups corresponding to the m first video frames, which is specifically configured to:

In one implementation, any one of the first video frames is denoted as an i+1th video frame, and a video group corresponding to the i+1th video frame is denoted as an i+1th video group, i=0, 1,2 …, m-1; the processing unit 1302 is configured to determine, for each first video frame, a target video frame from m video groups based on time interval information of two candidate video frames in two adjacent video groups in the m video groups in the second video and similarity between each first video frame and each candidate video frame in the corresponding video group, where the target video frame is specifically configured to:

In one implementation, the processing unit 1302 is configured to calculate the K matching score of the i+1th video frame based on the K matching scores of the i+1th video frame, time interval information of each candidate video frame in the i+1th video group and each candidate video frame in the second video, and similarity between each candidate video frame in the i+1th video group and the i+1th video frame, where the K matching score is specifically used for:

In one implementation, any candidate video frame in the i+1-th video group corresponding to the i+1-th video frame is represented as a j-th candidate video frame, j=1, 2; any candidate video frame in the ith video group is denoted as the kth candidate video frame, k=1, 2, …, K; the processing unit 1302 is configured to calculate, based on K matching scores of the i-th video frame, time interval information of each candidate video frame in the i-th video group and each candidate video frame in the i+1-th video group in the second video, and similarity between each candidate video frame in the i+1-th video group and the i+1-th video frame, a K matching score of each candidate video frame in the i+1-th video group corresponding to the i+1-th video frame, where:

Obtaining similarity between the (i+1) th video frame and the (j) th candidate video frameAnd based on time interval informationFor similarity->Performing weighted operation to obtain a weighted operation result; wherein the time interval informationThe larger theSimilarity->The smaller the weight value of (2);

In one implementation, the processing unit 1302 is configured to, when determining, for each first video frame from the m video groups, a target video frame based on K matching scores of each first video frame in the m first video frames, specifically configured to:

In one implementation, the dubbing alignment includes text alignment; the processing unit 1302 is configured to perform dubbing comparison on the first video and the reference video segment, and when a comparison result is obtained, the processing unit is specifically configured to:

In one implementation, the text information includes one or more characters; the processing unit 1302 is configured to perform text comparison on the text information of the first video and the text information of the reference video segment, and when obtaining a comparison result, is specifically configured to:

In one implementation, the dubbing alignment includes audio alignment; the processing unit 1302 is configured to perform dubbing comparison on the first video and the reference video segment, and when a comparison result is obtained, the processing unit is specifically configured to:

In one implementation, if the comparison result indicates that the first video is obtained by dubbing the reference video segment, the processing unit 1302 is further configured to:

displaying a first playing interface of a first video;

In one implementation, the processing unit 1302 is configured to, when playing the second video, specifically:

starting playing from a first second video frame in the second video;

displaying a first playing interface of a first video;

According to one embodiment of the present application, each unit in the video processing apparatus shown in fig. 13 may be separately or completely combined into one or several additional units, or some unit(s) thereof may be further split into a plurality of units with smaller functions, which may achieve the same operation without affecting the implementation of the technical effects of the embodiments of the present application. The above units are divided based on logic functions, and in practical applications, the functions of one unit may be implemented by a plurality of units, or the functions of a plurality of units may be implemented by one unit. In other embodiments of the present application, the video processing apparatus may also include other units, and in practical applications, these functions may also be implemented with assistance from other units, and may be implemented by cooperation of multiple units. According to another embodiment of the present application, a voice processing apparatus as shown in fig. 13 may be constructed by running a computer program (including program code) capable of executing the steps involved in the respective methods as shown in fig. 3 and 7 on a general-purpose computing device such as a computer including a processing element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read only storage medium (ROM), and the like, and a storage element, and implementing the video processing method of the embodiments of the present application. The computer program may be recorded on, for example, a computer-readable recording medium, and loaded into and run in the above-described computing device through the computer-readable recording medium.

In this embodiment of the present application, m target video frames are selected from the second video, so that not only each target video frame is matched with one first video frame in the first video to ensure similarity between the target video frame and the first video frame, but also playing time of m target video frames in the second video is continuous, so as to ensure that video frames of a reference video segment formed by m target video frames remain continuous instead of frame skipping. In this way, each frame of picture of the reference video segment extracted from the second video can be ensured to be consistent with each frame of picture of the first video as far as possible, that is, the reference video segment can be considered to be the same video segment extracted from the second video as the first video, so that the situation that the comparison result of the dubbing comparison is invalid because the reference video segment is not the same video segment as the first video in the second video when the dubbing comparison is carried out later can be avoided, namely, the effectiveness and the accuracy of the dubbing comparison are ensured.

Fig. 14 is a schematic structural diagram of a computer device according to an exemplary embodiment of the present application. Referring to fig. 14, the computer device includes a processor 1401, a communication interface 1402, and a computer readable storage medium 1403. Wherein the processor 1401, the communication interface 1402, and the computer-readable storage medium 1403 may be connected by a bus or other means. Wherein the communication interface 1402 is used for receiving and transmitting data. The computer readable storage medium 1403 may be stored in a memory of a computer device, the computer readable storage medium 1403 for storing a computer program, and the processor 1401 for executing the computer program stored by the computer readable storage medium 1403. The processor 1401 (or CPU (Central Processing Unit, central processing unit)) is a computing core as well as a control core of a computer device, which is adapted to implement one or more computer programs, in particular to load and execute one or more computer programs for implementing the respective method flows or the respective functions.

The embodiments of the present application also provide a computer-readable storage medium (Memory), which is a Memory device in a computer device, for storing programs and data. It is understood that the computer readable storage medium herein may include both built-in storage media in a computer device and extended storage media supported by the computer device. The computer readable storage medium provides storage space that stores a processing system of a computer device. Also stored in this memory space are one or more computer programs adapted to be loaded and executed by the processor 1401. Note that the computer readable storage medium can be either a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory; alternatively, it may be at least one computer-readable storage medium located remotely from the aforementioned processor.

In one embodiment, the computer device may be a terminal or a server as mentioned in the previous embodiments; the computer readable storage medium has one or more computer programs stored therein; one or more computer programs stored in a computer readable storage medium are loaded and executed by the processor 1401 to implement the corresponding steps in the above-described embodiments of the speech processing method; in a particular implementation, one or more computer programs in a computer-readable storage medium are loaded by processor 1401 and perform the steps of embodiments of the present application; the steps of each embodiment of the present application may be referred to the related descriptions of each embodiment, which are not repeated herein.

Based on the same inventive concept, the principle and beneficial effects of the computer device for solving the problems provided in the embodiments of the present application are similar to those of the video processing method in the embodiments of the present application, and may refer to the principle and beneficial effects of implementation of the method, which are not described herein for brevity.

The embodiment of the application also provides a computer program product, which comprises a computer program, and the computer program realizes the video processing method when being executed by a processor.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product comprises a computer program(s). The computer program performs the processes or functions described in the embodiments of the present application when the computer program is loaded and executed on a computer device. The computer device may be a general purpose computer, a special purpose computer, a computer network, or other programmable device. The computer program may be stored in or transmitted across a computer readable storage medium. The computer program may be transmitted from one website, computer device, server, or data center to another website, computer device, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). Computer readable storage media can be any available media that can be accessed by a computer device or a data storage device such as a server, data center, or the like, that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A video processing method, comprising:

screening m target video frames from the n second video frames; one target video frame is matched with one first video frame, and the playing time of the m target video frames in the second video is continuous;

forming a reference video segment from the m target video frames according to the sequence of playing time, and carrying out dubbing comparison on the first video and the reference video segment to obtain a comparison result; the comparison result is used for indicating whether the first video is obtained by dubbing the reference video segment.

2. The method of claim 1, wherein the screening m target video frames from the n second video frames comprises:

screening video for each first video frame from the n second video frames to obtain m video groups corresponding to the m first video frames; each video group comprises K candidate video frames, and the K candidate video frames contained in the video group are matched with the first video frames corresponding to the video group; k is an integer greater than 1;

determining a target video frame for each first video frame from the m video groups based on time interval information of two candidate video frames in two adjacent video groups in the m video groups in the second video and similarity between each first video frame and each candidate video frame in the corresponding video group; the two candidate video frames belong to different ones of the two adjacent video groups.

3. The method of claim 2, wherein said filtering video from said n second video frames for each first video frame to obtain m video groups corresponding to said m first video frames comprises:

performing feature extraction processing on each first video frame in the m first video frames to obtain image characterization of each first video frame; performing feature extraction processing on each second video frame in the n second video frames to obtain image representation of each second video frame; the image representation is used for representing semantic information of the video frame;

k second video frames with the similarity larger than a similarity threshold value in n similarity corresponding to each first video frame are used as K candidate video frames matched with the corresponding first video frames; and the K candidate video frames corresponding to each first video frame form a video group.

4. The method of claim 2, wherein any one of the first video frames is represented as an i+1th video frame, and a video group corresponding to the i+1th video frame is represented as an i+1th video group, i=0, 1,2 …, m-1; the determining, for each first video frame from the m video groups, a target video frame based on time interval information of two candidate video frames in the second video in two adjacent video groups in the m video groups and similarity between each first video frame and each candidate video frame in the corresponding video group, including:

When i=0, respectively using the similarity between K candidate video frames in the first video group in the m video groups and the first video frame in the first video as K matching scores of the first video frame; a matching score corresponds to a candidate video frame;

when 0<i is less than or equal to m-1, calculating K matching scores of the (i+1) th video frame based on K matching scores of the (i) th video frame, time interval information of each candidate video frame in the (i) th video group and each candidate video frame in the (i+1) th video group in a second video, and similarity between each candidate video frame in the (i+1) th video group and the (i+1) th video frame; the matching score of the i+1th video frame is used to indicate: on the basis that the target video frames are determined by the i video frames before the i+1 video frame, the continuous degree of the candidate video frames in the i video group and the candidate video frames in the i+1 video group in the time dimension, and the matching degree of the candidate video frames in the i+1 video frame and the i+1 video group in the image dimension;

a target video frame is determined for each of the m first video frames from the m video groups based on K matching scores for each of the m first video frames.

5. The method of claim 4, wherein calculating the K matching scores for the i+1th video frame based on the K matching scores for the i video frame, time interval information for each candidate video frame in the i video group and each candidate video frame in the i+1th video group in the second video, and the similarity between each candidate video frame in the i+1th video group and the i+1th video frame, comprises:

calculating K matching scores of each candidate video frame in the i+1th video group corresponding to the i+1th video frame based on K matching scores of the i video frame, time interval information of each candidate video frame in the i video group and each candidate video frame in the i+1th video group in a second video, and similarity between each candidate video frame in the i+1th video group and the i+1th video frame;

selecting the matching score with the largest value from K matching scores of each candidate video frame in the i+1th video group corresponding to the i+1th video frame;

and taking the K selected matching scores as K matching scores of the (i+1) th video frame.

6. The method of claim 5, wherein any candidate video frame in the i+1 video group to which the i+1 video frame corresponds is represented as a j candidate video frame, j = 1,2,..k; any candidate video frame in the ith video group is denoted as the kth candidate video frame, k=1, 2, …, K; the calculating, based on the K matching scores of the i+1th video frame, time interval information of each candidate video frame in the i+1th video group and each candidate video frame in the i+1th video group in the second video, and a similarity between each candidate video frame in the i+1th video group and the i+1th video frame, the K matching scores of each candidate video frame in the i+1th video group corresponding to the i+1th video frame includes:

Acquiring an image representation of the jth candidate video frame and an image representation of the kth candidate video frame, and calculating time interval information of the kth candidate video frame and the jth candidate video frame in the second video based on the image representation of the jth candidate video frame and the image representation of the kth candidate video frame；

Obtaining the similarity between the (i+1) th video frame and the (j) th candidate video frameAnd based on the time interval information +.>For the similarity->Performing weighted operation to obtain a weighted operation result; wherein the time interval information->The larger the similarity +.>The smaller the weight value of (2);

obtaining a matching score of the kth candidate video frame corresponding to the ith video frameAnd +.>Summing the weighted operation result to obtain the matching between the ith video frame and the kth candidate video frameThe j-th candidate video frame corresponds to the matching score +.1 of the i+1-th video frame when the video frame is matched>。

7. The method of claim 4, wherein the determining a target video frame for each of the m first video frames from the m video groups based on K matching scores for the each of the m first video frames comprises:

and when i < m-1 is not less than 0, determining a candidate video frame capable of enabling the i+1th video frame to obtain the maximum matching score from the i-th video group, and taking the candidate video frame capable of enabling the i+1th video frame to obtain the maximum matching score as a target video frame of the i+1th video frame.

8. The method of claim 1, wherein the dubbing alignment comprises text alignment; and comparing the first video with the reference video segment in a dubbing way to obtain a comparison result, wherein the dubbing comparison result comprises the following steps:

9. The method of claim 8, wherein the text information comprises one or more characters; the text comparison is performed on the text information of the first video and the text information of the reference video segment to obtain a comparison result, which comprises the following steps:

counting the times of the editing operation to obtain an editing times result, and generating a comparison result based on the editing times result;

when the editing frequency result indicates that the frequency of editing operation is greater than an operation threshold, the comparison result indicates that the first video is obtained by dubbing the reference video segment; when the editing frequency result indicates that the frequency of editing operation is smaller than or equal to an operation threshold value, the comparison result indicates that the first video is not obtained by dubbing the reference video segment.

10. The method of claim 1, wherein the dubbing alignment comprises an audio alignment; and comparing the first video with the reference video segment in a dubbing way to obtain a comparison result, wherein the dubbing comparison result comprises the following steps:

performing speech recognition processing on the first video to obtain audio information of the first video; performing speech recognition processing on the reference video segment to obtain audio information of the reference video segment;

11. The method according to any one of claims 1 to 10, wherein the M first video frames are obtained by performing frame extraction processing on M video frames included in the first video, where M is an integer greater than M; the N second video frames are obtained by performing frame extraction processing on N video frames contained in the second video, wherein N is an integer greater than N.

12. The method of claim 1, wherein if the comparison indicates that the first video is dubbed from the reference video segment, the method further comprises:

displaying a first playing interface of the first video;

13. The method of claim 1, wherein if the comparison indicates that the first video is dubbed from the reference video segment, the method further comprises:

and responding to the trigger operation for the jump entrance, and playing the second video.

14. The method of claim 13, wherein the playing the second video comprises:

starting playing from a first second video frame in the second video;

or starting playing from a target second video frame in the second video; and the target second video frame is a second video frame corresponding to the first video frame displayed in the first playing interface when the jump-in port in the first playing interface is triggered.

15. The method of claim 1, wherein if the comparison result indicates that the first video is obtained by dubbing the reference video segment, and the second video to which the reference video segment belongs corresponds to a plurality of dubbed videos, where the plurality of dubbing videos includes the first video, the method further includes:

Displaying a first playing interface of the first video;

selecting a target video identifier from the video identifiers of the plurality of dubbing videos according to an identifier selection operation;

16. A video processing apparatus, comprising:

an acquisition unit configured to acquire a first video and a second video, the first video being a video clip in the second video; the first video comprises m first video frames, and the second video comprises n second video frames; n and m are positive integers, and n is more than or equal to m;

the processing unit is used for screening m target video frames from the n second video frames; one target video frame is matched with one first video frame, and the playing time of the m target video frames in the second video is continuous;

the processing unit is further configured to compose the m target video frames into a reference video segment according to the sequence of playing time, and perform dubbing comparison on the first video and the reference video segment to obtain a comparison result; the comparison result is used for indicating whether the first video is obtained by dubbing the reference video segment.

17. A computer device, characterized in that,

a processor adapted to execute a computer program;

a computer readable storage medium having stored therein a computer program which, when executed by the processor, implements the video processing method according to any of claims 1-15.

18. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program adapted to be loaded by a processor and to perform the video processing method according to any of claims 1-15.