CN113705209A

CN113705209A - Subtitle generating method and device, electronic equipment and storage medium

Info

Publication number: CN113705209A
Application number: CN202110387022.8A
Authority: CN
Inventors: 康战辉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-04-12
Filing date: 2021-04-12
Publication date: 2021-11-26

Abstract

The application relates to the technical field of multimedia, in particular to a subtitle generating method, a subtitle generating device, electronic equipment and a storage medium, wherein the method comprises the steps of respectively extracting corresponding reference frames from each video clip contained in a target video; based on each obtained reference frame, respectively aiming at corresponding video clips, screening out a candidate video set of which the similarity with the corresponding video clip reaches a similarity threshold value from an original video set; and respectively carrying out title clustering on the candidate video sets corresponding to the video segments to obtain subtitles corresponding to the video segments, and further generating subtitles of the target video.

Description

Subtitle generating method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of multimedia technologies, and in particular, to a method and an apparatus for generating subtitles, an electronic device, and a storage medium.

Background

With the popularization of mobile networks and intelligent terminals, the manufacturing cost of short videos is lower and lower, the short videos uploaded to various new media platforms every day can reach hundreds of thousands or even millions, and a large number of short videos are formed through secondary creation on the basis of original videos. Video highlights are typically short videos created by secondary authoring, by cropping and re-splicing some popular, wonderful, or original videos with the same theme.

In general, the title of the video collection is a summary title, such as "family menu broad," "various glaring videos," "wonderful moments of a planet," and the like; however, the "summary" title can only reflect the type of the video album and cannot reflect the specific content contained in the video album, so that the user cannot quickly and accurately know which video segments are contained in the video album based on the "summary" title, for example, see the title "family dish menu large and complete", the user can only know that the video album belongs to the food category and cannot know which family dishes are contained in the video album; for another example, seeing the title "the splendid moment of the ball star", the user can only know that the video collection belongs to sports, and cannot know which ball stars are included in the video collection.

Fig. 1 is a video collection display interface of a manually edited "summary" title, as shown in (1) of fig. 1, an interface 11 shows information of a video number master, such as a user name, a video number master category, a video number master grade, and the like, an interface 12 is a display interface of a video clip, and an interface 13 shows information of a title of the video collection, such as "home dish recipe big whole #5 home dish methods # simple and easy to learn # nutrition and delicacy", a forwarding amount, a comment number, a play duration, and the like; from the title of the video collection displayed on the interface 13, it is impossible to know which video clips of the house dishes are specifically included in the video collection, and the video collection needs to be watched, and when a user searches for the "chicken bouillon" on the platform, even if the video clips of the "chicken bouillon" method are included in the video collection, the search system of the new media platform cannot know the corresponding content according to the title of the video collection, so that the video collection cannot be pushed to the search user. As shown in (2) in fig. 1, the interface 21 shows information of a video number owner, the interface 22 is a display interface of a video clip, and the interface 23 shows information of a title of the video collection, such as "all kinds of sports program # have your idol, ao sports, and national sports," and forwarding amount, comment number, play duration, and the like; from the title of the video collection presented at interface 23, it is not known which sports items or which sports stars are included in the video collection. When a user searches for a basketball, even if the video collection contains a highlight of a basketball game, the search system of the new media platform cannot acquire corresponding content according to the title of the collection video, so that the video collection cannot be pushed to the search user.

As can be seen from fig. 1, for reasons of simple operation, a large number of video titles are usually added to a video collection to obtain a "summary" title that can reflect the type of each video clip, the amount of information is small, a user needs to obtain the playing content of each video clip through a series of operations such as dragging, fast forwarding, double watching, and the like, and the "summary" title cannot reflect the playing content of the corresponding video clip, so that when the user searches for a target video, a new media platform cannot accurately push the target video that the user is interested in.

Therefore, when browsing a video collection, a user often needs to determine whether the segment content interested in watching exists through modes of dragging, fast forwarding, repeatedly searching different video segments and the like, so that browsing efficiency is reduced to a great extent, video recommendation accuracy is reduced, and network traffic is easily wasted.

Disclosure of Invention

The embodiment of the application provides a subtitle generating method and device, electronic equipment and a storage medium, which are used for improving the browsing efficiency and recommendation accuracy of video clips and further saving network traffic.

According to a first aspect of embodiments of the present application, there is provided a subtitle generating method, the method including:

extracting corresponding reference frames from each video clip contained in the target video respectively;

based on each obtained reference frame, respectively aiming at corresponding video clips, screening out a candidate video set of which the similarity with the corresponding video clip reaches a similarity threshold value from an original video set;

respectively carrying out title clustering on the candidate video sets corresponding to the video clips to obtain subtitles corresponding to the video clips;

and generating the subtitle of the target video based on the respective corresponding subtitles of the video clips.

According to a second aspect of embodiments of the present application, there is provided a subtitle generating method including:

and respectively carrying out title clustering on the candidate video sets corresponding to the video clips to obtain the subtitles corresponding to the video clips.

According to a third aspect of embodiments of the present application, there is provided a subtitle generating apparatus including:

the frame extraction module is used for respectively extracting corresponding reference frames from each video clip contained in the target video;

the screening module is used for screening out a candidate video set of which the similarity with the corresponding video clip reaches a similarity threshold value from an original video set respectively aiming at the corresponding video clip based on each obtained reference frame;

a generating module, configured to perform title clustering on the candidate video sets corresponding to the video segments, respectively, to obtain subtitles corresponding to the video segments; and generating the subtitle of the target video based on the respective corresponding subtitles of the video clips.

In an optional implementation manner, the screening module is specifically configured to:

for each reference frame, performing the following operations:

performing frame matching on one reference frame in each reference frame and each original video contained in the original video set respectively, and determining the matching frame number corresponding to each original video respectively;

respectively determining the similarity of each original video and the video segment corresponding to one reference frame based on the matching frame number corresponding to each original video, the total frame number of each original video and the total frame number of the video segment corresponding to one reference frame;

and screening out a candidate video set of which the similarity of the video segment corresponding to the reference frame reaches a similarity threshold value from the original video set.

extracting a first feature vector of the reference frame and extracting a second feature vector of each original frame contained in each original video respectively based on a preset first operator;

respectively determining a first frame matching degree between the reference frame and each original frame contained in each original video based on the obtained first feature vector and each second feature vector;

and respectively determining the frame number of the original frame of which the first frame matching degree meets a first preset condition as the matching frame number corresponding to the corresponding original video aiming at each original video.

performing frequency domain transformation on the reference frame based on the first operator and a first set step length to obtain a first frequency domain value set of the reference frame, and performing frequency domain transformation on each original frame contained in each original video to obtain a second frequency domain value set corresponding to each original frame;

determining a first frequency domain value mean value corresponding to the first frequency domain value set, and respectively determining a second frequency domain value mean value corresponding to each second frequency domain value set;

and determining a first feature vector of the reference frame based on a comparison result of each frequency-domain value in the first frequency-domain value set and the first frequency-domain value mean, and determining a second feature vector corresponding to each original frame based on a comparison result of each frequency-domain value in each second frequency-domain value set and the corresponding second frequency-domain value mean, respectively.

In an optional embodiment, the screening module is further configured to:

extracting a third feature vector of the reference frame and a fourth feature vector of each original frame contained in each original video respectively based on a preset second operator, wherein the second operator is smaller than the first operator;

respectively determining a second frame matching degree between the reference frame and each original frame contained in each original video based on the obtained third feature vector and each fourth feature vector;

and deleting the original frames of which the second frame matching degree does not meet a second preset condition in each original video respectively.

if the total frame number of one original video in each original video is less than the total frame number of the video clip corresponding to one reference frame, the similarity of the video clip corresponding to the original video and the reference frame is positively correlated with the matching frame number corresponding to the original video and negatively correlated with the total frame number of the original video;

if the total frame number of one original video in each original video is not less than the total frame number of the video clip corresponding to one reference frame, the similarity of the video clip corresponding to the original video and the reference frame is positively correlated with the matching frame number corresponding to the original video, and is negatively correlated with the total frame number of the video clip corresponding to the reference frame.

In an optional implementation manner, the generating module is specifically configured to:

for each video clip, respectively performing the following operations:

acquiring the title of each candidate video in the corresponding candidate video set aiming at one video clip in each video clip;

performing word segmentation processing on each obtained title respectively to obtain a word segmentation vector set corresponding to each title;

respectively taking the obtained word vector mean value of each word segmentation vector set as a corresponding title vector of the candidate video;

and performing title clustering on title vectors of all candidate videos in the candidate video set corresponding to the video segment to obtain a subtitle corresponding to the video segment.

performing title clustering on the title vectors of the candidate videos to obtain at least one candidate title category;

determining a target headline category from the at least one candidate headline category based on the number of headline vectors associated with each of the at least one candidate headline category;

and determining a subtitle corresponding to the video clip based on the playing amount of the candidate video corresponding to each title vector associated with the target title category and the similarity between each title vector and the title vector of the target video.

In an optional implementation manner, the frame extraction module is specifically configured to:

extracting corresponding reference frames from all video clips contained in the target video according to a set target frame extraction interval, wherein the target frame extraction interval is set according to the playing time length of all video clips contained in the target video; alternatively, the first and second electrodes may be,

determining the number of target video segments corresponding to the target playing time length based on the target playing time length of the target video and the mapping relation between the preset playing time length and the number of the video segments; and determining a target frame extraction interval based on the target playing time length of the target video and the number of corresponding target video clips, and extracting corresponding reference frames from all the video clips contained in the target video respectively based on the target frame extraction interval, wherein the target frame extraction interval is positively correlated with the target playing time length and negatively correlated with the number of the target video clips corresponding to the target playing time length.

According to a fourth aspect of embodiments of the present application, an apparatus for subtitle generation, the apparatus comprising:

and the generating module is used for respectively carrying out title clustering on the candidate video sets corresponding to the video segments to obtain the subtitles corresponding to the video segments.

According to a fifth aspect of embodiments of the present application, there is provided an electronic device, including a memory and a processor, where the memory stores thereon a computer program operable on the processor, and the computer program, when executed by the processor, causes the processor to implement a subtitle generating method in an embodiment of the present application.

According to a sixth aspect of the embodiments of the present application, there is provided a computer-readable storage medium having stored therein a computer program, which when executed by a processor, implements a subtitle generating method in the embodiments of the present application.

In the embodiment of the application, corresponding reference frames are respectively extracted from each video clip contained in a target video, then, based on each obtained reference frame, a candidate video set with the similarity reaching a similarity threshold value with the corresponding video clip is respectively screened from an original video set aiming at the corresponding video clip, title clustering is respectively carried out aiming at the candidate video set corresponding to each video clip, a subtitle corresponding to each video clip is obtained, and a subtitle of the target video is generated according to the subtitle of each video clip. Therefore, the subtitles of the corresponding video segments can be automatically generated according to the video content of each video segment contained in the target video, and the subtitles of the target video are further generated.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is an interface diagram of video highlights in the related art;

FIG. 2 is a schematic diagram of an implementation environment provided by an embodiment of the present application;

fig. 3a is a flowchart of a subtitle generating method according to an embodiment of the present application;

fig. 3b is a flowchart of a method for determining a candidate video set of video segments according to an embodiment of the present application;

fig. 3c is a flowchart of a method for determining a matching frame number of an original video according to an embodiment of the present application;

FIG. 3d is a flowchart of another method for determining a matching frame number of an original video according to an embodiment of the present disclosure;

fig. 3e is a flowchart of a method for determining a subtitle of a video segment according to an embodiment of the present application;

FIG. 3f is a flowchart of a detailed method for determining subtitles of a video segment according to an embodiment of the present application;

FIG. 4a is an interface diagram of a target video provided by an embodiment of the present application;

FIG. 4b is an interface diagram of a reference frame extracted according to a set target frame extraction interval according to an embodiment of the present application;

FIG. 4c is an interface diagram of a reference frame decimated at a determined target decimation interval according to an embodiment of the present application;

fig. 4d is a schematic diagram of frame matching provided in the embodiment of the present application;

fig. 4e is a schematic diagram of a candidate video set corresponding to a video segment provided in the present application;

FIG. 4f is an interface diagram for displaying subtitles of a video clip according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a Word2vec model provided in an embodiment of the present application;

FIG. 6a is a subtitle interface diagram for displaying a target video according to an embodiment of the present application;

FIG. 6b is a schematic diagram of a complete subheading display process provided by an embodiment of the present application;

fig. 7a is a functional block diagram of a subtitle generating apparatus for a target video according to an embodiment of the present application;

fig. 7b is a functional block diagram of a subtitle generating apparatus for a video clip according to an embodiment of the present application;

fig. 8 is a block diagram of an electronic device provided in an embodiment of the present application;

fig. 9 is a hardware configuration diagram of a generation apparatus according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments, but not all embodiments, of the technical solutions of the present application. All other embodiments obtained by a person skilled in the art without any inventive step based on the embodiments described in the present application are within the scope of the protection of the present application.

It is noted that the terms "first," "second," and the like, as used herein, are used interchangeably to distinguish between similar elements and not necessarily to describe a particular order or sequence. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in other sequences than described or illustrated herein.

Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Hereinafter, some terms in the embodiments of the present disclosure are explained to facilitate understanding by those skilled in the art.

(1) In the embodiment of the present application, the term "terminal" may include a smart phone, a tablet computer, a wearable device, and the like.

(2) In the embodiment of the present application, the term "video highlights" refers to short videos formed by secondary creation of some popular, wonderful, or original videos with the same theme, such as editing, re-splicing, and the like, and includes video clips played on various new media platforms, suitable for being watched in a moving state and a short leisure state, and pushed at high frequency, and the playing time is different from several seconds to several minutes.

The topics (types) of video clips are diverse and include, but are not limited to, skill sharing, humor fun, fashion trends, social hotspots, street interviews, commonweal education, advertising creatives, and business customization. The same video collection contains video segments with the same theme.

(3) In the embodiment of the present application, the term "image-sensing algorithm" is a generic term of a class of algorithms, including an average value hash algorithm (aHash), a sensing hash algorithm (pHash), and a difference value hash algorithm (dHash). A "fingerprint" string may be generated for each picture and then the fingerprint similarities between different pictures are compared.

(4) In the embodiment of the present application, the term "Word 2 vec" is short for Word to Vector, and the Word2vec models are a group of related models for generating Word vectors, which are proposed by Mikolov et al, and are shallow and double-layer neural networks for training to reconstruct linguistic Word text. The network is represented by words and the input words in adjacent positions need to be guessed, and the order of the words is unimportant under the assumption of the bag-of-words model in Word2 vec. After training is completed, the Word2vec model may map each Word to a Word vector to represent Word-to-Word relationships.

(5) In the embodiment of the present application, the term "clustering algorithm" refers to a statistical analysis method for studying (sample or index) classification problems, and is also an important algorithm for data mining.

Embodiments of the present application relate to Artificial Intelligence (AI) and Machine Learning techniques, and are designed based on Speech processing techniques (Speech Technology) and Machine Learning (ML) in the AI.

Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology mainly comprises a computer vision technology, a voice processing technology, machine learning/deep learning and other directions.

The following briefly introduces the design concept of the embodiments of the present application:

in the embodiment of the application, corresponding reference frames are respectively extracted from each video clip contained in a target video, then, based on each obtained reference frame, a candidate video set with the similarity reaching a similarity threshold value with the corresponding video clip is respectively screened from an original video set aiming at the corresponding video clip, and title clustering is respectively carried out aiming at the candidate video sets respectively corresponding to the video clips, so as to obtain subtitles respectively corresponding to the video clips. The subtitle can fully display the playing content of the corresponding video clip, so that when a user searches a target video, the server of the new media platform accurately recommends the video clip interested by the user based on the search words contained in the subtitle, the video recommendation accuracy is improved, and after the target video is pushed to the terminal, the user can accurately click the video clip which is intentionally watched by referring to each subtitle without performing a series of operations such as dragging, fast forwarding and double watching, so that the browsing efficiency of the video clip is effectively improved, and further, the consumption of network flow is also saved.

It should be noted that the target video in the embodiment of the present application is a video album, and includes one or more video clips.

The embodiments of the present application are described below with reference to the drawings of the specification, and it is to be understood that the embodiments described herein are merely for illustrating and explaining the present application and are not intended to limit the present application, and that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

FIG. 2 is a schematic diagram of an implementation environment provided by an embodiment of the present application; referring to fig. 2, the implementation environment includes at least: a terminal 201 and a server 202.

The terminal 201 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like. Optionally, the terminal 201 and the server 202 are directly or indirectly connected through wired or wireless communication, and the application is not limited herein. The user sends a video search request to the new media platform server 202 through the terminal 201, the search request carries an identifier of a target video which the user is interested in, after the server 202 of each new media platform receives the search request, the target video is returned to the terminal 201 after the server is processed based on an original video set, the terminal 201 displays the target video to the user, and the display content comprises subtitles of each video segment contained in the target video.

The terminal 201 generally refers to one of a plurality of terminals, and the embodiment of the present application is illustrated by the terminal 201. Those skilled in the art will appreciate that the number of terminals described above may be greater or less. For example, the number of the terminals is only a few, or the number of the terminals is several tens or hundreds, or more, and the number and the type of the terminals are not limited in the embodiments of the present application.

The server 202 is an independent physical server, or a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, a cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

Referring to fig. 3a, in the embodiment of the present application, a specific flow of subtitle generation is as follows:

in step S301, the server extracts corresponding reference frames from each video segment included in the target video.

In a specific implementation, the playing time length of each video segment included in the target video is not less than the set time length and is relatively average, and accordingly, when step S301 is executed, the following two manners may be adopted, but are not limited to:

mode 1: and extracting corresponding reference frames from all the video clips contained in the target video respectively according to a set target frame extraction interval, wherein the target frame extraction interval is set according to the playing time length of all the video clips contained in the target video.

For example, the interface for the target video is shown in FIG. 4a, with the title of the target video being "various fun highlights you wish! The sequence of the. The playing time of the video clip 1 is 00:37, the playing time of the second video clip 2 is 00:40 seconds, the playing time of the video clip 3 is 00:41 seconds, the proportion of the playing time of each video clip to the target playing time of the target video is relatively average, and then the target extraction frame interval is set to be 00:37 seconds according to actual experience so as to ensure that the corresponding reference frame can be extracted from each video clip. And extracting corresponding reference frames from the video clips 1,2 and 3 contained in the target video respectively based on the set target frame extraction interval.

When the reference frame is extracted from the target video, starting from the playing time 00:00 seconds of the target video, the reference frame 1 is extracted from the video clip 1 at the target frame extraction interval 00:37 seconds, the extraction result is shown as (1) in fig. 4b, the reference frame 2 is extracted from the video clip 2, the extraction result is shown as (2) in fig. 4b, the reference frame 3 is extracted from the video clip 3, and the extraction result is shown as (3) in fig. 4 b.

It should be noted that, when the target frame extracting interval is set according to the playing time length of each video segment included in the target video, the target frame extracting interval may be set according to an actual situation, and the target frame extracting interval may be a minimum value, a maximum value, a median, an average value, and the like in the playing time length of each video segment, so as to ensure that more video segments are covered when frames are extracted according to the set target frame extracting interval. In the above embodiments of the present application, the target frame extraction interval is the minimum value of the playing time lengths of the

video segments

1,2, and 3.

Mode 2: determining the number of target video segments corresponding to the target playing time length based on the target playing time length of the target video and the mapping relation between the preset playing time length and the number of the video segments; and determining a target frame extraction interval based on the target playing time length of the target video and the number of the corresponding target video segments, and extracting corresponding reference frames from all the video segments contained in the target video respectively based on the target frame extraction interval, wherein the target frame extraction interval is positively correlated with the target playing time length, and the number of the target video segments corresponding to the target playing time length is negatively correlated.

In specific implementation, the target video marked with the 'collection' identifier can be analyzed in advance, the number of video segments contained in the target video with different playing time lengths is counted, the average value of the number of the video segments in a playing time length interval is obtained, a dictionary is generated, the format of the dictionary can be 'playing time length interval and number of the video segments', wherein the number of the video segments in the dictionary can be the average value, the median and the like, and the mapping relation between the preset playing time length and the number of the video segments is contained. Alternatively, the target decimation interval can use the following formula:

target frame extraction interval is equal to target playing time length/target video clip number

And respectively extracting corresponding reference frames from the video clips contained in the target video based on the determined target frame extraction interval.

For example: play duration interval [01:00, 02:00]The related video collections are a first video collection, a second video collection and a third video collection, the number of the video segments respectively contained in the related video collections is 2,3 and 4, and the playing time interval is [01:00, 02:00 ]]The average of the corresponding number of video segments is (2+3+4)/3 ═ 3. The target playing time length is 01:58 seconds, the generated dictionary is inquired, and the fact that the target playing time length belongs to [01:00, 02:00 ] is known]The playing time interval is that the number of the video clips corresponding to the target video is 3, and the target extraction frame interval is

And secondly, extracting corresponding reference frames from the

video segments

1,2 and 3 contained in the target video respectively based on the determined target frame extraction interval.

When the reference frame is extracted from the target video, starting from the playing time 00:00 seconds of the target video, the reference frame 1 is extracted from the video clip 1 at the target frame extraction interval 00:39 seconds, the extraction result is shown as (1) in fig. 4c, the reference frame 2 is extracted from the video clip 2, the extraction result is shown as (2) in fig. 4c, the reference frame 3 is extracted from the video clip 3, and the extraction result is shown as (3) in fig. 4 c.

It should be noted that, in the embodiment of the present application, the target frame extraction interval may be determined by setting the target play time length and the coefficient of the target number of video segments according to actual needs.

Step S302, the server screens out a candidate video set with the similarity reaching a similarity threshold value with the corresponding video clip from the original video set respectively aiming at the corresponding video clip based on each obtained reference frame.

In step S302, various types of original videos are stored in the database of the server, and the number of the original videos is large, and the original videos can be used as sources of video segments in the target video, and for convenience of description, the original videos are hereinafter collectively referred to as an original video set. Because each video clip contained in the target video is formed by cutting and re-splicing the original video, the greater the similarity between the corresponding reference frame extracted from each video clip and the video frame in the original video, the greater the probability that the video clip is from the corresponding original video, therefore, for each video clip contained in the target video, based on each extracted reference frame, a candidate video set with the similarity reaching a similarity threshold with the corresponding video clip is screened from the original video set.

In a specific implementation, when step S302 is executed, an operation of filtering out a candidate video set needs to be executed for each reference frame, and the following description will be given by taking only one reference frame (hereinafter referred to as reference frame i) in each reference frame as an example, and refer to fig. 3 b:

step S3021, performing frame matching on the reference frame i in each reference frame with each original video included in the original video set, and determining the number of matching frames corresponding to each original video.

For example, for a reference frame 1 extracted from a video segment 1 included in the target video, frame matching is performed on the reference frame 1 and a first original video, a second original video, and a zth original video … included in the original video set, and the matching result is shown in fig. 4d, where the number of matching frames between the reference frame 1 and the first original video is X1 frames, the number of matching frames between the reference frame 1 and the second original video is Y1 frames, and the number of matching frames between … and the zth original video is Z1 frames.

It should be noted that, in the above, only the reference frame i is taken as the reference frame 1, and for other reference frames, the number of matching frames with the original video is determined in the same manner, which is not described herein again.

In an alternative embodiment, in performing step S3021, a perceptual hash (pHash) algorithm may be used for frame matching.

The pHash algorithm is one of image perception algorithms, a 'fingerprint' character string can be generated for each picture, then 'fingerprints' of different pictures are compared, and the closer the comparison result is, the more similar the two pictures are. The algorithm can be used for realizing the function of searching the images in the browser. The image sensing algorithm similar to the pHash algorithm also includes a mean hash algorithm (aHash) and a difference hash algorithm (dHash), and the difference of the three algorithms is shown in table 1.

TABLE 1 differences between three algorithms pHash, aHash and dHash

As can be seen from table 1, the accuracy of frame matching by the pHash algorithm is the highest.

The principle of the pHash algorithm is as follows:

the method has the advantages that the size of the picture is reduced, the details of the picture are removed, basic information such as the structure, the brightness and the like of the picture is reserved, and picture differences caused by different sizes and horizontal and vertical pixel proportions are abandoned. The reduced size may be set according to actual needs, for example, reduced to 8 × 8 (pixels), 16 × 16 (pixels), 32 × 32 (pixels), and the like. In the embodiment of the present application, one reference frame and one original frame are reduced to 32 × 32 (pixels).

And carrying out gray processing on the picture to simplify the color of the picture.

Based on a preset operator with the size of M × M, Discrete Cosine Transform (DCT) is performed on the reduced picture. In embodiments of the present application, M is less than 32, such as M equals 8 or 16. The DCT is a special Fourier transform, the picture is transformed from a pixel domain to a frequency domain, an operator matrix represents higher and higher frequency domain coefficients from the upper left corner to the lower right corner, and in order to reserve a low-frequency region of the upper left corner, other coefficients in the operator matrix are all 0 or the difference value between the other coefficients and 0 is smaller than a set threshold value.

And respectively calculating the DCT mean values of the pictures after DCT transformation. For each picture, respectively executing: starting from the upper left corner of the picture, sliding according to a preset operator and a set step length, obtaining a DCT value every time sliding, and determining a DCT mean value of the picture after DCT transformation based on each DCT value, wherein the DCT transformation formula is as follows:

one-dimensional DCT (discrete cosine transform):

where f (i) denotes a picture to be transformed, i denotes the number of DCT transforms, N denotes the number of pixel points of the picture, c (u) denotes a compensation coefficient for the DCT transform, u denotes a generalized frequency variable, u is 1,2,3, … N-1, and f (u) denotes a coefficient after the DCT transform.

Aiming at a two-dimensional picture, on the basis of one-dimensional DCT, performing two-dimensional DCT:

where v denotes a generalized frequency variable, u is 1,2,3, … N-1, and in the embodiment of the present application, the generalized frequency variables u and v may denote horizontal and vertical coordinates of a two-dimensional picture pixel array, and c (u) and c (v) are horizontal and vertical pixel compensation coefficients of DCT transform, respectively.

Converting equation 3 to obtain:

as can be seen from equation 5, the two-dimensional DCT transform is symmetric, and thus a picture can be restored by the inverse DCT transform.

And for each picture, comparing the DCT value after each DCT conversion with the DCT mean value of the corresponding picture, if the DCT value is greater than or equal to the DCT mean value, marking as 1, otherwise, marking as 0, thereby obtaining the binary number group corresponding to each picture, which is also called as a feature vector.

And performing image matching based on the feature vectors of the pictures. In an embodiment of the present application, the matching degree of two pictures can be determined based on the hamming distance between the two pictures. Wherein, the smaller the Hamming distance is, the higher the matching degree of the two pictures is.

It should be noted that, on the basis of not affecting the essential content of the present application, the embodiment of the present application does not make a limiting requirement on the frame matching algorithm, and besides using an image sensing algorithm, such as a pHash algorithm, an aHash algorithm, and a dHash algorithm, a matching algorithm based on local Features in an image classification and object recognition technology, such as a Scale-invariant feature transform (SIFT) algorithm, a Robust-Up Robust Features (SURF) algorithm, a Bag of Words (Bag of Words, BOW) algorithm, and the like, may also be used.

Based on the above principle, for each extracted reference frame, the number of matching frames corresponding to the corresponding reference frame in each original video in the original video set can be respectively determined. The following describes a process of determining a matching frame number corresponding to the original video j when performing step S3021, taking the reference frame i and the original video j as an example, and the following steps may be adopted, but are not limited to, see fig. 3 c:

step S30211, based on a preset first operator, extracting a first feature vector of the reference frame i, and extracting a second feature vector of each original frame included in the original video j.

In executing step S30211, preprocessing, including operations of reducing the picture size, graying, and the like, is first performed on each original frame included in the reference frame i and the original video j. In the embodiment of the present application, the size of the reduced picture is 32 × 32 (pixels). Setting the size of a first operator to be 16 × 16, performing DCT transformation on a reference frame i of 32 × 32 (pixels) based on the first operator, extracting a first feature vector of the reference frame i, and performing DCT transformation on each original frame of 32 × 32 (pixels) respectively, and extracting a second feature vector of each original frame. As can be seen from the principles of the pHash algorithm, the first feature vector and each second feature vector are arranged in an array of two.

In specific implementation, firstly, based on a first operator and a first set step length, performing frequency domain transformation on a reference frame i to obtain a first frequency domain value set of the reference frame i, and performing frequency domain transformation on each original frame contained in an original video j to respectively obtain a second frequency domain value set corresponding to each original frame; then, determining a first frequency domain value mean value corresponding to the first frequency domain value set, and respectively determining a second frequency domain value mean value corresponding to each second frequency domain value set; and finally, determining a first feature vector of the reference frame i based on a comparison result of each frequency domain value in the first frequency domain value set and the first frequency domain value mean value, and respectively determining a second feature vector corresponding to each original frame contained in the original video j based on a comparison result of each frequency domain value in the second frequency domain value set and the corresponding second frequency domain value mean value.

It should be noted that, based on the first frequency domain value set, a median of the first frequency domain value set may also be determined, and based on a comparison result between each frequency domain value in the first frequency domain value set and the median of the first frequency domain value, the first feature vector of the reference frame i is determined. Similarly, a second feature vector may also be determined.

In step S30212, a first frame matching degree between the reference frame i and each original frame included in the original video j is determined based on the obtained first feature vectors and each second feature vector.

In step S30212, image matching is performed based on the obtained first feature vectors and the second feature vectors, first hamming distances between the first feature vectors and the second feature vectors are calculated, and a first frame matching degree between the reference frame i and each original frame included in the original video j is determined based on the first hamming distances. Wherein, the smaller the first hamming distance is, the higher the first matching degree between the reference frame i and the original frame is.

Step S30213, for the original video j, determining the number of frames of the original frame whose first frame matching degree meets the first preset condition as the matching frame number corresponding to the original video j.

When step S30213 is executed, comparing the first frame matching degrees between the reference frame i and each original frame included in the original video j with the first hamming threshold Q, if the first matching degrees are smaller than the first hamming threshold Q, it indicates that the first preset condition is met, otherwise, it indicates that the first preset condition is not met, counting the number of original frames in each original frame included in the original video j, where the first matching degrees with the reference frame i meet the first preset condition, and determining the counted number of frames as the matching number of frames corresponding to the original video j.

In an alternative embodiment, in order to improve the matching degree between the reference frame i and each original frame included in the original video j, a coarse grain operator (also referred to as a second operator) may be first used to extract feature vectors of the reference frame i and each original frame, and a primary matching is performed based on the extracted feature vectors to remove a portion of the original frame; then, extracting feature vectors of the reference frame i and the rest of original frames by adopting a fine-grain operator (also called as a first operator), and performing matching again based on the extracted feature vectors, wherein the coarse-grain operator is smaller than the fine-grain operator, so that the matching degree of the reference frame i and each original frame is improved. Therefore, before performing step S30211, the following steps may also be included, see fig. 3 d:

step S30210_1, based on a preset second operator, extracting a third feature vector of the reference frame i, and extracting a fourth feature vector of each original frame included in the original video j, where the second operator is smaller than the first operator.

In executing step S30210_1, reference frame i and each original frame contained in original video j are first preprocessed, and the detailed description can be referred to in S30211. Setting the size of the second operator to be 8 × 8 and smaller than the size of the first operator 16 × 16, performing DCT (discrete cosine transformation) transformation on the reduced reference frame i based on the second operator, extracting a third feature vector of the reference frame i, and performing DCT transformation on each reduced original frame respectively to extract a fourth feature vector of each original frame.

In specific implementation, firstly, based on a second operator and a second set step length, performing frequency domain transformation on a reference frame i to obtain a third frequency domain value set of the reference frame i, and performing frequency domain transformation on each original frame contained in an original video j to respectively obtain a fourth frequency domain value set corresponding to each original frame; then, determining a third frequency domain value mean value corresponding to the third frequency domain value set, and respectively determining a fourth frequency domain value mean value corresponding to each fourth frequency domain value set; and finally, determining a third feature vector of the reference frame i based on a comparison result of each frequency domain value in the third frequency domain value set and a mean value of the third frequency domain values, and respectively determining a fourth feature vector corresponding to each original frame contained in the original video j based on a comparison result of each frequency domain value in the fourth frequency domain value set and a corresponding mean value of the fourth frequency domain values.

In step S30210_2, based on the obtained third feature vectors and the respective fourth feature vectors, second frame matching degrees between the reference frame i and the respective original frames included in the original video j are respectively determined.

In step S30210_2, image matching is performed based on the obtained third feature vectors and the fourth feature vectors, second hamming distances between the third feature vectors and the fourth feature vectors are respectively calculated, and a second frame matching degree between the reference frame i and each original frame included in the original video j is determined based on the second hamming distances.

In step S30210_3, in the original video j, the original frame with the second frame matching degree not meeting the second preset condition is deleted.

When step S30210_3 is executed, comparing the second frame matching degrees between the reference frame i and each original frame included in the original video j with the second hamming threshold Q ', if the second matching degrees are smaller than the second hamming threshold Q', it indicates that the second preset condition is met, otherwise, it indicates that the second preset condition is not met, counting the number of original frames in each original frame included in the original video j, where the second matching degrees with the reference frame i meet the second preset condition, and determining the counted number of frames as the matching number of frames of the original video j. Wherein, the second hamming threshold Q' is greater than the first hamming threshold Q, and based on the second frame matching degree, part of the original frames can be roughly screened out. Further, steps S30211 to S30213 are performed based on the screened original frames, so that the original frame of the original video j with higher matching degree with the reference frame i can be determined, and the accurate matching frame number corresponding to the original video j is obtained.

Based on the method, the number of the matching frames corresponding to each original video can be determined.

It should be noted that, after steps S302111-S302113 are executed, it is determined that the matching degree between the reference frame i and each original frame satisfies the usage requirement, and steps S30210_ 1-S30210_3 may not be executed.

It should be noted that, in the embodiment of the present application, the frame matching degree is measured by using the kanji distance, and the frame matching degree may also be measured by using the euclidean distance. In different scenes, if the frame matching degree is measured by using other parameters except the distance, the larger the value of the possible parameter value is, the higher the frame matching degree is, so that in different scenes, the first preset condition may be that the parameter value is greater than the set threshold, or that the parameter value is smaller than the set threshold. The second predetermined condition is the same, and will not be described herein again.

Step S3022, based on the matching frame number corresponding to each original video, the total frame number corresponding to each original video, and the total frame number of the video segment corresponding to the reference frame i, respectively determining the similarity between each original video and the video segment corresponding to the reference frame i.

In a specific implementation, when step S3022 is executed, an arbitrary original video (hereinafter, referred to as an original video j, j being 1,2, … Z) in each original video is taken as an example for description: if the total frame number of the original video j is less than the total frame number of the video clip corresponding to the reference frame i, the similarity of the video clip corresponding to the original video j and the reference frame i is positively correlated with the matching frame number corresponding to the original video j and negatively correlated with the total frame number of the original video j; if the total frame number of the original video j is not less than the total frame number of the video clip corresponding to the reference frame i, the similarity of the video clip corresponding to the original video j and the reference frame i is positively correlated with the matching frame number corresponding to the original video j, and is negatively correlated with the total frame number of the video clip corresponding to the reference frame i; alternatively, the following formula may be employed:

similarity matching frame number/min (total frame number of video clip, total frame number of corresponding original video)

Since one original frame in the original video may be edited, spliced, altered, etc. many times when the target video is created, the total frame number of one video segment may be greater than that of one original video, and then, in determining the similarity, it is necessary to refer to the minimum value between the total frame number of one video segment and the total frame number of one original video.

For example, the total frame numbers of the first original video, the second original video, and the … Z-th original video are SUM1, SUM2, and SUM … SUM3, respectively, and the total frame number of the video segment 1 corresponding to the reference frame 1 is SUM1, where if SUM1 is less than SUM1, the similarity between the first original video and the video segment 1 is P1 ═ X1/SUM1, if SUM2 is greater than SUM1, the similarity between the second original video and the video segment 1 is P2 ═ Y1/SUM1, if SUM3 is equal to SUM1, the similarity between the Z-th original video and the video segment 1 is P3 ═ Z1/SUM 1; similarly, the similarity between the video segment 2 and the first original video, the second original video, and the … Z-th original video can be determined as P4 ═ X2/SUM2, P5 ═ Y2/SUM2, and P6 ═ Z2/SUM2, respectively; the similarity of the video segment 3 with the first original video, the second original video and the … Z-th original video is P7 ═ X3/SUM3, P8 ═ Y3/SUM2 and P9 ═ Z3/SUM3 respectively.

And step S3023, screening a candidate video set, of which the similarity of the video segment corresponding to the reference frame i reaches a similarity threshold value, from the original video set.

In specific implementation, a similarity threshold value is set to be P, when the similarity between each original video in the original video set and the video segment corresponding to the reference frame i is greater than P, it is indicated that the video segment corresponding to the reference frame i is derived from the corresponding original video, the corresponding original video is used as a candidate video of the video segment corresponding to the reference frame i, wherein one video segment corresponds to at least one candidate video, and thus the candidate video set of the video segment corresponding to the reference frame i is screened out from the original video set.

For example, P1> P, the first original video is a candidate video of video segment 1; p5> P, the second original video is a candidate video corresponding to video segment 2; p9> P, the third original video is a candidate video corresponding to video segment 2.

Taking the reference frame i as the reference frame 1 as an example, the video segment 1 corresponding to the reference frame 1 has 3 original videos in the original video set with similarity greater than P to the video segment 1, so that the candidate video set of the video segment 1 includes

candidate videos

1,2, and 3, as shown in fig. 4e, each candidate video is issued by a different video number, and each candidate video has a title reflecting the content of the corresponding candidate video, which is circled by a thick dotted line in fig. 4 e.

Step S303, the server performs title clustering respectively for the candidate video sets corresponding to the video segments to obtain subtitles corresponding to the video segments.

In step S303, each video segment is derived from a candidate video in a corresponding candidate video set, each candidate video has a respective title, and since the title of the candidate video can reflect the content of the corresponding candidate video, a subtitle of the corresponding video segment can be generated based on the title of each candidate video in the candidate video set, so that the user can know the content of the corresponding video segment based on the subtitle, and the new media platform performs video recommendation that the user is interested in based on the subtitle.

In a specific implementation, when step S303 is executed, an operation of determining a subtitle of a corresponding video clip needs to be executed for a reference frame i, and an example of any one of the video clips (hereinafter, referred to as a video clip i, and the video clip i is a video clip corresponding to the reference frame i) is described below, with reference to fig. 3 e:

step S3031, for the video segment i, a title of each candidate video in the corresponding candidate video set is obtained.

In step S3031, the database of the server stores the titles of the respective original videos in advance. And acquiring titles of corresponding candidate videos from the database based on the identifications of the candidate videos in the candidate video set corresponding to the video clip i, such as the ID numbers of the candidate videos. The title of each candidate video reflects the content of the corresponding candidate video, as shown in fig. 4 e.

Taking the video clip i as the video clip 1 corresponding to the reference frame 1 as an example, the titles of the

candidate videos

1,2 and 3 in the candidate video set of the video clip 1 are shown in table 2.

TABLE 2 title of candidate video for video clip 1

Candidate video	Title
			1	Wonderful skiing video # Single-board skiing
2	Zero-foundation skiing entry video # double-board skiing # simple and easy to learn
		3	Skiing video appreciation # two-board skiing # technique is really Tainiu! \| A \| A

Step S3032, performing word segmentation processing on each obtained title, and obtaining a word segmentation vector set corresponding to each title.

In step S3032, the word segmentation process may be performed on each title by using an existing word segmentation device, or may be performed on each obtained title by using a word segmentation Algorithm, which includes but is not limited to a dictionary-based word segmentation Algorithm (such as a forward maximum matching method, a reverse maximum matching method, a two-way matching word segmentation method, etc.), a statistical-based machine learning Algorithm (such as Hidden Markov Model (HMM), Conditional Random Field Algorithm (CRF), etc.). The title can be divided into a plurality of participles, and each participle can be mapped into a participle vector, so that one title corresponds to one participle vector set.

In an alternative embodiment, in step S3032, a Word2vec model may be used to obtain a set of Word segmentation vectors for each title. The important significance of the word segmentation vector is that natural language is converted into vectors which can be understood by a computer, and the word segmentation vector can grasp the context and semantics of the word segmentation and measure the similarity between the word and the word, and has important effects in many natural language processing fields such as text classification, emotion analysis and the like, relative to a word bag model and a word Frequency-Inverse text index (TF-IDF) model. The Word2vec model is introduced as follows:

the Word2vec model has three layers of neural networks, namely an input layer, a hidden layer and an output layer, as shown in fig. 5, a Word of 'skiing' is input at the input layer, and the Word is represented by 10000-dimensional vectors which only contain 1 and are 0; setting 200-dimensional features to represent a word segmentation, wherein the hidden layer comprises 200 neurons, the hidden layer has no activation function, namely the neurons are linear neurons, and the weight matrix of the hidden layer is 10000 × 300; the output layer and the input layer have the same dimension, 10000 words including wonderful words, videos, funs and … words of double boards are output, linear regression is carried out by adopting a Softmax function, and the sum of the probabilities of the output words is 1.

It should be noted that the parameter values in the model of the above embodiment are only an example, and can be adjusted according to actual situations.

In a specific implementation, in step S3032, based on the Word2vec model, the participles of each title in the candidate video set of the video clip i are input into the Word2vec model, and 200-dimensional Word vectors of each participle are obtained, so as to obtain a participle vector set corresponding to each title.

Step S3033, the obtained word vector mean of each participle vector set is used as the title vector of the corresponding candidate video.

In step S3033, for the participle vector set corresponding to any one title i in each title, determining a mean value of each participle vector in the participle vector set according to the dimension, and using the determined mean value as the title vector of the title i.

Step S3034, performing title clustering on the title vector of each candidate video in the candidate video set corresponding to the video segment i to obtain the subtitle corresponding to the video segment i.

In step S3034, each candidate video in the candidate video set corresponding to the video segment i has its own title, and since different titles may be added when different users forward the same video, the main and title of the video for the same candidate video may be different. And performing title clustering according to semantic similarity based on the title vector of each candidate video to obtain a subtitle corresponding to the video clip i.

Alternatively, the clustering algorithm may use a k-means (k-means) algorithm. The k-means algorithm is a typical unsupervised clustering algorithm, aiming to cluster the input data into k clusters (clusters). The training process of the k-means algorithm is as follows:

firstly, randomly selecting k clustering centroids (cluster centroids) and recording the k clustering centroids as mu₁，μ₂，…，μ_kEach class has a centroid representing the center point of samples belonging to the same class, where μ_k∈Rⁿ，RⁿRepresenting an n-dimensional real number set.

Then, the following process is repeated until convergence:

for each sample s, determining the class to which the sample s belongs, wherein the formula is as follows:

c^(s)＝argmin_t||x^(s)-μ_t||² equation 6

Wherein, c^(s)The class which is the closest to the sample s and k classes is represented, and the values are 1,2, …, k and x^(s)The feature vector representing the sample s, in this example the title vector, μ, of the candidate video_tCentroid, argmin, representing class t_tRepresenting the centroid μ of the sample s from the class t_tThe minimum distance of (c).

For each category t, the centroid point of the category t is recalculated, and the formula is as follows:

where r represents the total number of samples.

Explaining the k-means algorithm training process by using a constellation model, essentially gathering all stars into k constellation, and in the first step: randomly selecting points (or k stars) in the k universes as the centroids of the k constellations; the second step is that: respectively calculating the distance from each star to k centroids, and selecting the star cluster corresponding to the centroid with the closest distance as c^(s)Thus, after the second step, each star has a cluster to which it belongs; the third step: for each constellation, determining the average of the coordinates of all stars in the respective constellation, and retrieving the centroid mu_t(ii) a And repeating the second step and the third step until the centroid is unchanged or the difference value of the centroids determined in two adjacent steps is smaller than a set threshold, indicating that the k-means algorithm reaches a convergence condition, and ending the iteration. The k value can be set according to actual conditions.

In a specific implementation, when step S3034 is executed, the following operations are executed for a candidate video set corresponding to the video segment i, see fig. 3 f:

step S30341, performing title clustering on the title vectors of the candidate videos to obtain at least one candidate title category.

When step S30341 is executed, the title vector of each candidate video is input to a k-means model, and the k-means model clusters the title vector of each candidate video to obtain k clusters, where each cluster represents a candidate title category, and k is an integer greater than or equal to 1.

Step S30342 determines a target headline category from the at least one candidate headline category based on the number of headline vectors associated with each candidate headline category in the at least one candidate headline category.

When step S30342 is executed, the number of candidate title categories is k, the corresponding title semantics of the title vectors associated with each candidate title category are similar, and the number of the title vectors associated with each candidate title category is different. In particular, the candidate title category with the largest number of associated title vectors may be determined as the target title category of the video segment i.

Step S30343, determining a subtitle corresponding to one video segment based on the playing amount of the candidate video corresponding to each title vector associated with the target title category and the similarity between each title vector and the title vector of the target video.

In step S30343, at least one candidate video is selected from the candidate videos corresponding to the title vectors associated with the target title category, where the playing amount is greater than the set playing threshold and the similarity between each title vector and the title vector of the target video is within the set similarity interval. Optionally, in the embodiment of the present application, the similarity interval is [ 0.4-0.6 ], so that the title of the candidate video may provide more complementary information to reflect the content of the video segment i relative to the title of the target video. Respectively determining the scores of all the selected titles as the subheadings of the video segment i based on the titles of at least one selected candidate video, and determining the title of the candidate video with the highest score as the subheading of the video segment i, wherein the score formula is as follows:

the playing amount of the candidate video can be smoothed by using a logarithm function log (), and the reciprocal of the cosine similarity between the title vector of the candidate video and the title vector of the target video can indicate that the lower the similarity is, the higher the score of the title of the corresponding candidate video is, and the more complementary information can be provided to reflect the content of the video segment i.

For example, taking video clip i as video clip 1, the titles of

candidate videos

1 and 2 satisfying the above play amount and similarity conditions are as follows:

candidate video 1: skiing # single board skiing # wonderful video

Candidate video 2: zero foundation of skiing # double-board skiing # entry video

If the score of the candidate video 2 is the highest as that of the target video, the subtitle of the video clip 1 is "ski # snowboarding # entry video # zero base", and as shown in fig. 4f, when the video clip 1 is played, the subtitle of the video clip 1 is displayed below the target video title and is circled by a thick solid line. Alternatively, if the subtitle is longer, the subtitle of video segment 1 may be streamed while video segment 1 is played. Based on the subtitle of the video clip 1, the user knows that the video clip 1 is specifically a related video for learning skiing, and when the user searches for a 'skiing funny video', the new media platform pushes a target video corresponding to the video clip 1 to the terminal based on the subtitle of the video clip 1.

It should be noted that, when performing step S30343, when calculating the score, in addition to considering the cosine similarity between the title vector of the candidate video and the title vector of the target video, it may also be considered whether the title of the candidate video includes entity information that does not appear in the target video, such as public names of various people, places, and organizations.

In step S304, the server generates a subtitle of the target video based on the respective subtitles corresponding to the respective video clips.

In step S304, the subtitles of the video segments may be arranged according to the playing order of the video segments, so as to obtain the subtitles of the target video. The subtitle of the target video may be in the format { subtitle for video clip 1-subtitle for video clip 2- … }, and the subtitle of the target video is displayed when the target video is played.

For example, the target video includes video segment 1, video segment 2, and video segment 3, the sub-title of video segment 1 is "ski # snowboarding # entry video # zero base", the sub-title of video segment 2 is "trip moment # comic", the sub-title of video segment 3 is "swimming # expression package", and the sub-title of the target video is "ski # snowboarding # entry video # zero base-trip moment # comic-swimming # expression package", and when an interface is displayed, as shown in fig. 6 a. It should be noted that, in order to match the interface size of the display device, the subtitle of the target video may be flowably displayed when the subtitle is too long.

In step S304, a subtitle of the target video may also be generated in the form of a key-value pair according to the correspondence relationship between each video segment and the subtitle of the video segment, and the format of the subtitle of the target video may be { [ id of video segment 1: subtitle corresponding to video clip 1 ], [ identification of video clip 2: subtitle corresponding to video clip 2 ], … }, when the target video is played, the corresponding subtitle is displayed according to the identifier of the played video clip.

For example, still taking the above example as an example, the sub-title of the target video is { [ 1: skiing # double-skis skiing # entry video # zero foundation ], [ 2: instant of fall # funny ], [ 3: swimming # emoticon ] }, when the target video is displayed, as shown in fig. 4f, when the video clip 1 is played, the subtitle "skiing # two-board skiing # entry video # zero base" corresponding to the video clip 1 is displayed.

Based on the above implementation, fig. 6b is a complete subtitle display interface provided by the embodiment of the present application; taking the video clip 1 corresponding to the reference frame 1 as an example, the interface of the target video shown in (1) in fig. 6b only displays the title of the target video; an interface of a video clip 1 included in the target video as shown in (2) in fig. 6 b; extracting a reference frame 1 from the video clip 1, performing frame matching on the extracted reference frame 1 and video frames of each original video in the original video set, determining a candidate video set of the video clip 1 based on the number of matched frames, clustering each title in the candidate video set, and screening out a candidate video 2 meeting the conditions of playing amount and similarity, wherein as shown in (3) in fig. 6b, the title of the candidate video 2 is displayed on an interface and is circled by a thick dotted line in (3) in fig. 6 b; based on the title of the candidate video 2, a subtitle of the video segment 1 is generated, and the subtitle of the video segment 1 can reflect the detailed content of the video segment 1, and is displayed below the title of the target video, and is circled by a thick solid line in (4) in fig. 6 b. When a user searches for a 'skiing funny video', the new media platform can recommend a target video corresponding to the video clip 1 to the terminal of the user based on the subtitle, and after the terminal receives the recommended target video, the user can know that the video clip 1 is a skiing course based on the subtitle of the video clip 1 under the condition that the user does not watch the video clip 1.

It should be noted that (4) in fig. 6b is only an example of a subtitle display of a video clip, and may also be displayed as the interface shown in fig. 6 a.

Based on the same inventive concept, the embodiment of the present application provides a subtitle generating apparatus. As shown in fig. 7a, the subtitle generating apparatus for a target video may include:

a frame extraction module 701, configured to extract corresponding reference frames from each video segment included in the target video;

a screening module 702, configured to screen, based on each obtained reference frame, a candidate video set, for which a similarity with a corresponding video segment reaches a similarity threshold, from an original video set, for the corresponding video segment;

a generating module 703, configured to perform title clustering on the candidate video sets corresponding to the video segments, respectively, to obtain subtitles corresponding to the video segments, and generate subtitles of the target video based on the subtitles corresponding to the video segments.

Optionally, the screening module 702 is specifically configured to:

for each reference frame, the following operations are respectively performed:

respectively determining the similarity of each original video and a video clip corresponding to a reference frame based on the matching frame number corresponding to each original video, the total frame number of each original video and the total frame number of the video clip corresponding to the reference frame;

and screening out a candidate video set of which the similarity of the video segments corresponding to one reference frame reaches a similarity threshold value from the original video set.

Optionally, the screening module 702 is specifically configured to:

extracting a first feature vector of a reference frame and respectively extracting a second feature vector of each original frame contained in each original video based on a preset first operator;

respectively determining a first frame matching degree between a reference frame and each original frame contained in each original video based on the obtained first characteristic vector and each second characteristic vector;

Optionally, the screening module 702 is specifically configured to:

performing frequency domain transformation on a reference frame based on a first operator and a first set step length to obtain a first frequency domain value set of the reference frame, and performing frequency domain transformation on each original frame contained in each original video respectively to obtain a second frequency domain value set corresponding to each original frame respectively;

and respectively determining second feature vectors corresponding to the original frames respectively based on the comparison results of the frequency domain values in the second frequency domain value sets and the corresponding second frequency domain value mean values.

Optionally, the screening module 702 is further configured to:

extracting a third feature vector of a reference frame and a fourth feature vector of each original frame contained in each original video respectively based on a preset second operator, wherein the second operator is smaller than the first operator;

respectively determining a second frame matching degree between one reference frame and each original frame contained in each original video based on the obtained third feature vector and each fourth feature vector;

and respectively deleting the original frames of which the second frame matching degree does not meet the second preset condition in each original video.

Optionally, the screening module 702 is specifically configured to:

if the total frame number of an original video in each original video is less than the total frame number of a video clip corresponding to a reference frame, the similarity of the video clip corresponding to the original video and the reference frame is positively correlated with the matching frame number corresponding to the original video and negatively correlated with the total frame number of the original video;

if the total frame number of one original video in each original video is not less than the total frame number of the video clip corresponding to one reference frame, the similarity of the video clip corresponding to one original video and one reference frame is positively correlated with the matching frame number corresponding to one original video, and is negatively correlated with the total frame number of the video clip corresponding to one reference frame.

Optionally, the generating module 703 is specifically configured to:

for each video clip, the following operations are respectively performed:

aiming at one video clip in each video clip, acquiring the title of each candidate video in the corresponding candidate video set;

and performing title clustering on title vectors of all candidate videos in a candidate video set corresponding to one video clip to obtain a subtitle corresponding to one video clip.

Optionally, the generating module 703 is specifically configured to:

performing title clustering on title vectors of all candidate videos to obtain at least one candidate title category;

determining a target title category from the at least one candidate title category based on the number of title vectors respectively associated with each candidate title category in the at least one candidate title category;

and determining a subtitle corresponding to one video segment based on the playing amount of the candidate video corresponding to each title vector associated with the target title category and the similarity of each title vector and the title vector of the target video.

Optionally, the frame extraction module 701 is specifically configured to:

extracting corresponding reference frames from all video clips contained in a target video according to a set target frame extraction interval, wherein the target frame extraction interval is set according to the playing time length of all video clips contained in the target video; alternatively, the first and second electrodes may be,

determining the number of target video segments corresponding to the target playing time length based on the target playing time length of the target video and the mapping relation between the preset playing time length and the number of the video segments; and determining a target frame extraction interval based on the target playing time length of the target video and the number of the corresponding target video segments, and extracting corresponding reference frames from all the video segments contained in the target video respectively based on the target frame extraction interval, wherein the target frame extraction interval is positively correlated with the target playing time length, and the number of the target video segments corresponding to the target playing time length is negatively correlated.

For convenience of description, the above parts are divided into modules according to functions and described separately. Of course, the functionality of the various modules may be implemented in the same one or more pieces of software or hardware when implementing the present application.

As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or program product. Accordingly, various aspects of the present application may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

With regard to the apparatus in the above embodiment, the specific implementation manner of each module has been described in detail in the embodiment related to the method, and will not be elaborated herein.

Based on the same inventive concept, the embodiment of the present application provides a subtitle generating apparatus. As shown in fig. 7b, which is a schematic structural diagram of a subtitle generating apparatus for a video clip, the subtitle generating apparatus may include:

a frame extraction module 704, configured to extract corresponding reference frames from each video segment included in the target video;

the screening module 705 is configured to screen, based on the obtained reference frames, a candidate video set, of which the similarity to the corresponding video segment reaches a similarity threshold, from the original video set for the corresponding video segments respectively;

a generating module 706, configured to perform title clustering on the candidate video sets corresponding to the video segments respectively, so as to obtain subtitles corresponding to the video segments respectively.

With regard to the apparatus in the above embodiment, the specific implementation manner of each module is shown in fig. 7a, and will not be described in detail here.

Fig. 8 is a block diagram illustrating an electronic device 800 including:

a processor 801;

a memory 802 for storing instructions executable by the processor 801;

wherein the processor 801 is configured to execute instructions to implement the subtitle generating method in the embodiments of the present application, such as the steps shown in fig. 3a to 3 f.

Having described the subtitle generating method and apparatus of the present exemplary embodiment, next, a generating apparatus according to another exemplary embodiment of the present application is described.

In some possible embodiments, a generating device according to the present application may comprise at least one processing unit, and at least one storage unit. Wherein the storage unit stores program code which, when executed by the processing unit, causes the processing unit to execute the steps in the subtitle generating method described above in the embodiments of the present application. For example, the processing unit may perform the steps as shown in fig. 3a to 3 f.

The generation apparatus 900 according to this embodiment of the present application is described below with reference to fig. 9. The generation apparatus shown in fig. 9 is only an example, and should not bring any limitation to the functions and the range of use of the embodiments of the present application.

As shown in fig. 9, the generating means is represented in the form of a general purpose computing device. Components of the generating device may include, but are not limited to: the at least one processing unit 901, the at least one memory unit 902, and the bus 903 connecting the various system components (including the memory unit 902 and the processing unit 901).

Bus 903 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, or a local bus using any of a variety of bus architectures.

The storage unit 902 may include readable media in the form of volatile memory, such as Random Access Memory (RAM)921 and/or cache memory 922, and may further include Read Only Memory (ROM) 923.

Storage unit 902 may also include programs/utilities 925 having a set (at least one) of program modules 924, such program modules 924 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

The generating apparatus may also communicate with one or more external devices 904 (e.g., keyboard, pointing device, etc.), may also communicate with one or more devices that enable a user to interact with the generating apparatus, and/or any devices (e.g., router, modem, etc.) that enable the generating apparatus to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 905. Also, the generating device may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via network adapter 906. As shown, the network adapter 906 communicates with other modules for the generating device over the bus 903. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the generating device, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Embodiments of the present application also provide a computer-readable medium having stored thereon a computer program that, when executed by a processor, performs the steps of the subtitle generating method described above.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A subtitle generating method, the method comprising:

2. The method according to claim 1, wherein in the process of screening out a candidate video set with a similarity reaching a similarity threshold from an original video set for a corresponding video segment based on each obtained reference frame, the following operations are respectively performed for each reference frame:

3. The method of claim 2, wherein performing frame matching on one of the reference frames with each of the original videos included in the original video set, and determining a number of matching frames corresponding to each of the original videos respectively, comprises:

4. The method of claim 3, wherein extracting the first feature vector of the reference frame and the second feature vector of each original frame contained in each original video respectively based on a preset first operator comprises:

5. The method as claimed in claim 3, wherein before extracting the first feature vector of the reference frame and the second feature vector of each original frame contained in each original video respectively based on a preset first operator, further comprising:

6. The method according to any one of claims 2-5, wherein determining the similarity of each original video to the video segment corresponding to the reference frame based on the matching frame number corresponding to each original video, the total frame number corresponding to each original video, and the total frame number of the video segment corresponding to the reference frame respectively comprises:

7. The method according to any one of claims 1 to 5, wherein in the process of performing title clustering on the candidate video sets respectively corresponding to the video segments to obtain the subtitles respectively corresponding to the video segments, the following operations are respectively performed on the video segments:

8. The method of claim 7, wherein performing title clustering on title vectors of each candidate video in the candidate video set corresponding to the one video segment to obtain a subtitle corresponding to the one video segment comprises:

9. The method of claim 7, wherein extracting the corresponding reference frame from each video segment contained in the target video comprises:

10. A subtitle generating method, the method comprising:

11. A subtitle generating apparatus, comprising:

12. A subtitle generating apparatus, comprising:

13. An electronic device comprising a memory and a processor, the memory having stored thereon a computer program operable on the processor, the computer program, when executed by the processor, implementing the method of any of claims 1-10.

14. A computer-readable storage medium having a computer program stored therein, the computer program characterized by: the computer program, when executed by a processor, implements the method of any of claims 1-10.