CN112291589A

CN112291589A - Video file structure detection method and device

Info

Publication number: CN112291589A
Application number: CN202011181785.9A
Authority: CN
Inventors: 孙祥学
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2021-01-29
Anticipated expiration: 2040-10-29
Also published as: CN112291589B

Abstract

The application provides a method and a device for detecting the structure of a video file, electronic equipment and a computer-readable storage medium; relates to artificial intelligence technology; the method comprises the following steps: extracting candidate segments including endpoint video frames from a video file; performing character recognition on a plurality of video frames in the candidate segment to obtain a keyword recognition result of each video frame; extracting image characteristics of each video frame through a machine learning model, and determining a segment prediction result of each video frame based on the image characteristics of each video frame; and identifying a boundary video frame from the plurality of video frames based on the keyword identification result of each video frame and the segment prediction result of each video frame. Through the method and the device, the head and the tail of the video file can be rapidly positioned.

Description

Video file structure detection method and device

Technical Field

The present application relates to artificial intelligence technologies, and in particular, to a method and an apparatus for detecting a structure of a video file, an electronic device, and a computer-readable storage medium.

Background

Artificial Intelligence (AI) is a comprehensive technique in computer science, and by studying the design principles and implementation methods of various intelligent machines, the machines have the functions of perception, reasoning and decision making.

Visual recognition is an important application of artificial intelligence technology, for example, the title and the end of a title in a video file can be determined through visual recognition, so as to provide intermediate data for an application based on the recognition result of the title and the end of a title, for example, skipping the title and the end of a title when watching.

The related technology lacks a scheme of detecting the structure of the video file to quickly position the head and the tail of the video file, and mainly depends on a manual marking mode to mark each part of the video file, so that the efficiency is low.

Disclosure of Invention

The embodiment of the application provides a method and a device for detecting the structure of a video file, electronic equipment and a computer readable storage medium, which can be used for quickly positioning the head and the tail of the video file.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a method for detecting a structure of a video file, which comprises the following steps:

extracting candidate segments including endpoint video frames from a video file;

performing character recognition on a plurality of video frames in the candidate segment to obtain a keyword recognition result of each video frame;

extracting image characteristics of each video frame through a machine learning model, and determining a segment prediction result of each video frame based on the image characteristics of each video frame;

and identifying a boundary video frame from the plurality of video frames based on the keyword identification result of each video frame and the segment prediction result of each video frame.

The embodiment of the application provides a structure detection device of video file, includes:

an extraction module for extracting candidate segments including endpoint video frames from a video file;

the first identification module is used for carrying out character identification on a plurality of video frames in the candidate segment to obtain a keyword identification result of each video frame;

the prediction module is used for extracting image characteristics of each video frame through a machine learning model and determining a segment prediction result of each video frame based on the image characteristics of each video frame;

and the second identification module is used for identifying a boundary video frame from the plurality of video frames based on the keyword identification result of each video frame and the segment prediction result of each video frame.

In the above scheme, the endpoint video frame includes a video head frame and a video end frame of the video file; the extraction module is further configured to:

when the duration of the video file is greater than the sum of the lengths of a slice head preset time period and a slice tail preset time period, extracting a slice head candidate segment which comprises the video head frame and has the length of the slice head preset time period from the video file, and extracting a slice tail candidate segment which comprises the video tail frame and has the length of the slice tail preset time period from the video file.

In the foregoing solution, the first identifying module is further configured to:

extracting a plurality of video frames from the candidate segments at fixed time intervals;

carrying out image preprocessing on each video frame to obtain a corresponding binary image;

carrying out segmentation processing on the binary image to obtain a character image containing a plurality of characters;

extracting character features of a plurality of characters in the character image, performing feature matching based on the character features, and taking keywords obtained by matching as a keyword recognition result of the video frame corresponding to the character image.

traversing a keyword feature library to match features in the keyword feature library with the character features, and taking a keyword corresponding to the feature with the highest matching degree as a keyword identification result of the video frame corresponding to the character image.

In the above scheme, the segment prediction result includes a segment head probability that the video frame belongs to a segment head and a segment tail probability that the video frame belongs to a segment tail; the prediction module is further configured to:

performing convolution processing on video frames in the head candidate segment and the tail candidate segment through the machine learning model to obtain corresponding image characteristics;

and classifying the image features to obtain the head probability of the video frame in the head candidate segment and the tail probability of the video frame in the tail candidate segment.

In the above scheme, the boundary video frame includes a first frame and a last frame; the second identification module is further configured to:

selecting a timestamp of a video frame with the largest timestamp from the video frames of the head candidate segments with the keyword identification result as a first head timestamp, and selecting a timestamp of a video frame with the smallest timestamp from the video frames of the tail candidate segments with the keyword identification result as a first tail timestamp;

selecting a timestamp corresponding to the video frame with the highest head probability and exceeding a first probability threshold value from the video frames of the head candidate segment as a second head timestamp, and selecting a timestamp corresponding to the video frame with the highest tail probability and exceeding a second probability threshold value from the video frames of the tail candidate segment as a second tail timestamp;

taking the larger timestamp of the first head timestamp and the second head timestamp as a head timestamp, and taking the smaller timestamp of the first tail timestamp and the second tail timestamp as a tail timestamp;

and taking the video frame corresponding to the head time stamp as the head and tail frames, and taking the video frame corresponding to the tail time stamp as the tail and head frames.

In the foregoing solution, the second identifying module is further configured to:

and when the boundary video frame is not identified from the plurality of video frames based on the keyword identification result of each video frame and the segment prediction result of each video frame, taking the video head frame as the head frame and the tail frame, and taking the video tail frame as the tail frame.

In the foregoing solution, the apparatus for detecting a structure of a video file further includes a determining module, configured to:

taking a segment formed by video frames with time stamps between the time stamp corresponding to the first frame of the video and the time stamp corresponding to the first frame and the last frame of the video as a slice header;

and taking a section formed by the video frames with the time stamps between the time stamp corresponding to the head frame of the end of the section and the time stamp corresponding to the end frame of the video as the end of the section.

In the above solution, the structure detecting device for video files further includes a training module, configured to:

adding a label to each video frame in a video file sample based on a head-of-slice timestamp and a tail-of-slice timestamp of the video file sample, wherein the label comprises a positive slice, a head of slice and a tail of slice;

extracting image features of each video frame;

based on the image characteristics of each video frame, carrying out forward propagation in the machine learning model to obtain a segment prediction result of each video frame;

determining a type of each of the video frames based on a segment prediction result of each of the video frames;

and performing back propagation in the machine learning model based on the type of each video frame and the error of the label of each video frame so as to update the parameters of the machine learning model.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the structure detection method of the video file provided by the embodiment of the application when the executable instructions stored in the memory are executed.

The embodiment of the present application provides a computer-readable storage medium, which stores executable instructions for causing a processor to execute the method for detecting a structure of a video file provided in the embodiment of the present application.

The embodiment of the application has the following beneficial effects:

performing character recognition on video frames in the candidate segments in the video file, and determining whether the video frames comprise keywords according to recognition results; the segment prediction result of the video frame is determined through the machine learning model, and the head and the tail of the video file can be quickly and accurately positioned according to the recognition result and the segment prediction result by combining the two schemes.

Drawings

Fig. 1 is a schematic structural diagram of a detection system 10 provided in an embodiment of the present application;

fig. 2 is a schematic structural diagram of a server 200 provided in an embodiment of the present application;

fig. 3 is a schematic flowchart of structure detection of a video file according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an advertisement page of a comprehensive art program provided in an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a machine learning model provided by an embodiment of the present application;

fig. 6 is an interaction flow diagram of a method for detecting a structure of a video file according to an embodiment of the present application;

fig. 7A is a schematic view of a video playing page provided in an embodiment of the present application;

fig. 7B is a schematic view of a page of a spacer header according to an embodiment of the present application;

fig. 7C is a schematic page diagram of positioning a trailer according to an embodiment of the present application;

fig. 8A is a schematic flow chart of slice header timestamp detection provided by an embodiment of the present application;

fig. 8B is a schematic flowchart of end-of-segment timestamp detection according to an embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

Where a similar description of "first/second" appears in the specification, and the description below is added where the terms "first/second/third" are used merely to distinguish between similar objects and do not denote a particular order or importance to the objects, it will be appreciated that "first/second/third" may be interchanged either in a particular order or in a sequential order, where permissible, to enable embodiments of the application described herein to be practiced in other than the order shown or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) Streaming Media (Streaming Media) is an emerging network transmission technology for sequentially transmitting and playing a continuous time-based data stream of multimedia content such as video/audio in real time over the internet. Compared with the network playing mode of watching after downloading, the streaming media is typically characterized in that continuous audio and video information is compressed and then put on a streaming media server, and a user can watch while downloading without waiting for the whole file to be downloaded.

2) The streaming media server is a key platform for an operator to provide video services to users. The main functions of the streaming media server are to collect, cache, schedule, transmit and play streaming media contents. The method can transmit the video file to the client by a streaming protocol for the user to watch on line; and real-time video streams can be received from video acquisition software and compression software and then are live broadcast to the client side through a streaming protocol.

3) And the endpoint video frames are the video head frame and the video end frame of the video file.

4) And boundary video frames, namely a head frame and a tail frame of the video file.

Generally, a video file consists of a head of a slice, a front slice and a tail of a slice, and many users tend to watch the video file without the head of the slice and the tail of the slice when watching the video file, so that many video clients provide the users with an option of skipping the head of the slice and the tail of the slice. In the related art, skipping of the head-of-a-slice and the end-of-a-slice is implemented based on the marked start time point and end time point by manually viewing the video file and marking the start time point (hereinafter, head-of-a-slice timestamp) and the end time point (hereinafter, end-of-a-slice timestamp) of the feature in the video file. This is not only very inefficient, but also labor intensive.

In order to solve the technical problem of low detection efficiency caused by manual labeling in the related art, embodiments of the present application provide a method and an apparatus for detecting a structure of a video file, an electronic device, and a computer-readable storage medium, which can quickly locate a beginning and an end of a video file.

The structure detection method for the video file provided by the embodiment of the application can be implemented by various electronic devices, for example, a terminal or a server alone. For example, after the terminal downloads the complete video file, the structure detection method of the video file described below may be performed based on the complete video file. The structure detection method of the video file can also be cooperatively implemented by the server and the terminal. For example, after receiving a head-to-tail determination operation of a user, a terminal receives a video data packet from a server in real time, decompresses the video data packet to obtain a video file, and then executes a video file structure detection method on the video file. Or after receiving the operation of determining the head and the tail of the target video file by the user, the terminal sends a head and tail determining request to the server, so that the server executes the structure detection method of the video file for the stored target video file, determines the head and the tail of the target video file, and sends a data packet of a video frame between a head timestamp and a tail timestamp of the target video file to the terminal in real time.

The electronic device for detecting the structure of the video file, which is provided by the embodiment of the application, may be various types of terminal devices or servers, where the server may be an independent physical server (such as a streaming media server), a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform; the terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in this embodiment of the application.

Taking a server as an example, for example, the server cluster may be deployed in a cloud, and an artificial intelligence cloud Service (AI aas, AI as a Service) is opened to a user, the AIaaS platform splits several types of common AI services, and provides an independent or packaged Service in the cloud, this Service mode is similar to an AI theme mall, and all users may access one or more artificial intelligence services provided by the AIaaS platform by using an application programming interface.

For example, one of the artificial intelligence cloud services may be a structure detection service of a video file, that is, a server in the cloud end encapsulates a program for detecting the structure of the video file provided in the embodiment of the present application. The terminal responds to the head and the tail of a user, a server deployed at the cloud end calls a program for detecting the structure of the packaged video file by calling the structure detection service of the video file in the cloud service, character recognition is carried out on a plurality of video frames of candidate segments in the video file to obtain a keyword recognition result, the probability that the plurality of video frames are respectively the head and the tail of the segment is predicted through a machine learning model, the head and the tail of the segment of the video file are determined according to the keyword recognition result and the predicted probability, and finally, a data packet of the video file with the head and the tail of the segment removed can be sent to the terminal.

The following describes an example of implementing the structure detection method of a video file provided in the embodiment of the present application by cooperation of a server and a terminal. Referring to fig. 1, fig. 1 is a schematic diagram of an architecture of a detection system 10 provided in an embodiment of the present application. The terminal 400 is connected to the server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of both.

In some embodiments, the terminal 400 sends a head-to-end-of-slice determination request to the server 200 in response to a head-to-end-of-slice determination operation of the user for the target video file, where the request carries identification information of the target video file. The server 200 determines the target video file according to the head and the tail of the video file determination request and the identification information of the target video file. Character recognition is carried out on a plurality of video frames of the candidate segment in the target video file to obtain a keyword recognition result, the probability that the plurality of video frames are respectively the head and the tail of the segment is predicted through a machine learning model, the head and the tail of the segment of the target video file are determined according to the keyword recognition result and the predicted probability, and finally, a data packet of the video file without the head and the tail of the segment is sent to the terminal 400 in real time.

In some embodiments, taking the electronic device provided in the embodiment of the present application as an example of the terminal 400, the terminal 400 implements the structure detection method for the video file provided in the embodiment of the present application by running a computer program, where the computer program may be a native program or a software module in an operating system; may be a local (Native) Application (APP), i.e. a program that needs to be installed in an operating system to run, such as a video client; or may be a browser that displays a video playback page in the form of a web page. In general, the computer programs described above may be any form of application, module or plug-in.

The following description will be given taking the electronic device provided in the embodiment of the present application as the server 200 described above as an example. Referring to fig. 2, fig. 2 is a schematic structural diagram of a server 200 according to an embodiment of the present application, where the server 200 shown in fig. 2 includes: at least one processor 410, memory 440, at least one network interface 420. The various components in server 200 are coupled together by a bus system 430. It is understood that the bus system 430 is used to enable connected communication between these components. The bus system 430 includes a power bus, a control bus, and a status signal bus in addition to the data bus. For clarity of illustration, however, the various buses are labeled as bus system 430 in fig. 2.

The Processor 410 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The memory 440 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 440 optionally includes one or more storage devices physically located remote from processor 410.

Memory 440 includes volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. The nonvolatile memory may be a Read Only Memory (ROM), and the volatile memory may be a Random Access Memory (RAM). The memory 440 described in embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 440 is capable of storing data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 441 including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;

a network communication module 442 for communicating to other computing devices via one or more (wired or wireless) network interfaces 420, exemplary network interfaces 420 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;

in some embodiments, the structure detection apparatus for video files provided in the embodiments of the present application may be implemented in software, and fig. 2 shows a structure detection apparatus 443 for video files stored in the memory 440, which may be software in the form of programs and plug-ins, and includes the following software modules: the extraction module 4431, the first recognition module 4432, the prediction module 4433, the second recognition module 4434, the determination module 4435, and the training module 4436 are logical and thus may be arbitrarily combined or further separated depending on the functions implemented. The functions of the respective modules will be explained below.

The structure detection method for a video file provided by the embodiment of the present application will be described below with reference to the accompanying drawings, where an execution subject of the method may be a server (e.g., a streaming media server), and specifically, the server may be implemented by running the above various computer programs; of course, as will be understood from the following description, it is obvious that the structure detection method for video files provided in the embodiments of the present application may also be implemented by a terminal or by cooperation of a terminal and a server.

Referring to fig. 3, fig. 3 is a schematic flowchart of structure detection of a video file according to an embodiment of the present application, and will be described with reference to the steps shown in fig. 3.

In step 101, candidate segments including endpoint video frames are extracted from a video file.

In some embodiments, the type of video file may be a movie, a television show, a variety, an animation, a documentary, and so forth. Different types of video files differ in structure. For example, a movie typically has a beginning and an end, while a variety program may have only a beginning or an end, or no explicit beginning or end. The beginning and end of a movie typically include the title and name of the persons involved in the production, such as the director, producer, etc. And the first frame and the last frame of the integrated art program are usually advertisement pages, such as the advertisement page of the integrated art program shown in fig. 4, which includes a product 401, a speaker 402, a product icon 403 and a product poster 404.

In some embodiments, extracting candidate segments from a video file that include endpoint video frames may be implemented as follows: when the duration of the video file is greater than the sum of the lengths of the first preset time period and the last preset time period, extracting a first candidate segment which comprises a video frame and has the length of the first preset time period from the video file, and extracting a last candidate segment which comprises a video frame and has the length of the last preset time period from the video file.

For example, the preset time period of the head and the preset time period of the end of the video file are both 5 minutes (min), and when the time length of the video file is 15min and is greater than 10min, at this time, the video frame of the first 5min of the video file is extracted as a candidate segment of the head of the video file, and the video frame of the last 5min of the video file is extracted as a candidate segment of the end of the video file.

In step 102, character recognition is performed on a plurality of video frames in the candidate segment to obtain a keyword recognition result of each video frame.

In some embodiments, the character recognition in step 102 may be optical character recognition. Character recognition is performed on a plurality of video frames in the candidate segment to obtain a keyword recognition result of each video frame, and the steps 1021 to 1024 can be implemented as follows.

In step 1021, a plurality of video frames are extracted from the candidate segment at regular time intervals.

For example, one video frame may be extracted every 1 second(s) in the candidate segment. In some possible examples, when the duration of the slice head preset time period and the duration of the slice tail preset time period are short, such as 5s or 10s, the video frames may also be extracted frame by frame.

In step 1022, image preprocessing is performed on each video frame to obtain a corresponding binarized image.

In some embodiments, image pre-processing includes graying, binarization, normalization, and smoothing. The graying can filter out interference information carried by a color video frame; the binarization can further separate the character part from the background part; the standardization is to unify the characters in the video frame into the same size for subsequent matching, and comprises position standardization, size standardization and stroke thickness standardization; smoothing is to make the edges of the text smoother.

In step 1023, the binarized image is segmented to obtain a character image including a plurality of characters.

In some embodiments, the binary image is segmented into different parts by a connected component analysis method, and attributes of each part, such as text part, image, table, etc., are labeled. And then, carrying out segment segmentation, line segmentation and character segmentation on the character part to obtain a plurality of characters.

In step 1024, the character features of the characters in the character image are extracted, feature matching is performed based on the character features, and the keywords obtained through matching are used as the keyword recognition result of the video frame corresponding to the character image.

In some embodiments, the extracted character features include statistical features and structural features, wherein the structural features may include edge features, penetration features, transformation features, mesh features, and the like.

In some embodiments, after extracting the character features of the plurality of characters in the character image, traversing the keyword feature library, and matching the features in the keyword feature library with the character features. The keyword feature library comprises features of various keywords which may appear in the head and the tail of the piece, and the keywords are as follows: the first set, issuing organ, drama review word, general director, director and the like, keywords possibly appearing in the title: the following forecast, the lead actor, the specially invited actors, the friendship actor, the special singing, and the like. The matching method can adopt a loose comparison method, a Euclidean space comparison method and other methods. And after matching, taking the keyword corresponding to the feature with the highest matching degree as a keyword recognition result of the video frame corresponding to the character image.

Thus, for a video file such as a movie television series, which usually includes a specific keyword in the head and end of the title, it is determined by character recognition whether a video frame containing the keyword exists in the head candidate segment and the end candidate segment thereof.

In step 103, image features are extracted for each video frame by a machine learning model.

In some embodiments, the machine learning model may be a convolutional neural network model, a deep neural network model, or the like. As shown in fig. 5, fig. 5 is a schematic structural diagram of a machine learning model provided in the embodiment of the present application. The machine learning model includes an input layer, a convolutional layer, a pooling layer, a fully-connected layer, and an output layer. In the input layer, the video frames in the head candidate segment and the tail candidate segment are subjected to zero-averaging preprocessing operation, so that different features in the video frames have the same scale. That is, the mean value is subtracted from each pixel in the video frame to obtain a pixel matrix. In the convolutional layer, a convolution operation is performed on the pixel matrix, i.e., different image features in the pixel matrix are extracted by different convolution kernels. In the pooling layer, different image characteristics are sampled through a selection frame, so that the purpose of data dimension reduction is achieved. The number of the convolutional layers and the pooling layers may be plural.

In step 104, a segment prediction result for each video frame is determined based on the image characteristics of each video frame.

In some embodiments, the segment prediction result includes a segment head probability that the video frame belongs to a segment head and a segment end probability that the video frame belongs to a segment end. In the fully connected layer of the machine learning model, the multi-dimensional image features are converted into one-dimensional features. Finally, in the output layer, the one-dimensional features are classified through a sigmod function to obtain the probabilities that the video frames belong to the head, the tail and the feature respectively, that is, the head probability that the video frames in the head candidate segment belong to the head and the tail probability that the video frames in the tail candidate segment belong to the tail can be obtained. Therefore, the possible video frames belonging to the head and the possible video frames belonging to the tail can be quickly and accurately determined through the machine learning model, and the method is particularly suitable for video files of the variety program types without keywords in the head and the tail of the film.

In some embodiments, the training process for the machine learning model is as follows: adding a label to each video frame in the video file sample based on a head-of-film timestamp and a tail-of-film timestamp of the video file sample, wherein the types of the labels comprise a feature film, a head-of-film and a tail-of-film; extracting image characteristics of each video frame; forward propagation is carried out in a machine learning model based on the image characteristics of each video frame to obtain the head probability and the tail probability of each video frame; determining the type of each video frame based on the head probability and the tail probability of each video frame; back-propagation is performed in the machine learning model based on the type of each video frame and the error of the label of each video frame to update the parameters of the machine learning model.

In step 105, a boundary video frame is identified from the plurality of video frames based on the keyword identification result of each video frame and the segment prediction result of each video frame.

In some embodiments, identifying the boundary video frame from the plurality of video frames based on the keyword identification result of each video frame and the segment prediction result of each video frame may be implemented through steps 1051 to 1054 as follows.

In step 1051, the timestamp of the video frame with the largest timestamp is selected as the first start timestamp from the video frames with the keyword recognition result in the first candidate segment, and the timestamp of the video frame with the smallest timestamp is selected as the first end timestamp from the video frames with the keyword recognition result in the last candidate segment.

For example, it is determined that the keyword "director in total" is included in the video frame with the time stamp of 1s in the top candidate segment through optical character recognition, the keyword "person out" is included in the video frame with the time stamp of 3min, … …, the size of the time stamp of these video frames is compared, and the largest time stamp (e.g., 3min) is taken as the first top time stamp. Meanwhile, the video frame with the time stamp of 15min in the candidate segment of the end of the segment is determined to include the keyword 'next collection announcement' through optical character recognition, the video frame with the time stamp of 17min includes the keyword 'special thank you' … …, and the smallest time stamp (such as 15min) is selected from the time stamps of the video frame with the keyword appearing in the candidate segment of the end of the segment as the first end of the segment time stamp.

In step 1052, the timestamp corresponding to the video frame with the highest head probability and exceeding the first probability threshold is selected as the second head timestamp from the video frames of the head candidate segment, and the timestamp corresponding to the video frame with the highest tail probability and exceeding the second probability threshold is selected as the second tail timestamp from the video frames of the tail candidate segment.

For example, the first probability threshold and the second probability threshold are both 0.8. And predicting by a machine learning model, wherein the video frame with the largest head probability in the video frames of the head candidate segment is the video frame with the timestamp of 3min, and the head probability is 0.85 and is greater than the first probability threshold. Therefore, the timestamp 3min is taken as the second slice header timestamp. The video frame with the largest end probability among the video frames of the end candidate segments is the video frame with the timestamp of 14min30s, and the end probability is 0.9 and is greater than the second probability threshold. Therefore, the timestamp 14min30s is taken as the second end-of-flight timestamp.

In step 1053, the larger timestamp of the first and second head timestamps is taken as the head timestamp, and the smaller timestamp of the first and second tail timestamps is taken as the tail timestamp.

For example, since the first slice head timestamp 3min is the same as the second slice head timestamp 3min, the slice head timestamp may be determined to be 3 min. Since the second end-of-flight timestamp 14min30s is less than the first end-of-flight timestamp 15min, the second end-of-flight timestamp 14min30s is considered an end-of-flight timestamp.

In step 1054, the video frame corresponding to the head timestamp is taken as the head frame and the tail frame, and the video frame corresponding to the tail timestamp is taken as the tail frame.

In some embodiments, when the boundary video frame is not identified from the plurality of video frames based on the keyword identification result of each video frame and the segment prediction result of each video frame, that is, the keywords that may appear in the head segment and the tail segment of each video frame are not identified in the video frames of the head segment candidate segment and the tail segment candidate segment, and the head segment probability of the video frames in the head segment candidate segment does not exceed the first probability threshold, and the tail segment probability of the video frames in the tail segment candidate segment does not exceed the second probability threshold. At this time, the video head frame is taken as the head-end frame, and the video end frame is taken as the tail-end frame.

In some embodiments, after identifying the boundary video frame, a slice header and a slice trailer in the video file may be determined based on the endpoint video frame and the boundary video frame. Taking a segment formed by video frames with time stamps between the time stamp corresponding to the first frame of the video and the time stamp corresponding to the first frame and the last frame of the video as a slice header; and taking a section formed by the video frames with the time stamps between the time stamp corresponding to the head frame of the end of the section and the time stamp corresponding to the end frame of the video as the end of the section.

For example, the timestamp corresponding to the first frame of the video is 0, the timestamp corresponding to the first and last frames of the video is 3min, the timestamp corresponding to the first and last frames of the video is 15min, and the timestamp corresponding to the last frame of the video is 19 min. Taking a segment formed by video frames with time stamps between 0min and 3min as a slice head, and taking a segment formed by video frames with time stamps between 15min and 19min as a slice tail.

In some embodiments, the recognition may be assisted by a speech classification model in a machine learning model, extracting speech features of candidate segments (beginning speech candidate segment and end speech candidate segment) in a speech file synchronized with a video file by the speech classification model, and predicting a probability that each speech frame in the candidate segments belongs to a speech beginning and a speech end based on the extracted speech features. And selecting a timestamp corresponding to the voice frame with the highest probability exceeding the first voice probability threshold value from the voice frames of the head voice candidate segment as a third head timestamp, and selecting a timestamp corresponding to the voice frame with the highest probability exceeding the second voice probability threshold value from the voice frames of the tail voice candidate segment as a third tail timestamp. And taking the largest timestamp among the first slice head timestamp, the second slice head timestamp and the third slice head timestamp as a slice head timestamp, and taking the smallest timestamp among the first slice tail timestamp, the second slice tail timestamp and the third slice tail timestamp as a slice tail timestamp.

Therefore, the character recognition, the prediction of the video frame by the machine learning model and the prediction of the voice frame by the machine learning model in the embodiment of the application can be mutually supplemented, and the most accurate head timestamp and the most accurate tail timestamp are determined together. In some possible examples, there is no text in the boundary video frame, or the included text does not belong to a keyword, for example, in a video file corresponding to the art program, the boundary video frame is an advertisement page, and the video frame is predicted mainly according to a machine learning model at this time.

In some possible examples, the boundary video frame has no text or includes text that does not belong to a keyword, and the boundary video frame is a non-advertisement page or other common page that can be used to distinguish the beginning/end of the segment from the feature, such as a video file corresponding to an interview-type program, at which time the beginning and end frames of the speech segment and the beginning and end frames of the speech segment can be determined through the speaking content of the moderator (e.g., "# formally start", "# end"), i.e., the speech frame is predicted through the machine learning model. Because the voice file is synchronized with the video file, the beginning and end frames of the video file can also be determined. Therefore, the head time stamp and the tail time stamp in various types of video files can be determined through comprehensive use of the three methods, and the accuracy and the efficiency of positioning the head and the tail of the video files are greatly improved.

It can be seen that, in the embodiment of the application, by performing character recognition on the video frame in the candidate segment in the video file, whether the video frame includes the keyword or not can be determined according to the recognition result. The segment prediction result of the video frame is determined through a machine learning model, so that the possible video frame belonging to the head of the segment and the possible video frame belonging to the tail of the segment can be quickly and accurately determined. The method can quickly and accurately position the head and the tail of the video file according to the recognition result and the segment prediction result by combining the two schemes, and can further provide the feature film content with the head and the tail of the video file removed for the user.

Referring to fig. 6, fig. 6 is an interaction flow diagram of a method for detecting a structure of a video file according to an embodiment of the present application. The following describes a process of implementing the structure detection method for a video file provided by the embodiment of the present application by cooperation of a terminal and a server, with reference to steps 201 to 206 in fig. 6.

Step 201, the terminal sends a head-to-tail determination request carrying identification information of a target video file to a server.

The identification information of the target video file can be in the forms of video links and the like, a large number of video files are stored in the server, and the video files can be located based on the video links. The head-to-tail determination request is used for requesting the server to return a data packet of the target video file with the head-to-tail removed.

Step 202, the server determines the target video file according to the head and the tail of the video and the identification information of the target video file.

Step 203, the server performs character recognition on a plurality of video frames of the candidate segment in the target video file to obtain a keyword recognition result.

And step 204, the server predicts the probability that the video frames are respectively the head and the tail of the film through a machine learning model.

And step 205, the server determines the head and the tail of the target video file according to the keyword recognition result and the predicted probability.

In step 206, the server sends the data packet of the video file with the slice header and the slice trailer removed to the terminal.

It should be noted that the above steps have been described in detail in the foregoing, and are not described again here.

Therefore, in the embodiment of the application, the terminal completes character recognition of the video frame in the target video file and prediction of the probability that the video frame is the head and the tail of the video file through the server, so that the head and the tail of the target video file can be accurately determined while the calculation pressure of the terminal is reduced, and finally the video file with the head and the tail removed can be obtained by the terminal.

Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.

Referring to fig. 7A, 7B, and 7C, fig. 7A is a schematic view of a video playing page provided in the embodiment of the present application, fig. 7B is a schematic view of a positioning sheet header provided in the embodiment of the present application, and fig. 7C is a schematic view of a positioning sheet trailer provided in the embodiment of the present application. The page shown in fig. 7A mainly includes two parts, i.e., a video file 701 and a video operation 702, and the video operation 702 includes intelligent identification, intelligent editing and intelligent review. When the video playback page shown in fig. 7A is entered, or the user has viewed a portion of the slice header, the user may click on the slice header trailer 703 under smart editing, so that the playback page appears as shown in fig. 7B. Below the head-of-slice end 703 of fig. 7B, the head-of-slice timestamp 0 is presented: 23.00 and end-of-flight timestamp 15: 40.04 while the video file 701 will skip the slice header and play the video starting at the slice header timestamp. Because the end-of-slice timestamp is 15: 40.04, therefore, in FIG. 7C, when the player plays to 15: at 40, the playback will stop.

The following describes how to determine the head time stamp and the end time stamp of a video file.

In the embodiment of the application, when the duration of the video file is detected to be greater than 10min, the first time stamp is detected only in the first 5min of the video file, and the last time stamp is detected in the last 5 min. Video frames are extracted at regular time intervals, for example, one frame every 1s, for the first 5min and last 5min of video content.

Referring to fig. 8A, fig. 8A is a schematic flowchart of detecting a slice start timestamp provided in an embodiment of the present application, and the determining of the slice start timestamp will be described with reference to the steps shown in fig. 8A.

In step 801, the server parses the video frame.

The server stores a video file to be analyzed, and analyzes a plurality of video frames in the first 5min of the video file to acquire relevant information of the video frames, such as timestamps corresponding to the video frames.

In step 802, the server determines whether the timestamp of the video frame is greater than 5min, if so, performs step 807, and if not, performs step 803.

In step 803, the server performs character recognition on the video frame to obtain a recognition result.

And the server performs character recognition on the currently analyzed video frame and determines whether the first keyword exists in the video frame based on the recognition result. In some possible examples, the character recognition may be optical character recognition. The chapter header keywords may include: the first album, the issuing organ, the news publishing broadcast and television bureau (cultural broadcast and television drama), the issuing license number (record number), the drama examination word, the general director, the director and the like.

In step 804, the server predicts the head of slice probability of the video frame through a machine learning model.

The machine learning model for predicting the film head probability in step 804 is obtained by training a video file sample, the machine learning model may be a convolutional neural network model, a deep neural network model, or the like, and the video file sample may be a movie, a television show, a variety, a cartoon, a documentary, or the like. The slice-first probability is the probability that the current video frame is the slice-first.

It should be noted that step 804 may be executed simultaneously with step 803 or prior to step 803.

In step 805, the server determines whether a slice header keyword exists in the video frame or whether the slice header probability is greater than a first probability threshold T1, if yes, step 806 is executed, and if not, step 801 is executed.

The first probability threshold T1 is a preset value, and may be 0.5, 0.6, 0.7, 0.8, etc. When the slice header probability of the current video frame is greater than the first probability threshold T1, the current video frame is likely to belong to the slice header. Step 806 is performed when the video frame satisfies either of the two conditions of "there is a slice head keyword" and "the slice head probability is greater than the first probability threshold T1".

In step 806, the server records the timestamp of the current video frame and performs step 801.

After recording the time stamp, the server performs step 801, i.e. starts processing the next video frame.

In step 807, the server determines whether there is a video frame containing a slice header keyword or a slice header probability greater than T1 in the parsed video frame, if so, step 808 is executed, and if not, step 809 is executed.

When the timestamp of the next video frame parsed by the server is greater than 5min, it is not processed. The server determines whether there is a recorded timestamp, that is, determines whether there is a video frame containing a slice header keyword or a slice header probability greater than T1 in the video frame of the first 5min of parsing.

In step 808, the server compares the timestamps of the video frames, and takes the largest timestamp as the head-of-slice timestamp.

The slice header timestamp is the timestamp corresponding to the last frame of the slice header.

In step 809, the server takes 0 as the head-of-slice timestamp.

Referring to fig. 8B, fig. 8B is a schematic flowchart of detecting a trailer timestamp according to an embodiment of the present application, and the determining of the trailer timestamp will be described with reference to the steps shown in fig. 8B.

In step 901, the server parses the video frame.

When the server analyzes the video frames in the first 5min of the video file, the video frames in the last 5min are directly processed, and the video frames in the middle part are skipped, so that the character recognition speed can be greatly increased.

In step 902, the server performs character recognition on the video frame to obtain a recognition result.

And the server performs character recognition on the currently analyzed video frame and determines whether the video frame has the end-of-title key words or not based on the recognition result. In some possible examples, the end-of-title key may include: lower episode preview, lead actor, festive actor, specially invited actor, friendship actor, employee form (cast), lower episode, special thank you, show at tomorrow, give attention to attention, etc.

In step 903, the server predicts the end-of-segment probability of the video frame through a machine learning model.

Wherein, the end-of-piece probability is the probability that the current video frame is an end-of-piece. It should be noted that step 903 may be executed simultaneously with step 902 or before step 902.

In step 904, the server determines whether there is a trailer keyword in the video frame or whether the trailer probability is greater than a second probability threshold T2, if yes, step 905 is performed, and if no, step 906 is performed.

Since the end-of-title timestamp is detected frame by frame according to the size of the timestamp in the process of detecting the end-of-title timestamp, when the end-of-title keyword is detected in the current video frame or the end-of-title probability is greater than the second probability threshold T2, the timestamp of the current video frame is likely to be the end-of-title timestamp.

In step 905, the server outputs the timestamp of the current video frame as the end-of-segment timestamp.

The end-of-segment timestamp is the timestamp corresponding to the first frame of the end-of-segment.

In step 906, the server determines whether the video file is finished, if so, performs step 907, and if not, performs step 901.

When the timestamp of the currently processed video frame is equal to the total duration of the video file, the server determines that the video file is finished. When the end-of-title keyword does not exist in the current video frame, the end-of-title probability is not greater than the second probability threshold T2, and the video file is not finished, the server executes step 901, that is, starts to process the next video frame.

In step 907, the server takes the total duration of the video file as the end-of-title timestamp.

After the head timestamp and the tail timestamp of the video file are obtained, the server positions the head and the tail of the video file according to the head timestamp and the tail timestamp, compresses the video file without the head and the tail of the video file, the head timestamp and the tail timestamp, and sends the compressed video file, the head timestamp and the tail timestamp to the user terminal in a data packet mode. The user terminal decompresses the received packet and starts playing from the slice head time stamp, thereby rendering the playing page shown in fig. 7B. When playing is to the end-of-title timestamp, playing is stopped, as shown in fig. 7C.

Continuing with the exemplary structure of the structure detection apparatus 443 for video files provided in the embodiments of the present application implemented as software modules, in some embodiments, as shown in fig. 2, the software modules stored in the structure detection apparatus 443 for video files in the memory 440 may include: an extraction module 4431 for extracting candidate segments from the video file including endpoint video frames; the first identification module 4432 is configured to perform character identification on a plurality of video frames in the candidate segment to obtain a keyword identification result of each video frame; a prediction module 4433, configured to extract image features for each video frame through a machine learning model, and determine a segment prediction result for each video frame based on the image features of each video frame; the second identifying module 4434 is configured to identify a boundary video frame from the plurality of video frames based on the keyword identification result of each video frame and the segment prediction result of each video frame.

In some embodiments, the endpoint video frames include a video head frame and a video end frame of the video file; an extraction module 4431, further configured to: when the duration of the video file is greater than the sum of the lengths of the first preset time period and the last preset time period, extracting a first candidate segment which comprises a video frame and has the length of the first preset time period from the video file, and extracting a last candidate segment which comprises a video frame and has the length of the last preset time period from the video file.

In some embodiments, the first identification module 4432 is further configured to: extracting a plurality of video frames from the candidate segments at fixed time intervals; carrying out image preprocessing on each video frame to obtain a corresponding binary image; carrying out segmentation processing on the binary image to obtain a character image containing a plurality of characters; extracting character features of a plurality of characters in the character image, performing feature matching based on the character features, and taking keywords obtained by matching as a keyword recognition result of a video frame corresponding to the character image.

In some embodiments, the first identification module 4432 is further configured to: and traversing the keyword feature library to match the features in the keyword feature library with the character features, and taking the keyword corresponding to the feature with the highest matching degree as a keyword identification result of the video frame corresponding to the character image.

In some embodiments, the segment prediction result comprises a segment head probability that the video frame belongs to a segment head and a segment tail probability that the video frame belongs to a segment tail; the prediction module 4433 is further configured to: performing convolution processing on video frames in the head candidate segment and the tail candidate segment through a machine learning model to obtain corresponding image characteristics; and classifying the image characteristics to obtain the head probability of the video frame in the head candidate segment and the tail probability of the video frame in the tail candidate segment.

In some embodiments, the boundary video frame comprises a head-to-end frame and a tail-to-end frame; a second identification module 4434, further configured to: selecting a timestamp of a video frame with the largest timestamp from video frames with the keyword identification results of the head candidate segment as a first head timestamp, and selecting a timestamp of a video frame with the smallest timestamp from video frames with the keyword identification results of the tail candidate segment as a first tail timestamp; selecting a timestamp corresponding to the video frame with the highest head probability and exceeding a first probability threshold value from the video frames of the head candidate segment as a second head timestamp, and selecting a timestamp corresponding to the video frame with the highest tail probability and exceeding a second probability threshold value from the video frames of the tail candidate segment as a second tail timestamp; taking the larger timestamp of the first piece of head timestamp and the second piece of head timestamp as a piece of head timestamp, and taking the smaller timestamp of the first piece of tail timestamp and the second piece of tail timestamp as a piece of tail timestamp; and taking the video frame corresponding to the head time stamp as a head frame and a tail frame, and taking the video frame corresponding to the tail time stamp as the tail frame.

In some embodiments, the second identification module 4434 is further configured to: and when the boundary video frame is not identified from the plurality of video frames based on the keyword identification result of each video frame and the segment prediction result of each video frame, taking the video head frame as a head frame and a tail frame.

In some embodiments, the apparatus for detecting a structure of a video file further includes a determining module 4435 configured to: taking a segment formed by video frames with time stamps between the time stamp corresponding to the first frame of the video and the time stamp corresponding to the first frame and the last frame of the video as a slice header; and taking a section formed by the video frames with the time stamps between the time stamp corresponding to the head frame of the end of the section and the time stamp corresponding to the end frame of the video as the end of the section.

In some embodiments, the apparatus for detecting the structure of a video file further comprises a training module 4436 for: adding a label to each video frame in the video file sample based on a head-of-film timestamp and a tail-of-film timestamp of the video file sample, wherein the label comprises a positive film, a head-of-film and a tail-of-film; extracting image characteristics of each video frame; based on the image characteristics of each video frame, carrying out forward propagation in a machine learning model to obtain a segment prediction result of each video frame; determining a type of each video frame based on a segment prediction result of each video frame; back-propagation is performed in the machine learning model based on the type of each video frame and the error of the label of each video frame to update the parameters of the machine learning model.

The embodiment of the present application provides a storage medium storing executable instructions, wherein the executable instructions are stored, and when being executed by a processor, the executable instructions will cause the processor to execute the method provided by the embodiment of the present application, for example, the structure detection method of a video file as shown in fig. 3.

In some embodiments, the storage medium may be a memory such as FRAM, ROM, PROM, EPROM, EE PROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (H TML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

In summary, in the embodiment of the present application, by performing character recognition on the video frame in the candidate segment in the video file, whether the video frame includes the keyword can be determined according to the recognition result. The segment prediction result of the video frame is determined through a machine learning model, so that the possible video frame belonging to the head of the segment and the possible video frame belonging to the tail of the segment can be quickly and accurately determined. The method can quickly and accurately position the head and the tail of the video file according to the recognition result and the segment prediction result by combining the two schemes, and can further provide the feature film content with the head and the tail of the video file removed for the user.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A method for detecting a structure of a video file, the method comprising:

2. The method of claim 1, wherein the endpoint video frames comprise a video head frame and a video end frame of the video file; the extracting the candidate segment including the endpoint video frame from the video file comprises:

3. The method according to claim 1, wherein said performing character recognition on a plurality of video frames in the candidate segment to obtain a keyword recognition result of each of the video frames comprises:

4. The method according to claim 3, wherein the performing feature matching based on the character features, and using the matched keyword as a result of keyword recognition of the video frame corresponding to the character image comprises:

5. The method according to claim 2, wherein the segment prediction result comprises a head-of-segment probability that the video frame belongs to a head-of-segment and a tail-of-segment probability that the video frame belongs to a tail-of-segment; the extracting, by a machine learning model, image features for each of the video frames and determining a segment prediction result for each of the video frames based on the image features of each of the video frames includes:

6. The method of claim 5, wherein the boundary video frame comprises a head-to-end frame and a tail-to-end frame; the identifying a boundary video frame from the plurality of video frames based on the keyword identification result of each video frame and the segment prediction result of each video frame comprises:

7. The method of claim 6, further comprising:

8. The method of claim 7, wherein after identifying a boundary video frame from the plurality of video frames based on the keyword identification result of each of the video frames and the segment prediction result of each of the video frames, the method further comprises:

9. The method of any of claims 1 to 8, wherein prior to said extracting candidate segments from the video file that include end-point video frames, the method further comprises:

extracting image features of each video frame;

10. An apparatus for detecting a structure of a video file, comprising: