CN112291589B

CN112291589B - Method and device for detecting structure of video file

Info

Publication number: CN112291589B
Application number: CN202011181785.9A
Authority: CN
Inventors: 孙祥学
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2023-09-22
Anticipated expiration: 2040-10-29
Also published as: CN112291589A

Abstract

The application provides a structure detection method, a device, electronic equipment and a computer readable storage medium of a video file; relates to artificial intelligence technology; the method comprises the following steps: extracting candidate segments including endpoint video frames from the video file; character recognition is carried out on a plurality of video frames in the candidate segment, and a keyword recognition result of each video frame is obtained; extracting image characteristics from each video frame through a machine learning model, and determining a segment prediction result of each video frame based on the image characteristics of each video frame; a demarcation video frame is identified from the plurality of video frames based on the keyword recognition result for each video frame and the segment prediction result for each video frame. By the method and the device, the head and the tail of the video file can be positioned rapidly.

Description

Method and device for detecting structure of video file

Technical Field

The present application relates to artificial intelligence technology, and in particular, to a method and apparatus for detecting a structure of a video file, an electronic device, and a computer readable storage medium.

Background

Artificial intelligence (AI, artificial Intelligence) is a comprehensive technology of computer science, and by researching the design principle and implementation method of various intelligent machines, the machines have the functions of sensing, reasoning and deciding.

Visual recognition is an important application of artificial intelligence techniques, for example, the head and tail in a video file may be determined by visual recognition, thereby providing intermediate data for head and tail recognition result-based applications, such as skipping the head and tail when viewing.

In the related art, the scheme of detecting the structure of the video file to rapidly position the head and tail of the video file is lacking, and the marking of each part of the video file is mainly performed in a manual marking mode, so that the efficiency is low.

Disclosure of Invention

The embodiment of the application provides a method and a device for detecting the structure of a video file, electronic equipment and a computer readable storage medium, which can rapidly locate the head and tail of the video file.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a structure detection method of a video file, which comprises the following steps:

extracting candidate segments including endpoint video frames from the video file;

character recognition is carried out on a plurality of video frames in the candidate segment, and a keyword recognition result of each video frame is obtained;

extracting image features from each video frame through a machine learning model, and determining a segment prediction result of each video frame based on the image features of each video frame;

And identifying a demarcation video frame from the plurality of video frames based on the keyword identification result of each video frame and the segment prediction result of each video frame.

The embodiment of the application provides a structure detection device of a video file, which comprises the following components:

an extraction module for extracting candidate segments including endpoint video frames from a video file;

the first recognition module is used for carrying out character recognition on a plurality of video frames in the candidate segment to obtain a keyword recognition result of each video frame;

a prediction module for extracting image features from each of the video frames through a machine learning model, and determining a segment prediction result of each of the video frames based on the image features of each of the video frames;

and the second identification module is used for identifying the demarcation video frames from the plurality of video frames based on the keyword identification result of each video frame and the segment prediction result of each video frame.

In the above scheme, the endpoint video frame includes a video first frame and a video last frame of the video file; the extraction module is further configured to:

and when the duration of the video file is longer than the sum of the length of the first preset time period and the length of the last preset time period, extracting a first candidate segment which comprises the first video frame and has the length of the first preset time period from the video file, and extracting a last candidate segment which comprises the last video frame and has the length of the last preset time period from the video file.

In the above scheme, the first identification module is further configured to:

extracting a plurality of video frames from the candidate segments at fixed time intervals;

performing image preprocessing on each video frame to obtain a corresponding binarized image;

dividing the binarized image to obtain a character image containing a plurality of characters;

extracting character features of a plurality of characters in the character image, performing feature matching based on the character features, and taking keywords obtained by matching as keyword recognition results of the video frames corresponding to the character image.

In the above scheme, the first identification module is further configured to:

traversing a keyword feature library to match the features in the keyword feature library with the character features, and taking a keyword corresponding to the feature with the highest matching degree as a keyword recognition result of the video frame corresponding to the character image.

In the above scheme, the segment prediction result includes a head probability that the video frame belongs to a head and a tail probability that the video frame belongs to a tail; the prediction module is further configured to:

carrying out convolution processing on video frames in the head candidate segment and the tail candidate segment through the machine learning model to obtain corresponding image features;

And classifying the image features to obtain the head probability of the video frame in the head candidate segment and the tail probability of the video frame in the tail candidate segment.

In the above scheme, the demarcation video frames include a head-to-tail frame and a tail-to-head frame; the second identification module is further configured to:

selecting the time stamp of the video frame with the largest time stamp from the video frames with the keyword recognition results of the head candidate fragments as a first head time stamp, and selecting the time stamp of the video frame with the smallest time stamp from the video frames with the keyword recognition results of the tail candidate fragments as a first tail time stamp;

selecting a timestamp corresponding to a video frame with the largest head probability and exceeding a first probability threshold from the video frames of the head candidate fragments as a second head timestamp, and selecting a timestamp corresponding to a video frame with the largest tail probability and exceeding a second probability threshold from the video frames of the tail candidate fragments as a second tail timestamp;

taking the larger timestamp of the first head timestamp and the second head timestamp as the head timestamp, and the smaller timestamp of the first tail timestamp and the second tail timestamp as the tail timestamp;

And taking the video frame corresponding to the head-to-head time stamp as the head-to-tail frame and taking the video frame corresponding to the tail-to-tail time stamp as the tail-to-head frame.

In the above scheme, the second identification module is further configured to:

when the demarcation video frame is not identified from the plurality of video frames based on the keyword identification result of each video frame and the segment prediction result of each video frame, the video head frame is taken as the head-to-tail frame, and the video tail frame is taken as the tail-to-head frame.

In the above scheme, the structure detection device of the video file further includes a determining module, configured to:

taking a fragment formed by the video frames with the time stamps between the time stamps corresponding to the video head frames and the time stamps corresponding to the video head and tail frames as a head;

and taking a fragment formed by the video frames with the time stamps between the time stamps corresponding to the head frames of the fragment and the time stamps corresponding to the tail frames of the video as the fragment tail.

In the above scheme, the structure detection device of the video file further includes a training module, configured to:

adding a label for each video frame in a video file sample based on a head time stamp and a tail time stamp of the video file sample, wherein the label comprises a positive film, a head and a tail;

Extracting image characteristics of each video frame;

based on the image characteristics of each video frame, forward propagation is carried out in the machine learning model, and a segment prediction result of each video frame is obtained;

determining the type of each video frame based on the segment prediction result of each video frame;

based on the type of each of the video frames and the errors of the labels of each of the video frames, back propagation is performed in the machine learning model to update parameters of the machine learning model.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the structure detection method of the video file provided by the embodiment of the application when executing the executable instructions stored in the memory.

The embodiment of the application provides a computer readable storage medium which stores executable instructions for realizing the structure detection method of the video file provided by the embodiment of the application when being executed by a processor.

The embodiment of the application has the following beneficial effects:

character recognition is carried out on video frames in candidate fragments in the video file, and whether keywords are included in the video frames can be determined according to recognition results; the segment prediction result of the video frame is determined through a machine learning model, and the head and tail of the video file can be rapidly and accurately positioned according to the identification result and the segment prediction result through a mode of combining the two schemes.

Drawings

FIG. 1 is a schematic diagram of a detection system 10 according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a server 200 according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of detecting the structure of a video file according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an advertisement page of a variety of programs according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a machine learning model provided by an embodiment of the present application;

fig. 6 is an interactive flow diagram of a method for detecting a structure of a video file according to an embodiment of the present application;

fig. 7A is a schematic diagram of a video playing page according to an embodiment of the present application;

FIG. 7B is a schematic diagram of a positioning tab according to an embodiment of the present application;

FIG. 7C is a schematic diagram of a locating tail according to an embodiment of the present application;

FIG. 8A is a flowchart illustrating the head-of-chip timestamp detection according to an embodiment of the present application;

fig. 8B is a flowchart illustrating tail-timestamp detection according to an embodiment of the present application.

Detailed Description

The present application will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present application more apparent, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

If a similar description of "first/second" appears in the application document, the following description is added, and in the following description, the terms "first/second/third" merely distinguish similar objects and do not represent a specific ordering of the objects, it being understood that "first/second/third" may, where allowed, interchange a specific order or precedence order such that the embodiments of the application described herein can be practiced otherwise than as illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application will be used in the following explanation.

1) Streaming Media (Streaming Media) is an emerging network transmission technology that sequentially transmits and plays continuous time-based data streams of multimedia content such as video/audio in real time over the internet. In contrast to the network play mode of downloading and viewing, streaming media is typically characterized by compressing continuous audio and video information and then placing the compressed information on a streaming media server, and the user views the information while downloading the information without waiting for the entire file to be downloaded.

2) Streaming media servers are key platforms for operators to provide video services to users. The streaming media server has the main functions of collecting, caching, scheduling and transmitting and playing streaming media contents. The method can transmit the video file to the client by a streaming protocol for the user to watch online; the real-time video stream can also be received from video acquisition software and compression software, and then the live broadcast is carried out on the client side by a streaming protocol.

3) The endpoint video frames, namely the video first frame and the video last frame of the video file.

4) And demarcating video frames, namely the head frame and the tail frame of the video file.

In general, a video file is composed of three parts, namely, a head part, a positive part and a tail part, and many users tend to watch video files without the head part and the tail part when watching the video file, so many video clients provide the users with the option of skipping the head part and the tail part. In the related art, skipping of the beginning and ending of a film is achieved based on marked beginning and ending time points by manually viewing a video file and marking the beginning and ending time points (i.e., the beginning and ending time points hereinafter) of the feature in the video file. This is not only very inefficient, but also labor intensive.

In order to solve the technical problem of low detection efficiency caused by manual labeling in the related art, the embodiment of the application provides a method, a device, electronic equipment and a computer-readable storage medium for detecting the structure of a video file, which can rapidly locate the head and tail of the video file.

The method for detecting the structure of the video file provided by the embodiment of the application can be implemented by various electronic devices, for example, can be independently implemented by a terminal or a server. For example, after the terminal downloads the complete video file, a structure detection method of the video file described below may be performed based on the complete video file. The method for detecting the structure of the video file can also be cooperatively implemented by the server and the terminal. For example, after receiving the head-to-tail determination operation of the user, the terminal receives the video data packet from the server in real time and decompresses the video data packet to obtain a video file, and then executes the structure detection method of the video file. Or after receiving the head-to-tail determining operation of the user on the target video file, the terminal sends a head-to-tail determining request to the server so that the server executes a video file structure detection method on the stored target video file, determines the head and tail of the target video file, and sends a data packet of a video frame from the head timestamp to the tail timestamp of the target video file to the terminal in real time.

The electronic device for detecting the structure of the video file provided by the embodiment of the application can be various types of terminal devices or servers, wherein the servers can be independent physical servers (such as streaming media servers), can be server clusters or distributed systems formed by a plurality of physical servers, and can be cloud servers for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms; the terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present application.

Taking a server as an example, for example, a server cluster deployed in a cloud may be used, an artificial intelligence cloud Service (AIaaS) is opened to users, an AIaaS platform splits several common AI services and provides independent or packaged services in the cloud, and the Service mode is similar to an AI theme mall, and all users can access one or more artificial intelligence services provided by using the AIaaS platform through an application programming interface.

For example, one of the artificial intelligence cloud services may be a structure detection service of a video file, that is, a cloud server encapsulates a program for detecting a structure of a video file provided by an embodiment of the present application. The terminal responds to the head-to-tail determining operation of the user, the structure detecting service of the video file in the cloud service is called, so that a server deployed at the cloud end calls a program for detecting the structure of the packaged video file, character recognition is carried out on a plurality of video frames of candidate fragments in the video file, a keyword recognition result is obtained, the probability that the plurality of video frames are the head-to-tail and are respectively the head-to-tail is predicted through a machine learning model, the head-to-tail of the video file is determined according to the keyword recognition result and the predicted probability, and finally, a data packet of the video file with the head-to-tail of the video file removed can be sent to the terminal.

The following describes an example of a method for detecting a structure of a video file provided by the embodiment of the present application by cooperatively implementing a server and a terminal. Referring to fig. 1, fig. 1 is a schematic architecture diagram of a detection system 10 according to an embodiment of the present application. The terminal 400 is connected to the server 200 through the network 300, and the network 300 may be a wide area network or a local area network, or a combination of both.

In some embodiments, the terminal 400 sends a slice head and slice tail determination request to the server 200 in response to a slice head and slice tail determination operation of a user for a target video file, where the request carries identification information of the target video file. The server 200 determines the target video file according to the head-to-tail determination request and the identification information of the target video file. Character recognition is carried out on a plurality of video frames of candidate fragments in the target video file, a keyword recognition result is obtained, the probability that the plurality of video frames are the head and the tail of the fragment respectively is predicted through a machine learning model, so that the head and the tail of the fragment of the target video file are determined according to the keyword recognition result and the predicted probability, and finally, the data packet of the video file with the head and the tail removed is sent to the terminal 400 in real time.

In some embodiments, taking the electronic device provided by the embodiment of the present application as the terminal 400, the terminal 400 implements the structure detection method of the video file provided by the embodiment of the present application by running a computer program, where the computer program may be a native program or a software module in an operating system; may be a Native Application (APP), i.e. a program that needs to be installed in an operating system to run, such as a video client; or a browser which displays the video play page in the form of a web page. In general, the computer programs described above may be any form of application, module or plug-in.

The following describes an example of the electronic device provided in the embodiment of the present application as the server 200 described above. Referring to fig. 2, fig. 2 is a schematic structural diagram of a server 200 according to an embodiment of the present application, and the server 200 shown in fig. 2 includes: at least one processor 410, a memory 440, at least one network interface 420. The various components in server 200 are coupled together by bus system 430. It is understood that bus system 430 is used to enable connected communications between these components. The bus system 430 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration the various buses are labeled in fig. 2 as bus system 430.

The processor 410 may be an integrated circuit chip having signal processing capabilities such as a general purpose processor, such as a microprocessor or any conventional processor, or the like, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.

Memory 440 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 440 optionally includes one or more storage devices physically remote from processor 410.

Memory 440 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The non-volatile memory may be read only memory (ROM, read Only Me mory) and the volatile memory may be random access memory (RAM, random Access Memor y). The memory 440 described in embodiments of the present application is intended to comprise any suitable type of memory.

In some embodiments, memory 440 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 441 including system programs, e.g., a framework layer, a core library layer, a driver layer, etc., for handling various basic system services and performing hardware-related tasks, for implementing various basic services and handling hardware-based tasks;

network communication module 442 for reaching other computing devices via one or more (wired or wireless) network interfaces 420, exemplary network interfaces 420 include: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (USB, universal Serial Bus), etc.;

in some embodiments, the structure detecting device for a video file provided in the embodiments of the present application may be implemented in a software manner, and fig. 2 shows the structure detecting device 443 for a video file stored in the memory 440, which may be software in the form of a program, a plug-in, or the like, including the following software modules: the extraction module 4431, the first recognition module 4432, the prediction module 4433, the second recognition module 4434, the determination module 4435, the training module 4436 are logical, and thus may be arbitrarily combined or further split according to the implemented functions. The functions of the respective modules will be described hereinafter.

The method for detecting the structure of the video file according to the embodiment of the present application will be described below with reference to the accompanying drawings, where the main execution body of the method may be a server (such as a streaming media server), and specifically may be implemented by running the above various computer programs; of course, it will be apparent from the following understanding that the method for detecting the structure of the video file provided in the embodiment of the present application may be implemented by a terminal or by a terminal and a server in cooperation.

Referring to fig. 3, fig. 3 is a schematic flow chart of structure detection of a video file according to an embodiment of the present application, and will be described with reference to the steps shown in fig. 3.

In step 101, candidate segments including endpoint video frames are extracted from a video file.

In some embodiments, the type of video file may be a movie, a television show, a variety, a cartoon, documentary, and the like. The structure of different types of video files is different. For example, a movie television show typically has a beginning and an ending, while a feature program may have only a beginning or ending, or no explicit beginning or ending. The first and last episodes of a movie and television show typically include the role and name of the person involved in the production, such as director, producer, etc. The first and last frames of the variety program are usually advertisement pages, and in the advertisement pages of the variety program shown in fig. 4, the advertisement pages include a product 401, a speaker 402, a product icon 403 and a propaganda word 404 of the product.

In some embodiments, extracting candidate segments from the video file that include the endpoint video frames may be accomplished by: and when the duration of the video file is longer than the sum of the length of the first preset time period and the length of the last preset time period, extracting a first candidate segment which comprises the first video frame and has the length of the first preset time period from the video file, and extracting a last candidate segment which comprises the last video frame and has the length of the last preset time period from the video file.

For example, the first preset time period and the last preset time period of the video file are both 5 minutes (min), when the duration of the video file is 15min, the video frame of the first 5min of the video file is extracted as the first candidate segment, and the video frame of the last 5min of the video file is extracted as the last candidate segment.

In step 102, character recognition is performed on a plurality of video frames in the candidate segment, so as to obtain a keyword recognition result of each video frame.

In some embodiments, the character recognition in step 102 may be optical character recognition. Character recognition is performed on a plurality of video frames in the candidate segment, so as to obtain a keyword recognition result of each video frame, which can be achieved through the following steps 1021 to 1024.

In step 1021, a plurality of video frames are extracted from the candidate segments at regular intervals.

For example, one video frame may be extracted every 1 second(s) in the candidate segment. In some possible examples, when the duration of the first preset time period and the duration of the last preset time period are shorter, for example, the duration is 5s or 10s, at this time, the video frames may also be extracted frame by frame.

In step 1022, image preprocessing is performed on each video frame to obtain a corresponding binarized image.

In some embodiments, image preprocessing includes graying, binarizing, normalizing, and smoothing. The graying can filter interference information carried by the color video frame; the binarization can further separate the text part and the background part; normalization is unifying the words in the video frames to the same size for subsequent matching, and includes position normalization, size normalization and stroke weight normalization; smoothing is to make the edges of the text smoother.

In step 1023, the binarized image is subjected to segmentation processing, to obtain a character image containing a plurality of characters.

In some embodiments, the binary image is segmented, that is, the binary image is segmented into different parts by a connected domain analysis method or the like, and attributes of each part, such as a text part, an image, a table, and the like, are marked. And then, performing segment segmentation, line segmentation and word segmentation on the text part, thereby obtaining a plurality of characters.

In step 1024, character features of a plurality of characters in the character image are extracted, feature matching is performed based on the character features, and keywords obtained by matching are used as keyword recognition results of video frames corresponding to the character image.

In some embodiments, the extracted character features include statistical features and structural features, where the structural features may include edge features, penetration features, transformation features, grid features, and the like.

In some embodiments, after extracting character features of a plurality of characters in the character image, traversing a keyword feature library to match features in the keyword feature library with the character features. The keyword feature library comprises features of various keywords which can occur in the head and the tail of the piece, and the keywords are keywords such as keywords which can occur in the head of the piece: the first set, issuing authorities, drama review, total director, etc., keywords that may occur in the trailer: the collection is forenotice, the lead actor, the special invited actor, the friend show, the special thank you, etc. The matching method can adopt a relaxation comparison method, an European space comparison method and the like. After matching, the keyword corresponding to the feature with the highest matching degree is used as the keyword recognition result of the video frame corresponding to the character image.

Thus, for video files such as movie dramas that typically include specific keywords in the beginning and end of the episode, it may be determined by character recognition whether there are video frames containing keywords in the beginning candidate and end candidate thereof.

In step 103, image features are extracted for each video frame by a machine learning model.

In some embodiments, the machine learning model may be a convolutional neural network model, a deep neural network model, or the like. As shown in fig. 5, fig. 5 is a schematic structural diagram of a machine learning model according to an embodiment of the present application. The machine learning model includes an input layer, a convolution layer, a pooling layer, a fully connected layer, and an output layer. In the input layer, the preprocessing operation of zero-averaging is carried out on the video frames in the head candidate segment and the tail candidate segment, so that different features in the video frames have the same scale. That is, each pixel in the video frame is subtracted by the mean to obtain a matrix of pixels. In the convolution layer, a convolution operation is performed on the pixel matrix, that is, different image features in the pixel matrix are extracted through different convolution kernels. In the pooling layer, different image features are sampled through a selection frame, so that the purpose of data dimension reduction is achieved. Wherein the number of convolution layers and pooling layers may be plural.

In step 104, a segment prediction result for each video frame is determined based on the image characteristics of each video frame.

In some embodiments, the segment prediction results include a head-of-a-segment probability that the video frame belongs to the head of the segment and a tail-of-a-segment probability that the video frame belongs to the tail of the segment. In the fully connected layer of the machine learning model, multi-dimensional image features are converted into one-dimensional features. Finally, in the output layer, classifying the one-dimensional features through a sigmod function to obtain probabilities that video frames respectively belong to the head, the tail and the positive film, namely, the head probability that the video frames in the head candidate segment belong to the head and the tail probability that the video frames in the tail candidate segment belong to the tail can be obtained. Therefore, the possible video frames belonging to the head and the possible video frames belonging to the tail can be rapidly and accurately determined through the machine learning model, and the method is particularly suitable for video files of the variety program types in which no keywords appear in the head and the tail.

In some embodiments, the training process of the machine learning model is as follows: adding a label for each video frame in the video file sample based on the head time stamp and the tail time stamp of the video file sample, wherein the types of the labels comprise positive, head and tail; extracting image characteristics of each video frame; based on the image characteristics of each video frame, forward propagation is carried out in a machine learning model, so that the head probability and the tail probability of each video frame are obtained; determining the type of each video frame based on the head-of-chip probability and the tail-of-chip probability of each video frame; based on the type of each video frame and the errors of the labels of each video frame, back propagation is performed in the machine learning model to update parameters of the machine learning model.

In step 105, a demarcation video frame is identified from the plurality of video frames based on the keyword recognition result for each video frame and the segment prediction result for each video frame.

In some embodiments, identifying the demarcation video frame from the plurality of video frames based on the keyword recognition result of each video frame and the segment prediction result of each video frame may be accomplished by the following steps 1051 through 1054.

In step 1051, the timestamp of the video frame with the largest timestamp is selected from the video frames with the keyword recognition results of the first candidate segment as the first end timestamp, and the timestamp of the video frame with the smallest timestamp is selected from the video frames with the keyword recognition results of the end candidate segment as the first end timestamp.

For example, it is determined by optical character recognition that the video frames with a timestamp of 1s in the segment candidate include the keyword "total director", the video frames with a timestamp of 3min include the keyword "sponsor", … …, and the sizes of the timestamps of these video frames are compared, and the largest timestamp (e.g., 3 min) is taken as the first segment head timestamp. Meanwhile, the video frames with the time stamp of 15min in the candidate segment of the clip comprise keywords of 'next collection forecast' and the video frames with the time stamp of 17min comprise keywords of 'special thank you' … …, and the smallest time stamp (such as 15 min) is selected from the time stamps of the video frames with the time stamp of the candidate segment of the clip as the first clip time stamp.

In step 1052, the timestamp corresponding to the video frame with the largest head probability and exceeding the first probability threshold is selected as the second head timestamp from the video frames of the head candidate segment, and the timestamp corresponding to the video frame with the largest tail probability and exceeding the second probability threshold is selected as the second tail timestamp from the video frames of the tail candidate segment.

For example, the first probability threshold and the second probability threshold are both 0.8. And predicting by a machine learning model, wherein the video frames of the head candidate fragments have the largest head probability of the video frames with the timestamp of 3min, and the head probability of the video frames is 0.85 and is larger than a first probability threshold. Thus, timestamp 3min is taken as the second head timestamp. Among the video frames of the candidate clip, the maximum clip probability is the video frame with the timestamp of 14min30s, and the clip probability is 0.9 and is larger than the second probability threshold. Thus, the timestamp 14min30s is taken as the second end-of-chip timestamp.

In step 1053, the larger of the first and second end-of-segment timestamps is used as the first end-of-segment timestamp and the smaller of the first and second end-of-segment timestamps is used as the end-of-segment timestamp.

For example, because the first head time stamp 3min is the same as the second head time stamp 3min, the head time stamp may be determined to be 3min. Because the second end of chip timestamp 14min30s is less than the first end of chip timestamp 15min, the second end of chip timestamp 14min30s is taken as the end of chip timestamp.

In step 1054, the video frame corresponding to the first time stamp is used as the first and last frames, and the video frame corresponding to the last time stamp is used as the last and last frames.

In some embodiments, when no demarcation video frame is identified from the plurality of video frames based on the keyword identification result of each video frame and the segment prediction result of each video frame, that is, no keyword that may occur in the head and tail of the video frame is identified in the video frames of the head candidate segment and the tail candidate segment, and the head probability of the video frame in the head candidate segment does not exceed the first probability threshold, and the tail probability of the video frame in the tail candidate segment does not exceed the second probability threshold. At this time, the video first frame is taken as the head and tail frames, and the video last frame is taken as the tail and head frames.

In some embodiments, after identifying the demarcation video frame, the beginning of the slice and the end of the slice in the video file may be determined based on the endpoint video frame and the demarcation video frame. Taking a segment formed by video frames with time stamps between the time stamps corresponding to the video head frames and the time stamps corresponding to the video head and tail frames as a head; and taking a fragment formed by the video frames with the time stamps between the time stamps corresponding to the head frames of the fragment and the time stamps corresponding to the tail frames of the video as the fragment tail.

For example, the timestamp corresponding to the first frame of the video is 0, the timestamp corresponding to the last frame of the video is 3min, the timestamp corresponding to the last frame of the video is 15min, and the timestamp corresponding to the last frame of the video is 19min. Taking a segment formed by video frames with time stamps between 0 and 3min as a head and taking a segment formed by video frames with time stamps between 15 and 19min as a tail.

In some embodiments, the auxiliary recognition may also be performed by a speech classification model in a machine learning model, the speech feature extraction may be performed by the speech classification model on candidate segments (a head-of-segment speech candidate segment and a tail-of-segment speech candidate segment) in a speech file synchronized with the video file, and the probability that each speech frame in the candidate segments belongs to the head-of-speech segment and the tail-of-speech segment may be predicted based on the extracted speech features. And selecting the time stamp corresponding to the voice frame with the highest probability and exceeding the first voice probability threshold from the voice frames of the head voice candidate fragments as a third head time stamp, and selecting the time stamp corresponding to the voice frame with the highest probability and exceeding the second voice probability threshold from the voice frames of the tail voice candidate fragments as a third tail time stamp. Taking the largest timestamp of the first head timestamp, the second head timestamp and the third head timestamp as the head timestamp, and taking the smallest timestamp of the first tail timestamp, the second tail timestamp and the third tail timestamp as the tail timestamp.

Therefore, the character recognition, the machine learning model prediction of the video frame and the machine learning model prediction of the voice frame in the embodiment of the application can be mutually complemented to jointly determine the most accurate head-of-chip timestamp and the tail-of-chip timestamp. In some possible examples, no text exists in the demarcation video frame, or the included text does not belong to keywords, for example, in a video file corresponding to the variety program, the demarcation video frame is an advertisement page, and then the video frame is predicted mainly according to the machine learning model.

In some possible examples, the demarcation video frames have no text or include text that does not belong to keywords, and the demarcation video frames are pages that are commonly used to distinguish between the beginning/end of a film and the feature of a feature, such as a video file corresponding to an interview program, where the beginning/end of a speech frame and the beginning/end of a speech frame may be determined by the content of the host speaking (e.g., "x-formally beginning", "x-to-end"), i.e., the speech frame is predicted by a machine learning model. Because the voice file is synchronized with the video file, the beginning-to-end frames and end-to-end frames of the video file may also be determined. Therefore, the head time stamp and the tail time stamp in the video files of various types can be determined through the comprehensive use of the three methods, and the accuracy and the efficiency of head and tail positioning are greatly improved.

It can be seen that, according to the embodiment of the application, by performing character recognition on the video frames in the candidate segments in the video file, whether keywords are included in the video frames can be determined according to the recognition result. The segment prediction result of the video frame is determined through the machine learning model, so that possible video frames belonging to the head of the segment and possible video frames belonging to the tail of the segment can be rapidly and accurately determined. By combining the two schemes, the head and the tail of the video file can be rapidly and accurately positioned according to the identification result and the segment prediction result, and positive content after the head and the tail of the video file are removed can be further provided for a user.

Referring to fig. 6, fig. 6 is an interaction flow diagram of a method for detecting a structure of a video file according to an embodiment of the present application. The following describes a procedure of cooperatively implementing the structure detection method for a video file provided by an embodiment of the present application by a terminal and a server in connection with steps 201 to 206 in fig. 6.

In step 201, the terminal sends a head-to-tail determination request carrying identification information of the target video file to the server.

The identification information of the target video file can be in the form of video links and the like, a large number of video files are stored in the server, and the video files can be positioned based on the video links. The head-to-tail determination request is used for requesting the server to return a data packet of the target video file from which the head-to-tail is removed.

Step 202, the server determines the target video file according to the head-to-tail determination request and the identification information of the target video file.

In step 203, the server performs character recognition on a plurality of video frames of the candidate segments in the target video file, so as to obtain a keyword recognition result.

In step 204, the server predicts probabilities that the plurality of video frames are the head and tail of the slice respectively through a machine learning model.

In step 205, the server determines the head and tail of the target video file according to the keyword recognition result and the predicted probability.

In step 206, the server sends the data packet of the video file with the head and tail removed to the terminal.

It should be noted that the foregoing steps have been described in detail in the foregoing, and are not repeated here.

Therefore, in the embodiment of the application, the terminal completes character recognition of the video frames in the target video file and prediction of the probability that the video frames are the head and the tail of the video frames through the server, so that the head and the tail of the target video file can be accurately determined while the calculation pressure of the terminal is reduced, and finally the terminal can obtain the video file from which the head and the tail of the video file are removed.

In the following, an exemplary application of the embodiment of the present application in a practical application scenario will be described.

Referring to fig. 7A, fig. 7B, and fig. 7C, fig. 7A is a schematic view of a video playing page provided by an embodiment of the present application, fig. 7B is a schematic view of a positioning sheet head provided by an embodiment of the present application, and fig. 7C is a schematic view of a positioning sheet tail provided by an embodiment of the present application. The page shown in fig. 7A mainly includes two parts, namely a video file 701 and a video operation 702, wherein the video operation 702 includes intelligent identification, intelligent editing and intelligent auditing. When entering the video play page shown in fig. 7A, or when the user has viewed a portion of the beginning of the clip, the user may click on the beginning of the clip 703 under intelligent editing, so that the play page appears as shown in fig. 7B. Below the beginning of slice 703 of fig. 7B, an end of slice timestamp of 0 is presented: 23.00 and end of chip timestamp 15:40.04, while the video file 701 will skip the beginning of the clip, the video will be played starting at the time stamp of the beginning of the clip. Because the end-of-chip timestamp is 15:40.04, thus, in fig. 7C, when the player plays to 15:40, the play is stopped.

How to determine the head-of-chip and tail-of-chip timestamps of a video file is described below.

In the embodiment of the application, when the time of detecting the video file is longer than 10min, the head time stamp is detected only in the first 5min of the video file, and the tail time stamp is detected in the last 5 min. Video frames are extracted at fixed time intervals for the video content of the first 5min and the last 5min, for example, one frame can be extracted every 1 s.

Referring to fig. 8A, fig. 8A is a schematic flow chart of the head-of-chip timestamp detection provided in the embodiment of the present application, and the determination of the head-of-chip timestamp will be described with reference to the steps shown in fig. 8A.

In step 801, the server parses the video frame.

The method comprises the steps that a video file to be analyzed is stored in a server, and the server analyzes a plurality of video frames in the first 5min of the video file to obtain relevant information of the video frames, such as time stamps corresponding to the video frames.

In step 802, the server determines whether the timestamp of the video frame is greater than 5min, if so, performs step 807, and if not, performs step 803.

In step 803, the server performs character recognition on the video frame to obtain a recognition result.

The server performs character recognition on the currently parsed video frame, and determines whether a head keyword exists in the video frame based on a recognition result. In some possible examples, the character recognition may be optical character recognition. The head keywords may include: the first set, the issuing authority, the news publishing broadcasting and broadcasting bureau (cultural broadcasting movie & TV drama), the issuing license number (record number), the drama review word, the general director, the director and the like.

In step 804, the server predicts a head probability of a video frame through a machine learning model.

The machine learning model for predicting the first probability in step 804 is obtained by training a video file sample, which may be a convolutional neural network model, a deep neural network model, or the like, and the video file sample may be a movie, a television show, a variety, a cartoon, a documentary, or the like. The slice header probability is the probability that the current video frame is the slice header.

It should be noted that, step 804 may be performed simultaneously with step 803 or performed prior to step 803.

In step 805, the server determines whether a head keyword exists in the video frame, or whether the head probability is greater than a first probability threshold T1, if yes, step 806 is performed, and if not, step 801 is performed.

The first probability threshold T1 is a preset value, and may be 0.5, 0.6, 0.7, 0.8, etc. When the slice head probability of the current video frame is greater than the first probability threshold T1, the current video frame is likely to belong to the slice head. Step 806 is performed when the video frame satisfies any one of the two conditions of "presence of a feature key" and "feature probability greater than a first probability threshold T1".

In step 806, the server records the timestamp of the current video frame and performs step 801.

After recording the time stamp, the server performs step 801, i.e. starts processing the next video frame.

In step 807, the server determines whether there is a video frame containing a head keyword or having a head probability greater than T1 in the parsed video frame, if so, step 808 is performed, and if not, step 809 is performed.

When the timestamp of the next video frame analyzed by the server is greater than 5min, the next video frame is not processed. The server determines whether a recorded time stamp exists, namely whether a video frame containing a head keyword or having a head probability greater than T1 exists in the analyzed video frame of the first 5 min.

In step 808, the server compares the time stamps of the video frames and takes the largest time stamp as the head-of-chip time stamp.

The slice head time stamp is the time stamp corresponding to the last frame of the slice head.

In step 809, the server takes 0 as the head-of-chip timestamp.

Referring to fig. 8B, fig. 8B is a schematic flow chart of end-of-chip timestamp detection according to an embodiment of the present application, and the determination of the end-of-chip timestamp will be described with reference to the steps shown in fig. 8B.

In step 901, the server parses the video frame.

When the server analyzes the video frames in the first 5min of the video file, the video frames in the last 5min are directly processed, and the speed of character recognition can be greatly increased because the video frames in the middle part are skipped.

In step 902, the server performs character recognition on the video frame to obtain a recognition result.

The server performs character recognition on the currently parsed video frame, and determines whether a tail keyword exists in the video frame based on a recognition result. In some possible examples, the end-of-chip keywords may include: the method comprises the following steps of collection forenotice, leading, starring, finishing, special invited actors, friendship, staff (cast), collection, special singing, broadcasting in the morning and evening, asking for attention and the like.

In step 903, the server predicts the end-of-chip probability of the video frame through a machine learning model.

The tail probability is the probability that the current video frame is the tail. It should be noted that, step 903 may be performed simultaneously with step 902 or performed prior to step 902.

In step 904, the server determines whether there is an end-of-chip keyword in the video frame, or whether the end-of-chip probability is greater than a second probability threshold T2, if so, step 905 is performed, and if not, step 906 is performed.

Because the end-of-chip time stamp is detected frame by frame in the process of detecting the end-of-chip time stamp, when the end-of-chip keyword exists in the current video frame or the end-of-chip probability is detected to be larger than the second probability threshold T2, the time stamp of the current video frame is likely to be the end-of-chip time stamp.

In step 905, the server outputs the timestamp of the current video frame as the end-of-chip timestamp.

The end-of-chip timestamp is the timestamp corresponding to the first frame of the end-of-chip.

In step 906, the server determines whether the video file is over, if so, step 907 is executed, and if not, step 901 is executed.

The server determines that the video file is over when the timestamp of the currently processed video frame is equal to the total duration of the video file. When there is no end keyword in the current video frame and the end probability is not greater than the second probability threshold T2, the server executes step 901, i.e. starts to process the next video frame.

In step 907, the server takes the total duration of the video file as the end-of-chip timestamp.

After the head time stamp and the tail time stamp of the video file are obtained, the server locates the head and the tail of the video file according to the head time stamp and the tail time stamp, compresses the video file with the head time stamp and the tail time stamp removed, and sends the compressed video file to the user terminal in the form of data packets. The user terminal decompresses the received data packet and starts playing from the head time stamp, thereby presenting the playing page shown in fig. 7B. When playing to the end of the clip timestamp, the playing is stopped, as shown in fig. 7C.

Continuing with the description below of an exemplary structure of the structure detection device 443 of a video file provided by an embodiment of the present application implemented as a software module, in some embodiments, as shown in fig. 2, the software module stored in the structure detection device 443 of a video file in the memory 440 may include: an extraction module 4431 for extracting candidate segments including endpoint video frames from a video file; the first recognition module 4432 is used for performing character recognition on a plurality of video frames in the candidate segment to obtain a keyword recognition result of each video frame; a prediction module 4433 for extracting image features for each video frame through a machine learning model and determining a segment prediction result for each video frame based on the image features of each video frame; the second identifying module 4434 is configured to identify a demarcation video frame from the plurality of video frames based on the keyword identifying result of each video frame and the segment predicting result of each video frame.

In some embodiments, the endpoint video frames include a video first frame and a video last frame of the video file; the extraction module 4431 is further configured to: and when the duration of the video file is longer than the sum of the length of the first preset time period and the length of the last preset time period, extracting a first candidate segment which comprises the first video frame and has the length of the first preset time period from the video file, and extracting a last candidate segment which comprises the last video frame and has the length of the last preset time period from the video file.

In some embodiments, the first identification module 4432 is further to: extracting a plurality of video frames from the candidate segments at fixed time intervals; carrying out image preprocessing on each video frame to obtain a corresponding binarized image; dividing the binarized image to obtain a character image containing a plurality of characters; extracting character features of a plurality of characters in the character image, performing feature matching based on the character features, and taking keywords obtained by matching as keyword recognition results of video frames corresponding to the character image.

In some embodiments, the first identification module 4432 is further to: traversing the keyword feature library to match the features in the keyword feature library with the character features, and taking the keywords corresponding to the features with the highest matching degree as the keyword recognition results of the video frames corresponding to the character images.

In some embodiments, the segment prediction result includes a head probability that the video frame belongs to a head and a tail probability that the video frame belongs to a tail; the prediction module 4433 is further configured to: carrying out convolution processing on video frames in the head candidate segment and the tail candidate segment through a machine learning model to obtain corresponding image features; and classifying the image features to obtain the head probability of the video frames in the head candidate segments and the tail probability of the video frames in the tail candidate segments.

In some embodiments, the demarcation video frames include a head-to-tail frame and a tail-to-head frame; the second recognition module 4434 is further configured to: selecting the time stamp of the video frame with the largest time stamp from the video frames with the keyword recognition results of the head candidate fragments as the first head time stamp, and selecting the time stamp of the video frame with the smallest time stamp from the video frames with the keyword recognition results of the tail candidate fragments as the first tail time stamp; selecting a timestamp corresponding to a video frame with the largest head probability and exceeding a first probability threshold from video frames of head candidate fragments as a second head timestamp, and selecting a timestamp corresponding to a video frame with the largest tail probability and exceeding a second probability threshold from video frames of tail candidate fragments as a second tail timestamp; taking the larger timestamp of the first head timestamp and the second head timestamp as the head timestamp, and the smaller timestamp of the first tail timestamp and the second tail timestamp as the tail timestamp; and taking the video frame corresponding to the head time stamp as the head and tail frames and taking the video frame corresponding to the tail time stamp as the tail and head frames.

In some embodiments, the second identification module 4434 is further configured to: when the demarcation video frame is not identified from the plurality of video frames based on the keyword identification result of each video frame and the segment prediction result of each video frame, the video first frame is regarded as the head-to-tail frame, and the video last frame is regarded as the tail-to-head frame.

In some embodiments, the structure detection apparatus of a video file further includes a determining module 4435 for: taking a segment formed by video frames with time stamps between the time stamps corresponding to the video head frames and the time stamps corresponding to the video head and tail frames as a head; and taking a fragment formed by the video frames with the time stamps between the time stamps corresponding to the head frames of the fragment and the time stamps corresponding to the tail frames of the video as the fragment tail.

In some embodiments, the structure detection apparatus of a video file further includes a training module 4436 for: adding a label for each video frame in the video file sample based on the head time stamp and the tail time stamp of the video file sample, wherein the label comprises a positive film, a head and a tail; extracting image characteristics of each video frame; based on the image characteristics of each video frame, forward propagation is carried out in a machine learning model, and a segment prediction result of each video frame is obtained; determining a type of each video frame based on a segment prediction result of each video frame; based on the type of each video frame and the errors of the labels of each video frame, back propagation is performed in the machine learning model to update parameters of the machine learning model.

An embodiment of the present application provides a storage medium storing executable instructions, where the executable instructions are stored, which when executed by a processor, cause the processor to perform a method provided by an embodiment of the present application, for example, a method for detecting a structure of a video file as shown in fig. 3.

In some embodiments, the storage medium may be FRAM, ROM, PROM, EPROM, EE PROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.

In some embodiments, the executable instructions may be in the form of programs, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.

As an example, the executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, such as in one or more scripts in a hypertext markup language (html, hyper Text Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices located at one site or, alternatively, distributed across multiple sites and interconnected by a communication network.

In summary, the embodiment of the application can determine whether the video frame includes the keyword according to the recognition result by performing character recognition on the video frame in the candidate segment in the video file. The segment prediction result of the video frame is determined through the machine learning model, so that possible video frames belonging to the head of the segment and possible video frames belonging to the tail of the segment can be rapidly and accurately determined. By combining the two schemes, the head and the tail of the video file can be rapidly and accurately positioned according to the identification result and the segment prediction result, and positive content after the head and the tail of the video file are removed can be further provided for a user.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A method for detecting a structure of a video file, the method comprising:

performing optical character recognition on a plurality of video frames in the candidate segment to obtain a keyword recognition result of each video frame;

extracting image features from each video frame through a machine learning model, and predicting the probability that each video frame belongs to an endpoint segment based on the image features of each video frame;

taking the larger timestamp of the first head timestamp and the second head timestamp as the head timestamp, and the smaller timestamp of the first tail timestamp and the second tail timestamp as the tail timestamp; the first head time stamp is the video frame with the largest time stamp in the video frames of the head candidate fragments with the keyword recognition results, and the first tail time stamp is the video frame with the smallest time stamp in the video frames of the tail candidate fragments with the keyword recognition results; the second head time stamp is a time stamp corresponding to a video frame with the maximum head probability and exceeding a first probability threshold, and the second tail time stamp is a time stamp corresponding to a video frame with the maximum tail probability and exceeding a second probability threshold;

Taking the video frame corresponding to the head time stamp as the head and tail frames and taking the video frame corresponding to the tail time stamp as the tail head frame;

the endpoint segment in the video file is determined based on the endpoint video frame and a demarcation video frame, wherein the demarcation video frame includes the head-to-tail frame and the tail-to-head frame.

2. The method of claim 1, wherein the endpoint video frames comprise a video first frame and a video last frame of the video file; the extracting candidate segments including the endpoint video frames from the video file includes:

3. The method of claim 1, wherein performing optical character recognition on the plurality of video frames in the candidate segment to obtain a keyword recognition result of each video frame comprises:

4. The method according to claim 3, wherein the performing feature matching based on the character features, using keywords obtained by the matching as keyword recognition results of the video frames corresponding to the character images, includes:

5. The method of claim 2, wherein the end point segments comprise a head and a tail; the extracting image features of each video frame through a machine learning model, and predicting the probability that each video frame belongs to an endpoint segment based on the image features of each video frame, comprising:

and classifying the image features to obtain the head probability that the video frames in the head candidate segments belong to the head and the tail probability that the video frames in the tail candidate segments belong to the tail.

6. The method according to claim 1, wherein the method further comprises:

when the demarcation video frame is not identified from the plurality of video frames based on the keyword identification result of each video frame and the probability that each video frame belongs to the end point segment, the video head frame is taken as the head-to-tail frame, and the video tail frame is taken as the tail-to-head frame.

7. The method of claim 6, wherein the determining the endpoint fragment in the video file based on the endpoint video frame and the demarcation video frame comprises:

8. The method of any of claims 1 to 7, wherein prior to said extracting from the video file candidate segments comprising the endpoint video frames, the method further comprises:

extracting image characteristics of each video frame;

based on the image characteristics of each video frame, forward propagation is carried out in the machine learning model, and the probability that each video frame belongs to the endpoint segment is obtained;

determining a type of each of the video frames based on a probability that each of the video frames belongs to the endpoint segment;

9. A structure detecting apparatus for a video file, comprising:

the first recognition module is used for carrying out optical character recognition on a plurality of video frames in the candidate segment to obtain a keyword recognition result of each video frame;

a prediction module for extracting image features from each of the video frames through a machine learning model, and predicting a probability that each of the video frames belongs to an endpoint segment based on the image features of each of the video frames;

the second identification module is used for taking the larger timestamp of the first head timestamp and the second head timestamp as the head timestamp and the smaller timestamp of the first tail timestamp and the second tail timestamp as the tail timestamp; the first head time stamp is the video frame with the largest time stamp in the video frames of the head candidate fragments with the keyword recognition results, and the first tail time stamp is the video frame with the smallest time stamp in the video frames of the tail candidate fragments with the keyword recognition results; the second head time stamp is a time stamp corresponding to a video frame with the maximum head probability and exceeding a first probability threshold, and the second tail time stamp is a time stamp corresponding to a video frame with the maximum tail probability and exceeding a second probability threshold;

and the determining module is used for determining the endpoint fragments in the video file based on the endpoint video frames and demarcation video frames, wherein the demarcation video frames comprise the head-to-tail frames and the head-to-tail frames.

10. An electronic device, the electronic device comprising:

a memory for storing executable instructions;

a processor for implementing the method for detecting the structure of a video file according to any one of claims 1 to 8 when executing the executable instructions stored in said memory.

11. A computer readable storage medium storing executable instructions which when executed by a processor implement the method of structure detection of a video file according to any one of claims 1 to 8.