CN111479130A

CN111479130A - Video positioning method and device, electronic equipment and storage medium

Info

Publication number: CN111479130A
Application number: CN202010256464.4A
Authority: CN
Inventors: 徐孩; 梁健豪; 车翔; 管琰平
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-04-02
Filing date: 2020-04-02
Publication date: 2020-07-31
Anticipated expiration: 2040-04-02
Also published as: CN111479130B

Abstract

The embodiment of the application discloses a video positioning method, a video positioning device, electronic equipment and a storage medium, wherein the embodiment of the application can select candidate video clips from videos to be identified according to the duration of the videos to be identified; performing fragment type identification on each video frame in the candidate video fragments to obtain an identification result sequence consisting of identification results of each video frame; selecting a video frame from the candidate video clips as a candidate clip type boundary position; separating candidate sub-segments from the candidate video segments according to the boundary positions of the candidate segment types; acquiring statistical parameters of the types of the video frame segments in the candidate sub-segments according to the recognition result sequence, and determining a target sub-segment from the candidate sub-segments according to the statistical parameters and a preset statistical parameter threshold; and acquiring a candidate segment type boundary position corresponding to the target sub-segment as a segment type boundary position in the video to be identified. Therefore, the boundary positions of different types of fragments in the video can be determined efficiently and quickly.

Description

Video positioning method and device, electronic equipment and storage medium

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a video positioning method, a video positioning device, electronic equipment and a storage medium.

Background

In recent years, the content and types of videos are more and more abundant, the form of the video content is not limited to videos such as large television episodes, movies and the like produced by professional teams, and more users can upload original short video content to a video playing platform. According to the type of the segment, the video can be divided into a plurality of different types of segments such as a head segment, a tail segment and the like, a plurality of video playing platforms can identify boundary positions of the different types of segments in a long video at present, so that functions of skipping the head segment, the tail segment and the like are provided during playing, most of the existing video playing platforms manually watch, directly obtain or calculate the similarity of a head segment and a tail segment sample of a video segment domain from a video provider, or determine the boundary positions of the segments according to methods such as the complexity of the video and the like, but because the number of short videos uploaded by a user is large, the number of content creators is large, and the production forms of the head segment and the tail segment are flexible and diverse, the boundary positions of the different types of segments in the video (especially the short videos) cannot be accurately and efficiently identified by.

Disclosure of Invention

In view of this, embodiments of the present application provide a video positioning method, an apparatus, an electronic device, and a storage medium, which can accurately and efficiently identify boundary positions of segments of different types in a video.

In a first aspect, an embodiment of the present application provides a video positioning method, including:

acquiring a video to be identified;

selecting candidate video clips from the video to be identified according to the duration of the video to be identified;

performing segment type identification on each video frame in the candidate video segment to obtain an identification result sequence, wherein the identification result sequence is a sequence formed by identification results of each video frame in the candidate video segment;

selecting a video frame from the candidate video clips as a candidate clip type boundary position;

separating candidate sub-segments from the candidate video segments according to the boundary positions of the candidate segment types;

acquiring statistical parameters of the types of the video frame segments in the candidate sub-segments according to the identification result sequence, and determining a target sub-segment from the candidate sub-segments according to the statistical parameters and a preset statistical parameter threshold;

and acquiring a candidate segment type boundary position corresponding to the target sub-segment as a segment type boundary position in the video to be identified.

In an embodiment, the candidate video segments include a first candidate video segment and a second candidate video segment;

the performing segment type identification on each video frame in the candidate video segment to obtain an identification result sequence includes:

splicing the first candidate video clip and the second candidate video clip to obtain a spliced candidate video clip;

adopting an identification network in a preset neural network to identify the type of each video frame in the target candidate video clip to obtain the clip type identification result of each video frame;

and combining the segment type recognition results according to the first candidate video segment and the second candidate video segment to obtain a first recognition result sequence and a second recognition result sequence.

In an embodiment, the candidate video segments include a first candidate video segment and a second candidate video segment, and the predetermined neural network includes a first identification network and a second identification network;

the method for recognizing the segment type of each video frame in the candidate video segment by adopting the preset neural network to obtain a recognition result sequence comprises the following steps:

adopting a first identification network to identify the type of each video frame in the first candidate video clip to obtain the clip type identification result of each video frame;

combining the fragment type recognition results to obtain a first recognition result sequence;

adopting a second identification network to identify the type of each video frame in the second candidate video clip to obtain the clip type identification result of each video frame;

and combining the fragment type recognition results to obtain a second recognition result sequence.

In an embodiment, the performing segment type identification on each video frame to obtain a segment type identification result of each video frame includes:

extracting the characteristics of each video frame according to the convolutional network in the identification network to obtain the characteristic information of the video frame;

and carrying out full connection operation on the characteristic information according to a full connection network in the identification network to obtain a segment type identification result of the video frame.

In an embodiment, the selecting a video frame from the candidate video segment as a candidate segment type boundary position includes:

and selecting video frames meeting a preset tolerance threshold from the candidate video clips as candidate clip type boundary positions according to the identification result sequence and the preset tolerance threshold, wherein the tolerance threshold is the maximum number of video frames of which the clip type identification result is not the type of the target video clip in the target video clips.

and taking the video frame in the candidate video segment as a candidate segment type boundary position.

In an embodiment, the obtaining the statistical parameter of the video frame segment type in the candidate sub-segments according to the recognition result sequence includes:

acquiring initial statistical parameters of the candidate sub-segments according to the recognition result sequence;

acquiring a position encouragement parameter according to the position of the candidate segment type boundary position in the candidate video segment;

and fusing the position encouragement parameters and the corresponding initial statistical parameters to obtain the statistical parameters corresponding to the candidate sub-segments.

In an embodiment, before the performing, by using a preset neural network, segment type identification on each video frame in the candidate video segment to obtain an identification result sequence, the method further includes:

acquiring a plurality of video clip samples, wherein the video clip samples comprise a plurality of video frames marked with real clip types;

identifying a segment type corresponding to a video frame in the video segment sample through a preset initial neural network;

determining a current prediction result according to the fragment type obtained by identification and the real fragment type;

constructing a loss function according to a preset adjusting coefficient, the real fragment type and probability information corresponding to the fragment type obtained by identification;

and adopting a loss function to converge the preset initial neural network until the current prediction result is correct, and obtaining the trained neural network.

In an embodiment, the constructing a loss function according to a preset adjustment coefficient, the real segment type, and probability information corresponding to the identified segment type includes:

constructing a loss weight parameter corresponding to the easily-identified sample according to the preset modulation coefficient, the real fragment type and probability information corresponding to the fragment type obtained by identification;

constructing an initial loss function according to the real fragment type and probability information corresponding to the fragment type obtained by identification;

and constructing the loss function according to the preset balance coefficient, the loss weight parameter and the initial loss function.

In a second aspect, an embodiment of the present application provides a video positioning apparatus, including:

the acquisition unit is used for acquiring a video to be identified;

the candidate unit is used for selecting candidate video clips from the video to be identified according to the duration of the video to be identified;

the identification unit is used for identifying the type of each video frame in the candidate video clip to obtain an identification result sequence, and the identification result sequence is a sequence formed by the identification results of each video frame in the candidate video clip;

the selecting unit is used for selecting a video frame from the candidate video clips as a candidate clip type boundary position;

the separation unit is used for separating candidate sub-segments from the candidate video segments according to the candidate segment type boundary positions;

the determining unit is used for acquiring the statistical parameters of the video frame fragment types in the candidate sub-fragments according to the identification result sequence and determining a target sub-fragment from the candidate sub-fragments according to the statistical parameters and a preset statistical parameter threshold;

and the positioning unit is used for acquiring the candidate segment type boundary position corresponding to the target sub-segment and taking the candidate segment type boundary position as the segment type boundary position in the video to be identified.

In a third aspect, an electronic device provided in an embodiment of the present application includes a processor and a memory, where the memory stores a plurality of instructions; the processor loads instructions from the memory to perform the steps in the video positioning method described above.

In a third aspect, a storage medium is provided in the embodiments of the present application, on which a computer program is stored, and when the computer program runs on a computer, the computer is caused to execute the video positioning method provided in any embodiment of the present application.

According to the embodiment of the application, the video to be identified can be obtained, and then the candidate video clip is selected from the video to be identified according to the duration of the video to be identified; performing segment type identification on each video frame in the candidate video segment to obtain an identification result sequence, wherein the identification result sequence is a sequence formed by identification results of each video frame in the candidate video segment; selecting a video frame from the candidate video clips as a candidate clip type boundary position; separating candidate sub-segments from the candidate video segments according to the boundary positions of the candidate segment types; acquiring statistical parameters of the types of the video frame segments in the candidate sub-segments according to the identification result sequence, and determining a target sub-segment from the candidate sub-segments according to the statistical parameters and a preset statistical parameter threshold; and acquiring a candidate segment type boundary position corresponding to the target sub-segment as a segment type boundary position in the video to be identified.

The scheme finally determines the boundary position according to the statistical parameters of the types of the video frame fragments, can avoid the influence on the determination of the boundary position on the error identification result of individual video frames, enables the identification result to be more accurate, does not need manual intervention, and can realize full automation, thereby being capable of identifying the boundary positions of the fragments of different types in the video more accurately and efficiently.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a scene schematic diagram of a video positioning method provided in an embodiment of the present application;

fig. 2a is a flowchart of a video positioning method provided in an embodiment of the present application;

fig. 2b is another flowchart of a video positioning method provided in an embodiment of the present application;

fig. 3 is a schematic structural diagram of a video positioning apparatus according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device provided in an embodiment of the present application;

fig. 5a is a specific flowchart of a video positioning method according to an embodiment of the present application;

FIG. 5b is a schematic structural diagram of a neural network with a single model structure provided in an embodiment of the present application;

FIG. 5c is a schematic structural diagram of a neural network with a dual model structure provided in an embodiment of the present application;

fig. 5d is a schematic diagram of a video recognition result according to an embodiment of the present application;

fig. 5e is a schematic diagram of another video recognition result provided in the embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a video positioning method and device, electronic equipment and a storage medium. The video positioning apparatus may be integrated in an electronic device, and the electronic device may be a server or a terminal.

The video positioning method provided by the embodiment of the application relates to a computational vision technology and a machine learning direction in the field of artificial intelligence, and can identify the segment type of each video frame (namely, the image forming the video) through a neural network obtained by machine learning training.

Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence software technology mainly comprises a computer vision technology, a machine learning/deep learning direction and the like.

Machine learning (Machine L earning, M L) is a multi-domain cross discipline, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. a special study on how a computer simulates or implements human learning behavior to acquire new knowledge or skills, reorganizes existing knowledge structures to continuously improve its performance.

In the embodiment of the application, the video positioning refers to the technology and the process of finding out the boundary positions of different types of segments in a video. In this application, a video may include three different types of slices, a slice header, a feature and a slice, or only a feature, or a slice and a slice header, or a feature and a slice trailer. The acquired boundary position can be applied to different scenes and electronic equipment. For example, when playing a video, the terminal may skip a leader and a trailer in the video according to the boundary position, specifically, when starting to play the video, the terminal may directly skip to a position where a video feature starts, or skip a trailer and directly start playing a next video after the video feature is played, so as to improve user experience. For another example, for video understanding tasks such as video classification, video motion segmentation and video stretching deformation identification, the server can skip the leader and the trailer in the video, so that the leader and the trailer are prevented from interfering with video understanding. Because the content of the head-end is often of low relevance to the video body, the head-end content does not help these video understanding tasks, and may even have a negative impact.

For example, referring to fig. 1, first, the electronic device integrated with the video positioning apparatus acquires a video to be identified, and then selects a candidate video segment from the video to be identified according to the duration of the video to be identified; performing segment type identification on each video frame in the candidate video segment to obtain an identification result sequence, wherein the identification result sequence is a sequence formed by identification results of each video frame in the candidate video segment; selecting a video frame from the candidate video clips as a candidate clip type boundary position; separating candidate sub-segments from the candidate video segments according to the boundary positions of the candidate segment types; acquiring statistical parameters of the types of the video frame segments in the candidate sub-segments according to the identification result sequence, and determining a target sub-segment from the candidate sub-segments according to the statistical parameters and a preset statistical parameter threshold; and acquiring a candidate segment type boundary position corresponding to the target sub-segment as a segment type boundary position in the video to be identified.

The following are detailed below. It should be noted that the following description of the embodiments is not intended to limit the preferred order of the embodiments.

The embodiment will be described from the perspective of a video positioning apparatus, where the video positioning apparatus may be specifically integrated in an electronic device, and the electronic device may be a server or a terminal; the terminal may include a mobile phone, a tablet Computer, a notebook Computer, and a Personal Computer (PC).

As shown in fig. 2a and fig. 5a, the specific flow of the video positioning method may be as follows:

101. and acquiring a video to be identified.

The video to be identified is the video which needs to be identified at the current moment and the boundary position is determined.

The video positioning method can be applied to different scenes, the corresponding execution main bodies of the method are different, and the methods for acquiring the videos to be identified are different.

For example, when the video positioning method is applied to a video playing scene, an execution subject of the method may be represented as a terminal, and of course, may also be represented as a server. The terminal or the server can determine the video which needs to be played currently as the video to be identified based on the operation of the user on the terminal interface, trigger a corresponding video acquisition instruction based on the operation of the user, and acquire the video to be identified from the server or the local storage according to the video acquisition instruction.

For example, when the video positioning method is applied to a video understanding scene, an execution subject of the method is generally expressed as a server, and of course, may also be expressed as a terminal. The terminal or the server can determine a video needing video understanding as a video to be identified based on user operation, trigger a corresponding video acquisition instruction based on the user operation, and acquire the video to be identified from the server or a local storage according to the video acquisition instruction.

In some embodiments, in order to facilitate transmission of a video file, the obtained video to be identified may be a file that is encapsulated and compressed, and before the next step is performed, the obtained original video file needs to be decoded and decapsulated to obtain the video to be identified, and the specific steps may include:

decapsulating the obtained original video file to obtain an independent pure video stream and an independent pure audio stream;

and respectively decoding the pure video stream and the pure audio stream to obtain a video frame sequence and an audio frame sequence in the video to be identified.

The original video file packaging format is not limited, and for example, the video packaging format widely used at present is mp4(Moving Picture Experts Group 4), ts, mkv, and the like. In one embodiment, these mainstream encapsulation formats may be decapsulated using decapsulation software. For example, the mainstream encapsulation formats can be decapsulated by using ffmpeg (fast Forward mpeg) or a third-party software tool, so as to obtain a pure video stream and a pure audio stream. Next, decoding software, such as ffmpeg or third-party tool software, may be used to decode the pure video stream and the pure audio stream, respectively, to obtain video frame data and audio frame data that can be processed.

102. And selecting candidate video clips from the video to be identified according to the duration of the video to be identified.

The candidate video segment refers to a video segment that may be a slice head or a slice tail, and the candidate video segment is represented by a sequence of a plurality of video frames.

Before selecting the candidate video clips, developers firstly determine the leader time threshold and the trailer time threshold corresponding to videos with different time lengths according to a large number of samples, and then select the candidate clips from the videos to be identified according to the leader time threshold and the trailer time threshold.

Wherein the candidate video segments may include a first candidate video segment and a second candidate video segment. According to the leader duration threshold, a first group of video frames are selected from the video to be identified to serve as a first candidate video segment, and according to the trailer duration threshold, a last group of video frames are selected from the video to be identified to serve as a second candidate video segment.

In an embodiment, a developer can divide a video into three video types, namely a long video, a short video and a small video according to the time length range of the video, determine the time length proportion range of a leader and a trailer occupying the video in different types of videos according to a large number of samples, determine the type of the video according to the time length of the video to be identified, then determine the time length proportion range according to the type of the video, and finally determine a leader time length threshold and a trailer time length threshold.

In an embodiment, a developer can divide a video into three video types, namely a long video, a short video and a small video according to a duration range of the video, and determine a leader duration threshold and a trailer duration threshold corresponding to different types of videos according to a large number of samples. Determining the type of the video according to the time length of the video to be identified, and then determining a leader time length threshold and a trailer time length threshold according to the type of the video.

103. And identifying the type of each video frame in the candidate video clips to obtain an identification result sequence.

And the identification result sequence is a sequence formed by the identification results of all the video frames in the candidate video clip.

The segment type identification result of the video frame at least comprises the segment type of the video frame, wherein the segment type can be represented as the probability that the video frame belongs to a certain segment type. In an embodiment, the segment type identification result of the video frame may further include a confidence that the video frame belongs to a certain segment type.

In an embodiment, referring to fig. 5b, the neural network includes an identification network, and the step "performing segment type identification on each video frame in the candidate video segment to obtain an identification result sequence" may specifically include:

In another embodiment, referring to fig. 5c, the neural network includes a first identification network and a second identification network, which respectively identify a first candidate video segment and a second candidate video segment, and the step "performing segment type identification on each video frame in the candidate video segments to obtain an identification result sequence" may specifically include:

The recognition network can be constructed based on the existing Resnet model, which is a basic feature extraction network in the computer vision field problem, and learns the residual representation between input and output by using a plurality of network layers with parameters (referred to as convolutional layers in the following embodiments), rather than directly trying to learn the mapping between input and output by using a parameter layer (i.e., a network layer with parameters) as in a general CNN network (such as Alexnet/VGG, etc.).

The identification network may include a convolutional network for learning feature representation and a full-connection network for classification identification, and performs segment type identification on each video frame by using the identification network to obtain a segment type identification result of each video frame, which may specifically include the following steps:

The convolutional network in this embodiment may include five convolutional layers (convolutional L encoders), as follows:

the convolutional layer is mainly used for performing feature extraction on an input image (such as a training sample or a video frame to be identified), wherein the size of a convolution kernel may be determined according to practical applications, for example, the sizes of convolution kernels from a first convolutional layer to a fourth convolutional layer may be (7, 7), (5, 5), (3, 3), and (3, 3), optionally, in order to reduce complexity of calculation and improve calculation efficiency, the sizes of convolution kernels of the five convolutional layers may also be all set to be (3, 3), optionally, in order to improve expression capability of a model, a non-linear factor may also be added by adding an activation function, for example, the activation function may be a "relu (linear rectification function, Rectified L initial Unit)". in this embodiment, feature information obtained after performing convolution operation on a video frame is expressed as a feature map.

Alternatively, in order to further reduce the amount of calculation, a downsampling (or pooling) operation (also referred to as a convolution operation) may be performed after the convolutional layer and before the fully-connected layer, and the downsampling operation is basically the same as the convolution operation, except that the downsampling convolution kernel is only a maximum value (max) or an average value (average) of the corresponding position. In the embodiment of the present invention, it can be considered that the downsampling operation, specifically, averaging discharging, is performed after the fifth convolutional layer of the convolutional network. In one embodiment, due to the full link layer parameter redundancy, Global Average Pooling (GAP) may also be used instead of the full link layer to fuse the learned depth features.

It should be noted that, for convenience of description, in the embodiment of the present invention, the downsampling layers (also referred to as pooling layers) may be included in the full-connection network.

The Fully Connected network in this embodiment includes at least one Fully Connected layer (FC), as follows:

full connection layer: the learned features can be mapped to a sample label space, which mainly plays the role of a "classifier" in the whole recognition network, and each node of the fully-connected layer is connected with all nodes output by the previous layer (such as a lower sampling layer), wherein one node of the fully-connected layer is called one neuron in the fully-connected layer, and the number of neurons in the fully-connected layer can be determined according to the requirements of practical application. The full connection layer performs weighting operation on the feature information obtained by the convolution operation to obtain a score (scores) of each category. Similar to the convolutional layer, optionally, after the full join operation, a non-linear factor may be added by adding an activation function, for example, a sigmoid or softmax activation function may be added.

In an embodiment, a softmax layer is further included in the fully-connected network, and the softmax layer is disposed behind the fully-connected layer, and the softmax layer is understood as normalization for sorting output results and mapping scores (scores) obtained by a fully-connected operation to probabilities. If the video frame has 3 (head, tail, positive) segment types, the output through the softmax layer is a 3-dimensional vector. The first value in the vector is the probability value that the current picture belongs to a first slice type (e.g., the slice header), and the second value in the vector is the probability value that the video frame belongs to a second slice (e.g., the slice trailer). The sum of these 3-dimensional vectors is 1. The dimensions of the input and output vectors of the softmax layer are the same.

Of course, referring to fig. 5b and 5c, it can be understood that the neural network structure may further include an input layer for inputting data and an output layer for outputting data, which are not described herein again.

Before the step "identifying the type of each video frame in the candidate video segment to obtain the identification result sequence", a large number of samples are required to train the initial neural network to obtain the trained neural network, wherein the training may include the following steps:

Wherein, since the candidate video segment has much fewer positive samples (leader or trailer type video frames) than negative samples (positive trailer type video frames), the negative samples tend to account for a large portion of the total loss, and the loss is mostly contributed by the easily recognizable samples. In the training process, the optimization direction is mainly dominated by the easily-identified sample, and the result of neural network learning is easily deviated.

The adjusting coefficient is used for reducing the influence degree of sample class imbalance on feature learning, and the optimizing direction is corrected by setting the adjusting coefficient, so that the problems of learning of samples difficult to identify and imbalance of positive and negative samples are solved.

In an embodiment, the adjusting coefficients include a modulation coefficient r for modulating weights of a sample difficult to identify and a sample easy to identify, and a balance coefficient a for balancing a ratio of positive and negative samples, and the step "construct a loss function according to a preset adjusting coefficient, the real segment type, and probability information corresponding to the segment type obtained by identification" may specifically include the following steps:

constructing a loss weight parameter corresponding to the easily-identified sample according to a preset modulation coefficient, the real fragment type and probability information corresponding to the fragment type obtained by identification;

and constructing the loss function according to a preset balance coefficient, the loss weight parameter and the initial loss function.

The modulation coefficient r is a parameter for modulating weights of the hard-to-identify sample and the easy-to-identify sample, and the balance coefficient a is a parameter for balancing the proportion of the positive sample and the negative sample.

The initial loss function may be flexibly set according to the actual application requirement, for example, the initial loss function CE (p, y) may be calculated by using a standard cross entropy formula, as follows:

taking the example that the neural network model includes two recognition networks as an example, for each recognition network, only one candidate video segment needs to be learned and recognized, and the recognition result includes two segment types: positive and head (or tail) belong to a binary network, p represents the probability that the fragment type of the sample is considered to belong to 1 by prediction (p ranges from 0 to 1), y represents the true fragment type labeled in the sample, and when the true fragment type is the first fragment type (for example, head) and y is equal to 1, if the probability that a certain sample x predicts 1 is p is equal to 0.6, the loss is-log (0.6), which is easy to understand, and only two classifications are taken as an example here, and so on. When the neural network only comprises one identification network, the identification network needs to learn and identify target candidate video segments (formed by splicing a first candidate video segment and a second candidate video segment), and the identification result comprises three segment types: positive, leader and trailer.

If the predicted segment type t is consistent with the real segment type, the predicted segment type t is true, y can be made to be 1, otherwise, if the predicted segment type t is inconsistent with the real segment type, the predicted segment type t is false, y can be made to be 0, and when y is 1, the confidence coefficient, p, of the predicted segment type t is obtained_tThe probability p of the segment type corresponding to the predicted segment type t is obtained, and when y is not 1 (for example, 0), the confidence p of the predicted segment type t is obtained_tThe difference between 1 and the fragment type probability p corresponding to the predicted fragment type t is expressed by a formula:

wherein p is the output fragment type prediction probability value. And continuously training by reducing the error between the real fragment type and the fragment type prediction probability value so as to adjust the weight to a proper value, thereby obtaining the trained neural network model.

In this case, the formula for calculating the initial loss function can be written as:

CE(p,y)＝CE(p_t)＝-log(p_t)

for example, the loss function F L (p)_t) It may be called focal loss, and specifically, it can be calculated by the following formula:

FL(p_t)＝-a_t(1-p_t)^rlog(p_t)

wherein p is_tFor confidence of the predicted segment type t, a_tR is a preset modulation coefficient. Wherein (1-p) can be substituted_t)^rA loss weight parameter referred to as an easily identifiable sample.

The loss function is consistent with the initial loss function when the modulation factor is 0, wherein the modulation factor is a number greater than 1, and can be used to control the weight of the easily recognized sample in the loss function, so that the weight of the difficultly recognized sample in the loss function is more concerned when the neural network is trained, and in addition, since the positive sample (a video frame of a slice head or a slice tail type) is much less than the negative sample (a video frame of a positive slice type) in the candidate video segment, in order to give a greater weight to the fewer samples, the preset balance factor can be set to a number between (0 and 1), so as to control the weight of the negative sample in the total loss.

It is understood that the above method can also be used to train the recognition model, and two sets of neural networks are constructed by using the trained recognition model together with the input and output layers and the layer for combining the recognition results, and the structure is called a dual-model structure. Alternatively, a trained recognition model may be used to construct a set of neural networks together with input and output layers, a layer for stitching the first candidate video segment and the second candidate video segment, and a layer for combining recognition results, which is referred to as a single model structure.

104. And selecting a video frame meeting a preset tolerance threshold value from the candidate video clips as a candidate clip type boundary position according to the identification result sequence and the preset tolerance threshold value.

And the tolerance threshold is the maximum number of video frames of which the segment type identification result is not the type of the target video segment in the target video segment.

The target video clip type is a non-feature type corresponding to the candidate video clip, when the candidate video clip is a video start part, the target video clip type is a title type, and when the candidate video clip is a video end part, the target video clip type is a title type.

For example, the preset tolerance threshold of the slice header type is 2, the 20 th frame is a 2 nd video frame that is not the slice header type, and the 25 th frame is a 3 rd video frame that is not the slice header type, and then each frame of the first 24 frames can be used as a candidate slice type boundary position between the slice header and the feature film.

Due to recognition errors, some frame images in the head or the tail of the video can be possibly recognized as a feature type, and the strategy of the application allows video frames within a tolerance threshold number in the target video clip to be recognized as non-target video clip types, so that the boundary position in the video can be determined more accurately.

105. And separating candidate sub-segments from the candidate video segments according to the boundary positions of the candidate segment types.

For a first candidate segment (i.e., a candidate slice header segment), all video frames between the first frame of the video to the candidate segment type boundary location may be combined into a candidate sub-segment.

For the second candidate segment (i.e., the candidate end-of-segment), all video frames between the candidate segment type boundary position in the video and the last frame of the video may be combined to obtain a candidate sub-segment.

106. And acquiring the statistical parameters of the video frame fragment types in the candidate sub-fragments according to the identification result sequence, and determining a target sub-fragment from the candidate sub-fragments according to the statistical parameters and a preset statistical parameter threshold.

The statistical parameter is a parameter used to represent the probability that the candidate sub-segment is the target sub-segment, and may be expressed as the number of target video frames in the candidate sub-segment whose identification result is the type of the target video segment, or the ratio of the number of target video frames to the total number of video frames in the candidate sub-segment.

The preset statistical parameter threshold refers to the minimum value of the statistical parameter, which is predetermined by developers according to statistics and experience, when the candidate sub-segment is the target sub-segment.

In general, the candidate sub-segment with the largest statistical parameter and the statistical parameter not less than the preset statistical parameter threshold may be determined as the target sub-segment.

Wherein the target sub-segment can be represented as a slice header or a slice trailer in the video.

107. And acquiring a candidate segment type boundary position corresponding to the target sub-segment as a segment type boundary position in the video to be identified.

As can be seen from the above, the embodiment of the application can acquire the video to be identified, and then select the candidate video segment from the video to be identified according to the duration of the video to be identified; performing segment type identification on each video frame in the candidate video segment to obtain an identification result sequence, wherein the identification result sequence is a sequence formed by identification results of each video frame in the candidate video segment; selecting a video frame from the candidate video clips as a candidate clip type boundary position; separating candidate sub-segments from the candidate video segments according to the boundary positions of the candidate segment types; acquiring statistical parameters of the types of the video frame segments in the candidate sub-segments according to the identification result sequence, and determining a target sub-segment from the candidate sub-segments according to the statistical parameters and a preset statistical parameter threshold; and acquiring a candidate segment type boundary position corresponding to the target sub-segment as a segment type boundary position in the video to be identified.

Since the strategy for determining the boundary position is different according to the identification result of the video frame, the video positioning method of the present application may further replace

steps

104, 105, and 106 in the above embodiment with

steps

204, 205, and 206, and refer to fig. 2b, where the specific flow is as follows:

201. and acquiring a video to be identified.

202. And selecting candidate video clips from the video to be identified according to the duration of the video to be identified.

203. And identifying the type of each video frame in the candidate video clip to obtain an identification result sequence, wherein the identification result sequence is a sequence formed by the identification results of each video frame in the candidate video clip.

Steps 201 to 203 are the same as steps 101 to 103 in the previous embodiment, and the specific process refers to the above embodiment, which is not described again.

The structure of the neural network may generally adopt a structure of a Resnet-101 model, and of course, a model such as a Resnet series, an inclusion series, a Mobilenet series, or the like may be substituted.

The training mode of training the dual-model structure of the neural network is not limited to the mode of firstly loading the single-model structure and training for initialization, then respectively fine-tuning, and can be replaced by the training modes of separate training, combined multi-task training and the like.

In order to solve the difficult-to-sample mining and sample imbalance problems, the loss function is not limited to the method for constructing the loss function in the above embodiment, and may be constructed by methods such as ohem (online hard execution) and quasi-balanced cross-spirit loss.

204. And taking the video frame in the candidate video segment as a candidate segment type boundary position.

In the present embodiment, all video frames in the candidate video segment are taken as candidate segment type boundary positions.

205. And separating candidate sub-segments from the candidate video segments according to the boundary positions of the candidate segment types.

For a specific separation principle, see step 105 in the above embodiment, which is not described herein again.

206. And acquiring statistical parameters corresponding to the candidate sub-segments according to the identification results of the video frames in the candidate sub-segments and the positions corresponding to the video frames, and determining a target sub-segment from the candidate sub-segments according to the statistical parameters and a preset statistical parameter threshold.

The step of obtaining the statistical parameters corresponding to the candidate sub-segments according to the identification result of the video frame in the candidate sub-segments and the position corresponding to the video frame may specifically include the following steps:

acquiring initial statistical parameters of the candidate sub-segments according to the recognition result sequence; acquiring a position encouragement parameter according to the position of the candidate segment type boundary position in the candidate video segment; and fusing the position encouragement parameters and the corresponding initial statistical parameters to obtain the statistical parameters corresponding to the candidate sub-segments.

The initial statistical parameter may be expressed as the number of target video frames in the candidate sub-segment whose identification result is the type of the target video segment, or the ratio of the number of target video frames to the total number of video frames in the candidate sub-segment.

The statistical parameter is an actual parameter for representing the probability that the candidate sub-segment is the target sub-segment.

Wherein the position encouragement parameter is a parameter for indicating a degree of contribution of the position of the candidate segment type boundary position in the candidate video segment to the statistical parameter.

In one embodiment, the location incentive parameter may be calculated using the following formula:

w_dlog (d) +1), or w_dLog (d) where d denotes the position of the video frame in the candidate video segment.

In an embodiment, the position incentive parameter may be multiplied by the corresponding initial statistical parameter to obtain the statistical parameter corresponding to the candidate sub-segment.

Since the beginning of some titles is the real scene, the title of the title appears in the latter half of the title. When the former part is seen manually, whether the current position is the title of the film leader or not is difficult to distinguish in real time, and the accurate judgment can be carried out only after key features (such as video titles, logos and the like) of the film leader appear behind. Based on this, we consider that greater attention should be given to the later occurring titles, so that different position encouragements can be given to the initial statistical parameters corresponding to different candidate segment type boundary positions, thereby giving greater attention to the later video frames. And then the boundary position of the video to be identified is determined more accurately.

When the statistical parameters are obtained, the fragment types of the identified video frames can be counted, and the probability that the identified video frames belong to a certain fragment type can be counted.

Wherein, step 201 and step 207 can be referred to as grouping, and the grouping policy is not limited to giving more encouragement to the video frames appearing later (location encouragement), but also can give more encouragement to the video frames appearing continuously.

207. And acquiring a candidate segment type boundary position corresponding to the target sub-segment as a segment type boundary position in the video to be identified.

Referring to fig. 5d, a video with a length of 242s is input, and the structure of the video can be predicted by using the scheme of the present application, specifically, the position where the slice head ends is 3.00s, the confidence is 0.805, the position where the slice tail starts is 229.25s, and the confidence is 0.607.

And when the confidence coefficient is smaller than a preset threshold value, the video positioning result can be automatically recalled, repositioned or handed over to manual positioning.

In an embodiment, the video may also have no leader or trailer, for example, referring to fig. 5e, the video has no trailer, and the positioning result has only the boundary position of the leader and no boundary position of the trailer.

According to the scheme, the boundary position is finally determined according to the statistical parameters of the types of the video frame fragments, the error identification result of individual video frames can be prevented from influencing the determination of the boundary position, the identification result is more accurate, manual intervention is not needed in the scheme, full automation can be realized, and therefore the boundary positions of the fragments of different types in the video can be identified more accurately and efficiently.

The method described in the above embodiments is further illustrated in detail by way of example.

In order to verify the effect brought by the video positioning scheme provided by the embodiment of the present application, the video positioning results obtained by different model structures in the present invention are compared, and the video positioning results obtained by different neural network structures and different methods for dividing candidate sub-segments (i.e. grouping strategies) are shown in tables 1 and 2, respectively.

TABLE 1

The ordinate in table 1 represents different neural network structures, the abscissa P in table 1 represents accuracy, and R represents recall rate, and according to a preset threshold t, when the slice head and slice tail error of a video is smaller than the preset threshold t, the positioning is considered to be accurate, otherwise, the positioning is considered to be inaccurate, and recall is required. The values in table 1 represent the accuracy of positioning the leader or trailer and the recall rate for different neural network structures.

Model (model)	P	R
			Fault tolerant grouping strategy	0.834	0.618
Encouraging grouping policy	0.845	0.645

TABLE 2

The ordinate in table 2 represents different methods for dividing candidate sub-segments, where the methods used in embodiments 101 to 107 are referred to as fault-tolerant grouping strategies, the methods used in embodiments 201 to 207 are referred to as encouraging grouping strategies, the abscissa P in table 2 represents accuracy, R represents recall rate, and according to a preset threshold t, when a slice header and slice trailer error of a video is smaller than the preset threshold t, it is considered that positioning is accurate, otherwise, it is considered that positioning is inaccurate, and recall is required. The values in table 2 represent the positioning accuracy and recall for different grouping strategies.

In order to better implement the method, correspondingly, an embodiment of the present application further provides a video positioning apparatus, where the video positioning apparatus may be specifically integrated in an electronic device, and the electronic device may be a server or a terminal.

For example, as shown in fig. 3, the video positioning apparatus may include an obtaining unit 301, a candidate unit 302, a recognition unit 303, a selecting unit 304, a separating unit 305, a determining unit 306, and a positioning unit 307, as follows:

(1) an obtaining unit 301, configured to obtain a video to be identified;

(2) a candidate unit 302, configured to select a candidate video segment from a video to be identified according to a duration of the video to be identified;

(3) the identifying unit 303 is configured to perform segment type identification on each video frame in the candidate video segment to obtain an identification result sequence, where the identification result sequence is a sequence formed by identification results of each video frame in the candidate video segment;

(4) a selecting unit 304, configured to select a video frame from the candidate video clips as a candidate clip type boundary position;

(5) the separation unit is used for separating candidate sub-segments from the candidate video segments according to the candidate segment type boundary positions;

(6) the determining unit is used for acquiring the statistical parameters of the video frame fragment types in the candidate sub-fragments according to the identification result sequence and determining a target sub-fragment from the candidate sub-fragments according to the statistical parameters and a preset statistical parameter threshold;

(7) and the positioning unit is used for acquiring the candidate segment type boundary position corresponding to the target sub-segment and taking the candidate segment type boundary position as the segment type boundary position in the video to be identified.

Optionally, in some embodiments, the identifying unit 303 may specifically include:

the splicing subunit is configured to splice the first candidate video segment and the second candidate video segment to obtain a spliced candidate video segment;

the identification subunit is used for identifying the type of each video frame in the target candidate video clip by adopting an identification network in a preset neural network to obtain the clip type identification result of each video frame;

and the combining subunit is used for combining the segment type identification results according to the first candidate video segment and the second candidate video segment to obtain a first identification result sequence and a second identification result sequence.

Alternatively, the first and second electrodes may be,

the first identification subunit is configured to perform segment type identification on each video frame in the first candidate video segment by using a first identification network to obtain a segment type identification result of each video frame;

the first combination subunit is used for combining the fragment type identification results to obtain a first identification result sequence;

the second identification subunit is configured to perform segment type identification on each video frame in the second candidate video segment by using a second identification network to obtain a segment type identification result of each video frame;

and the second combination subunit is used for combining the fragment type identification results to obtain a second identification result sequence.

Wherein the identifier subunit may be specifically configured to:

Optionally, in an embodiment, the selecting unit 304 may specifically be configured to:

Alternatively, the first and second electrodes may be,

Optionally, in an embodiment, the determining unit 306 may specifically be configured to:

Optionally, in some embodiments, the video positioning apparatus may further include a training unit, which may specifically include:

the acquiring subunit is used for acquiring a plurality of video clip samples, and the video clip samples comprise a plurality of video frames marked with real clip types;

the identification subunit is used for identifying the segment type corresponding to the video frame in the video segment sample through a preset initial neural network;

the prediction subunit is used for determining a current prediction result according to the fragment type obtained by identification and the real fragment type;

the construction subunit is used for constructing a loss function according to a preset adjustment coefficient, the real fragment type and the probability information corresponding to the fragment type obtained by identification;

and the training subunit is used for adopting a loss function to converge the preset initial neural network until the current prediction result is correct in prediction, so as to obtain the trained neural network.

In an embodiment, the adjustment coefficient includes a modulation coefficient and a balance coefficient, and the training subunit may specifically be configured to:

In a specific implementation, the above units may be implemented as independent entities, or may be combined arbitrarily to be implemented as the same or several entities, and the specific implementation of the above units may refer to the foregoing method embodiments, which are not described herein again.

Therefore, the boundary position is finally determined according to the statistical parameters of the types of the video frame fragments, the boundary position can be prevented from being influenced by the error identification result of individual video frames, the identification result is more accurate, manual intervention is not needed in the scheme, full automation can be realized, and therefore the boundary positions of the fragments of different types in the video can be identified more accurately and efficiently.

In addition, an electronic device according to an embodiment of the present application is further provided, as shown in fig. 4, which shows a schematic structural diagram of the electronic device according to an embodiment of the present application, and specifically:

the electronic device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 4 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 401 is a control center of the electronic device, connects various parts of the whole electronic device by various interfaces and lines, performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The electronic device further comprises a power supply 403 for supplying power to the various components, and preferably, the power supply 403 is logically connected to the processor 401 through a power management system, so that functions of managing charging, discharging, and power consumption are realized through the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The electronic device may further include an input unit 404, and the input unit 404 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the electronic device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 401 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application program stored in the memory 402, thereby implementing various functions as follows:

acquiring a video to be identified;

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Therefore, the style migration from the source domain image to the target domain image is realized by converting the network, so that the source domain image after the migration can be closer to the target domain image in style content, the learning difficulty in the subsequent training process can be reduced, and the confrontation generation network after the training can better solve the field self-adaptive task, thereby effectively solving the noise label and the unsupervised video positioning task and improving the accuracy of video positioning.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present application further provide a storage medium, where a plurality of instructions are stored, where the instructions can be loaded by a processor to perform the steps in any one of the training methods for a video positioning network provided in the embodiments of the present application. For example, the instructions may perform the steps of:

acquiring a target domain image and a source domain image with labeled information;

acquiring a video to be identified;

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium can execute the steps in any of the training methods for video positioning networks provided in the embodiments of the present application, beneficial effects that can be achieved by any of the training methods for video positioning networks provided in the embodiments of the present application can be achieved, for details, see the foregoing embodiments, and are not described herein again.

The above detailed description is given to a training method, an apparatus, an electronic device, a storage medium, and a video positioning method and apparatus for a video positioning network provided in the embodiments of the present application, and a specific example is applied in the present application to explain the principle and the implementation of the present application, and the description of the above embodiments is only used to help understand the method and the core idea of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for video localization, comprising:

acquiring a video to be identified;

2. The video localization method of claim 1, wherein the candidate video segments comprise a first candidate video segment and a second candidate video segment;

3. The video localization method according to claim 1, wherein the candidate video segments comprise a first candidate video segment and a second candidate video segment, and the predetermined neural network comprises a first recognition network and a second recognition network;

4. The video positioning method according to claim 2 or 3, wherein the performing segment type identification on each video frame to obtain the segment type identification result of each video frame comprises:

5. The video localization method of claim 1, wherein said selecting a video frame from said candidate video segments as a candidate segment type boundary location comprises:

6. The video localization method of claim 1, wherein said selecting a video frame from said candidate video segments as a candidate segment type boundary location comprises:

7. The video positioning method according to claim 6, wherein said obtaining statistical parameters of video frame segment types in the candidate sub-segments according to the recognition result sequence comprises:

8. The video positioning method according to any one of claims 1 to 7, wherein before said performing segment type recognition on each video frame in the candidate video segments by using a preset neural network to obtain a recognition result sequence, further comprising:

9. The video positioning method according to claim 8, wherein the modulation coefficients include modulation coefficients and balance coefficients, and the constructing the loss function according to the preset adjustment coefficients, the real segment type, and the probability information corresponding to the identified segment type includes:

10. A video positioning apparatus, comprising:

the acquisition unit is used for acquiring a video to be identified;

11. An electronic device comprising a processor and a memory, the memory storing a plurality of instructions; the processor loads instructions from the memory to perform the steps in the video positioning method of any of claims 1-9.

12. A storage medium having stored thereon a computer program, characterized in that, when the computer program is run on a computer, it causes the computer to execute the video localization method according to any one of claims 1 to 9.