CN111479130B

CN111479130B - Video positioning method and device, electronic equipment and storage medium

Info

Publication number: CN111479130B
Application number: CN202010256464.4A
Authority: CN
Inventors: 徐孩; 梁健豪; 车翔; 管琰平
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-04-02
Filing date: 2020-04-02
Publication date: 2023-09-26
Anticipated expiration: 2040-04-02
Also published as: CN111479130A

Abstract

The embodiment of the application discloses a video positioning method, a video positioning device, electronic equipment and a storage medium, wherein the embodiment of the application can select candidate video fragments from videos to be identified according to the duration of the videos to be identified; performing segment type recognition on each video frame in the candidate video segments to obtain a recognition result sequence composed of recognition results of each video frame; selecting a video frame from the candidate video clips as a candidate clip type demarcation position; separating candidate sub-segments from the candidate video segments according to the boundary positions of the candidate segment types; according to the identification result sequence, acquiring a statistical parameter of the video frame fragment type in the candidate sub-fragment, and determining a target sub-fragment from the candidate sub-fragment according to the statistical parameter and a preset statistical parameter threshold; and acquiring a candidate segment type boundary position corresponding to the target sub-segment as a segment type boundary position in the video to be identified. Thus, the demarcation positions of different types of fragments in the video can be determined efficiently and quickly.

Description

Video positioning method and device, electronic equipment and storage medium

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a video positioning method, a video positioning device, electronic equipment and a storage medium.

Background

In recent years, the content and the type of the video are more and more abundant, the video content is not only limited to large television episodes and movies which are produced by professional teams, but also more and more users can upload original short and small video content to a video playing platform by themselves. According to the types of the fragments, the video can be divided into a plurality of fragments of different types such as a head, a positive and a tail, and the like, and a plurality of video playing platforms can identify the demarcation positions of the fragments of different types in the long video so as to provide the functions of skipping the head, the tail and the like when playing, and most of the video playing platforms can determine the demarcation positions of the fragments by manually watching, directly acquiring or calculating the similarity of the head and the tail samples of the video fragment domain from a video provider, or determining the demarcation positions of the fragments according to the complexity and the like of the video, but because the number of short videos uploaded by users is large, the content creators are more, and the production forms of the head and the tail are flexible and various, the demarcation positions of the fragments of different types in the video (especially the short videos) can not be accurately and efficiently identified by using the existing video positioning method.

Disclosure of Invention

In view of this, embodiments of the present application provide a video positioning method, apparatus, electronic device, and storage medium, which can accurately and efficiently identify the boundary positions of different types of clips in a video.

In a first aspect, an embodiment of the present application provides a video positioning method, including:

acquiring a video to be identified;

selecting candidate video clips from the videos to be identified according to the duration of the videos to be identified;

performing fragment type recognition on each video frame in the candidate video fragments to obtain a recognition result sequence, wherein the recognition result sequence is a sequence formed by recognition results of each video frame in the candidate video fragments;

selecting a video frame from the candidate video clips as a candidate clip type demarcation position;

separating candidate sub-segments from the candidate video segments according to the candidate segment type demarcation location;

according to the identification result sequence, acquiring a statistical parameter of the video frame fragment type in the candidate sub-fragment, and determining a target sub-fragment from the candidate sub-fragment according to the statistical parameter and a preset statistical parameter threshold;

and acquiring a candidate segment type boundary position corresponding to the target sub-segment as a segment type boundary position in the video to be identified.

In an embodiment, the candidate video clips include a first candidate video clip and a second candidate video clip;

the step of identifying the segment types of the video frames in the candidate video segments to obtain an identification result sequence comprises the following steps:

splicing the first candidate video segment and the second candidate video segment to obtain spliced candidate video segments;

adopting an identification network in a preset neural network to identify the segment types of the video frames in the target candidate video segments, and obtaining segment type identification results of the video frames;

and combining the fragment type recognition results according to the first candidate video fragments and the second candidate video fragments to obtain a first recognition result sequence and a second recognition result sequence.

In an embodiment, the candidate video clips include a first candidate video clip and a second candidate video clip, and the preset neural network includes a first identification network and a second identification network;

the step of adopting a preset neural network to identify the segment types of each video frame in the candidate video segments to obtain an identification result sequence comprises the following steps:

performing fragment type identification on each video frame in the first candidate video fragment by adopting a first identification network to obtain a fragment type identification result of each video frame;

Combining the fragment type recognition results to obtain a first recognition result sequence;

performing fragment type identification on each video frame in the second candidate video fragment by adopting a second identification network to obtain a fragment type identification result of each video frame;

and combining the fragment type recognition results to obtain a second recognition result sequence.

In an embodiment, performing fragment type identification on each video frame to obtain a fragment type identification result of each video frame, including:

extracting the characteristics of each video frame according to the convolution network in the identification network to obtain the characteristic information of the video frame;

and carrying out full connection operation on the characteristic information according to the full connection network in the identification network to obtain a fragment type identification result of the video frame.

In an embodiment, the selecting the video frame from the candidate video segments as the candidate segment type boundary position includes:

and selecting video frames meeting the preset tolerance threshold from the candidate video clips as candidate clip type demarcation positions according to the identification result sequence and the preset tolerance threshold, wherein the tolerance threshold is the maximum number of video frames in the target video clips, and the clip type identification result is not the maximum number of video frames in the target video clip type.

and taking the video frames in the candidate video fragments as candidate fragment type demarcation positions.

In an embodiment, the obtaining, according to the identification result sequence, a statistical parameter of a video frame segment type in the candidate sub-segment includes:

acquiring initial statistical parameters of the candidate sub-fragments according to the identification result sequence;

acquiring a position encouraging parameter according to the position of the candidate segment type demarcation position in the candidate video segment;

and fusing the position encouraging parameters with the corresponding initial statistical parameters to obtain the statistical parameters corresponding to the candidate sub-fragments.

In an embodiment, before the step of using the preset neural network to identify the segment type of each video frame in the candidate video segment to obtain the identification result sequence, the method further includes:

obtaining a plurality of video clip samples, wherein the video clip samples comprise a plurality of video frames marked with real clip types;

identifying a fragment type corresponding to a video frame in the video fragment sample through a preset initial neural network;

Determining a current prediction result according to the fragment type obtained by recognition and the real fragment type;

constructing a loss function according to a preset adjustment coefficient, the real fragment type and probability information corresponding to the identified fragment type;

and converging the preset initial neural network by adopting a loss function until the current prediction result is correct in prediction, so as to obtain the trained neural network.

In an embodiment, the constructing a loss function according to a preset adjustment coefficient, the real segment type, and probability information corresponding to the identified segment type includes:

constructing a loss weight parameter corresponding to an easily-identified sample according to the preset modulation coefficient, the real fragment type and probability information corresponding to the identified fragment type;

constructing an initial loss function according to the real fragment type and probability information corresponding to the fragment type obtained by recognition;

and constructing the loss function according to the preset balance coefficient, the loss weight parameter and the initial loss function.

In a second aspect, an embodiment of the present application provides a video positioning apparatus, including:

the acquisition unit is used for acquiring the video to be identified;

A candidate unit, configured to select candidate video segments from a video to be identified according to a duration of the video to be identified;

the identification unit is used for carrying out fragment type identification on each video frame in the candidate video fragments to obtain an identification result sequence, wherein the identification result sequence is a sequence formed by identification results of each video frame in the candidate video fragments;

the selecting unit is used for selecting video frames from the candidate video clips as candidate clip type demarcation positions;

the separation unit is used for separating candidate sub-fragments from the candidate video fragments according to the candidate fragment type demarcation position;

the determining unit is used for acquiring the statistical parameters of the video frame fragment types in the candidate sub-fragments according to the identification result sequence, and determining target sub-fragments from the candidate sub-fragments according to the statistical parameters and a preset statistical parameter threshold;

and the positioning unit is used for acquiring the candidate segment type boundary position corresponding to the target sub-segment as the segment type boundary position in the video to be identified.

In a third aspect, an electronic device provided by an embodiment of the present application includes a processor and a memory, where the memory stores a plurality of instructions; the processor loads instructions from the memory to perform the steps in the video localization method described above.

In a third aspect, embodiments of the present application provide a storage medium having a computer program stored thereon, which when run on a computer causes the computer to perform a video localization method as provided in any of the embodiments of the present application.

The embodiment of the application can acquire the video to be identified, and then select candidate video fragments from the video to be identified according to the duration of the video to be identified; performing fragment type recognition on each video frame in the candidate video fragments to obtain a recognition result sequence, wherein the recognition result sequence is a sequence formed by recognition results of each video frame in the candidate video fragments; selecting a video frame from the candidate video clips as a candidate clip type demarcation position; separating candidate sub-segments from the candidate video segments according to the candidate segment type demarcation location; according to the identification result sequence, acquiring a statistical parameter of the video frame fragment type in the candidate sub-fragment, and determining a target sub-fragment from the candidate sub-fragment according to the statistical parameter and a preset statistical parameter threshold; and acquiring a candidate segment type boundary position corresponding to the target sub-segment as a segment type boundary position in the video to be identified.

According to the scheme, the demarcation position is finally determined according to the statistical parameters of the video frame fragment types, the situation that the demarcation position is influenced by the error identification result of an individual video frame can be avoided, the identification result is more accurate, the scheme does not need manual intervention, full automation can be realized, and therefore the demarcation positions of fragments of different types in the video can be more accurately and efficiently identified.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic view of a video positioning method according to an embodiment of the present application;

FIG. 2a is a flowchart of a video positioning method according to an embodiment of the present application;

FIG. 2b is another flowchart of a video positioning method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a video positioning device according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

Fig. 5a is a schematic flow chart of a video positioning method according to an embodiment of the present application;

FIG. 5b is a schematic structural diagram of a neural network with a single model structure according to an embodiment of the present application;

FIG. 5c is a schematic diagram of a dual-mode structural neural network according to an embodiment of the present application;

fig. 5d is a schematic diagram of a video recognition result according to an embodiment of the present application;

fig. 5e is a schematic diagram of another video recognition result according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

The embodiment of the application provides a video positioning method, a video positioning device, electronic equipment and a storage medium. The video positioning device can be integrated in electronic equipment, and the electronic equipment can be a server, a terminal and other equipment.

The video positioning method provided by the embodiment of the application relates to a computational vision technology and a machine learning direction in the field of artificial intelligence, and the segment types of each video frame (namely, the images forming the video) can be identified through a neural network obtained through machine learning training.

Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. The artificial intelligence software technology mainly comprises the directions of computer vision technology, machine learning/deep learning and the like.

Among them, machine Learning (ML) is a multi-domain interdisciplinary, and involves multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

In the embodiment of the application, the video positioning refers to a technology and a process for finding out the boundary positions of different types of fragments in video. In the present application, the video may include three different types of clips, namely a clip, a feature, and a clip, or only a feature, or a clip and a clip, or a feature and a clip. The obtained demarcation position can be applied to different scenes and electronic equipment. For example, when playing a video, the terminal can skip the head and the tail in the video according to the demarcation position, specifically, when playing the video, the terminal can directly skip to the position where the video feature starts, or skip the tail after the video feature is played, and directly start playing the next video, thereby improving the user experience. For another example, for video understanding tasks such as video classification, video motion segmentation, video stretch deformation recognition and the like, the server can skip the head and the tail in the video, so that the head and the tail are prevented from interfering with video understanding. Because the head-to-tail content is typically of low relevance to the video body, the head-to-tail content does not help these video understanding tasks and may even have a negative impact.

For example, referring to fig. 1, first, the electronic device integrated with the video positioning device acquires a video to be identified, and then selects a candidate video clip from the video to be identified according to the duration of the video to be identified; performing fragment type recognition on each video frame in the candidate video fragments to obtain a recognition result sequence, wherein the recognition result sequence is a sequence formed by recognition results of each video frame in the candidate video fragments; selecting a video frame from the candidate video clips as a candidate clip type demarcation position; separating candidate sub-segments from the candidate video segments according to the candidate segment type demarcation location; according to the identification result sequence, acquiring a statistical parameter of the video frame fragment type in the candidate sub-fragment, and determining a target sub-fragment from the candidate sub-fragment according to the statistical parameter and a preset statistical parameter threshold; and acquiring a candidate segment type boundary position corresponding to the target sub-segment as a segment type boundary position in the video to be identified.

The following will describe in detail. The following description of the embodiments is not intended to limit the preferred embodiments.

The embodiment will be described from the perspective of a video positioning device, which may be integrated in an electronic apparatus, where the electronic apparatus may be a server or a terminal, etc.; the terminal may include a mobile phone, a tablet computer, a notebook computer, a personal computer (Personal Computer, PC), and the like.

As shown in fig. 2a and fig. 5a, the specific flow of the video positioning method may be as follows:

101. and acquiring the video to be identified.

The video to be identified is a video which needs to be identified at the current moment and the demarcation position is determined.

The video positioning method can be applied to different scenes, the execution main bodies of the corresponding methods are different, and the methods for acquiring the videos to be identified are different.

For example, when the video positioning method is applied to a video playing scene, the execution subject of the method may be represented as a terminal, and of course, may also be represented as a server. The terminal or the server can determine the video to be played currently as the video to be identified based on the operation of the user on the terminal interface, trigger a corresponding video acquisition instruction based on the operation of the user, and acquire the video to be identified from the server or the local storage according to the video acquisition instruction.

For example, when the video localization method is applied to a video understanding scene, the execution subject of the method generally behaves as a server, and of course, may also behave as a terminal. The terminal or the server can determine the video to be subjected to video understanding as the video to be identified based on user operation, trigger a corresponding video acquisition instruction based on the user operation, and acquire the video to be subjected to video from the server or the local storage according to the video acquisition instruction.

In some embodiments, in order to facilitate video file transmission, the acquired video to be identified may be an encapsulated and compressed file, and before the next step, the acquired original video file needs to be decoded and decapsulated to obtain the video to be identified, which may include the following steps:

the method comprises the steps of performing decapsulation processing on an obtained original video file to obtain an independent pure video stream and an independent pure audio stream;

and respectively decoding the pure video stream and the pure audio stream to obtain a video frame sequence and an audio frame sequence in the video to be identified.

The package format of the original video file is not limited, for example, mp4 (Moving Picture Experts Group, moving picture expert group), ts, mkv, etc. are widely used. In one embodiment, these mainstream encapsulation formats may be decapsulated using decapsulation software. For example, the main stream encapsulation formats can be unpackaged using ffmpeg (Fast Forward Mpeg) or a third party software tool to obtain a pure video stream and a pure audio stream. Next, decoding software, such as ffmpeg or third party tool software, may be used to decode the pure video stream and the pure audio stream, respectively, to obtain video frame data and audio frame data that may be processed.

102. And selecting candidate video clips from the videos to be identified according to the duration of the videos to be identified.

Wherein, the candidate video clips refer to video clips which may be the head or the tail, and the candidate video clips are represented as a sequence formed by a plurality of video frames.

Before selecting candidate video clips, a developer firstly determines a head duration threshold and a tail duration threshold corresponding to videos with different durations according to a large number of samples, and then selects the candidate clips from the videos to be identified according to the head duration threshold and the tail duration threshold.

Wherein the candidate video clips may include a first candidate video clip and a second candidate video clip. And selecting a group of video frames at the beginning from the videos to be identified as a first candidate video segment according to the duration threshold of the slice head, and selecting a group of video frames at the end from the videos to be identified as a second candidate video segment according to the duration threshold of the slice tail.

In an embodiment, a developer may divide a video into three video types, namely a long video type, a short video type and a small video type according to a duration range of the video, determine a duration proportion range of a head and a tail in the video of different types according to a large number of samples, determine the type of the video according to the duration of the video to be identified, determine a duration proportion range according to the type of the video, and finally determine a head duration threshold and a tail duration threshold.

In an embodiment, a developer may divide a video into three video types, namely, a long video type, a short video type and a small video type according to a duration range of the video, and determine a head duration threshold and a tail duration threshold corresponding to the videos of different types according to a large number of samples. And determining the type of the video according to the duration of the video to be identified, and then determining a head duration threshold and a tail duration threshold according to the video type.

103. And carrying out fragment type identification on each video frame in the candidate video fragments to obtain an identification result sequence.

The identification result sequence is a sequence formed by identification results of all video frames in the candidate video clips.

The segment type identification result of the video frame at least comprises the segment type of the video frame, wherein the segment type can be expressed as the probability that the video frame belongs to a certain segment type. In an embodiment, the fragment type identification result of the video frame may further include a confidence that the video frame belongs to a certain fragment type.

Wherein, this step may be performed by using a neural network obtained by training a large number of samples, where the neural network may include different structures, and in an embodiment, referring to fig. 5b, the neural network includes an identification network, and the step of "performing segment type identification on each video frame in the candidate video segment to obtain an identification result sequence" may specifically include:

In another embodiment, referring to fig. 5c, the neural network includes a first recognition network and a second recognition network, and the step of recognizing a first candidate video segment and a second candidate video segment respectively, and the step of performing segment type recognition on each video frame in the candidate video segment to obtain a recognition result sequence may specifically include:

The recognition network may be constructed based on the existing Resnet model, which is a basic feature extraction network in the problem of the computer vision field, and learns residual representations between input and output by using a plurality of network layers with parameters (referred to as convolution layers in the following embodiments), instead of directly attempting to learn mapping between input and output by using a parameter layer (i.e., a network layer with parameters) like a general CNN network (e.g., alexnet/VGG, etc.).

The recognition network may include a convolutional network for learning feature representation and a fully-connected network for classification recognition, and the recognition network performs fragment type recognition on each video frame to obtain a fragment type recognition result of each video frame, and specifically may include the following steps:

The convolutional network in this embodiment may include five convolutional layers (Convolution Layers), as follows:

convolution layer: the method is mainly used for extracting characteristics of an input image (such as a training sample or a video frame to be identified), wherein the size of a convolution kernel can be determined according to practical application, for example, the sizes of convolution kernels from a first layer of convolution layer to a fourth layer of convolution layer can be (7, 7), (5, 5), (3, 3) in sequence; alternatively, in order to reduce the complexity of computation and improve the computation efficiency, the convolution kernel sizes of the five convolution layers may also be set to (3, 3). Alternatively, in order to increase the expression capacity of the model, a nonlinear factor may also be added by adding an activation function, for example, the activation function may be "relu (linear rectification function, rectified Linear Unit)". In this embodiment, feature information obtained after the convolution operation is performed on the video frame is represented as a feature map.

Optionally, to further reduce the amount of computation, a downsampling (pooling) operation (also called a pooling operation) may be performed after the convolution layer and before the full-connection layer, where the downsampling operation is substantially the same as the convolution operation, except that the downsampling convolution kernel takes only the maximum value (max pooling) or average value (average pooling) of the corresponding position, and so on. In the embodiment of the present invention, it may be considered that the downsampling operation, specifically, average pulling, is performed after the fifth convolutional layer of the convolutional network. In one embodiment, global average pooling (global average pooling, GAP) may also be used instead of full-connectivity layers to fuse learned depth features due to full-connectivity layer parameter redundancy.

It should be noted that, for convenience of description, in the embodiment of the present invention, the downsampling layer (also referred to as the pooling layer) may be included in the fully connected network.

The fully connected network in this embodiment comprises at least one fully connected layer (FC, fully Connected Layers) as follows:

full tie layer: the learned features can be mapped to a sample marking space, which mainly plays a role of a classifier in the whole identification network, each node of the full-connection layer is connected with all nodes output by the upper layer (such as a lower sampling layer), wherein one node of the full-connection layer is called one neuron in the full-connection layer, and the number of the neurons in the full-connection layer can be determined according to the actual application requirement. The full-connection layer performs weighting operation on the characteristic information obtained by the convolution operation to obtain scores (score) of each category. Similar to the convolutional layer, optionally, after the full join operation, a nonlinear factor may also be added by adding an activation function, e.g., an activation function sigmoid or softmax may be added.

In an embodiment, a softmax layer is further included in the fully connected network, and the softmax layer is disposed after the fully connected layer, where softmax may be understood as normalization, and is used to sort output results, and map scores (score) obtained by the fully connected operation into probabilities. If the video frame has 3 (head, tail, positive) clip types, the output through the softmax layer is a 3-dimensional vector. The first value in the vector is the probability value that the current picture belongs to a first slice type (e.g., slice header) and the second value in the vector is the probability value that the video frame belongs to a second slice (e.g., slice trailer). The sum of these 3-dimensional vectors is 1. The dimensions of the input and output vectors of the softmax layer are the same.

Of course, referring to fig. 5b and 5c, it will be understood that the neural network structure may further include an input layer for inputting data and an output layer for outputting data, which are not described herein.

Before the step of performing segment type recognition on each video frame in the candidate video segments to obtain a recognition result sequence, training the initial neural network with a large number of samples to obtain a trained neural network, wherein the training may include the following steps:

Where positive samples (video frames of the head or tail type) are much fewer than negative samples (video frames of the positive type) in the candidate video segments, the negative samples tend to account for a significant portion of the total loss, and these losses are mostly contributed by easily identifiable samples. The optimization direction is mainly guided by the easily-identified sample during training, and the result of neural network learning is easily deviated.

The adjustment coefficient is used for reducing the influence degree of sample type unbalance on feature learning, and the adjustment coefficient is set to correct the optimization direction, so that the problems of difficult-to-identify sample learning and positive and negative sample unbalance are solved.

In an embodiment, the adjustment coefficient includes a modulation coefficient r for modulating weights of the difficult-to-identify samples and the easy-to-identify samples, and a balance coefficient a for balancing the proportion of the positive and negative samples, and the step of "constructing a loss function according to a preset adjustment coefficient, the real fragment type, and probability information corresponding to the identified fragment type" may specifically include the following steps:

constructing a loss weight parameter corresponding to an easily-identified sample according to a preset modulation coefficient, the real fragment type and probability information corresponding to the identified fragment type;

and constructing the loss function according to a preset balance coefficient, the loss weight parameter and the initial loss function.

The modulation factor r is a parameter for modulating the weight of the sample difficult to identify and the sample easy to identify, and the balance factor a is a parameter for balancing the proportion of the positive sample and the negative sample.

The initial loss function may be flexibly set according to practical application requirements, for example, the initial loss function CE (p, y) may be calculated by using a standard cross entropy formula, as follows:

taking a neural network model comprising two recognition networks as an example for explanation, only one candidate video segment needs to be learned and recognized for each recognition network, and the recognition result comprises two segment types: positive and negative (or negative) belonging to a two-class network, p representing the probability that the prediction considers the segment type of the sample to be 1 (the value range of p is 0 to 1), y representing the true segment type noted in the sample, when the true segment type is the first segment type (e.g., negative), i.e., y=1, the loss is-log (0.6) provided that the probability that a certain sample x predicts 1 of this type is p=0.6, which is easily understood here by way of example only of two classes, multiple classes, and so on. When the neural network only comprises one recognition network, the recognition network needs to learn to recognize target candidate video clips (formed by splicing a first candidate video clip and a second candidate video clip), and the recognition result comprises three clip types: positive, film head and film tail.

If the predicted fragment type t is consistent with the true fragment type, y can be set to 1 at the moment, otherwise, if the predicted fragment type t is inconsistent with the true fragment type, y can be set to 0, and when y is 1, the confidence of the predicted fragment type t, p _t A probability p of the segment type corresponding to the predicted segment type t, and a confidence p of the predicted segment type t when y is not 1 (e.g., 0) _t The difference between 1 and the segment type probability p corresponding to the predicted segment type t is expressed as:

wherein p is the output fragment type prediction probability value. And continuously training by reducing errors between the real fragment type and the fragment type prediction probability value so as to adjust weights to proper values, thereby obtaining the trained neural network model.

At this time, the calculation formula of the initial loss function may be written as:

CE(p,y)＝CE(p _t )＝-log(p _t )

for example, the loss function FL (p _t ) May be referred to as focal loss, and may be specifically calculated using the following formula:

FL(p _t )＝-a _t (1-p _t ) ^r log(p _t )

wherein p is _t To predict the confidence of the resulting fragment type t, a _t And r is a preset modulation factor for the balance coefficient. Wherein (1-p) _t ) ^r A loss weight parameter called easy-to-identify sample.

Wherein when the modulation factor is 0, the loss function is consistent with the initial loss function, wherein the modulation factor is a number greater than 1, and the method can be used for controlling the weight of easily identified samples in the loss function, so that the neural network pays more attention to the weight of difficult-to-identify samples in the loss function during training, in addition, since positive samples (video frames of a slice head or slice tail type) are far fewer than negative samples (video frames of a positive slice type) in candidate video segments, in order to give a larger weight to fewer samples, a preset balance factor can be set to a number between (0, 1), so as to control the weight of the negative samples in total loss.

It will be appreciated that the above method may also be used to train the recognition model, and the trained recognition model is used to construct two sets of neural networks together with the input and output layers, and the layers for combining the recognition results, which are referred to as a dual-model structure. Alternatively, the trained recognition model may be used to construct a set of neural networks with input and output layers, layers for stitching the first candidate video segment and the second candidate video segment, and layers for combining the recognition results, referred to as a single model structure.

104. And selecting video frames meeting the preset tolerance threshold from the candidate video clips as candidate clip type demarcation positions according to the identification result sequence and the preset tolerance threshold.

The tolerance threshold is the maximum number of video frames in the target video clips, and the clip type identification result is not the maximum number of video frames in the target video clip type.

The target video clip type is a non-positive type corresponding to the candidate video clip, when the candidate video clip is a video start part, the target video clip type is a head type, and when the candidate video clip is a video end part, the target video clip type is a tail type.

For example, the preset tolerance threshold of the slice header type is 2, the 2 nd video frame which is not the slice header type is in the 20 th frame, the 25 th video frame which is not the slice header type is in the 3 rd video frame, and then each frame of the first 24 frames can be used as a boundary position of the candidate slice type between the slice header and the positive slice.

Due to misidentification, a few frames of images in the head or tail may be identified as positive type, and the strategy of the application allows video frames within a tolerance threshold number in the target video clip to be identified as non-target video clip type, so that the demarcation position in the video can be more accurately determined.

105. And separating candidate sub-fragments from the candidate video fragments according to the candidate fragment type demarcation position.

For a first candidate segment (i.e., a candidate slice header segment), all video frames between the first frame of video to the candidate slice type boundary location may be combined to obtain a candidate sub-segment.

For the second candidate segment (i.e., candidate end segment), all video frames between the boundary position of the candidate segment type in the video and the last frame of the video may be combined to obtain the candidate sub-segment.

106. And acquiring the statistical parameters of the video frame fragment types in the candidate sub-fragments according to the identification result sequence, and determining a target sub-fragment from the candidate sub-fragments according to the statistical parameters and a preset statistical parameter threshold.

The statistical parameter is a parameter for representing the probability that the candidate sub-segment is the target sub-segment, and may be represented as the number of target video frames in the candidate sub-segment, the recognition result of which is the target video segment type, or the ratio of the number of target video frames to the total number of video frames in the candidate sub-segment.

The preset statistical parameter threshold value refers to a minimum value of the statistical parameter when the candidate sub-segment is a target sub-segment, which is predetermined by a developer according to statistics and experience.

In general, a candidate sub-segment having the largest statistical parameter and not smaller than a preset statistical parameter threshold may be determined as a target sub-segment.

Wherein the target sub-segment may appear as a head or tail in the video.

107. And acquiring a candidate segment type boundary position corresponding to the target sub-segment as a segment type boundary position in the video to be identified.

As can be seen from the above, the embodiment of the present application can obtain the video to be identified, and then select candidate video clips from the video to be identified according to the duration of the video to be identified; performing fragment type recognition on each video frame in the candidate video fragments to obtain a recognition result sequence, wherein the recognition result sequence is a sequence formed by recognition results of each video frame in the candidate video fragments; selecting a video frame from the candidate video clips as a candidate clip type demarcation position; separating candidate sub-segments from the candidate video segments according to the candidate segment type demarcation location; according to the identification result sequence, acquiring a statistical parameter of the video frame fragment type in the candidate sub-fragment, and determining a target sub-fragment from the candidate sub-fragment according to the statistical parameter and a preset statistical parameter threshold; and acquiring a candidate segment type boundary position corresponding to the target sub-segment as a segment type boundary position in the video to be identified.

Since the strategies for determining the boundary position according to the identification result of the video frame are different, the video positioning method of the present application may replace steps 104, 105 and 106 in the above embodiment with steps 204, 205 and 206, and referring to fig. 2b, the specific flow is as follows:

201. and acquiring the video to be identified.

202. And selecting candidate video clips from the videos to be identified according to the duration of the videos to be identified.

203. And carrying out fragment type recognition on each video frame in the candidate video fragments to obtain a recognition result sequence, wherein the recognition result sequence is a sequence formed by recognition results of each video frame in the candidate video fragments.

Steps 201 to 203 are identical to steps 101 to 103 in the previous embodiment, and specific processes are referred to the above embodiments, and are not repeated.

The neural network may be generally configured by a Resnet-101 model, and may be replaced by a Resnet series, an acceptance series, a mobilet series, or the like.

The training mode of training the dual-model structure of the neural network is not limited to the training mode of loading the single-model structure to perform initialization, and then fine tuning is performed respectively, and the training mode can be replaced by the training modes of separate training, combined multi-task training and the like.

In order to solve the problems of difficult mining and sample imbalance, the loss function is not limited to the method of constructing the loss function by using the above embodiment, and may be constructed by replacing the methods such as OHEM (online hard example mining) and class-balanced cross-god loss.

204. And taking the video frames in the candidate video fragments as candidate fragment type demarcation positions. In this embodiment, all video frames in the candidate video clip are taken as candidate clip type boundary positions.

205. And separating candidate sub-fragments from the candidate video fragments according to the candidate fragment type demarcation position.

The specific separation principle is referred to in step 105 in the above embodiment, and will not be described here again.

206. And acquiring a statistical parameter corresponding to the candidate sub-segment according to the identification result of the video frame in the candidate sub-segment and the position corresponding to the video frame, and determining a target sub-segment from the candidate sub-segment according to the statistical parameter and a preset statistical parameter threshold.

The step of obtaining the statistical parameter corresponding to the candidate sub-segment according to the identification result of the video frame in the candidate sub-segment and the position corresponding to the video frame may specifically include the following steps:

acquiring initial statistical parameters of the candidate sub-fragments according to the identification result sequence; acquiring a position encouraging parameter according to the position of the candidate segment type demarcation position in the candidate video segment; and fusing the position encouraging parameters with the corresponding initial statistical parameters to obtain the statistical parameters corresponding to the candidate sub-fragments.

The initial statistical parameter may be expressed as a target video frame number of the candidate sub-segment, where the identification result is the target video segment type, or a ratio of the target video frame number to the total number of video frames in the candidate sub-segment.

Wherein the statistical parameter is an actual parameter for representing the probability that the candidate sub-segment is the target sub-segment.

Wherein the position encouragement parameter is a parameter for indicating a degree of contribution of the position of the candidate segment type demarcation location in the candidate video segment to the statistical parameter.

In one embodiment, the location incentive parameters may be calculated using the following formula:

w _d =log (log (d) +1), or w _d Log (d), where d represents the position of the video frame in the candidate video segment.

The method for fusing the position encouraging parameter with the corresponding initial statistical parameter to obtain the statistical parameter corresponding to the candidate sub-segment is various, and in an embodiment, the position encouraging parameter may be multiplied with the corresponding initial statistical parameter to obtain the statistical parameter corresponding to the candidate sub-segment.

Since the head part of some titles is a real scene, the mark of the title appears in the second half of the title. It is difficult to distinguish whether the current position is a title or not in real time when the previous part is manually seen, and accurate judgment can be performed after key features (such as a title, logo and the like of a video) of the title appear later. Based on this, we consider that greater attention should be given to the later occurring slice header, so different position encouragement can be given to the initial statistical parameters corresponding to different candidate slice type demarcation positions, thereby giving greater attention to the later positioned video frames. And further, the demarcation position of the video to be identified is determined more accurately.

When the statistical parameters are acquired, not only the segment types of the video frames obtained through the statistical identification can be counted, but also the probability that the video frames obtained through the statistical identification belong to a certain segment type can be counted.

Wherein steps 201-207 may be referred to as grouping, the grouping strategy is not limited to more encouragement (positional encouragement) of later occurring video frames, but may also be more encouragement of consecutively occurring video frames.

207. And acquiring a candidate segment type boundary position corresponding to the target sub-segment as a segment type boundary position in the video to be identified.

Referring to fig. 5d, a video with a length of 242s will be input, and the structure of the video can be predicted by adopting the scheme of the present application to recognize, specifically, the position of the end of the slice is 3.00s, the confidence is 0.805, the position of the beginning of the slice is 229.25s, and the confidence is 0.607.

The confidence level refers to the credibility of the video positioning result, and when the confidence level is smaller than a preset threshold value, the video positioning result can be automatically recalled, repositioned or manually transferred to be positioned.

In an embodiment, the video may also have no slice header or slice tail, for example, referring to fig. 5e, the video has no slice tail, and the result of positioning is only the boundary position of the slice header, and no boundary position of the slice tail.

From the above, the demarcation position is finally determined according to the statistical parameters of the video frame fragment types, so that the determination of the demarcation position can be prevented from being influenced by the error identification result of the individual video frame, the identification result is more accurate, the scheme does not need manual intervention, and full automation can be realized, so that the demarcation positions of fragments of different types in the video can be more accurately and efficiently identified.

The method described in the above embodiments is described in further detail below by way of example.

In order to verify the effect of the video positioning scheme provided by the embodiment of the application, the video positioning results obtained by different model structures in the application are compared, and the obtained video positioning results are respectively shown in table 1 and table 2 by different neural network structures and different methods for dividing candidate sub-segments (namely grouping strategies).

TABLE 1

The ordinate in table 1 represents different neural network structures, the abscissa in table 1 represents accuracy, R represents recall, and according to a preset threshold t, when the head-to-tail error of a video is smaller than the preset threshold t, positioning is considered accurate, otherwise positioning is considered inaccurate, and recall is required. The values in table 1 represent the accuracy of positioning of the head or tail of the slice, as well as the recall, for the different neural network structures.

Model	P	R
			Fault-tolerant grouping strategy	0.834	0.618
Encouraging grouping strategy	0.845	0.645

TABLE 2

The ordinate in table 2 represents different methods for dividing candidate sub-segments, where the methods used in embodiments 101 to 107 are called fault tolerant grouping strategies, the methods used in embodiments 201 to 207 are called encouraged grouping strategies, the abscissa in table 2P represents accuracy, R represents recall, and according to a preset threshold t, when the head-to-tail error of a video is smaller than the preset threshold t, positioning is considered accurate, otherwise positioning is considered inaccurate, and recall is required. The values in table 2 represent the positioning accuracy, and recall, corresponding to the different grouping strategies.

In order to better implement the method, correspondingly, the embodiment of the application also provides a video positioning device which can be integrated in electronic equipment, wherein the electronic equipment can be a server or a terminal and other equipment.

For example, as shown in fig. 3, the video positioning apparatus may include an acquisition unit 301, a candidate unit 302, an identification unit 303, a selection unit 304, a separation unit 305, a determination unit 306, and a positioning unit 307, as follows:

(1) An acquiring unit 301, configured to acquire a video to be identified;

(2) A candidate unit 302, configured to select a candidate video segment from the video to be identified according to a duration of the video to be identified;

(3) The identifying unit 303 is configured to identify a segment type of each video frame in the candidate video segment, so as to obtain an identification result sequence, where the identification result sequence is a sequence formed by identification results of each video frame in the candidate video segment;

(4) A selecting unit 304, configured to select a video frame from the candidate video segments as a candidate segment type boundary position;

(5) The separation unit is used for separating candidate sub-fragments from the candidate video fragments according to the candidate fragment type demarcation position;

(6) The determining unit is used for acquiring the statistical parameters of the video frame fragment types in the candidate sub-fragments according to the identification result sequence, and determining target sub-fragments from the candidate sub-fragments according to the statistical parameters and a preset statistical parameter threshold;

(7) And the positioning unit is used for acquiring the candidate segment type boundary position corresponding to the target sub-segment as the segment type boundary position in the video to be identified.

Optionally, in some embodiments, the identifying unit 303 may specifically include:

a splicing subunit, configured to splice the first candidate video segment and the second candidate video segment to obtain a spliced candidate video segment;

the identification subunit is used for carrying out fragment type identification on each video frame in the target candidate video fragments by adopting an identification network in a preset neural network to obtain fragment type identification results of each video frame;

and the combining subunit is used for combining the fragment type recognition results according to the first candidate video fragment and the second candidate video fragment to obtain a first recognition result sequence and a second recognition result sequence.

Or alternatively, the process may be performed,

the first identification subunit is used for carrying out fragment type identification on each video frame in the first candidate video fragments by adopting a first identification network to obtain fragment type identification results of each video frame;

The first combination subunit is used for combining the fragment type identification results to obtain a first identification result sequence;

the second recognition subunit is used for carrying out fragment type recognition on each video frame in the second candidate video fragments by adopting a second recognition network to obtain fragment type recognition results of each video frame;

and the second combination subunit is used for combining the fragment type identification results to obtain a second identification result sequence.

Wherein the identification subunit may specifically be configured to:

Optionally, in an embodiment, the selecting unit 304 may specifically be configured to:

Or alternatively, the process may be performed,

Alternatively, in an embodiment, the determining unit 306 may specifically be configured to:

Optionally, in some embodiments, the video positioning device may further include a training unit, and specifically may include:

the video clip comprises an acquisition subunit, a processing subunit and a processing subunit, wherein the acquisition subunit is used for acquiring a plurality of video clip samples, and the video clip samples comprise a plurality of video frames marked with real clip types;

the identification subunit is used for identifying the segment type corresponding to the video frame in the video segment sample through a preset initial neural network;

the prediction subunit is used for determining a current prediction result according to the fragment type obtained by recognition and the real fragment type;

the construction subunit is used for constructing a loss function according to a preset adjustment coefficient, the real fragment type and probability information corresponding to the fragment type obtained by recognition;

And the training subunit is used for converging the preset initial neural network by adopting a loss function until the current prediction result is correct in prediction, so as to obtain the trained neural network.

In an embodiment, the adjustment coefficients comprise a modulation coefficient and a balance coefficient, and the training subunit may specifically be configured to:

In the implementation, each unit may be implemented as an independent entity, or may be implemented as the same entity or several entities in any combination, and the implementation of each unit may be referred to the foregoing method embodiment, which is not described herein again.

From the above, it can be seen that the embodiment of the application finally determines the demarcation position according to the statistical parameters of the video frame segment types, which can avoid influencing the determination of the demarcation position by the error recognition result of the individual video frame, and make the recognition result more accurate.

In addition, the embodiment of the application further provides an electronic device, as shown in fig. 4, which shows a schematic structural diagram of the electronic device according to the embodiment of the application, specifically:

the electronic device may include one or more processing cores 'processors 401, one or more computer-readable storage media's memory 402, power supply 403, and input unit 404, among other components. Those skilled in the art will appreciate that the electronic device structure shown in fig. 4 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or may be arranged in different components. Wherein:

the processor 401 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402, and calling data stored in the memory 402, thereby performing overall detection of the electronic device. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface, an application program, etc., and the modem processor mainly processes wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by executing the software programs and modules stored in the memory 402. The memory 402 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the electronic device, etc. In addition, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.

The electronic device further comprises a power supply 403 for supplying power to the various components, preferably the power supply 403 may be logically connected to the processor 401 by a power management system, so that functions of managing charging, discharging, and power consumption are performed by the power management system. The power supply 403 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The electronic device may further comprise an input unit 404, which input unit 404 may be used for receiving input digital or character information and generating keyboard, mouse, joystick, optical or trackball signal inputs in connection with user settings and function control.

Although not shown, the electronic device may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 401 in the electronic device loads executable files corresponding to the processes of one or more application programs into the memory 402 according to the following instructions, and the processor 401 executes the application programs stored in the memory 402, so as to implement various functions as follows:

acquiring a video to be identified;

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

From the above, it can be seen that, in the embodiment of the present application, through the conversion network, style migration from the source domain image to the target domain image is realized, so that the migrated source domain image can be closer to the target domain image in style content, learning difficulty in a subsequent training process can be reduced, and further, the trained countermeasure generation network can better solve a field adaptive task, thereby effectively solving noise labels and unsupervised video positioning tasks, and improving accuracy of video positioning.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, an embodiment of the present application further provides a storage medium storing a plurality of instructions capable of being loaded by a processor to perform the steps in any one of the training methods for a video positioning network provided by the embodiment of the present application. For example, the instructions may perform the steps of:

acquiring a target domain image and a source domain image with marked information;

acquiring a video to be identified;

Wherein the storage medium may include: read Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

The instructions stored in the storage medium can execute the steps in the training method of any video positioning network provided by the embodiment of the present application, so that the beneficial effects that can be achieved by the training method of any video positioning network provided by the embodiment of the present application can be achieved, which are detailed in the previous embodiments and are not described herein.

The foregoing describes in detail the training method, apparatus, electronic device and storage medium of a video positioning network, and the video positioning method and apparatus provided by the embodiments of the present application, and specific examples are applied to illustrate the principles and embodiments of the present application, where the foregoing examples are only used to help understand the method and core idea of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present application, the present description should not be construed as limiting the present application.

Claims

1. A video positioning method, comprising:

acquiring a video to be identified;

selecting candidate video clips from the video to be identified according to the duration of the video to be identified, wherein the candidate video clips comprise video clips which can be the head or the tail of the video;

selecting a video frame meeting the preset tolerance threshold from the candidate video clips as a candidate clip type demarcation position according to the identification result sequence and a preset tolerance threshold, wherein the tolerance threshold is the maximum number of video frames of which the clip type identification result is not the target video clip type in the target video clips, the target video clip type comprises a head type or a tail type, when the candidate video clips are of the head type, the number of video frames of which the identification result is not the target video clip type in the video clips formed from the first frame to the candidate clip type demarcation position is smaller than the preset tolerance threshold, and when the candidate video clips are of the tail type, the number of video frames of which the identification result is not the target video clip type in the video clips formed from the candidate clip type demarcation position to the last frame is smaller than the preset tolerance threshold;

according to the identification result sequence, acquiring a statistical parameter of the video frame segment type in the candidate sub-segment, and determining a target sub-segment from the candidate sub-segment according to the statistical parameter and a preset statistical parameter threshold, wherein the statistical parameter comprises the target video frame number of which the identification result in the candidate sub-segment is the target video segment type, or the ratio of the target video frame number to the total video frame number in the candidate sub-segment;

2. The video localization method of claim 1, wherein the candidate video segments comprise a first candidate video segment and a second candidate video segment;

Adopting an identification network in a preset neural network to identify the segment types of the video frames in the spliced candidate video segments, and obtaining segment type identification results of the video frames;

3. The video localization method of claim 1, wherein the candidate video segments comprise a first candidate video segment and a second candidate video segment, and the pre-determined neural network comprises a first identification network and a second identification network;

and carrying out fragment type identification on each video frame in the candidate video fragments by adopting a preset neural network to obtain an identification result sequence, wherein the identification result sequence comprises the following steps of:

4. A video positioning method as claimed in claim 2 or 3, wherein the step of performing the segment type identification for each video frame to obtain the segment type identification result for each video frame comprises:

5. The video positioning method according to claim 2, wherein before the segment type identification is performed on each video frame in the candidate video segments by using the preset neural network, the method further comprises:

6. The video positioning method of claim 5, wherein the adjustment coefficients include a modulation coefficient and a balance coefficient, and the constructing a loss function according to a preset adjustment coefficient, the real segment type, and probability information corresponding to the identified segment type includes:

7. A video positioning method, comprising:

acquiring a video to be identified;

taking all video frames in the candidate video clips as candidate clip type demarcation positions;

acquiring initial statistical parameters of the candidate sub-fragments according to the identification result sequence; acquiring a position encouraging parameter according to the position of the candidate segment type demarcation position in the candidate video segment; fusing the position encouraging parameter with the corresponding initial statistical parameter to obtain a statistical parameter corresponding to the candidate sub-segment, wherein the initial statistical parameter comprises a target video frame number of which the identification result is a target video type in the candidate sub-segment or a ratio of the target video frame number to the total number of video frames in the candidate sub-segment, the statistical parameter comprises an actual parameter for representing the probability that the candidate sub-segment is the target sub-segment, and the position encouraging parameter comprises a parameter for representing the contribution degree of the position of the boundary position of the candidate segment type in the candidate video segment to the statistical parameter;

8. A video positioning apparatus, comprising:

the acquisition unit is used for acquiring the video to be identified;

a candidate unit, configured to select a candidate video segment from the video to be identified according to a duration of the video to be identified, where the candidate video segment includes a video segment that may be a head or a tail of the video;

a selecting unit, configured to select, according to the identification result sequence and a preset tolerance threshold, a video frame satisfying the preset tolerance threshold as a candidate segment type demarcation position from the candidate video segments, where the tolerance threshold is a maximum number of video frames in a target video segment, the segment type identification result is not a target video segment type, the target video segment type includes a head type or a tail type, when the candidate video segment is the head type, the number of video frames in a video segment formed from a first frame to the candidate segment type demarcation position, the identification result of which is not the target video segment type, is less than the preset tolerance threshold, and when the candidate video segment is the tail type, the number of video frames in a video segment formed from the candidate segment type demarcation position to a last frame, the identification result of which is not the target video segment type, is less than the preset tolerance threshold;

the determining unit is used for obtaining the statistical parameters of the video frame segment types in the candidate sub-segments according to the identification result sequence, and determining target sub-segments from the candidate sub-segments according to the statistical parameters and a preset statistical parameter threshold, wherein the statistical parameters comprise the target video frame number of which the identification result in the candidate sub-segments is the target video segment type, or the ratio of the target video frame number to the total video frame number in the candidate sub-segments;

9. An electronic device comprising a processor and a memory, the memory storing a plurality of instructions; the processor loads instructions from the memory to perform the steps in the video localization method as claimed in any one of claims 1 to 7.

10. A storage medium having stored thereon a computer program which, when run on a computer, causes the computer to perform the video localization method of any one of claims 1 to 7.