CN112149568A

CN112149568A - Short video positioning method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN112149568A
Application number: CN202011012280.XA
Authority: CN
Inventors: 黄家水; 徐华泽; 蒋德才
Original assignee: Ainnovation Hefei Technology Co ltd
Current assignee: Ainnovation Hefei Technology Co ltd
Priority date: 2020-09-23
Filing date: 2020-09-23
Publication date: 2020-12-29

Abstract

The invention relates to a short video positioning method, a short video positioning device, electronic equipment and a computer readable storage medium. In the process, manual intervention is not needed, and labor cost can be reduced. In addition, since machines are more efficient than humans, the accuracy and efficiency of locating short videos may also be improved.

Description

Short video positioning method and device, electronic equipment and computer readable storage medium

Technical Field

The application belongs to the field of computer vision, and particularly relates to a short video positioning method and device, electronic equipment and a computer readable storage medium.

Background

Short video targeting is to find the point in time of the short video's occurrence in a long video segment (referred to as an alternate long video in the embodiments of the present application), for example to target the point in time of the occurrence of a certain advertisement in a long video segment.

In the prior art, the short video is generally located manually, that is, a worker first looks up the short video to be located, and then looks up the time point of occurrence of the short video in the alternative video according to the impression. However, the human vision has great limitation, and the manual positioning method has the defects of poor positioning accuracy and low efficiency, and in addition, has the problem of high labor cost.

Disclosure of Invention

In view of the above, an object of the present application is to provide a short video positioning method, apparatus, electronic device and computer readable storage medium, so as to reduce labor cost and improve accuracy and efficiency of short video positioning.

The embodiment of the application is realized as follows:

in a first aspect, an embodiment of the present application provides a short video positioning method, including: acquiring an alternative long video and a short video to be positioned; inputting the short video into a pre-stored deep neural network model to obtain a first feature vector corresponding to the short video; splitting the alternative long video into a plurality of different alternative short videos, wherein each alternative short video and the short video are equal in frame; respectively inputting the different alternative short videos into the deep neural network model to obtain a plurality of second feature vectors, wherein the second feature vectors correspond to the different alternative short videos one by one; and determining the position of the short video in the candidate long video according to the first feature vector and the plurality of second feature vectors. In the process, manual intervention is not needed, and labor cost can be reduced. In addition, since machines are more efficient than humans, the accuracy and efficiency of locating short videos may also be improved.

With reference to the embodiment of the first aspect, in a possible implementation manner, each alternative short video has a corresponding time stamp in the alternative long video; the determining the position of the short video in the candidate long video according to the first feature vector and the plurality of second feature vectors includes: calculating a correlation coefficient between the first feature vector and each second feature vector to obtain a plurality of correlation coefficients, wherein the correlation coefficients are in one-to-one correspondence with the second feature vectors; determining a maximum correlation coefficient from the plurality of correlation coefficients; when the maximum correlation coefficient is larger than a threshold value, determining the time mark of the candidate short video corresponding to the second feature vector corresponding to the maximum correlation coefficient in the candidate long video; determining the position of the time mark in the alternative long video as the position of the short video in the alternative long video. I.e. the position of the short video to be positioned can be determined by means of the time stamp.

With reference to the embodiment of the first aspect, in a possible implementation manner, the calculating a correlation coefficient between the first feature vector and each of the second feature vectors includes: based on the formula

Calculating a correlation coefficient between the first feature vector and each second feature vector; wherein r is_X,YIs a correlation coefficient between the first eigenvector X and the second eigenvector Y,

is the average of the individual elements comprised by the first feature vector,

the average value, x, of each element included in the second feature vector_iCharacterizing the i-th element, y, in a first feature vector X_iAnd h represents the number of elements included in the first feature vector X and the second feature vector Y.

With reference to the embodiment of the first aspect, in a possible implementation manner, the inputting the short video into a pre-saved deep neural network model includes: sequentially extracting multiple frames of key frame images from the short video; and inputting the multi-frame key frame image into a pre-stored deep neural network model to improve the efficiency of feature extraction, thereby improving the efficiency of positioning.

With reference to the embodiment of the first aspect, in a possible implementation manner, when the number of input layers of the deep neural network model is M, and when the number of frames N included in the short video is greater than M, the sequentially extracting multiple frames of key frame images from the short video includes: the N and the M are subjected to quotient making to obtain a quotient value; and sequentially extracting M frames of images from the short video by taking the quotient as a step length from the first frame of the short video, and determining the M frames of images as the multi-frame key frame images.

With reference to the embodiment of the first aspect, in a possible implementation manner, before the inputting the multiple frames of key frame images into the pre-saved deep neural network model, the method further includes: the sizes of each frame of key frame image are adjusted to be consistent so as to improve the efficiency of feature extraction and improve the efficiency of positioning.

With reference to the embodiment of the first aspect, in one possible implementation manner, the alternative long video includes t_lFrame image, the short video including t_sThe step of splitting the alternative long video into a plurality of different alternative short videos comprises: sequentially extracting image frames with the same number of frames as the short video from the candidate long video by taking 1 as a step length from the f-th frame of the candidate long video to form the candidate short video so as to obtain a plurality of different candidate short videos, wherein f is sequentially 1,2,3, … …, t_l-t_s+1。

In a second aspect, an embodiment of the present application provides a short video positioning apparatus, including: the device comprises an acquisition module, an extraction module, a splitting module and a determination module. The acquisition module is used for acquiring the alternative long video and the short video to be positioned; the extraction module is used for inputting the short video into a pre-stored deep neural network model to obtain a first feature vector corresponding to the short video; the splitting module is used for splitting the alternative long video into a plurality of different alternative short videos, and each alternative short video and the short video are equal in frame; the extraction module is further configured to input the multiple different candidate short videos into the deep neural network model respectively to obtain multiple second feature vectors, where the multiple second feature vectors correspond to the multiple different candidate short videos one to one; and the determining module is used for determining the position of the short video in the candidate long video according to the first feature vector and the plurality of second feature vectors.

With reference to the second aspect, in a possible implementation manner, each candidate short video has a corresponding time stamp in the candidate long video, and the determining module is configured to calculate a correlation coefficient between the first feature vector and each second feature vector to obtain a plurality of correlation coefficients, where the correlation coefficients are in one-to-one correspondence with the second feature vectors; determining a maximum correlation coefficient from the plurality of correlation coefficients; when the maximum correlation coefficient is larger than a threshold value, determining the time mark of the candidate short video corresponding to the second feature vector corresponding to the maximum correlation coefficient in the candidate long video; determining the position of the time mark in the alternative long video as the position of the short video in the alternative long video.

With reference to the second aspect, in one possible implementation manner, the determining module is configured to base a formula

included for the first feature vectorThe average value of each of the elements is,

With reference to the embodiment of the second aspect, in a possible implementation manner, the extracting module is configured to sequentially extract multiple frames of key frame images from the short video; and inputting the multi-frame key frame image into a pre-stored deep neural network model.

With reference to the second aspect, in a possible implementation manner, the number of input layers of the deep neural network model is M, and when the number of frames N included in the short video is greater than M, the extraction module is configured to quotient the N with the M to obtain a quotient; and sequentially extracting M frames of images from the short video by taking the quotient as a step length from the first frame of the short video, and determining the M frames of images as the multi-frame key frame images.

With reference to the second aspect, in a possible implementation manner, the apparatus further includes an adjusting module, configured to adjust a size of each frame of the key frame image to be uniform.

With reference to the second aspect, in one possible implementation manner, the alternative long video includes t_lFrame image, the short video including t_sA frame image splitting module, configured to extract, starting from the ith frame of the candidate long video, image frames with the same number of frames as the number of frames included in the short video in sequence from the candidate long video by taking 1 as a step length to form the candidate short video, so as to obtain the multiple different candidate short videos, where i is taken to be 1,2,3, … …, t in sequence_l-t_s+1。

In a third aspect, an embodiment of the present application further provides an electronic device, including: a memory and a processor, the memory and the processor connected; the memory is used for storing programs; the processor calls a program stored in the memory to perform the method of the first aspect embodiment and/or any possible implementation manner of the first aspect embodiment.

In a fourth aspect, the present application further provides a non-transitory computer-readable storage medium (hereinafter, referred to as a computer-readable storage medium), on which a computer program is stored, where the computer program is executed by a computer to perform the method in the foregoing first aspect and/or any possible implementation manner of the first aspect.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and drawings.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts. The foregoing and other objects, features and advantages of the application will be apparent from the accompanying drawings. Like reference numerals refer to like parts throughout the drawings. The drawings are not intended to be to scale as practical, emphasis instead being placed upon illustrating the subject matter of the present application.

Fig. 1 shows a flowchart of a short video positioning method according to an embodiment of the present application.

Fig. 2 shows a block diagram of a short video positioning apparatus according to an embodiment of the present application.

Fig. 3 shows a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Icon: 100-an electronic device; 110-a processor; 120-a memory; 400-short video positioning means; 410-an obtaining module; 420-an extraction module; 430-split module; 440 — a determination module.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, relational terms such as "first," "second," and the like may be used solely in the description herein to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Further, the term "and/or" in the present application is only one kind of association relationship describing the associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone.

In addition, the defects (poor positioning accuracy, low efficiency, high labor cost) existing in the short video positioning in the prior art are all the results obtained after the applicant has practiced and studied carefully, and therefore, the discovery process of the above defects and the solution proposed by the embodiments of the present application to the above defects in the following text should be considered as contributions of the applicant to the present application.

In order to solve the foregoing problems, embodiments of the present application provide a short video positioning method, apparatus, electronic device, and computer-readable storage medium, so as to reduce labor cost and improve accuracy and efficiency of short video positioning.

The technology can be realized by adopting corresponding software, hardware and a combination of software and hardware. The following describes embodiments of the present application in detail.

The following description will be directed to the short video positioning method provided in the present application.

Referring to fig. 1, an embodiment of the present application provides a short video positioning method applied to an electronic device. The steps involved will be described below with reference to fig. 1.

Step S110: and acquiring the alternative long video and the short video to be positioned.

It will be appreciated that the alternative long video comprises a plurality of different short videos.

Suppose that a candidate long video includes advertisement a, advertisement B, television program a, advertisement C, and advertisement D. And the videos corresponding to the advertisement A, the advertisement B, the television program A, the advertisement C and the advertisement D are short videos relative to the alternative long video.

In the embodiment of the present application, a short video to be positioned needs to be specified first. For example, on the premise of the above assumption, the short video to be located is advertisement B, and accordingly, the purpose to be achieved by the method is to determine the location of the short video corresponding to advertisement B in the alternative long video composed of advertisement a, advertisement B, television program a, advertisement C, and advertisement D.

Step S120: and inputting the short video into a pre-stored deep neural network model to obtain a first feature vector corresponding to the short video.

The deep neural network model is a network model which can process time series. When the features of the video are extracted through the deep neural network model, the features included in each frame of image can be fused with the time sequence, and compared with the traditional mode that the features are extracted and combined only based on a single frame of image, more reliable feature vectors can be provided, so that the accuracy of subsequent video positioning is improved.

In some alternative embodiments, the deep neural network may employ a 3D convolutional neural network, for example using a ResNet18-3D convolutional neural network model.

Of course, in some embodiments, other deep neural network models may be employed, and are not enumerated here. The following steps in the embodiments of the present application will be described with reference to the ResNet18-3D convolutional neural network model as an example. Of course, it is worth noting that the ResNet18-3D convolutional neural network model has been trained in advance and is capable of yielding an output that meets preset requirements that are predetermined by the operator.

For example, there is a ResNet18-3D convolutional neural network model that includes 18 3D convolutional layers and 2 average pooling layers.

Suppose a short video V to be positioned_sSatisfy the requirement of

Wherein, t_sThe number of frames included in the short video is represented, h represents the length size of each frame of image of the short video, w represents the width size of each frame of image of the short video, and 3 represents the number of channels of the short video (i.e., the short video is an RGB image).

In some embodiments, the short video V may be directly transmitted_sInputting the result into ResNet18-3D convolution neural network model to obtain the feature vector corresponding to the short video, which will be referred to as the first feature vector F for the convenience of distinguishing from the following_s∈R²⁵⁶. 256 here characterizes the dimension of the feature vector, i.e. the number of elements comprised in the first feature vector. This value corresponds to the structure of the output layer of the ResNet18-3D convolutional neural network model. In some embodiments, when the output layer of the ResNet18-3D convolutional neural network model is of another structure, the number of dimensions included in the first feature vector changes accordingly.

On the premise that the total frame number included in the short video to be positioned is large, in order to save the time consumed by subsequent feature extraction and improve the efficiency of feature extraction, some implementation methodsIn the formula, short video V to be positioned can be treated_sOrderly extracting key frames, and inputting an image sequence formed by the extracted key frame images into a ResNet18-3D convolutional neural network model so as to output a first vector feature.

The ordering referred to above means that the original sequence between the frames is not changed.

In one embodiment, the short video V to be positioned is targeted_sThe included multi-frame images can randomly extract multi-frame key frame images, but for the extracted multi-frame key frame images, in the image sequence formed by the extracted multi-frame key frame images, the front and back sequence among the key frame images is V_sThe front and back sequences in (1) are consistent.

In one embodiment, when V is paired_sWhen the key frame extraction is performed, the number of extracted key frame images is consistent with the number of input layers of the ResNet18-3D convolutional neural network model.

Assuming that the ResNet18-3D convolutional neural network model includes M input layers, the short video to be located includes N frames, and N is greater than M. At this time, N may be divided by M to obtain a quotient Z. And subsequently, sequentially extracting M frames of images from the short video to be positioned by taking Z as a step length from the first frame of the short video to be positioned, and then determining the M frames of images as key frame images of the short video to be positioned.

When N is 33 and M is 16, for example, 33/16 is 2 and 1, so Z is 2. At this time, for the short video to be positioned, a first frame image, a third frame image (1+2 equals 3), a fifth frame image (3+2 equals 5), a seventh frame image (5+2 equals 7), a ninth frame image (7+2 equals 9), an eleventh frame image (9+2 equals 11), a thirteenth frame image (11+2 equals 13), and a fifteenth frame image (13+2 equals 15) are sequentially extracted from the first frame included in the short video, the image processing method includes the steps of determining 16 frames of images including a seventeenth frame image (15+ 2-17), a nineteenth frame image (17+ 2-19), a twenty-first frame image (19+ 2-21), a twenty-third frame image (21+ 2-23), a twenty-fifth frame image (23+ 2-25), a twenty-seventh frame image (25+ 2-27), a twenty-ninth frame image (27+ 2-29) and a thirty-first frame image (29+ 2-31) as key frame images of short video to be positioned. And then, the 16 key frame images are formed into an image sequence according to the extraction sequence.

In addition, in some embodiments, in order to improve the efficiency of feature extraction, before the short video to be positioned or the image sequence corresponding to the short video to be positioned is input into the ResNet18-3D convolutional neural network model, the size of the content of the input ResNet18-3D convolutional neural network model (the short video to be positioned or the image sequence corresponding to the short video) can be adjusted, so that the size of each frame of image included in the content of the input ResNet18-3D convolutional neural network model is consistent.

Alternatively, the size may be consistent with the size included in the samples for training the ResNet18-3D convolutional neural network model. For example, the size of each frame of image included in the sample is 224 × 224, and accordingly, the size of each frame of image included in the content of feature extraction performed by the input ResNet18-3D convolutional neural network model is adjusted to 224 × 224.

Step S130: and splitting the alternative long video into a plurality of different alternative short videos, wherein each alternative short video and the short video are equal in frame.

Suppose alternative long video

t_l＞t_s. Starting from the f-th frame of the candidate long video, values can be taken for each f (f is taken to be 1,2,3, … …, t in turn)_l-t_s+1), sequentially extracting the frame number t included in the short video from the candidate long video by taking 1 as a step length_sThe same image frames constitute the alternative short video, thereby obtaining a plurality of different alternative short videos.

Assume that the plurality of different alternative short videos are represented as

To indicate.

It is worth noting that each alternative short video has a corresponding time stamp in the alternative long video. The time stamp is used to characterize the position of the corresponding candidate short video in the candidate long video.

In some alternative embodiments, the video start (first frame) of the alternative short video in the alternative long video may be determined as a time stamp. For example, the video starting point of a certain candidate short video is the first frame of the candidate long video, the time stamp of the candidate short video is 1, and the candidate short video is used for representing that the candidate short video starts from the first frame of the candidate long video.

In some alternative embodiments, the video end point (last frame) of the alternative short video in the alternative long video may be determined as a time stamp. For example, the video end point of a certain candidate short video is the H-th frame of the candidate long video, the time mark of the candidate short video is H, and the candidate short video is used for representing that the candidate short video ends at the H-th frame of the candidate long video.

In some optional embodiments, the time period corresponding to the video start point and the video end point of the candidate short video in the candidate long video may be determined as the time stamp. For example, the video starting point of a certain candidate short video is the 2 nd frame of the candidate long video, and the video ending point of the candidate short video is the 19 th frame of the candidate long video, then the time mark of the candidate short video is 2-19, and the positions of the candidate short video in the candidate long video are the 2 nd frame to the 19 th frame.

Of course, the time stamp may also be characterized in other ways, and is not specifically limited herein.

Step S140: and respectively inputting the different alternative short videos into the deep neural network model to obtain a plurality of second feature vectors, wherein the second feature vectors correspond to the different alternative short videos one by one.

That is, for each candidate short video corresponding to the f value, there is one corresponding second feature vector.

Optionally, before each candidate short video is input to the ResNet18-3D convolutional neural network model, each frame image included in each candidate short video may also be resized in a manner consistent with the above.

Step S150: and determining the position of the short video in the candidate long video according to the first feature vector and the plurality of second feature vectors.

In the embodiment of the present application, the position of the short video in the candidate long video may be determined by calculating a correlation coefficient between the first feature vector and each second feature vector.

The correlation coefficient is calculated as follows.

May be based on a formula

Calculating to obtain a correlation coefficient between the first feature vector and each second feature vector; wherein r is_X,YAs a correlation coefficient (which may be understood as a similarity) between the first feature vector X and the second feature vector Y,

By replacing the values of the second eigenvectors, r corresponding to each second eigenvector can be obtained_X,Y. After obtaining the plurality of correlation coefficients, the largest correlation coefficient may be determined from the plurality of correlation coefficients.

By comparing the magnitude relation between the maximum correlation coefficient and a preset threshold value, whether the short video to be positioned is in the alternative long video can be determined. For example, when the maximum correlation coefficient is larger than a threshold value, determining that the short video to be positioned is in the alternative long video, otherwise, determining that the short video to be positioned is not in the alternative long video.

After determining that the short video to be positioned is in the candidate long video (that is, the maximum correlation coefficient is greater than the threshold), a time stamp of the candidate short video in the candidate long video corresponding to the second feature vector corresponding to the maximum correlation coefficient may be determined, and a position of the time stamp in the candidate long video may be determined as a position of the short video to be positioned in the candidate long video.

The short video positioning method provided by the embodiment of the application comprises the steps of extracting features of a short video to be positioned by introducing a deep neural network model to obtain a corresponding first feature vector, splitting alternative long videos, respectively inputting a plurality of alternative short videos which are obtained after splitting and have the same frame number as the short video to be positioned into the deep neural network model, outputting a second feature vector corresponding to each alternative short video, and determining the position of the short video to be positioned in the alternative long videos by comparing the first feature vector with each second feature vector. In the process, manual intervention is not needed, and labor cost can be reduced. In addition, since machines are more efficient than humans, the accuracy and efficiency of locating short videos may also be improved.

As shown in fig. 2, an embodiment of the present application further provides a short video positioning apparatus 400, where the short video positioning apparatus 400 may include: an acquisition module 410, an extraction module 420, a splitting module 430, and a determination module 440.

An obtaining module 410, configured to obtain a candidate long video and a short video to be positioned;

an extracting module 420, configured to input the short video into a pre-stored deep neural network model to obtain a first feature vector corresponding to the short video;

a splitting module 430, configured to split the alternative long video into multiple different alternative short videos, where each alternative short video and the short video are equal in frame;

the extracting module 420 is further configured to input the multiple different candidate short videos into the deep neural network model, so as to obtain multiple second feature vectors, where the multiple second feature vectors correspond to the multiple different candidate short videos one to one;

a determining module 440, configured to determine a position of the short video in the candidate long video according to the first feature vector and the plurality of second feature vectors.

In a possible implementation manner, each candidate short video has a corresponding time stamp in the candidate long video, and the determining module 440 is configured to calculate a correlation coefficient between the first feature vector and each second feature vector to obtain a plurality of correlation coefficients, where the correlation coefficients are in one-to-one correspondence with the second feature vectors; determining a maximum correlation coefficient from the plurality of correlation coefficients; when the maximum correlation coefficient is larger than a threshold value, determining the time mark of the candidate short video corresponding to the second feature vector corresponding to the maximum correlation coefficient in the candidate long video; determining the position of the time mark in the alternative long video as the position of the short video in the alternative long video.

In a possible implementation, the determining module 440 is configured to determine the value of the parameter based on a formula

In a possible implementation, the extracting module 420 is configured to sequentially extract multiple frames of key frame images from the short video; and inputting the multi-frame key frame image into a pre-stored deep neural network model.

In a possible implementation manner, the number of input layers of the deep neural network model is M, and when the number of frames N included in the short video is greater than M, the extraction module 420 is configured to quotient the N and the M to obtain a quotient; and sequentially extracting M frames of images from the short video by taking the quotient as a step length from the first frame of the short video, and determining the M frames of images as the multi-frame key frame images.

In a possible implementation, the apparatus further includes an adjusting module for adjusting the size of each frame of the key frame image to be uniform.

In one possible implementation, the alternative long video includes t_lFrame image, the short video including t_sA frame image splitting module 430, configured to sequentially extract, starting from the f-th frame of the candidate long video and taking 1 as a step length, image frames with the same number of frames as that included in the short video from the candidate long video to form the candidate short video, so as to obtain the multiple different candidate short videos, where f is sequentially 1,2,3, … …, and t is sequentially taken as f_l-t_s+1。

The short video positioning apparatus 400 provided in the embodiment of the present application has the same implementation principle and the same technical effect as the foregoing method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the foregoing method embodiments for the parts of the apparatus embodiments that are not mentioned.

In addition, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a computer, the steps included in the short video positioning method are executed.

In addition, referring to fig. 3, an electronic device 100 is further provided in an embodiment of the present invention.

Alternatively, the electronic Device 100 may be, but is not limited to, a Personal Computer (PC), a smart phone, a tablet PC, a Mobile Internet Device (MID), a Personal digital assistant, a server, and the like.

Among them, the electronic device 100 may include: a processor 110, a memory 120.

It should be noted that the components and structure of electronic device 100 shown in FIG. 3 are exemplary only, and not limiting, and electronic device 100 may have other components and structures as desired. For example, in some cases, the electronic device 100 may further include a display screen to display various videos in the embodiment of the present application and display the result of the final positioning to the user for collation.

The processor 110, memory 120, and other components that may be present in the electronic device 100 are electrically connected to each other, directly or indirectly, to enable the transfer or interaction of data. For example, the processor 110, the memory 120, and other components that may be present may be electrically coupled to each other via one or more communication buses or signal lines.

The memory 120 is used for storing programs, for example, programs corresponding to short video positioning methods appearing later or short video positioning devices appearing later. Optionally, when the short video positioning device is stored in the memory 120, the short video positioning device includes at least one software function module that can be stored in the memory 120 in the form of software or firmware (firmware).

Alternatively, the software function module included in the short video positioning apparatus may also be solidified in an Operating System (OS) of the electronic device 100.

The processor 110 is used to execute executable modules stored in the memory 120, such as software functional modules or computer programs included in the short video positioning device. When the processor 110 receives the execution instruction, it may execute the computer program, for example, to perform: acquiring an alternative long video and a short video to be positioned; inputting the short video into a pre-stored deep neural network model to obtain a first feature vector corresponding to the short video; splitting the alternative long video into a plurality of different alternative short videos, wherein each alternative short video and the short video are equal in frame; respectively inputting the different alternative short videos into the deep neural network model to obtain a plurality of second feature vectors, wherein the second feature vectors correspond to the different alternative short videos one by one; and determining the position of the short video in the candidate long video according to the first feature vector and the plurality of second feature vectors.

Of course, the method disclosed in any of the embodiments of the present application can be applied to the processor 110, or implemented by the processor 110.

In summary, in the short video positioning method, the short video positioning device, the electronic device, and the computer-readable storage medium according to the embodiments of the present invention, a deep neural network model is introduced to perform feature extraction on a short video to be positioned, so as to obtain a corresponding first feature vector, split a candidate long video, respectively input a plurality of candidate short videos with the same number of frames as the short video to be positioned, obtained after splitting, into the deep neural network model, output second feature vectors corresponding to the candidate short videos, and then determine the position of the short video to be positioned in the candidate long video by comparing the first feature vector with the second feature vectors. In the process, manual intervention is not needed, and labor cost can be reduced. In addition, since machines are more efficient than humans, the accuracy and efficiency of locating short videos may also be improved.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a notebook computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application.

Claims

1. A method for short video positioning, the method comprising:

acquiring an alternative long video and a short video to be positioned;

inputting the short video into a pre-stored deep neural network model to obtain a first feature vector corresponding to the short video;

splitting the alternative long video into a plurality of different alternative short videos, wherein each alternative short video and the short video are equal in frame;

respectively inputting the different alternative short videos into the deep neural network model to obtain a plurality of second feature vectors, wherein the second feature vectors correspond to the different alternative short videos one by one;

and determining the position of the short video in the candidate long video according to the first feature vector and the plurality of second feature vectors.

2. The method of claim 1, wherein each of the alternative short videos has a corresponding time stamp in the alternative long video; the determining the position of the short video in the candidate long video according to the first feature vector and the plurality of second feature vectors includes:

calculating a correlation coefficient between the first feature vector and each second feature vector to obtain a plurality of correlation coefficients, wherein the correlation coefficients are in one-to-one correspondence with the second feature vectors;

determining a maximum correlation coefficient from the plurality of correlation coefficients;

when the maximum correlation coefficient is larger than a threshold value, determining the time mark of the candidate short video corresponding to the second feature vector corresponding to the maximum correlation coefficient in the candidate long video;

determining the position of the time mark in the alternative long video as the position of the short video in the alternative long video.

3. The method of claim 2, wherein said calculating a correlation coefficient between the first eigenvector and each of the second eigenvectors comprises:

based on the formula

Calculating a correlation coefficient between the first feature vector and each second feature vector;

wherein r is_X,YIs a correlation coefficient between the first eigenvector X and the second eigenvector Y,

4. The method of claim 1, wherein inputting the short video into a pre-saved deep neural network model comprises:

sequentially extracting multiple frames of key frame images from the short video;

and inputting the multi-frame key frame image into a pre-stored deep neural network model.

5. The method of claim 4, wherein the number of input layers of the deep neural network model is M, and when the number of frames N included in the short video is greater than M, the sequentially extracting multiple frames of key frame images from the short video comprises:

the N and the M are subjected to quotient making to obtain a quotient value;

and sequentially extracting M frames of images from the short video by taking the quotient as a step length from the first frame of the short video, and determining the M frames of images as the multi-frame key frame images.

6. The method of claim 4 or 5, wherein prior to said inputting said plurality of frames of key frame images into a pre-saved deep neural network model, said method further comprises:

the key frame images of each frame are resized to be uniform.

7. The method of claim 1, wherein the alternative long video comprises t_lFrame image, the short video including t_sThe step of splitting the alternative long video into a plurality of different alternative short videos comprises:

sequentially extracting image frames with the same number of frames as the short video from the candidate long video by taking 1 as a step length from the f-th frame of the candidate long video to form the candidate short video so as to obtain a plurality of different candidate short videos, wherein f is sequentially 1,2,3, … …, t_l-t_s+1。

8. A short video positioning apparatus, the apparatus comprising:

the acquisition module is used for acquiring the alternative long video and the short video to be positioned;

the extraction module is used for inputting the short video into a pre-stored deep neural network model to obtain a first feature vector corresponding to the short video;

the splitting module is used for splitting the alternative long video into a plurality of different alternative short videos, and each alternative short video and the short video are equal in frame;

the extraction module is further configured to input the multiple different candidate short videos into the deep neural network model respectively to obtain multiple second feature vectors, where the multiple second feature vectors correspond to the multiple different candidate short videos one to one;

and the determining module is used for determining the position of the short video in the candidate long video according to the first feature vector and the plurality of second feature vectors.

9. An electronic device, comprising: a memory and a processor, the memory and the processor connected;

the memory is used for storing programs;

the processor calls a program stored in the memory to perform the method of any of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored which, when executed by a computer, performs the method of any one of claims 1-7.