WO2021012493A1

WO2021012493A1 - Short video keyword extraction method and apparatus, and storage medium

Info

Publication number: WO2021012493A1
Application number: PCT/CN2019/116933
Authority: WO
Inventors: 许剑勇
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-07-23
Filing date: 2019-11-10
Publication date: 2021-01-28
Also published as: CN110532431B; CN110532431A

Abstract

A short video keyword extraction method and apparatus, and a computer-readable storage medium. The method comprises: acquiring a short video set, obtaining different frames of images of the short video set by means of a regular screenshot, performing a pre-processing operation on the different frames of images to obtain a target image set and a tag set, and respectively performing target detection and attitude tracking on the target image set using a difference method and an optical flow method to obtain a difference image set and an optical flow atlas; training a pre-built short video keyword extraction model using the difference image set, the optical flow atlas and the tag set to obtain a trained short video keyword extraction model; and receiving a short video, using the trained short video keyword extraction model to obtain associated words of the short video, and performing keyword extraction on the associated words to obtain keywords of the short video. By means of the method, precise extraction of keywords of a short video is realized.

Description

Short video keyword extraction method, device and storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on July 23, 2019, the application number is 201910664967.2. The invention title is "short video keyword extraction method, device and storage medium", the entire content of which is incorporated by reference In this application.

Technical field

This application relates to the field of artificial intelligence technology, and in particular to a method, device and computer-readable storage medium for extracting related words from short videos.

Background technique

With the rapid development of digital media technology, electronic technology, communication technology, and the Internet, data resources have sprung up rapidly. Among these massive data resources, short video data is a type of multimedia data with rich semantics, complex structure, rapid development, and huge data volume. It is also a type of short video data. In Internet-based video retrieval systems, people are accustomed to implementing video retrieval using text as related words through a human-computer interface, and searching for required video data from various sites distributed on the Internet. According to the existing video retrieval system, it is difficult for people to effectively search for the video data they need from the vast array of video data. The reason is that there is no related word extraction technology based on short videos in the current market.

Summary of the invention

This application provides a short video keyword extraction method, device, and computer-readable storage medium, the main purpose of which is to present the user with accurate extraction results when the user extracts keywords in the short video.

In order to achieve the above objective, a short video keyword extraction method provided by this application includes: obtaining a short video set, obtaining different frame images of the short video set through timed screenshots, and performing preprocessing operations on the different frame images, Obtain the target image set and tag set and store them in the database; use the difference method to perform target detection on the target image set to obtain a differential image set, and perform posture tracking on the target image set according to the optical flow method to obtain an optical flow atlas Input the differential image set and the optical flow atlas as a training set into a pre-built short video keyword extraction model, use the training set to train the short video keyword extraction model, and pass the The activation function of the short video keyword extraction model outputs the picture content set in the differential image set and the time series information set in the optical flow atlas to obtain the associated word set of the differential image set and the optical flow atlas, and combine the related word set with all The tag set is input into the loss function of the short video keyword extraction model, and the loss function value is calculated. When the loss function value is less than the threshold, the short video keyword extraction model exits training; receiving the input short video Using the short video keyword extraction model to obtain related words of the short video, and perform keyword extraction on the related words to obtain the keywords of the short video.

In addition, in order to achieve the above objective, the present application also provides a short video keyword extraction device, which includes a memory and a processor, and the memory stores a short video keyword extraction program that can run on the processor. When the short video keyword extraction program is executed by the processor, the following steps are implemented: obtain a short video set, obtain different frame images of the short video set through timing screenshots, and perform preprocessing operations on the different frame images to obtain The target image set and the tag set are stored in a database; the target image set is detected by a difference method to obtain a difference image set, and the target image set is tracked according to the optical flow method to obtain an optical flow atlas; Input the differential image set and the optical flow atlas as a training set into a pre-built short video keyword extraction model, use the training set to train the short video keyword extraction model, and pass the short video The activation function of the video keyword extraction model outputs the picture content set in the differential image set and the time series information set in the optical flow atlas to obtain the associated word set of the differential image set and the optical flow atlas, and combine the associated word set with the The tag set is input into the loss function of the short video keyword extraction model, and the loss function value is calculated. When the loss function value is less than the threshold, the short video keyword extraction model exits training; receiving the input short video, The short video keyword extraction model is used to obtain related words of the short video, and keyword extraction is performed on the related words to obtain the keywords of the short video.

In addition, in order to achieve the above-mentioned object, this application also provides a computer-readable storage medium on which a short video keyword extraction program is stored. The short video keyword extraction program can be used by one or more The processor executes to implement the steps of the short video keyword extraction method as described above.

The short video keyword extraction method, device, and computer-readable storage medium proposed in this application obtain a short video set, perform preprocessing operations on the short video set, obtain a training set and a tag set, and compare the pre-built short video keywords The extraction model is trained to obtain a complete model, and the short video input by the user is received according to the trained model for keyword extraction, and the accurate short video keyword extraction result is presented to the user.

Description of the drawings

FIG. 1 is a schematic flowchart of a short video keyword extraction method provided by an embodiment of this application;

2 is a schematic diagram of the internal structure of a short video keyword extraction device provided by an embodiment of the application;

3 is a schematic diagram of modules of a short video keyword extraction program in a short video keyword extraction device provided by an embodiment of the application.

The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.

Detailed ways

It should be understood that the specific embodiments described here are only used to explain the application, and are not used to limit the application.

This application provides a short video keyword extraction method. Referring to FIG. 1, it is a schematic flowchart of a short video keyword extraction method provided by an embodiment of this application. The method can be executed by a device, and the device can be implemented by software and/or hardware.

In this embodiment, the short video keyword extraction method includes:

S1. Obtain a short video set, obtain different frame images of the short video set through timing screenshots, perform a preprocessing operation on the different frame images to obtain a target image set and a tag set, and store them in a database.

In a preferred embodiment of the present application, the short video collection is obtained by searching a network video library. The timing screenshot is to perform a screenshot operation on the short video at a timing according to the set interval of screenshots to obtain different frame images of the short video.

In a preferred embodiment of the present application, the preprocessing operation includes: performing grayscale, thresholding, median filtering, and scale normalization operations on the image. The specific implementation steps of the preprocessing operation are as follows:

a. Image grayscale processing:

The image grayscale processing is to convert a color image into a grayscale image. The brightness information of the grayscale image can fully express the overall and local characteristics of the image, and the grayscale processing of the image can greatly reduce the amount of calculation for subsequent work.

In a preferred embodiment of the present application, the method of image gray-scale processing is to convert the R, G, and B components of the image pixels into the Y component of the YUV color space, that is, the brightness value. The calculation method of the Y component As shown in the following formula:

Y=0.3R+0.59G+0.11B

Among them, R, G, and B are the R, G, and B values of the image pixel in the RGB color mode.

b. Image thresholding:

The image thresholding process is an efficient algorithm for binarizing the grayscale image through the OTSU algorithm to obtain a binarized image. The preferred embodiment of the present application presets the gray level t to be the segmentation threshold of the foreground and background of the gray image, and assumes that the proportion of the number of front spots in the image is w ₀ , the average gray level is u ₀ ; the proportion of background points in the image is w ₁ , The average gray level is u ₁ , then the total average gray level of the gray image is:

u=w ₀ *u ₀ +w ₁ *u ₁ ,

The variance of the foreground and background image of the grayscale image is:

g=w ₀ *(u ₀ -u)*(u ₀ -u)+w ₁ *(u ₁ -u)*(u ₁ -u)=w ₀ *w ₁ *(u ₀ -u ₁ )* (u ₀ -u ₁ ),

Wherein, when the variance g is the largest, the difference between the foreground and the background is the largest at this time, the gray scale t at this time is the optimal threshold, and the gray scale value greater than the gray scale t in the gray scale image is set to 255, The gray value smaller than the gray t is set to 0, and the binarized image of the gray image is obtained.

c. Median filter processing:

The median filter is a non-linear signal processing technique that can effectively suppress noise based on ranking statistical theory. The preferred embodiment of the present application replaces the value of a point in the digital image or digital sequence with the median value of each point in a neighborhood of the point, which is used to approach the surrounding pixel values, thereby Eliminate isolated noise points.

d. Image scale normalization processing:

The preferred embodiment of the present application performs scale normalization processing on the denoising binarized image points to eliminate the influence of the resolution of the short video on the image. Among them, when performing scale normalization, the preferred embodiment of the present application needs to preserve the relative positional relationship of the pose sequence in the time and space dimensions. Therefore, it is necessary to ensure that the translation and zoom scales of the pose in the same video are consistent, and the coordinate components The zoom ratio is also consistent.

It is preset that the original coordinates of any point in the denoised binarized image are (x ₀ , y ₀ ), and the normalized coordinates are (x, y), namely:

Among them, d=max{w,h}, w and h are the width and height of the video respectively, after normalization, x,y∈(-1,1).

S2. Perform target detection on the target image set using a difference method to obtain a difference image set, and perform posture tracking on the target image set according to the optical flow method to obtain an optical flow atlas.

The preferred embodiment of the present application performs target detection on the target image set by the difference method between adjacent frames to obtain a difference image set. The adjacent inter-frame difference method uses the difference between two adjacent frames of images in a video sequence. When the background changes little and no moving target appears, the resulting pixel difference will be small. If the pixel difference is relatively large, then It is believed to be caused by entering the sports target. The specific description formula is as follows:

Among them, I _k (x, y) and I _k-1 (x, y) are the current frame image and the previous frame of the video respectively, D _k (x, y) is the binary image after the difference, and T is The set threshold for differential segmentation. When the pixel value in the obtained difference image is less than or equal to the preset difference segmentation threshold, the difference image is considered to be the background and its value is set to 0; when the pixel value in the obtained difference image is greater than the preset difference segmentation threshold, set The difference image is determined to be a foreground pixel, and its value is set to 1, so as to obtain the foreground moving target, obtain the difference image set, and realize target detection.

Further, a preferred embodiment of the present application performs posture tracking on the target image set according to the optical flow method to obtain an optical flow atlas. The optical flow method evaluates the deformation between two adjacent frame images, and calculates the movement of each pixel position of the two adjacent frame images from time T to T+t. The specific calculation formula is as follows:

Calculate the partial derivatives of the space and time coordinates of the target image set according to the image constraint equation:

Among them, I(x,y) represents the two frames of images x and y, I represents the partial derivative of the coordinates, and t represents the time difference between the two frames of images.

Using the assumption of conservation of gray level, transform the image constraint equation to obtain:

The gray-level conservation hypothesis means that the gray-level mode of two adjacent images in the image sequence remains unchanged when the corresponding points are optimally matched.

Further, the preferred embodiment of the present application calculates the aperture problem of the image constraint equation through the Horn-Schunck optical flow algorithm:

Where E represents the aperture of the image constraint equation,

with

Denote the mean value in the u neighborhood and v neighborhood respectively. The Horn-Schunck optical flow algorithm refers to the reduction of the optical flow solution to the extreme value of the solution, and the solution is solved by an iterative method. The iterative equation is as follows:

Among them, λ is the smoothing control factor. The value of λ is affected by the noise in the image. When the noise is strong, it means that the confidence of the image data itself is low, and it needs to rely more on optical flow constraints, indicating that λ is a larger value at this time. In a preferred embodiment of the present application, by presetting λ to a smaller value, the posture tracking of the target image set is performed to obtain the optical flow atlas.

S3. Input the differential image set and the optical flow atlas as a training set into a pre-built short video keyword extraction model, and use the training set and the short video keyword extraction model for training. The activation function of the short video keyword extraction model outputs the picture content set in the differential image set and the time series information set in the optical flow atlas to obtain the associated word set of the differential image set and the optical flow atlas, and combine the related word set with all The tag set is input into the loss function of the short video keyword extraction model, and the loss function value is calculated, until the loss function value is less than a threshold, the short video keyword extraction model exits training.

In a preferred embodiment of the present application, the short video keyword extraction model includes a two-branch convolutional neural network model constructed by a dual-stream method, wherein one of the two-branch convolutional neural network model is a branch model It is a spatial convolutional neural network model, and another branch model is a temporal convolutional neural network model. The literal meaning of the Shuangliu method refers to the fact that two small streams flow separately and finally converge together. In the embodiment of the present application, the name of one stream is the information of the differential image, and the name of the other stream is the information of the optical flow diagram.

The convolutional neural network is a feed-forward neural network. Its artificial neurons can respond to a part of the surrounding units in the coverage area. Its basic structure includes two layers. One is the feature extraction layer. The input of each neuron is The local receptive fields of the previous layer are connected, and the local features are extracted. Once the local feature is extracted, the positional relationship between it and other features is also determined; the second is the feature mapping layer, each computing layer of the network is composed of multiple feature maps, and each feature map is a plane. The weights of all neurons on the plane are equal.

In a preferred embodiment of the present application, the convolutional neural network model includes an input layer, a convolutional layer, a pooling layer, and an output layer. In the preferred embodiment of the present application, the differential image is input into the input layer of the spatial convolutional neural network model, and the optical flow graph is input into the input layer of the temporal convolutional neural network model, and each In the convolutional layer, the differential image and the optical flow graph are respectively convolved by a preset set of filters to extract the feature vector, and the pooling layer is used to perform the pooling operation on the feature vector and input to the fully connected Layer, normalize and calculate the feature vector through the activation function, and input the calculation result to the output layer, the output layer outputs the picture content set in the difference image set and the time series information set in the optical flow atlas to obtain The associated word set of the differential image set and the optical flow atlas. The normalization process is to "compress" a K-dimensional vector containing any real number to another K-dimensional real vector, so that the range of each element is between (0,1), and the sum of all elements is 1. .

The activation function in the embodiment of this application is the softmax function, and the calculation formula is as follows:

Among them, O _j represents the image content and timing information output value of the _jth neuron in the output layer of the convolutional neural network, I _j represents the input value of the jth neuron in the output layer of the convolutional neural network, and t represents all The total number of neurons in the output layer, e is an infinite non-recurring decimal

The loss function in the preferred embodiment of this application is the least square method:

Where s is the error value of the output picture content and timing information and the differential image and optical flow diagram, k is the number of the image set, y _i is the differential image and optical flow diagram, and y′ _i is the output The picture content and timing information.

S4. Receive the input short video, use the short video keyword extraction model to obtain related words of the short video, and perform keyword extraction on the related words to obtain the keywords of the short video.

In a preferred embodiment of the present application, keyword extraction is performed on the related word set through a keyword extraction algorithm. The keyword extraction algorithm uses statistical information, word vector information, and dependency syntax information between words to calculate the correlation strength between words by constructing a dependency relationship graph, and iteratively calculates the importance score of words using the TextRank algorithm, and calculates the importance of words according to the sentence The result of the dependency syntax analysis is to construct an undirected graph for all non-stop words, and calculate the weight of the edge by using the gravity value between the words and the degree of dependency correlation.

In detail, the TextRank algorithm includes:

Calculate the dependency correlation degree of any two words W _i and W _j in the related word set:

Among them, len(W _i , W _j ) represents the length of the dependency path between words W _i and W _j , and b is a hyperparameter;

Calculate the gravitational forces of words W _i and W _j :

Wherein, tfidf (W) is a TF-IDF value of word W, TF represents term frequency, IDF represents inverse document frequency index, d is the Euclidean distance between the vectors of words W _i and W words of _J;

The degree of association between words W _i and W _j is:

weight(W _i ,W _j )=Dep(W _i ,W _j )*f _grav (W _i ,W _j )

Establish an undirected graph G=(V,E), where V is the set of vertices and E is the set of edges;

Calculate the importance score of the word W _i :

among them,

W _i is associated with a set of vertices, η is the damping coefficient;

Sort all words according to the importance score, select a preset number of keywords from the words according to the sort, and perform symbolic grammar splicing on the extracted keywords to obtain short video related words.

The invention also provides a short video keyword extraction device. Referring to FIG. 2, it is a schematic diagram of the internal structure of a short video keyword extraction device provided by an embodiment of this application.

In this embodiment, the short video keyword extraction device 1 may be a PC (Personal Computer, personal computer), or a terminal device such as a smart phone, a tablet computer, or a portable computer, or a server. The short video keyword extraction device 1 at least includes a memory 11, a processor 12, a communication bus 13, and a network interface 14.

Wherein, the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 11 may be an internal storage unit of the short video keyword extraction device 1, for example, the hard disk of the short video keyword extraction device 1. In other embodiments, the memory 11 may also be an external storage device of the short video keyword extraction device 1, for example, a plug-in hard disk equipped on the short video keyword extraction device 1, a Smart Media Card (SMC), Secure Digital (SD) card, Flash Card, etc. Further, the memory 11 may also include both an internal storage unit of the short video keyword extraction device 1 and an external storage device. The memory 11 can be used not only to store application software and various data installed in the short video keyword extraction device 1, such as the code of the short video keyword extraction program 01, etc., but also to temporarily store data that has been output or will be output. .

The processor 12 may be a central processing unit (CPU), controller, microcontroller, microprocessor or other data processing chip in some embodiments, and is used to run the program code or processing stored in the memory 11 Data, such as execution of short video keyword extraction program 01, etc.

The communication bus 13 is used to realize the connection and communication between these components.

The network interface 14 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is usually used to establish a communication connection between the device 1 and other electronic devices.

Optionally, the device 1 may also include a user interface. The user interface may include a display (Display) and an input unit such as a keyboard (Keyboard). The optional user interface may also include a standard wired interface and a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light emitting diode) touch device, etc. Among them, the display can also be appropriately called a display screen or a display unit, which is used to display the information processed in the short video keyword extraction device 1 and to display a visualized user interface.

Figure 2 only shows the short video keyword extraction device 1 with components 11-14 and short video keyword extraction program 01. Those skilled in the art can understand that the structure shown in Figure 1 does not constitute a key to short video The definition of the word extraction device 1 may include fewer or more components than shown, or a combination of certain components, or a different component arrangement.

In the embodiment of the device 1 shown in FIG. 2, the short video keyword extraction program 01 is stored in the memory 11; the processor 12 implements the following steps when executing the short video keyword extraction program 01 stored in the memory 11:

Step 1: Obtain a short video set, obtain different frame images of the short video set through timing screenshots, perform a preprocessing operation on the different frame images, obtain a target image set and a tag set, and store them in a database.

In a preferred embodiment of the present application, the short video is obtained by searching a network video library. The timing screenshot is to perform a screenshot operation on the short video at a timing according to the set interval of screenshots to obtain different frame images of the short video.

a. Image grayscale processing:

Y=0.3R+0.59G+0.11B

b. Image thresholding:

u=w ₀ *u ₀ +w ₁ *u ₁ ,

The variance of the foreground and background image of the grayscale image is:

c. Median filter processing:

d. Image scale normalization processing:

Step 2: Perform target detection on the target image set using the difference method to obtain a difference image set, and perform posture tracking on the target image set according to the optical flow method to obtain an optical flow atlas.

Where E represents the aperture of the image constraint equation,

with

Step 3: Input the differential image set and the optical flow atlas as a training set into a pre-built short video keyword extraction model, and use the training set to perform training on the short video keyword extraction model. The activation function of the short video keyword extraction model outputs the picture content set in the differential image set and the time series information set in the optical flow atlas to obtain the associated word set of the differential image set and the optical flow atlas, and combine the related word set with The tag set is input into the loss function of the short video keyword extraction model, and the loss function value is calculated. When the loss function value is less than the threshold, the short video keyword extraction model exits training.

Where +, O _j represents the image content and timing information output value of the _jth neuron in the output layer of the convolutional neural network, I _j represents the input value of the jth neuron in the output layer of the convolutional neural network, and t represents The total amount of neurons in the output layer, e is an infinite non-recurring decimal

Step 4: Receive the input short video, use the short video keyword extraction model to obtain related words of the short video, and perform keyword extraction on the related words to obtain the keywords of the short video.

In detail, the TextRank algorithm includes:

Calculate the gravitational forces of words W _i and W _j :

The degree of association between words W _i and W _j is:

weight(W _i ,W _j )=Dep(W _i ,W _j )*f _grav (W _i ,W _j )

Calculate the importance score of the word W _i :

among them,

W _i is associated with a set of vertices, η is the damping coefficient;

Optionally, in other embodiments, the short video keyword extraction program can also be divided into one or more modules, and the one or more modules are stored in the memory 11 and run by one or more processors (this embodiment For example, it is executed by the processor 12) to complete this application. The module referred to in this application refers to a series of computer program instruction segments that can complete specific functions, and is used to describe the short video keyword extraction program in the short video keyword extraction device The implementation process.

For example, referring to FIG. 3, a schematic diagram of the program modules of the short video keyword extraction program in an embodiment of the short video keyword extraction device of this application. In this embodiment, the short video keyword extraction program can be divided into The short video acquisition module 10, the image preprocessing module 20, the model training module 30, and the keyword extraction module 40 are exemplary:

The short video acquisition module 10 is configured to obtain a short video set by searching a network video library, and perform a regular screenshot operation on the short video set.

The image preprocessing module 20 is configured to perform target detection on the target image set using a differential method to obtain a differential image set, and perform posture tracking on the target image set according to an optical flow method to obtain an optical flow atlas.

The model training module 30 is configured to: input the differential image set and the optical flow atlas as a training set into a pre-built short video keyword extraction model, and use the training set to compare the short video keyword The extraction model is trained, and the image content set in the differential image set and the time series information set in the optical flow atlas are output through the activation function of the short video keyword extraction model to obtain the associated word set of the differential image set and the optical flow atlas, And input the associated word set and the tag set into the loss function of the short video keyword extraction model, and calculate the loss function value until the loss function value is less than the threshold value, the short video keyword extraction model Quit training.

The keyword extraction module 40 is configured to: receive an input short video, use the short video keyword extraction model to obtain related words of the short video, and perform keyword extraction on the related words to obtain the key of the short video word.

The functions or operation steps implemented by the program modules such as the short video acquisition module 10, the image preprocessing module 20, the model training module 30, and the keyword extraction module 40 when executed are substantially the same as those in the foregoing embodiment, and will not be repeated here.

In addition, an embodiment of the present application also proposes a computer-readable storage medium, the computer-readable storage medium stores a short video keyword extraction program, and the short video keyword extraction program can be executed by one or more processors To achieve the following operations:

Obtain a short video set, obtain different frame images of the short video set through timing screenshots, perform a preprocessing operation on the different frame images to obtain a target image set and a tag set, and store them in a database;

Performing target detection on the target image set using a difference method to obtain a difference image set, and performing posture tracking on the target image set according to an optical flow method to obtain an optical flow atlas;

Input the differential image set and the optical flow atlas as a training set into a pre-built short video keyword extraction model, use the training set to train the short video keyword extraction model, and pass the short video The activation function of the video keyword extraction model outputs the picture content set in the differential image set and the time series information set in the optical flow atlas to obtain the associated word set of the differential image set and the optical flow atlas, and combine the associated word set with the The tag set is input into the loss function of the short video keyword extraction model, and the loss function value is calculated, until the loss function value is less than the threshold, the short video keyword extraction model exits training;

Receive the input short video, use the short video keyword extraction model to obtain the related words of the short video, and perform keyword extraction on the related words to obtain the keywords of the short video.

The specific implementations of the computer-readable storage medium of the present application are basically the same as the foregoing embodiments of the short video keyword extraction device and method, and will not be repeated here.

It should be noted that the serial numbers of the above embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments. And the terms "include", "include" or any other variants thereof in this article are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, but also includes The other elements listed may also include elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article or method that includes the element.

Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disk, optical disk), including several instructions to make a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.

The above are only preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly used in other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims

A method for extracting short video keywords, characterized in that the method includes:

Obtain a short video set, obtain different frame images of the short video set through timing screenshots, perform a preprocessing operation on the different frame images to obtain a target image set and a tag set, and store them in a database;

Performing target detection on the target image set using a difference method to obtain a difference image set, and performing posture tracking on the target image set according to an optical flow method to obtain an optical flow atlas;

Input the differential image set and the optical flow atlas as a training set into a pre-built short video keyword extraction model, use the training set to train the short video keyword extraction model, and pass the short video The activation function of the video keyword extraction model outputs the picture content set in the differential image set and the time series information set in the optical flow atlas to obtain the associated word set of the differential image set and the optical flow atlas, and combine the associated word set with the The tag set is input into the loss function of the short video keyword extraction model, and the loss function value is calculated, until the loss function value is less than the threshold, the short video keyword extraction model exits training;

Receive the input short video, use the short video keyword extraction model to obtain the related words of the short video, and perform keyword extraction on the related words to obtain the keywords of the short video.
The method for extracting short video keywords according to claim 1, wherein the preprocessing operation on the different frame images to obtain the target image set comprises:

Converting the different frame images into gray-scale images by using image gray-scale, and thresholding the gray-scale images according to the OTSU algorithm to obtain a binary image;

Median filtering is used to eliminate isolated noise points in the binarized image, and scale normalization is used to eliminate the influence of the resolution in the short video on the binarized image, thereby obtaining a target image set.
The method for extracting short video keywords according to claim 1, wherein the training set is used to train the short video keyword extraction model, and the short video keyword extraction model is output by an activation function The picture content set in the differential image set and the time sequence information set in the optical flow atlas to obtain the associated word set of the differential image set and the optical flow atlas includes:

Use the dual-stream method to construct a two-branch convolutional neural network model, one of which is a spatial convolutional neural network model, and the other branch is a temporal convolutional neural network model;

Inputting the differential image set into the spatial convolutional neural network model, and inputting the optical flow atlas into the temporal convolutional neural network model;

Using the spatial convolutional neural network model and the temporal convolutional neural network model to extract feature vectors from the differential image set and optical flow atlas respectively, perform a pooling operation, and then normalize the feature vectors through an activation function After processing and calculation, output the picture content set and the time sequence information set in the optical flow atlas in the differential image set to obtain the associated word set of the differential image set and the optical flow atlas.
The method for extracting short video keywords according to claim 1, wherein the activation function is a Softmax function, and the loss function is a least squares function:

Wherein, the softmax function is:

Among them, O j represents the image content and timing information output value of the jth neuron in the output layer of the convolutional neural network, I j represents the input value of the jth neuron in the output layer of the convolutional neural network, and t represents all State the total amount of neurons in the output layer, e is an infinite non-recurring decimal;

The least square method is:

Where s is the error value of the output picture content and timing information and the differential image and optical flow diagram, k is the number of the image set, y i is the differential image and optical flow diagram, and y′ i is the output The picture content and timing information.
The method for extracting short video keywords according to claim 2, wherein the activation function is a Softmax function, and the loss function is a least square function:

Wherein, the softmax function is:

Among them, O j represents the image content and timing information output value of the jth neuron in the output layer of the convolutional neural network, I j represents the input value of the jth neuron in the output layer of the convolutional neural network, and t represents all State the total amount of neurons in the output layer, e is an infinite non-recurring decimal;

The least square method is:

Where s is the error value of the output picture content and timing information and the differential image and optical flow diagram, k is the number of the image set, y i is the differential image and optical flow diagram, and y′ i is the output The picture content and timing information.
5. The short video keyword extraction method of claim 3, wherein the activation function is a Softmax function, and the loss function is a least square function:

Wherein, the softmax function is:

Among them, O j represents the image content and timing information output value of the jth neuron in the output layer of the convolutional neural network, I j represents the input value of the jth neuron in the output layer of the convolutional neural network, and t represents all State the total amount of neurons in the output layer, e is an infinite non-recurring decimal;

The least square method is:

Where s is the error value of the output picture content and timing information and the differential image and optical flow diagram, k is the number of the image set, y i is the differential image and optical flow diagram, and y′ i is the output The picture content and timing information.
The method for extracting short video keywords according to claim 1, wherein said keyword extraction comprises:

Calculate the dependency correlation degree of any two words W i and W j in the related word set:

Among them, len(W i , W j ) represents the length of the dependency path between words W i and W j , and b is a hyperparameter;

Calculate the gravitational forces of words W i and W j :

Wherein, tfidf (W) is a TF-IDF value of word W, TF represents term frequency, IDF represents inverse document frequency index, d is the Euclidean distance between the vectors of words W i and W words of J;

The degree of association between words W i and W j is:

weight(W i ,W j )=Dep(W i ,W j )*f grav (W i ,W j )

Establish an undirected graph G=(V,E), where V is the set of vertices and E is the set of edges;

Calculate the importance score of the word W i :

among them,
W i is associated with a set of vertices, η is the damping coefficient;

Sort all words according to the importance score, select a preset number of keywords from the words according to the sort, and perform symbolic grammar splicing on the extracted keywords to obtain short video keywords .
A short video keyword extraction device, characterized in that the device includes a memory and a processor, and a short video keyword extraction program that can run on the processor is stored in the memory, and the short video keyword The following steps are implemented when the extraction program is executed by the processor:

Obtain a short video set, obtain different frame images of the short video set through timing screenshots, perform a preprocessing operation on the different frame images to obtain a target image set and a tag set, and store them in a database;

Performing target detection on the target image set using a difference method to obtain a difference image set, and performing posture tracking on the target image set according to an optical flow method to obtain an optical flow atlas;

Input the differential image set and the optical flow atlas as a training set into a pre-built short video keyword extraction model, use the training set to train the short video keyword extraction model, and pass the short video The activation function of the video keyword extraction model outputs the picture content set in the differential image set and the time series information set in the optical flow atlas to obtain the associated word set of the differential image set and the optical flow atlas, and combine the associated word set with the The tag set is input into the loss function of the short video keyword extraction model, and the loss function value is calculated, until the loss function value is less than the threshold, the short video keyword extraction model exits training;

Receive the input short video, use the short video keyword extraction model to obtain the related words of the short video, and perform keyword extraction on the related words to obtain the keywords of the short video.
8. The short video keyword extraction device according to claim 8, wherein the preprocessing operation on the different frame images to obtain the target image set comprises:

Converting the different frame images into gray-scale images by using image gray-scale, and thresholding the gray-scale images according to the OTSU algorithm to obtain a binary image;

Median filtering is used to eliminate isolated noise points in the binarized image, and scale normalization is used to eliminate the influence of the resolution in the short video on the binarized image, thereby obtaining a target image set.
8. The short video keyword extraction device according to claim 8, wherein the short video keyword extraction model is trained using the training set, and the short video keyword extraction model is outputted through the activation function of the short video keyword extraction model. The picture content set in the differential image set and the time sequence information set in the optical flow atlas are obtained to obtain the associated word set of the differential image set and the optical flow atlas, including:

Use the dual-stream method to construct a two-branch convolutional neural network model, one of which is a spatial convolutional neural network model, and the other branch is a temporal convolutional neural network model;

Inputting the differential image set into the spatial convolutional neural network model, and inputting the optical flow atlas into the temporal convolutional neural network model;

Using the spatial convolutional neural network model and the temporal convolutional neural network model to extract feature vectors from the differential image set and optical flow atlas respectively, perform a pooling operation, and then normalize the feature vectors through an activation function After processing and calculation, output the picture content set and the time sequence information set in the optical flow atlas in the differential image set to obtain the associated word set of the differential image set and the optical flow atlas.
8. The short video keyword extraction device of claim 8, wherein the activation function is a Softmax function, and the loss function is a least square function:

Wherein, the softmax function is:

Among them, O j represents the image content and timing information output value of the jth neuron in the output layer of the convolutional neural network, I j represents the input value of the jth neuron in the output layer of the convolutional neural network, and t represents all State the total amount of neurons in the output layer, e is an infinite non-recurring decimal;

The least square method is:

Where s is the error value of the output picture content and timing information and the differential image and optical flow diagram, k is the number of the image set, y i is the differential image and optical flow diagram, and y′ i is the output The picture content and timing information.
The short video keyword extraction device according to claim 9, wherein the activation function is a Softmax function, and the loss function is a least square function:

Wherein, the softmax function is:

Among them, O j represents the image content and timing information output value of the jth neuron in the output layer of the convolutional neural network, I j represents the input value of the jth neuron in the output layer of the convolutional neural network, and t represents all State the total amount of neurons in the output layer, e is an infinite non-recurring decimal;

The least square method is:

Where s is the error value of the output picture content and timing information and the differential image and optical flow diagram, k is the number of the image set, y i is the differential image and optical flow diagram, and y′ i is the output The picture content and timing information.
The short video keyword extraction device of claim 10, wherein the activation function is a Softmax function, and the loss function is a least square function:

Wherein, the softmax function is:

Among them, O j represents the image content and timing information output value of the jth neuron in the output layer of the convolutional neural network, I j represents the input value of the jth neuron in the output layer of the convolutional neural network, and t represents all State the total amount of neurons in the output layer, e is an infinite non-recurring decimal;

The least square method is:

Where s is the error value of the output picture content and timing information and the differential image and optical flow diagram, k is the number of the image set, y i is the differential image and optical flow diagram, and y′ i is the output The picture content and timing information.
8. The short video keyword extraction device of claim 8, wherein the keyword extraction includes:

Calculate the dependency correlation degree of any two words W i and W j in the related word set:

Among them, len(W i , W j ) represents the length of the dependency path between words W i and W j , and b is a hyperparameter;

Calculate the gravitational forces of words W i and W j :

Wherein, tfidf (W) is a TF-IDF value of word W, TF represents term frequency, IDF represents inverse document frequency index, d is the Euclidean distance between the vectors of words W i and W words of J;

The degree of association between words W i and W j is:

weight(W i ,W j )=Dep(W i ,W j )*f grav (W i ,W j )

Establish an undirected graph G=(V,E), where V is the set of vertices and E is the set of edges;

Calculate the importance score of the word W i :

among them,
W i is associated with a set of vertices, η is the damping coefficient;

Sort all words according to the importance score, select a preset number of keywords from the words according to the sort, and perform symbolic grammar splicing on the extracted keywords to obtain short video keywords .
A computer-readable storage medium, characterized in that a short video keyword extraction program is stored on the computer-readable storage medium, and the short video keyword extraction program can be executed by one or more processors to achieve the following step:

Obtain a short video set, obtain different frame images of the short video set through timing screenshots, perform a preprocessing operation on the different frame images to obtain a target image set and a tag set, and store them in a database;

Performing target detection on the target image set using a difference method to obtain a difference image set, and performing posture tracking on the target image set according to an optical flow method to obtain an optical flow atlas;

Input the differential image set and the optical flow atlas as a training set into a pre-built short video keyword extraction model, use the training set to train the short video keyword extraction model, and pass the short video The activation function of the video keyword extraction model outputs the picture content set in the differential image set and the time series information set in the optical flow atlas to obtain the associated word set of the differential image set and the optical flow atlas, and combine the associated word set with the The tag set is input into the loss function of the short video keyword extraction model, and the loss function value is calculated, until the loss function value is less than the threshold, the short video keyword extraction model exits training;

Receive the input short video, use the short video keyword extraction model to obtain the related words of the short video, and perform keyword extraction on the related words to obtain the keywords of the short video.
15. The computer-readable storage medium of claim 15, wherein the preprocessing operation on the different frame images to obtain a target image set comprises:

Converting the different frame images into gray-scale images by using image gray-scale, and thresholding the gray-scale images according to the OTSU algorithm to obtain a binary image;

Median filtering is used to eliminate isolated noise points in the binarized image, and scale normalization is used to eliminate the influence of the resolution in the short video on the binarized image, thereby obtaining a target image set.
The computer-readable storage medium of claim 16, wherein the training set is used to train the short video keyword extraction model, and the difference is output through the activation function of the short video keyword extraction model The image content set in the image set and the time sequence information set in the optical flow atlas are obtained to obtain the associated word set of the differential image set and the optical flow atlas, including:

Use the dual-stream method to construct a two-branch convolutional neural network model, one of which is a spatial convolutional neural network model, and the other branch is a temporal convolutional neural network model;

Inputting the differential image set into the spatial convolutional neural network model, and inputting the optical flow atlas into the temporal convolutional neural network model;

Using the spatial convolutional neural network model and the temporal convolutional neural network model to extract feature vectors from the differential image set and optical flow atlas respectively, perform a pooling operation, and then normalize the feature vectors through an activation function After processing and calculation, output the picture content set and the time sequence information set in the optical flow atlas in the differential image set to obtain the associated word set of the differential image set and the optical flow atlas.
The computer-readable storage medium of claim 15, wherein the activation function is a Softmax function, and the loss function is a least squares function:

Wherein, the softmax function is:

Among them, O j represents the image content and timing information output value of the jth neuron in the output layer of the convolutional neural network, I j represents the input value of the jth neuron in the output layer of the convolutional neural network, and t represents all State the total amount of neurons in the output layer, e is an infinite non-recurring decimal;

The least square method is:

Where s is the error value of the output picture content and timing information and the differential image and optical flow diagram, k is the number of the image set, y i is the differential image and optical flow diagram, and y′ i is the output The picture content and timing information.
The computer-readable storage medium according to claim 16 or 17, wherein the activation function is a Softmax function, and the loss function is a least squares function:

Wherein, the softmax function is:

Among them, O j represents the image content and timing information output value of the jth neuron in the output layer of the convolutional neural network, I j represents the input value of the jth neuron in the output layer of the convolutional neural network, and t represents all State the total amount of neurons in the output layer, e is an infinite non-recurring decimal;

The least square method is:

Where s is the error value of the output picture content and timing information and the differential image and optical flow diagram, k is the number of the image set, y i is the differential image and optical flow diagram, and y′ i is the output The picture content and timing information.
15. The computer-readable storage medium of claim 15, wherein the keyword extraction includes:

Calculate the dependency correlation degree of any two words W i and W j in the related word set:

Among them, len(W i , W j ) represents the length of the dependency path between words W i and W j , and b is a hyperparameter;

Calculate the gravitational forces of words W i and W j :

Wherein, tfidf (W) is a TF-IDF value of word W, TF represents term frequency, IDF represents inverse document frequency index, d is the Euclidean distance between the vectors of words W i and W words of J;

The degree of association between words W i and W j is:

weight(W i ,W j )=Dep(W i ,W j )*f grav (W i ,W j )

Establish an undirected graph G=(V,E), where V is the set of vertices and E is the set of edges;

Calculate the importance score of the word W i :

among them,
W i is associated with a set of vertices, η is the damping coefficient;

Sort all words according to the importance score, select a preset number of keywords from the words according to the sort, and perform symbolic grammar splicing on the extracted keywords to obtain short video keywords .