CN114979705A

CN114979705A - Automatic editing method based on deep learning, self-attention mechanism and symbolic reasoning

Info

Publication number: CN114979705A
Application number: CN202210383218.4A
Authority: CN
Inventors: 周景林; 曹瀚洋; 周奕希
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-04-12
Filing date: 2022-04-12
Publication date: 2022-08-30

Abstract

The invention discloses an automatic editing method based on deep learning, a self-attention mechanism and symbolic reasoning, which comprises the following steps of: establishing a field video material library needing propaganda; training the RVM by the constructed database; establishing a primitive library of video content to be described; organizing primitives using a logical inference engine of HAKE; establishing a text type needing semantic understanding; training a transformer by using a data set to obtain a text understanding network; inputting the video needing automatic clipping to the RVM network; then inputting the video to a HAKE video understanding engine, and outputting the video with the label; inputting the editing requirement text into a transform model, and outputting labels arranged according to a semantic sequence; comparing and matching the obtained labels; sequencing the video matching results; the steps are integrated into an integrated system, and the operation facing to a user is simplified. The invention solves the problems that the threshold of the pre-editing technology is high, and a plurality of videos cannot be edited simultaneously, and a large amount of human resources and time resources are consumed.

Description

Automatic editing method based on deep learning, self-attention mechanism and symbolic reasoning

Technical Field

The invention relates to the technical field of video automatic editing methods based on deep learning, in particular to an automatic editing method based on deep learning, an automatic attention mechanism and symbolic reasoning.

Background

With the development of science and technology, the consumption of contents by the public experiences huge changes from text to picture and from picture to video, and compared with pictures and texts, the video has more three-dimensional and visual effects and becomes an important window for connecting people with the society. Meanwhile, the content requirements of the public are gradually increasing in the face of a large amount of information input. Video is changed from the original entertainment carrier to an important channel for acquiring news and knowledge. Therefore, with the development of socio-economic and the change of propagation media in China, the research on the automatic generation of images needs to be solved urgently.

In the automatic generation of images, the classification according to the image feature values is the key of the whole image generation process. In the traditional video image editing process, the types and the characteristics of video segments are mainly determined according to the experience and subjective judgment of an editor and are embedded into video logic which is well conceived by the editor in advance. Although the industry admission threshold is improved and the editing auxiliary software is continuously upgraded, the quality and the efficiency of video image production are improved to a certain extent. However, with the increasing of image materials and the increasing of image requirements, image editing becomes more and more complex. Therefore, the traditional method cannot meet the current requirement, and the neural network has obvious advantages for processing the information, and the classification problem of fuzzy labeling factors is solved. The pre-editing technology has a high threshold, and a plurality of videos cannot be edited simultaneously, so that a large amount of human resources and time resources are consumed.

Disclosure of Invention

The invention aims to provide an automatic editing method based on deep learning, a self-attention mechanism and symbolic reasoning, and solves the problems that a pre-editing technology has a high threshold, and a large amount of human resources and time resources cannot be consumed for simultaneously editing a plurality of videos.

In order to achieve the above object, the present invention provides an automatic editing method based on deep learning, self-attention mechanism and symbolic reasoning, comprising the following steps:

s1, establishing a field video material library needing propaganda, and adopting RVM to segment low-quality video segments, wherein the low-quality original video and the artificially edited high-quality video are required to be contained;

s2, training the RVMs by using the database built in the step S1 to obtain a network architecture suitable for the task, and performing supervised training on the original RVMs by using a data set with larger capacity and containing low-quality fragments and corresponding high-quality fragments to obtain a network suitable for dividing videos with high and low video quality;

s3, establishing a primitive library of the video content to be described;

s4, organizing the primitives by using a logic inference engine of HAKE to obtain a series of labels conforming to semantic logic;

s5, establishing a text type needing semantic understanding, and mainly considering a manually marked related data set;

s6, training the transformer by using the data set of the step S5 to obtain a text understanding network with higher precision and meeting the requirement of analyzing the clips;

s7, inputting the video needing automatic clipping into the RVM network trained in the step S2, and obtaining a flaw part without human error or environmental factor influence to obtain a high-quality video;

s8, inputting the high-quality video obtained in the step S7 into a HAKE video understanding engine, and outputting the video with the label;

s9, inputting the clipping requirement text into the transformer model trained in the step S6, and outputting labels arranged according to the semantic sequence;

s10, matching the labels obtained in the step S8 and the step S9 in a matching mode;

s11, sorting the videos according to the matching result of the step S10;

and S12, integrating the steps into an integrated system, and simplifying the operation facing the user.

Preferably, outputting the video with the label is to collect a large number of unprocessed video segments as input into the multi-channel pre-trained RVM network, delete low-quality segments due to assumed misoperation or environmental factors, and output high-quality segments without defects;

secondly, after obtaining high-quality video clips, enabling the high-quality video clips to enter HAKE as input, enabling the HAKE to understand video content through three stages of work, building a primitive library in the related field, continuously expanding the capacity of the primitive library as required, combining primitives according to language logic by using a logic inference rule, labeling the video content by using CNN, and outputting the video with labels.

Preferably, the construction work of the cell library is divided into three steps:

the first step is to realize the identification of two types of entities, entities with different levels of a hierarchical structure and entities with the same level;

in the second step, hierarchical perception knowledge graph embedding is carried out, the HAKE consists of two parts, namely a quantity part and a phase part, modeling is carried out on two different classes of entities respectively, and e is used in a modulus part for distinguishing embedding of different parts _m And h _m Representing entity embedding and relationship embedding, while using e in the phase part _p And r _p Representing solid embedding and relational embedding, HAKE groups together the modulus and phase portions, maps the entities into a polar coordinate system, where the radial and angular coordinates correspond to the modulus and phase portions, respectively, and HAKE maps one entity h to [ h _m ；h _p ]，[·；·]Representing a concatenation of two vectors with a scoring function d _r,m (h,t)＝||h _m r _m -t _m || ₂ To evaluate the effect of modulus and phase;

the third step is parallel to the video segmentationPerforming text semantic segmentation, and completing the task by adopting a Transformer which consists of self-attention and Feed Forward Neural Network only, wherein in an encoder of the Transformer, data is firstly subjected to a module called self-attention to obtain a weighted feature vector Z which is expressed as a self-attention module

After obtaining Z, it is sent to the next module of the encoder, i.e. a Feed Forward Neural Network, which is fully connected with two layers, the activation function of the first layer is ReLU and the second layer is a linear activation function, which can be expressed as ffn (Z) ═ max (0, ZW) ₁ +b ₁ )W ₂ +b ₂ The two attentions are respectively used for calculating input and output weights, inputting a required text of a user into a transformer, outputting semantic labels which are logically arranged according to the text semantics through understanding the text, and mostly presenting in a primitive form;

and finally, comparing the video labels with the text labels by adopting a common comparison algorithm, matching within a certain fault-tolerant range, sequencing the video contents according to the sequence of the text labels, and finally obtaining high-quality video fragments which are arranged according to the text semantic sequence, wherein the video fragments can be used as fragments.

Therefore, the automatic editing method based on the deep learning, the self-attention mechanism and the symbolic reasoning solves the problems that the threshold of the pre-editing technology is high, and a large amount of human resources and time resources cannot be consumed for simultaneously editing a plurality of videos.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is a schematic flow chart illustrating an embodiment of an automatic editing method based on deep learning, self-attention mechanism and symbolic reasoning according to the present invention;

FIG. 2 is a schematic RVM structure diagram of an embodiment of an automatic clipping method based on deep learning, self-attention mechanism and symbolic reasoning according to the present invention;

FIG. 3 is a schematic diagram of a knowledge graph embedding diagram of an embodiment of an automatic editing method based on deep learning, a self-attention mechanism and symbolic reasoning according to the present invention;

fig. 4 is a schematic structural diagram of a transform in an embodiment of an automatic editing method based on deep learning, self-attention mechanism and symbolic reasoning.

Detailed Description

The technical solution of the present invention is further illustrated by the accompanying drawings and examples.

Examples

The invention provides an automatic editing method based on deep learning, a self-attention mechanism and symbolic reasoning, which comprises the following steps of:

s3, establishing a primitive library of the video content to be described;

s6, training the transformer by using the data set of the step S5 to obtain a text understanding network with higher precision and adaptive to the requirement of analysis clipping;

s11, sorting the videos according to the matching result of the step S10;

The invention provides an automatic editing method based on deep learning, a self-attention mechanism and symbolic reasoning, which comprises the following steps of: a large number of unprocessed video segments are first collected as input into a multi-channel pre-trained RVM network, low quality segments due to thought of as mishandling or environmental factors are deleted, and high quality segments without defects are output.

Finally, the construction work of the element library is divided into three steps:

the first step is to realize the identification of two types of entities, entities with different levels of hierarchy structure and entities with the same level.

In the second step, hierarchical perception knowledge graph embedding is carried out, the HAKE consists of two parts, namely a quantity part and a phase part, modeling is carried out on two different classes of entities respectively, and e is used in a modulus part for distinguishing embedding of different parts _m And h _m Representing entity embedding and relationship embedding, while using e in the phase part _p And r _p Representing entity embedding and relational embedding, HAKE groups together the modulus and phase portions, maps the entities into a polar coordinate system, where the radial and angular coordinates correspond to the modulus, respectivelyQuantity part and phase part, HAKE maps an entity h to [ h _m ；h _p ]，[·；·]Representing a concatenation of two vectors with a scoring function d _r,m (h,t)＝||h _m r _m -t _m || ₂ The effect of modulus and phase was evaluated.

Thirdly, performing text semantic segmentation in parallel while performing video segmentation, and completing the task by adopting a Transformer, wherein the Transformer consists of self-attention and Feed Forward Neural Network, and in an encoder of the Transformer, data first passes through a module called self-attention ' to obtain a weighted feature vector Z which is expressed as a ' self-attention ' module

After obtaining Z, it is sent to the next module of the encoder, i.e. a Feed Forward Neural Network, which is fully connected with two layers, the activation function of the first layer is ReLU and the second layer is a linear activation function, which can be expressed as ffn (Z) ═ max (0, ZW) ₁ +b ₁ )W ₂ +b ₂ Two attentions are used to compute the weights of input and output, respectively: self-orientation: the relationship between the current translation and the translated context; Encoder-Decnoder Attention: the relationship between the current translated and encoded feature vectors. The structure is very suitable for a task of semantic understanding, a text required by a user is input into a transform, semantic labels which are logically arranged according to text semantics are output through the text understanding, and the semantic labels are mostly presented in a primitive form.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the invention without departing from the spirit and scope of the invention.

Claims

1. An automatic editing method based on deep learning, self-attention mechanism and symbolic reasoning is characterized by comprising the following steps:

s3, establishing a primitive library of the video content to be described;

s11, sorting the videos according to the matching result of the step S10;

2. The automatic clipping method based on deep learning, self-attention mechanism and symbolic reasoning according to claim 1, wherein:

outputting a video with a label specifically comprises the steps of firstly collecting a large number of unprocessed video segments, using the collected video segments as input into a multichannel pre-trained RVM network, deleting low-quality segments caused by operation errors or environmental factors, and outputting high-quality segments without defects;

secondly, after obtaining high-quality video clips, enabling the high-quality video clips to enter HAKE as input, enabling the HAKE to understand video content through three stages of work, building a primitive library in the related field, continuously expanding the capacity of the primitive library as required, combining the primitives according to language logic by using a logic inference rule, labeling the video content by using CNN, and outputting the video with labels.

3. The automatic editing method based on deep learning, self-attention mechanism and symbolic reasoning according to claim 1, wherein the building work of the primitive library is divided into three steps:

After obtaining Z, it is sent to the next module of the encoder, i.e. a Feed Forward Neural Network, which is fully connected with two layers, the activation function of the first layer is ReLU and the second layer is a linear activation function, which can be expressed as ffn (Z) ═ max (0, ZW) ₁ +b ₁ )W ₂ +b ₂ The two attentions are respectively used for calculating input and output weights, inputting a required text of a user into a transformer, outputting semantic labels which are logically arranged according to the text semantics through understanding the text, and displaying the semantic labels in a primitive form;

and finally, comparing the video tags with the text tags by adopting a common comparison algorithm, matching within a certain fault tolerance range, sequencing the video contents according to the sequence of the text tags, and finally obtaining high-quality video fragments which are arranged according to the text semantic sequence, namely the video fragments can be used as fragments.