CN112328830A

CN112328830A - Information positioning method based on deep learning and related equipment

Info

Publication number: CN112328830A
Application number: CN201910718176.3A
Authority: CN
Inventors: 苏建; 蔡云龙
Original assignee: TCL Research America Inc
Current assignee: TCL Corp; TCL Research America Inc
Priority date: 2019-08-05
Filing date: 2019-08-05
Publication date: 2021-02-05

Abstract

The invention provides an information positioning method based on deep learning and related equipment, which respectively convert an audio characteristic vector and an image characteristic vector into first character information by acquiring the audio characteristic vector corresponding to audio information of a video file to be processed and the image characteristic vector corresponding to the audio characteristic vector; and the extracted target information and the positioning information of the target information in the first character information. When the method of the invention is used for information positioning, the identification of the audio information corresponding to the video file to be processed is added while identifying the video image, so that the whole content of the video is fully considered, the accuracy of video content identification is improved, and the accuracy of information positioning is improved.

Description

Information positioning method based on deep learning and related equipment

Technical Field

The invention relates to the technical field of display control, in particular to an information positioning method based on deep learning and related equipment.

Background

With the development of network technology, networks are increasingly popularized, and sharing and shooting videos through the networks are increasingly favored by users.

In the face of a large amount of video information in a network, if a user wants to locate a video clip that the user wants to watch, such as a clip of a favorite feature or a small item, or the user wants to filter out some disliked clips, the video clip is generally located by identifying a video tag. In the method for positioning videos by using video tags, each video may have dozens or even hundreds of tags, video positioning is performed according to the tags, which may cause consumption of a large amount of human resources, and many videos are not provided with corresponding tags, so that the requirement for accurate information positioning cannot be met by a tag positioning mode.

Therefore, the prior art is subject to further improvement.

Disclosure of Invention

In view of the defects in the prior art, the invention provides an information positioning method based on deep learning and related equipment, and overcomes the defects that the accuracy of the positioned or filtered target information is low and a large amount of human resources are consumed when the target information is searched, positioned or filtered only by a video label in the prior art.

In a first aspect, an embodiment of the present invention provides an information positioning method based on deep learning, where the method includes:

acquiring audio information and video image information of a video file to be processed, wherein the time sequence of the video image information in the video file to be processed is the same as the time sequence of the audio information in the video file to be processed;

determining first character information according to the audio information and the video image information;

inputting the first character information into a trained character integration model to obtain extracted target information and positioning information of the target information in the first character information; the character integration model is trained based on the corresponding relation between the sample character information of the marked target information and the sample character information of the unmarked target information.

Optionally, the determining the first text information according to the audio information and the video image information includes:

determining an audio characteristic vector corresponding to the audio information according to the audio information, and determining an image characteristic vector corresponding to the video image information according to the video image information;

and converting the audio feature vector and the image feature vector into first text information.

Optionally, the step of converting the audio feature vector and the image feature vector into first text information includes:

splicing the audio feature vector and the image feature vector into a video vector matrix;

and translating the video vector contained in the video vector matrix into first text information.

Optionally, the step of translating the video vector contained in the video vector matrix into the first text information includes:

inputting the video vector matrix into a trained content recognition model, outputting the first character information by the content recognition model, and training the content recognition model based on the corresponding relation between the video vector of the marked character information and the video vector of the unmarked character information.

Optionally, the step of determining the audio feature vector corresponding to the audio information according to the audio information includes:

sampling the frequency spectrum of the audio information according to preset sampling points to obtain a sampling frequency spectrum;

encoding the sampled spectrum into the audio feature vector.

Optionally, the step of determining the image feature vector corresponding to the video image information according to the video image information includes:

intercepting an image frame of the video image information;

and extracting the characteristic graphs corresponding to the image frames in the video image information respectively, and obtaining the image characteristic vectors corresponding to the image frames respectively according to the characteristic graphs corresponding to the image frames respectively.

Optionally, the step of intercepting the image frame of the video image information includes:

cutting the video image information into a plurality of video image frame segments according to the length of a preset video frame;

intercepting the image frames of each video image frame segment to obtain an image frame set corresponding to each video image frame segment, wherein the image frame set corresponding to each video image frame segment comprises each image frame in the video image frame segment;

the step of extracting the feature maps corresponding to the image frames in the video image information comprises the following steps:

and extracting feature maps corresponding to the image frames in the image frame sets respectively.

Optionally, the step of obtaining the image feature vectors corresponding to the image frames according to the feature maps corresponding to the image frames includes:

and inputting the image characteristics corresponding to each image frame into a trained convolutional neural network to obtain the image characteristic vectors corresponding to each image frame, wherein the convolutional neural network is trained on the basis of the corresponding relation between the marked image characteristic vectors and the input characteristic graph.

Optionally, the step of inputting the feature map into a trained convolutional neural network to obtain an image feature vector of each image frame includes:

and converting the two-dimensional image feature vectors corresponding to the feature maps into one-dimensional image feature vectors corresponding to the feature maps through convolution and pooling operations.

Optionally, the step of splicing the audio feature vector and the image feature vector into a video vector matrix includes:

and merging and adding the one-dimensional image feature matrixes corresponding to the feature maps into the two-dimensional matrix corresponding to the audio feature vector in a column vector mode to obtain a two-dimensional video vector matrix.

Alternatively to this, the first and second parts may,

the character integration model is an encoding-decoding model, and the encoding-decoding model comprises: an encoding layer, an attention layer, and a decoding layer;

the step of inputting the first text information into a trained text integration model to obtain the extracted target information and the positioning information of the target information in the first text information comprises:

converting the first text information into a text sequence, inputting the text sequence into the coding layer, and outputting a hidden layer sequence after the text sequence is hidden and coded;

inputting the hidden layer sequence into the attention layer, and outputting key information contained in the hidden layer sequence;

inputting the identified key information and the hidden layer sequence into the decoding layer, and extracting a theme of the hidden layer sequence and target information analyzed based on the theme;

and obtaining the positioning information of the target information according to the position information of the target information in the first character information.

Optionally, after the steps of extracting the theme of the hidden layer sequence and analyzing the target information based on the theme, the method further includes:

and displaying the extracted target information, and outputting analysis information of the target information.

Optionally, the information positioning method further includes:

target information extracted by the information positioning method and positioning information of the target information in the first character information are utilized;

and filtering the audio information and the video image information according to the target information and the positioning information, and generating a filtered video file according to the filtered audio information and the filtered video image information.

Optionally, the step of filtering the audio information and the video image information according to the target information and the positioning information, and generating a filtered video file according to the filtered audio information and the filtered video image information includes:

respectively splicing a plurality of audio feature vectors contained in the audio information and a plurality of image feature vectors corresponding to the plurality of audio feature vectors into a plurality of video vectors;

sequentially filtering a plurality of video vectors according to the extracted target information and the extracted positioning information;

and sequencing the video vectors according to the time sequence corresponding to the video vectors obtained after filtering, and integrating the sequenced video vectors into a filtered video file.

In a second aspect, the present embodiment further provides an information positioning apparatus based on deep learning, including:

the video information extraction module is used for acquiring audio information and video image information of a video file to be processed, wherein the time sequence of the video image information in the video file to be processed is the same as the time sequence of the audio information in the video file to be processed;

the description conversion module is used for determining first character information according to the audio information and the video image information;

the character integration module is used for inputting the first character information into a trained character integration model to obtain the extracted target information and the positioning information of the target information in the first character information; the text integration model is trained based on the corresponding relation between the text information marked as the target information and the sample text information not marked with the target information.

In a third aspect, a computer device comprises a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method when executing the computer program.

In a fourth aspect, a computer-readable storage medium, on which a computer program is stored, characterized in that the computer program realizes the steps of the method when executed by a processor.

Compared with the prior art, the embodiment of the invention has the following advantages:

according to the method provided by the embodiment of the invention, audio information of a video file to be processed and video image information with the same time sequence as the audio information are obtained, and first character information is determined according to the audio information and the video image information; and inputting the first character information into a trained character integration model to obtain the extracted target information and the positioning information of the target information in the first character information. Therefore, when the method of the invention is used for positioning information, the information positioning of the audio information corresponding to the video file is added while the video image is positioned, so that the whole content of the video is fully considered, the accuracy of video content identification is improved, and the effect of positioning and searching the target information in the video file is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart illustrating the steps of an information method based on deep learning according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an implementation of audio information recognition in an embodiment of the present invention;

FIG. 3 is a schematic diagram of an implementation of image feature vector identification in an embodiment of the present invention;

FIG. 4 is a schematic diagram of the implementation of key information and target information identification in the method of the present invention;

FIG. 5a is a diagram illustrating the filtering effect of audio filtering not considered in the prior art;

FIG. 5b is a schematic diagram of the filtering effect considering the audio and image characteristics according to the embodiment of the present invention;

FIG. 6 is a schematic diagram of an apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The inventor finds that, in the information positioning methods in the prior art, information positioning is only performed on video image information of a video file, audio information of the video file is not considered in combination with the video image information, and the content of a complete video file comprises two parts, namely the audio information and the video image information.

In order to solve the above problem, in the embodiment of the present invention, filtering of both audio information and video image information in a video file is considered comprehensively, so as to obtain an accurately filtered video file. In the embodiment of the invention, when a user wants to filter a video file, the audio information of the video file is firstly acquired, then the video image information corresponding to the audio information and positioned on the same time sequence is acquired, the audio characteristic vector of the audio information and the image characteristic vector of the video image information are respectively acquired, the audio characteristic vector and the image characteristic vector are converted into the character information expressed by characters, the target information contained in the converted character information is positioned, and the positioned target information is deleted from the video file, so that the target information in the video file is accurately positioned.

For example, the present embodiment can be applied to the scenarios described below. The scene comprises the following steps: the terminal equipment can be any equipment with a video file input function, such as a mobile phone, an IPAD (Internet protocol AD), a desktop computer and the like, the user can input the video file to be processed into the terminal equipment, and the terminal equipment can respond to the operation of the user for carrying out information positioning on the video file to be processed so as to start the information positioning on the video file to be processed. When the terminal equipment starts to perform information positioning on the video file to be processed, firstly, audio information of the video file is obtained, an audio characteristic vector is obtained based on the audio information, and the image characteristic vector of the video file is obtained after the video image information of the video file is obtained. And secondly, combining the audio characteristic vector and the image characteristic vector to convert the audio characteristic vector and the image characteristic vector into character description, carrying out character recognition on the character description as a whole, and recognizing target information contained in the character description, thereby realizing the positioning of the target information. The target information may be character (e.g., word, sentence) information that needs to be located in the information locating process, for example, the target information may include at least one type of information, for example, may include any one or more types of information that may cause adverse effects such as illegal content, pornography, violence, and the like, and of course, the target information may also include preset keywords such as music, pork, and the like.

Exemplary method

Referring to fig. 1, this embodiment is shown as an information positioning method based on deep learning, and the method embodiment may include the steps of:

step S1, obtaining audio information and video image information of a video file to be processed, wherein the time sequence of the video image information in the video file to be processed is the same as the time sequence of the audio information in the video file to be processed.

The time sequence is a time period when a clip corresponding to the audio information or the video image information is on the playing time axis of the video file to be processed, for example: the time period of the audio information corresponding to the playing time axis of the video file to be processed is as follows: from 10 th second to 200 th second, the time sequence is from 10 th second to 200 th second, and the corresponding time sequence corresponding to the video image information of the same time sequence is also from 10 th second to 200 th second in the video file to be processed.

In the embodiment of the invention, the audio information and the video image information are comprehensively filtered to realize the accurate positioning of the information in the video file, so that when the information of the video file needs to be positioned, the audio information and the video image information of the video file to be processed need to be respectively extracted in the step.

Firstly, extracting audio information of the video file to be processed.

Methods for extracting audio information from video files in this step are known in the art, and include, for example: audio extraction tools may be used, such as: the function of extracting the audio information in the video file contained in the video and audio application program extracts the audio information, or converts the format of the video file into the MP3 or MP4 format, or re-records the audio information in the video file, and the like.

And secondly, acquiring video image information which has the same time sequence with the audio information.

The method for acquiring video image information with the same time sequence of the audio information in the video file to be processed also has many ways in the prior art, for example: video editing tools may be used to obtain, for example: the application software such as the video clip APP, the fast clip APP or the master shooting APP can also record corresponding video information in the video file to be processed again through a tool with a video recording function, and the like.

The first step and the second step may also be interchanged, that is, the video image information may be acquired first, and then the audio information having the same time sequence as the video image information may be acquired.

And step S2, determining first character information according to the audio information and the video image information.

In this step, the audio information and the video image information of the video file to be processed, which are acquired in the step S1, are converted into text descriptions, and the text descriptions into which the audio information and the video image information are converted are analyzed, so as to identify whether the audio information and the video image information contain target information, specifically, the step S2 includes the following two steps:

step S21, determining an audio feature vector corresponding to the audio information according to the audio information, and determining an image feature vector corresponding to the video image information according to the video image information.

And step S22, converting the audio characteristic vector and the image characteristic vector into first character information.

In the step, an audio feature vector of the audio information needs to be extracted, and a text description of the audio information is obtained according to the audio feature vector, and in the step, an image feature vector of the video image information needs to be extracted, and a text description of the video image information is obtained according to the image feature vector, because the video image information needs to be converted into a text description.

Specifically, the method for determining the audio feature vector corresponding to the audio information according to the audio information in step S21 includes the following two ways:

the first method is implemented by sampling coding, and specifically, as shown in fig. 2, includes the following steps:

encoding the sampled spectrum into the audio feature vector.

The second method is realized by a deep learning model and comprises the following steps:

after the language information contained in the audio information is converted into the text information, the text information is converted into the expression of the characteristic vector by using a preset self-coding model, namely, the expression is converted into the audio characteristic vector.

The self-coding model can select an automatic encoder based on deep learning, and the weight occupied by each participle, phrase or short sentence in the character information and the character information is adjusted to convert the character information into the expression of the feature vector, so that the audio feature vector output in the self-coding model is obtained.

In addition, the step of determining the image feature vector corresponding to the video image information according to the video image information in step S21 includes:

intercepting an image frame of the video image information;

The step of intercepting an image frame of the video image information comprises:

The feature map is an image mainly containing color features, texture features, shape features and spatial relationship features of the image, is a global feature and describes surface properties of a scene corresponding to the image or an image area. In the present embodiment, the shape characteristics of the image are mainly considered. The main idea of obtaining the image feature vector based on the feature map is to project an original sample to a low-dimensional feature space to obtain low-dimensional image features which can reflect the nature of the feature map or distinguish the feature map most.

In the above steps, image frames in the video image information are obtained by obtaining image frames of each frame in the video image information or sampling, feature maps corresponding to the image frames are extracted, and image feature vectors which can reflect the essence of the feature maps are extracted based on the feature maps.

Specifically, the step of obtaining the image feature vectors corresponding to the image frames according to the feature maps corresponding to the image frames includes:

In the training of the convolutional neural network, the marked image feature vector is used as a convolution kernel, and the marked image feature vector is used for performing convolution on the input feature map to obtain the image feature vector of each image frame. The convolutional neural network modifies and marks the image characteristic vectors contained in the unknown characteristic diagram through the image characteristic vectors in the known characteristic diagram, finally outputs all image frames marked by the image characteristic vectors, and outputs the characteristic diagram marked with the image characteristic vectors based on the convolutional neural network to obtain the image characteristic vectors of the characteristic diagram.

Specifically, the text information is converted into a word vector matrix with M rows and N columns after feature vector conversion is performed on the text information through the self-coding model, wherein the M rows represent word vectors, and the N rows represent weights occupied by the word vector matrix in the text information. The method for converting the video image frames into the image characteristic vectors comprises the steps of converting a two-dimensional matrix corresponding to each frame of image into a corresponding low-dimensional image characteristic vector, namely a one-dimensional image characteristic vector with lower dimensionality than two dimensions, by utilizing a convolution neural network model, namely performing convolution and pooling operations on the features extracted from the image.

The convolutional neural network is composed of an input layer, a convolutional layer, an activation function, a pooling layer and a full-link layer, wherein the convolutional layer is used for extracting the characteristics of an input image, and the pooling layer is used for compressing an input characteristic diagram, so that the characteristic diagram is reduced, the network calculation complexity is simplified, and the characteristic compression is performed to extract main characteristics. And converting the two-dimensional matrix of the video image frame into a one-dimensional image feature vector through a convolutional neural network model.

The above-mentioned steps of determining the audio feature vector according to the audio information and determining the image feature vector corresponding to the video image information in step S21 are described in detail, and then the step of converting the audio feature vector and the image feature vector into the first text information in step S22 is described in detail.

Since the audio feature vector and the image feature vector are integrated into the text information, whether the video file contains the target information or not can be identified more accurately, and therefore the audio feature vector and the image feature vector need to be converted into the text information, which is referred to as the first text information in this embodiment.

Step S22, the step of converting the audio feature vector and the image feature vector into the first text information specifically includes:

and S221, splicing the audio characteristic vector and the image characteristic vector into a video vector matrix.

In order to integrate the audio feature vector and the image feature vector, in this embodiment, a mode of splicing the two vectors is adopted, the image feature vector is added to the audio feature vector to form a video vector matrix, and the video vector spliced by the image feature vector and the audio feature vector contained in the spliced video vector matrix is converted to obtain the first text information.

The audio characteristic vector is a two-dimensional vector, and the image characteristic vector is a one-dimensional vector, so that the audio characteristic vector and the image characteristic vector are spliced together to form a two-dimensional matrix. And if the dimensionalities of the matrixes to which the audio characteristic vector and the image characteristic vector belong are different, adding default values which are set by default to realize that the two characteristic vectors are two-dimensional matrixes with the same row and the same column, adding row data above or below the row data of the other characteristic vector to realize the splicing between the audio characteristic vector and the image characteristic vector, wherein the two-dimensional matrix obtained by the splicing is a video vector matrix containing the information of the whole video file.

In this embodiment, if the audio feature vector and the image feature vector are converted separately and then the text translation results of the two vectors are obtained separately, the correlation between the audio information and the video image information is not considered, which may result in misjudgment.

In order to avoid misjudgment caused by not considering the correlation between the audio information and the video image information, in this embodiment, a video vector obtained by splicing the audio feature vector and the image feature vector is used for conversion to obtain the first text information. In the embodiment, the image feature vectors and the corresponding audio feature vectors of the same time sequence are spliced into a whole for processing, and the correlation between the audio information and the video image information is fully considered, so that the effect of more accurately identifying whether the target information is contained can be achieved.

For example: when the target information is that the video contains the content related to violence, if the video image picture corresponding to a certain video file is military news, the audio playing information is as follows: if only the military news picture is analyzed, the video image picture is analyzed to obtain a picture of news and not to obtain content related to violence, and therefore an error judgment that the video file does not contain the content related to violence is made, and if the method provided by the embodiment is adopted, the audio information and the image video information of the video file are both analyzed, when the audio information of the video file is analyzed, the audio information of the video file is obtained to be the audio related to the terrorist attack event, and because the audio of the terrorist attack event contains the content related to violence, the video file can be judged to contain the content related to violence, the audio segment of the terrorist attack event and the video segment corresponding to the audio segment both belong to target information to be positioned, therefore, the method provided by the embodiment can analyze more accurately.

Step S222, converting the video vector contained in the video vector matrix into first text information.

In the training of the content recognition model, the video vector of the marked character information is used as a convolution kernel, and the video vector of the marked character information is utilized to carry out convolution on the video vector of the input unmarked character information to obtain the output character information of the input video vector. For example, the text information marked by a certain video vector is: and the scene, the mountain, the river and the tree, the content identification model identifies the video vector which is not marked with the character information in the input content identification model according to the character information marked by the video vector, marks the video vector which also contains the scene, the mountain, the river and the tree, and realizes the target of marking all the input video vectors based on the same marking method.

And recognizing the character information of the video vector matrix through the trained content recognition model so as to obtain the character information corresponding to each video vector in the video vector matrix.

Step S3, inputting the first text information into a trained text integration model to obtain extracted target information and positioning information of the target information in the first text information; the character integration model is trained based on the corresponding relation between the sample character information marked as the target information and the sample character information not marked with the target information.

The positioning information is the position information of the target information in the first text information. It is to be understood that the positioning information can embody the position of the object information in the video image information and the audio information, respectively. For example: if the target information is located in a second section second line to a third line and a third section first line of the first text information, the contents of the second section second line to the third line in the first text information correspond to the positions of the 30 th frame to the 50 th frame of the image video frame, the contents of the second section second line to the third line in the first text information correspond to the positions of the 3 rd second to the 5 th second of the audio information, the contents of the third section first line in the first text information correspond to the positions of the 100 th frame to the 120 th frame of the image video frame, and the contents of the third section first line in the first text information correspond to the 10 th second to the 12 th second of the audio information; the positioning information of the target information is the second line to the third line of the second segment and the first line of the third segment of the first text information, and the positioning information of the target information embodies that the target information is located in the 30 th frame to the 50 th frame and the 100 th frame to the 120 th frame in the image video information, and the target information is located in the 3 rd second to the 5 th second and the 10 th second to the 12 th second in the audio information.

In a possible implementation manner, the text integration model used in this embodiment is an encoding-decoding model, and the encoding-decoding model includes: an encoding layer, an attention layer, and a decoding layer;

specifically, the step of inputting the first text information into a trained text integration model to obtain the extracted target information and the positioning information of the target information in the first text information includes:

inputting the identified key information and the hidden layer sequence into the decoding layer, and extracting a theme of the hidden layer sequence and target information which is analyzed and positioned based on the theme;

and obtaining the positioning information of the target information in the first text information according to the target information.

In order to achieve more convenient acquisition of the position of the target information identified in the video file and determine the position as the reason of the target information, after the steps of extracting the theme of the hidden layer sequence and analyzing the located target information based on the theme, the method further includes:

and displaying the target information positioned by the extracted theme, and outputting analysis information of the target information.

For example, when analyzing text information translated from a video vector, if a picture with a title of "fight" and a word related to "fight" appear in a "children song related to a bird", the violent picture and its corresponding audio information are used to locate a corresponding time sequence, and the violent picture and its corresponding audio information are output and displayed, and a filtering reason is given as: violence is related. For example: when the analysis shows that the main problems are: the "fight" frame is determined first, and if the time sequence is: and 10 seconds to 200 seconds, positioning the audio information in the same time sequence according to the time sequence, namely acquiring the time sequence as follows: the reason for synchronously outputting and displaying the violent picture and the acquired audio information of the same time sequence from 10 seconds to 200 seconds and positioning the video corresponding to the time sequence is as follows: violence is related.

In one implementation, when determining the image feature vector corresponding to the video information according to the video image information, in order to facilitate better and fast identifying and deleting the image feature vector according to the positioning information of the image frame, in step S21, the video image information is cut into a plurality of video image frame segments according to the length of a preset video frame; intercepting the image frames of each video image frame segment to obtain an image frame set corresponding to each video image frame segment, wherein the image frame set corresponding to each video image frame segment comprises each image frame in the video image frame segment; the step of extracting the feature maps corresponding to the image frames in the video image information comprises the following steps: and extracting characteristic graphs corresponding to the image frames in the image frame sets respectively, and obtaining characteristic vectors corresponding to the image frames according to the characteristic graphs. For example, a ten minute video is divided into one minute segments and then the frame information is collected for the one minute segment. Because the image frames in the step are image frames in different video image frame segments, the image frames corresponding to the image frame segments can be synchronously and respectively searched, and the positions of the image frames can be quickly positioned.

Specifically, the method of the present invention is further described with reference to fig. 2 to 5 by taking an embodiment of the method as an example.

First, as shown in fig. 2, first, audio information of a video file needs to be extracted from the video file, and the audio information is subjected to sampling coding to obtain an audio feature vector corresponding to the audio information.

Specifically, as shown in fig. 3, according to the time span of the video image frame set screening in the video file, the text information corresponding to the video image frame set is converted into vector representation through a self-coding model, and the audio feature vector and the next frame set are spliced to generate a new feature vector matrix to complete the training of the neural network.

Because the amount of information contained in the video file is too large, the video file needs to be preprocessed, so that the video file needs to be cut into video segments with certain length, every K frames in the segments are taken as frames, the frames are cut into a frame set with the length of J frames, the frame set contains the information of K x J frames in the video segments, each frame in the frame set is configured with a corresponding weight in a neural network and mapped into a two-dimensional matrix containing frame image information, the main frame of the model is a CNN network (convolutional neural network), the main function of the model is to convert the two-dimensional matrix corresponding to each frame into an image feature vector corresponding to the two-dimensional matrix through convolution and pooling operations, and the audio feature vector is added into the image feature vector, and because the audio feature vector is added into the image feature vector in the method, most of the information to be expressed by the real video is provided by the audio information, therefore, compared with the traditional mode that only the image content is analyzed and the audio content is ignored, the method realizes more comprehensive identification of the information contained in the video file and improves the accuracy of target information positioning in the video file.

Secondly, image feature vectors contained in the spliced feature vector matrix need to be converted to obtain corresponding character expressions. In application, the neural network model used in the step is obtained by training based on a transform neural network, after the complete feature vector representation of the video information is obtained, the image feature vectors contained in the complete feature vector representation are translated into text descriptions, but the representation of the image feature vectors is much longer than that of the text representations alone, and the model is more suitable for our task than the RNN-based translation model, no matter in computational complexity or long sequence translation, so that the model is used for obtaining more rapid and accurate identification of the video content.

Thirdly, information integration is carried out on the translated text information in the steps, and keywords and target information contained in the text information are identified.

Specifically, as shown in fig. 4, the text information integration module used in this step is a coding-decoding model commonly used in the NLP field, and the model is composed of a coding layer, an attention layer, and a decoding layer.

k1, converting the first text information into a text sequence, inputting the text sequence into the coding layer, and outputting a hidden layer sequence obtained by hidden coding the text sequence through calculation by the coding layer;

k2, inputting the hidden layer sequence into the attention layer and outputting key information contained in the hidden layer sequence; the attention layer is a key part for understanding the content of the whole first text message, because the attention mechanism in the layer can well store the information of the whole sequence and can not ignore the information of the beginning of the sentence.

K3, inputting the identified key information and the hidden layer sequence into the decoding layer, and extracting the theme of the hidden layer sequence and the target information analyzed based on the theme;

k4, obtaining the positioning information of the target information according to the position information of the target information in the first character information.

In order to facilitate the user to obtain the specific content of the target information located in the video file, after the step of extracting the theme of the hidden layer sequence and analyzing the located target information based on the theme in the step K3, the method further includes:

and K31, displaying the extracted target information and outputting analysis information of the target information.

The user can conveniently know the reason why the piece of content is positioned through the displayed target information and the analysis information, so that the positioned information can be more fully known.

According to the information positioning method disclosed by the embodiment, when information is positioned, analysis of audio information corresponding to a video file is added on the basis of analyzing image video information, and the image video information and the audio information are combined, so that the overall analysis of the whole video file is realized, the accuracy of video content identification is improved, and technical guarantee is provided for accurately identifying a certain type of information or one or more character information in the video file.

As the network technology develops, the network is more and more popularized, wherein the sharing of the shot video through the network is increasingly favored by the user. However, since the transmission of video information through the network is not limited by the limitation, a large amount of videos with bad problems are also transmitted on the network, and the transmission of the videos brings troubles to users, and especially for teenagers without resolution capability, the transmission of the videos can cause more serious influence due to the guidance of the bad information in the videos.

In order to implement filtering of bad information in video information in the network, after step S3 in the above embodiment, the method further includes the steps of:

and filtering the audio information and the video image information according to the extracted target information and the extracted positioning information, and generating a filtered video file according to the filtered audio information and the filtered video image information.

By using the method for positioning the target information in the above embodiment, the target information in the video file is positioned to obtain the position information of the target information in the video file, and then the target information is filtered according to the positioned positioning information of the target information, so as to obtain the video file after the target information is filtered.

For example: when the target information is bad information, the positioning method of the embodiment is used for positioning the position information of the bad information in the video file, and then filtering the bad information contained in the video file according to the positioned position information to obtain a filtered video file, so that a clean audio-visual environment is provided for the network video.

In order to filter all the target information in the video file and achieve the filtering accuracy, the positioning method in the above embodiment is used to position both the target information contained in the audio information and the target information contained in the video image information in the video file, and both the target information contained in the audio information and the target information contained in the video image information are filtered according to the position of the positioned target information.

Specifically, the step of filtering the audio information and the video image information by the extracted target information and the extracted positioning information, and generating a filtered video file according to the filtered audio information and the filtered video image information includes:

a plurality of video vectors formed by splicing a plurality of audio feature vectors and a plurality of corresponding image feature vectors contained in the audio information and the video image information respectively;

The audio characteristic vectors and the image characteristic vectors contained in the audio information and the video image information are spliced respectively, the audio characteristic vectors and the corresponding image characteristic vectors of the same time sequence can be directly spliced during splicing, or a one-dimensional image characteristic vector is inserted into a matrix where the audio characteristic vectors are located in a column form and spliced to obtain a video vector matrix, and each column vector in the video vector matrix is called as a video vector.

And according to the target information and the positioning information of the target information analyzed in the steps, deleting the target information corresponding to the corresponding position from the video vectors in sequence to obtain a plurality of filtered video vectors. For example: when the target information is: and (4) filtering and deleting the violent pictures from the 20 th frame to the 100 th frame according to the positions of the pictures of the target information.

And integrating the video vectors with the target information deleted into a complete video file according to the time sequence of the video vectors.

In order to facilitate a user to know filtered target information, the step of filtering the audio information and the video image information according to the extracted target information and the extracted positioning information includes:

and displaying the extracted subject and the filtered target information, and outputting filtering analysis information of the target information.

The user can know the theme and the detailed content corresponding to the target information contained in the video file according to the displayed theme of the filtering information and the content of the filtered target information, so that the user can know the video file more in detail.

The neural network model used in the step is a model based on an LSTM neural network, the input is a video vector matrix formed by combining image characteristic vectors corresponding to a frame set of a video file and audio characteristic vectors corresponding to an audio file, and the output is a positioned video file. The module can obtain the video which stores the complete content of the video as much as possible and filters out a small part of bad information, but not completely delete the whole video.

The video filtering method adopted by the embodiment filters the target information contained in the video file, so that the filtered video file does not contain information which may bring negative influences such as illegal content, pornography, violence and the like, a healthy watching environment is provided for video users, and technical support is provided for normal propagation of the video file.

Compared with the prior art, the filtering method adopted by the embodiment has the advantages that the filtering of the target information in the audio information in the video file is increased, and the audio information and the video image information are combined and synchronously identified and filtered, so that a more accurate filtering effect can be obtained. As shown in fig. 5a and 5b, when the pictures appearing in the input video file are two people, if only the pictures of the video image are identified, only two people can be identified in talking, but the specific contents in talking cannot be identified, and the pictures do not contain related target information, the pictures are analyzed by using the filtering method in the prior art, and then a non-filtering mode is adopted. However, if the filtering method adopted in this embodiment is adopted, it is not only analyzed what two people are talking based on the screen, but also analyzed the target information that two people are talking, such as: if the language content contains dirty words, after analyzing the language content, the conclusion is that the two people talk about and the content of the talk about contains video frames of dirty words, and the reason for the filtering is given as follows: the talk content contains target information. Therefore, compared with the prior art, the filtering method disclosed by the invention has a more accurate filtering effect.

Another embodiment of the present invention is an information positioning apparatus based on deep learning, as shown in fig. 6, including:

the video information extraction module 610 is configured to obtain audio information and video image information of a video file to be processed, where a time sequence of the video image information in the video file to be processed is the same as a time sequence of the audio information in the video file to be processed; the function of which is as described in step S1.

A description conversion module 620, configured to determine first text information according to the audio information and the video image information; the function of which is as described in step S2.

A text integration module 630, configured to input the first text information into a trained text integration model, so as to obtain extracted target information and positioning information of the target information in the first text information; the character integration model is trained on the basis of the corresponding relation between the sample character information marked as the target information and the sample character information not marked with the target information; the function of which is as described in step S3.

Yet another embodiment of the present invention is a computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method when executing the computer program.

Yet another embodiment of the invention is a computer-readable storage medium having a computer program stored thereon, wherein the computer program realizes the steps of the method when executed by a processor.

The invention provides an information positioning method based on deep learning and related equipment, which are characterized in that audio information of a video file to be processed and video image information with the same time sequence as the audio information are obtained, and first character information is determined according to the audio information and the video image information; and inputting the first character information into a trained character integration model to obtain the extracted target information and the positioning information of the target information in the first character information. Therefore, when the method of the invention is used for positioning information, the information contained in the video image is positioned, and the information contained in the audio information corresponding to the video file is positioned, and the information contained in the video image and the audio information in the video file are fully considered, so that the information positioning method of the embodiment considers all contents of the video file, and improves the accuracy of information positioning in the video content. In addition, the method can realize accurate positioning of the target information, thereby realizing accurate filtration of corresponding bad information in the video file and providing technical support for obtaining a well-established network video environment.

It should be understood that equivalents and modifications of the technical solution and inventive concept thereof may occur to those skilled in the art, and all such modifications and alterations should fall within the scope of the appended claims.

Claims

1. An information positioning method based on deep learning, which is characterized by comprising the following steps:

2. The method according to claim 1, wherein the determining first text information according to the audio information and the video image information comprises:

3. The method according to claim 2, wherein the step of converting the audio feature vector and the image feature vector into first text information comprises:

4. The method of claim 3, wherein the step of translating the video vector contained in the video vector matrix into the first text information comprises:

5. The method according to claim 2, wherein the step of determining the audio feature vector corresponding to the audio information according to the audio information comprises:

encoding the sampled spectrum into the audio feature vector.

6. The method according to claim 3, wherein the step of determining the image feature vector corresponding to the video image information according to the video image information comprises:

intercepting an image frame of the video image information;

7. The method of claim 6, wherein the step of intercepting the image frames of the video image information comprises:

8. The method according to claim 6 or 7, wherein the step of obtaining the image feature vector corresponding to each image frame according to the feature map corresponding to each image frame comprises:

9. The method according to claim 8, wherein the step of inputting the feature map into a trained convolutional neural network to obtain an image feature vector of each image frame comprises:

10. The method according to claim 9, wherein the step of splicing the audio feature vector and the image feature vector into a video vector matrix comprises:

11. The deep learning-based information positioning method according to any one of claims 1 to 10, wherein the text integration model is an encoding-decoding model, and the encoding-decoding model comprises: an encoding layer, an attention layer, and a decoding layer;

12. The method according to claim 11, wherein after the steps of extracting the topic of the hidden layer sequence and analyzing the located target information based on the topic, the method further comprises:

13. The deep learning based information positioning method according to any one of claims 1-10, further comprising:

14. The method of claim 13, wherein the step of filtering the audio information and the video image information according to the target information and the positioning information, and generating a filtered video file according to the filtered audio information and video image information comprises:

15. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 14 when executing the computer program.

16. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 14.