CN111985612B - Encoder network model design method for improving video text description accuracy - Google Patents

Encoder network model design method for improving video text description accuracy Download PDF

Info

Publication number
CN111985612B
CN111985612B CN202010706821.2A CN202010706821A CN111985612B CN 111985612 B CN111985612 B CN 111985612B CN 202010706821 A CN202010706821 A CN 202010706821A CN 111985612 B CN111985612 B CN 111985612B
Authority
CN
China
Prior art keywords
semantic
video
training
feature
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010706821.2A
Other languages
Chinese (zh)
Other versions
CN111985612A (en
Inventor
朱虹
熊鸽
潘晓容
杨恺庆
刘晶晶
杜森
王栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN202010706821.2A priority Critical patent/CN111985612B/en
Publication of CN111985612A publication Critical patent/CN111985612A/en
Application granted granted Critical
Publication of CN111985612B publication Critical patent/CN111985612B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a design method of an encoder network model for improving the description accuracy of video texts, which comprises the following steps: step 1, constructing a visual feature extraction encoder of a video, step 2, constructing a semantic feature extraction encoder of the video, step 3, acquiring semantic features, and step 4, training an S-LSTM network model. In the method, in a coding network model, more accurate semantic information is extracted by utilizing the characteristics of the video, and the difference between semantic words is amplified, so that more accurate semantic characteristics are obtained; after the semantic features are input to a decoding network, carrying out text description on the video; compared with the algorithm index published in the mainstream paper searched at present, the accuracy of the text description of the video is obviously improved.

Description

Encoder network model design method for improving video text description accuracy
Technical Field
The invention belongs to the technical field of video text description algorithms, and relates to an encoder network model design method for improving video text description accuracy.
Background
Video text description algorithm refers to the language description of automatically outputting video content for a given piece of video. The video text description algorithm has important significance and wide application in practice. For example, in the face of massive video data, video text description can be utilized to rapidly analyze video clicked by a user, so that personalized service is carried out for the user; and the text description generated by the video text description algorithm can be utilized to intelligently audit the short video uploaded by the user. In addition, video text descriptions have important applications in early education assistance for young children, video retrieval, helping the amblyopia population to better acquire information, and the like.
In the process of video text description, the video needs to be converted into text to be output, so that accurate extraction of semantic information contained in the video plays an important role. Accurate semantic information is a precondition for outputting a video text description, and the work of the part is completed in an encoder of a model, but the prior art has the defects of inaccurate output information and low output speed in the aspect.
Disclosure of Invention
The invention aims to provide a design method of an encoder network model for improving the accuracy of video text description, which solves the problem of inaccurate output text description caused by inaccurate semantic information extraction in the video text description process in the prior art.
The technical scheme adopted by the invention is that the encoder network model design method for improving the video text description accuracy is implemented according to the following steps:
step 1, constructing a visual characteristic extraction encoder of a video,
step 2, constructing a semantic feature extraction encoder of the video,
step 3, obtaining the semantic features,
and 4, training an S-LSTM network model.
The method has the advantages that in the coding network model, more accurate semantic information is extracted by utilizing the characteristics of the video, and the difference between semantic words is amplified, so that more accurate semantic characteristics are obtained; after the semantic features are input to a decoding network, carrying out text description on the video; compared with the algorithm index published in the mainstream paper searched at present, the accuracy of the text description of the video is obviously improved.
Drawings
FIG. 1 is a block diagram of the overall architecture of a video semantic extraction encoding network model of the method of the present invention;
FIG. 2 is a flow chart of the structure of the Highway Layer module in the video semantic extraction coding network model of the method of the invention;
FIG. 3 is a flow chart of the structure of the enlarged word difference module in the video semantic extraction encoding network model of the method of the present invention;
fig. 4 is a flow chart of the structure of a decoding network model employed by the present invention.
Detailed Description
The invention will be described in detail below with reference to the drawings and the detailed description.
The invention relates to a design method of an encoder network model for improving the accuracy of video text description, which is implemented according to the following steps:
step 1, constructing a visual characteristic extraction encoder of a video, wherein the specific process is that,
1.1 A) a training data set is established,
training a data set of a deep learning network, wherein the data set is formed by a large number of marked samples, and the self marking is considered to have certain limitation and has huge workload, so that the data set published in the step is selected as the sample of the training set;
the embodiment selects the video samples of the published MSVD data set and the visual features of a plurality of videos of the MSRVTT data set as the samples of the training set of the semantic feature extraction network; in the MSRVTT data set, after randomly selecting the data sample as the training set, the left part of video sample is used as the sample of the verification set;
1.2 A semantic dictionary is established,
the most commonly used K words are selected from the labels of the samples of the training set and the samples of the verification set to form a semantic concept set (K is an empirical value, and is selected according to different text description fields, wherein the value range of K is preferably [250,400 ]),
assuming that the total number of samples of the training set is N, for the ith video sample, i=1, 2, & gt, N, performing semantic dictionary labeling on the samples of the training set by using the selected K words, wherein the calculation formula of the semantic attribute labeling is as follows:
Y i =[y i,1 ,y i,2 ,...,y i,K ],i=1,2,...,N (1)
wherein,
1.3 A) constructing a visual feature extraction encoder,
the basic model structure of the visual characteristic extraction encoder constructed in the step is a two-dimensional convolution network ResNeXt and a three-dimensional convolution network ECO (the two convolution networks are both in the prior art and can be searched in published papers and professional books), and M is taken as the output of a pooling layer of conv5/block3 in the two-dimensional convolution network ResNeXt 1 The dimension feature vector describes the two-dimensional visual characteristics of the video, M according to the ResNeXt network structure 1 Preferably 2048; taking M of global pooling layer output of three-dimensional convolution network ECO 2 The dimensional feature vector describes the three-dimensional visual characteristics of the video, M, according to the ECO network structure 2 Preferably 1536; after the two feature vectors are spliced, the two feature vectors are used as visual features of videos, and the visual feature vector corresponding to each video is m=m 1 +M 2 The dimensions are such that the visual characteristic expressions of the N samples of the training set are as shown in equation (3):
X i =[x i,1 ,x i,2 ,...,x i,M ],i=1,2,...,N (3)
step 2, constructing a semantic feature extraction encoder of the video, wherein the specific process is that,
2.1 A structure of a video semantic feature extraction network is designed,
the structure of the video semantic feature extraction network is shown in fig. 1, and the visual features X obtained in the step 1 are obtained i =[x i,1 ,x i,2 ,...,x i,M ]As input, through two FC layers (namely FC1, FC 2) to reduce the dimension, the M is obtained s Dimension (M) s For empirical values, M is preferred s =512) is written asThen through a Highway Layer module and an FC3 Layer, M is obtained through Sigmoid activation operation c Dimension (M) c For empirical values, M is preferred c =300) feature vectorFinally obtaining M through a word difference amplifying module c Dimension semantic feature vector, denoted ++>
2.1.1 A Highway Layer module is built up,
visual characteristic X obtained from step 1 i =[x i,1 ,x i,2 ,...,x i,M ]I=1, 2, and N, which generally includes a large amount of redundant information, in the structure shown in fig. 1, in this step, in addition to dimension reduction of two fully connected layers, coding of visual features is enhanced by a Highway Layer module, so that the obtained visual features of the video are more accurate, and the Highway Layer module is described in detail below;
as shown in fig. 1, the input of the Highway Layer module is the visual feature X i M obtained after FC1 and FC2 operations s Dimension feature vectorAs shown in fig. 2, the Highway Layer module is composed of two parts, a Transform gate and a Carry gate:
the Transform gate part of one is for the input information H i After FC4 full connection operation, sigmoid nonlinear transformation is carried out to obtain feature vectors, which are recorded as,the calculation formula is as follows:
L i =σ(W t ·H i +b t ),i=1,2,...,N (4)
wherein sigma represents Sigmoid activation function, parameter W t And b t The method is obtained by network training;
the output of the Carry gate portion of two is J i =1-L i I=1, 2,..n, denoted as
Will input information H i The feature vector obtained after FC5 nonlinear transformation is recorded asThe calculation formula is as follows:
D i =f act (W g ·H i +b g ),i=1,2,...,N (5)
wherein f act Representing ReLU activation function, parameter W g And b g The method is obtained by network training;
the Highway Layer module inputs the characteristic H by using the Transform gate and the Carry gate i And non-linearly transformed feature D i Fusing to obtain output vector of Highway Layer module, and recording asThe calculation formula is as follows:
U i =L i ·D i +J i ·H i ,i=1,2,...,N (6)
2.1.2 A magnified word difference module is constructed,
in order to accurately predict semantic information contained in video features, the step takes semantic attribute labels as guidance to predict words contained in video visual features, and video semantic features with rich semantic information are obtained. However, in the process of predicting words, if the probabilities between the predicted words are not great, the words used for representing the features of the video cannot express the semantic information contained in the video well. Therefore, the step adds an amplified word difference module in the video semantic extraction network to amplify the difference between semantic words so as to obtain words which are more important for expressing the video semantic information.
The structure of the amplified word difference module is shown in FIG. 3, and the input of the amplified word difference module is the feature vector U output by the Highway Layer module in step 2.1.1) i U is set up i After passing through a full connection layer FC3, activating by a function Sigmoid to obtain M c Dimension feature vector, denoted as R is R i Two different expression mode feature vectors are obtained through two full connection layers of FC6 and FC7 respectively and are respectively marked as +.>And->The calculation formula is as follows:
A i =W a ·R i +b a ,i=1,2,...,N (7)
B i =W b ·R i +b b ,i=1,2,...,N (8)
wherein the parameter W a 、W b 、b a And b b The method comprises the steps of training through a network to obtain;
feature vector A i And feature vector B i Performing dot multiplication to obtain feature vectors T i The score of the word in each dimension is obtained through the Sigmoid activation operationThe calculation formula is as follows:
t i,ks =a i,ks ·b i,ks ,i=1,2,...,N,ks=1,2,...,M c (9)
q i,ks =σ(t i,ks ),i=1,2,...,N,ks=1,2,...,M c (10)
wherein σ represents a Sigmoid activation function;
the dot multiplication operation is adopted, so that the dimension with larger value in the feature is larger and smaller, the difference of words is amplified, and the result after the Sigmoid function represents the score of each dimension of the word feature;
scoring wordsAnd vector of input->Semantic feature vector +.>The calculation formula is as follows:
z i,ks =q i,ks ·r i,ks ,i=1,2,...,N,ks=1,2,...,M c (11)
score Q i And word semantic features R i The corresponding elements of the word feature are multiplied, the feature selection effect is achieved on the word feature dimension, and the difference between different word features is amplified;
2.2 Setting parameters of the training network model,
the training process parameters are experience values, the preferred values of the step are Epoch epsilon [500,2000], the learning rate epsilon [0.0001,0.0005], the Batch size is selected according to the hardware conditions used in calculation, and the preferred values are Batch size epsilon {64, 128, 256, 512};
step 3, obtaining semantic features, wherein the specific process is that,
3.1 A plurality of semantic feature models are selected,
when training the semantic feature extraction network model in the step 2, the step extracts the model with multiple results in the iterative process, namely selecting the iteration to the Nth 1 Inferior, nth 2 Sub..n., th d The next model is saved, N 1 ,N 2 ,...,N d All are empirical values, preferably d= 5,N 1 =100,N 2 =200,N 3 =400,N 4 =800,N d =1000;
3.2 A) obtaining the semantic features,
the video feature X obtained in the step 1 is processed i =[x i,1 ,x i,2 ,...,x i,M ]Respectively inputting the d models obtained in the step 3.1) to obtain d semantic features, and recording the d semantic features as Computing different models to generate semantic features Z i (k d ) The average mean value precision mean Average Precision (mAP) of (2) is used for measuring the precision of generating semantic features, and finally the semantic feature with the highest mAP is selected as the corresponding semantic feature attribute of the video sample (five results are selected from five results in each training sample), and is recorded as->
The mAP calculation process is found in the relevant professional literature, and for convenience of explanation, this step is listed below:
wherein K is the number of words in the constructed semantic dictionary, namely the number of semantic tags, AP k In order to measure the label prediction accuracy index of the kth word, the calculation formula is as follows:
wherein N is G To predict the correct number of categories, P, for corresponding tags under conditions of changing decision boundaries k,j For the correct accuracy of label prediction for the kth word given the jth decision boundary condition, the calculation formula is as follows:
wherein TP k,j Predicting the correct number of samples for the label of the kth word given the jth decision boundary condition, FP k,j Predicting the number of false samples for the label of the kth word given the jth decision boundary condition;
step 4, training an S-LSTM network model, wherein the specific process is that,
4.1 A) feature-encoding the word,
taking all preset labels of the training sample as a corpus, and word segmentation is carried out on the pre-labeled labels in the corpus to obtain K s The words are ordered from big to small according to the use frequency, and the front K is selected w The words being lexicon, K w For empirical values, this step is preferablyThe rest of K s -K w For individual words<unk>Representing a terminator of a sentence<eos>A representation; an integer index mark is carried out on each word in the dictionary from 0;
in this step, word feature encoding uses the word embedded pre-training model GloVe issued by Stanford university (the network model can be used in published papers and professional books)Retrieve) the realization, i.e. obtain K w M of individual words c The coding features of the dimensions are expressed as
4.2 Training an S-LSTM network model,
the S-LSTM network model is shown in FIG. 4, and the training sample set is { X ] i ,Z i *,S i I=1, 2,.. i Z is the visual characteristic of the video extracted in the step 1 i * S for the video semantic features obtained in the step 3 i ={s i,0 ,s i,1 ,...,s i,t The } is a descriptive statement of the training video, which is the word feature obtained in step 4.1) The word sequence that constitutes, namely:
inputting the training samples into an S-LSTM network model shown in FIG. 4, and training the training samples;
so far, the trained coding network is used for obtaining the semantic features and visual features of the video, and the semantic features and visual features are input into the decoding network S-LSTM to complete the description of the video text.
Compared with the algorithm index published by the currently searched main stream paper, the accuracy of the text description of the video is obviously improved, the CIDEr is evaluated by multiple tests in the international standard test set MSVD and is up to 106.5, and the CIDEr is evaluated by multiple tests in the standard test set MSRVTT and is up to 54.0.

Claims (1)

1. The encoder network model design method for improving the video text description accuracy is characterized by comprising the following steps of:
step 1, constructing a visual characteristic extraction encoder of a video, wherein the specific process is that,
1.1 A) a training data set is established,
selecting a published MSVD data set video sample and visual features of a plurality of videos of the MSRVTT data set as samples of a training set of a semantic feature extraction network; in the MSRVTT data set, after randomly selecting the data sample as the training set, the left part of video sample is used as the sample of the verification set;
1.2 A semantic dictionary is established,
selecting K most common words from the labels of the samples of the training set and the samples of the verification set to form a semantic concept set,
assuming that the total number of samples of the training set is N, for the ith video sample, i=1, 2, & gt, N, performing semantic dictionary labeling on the samples of the training set by using the selected K words, wherein the calculation formula of the semantic attribute labeling is as follows:
Y i =[y i,1 ,y i,2 ,...,y i,K ],i=1,2,...,N(1)
wherein,
1.3 A) constructing a visual feature extraction encoder,
the basic model structure of the visual characteristic extraction encoder constructed in the step is a two-dimensional convolution network ResNeXt and a three-dimensional convolution network ECO, and M is taken as the output of a pooling layer of conv5/block3 in the two-dimensional convolution network ResNeXt 1 The dimensional feature vector describes two-dimensional visual characteristics of the video; taking M of global pooling layer output of three-dimensional convolution network ECO 2 The dimensional feature vector describes three-dimensional visual characteristics of the video; after the two feature vectors are spliced, the two feature vectors are used as visual features of videos, and the visual feature vector corresponding to each video is m=m 1 +M 2 The dimensions are such that the visual characteristic expressions of the N samples of the training set are as shown in equation (3):
X i =[x i,1 ,x i,2 ,...,x i,M ],i=1,2,...,N; (3)
step 2, constructing a semantic feature extraction encoder of the video, wherein the specific process is that,
2.1 A structure of a video semantic feature extraction network is designed,
the visual characteristic X obtained in the step 1 is obtained i =[x i,1 ,x i,2 ,...,x i,M ]As input, through full connection layers FC1, FC2 to reduce dimension, M is obtained s The full connection feature of the dimension is noted as Then through a Highway Layer module and an FC3 Layer, M is obtained through Sigmoid activation operation c Dimension feature vector +.>Finally obtaining M through a word difference amplifying module c Dimension semantic feature vector, denoted ++>
2.1.1 A Highway Layer module is built up,
the input of the Highway Layer module is the visual feature X i M obtained after FC1 and FC2 operations s Dimension feature vectorThe Highway Layer module consists of two parts, namely a Transform gate and a Carry gate:
the Transform gate part of one is for the input information H i After FC4 full connection operation, sigmoid nonlinear transformation is carried out to obtain feature vectors, which are recorded as,the calculation formula is as follows:
L i =σ(W t ·H i +b t ),i=1,2,...,N (4)
wherein sigma represents Sigmoid activation function, parameter W t And b t The method is obtained by network training;
the output of the Carry gate portion of two is J i =1-L i I=1, 2,..n, denoted as
Will input information H i The feature vector obtained after FC5 nonlinear transformation is recorded asThe calculation formula is as follows:
D i =f act (W g ·H i +b g ),i=1,2,...,N (5)
wherein f act Representing ReLU activation function, parameter W g And b g The method is obtained by network training;
the Highway Layer module inputs the characteristic H by using the Transform gate and the Carry gate i And non-linearly transformed feature D i Fusing to obtain output vector of Highway Layer module, and recording asThe calculation formula is as follows:
U i =L i ·D i +J i ·H i ,i=1,2,...,N; (6)
2.1.2 A magnified word difference module is constructed,
adding an amplified word difference module into a video semantic extraction network to amplify the difference between semantic words so as to obtain words which are more important to the expression of video semantic information;
step 3, obtaining semantic features, wherein the specific process is that,
3.1 A plurality of semantic feature models are selected,
when training the semantic feature extraction network model in the step 2, the step extracts the model with multiple results in the iterative process, namely selecting the iteration to the Nth 1 Inferior, nth 2 Sub..n., th d The next model is saved, N 1 ,N 2 ,...,N d Are all empirical values, where d= 5,N 1 =100,N 2 =200,N 3 =400,N 4 =800,N d =1000;
3.2 A) obtaining the semantic features,
the video feature X obtained in the step 1 is processed i =[x i,1 ,x i,2 ,...,x i,M ]Respectively inputting the d models obtained in the step 3.1) to obtain d semantic features, and marking the d semantic features as Z i (k d )=[z i,1 (k d ),z i,2 (k d ),...,z i,Mc (k d )],i=1,2,...,N,k d =1, 2, d, the step of setting the position of the base plate, computing different models to generate semantic features Z i (k d ) The average mean value precision mAP of (a) is used for measuring the precision of generating semantic features, and finally, the semantic feature with the highest mAP is selected as the semantic feature attribute corresponding to the video sample and is recorded as
The mAP calculation process is as follows:
wherein K is the number of words in the constructed semantic dictionary, namely the number of semantic tags, AP k In order to measure the label prediction accuracy index of the kth word, the calculation formula is as follows:
wherein N is G To predict the correct number of categories, P, for corresponding tags under conditions of changing decision boundaries k,j For the correct accuracy of label prediction for the kth word given the jth decision boundary condition, the calculation formula is as follows:
wherein TP k,j Predicting the correct number of samples for the label of the kth word given the jth decision boundary condition, FP k,j Predicting the number of false samples for the label of the kth word given the jth decision boundary condition;
step 4, training an S-LSTM network model, wherein the specific process is that,
4.1 A) feature-encoding the word,
taking all preset labels of the training sample as a corpus, and word segmentation is carried out on the pre-labeled labels in the corpus to obtain K s The words are ordered from big to small according to the use frequency, and the front K is selected w The words being lexicon, K w For empirical values, the rest of K s -K w For individual words<unk>Representing a terminator of a sentence<eos>A representation; an integer index mark is carried out on each word in the dictionary from 0;
the feature coding of the words is realized by embedding words issued by Stanford university into a pre-training model GloVe, namely K is obtained w M of individual words c The coding features of the dimensions are expressed as
4.2 Training an S-LSTM network model,
in the S-LSTM network model, the training sample set is { X ] i ,Z i *,S i I=1, 2,.. i Z is the visual characteristic of the video extracted in the step 1 i * S for the video semantic features obtained in the step 3 i ={s i,0 ,s i,1 ,...,s i,t The } is a descriptive statement of the training video, which is the word feature obtained in step 4.1)k c =1,2,...,K w The word sequence that constitutes, namely:
inputting the training samples into an S-LSTM network model, and training the S-LSTM network model;
so far, the trained coding network is used for obtaining the semantic features and visual features of the video, and the semantic features and visual features are input into the decoding network S-LSTM to complete the description of the video text.
CN202010706821.2A 2020-07-21 2020-07-21 Encoder network model design method for improving video text description accuracy Active CN111985612B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010706821.2A CN111985612B (en) 2020-07-21 2020-07-21 Encoder network model design method for improving video text description accuracy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010706821.2A CN111985612B (en) 2020-07-21 2020-07-21 Encoder network model design method for improving video text description accuracy

Publications (2)

Publication Number Publication Date
CN111985612A CN111985612A (en) 2020-11-24
CN111985612B true CN111985612B (en) 2024-02-06

Family

ID=73439530

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010706821.2A Active CN111985612B (en) 2020-07-21 2020-07-21 Encoder network model design method for improving video text description accuracy

Country Status (1)

Country Link
CN (1) CN111985612B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112580362B (en) * 2020-12-18 2024-02-20 西安电子科技大学 Visual behavior recognition method, system and computer readable medium based on text semantic supervision
CN112364850B (en) * 2021-01-13 2021-04-06 北京远鉴信息技术有限公司 Video quality inspection method and device, electronic equipment and storage medium
CN112733866B (en) * 2021-01-27 2023-09-26 湖南千里云医疗科技有限公司 Network construction method for improving text description correctness of controllable image
CN113269093B (en) * 2021-05-26 2023-08-22 大连民族大学 Visual feature segmentation semantic detection method and system in video description

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6410300A (en) * 1987-07-03 1989-01-13 Hitachi Ltd User's interface system for searching
CN107038221A (en) * 2017-03-22 2017-08-11 杭州电子科技大学 A kind of video content description method guided based on semantic information
WO2018188240A1 (en) * 2017-04-10 2018-10-18 北京大学深圳研究生院 Cross-media retrieval method based on deep semantic space
CN109299262A (en) * 2018-10-09 2019-02-01 中山大学 A kind of text implication relation recognition methods for merging more granular informations
CN110991290A (en) * 2019-11-26 2020-04-10 西安电子科技大学 Video description method based on semantic guidance and memory mechanism

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6410300A (en) * 1987-07-03 1989-01-13 Hitachi Ltd User's interface system for searching
CN107038221A (en) * 2017-03-22 2017-08-11 杭州电子科技大学 A kind of video content description method guided based on semantic information
WO2018188240A1 (en) * 2017-04-10 2018-10-18 北京大学深圳研究生院 Cross-media retrieval method based on deep semantic space
CN109299262A (en) * 2018-10-09 2019-02-01 中山大学 A kind of text implication relation recognition methods for merging more granular informations
CN110991290A (en) * 2019-11-26 2020-04-10 西安电子科技大学 Video description method based on semantic guidance and memory mechanism

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘硕研 ; 须德 ; 冯松鹤 ; 刘镝 ; 裘正定 ; .一种基于上下文语义信息的图像块视觉单词生成算法.电子学报.2010,(第05期),全文. *
张丽红 ; 曹刘彬 ; .基于深度迁移学习的视频描述方法研究.测试技术学报.2018,(第05期),全文. *

Also Published As

Publication number Publication date
CN111985612A (en) 2020-11-24

Similar Documents

Publication Publication Date Title
CN111985612B (en) Encoder network model design method for improving video text description accuracy
CN111125331B (en) Semantic recognition method, semantic recognition device, electronic equipment and computer readable storage medium
CN111783394B (en) Training method of event extraction model, event extraction method, system and equipment
CN112487143A (en) Public opinion big data analysis-based multi-label text classification method
CN111966812B (en) Automatic question answering method based on dynamic word vector and storage medium
CN110287337A (en) The system and method for medicine synonym is obtained based on deep learning and knowledge mapping
CN111695052A (en) Label classification method, data processing device and readable storage medium
CN110489750A (en) Burmese participle and part-of-speech tagging method and device based on two-way LSTM-CRF
CN111597328B (en) New event theme extraction method
CN110334186A (en) Data query method, apparatus, computer equipment and computer readable storage medium
CN113743119B (en) Chinese named entity recognition module, method and device and electronic equipment
CN111666766A (en) Data processing method, device and equipment
CN110969023B (en) Text similarity determination method and device
CN113486645A (en) Text similarity detection method based on deep learning
CN114492460B (en) Event causal relationship extraction method based on derivative prompt learning
CN115203338A (en) Label and label example recommendation method
CN112347780B (en) Judicial fact finding generation method, device and medium based on deep neural network
CN112905793A (en) Case recommendation method and system based on Bilstm + Attention text classification
CN115481313A (en) News recommendation method based on text semantic mining
CN112749566B (en) Semantic matching method and device for English writing assistance
CN117132923A (en) Video classification method, device, electronic equipment and storage medium
CN115221284A (en) Text similarity calculation method and device, electronic equipment and storage medium
CN115240712A (en) Multi-mode-based emotion classification method, device, equipment and storage medium
CN114595324A (en) Method, device, terminal and non-transitory storage medium for power grid service data domain division
CN113535928A (en) Service discovery method and system of long-term and short-term memory network based on attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant