CN106778926A - A kind of pictograph of view-based access control model attention model describes method - Google Patents

A kind of pictograph of view-based access control model attention model describes method Download PDF

Info

Publication number
CN106778926A
CN106778926A CN201611207945.6A CN201611207945A CN106778926A CN 106778926 A CN106778926 A CN 106778926A CN 201611207945 A CN201611207945 A CN 201611207945A CN 106778926 A CN106778926 A CN 106778926A
Authority
CN
China
Prior art keywords
image
sentry
vector
model
vision
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201611207945.6A
Other languages
Chinese (zh)
Inventor
夏春秋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Vision Technology Co Ltd
Original Assignee
Shenzhen Vision Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Vision Technology Co Ltd filed Critical Shenzhen Vision Technology Co Ltd
Priority to CN201611207945.6A priority Critical patent/CN106778926A/en
Publication of CN106778926A publication Critical patent/CN106778926A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/192Recognition using electronic means using simultaneous comparisons or correlations of the image signals with a plurality of references
    • G06V30/194References adjustable by an adaptive method, e.g. learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

A kind of pictograph of the view-based access control model attention model proposed in the present invention describes method, and its main contents includes:Data input, pretreatment, self adaptation attention model, the output of image captions, its process is, it performs various actions and in the context of complex scene comprising the image data set of multiple objects using description people first, and each image matches 5 captions of artificial mark;Then captions length is shortened in pretreatment, data set is input in encoder and extracts spatial image feature;The adaptive space attention model of the view-based access control model sentry door for training finally is fed back to, allows machine to perform the automatically generating image captions of the task, obtain the corresponding natural language description result of image.In terms of image recognition, compared with the method based on template, its performance capabilities is optimal for the present invention;It may also help in user visually impaired, and user is easy to a large amount of typical non-structured vision datas of organizing and navigate.

Description

A kind of pictograph of view-based access control model attention model describes method
Technical field
The present invention relates to field of image recognition, more particularly, to a kind of pictograph description of view-based access control model attention model Method.
Background technology
As science and technology are developed rapidly, in field of image recognition, the neural coding device-decoder chassis based on attention are Described through being widely used in pictograph, i.e. Intelligent Recognition picture material, and it is described with natural language automatically.So And, decoder may need image it is little even without visual information to predict non-visual word, may seem visual Other words generally can reliably be predicted from language model.And if using the pictograph description side of view-based access control model attention model Method, then can solve the problems, such as that the image captions for automatically generating are of low quality, and it can automatically determine and when relies on When visual signal, rely only on language model.
The present invention proposes a kind of pictograph of view-based access control model attention model and describes method, and it is held using description people first The various image data sets for acting and multiple objects being included in the context of complex scene of row, each image pairing 5 is artificial The captions of mark;Then captions length is shortened in pretreatment, data set is input in encoder and extracts spatial image feature;Finally The adaptive space attention model based on " vision sentry " door for training is fed back to, is allowed machine to perform and is automatically generated image captions Task, obtain the corresponding natural language description result of image.The present invention in terms of image recognition, compared with the side based on template Method, its performance capabilities is optimal;It may also help in user visually impaired, and make user a large amount of typical cases that are easy to organize and navigate non- The vision data of structuring.
The content of the invention
For the image captions for automatically generating problem of low quality, it is an object of the invention to provide a kind of view-based access control model The pictograph of attention model describes method.
To solve the above problems, the present invention provides a kind of pictograph of view-based access control model attention model and describes method, its master Wanting content includes:
(1) data input;
(2) pre-process;
(3) self adaptation attention model;
(4) image captions output.
Wherein, a kind of pictograph of view-based access control model attention model describes method, including new space transforms model, is used for Extract spatial image feature;Self adaptation attention mechanism, introduces new shot and long term memory (LSTM) extension, produces one Extra " vision sentry " vector is rather than single hidden state;" vision sentry " is the additional potential table of decoder memory Show, fallback option is provided to decoder;One new sentry's door is further obtained by " vision sentry ", it determines that decoder is wanted How many fresh information are obtained from image, " vision sentry " is rather than relied on and is generated next word.
Wherein, described data input, employs scenario objects data set;Most of images in scenario objects data set Describe people and perform various actions, and be that, comprising multiple objects in the context of complex scene, each image there are 5 manually The captions of mark.
Wherein, described pretreatment, truncates scenario objects data set length more than 18 captions of character;Then build Occurs the vocabulary of the word of at least 5 times and 3 times in training set.
Wherein, described self adaptation attention model, including encoder, space transforms model, sentry's door and decoder;It can Visual signal when is relied on to automatically determine, language model when is relied only on, and when visual signal is depended on, Model also determines which region of image noted.
Further, described encoder, including the expression of image is obtained using convolutional neural networks;Use ResNet Last convolutional layer space characteristics output, its size be 2048 × 7 × 7;We use Represent the spatial convoluted neural network characteristics of the everywhere in k grid position;It is special that global image is obtained in the following manner Levy:
Wherein agIt is global image feature, in order to model conveniently, we use the individual layer sense with rectifier activation primitive Know that image feature vector is converted into the new vector with dimension d by device:
vi=ReLU (Waai) (2)
vg=ReLU (Wbag) (3)
Wherein WaAnd WgIt is weight parameter, the spatial image characteristic formp V=[v of conversion1,…,vk]。
Further, described space transforms model, including the space transforms model is used to calculate context vector Ct, it is fixed Justice is:
Ct=g (V, ht) (4)
Wherein g is to note function,It is spatial image feature, each spatial image is characterized in Show with a part of corresponding d dimension tables of image;htIt is hidden state of the recurrent neural network in time t;
The spatial image feature of given LSTMAnd hidden stateWe pass through monolayer neural networks, It is followed by softmax functions to feed back them, is distributed with the attention on k region for producing image:
αt=softmax (zt) (6)
WhereinIt is vector that all elements are both configured to 1;WithLearn Parameter;It is the attention weight of feature in V;Based on noting being distributed, context vector C can be obtained by below equationt
Wherein combine CtAnd htBy formula:logp(yt|y1,…,yt-1, I) and=f (ht,Ct) the next word y of predictiont+1
Further, described sentry's door, including LSTM is extended to obtain " vision sentry " vector st
gt=σ (Wxxt+Whht-1) (8)
st=gt⊙tanh(mt) (9)
Wherein WxAnd WhIt is the weight parameter to be learnt, xtBe when step-length t at input to LSTM, and gtIt is to apply In memory cell mtOn door;⊙ represents element product, and σ is logic sigmoid activation;
Based on " vision sentry ", new context vector is calculated we have proposed a kind of self adaptation attention modelIts quilt It is modeled as the mixing of the feature (i.e. the context vector of space transforms model) and " vision sentry " vector of space transforms image;It is mixed Matched moulds type is defined as follows:
Wherein βtIt is the new sentry door in time t;In our mixed model, βtScope is [0,1];Value 1 is represented only to be made With " vision sentry " information, and 0 represents when next word is generated using only spatial image information;
In order to calculate new sentry's door βt, we have modified space transforms component;Especially, we add extra to z Element, the vector includes the attention fraction as defined in equation 5;The element indicates network for sentry (relative with characteristics of image) Place how many " attentions ";The addition of this extra elements is converted to by by equation 6:
WhereinRepresent connection;WsAnd WgIt is weight parameter;It is worth noting that, WgIt is and identical weight in equation 5 Parameter;It is the attention distribution of spatial image feature and " vision sentry " vector;We by the vector last Individual element is construed to gate value:βtt[k+1];Probability on the vocabulary of the possibility word of time t may be calculated:
Wherein WpIt is the weight parameter to be learnt;The formula encourages model adaptively to consider figure when next word is generated Picture and " vision sentry ";Sentry's vector is updated in each time step.
Further, described decoder, including using the structure based on recurrent neural network, the embedded vector w of connectiont's Word and global image characteristic vector vgTo obtain input vector xt=[wt;vg];Using monolayer neural networks by " vision sentry " to Amount stWith LSTM output vectors htIt is transformed to the new vector with dimension d.
Wherein, described image captions output, the spatial image profile feedback of extraction is given train based on " the vision whistle The adaptive space attention model of soldier " door, allows machine to perform the automatically generating image captions of the task, obtains the corresponding nature of image Language describes result.
Brief description of the drawings
Fig. 1 is that a kind of pictograph of view-based access control model attention model of the invention describes the system flow chart of method.
Fig. 2 is that a kind of pictograph of view-based access control model attention model of the invention describes the scenario objects data set of method.
Fig. 3 is that a kind of pictograph of view-based access control model attention model of the invention describes the model support composition of method.
Specific embodiment
It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase Mutually combine, the present invention is described in further detail with specific embodiment below in conjunction with the accompanying drawings.
Fig. 1 is that a kind of pictograph of view-based access control model attention model of the invention describes the system flow chart of method.Main bag Include data input;Pretreatment;Self adaptation attention model;Image captions are exported.
Wherein, described data input, employs scenario objects data set;Most of images in scenario objects data set Describe people and perform various actions, and be that, comprising multiple objects in the context of complex scene, each image there are 5 manually The captions of mark.
Wherein, described pretreatment, truncates scenario objects data set length more than 18 captions of character;Then build Occurs the vocabulary of the word of at least 5 times and 3 times in training set.
Wherein, described self adaptation attention model, including encoder, space transforms model, sentry's door and decoder;It can Visual signal when is relied on to automatically determine, language model when is relied only on, and when visual signal is depended on, Model also determines which region of image noted.
Further, described encoder, including the expression of image is obtained using convolutional neural networks;Use ResNet Last convolutional layer space characteristics output, its size be 2048 × 7 × 7;We use Represent the spatial convoluted neural network characteristics of the everywhere in k grid position;It is special that global image is obtained in the following manner Levy:
Wherein agIt is global image feature, in order to model conveniently, we use the individual layer sense with rectifier activation primitive Know that image feature vector is converted into the new vector with dimension d by device:
vi=ReLU (Waai) (2)
vg=ReLU (Wbag) (3)
Wherein WaAnd WgIt is weight parameter, the spatial image characteristic formp V=[v of conversion1,…,vk]。
Further, described space transforms model, including the space transforms model is used to calculate context vector Ct, it is fixed Justice is:
Ct=g (V, ht) (4)
Wherein g is to note function, V=[v1,…,vk],It is spatial image feature, each spatial image is characterized in Show with a part of corresponding d dimension tables of image;htIt is hidden state of the recurrent neural network in time t;
The spatial image feature of given LSTMAnd hidden stateWe pass through monolayer neural networks, It is followed by softmax functions to feed back them, is distributed with the attention on k region for producing image:
αt=softmax (zt) (6)
WhereinIt is vector that all elements are both configured to 1;Wv,WithIt is the ginseng to be learnt Number;It is the attention weight of feature in V;Based on noting being distributed, context vector C can be obtained by below equationt
Wherein combine CtAnd htBy formula:logp(yt|y1,…,yt-1, I) and=f (ht,Ct) the next word y of predictiont+1
Further, described sentry's door, including LSTM is extended to obtain " vision sentry " vector st
gt=σ (Wxxt+Whht-1) (8)
st=gt⊙tanh(mt) (9)
Wherein WxAnd WhIt is the weight parameter to be learnt, xtBe when step-length t at input to LSTM, and gtIt is to apply In memory cell mtOn door;⊙ represents element product, and σ is logic sigmoid activation;
Based on " vision sentry ", new context vector is calculated we have proposed a kind of self adaptation attention modelIts quilt It is modeled as the mixing of the feature (i.e. the context vector of space transforms model) and " vision sentry " vector of space transforms image;It is mixed Matched moulds type is defined as follows:
Wherein βtIt is the new sentry door in time t;In our mixed model, βtScope is [0,1];Value 1 is represented only to be made With " vision sentry " information, and 0 represents when next word is generated using only spatial image information;
In order to calculate new sentry's door βt, we have modified space transforms component;Especially, we add extra to z Element, the vector includes the attention fraction as defined in equation 5;The element indicates network for sentry (relative with characteristics of image) Place how many " attentions ";The addition of this extra elements is converted to by by equation 6:
WhereinRepresent connection;WsAnd WgIt is weight parameter;It is worth noting that, WgIt is and identical weight in equation 5 Parameter;It is the attention distribution of spatial image feature and " vision sentry " vector;We by the vector last Individual element is construed to gate value:βtt[k+1];Probability on the vocabulary of the possibility word of time t may be calculated:
Wherein WpIt is the weight parameter to be learnt;The formula encourages model adaptively to consider figure when next word is generated Picture and " vision sentry ";Sentry's vector is updated in each time step.
Further, described decoder, including using the structure based on recurrent neural network, the embedded vector w of connectiont's Word and global image characteristic vector vgTo obtain input vector xt=[wt;vg];Using monolayer neural networks by " vision sentry " to Amount stWith LSTM output vectors htIt is transformed to the new vector with dimension d.
Wherein, described image captions output, the spatial image profile feedback of extraction is given train based on " the vision whistle The adaptive space attention model of soldier " door, allows machine to perform the automatically generating image captions of the task, obtains the corresponding nature of image Language describes result.
Fig. 2 is that a kind of pictograph of view-based access control model attention model of the invention describes the scenario objects data set of method. Most of images of scape object dataset describe people and perform various actions, and are comprising many in the context of complex scene Individual object, each image has 5 captions of artificial mark.
Fig. 3 is that a kind of pictograph of view-based access control model attention model of the invention describes the model support composition of method.The model It is one new adaptive it should be noted that coder-decoder framework, including encoder, space transforms model, sentry's door and decoder, It automatically decides when to check image and when generates next word by language model.
For those skilled in the art, the present invention is not restricted to the details of above-described embodiment, without departing substantially from essence of the invention In the case of god and scope, the present invention can be realized with other concrete forms.Additionally, those skilled in the art can be to this hair Bright to carry out various changes and modification without departing from the spirit and scope of the present invention, these improvement also should be regarded as of the invention with modification Protection domain.Therefore, appended claims are intended to be construed to include preferred embodiment and fall into all changes of the scope of the invention More and modification.

Claims (10)

1. a kind of pictograph of view-based access control model attention model describes method, it is characterised in that mainly including data input (one); Pretreatment (two);Self adaptation attention model (three);Image captions export (four).
2. the pictograph based on a kind of view-based access control model attention model described in claims 1 describes method, it is characterised in that Including new space transforms model, for extracting spatial image feature;Self adaptation attention mechanism, introduces a new shot and long term Memory (LSTM) extends, and produces extra " vision sentry " vector rather than single hidden state;" vision sentry " is The additional potential expression of decoder memory, fallback option is provided to decoder;One is further obtained by " vision sentry " newly Sentry door, it determine decoder want how many fresh information obtained from image, rather than rely on " vision sentry " generate it is next Individual word.
3. based on the data input () described in claims 1, it is characterised in that employ scenario objects data set;Scene Most of images of object dataset describe people and perform various actions, and are comprising multiple in the context of complex scene Object, each image has 5 captions of artificial mark.
4. based on the pretreatment (two) described in claims 1, it is characterised in that scenario objects data set length more than 18 The captions of character are truncated;Then build and occur the vocabulary of at least 5 times and the word of 3 times in training set.
5. based on the self adaptation attention model (three) described in claims 1, it is characterised in that including encoder, space transforms Model, sentry's door and decoder;It can automatically determine when rely on visual signal, when rely only on language mould Type, and when visual signal is depended on, model also determines which region of image noted.
6. based on the encoder described in claims 5, it is characterised in that including obtaining image using convolutional neural networks Represent;Exported using the space characteristics of the last convolutional layer of ResNet, its size is 2048 × 7 × 7;We use A= {a1,…,ak},Represent the spatial convoluted neural network characteristics of the everywhere in k grid position;By following Mode obtains global image feature:
a g = 1 k Σ i = 1 k a i - - - ( 1 )
Wherein agIt is global image feature, in order to model conveniently, we will using the single-layer perceptron with rectifier activation primitive Image feature vector is converted into the new vector with dimension d:
vi=ReLU (Waai) (2)
vg=ReLU (Wbag) (3)
Wherein WaAnd WgIt is weight parameter, the spatial image characteristic formp V=[v of conversion1,…,vk]。
7. based on the space transforms model described in claims 5, it is characterised in that be used to calculate including the space transforms model Context vector Ct, it is defined as:
Ct=g (V, ht) (4)
Wherein g is to note function, V=[v1,…,vk],It is spatial image feature, each spatial image is characterized in and figure A part of corresponding d dimension tables of picture show;htIt is hidden state of the recurrent neural network in time t;
The spatial image feature of given LSTMAnd hidden stateWe are used by monolayer neural networks Softmax functions feed back them, are distributed with the attention on k region for producing image:
αt=softmax (zt) (6)
WhereinIt is vector that all elements are both configured to 1;Wv,WithIt is the parameter to be learnt;It is the attention weight of feature in V;Based on noting being distributed, context vector C can be obtained by below equationt
C t = Σ i = 1 k a t i v t i - - - ( 7 )
Wherein combine CtAnd htBy formula:logp(yt|y1,…,yt-1, I) and=f (ht,Ct) the next word y of predictiont+1
8. based on the sentry's door described in claims 5, it is characterised in that vectorial to obtain " vision sentry " including extension LSTM st
gt=σ (Wxxt+Whht-1) (8)
st=gt⊙tanh(mt) (9)
Wherein WxAnd WhIt is the weight parameter to be learnt, xtBe when step-length t at input to LSTM, and gtIt is consequently exerted at storage Device unit mtOn door;⊙ represents element product, and σ is logic sigmoid activation;
Based on " vision sentry ", new context vector is calculated we have proposed a kind of self adaptation attention modelIt is modeled It is the mixing of the feature (i.e. the context vector of space transforms model) of space transforms image and " vision sentry " vector;Hybrid guided mode Type is defined as follows:
c ^ t = β t s t + ( 1 - β t ) c t - - - ( 10 )
Wherein βtIt is the new sentry door in time t;In our mixed model, βtScope is [0,1];Value 1 represent using only " vision sentry " information, and 0 represents when next word is generated using only spatial image information;
In order to calculate new sentry's door βt, we have modified space transforms component;Especially, we add extra element to z, The vector includes the attention fraction as defined in equation 5;The element indicates network placement (relative with characteristics of image) for sentry How much " note ";The addition of this extra elements is converted to by by equation 6:
α ^ t = s o f t m a x ( [ z t ; w h T tanh ( W s s t + ( W g h t ) ) ] ) - - - ( 11 )
WhereinRepresent connection;WsAnd WgIt is weight parameter;It is worth noting that, WgIt is and identical weight parameter in equation 5;It is the attention distribution of spatial image feature and " vision sentry " vector;We by the vector last unit Element is construed to gate value:βtt[k+1];Probability on the vocabulary of the possibility word of time t may be calculated:
p t = s o f t m a x ( W p ( c ^ t + h t ) ) - - - ( 13 )
Wherein WpIt is the weight parameter to be learnt;The formula encourage model generate next word when adaptively consider image with " vision sentry ";Sentry's vector is updated in each time step.
9. based on the decoder described in claims 5, it is characterised in that including using the structure based on recurrent neural network, The embedded vector w of connectiontWord and global image characteristic vector vgTo obtain input vector xt=[wt;vg];Use monolayer neuronal net Network is by " vision sentry " vector stWith LSTM output vectors htIt is transformed to the new vector with dimension d.
10. based on image captions output (four) described in claims 1, it is characterised in that the spatial image feature that will be extracted The adaptive space attention model based on " vision sentry " door for training is fed back to, is allowed machine to perform and is automatically generated image captions Task, obtain the corresponding natural language description result of image.
CN201611207945.6A 2016-12-23 2016-12-23 A kind of pictograph of view-based access control model attention model describes method Withdrawn CN106778926A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611207945.6A CN106778926A (en) 2016-12-23 2016-12-23 A kind of pictograph of view-based access control model attention model describes method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611207945.6A CN106778926A (en) 2016-12-23 2016-12-23 A kind of pictograph of view-based access control model attention model describes method

Publications (1)

Publication Number Publication Date
CN106778926A true CN106778926A (en) 2017-05-31

Family

ID=58919991

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611207945.6A Withdrawn CN106778926A (en) 2016-12-23 2016-12-23 A kind of pictograph of view-based access control model attention model describes method

Country Status (1)

Country Link
CN (1) CN106778926A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107608943A (en) * 2017-09-08 2018-01-19 中国石油大学(华东) Merge visual attention and the image method for generating captions and system of semantic notice
CN107609563A (en) * 2017-09-15 2018-01-19 成都澳海川科技有限公司 Picture semantic describes method and device
CN108024158A (en) * 2017-11-30 2018-05-11 天津大学 There is supervision video abstraction extraction method using visual attention mechanism
CN108171283A (en) * 2017-12-31 2018-06-15 厦门大学 A kind of picture material automatic describing method based on structuring semantic embedding
CN108230413A (en) * 2018-01-23 2018-06-29 北京市商汤科技开发有限公司 Image Description Methods and device, electronic equipment, computer storage media, program
CN108228700A (en) * 2017-09-30 2018-06-29 北京市商汤科技开发有限公司 Training method, device, electronic equipment and the storage medium of image description model
CN108985370A (en) * 2018-07-10 2018-12-11 中国人民解放军国防科技大学 Automatic generation method of image annotation sentences
CN109871736A (en) * 2018-11-23 2019-06-11 腾讯科技(深圳)有限公司 The generation method and device of natural language description information
CN110119754A (en) * 2019-02-27 2019-08-13 北京邮电大学 Image generates description method, apparatus and model
CN110188779A (en) * 2019-06-03 2019-08-30 中国矿业大学 A kind of generation method of image, semantic description
CN110210499A (en) * 2019-06-03 2019-09-06 中国矿业大学 A kind of adaptive generation system of image, semantic description
CN114022735A (en) * 2021-11-09 2022-02-08 北京有竹居网络技术有限公司 Training method, device, equipment and medium for visual language pre-training model
CN114419402A (en) * 2022-03-29 2022-04-29 中国人民解放军国防科技大学 Image story description generation method and device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JIASEN LU等: "Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning", 《ARXIV:1612.01887V1》 *

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107608943A (en) * 2017-09-08 2018-01-19 中国石油大学(华东) Merge visual attention and the image method for generating captions and system of semantic notice
CN107609563A (en) * 2017-09-15 2018-01-19 成都澳海川科技有限公司 Picture semantic describes method and device
CN108228700B (en) * 2017-09-30 2021-01-26 北京市商汤科技开发有限公司 Training method and device of image description model, electronic equipment and storage medium
CN108228700A (en) * 2017-09-30 2018-06-29 北京市商汤科技开发有限公司 Training method, device, electronic equipment and the storage medium of image description model
CN108024158A (en) * 2017-11-30 2018-05-11 天津大学 There is supervision video abstraction extraction method using visual attention mechanism
CN108171283A (en) * 2017-12-31 2018-06-15 厦门大学 A kind of picture material automatic describing method based on structuring semantic embedding
CN108171283B (en) * 2017-12-31 2020-06-16 厦门大学 Image content automatic description method based on structured semantic embedding
CN108230413A (en) * 2018-01-23 2018-06-29 北京市商汤科技开发有限公司 Image Description Methods and device, electronic equipment, computer storage media, program
CN108230413B (en) * 2018-01-23 2021-07-06 北京市商汤科技开发有限公司 Image description method and device, electronic equipment and computer storage medium
CN108985370A (en) * 2018-07-10 2018-12-11 中国人民解放军国防科技大学 Automatic generation method of image annotation sentences
CN108985370B (en) * 2018-07-10 2021-04-16 中国人民解放军国防科技大学 Automatic generation method of image annotation sentences
US11868738B2 (en) 2018-11-23 2024-01-09 Tencent Technology (Shenzhen) Company Limited Method and apparatus for generating natural language description information
CN109871736A (en) * 2018-11-23 2019-06-11 腾讯科技(深圳)有限公司 The generation method and device of natural language description information
CN109871736B (en) * 2018-11-23 2023-01-31 腾讯科技(深圳)有限公司 Method and device for generating natural language description information
CN110119754A (en) * 2019-02-27 2019-08-13 北京邮电大学 Image generates description method, apparatus and model
CN110119754B (en) * 2019-02-27 2022-03-29 北京邮电大学 Image generation description method, device and model
CN110210499A (en) * 2019-06-03 2019-09-06 中国矿业大学 A kind of adaptive generation system of image, semantic description
CN110210499B (en) * 2019-06-03 2023-10-13 中国矿业大学 Self-adaptive generation system for image semantic description
CN110188779A (en) * 2019-06-03 2019-08-30 中国矿业大学 A kind of generation method of image, semantic description
CN114022735A (en) * 2021-11-09 2022-02-08 北京有竹居网络技术有限公司 Training method, device, equipment and medium for visual language pre-training model
CN114419402A (en) * 2022-03-29 2022-04-29 中国人民解放军国防科技大学 Image story description generation method and device, computer equipment and storage medium
CN114419402B (en) * 2022-03-29 2023-08-18 中国人民解放军国防科技大学 Image story description generation method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN106778926A (en) A kind of pictograph of view-based access control model attention model describes method
CN105244020B (en) Prosodic hierarchy model training method, text-to-speech method and text-to-speech device
CN108875807B (en) Image description method based on multiple attention and multiple scales
CN106650813B (en) A kind of image understanding method based on depth residual error network and LSTM
CN107273355B (en) Chinese word vector generation method based on word and phrase joint training
CN111368993B (en) Data processing method and related equipment
CN102436811B (en) Full-sequence training of deep structures for speech recognition
CN109902293A (en) A kind of file classification method based on part with global mutually attention mechanism
CN108536754A (en) Electronic health record entity relation extraction method based on BLSTM and attention mechanism
CN110647612A (en) Visual conversation generation method based on double-visual attention network
CN107924680A (en) Speech understanding system
CN106844442A (en) Multi-modal Recognition with Recurrent Neural Network Image Description Methods based on FCN feature extractions
CN106650756A (en) Image text description method based on knowledge transfer multi-modal recurrent neural network
CN109887484A (en) A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device
CN108153864A (en) Method based on neural network generation text snippet
CN109271516B (en) Method and system for classifying entity types in knowledge graph
CN110223714A (en) A kind of voice-based Emotion identification method
CN103793507B (en) A kind of method using deep structure to obtain bimodal similarity measure
CN110334196B (en) Neural network Chinese problem generation system based on strokes and self-attention mechanism
CN114021524B (en) Emotion recognition method, device, equipment and readable storage medium
CN112926655B (en) Image content understanding and visual question and answer VQA method, storage medium and terminal
CN112348911A (en) Semantic constraint-based method and system for generating fine-grained image by stacking texts
JP2022503812A (en) Sentence processing method, sentence decoding method, device, program and equipment
Malakan et al. Vision transformer based model for describing a set of images as a story
Yang et al. Text classification based on convolutional neural network and attention model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20170531

WW01 Invention patent application withdrawn after publication