CN106778926A - A kind of pictograph of view-based access control model attention model describes method - Google Patents
A kind of pictograph of view-based access control model attention model describes method Download PDFInfo
- Publication number
- CN106778926A CN106778926A CN201611207945.6A CN201611207945A CN106778926A CN 106778926 A CN106778926 A CN 106778926A CN 201611207945 A CN201611207945 A CN 201611207945A CN 106778926 A CN106778926 A CN 106778926A
- Authority
- CN
- China
- Prior art keywords
- image
- sentry
- vector
- model
- vision
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/192—Recognition using electronic means using simultaneous comparisons or correlations of the image signals with a plurality of references
- G06V30/194—References adjustable by an adaptive method, e.g. learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
A kind of pictograph of the view-based access control model attention model proposed in the present invention describes method, and its main contents includes:Data input, pretreatment, self adaptation attention model, the output of image captions, its process is, it performs various actions and in the context of complex scene comprising the image data set of multiple objects using description people first, and each image matches 5 captions of artificial mark;Then captions length is shortened in pretreatment, data set is input in encoder and extracts spatial image feature;The adaptive space attention model of the view-based access control model sentry door for training finally is fed back to, allows machine to perform the automatically generating image captions of the task, obtain the corresponding natural language description result of image.In terms of image recognition, compared with the method based on template, its performance capabilities is optimal for the present invention;It may also help in user visually impaired, and user is easy to a large amount of typical non-structured vision datas of organizing and navigate.
Description
Technical field
The present invention relates to field of image recognition, more particularly, to a kind of pictograph description of view-based access control model attention model
Method.
Background technology
As science and technology are developed rapidly, in field of image recognition, the neural coding device-decoder chassis based on attention are
Described through being widely used in pictograph, i.e. Intelligent Recognition picture material, and it is described with natural language automatically.So
And, decoder may need image it is little even without visual information to predict non-visual word, may seem visual
Other words generally can reliably be predicted from language model.And if using the pictograph description side of view-based access control model attention model
Method, then can solve the problems, such as that the image captions for automatically generating are of low quality, and it can automatically determine and when relies on
When visual signal, rely only on language model.
The present invention proposes a kind of pictograph of view-based access control model attention model and describes method, and it is held using description people first
The various image data sets for acting and multiple objects being included in the context of complex scene of row, each image pairing 5 is artificial
The captions of mark;Then captions length is shortened in pretreatment, data set is input in encoder and extracts spatial image feature;Finally
The adaptive space attention model based on " vision sentry " door for training is fed back to, is allowed machine to perform and is automatically generated image captions
Task, obtain the corresponding natural language description result of image.The present invention in terms of image recognition, compared with the side based on template
Method, its performance capabilities is optimal;It may also help in user visually impaired, and make user a large amount of typical cases that are easy to organize and navigate non-
The vision data of structuring.
The content of the invention
For the image captions for automatically generating problem of low quality, it is an object of the invention to provide a kind of view-based access control model
The pictograph of attention model describes method.
To solve the above problems, the present invention provides a kind of pictograph of view-based access control model attention model and describes method, its master
Wanting content includes:
(1) data input;
(2) pre-process;
(3) self adaptation attention model;
(4) image captions output.
Wherein, a kind of pictograph of view-based access control model attention model describes method, including new space transforms model, is used for
Extract spatial image feature;Self adaptation attention mechanism, introduces new shot and long term memory (LSTM) extension, produces one
Extra " vision sentry " vector is rather than single hidden state;" vision sentry " is the additional potential table of decoder memory
Show, fallback option is provided to decoder;One new sentry's door is further obtained by " vision sentry ", it determines that decoder is wanted
How many fresh information are obtained from image, " vision sentry " is rather than relied on and is generated next word.
Wherein, described data input, employs scenario objects data set;Most of images in scenario objects data set
Describe people and perform various actions, and be that, comprising multiple objects in the context of complex scene, each image there are 5 manually
The captions of mark.
Wherein, described pretreatment, truncates scenario objects data set length more than 18 captions of character;Then build
Occurs the vocabulary of the word of at least 5 times and 3 times in training set.
Wherein, described self adaptation attention model, including encoder, space transforms model, sentry's door and decoder;It can
Visual signal when is relied on to automatically determine, language model when is relied only on, and when visual signal is depended on,
Model also determines which region of image noted.
Further, described encoder, including the expression of image is obtained using convolutional neural networks;Use ResNet
Last convolutional layer space characteristics output, its size be 2048 × 7 × 7;We use
Represent the spatial convoluted neural network characteristics of the everywhere in k grid position;It is special that global image is obtained in the following manner
Levy:
Wherein agIt is global image feature, in order to model conveniently, we use the individual layer sense with rectifier activation primitive
Know that image feature vector is converted into the new vector with dimension d by device:
vi=ReLU (Waai) (2)
vg=ReLU (Wbag) (3)
Wherein WaAnd WgIt is weight parameter, the spatial image characteristic formp V=[v of conversion1,…,vk]。
Further, described space transforms model, including the space transforms model is used to calculate context vector Ct, it is fixed
Justice is:
Ct=g (V, ht) (4)
Wherein g is to note function,It is spatial image feature, each spatial image is characterized in
Show with a part of corresponding d dimension tables of image;htIt is hidden state of the recurrent neural network in time t;
The spatial image feature of given LSTMAnd hidden stateWe pass through monolayer neural networks,
It is followed by softmax functions to feed back them, is distributed with the attention on k region for producing image:
αt=softmax (zt) (6)
WhereinIt is vector that all elements are both configured to 1;WithLearn
Parameter;It is the attention weight of feature in V;Based on noting being distributed, context vector C can be obtained by below equationt:
Wherein combine CtAnd htBy formula:logp(yt|y1,…,yt-1, I) and=f (ht,Ct) the next word y of predictiont+1。
Further, described sentry's door, including LSTM is extended to obtain " vision sentry " vector st:
gt=σ (Wxxt+Whht-1) (8)
st=gt⊙tanh(mt) (9)
Wherein WxAnd WhIt is the weight parameter to be learnt, xtBe when step-length t at input to LSTM, and gtIt is to apply
In memory cell mtOn door;⊙ represents element product, and σ is logic sigmoid activation;
Based on " vision sentry ", new context vector is calculated we have proposed a kind of self adaptation attention modelIts quilt
It is modeled as the mixing of the feature (i.e. the context vector of space transforms model) and " vision sentry " vector of space transforms image;It is mixed
Matched moulds type is defined as follows:
Wherein βtIt is the new sentry door in time t;In our mixed model, βtScope is [0,1];Value 1 is represented only to be made
With " vision sentry " information, and 0 represents when next word is generated using only spatial image information;
In order to calculate new sentry's door βt, we have modified space transforms component;Especially, we add extra to z
Element, the vector includes the attention fraction as defined in equation 5;The element indicates network for sentry (relative with characteristics of image)
Place how many " attentions ";The addition of this extra elements is converted to by by equation 6:
WhereinRepresent connection;WsAnd WgIt is weight parameter;It is worth noting that, WgIt is and identical weight in equation 5
Parameter;It is the attention distribution of spatial image feature and " vision sentry " vector;We by the vector last
Individual element is construed to gate value:βt=αt[k+1];Probability on the vocabulary of the possibility word of time t may be calculated:
Wherein WpIt is the weight parameter to be learnt;The formula encourages model adaptively to consider figure when next word is generated
Picture and " vision sentry ";Sentry's vector is updated in each time step.
Further, described decoder, including using the structure based on recurrent neural network, the embedded vector w of connectiont's
Word and global image characteristic vector vgTo obtain input vector xt=[wt;vg];Using monolayer neural networks by " vision sentry " to
Amount stWith LSTM output vectors htIt is transformed to the new vector with dimension d.
Wherein, described image captions output, the spatial image profile feedback of extraction is given train based on " the vision whistle
The adaptive space attention model of soldier " door, allows machine to perform the automatically generating image captions of the task, obtains the corresponding nature of image
Language describes result.
Brief description of the drawings
Fig. 1 is that a kind of pictograph of view-based access control model attention model of the invention describes the system flow chart of method.
Fig. 2 is that a kind of pictograph of view-based access control model attention model of the invention describes the scenario objects data set of method.
Fig. 3 is that a kind of pictograph of view-based access control model attention model of the invention describes the model support composition of method.
Specific embodiment
It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase
Mutually combine, the present invention is described in further detail with specific embodiment below in conjunction with the accompanying drawings.
Fig. 1 is that a kind of pictograph of view-based access control model attention model of the invention describes the system flow chart of method.Main bag
Include data input;Pretreatment;Self adaptation attention model;Image captions are exported.
Wherein, described data input, employs scenario objects data set;Most of images in scenario objects data set
Describe people and perform various actions, and be that, comprising multiple objects in the context of complex scene, each image there are 5 manually
The captions of mark.
Wherein, described pretreatment, truncates scenario objects data set length more than 18 captions of character;Then build
Occurs the vocabulary of the word of at least 5 times and 3 times in training set.
Wherein, described self adaptation attention model, including encoder, space transforms model, sentry's door and decoder;It can
Visual signal when is relied on to automatically determine, language model when is relied only on, and when visual signal is depended on,
Model also determines which region of image noted.
Further, described encoder, including the expression of image is obtained using convolutional neural networks;Use ResNet
Last convolutional layer space characteristics output, its size be 2048 × 7 × 7;We use
Represent the spatial convoluted neural network characteristics of the everywhere in k grid position;It is special that global image is obtained in the following manner
Levy:
Wherein agIt is global image feature, in order to model conveniently, we use the individual layer sense with rectifier activation primitive
Know that image feature vector is converted into the new vector with dimension d by device:
vi=ReLU (Waai) (2)
vg=ReLU (Wbag) (3)
Wherein WaAnd WgIt is weight parameter, the spatial image characteristic formp V=[v of conversion1,…,vk]。
Further, described space transforms model, including the space transforms model is used to calculate context vector Ct, it is fixed
Justice is:
Ct=g (V, ht) (4)
Wherein g is to note function, V=[v1,…,vk],It is spatial image feature, each spatial image is characterized in
Show with a part of corresponding d dimension tables of image;htIt is hidden state of the recurrent neural network in time t;
The spatial image feature of given LSTMAnd hidden stateWe pass through monolayer neural networks,
It is followed by softmax functions to feed back them, is distributed with the attention on k region for producing image:
αt=softmax (zt) (6)
WhereinIt is vector that all elements are both configured to 1;Wv,WithIt is the ginseng to be learnt
Number;It is the attention weight of feature in V;Based on noting being distributed, context vector C can be obtained by below equationt:
Wherein combine CtAnd htBy formula:logp(yt|y1,…,yt-1, I) and=f (ht,Ct) the next word y of predictiont+1。
Further, described sentry's door, including LSTM is extended to obtain " vision sentry " vector st:
gt=σ (Wxxt+Whht-1) (8)
st=gt⊙tanh(mt) (9)
Wherein WxAnd WhIt is the weight parameter to be learnt, xtBe when step-length t at input to LSTM, and gtIt is to apply
In memory cell mtOn door;⊙ represents element product, and σ is logic sigmoid activation;
Based on " vision sentry ", new context vector is calculated we have proposed a kind of self adaptation attention modelIts quilt
It is modeled as the mixing of the feature (i.e. the context vector of space transforms model) and " vision sentry " vector of space transforms image;It is mixed
Matched moulds type is defined as follows:
Wherein βtIt is the new sentry door in time t;In our mixed model, βtScope is [0,1];Value 1 is represented only to be made
With " vision sentry " information, and 0 represents when next word is generated using only spatial image information;
In order to calculate new sentry's door βt, we have modified space transforms component;Especially, we add extra to z
Element, the vector includes the attention fraction as defined in equation 5;The element indicates network for sentry (relative with characteristics of image)
Place how many " attentions ";The addition of this extra elements is converted to by by equation 6:
WhereinRepresent connection;WsAnd WgIt is weight parameter;It is worth noting that, WgIt is and identical weight in equation 5
Parameter;It is the attention distribution of spatial image feature and " vision sentry " vector;We by the vector last
Individual element is construed to gate value:βt=αt[k+1];Probability on the vocabulary of the possibility word of time t may be calculated:
Wherein WpIt is the weight parameter to be learnt;The formula encourages model adaptively to consider figure when next word is generated
Picture and " vision sentry ";Sentry's vector is updated in each time step.
Further, described decoder, including using the structure based on recurrent neural network, the embedded vector w of connectiont's
Word and global image characteristic vector vgTo obtain input vector xt=[wt;vg];Using monolayer neural networks by " vision sentry " to
Amount stWith LSTM output vectors htIt is transformed to the new vector with dimension d.
Wherein, described image captions output, the spatial image profile feedback of extraction is given train based on " the vision whistle
The adaptive space attention model of soldier " door, allows machine to perform the automatically generating image captions of the task, obtains the corresponding nature of image
Language describes result.
Fig. 2 is that a kind of pictograph of view-based access control model attention model of the invention describes the scenario objects data set of method.
Most of images of scape object dataset describe people and perform various actions, and are comprising many in the context of complex scene
Individual object, each image has 5 captions of artificial mark.
Fig. 3 is that a kind of pictograph of view-based access control model attention model of the invention describes the model support composition of method.The model
It is one new adaptive it should be noted that coder-decoder framework, including encoder, space transforms model, sentry's door and decoder,
It automatically decides when to check image and when generates next word by language model.
For those skilled in the art, the present invention is not restricted to the details of above-described embodiment, without departing substantially from essence of the invention
In the case of god and scope, the present invention can be realized with other concrete forms.Additionally, those skilled in the art can be to this hair
Bright to carry out various changes and modification without departing from the spirit and scope of the present invention, these improvement also should be regarded as of the invention with modification
Protection domain.Therefore, appended claims are intended to be construed to include preferred embodiment and fall into all changes of the scope of the invention
More and modification.
Claims (10)
1. a kind of pictograph of view-based access control model attention model describes method, it is characterised in that mainly including data input (one);
Pretreatment (two);Self adaptation attention model (three);Image captions export (four).
2. the pictograph based on a kind of view-based access control model attention model described in claims 1 describes method, it is characterised in that
Including new space transforms model, for extracting spatial image feature;Self adaptation attention mechanism, introduces a new shot and long term
Memory (LSTM) extends, and produces extra " vision sentry " vector rather than single hidden state;" vision sentry " is
The additional potential expression of decoder memory, fallback option is provided to decoder;One is further obtained by " vision sentry " newly
Sentry door, it determine decoder want how many fresh information obtained from image, rather than rely on " vision sentry " generate it is next
Individual word.
3. based on the data input () described in claims 1, it is characterised in that employ scenario objects data set;Scene
Most of images of object dataset describe people and perform various actions, and are comprising multiple in the context of complex scene
Object, each image has 5 captions of artificial mark.
4. based on the pretreatment (two) described in claims 1, it is characterised in that scenario objects data set length more than 18
The captions of character are truncated;Then build and occur the vocabulary of at least 5 times and the word of 3 times in training set.
5. based on the self adaptation attention model (three) described in claims 1, it is characterised in that including encoder, space transforms
Model, sentry's door and decoder;It can automatically determine when rely on visual signal, when rely only on language mould
Type, and when visual signal is depended on, model also determines which region of image noted.
6. based on the encoder described in claims 5, it is characterised in that including obtaining image using convolutional neural networks
Represent;Exported using the space characteristics of the last convolutional layer of ResNet, its size is 2048 × 7 × 7;We use A=
{a1,…,ak},Represent the spatial convoluted neural network characteristics of the everywhere in k grid position;By following
Mode obtains global image feature:
Wherein agIt is global image feature, in order to model conveniently, we will using the single-layer perceptron with rectifier activation primitive
Image feature vector is converted into the new vector with dimension d:
vi=ReLU (Waai) (2)
vg=ReLU (Wbag) (3)
Wherein WaAnd WgIt is weight parameter, the spatial image characteristic formp V=[v of conversion1,…,vk]。
7. based on the space transforms model described in claims 5, it is characterised in that be used to calculate including the space transforms model
Context vector Ct, it is defined as:
Ct=g (V, ht) (4)
Wherein g is to note function, V=[v1,…,vk],It is spatial image feature, each spatial image is characterized in and figure
A part of corresponding d dimension tables of picture show;htIt is hidden state of the recurrent neural network in time t;
The spatial image feature of given LSTMAnd hidden stateWe are used by monolayer neural networks
Softmax functions feed back them, are distributed with the attention on k region for producing image:
αt=softmax (zt) (6)
WhereinIt is vector that all elements are both configured to 1;Wv,WithIt is the parameter to be learnt;It is the attention weight of feature in V;Based on noting being distributed, context vector C can be obtained by below equationt:
Wherein combine CtAnd htBy formula:logp(yt|y1,…,yt-1, I) and=f (ht,Ct) the next word y of predictiont+1。
8. based on the sentry's door described in claims 5, it is characterised in that vectorial to obtain " vision sentry " including extension LSTM
st:
gt=σ (Wxxt+Whht-1) (8)
st=gt⊙tanh(mt) (9)
Wherein WxAnd WhIt is the weight parameter to be learnt, xtBe when step-length t at input to LSTM, and gtIt is consequently exerted at storage
Device unit mtOn door;⊙ represents element product, and σ is logic sigmoid activation;
Based on " vision sentry ", new context vector is calculated we have proposed a kind of self adaptation attention modelIt is modeled
It is the mixing of the feature (i.e. the context vector of space transforms model) of space transforms image and " vision sentry " vector;Hybrid guided mode
Type is defined as follows:
Wherein βtIt is the new sentry door in time t;In our mixed model, βtScope is [0,1];Value 1 represent using only
" vision sentry " information, and 0 represents when next word is generated using only spatial image information;
In order to calculate new sentry's door βt, we have modified space transforms component;Especially, we add extra element to z,
The vector includes the attention fraction as defined in equation 5;The element indicates network placement (relative with characteristics of image) for sentry
How much " note ";The addition of this extra elements is converted to by by equation 6:
WhereinRepresent connection;WsAnd WgIt is weight parameter;It is worth noting that, WgIt is and identical weight parameter in equation 5;It is the attention distribution of spatial image feature and " vision sentry " vector;We by the vector last unit
Element is construed to gate value:βt=αt[k+1];Probability on the vocabulary of the possibility word of time t may be calculated:
Wherein WpIt is the weight parameter to be learnt;The formula encourage model generate next word when adaptively consider image with
" vision sentry ";Sentry's vector is updated in each time step.
9. based on the decoder described in claims 5, it is characterised in that including using the structure based on recurrent neural network,
The embedded vector w of connectiontWord and global image characteristic vector vgTo obtain input vector xt=[wt;vg];Use monolayer neuronal net
Network is by " vision sentry " vector stWith LSTM output vectors htIt is transformed to the new vector with dimension d.
10. based on image captions output (four) described in claims 1, it is characterised in that the spatial image feature that will be extracted
The adaptive space attention model based on " vision sentry " door for training is fed back to, is allowed machine to perform and is automatically generated image captions
Task, obtain the corresponding natural language description result of image.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611207945.6A CN106778926A (en) | 2016-12-23 | 2016-12-23 | A kind of pictograph of view-based access control model attention model describes method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611207945.6A CN106778926A (en) | 2016-12-23 | 2016-12-23 | A kind of pictograph of view-based access control model attention model describes method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106778926A true CN106778926A (en) | 2017-05-31 |
Family
ID=58919991
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611207945.6A Withdrawn CN106778926A (en) | 2016-12-23 | 2016-12-23 | A kind of pictograph of view-based access control model attention model describes method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106778926A (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107608943A (en) * | 2017-09-08 | 2018-01-19 | 中国石油大学(华东) | Merge visual attention and the image method for generating captions and system of semantic notice |
CN107609563A (en) * | 2017-09-15 | 2018-01-19 | 成都澳海川科技有限公司 | Picture semantic describes method and device |
CN108024158A (en) * | 2017-11-30 | 2018-05-11 | 天津大学 | There is supervision video abstraction extraction method using visual attention mechanism |
CN108171283A (en) * | 2017-12-31 | 2018-06-15 | 厦门大学 | A kind of picture material automatic describing method based on structuring semantic embedding |
CN108230413A (en) * | 2018-01-23 | 2018-06-29 | 北京市商汤科技开发有限公司 | Image Description Methods and device, electronic equipment, computer storage media, program |
CN108228700A (en) * | 2017-09-30 | 2018-06-29 | 北京市商汤科技开发有限公司 | Training method, device, electronic equipment and the storage medium of image description model |
CN108985370A (en) * | 2018-07-10 | 2018-12-11 | 中国人民解放军国防科技大学 | Automatic generation method of image annotation sentences |
CN109871736A (en) * | 2018-11-23 | 2019-06-11 | 腾讯科技(深圳)有限公司 | The generation method and device of natural language description information |
CN110119754A (en) * | 2019-02-27 | 2019-08-13 | 北京邮电大学 | Image generates description method, apparatus and model |
CN110188779A (en) * | 2019-06-03 | 2019-08-30 | 中国矿业大学 | A kind of generation method of image, semantic description |
CN110210499A (en) * | 2019-06-03 | 2019-09-06 | 中国矿业大学 | A kind of adaptive generation system of image, semantic description |
CN114022735A (en) * | 2021-11-09 | 2022-02-08 | 北京有竹居网络技术有限公司 | Training method, device, equipment and medium for visual language pre-training model |
CN114419402A (en) * | 2022-03-29 | 2022-04-29 | 中国人民解放军国防科技大学 | Image story description generation method and device, computer equipment and storage medium |
-
2016
- 2016-12-23 CN CN201611207945.6A patent/CN106778926A/en not_active Withdrawn
Non-Patent Citations (1)
Title |
---|
JIASEN LU等: "Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning", 《ARXIV:1612.01887V1》 * |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107608943A (en) * | 2017-09-08 | 2018-01-19 | 中国石油大学(华东) | Merge visual attention and the image method for generating captions and system of semantic notice |
CN107609563A (en) * | 2017-09-15 | 2018-01-19 | 成都澳海川科技有限公司 | Picture semantic describes method and device |
CN108228700B (en) * | 2017-09-30 | 2021-01-26 | 北京市商汤科技开发有限公司 | Training method and device of image description model, electronic equipment and storage medium |
CN108228700A (en) * | 2017-09-30 | 2018-06-29 | 北京市商汤科技开发有限公司 | Training method, device, electronic equipment and the storage medium of image description model |
CN108024158A (en) * | 2017-11-30 | 2018-05-11 | 天津大学 | There is supervision video abstraction extraction method using visual attention mechanism |
CN108171283A (en) * | 2017-12-31 | 2018-06-15 | 厦门大学 | A kind of picture material automatic describing method based on structuring semantic embedding |
CN108171283B (en) * | 2017-12-31 | 2020-06-16 | 厦门大学 | Image content automatic description method based on structured semantic embedding |
CN108230413A (en) * | 2018-01-23 | 2018-06-29 | 北京市商汤科技开发有限公司 | Image Description Methods and device, electronic equipment, computer storage media, program |
CN108230413B (en) * | 2018-01-23 | 2021-07-06 | 北京市商汤科技开发有限公司 | Image description method and device, electronic equipment and computer storage medium |
CN108985370A (en) * | 2018-07-10 | 2018-12-11 | 中国人民解放军国防科技大学 | Automatic generation method of image annotation sentences |
CN108985370B (en) * | 2018-07-10 | 2021-04-16 | 中国人民解放军国防科技大学 | Automatic generation method of image annotation sentences |
US11868738B2 (en) | 2018-11-23 | 2024-01-09 | Tencent Technology (Shenzhen) Company Limited | Method and apparatus for generating natural language description information |
CN109871736A (en) * | 2018-11-23 | 2019-06-11 | 腾讯科技(深圳)有限公司 | The generation method and device of natural language description information |
CN109871736B (en) * | 2018-11-23 | 2023-01-31 | 腾讯科技(深圳)有限公司 | Method and device for generating natural language description information |
CN110119754A (en) * | 2019-02-27 | 2019-08-13 | 北京邮电大学 | Image generates description method, apparatus and model |
CN110119754B (en) * | 2019-02-27 | 2022-03-29 | 北京邮电大学 | Image generation description method, device and model |
CN110210499A (en) * | 2019-06-03 | 2019-09-06 | 中国矿业大学 | A kind of adaptive generation system of image, semantic description |
CN110210499B (en) * | 2019-06-03 | 2023-10-13 | 中国矿业大学 | Self-adaptive generation system for image semantic description |
CN110188779A (en) * | 2019-06-03 | 2019-08-30 | 中国矿业大学 | A kind of generation method of image, semantic description |
CN114022735A (en) * | 2021-11-09 | 2022-02-08 | 北京有竹居网络技术有限公司 | Training method, device, equipment and medium for visual language pre-training model |
CN114419402A (en) * | 2022-03-29 | 2022-04-29 | 中国人民解放军国防科技大学 | Image story description generation method and device, computer equipment and storage medium |
CN114419402B (en) * | 2022-03-29 | 2023-08-18 | 中国人民解放军国防科技大学 | Image story description generation method, device, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106778926A (en) | A kind of pictograph of view-based access control model attention model describes method | |
CN105244020B (en) | Prosodic hierarchy model training method, text-to-speech method and text-to-speech device | |
CN108875807B (en) | Image description method based on multiple attention and multiple scales | |
CN106650813B (en) | A kind of image understanding method based on depth residual error network and LSTM | |
CN107273355B (en) | Chinese word vector generation method based on word and phrase joint training | |
CN111368993B (en) | Data processing method and related equipment | |
CN102436811B (en) | Full-sequence training of deep structures for speech recognition | |
CN109902293A (en) | A kind of file classification method based on part with global mutually attention mechanism | |
CN108536754A (en) | Electronic health record entity relation extraction method based on BLSTM and attention mechanism | |
CN110647612A (en) | Visual conversation generation method based on double-visual attention network | |
CN107924680A (en) | Speech understanding system | |
CN106844442A (en) | Multi-modal Recognition with Recurrent Neural Network Image Description Methods based on FCN feature extractions | |
CN106650756A (en) | Image text description method based on knowledge transfer multi-modal recurrent neural network | |
CN109887484A (en) | A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device | |
CN108153864A (en) | Method based on neural network generation text snippet | |
CN109271516B (en) | Method and system for classifying entity types in knowledge graph | |
CN110223714A (en) | A kind of voice-based Emotion identification method | |
CN103793507B (en) | A kind of method using deep structure to obtain bimodal similarity measure | |
CN110334196B (en) | Neural network Chinese problem generation system based on strokes and self-attention mechanism | |
CN114021524B (en) | Emotion recognition method, device, equipment and readable storage medium | |
CN112926655B (en) | Image content understanding and visual question and answer VQA method, storage medium and terminal | |
CN112348911A (en) | Semantic constraint-based method and system for generating fine-grained image by stacking texts | |
JP2022503812A (en) | Sentence processing method, sentence decoding method, device, program and equipment | |
Malakan et al. | Vision transformer based model for describing a set of images as a story | |
Yang et al. | Text classification based on convolutional neural network and attention model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20170531 |
|
WW01 | Invention patent application withdrawn after publication |