CN112836709A - Automatic image description method based on spatial attention enhancement mechanism - Google Patents

Automatic image description method based on spatial attention enhancement mechanism Download PDF

Info

Publication number
CN112836709A
CN112836709A CN202110168114.7A CN202110168114A CN112836709A CN 112836709 A CN112836709 A CN 112836709A CN 202110168114 A CN202110168114 A CN 202110168114A CN 112836709 A CN112836709 A CN 112836709A
Authority
CN
China
Prior art keywords
image
attention
candidate
loss
calculating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110168114.7A
Other languages
Chinese (zh)
Inventor
方玉明
朱旻炜
姜文晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202110168114.7A priority Critical patent/CN112836709A/en
Publication of CN112836709A publication Critical patent/CN112836709A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides an automatic image description method based on a spatial attention enhancement mechanism, which comprises the steps of extracting potential target areas in an image, setting the target areas as image areas to be processed, acquiring spatial features and position information of a plurality of image areas, and extracting the image features of the image areas; selecting an image area rich in positioning information as a candidate frame according to the information of the entity data set from the extracted image area, and obtaining an attention characteristic label based on a cluster; calculating the attention intensity of the image candidate region at each moment according to the extracted image features; calculating cross entropy loss on the descriptive content and significance loss based on the attention feature label of the cluster, calculating total loss; and calculating the loss between the real value label and the initial predicted value, judging the difference between the initial predicted value and the real result, carrying out self-learning by the image description model according to the difference, and inputting the image characteristics into the self-learned image description model to obtain the final predicted value. The invention can improve the performance of the automatic image description method.

Description

Automatic image description method based on spatial attention enhancement mechanism
Technical Field
The invention relates to the technical field of image description, in particular to an automatic image description method based on a spatial attention enhancement mechanism.
Background
Image description generation is a comprehensive problem combining computer vision and natural language processing, and the task of image description is very easy for human beings, but is limited by the heterogeneous characteristics of different modality data, and requires a machine to understand the content of pictures and describe the content in natural language, so that the machine is required to generate smooth and human-understandable sentences and the sentences are required to represent complete image content.
Inspired by the application of attention mechanisms in machine translation, some researchers have introduced attention mechanisms in the traditional "encode-decode" framework, significantly improving the performance of the automatic image description task. The attention mechanism focuses on key visual content in the image, providing more discriminative visual information to guide the sentence generation process during the input of the image context vector to the "encode-decode" framework. Although the attention mechanism can effectively improve the performance of the automatic image description method, the current method still has the problems of insufficient attention and the like, and the object description which does not appear in the image appears in the image description.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides an automatic image description method based on a spatial attention enhancement mechanism, which improves the attention accuracy.
In order to achieve the purpose, the invention is realized by the following technical scheme: an automatic image description method based on a spatial attention enhancement mechanism comprises the following steps: after an image to be described is obtained, potential target areas in the image are extracted, the target areas are set as image areas to be processed, spatial features and position information of a plurality of image areas are obtained, and image features of the image areas are extracted; selecting an image area rich in positioning information as a candidate frame according to the information of the entity data set from the extracted image area, and obtaining an attention characteristic label based on a cluster; calculating the attention intensity of the image candidate region at each moment according to the extracted image features; calculating cross entropy loss and cluster-based attention feature tag significance loss with respect to the descriptive content, and calculating total loss; and calculating the loss between the real value label and the initial predicted value, judging the difference between the initial predicted value and the real result, carrying out self-learning by the image description model according to the difference, and inputting the image characteristics into the self-learned image description model to obtain the final predicted value.
Preferably, the acquiring the spatial features and the position information of the plurality of image areas comprises: and extracting the bottom-up features in the image and the position information of the corresponding target boundary box in the image by using a target detection algorithm pre-trained by the visual gene data set.
Preferably, the selecting the image region rich in the positioning information as the candidate frame based on the information of the physical data set includes: and describing positioning nouns based on the content of the entity data set, matching the spatial features and the position information of the image area with the nouns in the entity data set, and selecting a candidate frame rich in positioning information by using a cluster information screening method.
Preferably, the selecting the candidate frame rich in the positioning information by using the cluster information screening method comprises: and combining the spatial features and the position information of the image area with nouns in the entity data set by using a cluster information screening method, and selecting a candidate frame rich in positioning information according to an intersection ratio criterion and an overlap ratio criterion.
Preferably, the selecting the candidate boxes rich in positioning information according to the intersection ratio criterion and the overlap ratio criterion comprises: calculating the intersection ratio of the target noun rectangular frame G and the candidate frame B, wherein the calculation formula of the intersection ratio is as follows:
Figure BDA0002938200580000021
g ≈ B represents the area of the intersection area of the candidate frame and the target noun rectangular frame, when the intersection ratio is greater than a first threshold, the candidate frame is retained, and the intersection ratio of the candidate frame is marked as positive;
calculating the overlapping ratio of the target noun rectangular frame G and the candidate frame B, wherein the calculating formula of the overlapping ratio is as follows:
Figure BDA0002938200580000022
when the overlapping ratio is larger than a preset second threshold value, the candidate frame is reserved, and the overlapping ratio of the candidate frame is marked as positive.
Preferably, a candidate frame whose intersection ratio of the target noun rectangular frame G to the candidate frame B is smaller than a first threshold value is marked as negative, and a candidate frame whose overlap ratio of the target noun rectangular frame G to the candidate frame B is smaller than a second threshold value is marked as negative.
Preferably, the calculating the attention intensity of the image candidate region at each moment according to the extracted image features comprises: inputting the spatial features and the position information of the image region into a feature mapping module, extracting semantic features from the feature regions of the N objects, and marking as
Figure BDA0002938200580000031
Inputting the extracted semantic features into an attention module to obtain an attention weight alpha at the moment tt
Preferably, calculating the cross-entropy loss and cluster-based attention feature tag significance loss with respect to the descriptive content, and calculating the total loss comprises: the significance loss with respect to cross-entropy loss and cluster-based attention feature labels describing the content is calculated using the following formula:
Figure BDA0002938200580000032
Figure BDA0002938200580000033
Figure BDA0002938200580000034
L(θ)=λ·Lgrd(θ)+LXE(θ)
wherein L is the total loss, LgrdAnd LXERespectively, the significance loss and the cross-entropy loss of the attention feature label, theta is a parameter of the image description model,
Figure BDA0002938200580000035
and
Figure BDA0002938200580000036
respectively representing the word vector at time t and the word vector before time t, p represents the conditional probability, NPRepresenting positive candidate boxes, N being the total number of all candidate boxes, BnIs a negative candidate box, αiThe attention weight of the ith candidate box is represented, and λ represents the weight ratio of the cluster-based candidate clustering loss function in the total loss function.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides an automatic image description method based on a spatial attention enhancement mechanism, which uses an attention label based on a cluster to provide better reference for the attention weight in the description generation process, thereby generating more accurate description and improving the performance of the automatic image description method. The method of the invention achieves superior results by performing extensive experiments on mainstream datasets such as Flickr30k and COCO, and comparing with the most advanced methods. The method has practical significance for the scene of the visually impaired people to which the automatic image description method is applied.
Drawings
FIG. 1 is a block diagram of the structure used in an embodiment of the automatic image description method based on the spatial attention enhancement mechanism of the present invention;
FIG. 2 is a flow chart of an embodiment of the automatic image description method based on the spatial attention enhancement mechanism of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention relates to an automatic image description method based on a spatial attention enhancement mechanism, which can be realized by a computer device, for example, the computer device comprises a processor and a memory, the memory stores a computer program, and the computer program can realize the automatic image description method based on the spatial attention enhancement mechanism.
The method of the invention is applied to the system shown in fig. 1, the image 10 to be described passes through the target detection algorithm module 11 to be extracted with the image features 13, the image features 13 are input to the attention module 14, and the attention weight is obtained through calculation. Meanwhile, the image feature 13 is also combined with noun matching 23 in the entity data set, and the attention weight 15 is calculated by using cluster information 24, the attention weight 15 can obtain the image description information 17 through calculation of the decoder 16, and the image description information 17 is also obtained by using the description tag 25. The location tag 21 can be obtained from the image 10 to be described, and the description location noun 23 can be obtained through the noun filtering 22.
Referring to fig. 2, the embodiment first executes step S1 to obtain an image to be described, for example, an image to be described is input to the image description model, and then executes step S2 to extract potential target regions in the image, which are regions of the image to be processed. Then, spatial features and position information of the image region are acquired, and image features are extracted. Specifically, spatial features of potential target regions in the image to be described are extracted, and the features are used as input of subsequent content. For example, a target detection algorithm pre-trained by a Visual gene dataset (Visual Genome) is used to extract bottom-up features in an image I to be described and corresponding target bounding boxes, and the extraction of image features can be implemented by applying known technologies such as area candidate networks and region-of-interest pooling, and the target bounding boxes determine the position of a target area in the image.
Next, step S3 is executed to extract a cluster-based attention feature label. In this embodiment, a candidate frame rich in positioning information is selected by using a cluster information screening method, and specifically, a candidate frame rich in positioning information is selected by using a cluster information screening method to combine spatial features and position information of an image region with nouns in an entity data set according to an intersection ratio criterion and an overlap ratio criterion.
For example, according to sentence division in the entity data set, a noun with positioning information in the sentence is found, the positioning area is the target noun rectangular frame G, and the candidate frame is a boundary frame corresponding to each bottom-up feature in the image. The candidate frame is a frame in the image to be described, and is a position corresponding to the image feature obtained before in the image. In this embodiment, the entity data set is a preset data set, and the entity data set has position labels of noun phrases for sentences described by coco or flickr.
Then, the candidate frames are screened. For example, a candidate box rich in positioning information is selected according to the intersection ratio criterion and the overlap ratio criterion.
When the intersection ratio criterion is applied, the intersection ratio of the target noun rectangular frame G and the candidate frame B is calculated (IoU), and the calculation formula of the intersection ratio is as follows:
Figure BDA0002938200580000051
wherein G.andgate.B represents the area of the intersection region of the candidate frame and the rectangular frame of the target noun, and when the intersection ratio is greater than a first threshold, preferably 0.5, the candidate frame is retained and the intersection ratio of the candidate frame is marked as positive. Therefore, the present embodiment retains the candidate frame B having a high intersection ratio with the target noun rectangular frame G.
When the overlap ratio is applied, the overlap ratio of the target noun rectangular frame G and the candidate frame B is calculated (IoP), and the calculation formula of the overlap ratio is as follows:
Figure BDA0002938200580000061
when the overlap ratio is greater than a preset second threshold, preferably 0.9, the candidate box is retained, and the overlap ratio of the candidate box is marked as positive. Therefore, the present embodiment retains the candidate frame B having a high overlap ratio with the target noun rectangular frame G.
Further, a candidate frame whose intersection ratio of the target noun rectangular frame G to the candidate frame B is smaller than a first threshold value is marked as negative, and a candidate frame whose overlap ratio of the target noun rectangular frame G to the candidate frame B is smaller than a second threshold value is marked as negative. Thus, the image features can be divided into two cluster classes, namely, positive and negative cluster classes according to the positive and negative mark definitions, and the division of the cluster classes is the attention feature label of the embodiment.
Next, step S4 is executed to calculate the attention intensity of the image candidate area at each time. For example, the attention intensity of the image candidate region at each moment is calculated according to the extracted image features, namely, the spatial features and the position information of the image region are input into the feature mapping module, semantic features are extracted from the feature regions of the N objects, and the semantic features are recorded as the semantic features
Figure BDA0002938200580000062
Then, the extracted semantic features are input into an attention module to obtain the attention weight alpha at the time tt
Specifically, the extracted semantic features K are input into an attention module, and the attention module generates an attention weight α at a certain time t by combining semantic information S included in a word generated at presenttA higher intensity means that the candidate region is more noticed. The semantic information is S is a word generated at the last moment, and the attention weight alpha of the current moment t can be obtained according to the word and the semantic feature K at the last momentt. Wherein the attention weight α at the time ttThe calculation formula is as follows:
Figure BDA0002938200580000063
αtsoftmax (a) (formula 4)
Figure BDA0002938200580000071
Where S is the text sequence at the previous moment, WsAnd WkRespectively mapping matrices for mapping S, K to a uniform mapping space, d is a scale of the mapping space, aiDenotes the ith component of a, e is the base of the natural logarithm.
Then, step S5 is performed, the cross entropy loss with respect to the descriptive contents and the significance loss of the cluster-based attention feature label are calculated, and the total loss is calculated. Specifically, the following formula is adopted for calculation:
Figure BDA0002938200580000072
Figure BDA0002938200580000073
Figure BDA0002938200580000074
L(θ)=λ·Lgrd(θ)+LXE(θ)
wherein L is the total loss, LgrdAnd LXERespectively, the significance loss and the cross-entropy loss of the attention feature label, theta is a parameter of the image description model,
Figure BDA0002938200580000075
and
Figure BDA0002938200580000076
respectively representing the word vector at time t and the word vector before time t, p represents the conditional probability, NPRepresenting positive candidate boxes, N being the total number of all candidate boxes, BnIs a negative candidate box, αiIndicates the ith waiting timeAttention weight of the box, αjThe attention weight of the jth candidate box is represented, and λ represents the weight ratio of the cluster-based candidate clustering loss function in the overall loss function.
Then, step S6 is executed to calculate the loss between the true value tag and the initial predicted value calculated by the image description model, and to determine the difference between the initial predicted value and the true result, based on which the image description model performs self-learning.
Finally, step S7 is executed, after the image features are input into the image description model after the self-learning is completed, the image description model obtains a final predicted value according to the input image features, and the final predicted value is a final statement of the image description that needs to be obtained in this embodiment.
To verify the feasibility of the present embodiment, the present embodiment was verified, and specifically, the COCO data set and the Flickr30k data set were used for testing and comparison. Where the COCO dataset contains twelve million images and the Flickr30k dataset contains thirty thousand images, for both datasets, each image has at least five artificially labeled image description statements called true value tags. In the experiment, the original training verification set and the original Flickr30k data set of the COCO data set are divided into a training set, a verification set and a test set by using a Karpathy segmentation method, and finally, the result on the test set is taken for verification. The invention uses five evaluation criteria: bilingual assessment assistant (BLEU), alternative summary assessment assistant (ROUGE) based on recall rate, explicit ordering translation assessment index (METEOR), common sense-based image description assessment (CIDER), semantic propositional image title assessment (SPICE) to quantitatively evaluate the performance of various image description methods. The CIDER can represent semantic accuracy better, and a good automatic image description method has a higher CIDER value. This application is made to
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (8)

1. An automatic image description method based on a spatial attention enhancement mechanism is characterized by comprising the following steps:
after an image to be described is obtained, potential target areas in the image are extracted, the target areas are set as image areas to be processed, spatial features and position information of the image areas are obtained, and image features of the image areas are extracted;
selecting an image area rich in positioning information as a candidate frame according to the information of the entity data set from the extracted image area, and obtaining an attention characteristic label based on a cluster;
calculating the attention intensity of the image candidate region at each moment according to the extracted image features;
calculating cross entropy loss and cluster-based attention feature tag significance loss with respect to the descriptive content, and calculating total loss;
and calculating the loss between the real value label and the initial predicted value, judging the difference between the initial predicted value and the real result, carrying out self-learning by the image description model according to the difference, and inputting the image characteristics into the self-learned image description model to obtain the final predicted value.
2. The method for automatic image description based on the spatial attention enhancement mechanism according to claim 1, wherein:
acquiring spatial features and positional information of a plurality of image regions includes:
and extracting the bottom-up features in the image and the position information of the corresponding target boundary box in the image by using a target detection algorithm pre-trained by a visual gene data set.
3. The method for automatic image description based on the spatial attention enhancement mechanism according to claim 1, wherein:
selecting an image region enriched with positioning information as a candidate frame based on information of the entity data set includes:
and describing positioning nouns based on the content of the entity data set, matching the spatial features and the position information of the image area with the nouns in the entity data set, and selecting a candidate frame rich in positioning information by using a cluster information screening method.
4. The method according to claim 3, wherein the method comprises:
selecting a candidate frame rich in positioning information by using a cluster information screening method comprises the following steps:
and combining the spatial features and the position information of the image area with nouns in the entity data set by using a cluster information screening method, and selecting a candidate frame rich in positioning information according to a cross-over ratio criterion and an overlap-over ratio criterion.
5. The method according to claim 4, wherein the method comprises:
selecting the candidate boxes rich in positioning information according to the intersection ratio criterion and the overlap ratio criterion comprises:
calculating the intersection ratio of the target noun rectangular frame G and the candidate frame B, wherein the calculation formula of the intersection ratio is as follows:
Figure FDA0002938200570000021
g ≈ B represents the area of the intersection region of the candidate frame and the target noun rectangular frame, and when the intersection ratio is greater than a first threshold, the candidate frame is retained, and the intersection ratio of the candidate frame is marked as positive;
calculating the overlapping ratio of the target noun rectangular frame G and the candidate frame B, wherein the calculating formula of the overlapping ratio is as follows:
Figure FDA0002938200570000022
when the overlapping ratio is larger than a preset second threshold value, the candidate frame is reserved, and the overlapping ratio of the candidate frame is marked as positive.
6. The method according to claim 5, wherein the method comprises:
a candidate frame whose intersection ratio of the target noun rectangular frame G to the candidate frame B is smaller than the first threshold value is marked as negative, and a candidate frame whose overlap ratio of the target noun rectangular frame G to the candidate frame B is smaller than the second threshold value is marked as negative.
7. The method for automatic image description based on spatial attention enhancement mechanism according to any one of claims 1 to 6, characterized in that:
calculating the attention intensity of the image candidate region at each moment according to the extracted image features comprises the following steps:
inputting the spatial features and the position information of the image region into a feature mapping module, extracting semantic features from the feature regions of the N objects, and marking as
Figure FDA0002938200570000031
Inputting the extracted semantic features into an attention module to obtain the semantic features at the time tAttention weight αt
8. The method for automatic image description based on spatial attention enhancement mechanism according to any one of claims 1 to 6, characterized in that:
calculating a cross-entropy loss and a cluster-based attention feature label significance loss with respect to the descriptive content, and calculating a total loss comprises:
the significance loss with respect to cross-entropy loss and cluster-based attention feature labels describing the content is calculated using the following formula:
Figure FDA0002938200570000032
Figure FDA0002938200570000033
Figure FDA0002938200570000034
L(θ)=λ·Lgrd(θ)+LXE(θ)
wherein L is the total loss, LgrdAnd LXERespectively, the significance loss and the cross-entropy loss of the attention feature label, theta is a parameter of the image description model,
Figure FDA0002938200570000035
and
Figure FDA0002938200570000036
respectively representing the word vector at time t and the word vector before time t, p represents the conditional probability, NPRepresenting positive candidate boxes, N being the total number of all candidate boxes, BnIs a negative candidate box, αiThe attention weight of the ith candidate box is represented, and lambda represents the cluster-based candidate clustering loss function in totalThe weight ratio in the loss function.
CN202110168114.7A 2021-02-07 2021-02-07 Automatic image description method based on spatial attention enhancement mechanism Pending CN112836709A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110168114.7A CN112836709A (en) 2021-02-07 2021-02-07 Automatic image description method based on spatial attention enhancement mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110168114.7A CN112836709A (en) 2021-02-07 2021-02-07 Automatic image description method based on spatial attention enhancement mechanism

Publications (1)

Publication Number Publication Date
CN112836709A true CN112836709A (en) 2021-05-25

Family

ID=75932647

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110168114.7A Pending CN112836709A (en) 2021-02-07 2021-02-07 Automatic image description method based on spatial attention enhancement mechanism

Country Status (1)

Country Link
CN (1) CN112836709A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114359741A (en) * 2022-03-19 2022-04-15 江西财经大学 Regional feature-based image description model attention mechanism evaluation method and system
CN116152118A (en) * 2023-04-18 2023-05-23 中国科学技术大学 Image description method based on contour feature enhancement

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114359741A (en) * 2022-03-19 2022-04-15 江西财经大学 Regional feature-based image description model attention mechanism evaluation method and system
CN114359741B (en) * 2022-03-19 2022-06-17 江西财经大学 Regional feature-based image description model attention mechanism evaluation method and system
CN116152118A (en) * 2023-04-18 2023-05-23 中国科学技术大学 Image description method based on contour feature enhancement
CN116152118B (en) * 2023-04-18 2023-07-14 中国科学技术大学 Image description method based on contour feature enhancement

Similar Documents

Publication Publication Date Title
CN108804530B (en) Subtitling areas of an image
EP3866026A1 (en) Theme classification method and apparatus based on multimodality, and storage medium
US20170278510A1 (en) Electronic device, method and training method for natural language processing
CN110490081B (en) Remote sensing object interpretation method based on focusing weight matrix and variable-scale semantic segmentation neural network
CN111612103B (en) Image description generation method, system and medium combined with abstract semantic representation
CN112270196B (en) Entity relationship identification method and device and electronic equipment
CN108765383B (en) Video description method based on deep migration learning
CN110674312B (en) Method, device and medium for constructing knowledge graph and electronic equipment
CN112100346A (en) Visual question-answering method based on fusion of fine-grained image features and external knowledge
CN112613273A (en) Compression method and system of multi-language BERT sequence labeling model
CN107305543B (en) Method and device for classifying semantic relation of entity words
CN108959474B (en) Entity relation extraction method
CN111708878B (en) Method, device, storage medium and equipment for extracting sports text abstract
CN112052684A (en) Named entity identification method, device, equipment and storage medium for power metering
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
CN110929640B (en) Wide remote sensing description generation method based on target detection
CN112836709A (en) Automatic image description method based on spatial attention enhancement mechanism
CN109977253A (en) A kind of fast image retrieval method and device based on semanteme and content
CN115861995A (en) Visual question-answering method and device, electronic equipment and storage medium
CN113836306B (en) Composition automatic evaluation method, device and storage medium based on chapter component identification
CN112613293B (en) Digest generation method, digest generation device, electronic equipment and storage medium
JP2020098592A (en) Method, device and storage medium of extracting web page content
CN113673294A (en) Method and device for extracting key information of document, computer equipment and storage medium
CN113705207A (en) Grammar error recognition method and device
CN112836754A (en) Image description model generalization capability evaluation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination