WO2022114322A1

WO2022114322A1 - System and method for automatically generating image caption by using image object attribute-oriented model based on deep learning algorithm

Info

Publication number: WO2022114322A1
Application number: PCT/KR2020/017272
Authority: WO
Inventors: 최호진; 한승호
Original assignee: 한국과학기술원
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2022-06-02

Abstract

The present invention relates to an automatic image caption generation system and method which generate a caption by using deep learning to extract attribute information and object information in an image, and which restructure the generated caption by predicting the relationship between pieces of object information. An automatic caption generation system for automatically generating, for an image, a caption describing the image, according to an embodiment of the present invention, comprises: a client device for providing the image for which the caption is to be generated; and a caption generator which analyzes the image provided from the client device to generate the caption describing the image, and which transmits, to the client, the generated caption and a foundation for generating the caption.

Description

A system and method for automatically generating image captions using a deep learning algorithm-based image object attribute attention model

The present invention relates to a system and method for automatically generating image captions using an image object-attribute attention model based on a deep learning algorithm. A system and method for automatically generating image captions for restructuring a generated caption by predicting a relationship between object information.

Image captioning is to generate a natural language sentence for a provided image to describe the image. Recently, with the development of artificial intelligence technology, a technology for automatically generating captions using a machine is being developed.

As such, the technology for automatically generating captions using a machine is to search for images with the same label by using many existing images and label (one word to describe the image) information attached to each image, or Labels were assigned to an image to generate captions for the image.

Image captioning is to generate a caption describing the image as a natural language sentence for a provided image. Recently, with the development of artificial intelligence technology, a technology for automatically generating captions using a machine is being developed.

Automatically generating captions using a machine can be performed using information on many existing images and label (one word to describe the image) information for each image. That is, by searching for an image having the same label or assigning labels of similar images to one image, a caption for the image can be created.

However, in this method, since captions are generated using only image and label data stored for a new image, it is difficult to generate a caption in natural language sentences, and even if a caption in natural language sentences is generated, the quality of the sentences is deteriorated. have.

The present invention is to solve the above-described problem, and extracts attribute information and object information in an image using deep learning to generate a caption, and predicts a relationship between object information to restructure the generated caption. An object of the present invention is to provide an automatic generation system and method.

In addition to the technical problems of the present invention mentioned above, other features and advantages of the present invention will be described below, or will be clearly understood by those of ordinary skill in the art from such description and description.

According to an embodiment of the present invention for achieving the above-described object, an automatic caption generation system for automatically generating a caption describing an image for an image is provided from a client device providing an image for generating a caption, and the client device It may include a caption generator that analyzes the received image to generate a caption describing the image, and transmits the generated caption and a reason for generating the caption to the client device.

Meanwhile, in the method for automatically generating captions for automatically generating captions describing images for images according to an embodiment of the present invention for achieving the above-described object, attribute information and objects within the image using deep learning in the caption generation module Extracting information, generating a caption using attribute information and object information, predicting a relationship between objects in an image in a relationship creation module, and creating a tuple set in which the predicted relationships are structured in a tuple form You can create an extended caption by restructuring the caption using the caption and tuple set created in the description generation module and visualize the graph for the extended caption and tuple set.

The automatic image caption generation system and method according to an embodiment of the present invention generates a caption by reflecting attribute information and object information in an image using deep learning, so that it is possible to improve the performance of generating a caption for an image.

In addition, other features and advantages of the present invention may be newly recognized through embodiments of the present invention.

1 is a diagram showing the configuration of an image caption automatic generation system according to an embodiment of the present invention.

2 is a diagram illustrating a configuration of a caption generator according to an embodiment of the present invention.

3 is a diagram showing the configuration of a caption generating module according to an embodiment of the present invention.

4 is a diagram showing the configuration of a relationship creation module according to an embodiment of the present invention.

5 is a diagram illustrating the configuration of a description generating module according to an embodiment of the present invention.

6 is a diagram illustrating caption generation for an image according to an embodiment of the present invention.

7 is a diagram illustrating generation of extended captions according to an embodiment of the present invention.

8 is a diagram illustrating a method for automatically generating image captions according to an embodiment of the present invention.

9 is a diagram illustrating a method of generating a caption according to an embodiment of the present invention.

10 is a diagram illustrating a method of generating an extended caption according to an embodiment of the present invention.

In order to clearly explain the present invention, parts irrelevant to the description are omitted, and the same reference numerals are assigned to the same or similar components throughout the specification.

The terminology used herein is for the purpose of referring to specific embodiments only, and is not intended to limit the present invention. As used herein, the singular forms also include the plural forms unless the phrases clearly indicate the opposite. The meaning of "comprising," as used herein, specifies a particular characteristic, region, integer, step, operation, element and/or component, and includes the presence or absence of another characteristic, region, integer, step, operation, element and/or component. It does not exclude additions.

Although not defined otherwise, all terms including technical and scientific terms used herein have the same meaning as commonly understood by those of ordinary skill in the art to which the present invention belongs. Commonly used terms defined in the dictionary are additionally interpreted as having a meaning consistent with the related technical literature and the presently disclosed content, and unless defined, are not interpreted in an ideal or very formal meaning.

Hereinafter, with reference to the accompanying drawings, embodiments of the present invention will be described in detail so that those of ordinary skill in the art can easily implement them. However, the present invention may be implemented in several different forms and is not limited to the embodiments described herein.

Referring to FIG. 1 , a system 1000 for automatically generating image captions according to an embodiment of the present invention may include a client 100 and a caption generator 200 .

The client 100 may provide an image for generating a caption. The client 100 may be a user device (or client device) such as a smart phone or a tablet PC. The client 100 may provide an image acquired (or photographed) in the user device and/or an image stored in the user device to the caption generator 200 . The client 100 according to embodiments of the present invention is not limited to the aforementioned smart phone or tablet PC, and may be equally applied to various types of electronic devices.

Also, the caption generator 200 may analyze the image provided from the client 100 to generate a caption describing the image, and transmit the generated caption and the basis for generating the caption to the client 100 . According to an embodiment, the caption generator 200 may be a server capable of communicating with a user device of the client 100 by wire and/or wirelessly.

Here, the caption generator 200 may analyze the image through deep learning. Specifically, the caption generator 200 may learn an image and an answer caption for the image.

The caption generator 200 may generate a caption for the new image by using the learned image and correct captions for the image. The caption generator 200 may generate a caption for the image provided from the client 100 using the learned image and correct captions for the image. Here, the correct caption may be a sentence including five or more phrases arbitrarily set by the user for the image. Also, the caption generator 200 may extract an object of the provided image to predict a relationship between the objects, and may generate an extended caption by applying the predicted relationship to the generated caption.

The caption generator 200 may transmit the extended caption and the basis for generating the caption to the client 100 , and the client 100 may transmit the caption for the image delivered from the caption generator 200 and the basis for generating the caption. You can interpret the results of deep learning. Here, the client 100 and the caption generator 200 may be connected by wire or wirelessly.

Referring to FIG. 2 , the caption generator 200 according to an embodiment of the present invention may include a caption generating module 210 , a relationship generating module 220 , and a description generating module 230 .

The caption generating module 210 may learn an image and an answer caption for the image, and may generate a caption of the image provided from the client 100 by using the learned image and an answer caption for the image. The caption generation module 210 may extract attribute information and object information in the image, and generate a caption using the extracted attribute information and object information. Here, the attribute information may be words related to an image, and the object information may be a core target of the provided image. For example, in the case of an image including a dog in front of a sofa, the attribute information may be 'dog' or 'sofa', and the object information may be 'dog' or 'sofa' in the image.

The relationship generating module 220 may predict a relationship between objects in an image and generate a tuple set in which the predicted relationships are structured in a tuple form. Here, the tuple form enumerates elements, and the elements may be listed in parentheses '( )' by separating them with commas ','. As an example, when an image including a dog in front of a sofa is provided, the relationship generating module 220 may predict a relationship between the object dog and the sofa. That is, the relationship generating module 220 may predict that the dog is in front of the sofa, and may structure the predicted relationship as (sofa, front, dog). At this time. '(sofa, front, dog)' may be a set of tuples.

The description generating module 230 may generate an extended caption by restructuring the caption using the caption generated by the caption generating module 210 and the tuple set generated by the relationship generating module 220 . That is, the description generating module 230 may generate an expanded caption by reflecting the relationship between the objects predicted by the relationship generating module 220 in the caption generated by the caption generating module 210 . Also, the description generating module 230 may visualize the extended caption and a graph for the tuple set that is the basis for generating the caption to the client 100 .

Referring to FIG. 3 , the caption generation module 210 according to an embodiment of the present invention may include an attribute extraction model 212 , an object recognition model 214 , and an image caption model 216 .

The attribute extraction model 212 may extract attribute information of the provided image and convert the attribute information into a vector representation (or tuple form). Here, the attribute extraction model 212 may learn images and captions for images in advance using an image-text embedding model based on a deep learning algorithm. . For example, before the caption generating module 210 of FIG. 2 is trained, the attribute extraction model 320 may learn by extracting words related to each image in advance using an image caption database. The image-text embedding model may be a model that outputs words related to a new image when a new image is input by mapping many images and words related to each image into one vector space. That is, the attribute extraction model 212 may output words related to a new image using images mapped to and stored in a vector space and words related to each image, and the output words may be used for learning.

In addition, the attribute extraction model 212 uses words in the form of verbs (or gerunds and participles) in the caption from the captions for each image and words in the form of nouns that exist three or more times to obtain captions for each image. words can be extracted from The attribute extraction model 212 may learn to embed the image and the extracted words into one vector space using a deep learning model.

Accordingly, the attribute extraction model 212 may extract words most related to the provided image by using the learned image and caption data for the image.

The object recognition model 214 may extract an important object in the image and convert the object region including the extracted object into a vector representation (or tuple form). The object recognition model 214 may utilize a deep learning-based object recognition model such as the Mask R-CNN algorithm, etc. to extract regions corresponding to a predefined object region in the provided image as the object region of the provided image. Like the attribute extraction model 320 , the object recognition model 214 may be trained in advance before the caption generating module 210 of FIG. 2 is trained.

The image caption model 216 describes the image provided from the client 100 based on vectors generated using each word extracted from the attribute extraction model 212 and object regions extracted from the object recognition model 214 . Captions can be created.

The image caption model 216 is performed using a deep learning algorithm, and may be performed based on a recurrent neural network (RNN). Accordingly, the image caption model 216 may time-sequentially predict the relationship between the objects in the image.

The image caption model 216 according to an embodiment of the present invention may include an attribute attention model 216a, an object attention model 216b, a grammar learning model 216c, and a language generation model 216d.

The attribute attention model 216a may assign an attention score to words extracted from the attribute extraction model 212 . The attribute attention model 216a may give the word attention to the word tag generated by the language generation model 216d at the current time in the order of a word having high relevance. Here, the word attention is a value between 0 and 1, and may be closer to 1 as the relevance to the word tag is higher.

The object attention model 216b may give region attention to regions of the object extracted from the object recognition model 214 . The object attention model 216b may assign a region attention to a word tag generated by the language generation model 216d at the current time and a word order with high relevance. Here, the region attention is a value between 0 and 1, and may be closer to 1 as the relevance to the word tag is higher.

The grammar learning model 216c may learn the grammar of a sentence for an image and a caption of the image. The grammar learning model 216c may tag each word in the sentence using a grammar tagging tool such as EasySRL for the correct caption sentence of the image, and learn the grammar of the correct caption sentence of the image. By learning the grammar of the caption sentence by the grammar learning model 216c, a grammatical aspect may be taken into account when generating a caption for the provided image.

The language generation model 216d includes words extracted from the attribute extraction model 216a, object regions extracted from the object recognition model 216b, word attention generated from the attribute attention model 216c, and the object attention model 216d. ), it is possible to generate a word tag and a grammar tag for a caption at each time step based on the area attention generated in .

The language generation model 216d is a word attention value, a region attention value, an average vector of words converted to a tuple form in the attribute extraction model 212, and an average of object regions converted to a tuple form in the object recognition model 214 A word tag and a grammar tag may be predicted at the current time by considering all the vectors, the word generated in the previous time by the language generation model 216d, and compressed information on all words generated by the language generation model 216d. The language generation model 216d may calculate loss values for the generated word tag and the grammar tag by comparing the predicted word tag and the grammar tag with the correct caption sentence, respectively. The language generation model 216d may update the learning parameters of the caption generation module 210 by reflecting the loss values for the word tag and the grammar tag.

Accordingly, the language generation model 216d may generate a caption sentence in which the grammar is considered with respect to the provided image by using the word tag and the grammar tag.

Referring to FIG. 4 , the relationship creation module 220 according to an embodiment of the present invention may include an object extraction model 222 , a relationship prediction model 224 , and a relationship graph generation model 226 .

The object recognition model 222 may extract important object regions in the provided image. The object recognition model 222 may extract important objects in the provided image, and may extract object regions including the extracted objects.

The relationship prediction model 224 may predict a relationship between the extracted object regions and structure the relationship between the predicted object regions in a tuple form. Here, the relationship prediction model 224 may structure the relationship between predicted object regions in the form of (first noun, predicate, second noun). The first noun and/or the second noun may be a noun representing an object in the image.

The relationship graph generation model 226 may generate one graph for the generated tuple set. The relationship graph generation model 226 may generate graphs for the tuple sets, such as displaying an arrow from a first noun to a predicate, and displaying an arrow from a predicate to a second noun.

Referring to FIG. 5 , the description generating module 230 according to an embodiment of the present invention may include a sentence restructuring model 232 and a visualization model 234 .

The sentence restructuring model 232 uses the tuple set generated by the caption and relationship generation module 220 generated by the caption generation module 210 to replace some words with phrases for the tuples according to an algorithm, and the generated caption can be expanded. That is, the sentence restructuring model 232 may further expand the caption by reflecting the tuple set generated by the relationship generating module 220 to the caption generated by the caption generating module 210 .

The sentence restructuring model 232 may remove tuple sets included in the caption generated by the caption generation module 210 from among the tuple sets generated by the relationship generation module 220 . Here, when the first noun, the second noun, and the predicate in the tuple set are all included in the caption generated by the caption generating module 210 to remove the tuple set, it is determined as a duplicate tuple set and the duplicate tuple set can be deleted.

The sentence restructuring model 232 may remove the duplicate tuple set and convert the remaining tuple sets into a sentence format. Here, when the predicate of the tuple set is a preposition, the sentence restructuring model 232 may convert it into a sentence form by listing the first noun - the preposition - the second noun in the order. On the other hand, when the predicate of the tuple set is a verb, the sentence restructuring model 232 may convert it into a sentence form by listing the second noun - the verb - the first noun in the order.

For example, when the tuple set is (sofa, front, dog), the predicate of the tuple set is a preposition, so the sentence restructuring model 232 may convert the tuple set into 'dog in front of the sofa'. As another example, when the tuple set is (person, lie down, bed), since the predicate of the tuple set is a verb, the sentence restructuring model 232 may convert the tuple set into 'a person lying in bed'.

The sentence restructuring model 232 may convert the tuple set into a sentence format and reflect the transformed sentence in the caption. Thereafter, a score may be calculated by comparing the caption (extended caption) in which the converted sentence is reflected with the correct caption, and a phrase having the highest score may be selected. The sentence restructuring model 232 may iterate until there are no more tuples sets remaining through a method of converting the tuple set into a sentence form - applying it to a caption - and selecting a phrase having the highest score. Thereafter, the sentence restructuring model 232 may select the last selected phrase as the final extended caption.

The visualization model 234 may visualize the caption extended by the sentence restructuring model 232 by matching it with the tuple set. The visualization model 234 may generate a graph representing the relationship between the tuple sets by matching the caption extended in the sentence restructuring model 232 with the tuple set. In addition, the visualization model 234 may transmit a graph representing the relationship between the generated tuple set to the client 100 so that the user can check the basis for generating the expanded caption.

The visualization model 234 may display an object region corresponding to the tuple set reflected in the caption on the provided image. In this case, the visualization model 234 may display each object area through different colors or different lines (such as line types or thicknesses). Also, the visualization model 234 may display a phrase corresponding to the object area in the final caption in the same color as the object area. For example, if the final caption sentence is 'a dog in front of a sofa lying on the floor and a cat around a laptop', the visualization model 234 may display the sofa and the dog in the provided image as one object area using a red line. . Also, the visualization model 234 may display 'dog in front of the sofa' in red text in the final caption sentence. In this way, by displaying the corresponding phrase and the object area in the same color, the user can recognize it at a glance.

Referring to FIG. 6 , when the image 10 is provided from the client 100 , the attribute extraction model 212 may extract attribute information 1 in the provided image 10 . The attribute extraction model 212 may extract attribute information 1 in the provided image 10 based on the learned image and the correct caption of the image. As an example, the attribute extraction model 212 may extract a dog, a cat, a floor, etc. as the attribute information 1 .

Also, the object recognition model 214 may extract the object region 2 including the object information and the object in the provided image 10 at the same time that the attribute extraction model 212 extracts the attribute information 1 . The object recognition model 214 may extract the object region 2 in the provided image 10 based on the learned image and the correct caption of the image. As an example, the object recognition model 214 may extract a dog, a cat, a floor, etc. as object information, and may extract the object region 2 including the object information.

Also, the image caption model 216 may generate a caption 3 for the provided image 10 using the attribute information extracted from the attribute extraction model 212 and the object information extracted from the object recognition model 214 . As an example, the image caption model 216 may generate the caption 3 'a living room photo of a dog and a cat lying on the floor'.

Referring to FIG. 7 , the object extraction model 222 may extract the object region 2 including the object information and the object in the provided image 10 . The object extraction model 222 may extract object information in the provided image 10 based on the learned image and the correct caption of the image. For example, the object extraction model 222 may extract a dog, a cat, a sofa, a notebook, a door, etc. as object information, and may extract the object region 2 including the object information. In this case, the object extraction model 222 may extract the object region 2 to include two or more extracted objects. Through this, the relationship prediction model 224 may predict the relationship between objects in the object region 2 .

The relationship prediction model 224 may predict the relationship between the objects extracted from the object extraction model 222 , and may generate the relationship between the objects as a tuple set 4 . As an example, the relationship prediction model 224 may predict that the relationship between 'sofa' and 'dog' extracted as an object is that there is a dog in front of the sofa, and accordingly, a tuple set (4) ) can be created. As another example, the relationship prediction model 224 may predict that the relationship between 'cat' and 'door' extracted as an object is that the cat is next to the door, and accordingly, a tuple set ( 4) can be created.

The sentence restructuring model 232 uses the tuple set 4 generated by the relationship prediction model 224 to replace some words with phrases for the tuple set according to the algorithm, and to expand the generated caption 3 have. That is, the sentence restructuring model 232 may further expand the caption by reflecting the tuple set 4 generated by the relationship generating module 220 to the caption 3 generated by the caption generating module 210 . For example, the sentence restructuring model 232 may extend the caption to 'a living room photo of a dog in front of a sofa lying on the floor and a cat near a laptop next to the door'.

The relationship graph generation model 226 may generate a relationship graph for the tuple set 4 generated by the relationship prediction model 224 . Here, the relationship graph generation model 226 may express the predicate of the tuple set 4 as a square box and the nouns of the tuple set as a circular box. The relationship graph generation model 226 may connect each box in the order of a first noun - a predicate - a second noun.

The visualization model 234 may display phrases of the extended caption on the image as object areas, and in this case, each object area may be displayed in a different color. Also, the visualization model 234 may visualize the phrases of the extended caption corresponding to each object area by displaying the phrases in the same color as the corresponding object area.

Referring to FIG. 8 , the caption generating module 210 may extract attribute information and object information of a provided image, and generate a caption by reflecting attribute information and object information of the extracted image ( S100 ).

The caption generation module 210 may extract attribute information and object information in the image, and generate a caption using the extracted attribute information and object information. Here, the attribute information may be words related to an image, and the object information may be a core target of the provided image. Here, the caption generating module 210 may generate a caption of the provided image based on the image learned through deep learning and captions for each image.

The relationship generating module 220 may predict a relationship between objects in an image and generate a tuple set for the predicted relationships ( S200 ). The relationship generating module 220 may represent a relationship between objects in an image as a tuple set consisting of (a first noun, a predicate, and a second noun).

The description generating module 230 may generate an extended caption using the caption generated by the caption generating module 210 and the tuple set generated by the relationship generating module 220 ( S300 ). The description generating module 230 may expand the caption by converting the tuple set into a sentence and reflecting it in the caption.

The description generating module 230 may visualize the relationship between the extended caption and the objects as a graph (S400). The description generating module 230 may generate a graph by matching the extended caption and the relationship between the objects. The description generating module 230 may transmit the generated graph to the client 100 so that the user can check the basis for generating the expanded caption.

Referring to FIG. 9 , the attribute extraction model 212 may extract attribute information of an image ( S110 ). Here, the attribute extraction model 212 may be trained on an image and a caption for the image. Accordingly, the attribute extraction model 212 may output attribute information related to a new image by using the learned information.

The object recognition model 214 may extract an important object in the image and convert the object region including the extracted object into a tuple form (S120). The object recognition model 214 may utilize a deep learning-based object recognition model, such as a Mask R-CNN algorithm, to extract regions corresponding to a predefined object region in the provided image as an object region of the provided image.

The image caption model 216 may give word attention and region attention to attribute information and object regions extracted from the provided image ( S130 ). The image caption model 216 may give word attention to a word tag generated at the current time and a word order with high relevance. Here, the word attention degree and the area attention degree are values between 0 and 1, and may be closer to 1 as the relevance to the word tag increases.

The image caption model 216 includes the attribute information extracted from the attribute extraction model 212, the object region extracted from the object recognition model 214, word tags for captions, and word tags for each time step based on word attention and region attention. A grammar tag may be predicted (S140). The image caption model 216 may calculate loss values for the word tag and the grammar tag generated by comparing the predicted word tag and the grammar tag with the correct caption sentence, respectively.

The image caption model 216 may generate a caption by reflecting the loss values for the word tag and the grammar tag (S150). Accordingly, the image caption model 216 may generate a caption sentence in which the grammar is considered for the provided image by using the word tag and the grammar tag, and may learn it.

Referring to FIG. 10 , the description generating module 210 may remove tuple sets included in the caption generated by the caption generating module 210 from among the tuple sets generated by the relationship generating module 220 ( S310 ). Here, when the first noun, the second noun, and the predicate in the tuple set are all included in the caption generated by the caption generating module 210 to remove the tuple set, it is determined as a duplicate tuple set and the duplicate tuple set can be deleted.

The description generating module 210 may remove the duplicate tuple set and convert the remaining tuple sets into a sentence format (S320). Here, when the predicate of the tuple set is a preposition, the description generating module 210 may convert it into a sentence form by listing the first noun - the preposition - the second noun in the order. On the other hand, when the predicate of the tuple set is a verb, the description generating module 210 may convert it into a sentence form by listing the second noun - the verb - the first noun in the order.

The description generating module 210 may reflect the converted sentence of the tuple sets in the caption (S330). Thereafter, a score may be calculated by comparing the caption (extended caption) in which the converted sentence is reflected with the correct caption, and a phrase having the highest score may be selected. The sentence restructuring model 232 may iterate until there are no more tuples sets remaining through a method of converting the tuple set into a sentence form - applying it to the caption - and selecting the phrase having the highest score. Thereafter, the description generating module 210 may select the last selected phrase as the final extended caption.

The description generating module 210 may visualize the caption extended in the sentence restructuring model 232 by matching it with the tuple set (S340). The description generating module 210 may generate a graph representing the relationship of the tuple set by matching the extended caption with the tuple set. In addition, the visualization model 234 may transmit a graph representing the relationship between the generated tuple set to the client 100 so that the user can check the basis for generating the expanded caption.

As described above, according to an embodiment of the present invention, a caption is generated by extracting attribute information and object information in an image using deep learning, and the generated caption is restructured by predicting a relationship between object information. Automated generation systems and methods may be provided.

Those skilled in the art to which the present invention pertains should understand that the present invention may be embodied in other specific forms without changing the technical spirit or essential characteristics thereof, so the embodiments described above are illustrative in all respects and not restrictive. only do The scope of the present invention is indicated by the following claims rather than the detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalent concepts should be construed as being included in the scope of the present invention. .

Claims

A caption automatic generating system for automatically generating a caption describing an image.

the client device providing the image; and

and a caption generator that analyzes the image provided from the client device to generate a caption describing the image, and transmits the generated caption and a reason for generating the caption to the client device.
According to claim 1,

The caption generator is

a caption generating module extracting attribute information and object information in the provided image using deep learning, and generating the caption using the attribute information and the object information;

a relationship generating module that predicts a relationship between objects in the image and generates a tuple set in which the predicted relationships are structured in a tuple form; and

Description of generating an extended caption by restructuring the caption using the caption generated by the caption generation module and the tuple set generated by the relationship generation module, and visualizing a graph for the expanded caption and the tuple set Image caption automatic generation system comprising a generation module.
3. The method of claim 2,

The caption generation module,

an attribute extraction model for extracting words most related to the provided image and converting each word into a tuple form;

an object recognition model for extracting important objects in the image and converting an object region including the extracted objects into a tuple form;

and an image caption model for generating a caption of the image by using the words extracted from the attribute extraction model and the object region extracted from the object recognition model.
3. The method of claim 2,

The image caption model is performed by a deep learning algorithm, is performed based on a recurrent neural network (RNN), and an automatic image caption generation system for predicting a relationship between objects in the image in time series.
4. The method of claim 3,

The image caption model is,

an attribute attention model for giving an attention score to words extracted from the attribute extraction model;

an object attention model for giving a degree of area attention to areas of the object extracted from the object recognition model;

a grammar learning model for learning a grammar of a sentence for the image and the caption of the image; and

A language generation model for generating word tags and grammar tags for captions at each time step based on the words extracted from the attribute extraction model, the object regions extracted from the object recognition model, the word attention level, and the area attention level Image caption automatic generation system containing ;.
6. The method of claim 5,

The attribute attention model gives the word attention in the order of words having high relevance to the word tag generated by the language generation model,

The object attention model gives the region attention in the order of words that are highly related to the word tags generated by the language generation model,

The word attention degree and the area attention degree are values between 0 and 1, and the higher the relevance to the word tag, the closer to 1 the image caption automatically generating system.
3. The method of claim 2,

The relationship creation module,

an object recognition model for extracting important object regions in the provided image; and

a relationship prediction model for predicting a relationship between the extracted regions and structuring the relationship between the predicted regions in a tuple form to generate a tuple set; and

An automatic image caption generation system comprising a; a relation graph generation model for generating one graph for the generated tuple set.
8. The method of claim 7,

The description generating module,

a sentence restructuring model that replaces some words with phrases for tuples according to an algorithm using the captions generated by the caption generation module and the tuple set generated by the relationship generation module, and expands the generated caption; and

A visualization model for visualizing the caption extended in the sentence restructuring model by matching it with the tuple information; and an automatic image caption generation system.
A caption automatic generation method for automatically generating a caption describing an image for an image, the method comprising:

extracting attribute information and object information from an image by using deep learning in a caption generating module, and generating the caption using the attribute information and the object information;

predicting a relationship between objects in the image in a relationship generating module, and generating a tuple set in which the predicted relationships are structured in a tuple form; and

Image caption automatic including; generating an extended caption by restructuring the caption using the generated caption and the tuple set in the description generating module, and visualizing a graph for the extended caption and the tuple set How to create.
The method of claim 9, wherein the generating of the caption comprises:

extracting words most related to the image in the caption generating module and converting each word into a tuple form;

extracting important objects in the image from an object recognition model, and converting an object region including the extracted objects into a tuple form; and

and generating a caption of the image by using the extracted words and the extracted object region from an image caption model.
11. The method of claim 10, wherein generating a caption of the image comprises:

assigning an attention score to the words extracted from the image caption model;

assigning region attention to the object regions extracted from the image caption model; and

generating word tags and grammar tags for captions for each time step based on the words, object regions, the word attention level, and the area attention level extracted from the image caption model; Way.
10. The method of claim 9, wherein generating the tuple set comprises:

extracting important object regions in the image from an object recognition model;

predicting a relationship between the extracted regions in a relationship prediction model, and structuring the relationship between the predicted regions in a tuple form to generate a tuple set; and

The method of automatically generating image captions further comprising; generating one graph for the generated tuple set in the relation graph generation model.
10. The method of claim 9, wherein visualizing the graph for the tuple set comprises:

replacing some words with phrases for tuples according to an algorithm using the generated caption and the tuple set generated in the relationship generating module in the sentence restructuring model, and expanding the generated caption; and

Visualizing the expanded caption in a visualization model by matching it with the tuple information; the automatic image caption generation method further comprising a.