WO2022114322A1 - Système et procédé pour générer automatiquement une légende à l'aide d'un modèle orienté attribut d'objet d'image basé sur un algorithme d'apprentissage profond - Google Patents

Système et procédé pour générer automatiquement une légende à l'aide d'un modèle orienté attribut d'objet d'image basé sur un algorithme d'apprentissage profond Download PDF

Info

Publication number
WO2022114322A1
WO2022114322A1 PCT/KR2020/017272 KR2020017272W WO2022114322A1 WO 2022114322 A1 WO2022114322 A1 WO 2022114322A1 KR 2020017272 W KR2020017272 W KR 2020017272W WO 2022114322 A1 WO2022114322 A1 WO 2022114322A1
Authority
WO
WIPO (PCT)
Prior art keywords
caption
image
model
generating
relationship
Prior art date
Application number
PCT/KR2020/017272
Other languages
English (en)
Korean (ko)
Inventor
최호진
한승호
Original Assignee
한국과학기술원
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 한국과학기술원 filed Critical 한국과학기술원
Priority to PCT/KR2020/017272 priority Critical patent/WO2022114322A1/fr
Publication of WO2022114322A1 publication Critical patent/WO2022114322A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/431Generation of visual interfaces for content selection or interaction; Content or additional data rendering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof

Definitions

  • the present invention relates to a system and method for automatically generating image captions using an image object-attribute attention model based on a deep learning algorithm.
  • Image captioning is to generate a natural language sentence for a provided image to describe the image. Recently, with the development of artificial intelligence technology, a technology for automatically generating captions using a machine is being developed.
  • the technology for automatically generating captions using a machine is to search for images with the same label by using many existing images and label (one word to describe the image) information attached to each image, or Labels were assigned to an image to generate captions for the image.
  • Image captioning is to generate a caption describing the image as a natural language sentence for a provided image. Recently, with the development of artificial intelligence technology, a technology for automatically generating captions using a machine is being developed.
  • Automatically generating captions using a machine can be performed using information on many existing images and label (one word to describe the image) information for each image. That is, by searching for an image having the same label or assigning labels of similar images to one image, a caption for the image can be created.
  • the present invention is to solve the above-described problem, and extracts attribute information and object information in an image using deep learning to generate a caption, and predicts a relationship between object information to restructure the generated caption.
  • An object of the present invention is to provide an automatic generation system and method.
  • an automatic caption generation system for automatically generating a caption describing an image for an image is provided from a client device providing an image for generating a caption, and the client device It may include a caption generator that analyzes the received image to generate a caption describing the image, and transmits the generated caption and a reason for generating the caption to the client device.
  • attribute information and objects within the image using deep learning in the caption generation module Extracting information, generating a caption using attribute information and object information, predicting a relationship between objects in an image in a relationship creation module, and creating a tuple set in which the predicted relationships are structured in a tuple form
  • You can create an extended caption by restructuring the caption using the caption and tuple set created in the description generation module and visualize the graph for the extended caption and tuple set.
  • the automatic image caption generation system and method generates a caption by reflecting attribute information and object information in an image using deep learning, so that it is possible to improve the performance of generating a caption for an image.
  • FIG. 1 is a diagram showing the configuration of an image caption automatic generation system according to an embodiment of the present invention.
  • FIG. 2 is a diagram illustrating a configuration of a caption generator according to an embodiment of the present invention.
  • FIG. 3 is a diagram showing the configuration of a caption generating module according to an embodiment of the present invention.
  • FIG. 4 is a diagram showing the configuration of a relationship creation module according to an embodiment of the present invention.
  • FIG. 5 is a diagram illustrating the configuration of a description generating module according to an embodiment of the present invention.
  • FIG. 6 is a diagram illustrating caption generation for an image according to an embodiment of the present invention.
  • FIG. 7 is a diagram illustrating generation of extended captions according to an embodiment of the present invention.
  • FIG. 8 is a diagram illustrating a method for automatically generating image captions according to an embodiment of the present invention.
  • FIG. 9 is a diagram illustrating a method of generating a caption according to an embodiment of the present invention.
  • FIG. 10 is a diagram illustrating a method of generating an extended caption according to an embodiment of the present invention.
  • FIG. 1 is a diagram showing the configuration of an image caption automatic generation system according to an embodiment of the present invention.
  • a system 1000 for automatically generating image captions may include a client 100 and a caption generator 200 .
  • the client 100 may provide an image for generating a caption.
  • the client 100 may be a user device (or client device) such as a smart phone or a tablet PC.
  • the client 100 may provide an image acquired (or photographed) in the user device and/or an image stored in the user device to the caption generator 200 .
  • the client 100 according to embodiments of the present invention is not limited to the aforementioned smart phone or tablet PC, and may be equally applied to various types of electronic devices.
  • the caption generator 200 may analyze the image provided from the client 100 to generate a caption describing the image, and transmit the generated caption and the basis for generating the caption to the client 100 .
  • the caption generator 200 may be a server capable of communicating with a user device of the client 100 by wire and/or wirelessly.
  • the caption generator 200 may analyze the image through deep learning. Specifically, the caption generator 200 may learn an image and an answer caption for the image.
  • the caption generator 200 may generate a caption for the new image by using the learned image and correct captions for the image.
  • the caption generator 200 may generate a caption for the image provided from the client 100 using the learned image and correct captions for the image.
  • the correct caption may be a sentence including five or more phrases arbitrarily set by the user for the image.
  • the caption generator 200 may extract an object of the provided image to predict a relationship between the objects, and may generate an extended caption by applying the predicted relationship to the generated caption.
  • the caption generator 200 may transmit the extended caption and the basis for generating the caption to the client 100 , and the client 100 may transmit the caption for the image delivered from the caption generator 200 and the basis for generating the caption. You can interpret the results of deep learning.
  • the client 100 and the caption generator 200 may be connected by wire or wirelessly.
  • FIG. 2 is a diagram illustrating a configuration of a caption generator according to an embodiment of the present invention.
  • the caption generator 200 may include a caption generating module 210 , a relationship generating module 220 , and a description generating module 230 .
  • the caption generating module 210 may learn an image and an answer caption for the image, and may generate a caption of the image provided from the client 100 by using the learned image and an answer caption for the image.
  • the caption generation module 210 may extract attribute information and object information in the image, and generate a caption using the extracted attribute information and object information.
  • the attribute information may be words related to an image
  • the object information may be a core target of the provided image.
  • the attribute information may be 'dog' or 'sofa'
  • the object information may be 'dog' or 'sofa' in the image.
  • the relationship generating module 220 may predict a relationship between objects in an image and generate a tuple set in which the predicted relationships are structured in a tuple form.
  • the tuple form enumerates elements, and the elements may be listed in parentheses '( )' by separating them with commas ','.
  • the relationship generating module 220 may predict a relationship between the object dog and the sofa. That is, the relationship generating module 220 may predict that the dog is in front of the sofa, and may structure the predicted relationship as (sofa, front, dog). At this time. '(sofa, front, dog)' may be a set of tuples.
  • the description generating module 230 may generate an extended caption by restructuring the caption using the caption generated by the caption generating module 210 and the tuple set generated by the relationship generating module 220 . That is, the description generating module 230 may generate an expanded caption by reflecting the relationship between the objects predicted by the relationship generating module 220 in the caption generated by the caption generating module 210 . Also, the description generating module 230 may visualize the extended caption and a graph for the tuple set that is the basis for generating the caption to the client 100 .
  • FIG. 3 is a diagram showing the configuration of a caption generating module according to an embodiment of the present invention.
  • the caption generation module 210 may include an attribute extraction model 212 , an object recognition model 214 , and an image caption model 216 .
  • the attribute extraction model 212 may extract attribute information of the provided image and convert the attribute information into a vector representation (or tuple form).
  • the attribute extraction model 212 may learn images and captions for images in advance using an image-text embedding model based on a deep learning algorithm. .
  • the attribute extraction model 320 may learn by extracting words related to each image in advance using an image caption database.
  • the image-text embedding model may be a model that outputs words related to a new image when a new image is input by mapping many images and words related to each image into one vector space. That is, the attribute extraction model 212 may output words related to a new image using images mapped to and stored in a vector space and words related to each image, and the output words may be used for learning.
  • the attribute extraction model 212 uses words in the form of verbs (or gerunds and participles) in the caption from the captions for each image and words in the form of nouns that exist three or more times to obtain captions for each image. words can be extracted from The attribute extraction model 212 may learn to embed the image and the extracted words into one vector space using a deep learning model.
  • the attribute extraction model 212 may extract words most related to the provided image by using the learned image and caption data for the image.
  • the object recognition model 214 may extract an important object in the image and convert the object region including the extracted object into a vector representation (or tuple form).
  • the object recognition model 214 may utilize a deep learning-based object recognition model such as the Mask R-CNN algorithm, etc. to extract regions corresponding to a predefined object region in the provided image as the object region of the provided image.
  • the object recognition model 214 may be trained in advance before the caption generating module 210 of FIG. 2 is trained.
  • the image caption model 216 describes the image provided from the client 100 based on vectors generated using each word extracted from the attribute extraction model 212 and object regions extracted from the object recognition model 214 . Captions can be created.
  • the image caption model 216 is performed using a deep learning algorithm, and may be performed based on a recurrent neural network (RNN). Accordingly, the image caption model 216 may time-sequentially predict the relationship between the objects in the image.
  • RNN recurrent neural network
  • the image caption model 216 may include an attribute attention model 216a, an object attention model 216b, a grammar learning model 216c, and a language generation model 216d.
  • the attribute attention model 216a may assign an attention score to words extracted from the attribute extraction model 212 .
  • the attribute attention model 216a may give the word attention to the word tag generated by the language generation model 216d at the current time in the order of a word having high relevance.
  • the word attention is a value between 0 and 1, and may be closer to 1 as the relevance to the word tag is higher.
  • the object attention model 216b may give region attention to regions of the object extracted from the object recognition model 214 .
  • the object attention model 216b may assign a region attention to a word tag generated by the language generation model 216d at the current time and a word order with high relevance.
  • the region attention is a value between 0 and 1, and may be closer to 1 as the relevance to the word tag is higher.
  • the grammar learning model 216c may learn the grammar of a sentence for an image and a caption of the image.
  • the grammar learning model 216c may tag each word in the sentence using a grammar tagging tool such as EasySRL for the correct caption sentence of the image, and learn the grammar of the correct caption sentence of the image.
  • a grammar tagging tool such as EasySRL for the correct caption sentence of the image
  • EasySRL EasySRL for the correct caption sentence of the image
  • the language generation model 216d includes words extracted from the attribute extraction model 216a, object regions extracted from the object recognition model 216b, word attention generated from the attribute attention model 216c, and the object attention model 216d. ), it is possible to generate a word tag and a grammar tag for a caption at each time step based on the area attention generated in .
  • the language generation model 216d is a word attention value, a region attention value, an average vector of words converted to a tuple form in the attribute extraction model 212, and an average of object regions converted to a tuple form in the object recognition model 214
  • a word tag and a grammar tag may be predicted at the current time by considering all the vectors, the word generated in the previous time by the language generation model 216d, and compressed information on all words generated by the language generation model 216d.
  • the language generation model 216d may calculate loss values for the generated word tag and the grammar tag by comparing the predicted word tag and the grammar tag with the correct caption sentence, respectively.
  • the language generation model 216d may update the learning parameters of the caption generation module 210 by reflecting the loss values for the word tag and the grammar tag.
  • the language generation model 216d may generate a caption sentence in which the grammar is considered with respect to the provided image by using the word tag and the grammar tag.
  • FIG. 4 is a diagram showing the configuration of a relationship creation module according to an embodiment of the present invention.
  • the relationship creation module 220 may include an object extraction model 222 , a relationship prediction model 224 , and a relationship graph generation model 226 .
  • the object recognition model 222 may extract important object regions in the provided image.
  • the object recognition model 222 may extract important objects in the provided image, and may extract object regions including the extracted objects.
  • the relationship prediction model 224 may predict a relationship between the extracted object regions and structure the relationship between the predicted object regions in a tuple form.
  • the relationship prediction model 224 may structure the relationship between predicted object regions in the form of (first noun, predicate, second noun).
  • the first noun and/or the second noun may be a noun representing an object in the image.
  • the relationship graph generation model 226 may generate one graph for the generated tuple set.
  • the relationship graph generation model 226 may generate graphs for the tuple sets, such as displaying an arrow from a first noun to a predicate, and displaying an arrow from a predicate to a second noun.
  • FIG. 5 is a diagram illustrating the configuration of a description generating module according to an embodiment of the present invention.
  • the description generating module 230 may include a sentence restructuring model 232 and a visualization model 234 .
  • the sentence restructuring model 232 uses the tuple set generated by the caption and relationship generation module 220 generated by the caption generation module 210 to replace some words with phrases for the tuples according to an algorithm, and the generated caption can be expanded. That is, the sentence restructuring model 232 may further expand the caption by reflecting the tuple set generated by the relationship generating module 220 to the caption generated by the caption generating module 210 .
  • the sentence restructuring model 232 may remove tuple sets included in the caption generated by the caption generation module 210 from among the tuple sets generated by the relationship generation module 220 .
  • the first noun, the second noun, and the predicate in the tuple set are all included in the caption generated by the caption generating module 210 to remove the tuple set, it is determined as a duplicate tuple set and the duplicate tuple set can be deleted.
  • the sentence restructuring model 232 may remove the duplicate tuple set and convert the remaining tuple sets into a sentence format.
  • the sentence restructuring model 232 may convert it into a sentence form by listing the first noun - the preposition - the second noun in the order.
  • the sentence restructuring model 232 may convert it into a sentence form by listing the second noun - the verb - the first noun in the order.
  • the sentence restructuring model 232 may convert the tuple set into 'dog in front of the sofa'.
  • the sentence restructuring model 232 may convert the tuple set into 'a person lying in bed'.
  • the sentence restructuring model 232 may convert the tuple set into a sentence format and reflect the transformed sentence in the caption. Thereafter, a score may be calculated by comparing the caption (extended caption) in which the converted sentence is reflected with the correct caption, and a phrase having the highest score may be selected. The sentence restructuring model 232 may iterate until there are no more tuples sets remaining through a method of converting the tuple set into a sentence form - applying it to a caption - and selecting a phrase having the highest score. Thereafter, the sentence restructuring model 232 may select the last selected phrase as the final extended caption.
  • the visualization model 234 may visualize the caption extended by the sentence restructuring model 232 by matching it with the tuple set.
  • the visualization model 234 may generate a graph representing the relationship between the tuple sets by matching the caption extended in the sentence restructuring model 232 with the tuple set.
  • the visualization model 234 may transmit a graph representing the relationship between the generated tuple set to the client 100 so that the user can check the basis for generating the expanded caption.
  • the visualization model 234 may display an object region corresponding to the tuple set reflected in the caption on the provided image.
  • the visualization model 234 may display each object area through different colors or different lines (such as line types or thicknesses).
  • the visualization model 234 may display a phrase corresponding to the object area in the final caption in the same color as the object area. For example, if the final caption sentence is 'a dog in front of a sofa lying on the floor and a cat around a laptop', the visualization model 234 may display the sofa and the dog in the provided image as one object area using a red line. .
  • the visualization model 234 may display 'dog in front of the sofa' in red text in the final caption sentence. In this way, by displaying the corresponding phrase and the object area in the same color, the user can recognize it at a glance.
  • FIG. 6 is a diagram illustrating caption generation for an image according to an embodiment of the present invention.
  • the attribute extraction model 212 may extract attribute information 1 in the provided image 10 .
  • the attribute extraction model 212 may extract attribute information 1 in the provided image 10 based on the learned image and the correct caption of the image. As an example, the attribute extraction model 212 may extract a dog, a cat, a floor, etc. as the attribute information 1 .
  • the object recognition model 214 may extract the object region 2 including the object information and the object in the provided image 10 at the same time that the attribute extraction model 212 extracts the attribute information 1 .
  • the object recognition model 214 may extract the object region 2 in the provided image 10 based on the learned image and the correct caption of the image.
  • the object recognition model 214 may extract a dog, a cat, a floor, etc. as object information, and may extract the object region 2 including the object information.
  • the image caption model 216 may generate a caption 3 for the provided image 10 using the attribute information extracted from the attribute extraction model 212 and the object information extracted from the object recognition model 214 .
  • the image caption model 216 may generate the caption 3 'a living room photo of a dog and a cat lying on the floor'.
  • FIG. 7 is a diagram illustrating generation of extended captions according to an embodiment of the present invention.
  • the object extraction model 222 may extract the object region 2 including the object information and the object in the provided image 10 .
  • the object extraction model 222 may extract object information in the provided image 10 based on the learned image and the correct caption of the image. For example, the object extraction model 222 may extract a dog, a cat, a sofa, a notebook, a door, etc. as object information, and may extract the object region 2 including the object information. In this case, the object extraction model 222 may extract the object region 2 to include two or more extracted objects. Through this, the relationship prediction model 224 may predict the relationship between objects in the object region 2 .
  • the relationship prediction model 224 may predict the relationship between the objects extracted from the object extraction model 222 , and may generate the relationship between the objects as a tuple set 4 .
  • the relationship prediction model 224 may predict that the relationship between 'sofa' and 'dog' extracted as an object is that there is a dog in front of the sofa, and accordingly, a tuple set (4) ) can be created.
  • the relationship prediction model 224 may predict that the relationship between 'cat' and 'door' extracted as an object is that the cat is next to the door, and accordingly, a tuple set ( 4) can be created.
  • the sentence restructuring model 232 uses the tuple set 4 generated by the relationship prediction model 224 to replace some words with phrases for the tuple set according to the algorithm, and to expand the generated caption 3 have. That is, the sentence restructuring model 232 may further expand the caption by reflecting the tuple set 4 generated by the relationship generating module 220 to the caption 3 generated by the caption generating module 210 . For example, the sentence restructuring model 232 may extend the caption to 'a living room photo of a dog in front of a sofa lying on the floor and a cat near a laptop next to the door'.
  • the relationship graph generation model 226 may generate a relationship graph for the tuple set 4 generated by the relationship prediction model 224 .
  • the relationship graph generation model 226 may express the predicate of the tuple set 4 as a square box and the nouns of the tuple set as a circular box.
  • the relationship graph generation model 226 may connect each box in the order of a first noun - a predicate - a second noun.
  • the visualization model 234 may display phrases of the extended caption on the image as object areas, and in this case, each object area may be displayed in a different color. Also, the visualization model 234 may visualize the phrases of the extended caption corresponding to each object area by displaying the phrases in the same color as the corresponding object area.
  • FIG. 8 is a diagram illustrating a method for automatically generating image captions according to an embodiment of the present invention.
  • the caption generating module 210 may extract attribute information and object information of a provided image, and generate a caption by reflecting attribute information and object information of the extracted image ( S100 ).
  • the caption generation module 210 may extract attribute information and object information in the image, and generate a caption using the extracted attribute information and object information.
  • the attribute information may be words related to an image
  • the object information may be a core target of the provided image.
  • the caption generating module 210 may generate a caption of the provided image based on the image learned through deep learning and captions for each image.
  • the relationship generating module 220 may predict a relationship between objects in an image and generate a tuple set for the predicted relationships ( S200 ).
  • the relationship generating module 220 may represent a relationship between objects in an image as a tuple set consisting of (a first noun, a predicate, and a second noun).
  • the description generating module 230 may generate an extended caption using the caption generated by the caption generating module 210 and the tuple set generated by the relationship generating module 220 ( S300 ).
  • the description generating module 230 may expand the caption by converting the tuple set into a sentence and reflecting it in the caption.
  • the description generating module 230 may visualize the relationship between the extended caption and the objects as a graph (S400).
  • the description generating module 230 may generate a graph by matching the extended caption and the relationship between the objects.
  • the description generating module 230 may transmit the generated graph to the client 100 so that the user can check the basis for generating the expanded caption.
  • FIG. 9 is a diagram illustrating a method of generating a caption according to an embodiment of the present invention.
  • the attribute extraction model 212 may extract attribute information of an image ( S110 ).
  • the attribute extraction model 212 may be trained on an image and a caption for the image. Accordingly, the attribute extraction model 212 may output attribute information related to a new image by using the learned information.
  • the object recognition model 214 may extract an important object in the image and convert the object region including the extracted object into a tuple form (S120).
  • the object recognition model 214 may utilize a deep learning-based object recognition model, such as a Mask R-CNN algorithm, to extract regions corresponding to a predefined object region in the provided image as an object region of the provided image.
  • the image caption model 216 may give word attention and region attention to attribute information and object regions extracted from the provided image ( S130 ).
  • the image caption model 216 may give word attention to a word tag generated at the current time and a word order with high relevance.
  • the word attention degree and the area attention degree are values between 0 and 1, and may be closer to 1 as the relevance to the word tag increases.
  • the image caption model 216 includes the attribute information extracted from the attribute extraction model 212, the object region extracted from the object recognition model 214, word tags for captions, and word tags for each time step based on word attention and region attention.
  • a grammar tag may be predicted (S140).
  • the image caption model 216 may calculate loss values for the word tag and the grammar tag generated by comparing the predicted word tag and the grammar tag with the correct caption sentence, respectively.
  • the image caption model 216 may generate a caption by reflecting the loss values for the word tag and the grammar tag (S150). Accordingly, the image caption model 216 may generate a caption sentence in which the grammar is considered for the provided image by using the word tag and the grammar tag, and may learn it.
  • FIG. 10 is a diagram illustrating a method of generating an extended caption according to an embodiment of the present invention.
  • the description generating module 210 may remove tuple sets included in the caption generated by the caption generating module 210 from among the tuple sets generated by the relationship generating module 220 ( S310 ).
  • the first noun, the second noun, and the predicate in the tuple set are all included in the caption generated by the caption generating module 210 to remove the tuple set, it is determined as a duplicate tuple set and the duplicate tuple set can be deleted.
  • the description generating module 210 may remove the duplicate tuple set and convert the remaining tuple sets into a sentence format (S320).
  • the predicate of the tuple set is a preposition
  • the description generating module 210 may convert it into a sentence form by listing the first noun - the preposition - the second noun in the order.
  • the predicate of the tuple set is a verb
  • the description generating module 210 may convert it into a sentence form by listing the second noun - the verb - the first noun in the order.
  • the description generating module 210 may reflect the converted sentence of the tuple sets in the caption (S330). Thereafter, a score may be calculated by comparing the caption (extended caption) in which the converted sentence is reflected with the correct caption, and a phrase having the highest score may be selected.
  • the sentence restructuring model 232 may iterate until there are no more tuples sets remaining through a method of converting the tuple set into a sentence form - applying it to the caption - and selecting the phrase having the highest score. Thereafter, the description generating module 210 may select the last selected phrase as the final extended caption.
  • the description generating module 210 may visualize the caption extended in the sentence restructuring model 232 by matching it with the tuple set (S340).
  • the description generating module 210 may generate a graph representing the relationship of the tuple set by matching the extended caption with the tuple set.
  • the visualization model 234 may transmit a graph representing the relationship between the generated tuple set to the client 100 so that the user can check the basis for generating the expanded caption.
  • a caption is generated by extracting attribute information and object information in an image using deep learning, and the generated caption is restructured by predicting a relationship between object information.
  • Automated generation systems and methods may be provided.

Abstract

La présente invention concerne un système et un procédé de génération de légende automatique qui génèrent une légende en utilisant l'apprentissage profond pour extraire des informations d'attribut et des informations d'objet dans une image, et qui restructurent la légende générée par prédiction de la relation entre des éléments d'informations d'objet. Un système de génération de légende automatique permettant de générer automatiquement, pour une image, une légende décrivant l'image, selon un mode de réalisation de la présente invention, comprend : un dispositif client pour fournir l'image pour laquelle la légende doit être générée ; et un générateur de légende qui analyse l'image fournie par le dispositif client afin de générer la légende décrivant l'image, et qui transmet, au client, la légende générée et une base pour générer la légende.
PCT/KR2020/017272 2020-11-30 2020-11-30 Système et procédé pour générer automatiquement une légende à l'aide d'un modèle orienté attribut d'objet d'image basé sur un algorithme d'apprentissage profond WO2022114322A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/KR2020/017272 WO2022114322A1 (fr) 2020-11-30 2020-11-30 Système et procédé pour générer automatiquement une légende à l'aide d'un modèle orienté attribut d'objet d'image basé sur un algorithme d'apprentissage profond

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/KR2020/017272 WO2022114322A1 (fr) 2020-11-30 2020-11-30 Système et procédé pour générer automatiquement une légende à l'aide d'un modèle orienté attribut d'objet d'image basé sur un algorithme d'apprentissage profond

Publications (1)

Publication Number Publication Date
WO2022114322A1 true WO2022114322A1 (fr) 2022-06-02

Family

ID=81755158

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2020/017272 WO2022114322A1 (fr) 2020-11-30 2020-11-30 Système et procédé pour générer automatiquement une légende à l'aide d'un modèle orienté attribut d'objet d'image basé sur un algorithme d'apprentissage profond

Country Status (1)

Country Link
WO (1) WO2022114322A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170200065A1 (en) * 2016-01-13 2017-07-13 Adobe Systems Incorporated Image Captioning with Weak Supervision
KR101996371B1 (ko) * 2018-02-22 2019-07-03 주식회사 인공지능연구원 영상 캡션 생성 시스템과 방법 및 이를 위한 컴퓨터 프로그램
KR20200104663A (ko) * 2019-02-27 2020-09-04 한국전력공사 이미지 캡션 자동 생성 시스템 및 방법
KR20200106115A (ko) * 2019-02-27 2020-09-11 한국전력공사 이미지 캡션 자동 생성 장치 및 방법
WO2020190112A1 (fr) * 2019-03-21 2020-09-24 Samsung Electronics Co., Ltd. Procédé, appareil, dispositif et support permettant de générer des informations de sous-titrage de données multimédias

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170200065A1 (en) * 2016-01-13 2017-07-13 Adobe Systems Incorporated Image Captioning with Weak Supervision
KR101996371B1 (ko) * 2018-02-22 2019-07-03 주식회사 인공지능연구원 영상 캡션 생성 시스템과 방법 및 이를 위한 컴퓨터 프로그램
KR20200104663A (ko) * 2019-02-27 2020-09-04 한국전력공사 이미지 캡션 자동 생성 시스템 및 방법
KR20200106115A (ko) * 2019-02-27 2020-09-11 한국전력공사 이미지 캡션 자동 생성 장치 및 방법
WO2020190112A1 (fr) * 2019-03-21 2020-09-24 Samsung Electronics Co., Ltd. Procédé, appareil, dispositif et support permettant de générer des informations de sous-titrage de données multimédias

Similar Documents

Publication Publication Date Title
WO2019168253A1 (fr) Dispositif conversationnel de conseil interactif et procédé de compréhension hiérarchique de l'expression d'un utilisateur et de génération de réponse
WO2011136425A1 (fr) Dispositif et procédé de mise en réseau de cadre de description de ressources à l'aide d'un schéma d'ontologie comprenant un dictionnaire combiné d'entités nommées et des règles d'exploration combinées
WO2020111314A1 (fr) Appareil et procédé d'interrogation-réponse basés sur un graphe conceptuel
WO2021215620A1 (fr) Dispositif et procédé pour générer automatiquement un sous-titre d'image spécifique au domaine à l'aide d'une ontologie sémantique
KR102622958B1 (ko) 이미지 캡션 자동 생성 시스템 및 방법
WO2012060540A1 (fr) Dispositif et procédé de traduction mécanique dans lesquels sont associés un modèle de conversion de syntaxe et un modèle de conversion de vocabulaire
WO2021096009A1 (fr) Procédé et dispositif permettant d'enrichir la connaissance sur la base d'un réseau de relations
WO2014106979A1 (fr) Procédé permettant de reconnaître un langage vocal statistique
WO2020085663A1 (fr) Système de génération automatique de logos basée sur l'intelligence artificielle et procédé de service de génération de logos l'utilisant
WO2021071137A1 (fr) Procédé et système de génération automatique de questions d'inférence d'espace vide pour une phrase en langue étrangère
WO2011162444A1 (fr) Dictionnaire d'entités nommées combiné avec un schéma d'ontologie et dispositif et procédé permettant de renouveler un dictionnaire d'entités nommées ou une base de données de règles d'exploration à l'aide d'une règle d'exploration
WO2021107449A1 (fr) Procédé pour fournir un service d'analyse d'informations de commercialisation basée sur un graphe de connaissances à l'aide de la conversion de néologismes translittérés et appareil associé
WO2014142422A1 (fr) Procédé permettant de traiter un dialogue d'après une expression d'instruction de traitement, et appareil associé
KR20200037077A (ko) Vqa 트레이닝 데이터를 생성하는 방법, 장치, 기기 및 컴퓨터 판독 가능 매체
WO2019107625A1 (fr) Procédé de traduction automatique et appareil associé
WO2022114322A1 (fr) Système et procédé pour générer automatiquement une légende à l'aide d'un modèle orienté attribut d'objet d'image basé sur un algorithme d'apprentissage profond
WO2022114368A1 (fr) Procédé et dispositif de complétion de connaissances par représentation vectorielle continue d'une relation neuro-symbolique
CN114169408A (zh) 一种基于多模态注意力机制的情感分类方法
WO2021107445A1 (fr) Procédé pour fournir un service d'informations de mots nouvellement créés sur la base d'un graphe de connaissances et d'une conversion de translittération spécifique à un pays, et appareil associé
WO2018169276A1 (fr) Procédé pour le traitement d'informations de langue et dispositif électronique associé
WO2023167496A1 (fr) Procédé de composition de musique utilisant l'intelligence artificielle
WO2021132760A1 (fr) Procédé de prédiction de colonnes et de tables utilisées lors de la traduction de requêtes sql à partir du langage naturel sur la base d'un réseau de neurones
WO2021256578A1 (fr) Appareil et procédé de génération automatique de légende d'image
WO2022177372A1 (fr) Système de fourniture de service de tutorat à l'aide d'une intelligence artificielle et son procédé
WO2020242086A1 (fr) Serveur, procédé et programme informatique pour supposer l'avantage comparatif de multi-connaissances

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20963716

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20963716

Country of ref document: EP

Kind code of ref document: A1