WO2023008609A1 - Image management server providing a scene image by merging objects from multiple images and method for creating the scene image using the same - Google Patents
Image management server providing a scene image by merging objects from multiple images and method for creating the scene image using the same Download PDFInfo
- Publication number
- WO2023008609A1 WO2023008609A1 PCT/KR2021/009814 KR2021009814W WO2023008609A1 WO 2023008609 A1 WO2023008609 A1 WO 2023008609A1 KR 2021009814 W KR2021009814 W KR 2021009814W WO 2023008609 A1 WO2023008609 A1 WO 2023008609A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image
- objects
- caption
- keyword
- content
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 239000000284 extract Substances 0.000 claims abstract description 17
- 238000013527 convolutional neural network Methods 0.000 claims description 8
- 238000003058 natural language processing Methods 0.000 claims description 8
- 230000006403 short-term memory Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 8
- 239000000463 material Substances 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/55—Rule-based translation
- G06F40/56—Natural language generation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/51—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/53—Querying
- G06F16/538—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/5866—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/042—Knowledge-based neural networks; Logical representations of neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/096—Transfer learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation; Time management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0241—Advertisements
- G06Q30/0276—Advertisement creation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T11/00—2D [Two Dimensional] image generation
- G06T11/60—Editing figures and text; Combining figures or text
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/11—Region-based segmentation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/70—Labelling scene content, e.g. deriving syntactic or semantic representations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
- G06F40/106—Display of layout of documents; Previewing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20212—Image combination
- G06T2207/20221—Image fusion; Image merging
Definitions
- the present invention relates to an image management server and an image creation method using the same, and more particularly, to an image management server providing a scene image by merging objects from multiple images and a method for creating the scene image using the same.
- Stock photography refers to photos stocked in large quantities, and when photos without copyright issues are uploaded to a photo platform or website where the stock photography is collected, companies or individuals pay money as needed to purchase the photos.
- the stock photography is used as material photos for newspapers and magazines, and is also used as related images in advertisements, publicity materials, online postings, etc.
- a problem to be solved by the present invention is to provide a method for creating a scene image capable of providing the scene image by extracting and merging objects from multiple images.
- Another problem to be solved by the present invention is to provide an image management server that performs such a method.
- a method for creating a scene image according to an embodiment of the present invention for achieving the problem described above is an image creation method for providing a scene image by merging objects from multiple images, the method including, by an image management server, a step of extracting a feature vector from a training image, based on training data composed of the training image and a target caption describing contents of the training image, and training a caption prediction model by using the feature vector of the training image as an input variable and the target caption as an output variable, a step of detecting an object from a content image and predicting a content caption describing contents of the content image by inputting a feature vector extracted from the detected object into the caption prediction model, a step of extracting a plurality of keywords and their relationship information from a search text input from a user terminal through natural language processing, a step of searching for a content caption matching each keyword, and detecting a content image (referred to as a 'related image') corresponding to the searched content caption, a step of detecting objects corresponding to a keyword for each related
- the plurality of keywords may include first and second keywords, a first related image may be defined as having an object corresponding to the first keyword, a second related image may be defined as having an object corresponding to the second keyword, a reference image including both objects corresponding to the first and second keywords in one image may be detected, and a size ratio between the plurality of cropped objects may be calculated by using a size ratio between the objects included in the reference image.
- the plurality of keywords may include first and second keywords, a first related image may be defined as having an object corresponding to the first keyword, a second related image may be defined as having an object corresponding to the second keyword, a first reference image including an object corresponding to the first keyword may be detected, a second reference image including an object corresponding to the second keyword may be detected, and a size ratio between the plurality of cropped objects may be calculated by using a size ratio between common objects that exist in common in the first and second reference images.
- the plurality of keywords may include first and second keywords, a first related image may be defined as having an object corresponding to the first keyword, a second related image may be defined as having an object corresponding to the second keyword, a first reference image including an object corresponding to the first keyword may be detected, a second reference image including an object corresponding to the second keyword may be detected, a standard image that is not related to the first or second keyword may be detected, and a size ratio between the plurality of cropped objects may be calculated by using a size ratio between common objects that exist in common in the first reference image and the standard image and a size ratio between common objects that exist in common in the second reference image and the standard image.
- a feature vector may be extracted from the training image based on a convolutional neural network (CNN) algorithm and the caption prediction model may be trained based on a long short term memory (LSTM) algorithm.
- CNN convolutional neural network
- LSTM long short term memory
- the step of creating the scene image may include a step of predicting a layout indicating an arrangement relationship of the detected objects in the scene image based on a graph convolution network (GCN) algorithm using the detected object as a node and the relationship information as an edge, and a step of adjusting the sizes of the plurality of cropped objects according to the size ratio and then arranging the plurality of cropped objects on the layout.
- GCN graph convolution network
- An image management server for achieving the other problem described above is an image management server for providing a scene image by merging objects from multiple images, the image management server including an image caption unit and a scene creation unit.
- the image caption unit may extract a feature vector from a training image, based on training data composed of the training image and a target caption describing contents of the training image, and train a caption prediction model by using the feature vector of the training image as an input variable and the target caption as an output variable.
- the image caption unit may detect an object from a content image and predict a content caption describing contents of the content image by inputting a feature vector extracted from the detected object into the caption prediction model.
- the scene creation unit may extract a plurality of keywords and their relationship information from a search text input from a user terminal through natural language processing.
- the scene creation unit may search for a content caption matching each keyword, and detect a content image (referred to as a 'related image') corresponding to the searched content caption.
- the scene creation unit may detect objects corresponding to a keyword for each related image, and calculate a size ratio between the detected objects by referring to a reference image other than the related image.
- the scene creation unit may crop the detected objects for each related image, and create a scene image by merging a plurality of cropped objects based on the size ratio.
- a content caption describing contents of each content image can be automatically predicted and stored through a caption prediction model.
- a search text is input from a user terminal, a keyword and relationship information are extracted from the search text.
- a content image matching each keyword (this is referred to as a 'related image') can be detected.
- One scene image can be created by cropping and merging objects corresponding to a keyword for each related image.
- a layout indicating an arrangement relationship of objects in one scene image can be automatically predicted using the extracted relationship information, and objects can be arranged in one scene image according to the predicted layout.
- the size ratio between objects can be automatically calculated with reference to an existing content image as follows.
- the size ratio between the plurality of cropped objects to be merged can be automatically calculated by using the size ratio between the objects included in the content image.
- a content image can be individually detected for each of the objects, and the size ratio between the plurality of cropped objects to be merged can be automatically calculated by using the size ratio between common objects that exist in common in the detected content images.
- a standard content image can be additionally detected and the size ratio between the plurality of cropped objects to be merged can be automatically calculated by using the size ratio between the common objects that exist in common between the detected content images and the standard content image.
- the objects can be represented naturally and harmoniously with each other.
- FIG. 1 is a configuration diagram conceptually illustrating an image management server according to an embodiment of the present invention.
- FIG. 2 is a configuration diagram conceptually illustrating an image caption unit of FIG. 1.
- FIG. 3 is a configuration diagram conceptually illustrating a scene creation unit of FIG. 1.
- FIGS. 4 to 6 are diagrams exemplarily illustrating a method for calculating a size ratio according to an embodiment of the present invention.
- FIG. 7 is a flowchart illustrating a process of creating a caption prediction model in a method for creating a scene image according to an embodiment of the present invention.
- FIG. 8 is a flowchart illustrating a process of predicting a content caption in the method for creating the scene image according to the embodiment of the present invention.
- FIG. 9 is a flowchart illustrating a process of creating a scene image in the method for creating the scene image according to the embodiment of the present invention.
- FIG. 1 is a configuration diagram conceptually illustrating an image management server according to an embodiment of the present invention.
- FIG. 2 is a configuration diagram conceptually illustrating an image caption unit of FIG. 1.
- FIG. 3 is a configuration diagram conceptually illustrating a scene creation unit of FIG. 1.
- An image management server 10 is a server that provides a scene image by merging objects from several images, and includes an image caption unit 100 that predicts a content caption for each content image using a prediction model, a scene creation unit 200 that detects a related image based on a search text and crops and merges objects from each related image, and a database 300 that stores various images and data.
- the image caption unit 100 includes a prediction model training unit 110, a caption prediction unit 120, and a tag creation unit 130.
- Training data may include a training image and a target caption describing contents of the training image.
- the prediction model training unit 110 extracts a feature vector from a training image using the training data.
- the training image may be composed of image files in various formats such as JPEG, BMP, GIF, PNG, and TIFF
- the target caption may be a ground truth caption and may be composed of text files in various formats such as TXT.
- the prediction model training unit 110 may use transfer learning to pre-process a raw image based on a pre-trained convolution neural network (CNN) algorithm.
- CNN convolution neural network
- the prediction model training unit 110 may create a feature vector by receiving a training image and extracting essential features of the corresponding training image based on the CNN algorithm.
- the feature vector refers to a value obtained by extracting features from image data.
- the prediction model training unit 110 trains the caption prediction model using the feature vector of the training image as an input variable and the target caption as an output variable.
- the prediction model training unit 110 decodes image features and learns a method for predicting a caption matching the target caption.
- the prediction model training unit 110 may train the caption prediction model based on a long short term memory (LSTM) algorithm.
- LSTM long short term memory
- the caption prediction unit 120 detects an object in the content image and extracts a feature vector from the detected object. For example, the caption prediction unit 120 may extract the feature vector from the content image based on the CNN algorithm.
- the caption prediction unit 120 predicts the content caption describing contents of the content image by inputting the feature vector of the content image into the caption prediction model.
- the caption prediction unit 120 may predict the content caption for the content image by decoding image features of the content image based on the LSTM algorithm.
- the content image may be composed of image files in various formats such as JPEG, BMP, GIF, PNG, and TIFF
- the content caption may be composed of text files in various formats such as TXT.
- One content caption and one or more objects may be defined for one content image.
- the caption prediction unit 120 stores a content caption and an object corresponding to each content image in the database 300.
- the tag creation unit 130 extracts a tag from the content caption using natural language processing. Specifically, the tag creation unit 130 performs sentence segmentation on the content caption composed of a combination of corpuses. Subsequently, the tag creation unit 130 divides the sentence into tokens.
- the tokens are a string having a meaning, and may be understood as a concept including a morpheme or a word.
- the tag creation unit 130 performs part-of-speech (POS) tagging for allocating part-of-speech information of the token.
- POS part-of-speech
- the tag creation unit 130 performs named entity recognition for the token by which various entity name tags, such as a person's name, a place name, and an organization name are attached thereto.
- the tag creation unit 130 stores the entity name tag in the database 300 together with the content caption and the object corresponding to each content image.
- the entity name tag can be used in a process of searching for the content caption.
- the scene creation unit 200 includes a search text analysis unit 210, an image search unit 220, a ratio calculation unit 230, and an object merging unit 240.
- the search text analysis unit 210 extracts a plurality of keywords and relationship information between the keywords from the search text through natural language processing. Specifically, the search text analysis unit 210 extracts a plurality of keywords and their relationship information by using sentence separation, tokenization, POS tagging, entity name recognition, etc. For example, when the user inputs "A dog beside a cycle in a park" as the search text, the search text analysis unit 210 extracts "dog", “cycle”, and “park” as keywords through natural language processing, and extracts "beside” and "in” as relationship information.
- the image search unit 220 searches for a content caption matching each keyword among the content captions stored in the database 300 and detects a content image (this is referred to as a 'related image') corresponding to the searched content caption.
- a content image this is referred to as a 'related image'
- a plurality of keywords are extracted from the search text, a plurality of content images are detected.
- the ratio calculation unit 230 detects objects corresponding to the keyword for each related image, and calculates a size ratio between the detected objects.
- the ratio calculation unit 230 may automatically calculate the size ratio between the detected objects with reference to a content image other than the related image.
- a method for calculating the size ratio between detected objects will be described in detail with reference to FIGS. 4 to 6.
- FIGS. 4 to 6 are diagrams exemplarily illustrating a method for calculating a size ratio according to an embodiment of the present invention.
- a method for calculating a size ratio between an object (dog) and an object (cycle) that respectively correspond to a first keyword and a second keywords when the first keyword is "dog" and the second keyword is "cycle” is exemplarily illustrated.
- the size ratio may be a horizontal ratio, a vertical ratio, an aspect ratio, etc. between the objects.
- FIG. 4 illustrates a case in which a content image (this is referred to as a 'reference image') including both the object (dog) corresponding to the first keyword and the object (cycle) corresponding to the second keyword in one image is detected from the database 300.
- a size ratio between objects cropped may be calculated in a process of merging objects later using a size ratio between the object (dog) and the object (cycle) included in the reference image.
- FIG. 5 illustrates, as a case in which a reference image including both the object (dog) corresponding to the first keyword and the object (cycle) corresponding to the second keyword is not present in one image, a case where the first reference image includes the object (dog) and the object (tree) and the second reference image includes the object (cycle) and the object (tree).
- the object (tree) exists in common in the first reference image and the second reference image, which is referred to as a common object.
- a size ratio between objects cropped may be calculated in the process of merging objects later using the size ratio between these common objects (tree).
- FIG. 6 illustrates, as a case in which a reference image including both the object (dog) corresponding to the first keyword and the object (cycle) corresponding to the second keyword is not present in one image, a case where the first reference image includes the object (dog) and the object (tree) and the second reference image includes the object (cycle) and the object (tree), but an object that exist in common is not present in the first reference image and the second reference image.
- the ratio calculation unit 230 additionally detects a standard image that is not related to a keyword among the content images stored in the database 300.
- a common object (tree) exists in the first reference image and the standard image
- a common object (house) exists in the second reference image and the standard image exist.
- the size ratio between the objects cropped may be calculated in the object merging process later using the size ratio between these common objects (tree, house).
- the object merging unit 240 crops the object for each related image.
- the object merging unit 240 may crop the object corresponding to the keyword by using algorithms such as YOLO, Saliency Map, Integral Image, Local Adaptive Thresholding, GrabCut, etc.
- the object merging unit 240 creates one scene image by merging a plurality of cropped objects based on the previously calculated size ratio. Specifically, the object merging unit 240 automatically predicts a layout indicating an arrangement relationship of objects in the scene image based on the GCN algorithm using the object corresponding to the keyword as a node and the relationship information as an edge. In addition, the object merging unit 240 adjusts the sizes of the plurality of cropped objects according to the size ratio and then arranges the cropped objects on the layout to complete the scene image.
- the database 300 stores various images and data used in the method for creating the scene image of the present invention, such as training data, content images and objects and content captions related the content images, and scene images.
- FIG. 7 is a flowchart illustrating a process of creating a caption prediction model in the method for creating the scene image according to the embodiment of the present invention.
- FIG. 8 is a flowchart illustrating a process of predicting a content caption in the method for creating the scene image according to the embodiment of the present invention.
- FIG. 9 is a flowchart illustrating a process of creating a scene image in the method for creating the scene image according to the embodiment of the present invention.
- the prediction model training unit 110 extracts a feature vector from the training image by using the training data (S12). For example, the prediction model training unit 110 may extract a feature vector from the training image based on the CNN algorithm.
- the prediction model training unit 110 trains the caption prediction model using the feature vector of the training image as an input variable and the target caption as an output variable (S14).
- the prediction model training unit 110 may train the caption prediction model based on the LSTM algorithm.
- the caption prediction unit 120 detects an object in the content image (S22).
- the caption prediction unit 120 extracts the feature vector from the detected object (S24).
- the caption prediction unit 120 inputs the feature vector of the content image into the caption prediction model to predict a content caption describing the contents of the content image (S26).
- the caption prediction unit 120 stores the content caption and the object corresponding to each content image in the database 300 (S28).
- the search text analysis unit 210 extracts a plurality of keywords and relationship information between the keywords from the search text through natural language processing (S31).
- the image search unit 220 searches for the content caption matching each keyword among the content captions stored in the database 300 and detects the content image (referred to as a 'related image') corresponding to the searched content caption (S32).
- the ratio calculation unit 230 detects objects corresponding to the keyword for each related image, and calculates a size ratio between the detected objects with reference to the reference image (S33).
- the ratio calculation unit 230 may calculate the size ratio in the following way according to the presence or absence of the detected object in the reference image.
- the plurality of keywords includes first and second keywords, a content image having an object corresponding to the first keyword among the content images is defined as a first related image, a content image having an object corresponding to the second keyword among the content images is defined as a second related image.
- the ratio calculation unit 230 may calculate a size ratio between a plurality of cropped objects by using a size ratio between the objects included in the reference image.
- the ratio calculation unit 230 may calculate the size ratio between the plurality of cropped objects by using the size ratio between common objects that exist in common in the first and second reference images.
- the ratio calculation unit 230 detects a standard image that is not related to the first keyword or second keyword. Subsequently, the ratio calculation unit 230 may calculate a size ratio between the plurality of cropped objects by using a size ratio between the common objects that exist in common in the first reference image and the standard image and a size ratio between the common objects that exist in common in the second reference image and the standard image.
- the object merging unit 240 crops the object for each related image (S34).
- the object merging unit 240 creates one scene image by merging the plurality of cropped objects based on the previously calculated size ratio (S35). Specifically, the object merging unit 240 predicts a layout indicating the arrangement relationship of the detected objects in the scene image based on the GCN algorithm using the detected object corresponding to the keyword as a node and the relationship information as an edge. Subsequently, the object merging unit 240 adjusts the sizes of the plurality of cropped objects according to the previously calculated size ratio and then arranges the plurality of cropped objects on the layout to complete the scene image.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Business, Economics & Management (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Strategic Management (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Multimedia (AREA)
- Economics (AREA)
- Library & Information Science (AREA)
- Marketing (AREA)
- Human Resources & Organizations (AREA)
- Entrepreneurship & Innovation (AREA)
- General Business, Economics & Management (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Tourism & Hospitality (AREA)
- Finance (AREA)
- Development Economics (AREA)
- Accounting & Taxation (AREA)
- Processing Or Creating Images (AREA)
- Quality & Reliability (AREA)
- Operations Research (AREA)
- Primary Health Care (AREA)
- Game Theory and Decision Science (AREA)
- Medical Informatics (AREA)
Abstract
Provided are an image management server providing a scene image by extracting and merging objects from multiple images and a method for creating the scene image using the same. The image management server predicts a content caption from a content image that describes contents of the content image using a caption prediction model, extracts a plurality of keywords and relationship information from a search text, detects related images using the keyword and the content caption, and crops objects from the related image, then adjusts a layout and sizes of the objects, and merges the objects to create one scene image.
Description
The present invention relates to an image management server and an image creation method using the same, and more particularly, to an image management server providing a scene image by merging objects from multiple images and a method for creating the scene image using the same.
Stock photography refers to photos stocked in large quantities, and when photos without copyright issues are uploaded to a photo platform or website where the stock photography is collected, companies or individuals pay money as needed to purchase the photos. The stock photography is used as material photos for newspapers and magazines, and is also used as related images in advertisements, publicity materials, online postings, etc.
When a user accesses the photo platform to purchase stock photography and inputs a search term, related photos are displayed on the screen as a result. However, although many photos are stored on the photo platform, it is not easy for the user to find the photos he/she like. For example, if the user searches for stock photography of a person jogging in a park, among the resulting photos, the user often likes only the background in some photos and only the joggers in some photos.
A problem to be solved by the present invention is to provide a method for creating a scene image capable of providing the scene image by extracting and merging objects from multiple images.
Another problem to be solved by the present invention is to provide an image management server that performs such a method.
The problems to be solved by the present invention are not limited to the problems mentioned above, and other problems not mentioned will be clearly understood by those of ordinary skill in the art to which the present invention pertains from the following description.
A method for creating a scene image according to an embodiment of the present invention for achieving the problem described above is an image creation method for providing a scene image by merging objects from multiple images, the method including, by an image management server, a step of extracting a feature vector from a training image, based on training data composed of the training image and a target caption describing contents of the training image, and training a caption prediction model by using the feature vector of the training image as an input variable and the target caption as an output variable, a step of detecting an object from a content image and predicting a content caption describing contents of the content image by inputting a feature vector extracted from the detected object into the caption prediction model, a step of extracting a plurality of keywords and their relationship information from a search text input from a user terminal through natural language processing, a step of searching for a content caption matching each keyword, and detecting a content image (referred to as a 'related image') corresponding to the searched content caption, a step of detecting objects corresponding to a keyword for each related image, and calculating a size ratio between the detected objects with reference to a reference image other than the related image, and a step of cropping the detected objects for each related image, and creating a scene image by merging a plurality of cropped objects based on the size ratio.
The plurality of keywords may include first and second keywords, a first related image may be defined as having an object corresponding to the first keyword, a second related image may be defined as having an object corresponding to the second keyword, a reference image including both objects corresponding to the first and second keywords in one image may be detected, and a size ratio between the plurality of cropped objects may be calculated by using a size ratio between the objects included in the reference image.
The plurality of keywords may include first and second keywords, a first related image may be defined as having an object corresponding to the first keyword, a second related image may be defined as having an object corresponding to the second keyword, a first reference image including an object corresponding to the first keyword may be detected, a second reference image including an object corresponding to the second keyword may be detected, and a size ratio between the plurality of cropped objects may be calculated by using a size ratio between common objects that exist in common in the first and second reference images.
The plurality of keywords may include first and second keywords, a first related image may be defined as having an object corresponding to the first keyword, a second related image may be defined as having an object corresponding to the second keyword, a first reference image including an object corresponding to the first keyword may be detected, a second reference image including an object corresponding to the second keyword may be detected, a standard image that is not related to the first or second keyword may be detected, and a size ratio between the plurality of cropped objects may be calculated by using a size ratio between common objects that exist in common in the first reference image and the standard image and a size ratio between common objects that exist in common in the second reference image and the standard image.
A feature vector may be extracted from the training image based on a convolutional neural network (CNN) algorithm and the caption prediction model may be trained based on a long short term memory (LSTM) algorithm.
The step of creating the scene image may include a step of predicting a layout indicating an arrangement relationship of the detected objects in the scene image based on a graph convolution network (GCN) algorithm using the detected object as a node and the relationship information as an edge, and a step of adjusting the sizes of the plurality of cropped objects according to the size ratio and then arranging the plurality of cropped objects on the layout.
An image management server according to an embodiment of the present invention for achieving the other problem described above is an image management server for providing a scene image by merging objects from multiple images, the image management server including an image caption unit and a scene creation unit.
Here, the image caption unit may extract a feature vector from a training image, based on training data composed of the training image and a target caption describing contents of the training image, and train a caption prediction model by using the feature vector of the training image as an input variable and the target caption as an output variable.
The image caption unit may detect an object from a content image and predict a content caption describing contents of the content image by inputting a feature vector extracted from the detected object into the caption prediction model.
The scene creation unit may extract a plurality of keywords and their relationship information from a search text input from a user terminal through natural language processing. The scene creation unit may search for a content caption matching each keyword, and detect a content image (referred to as a 'related image') corresponding to the searched content caption. The scene creation unit may detect objects corresponding to a keyword for each related image, and calculate a size ratio between the detected objects by referring to a reference image other than the related image. The scene creation unit may crop the detected objects for each related image, and create a scene image by merging a plurality of cropped objects based on the size ratio.
The specific details of other embodiments are included in the specific content and drawings.
As described above, according to the image management server and the method for creating the scene image using the same according to the present invention, in a state in which numerous content images are stored in a database, a content caption describing contents of each content image can be automatically predicted and stored through a caption prediction model. When a search text is input from a user terminal, a keyword and relationship information are extracted from the search text.
Through a matching search between the extracted keyword and the content caption, a content image matching each keyword (this is referred to as a 'related image') can be detected. One scene image can be created by cropping and merging objects corresponding to a keyword for each related image.
In addition, a layout indicating an arrangement relationship of objects in one scene image can be automatically predicted using the extracted relationship information, and objects can be arranged in one scene image according to the predicted layout.
When merging a plurality of cropped objects in one scene image, it is important to adjust a size ratio between the objects. In the case of the present invention, the size ratio between objects can be automatically calculated with reference to an existing content image as follows.
First, if a content image including all of a plurality of objects exists in one image, the size ratio between the plurality of cropped objects to be merged can be automatically calculated by using the size ratio between the objects included in the content image.
Second, if the content image including all of the plurality of objects is not present in one image, a content image can be individually detected for each of the objects, and the size ratio between the plurality of cropped objects to be merged can be automatically calculated by using the size ratio between common objects that exist in common in the detected content images.
Third, if the content image including all of the plurality of objects is not present in one image and the common object that exists in common is not present among the content images individually detected for each object, a standard content image can be additionally detected and the size ratio between the plurality of cropped objects to be merged can be automatically calculated by using the size ratio between the common objects that exist in common between the detected content images and the standard content image.
As such, by automatically adjusting the size ratio between the objects when merging the plurality of cropped objects in one scene image, the objects can be represented naturally and harmoniously with each other.
FIG. 1 is a configuration diagram conceptually illustrating an image management server according to an embodiment of the present invention.
FIG. 2 is a configuration diagram conceptually illustrating an image caption unit of FIG. 1.
FIG. 3 is a configuration diagram conceptually illustrating a scene creation unit of FIG. 1.
FIGS. 4 to 6 are diagrams exemplarily illustrating a method for calculating a size ratio according to an embodiment of the present invention.
FIG. 7 is a flowchart illustrating a process of creating a caption prediction model in a method for creating a scene image according to an embodiment of the present invention.
FIG. 8 is a flowchart illustrating a process of predicting a content caption in the method for creating the scene image according to the embodiment of the present invention.
FIG. 9 is a flowchart illustrating a process of creating a scene image in the method for creating the scene image according to the embodiment of the present invention.
Advantages and features of the present invention, and methods for achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention will not be limited to the embodiments disclosed below, but will be implemented in various different forms. Only the present embodiments are provided so that the disclosure of the present invention is complete, and to fully inform those of ordinary skill in the art to which the present invention belongs of the scope of the invention, and the present invention is only defined by the scope of the claims. The same reference numerals refer to the same components throughout the specification.
Hereinafter, an image management server and a method for creating a scene image using the same according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings. FIG. 1 is a configuration diagram conceptually illustrating an image management server according to an embodiment of the present invention. FIG. 2 is a configuration diagram conceptually illustrating an image caption unit of FIG. 1. FIG. 3 is a configuration diagram conceptually illustrating a scene creation unit of FIG. 1.
An image management server 10 according to an embodiment of the present invention is a server that provides a scene image by merging objects from several images, and includes an image caption unit 100 that predicts a content caption for each content image using a prediction model, a scene creation unit 200 that detects a related image based on a search text and crops and merges objects from each related image, and a database 300 that stores various images and data.
The image caption unit 100 includes a prediction model training unit 110, a caption prediction unit 120, and a tag creation unit 130.
Training data may include a training image and a target caption describing contents of the training image. The prediction model training unit 110 extracts a feature vector from a training image using the training data. Here, the training image may be composed of image files in various formats such as JPEG, BMP, GIF, PNG, and TIFF, and the target caption may be a ground truth caption and may be composed of text files in various formats such as TXT. For example, the prediction model training unit 110 may use transfer learning to pre-process a raw image based on a pre-trained convolution neural network (CNN) algorithm. For example, the prediction model training unit 110 may create a feature vector by receiving a training image and extracting essential features of the corresponding training image based on the CNN algorithm. Here, the feature vector refers to a value obtained by extracting features from image data.
The prediction model training unit 110 trains the caption prediction model using the feature vector of the training image as an input variable and the target caption as an output variable. The prediction model training unit 110 decodes image features and learns a method for predicting a caption matching the target caption. For example, the prediction model training unit 110 may train the caption prediction model based on a long short term memory (LSTM) algorithm.
When a content image is input from the database 300, the caption prediction unit 120 detects an object in the content image and extracts a feature vector from the detected object. For example, the caption prediction unit 120 may extract the feature vector from the content image based on the CNN algorithm.
The caption prediction unit 120 predicts the content caption describing contents of the content image by inputting the feature vector of the content image into the caption prediction model. For example, the caption prediction unit 120 may predict the content caption for the content image by decoding image features of the content image based on the LSTM algorithm. Here, the content image may be composed of image files in various formats such as JPEG, BMP, GIF, PNG, and TIFF, and the content caption may be composed of text files in various formats such as TXT. One content caption and one or more objects may be defined for one content image. The caption prediction unit 120 stores a content caption and an object corresponding to each content image in the database 300.
The tag creation unit 130 extracts a tag from the content caption using natural language processing. Specifically, the tag creation unit 130 performs sentence segmentation on the content caption composed of a combination of corpuses. Subsequently, the tag creation unit 130 divides the sentence into tokens. Here, the tokens are a string having a meaning, and may be understood as a concept including a morpheme or a word. The tag creation unit 130 performs part-of-speech (POS) tagging for allocating part-of-speech information of the token. The tag creation unit 130 performs named entity recognition for the token by which various entity name tags, such as a person's name, a place name, and an organization name are attached thereto. The tag creation unit 130 stores the entity name tag in the database 300 together with the content caption and the object corresponding to each content image. The entity name tag can be used in a process of searching for the content caption.
The scene creation unit 200 includes a search text analysis unit 210, an image search unit 220, a ratio calculation unit 230, and an object merging unit 240.
When the user inputs a search text for an image or photo desired to be found into the user terminal, the search text is transmitted to the image management server 10. The search text analysis unit 210 extracts a plurality of keywords and relationship information between the keywords from the search text through natural language processing. Specifically, the search text analysis unit 210 extracts a plurality of keywords and their relationship information by using sentence separation, tokenization, POS tagging, entity name recognition, etc. For example, when the user inputs "A dog beside a cycle in a park" as the search text, the search text analysis unit 210 extracts "dog", "cycle", and "park" as keywords through natural language processing, and extracts "beside" and "in" as relationship information.
The image search unit 220 searches for a content caption matching each keyword among the content captions stored in the database 300 and detects a content image (this is referred to as a 'related image') corresponding to the searched content caption. When a plurality of keywords are extracted from the search text, a plurality of content images are detected.
The ratio calculation unit 230 detects objects corresponding to the keyword for each related image, and calculates a size ratio between the detected objects. The ratio calculation unit 230 may automatically calculate the size ratio between the detected objects with reference to a content image other than the related image. Hereinafter, a method for calculating the size ratio between detected objects will be described in detail with reference to FIGS. 4 to 6. FIGS. 4 to 6 are diagrams exemplarily illustrating a method for calculating a size ratio according to an embodiment of the present invention. In this embodiment, a method for calculating a size ratio between an object (dog) and an object (cycle) that respectively correspond to a first keyword and a second keywords when the first keyword is "dog" and the second keyword is "cycle", is exemplarily illustrated. Here, the size ratio may be a horizontal ratio, a vertical ratio, an aspect ratio, etc. between the objects.
FIG. 4 illustrates a case in which a content image (this is referred to as a 'reference image') including both the object (dog) corresponding to the first keyword and the object (cycle) corresponding to the second keyword in one image is detected from the database 300. A size ratio between objects cropped may be calculated in a process of merging objects later using a size ratio between the object (dog) and the object (cycle) included in the reference image.
FIG. 5 illustrates, as a case in which a reference image including both the object (dog) corresponding to the first keyword and the object (cycle) corresponding to the second keyword is not present in one image, a case where the first reference image includes the object (dog) and the object (tree) and the second reference image includes the object (cycle) and the object (tree). The object (tree) exists in common in the first reference image and the second reference image, which is referred to as a common object. A size ratio between objects cropped may be calculated in the process of merging objects later using the size ratio between these common objects (tree).
FIG. 6 illustrates, as a case in which a reference image including both the object (dog) corresponding to the first keyword and the object (cycle) corresponding to the second keyword is not present in one image, a case where the first reference image includes the object (dog) and the object (tree) and the second reference image includes the object (cycle) and the object (tree), but an object that exist in common is not present in the first reference image and the second reference image. The ratio calculation unit 230 additionally detects a standard image that is not related to a keyword among the content images stored in the database 300. For example, when the object (tree) and the object (house) are included in the standard image, a common object (tree) exists in the first reference image and the standard image, and a common object (house) exists in the second reference image and the standard image exist. The size ratio between the objects cropped may be calculated in the object merging process later using the size ratio between these common objects (tree, house).
If a plurality of related images are detected through the previous content caption search and an object corresponding to a keyword is detected for each related image, the object merging unit 240 crops the object for each related image. For example, the object merging unit 240 may crop the object corresponding to the keyword by using algorithms such as YOLO, Saliency Map, Integral Image, Local Adaptive Thresholding, GrabCut, etc.
The object merging unit 240 creates one scene image by merging a plurality of cropped objects based on the previously calculated size ratio. Specifically, the object merging unit 240 automatically predicts a layout indicating an arrangement relationship of objects in the scene image based on the GCN algorithm using the object corresponding to the keyword as a node and the relationship information as an edge. In addition, the object merging unit 240 adjusts the sizes of the plurality of cropped objects according to the size ratio and then arranges the cropped objects on the layout to complete the scene image.
The database 300 stores various images and data used in the method for creating the scene image of the present invention, such as training data, content images and objects and content captions related the content images, and scene images.
Hereinafter, the method for creating the scene image according to an embodiment of the present invention will be described in detail with reference to FIGS. 7 to 9. FIG. 7 is a flowchart illustrating a process of creating a caption prediction model in the method for creating the scene image according to the embodiment of the present invention. FIG. 8 is a flowchart illustrating a process of predicting a content caption in the method for creating the scene image according to the embodiment of the present invention. FIG. 9 is a flowchart illustrating a process of creating a scene image in the method for creating the scene image according to the embodiment of the present invention.
Referring to FIG. 7, when training data including a training image and a target caption describing contents of the training image is input from the database 300 (S10), the prediction model training unit 110 extracts a feature vector from the training image by using the training data (S12). For example, the prediction model training unit 110 may extract a feature vector from the training image based on the CNN algorithm.
The prediction model training unit 110 trains the caption prediction model using the feature vector of the training image as an input variable and the target caption as an output variable (S14). The prediction model training unit 110 may train the caption prediction model based on the LSTM algorithm.
Subsequently, referring to FIG. 8, when a content image is input from the database 300 (S20), the caption prediction unit 120 detects an object in the content image (S22). The caption prediction unit 120 extracts the feature vector from the detected object (S24). The caption prediction unit 120 inputs the feature vector of the content image into the caption prediction model to predict a content caption describing the contents of the content image (S26). The caption prediction unit 120 stores the content caption and the object corresponding to each content image in the database 300 (S28).
Subsequently, referring to FIG. 9, when a search text is input from the user terminal (S30), the search text analysis unit 210 extracts a plurality of keywords and relationship information between the keywords from the search text through natural language processing (S31).
The image search unit 220 searches for the content caption matching each keyword among the content captions stored in the database 300 and detects the content image (referred to as a 'related image') corresponding to the searched content caption (S32).
The ratio calculation unit 230 detects objects corresponding to the keyword for each related image, and calculates a size ratio between the detected objects with reference to the reference image (S33). The ratio calculation unit 230 may calculate the size ratio in the following way according to the presence or absence of the detected object in the reference image. For convenience of explanation, the plurality of keywords includes first and second keywords, a content image having an object corresponding to the first keyword among the content images is defined as a first related image, a content image having an object corresponding to the second keyword among the content images is defined as a second related image.
If a reference image including both the objects corresponding to the first and second keywords in one image is detected, the ratio calculation unit 230 may calculate a size ratio between a plurality of cropped objects by using a size ratio between the objects included in the reference image.
If a content image including both the object corresponding to the first keyword and the second keyword does not exist in one image, the first reference image including the object corresponding to the first keyword is detected, and the second reference image including the object corresponding to the second keyword is detected, the ratio calculation unit 230 may calculate the size ratio between the plurality of cropped objects by using the size ratio between common objects that exist in common in the first and second reference images.
If a content image including both the objects corresponding to the first keyword and the second keyword does not exist in one image, the first reference image including the object corresponding to the first keyword is detected, the second reference image including the object corresponding to the second keyword is detected, and an object that exists in common is not present in the first reference image and the second reference image, the ratio calculation unit 230 detects a standard image that is not related to the first keyword or second keyword. Subsequently, the ratio calculation unit 230 may calculate a size ratio between the plurality of cropped objects by using a size ratio between the common objects that exist in common in the first reference image and the standard image and a size ratio between the common objects that exist in common in the second reference image and the standard image.
The object merging unit 240 crops the object for each related image (S34). The object merging unit 240 creates one scene image by merging the plurality of cropped objects based on the previously calculated size ratio (S35). Specifically, the object merging unit 240 predicts a layout indicating the arrangement relationship of the detected objects in the scene image based on the GCN algorithm using the detected object corresponding to the keyword as a node and the relationship information as an edge. Subsequently, the object merging unit 240 adjusts the sizes of the plurality of cropped objects according to the previously calculated size ratio and then arranges the plurality of cropped objects on the layout to complete the scene image.
Although the embodiments of the present invention have been described with reference to the accompanying drawings, those of ordinary skill in the art to which the present invention pertains will understand that the present invention may be embodied in other specific forms without changing the technical spirit or essential features thereof. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive.
Claims (7)
- An image creation method for providing a scene image by merging objects from multiple images, the method comprising:by an image management server,a step of extracting a feature vector from a training image, based on training data composed of the training image and a target caption describing contents of the training image, and training a caption prediction model by using the feature vector of the training image as an input variable and the target caption as an output variable;a step of detecting an object from a content image and predicting a content caption describing contents of the content image by inputting a feature vector extracted from the detected object into the caption prediction model;a step of extracting a plurality of keywords and their relationship information from a search text input from a user terminal through natural language processing;a step of searching for a content caption matching each keyword, and detecting a content image (referred to as a 'related image') corresponding to the searched content caption;a step of detecting objects corresponding to a keyword for each related image, and calculating a size ratio between the detected objects with reference to a reference image other than the related image; anda step of cropping the detected objects for each related image, and creating a scene image by merging a plurality of cropped objects based on the size ratio.
- The method according to claim 1, whereinthe plurality of keywords include first and second keywords, a first related image is defined as having an object corresponding to the first keyword, a second related image is defined as having an object corresponding to the second keyword,a reference image including both objects corresponding to the first and second keywords in one image is detected, anda size ratio between the plurality of cropped objects is calculated by using a size ratio between the objects included in the reference image.
- The method according to claim 1, whereinthe plurality of keywords include first and second keywords, a first related image is defined as having an object corresponding to the first keyword, a second related image is defined as having an object corresponding to the second keyword,a first reference image including an object corresponding to the first keyword is detected, a second reference image including an object corresponding to the second keyword is detected, anda size ratio between the plurality of cropped objects is calculated by using a size ratio between common objects that exist in common in the first and second reference images.
- The method according to claim 1, whereinthe plurality of keywords include first and second keywords, a first related image is defined as having an object corresponding to the first keyword, a second related image is defined as having an object corresponding to the second keyword,a first reference image including an object corresponding to the first keyword is detected, a second reference image including an object corresponding to the second keyword is detected, a standard image that is not related to the first or second keyword is detected, anda size ratio between the plurality of cropped objects is calculated by using a size ratio between common objects that exist in common in the first reference image and the standard image and a size ratio between common objects that exist in common in the second reference image and the standard image.
- The method according to claim 1, whereina feature vector is extracted from the training image based on a convolutional neural network (CNN) algorithm, andthe caption prediction model is trained based on a long short term memory (LSTM) algorithm.
- The method according to claim 1, whereinthe step of creating the scene image includesa step of predicting a layout indicating an arrangement relationship of the detected objects in the scene image based on a graph convolution network (GCN) algorithm using the detected object as a node and the relationship information as an edge, anda step of adjusting the sizes of the plurality of cropped objects according to the size ratio and then arranging the plurality of cropped objects on the layout.
- An image management server for providing a scene image by merging objects from multiple images, the server comprising:an image caption unit; anda scene creation unit, whereinthe image caption unit extracts a feature vector from a training image, based on training data composed of the training image and a target caption describing contents of the training image, and trains a caption prediction model by using the feature vector of the training image as an input variable and the target caption as an output variable,the image caption unit detects an object from a content image and predicts a content caption describing contents of the content image by inputting a feature vector extracted from the detected object into the caption prediction model,the scene creation unit extracts a plurality of keywords and their relationship information from a search text input from a user terminal through natural language processing,the scene creation unit searches for a content caption matching each keyword, and detects a content image (referred to as a 'related image') corresponding to the searched content caption,the scene creation unit detects objects corresponding to a keyword for each related image, and calculates a size ratio between the detected objects by referring to a reference image other than the related image, andthe scene creation unit crops the detected objects for each related image, and creates a scene image by merging a plurality of cropped objects based on the size ratio.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020210098889A KR20230017433A (en) | 2021-07-28 | 2021-07-28 | Image management server providing a scene image by merging objects from multiple images and method for creating the scene image using the same |
KR10-2021-0098889 | 2021-07-28 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023008609A1 true WO2023008609A1 (en) | 2023-02-02 |
Family
ID=85086890
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/KR2021/009814 WO2023008609A1 (en) | 2021-07-28 | 2021-07-28 | Image management server providing a scene image by merging objects from multiple images and method for creating the scene image using the same |
Country Status (2)
Country | Link |
---|---|
KR (1) | KR20230017433A (en) |
WO (1) | WO2023008609A1 (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101744141B1 (en) * | 2016-01-25 | 2017-06-07 | 조선대학교산학협력단 | Method for reconstructing a photograph by object retargeting and the apparatus thereof |
KR20200075114A (en) * | 2018-12-12 | 2020-06-26 | 주식회사 인공지능연구원 | System and Method for Matching Similarity between Image and Text |
KR20200114708A (en) * | 2019-03-29 | 2020-10-07 | 경북대학교 산학협력단 | Electronic device, image searching system and controlling method thereof |
KR20200122119A (en) * | 2019-04-17 | 2020-10-27 | 주식회사 웨스트월드 | Image retrieval system and method through scene analysis |
US20210200803A1 (en) * | 2018-12-07 | 2021-07-01 | Seoul National University R&Db Foundation | Query response device and method |
-
2021
- 2021-07-28 KR KR1020210098889A patent/KR20230017433A/en unknown
- 2021-07-28 WO PCT/KR2021/009814 patent/WO2023008609A1/en unknown
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101744141B1 (en) * | 2016-01-25 | 2017-06-07 | 조선대학교산학협력단 | Method for reconstructing a photograph by object retargeting and the apparatus thereof |
US20210200803A1 (en) * | 2018-12-07 | 2021-07-01 | Seoul National University R&Db Foundation | Query response device and method |
KR20200075114A (en) * | 2018-12-12 | 2020-06-26 | 주식회사 인공지능연구원 | System and Method for Matching Similarity between Image and Text |
KR20200114708A (en) * | 2019-03-29 | 2020-10-07 | 경북대학교 산학협력단 | Electronic device, image searching system and controlling method thereof |
KR20200122119A (en) * | 2019-04-17 | 2020-10-27 | 주식회사 웨스트월드 | Image retrieval system and method through scene analysis |
Also Published As
Publication number | Publication date |
---|---|
KR20230017433A (en) | 2023-02-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112131350B (en) | Text label determining method, device, terminal and readable storage medium | |
Siersdorfer et al. | Analyzing and predicting sentiment of images on the social web | |
WO2020122456A1 (en) | System and method for matching similarities between images and texts | |
CN102053991B (en) | Method and system for multi-language document retrieval | |
CN111079444A (en) | Network rumor detection method based on multi-modal relationship | |
US20070196013A1 (en) | Automatic classification of photographs and graphics | |
WO2010134752A2 (en) | Semantic search method and system in which a plurality of classification systems are linked | |
WO2012108623A1 (en) | Method, system and computer-readable recording medium for adding a new image and information on the new image to an image database | |
CN112100438A (en) | Label extraction method and device and computer readable storage medium | |
WO2020103899A1 (en) | Method for generating inforgraphic information and method for generating image database | |
CN109740152A (en) | Determination method, apparatus, storage medium and the computer equipment of text classification | |
WO2021235617A1 (en) | System for recommending scientific and technical knowledge information, and method therefor | |
Liu et al. | Documentclip: Linking figures and main body text in reflowed documents | |
Wang et al. | Data-driven approach for bridging the cognitive gap in image retrieval | |
WO2023008609A1 (en) | Image management server providing a scene image by merging objects from multiple images and method for creating the scene image using the same | |
WO2014148664A1 (en) | Multi-language search system, multi-language search method, and image search system, based on meaning of word | |
US20080015843A1 (en) | Linguistic Image Label Incorporating Decision Relevant Perceptual, Semantic, and Relationships Data | |
JP2002007413A (en) | Image retrieving device | |
CN116955707A (en) | Content tag determination method, device, equipment, medium and program product | |
WO2022092497A1 (en) | System for providing similar case information, and method therefor | |
CN117009578A (en) | Video data labeling method and device, electronic equipment and storage medium | |
WO2020122440A1 (en) | Apparatus for detecting contextually-anomalous sentence in document, method therefor, and computer-readable recording medium having program for performing same method recorded thereon | |
JP2022185874A (en) | Information processing device, information processing system, information processing method, and program | |
CN115114467A (en) | Training method and device of picture neural network model | |
Wang et al. | Exploring statistical correlations for image retrieval |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21951979 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |