WO2023008609A1 - Image management server providing a scene image by merging objects from multiple images and method for creating the scene image using the same - Google Patents

Image management server providing a scene image by merging objects from multiple images and method for creating the scene image using the same Download PDF

Info

Publication number
WO2023008609A1
WO2023008609A1 PCT/KR2021/009814 KR2021009814W WO2023008609A1 WO 2023008609 A1 WO2023008609 A1 WO 2023008609A1 KR 2021009814 W KR2021009814 W KR 2021009814W WO 2023008609 A1 WO2023008609 A1 WO 2023008609A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
objects
caption
keyword
content
Prior art date
Application number
PCT/KR2021/009814
Other languages
French (fr)
Inventor
Imtiaz Ahmed
Noor Al Din AHMED
Nourin Haque RIDI
Original Assignee
Imtiaz Ahmed
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Imtiaz Ahmed filed Critical Imtiaz Ahmed
Publication of WO2023008609A1 publication Critical patent/WO2023008609A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/55Rule-based translation
    • G06F40/56Natural language generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/51Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/538Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/5866Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0276Advertisement creation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/60Editing figures and text; Combining figures or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/106Display of layout of documents; Previewing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Definitions

  • the present invention relates to an image management server and an image creation method using the same, and more particularly, to an image management server providing a scene image by merging objects from multiple images and a method for creating the scene image using the same.
  • Stock photography refers to photos stocked in large quantities, and when photos without copyright issues are uploaded to a photo platform or website where the stock photography is collected, companies or individuals pay money as needed to purchase the photos.
  • the stock photography is used as material photos for newspapers and magazines, and is also used as related images in advertisements, publicity materials, online postings, etc.
  • a problem to be solved by the present invention is to provide a method for creating a scene image capable of providing the scene image by extracting and merging objects from multiple images.
  • Another problem to be solved by the present invention is to provide an image management server that performs such a method.
  • a method for creating a scene image according to an embodiment of the present invention for achieving the problem described above is an image creation method for providing a scene image by merging objects from multiple images, the method including, by an image management server, a step of extracting a feature vector from a training image, based on training data composed of the training image and a target caption describing contents of the training image, and training a caption prediction model by using the feature vector of the training image as an input variable and the target caption as an output variable, a step of detecting an object from a content image and predicting a content caption describing contents of the content image by inputting a feature vector extracted from the detected object into the caption prediction model, a step of extracting a plurality of keywords and their relationship information from a search text input from a user terminal through natural language processing, a step of searching for a content caption matching each keyword, and detecting a content image (referred to as a 'related image') corresponding to the searched content caption, a step of detecting objects corresponding to a keyword for each related
  • the plurality of keywords may include first and second keywords, a first related image may be defined as having an object corresponding to the first keyword, a second related image may be defined as having an object corresponding to the second keyword, a reference image including both objects corresponding to the first and second keywords in one image may be detected, and a size ratio between the plurality of cropped objects may be calculated by using a size ratio between the objects included in the reference image.
  • the plurality of keywords may include first and second keywords, a first related image may be defined as having an object corresponding to the first keyword, a second related image may be defined as having an object corresponding to the second keyword, a first reference image including an object corresponding to the first keyword may be detected, a second reference image including an object corresponding to the second keyword may be detected, and a size ratio between the plurality of cropped objects may be calculated by using a size ratio between common objects that exist in common in the first and second reference images.
  • the plurality of keywords may include first and second keywords, a first related image may be defined as having an object corresponding to the first keyword, a second related image may be defined as having an object corresponding to the second keyword, a first reference image including an object corresponding to the first keyword may be detected, a second reference image including an object corresponding to the second keyword may be detected, a standard image that is not related to the first or second keyword may be detected, and a size ratio between the plurality of cropped objects may be calculated by using a size ratio between common objects that exist in common in the first reference image and the standard image and a size ratio between common objects that exist in common in the second reference image and the standard image.
  • a feature vector may be extracted from the training image based on a convolutional neural network (CNN) algorithm and the caption prediction model may be trained based on a long short term memory (LSTM) algorithm.
  • CNN convolutional neural network
  • LSTM long short term memory
  • the step of creating the scene image may include a step of predicting a layout indicating an arrangement relationship of the detected objects in the scene image based on a graph convolution network (GCN) algorithm using the detected object as a node and the relationship information as an edge, and a step of adjusting the sizes of the plurality of cropped objects according to the size ratio and then arranging the plurality of cropped objects on the layout.
  • GCN graph convolution network
  • An image management server for achieving the other problem described above is an image management server for providing a scene image by merging objects from multiple images, the image management server including an image caption unit and a scene creation unit.
  • the image caption unit may extract a feature vector from a training image, based on training data composed of the training image and a target caption describing contents of the training image, and train a caption prediction model by using the feature vector of the training image as an input variable and the target caption as an output variable.
  • the image caption unit may detect an object from a content image and predict a content caption describing contents of the content image by inputting a feature vector extracted from the detected object into the caption prediction model.
  • the scene creation unit may extract a plurality of keywords and their relationship information from a search text input from a user terminal through natural language processing.
  • the scene creation unit may search for a content caption matching each keyword, and detect a content image (referred to as a 'related image') corresponding to the searched content caption.
  • the scene creation unit may detect objects corresponding to a keyword for each related image, and calculate a size ratio between the detected objects by referring to a reference image other than the related image.
  • the scene creation unit may crop the detected objects for each related image, and create a scene image by merging a plurality of cropped objects based on the size ratio.
  • a content caption describing contents of each content image can be automatically predicted and stored through a caption prediction model.
  • a search text is input from a user terminal, a keyword and relationship information are extracted from the search text.
  • a content image matching each keyword (this is referred to as a 'related image') can be detected.
  • One scene image can be created by cropping and merging objects corresponding to a keyword for each related image.
  • a layout indicating an arrangement relationship of objects in one scene image can be automatically predicted using the extracted relationship information, and objects can be arranged in one scene image according to the predicted layout.
  • the size ratio between objects can be automatically calculated with reference to an existing content image as follows.
  • the size ratio between the plurality of cropped objects to be merged can be automatically calculated by using the size ratio between the objects included in the content image.
  • a content image can be individually detected for each of the objects, and the size ratio between the plurality of cropped objects to be merged can be automatically calculated by using the size ratio between common objects that exist in common in the detected content images.
  • a standard content image can be additionally detected and the size ratio between the plurality of cropped objects to be merged can be automatically calculated by using the size ratio between the common objects that exist in common between the detected content images and the standard content image.
  • the objects can be represented naturally and harmoniously with each other.
  • FIG. 1 is a configuration diagram conceptually illustrating an image management server according to an embodiment of the present invention.
  • FIG. 2 is a configuration diagram conceptually illustrating an image caption unit of FIG. 1.
  • FIG. 3 is a configuration diagram conceptually illustrating a scene creation unit of FIG. 1.
  • FIGS. 4 to 6 are diagrams exemplarily illustrating a method for calculating a size ratio according to an embodiment of the present invention.
  • FIG. 7 is a flowchart illustrating a process of creating a caption prediction model in a method for creating a scene image according to an embodiment of the present invention.
  • FIG. 8 is a flowchart illustrating a process of predicting a content caption in the method for creating the scene image according to the embodiment of the present invention.
  • FIG. 9 is a flowchart illustrating a process of creating a scene image in the method for creating the scene image according to the embodiment of the present invention.
  • FIG. 1 is a configuration diagram conceptually illustrating an image management server according to an embodiment of the present invention.
  • FIG. 2 is a configuration diagram conceptually illustrating an image caption unit of FIG. 1.
  • FIG. 3 is a configuration diagram conceptually illustrating a scene creation unit of FIG. 1.
  • An image management server 10 is a server that provides a scene image by merging objects from several images, and includes an image caption unit 100 that predicts a content caption for each content image using a prediction model, a scene creation unit 200 that detects a related image based on a search text and crops and merges objects from each related image, and a database 300 that stores various images and data.
  • the image caption unit 100 includes a prediction model training unit 110, a caption prediction unit 120, and a tag creation unit 130.
  • Training data may include a training image and a target caption describing contents of the training image.
  • the prediction model training unit 110 extracts a feature vector from a training image using the training data.
  • the training image may be composed of image files in various formats such as JPEG, BMP, GIF, PNG, and TIFF
  • the target caption may be a ground truth caption and may be composed of text files in various formats such as TXT.
  • the prediction model training unit 110 may use transfer learning to pre-process a raw image based on a pre-trained convolution neural network (CNN) algorithm.
  • CNN convolution neural network
  • the prediction model training unit 110 may create a feature vector by receiving a training image and extracting essential features of the corresponding training image based on the CNN algorithm.
  • the feature vector refers to a value obtained by extracting features from image data.
  • the prediction model training unit 110 trains the caption prediction model using the feature vector of the training image as an input variable and the target caption as an output variable.
  • the prediction model training unit 110 decodes image features and learns a method for predicting a caption matching the target caption.
  • the prediction model training unit 110 may train the caption prediction model based on a long short term memory (LSTM) algorithm.
  • LSTM long short term memory
  • the caption prediction unit 120 detects an object in the content image and extracts a feature vector from the detected object. For example, the caption prediction unit 120 may extract the feature vector from the content image based on the CNN algorithm.
  • the caption prediction unit 120 predicts the content caption describing contents of the content image by inputting the feature vector of the content image into the caption prediction model.
  • the caption prediction unit 120 may predict the content caption for the content image by decoding image features of the content image based on the LSTM algorithm.
  • the content image may be composed of image files in various formats such as JPEG, BMP, GIF, PNG, and TIFF
  • the content caption may be composed of text files in various formats such as TXT.
  • One content caption and one or more objects may be defined for one content image.
  • the caption prediction unit 120 stores a content caption and an object corresponding to each content image in the database 300.
  • the tag creation unit 130 extracts a tag from the content caption using natural language processing. Specifically, the tag creation unit 130 performs sentence segmentation on the content caption composed of a combination of corpuses. Subsequently, the tag creation unit 130 divides the sentence into tokens.
  • the tokens are a string having a meaning, and may be understood as a concept including a morpheme or a word.
  • the tag creation unit 130 performs part-of-speech (POS) tagging for allocating part-of-speech information of the token.
  • POS part-of-speech
  • the tag creation unit 130 performs named entity recognition for the token by which various entity name tags, such as a person's name, a place name, and an organization name are attached thereto.
  • the tag creation unit 130 stores the entity name tag in the database 300 together with the content caption and the object corresponding to each content image.
  • the entity name tag can be used in a process of searching for the content caption.
  • the scene creation unit 200 includes a search text analysis unit 210, an image search unit 220, a ratio calculation unit 230, and an object merging unit 240.
  • the search text analysis unit 210 extracts a plurality of keywords and relationship information between the keywords from the search text through natural language processing. Specifically, the search text analysis unit 210 extracts a plurality of keywords and their relationship information by using sentence separation, tokenization, POS tagging, entity name recognition, etc. For example, when the user inputs "A dog beside a cycle in a park" as the search text, the search text analysis unit 210 extracts "dog", “cycle”, and “park” as keywords through natural language processing, and extracts "beside” and "in” as relationship information.
  • the image search unit 220 searches for a content caption matching each keyword among the content captions stored in the database 300 and detects a content image (this is referred to as a 'related image') corresponding to the searched content caption.
  • a content image this is referred to as a 'related image'
  • a plurality of keywords are extracted from the search text, a plurality of content images are detected.
  • the ratio calculation unit 230 detects objects corresponding to the keyword for each related image, and calculates a size ratio between the detected objects.
  • the ratio calculation unit 230 may automatically calculate the size ratio between the detected objects with reference to a content image other than the related image.
  • a method for calculating the size ratio between detected objects will be described in detail with reference to FIGS. 4 to 6.
  • FIGS. 4 to 6 are diagrams exemplarily illustrating a method for calculating a size ratio according to an embodiment of the present invention.
  • a method for calculating a size ratio between an object (dog) and an object (cycle) that respectively correspond to a first keyword and a second keywords when the first keyword is "dog" and the second keyword is "cycle” is exemplarily illustrated.
  • the size ratio may be a horizontal ratio, a vertical ratio, an aspect ratio, etc. between the objects.
  • FIG. 4 illustrates a case in which a content image (this is referred to as a 'reference image') including both the object (dog) corresponding to the first keyword and the object (cycle) corresponding to the second keyword in one image is detected from the database 300.
  • a size ratio between objects cropped may be calculated in a process of merging objects later using a size ratio between the object (dog) and the object (cycle) included in the reference image.
  • FIG. 5 illustrates, as a case in which a reference image including both the object (dog) corresponding to the first keyword and the object (cycle) corresponding to the second keyword is not present in one image, a case where the first reference image includes the object (dog) and the object (tree) and the second reference image includes the object (cycle) and the object (tree).
  • the object (tree) exists in common in the first reference image and the second reference image, which is referred to as a common object.
  • a size ratio between objects cropped may be calculated in the process of merging objects later using the size ratio between these common objects (tree).
  • FIG. 6 illustrates, as a case in which a reference image including both the object (dog) corresponding to the first keyword and the object (cycle) corresponding to the second keyword is not present in one image, a case where the first reference image includes the object (dog) and the object (tree) and the second reference image includes the object (cycle) and the object (tree), but an object that exist in common is not present in the first reference image and the second reference image.
  • the ratio calculation unit 230 additionally detects a standard image that is not related to a keyword among the content images stored in the database 300.
  • a common object (tree) exists in the first reference image and the standard image
  • a common object (house) exists in the second reference image and the standard image exist.
  • the size ratio between the objects cropped may be calculated in the object merging process later using the size ratio between these common objects (tree, house).
  • the object merging unit 240 crops the object for each related image.
  • the object merging unit 240 may crop the object corresponding to the keyword by using algorithms such as YOLO, Saliency Map, Integral Image, Local Adaptive Thresholding, GrabCut, etc.
  • the object merging unit 240 creates one scene image by merging a plurality of cropped objects based on the previously calculated size ratio. Specifically, the object merging unit 240 automatically predicts a layout indicating an arrangement relationship of objects in the scene image based on the GCN algorithm using the object corresponding to the keyword as a node and the relationship information as an edge. In addition, the object merging unit 240 adjusts the sizes of the plurality of cropped objects according to the size ratio and then arranges the cropped objects on the layout to complete the scene image.
  • the database 300 stores various images and data used in the method for creating the scene image of the present invention, such as training data, content images and objects and content captions related the content images, and scene images.
  • FIG. 7 is a flowchart illustrating a process of creating a caption prediction model in the method for creating the scene image according to the embodiment of the present invention.
  • FIG. 8 is a flowchart illustrating a process of predicting a content caption in the method for creating the scene image according to the embodiment of the present invention.
  • FIG. 9 is a flowchart illustrating a process of creating a scene image in the method for creating the scene image according to the embodiment of the present invention.
  • the prediction model training unit 110 extracts a feature vector from the training image by using the training data (S12). For example, the prediction model training unit 110 may extract a feature vector from the training image based on the CNN algorithm.
  • the prediction model training unit 110 trains the caption prediction model using the feature vector of the training image as an input variable and the target caption as an output variable (S14).
  • the prediction model training unit 110 may train the caption prediction model based on the LSTM algorithm.
  • the caption prediction unit 120 detects an object in the content image (S22).
  • the caption prediction unit 120 extracts the feature vector from the detected object (S24).
  • the caption prediction unit 120 inputs the feature vector of the content image into the caption prediction model to predict a content caption describing the contents of the content image (S26).
  • the caption prediction unit 120 stores the content caption and the object corresponding to each content image in the database 300 (S28).
  • the search text analysis unit 210 extracts a plurality of keywords and relationship information between the keywords from the search text through natural language processing (S31).
  • the image search unit 220 searches for the content caption matching each keyword among the content captions stored in the database 300 and detects the content image (referred to as a 'related image') corresponding to the searched content caption (S32).
  • the ratio calculation unit 230 detects objects corresponding to the keyword for each related image, and calculates a size ratio between the detected objects with reference to the reference image (S33).
  • the ratio calculation unit 230 may calculate the size ratio in the following way according to the presence or absence of the detected object in the reference image.
  • the plurality of keywords includes first and second keywords, a content image having an object corresponding to the first keyword among the content images is defined as a first related image, a content image having an object corresponding to the second keyword among the content images is defined as a second related image.
  • the ratio calculation unit 230 may calculate a size ratio between a plurality of cropped objects by using a size ratio between the objects included in the reference image.
  • the ratio calculation unit 230 may calculate the size ratio between the plurality of cropped objects by using the size ratio between common objects that exist in common in the first and second reference images.
  • the ratio calculation unit 230 detects a standard image that is not related to the first keyword or second keyword. Subsequently, the ratio calculation unit 230 may calculate a size ratio between the plurality of cropped objects by using a size ratio between the common objects that exist in common in the first reference image and the standard image and a size ratio between the common objects that exist in common in the second reference image and the standard image.
  • the object merging unit 240 crops the object for each related image (S34).
  • the object merging unit 240 creates one scene image by merging the plurality of cropped objects based on the previously calculated size ratio (S35). Specifically, the object merging unit 240 predicts a layout indicating the arrangement relationship of the detected objects in the scene image based on the GCN algorithm using the detected object corresponding to the keyword as a node and the relationship information as an edge. Subsequently, the object merging unit 240 adjusts the sizes of the plurality of cropped objects according to the previously calculated size ratio and then arranges the plurality of cropped objects on the layout to complete the scene image.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Business, Economics & Management (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Strategic Management (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Economics (AREA)
  • Library & Information Science (AREA)
  • Marketing (AREA)
  • Human Resources & Organizations (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Tourism & Hospitality (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Accounting & Taxation (AREA)
  • Processing Or Creating Images (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Primary Health Care (AREA)
  • Game Theory and Decision Science (AREA)
  • Medical Informatics (AREA)

Abstract

Provided are an image management server providing a scene image by extracting and merging objects from multiple images and a method for creating the scene image using the same. The image management server predicts a content caption from a content image that describes contents of the content image using a caption prediction model, extracts a plurality of keywords and relationship information from a search text, detects related images using the keyword and the content caption, and crops objects from the related image, then adjusts a layout and sizes of the objects, and merges the objects to create one scene image.

Description

IMAGE MANAGEMENT SERVER PROVIDING A SCENE IMAGE BY MERGING OBJECTS FROM MULTIPLE IMAGES AND METHOD FOR CREATING THE SCENE IMAGE USING THE SAME
The present invention relates to an image management server and an image creation method using the same, and more particularly, to an image management server providing a scene image by merging objects from multiple images and a method for creating the scene image using the same.
Stock photography refers to photos stocked in large quantities, and when photos without copyright issues are uploaded to a photo platform or website where the stock photography is collected, companies or individuals pay money as needed to purchase the photos. The stock photography is used as material photos for newspapers and magazines, and is also used as related images in advertisements, publicity materials, online postings, etc.
When a user accesses the photo platform to purchase stock photography and inputs a search term, related photos are displayed on the screen as a result. However, although many photos are stored on the photo platform, it is not easy for the user to find the photos he/she like. For example, if the user searches for stock photography of a person jogging in a park, among the resulting photos, the user often likes only the background in some photos and only the joggers in some photos.
A problem to be solved by the present invention is to provide a method for creating a scene image capable of providing the scene image by extracting and merging objects from multiple images.
Another problem to be solved by the present invention is to provide an image management server that performs such a method.
The problems to be solved by the present invention are not limited to the problems mentioned above, and other problems not mentioned will be clearly understood by those of ordinary skill in the art to which the present invention pertains from the following description.
A method for creating a scene image according to an embodiment of the present invention for achieving the problem described above is an image creation method for providing a scene image by merging objects from multiple images, the method including, by an image management server, a step of extracting a feature vector from a training image, based on training data composed of the training image and a target caption describing contents of the training image, and training a caption prediction model by using the feature vector of the training image as an input variable and the target caption as an output variable, a step of detecting an object from a content image and predicting a content caption describing contents of the content image by inputting a feature vector extracted from the detected object into the caption prediction model, a step of extracting a plurality of keywords and their relationship information from a search text input from a user terminal through natural language processing, a step of searching for a content caption matching each keyword, and detecting a content image (referred to as a 'related image') corresponding to the searched content caption, a step of detecting objects corresponding to a keyword for each related image, and calculating a size ratio between the detected objects with reference to a reference image other than the related image, and a step of cropping the detected objects for each related image, and creating a scene image by merging a plurality of cropped objects based on the size ratio.
The plurality of keywords may include first and second keywords, a first related image may be defined as having an object corresponding to the first keyword, a second related image may be defined as having an object corresponding to the second keyword, a reference image including both objects corresponding to the first and second keywords in one image may be detected, and a size ratio between the plurality of cropped objects may be calculated by using a size ratio between the objects included in the reference image.
The plurality of keywords may include first and second keywords, a first related image may be defined as having an object corresponding to the first keyword, a second related image may be defined as having an object corresponding to the second keyword, a first reference image including an object corresponding to the first keyword may be detected, a second reference image including an object corresponding to the second keyword may be detected, and a size ratio between the plurality of cropped objects may be calculated by using a size ratio between common objects that exist in common in the first and second reference images.
The plurality of keywords may include first and second keywords, a first related image may be defined as having an object corresponding to the first keyword, a second related image may be defined as having an object corresponding to the second keyword, a first reference image including an object corresponding to the first keyword may be detected, a second reference image including an object corresponding to the second keyword may be detected, a standard image that is not related to the first or second keyword may be detected, and a size ratio between the plurality of cropped objects may be calculated by using a size ratio between common objects that exist in common in the first reference image and the standard image and a size ratio between common objects that exist in common in the second reference image and the standard image.
A feature vector may be extracted from the training image based on a convolutional neural network (CNN) algorithm and the caption prediction model may be trained based on a long short term memory (LSTM) algorithm.
The step of creating the scene image may include a step of predicting a layout indicating an arrangement relationship of the detected objects in the scene image based on a graph convolution network (GCN) algorithm using the detected object as a node and the relationship information as an edge, and a step of adjusting the sizes of the plurality of cropped objects according to the size ratio and then arranging the plurality of cropped objects on the layout.
An image management server according to an embodiment of the present invention for achieving the other problem described above is an image management server for providing a scene image by merging objects from multiple images, the image management server including an image caption unit and a scene creation unit.
Here, the image caption unit may extract a feature vector from a training image, based on training data composed of the training image and a target caption describing contents of the training image, and train a caption prediction model by using the feature vector of the training image as an input variable and the target caption as an output variable.
The image caption unit may detect an object from a content image and predict a content caption describing contents of the content image by inputting a feature vector extracted from the detected object into the caption prediction model.
The scene creation unit may extract a plurality of keywords and their relationship information from a search text input from a user terminal through natural language processing. The scene creation unit may search for a content caption matching each keyword, and detect a content image (referred to as a 'related image') corresponding to the searched content caption. The scene creation unit may detect objects corresponding to a keyword for each related image, and calculate a size ratio between the detected objects by referring to a reference image other than the related image. The scene creation unit may crop the detected objects for each related image, and create a scene image by merging a plurality of cropped objects based on the size ratio.
The specific details of other embodiments are included in the specific content and drawings.
As described above, according to the image management server and the method for creating the scene image using the same according to the present invention, in a state in which numerous content images are stored in a database, a content caption describing contents of each content image can be automatically predicted and stored through a caption prediction model. When a search text is input from a user terminal, a keyword and relationship information are extracted from the search text.
Through a matching search between the extracted keyword and the content caption, a content image matching each keyword (this is referred to as a 'related image') can be detected. One scene image can be created by cropping and merging objects corresponding to a keyword for each related image.
In addition, a layout indicating an arrangement relationship of objects in one scene image can be automatically predicted using the extracted relationship information, and objects can be arranged in one scene image according to the predicted layout.
When merging a plurality of cropped objects in one scene image, it is important to adjust a size ratio between the objects. In the case of the present invention, the size ratio between objects can be automatically calculated with reference to an existing content image as follows.
First, if a content image including all of a plurality of objects exists in one image, the size ratio between the plurality of cropped objects to be merged can be automatically calculated by using the size ratio between the objects included in the content image.
Second, if the content image including all of the plurality of objects is not present in one image, a content image can be individually detected for each of the objects, and the size ratio between the plurality of cropped objects to be merged can be automatically calculated by using the size ratio between common objects that exist in common in the detected content images.
Third, if the content image including all of the plurality of objects is not present in one image and the common object that exists in common is not present among the content images individually detected for each object, a standard content image can be additionally detected and the size ratio between the plurality of cropped objects to be merged can be automatically calculated by using the size ratio between the common objects that exist in common between the detected content images and the standard content image.
As such, by automatically adjusting the size ratio between the objects when merging the plurality of cropped objects in one scene image, the objects can be represented naturally and harmoniously with each other.
FIG. 1 is a configuration diagram conceptually illustrating an image management server according to an embodiment of the present invention.
FIG. 2 is a configuration diagram conceptually illustrating an image caption unit of FIG. 1.
FIG. 3 is a configuration diagram conceptually illustrating a scene creation unit of FIG. 1.
FIGS. 4 to 6 are diagrams exemplarily illustrating a method for calculating a size ratio according to an embodiment of the present invention.
FIG. 7 is a flowchart illustrating a process of creating a caption prediction model in a method for creating a scene image according to an embodiment of the present invention.
FIG. 8 is a flowchart illustrating a process of predicting a content caption in the method for creating the scene image according to the embodiment of the present invention.
FIG. 9 is a flowchart illustrating a process of creating a scene image in the method for creating the scene image according to the embodiment of the present invention.
Advantages and features of the present invention, and methods for achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention will not be limited to the embodiments disclosed below, but will be implemented in various different forms. Only the present embodiments are provided so that the disclosure of the present invention is complete, and to fully inform those of ordinary skill in the art to which the present invention belongs of the scope of the invention, and the present invention is only defined by the scope of the claims. The same reference numerals refer to the same components throughout the specification.
Hereinafter, an image management server and a method for creating a scene image using the same according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings. FIG. 1 is a configuration diagram conceptually illustrating an image management server according to an embodiment of the present invention. FIG. 2 is a configuration diagram conceptually illustrating an image caption unit of FIG. 1. FIG. 3 is a configuration diagram conceptually illustrating a scene creation unit of FIG. 1.
An image management server 10 according to an embodiment of the present invention is a server that provides a scene image by merging objects from several images, and includes an image caption unit 100 that predicts a content caption for each content image using a prediction model, a scene creation unit 200 that detects a related image based on a search text and crops and merges objects from each related image, and a database 300 that stores various images and data.
The image caption unit 100 includes a prediction model training unit 110, a caption prediction unit 120, and a tag creation unit 130.
Training data may include a training image and a target caption describing contents of the training image. The prediction model training unit 110 extracts a feature vector from a training image using the training data. Here, the training image may be composed of image files in various formats such as JPEG, BMP, GIF, PNG, and TIFF, and the target caption may be a ground truth caption and may be composed of text files in various formats such as TXT. For example, the prediction model training unit 110 may use transfer learning to pre-process a raw image based on a pre-trained convolution neural network (CNN) algorithm. For example, the prediction model training unit 110 may create a feature vector by receiving a training image and extracting essential features of the corresponding training image based on the CNN algorithm. Here, the feature vector refers to a value obtained by extracting features from image data.
The prediction model training unit 110 trains the caption prediction model using the feature vector of the training image as an input variable and the target caption as an output variable. The prediction model training unit 110 decodes image features and learns a method for predicting a caption matching the target caption. For example, the prediction model training unit 110 may train the caption prediction model based on a long short term memory (LSTM) algorithm.
When a content image is input from the database 300, the caption prediction unit 120 detects an object in the content image and extracts a feature vector from the detected object. For example, the caption prediction unit 120 may extract the feature vector from the content image based on the CNN algorithm.
The caption prediction unit 120 predicts the content caption describing contents of the content image by inputting the feature vector of the content image into the caption prediction model. For example, the caption prediction unit 120 may predict the content caption for the content image by decoding image features of the content image based on the LSTM algorithm. Here, the content image may be composed of image files in various formats such as JPEG, BMP, GIF, PNG, and TIFF, and the content caption may be composed of text files in various formats such as TXT. One content caption and one or more objects may be defined for one content image. The caption prediction unit 120 stores a content caption and an object corresponding to each content image in the database 300.
The tag creation unit 130 extracts a tag from the content caption using natural language processing. Specifically, the tag creation unit 130 performs sentence segmentation on the content caption composed of a combination of corpuses. Subsequently, the tag creation unit 130 divides the sentence into tokens. Here, the tokens are a string having a meaning, and may be understood as a concept including a morpheme or a word. The tag creation unit 130 performs part-of-speech (POS) tagging for allocating part-of-speech information of the token. The tag creation unit 130 performs named entity recognition for the token by which various entity name tags, such as a person's name, a place name, and an organization name are attached thereto. The tag creation unit 130 stores the entity name tag in the database 300 together with the content caption and the object corresponding to each content image. The entity name tag can be used in a process of searching for the content caption.
The scene creation unit 200 includes a search text analysis unit 210, an image search unit 220, a ratio calculation unit 230, and an object merging unit 240.
When the user inputs a search text for an image or photo desired to be found into the user terminal, the search text is transmitted to the image management server 10. The search text analysis unit 210 extracts a plurality of keywords and relationship information between the keywords from the search text through natural language processing. Specifically, the search text analysis unit 210 extracts a plurality of keywords and their relationship information by using sentence separation, tokenization, POS tagging, entity name recognition, etc. For example, when the user inputs "A dog beside a cycle in a park" as the search text, the search text analysis unit 210 extracts "dog", "cycle", and "park" as keywords through natural language processing, and extracts "beside" and "in" as relationship information.
The image search unit 220 searches for a content caption matching each keyword among the content captions stored in the database 300 and detects a content image (this is referred to as a 'related image') corresponding to the searched content caption. When a plurality of keywords are extracted from the search text, a plurality of content images are detected.
The ratio calculation unit 230 detects objects corresponding to the keyword for each related image, and calculates a size ratio between the detected objects. The ratio calculation unit 230 may automatically calculate the size ratio between the detected objects with reference to a content image other than the related image. Hereinafter, a method for calculating the size ratio between detected objects will be described in detail with reference to FIGS. 4 to 6. FIGS. 4 to 6 are diagrams exemplarily illustrating a method for calculating a size ratio according to an embodiment of the present invention. In this embodiment, a method for calculating a size ratio between an object (dog) and an object (cycle) that respectively correspond to a first keyword and a second keywords when the first keyword is "dog" and the second keyword is "cycle", is exemplarily illustrated. Here, the size ratio may be a horizontal ratio, a vertical ratio, an aspect ratio, etc. between the objects.
FIG. 4 illustrates a case in which a content image (this is referred to as a 'reference image') including both the object (dog) corresponding to the first keyword and the object (cycle) corresponding to the second keyword in one image is detected from the database 300. A size ratio between objects cropped may be calculated in a process of merging objects later using a size ratio between the object (dog) and the object (cycle) included in the reference image.
FIG. 5 illustrates, as a case in which a reference image including both the object (dog) corresponding to the first keyword and the object (cycle) corresponding to the second keyword is not present in one image, a case where the first reference image includes the object (dog) and the object (tree) and the second reference image includes the object (cycle) and the object (tree). The object (tree) exists in common in the first reference image and the second reference image, which is referred to as a common object. A size ratio between objects cropped may be calculated in the process of merging objects later using the size ratio between these common objects (tree).
FIG. 6 illustrates, as a case in which a reference image including both the object (dog) corresponding to the first keyword and the object (cycle) corresponding to the second keyword is not present in one image, a case where the first reference image includes the object (dog) and the object (tree) and the second reference image includes the object (cycle) and the object (tree), but an object that exist in common is not present in the first reference image and the second reference image. The ratio calculation unit 230 additionally detects a standard image that is not related to a keyword among the content images stored in the database 300. For example, when the object (tree) and the object (house) are included in the standard image, a common object (tree) exists in the first reference image and the standard image, and a common object (house) exists in the second reference image and the standard image exist. The size ratio between the objects cropped may be calculated in the object merging process later using the size ratio between these common objects (tree, house).
If a plurality of related images are detected through the previous content caption search and an object corresponding to a keyword is detected for each related image, the object merging unit 240 crops the object for each related image. For example, the object merging unit 240 may crop the object corresponding to the keyword by using algorithms such as YOLO, Saliency Map, Integral Image, Local Adaptive Thresholding, GrabCut, etc.
The object merging unit 240 creates one scene image by merging a plurality of cropped objects based on the previously calculated size ratio. Specifically, the object merging unit 240 automatically predicts a layout indicating an arrangement relationship of objects in the scene image based on the GCN algorithm using the object corresponding to the keyword as a node and the relationship information as an edge. In addition, the object merging unit 240 adjusts the sizes of the plurality of cropped objects according to the size ratio and then arranges the cropped objects on the layout to complete the scene image.
The database 300 stores various images and data used in the method for creating the scene image of the present invention, such as training data, content images and objects and content captions related the content images, and scene images.
Hereinafter, the method for creating the scene image according to an embodiment of the present invention will be described in detail with reference to FIGS. 7 to 9. FIG. 7 is a flowchart illustrating a process of creating a caption prediction model in the method for creating the scene image according to the embodiment of the present invention. FIG. 8 is a flowchart illustrating a process of predicting a content caption in the method for creating the scene image according to the embodiment of the present invention. FIG. 9 is a flowchart illustrating a process of creating a scene image in the method for creating the scene image according to the embodiment of the present invention.
Referring to FIG. 7, when training data including a training image and a target caption describing contents of the training image is input from the database 300 (S10), the prediction model training unit 110 extracts a feature vector from the training image by using the training data (S12). For example, the prediction model training unit 110 may extract a feature vector from the training image based on the CNN algorithm.
The prediction model training unit 110 trains the caption prediction model using the feature vector of the training image as an input variable and the target caption as an output variable (S14). The prediction model training unit 110 may train the caption prediction model based on the LSTM algorithm.
Subsequently, referring to FIG. 8, when a content image is input from the database 300 (S20), the caption prediction unit 120 detects an object in the content image (S22). The caption prediction unit 120 extracts the feature vector from the detected object (S24). The caption prediction unit 120 inputs the feature vector of the content image into the caption prediction model to predict a content caption describing the contents of the content image (S26). The caption prediction unit 120 stores the content caption and the object corresponding to each content image in the database 300 (S28).
Subsequently, referring to FIG. 9, when a search text is input from the user terminal (S30), the search text analysis unit 210 extracts a plurality of keywords and relationship information between the keywords from the search text through natural language processing (S31).
The image search unit 220 searches for the content caption matching each keyword among the content captions stored in the database 300 and detects the content image (referred to as a 'related image') corresponding to the searched content caption (S32).
The ratio calculation unit 230 detects objects corresponding to the keyword for each related image, and calculates a size ratio between the detected objects with reference to the reference image (S33). The ratio calculation unit 230 may calculate the size ratio in the following way according to the presence or absence of the detected object in the reference image. For convenience of explanation, the plurality of keywords includes first and second keywords, a content image having an object corresponding to the first keyword among the content images is defined as a first related image, a content image having an object corresponding to the second keyword among the content images is defined as a second related image.
If a reference image including both the objects corresponding to the first and second keywords in one image is detected, the ratio calculation unit 230 may calculate a size ratio between a plurality of cropped objects by using a size ratio between the objects included in the reference image.
If a content image including both the object corresponding to the first keyword and the second keyword does not exist in one image, the first reference image including the object corresponding to the first keyword is detected, and the second reference image including the object corresponding to the second keyword is detected, the ratio calculation unit 230 may calculate the size ratio between the plurality of cropped objects by using the size ratio between common objects that exist in common in the first and second reference images.
If a content image including both the objects corresponding to the first keyword and the second keyword does not exist in one image, the first reference image including the object corresponding to the first keyword is detected, the second reference image including the object corresponding to the second keyword is detected, and an object that exists in common is not present in the first reference image and the second reference image, the ratio calculation unit 230 detects a standard image that is not related to the first keyword or second keyword. Subsequently, the ratio calculation unit 230 may calculate a size ratio between the plurality of cropped objects by using a size ratio between the common objects that exist in common in the first reference image and the standard image and a size ratio between the common objects that exist in common in the second reference image and the standard image.
The object merging unit 240 crops the object for each related image (S34). The object merging unit 240 creates one scene image by merging the plurality of cropped objects based on the previously calculated size ratio (S35). Specifically, the object merging unit 240 predicts a layout indicating the arrangement relationship of the detected objects in the scene image based on the GCN algorithm using the detected object corresponding to the keyword as a node and the relationship information as an edge. Subsequently, the object merging unit 240 adjusts the sizes of the plurality of cropped objects according to the previously calculated size ratio and then arranges the plurality of cropped objects on the layout to complete the scene image.
Although the embodiments of the present invention have been described with reference to the accompanying drawings, those of ordinary skill in the art to which the present invention pertains will understand that the present invention may be embodied in other specific forms without changing the technical spirit or essential features thereof. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive.

Claims (7)

  1. An image creation method for providing a scene image by merging objects from multiple images, the method comprising:
    by an image management server,
    a step of extracting a feature vector from a training image, based on training data composed of the training image and a target caption describing contents of the training image, and training a caption prediction model by using the feature vector of the training image as an input variable and the target caption as an output variable;
    a step of detecting an object from a content image and predicting a content caption describing contents of the content image by inputting a feature vector extracted from the detected object into the caption prediction model;
    a step of extracting a plurality of keywords and their relationship information from a search text input from a user terminal through natural language processing;
    a step of searching for a content caption matching each keyword, and detecting a content image (referred to as a 'related image') corresponding to the searched content caption;
    a step of detecting objects corresponding to a keyword for each related image, and calculating a size ratio between the detected objects with reference to a reference image other than the related image; and
    a step of cropping the detected objects for each related image, and creating a scene image by merging a plurality of cropped objects based on the size ratio.
  2. The method according to claim 1, wherein
    the plurality of keywords include first and second keywords, a first related image is defined as having an object corresponding to the first keyword, a second related image is defined as having an object corresponding to the second keyword,
    a reference image including both objects corresponding to the first and second keywords in one image is detected, and
    a size ratio between the plurality of cropped objects is calculated by using a size ratio between the objects included in the reference image.
  3. The method according to claim 1, wherein
    the plurality of keywords include first and second keywords, a first related image is defined as having an object corresponding to the first keyword, a second related image is defined as having an object corresponding to the second keyword,
    a first reference image including an object corresponding to the first keyword is detected, a second reference image including an object corresponding to the second keyword is detected, and
    a size ratio between the plurality of cropped objects is calculated by using a size ratio between common objects that exist in common in the first and second reference images.
  4. The method according to claim 1, wherein
    the plurality of keywords include first and second keywords, a first related image is defined as having an object corresponding to the first keyword, a second related image is defined as having an object corresponding to the second keyword,
    a first reference image including an object corresponding to the first keyword is detected, a second reference image including an object corresponding to the second keyword is detected, a standard image that is not related to the first or second keyword is detected, and
    a size ratio between the plurality of cropped objects is calculated by using a size ratio between common objects that exist in common in the first reference image and the standard image and a size ratio between common objects that exist in common in the second reference image and the standard image.
  5. The method according to claim 1, wherein
    a feature vector is extracted from the training image based on a convolutional neural network (CNN) algorithm, and
    the caption prediction model is trained based on a long short term memory (LSTM) algorithm.
  6. The method according to claim 1, wherein
    the step of creating the scene image includes
    a step of predicting a layout indicating an arrangement relationship of the detected objects in the scene image based on a graph convolution network (GCN) algorithm using the detected object as a node and the relationship information as an edge, and
    a step of adjusting the sizes of the plurality of cropped objects according to the size ratio and then arranging the plurality of cropped objects on the layout.
  7. An image management server for providing a scene image by merging objects from multiple images, the server comprising:
    an image caption unit; and
    a scene creation unit, wherein
    the image caption unit extracts a feature vector from a training image, based on training data composed of the training image and a target caption describing contents of the training image, and trains a caption prediction model by using the feature vector of the training image as an input variable and the target caption as an output variable,
    the image caption unit detects an object from a content image and predicts a content caption describing contents of the content image by inputting a feature vector extracted from the detected object into the caption prediction model,
    the scene creation unit extracts a plurality of keywords and their relationship information from a search text input from a user terminal through natural language processing,
    the scene creation unit searches for a content caption matching each keyword, and detects a content image (referred to as a 'related image') corresponding to the searched content caption,
    the scene creation unit detects objects corresponding to a keyword for each related image, and calculates a size ratio between the detected objects by referring to a reference image other than the related image, and
    the scene creation unit crops the detected objects for each related image, and creates a scene image by merging a plurality of cropped objects based on the size ratio.
PCT/KR2021/009814 2021-07-28 2021-07-28 Image management server providing a scene image by merging objects from multiple images and method for creating the scene image using the same WO2023008609A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020210098889A KR20230017433A (en) 2021-07-28 2021-07-28 Image management server providing a scene image by merging objects from multiple images and method for creating the scene image using the same
KR10-2021-0098889 2021-07-28

Publications (1)

Publication Number Publication Date
WO2023008609A1 true WO2023008609A1 (en) 2023-02-02

Family

ID=85086890

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2021/009814 WO2023008609A1 (en) 2021-07-28 2021-07-28 Image management server providing a scene image by merging objects from multiple images and method for creating the scene image using the same

Country Status (2)

Country Link
KR (1) KR20230017433A (en)
WO (1) WO2023008609A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101744141B1 (en) * 2016-01-25 2017-06-07 조선대학교산학협력단 Method for reconstructing a photograph by object retargeting and the apparatus thereof
KR20200075114A (en) * 2018-12-12 2020-06-26 주식회사 인공지능연구원 System and Method for Matching Similarity between Image and Text
KR20200114708A (en) * 2019-03-29 2020-10-07 경북대학교 산학협력단 Electronic device, image searching system and controlling method thereof
KR20200122119A (en) * 2019-04-17 2020-10-27 주식회사 웨스트월드 Image retrieval system and method through scene analysis
US20210200803A1 (en) * 2018-12-07 2021-07-01 Seoul National University R&Db Foundation Query response device and method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101744141B1 (en) * 2016-01-25 2017-06-07 조선대학교산학협력단 Method for reconstructing a photograph by object retargeting and the apparatus thereof
US20210200803A1 (en) * 2018-12-07 2021-07-01 Seoul National University R&Db Foundation Query response device and method
KR20200075114A (en) * 2018-12-12 2020-06-26 주식회사 인공지능연구원 System and Method for Matching Similarity between Image and Text
KR20200114708A (en) * 2019-03-29 2020-10-07 경북대학교 산학협력단 Electronic device, image searching system and controlling method thereof
KR20200122119A (en) * 2019-04-17 2020-10-27 주식회사 웨스트월드 Image retrieval system and method through scene analysis

Also Published As

Publication number Publication date
KR20230017433A (en) 2023-02-06

Similar Documents

Publication Publication Date Title
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
Siersdorfer et al. Analyzing and predicting sentiment of images on the social web
WO2020122456A1 (en) System and method for matching similarities between images and texts
CN102053991B (en) Method and system for multi-language document retrieval
CN111079444A (en) Network rumor detection method based on multi-modal relationship
US20070196013A1 (en) Automatic classification of photographs and graphics
WO2010134752A2 (en) Semantic search method and system in which a plurality of classification systems are linked
WO2012108623A1 (en) Method, system and computer-readable recording medium for adding a new image and information on the new image to an image database
CN112100438A (en) Label extraction method and device and computer readable storage medium
WO2020103899A1 (en) Method for generating inforgraphic information and method for generating image database
CN109740152A (en) Determination method, apparatus, storage medium and the computer equipment of text classification
WO2021235617A1 (en) System for recommending scientific and technical knowledge information, and method therefor
Liu et al. Documentclip: Linking figures and main body text in reflowed documents
Wang et al. Data-driven approach for bridging the cognitive gap in image retrieval
WO2023008609A1 (en) Image management server providing a scene image by merging objects from multiple images and method for creating the scene image using the same
WO2014148664A1 (en) Multi-language search system, multi-language search method, and image search system, based on meaning of word
US20080015843A1 (en) Linguistic Image Label Incorporating Decision Relevant Perceptual, Semantic, and Relationships Data
JP2002007413A (en) Image retrieving device
CN116955707A (en) Content tag determination method, device, equipment, medium and program product
WO2022092497A1 (en) System for providing similar case information, and method therefor
CN117009578A (en) Video data labeling method and device, electronic equipment and storage medium
WO2020122440A1 (en) Apparatus for detecting contextually-anomalous sentence in document, method therefor, and computer-readable recording medium having program for performing same method recorded thereon
JP2022185874A (en) Information processing device, information processing system, information processing method, and program
CN115114467A (en) Training method and device of picture neural network model
Wang et al. Exploring statistical correlations for image retrieval

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21951979

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE