WO2023008609A1

WO2023008609A1 - Image management server providing a scene image by merging objects from multiple images and method for creating the scene image using the same

Info

Publication number: WO2023008609A1
Application number: PCT/KR2021/009814
Authority: WO
Inventors: Imtiaz Ahmed; Noor Al Din AHMED; Nourin Haque RIDI
Original assignee: Imtiaz Ahmed
Priority date: 2021-07-28
Filing date: 2021-07-28
Publication date: 2023-02-02
Also published as: KR20230017433A

Abstract

Provided are an image management server providing a scene image by extracting and merging objects from multiple images and a method for creating the scene image using the same. The image management server predicts a content caption from a content image that describes contents of the content image using a caption prediction model, extracts a plurality of keywords and relationship information from a search text, detects related images using the keyword and the content caption, and crops objects from the related image, then adjusts a layout and sizes of the objects, and merges the objects to create one scene image.

Description

IMAGE MANAGEMENT SERVER PROVIDING A SCENE IMAGE BY MERGING OBJECTS FROM MULTIPLE IMAGES AND METHOD FOR CREATING THE SCENE IMAGE USING THE SAME

The present invention relates to an image management server and an image creation method using the same, and more particularly, to an image management server providing a scene image by merging objects from multiple images and a method for creating the scene image using the same.

Stock photography refers to photos stocked in large quantities, and when photos without copyright issues are uploaded to a photo platform or website where the stock photography is collected, companies or individuals pay money as needed to purchase the photos. The stock photography is used as material photos for newspapers and magazines, and is also used as related images in advertisements, publicity materials, online postings, etc.

When a user accesses the photo platform to purchase stock photography and inputs a search term, related photos are displayed on the screen as a result. However, although many photos are stored on the photo platform, it is not easy for the user to find the photos he/she like. For example, if the user searches for stock photography of a person jogging in a park, among the resulting photos, the user often likes only the background in some photos and only the joggers in some photos.

A problem to be solved by the present invention is to provide a method for creating a scene image capable of providing the scene image by extracting and merging objects from multiple images.

Another problem to be solved by the present invention is to provide an image management server that performs such a method.

The problems to be solved by the present invention are not limited to the problems mentioned above, and other problems not mentioned will be clearly understood by those of ordinary skill in the art to which the present invention pertains from the following description.

A method for creating a scene image according to an embodiment of the present invention for achieving the problem described above is an image creation method for providing a scene image by merging objects from multiple images, the method including, by an image management server, a step of extracting a feature vector from a training image, based on training data composed of the training image and a target caption describing contents of the training image, and training a caption prediction model by using the feature vector of the training image as an input variable and the target caption as an output variable, a step of detecting an object from a content image and predicting a content caption describing contents of the content image by inputting a feature vector extracted from the detected object into the caption prediction model, a step of extracting a plurality of keywords and their relationship information from a search text input from a user terminal through natural language processing, a step of searching for a content caption matching each keyword, and detecting a content image (referred to as a 'related image') corresponding to the searched content caption, a step of detecting objects corresponding to a keyword for each related image, and calculating a size ratio between the detected objects with reference to a reference image other than the related image, and a step of cropping the detected objects for each related image, and creating a scene image by merging a plurality of cropped objects based on the size ratio.

The plurality of keywords may include first and second keywords, a first related image may be defined as having an object corresponding to the first keyword, a second related image may be defined as having an object corresponding to the second keyword, a reference image including both objects corresponding to the first and second keywords in one image may be detected, and a size ratio between the plurality of cropped objects may be calculated by using a size ratio between the objects included in the reference image.

The plurality of keywords may include first and second keywords, a first related image may be defined as having an object corresponding to the first keyword, a second related image may be defined as having an object corresponding to the second keyword, a first reference image including an object corresponding to the first keyword may be detected, a second reference image including an object corresponding to the second keyword may be detected, and a size ratio between the plurality of cropped objects may be calculated by using a size ratio between common objects that exist in common in the first and second reference images.

The plurality of keywords may include first and second keywords, a first related image may be defined as having an object corresponding to the first keyword, a second related image may be defined as having an object corresponding to the second keyword, a first reference image including an object corresponding to the first keyword may be detected, a second reference image including an object corresponding to the second keyword may be detected, a standard image that is not related to the first or second keyword may be detected, and a size ratio between the plurality of cropped objects may be calculated by using a size ratio between common objects that exist in common in the first reference image and the standard image and a size ratio between common objects that exist in common in the second reference image and the standard image.

A feature vector may be extracted from the training image based on a convolutional neural network (CNN) algorithm and the caption prediction model may be trained based on a long short term memory (LSTM) algorithm.

The step of creating the scene image may include a step of predicting a layout indicating an arrangement relationship of the detected objects in the scene image based on a graph convolution network (GCN) algorithm using the detected object as a node and the relationship information as an edge, and a step of adjusting the sizes of the plurality of cropped objects according to the size ratio and then arranging the plurality of cropped objects on the layout.

An image management server according to an embodiment of the present invention for achieving the other problem described above is an image management server for providing a scene image by merging objects from multiple images, the image management server including an image caption unit and a scene creation unit.

Here, the image caption unit may extract a feature vector from a training image, based on training data composed of the training image and a target caption describing contents of the training image, and train a caption prediction model by using the feature vector of the training image as an input variable and the target caption as an output variable.

The image caption unit may detect an object from a content image and predict a content caption describing contents of the content image by inputting a feature vector extracted from the detected object into the caption prediction model.

The scene creation unit may extract a plurality of keywords and their relationship information from a search text input from a user terminal through natural language processing. The scene creation unit may search for a content caption matching each keyword, and detect a content image (referred to as a 'related image') corresponding to the searched content caption. The scene creation unit may detect objects corresponding to a keyword for each related image, and calculate a size ratio between the detected objects by referring to a reference image other than the related image. The scene creation unit may crop the detected objects for each related image, and create a scene image by merging a plurality of cropped objects based on the size ratio.

The specific details of other embodiments are included in the specific content and drawings.

As described above, according to the image management server and the method for creating the scene image using the same according to the present invention, in a state in which numerous content images are stored in a database, a content caption describing contents of each content image can be automatically predicted and stored through a caption prediction model. When a search text is input from a user terminal, a keyword and relationship information are extracted from the search text.

Through a matching search between the extracted keyword and the content caption, a content image matching each keyword (this is referred to as a 'related image') can be detected. One scene image can be created by cropping and merging objects corresponding to a keyword for each related image.

In addition, a layout indicating an arrangement relationship of objects in one scene image can be automatically predicted using the extracted relationship information, and objects can be arranged in one scene image according to the predicted layout.

When merging a plurality of cropped objects in one scene image, it is important to adjust a size ratio between the objects. In the case of the present invention, the size ratio between objects can be automatically calculated with reference to an existing content image as follows.

First, if a content image including all of a plurality of objects exists in one image, the size ratio between the plurality of cropped objects to be merged can be automatically calculated by using the size ratio between the objects included in the content image.

Second, if the content image including all of the plurality of objects is not present in one image, a content image can be individually detected for each of the objects, and the size ratio between the plurality of cropped objects to be merged can be automatically calculated by using the size ratio between common objects that exist in common in the detected content images.

Third, if the content image including all of the plurality of objects is not present in one image and the common object that exists in common is not present among the content images individually detected for each object, a standard content image can be additionally detected and the size ratio between the plurality of cropped objects to be merged can be automatically calculated by using the size ratio between the common objects that exist in common between the detected content images and the standard content image.

As such, by automatically adjusting the size ratio between the objects when merging the plurality of cropped objects in one scene image, the objects can be represented naturally and harmoniously with each other.

FIG. 1 is a configuration diagram conceptually illustrating an image management server according to an embodiment of the present invention.

FIG. 2 is a configuration diagram conceptually illustrating an image caption unit of FIG. 1.

FIG. 3 is a configuration diagram conceptually illustrating a scene creation unit of FIG. 1.

FIGS. 4 to 6 are diagrams exemplarily illustrating a method for calculating a size ratio according to an embodiment of the present invention.

FIG. 7 is a flowchart illustrating a process of creating a caption prediction model in a method for creating a scene image according to an embodiment of the present invention.

FIG. 8 is a flowchart illustrating a process of predicting a content caption in the method for creating the scene image according to the embodiment of the present invention.

FIG. 9 is a flowchart illustrating a process of creating a scene image in the method for creating the scene image according to the embodiment of the present invention.

Advantages and features of the present invention, and methods for achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention will not be limited to the embodiments disclosed below, but will be implemented in various different forms. Only the present embodiments are provided so that the disclosure of the present invention is complete, and to fully inform those of ordinary skill in the art to which the present invention belongs of the scope of the invention, and the present invention is only defined by the scope of the claims. The same reference numerals refer to the same components throughout the specification.

Hereinafter, an image management server and a method for creating a scene image using the same according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings. FIG. 1 is a configuration diagram conceptually illustrating an image management server according to an embodiment of the present invention. FIG. 2 is a configuration diagram conceptually illustrating an image caption unit of FIG. 1. FIG. 3 is a configuration diagram conceptually illustrating a scene creation unit of FIG. 1.

An image management server 10 according to an embodiment of the present invention is a server that provides a scene image by merging objects from several images, and includes an image caption unit 100 that predicts a content caption for each content image using a prediction model, a scene creation unit 200 that detects a related image based on a search text and crops and merges objects from each related image, and a database 300 that stores various images and data.

The image caption unit 100 includes a prediction model training unit 110, a caption prediction unit 120, and a tag creation unit 130.

Training data may include a training image and a target caption describing contents of the training image. The prediction model training unit 110 extracts a feature vector from a training image using the training data. Here, the training image may be composed of image files in various formats such as JPEG, BMP, GIF, PNG, and TIFF, and the target caption may be a ground truth caption and may be composed of text files in various formats such as TXT. For example, the prediction model training unit 110 may use transfer learning to pre-process a raw image based on a pre-trained convolution neural network (CNN) algorithm. For example, the prediction model training unit 110 may create a feature vector by receiving a training image and extracting essential features of the corresponding training image based on the CNN algorithm. Here, the feature vector refers to a value obtained by extracting features from image data.

The prediction model training unit 110 trains the caption prediction model using the feature vector of the training image as an input variable and the target caption as an output variable. The prediction model training unit 110 decodes image features and learns a method for predicting a caption matching the target caption. For example, the prediction model training unit 110 may train the caption prediction model based on a long short term memory (LSTM) algorithm.

When a content image is input from the database 300, the caption prediction unit 120 detects an object in the content image and extracts a feature vector from the detected object. For example, the caption prediction unit 120 may extract the feature vector from the content image based on the CNN algorithm.

The caption prediction unit 120 predicts the content caption describing contents of the content image by inputting the feature vector of the content image into the caption prediction model. For example, the caption prediction unit 120 may predict the content caption for the content image by decoding image features of the content image based on the LSTM algorithm. Here, the content image may be composed of image files in various formats such as JPEG, BMP, GIF, PNG, and TIFF, and the content caption may be composed of text files in various formats such as TXT. One content caption and one or more objects may be defined for one content image. The caption prediction unit 120 stores a content caption and an object corresponding to each content image in the database 300.

The tag creation unit 130 extracts a tag from the content caption using natural language processing. Specifically, the tag creation unit 130 performs sentence segmentation on the content caption composed of a combination of corpuses. Subsequently, the tag creation unit 130 divides the sentence into tokens. Here, the tokens are a string having a meaning, and may be understood as a concept including a morpheme or a word. The tag creation unit 130 performs part-of-speech (POS) tagging for allocating part-of-speech information of the token. The tag creation unit 130 performs named entity recognition for the token by which various entity name tags, such as a person's name, a place name, and an organization name are attached thereto. The tag creation unit 130 stores the entity name tag in the database 300 together with the content caption and the object corresponding to each content image. The entity name tag can be used in a process of searching for the content caption.

The scene creation unit 200 includes a search text analysis unit 210, an image search unit 220, a ratio calculation unit 230, and an object merging unit 240.

When the user inputs a search text for an image or photo desired to be found into the user terminal, the search text is transmitted to the image management server 10. The search text analysis unit 210 extracts a plurality of keywords and relationship information between the keywords from the search text through natural language processing. Specifically, the search text analysis unit 210 extracts a plurality of keywords and their relationship information by using sentence separation, tokenization, POS tagging, entity name recognition, etc. For example, when the user inputs "A dog beside a cycle in a park" as the search text, the search text analysis unit 210 extracts "dog", "cycle", and "park" as keywords through natural language processing, and extracts "beside" and "in" as relationship information.

The image search unit 220 searches for a content caption matching each keyword among the content captions stored in the database 300 and detects a content image (this is referred to as a 'related image') corresponding to the searched content caption. When a plurality of keywords are extracted from the search text, a plurality of content images are detected.

The ratio calculation unit 230 detects objects corresponding to the keyword for each related image, and calculates a size ratio between the detected objects. The ratio calculation unit 230 may automatically calculate the size ratio between the detected objects with reference to a content image other than the related image. Hereinafter, a method for calculating the size ratio between detected objects will be described in detail with reference to FIGS. 4 to 6. FIGS. 4 to 6 are diagrams exemplarily illustrating a method for calculating a size ratio according to an embodiment of the present invention. In this embodiment, a method for calculating a size ratio between an object (dog) and an object (cycle) that respectively correspond to a first keyword and a second keywords when the first keyword is "dog" and the second keyword is "cycle", is exemplarily illustrated. Here, the size ratio may be a horizontal ratio, a vertical ratio, an aspect ratio, etc. between the objects.

FIG. 4 illustrates a case in which a content image (this is referred to as a 'reference image') including both the object (dog) corresponding to the first keyword and the object (cycle) corresponding to the second keyword in one image is detected from the database 300. A size ratio between objects cropped may be calculated in a process of merging objects later using a size ratio between the object (dog) and the object (cycle) included in the reference image.

FIG. 5 illustrates, as a case in which a reference image including both the object (dog) corresponding to the first keyword and the object (cycle) corresponding to the second keyword is not present in one image, a case where the first reference image includes the object (dog) and the object (tree) and the second reference image includes the object (cycle) and the object (tree). The object (tree) exists in common in the first reference image and the second reference image, which is referred to as a common object. A size ratio between objects cropped may be calculated in the process of merging objects later using the size ratio between these common objects (tree).

FIG. 6 illustrates, as a case in which a reference image including both the object (dog) corresponding to the first keyword and the object (cycle) corresponding to the second keyword is not present in one image, a case where the first reference image includes the object (dog) and the object (tree) and the second reference image includes the object (cycle) and the object (tree), but an object that exist in common is not present in the first reference image and the second reference image. The ratio calculation unit 230 additionally detects a standard image that is not related to a keyword among the content images stored in the database 300. For example, when the object (tree) and the object (house) are included in the standard image, a common object (tree) exists in the first reference image and the standard image, and a common object (house) exists in the second reference image and the standard image exist. The size ratio between the objects cropped may be calculated in the object merging process later using the size ratio between these common objects (tree, house).

If a plurality of related images are detected through the previous content caption search and an object corresponding to a keyword is detected for each related image, the object merging unit 240 crops the object for each related image. For example, the object merging unit 240 may crop the object corresponding to the keyword by using algorithms such as YOLO, Saliency Map, Integral Image, Local Adaptive Thresholding, GrabCut, etc.

The object merging unit 240 creates one scene image by merging a plurality of cropped objects based on the previously calculated size ratio. Specifically, the object merging unit 240 automatically predicts a layout indicating an arrangement relationship of objects in the scene image based on the GCN algorithm using the object corresponding to the keyword as a node and the relationship information as an edge. In addition, the object merging unit 240 adjusts the sizes of the plurality of cropped objects according to the size ratio and then arranges the cropped objects on the layout to complete the scene image.

The database 300 stores various images and data used in the method for creating the scene image of the present invention, such as training data, content images and objects and content captions related the content images, and scene images.

Hereinafter, the method for creating the scene image according to an embodiment of the present invention will be described in detail with reference to FIGS. 7 to 9. FIG. 7 is a flowchart illustrating a process of creating a caption prediction model in the method for creating the scene image according to the embodiment of the present invention. FIG. 8 is a flowchart illustrating a process of predicting a content caption in the method for creating the scene image according to the embodiment of the present invention. FIG. 9 is a flowchart illustrating a process of creating a scene image in the method for creating the scene image according to the embodiment of the present invention.

Referring to FIG. 7, when training data including a training image and a target caption describing contents of the training image is input from the database 300 (S10), the prediction model training unit 110 extracts a feature vector from the training image by using the training data (S12). For example, the prediction model training unit 110 may extract a feature vector from the training image based on the CNN algorithm.

The prediction model training unit 110 trains the caption prediction model using the feature vector of the training image as an input variable and the target caption as an output variable (S14). The prediction model training unit 110 may train the caption prediction model based on the LSTM algorithm.

Subsequently, referring to FIG. 8, when a content image is input from the database 300 (S20), the caption prediction unit 120 detects an object in the content image (S22). The caption prediction unit 120 extracts the feature vector from the detected object (S24). The caption prediction unit 120 inputs the feature vector of the content image into the caption prediction model to predict a content caption describing the contents of the content image (S26). The caption prediction unit 120 stores the content caption and the object corresponding to each content image in the database 300 (S28).

Subsequently, referring to FIG. 9, when a search text is input from the user terminal (S30), the search text analysis unit 210 extracts a plurality of keywords and relationship information between the keywords from the search text through natural language processing (S31).

The image search unit 220 searches for the content caption matching each keyword among the content captions stored in the database 300 and detects the content image (referred to as a 'related image') corresponding to the searched content caption (S32).

The ratio calculation unit 230 detects objects corresponding to the keyword for each related image, and calculates a size ratio between the detected objects with reference to the reference image (S33). The ratio calculation unit 230 may calculate the size ratio in the following way according to the presence or absence of the detected object in the reference image. For convenience of explanation, the plurality of keywords includes first and second keywords, a content image having an object corresponding to the first keyword among the content images is defined as a first related image, a content image having an object corresponding to the second keyword among the content images is defined as a second related image.

If a reference image including both the objects corresponding to the first and second keywords in one image is detected, the ratio calculation unit 230 may calculate a size ratio between a plurality of cropped objects by using a size ratio between the objects included in the reference image.

If a content image including both the object corresponding to the first keyword and the second keyword does not exist in one image, the first reference image including the object corresponding to the first keyword is detected, and the second reference image including the object corresponding to the second keyword is detected, the ratio calculation unit 230 may calculate the size ratio between the plurality of cropped objects by using the size ratio between common objects that exist in common in the first and second reference images.

If a content image including both the objects corresponding to the first keyword and the second keyword does not exist in one image, the first reference image including the object corresponding to the first keyword is detected, the second reference image including the object corresponding to the second keyword is detected, and an object that exists in common is not present in the first reference image and the second reference image, the ratio calculation unit 230 detects a standard image that is not related to the first keyword or second keyword. Subsequently, the ratio calculation unit 230 may calculate a size ratio between the plurality of cropped objects by using a size ratio between the common objects that exist in common in the first reference image and the standard image and a size ratio between the common objects that exist in common in the second reference image and the standard image.

The object merging unit 240 crops the object for each related image (S34). The object merging unit 240 creates one scene image by merging the plurality of cropped objects based on the previously calculated size ratio (S35). Specifically, the object merging unit 240 predicts a layout indicating the arrangement relationship of the detected objects in the scene image based on the GCN algorithm using the detected object corresponding to the keyword as a node and the relationship information as an edge. Subsequently, the object merging unit 240 adjusts the sizes of the plurality of cropped objects according to the previously calculated size ratio and then arranges the plurality of cropped objects on the layout to complete the scene image.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, those of ordinary skill in the art to which the present invention pertains will understand that the present invention may be embodied in other specific forms without changing the technical spirit or essential features thereof. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive.

Claims

An image creation method for providing a scene image by merging objects from multiple images, the method comprising:

by an image management server,

a step of extracting a feature vector from a training image, based on training data composed of the training image and a target caption describing contents of the training image, and training a caption prediction model by using the feature vector of the training image as an input variable and the target caption as an output variable;

a step of detecting an object from a content image and predicting a content caption describing contents of the content image by inputting a feature vector extracted from the detected object into the caption prediction model;

a step of extracting a plurality of keywords and their relationship information from a search text input from a user terminal through natural language processing;

a step of searching for a content caption matching each keyword, and detecting a content image (referred to as a 'related image') corresponding to the searched content caption;

a step of detecting objects corresponding to a keyword for each related image, and calculating a size ratio between the detected objects with reference to a reference image other than the related image; and

a step of cropping the detected objects for each related image, and creating a scene image by merging a plurality of cropped objects based on the size ratio.
The method according to claim 1, wherein

the plurality of keywords include first and second keywords, a first related image is defined as having an object corresponding to the first keyword, a second related image is defined as having an object corresponding to the second keyword,

a reference image including both objects corresponding to the first and second keywords in one image is detected, and

a size ratio between the plurality of cropped objects is calculated by using a size ratio between the objects included in the reference image.
The method according to claim 1, wherein

the plurality of keywords include first and second keywords, a first related image is defined as having an object corresponding to the first keyword, a second related image is defined as having an object corresponding to the second keyword,

a first reference image including an object corresponding to the first keyword is detected, a second reference image including an object corresponding to the second keyword is detected, and

a size ratio between the plurality of cropped objects is calculated by using a size ratio between common objects that exist in common in the first and second reference images.
The method according to claim 1, wherein

the plurality of keywords include first and second keywords, a first related image is defined as having an object corresponding to the first keyword, a second related image is defined as having an object corresponding to the second keyword,

a first reference image including an object corresponding to the first keyword is detected, a second reference image including an object corresponding to the second keyword is detected, a standard image that is not related to the first or second keyword is detected, and

a size ratio between the plurality of cropped objects is calculated by using a size ratio between common objects that exist in common in the first reference image and the standard image and a size ratio between common objects that exist in common in the second reference image and the standard image.
The method according to claim 1, wherein

a feature vector is extracted from the training image based on a convolutional neural network (CNN) algorithm, and

the caption prediction model is trained based on a long short term memory (LSTM) algorithm.
The method according to claim 1, wherein

the step of creating the scene image includes

a step of predicting a layout indicating an arrangement relationship of the detected objects in the scene image based on a graph convolution network (GCN) algorithm using the detected object as a node and the relationship information as an edge, and

a step of adjusting the sizes of the plurality of cropped objects according to the size ratio and then arranging the plurality of cropped objects on the layout.
An image management server for providing a scene image by merging objects from multiple images, the server comprising:

an image caption unit; and

a scene creation unit, wherein

the image caption unit extracts a feature vector from a training image, based on training data composed of the training image and a target caption describing contents of the training image, and trains a caption prediction model by using the feature vector of the training image as an input variable and the target caption as an output variable,

the image caption unit detects an object from a content image and predicts a content caption describing contents of the content image by inputting a feature vector extracted from the detected object into the caption prediction model,

the scene creation unit extracts a plurality of keywords and their relationship information from a search text input from a user terminal through natural language processing,

the scene creation unit searches for a content caption matching each keyword, and detects a content image (referred to as a 'related image') corresponding to the searched content caption,

the scene creation unit detects objects corresponding to a keyword for each related image, and calculates a size ratio between the detected objects by referring to a reference image other than the related image, and

the scene creation unit crops the detected objects for each related image, and creates a scene image by merging a plurality of cropped objects based on the size ratio.