WO2021149923A1

WO2021149923A1 - Method and apparatus for providing image search

Info

Publication number: WO2021149923A1
Application number: PCT/KR2020/018892
Authority: WO
Inventors: 안재용; 강민수
Original assignee: 주식회사 씨오티커넥티드
Priority date: 2020-01-20
Filing date: 2020-12-22
Publication date: 2021-07-29

Abstract

Disclosed are a method and an apparatus for providing image search. Disclosed are a method and an apparatus for providing image search. A method for providing image search according to an embodiment comprises the steps of: receiving, from an image analysis engine, one or more keywords corresponding to each shot of an image and a confidence value corresponding to the keyword; determining, in connection with successive shots, that shots each having a candidate keyword matching ratio equal to or larger than a second threshold value constitute a single scene; statistically processing confidence values corresponding to keywords of shots constituting a scene so as to determine a final keyword representing the scene; receiving a scene search request including a search query; determining a scene corresponding to the search query on the basis of the final keyword; and providing an image corresponding to the determined scene.

Description

Method and apparatus for providing video search

The following embodiments relate to an image search providing method and apparatus.

Users are often interested in some content of a specific image. That is, there are many cases where the user is interested in the content of a part of the image rather than the entire image. For example, there may be a case where a user who wants to watch a soccer relay wants to watch only a scene in which a specific player scores a goal, rather than watching the entire soccer relay video. Also, there may be a case where the viewer of the entertainment program wants to view only a specific scene. However, in the general video search method, since the entire soccer relay or the entire entertainment program is the target of the search, it is impossible to search for some scenes of the video desired by the user.

Conventionally, in order to enable a search for a portion of an image, an attempt has been made to search for an image using dialogue information (or subtitle information) included in an image. That is, by searching for a portion having dialogue or subtitles related to the search word entered by the user from among the dialogues or subtitles of a specific image, it was attempted to enable the user to find the desired part of the image.

However, in many cases, it is not possible to properly reflect the contents of the video only with dialogue or subtitles. For example, when trying to find a kissing scene in a drama, it is very difficult to find a scene in which the main characters of a drama are kissing only with lines or subtitles because there are often no lines in the kissing scene. In addition, in the case of an image without dialogue or subtitles, it is impossible to search for an image using this method at all.

Embodiments attempt to extract keywords at regular time intervals from an image.

Embodiments intend to generate metadata in which a keyword corresponding to a scene and time information are matched.

Embodiments are intended to provide a search for an image and a search for a specific scene based on the generated metadata.

An image search providing method according to an embodiment comprises: receiving one or more keywords corresponding to each shot of an image and a confidence value corresponding to the keywords from an image analysis engine; in successive shots, determining shots in which the match ratio of the candidate keyword of each shot is equal to or greater than a second threshold value as one scene; determining a final keyword representing the scene by statistically processing confidence values corresponding to keywords of shots constituting the scene; receiving a scene search request comprising a search query; determining a scene corresponding to the search query based on the final keyword; and providing an image corresponding to the determined scene.

The method of providing an image search according to an embodiment may further include modifying the final keyword based on the search query.

The first threshold value may be determined based on a distribution of the confidence values.

The determining of the final keyword may include: determining a weight for each shot of the shots constituting the scene; weighted summing confidence values corresponding to keywords of shots constituting the scene based on the weights for each shot; and determining a keyword whose weighted sum is equal to or greater than a third threshold value as the final keyword.

The determining of the weight for each shot may include determining the weight for each shot of the shots constituting the scene based on the number of candidate keywords included in each of the shots constituting the scene.

The third threshold value may be determined based on a distribution of the confidence values corresponding to the keywords of shots constituting the scene.

The weighted summing may include weighted summing confidence values corresponding to the candidate keywords of the shots constituting the scene based on the weight.

The determining of the scene corresponding to the search query may include: comparing the search query with a keyword equal to or greater than the third threshold; and determining a scene corresponding to the search query based on the comparison result.

The method of providing an image search according to an embodiment further includes adding one or more relational keywords having a predetermined relation with the keywords equal to or greater than the third threshold, and determining a scene corresponding to the search query includes the relation The method may include determining a scene corresponding to the search query by further considering keywords.

The image analysis engine may include an external image analysis engine that receives the image and generates the first metadata.

An image search providing apparatus according to an embodiment receives one or more keywords corresponding to each shot of an image and a confidence value corresponding to the keyword from an image analysis engine, and in successive shots, The shots in which the matching ratio of the candidate keywords are equal to or greater than the second threshold are determined as one scene, and confidence values corresponding to the keywords of shots constituting the scene are statistically processed to determine the final keyword representing the scene, and a processor that receives a scene search request including a search query, determines a scene corresponding to the search query based on the final keyword, and provides an image corresponding to the determined scene.

The processor may modify the final keyword based on the search query/

The processor determines a weight for each shot of the shots constituting the scene, weights and sums confidence values corresponding to keywords of the shots constituting the scene based on the weight for each shot, and the weighted sum is a third threshold value The above keywords may be determined as the final keywords.

The processor may determine a weight for each shot of the shots constituting the scene based on the number of the candidate keywords included in each of the shots constituting the scene.

The third threshold value may be determined based on a distribution of the confidence values corresponding to the keywords of shots constituting a scene.

The processor may weight-add confidence values corresponding to the candidate keywords of the shots constituting the scene based on the weight.

The processor may compare the search query with a keyword equal to or greater than the third threshold, and determine a scene corresponding to the search query based on the comparison result.

The processor may determine a scene corresponding to the search query by adding one or more relational keywords having a predetermined relation with a keyword equal to or greater than the third threshold value, and further considering the relational keyword.

Embodiments may extract keywords from an image at regular time intervals.

Embodiments may generate metadata in which a keyword corresponding to a scene and time information are matched.

Embodiments may provide image search based on the generated metadata.

1 is a diagram illustrating an example of a network environment according to an embodiment.

2 is a diagram for explaining a method of operating an image search providing system according to an exemplary embodiment.

3 is a diagram illustrating an example of an image search result page according to an exemplary embodiment.

4 is a flowchart illustrating a method for providing an image search according to an exemplary embodiment.

5 is a diagram for describing a specific method of determining a scene according to an exemplary embodiment.

Specific structural or functional descriptions disclosed in this specification are merely illustrative for the purpose of describing embodiments according to technical concepts, and the embodiments may be embodied in various other forms and are limited to the embodiments described herein. doesn't happen

Terms such as first or second may be used to describe various elements, but these terms should be understood only for the purpose of distinguishing one element from another element. For example, a first component may be termed a second component, and similarly, a second component may also be termed a first component.

When a component is referred to as being “connected” or “connected” to another component, it is understood that the other component may be directly connected or connected to the other component, but other components may exist in between. it should be On the other hand, when it is said that a certain element is "directly connected" or "directly connected" to another element, it should be understood that no other element is present in the middle. Expressions describing the relationship between elements, for example, “between” and “between” or “neighboring to” and “directly adjacent to”, etc. should be interpreted similarly.

The singular expression includes the plural expression unless the context clearly dictates otherwise. In this specification, terms such as "comprise" or "have" are intended to designate that the described feature, number, step, operation, component, part, or combination thereof exists, and includes one or more other features or numbers, It should be understood that the possibility of the presence or addition of steps, operations, components, parts or combinations thereof is not precluded in advance.

Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms such as those defined in a commonly used dictionary should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present specification. does not

The embodiments may be implemented in various types of products, such as personal computers, laptop computers, tablet computers, smart phones, televisions, smart home appliances, intelligent cars, kiosks, wearable devices, and the like. Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. Like reference numerals in each figure indicate like elements.

Referring to FIG. 1 , a network environment according to an embodiment may include a plurality of

terminals

110 , 120 , 130 , 140 , a plurality of

servers

150 , 160 , and a network 170 . FIG. 1 is an example for the description of the invention, and the number of terminals or the number of servers is not limited as in FIG. 1 .

The plurality of

terminals

110 , 120 , 130 , and 140 may be a fixed terminal implemented as a computer device or a mobile terminal. For example, a plurality of

terminals

110 , 120 , 130 and 140 , a smart phone, a mobile phone, a navigation device, a computer, a notebook computer, a digital broadcasting terminal, a PDA (Personal Digital Assistants), a PMP (Portable Multimedia Player) , tablet PC, HMD (Head mounted Display), TV, smart TV, etc.

The terminal 110 may communicate with the

other terminals

120 , 130 , 140 and/or the

servers

150 and 160 through the network 170 using a wireless or wired communication method. The server 150 may communicate with the

terminals

110 , 120 , 130 , 140 and/or other servers 160 through the network 170 using a wireless or wired communication method.

The communication method is not limited, and not only a communication method using a communication network (eg, a mobile communication network, a wired Internet, a wireless Internet, a broadcasting network) that the network 170 may include, but also short-range wireless communication between devices may be included. . For example, the network 170 may include a personal area network (PAN), a local area network (LAN), a campus area network (CAN), a metropolitan area network (MAN), a wide area network (WAN), and a broadband network (BBN). , the Internet, and the like. In addition, the network 170 may include any one or more of a network topology including a bus network, a star network, a ring network, a mesh network, a star-bus network, a tree, or a hierarchical network, etc. not limited

Each of the

servers

150 and 160 communicates with the plurality of

terminals

110 , 120 , 130 , 140 and the network 170 through a computer device or a plurality of computer devices to provide commands, codes, files, contents, services, etc. can be implemented with

The server 150 may provide a file for installing an application to the terminal 110 connected through the network 170 . In this case, the terminal 110 may install the application using the file provided from the server 150 . In addition, by accessing the server 150 under the control of an operating system (OS) and at least one program (eg, a browser or the installed application) included in the terminal 110, the service provided by the server 150 or content can be provided. For example, when the terminal 110 transmits a service request message to the server 150 through the network 170 under the control of the application, the server 150 transmits a code corresponding to the service request message to the terminal 110 . can be transmitted, and the terminal 110 can provide content to the user by composing and displaying a screen according to the code according to the control of the application.

Referring to FIG. 2 , an image search providing system according to an embodiment may include a terminal 210 , an image search providing apparatus 220 , and an image analysis engine 230 . The image search providing apparatus 220 includes a processor 221 and a database 222 . The terminal 210 according to an embodiment may be one of the terminals 110 to 140 of FIG. 1 , and the image search providing apparatus 220 and the image analysis engine 230 are the

servers

150 and 160 of FIG. 1 . can be one of

According to the image search providing system according to an embodiment, when a user requests a scene search including a search query to the image search providing apparatus 220 through the terminal 210, the image search providing apparatus 220 provides a search query It is possible to determine a scene corresponding to , and provide an image corresponding to the scene to the user.

In the conventional case, it may be difficult to provide an accurate search result because an image search is provided depending on metadata information (title, year, character, plot, etc.) and subtitle data.

The image search providing system according to an embodiment utilizes keywords, appearance frequency, and relationship derived from one scene for the search, and furthermore, the user search results and viewing records are also utilized for the search, so that more accurate search results can be provided. there is.

The image search providing apparatus 220 may extract keywords related to an image corresponding to the corresponding time at regular time intervals. More specifically, the image search providing apparatus 220 may transmit an image corresponding to the corresponding time at regular time intervals to the image analysis engine 230 , and receive a keyword corresponding to each image from the image analysis engine 230 . can For example, an image may be divided into minimum time intervals, and an image corresponding to each time may be referred to as a shot. The image search providing apparatus 220 may transmit the shot to the image analysis engine 230 , and may receive first metadata corresponding to each shot from the image analysis engine 230 . The first metadata may include one or more keywords corresponding to the image and a confidence value corresponding to the keywords. The confidence value may be a numerical value relating to the degree of relation between the corresponding keyword and the shot. For example, the confidence value may be a value between 0 and 1, and the closer to 1, the higher the degree of the keyword related to the corresponding shot.

The image search providing apparatus 220 may determine a keyword having a confidence value equal to or greater than a first threshold value among one or more keywords as a candidate keyword. The image search providing apparatus 220 may determine a scene including one or more shots based on the first metadata. Through this, the image search providing apparatus 220 may also determine a scene change time based on the first metadata. As an example, the image search providing apparatus 220 may determine, among consecutive shots, shots in which the match ratio of the candidate keyword of each shot is equal to or greater than a second threshold value as one scene. A specific method of determining a scene based on the first metadata will be described in detail below with reference to FIG. 5 .

The image search providing apparatus 220 may determine second metadata corresponding to the scene, and may determine a scene corresponding to the search query input by the user based on the second metadata of the scene. For example, the image search providing apparatus 220 may statistically process confidence values corresponding to keywords of shots constituting a scene to determine a final keyword representing the scene.

The image analysis engine 230 may include an external image analysis engine that receives an image and generates first metadata corresponding to the input image. For example, the image analysis engine 230 may be a google vision API. The google vision API builds metadata by finding the dominant object included in the image, and can classify objects in the image into thousands of categories using the built metadata. However, here, the google vision API is merely exemplary, and may be employed and applied to various types of models or devices that output object recognition and corresponding metadata other than the above-described image analysis engine.

When an external image analysis engine is used in the image search providing system, the image search providing apparatus 220 utilizes the first metadata received from the external image analysis engine and the image search providing apparatus 220 performs image analysis on its own. Because it does not, processing speed can be improved.

The image search providing apparatus 220 includes a processor 221 and a database 222 . The image search providing apparatus 220 may include more components than those of FIG. 2 . For example, the image search providing apparatus 220 may further include other components such as a memory, a communication module, an input/output interface, and the like.

The memory is a computer-readable recording medium and may include a random access memory (RAM), a read only memory (ROM), and a permanent mass storage device such as a disk drive. In addition, codes for a browser installed and driven in the operating system and at least one program code or the above-described application may be stored in the memory. These software components may be loaded from a computer-readable recording medium separate from the memory using a drive mechanism. The separate computer-readable recording medium may include a computer-readable recording medium such as a floppy drive, a disk, a tape, a DVD/CD-ROM drive, and a memory card.

In another embodiment, the software components may be loaded into the memory through a communication module rather than a computer-readable recording medium. For example, the at least one program may be loaded into the memory based on a file distribution system that distributes installation files of developers or applications.

The processor 221 may be configured to process instructions of a computer program by performing basic arithmetic, logic, and input/output operations. Instructions may be provided to the processor 221 by a memory or communication module. For example, the processor 221 may be configured to execute a received instruction according to a program code stored in a recording device such as a memory.

The communication module may provide a function for the terminal 210 and the image search providing apparatus 220 to communicate with each other through a network, and may provide a function for communicating with another server. For example, a control signal, command, content, file, etc. provided under the control of the processor 221 of the image search providing apparatus 220 may be received by the terminal 210 through the communication module and the network 170 . .

Referring to FIG. 3 , the image search result page may include a scene image title field 310 , a player field 320 for reproducing a scene image, and an additional information field 330 .

The scene video title field 310 may display a title of a scene video selected as a search result. According to the search word input by the user, "a scene where AAA and BBB enjoy a picture of a child in OOO" shown in FIG. 3 may be the title of the scene video.

A player for playing the scene video selected as a result of the search is displayed in the player field 320 . In this case, the scene video may be set to be played only when the user clicks the play button of the player.

In the additional information field 330 , image quality information, file type information, capacity information, reproduction time information, screen size information, source information, and the like may be displayed.

According to an embodiment, the image search method may provide the user with a search result including not only the scene image selected as a result of the scene image search, but also the entire image selected as matching the search query. That is, the image search method may display the entire image search result and the scene image search result corresponding to the search word input by the user.

Referring to FIG. 4 , in an embodiment, steps 410 to 460 may be performed by the image search providing apparatus 220 described above with reference to FIG. 2 . The image search providing apparatus 220 may be implemented by one or more hardware modules, one or more software modules, or various combinations thereof.

In step 410 , the image search providing apparatus 220 receives one or more keywords corresponding to each shot of an image and a confidence value corresponding to the keywords from the image analysis engine. The image analysis engine may be the image analysis engine 230 described above with reference to FIG. 2 .

In operation 420 , the image search providing apparatus 220 determines, among one or more keywords, a keyword having a confidence value equal to or greater than a first threshold value as a candidate keyword. The image search providing apparatus 220 may regard keywords having a confidence value less than the first threshold value as noise among keywords corresponding to the shot. The first threshold value may be determined based on a distribution of confidence values. For example, the first threshold value may be determined based on an average and variance of confidence values corresponding to a specific shot.

In step 430 , the video search providing apparatus 220 determines, among consecutive shots, shots in which the match ratio of the candidate keyword of each shot is equal to or greater than the second threshold value as one scene. One or more consecutive shots may be grouped together to form a scene. For example, in the case of a scene in which the main character walks down the street, the scene may be divided into several shots from various angles, but all of the shots may be scenes in which the main character is walking. A specific method for determining a scene will be described with reference to FIG. 5 .

Referring to FIG. 5 , shot 1 510 to shot 3 530 according to an embodiment are consecutive shots, and Table 1 below shows shots 1 510 to shot 3 530 received from the image analysis engine. Corresponding first metadata (eg, keyword and confidence value) are indicated.

쇼트1short 1		쇼트2short 2		쇼트3short 3
ManMan	0.880.88	Womanwoman	0.840.84	Womanwoman	0.830.83
Picture framepicture frame	0.850.85	ManMan	0.840.84	PersonPerson	0.790.79
PersonPerson	0.810.81	Picture framepicture frame	0.780.78	TopTop	0.670.67
ClothingClothing	0.690.69	ClothingClothing	0.560.56	GestureGesture	0.810.81
Luggage & bagsLuggage & bags	0.570.57	Luggage & bagsLuggage & bags	0.550.55	ForeheadForehead	0.780.78
ArtArt	0.860.86	ArtArt	0.680.68	FingerFinger	0.720.72
Visual ArtsVisual Arts	0.820.82	RoomRoom	0.660.66	SceneScene	0.710.71
Modern ArtModern Art	0.740.74	EventEvent	0.630.63	HandHand	0.700.70
PaintingPainting	0.740.74	PhotographyPhotography	0.620.62	MouthMouth	0.680.68
OrganismOrganism	0.720.72	ConversationConversation	0.570.57	SmileSmile	0.640.64
FunFun	0.700.70	GestureGesture	0.560.56	PhotographyPhotography	0.620.62
EventEvent	0.670.67	Visual ArtsVisual Arts	0.550.55	Black HairBlack Hair	0.610.61
AdaptationAdaptation	0.670.67			ConversationConversation	0.580.58
RoomRoom	0.660.66			JawJaw	0.570.57
Art ExhibitionArt Exhibition	0.650.65
DrawingDrawing	0.590.59
PortraitPortrait	0.570.57
AnimationAnimation	0.560.56
ExhibitionExhibition	0.560.56
IllustrationIllustration	0.550.55
ConversationConversation	0.540.54
MuralMural	0.530.53

The image search providing apparatus 220 may determine, as a candidate keyword, a keyword having a confidence value equal to or greater than a first threshold among keywords corresponding to each shot. The first threshold value may be determined based on a distribution of confidence values. For example, the first threshold value may be determined as a confidence value corresponding to the lower 20%. Referring to Table 1, the first threshold value of the first shot 510 may be determined as 0.565, the first threshold value of the second shot 520 may be determined as 0.555, and the first threshold value of the third shot 530 may be determined as 0.615. Accordingly, the keywords 'Animation', 'Exhibition', 'llustration', 'Conversation' and 'Mural' of the first shot 510 are excluded from the candidate keywords, and the keywords 'Luggage & bags' and 'Visual' of the second shot 520 are excluded. Arts' may be excluded from the candidate keywords, and the keywords 'Black Hair', 'Conversation', and 'Jaw' of the shot 3 530 may be excluded from the candidate keywords.

The image search providing apparatus 220 may determine, among consecutive shots, shots in which the match ratio of candidate keywords of each shot is equal to or greater than the second threshold value as one scene.

According to an exemplary embodiment, among consecutive shots, the image search providing apparatus 220 may determine, as one scene, shots in which a match ratio of a candidate keyword with a previous shot is equal to or greater than a second threshold value.

Referring to Table 1, shot 2 520 includes candidate keywords of shot 1 510 and 5 ('Man', 'Picture frame', Clothing', 'Art', 'Event') among 10 candidate keywords. match Also, in the shot 3 530 , three keywords ('Woman', 'Photography', and 'Gesture') match the keywords of the shot 2 520 out of 10 candidate keywords. When the second threshold value is, for example, 0.5, shot 2 520 matches 5 keywords of shot 1 510 out of 10 total keywords, so shot 1 510 and shot 2 520 are 0.5 (5) /10), which is equal to or greater than the second threshold value of 0.5, it can be determined as a shot constituting one scene. On the other hand, shot 3 530 matches three keywords of shot 2 520 out of 10 keywords, so shot 2 520 and shot 3 530 have a matching ratio of 0.3 (3/10), Since this is less than the second threshold value of 0.5, the second shot 520 and the third shot 530 may be determined as shots constituting different scenes.

According to another embodiment, the image search providing apparatus 220 cumulatively counts the number of matching candidate keywords in each shot in successive shots, and based on the accumulated count value, the matching ratio of the candidate keywords can be calculated. Referring to Table 1, in the short 3 530, the keyword of the short 1 510 and 1 ('Person') among the 10 candidate keywords, the keyword of the short 2 520 and 3 ('Woman', ') Photography', 'Gesture') coincide and accumulate, so that the shot 3 530 may have a coincidence ratio of 0.4 (4/10).

Referring back to FIG. 4 , in step 440 , the image search providing apparatus 220 statistically processes confidence values corresponding to keywords of shots constituting a scene to determine a final keyword representing the scene. The image search providing apparatus 220 may determine a final keyword composed of keywords representing the scene by removing keywords that can be viewed as noise from keywords corresponding to one or more shots constituting the scene. The image search providing apparatus 220 may generate a final keyword matched with time information corresponding to a scene, and the final keyword matched with time information may be stored in a database in the form of a file.

According to an embodiment, the image search providing apparatus 220 may determine a weight for each shot of the shots constituting the scene, and weight the confidence values corresponding to keywords of the shots constituting the scene based on the weight for each shot. . Furthermore, the image search providing apparatus 220 may determine a keyword whose weighted sum is equal to or greater than the third threshold value as the final keyword. The statistical processing method is not limited to the above-described weighted sum, and may include any method related to statistical processing.

The image search providing apparatus 220 may determine a weight for each shot based on the importance of shots constituting the scene. For example, the image search providing apparatus 220 may assign a greater weight to a shot having a higher importance.

According to an embodiment, the image search providing apparatus 220 may determine that the greater the number of candidate keywords, the higher the importance. Accordingly, the image search providing apparatus 220 may determine a weight for each shot of the shots constituting the scene based on the number of candidate keywords included in each of the shots constituting the scene. For example, in Table 1, shot 1 510 has 17 candidate keywords, shot 2 520 has 10 candidate keywords, and shot 1 510 has a weight of 0.63 (17/27). , shot 2 520 may have a weight of 0.37 (10/27). Table 2 below shows the weighted sum of keywords and corresponding confidence values of shots (eg, shot 1 510 and shot 2 520 ) constituting the scene according to the example above.

장면1scene 1
ManMan	0.86520.8652
Picture framepicture frame	0.82410.8241
ArtArt	0.83040.8304
EventEvent	0.42210.4221
ClothingClothing	0.63820.6382
Visual ArtsVisual Arts	0.51660.5166
PersonPerson	0.51030.5103
Modern ArtModern Art	0.46620.4662
PaintingPainting	0.46620.4662
OrganismOrganism	0.45360.4536
FunFun	0.4410.441
AdaptationAdaptation	0.42210.4221
RoomRoom	0.41580.4158
Art ExhibitionArt Exhibition	0.40950.4095
DrawingDrawing	0.37170.3717
Luggage & bagsLuggage & bags	0.35910.3591
PortraitPortrait	0.35910.3591
AnimationAnimation	0.35280.3528
ExhibitionExhibition	0.35280.3528
IllustrationIllustration	0.34650.3465
ConversationConversation	0.34020.3402
MuralMural	0.33390.3339
Womanwoman	0.31080.3108
RoomRoom	0.24420.2442
PhotographyPhotography	0.22940.2294
ConversationConversation	0.21090.2109
GestureGesture	0.20720.2072
Luggage & bagsLuggage & bags	0.20350.2035
Visual ArtsVisual Arts	0.20350.2035

Referring to Table 2, in the image search providing apparatus 220, the weighted sum of the confidence values corresponding to the keywords of the first shot 510 and the second shot 520 is equal to or greater than a predetermined third threshold (eg, 0.5). Keywords ('Man', 'Picture frame', 'Art', 'Event', 'Clothing', 'Visual Arts', 'Person' ) can be determined as the final keywords of Scene 1.

The third threshold value according to an embodiment may be determined based on a distribution of confidence values corresponding to keywords of shots constituting a scene. For example, the third threshold value may be determined based on an average and variance of confidence values corresponding to keywords of shots constituting a scene. Alternatively, the third threshold value may be determined based on a predetermined percentile value of confidence values corresponding to keywords of shots constituting a scene. The method of determining the third threshold value is not limited to the above example, and includes any method that may be determined based on the distribution of confidence values.

In step 450 , the image search providing device 220 receives a scene search request including a search query. The image search providing apparatus 220 may include a search engine itself or may use an external search engine. When the image search providing apparatus 220 includes a search engine itself, it may directly receive a scene search request including a search query from a user's terminal. Alternatively, when an external search engine is used, the user may request a scene search through the external search engine, and the image search providing apparatus 220 may receive a scene search request including a search query from the external search engine.

In operation 460, the image search providing apparatus 220 determines a scene corresponding to the search query based on the final keyword. The image search providing apparatus 220 may compare the final keyword with the search query, and may determine a scene corresponding to the search query based on the comparison result.

Furthermore, the image search providing apparatus 220 may add one or more relational keywords having a predetermined relationship with the final keyword, and may determine a scene corresponding to the search query by further considering the relational keywords. The relational keyword may include a final keyword, a similar word, an upper/lower word, and the like.

In step 470 , the image search providing apparatus 220 provides an image corresponding to the determined scene. When an external search engine is used, the image search providing apparatus 220 may provide an image corresponding to a scene determined by the external search engine.

The image search providing apparatus 220 may modify the final keyword based on the search query. The image search providing apparatus 220 may improve search accuracy while continuously correcting determined information by applying feedback on the search result.

The embodiments described above may be implemented by a hardware component, a software component, and/or a combination of a hardware component and a software component. For example, the apparatus, methods and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate (FPGA) array), a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions, may be implemented using one or more general purpose or special purpose computers. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, although one processing device is sometimes described as being used, one of ordinary skill in the art will recognize that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that can include For example, the processing device may include a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as parallel processors.

Software may comprise a computer program, code, instructions, or a combination of one or more of these, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or device, to be interpreted by or to provide instructions or data to the processing device. , or may be permanently or temporarily embody in a transmitted signal wave. The software may be distributed over networked computer systems, and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

As described above, although the embodiments have been described with reference to the limited drawings, those skilled in the art may apply various technical modifications and variations based on the above. For example, the described techniques are performed in a different order than the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

receiving one or more keywords corresponding to each shot of an image and a confidence value corresponding to the keywords from an image analysis engine;

determining, among the one or more keywords, a keyword having the confidence value equal to or greater than a first threshold value as a candidate keyword;

in successive shots, determining shots in which the match ratio of the candidate keyword of each shot is equal to or greater than a second threshold value as one scene;

determining a final keyword representing the scene by statistically processing confidence values corresponding to keywords of shots constituting the scene;

receiving a scene search request comprising a search query;

determining a scene corresponding to the search query based on the final keyword; and

providing an image corresponding to the determined scene

A method of providing a video search comprising a.
According to claim 1,

modifying the final keyword based on the search query

Further comprising, a video search providing method.
According to claim 1,

The first threshold value is determined based on a distribution of the confidence values.
4. The method of claim 3,

The step of determining the final keyword is

determining a weight for each shot of the shots constituting the scene;

weighted summing confidence values corresponding to keywords of shots constituting the scene based on the weights for each shot; and

determining a keyword whose weighted sum is equal to or greater than a third threshold value as the final keyword

Including, a method for providing video search.
5. The method of claim 4,

The step of determining the weight for each shot is

determining a weight for each shot of the shots constituting the scene based on the number of candidate keywords included in each of the shots constituting the scene

A method of providing media enrichment, comprising a.
5. The method of claim 4,

The third threshold is

It is determined based on the distribution of the confidence value corresponding to the keyword of the shots constituting the scene, the enrichment providing method.
5. The method of claim 4,

The weighting step is

weighted summing confidence values corresponding to the candidate keywords of the shots constituting the scene based on the weight

A method of providing media enrichment, comprising a.
5. The method of claim 4,

The step of determining a scene corresponding to the search query is

comparing the search query with a keyword equal to or greater than the third threshold; and

determining a scene corresponding to the search query based on the comparison result

Including, a method for providing video search.
9. The method of claim 8,

adding one or more relational keywords having a predetermined relation with a keyword equal to or greater than the third threshold value;

further comprising,

The step of determining a scene corresponding to the search query is

determining a scene corresponding to the search query by further considering the relational keyword

Including, a method for providing video search.
According to claim 1,

The video analysis engine

and an external image analysis engine that receives the image and generates the keyword and the confidence value corresponding to the keyword.
A computer program stored on a medium in combination with hardware to execute the method of any one of claims 1 to 10.
Receive one or more keywords corresponding to each shot of an image and a confidence value corresponding to the keyword from the image analysis engine, and determine, among the one or more keywords, a keyword having a confidence value equal to or greater than a first threshold value as a candidate keyword and, in successive shots, the shots in which the match ratio of the candidate keyword of each shot is equal to or greater than a second threshold value is determined as one scene, and the confidence values corresponding to the keywords of the shots constituting the scene are statistically processed to determine a final keyword representing the scene, receive a scene search request including a search query, determine a scene corresponding to the search query based on the final keyword, and display an image corresponding to the determined scene provided

processor

A video search providing device comprising a.
13. The method of claim 12,

the processor

Based on the search query, the video search providing apparatus for modifying the final keyword.
13. The method of claim 12,

The first threshold value is determined based on a distribution of the confidence values.
15. The method of claim 14,

the processor

A weight of shots constituting the scene is determined for each shot, and confidence values corresponding to keywords of shots constituting the scene are weighted and summed based on the weight of each shot, and a keyword whose weighted sum is equal to or greater than a third threshold is selected. Determined by the final keyword, video search providing device.
16. The method of claim 15,

the processor

and determining a weight for each shot of the shots constituting the scene based on the number of candidate keywords included in each of the shots constituting the scene.
16. The method of claim 15,

The third threshold is

which is determined based on a distribution of the confidence values corresponding to the keywords of shots constituting the scene.
16. The method of claim 15,

the processor

Based on the weight, the apparatus for providing an image search for weighting the confidence values corresponding to the candidate keywords of the shots constituting the scene.
16. The method of claim 15,

the processor

Comparing a keyword equal to or greater than the third threshold value with the search query, and determining a scene corresponding to the search query based on the comparison result.
16. The method of claim 15,

the processor

An image search providing apparatus for adding one or more relational keywords having a predetermined relationship with a keyword equal to or greater than the third threshold value, and determining a scene corresponding to the search query by further considering the relational keyword.