WO2021149924A1

WO2021149924A1 - Method and apparatus for providing media enrichment

Info

Publication number: WO2021149924A1
Application number: PCT/KR2020/018893
Authority: WO
Inventors: 안재용; 강민수
Original assignee: 주식회사 씨오티커넥티드
Priority date: 2020-01-20
Filing date: 2020-12-22
Publication date: 2021-07-29

Abstract

A method and an apparatus for providing media enrichment are disclosed. A method for providing media enrichment, according to one embodiment, comprises the steps of: receiving, from a video analysis engine, one or more keywords corresponding to each shot of a video, and confidence values corresponding to the keywords; determining, to be a candidate keyword, a keyword of which the confidence value is a first threshold or higher from among the one or more keywords; determining, to be one scene, shots of which the match ratio of a candidate keyword of each shot is a second threshold or higher from consecutive shots; determining a final keyword representing the scene by performing statistical processing on confidence values corresponding to the keywords of the shots constituting the scene; receiving a request for image analysis of the video; and acquiring at least one media enrichment object on the basis of the final keyword of a scene corresponding to a time point at which the request for image analysis was received.

Description

Method and apparatus for providing media enrichment

The following embodiments relate to a method and apparatus for providing media enrichment.

The previous video providing service takes the form of a one-way service in which the viewer unilaterally watches the transmitted video, so it is difficult to solve the curiosity that arises while the viewer watches the broadcast. For example, if a viewer wants to obtain information about a character or a sponsored product while the viewer is watching the content, they must search using a separate web search screen, and they cannot know the search terms for the character or sponsored product. If there is no search itself, there is a problem that it is impossible.

Embodiments attempt to extract keywords at regular time intervals from an image.

Embodiments attempt to determine a scene change time based on extracted keywords.

Embodiments intend to provide media/advertisement content in an overlay manner on the screen being viewed using the extracted keyword.

A method for providing media enrichment according to an embodiment includes: receiving one or more keywords corresponding to each shot of an image and a confidence value corresponding to the keyword from an image analysis engine; determining, among the one or more keywords, a keyword having the confidence value equal to or greater than a first threshold value as a candidate keyword; in successive shots, determining shots in which the match ratio of the candidate keyword of each shot is equal to or greater than a second threshold value as one scene; determining a final keyword representing the scene by statistically processing confidence values corresponding to keywords of shots constituting the scene; receiving an image analysis request for the image; and obtaining at least one media enrichment object based on a final keyword of a scene corresponding to a time point at which the image analysis request is received.

The determining of the single scene may include: in the successive shots, cumulatively counting the number of the matching candidate keywords in each shot; and calculating a matching rate of the candidate keyword based on the accumulated count value.

The first threshold value may be determined based on a distribution of the confidence values.

The determining of the final keyword may include: determining a weight for each shot of the shots constituting the scene; weighted summing confidence values corresponding to keywords of shots constituting the scene based on the weights for each shot; and determining a keyword whose weighted sum is equal to or greater than a third threshold value as the final keyword.

The determining of the weight for each shot may include determining the weight for each shot of the shots constituting the scene based on the number of candidate keywords included in each of the shots constituting the scene.

The third threshold value may be determined based on a distribution of the confidence values corresponding to the keywords of shots constituting the scene.

The weighted summing may include weighted summing confidence values corresponding to the candidate keywords of the shots constituting the scene based on the weight.

The method for providing media enrichment according to an embodiment may further include overlapping rendering the at least one media enrichment object on the image.

The obtaining of the at least one media enrichment object may include: inputting the final keyword as a query to at least one service server; and obtaining the at least one media enrichment object corresponding to the query from the service server.

The at least one media enrichment object may include a thumbnail hyperlink of at least one of a photo, text, sound, external link, social link information, advertisement content, and related and playable separate content.

The method for providing media enrichment according to an embodiment further comprises determining at least one keyword based on a result of voice recognition of a sound of a scene corresponding to a time point at which the image analysis request is received, wherein the at least one Acquiring the media enrichment object of may include acquiring the at least one media enrichment object with further reference to the keyword.

The image analysis engine may include an external image analysis engine that receives the image and generates the keyword and the confidence value corresponding to the keyword.

The apparatus for providing media enrichment according to an embodiment receives one or more keywords corresponding to each shot of an image and a confidence value corresponding to the keyword from an image analysis engine, and among the one or more keywords, the confidence value A keyword equal to or greater than the first threshold value is determined as a candidate keyword, and among successive shots, shots in which the match ratio of the candidate keyword of each shot is greater than or equal to a second threshold value is determined as one scene, and the scene is constructed Confidence values corresponding to keywords of the shots are statistically processed to determine a final keyword representing the scene, and an image analysis request for the image is received, and the final scene of the scene corresponding to the time at which the image analysis request is received. and a processor for obtaining, based on the keyword, at least one media enrichment object.

In the successive shots, the processor may cumulatively count the number of matching candidate keywords in each shot, and calculate a matching ratio of the candidate keywords based on the accumulated count value.

The processor determines a weight for each shot of the shots constituting the scene, weights and sums confidence values corresponding to keywords of the shots constituting the scene based on the weight for each shot, and the weighted sum is a third threshold value The above keywords may be determined as the final keywords.

The processor may determine a weight for each shot of the shots constituting the scene based on the number of the candidate keywords included in each of the shots constituting the scene.

The processor may weight-add confidence values corresponding to the candidate keywords of the shots constituting the scene based on the weight.

The processor may overlap-render the at least one media enrichment object on the image.

The processor may input the final keyword as a query to at least one service server, and obtain the at least one media enrichment object corresponding to the query from the service server.

The processor determines at least one keyword based on a result of voice recognition of a sound of a scene corresponding to a time point at which the image analysis request is received, and further refers to the keyword to generate the at least one media enrichment object. can be obtained

Embodiments may extract keywords from an image at regular time intervals.

Embodiments may determine a scene change time based on the extracted keywords.

Embodiments may provide media/advertisement content in an overlay manner on the screen being viewed by using the extracted keyword.

1 is a diagram illustrating an example of a network environment according to an embodiment.

2 is a diagram for explaining a method of operating a system for providing media enrichment according to an embodiment.

3 is a diagram for describing an enrichment object related to a scene corresponding to a time point at which an image analysis request is received, according to an embodiment.

4 is a flowchart illustrating a method for providing media enrichment according to an embodiment.

5 is a diagram for describing a specific method of determining a scene according to an exemplary embodiment.

Specific structural or functional descriptions disclosed in this specification are merely illustrative for the purpose of describing embodiments according to technical concepts, and the embodiments may be embodied in various other forms and are limited to the embodiments described herein. doesn't happen

Terms such as first or second may be used to describe various elements, but these terms should be understood only for the purpose of distinguishing one element from another element. For example, a first component may be termed a second component, and similarly, a second component may also be termed a first component.

When a component is referred to as being “connected” or “connected” to another component, it is understood that the other component may be directly connected or connected to the other component, but other components may exist in between. it should be On the other hand, when it is said that a certain element is "directly connected" or "directly connected" to another element, it should be understood that no other element is present in the middle. Expressions describing the relationship between elements, for example, “between” and “between” or “neighboring to” and “directly adjacent to”, etc. should be interpreted similarly.

The singular expression includes the plural expression unless the context clearly dictates otherwise. In this specification, terms such as "comprise" or "have" are intended to designate that the described feature, number, step, operation, component, part, or combination thereof exists, and includes one or more other features or numbers, It should be understood that the possibility of the presence or addition of steps, operations, components, parts or combinations thereof is not precluded in advance.

Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms such as those defined in a commonly used dictionary should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present specification. does not

The embodiments may be implemented in various types of products, such as personal computers, laptop computers, tablet computers, smart phones, televisions, smart home appliances, intelligent cars, kiosks, wearable devices, and the like. Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. Like reference numerals in each figure indicate like elements.

Referring to FIG. 1 , a network environment according to an embodiment may include a plurality of

terminals

110 , 120 , 130 , 140 , a plurality of

servers

150 , 160 , and a network 170 . FIG. 1 is an example for the description of the invention, and the number of terminals or the number of servers is not limited as in FIG. 1 .

The plurality of

terminals

110 , 120 , 130 , and 140 may be a fixed terminal implemented as a computer device or a mobile terminal. For example, a plurality of

terminals

110 , 120 , 130 and 140 , a smart phone, a mobile phone, a navigation device, a computer, a notebook computer, a digital broadcasting terminal, a PDA (Personal Digital Assistants), a PMP (Portable Multimedia Player) , tablet PC, HMD (Head mounted Display), TV, smart TV, etc.

The terminal 110 may communicate with the

other terminals

120 , 130 , 140 and/or the

servers

150 and 160 through the network 170 using a wireless or wired communication method. The server 150 may communicate with the

terminals

110 , 120 , 130 , 140 and/or other servers 160 through the network 170 using a wireless or wired communication method.

The communication method is not limited, and not only a communication method using a communication network (eg, a mobile communication network, a wired Internet, a wireless Internet, a broadcasting network) that the network 170 may include, but also short-range wireless communication between devices may be included. . For example, the network 170 may include a personal area network (PAN), a local area network (LAN), a campus area network (CAN), a metropolitan area network (MAN), a wide area network (WAN), and a broadband network (BBN). , the Internet, and the like. In addition, the network 170 may include any one or more of a network topology including a bus network, a star network, a ring network, a mesh network, a star-bus network, a tree, or a hierarchical network, etc. not limited

Each of the

servers

150 and 160 communicates with the plurality of

terminals

110 , 120 , 130 , 140 and the network 170 through a computer device or a plurality of computer devices to provide commands, codes, files, contents, services, etc. can be implemented with

The server 150 may provide a file for installing an application to the terminal 110 connected through the network 170 . In this case, the terminal 110 may install the application using the file provided from the server 150 . In addition, by accessing the server 150 under the control of an operating system (OS) and at least one program (eg, a browser or the installed application) included in the terminal 110, the service provided by the server 150 or content can be provided. For example, when the terminal 110 transmits a service request message to the server 150 through the network 170 under the control of the application, the server 150 transmits a code corresponding to the service request message to the terminal 110 . can be transmitted, and the terminal 110 can provide content to the user by composing and displaying a screen according to the code according to the control of the application.

Referring to FIG. 2 , the media enrichment providing system according to an embodiment may include a terminal 210 , a media enrichment providing apparatus 220 , and an image analysis engine 230 . The media enrichment providing apparatus 220 includes a processor 221 and a database 222 . The terminal 210 according to an embodiment may be one of the terminals 110 to 140 of FIG. 1 , and the media enrichment providing apparatus 220 and the image analysis engine 230 are the server 150 of FIG. 1 . 160) may be one of them.

According to the media enrichment providing system according to an embodiment, when a user requests an image analysis of an image to the media enrichment providing apparatus 220 through the terminal 210, the media enrichment providing apparatus 220 ) may obtain a media enrichment object related to a scene corresponding to a time point at which an image analysis request is received and provide it to the user. The media enrichment object may include, but is not limited to, a thumbnail hyperlink of at least one of a photo, text, sound, external link, social link information, advertisement content, and related and playable separate content.

Through this, the user may be provided with information about a character or a sponsored product of the video to be viewed without performing a separate search. For example, the user may request the media enrichment providing apparatus 220 to analyze the image at a desired time point while viewing the video using the remote control of the terminal 210 or the like. The image may include a streaming image as well as a real-time channel image and VOD image.

The media enrichment providing apparatus 220 may obtain an object related to a scene corresponding to the time when the image analysis request is received and provide it to the user. A specific example of an object related to a scene corresponding to a time point at which an image analysis request is received will be described in detail with reference to FIG.

The media enrichment providing apparatus 220 may extract keywords related to an image corresponding to a corresponding time at regular time intervals. More specifically, the media enrichment providing apparatus 220 may transmit an image corresponding to the corresponding time at regular time intervals to the image analysis engine 230 , and obtain a keyword corresponding to each image from the image analysis engine 230 . can receive For example, an image may be divided into minimum time intervals, and an image corresponding to each time may be referred to as a shot. The media enrichment providing apparatus 220 may transmit the shot to the image analysis engine 230 , and may receive first metadata corresponding to each shot from the image analysis engine 230 . The first metadata may include one or more keywords corresponding to the image and a confidence value corresponding to the keywords. The confidence value may be a numerical value relating to the degree of relation between the corresponding keyword and the shot. For example, the confidence value may be a value between 0 and 1, and the closer to 1, the higher the degree of the keyword related to the corresponding shot.

The media enrichment providing apparatus 220 may determine, as a candidate keyword, a keyword having a confidence value equal to or greater than a first threshold value among one or more keywords. The media enrichment providing apparatus 220 may determine a scene including one or more shots based on the first metadata. Through this, the apparatus 220 for providing media enrichment may also determine a scene change time based on the first metadata. For example, in successive shots, the apparatus 220 for providing media enrichment may determine shots in which the match ratio of the candidate keyword of each shot is equal to or greater than a second threshold value as one scene. A specific method of determining a scene based on the first metadata will be described in detail below with reference to FIG. 5 .

The media enrichment providing apparatus 220 may determine second metadata corresponding to the scene, and may obtain at least one media enrichment object corresponding to the scene based on the second metadata of the scene. there is. For example, the media enrichment providing apparatus 220 may statistically process confidence values corresponding to keywords of shots constituting a scene to determine a final keyword representing the scene.

The image analysis engine 230 may include an external image analysis engine that receives an image and generates first metadata corresponding to the input image. For example, the image analysis engine 230 may be a google vision API. The google vision API builds metadata by finding the dominant object included in the image, and can classify objects in the image into thousands of categories using the built metadata. However, here, the google vision API is merely exemplary, and may be employed and applied to various types of models or devices that output object recognition and corresponding metadata other than the above-described image analysis engine.

When an external image analysis engine is used in the media enrichment providing system, the media enrichment providing apparatus 220 only utilizes the first metadata received from the external image analysis engine and the media enrichment providing apparatus 220 The processing speed can be improved because the TM does not perform image analysis on its own.

The media enrichment providing apparatus 220 includes a processor 221 and a database 222 . The media enrichment providing apparatus 220 may include more components than those of FIG. 2 . For example, the media enrichment providing apparatus 220 may further include other components such as a memory, a communication module, an input/output interface, and the like.

The memory is a computer-readable recording medium and may include a random access memory (RAM), a read only memory (ROM), and a permanent mass storage device such as a disk drive. In addition, codes for a browser installed and driven in the operating system and at least one program code or the above-described application may be stored in the memory. These software components may be loaded from a computer-readable recording medium separate from the memory using a drive mechanism. The separate computer-readable recording medium may include a computer-readable recording medium such as a floppy drive, a disk, a tape, a DVD/CD-ROM drive, and a memory card.

In another embodiment, the software components may be loaded into the memory through a communication module rather than a computer-readable recording medium. For example, the at least one program may be loaded into the memory based on a file distribution system that distributes installation files of developers or applications.

The processor 221 may be configured to process instructions of a computer program by performing basic arithmetic, logic, and input/output operations. Instructions may be provided to the processor 221 by a memory or communication module. For example, the processor 221 may be configured to execute a received instruction according to a program code stored in a recording device such as a memory.

The communication module may provide a function for the terminal 210 and the media enrichment providing apparatus 220 to communicate with each other through a network, and may provide a function for communicating with another server. For example, a control signal, command, content, file, etc. provided under the control of the processor 221 of the media enrichment providing apparatus 220 may be received by the terminal 210 through the communication module and the network 170 . can

Referring to the

drawings

310 and 320 of FIG. 3 , the apparatus for providing media enrichment according to an embodiment of the present invention performs media enrichment related to the image in a second area when an image is displayed in a first area on one screen. You can provide a Cement object.

The media enrichment providing apparatus may divide categories of keywords and provide media enrichment objects for each category. For example, referring to the drawing 310 , the apparatus for providing media enrichment acquires keywords corresponding to a scene corresponding to the time when an image analysis request is received, and sets the keywords to a 'character' category, 'product/shopping' It can be divided into 'category' and 'social' category, and a media enrichment object can be provided for each category. Furthermore, the apparatus for providing media enrichment may display the number of the scene in the current image, and may also display a time section corresponding to the scene.

Similarly, referring to the drawing 320 , the media enrichment providing apparatus obtains keywords corresponding to the scene corresponding to the time when the image analysis request is received, and sets the keywords to the 'Repeat' category, 'Ryu Hyun-jin' category, and 'LA'. It can be divided into 'Dodgers' categories, and a media enrichment object can be provided for each category.

The embodiments illustrated in FIG. 3 are only examples for describing an enrichment object related to a scene corresponding to a time point at which an image analysis request is received, and the media enrichment object is not limited thereto. For example, the media enrichment object may be utilized in various images such as news, drama, home shopping, and entertainment.

Referring to FIG. 4 , steps 410 to 460 according to an embodiment may be performed by the apparatus for providing media enrichment described above with reference to FIGS. 2 to 3 . The apparatus for providing media enrichment may be implemented by one or more hardware modules, one or more software modules, or various combinations thereof.

In operation 410, the media enrichment providing apparatus 220 receives one or more keywords corresponding to each shot of an image and a confidence value corresponding to the keywords from the image analysis engine. The image analysis engine may be the image analysis engine 230 described above with reference to FIG. 2 .

In operation 420 , the media enrichment providing apparatus 220 determines, among one or more keywords, a keyword having a confidence value equal to or greater than a first threshold value as a candidate keyword. The apparatus 220 for providing media enrichment may consider keywords having a confidence value less than the first threshold value as noise among keywords corresponding to the short. The first threshold value may be determined based on a distribution of confidence values. For example, the first threshold value may be determined based on an average and variance of confidence values corresponding to a specific shot.

In step 430 , the media enrichment providing apparatus 220 determines, among successive shots, shots in which the match ratio of the candidate keyword of each shot is equal to or greater than a second threshold value as one scene. One or more consecutive shots may be grouped together to form a scene. For example, in the case of a scene in which the main character walks down the street, the scene may be divided into several shots from various angles, but all of the shots may be scenes in which the main character is walking. A specific method for determining a scene will be described with reference to FIG. 5 .

Referring to FIG. 5 , shot 1 510 to shot 3 530 according to an embodiment are consecutive shots, and Table 1 below shows shots 1 510 to shot 3 530 received from the image analysis engine. Corresponding first metadata (eg, keyword and confidence value) are indicated.

쇼트1short 1		쇼트2short 2		쇼트3short 3
ManMan	0.880.88	Womanwoman	0.840.84	Womanwoman	0.830.83
Picture framepicture frame	0.850.85	ManMan	0.840.84	PersonPerson	0.790.79
PersonPerson	0.810.81	Picture framepicture frame	0.780.78	TopTop	0.670.67
ClothingClothing	0.690.69	ClothingClothing	0.560.56	GestureGesture	0.810.81
Luggage & bagsLuggage & bags	0.570.57	Luggage & bagsLuggage & bags	0.550.55	ForeheadForehead	0.780.78
ArtArt	0.860.86	ArtArt	0.680.68	FingerFinger	0.720.72
Visual ArtsVisual Arts	0.820.82	RoomRoom	0.660.66	SceneScene	0.710.71
Modern ArtModern Art	0.740.74	EventEvent	0.630.63	HandHand	0.700.70
PaintingPainting	0.740.74	PhotographyPhotography	0.620.62	MouthMouth	0.680.68
OrganismOrganism	0.720.72	ConversationConversation	0.570.57	SmileSmile	0.640.64
FunFun	0.700.70	GestureGesture	0.560.56	PhotographyPhotography	0.620.62
EventEvent	0.670.67	Visual ArtsVisual Arts	0.550.55	Black HairBlack Hair	0.610.61
AdaptationAdaptation	0.670.67			ConversationConversation	0.580.58
RoomRoom	0.660.66			JawJaw	0.570.57
Art ExhibitionArt Exhibition	0.650.65
DrawingDrawing	0.590.59
PortraitPortrait	0.570.57
AnimationAnimation	0.560.56
ExhibitionExhibition	0.560.56
IllustrationIllustration	0.550.55
ConversationConversation	0.540.54
MuralMural	0.530.53

The media enrichment providing apparatus 220 may determine, as a candidate keyword, a keyword having a confidence value equal to or greater than a first threshold value among keywords corresponding to each shot. The first threshold value may be determined based on a distribution of confidence values. For example, the first threshold value may be determined as a confidence value corresponding to the lower 20%. Referring to Table 1, the first threshold value of the first shot 510 may be determined as 0.565, the first threshold value of the second shot 520 may be determined as 0.555, and the first threshold value of the third shot 530 may be determined as 0.615. Accordingly, the keywords 'Animation', 'Exhibition', 'llustration', 'Conversation' and 'Mural' of the first shot 510 are excluded from the candidate keywords, and the keywords 'Luggage & bags' and 'Visual' of the second shot 520 are excluded. Arts' may be excluded from the candidate keywords, and the keywords 'Black Hair', 'Conversation', and 'Jaw' of the shot 3 530 may be excluded from the candidate keywords.

The apparatus 220 for providing media enrichment may determine, among consecutive shots, shots in which the match ratio of candidate keywords of each shot is equal to or greater than the second threshold value as one scene.

According to an embodiment, among consecutive shots, the apparatus 220 for providing media enrichment may determine, as one scene, shots in which a match ratio of a candidate keyword with a previous shot is equal to or greater than a second threshold value. . Referring to Table 1, shot 2 520 includes candidate keywords of shot 1 510 and 5 ('Man', 'Picture frame', Clothing', 'Art', 'Event') among 10 candidate keywords. match Also, in the shot 3 530 , three keywords ('Woman', 'Photography', and 'Gesture') match the keywords of the shot 2 520 out of 10 candidate keywords. When the second threshold value is, for example, 0.5, shot 2 520 matches 5 keywords of shot 1 510 out of 10 total keywords, so shot 1 510 and shot 2 520 are 0.5 (5) /10), which is equal to or greater than the second threshold value of 0.5, it can be determined as a shot constituting one scene. On the other hand, shot 3 530 matches three keywords of shot 2 520 out of 10 keywords, so shot 2 520 and shot 3 530 have a matching ratio of 0.3 (3/10), Since this is less than the second threshold value of 0.5, the second shot 520 and the third shot 530 may be determined as shots constituting different scenes.

According to another embodiment, the apparatus 220 for providing media enrichment cumulatively counts the number of matching candidate keywords in each shot in successive shots, and based on the accumulated count value, You can calculate the match rate. Referring to Table 1, in the short 3 530, the keyword of the short 1 510 and 1 ('Person') among the 10 candidate keywords, the keyword of the short 2 520 and 3 ('Woman', ') Photography', 'Gesture') coincide and accumulate, so that the shot 3 530 may have a coincidence ratio of 0.4 (4/10).

Referring back to FIG. 4 , in step 440 , the apparatus 220 for providing media enrichment statistically processes confidence values corresponding to keywords of shots constituting a scene to determine a final keyword representing the scene. The media enrichment providing apparatus 220 may determine a final keyword representing a scene by removing keywords that can be viewed as noise from among keywords corresponding to one or more shots constituting a scene.

According to an embodiment, the apparatus 220 for providing media enrichment determines a weight for each shot of shots constituting a scene, and weights and sums confidence values corresponding to keywords of shots constituting a scene based on the weight for each shot. can Furthermore, the media enrichment providing apparatus 220 may determine a keyword whose weighted sum is equal to or greater than the third threshold value as the final keyword. The statistical processing method is not limited to the above-described weighted sum, and may include any method related to statistical processing.

The media enrichment providing apparatus 220 may determine a weight for each shot based on the importance of shots constituting the scene. For example, the media enrichment providing apparatus 220 may assign a greater weight to a shot having a higher importance.

According to an embodiment, the apparatus 220 for providing media enrichment may determine that the greater the number of candidate keywords, the higher the importance. Accordingly, the media enrichment providing apparatus 220 may determine a weight for each shot of the shots constituting the scene based on the number of candidate keywords included in each of the shots constituting the scene. For example, in Table 1, shot 1 510 has 17 candidate keywords, shot 2 520 has 10 candidate keywords, and shot 1 510 has a weight of 0.63 (17/27). , shot 2 520 may have a weight of 0.37 (10/27). Table 2 below shows the weighted sum of keywords and corresponding confidence values of shots (eg, shot 1 510 and shot 2 520 ) constituting the scene according to the example above.

장면1scene 1
ManMan	0.86520.8652
Picture framepicture frame	0.82410.8241
ArtArt	0.83040.8304
EventEvent	0.42210.4221
ClothingClothing	0.63820.6382
Visual ArtsVisual Arts	0.51660.5166
PersonPerson	0.51030.5103
Modern ArtModern Art	0.46620.4662
PaintingPainting	0.46620.4662
OrganismOrganism	0.45360.4536
FunFun	0.4410.441
AdaptationAdaptation	0.42210.4221
RoomRoom	0.41580.4158
Art ExhibitionArt Exhibition	0.40950.4095
DrawingDrawing	0.37170.3717
Luggage & bagsLuggage & bags	0.35910.3591
PortraitPortrait	0.35910.3591
AnimationAnimation	0.35280.3528
ExhibitionExhibition	0.35280.3528
IllustrationIllustration	0.34650.3465
ConversationConversation	0.34020.3402
MuralMural	0.33390.3339
Womanwoman	0.31080.3108
RoomRoom	0.24420.2442
PhotographyPhotography	0.22940.2294
ConversationConversation	0.21090.2109
GestureGesture	0.20720.2072
Luggage & bagsLuggage & bags	0.20350.2035
Visual ArtsVisual Arts	0.20350.2035

Referring to Table 2, the media enrichment providing apparatus 220 determines a third threshold (eg, 0.5) in which the weighted sum of confidence values corresponding to the keywords of the first shot 510 and the second shot 520 is predetermined. ) or higher keywords ('Man', 'Picture frame', 'Art', 'Event', 'Clothing', 'Visual Arts', 'Person' ) can be determined as the final keywords of Scene 1.

The third threshold value according to an embodiment may be determined based on a distribution of confidence values corresponding to keywords of shots constituting a scene. For example, the third threshold value may be determined based on an average and variance of confidence values corresponding to keywords of shots constituting a scene. Alternatively, the third threshold value may be determined based on a predetermined percentile value of confidence values corresponding to keywords of shots constituting a scene. The method of determining the third threshold value is not limited to the above example, and includes any method that may be determined based on the distribution of confidence values.

In operation 450 , the apparatus for providing media enrichment receives an image analysis request for an image. For example, the apparatus for providing media enrichment may receive image analysis of an image at a desired viewpoint while viewing an image from a user using a remote control of the terminal.

In step 460, the media enrichment providing apparatus acquires at least one media enrichment object based on the final keyword of the scene corresponding to the time of receiving the image analysis request. The media enrichment providing apparatus may acquire a media enrichment object based on the final keyword. For example, the media enrichment providing device may generate a search term by combining the final keywords, and input the generated search term as a query to a database or service server of the media enrichment providing device to obtain a media enrichment object can do. The apparatus for providing media enrichment may generate a search word in consideration of a confidence value corresponding to the final keyword. For example, when a plurality of final keywords exist, the apparatus for providing media enrichment may determine a final keyword having a high confidence value as a search word candidate in preference to other final keywords.

Also, the apparatus for providing media enrichment may select and provide media content or advertisement content to the user according to user preference. The media enrichment providing apparatus may receive and store user preference, and may provide a media enrichment object to the user in consideration of this.

In addition, the apparatus for providing media enrichment may determine at least one keyword based on a result of voice recognition of a sound of a scene corresponding to the time of receiving the image analysis request, and the keyword determined based on the result of voice recognition of the sound With further reference to , at least one media enrichment object may be obtained.

The embodiments described above may be implemented by a hardware component, a software component, and/or a combination of a hardware component and a software component. For example, the apparatus, methods and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate (FPGA) array), a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions, may be implemented using one or more general purpose or special purpose computers. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, although one processing device is sometimes described as being used, one of ordinary skill in the art will recognize that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that can include For example, the processing device may include a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as parallel processors.

Software may comprise a computer program, code, instructions, or a combination of one or more of these, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or device, to be interpreted by or to provide instructions or data to the processing device. , or may be permanently or temporarily embody in a transmitted signal wave. The software may be distributed over networked computer systems, and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

As described above, although the embodiments have been described with reference to the limited drawings, those skilled in the art may apply various technical modifications and variations based on the above. For example, the described techniques are performed in a different order than the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

receiving one or more keywords corresponding to each shot of an image and a confidence value corresponding to the keywords from an image analysis engine;

determining, among the one or more keywords, a keyword having the confidence value equal to or greater than a first threshold value as a candidate keyword;

in successive shots, determining shots in which the match ratio of the candidate keyword of each shot is equal to or greater than a second threshold value as one scene;

determining a final keyword representing the scene by statistically processing confidence values corresponding to keywords of shots constituting the scene;

receiving an image analysis request for the image; and

Acquiring at least one media enrichment object based on the final keyword of the scene corresponding to the time point at which the image analysis request is received

Media enrichment providing method comprising a.
According to claim 1,

The step of determining the single scene is

in the successive shots, cumulatively counting the number of the matching candidate keywords in each shot; and

calculating a matching rate of the candidate keyword based on the accumulated count value

A method of providing media enrichment, comprising a.
According to claim 1,

and the first threshold value is determined based on a distribution of the confidence values.
According to claim 1,

The step of determining the final keyword is

determining a weight for each shot of the shots constituting the scene;

weighted summing confidence values corresponding to keywords of shots constituting the scene based on the weights for each shot; and

determining a keyword whose weighted sum is equal to or greater than a third threshold value as the final keyword

A method of providing media enrichment, comprising a.
5. The method of claim 4,

The step of determining the weight for each shot is

determining a weight for each shot of the shots constituting the scene based on the number of candidate keywords included in each of the shots constituting the scene

A method of providing media enrichment, comprising a.
5. The method of claim 4,

The third threshold is

It is determined based on the distribution of the confidence value corresponding to the keyword of the shots constituting the scene, the enrichment providing method.
5. The method of claim 4,

The weighting step is

weighted summing confidence values corresponding to the candidate keywords of the shots constituting the scene based on the weight

A method of providing media enrichment, comprising a.
According to claim 1,

Overlap rendering the at least one media enrichment object on the image

Further comprising a, media enrichment providing method.
According to claim 1,

The step of obtaining the at least one media enrichment object comprises:

inputting the final keyword as a query to at least one service server; and

obtaining the at least one media enrichment object corresponding to the query from the service server;

A method of providing media enrichment, comprising a.
According to claim 1,

The at least one media enrichment object is

A method for providing media enrichment, comprising a thumbnail hyperlink of at least one of a photo, text, sound, external link, social link information, advertisement content, and associated and playable separate content.
According to claim 1,

Determining at least one keyword based on a result of voice recognition of a sound of a scene corresponding to the time when the image analysis request is received

further comprising,

The step of obtaining the at least one media enrichment object comprises:

obtaining the at least one media enrichment object with further reference to the keyword

A method of providing media enrichment, comprising a.
According to claim 1,

The video analysis engine

and an external image analysis engine that receives the image and generates the keyword and the confidence value corresponding to the keyword.
A computer program stored on a medium in combination with hardware to execute the method of any one of claims 1 to 12.
Receive one or more keywords corresponding to each shot of an image and a confidence value corresponding to the keyword from the image analysis engine, and determine, among the one or more keywords, a keyword having a confidence value equal to or greater than a first threshold value as a candidate keyword and, in successive shots, the shots in which the match ratio of the candidate keyword of each shot is equal to or greater than a second threshold value is determined as one scene, and the confidence values corresponding to the keywords of the shots constituting the scene are statistically processed to determine a final keyword representing the scene, receive an image analysis request for the image, and perform at least one media enrichment method based on the final keyword of the scene corresponding to the time at which the image analysis request is received. (enrichment) the processor that acquires the object

Media enrichment providing device comprising a.
15. The method of claim 14,

the processor

In the successive shots, cumulatively counting the number of matching candidate keywords in each shot, and calculating a matching ratio of the candidate keywords based on the accumulated count value. Device.
15. The method of claim 14,

The first threshold value is determined based on a distribution of the confidence values.
15. The method of claim 14,

the processor

A weight of shots constituting the scene is determined for each shot, and confidence values corresponding to keywords of shots constituting the scene are weighted and summed based on the weight of each shot, and a keyword whose weighted sum is equal to or greater than a third threshold is selected. Determined by the final keyword, media enrichment providing device.
18. The method of claim 17,

the processor

and determining a weight for each shot of the shots constituting the scene based on the number of candidate keywords included in each of the shots constituting the scene.
18. The method of claim 17,

The third threshold is

which is determined based on a distribution of the confidence values corresponding to the keywords of shots constituting the scene.
18. The method of claim 17,

the processor

Based on the weight, the media enrichment providing apparatus for weighting the confidence values corresponding to the candidate keywords of the shots constituting the scene.
15. The method of claim 14,

the processor

The media enrichment providing apparatus for rendering the at least one media enrichment object overlap on the image.
15. The method of claim 14,

the processor

inputting the final keyword as a query to at least one service server, and obtaining the at least one media enrichment object corresponding to the query from the service server.
15. The method of claim 14,

the processor

determining at least one keyword based on a result of voice recognition of a sound of a scene corresponding to the time of receiving the image analysis request, and further referring to the keyword to obtain the at least one media enrichment object, Media Enrichment Delivery Device.