US20110222782A1 - Information processing apparatus, information processing method, and program - Google Patents

Information processing apparatus, information processing method, and program Download PDF

Info

Publication number
US20110222782A1
US20110222782A1 US13/038,625 US201113038625A US2011222782A1 US 20110222782 A1 US20110222782 A1 US 20110222782A1 US 201113038625 A US201113038625 A US 201113038625A US 2011222782 A1 US2011222782 A1 US 2011222782A1
Authority
US
United States
Prior art keywords
video
characteristic amounts
scene
characteristic
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US13/038,625
Other versions
US8731307B2 (en
Inventor
Akifumi Kashiwagi
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sony Corp
Original Assignee
Sony Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sony Corp filed Critical Sony Corp
Assigned to SONY CORPORATION reassignment SONY CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: KASHIWAGI, AKIFUMI
Publication of US20110222782A1 publication Critical patent/US20110222782A1/en
Application granted granted Critical
Publication of US8731307B2 publication Critical patent/US8731307B2/en
Expired - Fee Related legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content

Definitions

  • the present invention relates to an information processing apparatus, an information processing method, and a program.
  • One conceivable solution would be to introduce a system that inputs a video being viewed by the user and/or retail content already owned by the user and outputs other content that is similar. By using this type of system, it would be possible to automatically recommend other content and judge whether uploaded content is illegal, thereby making it unnecessary to manually manage a vast amount of content.
  • the technologies listed above use movement recognition or luminance measuring means to measure similarity between different videos based on characteristic amounts obtained from an “image part” of the videos. However, since similarity is judged only for the image parts, such technologies can fundamentally only be used to carry out judgments of similarity for videos where the content of the image part is substantially the same.
  • Japanese Laid-Open Patent Publication No. 2009-70278 measures similarity between video images using “comments” and Japanese Laid-Open Patent Publication No. 11-308581 (Japanese Patent No. 3569441) measures similarity by searching “text in a program guide” that accompanies programs.
  • Publication No. H11-308581 uses program guide information that accompanies a program to measure similarity between a program being viewed by the user and programs that the user can view and that have been assigned a program guide and recommends programs with high similarity to the program being viewed.
  • videos are recommended based on information that accompanies programs.
  • program guides are merely summaries of programs provided by the respective suppliers of videos.
  • the number of similar videos that can be discovered by this method is extremely limited, and as a result it is thought difficult to make sufficient recommendations given the great amount of content that is available.
  • there is fundamentally a one-to-one relationship between program guides and programs it is not possible with this method to judge similarity between videos in units of scenes.
  • the present invention was conceived in view of the problem described above and aims to provide a novel and improved information processing apparatus, an information processing method, and a program that enable wide and flexible searches of multimedia content with similar characteristics.
  • an information processing apparatus including a characteristic amount extracting unit extracting a plurality of characteristic amounts, which are information expressing characteristics of a video, from the video, a labeling unit associating the extracted characteristic amounts with a person or a background, a matching degree judging unit judging a degree of matching between the associated characteristic amounts and the characteristic amounts of at least one other video, a comparing unit comparing the plurality of characteristic amounts of one scene in the video from which the characteristic amounts have been extracted and the plurality of characteristic amounts of one scene in the at least one other video, and a relationship inferring unit inferring a relationship between the one scene in the video and the one scene in the at least one other video based on a comparison result of the comparing unit.
  • the matching degree judging unit may judge, for the associated characteristic amounts, a degree of matching with the characteristic amounts of the at least one other video that have been recorded in a storage unit. And the comparing unit may be operable when it has been judged using at least one threshold that at least one of the associated characteristic amounts matches at least one of the characteristic amounts of another video item, to compare the plurality of characteristic amounts of one scene in the video and the plurality of characteristic amounts in one scene of the other video.
  • the characteristic amount extracting unit may extract a plurality of characteristic amounts for each scene in the video.
  • the characteristic amount extracting unit may be operable when at least one similar characteristic amount is obtained from a plurality of scenes in the video, to assign index information showing that the at least one characteristic amount is similar for the plurality of scenes.
  • the characteristic amount extracting unit may recognize a face of a person and detects body movements of the person, and the labeling unit may associate the face and the body movements with the person and gathers together the associated characteristic amounts for each person.
  • an information processing method including steps of extracting a plurality of characteristic amounts, which are information expressing characteristics of a video, from the video, associating the extracted characteristic amounts with a person or a background, judging a degree of matching between the associated characteristic amounts and the characteristic amounts of at least one other video, comparing the plurality of characteristic amounts of one scene in the video from which the characteristic amounts have been extracted and the plurality of characteristic amounts of one scene in the at least one other video, and inferring a relationship between the one scene in the video and the one scene in the at least one other video based on a comparison result of the comparing unit.
  • a program causing a computer to carry out steps of extracting a plurality of characteristic amounts, which are information expressing characteristics of a video, from the video, associating the extracted characteristic amounts with a person or a background, judging a degree of matching between the associated characteristic amounts and the characteristic amounts of at least one other video, comparing the plurality of characteristic amounts of one scene in the video from which the characteristic amounts have been extracted and the plurality of characteristic amounts of one scene in the at least one other video, and inferring a relationship between the one scene in the video and the one scene in the at least one other video based on a comparison result of the comparing unit.
  • FIGS. 1A to 1D are diagrams useful in showing the overall processing of an information processing apparatus 100 according to an embodiment of the present invention.
  • FIG. 2 is a block diagram showing an information processing system including the information processing apparatus 100 according to the present embodiment
  • FIG. 3 is a flowchart showing a processing operation of the information processing apparatus 100 according to the present embodiment
  • FIGS. 4A to 4G are diagrams useful in showing an example of high-frequency components that are a characteristic of a detected face and a color composition of the face;
  • FIGS. 5A and 5B are diagrams useful in showing an example of detection of a gesture/movement pattern
  • FIG. 6 is a diagram showing detection of a theme song (theme BGM) of a video
  • FIG. 7 is a flowchart showing a labeling process for characteristic amounts
  • FIG. 8 is a diagram useful in showing how speakers and spoken contents are associated
  • FIG. 9 is a diagram useful in showing the relationship between a time line of a video and extracted characteristic amounts
  • FIG. 10 is a diagram useful in showing a similarity judging method for characteristic amounts
  • FIG. 11 is a table showing characteristic amounts that are classified into three categories.
  • FIG. 12 is a table showing combinations of degrees of similarity and relationships between scenes.
  • the present embodiment provides a method that extracts various characteristic amounts from video, audio, and the like in videos being processed and, based on such characteristic amounts, finds and presents videos and scenes with similar characteristics using a more flexible method than in the background art. By doing so, as examples, it is possible to associate an original and a parody version together and to recommend a live-action video version of an animated video that is being viewed by the user.
  • characteristic amounts obtained from videos can be aggregated for each character and registered in a database 106 . After this, by using the database 106 , it is possible to distinguish characters in unknown videos and to associate videos that use the same characters.
  • information on the face of a character such as high-frequency components and skin color composition
  • audio information such as a voiceprint of the character and BGM
  • subtitle information such as subtitles
  • gestures movement change patterns on a time axis
  • the present embodiment mainly carries out the processing described below.
  • the characteristic amounts described above are extracted from every scene in a video and such characteristic amounts are judged and labeled as belonging to a character or to the background. 2. After the processing in (1.), if the labeled characteristic amounts are judged to match or be similar to the characteristic amounts of another video already registered in the database 106 , the similarity between the scenes that caused such judgment is then determined for each characteristic amount. 3. In accordance with the result of the judgments of similarity in (2.), an overall similarity for the two videos is calculated and the relationship between the two videos is inferred.
  • FIGS. 1A to 1D are diagrams useful in explaining the overall processing of an information processing apparatus 100 according to the present embodiment.
  • first characteristic amounts that are decided in advance are extracted from a video being processed. For example, as shown in FIG. 1A , characteristic amounts 1 to 3 are extracted from video A. The extracted characteristic amounts are individually judged and labeled as belonging to a character or to the background.
  • the similarity for all of the characteristic amounts is then judged for the scenes that caused such judgment of similarity. For example, as shown in FIG. 1B , similarity is judged for the characteristic amounts 2 and 3 aside from the characteristic amount 1 between scene A of video A and scene B of video B. As a result of judging the similarity, a relationship between scene A and scene B is obtained ( FIG. 1C ).
  • An information processing system according to the present embodiment, the flow of the respective processes, the types of characteristic amounts extracted from the videos and the method of extraction, the method of judging similarity between the characteristic amounts, and the method of inferring a relationship between videos will now be described.
  • FIG. 2 is a block diagram showing the information processing system including the information processing apparatus 100 according to the present embodiment.
  • the information processing apparatus 100 includes a central processing unit 102 , a temporary storage unit 104 , a database (storage apparatus) 106 , a facial recognition database 112 , a decoder/encoder 114 , a voice recognition database 116 , an image analyzing unit 122 , a metadata analyzing unit 124 , an audio analyzing unit 126 , and the like.
  • the information processing apparatus 100 may be used having been incorporated into a household video recording appliance.
  • the information processing apparatus 100 receives videos from a video sharing site and/or a video image providing apparatus 20 , such as a household video recording appliance or a television broadcasting station, decodes or encodes a video stream as necessary, and then divides the video stream into an image part, an audio part, and a metadata part.
  • a video image providing apparatus 20 such as a household video recording appliance or a television broadcasting station
  • the image analyzing unit 122 , the audio analyzing unit 126 , and the metadata analyzing unit 124 receive the divided stream as appropriate and extract characteristic amounts of the video.
  • the central processing unit 102 carries out a process that receives the extracted characteristic amounts and accumulates the characteristic amounts in the temporary storage unit 104 and/or stores the characteristic amounts in the database 106 .
  • the central processing unit 102 outputs, via a display apparatus 30 , statistical information on the characteristic amounts accumulated in the temporary storage unit 104 and/or information obtained as a result of carrying out the process that stores the characteristic amounts in the database 106 .
  • the central processing unit 102 also has an environment that is capable of acquiring information relating to characteristic amounts as necessary from a network 10 .
  • FIG. 3 is a flowchart showing processing operations of the information processing apparatus 100 according to the present embodiment.
  • a video is inputted (step S 11 ).
  • Information (“characteristic amounts”) expressing characteristics of the video in every scene is then extracted from the inputted video (step S 12 ).
  • the extraction of characteristic amounts is carried out by the image analyzing unit 122 , the audio analyzing unit 126 , and the metadata analyzing unit 124 shown in FIG. 2 .
  • the image analyzing unit 122 , the audio analyzing unit 126 , and the metadata analyzing unit 124 are one example of a “characteristic amount extracting unit” for the present invention.
  • FIG. 9 shows the relationship between a timeline of a video and characteristic amounts that are extracted.
  • the image analyzing unit 122 has a typical facial recognition function and body movement recognition function, and mainly extracts high-frequency components of a face, the color and distribution of the face, movements, a person specified using facial recognition, and the color and distribution of the body as necessary.
  • the facial recognition database 112 has a dictionary that is generated in advance, and is used when specifying people using facial recognition.
  • the audio analyzing unit 126 includes an audio information (frequency characteristic) extracting function and extracts mainly a voiceprint (frequency distribution) of a person, volume, and sections where the frequency distribution sharply changes from the audio information of a video.
  • the audio information (frequency characteristic) extracting function is capable of using the technology disclosed in the specification of Japanese Laid-Open Patent Publication No. 2009-278180, for example. Also, if the audio analyzing unit 126 has a speech recognition (voice recognition) function, a spoken content is extracted as a characteristic amount.
  • the voice recognition database 116 has a dictionary that is generated in advance, and is used to specify a person via extraction of voice information.
  • the metadata analyzing unit 124 extracts mainly subtitle information from metadata that accompanies a video. If the title of the video is included in the obtained metadata, the title is also extracted as necessary as a characteristic amount. If the names of characters are included in the obtained metadata, the metadata analyzing unit 124 refers as necessary via the central processing unit 102 to facial images on the network 10 based on the names of the characters and registers composition information of the faces of the people in question in the facial recognition database 112 .
  • a labeling unit of the central processing unit 102 specifies to which person the extracted characteristic amounts belong or whether the characteristic amounts do not belong to any person (step S 13 ). The method of labeling the characteristic amounts in this process will be described later.
  • a match judging unit of the central processing unit 102 confirms whether data with similar values to the characteristic amounts of each scene that have been labeled is present in the database 106 (step S 14 ).
  • the data in the database 106 is characteristic amounts of other videos that have been registered in the database 106 by previously carrying out the same process.
  • a comparing unit of the central processing unit 102 compares the characteristic amounts of the two videos that have such characteristic amounts (step S 16 ). Such comparing is carried out for all of the characteristic amounts included in the scenes judged to have similar characteristic amounts.
  • a relationship inferring unit of the central processing unit 102 infers a relationship between the two videos based on the similarity of the respective characteristic amounts (step S 17 ).
  • the characteristic amounts are newly registered in the database 106 to complete the processing (step S 18 ). Additionally, if data that is similar to the characteristic amounts to be registered has been found in the database 106 , information on the relationship between the scenes and videos to which the two characteristic amounts belong is added to the characteristic amounts and also to the registered content of the characteristic amounts of the similar data.
  • step S 12 if, in step S 12 , similar characteristic amounts are obtained for a plurality of scenes in a video when extracting the characteristic amounts from the video, by assigning index information showing that such amounts are similar in advance, it is possible to reduce the number of searches carried out when data in the database 106 is compared with newly extracted characteristic amounts, and as a result, it is possible to reduce the processing time.
  • FIGS. 4A to 4F are diagrams useful in explaining one example of high-frequency components that are characteristics of a detected face and the color composition of the face.
  • FIG. 4A shows one example of a face in a video.
  • the image analyzing unit 122 extracts contour (high-frequency) components by carrying out a Fourier transform. As shown in FIGS. 4C to 4E and 4 G, the image analyzing unit 122 also calculates composition ratios of colors of the detected face as ratios relative to the area of the face. After this, it is possible to carry out a facial recognition process using the contours of the face and/or information on the color composition such as that in FIGS. 4A to 4G obtained by the characteristic amount extracting process. In addition, as a body movement recognition function, the image analyzing unit 122 detects body movements from the video. FIGS. 5A and 5B are diagrams useful in showing one example of detection of gesture/movement patterns. The image analyzing unit 122 then associates the face and body movements from the results of the facial recognition and the body movement detection and registers movement changes of the character in a series of scenes.
  • the audio analyzing unit 126 detects a voiceprint from the video. Based on lip movements of the detected face, the audio analyzing unit 126 also separates voice information of a person from background sound to acquire the voice information. By carrying out speech recognition, the audio analyzing unit 126 also extracts the spoken content (dialog) of the person. Also, as shown in FIG. 6 , the audio analyzing unit 126 detects BGM (BackGround Music) from the video.
  • FIG. 6 is a diagram useful in showing extraction of a theme song (theme BGM) of a video.
  • the audio analyzing unit 126 refers for example to sudden changes in people appearing in scenes, volume, and/or high-frequency components to separate background sound.
  • the metadata analyzing unit 124 extracts the subtitles from the video.
  • FIG. 7 is a flowchart showing the labeling process for characteristic amounts.
  • each scene in which a face has been detected is labeled with the name of the detected person (step S 21 ).
  • the label names do not need to be the proper names of the people, and it is possible to use any information that serves as unique identifiers indicating specific people.
  • a movement change pattern of a person obtained by carrying out detection of body movements using the detected face is labeled with the same name as the face in question (step S 22 ).
  • step S 23 it is verified whether the voiceprint obtained by the audio analyzing unit 126 is the voiceprint of the person whose name is the label assigned to the face and/or body movements in the scene (step S 23 ).
  • the method disclosed in Japanese Laid-Open Patent Publication No. 2009-278180 is used to recognize the voiceprint.
  • step S 24 If, as a result of verification, the voiceprint obtained for a scene matches the person indicated by the label (step S 24 ), the label is assigned to the voiceprint (step S 26 ). Meanwhile, if the voiceprint has been recognized as that of another person (step S 24 ), the voiceprint is assigned a label as background sound and is excluded from subsequent processing (step S 25 ). By doing so, it is possible to reduce the amount of processing in the subsequent similarity judging process.
  • BGM has been detected as in FIG. 6
  • BGM can be labeled as a characteristic amount in the same way as a voiceprint and used when judging similarity.
  • step S 27 it is verified whether the spoken content obtained from speech recognition by the audio analyzing unit 126 or from subtitle information produced by the metadata analyzing unit 124 is speech by the person whose name has been used as a label (step S 27 ). If the result of speech recognition by the audio analyzing unit 126 is used as the spoken content, since it is possible to simultaneously extract a voiceprint by carrying out voiceprint recognition, if a person can be specified from the voiceprint, it will be possible to easily specify the person to which such spoken content belongs.
  • spoken content obtained from subtitle information by comparing speech timing that accompanies the subtitle information against the time in a scene when lip movements have been detected using facial recognition by the image analyzing unit 122 , it is possible to specify the person to which the spoken content belongs.
  • a method of associating a speaker and a spoken content will be described later.
  • step S 28 If, as a result of verification, the spoken content has been recognized as speech of the person in question (step S 28 ), the present label is assigned to the spoken content (step S 30 ). Conversely, if the spoken content has been recognized as speech of another person, the spoken content is labeled as background sound, and subsequent processing is not carried out (step S 29 ). This completes the labeling process for characteristic amounts.
  • FIG. 8 is a diagram useful in showing the associating of a speaker and a spoken content.
  • spoken content from subtitle information obtained by the metadata analyzing unit 124 is assigned to each scene based on the accompanying timing information.
  • scenes in which lip movements have been detected by the image analyzing unit 122 and the assigned subtitle information are placed together on the time axis. By doing so, it is possible to specify which spoken content was spoken by which person.
  • the following data is stored as the characteristic amounts.
  • Video/Scene Relationship Storing Database 106 (Video/Scene Relationship Storing Database 106 )
  • the similarity of every characteristic amount is then measured between the scenes in which the matching characteristic amount is present. Based on such measurements, the relationship between two videos or scenes is decided.
  • Face . . . similarity for faces of characters is determined between scenes from the contours and/or color composition ratio of faces that have been detected.
  • Movement . . . similarity for movements of characters is determined between scenes from changes in posture on a time axis.
  • Voiceprint . . . similarity for voices of characters is determined between scenes from a frequency distribution of audio.
  • BGM . . . similarity for BGM of characters is determined between scenes from audio information that plays for a certain period.
  • Dialog . . . similarity for dialog of characters is determined between scenes from the voiceprint and subtitles and/or spoken content.
  • a face is the same. Or is similar.
  • a movement pattern is the same. Or is similar.
  • a voiceprint is the same. Or is similar.
  • (4) Dialog is the same. Or is similar.
  • Evaluation may be carried out as described below to see whether the degree of similarity between two videos or scenes is zero or below for such characteristic amounts (that is, such amounts are not similar), or whether the degree of similarity is larger than a threshold set in advance.
  • the relationship between two items or scenes is judged in view of the overall similarity for all of the characteristics in each scene.
  • the characteristic amounts of a face are given priority, it is possible to classify two scenes as belonging to the same series or as being different videos in which the same person appears.
  • the characteristic amounts of a face are given priority, it may not be possible to spot the relationship between videos where characteristic amounts aside from the face are the same but the faces are different.
  • the dialog or voice is the same but the face is different, it may not be possible to determine similarity by carrying out processing that gives priority to the characteristic amounts of a face.
  • the dialog or voice is the same but the face is different, it can be assumed that there is an “impersonation”-type relationship composed of what is actually a different person doing an imitation or the like. For this reason, when inferring a relationship between videos, it is not considered preferable to carry out processing on a scene-by-scene basis for scenes where the result of facial recognition (that is, a label) is the same.
  • FIG. 10 is a diagram useful in showing the method of judging similarity for the characteristic amounts.
  • characteristic amounts in a scene are extracted from scene a in video A that is being processed and passed to the database 106 .
  • the scenes to be processed are scenes in which faces have been detected by a typical facial detection method.
  • Face information of a person, gestures (a movement pattern within a scene), a voiceprint of the face in question, BGM in the scene (background sound in the scene produced by excluding the voice of the person in question), and dialog (subtitle information) can be given as five examples of the characteristic amounts extracted from each scene. Note that the characteristic amounts are not necessarily limited to these five examples and it is also possible to use other characteristic amounts.
  • the extracted characteristic amounts are registered in the database 106 by the present system ((2) in FIG. 10 ). At the same time, a degree of similarity is calculated between the extracted characteristic amounts and characteristic amounts extracted from other videos (scenes) that are already registered in the database 106 ((3) in FIG. 10 ).
  • a judgment of similarity for faces compares the contours of faces as well as color information.
  • a two-dimensional plane is expressed by x and y.
  • Two-dimensional contour information of faces in the scenes A and B is expressed as F l (A(x,y)) and F l (B(x,y)) and two-dimensional color information is expressed as F c (A(x,y)) and F c (B(x,y)).
  • F c A(x,y)
  • B(x,y) two-dimensional color information
  • RF ( A,B ) u ⁇ x,y [1 ⁇ F l ( A ( x,y )) ⁇ F l ( B ( x,y )) ⁇ ]/( L _MAX ⁇ F s ( B ))+(1 ⁇ u ) ⁇ x,y [1 ⁇ F c ( A ( x,y )) ⁇ F c ( B ( x,y )) ⁇ ]/( C _MAX ⁇ F s ( B )) Equation 2.
  • L_MAX and C_MAX express the respective maximum values for the contour information and the color information.
  • Judgment of similarity for voiceprints is carried out by comparing the frequency distributions of voices.
  • Equation 3 the similarity R V (A,B) for voiceprints in the scenes A and B is expressed as shown in Equation 3 below.
  • F_MAX and D_MAX respectively express a frequency maximum value and a value for normalizing sound.
  • a judgment of similarity for gestures detects five endpoint positions (i.e., the head and both hands and feet) of a body using an existing body movement detection method and measures and compares the movement loci of the respective endpoints within scenes.
  • an endpoint number is expressed as n
  • the position of an endpoint is expressed as p(t,n)
  • a movement vector of endpoint n 0 from a time t 0 to another time t 1 is expressed as (p(t 1 ,n 0 ) ⁇ p(t 0 ,n 0 )).
  • the default position for the end points uses a state where the face is facing to the front and a midline for both eyes is perpendicular to the horizontal as the standard. This means that it is possible to estimate the posture of a person based on the inclination of the detected face to the horizontal and the vertical and to calculate endpoint positions in three dimensions.
  • DIM expresses the number of dimensions
  • T_MAX expresses the length (in time) of the scenes being compared
  • N_MAX expresses the number of endpoints being compared.
  • the judgment of similarity for dialog is carried out by text matching for the spoken content in the two scenes.
  • S_MAX expresses the length of the character strings to be compared.
  • the judgment of similarity for BGM is carried out by measuring the amount of time for which the same continuous playback sound is included in both scenes.
  • the BGM waveforms or melodies obtained from scenes A and B at time t are respectively expressed as g A (t) and g B (t).
  • a function that measures correlation for g A (t) and g B (t) is expressed as R(g A (t),g B (t)) and a function that selects a longest region out of the regions for which high correlation has been obtained is expressed as L r ( ⁇ t ⁇ R(g A (t),g B (t)) ⁇ )
  • the similarity R G (A,B) for BGM in the scenes A and B is expressed as shown in Equation 6 below.
  • R G ( A,B ) L r ( ⁇ t ⁇ R ( g A ( t ), g B ( t )) ⁇ )/ T _MAX (Equation 6)
  • T_MAX expresses the time of scenes being compared.
  • FIG. 12 Examples of combinations of similarities calculated for the respective categories and relationships between scenes based on such similarities are shown in FIG. 12 .
  • the similarity is one, and if the characteristic amounts are completely different, the similarity is zero. Since the degrees of similarity that are actually calculated are arbitrary values in a range of zero to one, inclusive, the following is not an exhaustive list of the relationships between scenes that may be determined.
  • characteristic amounts extracted from a video By labeling characteristic amounts extracted from a video, it is possible to store characteristic amounts of a person who appears in the video as data, and based on such characteristic amounts, it is possible to evaluate who people appearing in another video resemble, and which parts are similar.

Abstract

There is provided an information processing apparatus including: a characteristic amount extracting unit extracting a plurality of characteristic amounts, which are information expressing characteristics of a video, from the video; a labeling unit associating the extracted characteristic amounts with a person or a background; a matching degree judging unit judging a degree of matching between the associated characteristic amounts and the characteristic amounts of at least one other video; a comparing unit comparing the plurality of characteristic amounts of one scene in the video from which the characteristic amounts have been extracted and the plurality of characteristic amounts of one scene in the at least one other video; and a relationship inferring unit inferring a relationship between the one scene in the video and the one scene in the at least one other video based on a comparison result of the comparing unit.

Description

    BACKGROUND OF THE INVENTION
  • 1. Field of the Invention
  • The present invention relates to an information processing apparatus, an information processing method, and a program.
  • 2. Description of the Related Art
  • In recent years, advances have been made in the speed and capacity of data transfer on networks. One result is that a great amount of video has become easily available to a huge number of users at certain locations on networks. It has also become possible for many users to upload their own videos and share such content with others using video sharing sites. Although it has become easy to share large files such as videos, there is now a vast amount of content available, which makes it difficult for users to find content that matches their preferences and/or to judge whether content that has been uploaded is illegal.
  • One conceivable solution would be to introduce a system that inputs a video being viewed by the user and/or retail content already owned by the user and outputs other content that is similar. By using this type of system, it would be possible to automatically recommend other content and judge whether uploaded content is illegal, thereby making it unnecessary to manually manage a vast amount of content.
  • A number of inventions that relate to judging similarity between videos have been disclosed as listed below. For example, the first five of the following references relate to methods of measuring similarity between video images using information obtained from an “image part” of a video.
    • Japanese Laid-Open Patent Publication No. 2002-203245 (Japanese Patent No. 3711022)
    • Japanese Laid-Open Patent Publication No. 2006-285907
    • Japanese Laid-Open Patent Publication No. 2009-147603
    • Japanese Laid-Open Patent Publication (Translated PCT Application) No. 2006-514451
    • Japanese Laid-Open Patent Publication No. 2002-32761 (Japanese Patent No. 3636674)
    • Japanese Laid-Open Patent Publication No. 2009-70278
    • Japanese Laid-Open Patent Publication No. H11-308581 (Japanese Patent No. 3569441)
  • The technologies listed above use movement recognition or luminance measuring means to measure similarity between different videos based on characteristic amounts obtained from an “image part” of the videos. However, since similarity is judged only for the image parts, such technologies can fundamentally only be used to carry out judgments of similarity for videos where the content of the image part is substantially the same.
  • In PCT Application WO2004061711, videos with images that have similar transitions or pictures are recommended to the user, so that the system can be described as a content recommending system that is dedicated to situations where video images are shot with a fixed camera and a distinct pattern is present in the image part, as in video images of sports such as tennis. It is therefore doubtful that such system would be as effective when making recommendations for all types of videos.
  • In addition, with all of the methods listed above, since no reference is made to the content of the videos, of the methods is suited to recommending videos, such as parodies, where the content is similar but the pictures are different or to discovering illegal videos where the image part is different but only the “audio part” corresponds to non-permitted use of commercial material.
  • As other methods, Japanese Laid-Open Patent Publication No. 2009-70278 measures similarity between video images using “comments” and Japanese Laid-Open Patent Publication No. 11-308581 (Japanese Patent No. 3569441) measures similarity by searching “text in a program guide” that accompanies programs.
  • Publication No. 2009-70278 extracts words referred to as “characteristic words” from the content of comments that accompany each video and measures the similarity between videos by comparing the distribution of the sets of obtained characteristic words. The premise here is that a plurality of comments have been assigned to at least a plurality of scenes in all of the videos to be compared. This means that the ability to specify similar videos and the precision when doing so are dependent on the number of comments assigned to the videos being compared. Although it is assumed that there is a high probability of preferred characteristic words being included in comments, since the content of comments is fundamentally freely chosen by users, there is no guarantee that preferred characteristic words will be included. Meanwhile, it is not realistic to implement limitations over the comments that can be made by users.
  • Publication No. H11-308581 uses program guide information that accompanies a program to measure similarity between a program being viewed by the user and programs that the user can view and that have been assigned a program guide and recommends programs with high similarity to the program being viewed. With this method, videos are recommended based on information that accompanies programs. However, program guides are merely summaries of programs provided by the respective suppliers of videos. Also, in the same way as with Publication No. 2009-70278, since there are no limitations on how such information is written, the number of similar videos that can be discovered by this method is extremely limited, and as a result it is thought difficult to make sufficient recommendations given the great amount of content that is available. Also, since there is fundamentally a one-to-one relationship between program guides and programs, it is not possible with this method to judge similarity between videos in units of scenes.
  • SUMMARY OF THE INVENTION
  • The present invention was conceived in view of the problem described above and aims to provide a novel and improved information processing apparatus, an information processing method, and a program that enable wide and flexible searches of multimedia content with similar characteristics.
  • According to an embodiment of the present invention, there is provided an information processing apparatus including a characteristic amount extracting unit extracting a plurality of characteristic amounts, which are information expressing characteristics of a video, from the video, a labeling unit associating the extracted characteristic amounts with a person or a background, a matching degree judging unit judging a degree of matching between the associated characteristic amounts and the characteristic amounts of at least one other video, a comparing unit comparing the plurality of characteristic amounts of one scene in the video from which the characteristic amounts have been extracted and the plurality of characteristic amounts of one scene in the at least one other video, and a relationship inferring unit inferring a relationship between the one scene in the video and the one scene in the at least one other video based on a comparison result of the comparing unit.
  • The matching degree judging unit may judge, for the associated characteristic amounts, a degree of matching with the characteristic amounts of the at least one other video that have been recorded in a storage unit. And the comparing unit may be operable when it has been judged using at least one threshold that at least one of the associated characteristic amounts matches at least one of the characteristic amounts of another video item, to compare the plurality of characteristic amounts of one scene in the video and the plurality of characteristic amounts in one scene of the other video.
  • The characteristic amount extracting unit may extract a plurality of characteristic amounts for each scene in the video.
  • The characteristic amount extracting unit may be operable when at least one similar characteristic amount is obtained from a plurality of scenes in the video, to assign index information showing that the at least one characteristic amount is similar for the plurality of scenes.
  • As the characteristic amounts, the characteristic amount extracting unit may recognize a face of a person and detects body movements of the person, and the labeling unit may associate the face and the body movements with the person and gathers together the associated characteristic amounts for each person.
  • According to an embodiment of the present invention, there is provided an information processing method including steps of extracting a plurality of characteristic amounts, which are information expressing characteristics of a video, from the video, associating the extracted characteristic amounts with a person or a background, judging a degree of matching between the associated characteristic amounts and the characteristic amounts of at least one other video, comparing the plurality of characteristic amounts of one scene in the video from which the characteristic amounts have been extracted and the plurality of characteristic amounts of one scene in the at least one other video, and inferring a relationship between the one scene in the video and the one scene in the at least one other video based on a comparison result of the comparing unit.
  • According to an embodiment of the present invention, there is provided a program causing a computer to carry out steps of extracting a plurality of characteristic amounts, which are information expressing characteristics of a video, from the video, associating the extracted characteristic amounts with a person or a background, judging a degree of matching between the associated characteristic amounts and the characteristic amounts of at least one other video, comparing the plurality of characteristic amounts of one scene in the video from which the characteristic amounts have been extracted and the plurality of characteristic amounts of one scene in the at least one other video, and inferring a relationship between the one scene in the video and the one scene in the at least one other video based on a comparison result of the comparing unit.
  • According to the embodiments of the present invention described above, it is possible to carry out wide and flexible searches of multimedia content with similar characteristics.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIGS. 1A to 1D are diagrams useful in showing the overall processing of an information processing apparatus 100 according to an embodiment of the present invention;
  • FIG. 2 is a block diagram showing an information processing system including the information processing apparatus 100 according to the present embodiment;
  • FIG. 3 is a flowchart showing a processing operation of the information processing apparatus 100 according to the present embodiment;
  • FIGS. 4A to 4G are diagrams useful in showing an example of high-frequency components that are a characteristic of a detected face and a color composition of the face;
  • FIGS. 5A and 5B are diagrams useful in showing an example of detection of a gesture/movement pattern;
  • FIG. 6 is a diagram showing detection of a theme song (theme BGM) of a video;
  • FIG. 7 is a flowchart showing a labeling process for characteristic amounts;
  • FIG. 8 is a diagram useful in showing how speakers and spoken contents are associated;
  • FIG. 9 is a diagram useful in showing the relationship between a time line of a video and extracted characteristic amounts;
  • FIG. 10 is a diagram useful in showing a similarity judging method for characteristic amounts;
  • FIG. 11 is a table showing characteristic amounts that are classified into three categories; and
  • FIG. 12 is a table showing combinations of degrees of similarity and relationships between scenes.
  • DETAILED DESCRIPTION OF THE EMBODIMENTS
  • Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the appended drawings. Note that, in this specification and the appended drawings, structural elements that have substantially the same function and structure are denoted with the same reference numerals, and repeated explanation of these structural elements is omitted.
  • The following description is given in the order indicated below.
  • 1. Overview of Processing by Information Processing Apparatus 100 2. Information Processing System According to the Present Embodiment 3. Flow of Processes
  • 4. Types of Characteristic Amounts Extracted from a Video and Method of Extraction
  • 5. Labeling of Characteristic Amounts 6. Associating Speaker and Spoken Content 7. Storage of Characteristic Amounts and Method of Judging Similarity
  • 8. Method of Inferring Relationship between Videos
  • 9. Method of Judging Similarity for Characteristic Amounts 10. Effects of the Present Embodiment 1. Overview of Processing by Information Processing Apparatus 100
  • The present embodiment provides a method that extracts various characteristic amounts from video, audio, and the like in videos being processed and, based on such characteristic amounts, finds and presents videos and scenes with similar characteristics using a more flexible method than in the background art. By doing so, as examples, it is possible to associate an original and a parody version together and to recommend a live-action video version of an animated video that is being viewed by the user.
  • Also, in the present embodiment, characteristic amounts obtained from videos can be aggregated for each character and registered in a database 106. After this, by using the database 106, it is possible to distinguish characters in unknown videos and to associate videos that use the same characters.
  • In the present embodiment, as examples of the characteristic amounts, information on the face of a character (such as high-frequency components and skin color composition), audio information (such as a voiceprint of the character and BGM), subtitle information, and gestures (movement change patterns on a time axis) are extracted from video content and used.
  • The present embodiment mainly carries out the processing described below.
  • 1. The characteristic amounts described above are extracted from every scene in a video and such characteristic amounts are judged and labeled as belonging to a character or to the background.
    2. After the processing in (1.), if the labeled characteristic amounts are judged to match or be similar to the characteristic amounts of another video already registered in the database 106, the similarity between the scenes that caused such judgment is then determined for each characteristic amount.
    3. In accordance with the result of the judgments of similarity in (2.), an overall similarity for the two videos is calculated and the relationship between the two videos is inferred.
  • The flow of the processing in the present embodiment will now be described with reference to FIGS. 1A to 1D. FIGS. 1A to 1D are diagrams useful in explaining the overall processing of an information processing apparatus 100 according to the present embodiment.
  • In the present embodiment, first characteristic amounts that are decided in advance are extracted from a video being processed. For example, as shown in FIG. 1A, characteristic amounts 1 to 3 are extracted from video A. The extracted characteristic amounts are individually judged and labeled as belonging to a character or to the background.
  • It is then verified whether the labeled characteristic amounts are similar to characteristic amounts that belong to another video present in the database 106. As one example, as shown in FIG. 1A, it is verified whether the characteristic amount 1 of video A is similar to the characteristic amount 1 of video B.
  • If characteristic amounts that match or are similar are present in the database 106, the similarity for all of the characteristic amounts is then judged for the scenes that caused such judgment of similarity. For example, as shown in FIG. 1B, similarity is judged for the characteristic amounts 2 and 3 aside from the characteristic amount 1 between scene A of video A and scene B of video B. As a result of judging the similarity, a relationship between scene A and scene B is obtained (FIG. 1C).
  • After this, based on the judgment result for similarity for the respective characteristic amounts, an overall judgment of similarity is made for the two videos. At the same time, a relationship between the two videos is inferred with reference to the similarity for each characteristic amount (see FIG. 1D).
  • An information processing system according to the present embodiment, the flow of the respective processes, the types of characteristic amounts extracted from the videos and the method of extraction, the method of judging similarity between the characteristic amounts, and the method of inferring a relationship between videos will now be described.
  • 2. Information Processing System According to the Present Embodiment
  • An information processing system that includes the information processing apparatus 100 according to the present embodiment is shown in FIG. 2. FIG. 2 is a block diagram showing the information processing system including the information processing apparatus 100 according to the present embodiment.
  • The information processing apparatus 100 according to the present embodiment includes a central processing unit 102, a temporary storage unit 104, a database (storage apparatus) 106, a facial recognition database 112, a decoder/encoder 114, a voice recognition database 116, an image analyzing unit 122, a metadata analyzing unit 124, an audio analyzing unit 126, and the like. The information processing apparatus 100 may be used having been incorporated into a household video recording appliance.
  • The information processing apparatus 100 receives videos from a video sharing site and/or a video image providing apparatus 20, such as a household video recording appliance or a television broadcasting station, decodes or encodes a video stream as necessary, and then divides the video stream into an image part, an audio part, and a metadata part.
  • The image analyzing unit 122, the audio analyzing unit 126, and the metadata analyzing unit 124 receive the divided stream as appropriate and extract characteristic amounts of the video.
  • The central processing unit 102 carries out a process that receives the extracted characteristic amounts and accumulates the characteristic amounts in the temporary storage unit 104 and/or stores the characteristic amounts in the database 106. The central processing unit 102 outputs, via a display apparatus 30, statistical information on the characteristic amounts accumulated in the temporary storage unit 104 and/or information obtained as a result of carrying out the process that stores the characteristic amounts in the database 106. The central processing unit 102 also has an environment that is capable of acquiring information relating to characteristic amounts as necessary from a network 10.
  • 3. Flow of Processes
  • The processing flow in the present embodiment is shown in FIG. 3. FIG. 3 is a flowchart showing processing operations of the information processing apparatus 100 according to the present embodiment.
  • First, a video is inputted (step S11). Information (“characteristic amounts”) expressing characteristics of the video in every scene is then extracted from the inputted video (step S12). The extraction of characteristic amounts is carried out by the image analyzing unit 122, the audio analyzing unit 126, and the metadata analyzing unit 124 shown in FIG. 2. The image analyzing unit 122, the audio analyzing unit 126, and the metadata analyzing unit 124 are one example of a “characteristic amount extracting unit” for the present invention. FIG. 9 shows the relationship between a timeline of a video and characteristic amounts that are extracted.
  • The image analyzing unit 122 has a typical facial recognition function and body movement recognition function, and mainly extracts high-frequency components of a face, the color and distribution of the face, movements, a person specified using facial recognition, and the color and distribution of the body as necessary. The facial recognition database 112 has a dictionary that is generated in advance, and is used when specifying people using facial recognition.
  • The audio analyzing unit 126 includes an audio information (frequency characteristic) extracting function and extracts mainly a voiceprint (frequency distribution) of a person, volume, and sections where the frequency distribution sharply changes from the audio information of a video. The audio information (frequency characteristic) extracting function is capable of using the technology disclosed in the specification of Japanese Laid-Open Patent Publication No. 2009-278180, for example. Also, if the audio analyzing unit 126 has a speech recognition (voice recognition) function, a spoken content is extracted as a characteristic amount. The voice recognition database 116 has a dictionary that is generated in advance, and is used to specify a person via extraction of voice information.
  • The metadata analyzing unit 124 extracts mainly subtitle information from metadata that accompanies a video. If the title of the video is included in the obtained metadata, the title is also extracted as necessary as a characteristic amount. If the names of characters are included in the obtained metadata, the metadata analyzing unit 124 refers as necessary via the central processing unit 102 to facial images on the network 10 based on the names of the characters and registers composition information of the faces of the people in question in the facial recognition database 112.
  • Next, a labeling unit of the central processing unit 102 specifies to which person the extracted characteristic amounts belong or whether the characteristic amounts do not belong to any person (step S13). The method of labeling the characteristic amounts in this process will be described later.
  • After this, a match judging unit of the central processing unit 102 confirms whether data with similar values to the characteristic amounts of each scene that have been labeled is present in the database 106 (step S14). Here, the data in the database 106 is characteristic amounts of other videos that have been registered in the database 106 by previously carrying out the same process.
  • If, as a result of verification, data with similar characteristic amounts has been found in the database 106 (step S15), a comparing unit of the central processing unit 102 compares the characteristic amounts of the two videos that have such characteristic amounts (step S16). Such comparing is carried out for all of the characteristic amounts included in the scenes judged to have similar characteristic amounts.
  • From the result of the comparison, a relationship inferring unit of the central processing unit 102 infers a relationship between the two videos based on the similarity of the respective characteristic amounts (step S17).
  • Meanwhile, if there is no data that is similar to the characteristic amounts in the database 106, the comparison process and the relationship inferring process for a video are not carried out.
  • Lastly, the characteristic amounts are newly registered in the database 106 to complete the processing (step S18). Additionally, if data that is similar to the characteristic amounts to be registered has been found in the database 106, information on the relationship between the scenes and videos to which the two characteristic amounts belong is added to the characteristic amounts and also to the registered content of the characteristic amounts of the similar data.
  • As supplementary explanation, if, in step S12, similar characteristic amounts are obtained for a plurality of scenes in a video when extracting the characteristic amounts from the video, by assigning index information showing that such amounts are similar in advance, it is possible to reduce the number of searches carried out when data in the database 106 is compared with newly extracted characteristic amounts, and as a result, it is possible to reduce the processing time.
  • 4. Types of Characteristic Amounts Extracted from a Video and Method of Extraction
  • The types of characteristic amounts extracted from a video and the method of extraction are described below.
  • Image Analyzing Unit 122
  • FIGS. 4A to 4F are diagrams useful in explaining one example of high-frequency components that are characteristics of a detected face and the color composition of the face. FIG. 4A shows one example of a face in a video.
  • As shown in FIGS. 4B and 4F, as a facial recognition function, the image analyzing unit 122 extracts contour (high-frequency) components by carrying out a Fourier transform. As shown in FIGS. 4C to 4E and 4G, the image analyzing unit 122 also calculates composition ratios of colors of the detected face as ratios relative to the area of the face. After this, it is possible to carry out a facial recognition process using the contours of the face and/or information on the color composition such as that in FIGS. 4A to 4G obtained by the characteristic amount extracting process. In addition, as a body movement recognition function, the image analyzing unit 122 detects body movements from the video. FIGS. 5A and 5B are diagrams useful in showing one example of detection of gesture/movement patterns. The image analyzing unit 122 then associates the face and body movements from the results of the facial recognition and the body movement detection and registers movement changes of the character in a series of scenes.
  • Audio Analyzing Unit 126
  • The audio analyzing unit 126 detects a voiceprint from the video. Based on lip movements of the detected face, the audio analyzing unit 126 also separates voice information of a person from background sound to acquire the voice information. By carrying out speech recognition, the audio analyzing unit 126 also extracts the spoken content (dialog) of the person. Also, as shown in FIG. 6, the audio analyzing unit 126 detects BGM (BackGround Music) from the video. FIG. 6 is a diagram useful in showing extraction of a theme song (theme BGM) of a video. The audio analyzing unit 126 refers for example to sudden changes in people appearing in scenes, volume, and/or high-frequency components to separate background sound.
  • Metadata Analyzing Unit 124
  • If subtitles are included in the metadata of a video, the metadata analyzing unit 124 extracts the subtitles from the video.
  • 5. Labeling of Characteristic Amounts
  • Next, a labeling process for characteristic amounts will be described. FIG. 7 is a flowchart showing the labeling process for characteristic amounts.
  • First, for scenes that have been subjected to facial recognition by the image analyzing unit 122, each scene in which a face has been detected is labeled with the name of the detected person (step S21). The label names do not need to be the proper names of the people, and it is possible to use any information that serves as unique identifiers indicating specific people.
  • Next, a movement change pattern of a person obtained by carrying out detection of body movements using the detected face is labeled with the same name as the face in question (step S22).
  • In addition, for the audio information of a scene in which the face described above and the body movements have been detected, it is verified whether the voiceprint obtained by the audio analyzing unit 126 is the voiceprint of the person whose name is the label assigned to the face and/or body movements in the scene (step S23). Here, as one example, the method disclosed in Japanese Laid-Open Patent Publication No. 2009-278180 is used to recognize the voiceprint.
  • If, as a result of verification, the voiceprint obtained for a scene matches the person indicated by the label (step S24), the label is assigned to the voiceprint (step S26). Meanwhile, if the voiceprint has been recognized as that of another person (step S24), the voiceprint is assigned a label as background sound and is excluded from subsequent processing (step S25). By doing so, it is possible to reduce the amount of processing in the subsequent similarity judging process.
  • Note that by using the voiceprint recognition disclosed in Japanese Laid-Open Patent Publication No. 2009-278180, it is possible to specify a person from a voice using only audio information such as that described above. However, in the present embodiment, emphasis is placed on gathering characteristic amounts for a video based on characters. For this reason, information that includes only audio but no images is judged to be insufficient for extracting characteristics of a person and accordingly such information is not used.
  • Also, if BGM has been detected as in FIG. 6, such BGM can be labeled as a characteristic amount in the same way as a voiceprint and used when judging similarity.
  • After this, in the same way as a voiceprint, for a scene, it is verified whether the spoken content obtained from speech recognition by the audio analyzing unit 126 or from subtitle information produced by the metadata analyzing unit 124 is speech by the person whose name has been used as a label (step S27). If the result of speech recognition by the audio analyzing unit 126 is used as the spoken content, since it is possible to simultaneously extract a voiceprint by carrying out voiceprint recognition, if a person can be specified from the voiceprint, it will be possible to easily specify the person to which such spoken content belongs.
  • Meanwhile, regarding spoken content obtained from subtitle information, by comparing speech timing that accompanies the subtitle information against the time in a scene when lip movements have been detected using facial recognition by the image analyzing unit 122, it is possible to specify the person to which the spoken content belongs. A method of associating a speaker and a spoken content will be described later.
  • If, as a result of verification, the spoken content has been recognized as speech of the person in question (step S28), the present label is assigned to the spoken content (step S30). Conversely, if the spoken content has been recognized as speech of another person, the spoken content is labeled as background sound, and subsequent processing is not carried out (step S29). This completes the labeling process for characteristic amounts.
  • 6. Associating Speaker and Spoken Content
  • The associating of a speaker and a spoken content will now be described with reference to FIG. 8. FIG. 8 is a diagram useful in showing the associating of a speaker and a spoken content.
  • First, characters in each scene are detected and specified by the facial recognition function of the image analyzing unit 122. Next, scenes including lip movements are also detected and marked out of such scenes.
  • Meanwhile, spoken content from subtitle information obtained by the metadata analyzing unit 124 is assigned to each scene based on the accompanying timing information. Here, scenes in which lip movements have been detected by the image analyzing unit 122 and the assigned subtitle information are placed together on the time axis. By doing so, it is possible to specify which spoken content was spoken by which person.
  • 7. Storage of Characteristic Amounts and Method of Judging Similarity
  • The following data is stored as the characteristic amounts.
  • (Storage of Characteristic Amounts in Temporary Storage Unit 104)
      • Characteristic Amount Types
      • Values of Characteristic Amounts
      • Labels
      • Scene Start Time
      • Scene End Time
      • Index Numbers
    (Characteristic Amount Storing Database 106)
      • Characteristic Amount Types
      • Values of Characteristic Amounts
      • Labels
      • Scene Start Time
      • Scene End Time
      • ID number of Video
    (Video/Scene Relationship Storing Database 106)
      • ID number of Video 1
      • Scene Start Time of Video 1
      • Scene End Time of Video 1
      • ID number of Video 2
      • Scene Start Time of Video 2
      • Scene End Time of Video 2
      • Video/Scene Flag
      • Relationship Type
  • If data where a characteristic amount matches or data that is judged to be similar from a result of using a threshold is present in the database 106, the similarity of every characteristic amount is then measured between the scenes in which the matching characteristic amount is present. Based on such measurements, the relationship between two videos or scenes is decided.
  • Next, the calculation of similarity will be described.
  • (Similarity for Video Images)
  • Face . . . similarity for faces of characters is determined between scenes from the contours and/or color composition ratio of faces that have been detected.
  • Movement . . . similarity for movements of characters is determined between scenes from changes in posture on a time axis.
  • (Similarity for Audio)
  • Voiceprint . . . similarity for voices of characters is determined between scenes from a frequency distribution of audio.
  • BGM . . . similarity for BGM of characters is determined between scenes from audio information that plays for a certain period.
  • (Similarity for Content)
  • Dialog . . . similarity for dialog of characters is determined between scenes from the voiceprint and subtitles and/or spoken content.
  • 8. Method of Inferring Relationship Between Videos
  • By comparing the characteristic amounts described above to find the degree of similarity of various characteristics of two videos, it is possible to classify relationships as shown below.
  • Characteristic Amounts to be Compared and Similarity
  • (1) A face is the same. Or is similar.
    (2) A movement pattern is the same. Or is similar.
    (3) A voiceprint is the same. Or is similar.
    (4) Dialog is the same. Or is similar.
  • Evaluation may be carried out as described below to see whether the degree of similarity between two videos or scenes is zero or below for such characteristic amounts (that is, such amounts are not similar), or whether the degree of similarity is larger than a threshold set in advance.
  • Degree of Similarity and Evaluation
  • (1) If similarity is zero or below→possibly a different person
    (2) If similarity is equal to or above a threshold→possibly the same person
    (3) If similarity is below the threshold→possibly a different person doing an impression or some kind of a modification
  • The relationship between two items or scenes is judged in view of the overall similarity for all of the characteristics in each scene.
  • Result of Similarity and Judgment of Relationship Between Content
  • (1) If the degree of similarity is above a threshold for every characteristic amount given above→the two videos have the same content
    (2) If the degree of similarity is above a threshold for the face and voiceprint of at least a certain number of people→the two videos are part of a series
    (3) If the degree of similarity is above a threshold for the face and voiceprint of at least one person but less than the certain number of people→the two videos are different programs with the same characters.
    (4) If there is a scene where the degree of similarity for the face and voiceprint is below a threshold but the degree of similarity of a movement pattern and/or dialog is above a threshold→the video is a parody including another person doing an impression
    (5) If the degree of similarity is below a threshold for every characteristic amount→the videos are unrelated.
  • By finding the total of the number of scenes where high similarity has been calculated, the judgment described above makes it possible to evaluate whether a relationship is established for two entire videos or is established for only specified scenes.
  • When inferring a relationship between videos, although it would be possible to carry out processing for each scene where the same face (person) is shown, it is preferable to carry out facial recognition and also associated body movement detection and subject only scenes where the same characters appear to processing.
  • By doing so, it is possible to label (index) the respective characteristic amounts for each character in each video. As a result, it is possible to aggregate the characteristic amounts for a person and to decide the importance of such information and/or sort through such information in advance. This makes it possible to increase the processing speed.
  • Meanwhile, if the characteristic amounts of a face are given priority, it is possible to classify two scenes as belonging to the same series or as being different videos in which the same person appears. However, if the characteristic amounts of a face are given priority, it may not be possible to spot the relationship between videos where characteristic amounts aside from the face are the same but the faces are different. For example, if the dialog or voice is the same but the face is different, it may not be possible to determine similarity by carrying out processing that gives priority to the characteristic amounts of a face. However, if the dialog or voice is the same but the face is different, it can be assumed that there is an “impersonation”-type relationship composed of what is actually a different person doing an imitation or the like. For this reason, when inferring a relationship between videos, it is not considered preferable to carry out processing on a scene-by-scene basis for scenes where the result of facial recognition (that is, a label) is the same.
  • 9. Method of Judging Similarity for Characteristic Amounts
  • The method of judging similarity for the characteristic amounts will now be described with reference to a flow shown in FIG. 10. FIG. 10 is a diagram useful in showing the method of judging similarity for the characteristic amounts.
  • First, in (1) in FIG. 10, characteristic amounts in a scene are extracted from scene a in video A that is being processed and passed to the database 106. Here, it is assumed that the scenes to be processed are scenes in which faces have been detected by a typical facial detection method.
  • Face information of a person, gestures (a movement pattern within a scene), a voiceprint of the face in question, BGM in the scene (background sound in the scene produced by excluding the voice of the person in question), and dialog (subtitle information) can be given as five examples of the characteristic amounts extracted from each scene. Note that the characteristic amounts are not necessarily limited to these five examples and it is also possible to use other characteristic amounts.
  • The extracted characteristic amounts are registered in the database 106 by the present system ((2) in FIG. 10). At the same time, a degree of similarity is calculated between the extracted characteristic amounts and characteristic amounts extracted from other videos (scenes) that are already registered in the database 106 ((3) in FIG. 10).
  • The standards and calculation formulas for judging similarity for the respective characteristic amounts are given below. In the following description, two scenes for which similarity is being judged are referred to as “A” and “B”. The degrees of similarity calculated by the calculation formula take a value in a range of zero to one, with higher values in the range of zero to one indicating greater similarity.
  • (Judgment of Similarity for Faces)
  • A judgment of similarity for faces compares the contours of faces as well as color information.
  • When faces are compared between scenes, resizing is first carried out to make both faces the same size. As one example, if the sizes of the faces detected in the respective scenes A and B are expressed as Fs(A), Fs(B), the resizing ratio r is expressed by Equation 1 below.

  • r=F s(B)/F s(A)  (Equation 1)
  • Here, it is assumed that resizing is carried out with the same ratio in the vertical and horizontal axes to prevent deformation of the faces.
  • After this, the degree of similarity is calculated for the contours and the colors of both faces.
  • Here, a two-dimensional plane is expressed by x and y. Two-dimensional contour information of faces in the scenes A and B is expressed as Fl(A(x,y)) and Fl(B(x,y)) and two-dimensional color information is expressed as Fc(A(x,y)) and Fc(B(x,y)). In addition, if a loading of the comparison result is set at u, the similarity RF (A, B) for faces in the scenes A and B is expressed as shown in Equation 2 below.

  • RF(A,B)= x,y[1−{F l(A(x,y))−F l(B(x,y))}]/(L_MAX×F s(B))+(1−ux,y[1−{F c(A(x,y))−F c(B(x,y))}]/(C_MAX×Fs(B))  Equation 2.
  • Here, L_MAX and C_MAX express the respective maximum values for the contour information and the color information.
  • Judgment of Similarity for Voiceprints
  • Judgment of similarity for voiceprints is carried out by comparing the frequency distributions of voices.
  • If the frequency is expressed as f and the voiceprints, that is, the frequency distributions of people in scenes A and B are expressed as VFA(f) and VFB(f), the similarity RV(A,B) for voiceprints in the scenes A and B is expressed as shown in Equation 3 below.

  • R V(A,B)=Σf {V FA(f)−V FB(f)}/(F_MAX×D_MAX)  (Equation 3)
  • Here, F_MAX and D_MAX respectively express a frequency maximum value and a value for normalizing sound.
  • Judgment of Similarity for Gestures
  • A judgment of similarity for gestures detects five endpoint positions (i.e., the head and both hands and feet) of a body using an existing body movement detection method and measures and compares the movement loci of the respective endpoints within scenes.
  • If time is expressed as t, an endpoint number is expressed as n, and the position of an endpoint is expressed as p(t,n), a movement vector of endpoint n0 from a time t0 to another time t1 is expressed as (p(t1,n0)−p(t0,n0)).
  • Here, the default position for the end points uses a state where the face is facing to the front and a midline for both eyes is perpendicular to the horizontal as the standard. This means that it is possible to estimate the posture of a person based on the inclination of the detected face to the horizontal and the vertical and to calculate endpoint positions in three dimensions.
  • Next, similarity is calculated for the endpoint movement vectors calculated for the scenes A and B. If the movement vectors for the endpoint n in scenes A and B at time t are expressed as vA(t,n) and vB(t,n), the similarity RM(A,B) for gestures in the scenes A and B is expressed as shown in Equation 4 below.

  • RM(A,B)=1−Σt,n|{(v A(t,n)−v B(t,n))/(|v A(t,n)∥v B(t,n))|}|/(DIM×T_MAX×N_MAX)  (Equation 4)
  • Here, DIM expresses the number of dimensions, T_MAX expresses the length (in time) of the scenes being compared, and N_MAX expresses the number of endpoints being compared.
  • Judgment of Similarity for Dialog
  • The judgment of similarity for dialog is carried out by text matching for the spoken content in the two scenes.
  • If the spoken contents obtained from scenes A and B are respectively expressed as s(A) and s(B) and a function that measures the length of words and sentences that are common to scenes A and B is expressed as Cl(s(A),s(B)), the similarity RS(A,B) for dialog in the scenes A and B is expressed as shown in Equation 5 below.

  • R S(A,B)=C l(s(A),s(B))/S_MAX  (Equation 5)
  • Here, S_MAX expresses the length of the character strings to be compared.
  • Judgment of Similarity for BGM
  • The judgment of similarity for BGM is carried out by measuring the amount of time for which the same continuous playback sound is included in both scenes.
  • The BGM waveforms or melodies obtained from scenes A and B at time t are respectively expressed as gA(t) and gB(t). In addition, if a function that measures correlation for gA(t) and gB(t) is expressed as R(gA(t),gB(t)) and a function that selects a longest region out of the regions for which high correlation has been obtained is expressed as Lrt{R(gA(t),gB(t))}), the similarity RG(A,B) for BGM in the scenes A and B is expressed as shown in Equation 6 below.

  • R G(A,B)=L rt {R(g A(t),g B(t))})/T_MAX  (Equation 6)
  • Here, T_MAX expresses the time of scenes being compared.
  • The judgment results given below are obtained for the characteristic amounts based on the values calculated from the formulas given above.
  • Face . . . [1:same]>[Overall similarity for contours or color composition]>[Similarity for parts of contours or color composition]>[0:different]
  • Voiceprint . . . [1:same]>[At least one part is a continuous section that is the same. Some endpoints have different loci]>[0:different]
  • Gestures . . . [1:same]>[All points plot similar loci for a long time in a time series]>[All points plot similar loci for a short time. Alternatively, many points plot similar loci for a long time]>[Many points plot similar loci for a short time]>[0:different]
  • Dialog . . . [1:same]>[0:different]. Note that frequently occurring parts are excluded and only characteristic dialog is kept.
  • BGM . . . [1:same]>[Partially the same for entire length]>[Melody is the same but performance/recording method etc. is different. Different material with the same content]>[A different part is included]>[0:different]
  • It is assumed that the judgments described above are carried out using various thresholds.
  • The relationship between the two scenes is inferred based on the above judgment results ((4) in FIG. 10).
  • First the characteristic amounts described above are classified into the three categories shown in FIG. 11.
  • Examples of combinations of similarities calculated for the respective categories and relationships between scenes based on such similarities are shown in FIG. 12. Here, if the characteristic amounts belonging to a category are the same between scenes, the similarity is one, and if the characteristic amounts are completely different, the similarity is zero. Since the degrees of similarity that are actually calculated are arbitrary values in a range of zero to one, inclusive, the following is not an exhaustive list of the relationships between scenes that may be determined.
  • 10. Effects of the Present Embodiment
  • It is possible to associate not only a video with substantially the same content as a video that is the standard for judging similarity, but also a wide range of videos with similar characteristics, such as videos in a series, a parody, and a video that is an animation version. In addition, based on which parts of a plurality of videos are similar, it is possible to further classify related videos according to their relationships with the video used as a standard.
  • It is also possible to evaluate similarity and/or a relationship between videos not only in units of the videos themselves but also in units of scenes (arbitrary sections).
  • By labeling characteristic amounts extracted from a video, it is possible to store characteristic amounts of a person who appears in the video as data, and based on such characteristic amounts, it is possible to evaluate who people appearing in another video resemble, and which parts are similar.
  • Using the characteristic amounts extracted from commercial content, it is possible to easily investigate whether a video that has been uploaded to a video sharing site or a personal web page infringes a copyright.
  • By compiling statistics on dialog and/or movement patterns for each person from the extracted characteristic amounts, it is possible to know the person's way of talking and gestures.
  • It is also possible to use (or replace) the movement pattern, dialog, voiceprint, and the like of a character registered in the database 106 for a new character who has been separately created.
  • It is also possible to quantitatively evaluate the extent to which someone doing an impersonation is similar to the person being impersonated and which characteristics are similar.
  • It is possible to use metadata of another video that is very similar for a video that has not been assigned metadata. It is also possible to assign the results of similarity judgments to respective videos as metadata.
  • By extracting a plurality of characteristic amounts individually from a video, it is possible to use the respective characteristic amounts to acquire information on characters and the like that is not related to a video or scene from the Web or from a similar item.
  • Although preferred embodiments of the present invention have been described in detail with reference to the attached drawings, the present invention is not limited to the above examples. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and alterations may occur depending on design requirements and other factors insofar as they are within the scope of the appended claims or the equivalents thereof.
  • The present application contains subject matter related to that disclosed in Japanese Priority Patent Application JP 2010-52919 filed in the Japan Patent Office on Mar. 10, 2010, the entire content of which is hereby incorporated by reference.

Claims (7)

1. An information processing apparatus comprising:
a characteristic amount extracting unit extracting a plurality of characteristic amounts, which are information expressing characteristics of a video, from the video;
a labeling unit associating the extracted characteristic amounts with a person or a background;
a matching degree judging unit judging a degree of matching between the associated characteristic amounts and the characteristic amounts of at least one other video;
a comparing unit comparing the plurality of characteristic amounts of one scene in the video from which the characteristic amounts have been extracted and the plurality of characteristic amounts of one scene in the at least one other video; and
a relationship inferring unit inferring a relationship between the one scene in the video and the one scene in the at least one other video based on a comparison result of the comparing unit.
2. An information processing apparatus according to claim 1,
wherein the matching degree judging unit judges, for the associated characteristic amounts, a degree of matching with the characteristic amounts of the at least one other video that have been recorded in a storage unit, and
the comparing unit is operable when it has been judged using at least one threshold that at least one of the associated characteristic amounts matches at least one of the characteristic amounts of another video item, to compare the plurality of characteristic amounts of one scene in the video and the plurality of characteristic amounts in one scene of the other video.
3. An information processing apparatus according to claim 1,
wherein the characteristic amount extracting unit extracts a plurality of characteristic amounts for each scene in the video.
4. An information processing apparatus according to claim 3,
wherein the characteristic amount extracting unit is operable when at least one similar characteristic amount is obtained from a plurality of scenes in the video, to assign index information showing that the at least one characteristic amount is similar for the plurality of scenes.
5. An information processing apparatus according to claim 1,
wherein as the characteristic amounts, the characteristic amount extracting unit recognizes a face of a person and detects body movements of the person, and
the labeling unit associates the face and the body movements with the person and gathers together the associated characteristic amounts for each person.
6. An information processing method comprising steps of:
extracting a plurality of characteristic amounts, which are information expressing characteristics of a video, from the video;
associating the extracted characteristic amounts with a person or a background;
judging a degree of matching between the associated characteristic amounts and the characteristic amounts of at least one other video;
comparing the plurality of characteristic amounts of one scene in the video from which the characteristic amounts have been extracted and the plurality of characteristic amounts of one scene in the at least one other video; and
inferring a relationship between the one scene in the video and the one scene in the at least one other video based on a comparison result of the comparing unit.
7. A program causing a computer to carry out steps of:
extracting a plurality of characteristic amounts, which are information expressing characteristics of a video, from the video;
associating the extracted characteristic amounts with a person or a background;
judging a degree of matching between the associated characteristic amounts and the characteristic amounts of at least one other video;
comparing the plurality of characteristic amounts of one scene in the video from which the characteristic amounts have been extracted and the plurality of characteristic amounts of one scene in the at least one other video; and
inferring a relationship between the one scene in the video and the one scene in the at least one other video based on a comparison result of the comparing unit.
US13/038,625 2010-03-10 2011-03-02 Information processing apparatus, information processing method, and program Expired - Fee Related US8731307B2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2010-052919 2010-03-10
JP2010052919A JP2011188342A (en) 2010-03-10 2010-03-10 Information processing apparatus, information processing method, and program

Publications (2)

Publication Number Publication Date
US20110222782A1 true US20110222782A1 (en) 2011-09-15
US8731307B2 US8731307B2 (en) 2014-05-20

Family

ID=44560026

Family Applications (1)

Application Number Title Priority Date Filing Date
US13/038,625 Expired - Fee Related US8731307B2 (en) 2010-03-10 2011-03-02 Information processing apparatus, information processing method, and program

Country Status (2)

Country Link
US (1) US8731307B2 (en)
JP (1) JP2011188342A (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130108169A1 (en) * 2011-10-28 2013-05-02 Raymond William Ptucha Image Recomposition From Face Detection And Facial Features
US20140086496A1 (en) * 2012-09-27 2014-03-27 Sony Corporation Image processing device, image processing method and program
US20140086556A1 (en) * 2012-09-27 2014-03-27 Sony Corporation Image processing apparatus, image processing method, and program
US8938100B2 (en) 2011-10-28 2015-01-20 Intellectual Ventures Fund 83 Llc Image recomposition from face detection and facial features
US20150063725A1 (en) * 2013-08-29 2015-03-05 Htc Corporation Related Image Searching Method and User Interface Controlling Method
US8984405B1 (en) * 2013-06-26 2015-03-17 R3 Collaboratives, Inc. Categorized and tagged video annotation
JP2015053056A (en) * 2013-09-06 2015-03-19 イマージョン コーポレーションImmersion Corporation Automatic remote sensing and haptic conversion system
US9008436B2 (en) 2011-10-28 2015-04-14 Intellectual Ventures Fund 83 Llc Image recomposition from face detection and facial features
US9025836B2 (en) 2011-10-28 2015-05-05 Intellectual Ventures Fund 83 Llc Image recomposition from face detection and facial features
US9025835B2 (en) 2011-10-28 2015-05-05 Intellectual Ventures Fund 83 Llc Image recomposition from face detection and facial features
US20150248918A1 (en) * 2014-02-28 2015-09-03 United Video Properties, Inc. Systems and methods for displaying a user selected object as marked based on its context in a program
US20170357875A1 (en) * 2016-06-08 2017-12-14 International Business Machines Corporation Detecting usage of copyrighted video content using object recognition
CN107958212A (en) * 2017-11-20 2018-04-24 珠海市魅族科技有限公司 A kind of information cuing method, device, computer installation and computer-readable recording medium
US20190095750A1 (en) * 2016-06-13 2019-03-28 Nec Corporation Reception apparatus, reception system, reception method, and storage medium
US10299008B1 (en) * 2017-11-21 2019-05-21 International Business Machines Corporation Smart closed caption positioning system for video content
CN110390242A (en) * 2018-04-20 2019-10-29 富士施乐株式会社 Information processing unit and storage medium
WO2020045157A1 (en) * 2018-08-31 2020-03-05 Nec Corporation Methods, systems, and non-transitory computer readable medium for grouping same persons
CN112866800A (en) * 2020-12-31 2021-05-28 四川金熊猫新媒体有限公司 Video content similarity detection method, device, equipment and storage medium
US11030462B2 (en) 2016-06-27 2021-06-08 Facebook, Inc. Systems and methods for storing content

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013082709A1 (en) * 2011-12-06 2013-06-13 Aastra Technologies Limited Collaboration system and method
JP6070584B2 (en) * 2014-01-17 2017-02-01 ソニー株式会社 Information processing apparatus, information processing method, and program
JP6316685B2 (en) * 2014-07-04 2018-04-25 日本電信電話株式会社 Voice imitation voice evaluation device, voice imitation voice evaluation method and program
US20190020913A9 (en) * 2016-06-27 2019-01-17 Facebook, Inc. Systems and methods for identifying matching content

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010004739A1 (en) * 1999-09-27 2001-06-21 Shunichi Sekiguchi Image retrieval system and image retrieval method
US20020006221A1 (en) * 2000-05-31 2002-01-17 Hyun-Doo Shin Method and device for measuring similarity between images
US20060184963A1 (en) * 2003-01-06 2006-08-17 Koninklijke Philips Electronics N.V. Method and apparatus for similar video content hopping
US20080298643A1 (en) * 2007-05-30 2008-12-04 Lawther Joel S Composite person model from image collection
US20100005070A1 (en) * 2002-04-12 2010-01-07 Yoshimi Moriya Metadata editing apparatus, metadata reproduction apparatus, metadata delivery apparatus, metadata search apparatus, metadata re-generation condition setting apparatus, and metadata delivery method and hint information description method
US20100111501A1 (en) * 2008-10-10 2010-05-06 Koji Kashima Display control apparatus, display control method, and program

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3569441B2 (en) 1998-04-24 2004-09-22 シャープ株式会社 Similar program search device, similar program search method, and medium recording similar program search program
KR100677096B1 (en) 2000-05-31 2007-02-05 삼성전자주식회사 Similarity measuring method of images and similarity measuring device
JP3711022B2 (en) 2000-12-28 2005-10-26 株式会社東芝 Method and apparatus for recognizing specific object in moving image
JP2006285907A (en) 2005-04-05 2006-10-19 Nippon Hoso Kyokai <Nhk> Designation distribution content specification device, designation distribution content specification program and designation distribution content specification method
JP2009070278A (en) 2007-09-14 2009-04-02 Toshiba Corp Content similarity determination apparatus and content similarity determination method
JP5061877B2 (en) 2007-12-13 2012-10-31 オムロン株式会社 Video identification device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010004739A1 (en) * 1999-09-27 2001-06-21 Shunichi Sekiguchi Image retrieval system and image retrieval method
US20020006221A1 (en) * 2000-05-31 2002-01-17 Hyun-Doo Shin Method and device for measuring similarity between images
US20100005070A1 (en) * 2002-04-12 2010-01-07 Yoshimi Moriya Metadata editing apparatus, metadata reproduction apparatus, metadata delivery apparatus, metadata search apparatus, metadata re-generation condition setting apparatus, and metadata delivery method and hint information description method
US20060184963A1 (en) * 2003-01-06 2006-08-17 Koninklijke Philips Electronics N.V. Method and apparatus for similar video content hopping
US7676820B2 (en) * 2003-01-06 2010-03-09 Koninklijke Philips Electronics N.V. Method and apparatus for similar video content hopping
US20080298643A1 (en) * 2007-05-30 2008-12-04 Lawther Joel S Composite person model from image collection
US20100111501A1 (en) * 2008-10-10 2010-05-06 Koji Kashima Display control apparatus, display control method, and program

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9025835B2 (en) 2011-10-28 2015-05-05 Intellectual Ventures Fund 83 Llc Image recomposition from face detection and facial features
US8811747B2 (en) * 2011-10-28 2014-08-19 Intellectual Ventures Fund 83 Llc Image recomposition from face detection and facial features
US8938100B2 (en) 2011-10-28 2015-01-20 Intellectual Ventures Fund 83 Llc Image recomposition from face detection and facial features
US20130108169A1 (en) * 2011-10-28 2013-05-02 Raymond William Ptucha Image Recomposition From Face Detection And Facial Features
US9008436B2 (en) 2011-10-28 2015-04-14 Intellectual Ventures Fund 83 Llc Image recomposition from face detection and facial features
US9025836B2 (en) 2011-10-28 2015-05-05 Intellectual Ventures Fund 83 Llc Image recomposition from face detection and facial features
US20140086496A1 (en) * 2012-09-27 2014-03-27 Sony Corporation Image processing device, image processing method and program
US20140086556A1 (en) * 2012-09-27 2014-03-27 Sony Corporation Image processing apparatus, image processing method, and program
US9549162B2 (en) * 2012-09-27 2017-01-17 Sony Corporation Image processing apparatus, image processing method, and program
US9489594B2 (en) * 2012-09-27 2016-11-08 Sony Corporation Image processing device, image processing method and program
US10001904B1 (en) 2013-06-26 2018-06-19 R3 Collaboratives, Inc. Categorized and tagged video annotation
US8984405B1 (en) * 2013-06-26 2015-03-17 R3 Collaboratives, Inc. Categorized and tagged video annotation
US11669225B2 (en) 2013-06-26 2023-06-06 R3 Collaboratives, Inc. Categorized and tagged video annotation
US11294540B2 (en) 2013-06-26 2022-04-05 R3 Collaboratives, Inc. Categorized and tagged video annotation
US10908778B1 (en) 2013-06-26 2021-02-02 R3 Collaboratives, Inc. Categorized and tagged video annotation
US9201900B2 (en) * 2013-08-29 2015-12-01 Htc Corporation Related image searching method and user interface controlling method
US20150063725A1 (en) * 2013-08-29 2015-03-05 Htc Corporation Related Image Searching Method and User Interface Controlling Method
US10416774B2 (en) 2013-09-06 2019-09-17 Immersion Corporation Automatic remote sensing and haptic conversion system
JP2015053056A (en) * 2013-09-06 2015-03-19 イマージョン コーポレーションImmersion Corporation Automatic remote sensing and haptic conversion system
US20150248918A1 (en) * 2014-02-28 2015-09-03 United Video Properties, Inc. Systems and methods for displaying a user selected object as marked based on its context in a program
US9996769B2 (en) * 2016-06-08 2018-06-12 International Business Machines Corporation Detecting usage of copyrighted video content using object recognition
US11301714B2 (en) 2016-06-08 2022-04-12 International Business Machines Corporation Detecting usage of copyrighted video content using object recognition
US20170357875A1 (en) * 2016-06-08 2017-12-14 International Business Machines Corporation Detecting usage of copyrighted video content using object recognition
US10579899B2 (en) 2016-06-08 2020-03-03 International Business Machines Corporation Detecting usage of copyrighted video content using object recognition
US20190095750A1 (en) * 2016-06-13 2019-03-28 Nec Corporation Reception apparatus, reception system, reception method, and storage medium
US20190180138A1 (en) * 2016-06-13 2019-06-13 Nec Corporation Reception apparatus, reception system, reception method, and storage medium
US11430207B2 (en) * 2016-06-13 2022-08-30 Nec Corporation Reception apparatus, reception system, reception method and storage medium
US11514663B2 (en) * 2016-06-13 2022-11-29 Nec Corporation Reception apparatus, reception system, reception method, and storage medium
US11850728B2 (en) 2016-06-13 2023-12-26 Nec Corporation Reception apparatus, reception system, reception method, and storage medium
US11030462B2 (en) 2016-06-27 2021-06-08 Facebook, Inc. Systems and methods for storing content
CN107958212A (en) * 2017-11-20 2018-04-24 珠海市魅族科技有限公司 A kind of information cuing method, device, computer installation and computer-readable recording medium
US10299008B1 (en) * 2017-11-21 2019-05-21 International Business Machines Corporation Smart closed caption positioning system for video content
CN110390242A (en) * 2018-04-20 2019-10-29 富士施乐株式会社 Information processing unit and storage medium
WO2020045157A1 (en) * 2018-08-31 2020-03-05 Nec Corporation Methods, systems, and non-transitory computer readable medium for grouping same persons
US11615119B2 (en) 2018-08-31 2023-03-28 Nec Corporation Methods, systems, and non-transitory computer readable medium for grouping same persons
CN112866800A (en) * 2020-12-31 2021-05-28 四川金熊猫新媒体有限公司 Video content similarity detection method, device, equipment and storage medium

Also Published As

Publication number Publication date
US8731307B2 (en) 2014-05-20
JP2011188342A (en) 2011-09-22

Similar Documents

Publication Publication Date Title
US8731307B2 (en) Information processing apparatus, information processing method, and program
WO2022116888A1 (en) Method and device for video data processing, equipment, and medium
US8558952B2 (en) Image-sound segment corresponding apparatus, method and program
JP6824332B2 (en) Video service provision method and service server using this
Brezeale et al. Automatic video classification: A survey of the literature
Aran et al. Broadcasting oneself: Visual discovery of vlogging styles
US10134440B2 (en) Video summarization using audio and visual cues
WO2020232796A1 (en) Multimedia data matching method and device, and storage medium
US20160364397A1 (en) System and Methods for Locally Customizing Media Content for Rendering
CN111683209A (en) Mixed-cut video generation method and device, electronic equipment and computer-readable storage medium
Xu et al. An HMM-based framework for video semantic analysis
US20110243449A1 (en) Method and apparatus for object identification within a media file using device identification
CN103200463A (en) Method and device for generating video summary
CN112445935B (en) Automatic generation method of video selection collection based on content analysis
WO2023011094A1 (en) Video editing method and apparatus, electronic device, and storage medium
WO2006025272A1 (en) Video classification device, video classification program, video search device, and videos search program
CN106250553A (en) A kind of service recommendation method and terminal
Zhao et al. Flexible presentation of videos based on affective content analysis
CN113259780A (en) Holographic multidimensional audio and video playing progress bar generating, displaying and playing control method
WO2020135756A1 (en) Video segment extraction method, apparatus and device, and computer-readable storage medium
Xu et al. Fast summarization of user-generated videos: Exploiting semantic, emotional, and quality clues
Ionescu et al. An audio-visual approach to web video categorization
Ionescu et al. Content-based video description for automatic video genre categorization
Gu et al. Deepfake video detection using audio-visual consistency
Kim et al. Automatic color scheme extraction from movies

Legal Events

Date Code Title Description
AS Assignment

Owner name: SONY CORPORATION, JAPAN

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:KASHIWAGI, AKIFUMI;REEL/FRAME:025901/0494

Effective date: 20110131

FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

FEPP Fee payment procedure

Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.)

LAPS Lapse for failure to pay maintenance fees

Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.)

STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20180520