WO2021221209A1

WO2021221209A1 - Method and apparatus for searching for information inside video

Info

Publication number: WO2021221209A1
Application number: PCT/KR2020/005718
Authority: WO
Inventors: 구원용; 홍의재
Original assignee: 엠랩 주식회사
Priority date: 2020-04-29
Filing date: 2020-04-29
Publication date: 2021-11-04
Also published as: KR20210134866A

Abstract

As a preferred embodiment of the present invention, a method for searching for information inside a video comprises the steps of: receiving a sentence as a search term from a user; searching for a scene having a highest degree of matching with the search term in a video indexed by a scene unit in which metadata is given in the form of a sentence for each scene; and reproducing only a starting point and an ending point of the searched scene in the video.

Description

Methods and devices for retrieving information inside a video

The present invention relates to a method for retrieving information inside a moving picture.

With the rapid spread of Internet, IPTV, SNS, Mobile LTE, etc. and the expansion of video OTT services such as YouTube and Netflix, the distribution and consumption of multimedia video is rapidly increasing. In general, the user watches the video from the beginning to the end in order to check the contents of the video.

However, in recent years, there is an increasing demand for users to view only scenes in which desired information is provided within a video. The reality is that it is difficult to find.

In a preferred embodiment of the present invention, it is intended to propose a method of searching only a specific section in which the user provides information desired by the user and providing it to the user.

As a preferred embodiment of the present invention, a method for searching information inside a video includes: receiving a sentence as a search word from a user; Searching for a scene having the highest degree of matching with the search term in a video indexed for each scene by providing metadata in the form of a sentence for each scene; and reproducing only the start point to the end point of the searched scene in the video It is characterized in that it includes;

As a preferred embodiment of the present invention, the user selects one video to be searched, and searches for content to be searched in the selected video in the form of a sentence.

As a preferred embodiment of the present invention, the moving picture is composed of at least one scene, and a summary sentence indicating the contents of each of the at least one scene is given in the form of a sentence to each of the at least one scene and used as metadata. characterized by being

As a preferred embodiment of the present invention, the degree of matching becomes 0 when two sentences are the same, and the value increases as the similarity between the two sentences decreases. It is determined using a Levenshtein distance technique. do.

As a preferred embodiment of the present invention, a method for retrieving information inside a moving image includes segmenting the moving image in shot units; applying a tag set to each of the segmented shots, and deriving keywords that highlight the characteristics of each shot for each shot through topic analysis of the tag set; and between adjacent shots determined based on the keyword and generating the scene by performing hierarchical clustering based on the similarity.

As a preferred embodiment of the present invention, the tag set is composed of a video tag and an audio tag.

As a preferred embodiment of the present invention, a scene tag is assigned to each scene, and metadata is assigned in the form of a sentence, so that the moving picture is indexed for each scene.

As a preferred embodiment of the present invention, metadata converts at least one keyword derived from each of at least one shot constituting one scene and voice data of each of the at least one shot through STT (Speech To Text) technique. It is characterized in that it is generated based on one text.

As a preferred embodiment of the present invention, at least one sentence including the keyword is selected from at least one text data obtained by converting the voice data of each of the at least one shot constituting the one scene into STT, and the selected It is characterized by generating a single sentence through deep learning from at least one sentence, and storing the generated single sentence as metadata describing the scene.

As a preferred embodiment of the present invention, after extracting a frame from a moving picture as an image to extract a shot, converting each image into an HSV color space; generating three time series data including median values of H (hue), S (saturation), and v (brightness); and setting the corresponding point as the start or end point of the shot when all three inflection points detected in the three time series data coincide.

As a preferred embodiment of the present invention, an apparatus for searching video internal information includes: a search word input unit for receiving a sentence as a search word from a user; a video section search unit for searching a specific section having the highest degree of relevance to the search term within the video; and a video section playback unit that reproduces only the specific section within the video, wherein the video is segmented based on meaning, and a sentence explaining the meaning of the segmented section is provided as metadata for each segmented section, It is characterized in that the relevance to the search word is determined by using the sentence as an index along the timeline of the video.

As a preferred embodiment of the present invention, an apparatus and method for searching video internal information provides the user with only a specific section within the video that the user wants to search, so that the user does not have to watch the video from beginning to end. It has the effect of being able to grasp information quickly and easily. By providing the content of the video to the user like the table of contents of a book, the user can check in advance what content is included before watching the video.

1 illustrates an example in which components constituting a moving picture are divided into a scene and a shot as a preferred embodiment of the present invention.

FIG. 2 is a flowchart of a method for retrieving information inside a moving picture as a preferred embodiment of the present invention.

3 is a diagram showing an internal configuration of an apparatus for searching video internal information as a preferred embodiment of the present invention.

4 shows an example of dividing a shot in a moving picture as a preferred embodiment of the present invention.

5 shows an example of assigning a tag set to a shot as a preferred embodiment of the present invention.

6 shows an example of grouping a shot into a scene as a preferred embodiment of the present invention.

7 is a flowchart of a method for retrieving information in a moving picture as another preferred embodiment of the present invention.

8 shows an embodiment of searching for information inside a moving picture as a preferred embodiment of the present invention.

As a preferred embodiment of the present invention, a method for searching information inside a video includes the steps of: receiving a sentence as a search word from a user; Metadata is provided in the form of a sentence for each scene and indexed in a scene unit within a video Searching for a scene having the highest degree of matching with the search term; and reproducing only the start point to the end point of the searched scene in the video; characterized in that it comprises a.

Hereinafter, with reference to the drawings will be described in detail so that those of ordinary skill in the art to which the present invention pertains can easily understand and reproduce.

1 shows an example in which components constituting a moving picture are divided into a scene and a shot as a preferred embodiment of the present invention.

As a preferred embodiment of the present invention, the moving picture 100 is segmented into n shots (n is a natural number) 111, 113, 121, 123, 125, 131, 133. For a method of classifying shots in a video, refer to FIG. 4 .

At least one shot is grouped into units having similar meanings or subjects to constitute a scene. Referring to the example of FIG. 1 , the first shot 111 and the second shot 113 are grouped into the first scene 110 , and the third shot 121 , the fourth shot 123 , and the fifth shot are grouped together. 125 may be grouped into the second scene 120 , and the sixth shot 131 and the seventh shot 133 may be grouped into the third scene 130 . In the present invention, a subject may include at least one meaning.

FIG. 2 is a flowchart of a method for retrieving video internal information as a preferred embodiment of the present invention.

When there is content to be searched for within a specific video, the user selects the video and inputs a search word through a search word input interface provided when video selection is activated. It is premised that the video is indexed in units of scenes by providing metadata in the form of sentences for each scene. As a preferred embodiment of the present invention, when receiving a sentence as a search word from a user, the video internal information search apparatus searches for a specific section that matches the search word or has high relevance in the video, and reproduces only the searched specific section.

As another preferred embodiment of the present invention, when a user inputs a search term related to content to be searched within a specific video, the video internal information search apparatus searches for a scene with the highest matching degree with the search word in the video (S210). and (S220), only the start point to the end point of the searched scene is played back (S230).

3 shows an internal configuration diagram of an apparatus 300 for searching video internal information as a preferred embodiment of the present invention. 4 to 6 show detailed functions of the video section search unit 320 constituting the apparatus 300 for searching video internal information. 7 is a flowchart showing a search for video inside information. Hereinafter, a method for searching video internal information in a device for searching video internal information will be described with reference to FIGS. 3 to 7 .

As a preferred embodiment of the present invention, the apparatus 300 for searching video internal information may be implemented in a terminal, a computer, a notebook computer, a handheld device, or a wearable device. In addition, the apparatus 300 for searching video internal information may be implemented in the form of a terminal having an input unit for receiving a user's search word, a display for displaying a video, and a processor. In addition, the method of searching the video internal information may be implemented by being installed in the form of an application in the terminal.

As a preferred embodiment of the present invention, the apparatus 300 for searching video internal information includes a search word input unit 310 , a video section search unit 320 , and a video section playback unit 330 . The video section search unit 320 includes a shot segmentation unit 340 , a scene generation unit 350 , a metadata generation unit 360 , and a video index unit 370 .

The search word input unit 310 receives a search word from the user in the form of a sentence. The user can use all forms such as voice search, text search, and image search. An example of an image search is a case where the contents scanned from a book are converted into text and used as a search term. The search word input unit 310 may be implemented as a keyboard, a stylus pen, a microphone, or the like.

The video section search unit 320 searches for a specific section in the video that matches the search word input from the search word input unit 310 or has content related to the search word. As an embodiment, the video section search unit 320 searches for a scene in which a sentence having the highest degree of matching with the input search word sentence is assigned as metadata.

The video section search unit 320 indexes and manages videos so that information can be searched within a single video.

7, the shot segmentation unit 340 segments the video in shot units (S710), assigns a tag set to each segmented shot (S720), and adds a tag set to each shot. A keyword is derived for each shot by applying a topic analysis algorithm (S730). The keyword is derived in the form of identifying and discriminating the content of each of at least one shot constituting the moving picture.

The scene generator 350 determines the similarity between adjacent before and after shots on the timeline of the video. The similarity determination may be performed based on a keyword derived from each shot, an object detected in each shot, a voice feature detected in each shot, and the like. As a preferred embodiment of the present invention, the scene generator 350 may create a scene by grouping shots having a high degree of similarity between adjacent shots based on a keyword ( S740 ). An algorithm for performing grouping may include a hierarchical clustering technique (S750). In this case, a plurality of shots included in one scene may be interpreted as delivering content having similar meaning or subject matter. For an example of grouping shots through hierarchical clustering in the scene generator 350 , refer to FIG. 8 .

The scene generator 350 assigns a scene tag to each created scene (351, 353, 355). The scene tag may be generated based on an image tag assigned to each of at least one shot included in each scene. In the present invention, a scene tag may be generated by a combination of a tag set assigned to each of at least one shot constituting a scene. Also, the scene keyword may be generated by a combination of keywords derived from each of at least one shot constituting the scene. In a preferred embodiment of the present invention, the scene tag may serve as a weight when generating metadata for each scene.

The metadata generator 360 analyzes the scenes generated by the scene generator 350, and provides metadata for each scene, thereby supporting a search for internal video content (S760). Metadata assigned to each scene acts as an index. The metadata is in the form of a summary sentence indicating the contents of each scene.

The metadata may be generated by further referring to a scene tag assigned to each of at least one shot constituting one scene. Scene tags can serve as weights when performing deep learning to generate metadata. For example, weight may be assigned to image tag information and voice tag information extracted from at least one tag set included in the scene tag.

The metadata is generated based on STT (Speech to Text) data of voice data extracted from at least one shot constituting each scene, and a scene tag extracted from each of at least one shot constituting each scene. For example, a summary sentence is generated by performing deep learning machine learning on at least one STT data and at least one scene tag obtained from at least one shot constituting one scene. Metadata is given to each scene by using a summary sentence generated through machine learning for each scene.

The video indexing unit 370 uses metadata assigned to each scene of the video S300 as an index. For example, if the video S300 is classified into three scenes, the video indexing unit 370 uses the first sentence 371 given as metadata to the first scene 351 (0:00 to t1). Used as an index, the second sentence 373 assigned as metadata to the second scene 353 (t1 to t2) is used as an index, and given as metadata to the third scene 355 (t2 to t3) The third sentence 375 is used as an index.

In the video indexing unit 370, the user's search sentence is a first search sentence (S311), and the first search sentence (S311) is a first of a plurality of metadata (371, 373, 375) allocated to each of a plurality of scenes in one video. When it is determined that the sentence 371 has the highest degree of matching, it is determined that the video section having the highest degree of matching with the search sentence input in the search word input unit 310 is the first scene 351 . In this case, the video section reproducing unit 330 reproduces only the section 0:00 to t1 of the first scene 351 in the video S300.

As a preferred embodiment of the present invention, the video indexing unit 370 uses the Levenshtein distance technique, in which the value becomes 0 when two sentences are identical and the value increases as the similarity between the two sentences decreases. can be determined, but is not limited thereto, and various algorithms for determining the similarity between two sentences can be used.

In the video indexing unit 370, the user's search text is the second search text (S313), and the second search text (S313) is the first of a plurality of metadata (371, 373, 375) allocated to each of a plurality of scenes in one video When it is determined that two sentences are the highest, it is determined that the video section having the highest degree of matching with the search sentence input in the search word input unit 310 is the second scene 353 . In this case, the video section reproducing unit 330 reproduces only the section t1 to t2 of the second scene 353 in the video S300.

Similarly, when the video indexing unit 370 determines that the user's search sentence is the third search sentence (S315) and the third search sentence (S315) has the highest degree of matching with the third sentence, it is input into the search word input unit 310 It is determined that the video section having the highest degree of matching with the search sentence is the third scene 355 . In this case, the video section reproducing unit 330 reproduces only the section t2 to t3 of the third scene 355 in the video S300.

4 shows an example of dividing a shot in a moving picture as a preferred embodiment of the present invention. In FIG. 4 , the x-axis represents time (sec), and the y-axis represents a representative HSV value.

As a preferred embodiment of the present invention, the shot segmentation unit 340 of the video internal information search apparatus extracts frames from the video S300 at regular intervals as images, and then converts each image into an HSV color space. Then, three time series data composed of representative values (median) of H (hue) (S401), S (saturation) (S403) and v (brightness) (S405) of each image are generated. And, when the inflection points of each of the three time series data of H (hue) (S401), S (saturation) (S403), and v (brightness) (S405) all match or are within a certain time period, the corresponding point of the shot Set as a starting point or an ending point. In FIG. 4 , the t=10sec point at which the inflection points of each of the three time series data coincide are set as the end point of the first shot 410 and the start point of the second shot 420 . In addition, the t=21sec point at which the inflection points of each of the three time series data coincide were set as the end point of the second shot 420 and the start point of the third shot 430 .

In a preferred embodiment of the present invention, after segmenting a moving image in shot units, a tag set is assigned to each shot. FIG. 5 illustrates an example in which the first tag set 550 is applied to the first shot 510 .

In a preferred embodiment of the present invention, the shot 510 is classified into image data 510a and audio data 510b. In the image data 510a, after extracting images per second (520a), an object is detected in each image (530a). Then, an image tag is generated based on the detected object (540a). Image tags apply object annovation or labeling to objects detected in images to construct learning data, and then perform object recognition through deep learning related to image recognition. Information obtained by extracting objects from each image can be created based on

In the voice data 510b, STT conversion 520b is performed, morphemes are extracted (530b), and a voice tag is generated (540b). When both the image tag 540a and the audio tag 540b are generated, a tag set 550 is generated. The tag set refers to a combination of the image tag 540a and the voice tag 540b detected during the time when the first shot 510 is, for example, between 00:00 and 10:00 seconds.

After assigning an integrated tag to each shot ( 610 ), a topic analysis algorithm is applied to tag sets assigned to each shot ( 620 ) to derive a keyword for each shot ( 630 ). Thereafter, the scene generator determines the similarity between adjacent front and rear shots and groups them into a scene. FIG. 6 illustrates an example of creating a scene through hierarchical clustering 640 after determining the degree of similarity based on the keyword 630 .

An example of a case in which a user selects a 50-second video 800 for which a user wants to search for content is assumed.

The embodiment of FIG. 8 shows an example in which the video 800 selected by the user in the shot segmentation unit is segmented into seven shots 801 to 807. The device for searching video internal information generates a tag set by extracting a video tag and an audio tag from each of the seven shots (801 to 807), and then performs topic analysis such as LDA on the tag set for each shot (801 to 807). 807), a keyword is derived.

Referring to the embodiment of FIG. 8 , the first shot 801 is in the range of 0:00 to 0:17, and the first keyword derived from the first shot 801 is (Japan, Corona 19, severe) 801a ) am. The second shot 802 is a section from 0:18 to 0:29, and the second keyword derived from the second shot 802 is (Japan, Corona 19, Spread) 802a. The third shot 803 is a section from 0:30 to 0:34, and the third keyword derived from the third shot 803 is (New York, Corona 19, Europe, Inflow) 803a. The fourth shot 804 is a section from 0:34 to 0:38, and the fourth keyword derived from the fourth shot 804 is (US, Corona 19, death) 804a. The fifth shot 805 is a section from 0:39 to 0:41, and the fifth keyword derived from the fifth shot 805 is (US, Corona 19, confirmed, dead). The sixth shot 806 is a section from 0:42 to 0:45, and the sixth keyword derived from the sixth shot 806 is (US, Corona 19, death) 806a. The seventh shot 807 is a section from 0:46 to 0:50, and the seventh keyword derived from the seventh shot 807 is (US, Corona 19, death) 807a.

The scene generator groups at least one shot based on the similarity. The degree of similarity can be determined based on keywords extracted from each shot, and video tags and voice tags can be further referred to.

8 , the first shot 801 and the second shot 802 are grouped into the first scene 810 , the third shot 803 is grouped with the second scene 820 , and The fourth to seventh shots 804 to 807 are grouped into the third scene 830 .

The first scene 810 is a section from 0:00 to 0:29, and the first keyword derived from the first shot 801 is derived from (Japan, Corona 19, severe) 801a and the second shot 802 . The second keyword used is (Japan, Corona 19, Spread) (802a) to "Japan Corona 19 continues to spread" (810b) with reference to the voice data of the first shot 801 and the second shot 802. metadata is provided.

The second scene 820 is a section from 0:30 to 0:34, and the third keywords derived from the third shot 803 are (New York, Corona 19, Europe, Inflow) 803a and the third shot 803. Referring to the voice data of the "New York's Corona 19 is said to be coming from Europe" (820b) is given.

The third scene 830 is a section from 0:35 to 0:50, and is derived from the fourth keyword (mi, corona 19, death) 804a derived from the fourth shot 804 and the fifth shot 805. The fifth keyword (US, Corona 19, confirmed, dead), the sixth keyword (US, Corona 19, dead) (806a) derived from the sixth shot 806, and the fourth shot 804 to the sixth shot 806 ) with reference to the voice data of "This is the news of the death of COVID-19 in the United States." (830b) is given.

In a preferred embodiment of the present invention, when the user selects the video 800 and the search word input interface is activated, the user inputs the content to be searched in the form of a sentence. For example, a search term sentence "What is the current state of Corona in the United States?" (840) may be input.

The video indexing unit searches for metadata with the highest degree of matching with the search word sentence 840 by using the metadata given to each scene as an index. The degree of matching is determined based on the similarity between the search word 840 and the

metadata

810b, 820b, and 830b, and becomes 0 when two sentences are the same, and the Levinstein distance ( Levenshtein distance) technique can be used.

The video indexer searches for the metadata most similar to the user search term (840) "How is the current state of Corona in the United States?" The finished third scene 830 is played back to the user. When the user inputs the search word 840, only the section of the third scene 830 corresponding to the section 0:35 to 0:50 related to the search word 840 in the video 800 can be searched and viewed.

As another preferred embodiment of the present invention, the video indexing unit may provide the user with metadata assigned to each scene 810 to 830 constituting the video as an index. Users can preview the contents of the video in advance through the video index.

Methods according to an embodiment of the present invention may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the present invention, or may be known and available to those skilled in the art of computer software.

As described above, although the present invention has been described with reference to limited embodiments and drawings, the present invention is not limited to the above embodiments, and various modifications and variations from these descriptions are provided by those skilled in the art to which the present invention pertains. This is possible.

Claims

receiving a sentence as a search word from a user;

Searching for a scene having the highest degree of matching with the search term in a video indexed for each scene in which metadata is given in the form of a sentence for each scene; And

Reproducing only the starting point to the end point of the searched scene in the video; the method of searching for information in the video, comprising: a.
The method of claim 1,

The method for searching information in a video, characterized in that the user selects one video to be searched, and searches for content to be searched in the selected video in the form of a sentence.
The method of claim 1,

The moving picture is composed of at least one scene, and a summary sentence indicating the contents of each of the at least one scene is given to each of the at least one scene in the form of a sentence and is used as metadata. How to search.
The method of claim 1, wherein the degree of matching is

A method of retrieving information in a video, characterized in that the determination is made using the Levenshtein distance technique, which becomes 0 when two sentences are the same and increases as the similarity between the two sentences decreases.
The method of claim 1,

segmenting the video in shot units;

adding a tag set to each of the segmented shots, and deriving keywords that highlight the characteristics of each shot for each shot through topic analysis of the tag set; And

and generating the scene by performing hierarchical clustering based on the degree of similarity between adjacent shots determined based on the keyword.
6. The method of claim 5, wherein the tag set is

A method of retrieving information inside a video, characterized in that it consists of a video tag and an audio tag.
6. The method of claim 5, wherein in each of the scenes

A method for retrieving information in a video, characterized in that a scene tag is assigned and metadata is assigned in the form of a sentence to index the video in units of scenes.
6. The method of claim 5, wherein the metadata is

A video characterized in that at least one keyword derived from each of at least one shot constituting one scene and a text converted from speech data of each of at least one shot through STT (Speech To Text) are generated based on a video How to retrieve information inside.
9. The method of claim 8,

At least one sentence including the keyword is selected from at least one text data obtained by converting the speech data of each of the at least one shot constituting the one scene into STT, and the selected at least one sentence is through deep learning A method of retrieving information inside a video, characterized in that a single sentence is generated and the generated single sentence is stored as metadata describing the scene.
The method of claim 1,

After extracting a frame from a moving picture as an image, converting each image into an HSV color space;

generating three time series data including median values of H (hue), S (saturation), and v (brightness); and

When all three inflection points detected in the three time series data match, setting the corresponding point as the start or end point of the shot; the method of searching for information inside the video, comprising further comprising.
11. The method of claim 10,

The method further comprising grouping at least one shot into a scene, wherein in this case, grouping is performed based on a degree of similarity between at least one adjacent shot in the timeline.
a search word input unit for receiving a sentence as a search word from a user;

a video section search unit for searching a specific section having the highest degree of relevance to the search term within the video; and

Including; a video section reproducing unit that reproduces only the specific section within the video;

The video is segmented based on meaning, and a sentence explaining the meaning of the segmented section is given as metadata for each segmented section, and the relevance with the search term is used along the timeline of the video using the sentence as an index. A device for searching video internal information, characterized in that determining the.
13. The method of claim 12, wherein the video section search unit

A video segment is segmented into shots, a tag set is assigned to each segmented shot, and a keyword that highlights the characteristics of each shot is derived through topic analysis of the tag set. wealth; and

and a scene generator configured to generate the scene by performing hierarchical clustering based on the degree of similarity between adjacent shots determined based on the keyword.
14. The method of claim 13,

and a metadata generation unit that analyzes the scenes generated by the scene generation unit, respectively, and provides metadata for each scene in the form of a sentence.
A computer-readable recording medium in which a program for performing the method according to any one of claims 1 to 11 is recorded.