WO2021221209A1 - Method and apparatus for searching for information inside video - Google Patents
Method and apparatus for searching for information inside video Download PDFInfo
- Publication number
- WO2021221209A1 WO2021221209A1 PCT/KR2020/005718 KR2020005718W WO2021221209A1 WO 2021221209 A1 WO2021221209 A1 WO 2021221209A1 KR 2020005718 W KR2020005718 W KR 2020005718W WO 2021221209 A1 WO2021221209 A1 WO 2021221209A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- video
- scene
- shot
- sentence
- metadata
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 238000013135 deep learning Methods 0.000 claims description 5
- 230000007423 decrease Effects 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 208000025721 COVID-19 Diseases 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7844—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/71—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/73—Querying
- G06F16/732—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7847—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
- G06F16/785—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content using colour or luminescence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/7867—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Definitions
- the present invention relates to a method for retrieving information inside a moving picture.
- it is intended to propose a method of searching only a specific section in which the user provides information desired by the user and providing it to the user.
- a method for searching information inside a video includes: receiving a sentence as a search word from a user; Searching for a scene having the highest degree of matching with the search term in a video indexed for each scene by providing metadata in the form of a sentence for each scene; and reproducing only the start point to the end point of the searched scene in the video It is characterized in that it includes;
- the user selects one video to be searched, and searches for content to be searched in the selected video in the form of a sentence.
- the moving picture is composed of at least one scene, and a summary sentence indicating the contents of each of the at least one scene is given in the form of a sentence to each of the at least one scene and used as metadata. characterized by being
- the degree of matching becomes 0 when two sentences are the same, and the value increases as the similarity between the two sentences decreases. It is determined using a Levenshtein distance technique. do.
- a method for retrieving information inside a moving image includes segmenting the moving image in shot units; applying a tag set to each of the segmented shots, and deriving keywords that highlight the characteristics of each shot for each shot through topic analysis of the tag set; and between adjacent shots determined based on the keyword and generating the scene by performing hierarchical clustering based on the similarity.
- the tag set is composed of a video tag and an audio tag.
- a scene tag is assigned to each scene, and metadata is assigned in the form of a sentence, so that the moving picture is indexed for each scene.
- metadata converts at least one keyword derived from each of at least one shot constituting one scene and voice data of each of the at least one shot through STT (Speech To Text) technique. It is characterized in that it is generated based on one text.
- STT Seech To Text
- At least one sentence including the keyword is selected from at least one text data obtained by converting the voice data of each of the at least one shot constituting the one scene into STT, and the selected It is characterized by generating a single sentence through deep learning from at least one sentence, and storing the generated single sentence as metadata describing the scene.
- each image into an HSV color space; generating three time series data including median values of H (hue), S (saturation), and v (brightness); and setting the corresponding point as the start or end point of the shot when all three inflection points detected in the three time series data coincide.
- an apparatus for searching video internal information includes: a search word input unit for receiving a sentence as a search word from a user; a video section search unit for searching a specific section having the highest degree of relevance to the search term within the video; and a video section playback unit that reproduces only the specific section within the video, wherein the video is segmented based on meaning, and a sentence explaining the meaning of the segmented section is provided as metadata for each segmented section, It is characterized in that the relevance to the search word is determined by using the sentence as an index along the timeline of the video.
- an apparatus and method for searching video internal information provides the user with only a specific section within the video that the user wants to search, so that the user does not have to watch the video from beginning to end. It has the effect of being able to grasp information quickly and easily.
- the user can check in advance what content is included before watching the video.
- FIG. 1 illustrates an example in which components constituting a moving picture are divided into a scene and a shot as a preferred embodiment of the present invention.
- FIG. 2 is a flowchart of a method for retrieving information inside a moving picture as a preferred embodiment of the present invention.
- FIG. 3 is a diagram showing an internal configuration of an apparatus for searching video internal information as a preferred embodiment of the present invention.
- FIG. 4 shows an example of dividing a shot in a moving picture as a preferred embodiment of the present invention.
- FIG. 5 shows an example of assigning a tag set to a shot as a preferred embodiment of the present invention.
- FIG. 6 shows an example of grouping a shot into a scene as a preferred embodiment of the present invention.
- FIG. 7 is a flowchart of a method for retrieving information in a moving picture as another preferred embodiment of the present invention.
- FIG. 8 shows an embodiment of searching for information inside a moving picture as a preferred embodiment of the present invention.
- a method for searching information inside a video includes the steps of: receiving a sentence as a search word from a user; Metadata is provided in the form of a sentence for each scene and indexed in a scene unit within a video Searching for a scene having the highest degree of matching with the search term; and reproducing only the start point to the end point of the searched scene in the video; characterized in that it comprises a.
- FIG. 1 shows an example in which components constituting a moving picture are divided into a scene and a shot as a preferred embodiment of the present invention.
- the moving picture 100 is segmented into n shots (n is a natural number) 111, 113, 121, 123, 125, 131, 133.
- n shots is a natural number
- FIG. 4 For a method of classifying shots in a video, refer to FIG. 4 .
- At least one shot is grouped into units having similar meanings or subjects to constitute a scene.
- the first shot 111 and the second shot 113 are grouped into the first scene 110
- the third shot 121 , the fourth shot 123 , and the fifth shot are grouped together.
- 125 may be grouped into the second scene 120
- the sixth shot 131 and the seventh shot 133 may be grouped into the third scene 130 .
- a subject may include at least one meaning.
- FIG. 2 is a flowchart of a method for retrieving video internal information as a preferred embodiment of the present invention.
- the user selects the video and inputs a search word through a search word input interface provided when video selection is activated.
- the video is indexed in units of scenes by providing metadata in the form of sentences for each scene.
- the video internal information search apparatus searches for a specific section that matches the search word or has high relevance in the video, and reproduces only the searched specific section.
- the video internal information search apparatus searches for a scene with the highest matching degree with the search word in the video (S210). and (S220), only the start point to the end point of the searched scene is played back (S230).
- FIG. 3 shows an internal configuration diagram of an apparatus 300 for searching video internal information as a preferred embodiment of the present invention.
- 4 to 6 show detailed functions of the video section search unit 320 constituting the apparatus 300 for searching video internal information.
- 7 is a flowchart showing a search for video inside information.
- a method for searching video internal information in a device for searching video internal information will be described with reference to FIGS. 3 to 7 .
- the apparatus 300 for searching video internal information may be implemented in a terminal, a computer, a notebook computer, a handheld device, or a wearable device.
- the apparatus 300 for searching video internal information may be implemented in the form of a terminal having an input unit for receiving a user's search word, a display for displaying a video, and a processor.
- the method of searching the video internal information may be implemented by being installed in the form of an application in the terminal.
- the apparatus 300 for searching video internal information includes a search word input unit 310 , a video section search unit 320 , and a video section playback unit 330 .
- the video section search unit 320 includes a shot segmentation unit 340 , a scene generation unit 350 , a metadata generation unit 360 , and a video index unit 370 .
- the search word input unit 310 receives a search word from the user in the form of a sentence.
- the user can use all forms such as voice search, text search, and image search.
- An example of an image search is a case where the contents scanned from a book are converted into text and used as a search term.
- the search word input unit 310 may be implemented as a keyboard, a stylus pen, a microphone, or the like.
- the video section search unit 320 searches for a specific section in the video that matches the search word input from the search word input unit 310 or has content related to the search word. As an embodiment, the video section search unit 320 searches for a scene in which a sentence having the highest degree of matching with the input search word sentence is assigned as metadata.
- the video section search unit 320 indexes and manages videos so that information can be searched within a single video.
- the shot segmentation unit 340 segments the video in shot units (S710), assigns a tag set to each segmented shot (S720), and adds a tag set to each shot.
- a keyword is derived for each shot by applying a topic analysis algorithm (S730). The keyword is derived in the form of identifying and discriminating the content of each of at least one shot constituting the moving picture.
- the scene generator 350 determines the similarity between adjacent before and after shots on the timeline of the video. The similarity determination may be performed based on a keyword derived from each shot, an object detected in each shot, a voice feature detected in each shot, and the like. As a preferred embodiment of the present invention, the scene generator 350 may create a scene by grouping shots having a high degree of similarity between adjacent shots based on a keyword ( S740 ).
- An algorithm for performing grouping may include a hierarchical clustering technique (S750). In this case, a plurality of shots included in one scene may be interpreted as delivering content having similar meaning or subject matter. For an example of grouping shots through hierarchical clustering in the scene generator 350 , refer to FIG. 8 .
- the scene generator 350 assigns a scene tag to each created scene (351, 353, 355).
- the scene tag may be generated based on an image tag assigned to each of at least one shot included in each scene.
- a scene tag may be generated by a combination of a tag set assigned to each of at least one shot constituting a scene.
- the scene keyword may be generated by a combination of keywords derived from each of at least one shot constituting the scene.
- the scene tag may serve as a weight when generating metadata for each scene.
- the metadata generator 360 analyzes the scenes generated by the scene generator 350, and provides metadata for each scene, thereby supporting a search for internal video content (S760). Metadata assigned to each scene acts as an index.
- the metadata is in the form of a summary sentence indicating the contents of each scene.
- the metadata may be generated by further referring to a scene tag assigned to each of at least one shot constituting one scene.
- Scene tags can serve as weights when performing deep learning to generate metadata. For example, weight may be assigned to image tag information and voice tag information extracted from at least one tag set included in the scene tag.
- the metadata is generated based on STT (Speech to Text) data of voice data extracted from at least one shot constituting each scene, and a scene tag extracted from each of at least one shot constituting each scene.
- STT Seech to Text
- a summary sentence is generated by performing deep learning machine learning on at least one STT data and at least one scene tag obtained from at least one shot constituting one scene. Metadata is given to each scene by using a summary sentence generated through machine learning for each scene.
- the video indexing unit 370 uses metadata assigned to each scene of the video S300 as an index. For example, if the video S300 is classified into three scenes, the video indexing unit 370 uses the first sentence 371 given as metadata to the first scene 351 (0:00 to t1). Used as an index, the second sentence 373 assigned as metadata to the second scene 353 (t1 to t2) is used as an index, and given as metadata to the third scene 355 (t2 to t3) The third sentence 375 is used as an index.
- the user's search sentence is a first search sentence (S311)
- the first search sentence (S311) is a first of a plurality of metadata (371, 373, 375) allocated to each of a plurality of scenes in one video.
- the sentence 371 has the highest degree of matching
- the video section having the highest degree of matching with the search sentence input in the search word input unit 310 is the first scene 351 .
- the video section reproducing unit 330 reproduces only the section 0:00 to t1 of the first scene 351 in the video S300.
- the video indexing unit 370 uses the Levenshtein distance technique, in which the value becomes 0 when two sentences are identical and the value increases as the similarity between the two sentences decreases. can be determined, but is not limited thereto, and various algorithms for determining the similarity between two sentences can be used.
- the user's search text is the second search text (S313)
- the second search text (S313) is the first of a plurality of metadata (371, 373, 375) allocated to each of a plurality of scenes in one video
- the video section reproducing unit 330 reproduces only the section t1 to t2 of the second scene 353 in the video S300.
- the video indexing unit 370 determines that the user's search sentence is the third search sentence (S315) and the third search sentence (S315) has the highest degree of matching with the third sentence, it is input into the search word input unit 310 It is determined that the video section having the highest degree of matching with the search sentence is the third scene 355 .
- the video section reproducing unit 330 reproduces only the section t2 to t3 of the third scene 355 in the video S300.
- FIG. 4 shows an example of dividing a shot in a moving picture as a preferred embodiment of the present invention.
- the x-axis represents time (sec)
- the y-axis represents a representative HSV value.
- the shot segmentation unit 340 of the video internal information search apparatus extracts frames from the video S300 at regular intervals as images, and then converts each image into an HSV color space. Then, three time series data composed of representative values (median) of H (hue) (S401), S (saturation) (S403) and v (brightness) (S405) of each image are generated. And, when the inflection points of each of the three time series data of H (hue) (S401), S (saturation) (S403), and v (brightness) (S405) all match or are within a certain time period, the corresponding point of the shot Set as a starting point or an ending point.
- FIG. 5 shows an example of assigning a tag set to a shot as a preferred embodiment of the present invention.
- FIG. 5 illustrates an example in which the first tag set 550 is applied to the first shot 510 .
- the shot 510 is classified into image data 510a and audio data 510b.
- image data 510a after extracting images per second (520a), an object is detected in each image (530a). Then, an image tag is generated based on the detected object (540a).
- Image tags apply object annovation or labeling to objects detected in images to construct learning data, and then perform object recognition through deep learning related to image recognition. Information obtained by extracting objects from each image can be created based on
- a tag set 550 is generated.
- the tag set refers to a combination of the image tag 540a and the voice tag 540b detected during the time when the first shot 510 is, for example, between 00:00 and 10:00 seconds.
- FIG. 6 shows an example of grouping a shot into a scene as a preferred embodiment of the present invention.
- FIG. 6 illustrates an example of creating a scene through hierarchical clustering 640 after determining the degree of similarity based on the keyword 630 .
- FIG. 8 shows an embodiment of searching for information inside a moving picture as a preferred embodiment of the present invention.
- FIG. 8 shows an example in which the video 800 selected by the user in the shot segmentation unit is segmented into seven shots 801 to 807.
- the device for searching video internal information generates a tag set by extracting a video tag and an audio tag from each of the seven shots (801 to 807), and then performs topic analysis such as LDA on the tag set for each shot (801 to 807). 807), a keyword is derived.
- the first shot 801 is in the range of 0:00 to 0:17, and the first keyword derived from the first shot 801 is (Japan, Corona 19, severe) 801a ) am.
- the second shot 802 is a section from 0:18 to 0:29, and the second keyword derived from the second shot 802 is (Japan, Corona 19, Spread) 802a.
- the third shot 803 is a section from 0:30 to 0:34, and the third keyword derived from the third shot 803 is (New York, Corona 19, Europe, Inflow) 803a.
- the fourth shot 804 is a section from 0:34 to 0:38, and the fourth keyword derived from the fourth shot 804 is (US, Corona 19, death) 804a.
- the fifth shot 805 is a section from 0:39 to 0:41, and the fifth keyword derived from the fifth shot 805 is (US, Corona 19, confirmed, dead).
- the sixth shot 806 is a section from 0:42 to 0:45, and the sixth keyword derived from the sixth shot 806 is (US, Corona 19, death) 806a.
- the seventh shot 807 is a section from 0:46 to 0:50, and the seventh keyword derived from the seventh shot 807 is (US, Corona 19, death) 807a.
- the scene generator groups at least one shot based on the similarity.
- the degree of similarity can be determined based on keywords extracted from each shot, and video tags and voice tags can be further referred to.
- the first shot 801 and the second shot 802 are grouped into the first scene 810
- the third shot 803 is grouped with the second scene 820
- the fourth to seventh shots 804 to 807 are grouped into the third scene 830 .
- the first scene 810 is a section from 0:00 to 0:29, and the first keyword derived from the first shot 801 is derived from (Japan, Corona 19, severe) 801a and the second shot 802 .
- the second keyword used is (Japan, Corona 19, Spread) (802a) to "Japan Corona 19 continues to spread” (810b) with reference to the voice data of the first shot 801 and the second shot 802. metadata is provided.
- the second scene 820 is a section from 0:30 to 0:34, and the third keywords derived from the third shot 803 are (New York, Corona 19, Europe, Inflow) 803a and the third shot 803. Referring to the voice data of the "New York's Corona 19 is said to be coming from Europe" (820b) is given.
- the third scene 830 is a section from 0:35 to 0:50, and is derived from the fourth keyword (mi, corona 19, death) 804a derived from the fourth shot 804 and the fifth shot 805.
- the fifth keyword (US, Corona 19, confirmed, dead)
- the sixth keyword (US, Corona 19, dead)
- (806a) derived from the sixth shot 806, and the fourth shot 804 to the sixth shot 806 ) with reference to the voice data of "This is the news of the death of COVID-19 in the United States.” (830b) is given.
- the user when the user selects the video 800 and the search word input interface is activated, the user inputs the content to be searched in the form of a sentence. For example, a search term sentence "What is the current state of Corona in the United States?" (840) may be input.
- the video indexing unit searches for metadata with the highest degree of matching with the search word sentence 840 by using the metadata given to each scene as an index.
- the degree of matching is determined based on the similarity between the search word 840 and the metadata 810b, 820b, and 830b, and becomes 0 when two sentences are the same, and the Levinstein distance ( Levenshtein distance) technique can be used.
- the video indexer searches for the metadata most similar to the user search term (840) "How is the current state of Corona in the United States?"
- the finished third scene 830 is played back to the user.
- the search word 840 only the section of the third scene 830 corresponding to the section 0:35 to 0:50 related to the search word 840 in the video 800 can be searched and viewed.
- the video indexing unit may provide the user with metadata assigned to each scene 810 to 830 constituting the video as an index. Users can preview the contents of the video in advance through the video index.
- Methods according to an embodiment of the present invention may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium.
- the computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination.
- the program instructions recorded on the medium may be specially designed and configured for the present invention, or may be known and available to those skilled in the art of computer software.
Abstract
Description
Claims (15)
- 사용자로부터 문장을 검색어로 입력받는 단계;receiving a sentence as a search word from a user;각 씬(scene)마다 메타데이터가 문장 형식으로 부여되어 씬 단위로 색인화된 동영상 내에서 상기 검색어와 매칭도가 가장높은 씬을 검색하는 단계;및 Searching for a scene having the highest degree of matching with the search term in a video indexed for each scene in which metadata is given in the form of a sentence for each scene; And동영상에서 상기 검색된 씬의 시작지점부터 끝지점만을 재생하는 단계;를 포함하는 것을 특징으로 하는 동영상 내부의 정보를 검색하는 방법. Reproducing only the starting point to the end point of the searched scene in the video; the method of searching for information in the video, comprising: a.
- 제 1 항에 있어서, The method of claim 1,상기 사용자는 검색을 수행하고자 하는 동영상을 하나 선택하고, 선택한 동영상 내부에서 검색하고자 하는 내용을 문장의 형태로 검색하는 특징으로 하는 동영상 내부의 정보를 검색하는 방법. The method for searching information in a video, characterized in that the user selects one video to be searched, and searches for content to be searched in the selected video in the form of a sentence.
- 제1 항에 있어서, The method of claim 1,상기 동영상은 적어도 하나의 씬으로 구성되고, 상기 적어도 하나의 씬 각각에는 상기 적어도 하나의 씬 각각의 내용을 표시하는 요약문이 문장의 형식으로 부여되어 메타데이터로 이용되는 것을 특징으로 하는 동영상 내부의 정보를 검색하는 방법. The moving picture is composed of at least one scene, and a summary sentence indicating the contents of each of the at least one scene is given to each of the at least one scene in the form of a sentence and is used as metadata. How to search.
- 제 1 항에 있어서, 상기 매칭도는 The method of claim 1, wherein the degree of matching is두 문장이 동일한 경우에 0이 되고 두 문장의 유사도가 작아질수록 값이 커지는 레빈쉬타인 거리(Levenshtein distance)기법을 이용하여 판단하는 것을 특징으로 하는 동영상 내부의 정보를 검색하는 방법. A method of retrieving information in a video, characterized in that the determination is made using the Levenshtein distance technique, which becomes 0 when two sentences are the same and increases as the similarity between the two sentences decreases.
- 제 1 항에 있어서, The method of claim 1,동영상을 샷(shot) 단위로 분절하는 단계;segmenting the video in shot units;상기 분절된 샷 각각에 태그세트(tag set)를 부여하고, 상기 태그세트의 토픽 분석을 통해 각 샷마다 각 샷의 특징을 부각시키는 키워드를 도출하는 단계;및adding a tag set to each of the segmented shots, and deriving keywords that highlight the characteristics of each shot for each shot through topic analysis of the tag set; And상기 키워드를 기준으로 판단한 인접한 샷 간의 유사도를 기초로 계층적 클러스터링 수행하여 상기 씬(scene)을 생성하는 단계;를 포함하는 것을 특징으로 하는 동영상 내부의 정보를 검색하는 방법. and generating the scene by performing hierarchical clustering based on the degree of similarity between adjacent shots determined based on the keyword.
- 제 5 항에 있어서, 상기 태그 세트는 6. The method of claim 5, wherein the tag set is영상 태그 및 음성 태그로 구성되는 것을 특징으로 하는 동영상 내부의 정보를 검색하는 방법. A method of retrieving information inside a video, characterized in that it consists of a video tag and an audio tag.
- 제 5 항에 있어서, 상기 씬 각각에 6. The method of claim 5, wherein in each of the scenes씬 태그(Scene Tag)가 부여되고, 또한 메타데이터가 문장 형식으로 부여되어 상기 동영상을 씬 단위로 색인화하는 것을 특징으로 하는 동영상 내부의 정보를 검색하는 방법. A method for retrieving information in a video, characterized in that a scene tag is assigned and metadata is assigned in the form of a sentence to index the video in units of scenes.
- 제 5 항에 있어서, 상기 메타데이터는 6. The method of claim 5, wherein the metadata is하나의 씬을 구성하는 적어도 하나의 샷 각각에서 도출된 적어도 하나의 키워드와 적어도 하나의 샷 각각의 음성데이터를 STT(Speech To Text) 기법을 통해 변환한 텍스트를 기초로 생성되는 것을 특징으로 하는 동영상 내부의 정보를 검색하는 방법. A video characterized in that at least one keyword derived from each of at least one shot constituting one scene and a text converted from speech data of each of at least one shot through STT (Speech To Text) are generated based on a video How to retrieve information inside.
- 제 8 항에 있어서, 9. The method of claim 8,상기 하나의 씬을 구성하는 적어도 하나의 샷 각각의 음성데이터를 STT로 변환한 적어도 하나의 텍스트 데이터에서 상기 키워드가 포함된 문장을 적어도 하나 이상 선별하고, 선별된 적어도 하나의 문장에서 딥러닝을 통해 단일문장을 생성하고, 생성된 단일문장을 씬을 설명하는 메타데이터로 저장하는 것을 특징으로 하는 동영상 내부의 정보를 검색하는 방법. At least one sentence including the keyword is selected from at least one text data obtained by converting the speech data of each of the at least one shot constituting the one scene into STT, and the selected at least one sentence is through deep learning A method of retrieving information inside a video, characterized in that a single sentence is generated and the generated single sentence is stored as metadata describing the scene.
- 제 1 항에 있어서, The method of claim 1,동영상에서 프레임을 이미지로 추출한 후, 각 이미지를 HSV 색공간으로 변환하는 단계; After extracting a frame from a moving picture as an image, converting each image into an HSV color space;H(색상), S(채도) 및 v(명도)의 대표값(median)으로 구성된 3개의 시계열 데이터를 생성하는 단계; 및generating three time series data including median values of H (hue), S (saturation), and v (brightness); and3개의 시계열데이터에서 검출된 3개의 변곡점이 모두 일치하는 경우 해당 지점을 샷의 시작 또는 끝지점으로 설정하는 단계;를 더 포함하는 것을 특징으로 하는 동영상 내부의 정보를 검색하는 방법. When all three inflection points detected in the three time series data match, setting the corresponding point as the start or end point of the shot; the method of searching for information inside the video, comprising further comprising.
- 제 10 항에 있어서, 11. The method of claim 10,적어도 하나의 샷을 씬으로 그루핑하는 단계;를 더 포함하고, 이 경우 타임라인에서 인접한 적어도 하나의 샷 간의 유사도를 기초로 그루핑을 수행하는 것을 특징으로 하는 동영상 내부의 정보를 검색하는 방법. The method further comprising grouping at least one shot into a scene, wherein in this case, grouping is performed based on a degree of similarity between at least one adjacent shot in the timeline.
- 사용자로부터 문장을 검색어로 입력받는 검색어입력부;a search word input unit for receiving a sentence as a search word from a user;동영상 내에서 상기 검색어와 관련도가 가장 높은 특정 구간을 검색하는 동영상구간검색부; 및a video section search unit for searching a specific section having the highest degree of relevance to the search term within the video; and상기 동영상 내에서 상기 특정 구간만을 재생하는 동영상구간재생부;를 포함하고, Including; a video section reproducing unit that reproduces only the specific section within the video;상기 동영상은 의미기반으로 분절되고, 분절된 구간마다 상기 분절된 구간의 의미를 설명하는 문장이 메타데이터로 부여되며, 상기 동영상의 타임라인을 따라 상기 문장을 색인으로 이용하여 상기 검색어와의 관련도를 판단하는 것을 특징으로 하는 동영상내부정보를 검색하는장치. The video is segmented based on meaning, and a sentence explaining the meaning of the segmented section is given as metadata for each segmented section, and the relevance with the search term is used along the timeline of the video using the sentence as an index. A device for searching video internal information, characterized in that determining the.
- 제 12 항에 있어서, 상기 동영상구간검색부는13. The method of claim 12, wherein the video section search unit동영상을 샷(shot) 단위로 분절하고, 분절된 샷 각각에 태그세트(tag set)를 부여하며, 상기 태그세트의 토픽 분석을 통해 각 샷마다 각 샷의 특징을 부각시키는 키워드를 도출하는 샷분절부;및A video segment is segmented into shots, a tag set is assigned to each segmented shot, and a keyword that highlights the characteristics of each shot is derived through topic analysis of the tag set. wealth; and상기 키워드를 기준으로 판단한 인접한 샷 간의 유사도를 기초로 계층적 클러스터링 수행하여 상기 씬(scene)을 생성하는 씬생성부;를 포함하는 것을 특징으로 하는 동영상내부정보를 검색하는장치. and a scene generator configured to generate the scene by performing hierarchical clustering based on the degree of similarity between adjacent shots determined based on the keyword.
- 제 13 항에 있어서, 14. The method of claim 13,상기 씬생성부에서 생성된 씬을 각각 분석하여, 각 씬마다 메타데이터를 문장의 형식으로 부여하는 메타데이터생성부;를 더 포함하는 것을 특징으로 하는 동영상내부정보를 검색하는장치. and a metadata generation unit that analyzes the scenes generated by the scene generation unit, respectively, and provides metadata for each scene in the form of a sentence.
- 제 1 항 내지 제 11 항 중 어느 한 항에 기재된 방법을 수행하기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체.A computer-readable recording medium in which a program for performing the method according to any one of claims 1 to 11 is recorded.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/KR2020/005718 WO2021221209A1 (en) | 2020-04-29 | 2020-04-29 | Method and apparatus for searching for information inside video |
KR1020207014777A KR20210134866A (en) | 2020-04-29 | 2020-04-29 | Methods and devices for retrieving information inside a video |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/KR2020/005718 WO2021221209A1 (en) | 2020-04-29 | 2020-04-29 | Method and apparatus for searching for information inside video |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021221209A1 true WO2021221209A1 (en) | 2021-11-04 |
Family
ID=78374167
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/KR2020/005718 WO2021221209A1 (en) | 2020-04-29 | 2020-04-29 | Method and apparatus for searching for information inside video |
Country Status (2)
Country | Link |
---|---|
KR (1) | KR20210134866A (en) |
WO (1) | WO2021221209A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116702707A (en) * | 2023-08-03 | 2023-09-05 | 腾讯科技(深圳)有限公司 | Action generation method, device and equipment based on action generation model |
CN117633297A (en) * | 2024-01-26 | 2024-03-01 | 江苏瑞宁信创科技有限公司 | Video retrieval method, device, system and medium based on annotation |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20080111376A (en) * | 2007-06-18 | 2008-12-23 | 한국전자통신연구원 | System and method for managing digital videos using video features |
KR20150022088A (en) * | 2013-08-22 | 2015-03-04 | 주식회사 엘지유플러스 | Context-based VOD Search System And Method of VOD Search Using the Same |
JP2016035607A (en) * | 2012-12-27 | 2016-03-17 | パナソニック株式会社 | Apparatus, method and program for generating digest |
KR20190114548A (en) * | 2018-03-30 | 2019-10-10 | 주식회사 엘지유플러스 | Apparatus and method for controlling contents, or content control apparatus and method thereof |
KR20190129266A (en) * | 2018-05-10 | 2019-11-20 | 네이버 주식회사 | Content providing server, content providing terminal and content providing method |
-
2020
- 2020-04-29 KR KR1020207014777A patent/KR20210134866A/en not_active Application Discontinuation
- 2020-04-29 WO PCT/KR2020/005718 patent/WO2021221209A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20080111376A (en) * | 2007-06-18 | 2008-12-23 | 한국전자통신연구원 | System and method for managing digital videos using video features |
JP2016035607A (en) * | 2012-12-27 | 2016-03-17 | パナソニック株式会社 | Apparatus, method and program for generating digest |
KR20150022088A (en) * | 2013-08-22 | 2015-03-04 | 주식회사 엘지유플러스 | Context-based VOD Search System And Method of VOD Search Using the Same |
KR20190114548A (en) * | 2018-03-30 | 2019-10-10 | 주식회사 엘지유플러스 | Apparatus and method for controlling contents, or content control apparatus and method thereof |
KR20190129266A (en) * | 2018-05-10 | 2019-11-20 | 네이버 주식회사 | Content providing server, content providing terminal and content providing method |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116702707A (en) * | 2023-08-03 | 2023-09-05 | 腾讯科技(深圳)有限公司 | Action generation method, device and equipment based on action generation model |
CN116702707B (en) * | 2023-08-03 | 2023-10-03 | 腾讯科技(深圳)有限公司 | Action generation method, device and equipment based on action generation model |
CN117633297A (en) * | 2024-01-26 | 2024-03-01 | 江苏瑞宁信创科技有限公司 | Video retrieval method, device, system and medium based on annotation |
Also Published As
Publication number | Publication date |
---|---|
KR20210134866A (en) | 2021-11-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2010117213A2 (en) | Apparatus and method for providing information related to broadcasting programs | |
WO2020080606A1 (en) | Method and system for automatically generating video content-integrated metadata using video metadata and script data | |
US6507838B1 (en) | Method for combining multi-modal queries for search of multimedia data using time overlap or co-occurrence and relevance scores | |
US6578040B1 (en) | Method and apparatus for indexing of topics using foils | |
US20050038814A1 (en) | Method, apparatus, and program for cross-linking information sources using multiple modalities | |
KR101516995B1 (en) | Context-based VOD Search System And Method of VOD Search Using the Same | |
US20110078176A1 (en) | Image search apparatus and method | |
US20100318532A1 (en) | Unified inverted index for video passage retrieval | |
WO2021221209A1 (en) | Method and apparatus for searching for information inside video | |
US9131207B2 (en) | Video recording apparatus, information processing system, information processing method, and recording medium | |
WO2017188606A2 (en) | Terminal device and method for providing additional information | |
KR101640317B1 (en) | Apparatus and method for storing and searching image including audio and video data | |
Luo et al. | Exploring large-scale video news via interactive visualization | |
JP2007328713A (en) | Related term display device, searching device, method thereof, and program thereof | |
US20080016068A1 (en) | Media-personality information search system, media-personality information acquiring apparatus, media-personality information search apparatus, and method and program therefor | |
CN110008314B (en) | Intention analysis method and device | |
WO2021221210A1 (en) | Method and apparatus for generating smart route | |
Sack et al. | Automated annotations of synchronized multimedia presentations | |
WO2021167238A1 (en) | Method and system for automatically creating table of contents of video on basis of content | |
Aletras et al. | Computing similarity between cultural heritage items using multimodal features | |
Kim et al. | Content-Based Video Indexing and Retrieval--A Natural Language Approach-- | |
WO2015190834A1 (en) | Method for searching for and providing video | |
JP2007293602A (en) | System and method for retrieving image and program | |
WO2016089110A1 (en) | Entry-based knowledge resource generation device and method | |
KR20200063316A (en) | Apparatus for searching video based on script and method for the same |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20933240 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20933240 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20933240 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 11/04/2023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20933240 Country of ref document: EP Kind code of ref document: A1 |