WO2021221209A1 - Method and apparatus for searching for information inside video - Google Patents

Method and apparatus for searching for information inside video Download PDF

Info

Publication number
WO2021221209A1
WO2021221209A1 PCT/KR2020/005718 KR2020005718W WO2021221209A1 WO 2021221209 A1 WO2021221209 A1 WO 2021221209A1 KR 2020005718 W KR2020005718 W KR 2020005718W WO 2021221209 A1 WO2021221209 A1 WO 2021221209A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
scene
shot
sentence
metadata
Prior art date
Application number
PCT/KR2020/005718
Other languages
French (fr)
Korean (ko)
Inventor
구원용
홍의재
Original Assignee
엠랩 주식회사
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 엠랩 주식회사 filed Critical 엠랩 주식회사
Priority to PCT/KR2020/005718 priority Critical patent/WO2021221209A1/en
Priority to KR1020207014777A priority patent/KR20210134866A/en
Publication of WO2021221209A1 publication Critical patent/WO2021221209A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/71Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/732Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • G06F16/785Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content using colour or luminescence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present invention relates to a method for retrieving information inside a moving picture.
  • it is intended to propose a method of searching only a specific section in which the user provides information desired by the user and providing it to the user.
  • a method for searching information inside a video includes: receiving a sentence as a search word from a user; Searching for a scene having the highest degree of matching with the search term in a video indexed for each scene by providing metadata in the form of a sentence for each scene; and reproducing only the start point to the end point of the searched scene in the video It is characterized in that it includes;
  • the user selects one video to be searched, and searches for content to be searched in the selected video in the form of a sentence.
  • the moving picture is composed of at least one scene, and a summary sentence indicating the contents of each of the at least one scene is given in the form of a sentence to each of the at least one scene and used as metadata. characterized by being
  • the degree of matching becomes 0 when two sentences are the same, and the value increases as the similarity between the two sentences decreases. It is determined using a Levenshtein distance technique. do.
  • a method for retrieving information inside a moving image includes segmenting the moving image in shot units; applying a tag set to each of the segmented shots, and deriving keywords that highlight the characteristics of each shot for each shot through topic analysis of the tag set; and between adjacent shots determined based on the keyword and generating the scene by performing hierarchical clustering based on the similarity.
  • the tag set is composed of a video tag and an audio tag.
  • a scene tag is assigned to each scene, and metadata is assigned in the form of a sentence, so that the moving picture is indexed for each scene.
  • metadata converts at least one keyword derived from each of at least one shot constituting one scene and voice data of each of the at least one shot through STT (Speech To Text) technique. It is characterized in that it is generated based on one text.
  • STT Seech To Text
  • At least one sentence including the keyword is selected from at least one text data obtained by converting the voice data of each of the at least one shot constituting the one scene into STT, and the selected It is characterized by generating a single sentence through deep learning from at least one sentence, and storing the generated single sentence as metadata describing the scene.
  • each image into an HSV color space; generating three time series data including median values of H (hue), S (saturation), and v (brightness); and setting the corresponding point as the start or end point of the shot when all three inflection points detected in the three time series data coincide.
  • an apparatus for searching video internal information includes: a search word input unit for receiving a sentence as a search word from a user; a video section search unit for searching a specific section having the highest degree of relevance to the search term within the video; and a video section playback unit that reproduces only the specific section within the video, wherein the video is segmented based on meaning, and a sentence explaining the meaning of the segmented section is provided as metadata for each segmented section, It is characterized in that the relevance to the search word is determined by using the sentence as an index along the timeline of the video.
  • an apparatus and method for searching video internal information provides the user with only a specific section within the video that the user wants to search, so that the user does not have to watch the video from beginning to end. It has the effect of being able to grasp information quickly and easily.
  • the user can check in advance what content is included before watching the video.
  • FIG. 1 illustrates an example in which components constituting a moving picture are divided into a scene and a shot as a preferred embodiment of the present invention.
  • FIG. 2 is a flowchart of a method for retrieving information inside a moving picture as a preferred embodiment of the present invention.
  • FIG. 3 is a diagram showing an internal configuration of an apparatus for searching video internal information as a preferred embodiment of the present invention.
  • FIG. 4 shows an example of dividing a shot in a moving picture as a preferred embodiment of the present invention.
  • FIG. 5 shows an example of assigning a tag set to a shot as a preferred embodiment of the present invention.
  • FIG. 6 shows an example of grouping a shot into a scene as a preferred embodiment of the present invention.
  • FIG. 7 is a flowchart of a method for retrieving information in a moving picture as another preferred embodiment of the present invention.
  • FIG. 8 shows an embodiment of searching for information inside a moving picture as a preferred embodiment of the present invention.
  • a method for searching information inside a video includes the steps of: receiving a sentence as a search word from a user; Metadata is provided in the form of a sentence for each scene and indexed in a scene unit within a video Searching for a scene having the highest degree of matching with the search term; and reproducing only the start point to the end point of the searched scene in the video; characterized in that it comprises a.
  • FIG. 1 shows an example in which components constituting a moving picture are divided into a scene and a shot as a preferred embodiment of the present invention.
  • the moving picture 100 is segmented into n shots (n is a natural number) 111, 113, 121, 123, 125, 131, 133.
  • n shots is a natural number
  • FIG. 4 For a method of classifying shots in a video, refer to FIG. 4 .
  • At least one shot is grouped into units having similar meanings or subjects to constitute a scene.
  • the first shot 111 and the second shot 113 are grouped into the first scene 110
  • the third shot 121 , the fourth shot 123 , and the fifth shot are grouped together.
  • 125 may be grouped into the second scene 120
  • the sixth shot 131 and the seventh shot 133 may be grouped into the third scene 130 .
  • a subject may include at least one meaning.
  • FIG. 2 is a flowchart of a method for retrieving video internal information as a preferred embodiment of the present invention.
  • the user selects the video and inputs a search word through a search word input interface provided when video selection is activated.
  • the video is indexed in units of scenes by providing metadata in the form of sentences for each scene.
  • the video internal information search apparatus searches for a specific section that matches the search word or has high relevance in the video, and reproduces only the searched specific section.
  • the video internal information search apparatus searches for a scene with the highest matching degree with the search word in the video (S210). and (S220), only the start point to the end point of the searched scene is played back (S230).
  • FIG. 3 shows an internal configuration diagram of an apparatus 300 for searching video internal information as a preferred embodiment of the present invention.
  • 4 to 6 show detailed functions of the video section search unit 320 constituting the apparatus 300 for searching video internal information.
  • 7 is a flowchart showing a search for video inside information.
  • a method for searching video internal information in a device for searching video internal information will be described with reference to FIGS. 3 to 7 .
  • the apparatus 300 for searching video internal information may be implemented in a terminal, a computer, a notebook computer, a handheld device, or a wearable device.
  • the apparatus 300 for searching video internal information may be implemented in the form of a terminal having an input unit for receiving a user's search word, a display for displaying a video, and a processor.
  • the method of searching the video internal information may be implemented by being installed in the form of an application in the terminal.
  • the apparatus 300 for searching video internal information includes a search word input unit 310 , a video section search unit 320 , and a video section playback unit 330 .
  • the video section search unit 320 includes a shot segmentation unit 340 , a scene generation unit 350 , a metadata generation unit 360 , and a video index unit 370 .
  • the search word input unit 310 receives a search word from the user in the form of a sentence.
  • the user can use all forms such as voice search, text search, and image search.
  • An example of an image search is a case where the contents scanned from a book are converted into text and used as a search term.
  • the search word input unit 310 may be implemented as a keyboard, a stylus pen, a microphone, or the like.
  • the video section search unit 320 searches for a specific section in the video that matches the search word input from the search word input unit 310 or has content related to the search word. As an embodiment, the video section search unit 320 searches for a scene in which a sentence having the highest degree of matching with the input search word sentence is assigned as metadata.
  • the video section search unit 320 indexes and manages videos so that information can be searched within a single video.
  • the shot segmentation unit 340 segments the video in shot units (S710), assigns a tag set to each segmented shot (S720), and adds a tag set to each shot.
  • a keyword is derived for each shot by applying a topic analysis algorithm (S730). The keyword is derived in the form of identifying and discriminating the content of each of at least one shot constituting the moving picture.
  • the scene generator 350 determines the similarity between adjacent before and after shots on the timeline of the video. The similarity determination may be performed based on a keyword derived from each shot, an object detected in each shot, a voice feature detected in each shot, and the like. As a preferred embodiment of the present invention, the scene generator 350 may create a scene by grouping shots having a high degree of similarity between adjacent shots based on a keyword ( S740 ).
  • An algorithm for performing grouping may include a hierarchical clustering technique (S750). In this case, a plurality of shots included in one scene may be interpreted as delivering content having similar meaning or subject matter. For an example of grouping shots through hierarchical clustering in the scene generator 350 , refer to FIG. 8 .
  • the scene generator 350 assigns a scene tag to each created scene (351, 353, 355).
  • the scene tag may be generated based on an image tag assigned to each of at least one shot included in each scene.
  • a scene tag may be generated by a combination of a tag set assigned to each of at least one shot constituting a scene.
  • the scene keyword may be generated by a combination of keywords derived from each of at least one shot constituting the scene.
  • the scene tag may serve as a weight when generating metadata for each scene.
  • the metadata generator 360 analyzes the scenes generated by the scene generator 350, and provides metadata for each scene, thereby supporting a search for internal video content (S760). Metadata assigned to each scene acts as an index.
  • the metadata is in the form of a summary sentence indicating the contents of each scene.
  • the metadata may be generated by further referring to a scene tag assigned to each of at least one shot constituting one scene.
  • Scene tags can serve as weights when performing deep learning to generate metadata. For example, weight may be assigned to image tag information and voice tag information extracted from at least one tag set included in the scene tag.
  • the metadata is generated based on STT (Speech to Text) data of voice data extracted from at least one shot constituting each scene, and a scene tag extracted from each of at least one shot constituting each scene.
  • STT Seech to Text
  • a summary sentence is generated by performing deep learning machine learning on at least one STT data and at least one scene tag obtained from at least one shot constituting one scene. Metadata is given to each scene by using a summary sentence generated through machine learning for each scene.
  • the video indexing unit 370 uses metadata assigned to each scene of the video S300 as an index. For example, if the video S300 is classified into three scenes, the video indexing unit 370 uses the first sentence 371 given as metadata to the first scene 351 (0:00 to t1). Used as an index, the second sentence 373 assigned as metadata to the second scene 353 (t1 to t2) is used as an index, and given as metadata to the third scene 355 (t2 to t3) The third sentence 375 is used as an index.
  • the user's search sentence is a first search sentence (S311)
  • the first search sentence (S311) is a first of a plurality of metadata (371, 373, 375) allocated to each of a plurality of scenes in one video.
  • the sentence 371 has the highest degree of matching
  • the video section having the highest degree of matching with the search sentence input in the search word input unit 310 is the first scene 351 .
  • the video section reproducing unit 330 reproduces only the section 0:00 to t1 of the first scene 351 in the video S300.
  • the video indexing unit 370 uses the Levenshtein distance technique, in which the value becomes 0 when two sentences are identical and the value increases as the similarity between the two sentences decreases. can be determined, but is not limited thereto, and various algorithms for determining the similarity between two sentences can be used.
  • the user's search text is the second search text (S313)
  • the second search text (S313) is the first of a plurality of metadata (371, 373, 375) allocated to each of a plurality of scenes in one video
  • the video section reproducing unit 330 reproduces only the section t1 to t2 of the second scene 353 in the video S300.
  • the video indexing unit 370 determines that the user's search sentence is the third search sentence (S315) and the third search sentence (S315) has the highest degree of matching with the third sentence, it is input into the search word input unit 310 It is determined that the video section having the highest degree of matching with the search sentence is the third scene 355 .
  • the video section reproducing unit 330 reproduces only the section t2 to t3 of the third scene 355 in the video S300.
  • FIG. 4 shows an example of dividing a shot in a moving picture as a preferred embodiment of the present invention.
  • the x-axis represents time (sec)
  • the y-axis represents a representative HSV value.
  • the shot segmentation unit 340 of the video internal information search apparatus extracts frames from the video S300 at regular intervals as images, and then converts each image into an HSV color space. Then, three time series data composed of representative values (median) of H (hue) (S401), S (saturation) (S403) and v (brightness) (S405) of each image are generated. And, when the inflection points of each of the three time series data of H (hue) (S401), S (saturation) (S403), and v (brightness) (S405) all match or are within a certain time period, the corresponding point of the shot Set as a starting point or an ending point.
  • FIG. 5 shows an example of assigning a tag set to a shot as a preferred embodiment of the present invention.
  • FIG. 5 illustrates an example in which the first tag set 550 is applied to the first shot 510 .
  • the shot 510 is classified into image data 510a and audio data 510b.
  • image data 510a after extracting images per second (520a), an object is detected in each image (530a). Then, an image tag is generated based on the detected object (540a).
  • Image tags apply object annovation or labeling to objects detected in images to construct learning data, and then perform object recognition through deep learning related to image recognition. Information obtained by extracting objects from each image can be created based on
  • a tag set 550 is generated.
  • the tag set refers to a combination of the image tag 540a and the voice tag 540b detected during the time when the first shot 510 is, for example, between 00:00 and 10:00 seconds.
  • FIG. 6 shows an example of grouping a shot into a scene as a preferred embodiment of the present invention.
  • FIG. 6 illustrates an example of creating a scene through hierarchical clustering 640 after determining the degree of similarity based on the keyword 630 .
  • FIG. 8 shows an embodiment of searching for information inside a moving picture as a preferred embodiment of the present invention.
  • FIG. 8 shows an example in which the video 800 selected by the user in the shot segmentation unit is segmented into seven shots 801 to 807.
  • the device for searching video internal information generates a tag set by extracting a video tag and an audio tag from each of the seven shots (801 to 807), and then performs topic analysis such as LDA on the tag set for each shot (801 to 807). 807), a keyword is derived.
  • the first shot 801 is in the range of 0:00 to 0:17, and the first keyword derived from the first shot 801 is (Japan, Corona 19, severe) 801a ) am.
  • the second shot 802 is a section from 0:18 to 0:29, and the second keyword derived from the second shot 802 is (Japan, Corona 19, Spread) 802a.
  • the third shot 803 is a section from 0:30 to 0:34, and the third keyword derived from the third shot 803 is (New York, Corona 19, Europe, Inflow) 803a.
  • the fourth shot 804 is a section from 0:34 to 0:38, and the fourth keyword derived from the fourth shot 804 is (US, Corona 19, death) 804a.
  • the fifth shot 805 is a section from 0:39 to 0:41, and the fifth keyword derived from the fifth shot 805 is (US, Corona 19, confirmed, dead).
  • the sixth shot 806 is a section from 0:42 to 0:45, and the sixth keyword derived from the sixth shot 806 is (US, Corona 19, death) 806a.
  • the seventh shot 807 is a section from 0:46 to 0:50, and the seventh keyword derived from the seventh shot 807 is (US, Corona 19, death) 807a.
  • the scene generator groups at least one shot based on the similarity.
  • the degree of similarity can be determined based on keywords extracted from each shot, and video tags and voice tags can be further referred to.
  • the first shot 801 and the second shot 802 are grouped into the first scene 810
  • the third shot 803 is grouped with the second scene 820
  • the fourth to seventh shots 804 to 807 are grouped into the third scene 830 .
  • the first scene 810 is a section from 0:00 to 0:29, and the first keyword derived from the first shot 801 is derived from (Japan, Corona 19, severe) 801a and the second shot 802 .
  • the second keyword used is (Japan, Corona 19, Spread) (802a) to "Japan Corona 19 continues to spread” (810b) with reference to the voice data of the first shot 801 and the second shot 802. metadata is provided.
  • the second scene 820 is a section from 0:30 to 0:34, and the third keywords derived from the third shot 803 are (New York, Corona 19, Europe, Inflow) 803a and the third shot 803. Referring to the voice data of the "New York's Corona 19 is said to be coming from Europe" (820b) is given.
  • the third scene 830 is a section from 0:35 to 0:50, and is derived from the fourth keyword (mi, corona 19, death) 804a derived from the fourth shot 804 and the fifth shot 805.
  • the fifth keyword (US, Corona 19, confirmed, dead)
  • the sixth keyword (US, Corona 19, dead)
  • (806a) derived from the sixth shot 806, and the fourth shot 804 to the sixth shot 806 ) with reference to the voice data of "This is the news of the death of COVID-19 in the United States.” (830b) is given.
  • the user when the user selects the video 800 and the search word input interface is activated, the user inputs the content to be searched in the form of a sentence. For example, a search term sentence "What is the current state of Corona in the United States?" (840) may be input.
  • the video indexing unit searches for metadata with the highest degree of matching with the search word sentence 840 by using the metadata given to each scene as an index.
  • the degree of matching is determined based on the similarity between the search word 840 and the metadata 810b, 820b, and 830b, and becomes 0 when two sentences are the same, and the Levinstein distance ( Levenshtein distance) technique can be used.
  • the video indexer searches for the metadata most similar to the user search term (840) "How is the current state of Corona in the United States?"
  • the finished third scene 830 is played back to the user.
  • the search word 840 only the section of the third scene 830 corresponding to the section 0:35 to 0:50 related to the search word 840 in the video 800 can be searched and viewed.
  • the video indexing unit may provide the user with metadata assigned to each scene 810 to 830 constituting the video as an index. Users can preview the contents of the video in advance through the video index.
  • Methods according to an embodiment of the present invention may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium.
  • the computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination.
  • the program instructions recorded on the medium may be specially designed and configured for the present invention, or may be known and available to those skilled in the art of computer software.

Abstract

As a preferred embodiment of the present invention, a method for searching for information inside a video comprises the steps of: receiving a sentence as a search term from a user; searching for a scene having a highest degree of matching with the search term in a video indexed by a scene unit in which metadata is given in the form of a sentence for each scene; and reproducing only a starting point and an ending point of the searched scene in the video.

Description

동영상 내부의 정보를 검색하는 방법 및 장치Methods and devices for retrieving information inside a video
본 발명은 동영상 내부의 정보를 검색하는 방법에 관한 것이다. The present invention relates to a method for retrieving information inside a moving picture.
인터넷, IPTV, SNS, Mobile LTE 등의 급속한 확산 및 유튜브, 넷플릭스 등의 동영상 OTT 서비스의 확대에 따라서 멀티미디어 영상의 유통 및 소비가 급격히 증가하고 있다. 사용자는 동영상의 내용을 확인하기 위해서는 동영상의 시작부터 끝까지 시청하는 것이 일반적이다. With the rapid spread of Internet, IPTV, SNS, Mobile LTE, etc. and the expansion of video OTT services such as YouTube and Netflix, the distribution and consumption of multimedia video is rapidly increasing. In general, the user watches the video from the beginning to the end in order to check the contents of the video.
그러나, 최근에는 사용자는 동영상 내에서 본인이 원하는 정보가 제공되는 장면만을 시청하기 원하는 요구가 높아지고 있으나, 현재 동영상에 대한 검색이 제목이나 설명을 기반으로 이루어지고 있어서 동영상 내에 포함되어 있는 콘텐츠에 대한 상세한 검색이 어려운 것이 현실이다.However, in recent years, there is an increasing demand for users to view only scenes in which desired information is provided within a video. The reality is that it is difficult to find.
본 발명의 바람직한 일 실시예에서는 동영상 내에서 사용자가 원하는 정보를 제공하는 특정구간만을 검색하여 사용자에게 제공하는 방법을 제안하고자 한다. In a preferred embodiment of the present invention, it is intended to propose a method of searching only a specific section in which the user provides information desired by the user and providing it to the user.
본 발명의 바람직한 일 실시예로서, 동영상 내부의 정보를 검색하는 방법은 사용자로부터 문장을 검색어로 입력받는 단계; 각 씬(scene)마다 메타데이터가 문장 형식으로 부여되어 씬 단위로 색인화된 동영상 내에서 상기 검색어와 매칭도가 가장높은 씬을 검색하는 단계;및 동영상에서 상기 검색된 씬의 시작지점부터 끝지점만을 재생하는 단계;를 포함하는 것을 특징으로 한다. As a preferred embodiment of the present invention, a method for searching information inside a video includes: receiving a sentence as a search word from a user; Searching for a scene having the highest degree of matching with the search term in a video indexed for each scene by providing metadata in the form of a sentence for each scene; and reproducing only the start point to the end point of the searched scene in the video It is characterized in that it includes;
본 발명의 바람직한 일 실시예로서, 상기 사용자는 검색을 수행하고자 하는 동영상을 하나 선택하고, 선택한 동영상 내부에서 검색하고자 하는 내용을 문장의 형태로 검색하는 특징으로 한다.As a preferred embodiment of the present invention, the user selects one video to be searched, and searches for content to be searched in the selected video in the form of a sentence.
본 발명의 바람직한 일 실시예로서, 상기 동영상은 적어도 하나의 씬으로 구성되고, 상기 적어도 하나의 씬 각각에는 상기 적어도 하나의 씬 각각의 내용을 표시하는 요약문이 문장의 형식으로 부여되어 메타데이터로 이용되는 것을 특징으로 한다. As a preferred embodiment of the present invention, the moving picture is composed of at least one scene, and a summary sentence indicating the contents of each of the at least one scene is given in the form of a sentence to each of the at least one scene and used as metadata. characterized by being
본 발명의 바람직한 일 실시예로서, 상기 매칭도는 두 문장이 동일한 경우에 0이 되고 두 문장의 유사도가 작아질수록 값이 커지는 레빈쉬타인 거리(Levenshtein distance)기법을 이용하여 판단하는 것을 특징으로 한다.As a preferred embodiment of the present invention, the degree of matching becomes 0 when two sentences are the same, and the value increases as the similarity between the two sentences decreases. It is determined using a Levenshtein distance technique. do.
본 발명의 바람직한 일 실시예로서, 동영상 내부의 정보를 검색하는 방법은 동영상을 샷(shot) 단위로 분절하는 단계; 상기 분절된 샷 각각에 태그세트(tag set)를 부여하고, 상기 태그세트의 토픽 분석을 통해 각 샷마다 각 샷의 특징을 부각시키는 키워드를 도출하는 단계;및 상기 키워드를 기준으로 판단한 인접한 샷 간의 유사도를 기초로 계층적 클러스터링 수행하여 상기 씬(scene)을 생성하는 단계;를 포함하는 것을 특징으로 한다.As a preferred embodiment of the present invention, a method for retrieving information inside a moving image includes segmenting the moving image in shot units; applying a tag set to each of the segmented shots, and deriving keywords that highlight the characteristics of each shot for each shot through topic analysis of the tag set; and between adjacent shots determined based on the keyword and generating the scene by performing hierarchical clustering based on the similarity.
본 발명의 바람직한 일 실시예로서, 상기 태그 세트는 영상 태그 및 음성 태그로 구성되는 것을 특징으로 한다. As a preferred embodiment of the present invention, the tag set is composed of a video tag and an audio tag.
본 발명의 바람직한 일 실시예로서, 씬 각각에 씬 태그(Scene Tag)가 부여되고, 또한 메타데이터가 문장 형식으로 부여되어 상기 동영상을 씬 단위로 색인화하는 것을 특징으로 한다. As a preferred embodiment of the present invention, a scene tag is assigned to each scene, and metadata is assigned in the form of a sentence, so that the moving picture is indexed for each scene.
본 발명의 바람직한 일 실시예로서, 메타데이터는 하나의 씬을 구성하는 적어도 하나의 샷 각각에서 도출된 적어도 하나의 키워드와 적어도 하나의 샷 각각의 음성데이터를 STT(Speech To Text) 기법을 통해 변환한 텍스트를 기초로 생성되는 것을 특징으로 한다. As a preferred embodiment of the present invention, metadata converts at least one keyword derived from each of at least one shot constituting one scene and voice data of each of the at least one shot through STT (Speech To Text) technique. It is characterized in that it is generated based on one text.
본 발명의 바람직한 일 실시예로서, 상기 하나의 씬을 구성하는 적어도 하나의 샷 각각의 음성데이터를 STT로 변환한 적어도 하나의 텍스트 데이터에서 상기 키워드가 포함된 문장을 적어도 하나 이상 선별하고, 선별된 적어도 하나의 문장에서 딥러닝을 통해 단일문장을 생성하고, 생성된 단일문장을 씬을 설명하는 메타데이터로 저장하는 것을 특징으로 한다. As a preferred embodiment of the present invention, at least one sentence including the keyword is selected from at least one text data obtained by converting the voice data of each of the at least one shot constituting the one scene into STT, and the selected It is characterized by generating a single sentence through deep learning from at least one sentence, and storing the generated single sentence as metadata describing the scene.
본 발명의 바람직한 일 실시예로서, 샷을 추출하기 위해 동영상에서 프레임을 이미지로 추출한 후, 각 이미지를 HSV 색공간으로 변환하는 단계; H(색상), S(채도) 및 v(명도)의 대표값(median)으로 구성된 3개의 시계열 데이터를 생성하는 단계; 및 3개의 시계열데이터에서 검출된 3개의 변곡점이 모두 일치하는 경우 해당 지점을 샷의 시작 또는 끝지점으로 설정하는 단계;를 더 포함하는 것을 특징으로 한다.As a preferred embodiment of the present invention, after extracting a frame from a moving picture as an image to extract a shot, converting each image into an HSV color space; generating three time series data including median values of H (hue), S (saturation), and v (brightness); and setting the corresponding point as the start or end point of the shot when all three inflection points detected in the three time series data coincide.
본 발명의 바람직한 일 실시예로서, 동영상내부정보를 검색하는장치는 사용자로부터 문장을 검색어로 입력받는 검색어입력부; 동영상 내에서 상기 검색어와 관련도가 가장 높은 특정 구간을 검색하는 동영상구간검색부; 및 상기 동영상 내에서 상기 특정 구간만을 재생하는 동영상구간재생부;를 포함하고, 상기 동영상은 의미기반으로 분절되고, 분절된 구간마다 상기 분절된 구간의 의미를 설명하는 문장이 메타데이터로 부여되며, 상기 동영상의 타임라인을 따라 상기 문장을 색인으로 이용하여 상기 검색어와의 관련도를 판단하는 것을 특징으로 한다. As a preferred embodiment of the present invention, an apparatus for searching video internal information includes: a search word input unit for receiving a sentence as a search word from a user; a video section search unit for searching a specific section having the highest degree of relevance to the search term within the video; and a video section playback unit that reproduces only the specific section within the video, wherein the video is segmented based on meaning, and a sentence explaining the meaning of the segmented section is provided as metadata for each segmented section, It is characterized in that the relevance to the search word is determined by using the sentence as an index along the timeline of the video.
본 발명의 바람직한 일 실시예로서, 동영상내부정보를 검색하는 장치 및 방법은 동영상 내에서 사용자가 검색하고자 하는 내용이 있는 특정 구간만을 사용자에게 제공함으로써, 사용자가 동영상을 처음부터 끝까지 다 시청하지 않고도 원하는 정보를 쉽고 빠르게 파악할 수 있는 효과가 있다. 동영상의 내용을 도서의 목차와 같이 사용자에게 제공해서 사용자가 동영상을 시청하기 전에 어떤 내용이 들어가 있는지를 미리 확인할 수 있는 효과가 있다.As a preferred embodiment of the present invention, an apparatus and method for searching video internal information provides the user with only a specific section within the video that the user wants to search, so that the user does not have to watch the video from beginning to end. It has the effect of being able to grasp information quickly and easily. By providing the content of the video to the user like the table of contents of a book, the user can check in advance what content is included before watching the video.
도 1 은 본 발명의 바람직한 일 실시예로서, 동영상을 구성하는 구성요소를 씬(scene)과 샷(shot)으로 구분한 일 예를 도시한다. 1 illustrates an example in which components constituting a moving picture are divided into a scene and a shot as a preferred embodiment of the present invention.
도 2 는 본 발명의 바람직한 일 실시예로서, 동영상 내부의 정보를 검색하는 방법의 흐름도를 도시한다.FIG. 2 is a flowchart of a method for retrieving information inside a moving picture as a preferred embodiment of the present invention.
도 3 은 본 발명의 바람직한 일 실시예로서, 동영상내부정보를 검색하는장치의 내부 구성도를 도시한다. 3 is a diagram showing an internal configuration of an apparatus for searching video internal information as a preferred embodiment of the present invention.
도 4 는 본 발명의 바람직한 일 실시예로서, 동영상에서 샷 을 구분하는 일 예를 도시한다. 4 shows an example of dividing a shot in a moving picture as a preferred embodiment of the present invention.
도 5 는 본 발명의 바람직한 일 실시예로서, 샷에 태그세트를 부여하는 일 예를 도시한다. 5 shows an example of assigning a tag set to a shot as a preferred embodiment of the present invention.
도 6 은 본 발명의 바람직한 일 실시예로서, 샷을 씬으로 그루핑하는 일 예를 도시한다. 6 shows an example of grouping a shot into a scene as a preferred embodiment of the present invention.
도 7 은 본 발명의 또 다른 바람직한 일 실시예로서, 동영상 내부의 정보를 검색하는 방법의 흐름도를 도시한다.7 is a flowchart of a method for retrieving information in a moving picture as another preferred embodiment of the present invention.
도 8 은 본 발명의 바람직한 일 실시예로서, 동영상 내부의 정보를 검색하는 일 실시예를 도시한다. 8 shows an embodiment of searching for information inside a moving picture as a preferred embodiment of the present invention.
본 발명의 바람직한 일 실시예로서, 동영상 내부의 정보를 검색하는 방법은 사용자로부터 문장을 검색어로 입력받는 단계;각 씬(scene)마다 메타데이터가 문장 형식으로 부여되어 씬 단위로 색인화된 동영상 내에서 상기 검색어와 매칭도가 가장높은 씬을 검색하는 단계;및 동영상에서 상기 검색된 씬의 시작지점부터 끝지점만을 재생하는 단계;를 포함하는 것을 특징으로 한다. As a preferred embodiment of the present invention, a method for searching information inside a video includes the steps of: receiving a sentence as a search word from a user; Metadata is provided in the form of a sentence for each scene and indexed in a scene unit within a video Searching for a scene having the highest degree of matching with the search term; and reproducing only the start point to the end point of the searched scene in the video; characterized in that it comprises a.
이하에서, 도면을 참고하여 본원 발명이 속하는 분야의 통상의 지식을 가진자가 용이하게 이해하고 재현할 수 있도록 상세히 설명하기로 한다. Hereinafter, with reference to the drawings will be described in detail so that those of ordinary skill in the art to which the present invention pertains can easily understand and reproduce.
도 1 은 본 발명의 바람직한 일 실시예로서, 동영상을 구성하는 구성요소를 씬(scene)과 샷(샷)으로 구분한 일 예를 도시한다. 1 shows an example in which components constituting a moving picture are divided into a scene and a shot as a preferred embodiment of the present invention.
본 발명의 바람직한 일 실시예로서, 동영상(100)은 n 개(n은 자연수)의 샷(111, 113, 121, 123, 125, 131, 133)으로 분절된다. 동영상에서 샷을 구분하는 방식은 도 4의 내용을 참고한다. As a preferred embodiment of the present invention, the moving picture 100 is segmented into n shots (n is a natural number) 111, 113, 121, 123, 125, 131, 133. For a method of classifying shots in a video, refer to FIG. 4 .
적어도 하나의 샷은 의미 또는 주제가 유사한 단위로 그루핑 되어 씬(scene)을 구성한다. 도 1 의 일 예를 참고하면, 제 1 샷(111) 및 제 2 샷(113)은 제 1 씬(110)으로 그루핑되고, 제 3 샷(121), 제 4 샷(123) 및 제 5 샷(125)은 제 2 씬(120)으로, 그리고 제 6 샷(131) 및 제 7 샷(133)은 제 3 씬(130)으로 그루핑될 수 있다. 본 발명에서 주제는 적어도 하나 이상의 의미를 포함할 수 있다. At least one shot is grouped into units having similar meanings or subjects to constitute a scene. Referring to the example of FIG. 1 , the first shot 111 and the second shot 113 are grouped into the first scene 110 , and the third shot 121 , the fourth shot 123 , and the fifth shot are grouped together. 125 may be grouped into the second scene 120 , and the sixth shot 131 and the seventh shot 133 may be grouped into the third scene 130 . In the present invention, a subject may include at least one meaning.
도 2 는 본 발명의 바람직한 일 실시예로서, 동영상내부정보를 검색하는 방법의 흐름도를 도시한다.FIG. 2 is a flowchart of a method for retrieving video internal information as a preferred embodiment of the present invention.
사용자는 특정 동영상 내에서 검색하고자 하는 내용이 있는 경우, 동영상을 선택하고, 동영상 선택이 활성화되면 제공되는 검색어 입력 인터페이스를 통해 검색어를 입력한다. 동영상은 각 씬(scene)마다 메타데이터가 문장 형식으로 부여되어 씬 단위로 색인화되어 있는 것을 전제로 한다. 본 발명의 바람직한 일 실시예로서, 동영상내부정보 검색장치는 사용자로부터 문장을 검색어로 입력받으면, 동영상 내에서 검색어와 일치되거나 연관성이 높은 특정 구간을 검색하여, 검색된 특정 구간만을 재생한다. When there is content to be searched for within a specific video, the user selects the video and inputs a search word through a search word input interface provided when video selection is activated. It is premised that the video is indexed in units of scenes by providing metadata in the form of sentences for each scene. As a preferred embodiment of the present invention, when receiving a sentence as a search word from a user, the video internal information search apparatus searches for a specific section that matches the search word or has high relevance in the video, and reproduces only the searched specific section.
본 발명의 또 다른 바람직한 일 실시예로서, 동영상내부정보 검색장치는 사용자가 특정 동영상 내부에서 검색하고자 하는 내용과 관련한 검색어를 입력하면(S210), 동영상 내에서 검색어와 매칭도가 가장높은 씬을 검색하고(S220), 검색된 씬의 시작지점부터 끝지점만을 재생한다(S230). As another preferred embodiment of the present invention, when a user inputs a search term related to content to be searched within a specific video, the video internal information search apparatus searches for a scene with the highest matching degree with the search word in the video (S210). and (S220), only the start point to the end point of the searched scene is played back (S230).
도 3 은 본 발명의 바람직한 일 실시예로서, 동영상내부정보를 검색하는장치(300)의 내부 구성도를 도시한다. 도 4 내지 6 은 동영상내부정보를 검색하는장치(300)를 구성하는 동영상구간검색부(320)의 세부 기능을 도시한다. 도 7 은 동영상내부정보를 검색하는 흐름도를 도시한다. 이하에서 도 3 내지 7을 참고하여 동영상내부정보를 검색하는 장치에서 동영상내부정보를 검색하는 방법을 설명한다.3 shows an internal configuration diagram of an apparatus 300 for searching video internal information as a preferred embodiment of the present invention. 4 to 6 show detailed functions of the video section search unit 320 constituting the apparatus 300 for searching video internal information. 7 is a flowchart showing a search for video inside information. Hereinafter, a method for searching video internal information in a device for searching video internal information will be described with reference to FIGS. 3 to 7 .
본 발명의 바람직한 일 실시예로서, 동영상내부정보를 검색하는장치(300)는 단말기, 컴퓨터, 노트북, 핸드헬드 장치, 웨어러블 장치에 구현될 수 있다. 또한, 동영상내부정보를 검색하는장치(300)는 사용자의 검색어를 입력받는 입력부, 동영상을 표시하는 디스플레이와 프로세서를 구비한 단말기의 형태로 구현될 수 있다. 또한, 동영상내부정보를 검색하는 방법은 단말기에 어플리케이션의 형태로 설치되어 구현될 수 있다. As a preferred embodiment of the present invention, the apparatus 300 for searching video internal information may be implemented in a terminal, a computer, a notebook computer, a handheld device, or a wearable device. In addition, the apparatus 300 for searching video internal information may be implemented in the form of a terminal having an input unit for receiving a user's search word, a display for displaying a video, and a processor. In addition, the method of searching the video internal information may be implemented by being installed in the form of an application in the terminal.
본 발명의 바람직한 일 실시예로서, 동영상내부정보를 검색하는장치(300)는 검색어입력부(310), 동영상구간검색부(320) 및 동영상구간재생부(330)를 포함한다. 동영상구간검색부(320)는 샷분절부(340), 씬생성부(350), 메타데이터생성부(360) 및 동영상색인부(370)를 포함한다. As a preferred embodiment of the present invention, the apparatus 300 for searching video internal information includes a search word input unit 310 , a video section search unit 320 , and a video section playback unit 330 . The video section search unit 320 includes a shot segmentation unit 340 , a scene generation unit 350 , a metadata generation unit 360 , and a video index unit 370 .
검색어입력부(310)는 사용자로부터 검색어를 문장(sentence)의 형식으로 입력받는다. 사용자는 음성검색, 텍스트검색, 이미지검색 등의 형태를 모두 이용할 수 있다. 이미지검색의 예로는 책에서 스캔한 내용을 문자로 변환하여 검색어로 이용하는 경우이다. 검색어입력부(310)는 키보드, 첨펜, 마이크 등으로 구현될 수 있다.The search word input unit 310 receives a search word from the user in the form of a sentence. The user can use all forms such as voice search, text search, and image search. An example of an image search is a case where the contents scanned from a book are converted into text and used as a search term. The search word input unit 310 may be implemented as a keyboard, a stylus pen, a microphone, or the like.
동영상구간검색부(320)는 검색어입력부(310)에서 입력받은 검색어와 일치하거나 검색어와 연관되는 내용이 있는 동영상 내의 특정 구간을 검색한다. 일 실시예로서, 동영상구간검색부(320)는 입력받은 검색어 문장과 매칭도가 가장 높은 문장을 메타데이터로 부여받은 씬을 검색한다. The video section search unit 320 searches for a specific section in the video that matches the search word input from the search word input unit 310 or has content related to the search word. As an embodiment, the video section search unit 320 searches for a scene in which a sentence having the highest degree of matching with the input search word sentence is assigned as metadata.
동영상구간검색부(320)에서는 단일 동영상 내에서 정보를 검색할 수 있도록 동영상을 색인화하여 관리한다.The video section search unit 320 indexes and manages videos so that information can be searched within a single video.
도 7을 더 참고하여 설명하면, 샷분절부(340)는 동영상을 샷 단위로 분절하고(S710), 분절된 샷 각각에 태그세트를 부여한 후(S720), 각 샷에 부여된 태그세트들에 토픽 분석 알고리즘을 적용하여 각 샷마다 키워드를 도출한다(S730). 키워드는 동영상을 구성하는 적어도 하나의 샷 각각의 내용을 식별하고 분별하는 형식으로 도출된다. 7, the shot segmentation unit 340 segments the video in shot units (S710), assigns a tag set to each segmented shot (S720), and adds a tag set to each shot. A keyword is derived for each shot by applying a topic analysis algorithm (S730). The keyword is derived in the form of identifying and discriminating the content of each of at least one shot constituting the moving picture.
씬생성부(350)는 동영상의 타임라인 상에서 인접한 전후 샷간에 유사도를 판단한다. 유사도 판단은 각 샷에서 도출된 키워드, 각 샷에서 검출된 객체, 각 샷에서 검출된 음성특징 등을 기초로 수행될 수 있다. 본 발명의 바람직한 일 실시예로서, 씬생성부(350)는 인접한 샷간에 키워드 기준으로 유사도가 높은 샷을 그루핑하여 씬으로 생성할 수 있다(S740). 그루핑을 수행하는 알고리즘으로는 계층적 클러스터링 기법 등을 포함할 수 있다(S750). 이 경우 하나의 씬 내에 포함된 복수의 샷은 의미 또는 주제가 유사한 내용의 컨텐츠를 전달한다고 해석될 수 있다. 씬생성부(350)에서 샷을 계층적 클러스터링을 통해 그루핑하는 일 예는 도 8을 참고한다.The scene generator 350 determines the similarity between adjacent before and after shots on the timeline of the video. The similarity determination may be performed based on a keyword derived from each shot, an object detected in each shot, a voice feature detected in each shot, and the like. As a preferred embodiment of the present invention, the scene generator 350 may create a scene by grouping shots having a high degree of similarity between adjacent shots based on a keyword ( S740 ). An algorithm for performing grouping may include a hierarchical clustering technique (S750). In this case, a plurality of shots included in one scene may be interpreted as delivering content having similar meaning or subject matter. For an example of grouping shots through hierarchical clustering in the scene generator 350 , refer to FIG. 8 .
씬생성부(350)는 생성된 각 씬마다(351, 353, 355) 씬 태그를 부여한다. 씬 태그는 각 씬에 포함된 적어도 하나의 샷 각각에 할당된 영상태그를 기초로 생성될 수 있다. 본 발명에서 씬 태그는 씬을 구성하는 적어도 하나의 샷 각각에 할당된 태그 세트의 조합으로 생성될 수 있다. 또한, 씬 키워드는 씬을 구성하는 적어도 하나의 샷 각각에서 도출한 키워드들의 조합으로 생성될 수 있다. 본 발명의 바람직한 일 실시예에서, 씬 태그는 각 씬에 메타데이터를 생성할 때 가중치 역할을 수행할 수 있다. The scene generator 350 assigns a scene tag to each created scene (351, 353, 355). The scene tag may be generated based on an image tag assigned to each of at least one shot included in each scene. In the present invention, a scene tag may be generated by a combination of a tag set assigned to each of at least one shot constituting a scene. Also, the scene keyword may be generated by a combination of keywords derived from each of at least one shot constituting the scene. In a preferred embodiment of the present invention, the scene tag may serve as a weight when generating metadata for each scene.
메타데이터생성부(360)는 씬생성부(350)에서 생성된 씬을 각각 분석하여, 각 씬마다 메타데이터를 부여함으로써 동영상 내부 컨텐츠에 대한 검색을 지원한다(S760). 각 씬에 부여되는 메타데이터는 색인의 역할을 수행한다. 메타데이터는 각 씬의 내용을 표시하는 요약 문장(sentence)의 형식을 지닌다. The metadata generator 360 analyzes the scenes generated by the scene generator 350, and provides metadata for each scene, thereby supporting a search for internal video content (S760). Metadata assigned to each scene acts as an index. The metadata is in the form of a summary sentence indicating the contents of each scene.
메타데이터는 하나의 씬을 구성하는 적어도 하나의 샷 각각에 부여된 씬태그를 더 참고하여 생성될 수 있다. 씬 태그는 메타데이터를 생성하기 위해 딥러닝을 수행할 때 가중치 역할을 수행할 수 있다. 예를 들어, 씬 태그에 포함된 적어도 하나의 태그 세트에서 추출된 영상태그 정보와 음성태그 정보에 가중치를 부여할 수 있다. The metadata may be generated by further referring to a scene tag assigned to each of at least one shot constituting one scene. Scene tags can serve as weights when performing deep learning to generate metadata. For example, weight may be assigned to image tag information and voice tag information extracted from at least one tag set included in the scene tag.
메타데이터는 각 씬을 구성하는 적어도 하나의 샷에서 각각 추출한 음성데이터의 STT(Speech to Text) 데이터와 각 씬을 구성하는 적어도 하나의 샷 각각에서 추출한 씬 태그를 기초로 생성된다. 일 예로, 하나의 씬을 구성하는 적어도 하나의 샷에서 획득한 적어도 하나의 STT데이터와 적어도 하나의씬 태그를 딥러닝 방식의 기계학습을 수행하여 요약문장을 생성한다. 하나의 씬마다 기계학습을 통해 생성된 요약문장을 이용하여 각 씬에 메타데이터를 부여한다.The metadata is generated based on STT (Speech to Text) data of voice data extracted from at least one shot constituting each scene, and a scene tag extracted from each of at least one shot constituting each scene. For example, a summary sentence is generated by performing deep learning machine learning on at least one STT data and at least one scene tag obtained from at least one shot constituting one scene. Metadata is given to each scene by using a summary sentence generated through machine learning for each scene.
동영상색인부(370)는 동영상(S300)의 각 씬마다 부여된 메타데이터를 색인으로 이용한다. 예를 들어, 동영상(S300)이 3개의 씬으로 분류된 경우, 동영상색인부(370)는 제 1 씬(351)(0:00~t1)에 메타데이터로 부여된 제 1 문장(371)을 색인으로 이용하고, 제 2 씬(353)(t1~t2)에 메타데이터로 부여된 제 2 문장(373)을 색인으로 이용하며, 제 3 씬(355)(t2~t3)에 메타데이터로 부여된 제 3 문장(375)을 색인으로 이용한다. The video indexing unit 370 uses metadata assigned to each scene of the video S300 as an index. For example, if the video S300 is classified into three scenes, the video indexing unit 370 uses the first sentence 371 given as metadata to the first scene 351 (0:00 to t1). Used as an index, the second sentence 373 assigned as metadata to the second scene 353 (t1 to t2) is used as an index, and given as metadata to the third scene 355 (t2 to t3) The third sentence 375 is used as an index.
동영상색인부(370)는 사용자의 검색문장이 제 1 검색문장(S311)이고, 제 1 검색문장(S311)이 하나의 동영상 내의 복수 개의 씬 각각에 할당된 복수 개의 메타데이타(371,373,375) 중 제 1 문장(371)과 가장 매칭도가 높다고 판단한 경우, 검색어 입력부(310)에 입력된 검색문장과 가장 매칭도가 높은 동영상구간을 제 1 씬(351)이라고 판단한다. 이 경우, 동영상구간재생부(330)는 동영상(S300)에서 제 1 씬(351)의 구간인 0:00~t1만을 재생한다. In the video indexing unit 370, the user's search sentence is a first search sentence (S311), and the first search sentence (S311) is a first of a plurality of metadata (371, 373, 375) allocated to each of a plurality of scenes in one video. When it is determined that the sentence 371 has the highest degree of matching, it is determined that the video section having the highest degree of matching with the search sentence input in the search word input unit 310 is the first scene 351 . In this case, the video section reproducing unit 330 reproduces only the section 0:00 to t1 of the first scene 351 in the video S300.
본 발명의 바람직한 일 실시예로서, 동영상색인부(370)는 두 문장이 동일한 경우에 0이 되고 두 문장의 유사도가 작아질수록 값이 커지는 레빈쉬타인 거리(Levenshtein distance)기법을 이용하여 매칭도를 판단할 수 있으나, 이에 제한되는 것은 아니며 두 문장의 유사도를 판단하는 다양한 알고리즘을 이용할 수 있다. As a preferred embodiment of the present invention, the video indexing unit 370 uses the Levenshtein distance technique, in which the value becomes 0 when two sentences are identical and the value increases as the similarity between the two sentences decreases. can be determined, but is not limited thereto, and various algorithms for determining the similarity between two sentences can be used.
동영상색인부(370)는 또한 사용자의 검색문장이 제 2 검색문장(S313)이고, 제 2 검색문장(S313)이 하나의 동영상 내의 복수 개의 씬 각각에 할당된 복수 개의 메타데이타(371,373,375) 중 제 2 문장과 가장높다고 판단한 경우, 검색어 입력부(310)에 입력된 검색문장과 가장 매칭도가 높은 동영상구간을 제 2 씬(353)이라고 판단한다. 이 경우, 동영상구간재생부(330)는 동영상(S300)에서 제 2 씬(353)의 구간인 t1~t2만을 재생한다. In the video indexing unit 370, the user's search text is the second search text (S313), and the second search text (S313) is the first of a plurality of metadata (371, 373, 375) allocated to each of a plurality of scenes in one video When it is determined that two sentences are the highest, it is determined that the video section having the highest degree of matching with the search sentence input in the search word input unit 310 is the second scene 353 . In this case, the video section reproducing unit 330 reproduces only the section t1 to t2 of the second scene 353 in the video S300.
마찬가지로, 동영상색인부(370)는 사용자의 검색문장이 제 3 검색문장(S315)이고, 제 3 검색문장(S315)이 제 3 문장과 가장 매칭도가 높다고 판단한 경우, 검색어 입력부(310)에 입력된 검색문장과 가장 매칭도가 높은 동영상구간을 제 3 씬(355)이라고 판단한다. 이 경우, 동영상구간재생부(330)는 동영상(S300)에서 제 3 씬(355)의 구간인 t2~t3만을 재생한다. Similarly, when the video indexing unit 370 determines that the user's search sentence is the third search sentence (S315) and the third search sentence (S315) has the highest degree of matching with the third sentence, it is input into the search word input unit 310 It is determined that the video section having the highest degree of matching with the search sentence is the third scene 355 . In this case, the video section reproducing unit 330 reproduces only the section t2 to t3 of the third scene 355 in the video S300.
도 4 는 본 발명의 바람직한 일 실시예로서, 동영상에서 샷 을 구분하는 일 예를 도시한다. 도 4 에서 x축은 시간(sec), y축은 HSV 대표값을 나타낸다.4 shows an example of dividing a shot in a moving picture as a preferred embodiment of the present invention. In FIG. 4 , the x-axis represents time (sec), and the y-axis represents a representative HSV value.
본 발명의 바람직한 일 실시예로서, 동영상내부정보 검색장치의 샷분절부(340)는 동영상(S300)에서 일정 간격으로 프레임을 이미지로 추출한 후, 각 이미지를 HSV색공간으로 변환한다. 그리고, 각 이미지의 H(색상)(S401), S(채도)(S403) 및 v(명도)(S405)의 대표값(median)으로 구성된 3개의 시계열 데이터를 생성한다. 그리고, H(색상)(S401), S(채도)(S403) 및 v(명도)(S405) 3개의 시계열 데이터 각각의 변곡점이 모두 일치하는 경우, 또는 일정 시간구간 내에 있는 경우 해당 지점을 샷의 시작지점(starting point) 또는 끝지점(ending point)으로 설정한다. 도 4에서는 3개의 시계열 데이터 각각의 변곡점이 일치되는 t=10sec 지점을 제 1 샷(410)의 끝지점, 그리고 제 2 샷(420)의 시작지점으로 설정하였다. 또한 3개의 시계열 데이터 각각의 변곡점이 일치되는 t=21sec 지점을 제 2 샷(420)의 끝지점, 그리고 제 3 샷(430)의 시작지점으로 설정하였다. As a preferred embodiment of the present invention, the shot segmentation unit 340 of the video internal information search apparatus extracts frames from the video S300 at regular intervals as images, and then converts each image into an HSV color space. Then, three time series data composed of representative values (median) of H (hue) (S401), S (saturation) (S403) and v (brightness) (S405) of each image are generated. And, when the inflection points of each of the three time series data of H (hue) (S401), S (saturation) (S403), and v (brightness) (S405) all match or are within a certain time period, the corresponding point of the shot Set as a starting point or an ending point. In FIG. 4 , the t=10sec point at which the inflection points of each of the three time series data coincide are set as the end point of the first shot 410 and the start point of the second shot 420 . In addition, the t=21sec point at which the inflection points of each of the three time series data coincide were set as the end point of the second shot 420 and the start point of the third shot 430 .
도 5 는 본 발명의 바람직한 일 실시예로서, 샷에 태그세트를 부여하는 일 예를 도시한다. 5 shows an example of assigning a tag set to a shot as a preferred embodiment of the present invention.
본 발명의 바람직한 일 실시예에서는, 동영상을 샷 단위로 분절한 이후에, 샷마다 태그세트를 부여한다. 도 5는 제 1 샷(510)에 제 1 태그세트(550)를 부여하는 일 예를 도시한다. In a preferred embodiment of the present invention, after segmenting a moving image in shot units, a tag set is assigned to each shot. FIG. 5 illustrates an example in which the first tag set 550 is applied to the first shot 510 .
본 발명의 바람직한 일 실시예에서는 샷(510)을 영상데이터(510a)와 음성데이터(510b)로 분류한다. 영상데이터(510a)에서는 초당 이미지를 추출한 후(520a), 각 이미지에서 객체를 검출한다(530a). 그리고, 검출된 객체를 기초로 영상태그를 생성한다(540a). 영상태그는 이미지에서 검출된 객체들에 대해 객체 바운딩(object annovation) 내지 라벨링을 적용하여 학습데이터를 구축한 후 영상인식과 연관된 딥러닝 학습을 통해 객체 인식을 수행하여, 각 이미지마다 객체를 추출한 정보를 기초로 생성할 수 있다. In a preferred embodiment of the present invention, the shot 510 is classified into image data 510a and audio data 510b. In the image data 510a, after extracting images per second (520a), an object is detected in each image (530a). Then, an image tag is generated based on the detected object (540a). Image tags apply object annovation or labeling to objects detected in images to construct learning data, and then perform object recognition through deep learning related to image recognition. Information obtained by extracting objects from each image can be created based on
그리고, 음성데이터(510b)에서는 STT변환(520b)을 수행한 후, 형태소를 추출하여(530b) 음성태그를 생성한다(540b). 영상태그(540a)와 음성태그(540b)가 모두 생성되면, 태그세트(550)를 생성한다. 태그세트는 제 1 샷(510)이 예를 들어 00:00~10:00 초 구간인 경우, 해당 시간 동안 검출된 영상태그(540a)와 음성태그(540b)의 조합을 지칭한다. In the voice data 510b, STT conversion 520b is performed, morphemes are extracted (530b), and a voice tag is generated (540b). When both the image tag 540a and the audio tag 540b are generated, a tag set 550 is generated. The tag set refers to a combination of the image tag 540a and the voice tag 540b detected during the time when the first shot 510 is, for example, between 00:00 and 10:00 seconds.
도 6 은 본 발명의 바람직한 일 실시예로서, 샷을 씬으로 그루핑하는 일 예를 도시한다. 6 shows an example of grouping a shot into a scene as a preferred embodiment of the present invention.
각 샷마다 통합태그를 부여한 후(610), 각 샷에 부여된 태그세트들에 토픽 분석 알고리즘을 적용하여(620) 각 샷마다 키워드를 도출한다(630). 이 후, 씬생성부에서 인접한 전후 샷간에 유사도를 판단하여 씬으로 그루핑한다. 도 6은 키워드(630)를 기초로 유사도를 판단한 후 계층적 클러스터링(640)을 통해 씬을 생성한 일 예를 도시한다.After assigning an integrated tag to each shot ( 610 ), a topic analysis algorithm is applied to tag sets assigned to each shot ( 620 ) to derive a keyword for each shot ( 630 ). Thereafter, the scene generator determines the similarity between adjacent front and rear shots and groups them into a scene. FIG. 6 illustrates an example of creating a scene through hierarchical clustering 640 after determining the degree of similarity based on the keyword 630 .
도 8 은 본 발명의 바람직한 일 실시예로서, 동영상 내부의 정보를 검색하는 일 실시예를 도시한다. 8 shows an embodiment of searching for information inside a moving picture as a preferred embodiment of the present invention.
사용자가 내용을 검색하고자 하는 50초 분량의 동영상(800)을 선택한 경우의 일 예를 가정한다. An example of a case in which a user selects a 50-second video 800 for which a user wants to search for content is assumed.
도 8의 일 실시예는 샷분절부에서 사용자가 선택한 동영상(800)을 7개의 샷(801~807)으로 분절한 일 예를 도시한다. 동영상내부 정보를 검색하는 장치는 7개의 샷(801~807) 각각은 영상태그와 음성태그를 추출하여 태그세트를 생성하고, 이후 태그세트에 대해 LDA 등의 토픽 분석을 수행하여 각 샷(801~807)마다 키워드를 도출한다. The embodiment of FIG. 8 shows an example in which the video 800 selected by the user in the shot segmentation unit is segmented into seven shots 801 to 807. The device for searching video internal information generates a tag set by extracting a video tag and an audio tag from each of the seven shots (801 to 807), and then performs topic analysis such as LDA on the tag set for each shot (801 to 807). 807), a keyword is derived.
도 8의 일 실시예를 참고하면, 제 1 샷(801)은 0:00~0:17 구간이며, 제 1 샷(801)에서 도출된 제 1 키워드는 (일본, 코로나19, 심각)(801a) 이다. 제 2 샷(802)은 0:18~0:29 구간이며, 제 2 샷(802)에서 도출된 제 2 키워드는 (일본, 코로나19, 확산)(802a) 이다. 제 3 샷(803)은 0:30~0:34 구간이며, 제 3 샷(803)에서 도출된 제 3 키워드는 (뉴욕, 코로나19, 유럽, 유입)(803a) 이다. 제 4 샷(804)은 0:34~0:38 구간이며, 제 4 샷(804)에서 도출된 제 4 키워드는 (미, 코로나19, 사망)(804a) 이다. 제 5 샷(805)은 0:39~0:41 구간이며, 제 5 샷(805)에서 도출된 제 5 키워드는 (미, 코로나19, 확진, 사망)이다. 제 6 샷(806)은 0:42~0:45 구간이며, 제 6 샷(806)에서 도출된 제 6 키워드는 (미, 코로나19, 사망)(806a) 이다. 제 7 샷(807)은 0:46~0:50 구간이며, 제 7 샷(807)에서 도출된 제 7 키워드는 (미, 코로나19, 사망)(807a) 이다.Referring to the embodiment of FIG. 8 , the first shot 801 is in the range of 0:00 to 0:17, and the first keyword derived from the first shot 801 is (Japan, Corona 19, severe) 801a ) am. The second shot 802 is a section from 0:18 to 0:29, and the second keyword derived from the second shot 802 is (Japan, Corona 19, Spread) 802a. The third shot 803 is a section from 0:30 to 0:34, and the third keyword derived from the third shot 803 is (New York, Corona 19, Europe, Inflow) 803a. The fourth shot 804 is a section from 0:34 to 0:38, and the fourth keyword derived from the fourth shot 804 is (US, Corona 19, death) 804a. The fifth shot 805 is a section from 0:39 to 0:41, and the fifth keyword derived from the fifth shot 805 is (US, Corona 19, confirmed, dead). The sixth shot 806 is a section from 0:42 to 0:45, and the sixth keyword derived from the sixth shot 806 is (US, Corona 19, death) 806a. The seventh shot 807 is a section from 0:46 to 0:50, and the seventh keyword derived from the seventh shot 807 is (US, Corona 19, death) 807a.
씬 생성부는 적어도 하나의 샷을 유사도를 기준으로 그루핑한다. 유사도는 각 샷에서 추출한 키워드를 기준으로 판단할 수 있으며, 추가로 영상태그 및 음성태그를 더 참고할 수 있다. The scene generator groups at least one shot based on the similarity. The degree of similarity can be determined based on keywords extracted from each shot, and video tags and voice tags can be further referred to.
도 8의 일 실시예에서는, 제 1 샷(801)과 제 2 샷(802)은 제 1 씬(810)으로 그루핑되고, 제 3 샷(803)은 제 2 씬(820)으로 그루핑되고, 그리고 제 4 샷 내지 제 7 샷(804~807)은 제 3 씬(830)으로 그루핑되었다. 8 , the first shot 801 and the second shot 802 are grouped into the first scene 810 , the third shot 803 is grouped with the second scene 820 , and The fourth to seventh shots 804 to 807 are grouped into the third scene 830 .
제 1 씬(810)은 0:00~0:29 구간으로, 제 1 샷(801)에서 도출된 제 1 키워드는 (일본, 코로나19, 심각)(801a)와 제 2 샷(802)에서 도출된 제 2 키워드는 (일본, 코로나19, 확산)(802a) 내지 제 1 샷(801)과 제 2 샷(802)의 음성데이터를 참고하여 "일본 코로나19가 계속 확산중입니다."(810b)라는 메타데이터가 부여된다. The first scene 810 is a section from 0:00 to 0:29, and the first keyword derived from the first shot 801 is derived from (Japan, Corona 19, severe) 801a and the second shot 802 . The second keyword used is (Japan, Corona 19, Spread) (802a) to "Japan Corona 19 continues to spread" (810b) with reference to the voice data of the first shot 801 and the second shot 802. metadata is provided.
제 2 씬(820)은 0:30~0:34 구간으로, 제 3 샷(803)에서 도출된 제 3 키워드는 (뉴욕, 코로나19, 유럽, 유입)(803a)과 제 3 샷(803)의 음성데이터를 참고하여 "뉴욕의 코로나19가 유럽에서 유입이 되고 있다고 합니다."(820b)라는 메타데이터가 부여된다. The second scene 820 is a section from 0:30 to 0:34, and the third keywords derived from the third shot 803 are (New York, Corona 19, Europe, Inflow) 803a and the third shot 803. Referring to the voice data of the "New York's Corona 19 is said to be coming from Europe" (820b) is given.
제 3 씬(830)은 0:35~0:50 구간으로, 제 4 샷(804)에서 도출된 제 4 키워드 (미, 코로나19, 사망)(804a), 제 5 샷(805)에서 도출된 제 5 키워드 (미, 코로나19, 확진, 사망), 제 6 샷(806)에서 도출된 제 6 키워드 (미, 코로나19, 사망)(806a)와 제 4 샷(804)내지 제 6 샷(806)의 음성데이터를 참고하여 " 미국 코로나 19확진 사망 소식입니다."(830b)라는 메타데이터가 부여된다. The third scene 830 is a section from 0:35 to 0:50, and is derived from the fourth keyword (mi, corona 19, death) 804a derived from the fourth shot 804 and the fifth shot 805. The fifth keyword (US, Corona 19, confirmed, dead), the sixth keyword (US, Corona 19, dead) (806a) derived from the sixth shot 806, and the fourth shot 804 to the sixth shot 806 ) with reference to the voice data of "This is the news of the death of COVID-19 in the United States." (830b) is given.
본 발명의 바람직한 일 실시예에서, 사용자는 동영상(800)을 선택한 후, 검색어 입력인터페이스가 활성화되면, 검색하고자 하는 내용을 문장의 형태로 입력한다. 예를 들어, "미국의 코로나 현황은 어떠한가요?"(840)라는 검색어 문장을 입력할 수 있다. In a preferred embodiment of the present invention, when the user selects the video 800 and the search word input interface is activated, the user inputs the content to be searched in the form of a sentence. For example, a search term sentence "What is the current state of Corona in the United States?" (840) may be input.
동영상색인부는 각 씬에 부여된 메타데이터를 색인으로 활용하여 검색어 문장(840)과 매칭도가 가장 높은 메타데이터를 검색한다. 매칭도는 검색어(840)와 메타데이터(810b, 820b, 830b)의 유사도를 기초로 판단되며, 두 문장이 동일한 경우에 0이 되고 두 문장의 유사도가 작아질수록 값이 커지는 레빈쉬타인 거리(Levenshtein distance)기법을 이용할 수 있다. The video indexing unit searches for metadata with the highest degree of matching with the search word sentence 840 by using the metadata given to each scene as an index. The degree of matching is determined based on the similarity between the search word 840 and the metadata 810b, 820b, and 830b, and becomes 0 when two sentences are the same, and the Levinstein distance ( Levenshtein distance) technique can be used.
동영상색인부는 사용자 검색어(840) "미국의 코로나 현황은 어떠한가요?"(840)와 가장 유사한 메타데이터를 "미국 코로나 19확진 사망 소식입니다."(830b)로 검색한 후, 해당 메타데이터가 부여된 제 3 씬(830)을 사용자에게 재생한다. 사용자는 검색어(840)를 입력하면, 동영상(800) 내에서 검색어(840)와 관련된 0:35~0:50 구간에 해당하는 제 3 씬(830) 구간만을 검색하여 시청할 수 있다. The video indexer searches for the metadata most similar to the user search term (840) "How is the current state of Corona in the United States?" The finished third scene 830 is played back to the user. When the user inputs the search word 840, only the section of the third scene 830 corresponding to the section 0:35 to 0:50 related to the search word 840 in the video 800 can be searched and viewed.
본 발명의 또 다른 바람직한 일 실시예로서, 동영상색인부는 사용자에게 동영상을 구성하는 각 씬(810~830)에 부여된 메타데이터를 색인으로 제공할 수 있다. 사용자는 동영상 색인을 통해 해당 동영상의 내용을 개략적으로 미리 확인이 가능하다. As another preferred embodiment of the present invention, the video indexing unit may provide the user with metadata assigned to each scene 810 to 830 constituting the video as an index. Users can preview the contents of the video in advance through the video index.
본 발명의 실시 예에 따른 방법들은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다.Methods according to an embodiment of the present invention may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the present invention, or may be known and available to those skilled in the art of computer software.
이상과 같이 본 발명은 비록 한정된 실시예와 도면에 의해 설명되었으나, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 본 발명이 속하는 분야에서 통상의 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다.As described above, although the present invention has been described with reference to limited embodiments and drawings, the present invention is not limited to the above embodiments, and various modifications and variations from these descriptions are provided by those skilled in the art to which the present invention pertains. This is possible.

Claims (15)

  1. 사용자로부터 문장을 검색어로 입력받는 단계;receiving a sentence as a search word from a user;
    각 씬(scene)마다 메타데이터가 문장 형식으로 부여되어 씬 단위로 색인화된 동영상 내에서 상기 검색어와 매칭도가 가장높은 씬을 검색하는 단계;및 Searching for a scene having the highest degree of matching with the search term in a video indexed for each scene in which metadata is given in the form of a sentence for each scene; And
    동영상에서 상기 검색된 씬의 시작지점부터 끝지점만을 재생하는 단계;를 포함하는 것을 특징으로 하는 동영상 내부의 정보를 검색하는 방법. Reproducing only the starting point to the end point of the searched scene in the video; the method of searching for information in the video, comprising: a.
  2. 제 1 항에 있어서, The method of claim 1,
    상기 사용자는 검색을 수행하고자 하는 동영상을 하나 선택하고, 선택한 동영상 내부에서 검색하고자 하는 내용을 문장의 형태로 검색하는 특징으로 하는 동영상 내부의 정보를 검색하는 방법. The method for searching information in a video, characterized in that the user selects one video to be searched, and searches for content to be searched in the selected video in the form of a sentence.
  3. 제1 항에 있어서, The method of claim 1,
    상기 동영상은 적어도 하나의 씬으로 구성되고, 상기 적어도 하나의 씬 각각에는 상기 적어도 하나의 씬 각각의 내용을 표시하는 요약문이 문장의 형식으로 부여되어 메타데이터로 이용되는 것을 특징으로 하는 동영상 내부의 정보를 검색하는 방법. The moving picture is composed of at least one scene, and a summary sentence indicating the contents of each of the at least one scene is given to each of the at least one scene in the form of a sentence and is used as metadata. How to search.
  4. 제 1 항에 있어서, 상기 매칭도는 The method of claim 1, wherein the degree of matching is
    두 문장이 동일한 경우에 0이 되고 두 문장의 유사도가 작아질수록 값이 커지는 레빈쉬타인 거리(Levenshtein distance)기법을 이용하여 판단하는 것을 특징으로 하는 동영상 내부의 정보를 검색하는 방법. A method of retrieving information in a video, characterized in that the determination is made using the Levenshtein distance technique, which becomes 0 when two sentences are the same and increases as the similarity between the two sentences decreases.
  5. 제 1 항에 있어서, The method of claim 1,
    동영상을 샷(shot) 단위로 분절하는 단계;segmenting the video in shot units;
    상기 분절된 샷 각각에 태그세트(tag set)를 부여하고, 상기 태그세트의 토픽 분석을 통해 각 샷마다 각 샷의 특징을 부각시키는 키워드를 도출하는 단계;및adding a tag set to each of the segmented shots, and deriving keywords that highlight the characteristics of each shot for each shot through topic analysis of the tag set; And
    상기 키워드를 기준으로 판단한 인접한 샷 간의 유사도를 기초로 계층적 클러스터링 수행하여 상기 씬(scene)을 생성하는 단계;를 포함하는 것을 특징으로 하는 동영상 내부의 정보를 검색하는 방법. and generating the scene by performing hierarchical clustering based on the degree of similarity between adjacent shots determined based on the keyword.
  6. 제 5 항에 있어서, 상기 태그 세트는 6. The method of claim 5, wherein the tag set is
    영상 태그 및 음성 태그로 구성되는 것을 특징으로 하는 동영상 내부의 정보를 검색하는 방법. A method of retrieving information inside a video, characterized in that it consists of a video tag and an audio tag.
  7. 제 5 항에 있어서, 상기 씬 각각에 6. The method of claim 5, wherein in each of the scenes
    씬 태그(Scene Tag)가 부여되고, 또한 메타데이터가 문장 형식으로 부여되어 상기 동영상을 씬 단위로 색인화하는 것을 특징으로 하는 동영상 내부의 정보를 검색하는 방법. A method for retrieving information in a video, characterized in that a scene tag is assigned and metadata is assigned in the form of a sentence to index the video in units of scenes.
  8. 제 5 항에 있어서, 상기 메타데이터는 6. The method of claim 5, wherein the metadata is
    하나의 씬을 구성하는 적어도 하나의 샷 각각에서 도출된 적어도 하나의 키워드와 적어도 하나의 샷 각각의 음성데이터를 STT(Speech To Text) 기법을 통해 변환한 텍스트를 기초로 생성되는 것을 특징으로 하는 동영상 내부의 정보를 검색하는 방법. A video characterized in that at least one keyword derived from each of at least one shot constituting one scene and a text converted from speech data of each of at least one shot through STT (Speech To Text) are generated based on a video How to retrieve information inside.
  9. 제 8 항에 있어서, 9. The method of claim 8,
    상기 하나의 씬을 구성하는 적어도 하나의 샷 각각의 음성데이터를 STT로 변환한 적어도 하나의 텍스트 데이터에서 상기 키워드가 포함된 문장을 적어도 하나 이상 선별하고, 선별된 적어도 하나의 문장에서 딥러닝을 통해 단일문장을 생성하고, 생성된 단일문장을 씬을 설명하는 메타데이터로 저장하는 것을 특징으로 하는 동영상 내부의 정보를 검색하는 방법. At least one sentence including the keyword is selected from at least one text data obtained by converting the speech data of each of the at least one shot constituting the one scene into STT, and the selected at least one sentence is through deep learning A method of retrieving information inside a video, characterized in that a single sentence is generated and the generated single sentence is stored as metadata describing the scene.
  10. 제 1 항에 있어서, The method of claim 1,
    동영상에서 프레임을 이미지로 추출한 후, 각 이미지를 HSV 색공간으로 변환하는 단계; After extracting a frame from a moving picture as an image, converting each image into an HSV color space;
    H(색상), S(채도) 및 v(명도)의 대표값(median)으로 구성된 3개의 시계열 데이터를 생성하는 단계; 및generating three time series data including median values of H (hue), S (saturation), and v (brightness); and
    3개의 시계열데이터에서 검출된 3개의 변곡점이 모두 일치하는 경우 해당 지점을 샷의 시작 또는 끝지점으로 설정하는 단계;를 더 포함하는 것을 특징으로 하는 동영상 내부의 정보를 검색하는 방법. When all three inflection points detected in the three time series data match, setting the corresponding point as the start or end point of the shot; the method of searching for information inside the video, comprising further comprising.
  11. 제 10 항에 있어서, 11. The method of claim 10,
    적어도 하나의 샷을 씬으로 그루핑하는 단계;를 더 포함하고, 이 경우 타임라인에서 인접한 적어도 하나의 샷 간의 유사도를 기초로 그루핑을 수행하는 것을 특징으로 하는 동영상 내부의 정보를 검색하는 방법. The method further comprising grouping at least one shot into a scene, wherein in this case, grouping is performed based on a degree of similarity between at least one adjacent shot in the timeline.
  12. 사용자로부터 문장을 검색어로 입력받는 검색어입력부;a search word input unit for receiving a sentence as a search word from a user;
    동영상 내에서 상기 검색어와 관련도가 가장 높은 특정 구간을 검색하는 동영상구간검색부; 및a video section search unit for searching a specific section having the highest degree of relevance to the search term within the video; and
    상기 동영상 내에서 상기 특정 구간만을 재생하는 동영상구간재생부;를 포함하고, Including; a video section reproducing unit that reproduces only the specific section within the video;
    상기 동영상은 의미기반으로 분절되고, 분절된 구간마다 상기 분절된 구간의 의미를 설명하는 문장이 메타데이터로 부여되며, 상기 동영상의 타임라인을 따라 상기 문장을 색인으로 이용하여 상기 검색어와의 관련도를 판단하는 것을 특징으로 하는 동영상내부정보를 검색하는장치. The video is segmented based on meaning, and a sentence explaining the meaning of the segmented section is given as metadata for each segmented section, and the relevance with the search term is used along the timeline of the video using the sentence as an index. A device for searching video internal information, characterized in that determining the.
  13. 제 12 항에 있어서, 상기 동영상구간검색부는13. The method of claim 12, wherein the video section search unit
    동영상을 샷(shot) 단위로 분절하고, 분절된 샷 각각에 태그세트(tag set)를 부여하며, 상기 태그세트의 토픽 분석을 통해 각 샷마다 각 샷의 특징을 부각시키는 키워드를 도출하는 샷분절부;및A video segment is segmented into shots, a tag set is assigned to each segmented shot, and a keyword that highlights the characteristics of each shot is derived through topic analysis of the tag set. wealth; and
    상기 키워드를 기준으로 판단한 인접한 샷 간의 유사도를 기초로 계층적 클러스터링 수행하여 상기 씬(scene)을 생성하는 씬생성부;를 포함하는 것을 특징으로 하는 동영상내부정보를 검색하는장치. and a scene generator configured to generate the scene by performing hierarchical clustering based on the degree of similarity between adjacent shots determined based on the keyword.
  14. 제 13 항에 있어서, 14. The method of claim 13,
    상기 씬생성부에서 생성된 씬을 각각 분석하여, 각 씬마다 메타데이터를 문장의 형식으로 부여하는 메타데이터생성부;를 더 포함하는 것을 특징으로 하는 동영상내부정보를 검색하는장치. and a metadata generation unit that analyzes the scenes generated by the scene generation unit, respectively, and provides metadata for each scene in the form of a sentence.
  15. 제 1 항 내지 제 11 항 중 어느 한 항에 기재된 방법을 수행하기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체.A computer-readable recording medium in which a program for performing the method according to any one of claims 1 to 11 is recorded.
PCT/KR2020/005718 2020-04-29 2020-04-29 Method and apparatus for searching for information inside video WO2021221209A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/KR2020/005718 WO2021221209A1 (en) 2020-04-29 2020-04-29 Method and apparatus for searching for information inside video
KR1020207014777A KR20210134866A (en) 2020-04-29 2020-04-29 Methods and devices for retrieving information inside a video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/KR2020/005718 WO2021221209A1 (en) 2020-04-29 2020-04-29 Method and apparatus for searching for information inside video

Publications (1)

Publication Number Publication Date
WO2021221209A1 true WO2021221209A1 (en) 2021-11-04

Family

ID=78374167

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2020/005718 WO2021221209A1 (en) 2020-04-29 2020-04-29 Method and apparatus for searching for information inside video

Country Status (2)

Country Link
KR (1) KR20210134866A (en)
WO (1) WO2021221209A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116702707A (en) * 2023-08-03 2023-09-05 腾讯科技(深圳)有限公司 Action generation method, device and equipment based on action generation model
CN117633297A (en) * 2024-01-26 2024-03-01 江苏瑞宁信创科技有限公司 Video retrieval method, device, system and medium based on annotation

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20080111376A (en) * 2007-06-18 2008-12-23 한국전자통신연구원 System and method for managing digital videos using video features
KR20150022088A (en) * 2013-08-22 2015-03-04 주식회사 엘지유플러스 Context-based VOD Search System And Method of VOD Search Using the Same
JP2016035607A (en) * 2012-12-27 2016-03-17 パナソニック株式会社 Apparatus, method and program for generating digest
KR20190114548A (en) * 2018-03-30 2019-10-10 주식회사 엘지유플러스 Apparatus and method for controlling contents, or content control apparatus and method thereof
KR20190129266A (en) * 2018-05-10 2019-11-20 네이버 주식회사 Content providing server, content providing terminal and content providing method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20080111376A (en) * 2007-06-18 2008-12-23 한국전자통신연구원 System and method for managing digital videos using video features
JP2016035607A (en) * 2012-12-27 2016-03-17 パナソニック株式会社 Apparatus, method and program for generating digest
KR20150022088A (en) * 2013-08-22 2015-03-04 주식회사 엘지유플러스 Context-based VOD Search System And Method of VOD Search Using the Same
KR20190114548A (en) * 2018-03-30 2019-10-10 주식회사 엘지유플러스 Apparatus and method for controlling contents, or content control apparatus and method thereof
KR20190129266A (en) * 2018-05-10 2019-11-20 네이버 주식회사 Content providing server, content providing terminal and content providing method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116702707A (en) * 2023-08-03 2023-09-05 腾讯科技(深圳)有限公司 Action generation method, device and equipment based on action generation model
CN116702707B (en) * 2023-08-03 2023-10-03 腾讯科技(深圳)有限公司 Action generation method, device and equipment based on action generation model
CN117633297A (en) * 2024-01-26 2024-03-01 江苏瑞宁信创科技有限公司 Video retrieval method, device, system and medium based on annotation

Also Published As

Publication number Publication date
KR20210134866A (en) 2021-11-11

Similar Documents

Publication Publication Date Title
WO2010117213A2 (en) Apparatus and method for providing information related to broadcasting programs
WO2020080606A1 (en) Method and system for automatically generating video content-integrated metadata using video metadata and script data
US6507838B1 (en) Method for combining multi-modal queries for search of multimedia data using time overlap or co-occurrence and relevance scores
US6578040B1 (en) Method and apparatus for indexing of topics using foils
US20050038814A1 (en) Method, apparatus, and program for cross-linking information sources using multiple modalities
KR101516995B1 (en) Context-based VOD Search System And Method of VOD Search Using the Same
US20110078176A1 (en) Image search apparatus and method
US20100318532A1 (en) Unified inverted index for video passage retrieval
WO2021221209A1 (en) Method and apparatus for searching for information inside video
US9131207B2 (en) Video recording apparatus, information processing system, information processing method, and recording medium
WO2017188606A2 (en) Terminal device and method for providing additional information
KR101640317B1 (en) Apparatus and method for storing and searching image including audio and video data
Luo et al. Exploring large-scale video news via interactive visualization
JP2007328713A (en) Related term display device, searching device, method thereof, and program thereof
US20080016068A1 (en) Media-personality information search system, media-personality information acquiring apparatus, media-personality information search apparatus, and method and program therefor
CN110008314B (en) Intention analysis method and device
WO2021221210A1 (en) Method and apparatus for generating smart route
Sack et al. Automated annotations of synchronized multimedia presentations
WO2021167238A1 (en) Method and system for automatically creating table of contents of video on basis of content
Aletras et al. Computing similarity between cultural heritage items using multimodal features
Kim et al. Content-Based Video Indexing and Retrieval--A Natural Language Approach--
WO2015190834A1 (en) Method for searching for and providing video
JP2007293602A (en) System and method for retrieving image and program
WO2016089110A1 (en) Entry-based knowledge resource generation device and method
KR20200063316A (en) Apparatus for searching video based on script and method for the same

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20933240

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20933240

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20933240

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 11/04/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20933240

Country of ref document: EP

Kind code of ref document: A1