WO2021167238A1

WO2021167238A1 - Method and system for automatically creating table of contents of video on basis of content

Info

Publication number: WO2021167238A1
Application number: PCT/KR2021/000093
Authority: WO
Inventors: 손영석
Original assignee: 제주대학교 산학협력단
Priority date: 2020-02-17
Filing date: 2021-01-05
Publication date: 2021-08-26
Also published as: KR102252522B1; WO2021167220A1

Abstract

The present invention provides a method for automatically creating a table of contents of video on the basis of content, comprising the steps of: selecting video content for which a table of contents is to be created; extracting voice information spoken in the video content so as to generate text information; classifying the text information into morphemes, and converting, into data, morpheme information including part of speech and frequency of use of each morpheme; dividing the video content into a plurality of sections, and selecting characteristic words from the divided sections of the video content on the basis of the morpheme information; sequentially arranging the characteristic words selected from the divided sections of the video content so as to generate table of contents information; setting an individual playback section of the video content on the basis of spoken time information of the characteristic words in the video content; linking the table of contents information and an individual playback section of video content; and linking the table of contents information so that same is displayed in relation to the video content.

Description

Content-based video table of contents automatic creation method and system

The present invention relates to a method and system for automatically generating a table of contents for a video, and more particularly, a new method and system for automatically generating a table of contents based on the contents of video content without a user's manual work, and automatically playing a section related to the table of contents suggest

In the knowledge-information society, where information has been achieved in all fields due to the rapid development of information and communication systems such as computers, the world is connected to one another by the high-speed information and communication network, and a lot of information is being digitized. It can be said that it is a society that leads technology and industry by convergence of all kinds of information and creative knowledge on the basis of advanced information and communication facilities and use.

In the knowledge information society, not only the system for information production and supply, but also the technical basis for the smooth sharing of information is very important. In particular, it can be said that it is an essential element in terms of information utilization to select the desired information from the exponentially pouring sea of information and to quickly and easily find a specific part of the information.

Personal computers and web-based Internet networks have long been a means of producing and sharing various kinds of digital information, and in recent years, mobile platforms that further spread the production and sharing of information continue to increase. With the development of the information platform, information utilization technology is also being reversed. In the case of printed materials represented by conventional books, the contents included in the printed materials are provided as a table of contents, so users do not need to read the first page sequentially. You can find out where the information is written. On the other hand, in the case of video content, since a table of contents is not provided, it is difficult to guess what content was uttered in which part unless you watch it sequentially from the beginning to the end of the video, making it difficult to find the desired information.

Until just a few years ago, most information retrieval was based on text. However, recently, information retrieval based on video is spreading at a rapid pace. This fact can also be confirmed through the statistical results that recently, young people such as teenagers and 20s as well as middle-aged people use video sharing services such as YouTube 10 times more than searching portal sites such as Naver.

BACKGROUND With the rapid development of video-related technologies, the demand of users to search for video files of interest on a network is increasing. Various types of video search methods have been developed to meet these needs. Currently, the commonly used video search method is an annotation-based search engine that searches text annotations for the entire video file using the input text keyword. There is a search method by In this method, a moving picture in which a movie title or a related newspaper article title, which can represent a specific moving picture file, is annotated is searched for by comparing the text annotation with the text keyword according to the keyword input. As another method, a system for automatically searching for a section in which a specific person appears in a video based on section information in which a specific person appears in the video is proposed.

As such, video searches are limited to using 'title' or 'hashtag (#)'. In other words, in the current video search method, if the keyword searched by the user does not match the title or hashtag of the video (randomly attached by the person who uploaded the video), the desired video cannot be found. There is a limit to searching for countless videos on the Internet only with such a video search method, and it is an obstacle to the popular use and spread of the pouring video content.

In this situation, the present inventor has proposed a video search system based on content that controls so that only the scene in the section reflecting the specific content is played when the content the user wants to search for corresponds to a specific content in the video file ( See Registered Patent 10-1940289).

This technology extracts and arranges only images of a part in which a specific word is used from among several videos and sequentially reproduces them. Specifically, a video searcher terminal for requesting a video search using a search keyword; video storage servers for storing and managing video content serviced on the Internet; Based on the search keyword transmitted by the video search request from the video searcher terminal, video contents in which the search keyword is uttered among video contents stored and managed in the video storage servers are collected, and the search keyword is obtained for each of the collected video contents. By setting a 'search keyword playback section', which is a playback section in which is uttered, and providing it to the video searcher terminal, when viewing the video content provided in the video searcher terminal, a video search that controls only the search keyword playback section section to be played server; consists of

The video search server includes a DB unit; and a video search engine; and in the DB unit, video content identification information for each storage location, which is an identifier for identifying video content stored and managed by the video storage servers, from video content stored and managed by the video storage servers. The output dialogue is recorded in sentence units, and the video contents that are recorded and stored according to the 'when the dialogue is uttered' in the video contents are stored separately for each video content Conversation text information DB including, wherein the video search engine matches the search keyword transmitted from the video searcher terminal with the video content conversation text information of the conversation content text information DB for each video content in the storage location, and the matching video content conversation text text Recognizes video content identification information for each storage location associated with the information, collects video content that includes a corresponding search keyword matching the recognized video content identification information for each storage location, and refers to the video content conversation text information Among the images of the collected video content, the search keyword is linked to be reproduced by a time region corresponding to a sentence unit including the preceding and following contexts.

Such a conventional video search method is meaningful in that, when the content a user wants to search for corresponds to a specific content in the video file, it is possible to search for videos reflecting the specific content. It was difficult to confirm, so it was impossible for a user to search for a video more specifically and to select and play only a desired part of a specific video.

The present invention was conceived under the above technical background, and an object of the present invention is to provide a new video search method tailored to the needs of users in searching for various video content provided online.

Another object of the present invention is to provide a video playback method in which a user can check a desired part and play only the corresponding part without playing the entire video.

Another object of the present invention is to automatically create a table of contents for video content without the cumbersome task of linking the table of contents and the video part so that the video content creator creates a table of contents according to the video content and plays the corresponding part of the video. It is to provide a system that allows the video part related to the table of contents to be played.

In addition, other objects and technical features of the present invention will be presented in more detail in the following detailed description.

In order to achieve the above object, the present invention provides the steps of selecting a video content to be generated for a table of contents, extracting voice information uttered from the video content to generate text information, and dividing the text information into morphemes. to data the morpheme information including the part-of-speech of each morpheme and the number of times used, dividing the video content into a plurality of sections, and selecting a characteristic word from the section of the video content based on the morpheme information; generating table of contents information by sequentially arranging the characteristic words selected in the segmentation section of the moving picture content; setting individual playback sections of the moving picture content based on the utterance information of the characteristic word in the moving picture content; the table of contents There is provided a method for automatically generating a content-based video table of contents, comprising the steps of linking information and individual playback sections of video content, and linking the information to display the table of contents information in relation to the video content.

In the present invention, when generating text information from the audio information of the video content, it is preferable to generate text information in characters of a language corresponding to the audio information.

In the present invention, when dividing the section of the moving picture content, the entire playback time of the moving picture content can be divided into a plurality of sections by equal time. In addition, the characteristic word may select a text having the highest frequency of utterance in a section of the moving picture content.

In addition, in the present invention, it is possible to set an individual playback section of the video content based on the first utterance time of the first characteristic word and the first utterance time of the second characteristic word in the video content.

The present invention also provides a video management unit that selects the target video content for creating a table of contents and divides the selected video content into a plurality of sections, a text converter that extracts voice information uttered from the video content and generates text information from the extracted audio information; A morpheme analyzer that divides the text information into morphemes and converts the morpheme information into data including the part-of-speech of each morpheme and the number of times used. a table of contents generator for generating table of contents information by sequentially listing the characteristic words selected in It provides a content-based automatic content-based video table of contents generation system including an output control unit that links the playback section and interlocks the table of contents information to be displayed in relation to the video content.

According to the present invention, it is possible to automatically generate table of contents information for various video contents provided through the Internet selectively or collectively, and for individual sections of a video without replaying the entire contents of the video. You can check the contents of the video through the generated table of contents information, and if necessary, it is possible to play only a desired section through the corresponding table of contents information.

It is expected that the present invention can be widely used in various video platforms and related IT companies that provide video streams on individual YouTubers or the Internet.

1 is a schematic diagram showing the configuration of a table of contents generation system of the present invention;

2 is a flowchart illustrating a method for generating a video table of contents according to the present invention;

3 is a schematic diagram showing a key word extraction method

4 is a schematic diagram showing a section section and a playback section;

5 is an Internet screen showing video contents and table of contents information;

Advantages and features of the present invention, and methods for achieving them, will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but will be implemented in various different forms.

The present invention proposes a new method and system for automatically generating a table of contents based on the contents of video content and automatically reproducing sections related to the table of contents.

In the automatic creation of the video table of contents, in order for the table of contents to be based on the contents of the video, in a preferred embodiment of the present invention, the table of contents of the video is generated based on words (feature words) that appear frequently in a certain part of the video, and , the generated table of contents is linked with the corresponding video part, respectively, so that the video related to the table of contents is played. Various methods may be used for content-based table of contents generation in addition to selecting a frequent word as a characteristic word.

This automatic table of contents generation can be applied to individual video content to selectively watch a video part corresponding to a desired table of contents within a single video, and after creating a table of contents for a plurality of videos with similar content or categories, a plurality of videos It is also possible to selectively watch a video corresponding to a desired table of contents.

In order to implement the table of contents generation method of the present invention, technical means such as selection of a video, creation and display of a table of contents, a method of dividing (segmenting) a video and a method of extracting key words, a link between the displayed table of contents and the corresponding part of the video are important, To this end, the system according to the present invention requires various hardware and software means. Referring to FIG. 1 , the configuration of each function processing unit of the table of contents generating system 100 of the present invention is shown. A video management unit 110 , a text conversion unit 120 , a morpheme analysis unit 130 , and a table of contents generation unit 140 . ) and the output control unit 150, for example, may be built in the form of an online platform, may include a server such as a computer terminal, and may be networked with other servers or Internet websites through a wired/wireless communication network.

The video management unit basically selects the target video content for creating the table of contents, manages the corresponding content information, and also performs an additional function of dividing the selected video content into a plurality of sections. The video management unit may be included in a server related to the entire system.

The text conversion unit is responsible for extracting voice information uttered from video content and generating text information from the extracted voice information, and can convert text information or other information other than voice information into text, and If it corresponds to the language of a specific country or ethnicity, it also includes a function to determine the language. The text converter may include a text database for storing text information converted from voice information. The morpheme analyzer may include a database for classifying the text information into morphemes, converting morpheme information including parts of speech of each morpheme and the number of times used, into data, and storing the obtained morpheme information.

The text conversion unit and the morpheme analysis unit may be included in the server as components having independent functions, and when using an external open API, the system server may include only a database for storing the result using the external API.

The table of contents generator selects a characteristic word from a section of the moving picture content based on the morpheme information, and sequentially lists the selected feature words from a section of the moving picture content to generate the table of contents information. The generated table of contents may include connection information related to a corresponding part (a specific playback section) of a video together with table of contents information in text form.

The output control unit sets an individual playback section of the video content based on the utterance information of the characteristic word in the video content, and links the table of contents information with the individual playback section of the video content, and the table of contents in relation to the video content Link the information to be displayed. The output control unit may be independently configured and included in the system server, or it may be included in the video management unit or the table of contents generator to form an integrated video control unit.

In addition, in the present invention, the video management unit and the output control unit transmit and receive data while communicating in real time with an external video providing server (S), for example, a video providing related website or mobile platform through the Internet, and contents information ( playback information) can be provided.

Through such a content-based automatic video table of contents generation system, a table of contents can be automatically created for user-generated videos as well as third-party videos that are published on the Internet, can This method is implemented through the following main steps, the step of automatically recording the contents uttered in the video to generate the textualized text, the step of dividing the textualized text into word units based on the morpheme analysis program, etc., and the schedule in the video It includes the steps of generating a table of contents based on the most frequent words or characteristic words in the part, and controlling the video to reproduce only the scenes of the 'table of contents section' based on the table of contents.

Such a method for automatically generating a table of contents for a video will be described in more detail with reference to FIG. 2 .

First, the video management unit of the system selects or receives the target video content to be created (step S110). The selection of a video may be performed by the server itself, or may receive video information requested to generate a table of contents from outside, or may be requested to generate a table of contents for a plurality of videos as well as individual videos.

When the table of contents creation video is determined, the text converter extracts voice information uttered from the video content, and generates text information from the extracted voice information (step S120). Most of the audio information of the video includes one or more languages, When generating text information from such voice information, it is preferable to generate text information in characters of a language corresponding to voice information, and convert the text into characters corresponding to each language when a plurality of languages are included. If necessary, it is also possible to unify text information by unifying voice information in two or more languages into one language. In the process of automatically recording the content uttered in the video to generate the transcribed text, a text converter including the corresponding algorithm may be provided in the server itself, but it is also possible to use an external service. For example, a speech to text (STT) service that records voice and automatically converts it into text may be used. In this case, the system server includes a text conversion related control unit and a database of converted text information.

Next, the text information is divided into morphemes, and morpheme information including the part-of-speech of each morpheme and the number of times it is used is converted into data (step S130). Characterized texts (sentences) can be divided into word units based on the morpheme analysis program (morpheme analyzer). New Year's Day/Eun/Yesterday/Go/Yo/, We/We/New Year's Day/Eun/Today/Since/Yo.' can be automatically classified. In addition, you can check the part-of-speech of each word and the number of times it is used, such as magpie (noun) 2 times, Woori (noun) 2 times, Lunar New Year (noun) 2 times, Yo (noun) 2 times, and today (noun) 1 time. . The morpheme analyzer may be provided by itself in the system server, and it is also possible to use an external public program. In this case, the system server includes a control unit related to morpheme information conversion and a database of extracted morpheme information.

When text conversion and data conversion of morpheme information are completed, the video content is divided into a plurality of sections, and a characteristic word is selected from the section section of the video content based on the morpheme information (step S140).

3 is a schematic diagram illustrating a method of extracting a key word, and shows that a word with high frequency (AAA) is selected as a key word among a plurality of words uttered in a specific section section (section 1). Here, it is different from a segmented section of video content and a video playback section to be described later (see FIG. 4 ), and corresponds to a split image in which video content is temporarily divided into a plurality of regions for feature word extraction. The division of the section can be set according to various criteria. For example, the entire playback time of the video content can be divided into a plurality of sections by equal time, and the section (time) divided according to the amount of the video can vary. have.

In the case of a moving picture, since there is no physical boundary, a word frequently used in a section divided by a predetermined time interval can be selected as a characteristic word of the section. For example, if the video is 10 minutes long, first, the video section is mechanically divided into 10 equal parts, and then each word that is particularly frequently used in each section divided by 1 minute (which appears less in other sections) is selected as a characteristic word. The characteristic word may be in the form of a noun, a verb, or a short sentence in which a noun and a verb are combined. In addition, when there is no frequent word, the first word in the corresponding section may be determined as a temporary characteristic word.

How to methodologically extract a feature word (important vocabulary) from a section section is very important in generating the video table of contents of the present invention. As described above, a word frequently used in a section divided by a predetermined time interval may be used as a characteristic word of the corresponding section, or an algorithm may be performed according to another characteristic word extraction technique. Term Frequency - Inverse Document Frequency (TF-IDF) is one of the representative feature word calculation methods. TF-IDF is a weight used in information retrieval and text mining, and when there is a document group consisting of several documents, it indicates how important a word is in a specific document as a statistical value. By using this method, a key word can be extracted from a video segmentation section or used to distinguish a characteristic word by comparing similar key words in a plurality of section sections.

Next, the table of contents information is generated by sequentially arranging the selected feature words in the section section of the video content (step S150). For example, a table of contents can be created based on words (characteristic words) that occur only in certain parts of a video, and the table of contents will be in the form of a noun, a verb, or a combination of a noun and a verb. may be

After generating the table of contents information, the output controller controls the video so that only the scenes of the 'table of contents section' are reproduced based on the table of contents. Specifically, an individual playback section of the video content is set based on the utterance information of the characteristic word in the video content, and the table of contents information is linked with the individual playback section of the video content (step S160).

4 is a schematic diagram showing a division section and a reproduction section. In contrast to the segmentation section, for example, in which the entire video is divided into equal time, the length of each reproduction section may be different because the reproduction section is set based on the utterance information of the extracted feature word. The playback section may set the playback section of the video content based on a time point at which the first characteristic word is first uttered and a time point at which the second feature word is first uttered in the video content. For example, the section from the first appearance of the feature word A to just before the appearance of the feature word B is set as the first table of contents playback section related to one feature word A, and from the time when the feature word B first appears, the feature word C is The section just before the appearance may be set as the second table of contents playback section related to the feature word B. In this case, if the first table of contents related to the feature word A is clicked in the table of contents information, the video may be played from the point where the feature word A first appears, not from the first start point of the video. To this end, the output control unit interconnects the table of contents information corresponding to the selected characteristic word and the individual playback section of the set moving picture. It is also possible to set the reproduction section to start at a point in time prior to the point at which the characteristic word is first uttered.

The output control unit also includes the step of linking to display the table of contents information in relation to the video content. In this case, the table of contents information may include a layout suitable for display with respect to the table of contents list, and video playback section and link information for an individual table of contents. 5 is an Internet screen 200 showing moving picture content 210 and table of contents information. The table of contents information may be displayed integrally within the moving picture content (see 300a) or separately displayed in another area outside the moving picture (300b). ).

Through the table of contents information generated based on the characteristic word and the link of the playback section, it is possible to selectively play individual video sections and watch the desired content without replaying the contents of the specific video as a whole. of these video content The user's convenience in using video is improved through automatic creation of the table of contents, and it is expected to be used by video platform companies that provide various videos on the Internet in stream format.

In particular, the method for generating a table of contents according to the present invention can be applied to a lecture video or a music (song) video, and can be effectively applied to a long drama or movie related video. It is also possible to generate information. In addition, the present invention is technically linked with the content-based video search system of Patent No. 10-1940289 previously developed by the present inventor, and the user performs a content-based video search based on key words for various video contents provided online. On the other hand, it may be possible to watch the video through the table of contents service provided focusing on key words.

Although the present invention has been exemplarily described through preferred embodiments above, the present invention is not limited to such specific embodiments, and various forms within the scope of the technical idea presented in the present invention, specifically, the claims may be modified, changed, or improved with

Similar to how components of the present invention may be implemented as software programming or software elements, the present invention includes various algorithms implemented as data structures, processes, routines, or combinations of other programming constructs, including C, C++ , Java, assembler, etc. may be implemented in a programming or scripting language. Functional aspects may be implemented in an algorithm running on one or more processors. Further, the present invention may employ prior art techniques for electronic configuration, signal processing, and/or data processing, and the like. Terms such as “mechanism”, “element”, “means” and “configuration” may be used broadly and are not limited to mechanical and physical configurations. The term may include the meaning of a series of routines of software in association with a processor or the like.

As described above, the technical idea of the present invention has been specifically described in the preferred embodiment, but the preferred embodiment is for the purpose of explanation and not for limitation. As such, those skilled in the art will be able to understand that various embodiments are possible through the combination of the embodiments of the present invention within the scope of the technical spirit of the present invention.

Claims

selecting the video content to be created for the table of contents;

generating text information by extracting voice information uttered from video content;

classifying the text information into morphemes and converting the morpheme information including parts of speech of each morpheme and the number of times it is used into data;

dividing the video content into a plurality of sections, and selecting a characteristic word from the section of the video content based on the morpheme information;

generating table of contents information by sequentially arranging the selected characteristic words in the segmentation section of the video content;

setting an individual playback section of the video content based on the utterance information of the characteristic word in the video content;

linking the table of contents information and individual playback sections of video content; and

and linking the table of contents information to be displayed in relation to the video content,

When text information is generated from the voice information of the video content, text information is generated in characters of a language corresponding to the voice information.
According to claim 1,

When dividing the section of the video content, the method for automatically generating a content-based video table of contents, characterized in that the entire playback time of the video content is divided by equal time and divided into a plurality of sections.
According to claim 1,

The method for automatically generating a content-based video table of contents, characterized in that the characteristic word selects the text with the highest frequency of utterance in a section of the video content.
According to claim 1,

A method for automatically generating a content-based video table of contents, characterized in that the individual playback sections of the video content are set based on the first utterance time of the first characteristic word and the first utterance time of the second characteristic word in the video content.
a video management unit that selects video content to be created and divides the selected video content into a plurality of sections;

A text conversion unit that extracts voice information uttered from video content and generates text information from the extracted voice information;

a morpheme analysis unit that divides the text information into morphemes and converts the morpheme information including parts of speech of each morpheme and the number of times it is used into data;

a table of contents generator for selecting a characteristic word from a section of the moving picture content based on the morpheme information, and sequentially arranging the selected characteristic words from a section of the moving picture content to generate table of contents information;

In the video content, an individual playback section of the video content is set based on the utterance information of the characteristic word in the video content, the table of contents information is linked to an individual playback section of the video content, and the table of contents information is displayed in relation to the video content Content-based video table of contents automatic creation system including an output control unit that interlocks as much as possible.