CN116680440A

CN116680440A - Segment division processing device, method, and storage medium

Info

Publication number: CN116680440A
Application number: CN202211059350.6A
Authority: CN
Inventors: 小林优佳; 吉田尚水; 岩田宪治; 久岛務嗣; 三原功雄; 永江尚义; 渡边奈夕子
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2022-02-22
Filing date: 2022-08-31
Publication date: 2023-09-01
Also published as: JP2023122236A

Abstract

The embodiment of the application relates to a section division processing device, a method and a storage medium. Provided are a segment division processing device, method, and storage medium capable of efficiently managing or viewing video content or audio content. The section dividing processing device of the embodiment is provided with an information acquisition unit, a dividing unit, a section label candidate acquisition unit, a section label selection unit, and a section label application unit. The information acquisition unit acquires video or audio data, a field of the video or audio data, and text information of the video or audio data. A dividing unit divides video or audio data into 1 or more sections. A section tag candidate acquisition unit acquires a section tag candidate corresponding to a field. The section tag selection unit selects a section tag from the section tag candidates for each section based on the text information. A section label imparting unit imparts a selected section label to the section.

Description

Segment division processing device, method, and storage medium

The present application is based on Japanese patent application 2022-025818 (application day: 22 nd year 2022), from which priority is enjoyed. The present application is incorporated by reference into this application in its entirety.

Technical Field

The embodiment of the invention relates to a section dividing processing device, a method and a storage medium.

Background

In recent years, online education, online academia, and the like are increasing, and opportunities to view video of a lecture and to listen to sound data of a lecture are increasing. Accordingly, attention has been paid to a technique for managing a large amount of video content and audio content and a technique for efficiently viewing a large amount of content.

In such a technique, a video is divided into 1 or more sections (sections) based on information in the video, and a section name is given to each of the divided sections. In this case, since a scattered section name is assigned to each video, the section names are not unified among videos. The user only views important parts of each video, and needs to view the section names one by one in order to determine which section to view.

Disclosure of Invention

The present invention aims to provide a segment division processing device, method and storage medium capable of efficiently managing or viewing video content or audio content.

In order to solve the above problems, the segment division processing device according to the embodiment includes an information acquisition unit, a division unit, a segment tag candidate acquisition unit, a segment tag selection unit, and a segment tag application unit. The information acquisition unit acquires video or audio data, a field of the video or audio data, and text information of the video or audio data. A dividing unit divides video or audio data into 1 or more sections. A section tag candidate acquisition unit acquires a section tag candidate corresponding to a field. The section tag selection unit selects a section tag from the section tag candidates for each section based on the text information. A section label imparting unit imparts a selected section label to the section.

According to the above-described configuration of the division processing device, a plurality of video contents or audio contents can be efficiently viewed.

Drawings

Fig. 1 is a diagram showing an example of the configuration of a segment division processing device according to embodiment 1.

Fig. 2 is a flowchart illustrating a processing procedure of video segmentation processing in the segment segmentation processing device according to embodiment 1.

Fig. 3 is a diagram showing an example of a case where a video in the "academic" field is divided by the section division processing device according to embodiment 1 using text information.

Fig. 4 is a diagram showing an example of a case where a section label is given to each section obtained by dividing a video in the "academic" field by the section division processing device according to embodiment 1.

Fig. 5 is a diagram showing an example of a case where a section label is given to each section obtained by the section division processing device according to embodiment 1 for dividing a video in the "education" field.

Fig. 6 is a diagram showing an example of the configuration of the segment division processing device according to embodiment 2.

Fig. 7 is a flowchart illustrating a processing procedure of video segmentation processing in the segment segmentation processing device according to embodiment 2.

Fig. 8 is a diagram showing an example of setting of a section name for each section obtained by dividing a video in the "academic or vocational study" area by the section division processing device according to embodiment 2.

Fig. 9 is a diagram showing an example of a case where a section name is selected for each section obtained by the section division processing device according to embodiment 2 by dividing a video in the "education" field.

Fig. 10 is a diagram showing an example of the configuration of a segment division processing device according to modification 1 of embodiment 2.

Fig. 11 is a flowchart illustrating a procedure of video segmentation processing in the segment segmentation processing device according to modification 1 of embodiment 2.

Fig. 12 is a diagram showing an example of the configuration of a segment division processing device according to modification 2 of embodiment 2.

Fig. 13 is a flowchart illustrating a procedure of video segmentation processing in the segment segmentation processing device according to the 2 nd modification of embodiment 2.

Fig. 14 is a diagram showing an example of the configuration of a segment division processing device according to modification 3 of embodiment 2.

Fig. 15 is a flowchart illustrating a procedure of video segmentation processing in the segment segmentation processing device according to the 3 rd modification of embodiment 2.

(description of symbols)

100: a section dividing processor; 101: a video information acquisition unit; 102: a video dividing section; 103: a section tag candidate acquisition unit; 104: a section label selecting section; 105: a section label applying section; 106: a section name generation unit; 107: a keyword detection unit; 108: a search term setting unit; 109: an introduction data generation unit; 110: a similarity calculation unit; 111: a video generation unit; A1-A6: a sentence; B1-B6: a content word list; C1-C4, D1-D4, F1-F6: a section.

Detailed Description

Hereinafter, embodiments of a segment division processing device, method, and program will be described in detail with reference to the drawings. In the following description, the same reference numerals are given to constituent elements having substantially the same functions and configurations, and the description will be repeated only when necessary.

(embodiment 1)

Fig. 1 is a diagram showing a configuration of a segment division processing device 100 according to embodiment 1. The section division processing device 100 acquires a plurality of videos to be viewed by a user, divides the video into a plurality of sections based on the content of text information that can be acquired from the video, and can view the video in part. The segment division processing device 100 also assigns a common segment label to each segment, which is unified for each field. By giving a unified section tag to a plurality of videos, management of a large number of videos can be performed easily.

Further, the section division processing device 100 can be applied not only to management of data in the form of video files but also to management of data in the form of sound files. The segment division processing device 100 can also be applied to management of a plurality of data in which data in the form of a video file and data in the form of a sound file exist in a mixed manner. In the present embodiment, the description is given by taking the management of data in the form of video as an example, but in the above description and the following description, the words such as "video", "video content", and "video data" may be replaced with the words such as "sound", "sound content", and "sound data".

The segment division processing device 100 is mounted on a terminal device such as a PC terminal used by a user as an application for video management, or a cloud server connected to the terminal device via a network, for example. The terminal device includes, for example, a communication interface and a communication function for communicating with the segment division processing device 100, an input interface and an input function for inputting video, a display and a display control function for displaying a management screen of video and a playback screen of video, a video search function for searching for a specific video from among a plurality of managed videos, and the like. The network is, for example, a LAN (Local Area Network ). Further, the connection to the network may be any of a wired connection and a wireless connection. The network is not limited to LAN, and may be the internet, a public communication line, or the like.

Examples of the video include e-learning, lectures of universities, lectures of academic institutions, and data recorded on a voice and an image of a lecture. The video may be data obtained by recording a lecture such as an online lecture or an online school. The video may be data obtained by downloading video posted to a video sharing site. As the video field, for example, videos in a field where a lecture is easy, such as "education", "academic" or the like, can be used, but the present invention is not limited thereto. In addition, the video may be data including only sound such as lectures and not including an image. The video may be data including only an image in which text representing the content of the lecture is displayed, and not including sound.

The segmentation process device 100 includes a process circuit for controlling the entire segmentation process device 100, and a storage medium (memory). The processing circuit is a processor that executes functions of the video information acquisition unit 101, the video segmentation unit 102, the section tag candidate acquisition unit 103, the section tag selection unit 104, and the section tag assignment unit 105 by calling and executing programs in the storage medium. The processing circuit is formed of an integrated circuit including a CPU (Central Processing Unit ), an ASIC (Application Specific Integrated Circuit, application specific integrated circuit), or an FPGA (Field Programmable Gate Array ), or the like. The processor may be formed of either 1 integrated circuit or a plurality of integrated circuits.

In the storage medium, a processing program used by a processor, parameters and tables used in operations in the processor, and the like are stored. The storage medium is a storage device such as an HDD (Hard Disk Drive), an SSD (Solid State Drive), an integrated circuit, or the like that stores various information. In addition to HDD, SSD, and the like, the storage device may be a removable storage medium such as a CD (Compact Disc), DVD (Digital Versatile Disc ), flash memory, or the like, or may be a drive device that reads and writes various information from and to a semiconductor memory element such as a flash memory, RAM (Random Access Memory ), or the like. In addition, a plurality of videos, a section tag candidate described later, data used in processing by the processing circuit, a threshold value, and the like are stored in the storage medium. The storage medium is an example of the storage section.

The functions of the video information acquisition unit 101, the video division unit 102, the section tag candidate acquisition unit 103, the section tag selection unit 104, and the section tag application unit 105 may be realized by a single processing circuit, or may be realized by a combination of a plurality of independent processors to form a processing circuit and executing programs by the processors. The functions of the video information acquisition unit 101, the video division unit 102, the section tag candidate acquisition unit 103, the section tag selection unit 104, and the section tag assignment unit 105 may be implemented as separate hardware circuits. In addition, all or part of the functions of the processing circuit may be mounted on a cloud server that performs processing on the cloud.

The video information acquisition unit 101 acquires a video, information on a field of the video (hereinafter referred to as field information), and text information of the video. The text information is text data representing the content of the video. For example, the user inputs video and domain information via an input interface of the terminal device. The video information acquisition unit 101 converts a voice in a video into text data using voice recognition processing, and acquires the converted text data as text information. The sound may be, for example, a sound of a lecturer of a lecture, a sound of a presenter of an educational video, or a machine sound of an educational video. Alternatively, the video information acquisition unit 101 may perform OCR (Optical Character Recognition: optical character recognition) processing on an image in a video, convert content displayed in the video into character data, and acquire the converted character data as text information.

The video information acquisition unit 101 may detect, for each text data included in the text information, a sound emission time of the sound, a start time of the image display start, and an end time of the image display end. In this case, the video information acquisition unit 101 associates and stores the detected time with each text data. The "time" mentioned above represents the elapsed time in the video in which the time point at which the video starts is set to "0". The video information acquisition unit 101 is an example of an information acquisition unit.

The video information acquisition unit 101 may detect display coordinates (display positions) in an image with respect to character data acquired from the image displayed on the video. In this case, the video information acquisition unit 101 associates and stores the detected display coordinates with the corresponding text data.

The video divider 102 divides a video into 1 or more sections. For example, the video dividing unit 102 divides a plurality of videos into 1 or more sections based on text information. Typically, the video divider 102 divides each video into a plurality of sections. As a result of the division processing of the video being performed, the video may also be divided into 1 section. An example of a method of dividing a video is described below, but other known dividing methods may be used. In the following description, a case where 1 video is divided into a plurality of sections will be described as an example. The video division unit 102 is an example of a division unit.

As a method of dividing a video using text information, for example, a method of dividing a video using a specific phrase used when a speaker wishes to explicitly divide a section can be cited. In educational videos and lecture videos, a specific phrase is sometimes used in cases where a speaker wishes to explicitly divide a section. As a specific phrase, there is, for example, "the present session ends". "and" then "are described next with respect to o. "etc. Hereinafter, such specific terms are referred to as segment division terms. The term segment division is stored in a storage medium, for example.

In this method, the video divider 102 detects a segmentation word from the text information, and divides a plurality of videos into a plurality of segments, respectively, with a portion where the segmentation word is detected as a boundary. Specifically, the video divider 102 first reads out a plurality of segmentation words stored in the storage medium, and calculates the similarity between each of the segmentation words read out for each sentence in the text information. As a method for calculating the similarity, for example, a method using an edit distance can be cited. Next, the video dividing unit 102 detects a sentence having a similarity to the segmentation word of a predetermined value or more as a sentence to be segmented. The video dividing unit 102 divides the section before and after the sentence to be divided into sections. Among the section division words, there are words used at the end of the section and words used at the beginning of the section. Therefore, for each segmentation word, it is preset where to segment the sentence before and after the segmentation word is detected. For example, upon detecting the end of the session with "the present session". In the case where "the similarity between divided words and sentences of such a section becomes a sentence of a predetermined value or more," the video dividing section 102 divides the section immediately after the detected sentence. In this case, the detected sentence and the immediately following sentence are contained in different sections. In addition, for example, when "and" are detected, "the following description is given about o. In the case where "the similarity between divided words and sentences of such a section is a sentence of a predetermined value or more," the video divider 102 divides the section immediately before the detected sentence. In this case, the detected sentence and the immediately preceding sentence are contained in different sections.

Next, as an example of a method of dividing a video using text information, a method of dividing a video using a content word included in text information will be described. The content word refers to words and sentences other than the auxiliary word, auxiliary verb, pronoun, interjective, and the like. The content word indicates the content of the sentence. In this method, the video dividing unit 102 divides a text included in text information into a plurality of sentences, detects content words related to the content of a video from each of the plurality of sentences, compares the plurality of sentences using the content words, and divides each video into a plurality of sections with a portion where the content word changes as a boundary.

Specifically, the video segmentation unit 102 first segments text information for every 1 sentence. As a method of dividing text information for each sentence, for example, a method of using a symbol representing an end of a sentence such as a period is cited. The method can be applied to both text information acquired by voice recognition and text information acquired by OCR processing.

As other methods of dividing text information for each sentence, for example, the following methods can be cited: for text information obtained by voice recognition, referring to the sounding time of each text data, a period in which no sounding is performed in the entire video is detected as a silent section, and the text information is divided into a plurality of sentences by dividing the text information at a location where the silent section lasts for a certain time or longer. The method can be applied to text information acquired by voice recognition.

As another method for dividing text information for each sentence, for example, the following method can be cited: for the text information acquired by the OCR processing, the text information is divided into a plurality of sentences by referring to the display coordinates of each text data and dividing the text information at a position where the display coordinates change by a certain or more. The method can be applied to text information acquired by OCR processing.

After dividing the text information for each sentence, the video dividing section 102 detects a content word from each divided sentence. As a method of detecting a content word from a sentence, various known methods can be employed. As a method of detecting a content word from a sentence, for example, a method using a grapheme analysis can be cited. In this method, the video segmentation unit 102 performs morphological analysis on each of the segmented sentences to segment each sentence into words, and extracts content words from the segmented words.

As a method of extracting a content word, for example, the following methods can be mentioned: the part from which the stop word is removed from the divided words is extracted as a content word with reference to a word (stop word) other than the content word stored in advance.

As another method for extracting the content word, for example, a method using IDF (Inverse Document Frequency: inverse document frequency) is cited. In this method, the video dividing section 102 first refers to a plurality of external documents stored in advance, calculates the number Nd of external documents included in each divided word, and calculates the IDF of each word using the following formula (1). In the formula (1), "N" represents the number of documents of the prepared external document. Regarding IDF, the smaller the word is the commonly used word, the larger the word representing the feature of the sentence is. The video segmentation unit 102 compares the IDF of each word with a predetermined threshold value, and extracts words having IDFs larger than the threshold value as content words.

IDF＝log(N/Nd) (1)

After extracting the content words, the video segmentation unit 102 generates a content word list including the extracted content words for each sentence. Then, the video segmentation unit 102 calculates the similarity of the content word list with respect to the consecutive 2 sentences included in the text information, and segments between the compared 2 sentences when the calculated similarity is lower than the threshold value. For example, comparison with the list of content words between the immediately following sentences is performed sequentially from the 1 st sentence included in the text information. As a method for calculating the similarity of the content word list, for example, the number of content words matching in the content word list and the edit distance of the content words included in the content word list are used. Alternatively, the similarity of the content word list may be calculated using a word distribution expression model prepared in advance.

Further, it is also possible to store a list of content words of sentences without segment division as a result of the comparison of the similarity, and compare the stored list of content words with a list of content words of a following sentence. In this case, for example, when the similarity between the content word list (hereinafter referred to as 1 st content word list) of the 1 st sentence (hereinafter referred to as 1 st sentence) and the content word list (hereinafter referred to as 2 nd sentence) included in the text information is higher than the threshold, the video divider 102 determines that the 1 st sentence and the 2 nd sentence are identical segments, and does not divide the segments between the 1 st sentence and the 2 nd sentence. When similarity between the content word list of the 2 nd sentence and the content word list of the 3 rd sentence (hereinafter referred to as 3 rd sentence) (hereinafter referred to as 3 rd content word list) is calculated, similarity between the content word list including the content words of the 1 st sentence in addition to the content words of the 2 nd sentence and the 3 rd content word list is calculated, and it is determined whether or not a segment is divided between the 2 nd sentence and the 3 rd sentence. At this time, by using the content word list of the content words of both the 1 st sentence and the 2 nd sentence including the same section for comparison, the accuracy of the section division can be improved.

Next, an example of a method of dividing a video using feature amounts of sound and image in the video in addition to text information will be described. In this method, the video dividing unit 102 obtains the feature amount of the audio information or the image of the video, and divides the video into a plurality of segments based on the feature amount of the audio information or the image and the text information. In the case of using sound in video, for example, a section is divided at timing (timing) when a specific piece of music is played or at timing when there is a certain period of silence. In the case of using the feature amount of an image in a video, for example, a section is divided at the timing of switching from a still image of a slide display to the video, the timing of speaker switching, or the timing of color change of a text or an image in the video.

The section tag candidate acquisition unit 103 acquires a section tag candidate corresponding to the video field. The section label included in the section label candidates is predetermined according to typical video streams in each field, and is stored in a storage medium according to the field. For example, the section tag included in the section tag candidate is preset according to the function or function of each section in the entire video. The field of video is, for example, "academic society", "education", and the like. As segment tag candidates in the academic field, for example, "research background", "proposed method", "experiment", "summary", and the like are used. As segment tag candidates for the education field, for example, "review", "summary", "detailed description", "specific example", "summary", and the like are used. The number of segment tags included in the segment tag candidates may be 1 or more, or may be 6 or more, for example.

The section tag selection unit 104 selects, for each of the divided sections, an appropriate section tag from the section tag candidates obtained from the video field based on the text information of the section.

As a method of selecting a section tag from the section tag candidates, for example, a method using a content word extracted from text information can be cited. In this method, a table indicating the association degree of the content word and the segment tag candidate conceived in each field is stored in advance in a storage medium. The section tag selection unit 104 first detects content words of each section using text information, and generates a content word list for each section. As for the method of detecting the content word, any of the above methods may be used. In addition, when a content word of each sentence is detected during the processing of the video dividing unit 102, the content word may be used. The section tag selection unit 104 obtains the association degree between the content word included in the content word list and the section tag included in the section tag candidate by using the content word list and the table, and calculates the association degree for each section tag included in the section tag candidate. At this time, the section tag selection unit 104 obtains the degree of association with each of the content words included in the content word list with respect to each section tag included in the section tag candidate, and calculates an average value or a maximum value of the obtained degrees of association as the degree of association of the section. The section tag selection unit 104 selects a section tag having the highest degree of association among the section tag candidates as a section tag of the section.

Further, the process of acquiring text information by the video information acquisition unit 101 may be performed using a machine learning model. In this case, the machine learning model receives, for example, an input of a video, and outputs text information of the video. As the machine learning model, for example, a deep neural network or the like is used. The process of dividing the video into a plurality of segments by the video divider 102 may be performed using a machine learning model. In this case, the machine learning model receives, for example, an input of a video and text information acquired from the video, and outputs a segmentation result of a segment. The process of selecting an appropriate section tag by the section tag selecting unit 104 may be performed using a machine learning model. In this case, the machine learning model receives, for example, inputs of a video, text information acquired from the video, and a segmentation result in the video, and outputs an appropriate segment tag in each segment.

The section tag giving section 105 gives the section tag selected by the processing of the section tag selecting section 104 to each section. The section tag assigning unit 105 stores the video to which the selected section tag is assigned for each section in a storage medium. In the storage medium, a plurality of videos to each of which a section label is given by the section label giving section 105 or the like are stored.

Next, the operation of the process performed by the segment division processing device 100 will be described. Fig. 2 is a flowchart showing an example of a procedure of the video segmentation process. The video segmentation process is as follows: the input video is divided into a plurality of sections, and a common section label set for each field is assigned to each divided section. The video segmentation process is automatically performed when a new video is input to the segment segmentation process apparatus 100. The processing procedure in each processing described below is merely an example, and each processing can be changed as appropriate as possible. In addition, regarding the processing procedures described below, omission, substitution, and addition of steps can be appropriately performed according to the embodiment.

(video segmentation process)

(step S201)

When a new video is input to the segmentation process device 100, the video information acquisition unit 101 acquires the input video and the domain information of the video. The domain information is manually entered, for example, by a user. The video information acquiring unit 101 performs voice recognition processing and OCR processing on the acquired video to acquire text information of the video.

(step S202)

Next, the video dividing unit 102 divides the video into a plurality of sections using the text information acquired in the processing of step S201. At this time, for example, the video dividing unit 102 divides the video into a plurality of sections using any of the above methods.

Fig. 3 is a diagram showing an example of a case where text information is used to divide a video of the "academic" field. Fig. 3 shows the following situation: the text included in the text information is divided into 6 sentences A1-A6 using periods within the text information, and the video is divided into 4 sections C1-C4 using the content word lists B1-B6 generated from the sentences A1-A6. The content words included in bold characters of the content word list B1 to B6 represent content words common to the preceding and following sentences. The more sentences of the common content word, the easier it is to divide into the same sections.

(step S203)

Next, the segment tag candidate acquisition unit 103 acquires segment tag candidates corresponding to the video field from the storage medium. In the case where the field of video is "academic", for example, segment tag candidates including segment tags of "research background", "proposed method", "experiment", and "summary" are acquired. In the case where the video field is "education", for example, a section tag candidate including section tags of "review", "summary", "detailed description", "specific example" and "summary" is obtained as the section tag candidate.

(step S204)

The section tag selection unit 104 selects, for each section, an appropriate section tag from the section tag candidates acquired in the processing of step S203.

(step S205)

The section label giving section 105 gives the section label selected by the processing of step S204 to the corresponding section.

Fig. 4 is a diagram showing an example of a case where a section label is given to each of sections D1 to D4 obtained by dividing a video in the "academic field". Fig. 4 shows a case where 1 of "research background", "proposed method", "experiment", and "summary" obtained as segment tag candidates is selected for each of the segments D1 to D4.

Fig. 5 is a diagram showing an example of a case where a section label is selected for each of the sections F1 to F7 obtained by dividing a video in the "education" field. Fig. 5 shows a case where 1 of "review", "summary", "detailed description", "specific example", and "summary" obtained as a section tag candidate is selected for each of the sections F1 to F7. As shown in fig. 5, there may be a plurality of sections to which the same section tag is attached.

When a section label is given to each section, the video segmentation process ends. By executing the video division process described above for each inputted video, a section tag unified by field is given to a plurality of managed videos. In the above embodiment, the processing of step S201 to step S205 was described as being performed for 1 video, but the processing of step S201 to step S205 may be performed for a plurality of inputted videos, and the video division processing for a plurality of videos may be collectively performed.

Next, effects of the segment division processing device 100 according to the present embodiment will be described.

The segment division processing device 100 according to the present embodiment can acquire video or audio data, a field of the video or audio data, and text information of the video or audio data, divide the video or audio data into 1 or more segments, acquire segment tag candidates corresponding to the field, select a segment tag from the segment tag candidates for each of the divided segments based on the text information, and assign the selected segment tag to the segment.

In recent years, online education, online academia, and the like have increased, and opportunities to view lecture videos have increased. Therefore, there is a high demand for efficient viewing of a plurality of videos, which is required to view only a point of view of a plurality of long-time videos. However, the assigned section names are not unified among videos, and since a random section name is assigned to each video, it is impossible to associate sections among videos, and the user needs to confirm the section names to determine which section to view. According to the segment division processing device 100 of the present embodiment, a plurality of videos to be managed can be divided into a plurality of segments, and a segment label can be selected from segment label candidates prepared for each field and assigned to a segment. Therefore, the user does not need to check the section labels one by one on the management screen, and can efficiently view only the desired section among the plurality of videos by selecting the section to which the specific section label is assigned in each video. That is, by giving a unified section tag to a plurality of videos that the user wants to manage, it is easy to search for videos that the user wants to view.

For example, in a video of a academic field, a topic common to the field is generally spoken in "research background", and a gist of the video is generally spoken in "proposal method". By selecting only the section to which the section label of the "proposed method" is assigned from each video of the academic field, the user can efficiently view only a significant portion of each video. Similarly, for example, by selecting only a section to which a section tag of "summary" that points to a video is assigned, regarding a video in the education field, it is possible to efficiently view only a significant portion of each video.

In addition, in educational sites such as education in elementary, middle, high and middle schools and enterprises, educational methods for viewing video are spreading. In such an educational style, by using the section division processing device 100 according to the present embodiment, the user can arbitrarily select a video, view only a necessary portion of the video, or perform quick-entry viewing. Therefore, according to the segment division processing device 100 according to the present embodiment, it is possible to support a free viewing style and realize efficient viewing of video. In addition, the listening and speaking style of viewing and listening to videos is also becoming widespread in academic institutions, lectures, and the like, and the segmentation process apparatus 100 according to the present embodiment can be applied to these fields.

The segment division processing device 100 according to the present embodiment can acquire audio information of video or audio data or a feature value of an image of a video, and divide the video or audio data into 1 or more segments based on the audio information or feature value and text information. In the case of using audio information, the segment division processing device 100 can divide segments at, for example, timing of playing specific music or timing of silent time for a certain period. By using the audio information of the video, it is possible to segment the video with high accuracy even when the change of the image is small, such as in the lecture video. In the case of using the feature amount of the image, the segment division processing device 100 can divide the segment at, for example, timing of switching from the still image to the video in the slide display, timing of switching the speaker, and timing of color change of the text or the image in the video. With this configuration, the division of the segments can be performed with high accuracy.

The segmentation process device 100 according to the present embodiment can detect a segmentation word from text information and segment video or audio data into 1 or more segments with a portion where the segmentation word is detected as a boundary. The section division term is a specific phrase used in the case where a speaker wants to explicitly divide a section in an educational video or a lecture video, and is, for example, "the present session ends". "and" then "are described next with respect to o. "etc. With this configuration, the division of the segments can be performed with high accuracy.

The segment division processing device 100 according to the present embodiment can divide text information into a plurality of sentences, detect content words related to the content of video or audio data from the plurality of sentences, compare the plurality of sentences using the content words, and divide the video or audio data into 1 or more segments with a portion where the content words change as a boundary. The content word is a word and sentence from which a supplementary word, a supplementary verb, a pronoun, an exclamatory word, and the like are removed from a sentence, and represents the content of the sentence. By using the content word, the segmentation of the segments can be performed with high accuracy.

The segmentation process device 100 according to the present embodiment can select a segment tag from segment tag candidates based on a content word. By using the content word, an appropriate section tag can be selected that matches the content of the section.

The section tag selection unit 104 may generate a section tag from text information of a section for which there is no suitable section tag among the section tag candidates. For example, when a suitable section tag is selected from the section tag candidates using the content word extracted from the text information, if there is no content word having a degree of association with the section tag included in the section tag candidate greater than a predetermined value among the content words detected from the section, the section tag selection unit 104 determines that there is no suitable section tag, generates a new section tag from the text information, and selects the section tag from the section tag candidates instead. The method for generating the section tag can use the same method as the method for generating the section name. In addition, instead of generating a new section tag, a suitable section tag for a section for which a suitable section tag is not found may be input by the user.

(embodiment 2)

Embodiment 2 will be described. The present embodiment is a configuration obtained by deforming the structure of embodiment 1 as follows. The same structure, operation, and effects as those of embodiment 1 will not be described. The segmentation process apparatus 100 according to the present embodiment sets a title and a keyword for search for each segment after assigning a segment label to each segmented segment.

Fig. 6 is a diagram showing a configuration of the segmentation process apparatus 100 according to the present embodiment. The processing circuit of the segmentation process device 100 further includes a segment name generation unit 106, a keyword detection unit 107, and a search term setting unit 108.

The section name generation unit 106 generates a section name for each section from the text information. The section name is used as the title of the section. As a method of generating a segment name, for example, a method using a content word extracted from text information is cited. In this method, the segment name generating unit 106 first detects a content word using text information, and sets the content word having the highest frequency of occurrence in each segment as a segment name. In this case, the IDF of each content word may be calculated, and the content word having the highest IDF may be used as the segment name. By using IDF, a word with a high frequency of occurrence rarely can be set as a segment name. Alternatively, a template such as "about o" may be prepared in advance, and a plurality of content words having a high frequency of occurrence among the content words of each segment may be combined with the template to generate a segment name such as "about a high-speed and high-precision pattern neural network". In this example, "high speed", "high accuracy", and "graphic neural network" are content words that occur frequently. Further, "then" may be detected from the text information of each section, and a specific phrase such as "description about o" may be started from now on, and the part of o among the detected phrases may be used as the section name. In this case, even if the similarity does not completely match the specific phrase, the similarity for the specific phrase may be calculated using the edit distance or the like, and a portion having a high similarity may be used for the determination of the segment name. The segment names may be generated using a machine learning model trained so as to receive text information of each segment and input of a content word list and output the segment names composed of natural sentences.

The keyword detection unit 107 detects keywords from the text information for each section. Keywords are words that represent characteristics of the content of a section. As a method of detecting a keyword, for example, a method of using a content word detected from text information can be cited. In this method, the keyword detection unit 107 detects content words in each segment, for example, and detects, as keywords, a plurality of content words having a high frequency of occurrence among the detected content words. At this time, the IDF of each content word may be calculated, and the keywords may be rearranged in order of the IDF from high to low. Further, it is also possible to detect a specific phrase such as "about o" from the present of the text information of each section "and use the part of o among the detected phrases as a keyword. Further, a keyword may be detected using a machine learning model trained so as to receive an input of text information and output the keyword.

The search term setting unit 108 selects a region of interest from the regions based on the region tags assigned to the regions, and sets only keywords detected from text information of the region of interest as search keywords for the video. The attention section is a section assumed to contain the most important content in each video, and is a section to which a specific section tag is given in each field. The section label of the section of interest is determined to be preset for each field, and stored in a storage medium. For example, in a video of the academic field, a section that can be presumed to be a section label given to "proposed method" is the most important focused section. In addition, in a video of an educational field, a section that can be presumed to be a section label given "summary" is the most important focused section.

Specifically, the search term setting unit 108 first obtains an important section label set for each field, and selects a region of interest of the video with reference to the section label assigned to each region. Next, the keyword setting unit 108 sets, among the keywords detected by the keyword detection unit 107, only the keywords detected from the attention area as the search keywords of the video. The set search keyword is given to the video and stored in the storage medium.

Next, the operation of the process performed by the segment division processing device 100 according to the present embodiment will be described. Fig. 7 is a flowchart showing an example of the procedure of the video segmentation process of the present embodiment. The processing in steps S701 to S705 is the same as the processing in steps S201 to S205 in fig. 2, respectively, and therefore, the description thereof is omitted. The processing procedure in each processing described below is merely an example, and each processing can be changed as appropriate as possible. In addition, the process described below may be omitted, replaced, or added as appropriate according to the embodiment.

(video segmentation process)

(step S706)

After the section labels unified for each section by the area are given by the processing of step S701 to step S705, the section name generating unit 106 generates and sets the section name of each section. The section name generation unit 106 associates the set section name with the corresponding section and stores the same in the storage medium.

Fig. 8 is a diagram showing an example of setting a section name for each of the sections D1 to D4 obtained by dividing a video in the "academic or vocational study" field. Fig. 9 is a diagram showing an example of selecting a section name for each of the sections F1 to F7 obtained by dividing a video in the "education" field.

(step S707)

The keyword detection unit 107 detects keywords in each section using text information. The keyword detection unit 107 associates the detected keyword with the corresponding section and stores the same in the storage medium.

Fig. 8 is a diagram showing an example of a case where keywords are detected for each of the segments D1 to D4 obtained by dividing a video in the "academic field". Fig. 9 is a diagram showing an example of a case where keywords are detected for each of the sections F1 to F7 obtained by dividing a video in the "education" field.

(step S708)

The search term setting unit 108 selects a region of interest of a video, and sets only keywords detected from the region of interest as search keywords for the video. The search term setting unit 108 associates the set search keyword with the corresponding video and stores the same in the storage medium. For example, only keywords detected from a section of a section tag given to a "proposal method" of an academic field or a section of a section tag given to a "summary" of an educational field are set as search keywords of a video.

The segmentation process device 100 according to the present embodiment can generate a segment name from text information for each segment, and detect a keyword from the text information. With this configuration, according to the segment division processing device 100 according to the present embodiment, since the additional information indicating the content of the segment is given to each segment as the segment name, the user can grasp the content of the video before viewing the video by confirming the segment name.

In addition, in the conventional video management apparatus, when searching for a video to be viewed by the user from among a large number of videos, if the user searches for the video using a query, the hit video is presented as a search result. At this time, the video is searched for using the search keyword detected from the entire video. However, among the set keywords, there are keywords commonly used in videos in the same field. For example, in the educational field, "performance improvement", "reservation", and the like are corresponding. When keywords of all segments are used for search, commonly used keywords are also included in the search keywords, so that it is difficult to reduce the video to be viewed. In addition, as in the example of fig. 9, an example video may be inserted into the video. The example video is, for example, a scenario-type video in which an example of the content of the video is reproduced for presenting a specific example. Keywords detected from example videos are more words that are irrelevant to the content of the video, such as "Suzuki/Mr. and" offensive "for example. Therefore, in the case of a video including an example video, when keywords of all sections are used for search, keywords having low relativity with the content of the video are used as search keywords, and therefore the accuracy of search is lowered.

In order to solve the above-described problem, the segment division processing device 100 according to the present embodiment can select a region of interest from among the segments based on the segment tags assigned to the respective segments, and can set only keywords detected from text information of the region of interest as search keywords of the video or audio data. With this configuration, according to the segment division processing device 100 according to the present embodiment, the search keywords used for the search are limited to the keywords of the segment with high importance, and thus the search can be performed using only the unique search keywords indicating the features of the video in each video. This can remove unnecessary search keywords, and the user can efficiently search for a video of a desired content.

(modification 1 of embodiment 2)

Modification 1 of embodiment 2 will be described. The present modification is an example in which the structure of embodiment 2 is modified as follows. The same structure, operation, and effects as those of embodiment 2 will not be described. The segment division processing device 100 according to this modification generates an introduction text and an introduction image of a video using only the keywords of a specific segment.

Fig. 10 is a diagram showing a configuration of a segment division processing device 100 according to the present modification. The processing circuit of the segmentation process device 100 includes an introduction data generation unit 109 in place of the search term setting unit 108.

The introduction data generation unit 109 selects a region of interest from a region based on a region tag for each video, and generates introduction data of the video using keywords detected from text information of the region of interest. The introduction data of the video are, for example, introduction text of the video, introduction images. The introduction text and the introduction image are displayed on a management screen of the video together with the video, for example.

As the introduction, for example, a section name of the section of interest is used. Alternatively, the introduction of the longer article may be generated by the same method as the process of generating the section name.

As the introduction image, for example, a word cloud image (word surrounding image) using a keyword of the attention section is used. The term cloud image refers to an image in which a plurality of keywords are displayed in 1 image. In the word cloud image, the more important keywords are displayed, the more the appearance frequency in the text data is high, the larger the font size of the keywords is displayed. The introduction image may also be used as a thumbnail of the video in a management screen of the video, for example.

Next, the operation of the processing performed by the segment division processing device 100 according to the present modification will be described. Fig. 11 is a flowchart showing an example of the procedure of the video segmentation process according to the present modification. The processing of steps S1101 to S1107 is the same as that of steps S701 to S707 of fig. 7, respectively, and therefore, the description thereof is omitted. The processing procedure in each processing described below is merely an example, and each processing can be changed as appropriate as possible. In addition, the process described below may be omitted, replaced, or added as appropriate according to the embodiment.

(video segmentation process)

(step S1108)

After detecting the keywords of each section through the processing of step S1101 to step S1107, the introduction data generation section 109 generates introduction data of a video using the detected keywords. The introduction data of the video is, for example, an introduction image in which introduction text generated using a keyword and the keyword are displayed. At this time, the introduction data generation section 109 determines a region of interest including important content in the video, and generates introduction data using only keywords of the region of interest. The generated introduction data is stored to a storage medium in association with the video.

The segment division processing device 100 according to the present modification can select a region of interest from a segment based on a segment tag, and generate introduction data of video or audio data using keywords detected from text information of the region of interest. For example, by generating introduction data using only keywords of important segments in a video, introduction text and introduction images representing features of the video with high accuracy can be generated. By confirming the introduction data of the video, the user can efficiently judge whether or not to view the video.

(modification 2 of embodiment 2)

A modification 2 of embodiment 2 will be described. The present modification is an example in which the structure of embodiment 2 is modified as follows. The same structure, operation, and effects as those of embodiment 2 will not be described. The segment division processing device 100 according to this modification calculates the degree of association between videos using only the keywords of a specific segment, and displays the videos having high degrees of association in a concentrated manner.

Fig. 12 is a diagram showing a configuration of the segment division processing device 100 according to the present modification. The processing circuit of the segment division processing device 100 includes a similarity calculation unit 110 instead of the search term setting unit 108.

The similarity calculation unit 110 selects a region of interest from the region based on the region label for each video, calculates the degree of association between the plurality of videos using the text information in the region of interest, and associates the plurality of videos having high degrees of association. For example, the associated video is displayed in a concentrated manner on a management screen of the video displayed on the terminal device. The degree of association can be calculated by comparing the keywords of the attention segment, for example. In this case, for example, the probability that the keywords of the attention section agree between 2 videos is calculated as the degree of association. The similarity calculation unit 110 is an example of a relevance calculation unit.

Next, the operation of the processing performed by the segment division processing device 100 according to the present modification will be described. Fig. 13 is a flowchart showing an example of the procedure of the video segmentation process according to the present modification. The processing in steps S1301 to S1307 is the same as the processing in steps S701 to S707 in fig. 7, respectively, and therefore, the description thereof is omitted. The processing procedure in each processing described below is merely an example, and each processing can be changed as appropriate as possible. In addition, the process described below may be omitted, replaced, or added as appropriate according to the embodiment.

(video segmentation process)

(step S1308)

After the keywords of each section are detected through the processing of steps S1301 to S1307, the similarity calculation section 110 calculates the degree of association between the managed videos. For example, with respect to each video stored in the storage medium, the degree of association with all other videos is calculated. At this time, the similarity calculation section 110 determines a region of interest including important content in the videos, and calculates the degree of association between the videos using only keywords of the region of interest. The calculated association degree is stored in a storage medium in association with the corresponding video.

The segment division processing device 100 according to the present modification can select a region of interest from a segment for each of a plurality of videos based on a segment tag, calculate a degree of association between the plurality of videos or between a plurality of pieces of audio data using text information in the region of interest, and associate a plurality of videos or pieces of audio data having a high degree of association. With this configuration, it is possible to introduce a video with a high degree of association with a video viewed by a user to a user who views a certain video. Alternatively, by displaying videos with high relevance adjacent to each other, the user can view the videos with high relevance in a concentrated manner. In addition, by calculating the degree of association using only keywords of important segments in the videos, the degree of association can be calculated by comparing characteristic portions of the videos with each other, and the degree of association between videos can be calculated with high accuracy.

(modification 3 of embodiment 2)

Modification 3 of embodiment 2 will be described. The present modification is an example in which the structure of embodiment 2 is modified as follows. The same structure, operation, and effects as those of embodiment 2 will not be described. The segment division processing device 100 according to this modification produces a digest video obtained by synthesizing only a specific segment.

Fig. 14 is a diagram showing a configuration of the segmentation process apparatus 100 according to the present embodiment. The processing circuit of the segment division processing device 100 includes a video generation unit 111 in place of the search term setting unit 108.

The video generation unit 111 selects a region of interest from the regions based on the region tags for each video, and generates a summary video obtained by combining only the regions of interest of the plurality of videos. The summarized video is, for example, a summarized video obtained by concentrating only specific sections of a plurality of videos. The video generating unit 111 is an example of a generating unit.

Next, the operation of the processing performed by the segment division processing device 100 according to the present modification will be described. Fig. 15 is a flowchart showing an example of the procedure of the video segmentation process according to the present modification. The processing of steps S1501 to S1507 is the same as that of steps S701 to S707 of fig. 7, respectively, and therefore, description thereof is omitted. The processing procedure in each processing described below is merely an example, and each processing can be changed as appropriate as possible. In addition, the process described below may be omitted, replaced, or added as appropriate according to the embodiment.

(video segmentation process)

(step S1508)

After the keywords of each section are detected by the processing of step S1501 to step S1507, the video generating unit 111 selects a section of interest from the sections based on the section tag, and generates a summary video obtained by combining a plurality of videos of the same domain among videos to be managed. At this time, the video generation section 111 generates a summary video by combining only the attention segments of the videos to be combined. The generated summary video is displayed on a management screen, for example.

The segment division processing device 100 according to the present modification can select a region of interest from a segment based on a segment tag for each video or audio data, and generate summary contents obtained by combining only the regions of interest of a plurality of video or audio data. The summary content is, for example, a summary video. Since the summary video is generated by combining only important attention segments of each video, the user can view only the characteristic parts of each video in a concentrated manner by viewing the generated summary video, and can view only important parts of the video in a concentrated manner.

According to at least one embodiment described above, it is possible to provide a segment division processing device, method, and storage medium capable of efficiently viewing a plurality of video contents or audio contents.

The present invention is not limited to the above-described embodiments, and can be embodied by deforming the constituent elements in the range not departing from the gist thereof in the implementation stage. In addition, various inventions can be formed by appropriate combinations of the plurality of constituent elements disclosed in the above embodiments. For example, a plurality of components may be deleted from all the components shown in the embodiment. Further, the constituent elements according to the different embodiments may be appropriately combined.

The above embodiments can be summarized as follows.

(technical solution 1)

A segment division processing device is provided with:

an information acquisition unit that acquires video or audio data, a field of the video or audio data, and text information of the video or audio data;

a dividing unit that divides the video or audio data into 1 or more segments;

a segment tag candidate acquisition unit that acquires a segment tag candidate corresponding to the field;

a section tag selection unit that selects a section tag from the section tag candidates according to the text information for each section; and

a section label giving part for giving the selected section label to the section.

(technical solution 2)

According to the above-mentioned claim 1, further comprising: a section name generating unit that generates a section name for each of the sections from the text information; and a keyword detection unit that detects a keyword from the text information for each of the sections.

(technical solution 3)

According to the above-described claim 2, the video processing device further includes a search term setting unit that selects a region of interest from the region based on the region tag given to the region, and sets only keywords detected from the text information of the region of interest as search keywords of the video or audio data.

(technical solution 4)

According to the above-described claim 2 or 3, there is further provided an introduction data generation unit that selects a region of interest from the region based on the region tag given to the region, and generates an introduction text or an introduction image of the video or sound data using a keyword detected from the text information of the region of interest.

(technical solution 5)

According to any one of the above-mentioned claims 2 to 4, wherein,

the method is also provided with a correlation degree calculating part,

the information acquisition unit acquires a plurality of video or audio data from a storage unit that stores the plurality of video or audio data to which a section tag is assigned to a section,

The association degree calculating unit selects a region of interest from the region on the basis of the region tag assigned to the region, calculates an association degree between the plurality of videos or between the plurality of sound data using text information in the region of interest, and associates the plurality of videos or sound data having a high association degree with each other for each of the plurality of videos or sound data.

(technical scheme 6)

According to any one of the above-mentioned claims 2 to 5, wherein,

the device is also provided with a generation part,

the generation unit selects a region of interest from the region on the basis of the region tag assigned to the region, for each of the plurality of video or audio data, and generates summary content obtained by combining only the region of interest of the plurality of video or audio data.

(technical scheme 7)

According to any one of the above-described claims 1 to 6, wherein the section tag giving section generates a section tag from text information of a section for which there is no suitable section tag among the section tag candidates.

(technical scheme 8)

According to any one of the above-described aspects 1 to 7, wherein the dividing unit obtains sound information of the video or sound data or a feature amount of an image of the video, and divides the video or sound data into the 1 or more segments based on the sound information or the feature amount and the text information.

(technical solution 9)

According to any one of the above-mentioned claims 1 to 7, wherein the dividing section detects a division word from the text information, and divides the video or sound data into the 1 or more sections with a portion where the division word is detected as a boundary.

(technical scheme 10)

According to any one of the above-described aspects 1 to 7, wherein the dividing unit divides the text information into a plurality of sentences, detects content words related to the content of the video or sound data from the plurality of sentences, compares the plurality of sentences using the content words, and divides the video or sound data into the 1 or more sections with a part where the content words change as a boundary.

(technical scheme 11)

According to the above-described claim 10, wherein the section tag selection section selects a section tag from the section tag candidates according to the content word.

(technical scheme 12)

A segment division processing method includes:

acquiring video or sound data, a field of the video or sound data, and text information of the video or sound data;

dividing the video or sound data into more than 1 section;

obtaining segment tag candidates corresponding to the field;

selecting, for each of the sections, a section tag from the section tag candidates according to the text information; and

the selected section is given a label.

(technical scheme 13)

A computer-readable non-transitory storage medium storing a section-division processing program for causing a computer to realize:

a function of acquiring video or sound data, a field of the video or sound data, and text information of the video or sound data;

a function of dividing the video or sound data into 1 or more sections;

a function of acquiring a segment label candidate corresponding to the field;

a function of selecting a section tag from the section tag candidates according to the text information for each of the sections; and

the selected section is given the function of a section label.

Claims

1. A segment division processing device is provided with:

a dividing unit that divides the video or audio data into 1 or more segments;

2. The segment division processing device according to claim 1, wherein,

the segment division processing device further includes:

a section name generating unit that generates a section name for each of the sections from the text information; and

and a keyword detection unit configured to detect a keyword from the text information for each of the sections.

3. The segment division processing device according to claim 2, wherein,

the segment division processing device further includes a search term setting unit that selects a segment of interest from the segment based on the segment tag assigned to the segment, and sets only keywords detected from the text information of the segment of interest as search keywords of the video or audio data.

4. A segment division processing apparatus according to claim 2 or 3, wherein,

the segment division processing device further includes an introduction data generation unit that selects a segment of interest from the segment based on the segment tag given to the segment, and generates an introduction text or an introduction image of the video or audio data using a keyword detected from the text information of the segment of interest.

5. The segment division processing device according to any one of claims 2 to 4, wherein,

the segment division processing device further comprises a correlation degree calculating part,

6. The segment division processing device according to any one of claims 2 to 5, wherein,

the segment division processing device further comprises a generation unit,

7. The segment division processing device according to any one of claims 1 to 6, wherein,

the section tag assigning unit generates a section tag from text information of a section for which there is no suitable section tag among the section tag candidates.

8. The segment division processing device according to any one of claims 1 to 7, wherein,

the dividing unit obtains audio information of the video or audio data or a feature value of an image of the video, and divides the video or audio data into the 1 or more segments based on the audio information or the feature value and the text information.

9. The segment division processing device according to any one of claims 1 to 7, wherein,

the dividing unit detects a segmentation word from the text information, and divides the video or audio data into at least 1 segment with a portion where the segmentation word is detected as a boundary.

10. The segment division processing device according to any one of claims 1 to 7, wherein,

the dividing unit divides the text information into a plurality of sentences, detects content words related to the content of video or audio data from the plurality of sentences, compares the plurality of sentences using the content words, and divides the video or audio data into the 1 or more sections with a part where the content words change as a boundary.

11. The segment division processing device according to claim 10, wherein,

the section tag selection unit selects a section tag from the section tag candidates based on the content word.

12. A segment division processing method includes:

dividing the video or sound data into more than 1 section;

Obtaining segment tag candidates corresponding to the field;

the selected section is given a label.

13. A computer-readable non-transitory storage medium storing a section-division processing program for causing a computer to realize:

a function of dividing the video or sound data into 1 or more sections;

a function of acquiring a segment label candidate corresponding to the field;

the selected section is given the function of a section label.