CN114996510A

CN114996510A - Teaching video segmentation and information point extraction method, device, electronic equipment and medium

Info

Publication number: CN114996510A
Application number: CN202110223138.8A
Authority: CN
Inventors: 杨立春; 夏德虎; 张志发; 赵梦凯; 巩稼民; 蒋杰伟
Original assignee: Shenzhen Penguin Network Technology Co ltd; Xian University of Posts and Telecommunications
Current assignee: Shenzhen Penguin Network Technology Co ltd; Xian University of Posts and Telecommunications
Priority date: 2021-03-01
Filing date: 2021-03-01
Publication date: 2022-09-02

Abstract

The application relates to a teaching video segmentation and information point extraction method, a teaching video segmentation and information point extraction device, electronic equipment and a teaching video segmentation and information point extraction medium. The method comprises the following steps: acquiring a teaching video, and reading image information in the teaching video; extracting text information in the teaching video; segmenting the teaching video according to the text information and the image information to generate a segmented video; and extracting information points of the segmented video according to the text information and the image information corresponding to the segmented video, and determining the information points. According to the teaching video segmentation and information point extraction method, extraction is automatically completed in the whole segmentation and information point extraction process, manual participation is not needed, the working efficiency of the video segmentation and information point extraction process is improved, and the cost is reduced.

Description

Teaching video segmentation and information point extraction method, device, electronic equipment and medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for segmenting a teaching video and extracting an information point, an electronic device, and a medium.

Background

With the rapid development of intelligent equipment and the Internet, more and more people share the learning experience and the life entertainment process of the people in a video mode through an Internet platform. Each large education platform also provides own online teaching video courses. Compared with the traditional offline course, the online teaching video course has unique advantages, such as no restriction of classroom sites and class time, playback and watching, and the like. However, there are some problems in online teaching video courses, and although the student can find the required video according to the title and content introduction of the video, the student wants to jump to learn part of the knowledge in the video, and cannot quickly and accurately locate the target information point in the video. Especially for long duration videos, the location finding will delay the trainee's significant time.

When segmentation and information point extraction are carried out on a teaching video, a manual segmentation and extraction mode is adopted. The manual segmentation and extraction method not only needs to consume a large amount of manpower and material resources and has low efficiency, but also causes the problem that the segmentation and the extraction of information points are inconsistent because different people understand the same video differently. Therefore, the traditional teaching video segmentation and information point extraction method is time-consuming and labor-consuming and has the defect of low working efficiency.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a teaching video segmentation and information point extraction method, apparatus, electronic device, and medium capable of improving work efficiency.

In a first aspect of the present application, a method for segmenting a teaching video and extracting information points is provided, which includes:

acquiring a teaching video, and reading image information in the teaching video;

extracting text information in the teaching video;

segmenting the teaching video according to the text information and the image information to generate a segmented video;

and extracting information points of the segmented video according to the text information and the image information corresponding to the segmented video, and determining the information points.

In one embodiment, the text information is an audio text, and the extracting the text information in the teaching video includes:

and extracting audio content in the teaching video, and correspondingly storing text content and time line in the audio content according to an audio-to-text technology to obtain an audio text.

In one embodiment, the segmenting the teaching video according to the text information and the image information to generate a segmented video includes:

determining a preliminary segmentation point according to the text information;

determining secondary segmentation points according to the image information;

determining a final segmentation point according to the primary segmentation point and the secondary segmentation point;

and according to the final segmentation point, carrying out segmentation processing on the teaching video to generate a segmented video.

In one embodiment, the determining a preliminary segmentation point according to the text information includes:

and extracting time points of which the text time interval is larger than a preset interval threshold value in the text information according to the text information, and determining a preliminary segmentation point.

In one embodiment, the determining a secondary segmentation point according to the image information comprises:

extracting images according to the image information and preset time intervals, and calculating the similarity of adjacent images;

and if the similarity is smaller than a first preset similarity threshold, determining the time point between the corresponding adjacent images as a secondary segmentation point.

In one embodiment, the extracting images at preset time intervals according to the image information and calculating the similarity of adjacent images includes:

extracting images according to the image information and preset time intervals;

converting the extracted image into a four-level gray image, and vectorizing the four-level gray image to obtain an image vector;

carrying out standardization processing on the image vector to obtain a standardized vector;

and calculating the similarity of the adjacent images according to the normalized vector.

In one embodiment, the extracting information points from the segmented video and determining information points includes:

determining a first candidate information point according to text information corresponding to the segmented video;

determining a second candidate information point according to the image information in the segmented video;

and determining the information points of the segmented video according to the first candidate information points and the second candidate information points.

The second aspect of the present application provides a teaching video segmentation and information point extraction device, including:

the information extraction module is used for acquiring a teaching video and reading image information in the teaching video; extracting text information in the teaching video;

the segmented video generation module is used for carrying out segmented processing on the teaching video according to the text information and the image information to generate a segmented video;

and the information point extraction module is used for extracting the information points of the segmented video according to the text information and the image information corresponding to the segmented video.

In a third aspect of the present application, an electronic device is provided, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the method in the foregoing embodiments when executing the computer program.

In a fourth aspect of the present application, a computer-readable storage medium is provided, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of the method in the above-described embodiments.

The teaching video segmentation and information point extraction method comprises the steps of firstly obtaining a teaching video, reading image information in the teaching video, and extracting text information in the teaching video; segmenting the teaching video according to the text information and the image information to generate a segmented video; and finally, extracting information points of the segmented video to determine the information points. In the whole information point extraction process, video segmentation and information point extraction are automatically completed without manual participation, and the work efficiency of the video segmentation and information point extraction process is favorably improved.

Drawings

FIG. 1 is a flow chart illustrating a method for teaching video segmentation and information point extraction according to an embodiment;

FIG. 2 is a schematic flowchart of a method for segmenting a teaching video and extracting information points according to another embodiment;

FIG. 3 is a flowchart illustrating segmentation of a teaching video according to text information and image information to generate a segmented video according to an embodiment;

fig. 4 is a schematic flow chart illustrating a process of extracting information points from a segmented video according to text information and image information corresponding to the segmented video and determining the information points in one embodiment;

FIG. 5 is a block diagram of an apparatus for segmentation and information point extraction of a teaching video according to another embodiment;

FIG. 6 is a diagram illustrating an internal structure of an electronic device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In an embodiment, as shown in fig. 1, an information point extraction method is provided, and in this embodiment, the method is applied to a terminal for example, it is understood that the method may also be applied to a server, and may also be applied to a system including a terminal and a server, and is implemented by the terminal of the terminal and the server. In this embodiment, the information point extraction method includes steps S200 to S800.

Step S200: and acquiring a teaching video and reading image information in the teaching video.

The teaching video is a video including an image and an audio. The content of the teaching video can be teaching contents of subjects such as Chinese, mathematics, English and the like, and can also be other types of teaching videos such as cooking skill, flower arrangement and the like. In short, the embodiment of the present application does not limit the specific content and subject type of the teaching video. The image information in the teaching video refers to a set of images of each frame in the teaching video. Specifically, after a teaching video is acquired, image information in the teaching video is read.

Step S400: and extracting text information in the teaching video.

The text information in the teaching video comprises a subtitle text and an audio text. In general, a video file and a subtitle file in a teaching video are stored separately, and not all teaching videos contain subtitles. Preferably, after the teaching video is obtained, firstly, whether a caption file is attached to the teaching video is judged, and if yes, the caption file is read to obtain a caption text; and if not, directly extracting the audio text.

In one embodiment, the text information is an audio text, and the specific process of extracting the audio text in the teaching video is as follows: and extracting audio content in the teaching video, and correspondingly storing text content and time line in the audio content according to an audio-to-text technology to obtain an audio text.

Specifically, the audio-to-text technology is adopted to convert the audio content, and the text content and the timeline are stored in a one-to-one correspondence manner, so that the format of the obtained audio text is shown in the following table.

Step S600: and carrying out segmentation processing on the teaching video according to the text information and the image information to generate a segmented video.

The segmentation processing is to divide the teaching video into a plurality of segmented videos according to a time line based on a certain standard. Specifically, the specific content of the teaching video can be acquired according to the text information and the image information obtained in the above steps. And splitting the teaching video according to the specific content of the teaching video, determining segmentation points, and dividing the teaching video into a plurality of segmentation videos.

Step S800: and extracting information points of the segmented video according to the text information and the image information corresponding to the segmented video, and determining the information points.

The information points refer to keywords or key sentences that can be used to represent the corresponding segmented video. Specifically, according to text information and image information corresponding to the segmented video, candidate information points in the video can be extracted, and then the candidate information points are screened according to a preset algorithm, so that the final information point can be determined.

Furthermore, after the information points are determined, the segmented video, the segmented points and the information points can be correspondingly stored, so that the segmented video is convenient to search. The segmentation point may refer to a start point or an end point of the corresponding segmented video.

In one embodiment, referring to fig. 2, step S600 includes steps S620 to S680.

Step S620: and determining a preliminary segmentation point according to the text information.

Specifically, the interval word set may be preset, the interval words in the text information are extracted, and the preliminary segmentation point is determined according to the time point of the interval word. Wherein the spacer words include, but are not limited to, "next section," "now," "start," "complete," etc.

In addition, the preliminary segmentation point may also be determined according to the interval time. The interval time is the pause time between sentences in the course of teaching by the teacher, and the text time interval in the text information can be obtained according to the text information obtained in the above steps, namely the interval time. Preferably, according to the text information, time points in the text information where the text time interval is greater than a preset interval threshold value are extracted, and a preliminary segmentation point is determined. The starting position of the first sentence in the text information is a first preliminary segmentation point, and the ending position of the last sentence is a last preliminary segmentation point.

Specifically, the starting point or the end point of the text interval may be determined as a preliminary segmentation point; the intermediate time point of the text interval may also be determined as a preliminary segmentation point. For example: the sentence sequence is set as S ₁ ,S ₂ …S _i ,S _i+1 ,…S _n With subscript s denoting the beginning of the sentence and e denoting the end of the sentence. Let the set of intervals between two adjacent sentences be { T } ₁ ,T ₂ …T _i ,T _i+1 ,…T _n-1 Is then T _i ＝S _(i+1)s -S _ie The preliminary segmentation point may be D _i ＝S _ie 、D _i ＝S _(i+1)s Or D _i ＝S _ie +T _i /2。

Step 640: and determining secondary segmentation points according to the image information.

Specifically, the teaching video generally includes a presentation document. According to the image information, the image information can be processed first, the demonstration document information in the image information is extracted, and the time node corresponding to the demonstration document information is marked. And determining a time node for switching to the next title according to the title text in the demonstration document information, and determining the time node as a secondary segmentation point.

In addition, secondary segmentation points can be determined according to the similarity of adjacent images. Specifically, images may be extracted according to image information at preset time intervals, the similarity between adjacent images is calculated, and if the similarity is smaller than a first preset similarity threshold, a time point between corresponding adjacent images is determined as a secondary segmentation point. The preset time interval may be 2S, 3S or other time intervals.

Further, in an embodiment, the process of extracting images according to the image information and the preset time interval and calculating the similarity between adjacent images includes: extracting images according to preset time intervals according to the image information; converting the extracted image into a four-level gray image, and vectorizing the four-level gray image to obtain an image vector; carrying out standardization processing on the image vector to obtain a standardized vector; and calculating the similarity of the adjacent images according to the normalized vector.

Where a gray-scale digital image refers to an image having only one sample color per pixel, such images are typically displayed as gray scales from the darkest black to the brightest white. Grayscale images differ from black and white images in that they have only two colors, black and white, and grayscale images also have many levels of excess color between black and white. The four-level gray image is a gray image with two excessive colors added between white and black. The image is converted into a four-level gray level image, so that the influence of non-key factors such as background color brightness and the like can be reduced on one hand, and content information in the image, such as character content, character attributes and the like, can be well maintained on the other hand. Specifically, two adjacent four-level gray level images are vectorized to obtain image vectors X and Y according to a formula

And

calculating 2 norms of two image vectors and obtaining a normalized vector Norm after normalization processing _X ＝X/L _X And Norm _Y ＝Y/L _Y . Finally, according to the dot product formula S ═ Norm of the vector _X Norm _Y ComputingAnd obtaining the cosine similarity of the two images, namely the similarity of the adjacent images.

Step S660: and determining a final segmentation point according to the primary segmentation point and the secondary segmentation point.

Specifically, the primary segmentation point and the secondary segmentation point are fused according to a time line, the obtained segmentation point sets are merged, and the situation that the time interval between adjacent segmentation points is too short may occur in the merged set. In order to avoid the situation, a Viterbi algorithm is used for searching the combined set and screening out the optimal segmentation point, and the specific method comprises the following steps:

for the preliminary segmentation points, each segmentation point contains an interval time attribute. For the secondary segmentation points, each segmentation point contains a similarity degree value attribute. Because the data types of the two are different, firstly, the data of the two are converted into a reasonable measurement standard, normalization processing is carried out, then, the Viterbi algorithm is used for solving the segmentation point sequence with the highest score, and the time interval between the adjacent segmentation points is ensured to be larger than the minimum duration threshold value through the constraint condition of the algorithm, so that the final segmentation point set can be obtained. Wherein the minimum duration threshold may be 5 minutes, 10 minutes, or any other duration.

Step S680: and according to the final segmentation point, carrying out segmentation processing on the teaching video to generate a segmented video.

In the above embodiment, the primary segmentation point is determined according to the text information, the secondary segmentation point is determined according to the video information, the two segmentation points are fused by using a preset algorithm, the final segmentation point is determined, the teaching video is segmented, and the segmented video is generated, so that the segmentation accuracy can be improved, and the extraction accuracy of the video information point is further improved.

In one embodiment, referring to fig. 3, step S800 includes steps S820 to S860.

Step S820: and determining a first candidate information point according to the text information corresponding to the segmented video.

Specifically, for the subtitle text or the audio text corresponding to each segmented video, a preset algorithm may be used to obtain the probability distribution of the keyword information points, and the keyword information points with the probability distribution greater than a preset probability threshold are used as the first candidate information points.

Step S840: and determining a second candidate information point according to the image information corresponding to the segmented video.

Specifically, the text content in the image information and the attribute of the corresponding text content can be extracted according to the image information of which the similarity value is smaller than the second preset similarity threshold value in the segmented video, so that the repeated recognition of the image with high similarity can be avoided, and the efficiency is improved under the condition that the content is not lost. And the corresponding text content meeting the preset attribute condition is used as a second candidate information point.

The attribute of the text content may be a text size, and the preset attribute condition may be that a font size is larger than a set font size. Furthermore, the attribute of the text content may further include a font, whether the font is bold, a gray value, and the like, and similarly, different preset attribute conditions may be set according to the specific content of the attribute. For example, according to the format of the presentation document, the text size and font of the primary directory and the secondary directory may be determined, a preset attribute condition may be set, and the second candidate information point may be determined according to the preset attribute condition. The method for extracting the attribute of the text content can be to use an OCR technology to recognize the text content in the image and retain the text attribute. Moreover, because a large amount of presentation documents are used in the teaching video, in order to extract information points more accurately, the embodiment retains the attributes which are important to the information points: the size of the text.

Step S860: and determining the information points of the segmented video according to the first candidate information points and the second candidate information points.

Specifically, according to the first candidate information point and the second candidate information point, the information point with high text similarity in the first candidate information point and the second candidate information point is extracted by using a cosine similarity technology, and then the information point of the segmented video can be determined.

It should be understood that, although the steps in the flowcharts involved in the above embodiments are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in each flowchart involved in the above embodiments may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.

In one embodiment, as shown in fig. 4, there is provided a teaching video segmentation and information point extraction apparatus, including: the information extraction module 100 is configured to acquire a teaching video, read image information in the teaching video, and extract text information in the teaching video; the segmented video generating module 200 is configured to perform segmented processing on the teaching video according to the text information and the image information to generate a segmented video; the information point extracting module 300 is configured to extract information points of the segmented video according to the text information and the image information corresponding to the segmented video.

In one embodiment, please refer to fig. 5, the apparatus further includes a storage module 400 for correspondingly storing the segmented video and the segmentation points and the information points thereof.

In one embodiment, the information extraction module 100 is specifically configured to: and extracting audio content in the teaching video, and correspondingly storing text content and time line in the audio content according to an audio-to-text technology to obtain an audio text.

In one embodiment, the segmented video generation module 200 includes: a preliminary segmentation point determination unit for determining a preliminary segmentation point according to the text information; a secondary segmentation point determining unit, configured to determine a secondary segmentation point according to the image information; a final segmentation point determining unit, configured to determine a final segmentation point according to the primary segmentation point and the secondary segmentation point; and the segmented video generating unit is used for carrying out segmented processing on the teaching video according to the final segmentation point to generate a segmented video.

In an embodiment, the preliminary segmentation point determination unit is specifically configured to: and extracting time points of which the text time interval is larger than a preset interval threshold value in the text information according to the text information, and determining a preliminary segmentation point.

In an embodiment, the secondary segmentation point determination unit is specifically configured to: extracting images according to the image information and preset time intervals, and calculating the similarity of adjacent images; and if the similarity is smaller than a first preset similarity threshold, determining the time point between the corresponding adjacent images as a secondary segmentation point.

In an embodiment, the secondary segmentation point determination unit is specifically configured to: extracting images according to preset time intervals according to the image information; converting the extracted image into a four-level gray image, and vectorizing the four-level gray image to obtain an image vector; standardizing the image vector to obtain a standardized vector; and calculating the similarity of the adjacent images according to the normalized vector.

In an embodiment, the information point extracting module 300 is specifically configured to: determining a first candidate information point according to text information corresponding to the segmented video; determining a second candidate information point according to image information in the segmented video; and determining the information points of the segmented video according to the first candidate information points and the second candidate information points.

For the specific definition of the knowledge point annotation device, reference may be made to the above definition of the knowledge point annotation method, which is not described herein again. The modules in the knowledge point labeling apparatus can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, an electronic device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 6. The electronic device comprises a processor, a memory, a communication interface, a display screen and an input device which are connected through a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the electronic device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a teaching video information point extraction method. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the electronic equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the configuration shown in fig. 6 is a block diagram of only a portion of the configuration associated with the present application, and does not constitute a limitation on the electronic device to which the present application is applied, and a particular electronic device may include more or less components than those shown in the drawings, or may combine certain components, or have a different arrangement of components.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A teaching video segmentation and information point extraction method is characterized by comprising the following steps:

extracting text information in the teaching video;

2. The method of claim 1, wherein the text information is audio text, and the extracting the text information from the teaching video comprises:

3. The method for segmenting and extracting the information points in the teaching video according to claim 1, wherein the step of segmenting the teaching video according to the text information and the image information to generate a segmented video comprises:

determining a preliminary segmentation point according to the text information;

determining secondary segmentation points according to the image information;

4. The method as claimed in claim 3, wherein the step of determining the preliminary segmentation points according to the text information comprises:

and extracting time points of which the text time interval is greater than a preset interval threshold value in the text information according to the text information, and determining a preliminary segmentation point.

5. The method as claimed in claim 3, wherein said determining secondary segmentation points according to the image information comprises:

and if the similarity is smaller than a first preset similarity threshold, determining a time point between corresponding adjacent images as a secondary segmentation point.

6. The teaching video segmentation and information point extraction method as claimed in claim 5, wherein the extracting images according to the image information and the preset time interval and calculating the similarity of adjacent images comprises:

extracting images according to the image information and preset time intervals;

standardizing the image vector to obtain a standardized vector;

7. The teaching video segmentation and information point extraction method according to claim 1, wherein the extracting information points from the segmented video and determining information points comprises:

determining a first candidate information point according to the text information corresponding to the segmented video;

8. The utility model provides a teaching video segmentation and information point extraction element which characterized in that includes:

the segmented video generating module is used for carrying out segmented processing on the teaching video according to the text information and the image information to generate a segmented video;

9. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.