CN114357206A

CN114357206A - Education video color subtitle generation method and system based on semantic analysis

Info

Publication number: CN114357206A
Application number: CN202210037173.5A
Authority: CN
Inventors: 邵增珍; 董树霞; 孙中志; 肖建新; 韩帅; 李壮壮; 徐卫志
Original assignee: Shandong Womens University
Current assignee: Shandong Womens University
Priority date: 2022-01-13
Filing date: 2022-01-13
Publication date: 2022-04-15

Abstract

The invention discloses a semantic analysis-based educational video color subtitle generating method and system, comprising the following steps: acquiring an education video to be processed; sampling an educational video to be processed, and extracting a plurality of text frames; sampling an educational video to be processed, and extracting a plurality of text-free frames; extracting key frames of all the text frames to obtain a key frame set with the text frames; extracting key frames of all the non-text frames to obtain a key frame set of the non-text frames; summarizing a key frame set with text frames and a key frame set without text frames to obtain all key frames of the education videos to be processed; extracting a content text and a voice text for each frame in all the key frames; and according to the similarity between the content text and the voice text, providing the color subtitles corresponding to the key frames. The storage burden is reduced by acquiring the key frame, the video retrieval speed can be effectively improved, and the method is also very helpful for acquiring the main content of the spoken video.

Description

Education video color subtitle generation method and system based on semantic analysis

Technical Field

The invention relates to the technical field of video abstraction and key frame extraction, in particular to a method and a system for generating education video color subtitles based on semantic analysis.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

The color subtitles are realized on the basis of key frames, the extraction of the video key frames mainly reflects the remarkable characteristics of all shots in the video, the time required by video retrieval can be effectively reduced through the extraction of the video key frames, and the accuracy of the video retrieval can be enhanced. Short videos have risen rapidly in recent years, and the number of videos is expected to continue to rise at a high rate in the coming years, and some videos are education videos. Compared with the traditional image-text data, the video data has richer content and more complex structure. The massive video data brings convenience to students, but how to quickly and accurately find interesting videos becomes a problem to be solved urgently. Meanwhile, the video data generally comprises video sequences, scenes, shots, image frames and the like, and the required storage space is huge. Therefore, the video data is summarized, only those video frames with large information content and strong representativeness are reserved, and the problem of low efficiency existing in video storage and transmission can be solved to a certain extent.

Generally, the key frame extraction research method mainly includes the following steps: a shot boundary-based key frame extraction algorithm, a video content-based key frame extraction algorithm, and a cluster-based key frame extraction algorithm. The key frame extraction algorithm based on shot boundaries mainly performs shot segmentation on a video through the existing shot detection technology, and extracts a fixed number of key frames from each shot. The key frame extraction algorithm based on video content mainly obtains key frames by extracting the bottom layer features of the frames, and the common features comprise pixels. The algorithm usually takes a first frame as a key frame, carries out similarity comparison with subsequent frames in sequence, and adds the key frame into a key frame set when the similarity is smaller than a set threshold value until the detection is finished. The key frame extraction algorithm based on clustering mainly divides video frames with similar characteristics into one class by considering the correlation in and among shots, and then sequentially selects the most representative frames from each class as key frames.

In the process of implementing the invention, the inventor finds that the following technical problems exist in the prior art:

the shot boundary-based method is suitable for videos with relatively simple video content, little scene change or less shot activity, and for videos with various change modes, great deviation can be caused when key frames are extracted, and even serious errors can be generated.

The key frame extraction algorithm based on the video content has strong adaptivity, and can extract different numbers of key frames at different positions according to the self-adaption of the shot content, but when the change of the shot content is large, the number of automatically generated key frames is large, and the redundancy of the algorithm is large.

The key frame extraction algorithm based on clustering can well express content change in a video, but the algorithm needs to carry out preprocessing analysis on the whole video in advance to determine the number of clusters, and the selection of a clustering threshold value can also influence the selection quality of key frames, so that redundant or missing frames are caused.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides a method and a system for generating education video color subtitles based on semantic analysis; the storage burden can be reduced by acquiring the key frames, and the video retrieval speed can be effectively improved, which is also helpful for acquiring the main content of the spoken video.

In a first aspect, the invention provides a semantic analysis-based educational video color subtitle generating method;

the educational video color subtitle generating method based on semantic analysis comprises the following steps:

acquiring an education video to be processed; sampling an educational video to be processed, and extracting a plurality of text frames; sampling an educational video to be processed, and extracting a plurality of text-free frames;

extracting key frames of all the text frames to obtain a key frame set with the text frames; extracting key frames of all the non-text frames to obtain a key frame set of the non-text frames;

summarizing a key frame set with text frames and a key frame set without text frames to obtain all key frames of the education videos to be processed;

extracting a content text and a voice text for each frame in all the key frames; and according to the similarity between the content text and the voice text, providing the color subtitles corresponding to the key frames.

In a second aspect, the invention provides a semantic analysis-based educational video color subtitle generating system;

educational video color subtitle generating system based on semantic analysis comprises:

an acquisition module configured to: acquiring an education video to be processed; sampling an educational video to be processed, and extracting a plurality of text frames; sampling an educational video to be processed, and extracting a plurality of text-free frames;

a key frame extraction module configured to: extracting key frames of all the text frames to obtain a key frame set with the text frames; extracting key frames of all the non-text frames to obtain a key frame set of the non-text frames;

a summarization module configured to: summarizing a key frame set with text frames and a key frame set without text frames to obtain all key frames of the education videos to be processed;

a subtitle generation module configured to: extracting a content text and a voice text for each frame in all the key frames; and according to the similarity between the content text and the voice text, providing the color subtitles corresponding to the key frames.

In a third aspect, the present invention further provides an electronic device, including:

a memory for non-transitory storage of computer readable instructions; and

a processor for executing the computer readable instructions,

wherein the computer readable instructions, when executed by the processor, perform the method of the first aspect.

In a fourth aspect, the present invention also provides a storage medium storing non-transitory computer readable instructions, wherein the non-transitory computer readable instructions, when executed by a computer, perform the instructions of the method of the first aspect.

In a fifth aspect, the invention also provides a computer program product comprising a computer program for implementing the method of the first aspect when run on one or more processors.

Compared with the prior art, the invention has the beneficial effects that:

(1) the storage burden can be reduced by acquiring the key frames, and the video retrieval speed can be effectively improved, which is also helpful for acquiring the main content of the spoken video.

(2) Color captions may be marked by capturing key frames. In the course of teaching, in order to teach a certain knowledge point more clearly and clearly, the teacher can continuously and repeatedly emphasize the knowledge point in a language description mode and more intuitively highlight the teaching key point of the teacher by adding color marks.

(3) Through experimental comparison on a group of educational video abstract reference data sets created by multiple users together, compared with the traditional method, the method disclosed by the invention has the advantages that the generated key subtitles can not only accurately describe the video content, but also can generate fewer redundant items.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a flowchart of a method according to a first embodiment;

fig. 2 is a flowchart of subtitle extraction according to the first embodiment;

fig. 3 is a flowchart of speech signal processing according to the first embodiment.

Detailed Description

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

All data are obtained according to the embodiment and are legally applied on the data on the basis of compliance with laws and regulations and user consent.

The teaching video is combined with the special properties of the education video, the education video is generally lectures, and a teacher giving lessons develops explanation around one or more knowledge points within a certain time, so that the content of the education video is concentrated relative to other videos, the education video has the characteristics of language repeatability, central thought concentration and single character, and multi-channel voice does not need to be distinguished. Keywords that may represent primary content are more easily found. The method comprises the steps of taking texts in educational video frames as main features, respectively comparing and analyzing frame sets after two parts of processing by using a text similarity comparison algorithm and a perceptual hash algorithm to obtain key frames, and finally generating video summaries according to time sequence. The extraction of the key caption is realized on the basis of the extraction of the key frame and the extraction of the audio text.

Example one

The embodiment provides a semantic analysis-based educational video color subtitle generating method;

s101: acquiring an education video to be processed; sampling an educational video to be processed, and extracting a plurality of text frames; sampling an educational video to be processed, and extracting a plurality of text-free frames;

s102: extracting key frames of all the text frames to obtain a key frame set with the text frames; extracting key frames of all the non-text frames to obtain a key frame set of the non-text frames;

s103: summarizing a key frame set with text frames and a key frame set without text frames to obtain all key frames of the education videos to be processed;

s104: extracting a content text and a voice text for each frame in all the key frames; and according to the similarity between the content text and the voice text, providing the color subtitles corresponding to the key frames.

Further, the step S101: sampling an educational video to be processed, and extracting a plurality of text frames; the method specifically comprises the following steps:

and sampling the educational video to be processed according to a set frequency, and extracting a plurality of text frames.

Further, the step S101: sampling an educational video to be processed, and extracting a plurality of text-free frames; the method specifically comprises the following steps:

and sampling the educational video to be processed according to a set frequency, and extracting a plurality of non-text frames.

Further, the S102: extracting key frames of all the text frames to obtain a key frame set with the text frames; the method specifically comprises the following steps:

s102a 1: performing text recognition on all the frames with texts in a text recognition mode;

s102a 2: based on a TF-IDF weight calculation mode, carrying out vector representation on the text identified by each frame;

s102a 3: the first frame is regarded as a key frame; for a non-first frame, measuring the vector similarity between a current frame and a previous frame through cosine similarity, and if the vector similarity of two adjacent frames is smaller than a set threshold, not taking the current frame as a key frame; and if the vector similarity of two adjacent frames is greater than or equal to a set threshold, taking the current frame as a key frame.

Further, the S102: extracting key frames of all the non-text frames to obtain a key frame set of the non-text frames; the method specifically comprises the following steps:

s102b 1: reducing each non-text frame to a set size;

s102b 2: graying the reduced image;

s102b 3: decomposing the grayed image by adopting discrete cosine transform to obtain a discrete cosine transform matrix; calculating the mean value of the discrete cosine transform matrix;

s102b 4: comparing the pixel value of each element of the discrete cosine transform matrix with the mean value of the discrete cosine transform matrix, setting the element larger than the mean value as 1, and setting the element smaller than the mean value as 0; obtaining a discrete cosine transform matrix after transformation;

s102b 5: generating a hash value based on the transformed discrete cosine transform matrix, and taking the hash value as the fingerprint of the current frame;

s102b 6: regarding the first non-text frame as a key frame; for a non-first non-text frame, measuring the similarity between a current frame and a previous frame through a Hamming distance, and if the Hamming distance is smaller than a set threshold value, not taking the current non-text frame as a key frame; and if the Hamming distance is larger than or equal to a set threshold, taking the current text-free frame as a key frame.

Further, the step S103: summarizing a key frame set with text frames and a key frame set without text frames to obtain all key frames of the education videos to be processed; the method specifically comprises the following steps:

and sequencing and summarizing all key frames according to the time sequence to the key frame set with the text frame and the key frame set without the text frame to obtain all key frames of the education video to be processed.

Further, the S104: extracting a content text and a voice text for each frame in all the key frames; the method specifically comprises the following steps:

extracting a content text for each key frame by adopting an Optical Character Recognition (OCR) algorithm;

and extracting a voice text from the audio within a set time range before and after the time point corresponding to each key frame by adopting Automatic Speech Recognition (ASR).

Further, according to the similarity of the content text and the voice text, color subtitles corresponding to the key frames are given; the method specifically comprises the following steps:

if the similarity between the content text and the voice text is greater than a set threshold value, giving the color caption of the current key frame;

and if the similarity between the content text and the voice text is less than or equal to a set threshold value, not giving the color caption of the current key frame.

The invention provides a novel Education Video key frame extraction algorithm based on interframe text Semantic analysis, namely KFEVSA (KeyFrame Eduition Video Semantic analysis) to realize color caption marking. Firstly, text recognition is carried out based on an OCR technology, and the text recognition is respectively stored according to a text frame set and a non-text frame set; comparing frame texts in the text frame set by calculating text similarity, reserving frames with larger similarity difference according to the sequence, generating a key frame set, measuring the similarity of the frames by applying a perceptual hash algorithm to the non-text frame set, and finding a representative frame; and finally combining the two groups of frames into a final key frame set according to the time sequence. And extracting the voice text and the image text through the key frame to carry out similarity comparison to obtain the key subtitles.

The KFEVSA algorithm is mainly divided into three stages: firstly, OCR (optical Character recognition) text recognition is adopted to divide a sampling frame into a text frame set and a null text frame set, secondly, the two frame sets are processed by using a text similarity and perceptual hashing algorithm to generate respective key frame sets, and finally, the two groups of key frame sets are combined in sequence according to time sequence to generate a final video abstract set.

The detailed process of the KFEVSA algorithm is shown in FIG. 1. The video is first sampled 3 frames/second to reduce redundancy in the video data, and for a given video, the sampled frames are stored in a set F ═ F₁,f₂,f₃,…f_i…f_nWhere n is the total number of sample frames, representing the ith frame sample frame. And then sequentially carrying out text recognition on the frames in the F by adopting an OCR text recognition technology, sequentially storing the frames with recognized texts into a database, and sequentially marking and storing the frames with recognized empty texts. Secondly, expressing the frame texts in a vector form by using a TF-IDF (Trans-inverse discrete Fourier transform) -based weight calculation method in a vector space model, measuring the similarity between the two frame texts by cosine similarity, filtering the frames which are extremely similar, reserving the frames with larger difference as key frames, then generating the fingerprints of the frames by using a perceptual hash algorithm for marking the frames, measuring the similarity between the frames by using a Hamming distance, and taking out the key frames. And finally, fusing the key frames extracted from the two parts according to the time sequence to generate a final video static abstract set.

In the first stage, the preprocessed frame text is firstly mapped into vectors through a Vector Space Model (VSM), and a TF-IDF (Term Frequency-inverse Document Frequency) weight function calculation method is commonly used, and the method calculates the feature Term weight by calculating the product of the word Frequency (TF) and the inverse text Frequency (IDF), wherein the word Frequency (TF) can be calculated through the following formula:

where i denotes the index of a word, j denotes the text index, n_i,jRepresents the ithNumber of times, Σ, that a word appears in the jth text_kn_k,jRepresenting the total number of words in the j text. It is thus understood that it refers to the frequency with which a given word i appears in the j file.

The inverse text frequency (IDF) may be calculated by:

where | D | represents the total number of documents in the corpus, the denominator represents the number of documents containing the term, if t_iOut of the corpus results in a denominator of zero, so 1+ | { j: t:, is typically used_i∈d_jTo avoid the case where the denominator is 0. And then calculating the TF-IDF weight corresponding to each text vector feature item by the following formula:

TF-IDF_i,j＝TF_i,j×IDF_i (3)

representative words in the text can be well distinguished by using the formula, higher weight is given, and after each characteristic weight is calculated through a Vector Space Model (VSM), the frame text is effectively represented through a Vector. Similarity between adjacent frames of text is then measured by the distance between the vectors, with closer distances indicating more similar frames and less similar frames. The method for measuring the distance between texts commonly used comprises cosine similarity, Euclidean distance, Mahalton distance and the like, compared with the method for measuring the distance between texts in the same frame, the cosine similarity is selected, the similarity of adjacent frames is measured by calculating the size of cosine values between two text vectors, and the cosine values of the adjacent frames are sequentially stored in a set C (C)₁,c₂,c₃,…,c_i,…,c_mAnd f, wherein m is the total number of cosine values, which is the ith cosine value, and the cosine value can be calculated by the following formula:

wherein, X_i、Y_iRespectively representing word frequency vectors of adjacent two frames of texts X and Y after TF-IDF processing, n representing the sum of all key words extracted after word segmentation and stop word filtering of the two texts, and the value interval is [0,1 ] because the word frequency is not a negative number]The closer the distance is to 1, the more similar the two texts are, and vice versa.

And in the second stage, processing the empty text frames by using a perceptual hash algorithm, selecting key frames which can represent the sequence from the empty text frames, and generating representative fingerprints for the key frames by using the low-frequency information of the frames, wherein the specific steps are as follows:

(1) reducing the frame to 32 × 32, and rapidly removing high frequency and details, thereby improving the efficiency of the algorithm;

(2) carrying out graying processing on the reduced picture to further simplify the calculated amount;

(3) the picture is decomposed into frequency clusters and ladders by Discrete Cosine Transform (DCT), and only the matrix of 8 × 8 at the top left corner is retained, which is the lowest frequency block in the picture.

(4) Averaging 64 values in the 8 x 8 matrix to obtain a DCT mean value, comparing the 8 x 8 DCT matrix with the mean value, setting the DCT matrix larger than the mean value as 1, and setting the DCT matrix smaller than the mean value as 0; a 64-bit hash value of only 0 or 1 is generated.

(5) The hash value is treated as a fingerprint of the frame.

After the two frames of picture hash fingerprints are obtained, the similarity between the two frames is measured through the hamming distance, in general, if the hamming distance between the two frames is less than 10, the two frames are considered to be similar, otherwise, if the hamming distance is greater than 10, the two frames are considered to be dissimilar, and the specific calculation formula of the hamming distance is as follows:

wherein p is¹ _i,jAnd p² _i,jThe hash bit values corresponding to the fingerprints of the two frames of pictures are respectively, and i and j are horizontal and vertical coordinate values of the reserved 8 x 8 DCT matrix.

The method for extracting the key frame from the blank text frame comprises the following steps: taking the first frame as a key frame, sequentially calculating the Hamming distance between the following frame and the previous frame, when the value is less than 10, considering that the two frames are similar, removing the previous frame and reserving the frame, continuously comparing the frame with the next frame until the frame with the Hamming distance larger than 10 is found, taking the reserved frame as the key frame of the previous lens and the next frame as the initial key frame of the next lens, repeating the above operations until all the marked frames are completely read, and reserving all the taken key frames according to the time sequence.

Merging the two generated key frame sets to generate a final educational video summary to be stored in the SA_f＝{sa_f1,sa_f2,…sa_fi…sa_ftWhere sa_fiThe ith key frame generated for KFEEVSA, t is the total number of generated key frames. When the two groups of frames are stored, the frames are named and marked according to the time sequence of frame taking, so that the frame merging operation in the step is facilitated.

As shown in fig. 2, after extracting the key frame, the present invention uses an OCR algorithm to obtain the content text in the key frame, and uses an asr (automated Speech recognition) algorithm to obtain the Speech text corresponding to the audio within the set time range before and after the time point of the key frame.

In the above formula, W represents a text sequence, and Y represents a voice input. The first line of the formula indicates that the goal of speech recognition is to find the most likely sequence of words given the speech input. According to Bayer Rule, the second row of formula (6) can be obtained, wherein the denominator represents the probability of the occurrence of the speech, and compared with the solved text sequence, the second row has no parameter relation and can be ignored during the solution, and then the final formula is obtained. The first line in equation (6) represents the probability of this audio given a text sequence, which is the acoustic model in speech recognition; the second row in equation (6) represents the probability of this literal sequence occurring.

As shown in fig. 3, the working principle of the asr (automated speed recognition) algorithm: after the audio is subjected to signal processing, the audio is split according to frames, the split small-section waveforms are changed into multi-dimensional vector information according to the characteristics of human ears, the frame information is recognized into states, the states are combined to form phonemes, and finally the phonemes are combined into words and are connected in series to form sentences.

Then, the text content in the key frame and the time interval (t) for acquiring each shot are acquired_i,t_j) In the time interval (t) of the shot_i,t_j) The speech text is segmented for reference. Text f with a certain key frame_iFor reference text and the speech segment text v_iText similarity comparison is carried out by using TF-IDF algorithm, the voice texts with the weight ratio higher than 80% are highlighted and displayed, and the words w which appear in high frequency and can serve as video tags are displayed in the segment_jLabel display is also carried out; in consideration of the difference between written languages and spoken narration, the method considers semantic similarity in text similarity comparison so as to hopefully improve the extraction accuracy of the key subtitles, and finally labels the acquired key subtitles.

Example two

The embodiment provides an educational video color subtitle generating system based on semantic analysis;

It should be noted here that the acquiring module, the key frame extracting module, the summarizing module and the subtitle generating module correspond to steps S101 to S104 in the first embodiment, and the modules are the same as the corresponding steps in the implementation example and the application scenario, but are not limited to the contents disclosed in the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.

In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed.

EXAMPLE III

The present embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, a processor is connected with the memory, the one or more computer programs are stored in the memory, and when the electronic device runs, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first embodiment.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software.

The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

Example four

The present embodiments also provide a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the method of the first embodiment.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The educational video color subtitle generating method based on semantic analysis is characterized by comprising the following steps:

2. The method for generating educational video color captions based on semantic analysis according to claim 1, wherein the educational video to be processed is sampled to extract a plurality of text frames; the method specifically comprises the following steps:

sampling an educational video to be processed according to a set frequency, and extracting a plurality of text frames;

sampling an educational video to be processed, and extracting a plurality of text-free frames; the method specifically comprises the following steps:

3. The method for generating educational video color captions based on semantic analysis according to claim 1, wherein the key frame extraction is performed on all the frames with text to obtain a key frame set with text frames; the method specifically comprises the following steps:

performing text recognition on all the frames with texts in a text recognition mode;

based on a TF-IDF weight calculation mode, carrying out vector representation on the text identified by each frame;

the first frame is regarded as a key frame; for a non-first frame, measuring the vector similarity between a current frame and a previous frame through cosine similarity, and if the vector similarity of two adjacent frames is smaller than a set threshold, not taking the current frame as a key frame; and if the vector similarity of two adjacent frames is greater than or equal to a set threshold, taking the current frame as a key frame.

4. The method for generating educational video color captions based on semantic analysis according to claim 1, wherein key frame extraction is performed on all the non-text frames to obtain a key frame set of non-text frames; the method specifically comprises the following steps:

reducing each non-text frame to a set size;

graying the reduced image;

decomposing the grayed image by adopting discrete cosine transform to obtain a discrete cosine transform matrix; calculating the mean value of the discrete cosine transform matrix;

comparing the pixel value of each element of the discrete cosine transform matrix with the mean value of the discrete cosine transform matrix, setting the element larger than the mean value as 1, and setting the element smaller than the mean value as 0; obtaining a discrete cosine transform matrix after transformation;

generating a hash value based on the transformed discrete cosine transform matrix, and taking the hash value as the fingerprint of the current frame;

regarding the first non-text frame as a key frame; for a non-first non-text frame, measuring the similarity between a current frame and a previous frame through a Hamming distance, and if the Hamming distance is smaller than a set threshold value, not taking the current non-text frame as a key frame; and if the Hamming distance is larger than or equal to a set threshold, taking the current text-free frame as a key frame.

5. The method for generating educational video color subtitles based on semantic analysis according to claim 1, wherein the key frame set with text frames and the key frame set without text frames are summarized to obtain all key frames of the educational video to be processed; the method specifically comprises the following steps:

6. The method for generating educational video color captions based on semantic analysis according to claim 1, wherein for each of all key frames, content text and speech text are extracted; the method specifically comprises the following steps:

extracting a content text for each key frame by adopting an optical character recognition algorithm;

and extracting a voice text from the audio within a preset time range before and after the time point corresponding to each key frame by adopting automatic voice recognition.

7. The method for generating educational video color captions based on semantic analysis according to claim 1, wherein the color captions corresponding to the key frames are given according to the similarity between the content text and the speech text; the method specifically comprises the following steps:

8. The educational video color subtitle generating system based on semantic analysis is characterized by comprising the following components:

9. An electronic device, comprising:

a memory for non-transitory storage of computer readable instructions; and

a processor for executing the computer readable instructions,

wherein the computer readable instructions, when executed by the processor, perform the method of any of claims 1-7.

10. A storage medium storing non-transitory computer-readable instructions, wherein the non-transitory computer-readable instructions, when executed by a computer, perform the instructions of the method of any one of claims 1-7.