CN113395578B

CN113395578B - Method, device, equipment and storage medium for extracting video theme text

Info

Publication number: CN113395578B
Application number: CN202011363335.1A
Authority: CN
Inventors: 刘刚
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-11-27
Filing date: 2020-11-27
Publication date: 2023-06-30
Anticipated expiration: 2040-11-27
Also published as: CN113395578A

Abstract

The present application relates to computer technology, to artificial intelligence, computer vision technology, and speech technology. The application provides a method, a device, equipment and a storage medium for extracting video theme texts, which are used for improving the accuracy of extracting the video theme texts. The method comprises the following steps: acquiring a video frame sequence of a video to be extracted; dividing the video frame sequence into at least one video frame sub-sequence according to the degree of difference between every two adjacent video frames; respectively carrying out video frame text recognition on each video frame sub-sequence, and acquiring text information of each video frame sub-sequence based on a video frame text recognition result; and carrying out fusion processing on the text information of each video frame sub-sequence to obtain the subject text of the video to be extracted.

Description

Method, device, equipment and storage medium for extracting video theme text

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for extracting video theme text.

Background

With the rapid growth of the internet, more and more users not only acquire content from the network, but also share some content in the network, such as from media, professional production content (professional generated content, PGC) or user production content (user generated content, UGC), etc. As the sources of video become more and more abundant, the video upload volume of the playback platform is also growing rapidly, including long video and short video. Therefore, the amount of video to be processed in the playing platform is also increasing, for example, after the playing platform needs to extract the subject text of the video, the video can be further checked, and the video recommended to the user according to the preference of the user, etc.

Currently, in each playing platform, two methods for extracting the text of the video theme are included, one method is to watch the video manually, and extract key information of the video as a label of the video or classify the video. However, with the rapid increase of the video volume, such a method for manually watching video requires high labor cost, and in the process of manual processing, the situation of extracting key information errors caused by different understanding of the video or no serious watching of the video is unavoidable, thereby causing label errors of the video or classification errors of the video, and the like. The other is that the playing platform obtains the theme text of the video according to the video title, video classification or video keywords and the like provided when the user uploads the video. However, this approach is entirely dependent on the user, e.g., if the user does not provide video titles, video classifications, or video keywords, etc., or if the user provides video titles, video classifications, or video keywords, etc., that are inaccurate, the playback platform cannot obtain the video-accurate subject text. It can be seen that the accuracy of extracting the text of the video theme is low at present.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a storage medium for extracting video theme texts, which are used for improving the accuracy of extracting the video theme texts.

In a first aspect, a method for extracting text of a video theme is provided, including:

acquiring a video frame sequence of a video to be extracted;

dividing the video frame sequence into at least one video frame sub-sequence according to the degree of difference between every two adjacent video frames; the difference degree between adjacent video frames in each video frame sub-sequence is within a set first threshold value, the difference degree is in direct proportion to the number of difference pixel points between the adjacent video frames, and the difference pixel points comprise pixel points meeting preset pixel difference conditions at corresponding positions between the adjacent video frames;

respectively carrying out video frame text recognition on each video frame sub-sequence, and acquiring text information of each video frame sub-sequence based on a video frame text recognition result;

and carrying out fusion processing on the text information of each video frame sub-sequence to obtain the subject text of the video to be extracted.

In a second aspect, an apparatus for extracting text of a video theme is provided, including:

the acquisition module is used for: the method comprises the steps of obtaining a video frame sequence of a video to be extracted;

and a segmentation module: dividing the video frame sequence into at least one video frame sub-sequence according to the degree of difference between every two adjacent video frames; the difference degree between adjacent video frames in each video frame sub-sequence is within a set first threshold value, the difference degree is in direct proportion to the number of difference pixel points between the adjacent video frames, and the difference pixel points comprise pixel points meeting preset pixel difference conditions at corresponding positions between the adjacent video frames;

The processing module is used for: the method comprises the steps of carrying out video frame text recognition on each video frame sub-sequence respectively, and acquiring text information of each video frame sub-sequence based on a video frame text recognition result; and carrying out fusion processing on the text information of each video frame sub-sequence to obtain the subject text of the video to be extracted.

Optionally, the preset pixel difference condition includes:

the ratio of the absolute value of the pixel value difference value of two pixel points at the corresponding positions between adjacent video frames to the sum of the pixel values is greater than a second threshold.

Optionally, the difference degree is a ratio of the number of difference pixels between adjacent video frames to the total number of pixels of the video frames.

Optionally, the processing module is further configured to: before text information of each video frame sub-sequence is acquired, audio identification is carried out on audio files corresponding to each video frame sub-sequence, and an audio identification result is obtained; the method comprises the steps of,

the processing module is specifically configured to: and combining the audio recognition result and the video frame text recognition result to obtain text information of each video frame sub-sequence.

Optionally, the processing module is specifically configured to:

Determining the similarity between the audio recognition result and the text recognition result corresponding to each video frame sub-sequence respectively;

and when the similarity between the audio recognition result and the text recognition result corresponding to the video frame sub-sequence is larger than the preset similarity, combining the audio recognition result and the text recognition result to obtain text information of the video frame sub-sequence.

Optionally, for each video frame sub-sequence, the processing module is specifically configured to:

sampling each video frame in the video frame sub-sequence to obtain at least one target video frame;

carrying out video frame text recognition on each target video frame to obtain a sub-text corresponding to each target video frame;

according to preset filtering conditions, filtering each sub text;

based on each sub-text after filtering, obtaining a video frame text recognition result;

based on the result of the video frame text recognition, text information of each video frame sub-sequence is acquired.

Optionally, the processing module is specifically configured to include one or any combination of the following:

filtering preset keywords in each sub-text, wherein the preset keywords comprise keywords irrelevant to video topics;

Duplicate removal of similar sub-texts according to the similarity among the sub-texts;

the low frequency sub-text is filtered by the duration of each target video frame associated with the sub-text.

Optionally, for the two sub-texts, the processing module is specifically configured to:

converting the first sub-text and the second sub-text into a first character string and a second character string respectively;

performing character adding, character replacing or character deleting operation on the first character string so as to convert the first character string into the second character string;

determining the similarity between the first sub-text and the second sub-text according to the minimum operation times required for converting the first character string into the second character string, wherein the operation times are inversely proportional to the similarity;

and if the similarity between the first sub-text and the second sub-text is greater than a similarity threshold, filtering the first sub-text or the second sub-text.

Optionally, the processing module is specifically configured to:

inputting the text information of each video frame sub-sequence into a trained text probability model to obtain abstract information of each video frame sub-sequence; the abstract information is used for showing the semantics of the text information by a preset text table; the trained text probability model is obtained based on text information training of each video frame sub-sequence;

Inputting abstract information of each video frame sub-sequence into a trained topic probability model to obtain topic text of the video to be extracted; the trained topic probability model is trained based on summary information of each video frame sub-sequence.

In a third aspect, a computer device comprises:

a memory for storing program instructions;

and a processor for calling program instructions stored in the memory and executing the method according to the first aspect according to the obtained program instructions.

In a fourth aspect, a storage medium stores computer-executable instructions for causing a computer to perform the method according to the first aspect.

In the embodiment of the application, the video frame sequence is divided into at least one video frame sub-sequence, for example, the video can be divided into a plurality of fragments according to the shooting scene of the video, so that the text information of each video frame sub-sequence can be acquired in a targeted manner, for example, text recognition is performed on the video frames in the same shooting scene, the semantic information of the text in the shooting scene can be acquired more accurately, and the accuracy of the text information is improved. And the difference degree between the two video frames is determined only according to the number of the difference pixel points between the two video frames, so that a difference image between the two video frames does not need to be analyzed, the process of determining the difference degree between the two video frames is simplified, and the efficiency of dividing the video frame sequence is improved. And finally, fusing text information of all the video frame sub-sequences to obtain the theme text of the video. The semantic information contained in the video is fully utilized to extract the topic text of the video, so that the sources of the extracted topic text are enriched, and the problem of inaccurate topic text of the extracted video caused by too limited video titles or video classifications provided by users who release the video is avoided.

Drawings

Fig. 1 is a schematic diagram of a method for extracting video theme text according to an embodiment of the present application;

fig. 2a is an application scenario one of a method for extracting video theme text provided in an embodiment of the present application;

fig. 2b is an application scenario two of the method for extracting video theme text provided in the embodiment of the present application;

fig. 3 is a schematic flow chart of a method for extracting video theme text according to an embodiment of the present application;

fig. 4 is a schematic diagram two of a method for extracting video theme text according to an embodiment of the present application;

fig. 5 is a schematic diagram III of a method for extracting video theme text according to an embodiment of the present application;

fig. 6 is a schematic diagram IV of a method for extracting video theme text according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an apparatus for extracting video theme text according to an embodiment of the present application;

fig. 8 is a schematic diagram ii of an apparatus for extracting video theme text according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

Some of the terms in the embodiments of the present application are explained below to facilitate understanding by those skilled in the art.

(1) Long Short-Term Memory network (LSTM):

LSTM is a time-cycled neural network (Recurrent Neural Network, RNN) specifically designed to address the long-term dependency problem present in RNNs, and is applied to process and predict very long-spaced and delayed important events in a time series.

(2)RNN：

The recurrent neural network is a type of recurrent neural network (recursive neural network) in which sequence (sequence) data is taken as an input, recursion (recovery) is performed in the evolution direction of the sequence, and all nodes (circulation units) are connected in a chain.

Embodiments of the present application relate to artificial intelligence (Artificial Intelligence, AI) and Machine Learning techniques, designed based on Computer Vision (CV), speech (Speech Technology), natural language processing (natural language processing, NLP) and Machine Learning (ML) techniques in artificial intelligence.

Artificial intelligence is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. Artificial intelligence techniques mainly include computer vision techniques, natural language processing techniques, machine learning/deep learning, and other major directions.

Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, to replace human eyes with a camera and a Computer to perform machine Vision such as recognition, tracking and measurement on a target, and further perform graphic processing to make the Computer process into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

Key technologies to the speech technology (Speech Technology) are automatic speech recognition technology (ASR) and speech synthesis technology (TTS) and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.

Machine learning is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, and the like.

Natural language processing technology is an important direction in the fields of computer science and artificial intelligence. Various theories and methods for realizing effective communication between a person and a computer by using natural language are researched. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

Text processing is a main process in natural language processing technology and can be widely applied to various application scenes. Identifying rank sentences in text is an important part of text processing. For example, during the composition correction process, if the rank sentences in the composition can be identified, more accurate evaluation of the composition in the mining dimension is facilitated.

With research and progress of artificial intelligence technology, artificial intelligence is developed in various fields such as common smart home, intelligent recommendation system, virtual assistant, smart speaker, smart marketing, unmanned, automatic driving, robot, smart medical, etc., and it is believed that with the development of technology, artificial intelligence will be applied in more fields and become more and more important value.

The application field of the technical scheme provided by the embodiment of the application is briefly described below.

With the rapid development of internet technology, more and more users can acquire content in the internet or share content in the internet, for example, users can watch video or release video on a video playing platform. After a user who issues a video uploads the video on a video playing platform, the video playing platform firstly needs to transcode the video to standardize video files, store meta-information of the video and improve the compatibility of the video. After transcoding the video, the video playing platform needs to audit the video content to ensure the validity of the video content, and also needs to classify the video content to ensure that relevant videos can be recommended to users watching the video according to the preference of the users watching the video, or the video management is convenient. After the video playing platform completes the examination and verification of the video, the video can be released on the playing platform.

The auditing of the video can be converted into auditing of the topic text of the video, so that the topic text of the video needs to be accurately extracted. The method for extracting the video theme text generally comprises two methods, and one method for extracting the video theme text is to extract the video theme text by manually watching the video. The manual auditing mode requires auditing personnel to completely watch the video and make correct understanding on the video content. However, with the increase of the video uploading amount, the video uploading amount of the video playing platform per day can reach millions, and the manual auditing mode is low in efficiency and high in labor cost. In addition, the video uploaded to the video playing platform is rich and various in video content, the understanding capability of the auditing personnel to the video content cannot be guaranteed, whether the auditing personnel carefully watch the video cannot be guaranteed, and the manual auditing mode has high requirements on the business quality of the auditing personnel.

Another method of extracting video theme text is by machine extraction of video theme text. And the video playing platform acquires video titles set by the user of the released video, video classifications selected by the user of the released video and the like after receiving the video uploaded by the user of the released video. The video playing platform performs semantic recognition on the video titles, extracts video theme texts based on the video titles set by the user who distributed the video and the selected video classification of the distributed video. However, the video titles set by the user who publishes the video are often used to attract the user who watches the video to click on the video for watching, and the video content cannot be accurately expressed, for example, when legal video titles are set for illegal videos, the video theme text cannot be accurately extracted. The same problem exists with user-selected video classification of the published video. The accuracy of this method of extracting the subject text of a video is entirely dependent on the user who published the video. As can be seen, the method for extracting the video theme text in the related art has low accuracy.

In order to solve the problems of low accuracy and the like of extracting video theme texts in the related art, the application provides a method for extracting video theme texts. According to the method, the video frame sequence is divided according to the difference degree between adjacent video frames in the video frame sequence, and at least one video frame sub-sequence is obtained, so that the video can be divided into a plurality of fragments according to the basis of different shooting scenes, different shooting targets or different environmental factors in the video, and the text information of each video fragment can be extracted in a targeted manner. Shooting scenes include, for example, bedrooms, living rooms, indoors, outdoors, courts, malls, or the like. The shooting target includes, for example, a user a, a user B, an animal, a plant, a still, or the like. Environmental factors include, for example, ambient brightness or weather, etc. And finally, fusing text information of all the video frame sub-sequences to obtain the theme text of the video. The semantic information contained in the video is fully utilized to extract the topic text of the video, so that the problem of inaccurate topic text of the extracted video caused by too limited video titles or video classifications provided by users who release the video is avoided.

Fig. 1 is a schematic diagram of a method for extracting text of a video theme. And acquiring a video frame sequence of the video to be extracted, and dividing the video frame sequence into at least one video frame sub-sequence according to the difference degree between every two adjacent video frames in the video frame video. And respectively carrying out video frame text recognition on each video frame sub-sequence to obtain text information of each video frame sub-sequence. And carrying out fusion processing on the text information of each video frame sub-sequence to obtain the subject text of the video to be extracted.

In the embodiment of the application, the video frame sequence is divided into at least one video frame sub-sequence, for example, the video can be divided into a plurality of fragments according to the shooting scene of the video, so that the text information of each video frame sub-sequence can be acquired in a targeted manner, for example, text recognition is performed on the video frames in the same shooting scene, the semantic information of the text in the shooting scene can be acquired more accurately, and the accuracy of the text information is improved. And finally, fusing text information of all the video frame sub-sequences to obtain the theme text of the video. The semantic information contained in the video is fully utilized to extract the topic text of the video, so that the problem of inaccurate topic text of the extracted video caused by too limited video titles or video classifications provided by users who release the video is avoided.

The application scenario of the method for extracting the video theme text provided by the application is described below.

Please refer to fig. 2a, which is an application scenario of the method for extracting the text of the video theme. The application scene includes a video providing device 101, a subject text extracting device 102, and a video processing device 103. Communication may be provided between the video providing device 101 and the subject text extraction device 102, and communication may be provided between the subject text extraction device 102 and the video processing device 103. The communication mode can be wired communication, for example, communication is carried out through a connecting network cable or a serial port line; the communication method may be wireless communication, such as bluetooth, and is not particularly limited.

The video providing apparatus 101 generally refers to an apparatus that can transmit video to the subject text extracting apparatus 102, for example, a terminal apparatus, a server, a client, or the like. The terminal device may be a mobile phone, a desktop computer, a tablet computer, or the like. The server may be a local server of the subject text extraction device 102, or a third party server associated with the subject text extraction device 102, or a cloud server, or the like. The client may be a third party application installed in the subject text extraction device 102 or a web page or the like that the subject text extraction device 102 may access.

The subject text extraction device 102 generally refers to a device, such as a terminal device, a server, or a client, that can extract subject text of a video. The video processing device 103 generally refers to a device that can process video, such as categorizing video, recommending video to a user, and the like. The video processing device 103 may be a terminal device, a server, a client, or the like.

As an embodiment, the video providing apparatus 101 and the subject text extracting apparatus 102 may be the same apparatus, or the subject text extracting apparatus 102 and the video processing apparatus 103 may be the same apparatus, or the video providing apparatus 101, the subject text extracting apparatus 102 and the video processing apparatus 103 may be the same apparatus, without being limited in particular. In the embodiment of the present application, description will be given taking an example in which the video providing apparatus 101, the subject text extracting apparatus 102, and the video processing apparatus 103 are different apparatuses.

The interaction between the devices is illustrated below based on fig. 2 a:

the video providing apparatus 101 may transmit the video to be extracted to the subject text extraction apparatus 102, and the subject text extraction apparatus 102 receives the video to be extracted transmitted by the video providing apparatus 101.

The subject text extraction device 102 obtains a sequence of video frames of the video to be extracted and divides the sequence of video frames into at least one sub-sequence of video frames according to the degree of difference between every two adjacent video frames. The topic text extraction device 102 performs video frame text recognition for each video frame sub-sequence, respectively, to obtain text information of each video frame sub-sequence. The topic text extraction device 102 performs fusion processing on text information of each video frame sub-sequence to obtain topic text of the video to be extracted.

The topic text extraction device 102 sends the video to be extracted and the topic text of the video to be extracted to the video processing device 103, and the video processing device 103 receives the video to be extracted and the topic text of the video to be extracted sent by the topic text extraction device 102. The video processing device 103 classifies the video to be extracted according to the subject text of the video to be extracted, and recommends the video to be extracted to the relevant user according to the interest portraits of the respective users.

Please refer to fig. 2b, which is an application scenario of the method for extracting the text of the video theme. The application scene comprises a video generating device 201, a first storage device 202, a second storage device 203, a scheduling device 204, a theme text extraction device 205 and a video processing device 206, wherein the devices can communicate with each other.

The user who issues the video may upload the video to be extracted through a front-end interface or a back-end interface of the video generating apparatus 201. The information uploaded with the video to be extracted can also comprise a video title, a publisher, a video abstract, a cover map, a release time and the like. The scheduling device 204 stores the video to be extracted or the information uploaded together with the video to be extracted in the first storage device 202. The first storage means 202 is, for example, a video content storage server, i.e. a relational database. The scheduling means 204 stores meta information of the video to be extracted, such as a video file size, a cover map link, a code rate, a video file format, a video title, a publisher, a video summary, a publication time, and the like, in the second storage means 203. The second storage means 203 is, for example, a content database, i.e. a non-relational database. When a user watching a video acquires the video, the scheduling device 204 may determine index information of the video displayed to the user in the first storage device 202 according to the subject text of each video, download a streaming media file of the video from the first storage device 202, and play the video through a local player of the user watching the video. The subject text of each video is obtained by the subject text extraction means 205.

As an embodiment, the second storage means 203 further stores video classifications or video tags of the video to be extracted. For example, regarding videos of brand a handsets, the primary classification is science and technology, the secondary classification is handsets, the tertiary classification is domestic handsets, and the tags are brand a handsets and models.

As an embodiment, the scheduling device 204 may obtain the content stored in the first storage device 202 or the second storage device 203, and repeat the video according to the subject text extracted by the subject text extracting device 205, delete the video that is repeatedly uploaded or the video that has a plagiarism, and so on.

As one example, the video processing device 206 may recommend relevant videos to a user viewing the videos according to their interest portraits based on the subject text of the videos; or displaying the related video to the user watching the video according to the keyword searched by the user.

In the embodiment of the present application, a method for extracting a video theme text is specifically described with respect to the theme text extraction device 102 or the theme text extraction apparatus 205 in the above two application scenarios.

Referring to fig. 3, a flow chart of a method for extracting video theme text is shown, and the flow chart of the method for extracting video theme text is specifically described below.

S301, acquiring a video frame sequence of a video to be extracted.

The video frame sequence comprises video frames arranged according to time sequence, and each video frame in the video frame sequence is switched according to time sequence to form a video.

S302, dividing the video frame sequence into at least one video frame sub-sequence according to the difference degree between every two adjacent video frames.

After a sequence of video frames of the video to be extracted is obtained, a degree of difference between every two adjacent video frames in the sequence of video frames is determined. The degree of difference between two adjacent video frames is proportional to the number of difference pixels between two adjacent video frames. The difference pixel points between two adjacent video frames are pixel points meeting the preset pixel difference condition at the corresponding positions between the two adjacent video frames. The difference pixel points can be pixel points meeting preset pixel difference conditions at corresponding positions between two adjacent complete video frames; alternatively, the pixel points meeting the preset pixel difference condition at the corresponding positions between the designated areas in the two adjacent video frames can be used. For example, whether the pixel point at each corresponding position in two adjacent video frames meets a preset pixel difference condition is sequentially determined, or when the designated area is a face area, whether the pixel point at each corresponding position in the face area of two adjacent video frames meets a preset pixel difference condition is sequentially determined.

The process of dividing a sequence of video frames into at least one sub-sequence of video frames is described in more detail below.

S1.1, determining a difference pixel point between two adjacent video frames.

The preset pixel difference conditions include a plurality of kinds, and two kinds of them are described below as examples.

Presetting a pixel difference condition one:

the absolute value of the difference between the pixel values of two pixel points at the corresponding positions between the adjacent video frames is larger than a third threshold value.

If the difference value of the pixel values of the two pixel points at the corresponding positions between the adjacent video frames is larger, the pixel values of the two pixel points at the corresponding positions between the adjacent video frames are larger in variation, and when the absolute value of the difference value is larger than a third threshold value, the two pixel points at the corresponding positions are determined to be difference pixel points.

According to the following formula (1), a difference pixel point between adjacent video frames is determined.

|T _m (i，j)-T _n (i，j)|>ε ₁ (1)

Wherein i represents the number of rows where the pixel points in the video frame are located, and j represents the number of columns where the pixel points in the video frame are located. m, n denote two adjacent video frames, T _m (i, j) represents the pixel value of the pixel point at the (i, j) position of the mth frame, T _n (i, j) represents the nth framePixel values of (i, j) position pixels. i is an integer greater than 0 and less than the total number of rows of pixels in the video frame, and j is an integer greater than 0 and less than the total number of columns of pixels in the video frame. Epsilon ₁ Representing a third threshold.

Presetting a pixel difference condition II:

the ratio of the absolute value of the pixel value difference value of two pixel points at the corresponding positions between adjacent video frames and the sum of the pixel values is larger than a second threshold value.

If the difference value of the pixel values of the two pixel points at the corresponding position between the adjacent video frames is larger, possibly because the pixel value of the pixel point at the position is larger, the pixel value difference value of the two pixel points is normalized, and the difference pixel point between the adjacent video frames is determined according to the pixel value difference value of the two pixel points divided by the sum of the pixel values of the two pixel points. If the ratio of the absolute value of the pixel value difference value of two pixel points at the corresponding position between adjacent video frames to the sum of the pixel values is large, the pixel value of the two pixel points at the corresponding position between adjacent video frames is represented to be changed greatly.

According to the following formula (2), a difference pixel point between adjacent video frames is determined.

Wherein i represents the number of rows where the pixel points in the video frame are located, and j represents the number of columns where the pixel points in the video frame are located. m, n denote two adjacent video frames, T _m (i, j) represents the pixel value of the pixel point at the (i, j) position of the mth frame, T _n (i, j) represents the pixel value of the pixel point at the (i, j) position of the nth frame. i is an integer greater than 0 and less than the total number of rows of pixels in the video frame, and j is an integer greater than 0 and less than the total number of columns of pixels in the video frame. Epsilon ₂ Representing a second threshold.

As an embodiment, the sum of the pixel values of the two pixel points can also be replaced by the average value of the sum of the pixel values of the two pixel points, namely

As an embodiment, the pixel values of the pixels in the video frame are different according to the color space of the sequence of video frames. The color space is, for example, an RGB color space, a YUV color space, or the like.

For the RGB color space, the pixel value of the pixel point comprises three color channels of red, green and blue, so the pixel value of the pixel point can be the sum of the values of the three color channels, and then the pixel value difference value of the two pixel points can be the sum of the difference values of the pixel values; alternatively, the pixel value of the pixel point may include the value of each channel, and then the pixel value difference of the two pixel points may be the sum of the differences of the values of the corresponding channels of the two pixel points, or the like. For YUV color space, the pixel value of a pixel includes one brightness and two chromaticities, so the pixel value of a pixel may be the sum of brightness and chromaticity, or may include brightness and chromaticity, or the like.

S1.2, determining the difference degree between two adjacent video frames.

After the difference pixel point between two adjacent video frames is determined according to the preset pixel difference condition, the difference degree between the two adjacent video frames is determined according to the difference pixel point between the two adjacent video frames. The degree of difference between two adjacent video frames is proportional to the number of difference pixels between two adjacent video frames.

Determining a degree of difference between two adjacent video frames according to the following formula (3);

d _m，n ＝∑h _m，n (i，j) (3)

wherein h is _m，n (i, j) represents the difference pixel point between two video frames, Σh _m，n (i, j) represents the number of difference pixels between the mth frame and the nth frame, i is an integer greater than 0 and less than the total number of rows of pixels in the video frame, j is an integer greater than 0 and less than the total number of columns of pixels in the video frame.

As an embodiment, if the number of difference pixels between two adjacent video frames is larger, possibly because the video frames themselves include more pixels, in order to improve the applicability of the method for determining the difference degree between two adjacent video frames, the number of difference pixels may be normalized according to the total number of pixels in the video frames, that is, the difference degree between two adjacent video frames may be determined by dividing the number of difference pixels between two adjacent video frames by the total number of pixels in the video frames. Therefore, the difference degree between two adjacent video frames is not influenced by the total number of pixel points in the video frames, and the method is applicable to video frames of any size.

Determining a degree of difference between two video frames according to the following formula (4);

the total number of rows of the pixel points in the video frame A and the total number of columns of the pixel points in the video frame B are adopted.

S1.3, dividing the video frame sequence into at least one video frame sub-sequence.

After determining the degree of difference between every two video frames in the sequence of video frames, the sequence of video frames is divided into at least one sub-sequence of video frames according to the degree of difference between every two video frames.

The sequence of video frames is partitioned according to the following equation (5).

d _n+1，n -d _n，n-1 >ε ₃ (5)

Taking the first video frame of the video frame sequence as the first video frame of the first video frame sub-sequence, starting from the second video frame of the video frame sequence, and sequentially taking the current video frame as the last video frame of the current video frame sub-sequence and the next video frame of the current video frame sub-sequence as the first video frame of the next video frame sub-sequence for each video frame if the difference between the current video frame and the previous video frame of the current video frame and the difference between the current video frame and the next video frame of the current video frame is determined to be larger than a fourth threshold value. And after determining the last video frame of the video frame sequence, taking the last video frame of the video frame sequence as the last video frame of the last video frame sub-sequence to obtain each video frame sub-sequence.

S303, respectively carrying out video frame text recognition on each video frame sub-sequence, and acquiring text information of each video frame sub-sequence based on the result of the video frame text recognition.

After each video frame sub-sequence is obtained, video frame text recognition can be respectively carried out on each video frame in each video frame sub-sequence based on a computer vision technology in the artificial intelligence field, and text information of each video frame sub-sequence is obtained based on a text recognition result of each video frame in the video frame sub-sequence; or, each video frame sub-sequence may be sampled, video frame text recognition may be performed on each sampled target video frame, and text information of each video frame sub-sequence may be obtained based on the result of text recognition of each sampled target video frame.

The following description will be given by taking the case of sampling each video frame sub-sequence and performing video frame text recognition on each sampled target video frame, where the principle of performing video frame text recognition on each video frame in each video frame sub-sequence is the same, and will not be repeated.

And S2.1, sampling each video frame in the video frame sub-sequence.

And sampling a video frame or an I frame (Intra Picture) which represents the motion of a person or the change of an object in the video frame sub-sequence to obtain at least one target video frame.

As an example, at least one target video frame may be time-ordered to form a sequence of target video frames. Each two adjacent target video frames in the sequence of target video frames may be two adjacent video frames in the sequence of video frames or may be two non-adjacent video frames in the sequence of video frames.

As an example, the duration of the preceding target video frame in time order in each two adjacent target video frames may be determined based on the number of video frames spaced in the video frame sub-sequence for each two adjacent target video frames in the sequence of target video frames. Every two adjacent target video frames in the target video frame sequence, the video frames spaced in the video frame sub-sequence can be regarded as transition video frames or intermediate video frames and the like associated with the previous target video frames according to time sequence, and the transition video frames or the intermediate video frames can promote the continuity of the transformation from one target video frame to the other target video frame, so that text recognition can be carried out only on the target video frames, the data quantity required to be processed is reduced, and the text recognition efficiency is improved.

S2.2, carrying out video frame text recognition on each target video frame to obtain a sub-text corresponding to each target video frame.

After obtaining at least one target video frame, text detection is performed on each target video frame based on machine learning techniques in the field of artificial intelligence to determine whether text is included in the target video frame. If the target video frame does not include text, the target video frame is not text-identified, and if the target video frame includes text, the target video frame is continued to be text-identified.

As an embodiment, please refer to fig. 4, which is a schematic diagram of text detection. First, feature extraction is performed on a target video frame, and at least one text region, such as a first text region, a second text region and a third text region, is obtained according to a feature extraction result. The number of at least one text region may be preset by the user, or may be automatically determined by the device according to the feature extraction result, without limitation. The text region may be one pixel or a region including a plurality of pixels, the text region indicating that the pixels within the text region are related to the sub-text corresponding to the target video frame. And secondly, continuously merging pixel points around the text regions according to the similarity among the pixel points for each text region, expanding each text region, and finally merging all the text regions to obtain the text region comprising the sub-text corresponding to the complete target video frame. Compared with the text detection mode through a rectangular box, the text detection method and device do not need to limit the occurrence form of the sub-text corresponding to the target video frame, and have strong applicability to any fonts or handwritten text and the like.

As an example, in detecting whether text is included in the target video frame, text detection may also be performed using a trained PSENet model or the like.

After text detection of the target video frame, if text is included in the target video frame, text recognition of the target video frame continues based on natural language processing techniques in the field of artificial intelligence. Referring to fig. 5, a schematic diagram of a principle of text recognition of a video frame is shown. Firstly, extracting features of a text region obtained through text detection by using a feature extraction network to obtain a text sequence of the text region. And secondly, carrying out vector compression on the text sequence of the text region by using a vector compression network, and obtaining a semantic vector after vector compression according to the generated forward hidden state and reverse hidden state. Then, the semantic vector is restored to a text sequence based on the vector compressed semantic vector and at least one mapping parameter using a mapping network. By introducing at least one mapping parameter, the degree of association between different positions of the matching semantic vector and different positions of the text sequence is improved, and the accuracy of determining the text sequence is improved.

Taking a feature extraction network as an EfficientNet model, a vector compression network as a BiRNN model formed by two layers of LSTM, and a mapping network as a Multi-head Attention model as an example, the text recognition model is introduced.

Inputting the text region into EfficientNet model to obtain text sequence { x } ₁ ，x ₂ ，…，x _n }. The text sequence is input into the BiRNN model, and each hidden state is calculated. First layer LSTM of BiRNN model, processing text sequence from left to right, generating forward hidden state

The second layer LSTM of BiRNN model processes text sequence from right to left to generate reverse hidden state +.>

Obtaining semantic vectors according to the forward hidden state and the reverse hidden state

Inputting the semantic vector into a Multi-head Attention model, and restoring the semantic vector into a text sequence according to the mapping parameters to obtain the sub-text of the target video frame.

Please refer to fig. 6, which is a schematic diagram of a mapping network. The semantic vector is input into a mapping network, and each text sequence is restored through at least one mapping parameter. The individual text sequences are fused into the final sub-text.

As one embodiment, the duration of the video frame associated with the sub-text of the target video frame is the duration of the target video frame. If the sub-text of the plurality of target video frames is the same, the duration of the video frame associated with the sub-text is the sum of the durations of the plurality of target video frames.

As an embodiment, after each target video frame is subjected to text detection to determine whether the target video frame includes text, if the target video frame does not include text, the text description can be performed on the target video frame which does not include any text by combining computer vision, natural language processing and machine learning technology, and the image is translated into text, so that the connection between two different modal spaces of the image and the text is established, the artificial talking behavior can be simulated, the emotion of a person in the image can be described by the text, and the flexibility and the intelligence of obtaining the text of the target video frame are improved.

Inputting the target video frame into a trained text description model, and extracting the characteristics of the target video frame by a characteristic extraction network in the text description model to obtain characteristic vectors representing the spatial information of the target video frame. After the feature vector of the target video frame is obtained, the generated sequence network in the text description model determines keywords associated with the target video frame based on the feature vector of the target video frame, and obtains each keyword sequence, so that the sub-text of the target video frame can be obtained.

The feature extraction network in the text description model can be built through a roll-in neural network and a full-connection network, and the generated sequence network in the text description model can be built through a cyclic neural network, so that the generated sequence network can memorize and keep information for a long time.

The text description model may be trained from a number of sample images that do not include text, and text labels corresponding to each sample image. The sample image that does not include text may be a video frame that does not include text in a video obtained from a network resource, and the text label may be obtained from a manual annotation or a device annotation, or may be obtained from text associated with the sample image from a source of the sample image, without limitation.

S2.3, according to preset filtering conditions, filtering the sub-texts.

The obtained sub-texts possibly comprise some text which is irrelevant to the video theme or is repeated, so that the filtering condition can be set to filter the sub-texts, and the accuracy and the simplicity of the obtained video theme text are ensured. Three filtration conditions are described below as examples.

Filtration condition one:

and filtering preset keywords in each sub-text.

The video may have text unrelated to the subject of the video, such as watermark text or trademark text, so that preset keywords may be set and filtered from each sub-text.

For example, a user takes a video through software a and releases it through software B. The name watermark of the software A is included in the video, and the name of the software A is not related to the video content in practice, so that the name text of the software A can be deleted from the obtained sub-texts of each target video frame, and the accuracy and the simplicity of extracting the theme text are improved.

And (2) filtering conditions II:

and de-duplicating the similar sub-texts according to the similarity among the sub-texts.

Similar text, such as similar words or sentences, may be present in the sub-text of each target video frame, possibly due to text recognition errors or paraphrasing, etc., and thus similar sub-text in each sub-text may be filtered.

After each sub-text is obtained, for each sub-sequence of video frames, a similarity between every two sub-texts in the sub-sequence of video frames is determined. If the similarity between the two sub-texts is greater than a similarity threshold, then it is determined that the two sub-texts are similar to each other. Performing de-duplication on similar sub-texts, for example, for a plurality of similar sub-texts, only reserving the similar sub-text with the longest duration of the associated video frame, and deleting similar sub-texts with shorter duration of the rest associated video frames; alternatively, only similar sub-text arranged in chronological order in the forefront target video frame is retained, similar sub-text arranged in the remaining rear target video frame is deleted, and so on.

As an example, there are various methods of determining the similarity between two sub-texts, such as euclidean distance, cosine similarity, or levenstein distance. The method of determining the similarity between two sub-texts will be described below using the levenstein distance as an example.

Converting the two sub-texts into two character strings, performing operations such as adding characters, deleting characters or replacing characters on the first character string, and counting the minimum operation times required by converting the first character string into the second character string. The similarity between the two sub-texts may be determined according to the number of operations, which is inversely proportional to the similarity. If the operation frequency of converting the first character string into the second character string is in the content of the preset range, the similarity between the two sub-texts is determined to be higher, and the two sub-texts are similar sub-texts.

And (3) filtering conditions III:

If some of the sub-text in the video appears for a short period of time, for example, only in 3 video frames, then these sub-text may be considered low frequency sub-text, independent of the video subject text, and therefore the associated sub-text for a short period of time may be deleted.

After the video frame duration associated with each sub-text in the video frame sub-sequence is obtained, the sub-texts can be ordered from large to small according to the associated video frame duration, and all the sub-texts arranged behind the preset sequence number are deleted. Alternatively, all sub-text associated with a video frame having a duration less than a preset duration may be deleted. Alternatively, after sorting each sub-text from large to small according to the associated video frame duration, the difference in video frame duration for each two adjacent sub-texts may be calculated, and if the difference is greater than a preset difference, all sub-texts arranged behind the sub-text may be deleted, and so on.

And filtering the sub-text of each target video frame in each video frame sub-sequence, and obtaining text information of each video frame sub-sequence.

As an embodiment, before obtaining the text information of each video frame sub-sequence, audio recognition may be performed on the audio file corresponding to each video frame sub-sequence based on a voice technology in the artificial intelligence field, or audio recognition may be performed on the audio file corresponding to each target video frame, so as to obtain an audio recognition result. After the text recognition result is obtained, for example, after the sub-text of each target video frame is obtained, the text recognition result and the audio recognition result may be combined, and the text information of each video frame sub-sequence may be obtained after the combined text is filtered.

The audio file is, for example, a dubbing file corresponding to a video frame sequence, for example, the playing time of the current target video frame in the video to be extracted is used as the starting time, the playing time of the next target video frame of the current target video frame in the video to be extracted is used as the ending time, the audio information in the audio file of the video to be extracted is determined, the text identified according to the audio information is obtained, and the sub-text of the current target video frame is obtained.

For another example, according to the playing time corresponding to each video frame sub-sequence, determining the audio information corresponding to each video frame sub-sequence in the audio file of the video to be extracted, and performing audio recognition on the audio information to obtain the audio recognition result corresponding to each video frame sub-sequence.

As an embodiment, in the process of merging the audio recognition result and the text recognition result, in order to avoid the situation that the audio recognition result is misplaced with the corresponding text recognition result on a content level, or the situation that the audio recognition result is not related with the text recognition result and causes content confusion, the similarity between the audio recognition result and the text recognition result corresponding to each video frame sub-sequence can be determined first. And when the similarity between the audio recognition result and the text recognition result corresponding to the video frame sub-sequence is larger than the preset similarity, combining the audio recognition result and the text recognition result to obtain text information of the video frame sub-sequence. Alternatively, the similarity between the audio recognition result and the text recognition result corresponding to each target video frame may be determined. And when the similarity between the audio recognition result and the text recognition result corresponding to the target video frame is larger than the preset similarity, combining the audio recognition result and the text recognition result to obtain text information of the video frame sub-sequence.

S304, fusion processing is carried out on the text information of each video frame sub-sequence, and the theme text of the video to be extracted is obtained.

And aiming at each video frame sub-sequence, splicing the text information of the video frame sub-sequence according to the sequence of each video frame in the video frame sub-sequence to obtain the text information of the spliced video frame sub-sequence. And splicing the text information of each spliced video frame sub-sequence according to the sequence of each video frame sub-sequence in the video frame sequence to obtain the subject text of the video to be extracted.

As an example, if a video frame sub-sequence includes more video frames, the text information of the obtained video frame sub-sequence may include more text. In order to avoid the problem that the topic text of the video to be extracted after the fusion processing is not brief, the abstract extraction, such as keyword extraction, can be performed on the text information of the video frame sub-sequence, wherein the number of video frames of the video frame sub-sequence is larger than the preset number of video frames. And updating the text information of the video frame sub-sequences into extracted abstract information, and then carrying out fusion processing on the text information of each video frame sub-sequence to obtain the subject text of the video to be extracted.

Or if the text information of the video frame sub-sequence comprises more texts, in order to avoid the problem that the obtained subject text of the video to be extracted is not brief, the text information of each video frame sub-sequence can be fused based on the text probability model and the subject probability model, and the subject text of the video to be extracted is determined.

The text information of each video frame sub-sequence is respectively input into a trained text probability model, and the text probability model can divide the text information into a plurality of words and obtain word vectors of each word. After each word in the text information and the word vector of each word are obtained, the text probability model can determine each word combination mode of various word combination modes of at least one word, can express the probability distribution of the semantics of the text information of the video frame sub-sequence, combines at least one word according to the word combination mode with the highest probability, and outputs the abstract information of the video frame sub-sequence. The summary information may characterize the semantics of the text information in a preset text table. The preset text amount may refer to the number of characters in the text information, or may be the size of the memory occupied by the text information, which is not particularly limited.

After obtaining summary information for each video frame sub-sequence, the summary information for the video frame sub-sequence may be input into a trained topic probability model. The topic probability model determines the information vector corresponding to each piece of abstract information, determines each abstract combination mode in various abstract combination modes of each piece of abstract information, can express the probability distribution of the semantics of the abstract information of the video frame sub-sequence, combines each piece of abstract information according to the abstract combination mode with the maximum probability, and outputs the topic text of the video frame sequence.

As one embodiment, the text probability model can be used for performing unsupervised training through text information of each video frame sub-sequence, if the obtained Markov chain of the probability distribution is converged, a trained text probability model is obtained, and otherwise, the model parameters of the text probability model are adjusted to continue training. The topic probability model can perform unsupervised training through summary information of each video frame sub-sequence, and the training process is similar to that of the text probability model, and is not repeated here.

As an embodiment, the text probability model and the topic probability model may be built by an implicit dirichlet allocation (Latent Dirichlet allocation, LDA) model.

Based on the same inventive concept, the embodiments of the present application provide a device for extracting a video topic text, which is equivalent to the topic text extraction device 102 or the topic text extraction device 205 discussed above, and can implement functions corresponding to the method for extracting a video topic text. Referring to fig. 7, the apparatus includes an acquisition module 701, a segmentation module 702, and a processing module 703, where:

the acquisition module 701: the method comprises the steps of obtaining a video frame sequence of a video to be extracted;

segmentation module 702: dividing the video frame sequence into at least one video frame sub-sequence according to the degree of difference between every two adjacent video frames; the difference degree between adjacent video frames in each video frame sub-sequence is in a set first threshold value, the difference degree is in direct proportion to the number of difference pixel points between the adjacent video frames, and the difference pixel points comprise pixel points meeting preset pixel difference conditions at corresponding positions between the adjacent video frames;

processing module 703: the method comprises the steps of carrying out video frame text recognition on each video frame sub-sequence respectively, and acquiring text information of each video frame sub-sequence based on a video frame text recognition result; and carrying out fusion processing on the text information of each video frame sub-sequence to obtain the subject text of the video to be extracted.

In one possible embodiment, the preset pixel difference condition includes:

In one possible embodiment, the degree of difference is the ratio of the number of difference pixels between adjacent video frames to the total number of pixels of the video frame.

In one possible embodiment, the processing module 703 is further configured to: before text information of each video frame sub-sequence is acquired, audio identification is carried out on audio files corresponding to each video frame sub-sequence, and an audio identification result is obtained; the method comprises the steps of,

the processing module 703 is specifically configured to: and combining the audio recognition result and the video frame text recognition result to obtain text information of each video frame sub-sequence.

In one possible embodiment, the processing module 703 is specifically configured to:

In one possible embodiment, for each video frame sub-sequence, the processing module 703 is specifically configured to:

according to preset filtering conditions, filtering each sub text;

In one possible embodiment, the processing module 703 is specifically configured to include one or any combination of the following:

In one possible embodiment, for two sub-texts, the processing module 703 is specifically configured to:

Performing character adding, character replacing or character deleting operations on the first character string to convert the first character string into a second character string;

inputting abstract information of each video frame sub-sequence into a trained topic probability model to obtain topic text of a video to be extracted; the trained topic probability model is trained based on summary information for each video frame sub-sequence.

Based on the same inventive concept, embodiments of the present application provide a computer apparatus, and the computer apparatus 800 will be described below.

Referring to fig. 8, the apparatus for extracting video theme text may be run on a computer device 800, and a current version and a history version of a program for extracting video theme text and application software corresponding to the program for extracting video theme text may be installed on the computer device 800, where the computer device 800 includes a display unit 840, a processor 880 and a memory 820, and the display unit 840 includes a display panel 841 for displaying an interface interacted with by a user, etc.

In one possible embodiment, the display panel 841 may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD) or an Organic Light-Emitting Diode (OLED), or the like.

The processor 880 is configured to read the computer program and then execute a method defined by the computer program, for example, the processor 880 reads a program or a file or the like for extracting the text of the video theme, thereby running the program for extracting the text of the video theme on the computer apparatus 800, and displaying a corresponding interface on the display unit 840. The processor 880 may include one or more general-purpose processors and may further include one or more DSPs (Digital Signal Processor, digital signal processors) for performing related operations to implement the technical solutions provided by the embodiments of the present application.

Memory 820 typically includes memory and external memory, and memory may be Random Access Memory (RAM), read Only Memory (ROM), CACHE (CACHE), etc. The external memory can be a hard disk, an optical disk, a USB disk, a floppy disk, a tape drive, etc. The memory 820 is used to store computer programs including application programs corresponding to the respective clients, etc., and other data, which may include data generated after the operating system or application programs are run, including system data (e.g., configuration parameters of the operating system) and user data. In the embodiment of the present application, the program instructions are stored in the memory 820, and the processor 880 executes the program instructions stored in the memory 820, to implement any of the methods for extracting the video theme text discussed in the previous figures.

The above-described display unit 840 is used to receive input digital information, character information, or touch operation/noncontact gestures, and to generate signal inputs related to user settings and function controls of the computer device 800, and the like. Specifically, in the embodiment of the present application, the display unit 840 may include a display panel 841. The display panel 841, for example, a touch screen, may collect touch operations thereon or thereabout by a user (such as operations of the user on the display panel 841 or on the display panel 841 using any suitable object or accessory such as a finger, a stylus, etc.), and drive the corresponding connection device according to a predetermined program.

In one possible embodiment, the display panel 841 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a player, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device and converts it into touch point coordinates, which are then sent to the processor 880 and can receive commands from the processor 880 and execute them.

The display panel 841 may be implemented by various types such as resistive, capacitive, infrared, and surface acoustic wave. In addition to the display unit 840, the computer device 800 may also include an input unit 830, where the input unit 830 may include a graphical input device 831 and other input devices 832, where other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, etc.

In addition to the above, the computer device 800 may also include a power supply 890 for powering other modules, audio circuitry 860, near field communication module 870, and RF circuitry 810. The computer device 800 may also include one or more sensors 850, such as acceleration sensors, light sensors, pressure sensors, and the like. The audio circuit 860 specifically includes a speaker 861 and a microphone 862, etc., and for example, the computer device 800 can collect the sound of the user through the microphone 862, perform corresponding operations, etc.

The number of processors 880 may be one or more, and the processors 880 and memory 820 may be coupled or may be relatively independent.

As an example, the processor 880 of fig. 8 may be used to implement the functions of the acquisition module, the segmentation module, and the processing module of fig. 7.

As an example, the processor 880 of fig. 8 may be configured to implement the functionality associated with the apparatus for extracting video theme text discussed above.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program performs steps including the above method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, or the like, which can store program codes.

Alternatively, the above-described integrated units of the present invention may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in essence or a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, ROM, RAM, magnetic or optical disk, or other medium capable of storing program code.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims

1. A method of extracting text of a video theme, comprising:

acquiring a video frame sequence of a video to be extracted;

taking a first video frame of the video frame sequence as a first video frame of a first video frame sub-sequence, starting from a second video frame of the video frame sequence, and sequentially performing the following operation on each video frame until the last-to-last video frame of the video frame sequence:

Determining a first difference degree between a current video frame and a previous video frame and a second difference degree between the current video frame and a next video frame;

if the difference between the first difference and the second difference is greater than a fourth threshold, dividing the video frame sequence once, and taking the current video frame as the last video frame of the current video frame sub-sequence, and taking the next video frame of the current video frame as the first video frame of the next video frame sub-sequence of the current video frame sub-sequence;

taking the last video frame of the video frame sequence as the last video frame of the last video frame sub-sequence to obtain each video frame sub-sequence; the difference degree between adjacent video frames in each video frame sub-sequence is in a set first threshold value, the difference degree is in direct proportion to the number of difference pixel points between the adjacent video frames, the difference pixel points comprise pixel points which meet preset pixel difference conditions at corresponding positions between the adjacent video frames, and the preset pixel difference conditions comprise that the ratio of the absolute value of the pixel value difference value of two pixel points at corresponding positions between the adjacent video frames to the sum of the pixel values is larger than a second threshold value;

2. The method of claim 1, further comprising, prior to obtaining text information for each sub-sequence of video frames:

respectively carrying out audio recognition on the audio files corresponding to each video frame sub-sequence to obtain an audio recognition result, and,

the text information of each video frame sub-sequence is acquired based on the result of the video frame text recognition, and the method comprises the following steps:

and combining the audio recognition result and the video frame text recognition result to obtain text information of each video frame sub-sequence.

3. The method of claim 2, wherein combining the results of the audio recognition and the results of the video frame text recognition comprises:

4. A method according to any one of claims 1 to 3, wherein the performing video frame text recognition for each video frame sub-sequence, when obtaining text information of each video frame sub-sequence based on the result of the video frame text recognition, includes, for each video frame sub-sequence:

according to preset filtering conditions, filtering each sub text;

5. The method according to claim 4, wherein the filtering process is performed on each sub-text according to a preset filtering condition, including one or any combination of the following:

6. The method of claim 5, wherein de-duplicating similar sub-text with a similarity between the respective sub-text comprises, for both sub-text:

7. The method according to claim 1, wherein the merging the text information of each video frame sub-sequence to obtain the subject text of the video to be extracted comprises:

8. An apparatus for extracting text of a video theme, comprising:

and a segmentation module: for taking a first video frame of the sequence of video frames as a first video frame of a first sub-sequence of video frames, starting from a second video frame of the sequence of video frames, for each video frame in turn, performing the following operations until a penultimate video frame of the sequence of video frames: determining a first difference degree between a current video frame and a previous video frame and a second difference degree between the current video frame and a next video frame; if the difference between the first difference and the second difference is greater than a fourth threshold, dividing the video frame sequence once, and taking the current video frame as the last video frame of the current video frame sub-sequence, and taking the next video frame of the current video frame as the first video frame of the next video frame sub-sequence of the current video frame sub-sequence; taking the last video frame of the video frame sequence as the last video frame of the last video frame sub-sequence to obtain each video frame sub-sequence; the difference degree between adjacent video frames in each video frame sub-sequence is in a set first threshold value, the difference degree is in direct proportion to the number of difference pixel points between the adjacent video frames, the difference pixel points comprise pixel points which meet preset pixel difference conditions at corresponding positions between the adjacent video frames, and the preset pixel difference conditions comprise that the ratio of the absolute value of the pixel value difference value of two pixel points at corresponding positions between the adjacent video frames to the sum of the pixel values is larger than a second threshold value;

9. The apparatus of claim 8, wherein the processing module is further configured to:

before text information of each video frame sub-sequence is acquired, audio recognition is performed on the audio file corresponding to each video frame sub-sequence, an audio recognition result is obtained, and,

the processing module is specifically configured to:

10. The apparatus according to claim 8 or 9, wherein the processing module is specifically configured to:

according to preset filtering conditions, filtering each sub text;

11. The apparatus of claim 10, wherein the processing module is specifically configured to include one or any combination of:

12. A computer device, comprising:

a memory for storing program instructions;

a processor for invoking program instructions stored in the memory and executing the method according to any of claims 1-7 in accordance with the obtained program instructions.

13. A storage medium storing computer-executable instructions for causing a computer to perform the method of any one of claims 1 to 7.