CN113395578A

CN113395578A - Method, device and equipment for extracting video theme text and storage medium

Info

Publication number: CN113395578A
Application number: CN202011363335.1A
Authority: CN
Inventors: 刘刚
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-11-27
Filing date: 2020-11-27
Publication date: 2021-09-14
Anticipated expiration: 2040-11-27
Also published as: CN113395578B

Abstract

The application relates to computer technology, artificial intelligence, computer vision technology and speech technology. The application provides a method, a device, equipment and a storage medium for extracting a video theme text, which are used for improving the accuracy of extracting the video theme text. The method comprises the following steps: acquiring a video frame sequence of a video to be extracted; dividing the video frame sequence into at least one video frame subsequence according to the difference degree between every two adjacent video frames; respectively carrying out video frame text recognition on each video frame subsequence, and acquiring text information of each video frame subsequence based on a video frame text recognition result; and performing fusion processing on the text information of each video frame subsequence to obtain the subject text of the video to be extracted.

Description

Method, device and equipment for extracting video theme text and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for extracting a video theme text.

Background

With the rapid development of the internet, more and more users not only obtain content from the network, but also share some content in the network, such as self-media, Professional Generated Content (PGC), or User Generated Content (UGC). As the sources of videos are more and more abundant, the amount of video uploading of the playing platform is also rapidly increasing, including long videos and short videos. Therefore, the amount of videos to be processed in the playing platform is increasing, for example, after the playing platform needs to extract the subject text of the video, the playing platform can further review the video, and recommend the video to the user according to the preference of the user.

At present, in each playing platform, two methods for extracting video theme texts include that one method is to watch videos manually, extract key information of the videos as tags of the videos, classify the videos, and the like. However, with the rapid increase of the amount of videos, the method for watching videos manually needs higher labor cost, and in the process of manual processing, it is inevitable that the situation that key information is extracted incorrectly due to different comprehension of the videos or no careful watching of the videos occurs, so that a label error of the videos or a classification error of the videos are caused. And the other is that the playing platform obtains the subject text of the video according to the video title, video classification or video keywords and the like provided by the user when the user uploads the video. However, this approach is completely user dependent, e.g., if the user does not provide a video title, video category, or video keyword, etc., or the user provides a video title, video category, or video keyword, etc., that is inaccurate, the playback platform cannot obtain the exact subject text of the video. Therefore, the accuracy of extracting the video theme text is low at present.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a storage medium for extracting a video theme text, which are used for improving the accuracy of extracting the video theme text.

In a first aspect, a method for extracting a video subject text is provided, including:

acquiring a video frame sequence of a video to be extracted;

dividing the video frame sequence into at least one video frame subsequence according to the difference degree between every two adjacent video frames; the difference degree between adjacent video frames in each video frame subsequence is within a set first threshold value, the difference degree is in direct proportion to the number of difference pixel points between the adjacent video frames, and the difference pixel points comprise pixel points which meet a preset pixel difference condition at corresponding positions between the adjacent video frames;

respectively carrying out video frame text recognition on each video frame subsequence, and acquiring text information of each video frame subsequence based on a video frame text recognition result;

and performing fusion processing on the text information of each video frame subsequence to obtain the subject text of the video to be extracted.

In a second aspect, an apparatus for extracting a video theme text is provided, including:

an acquisition module: the method comprises the steps of obtaining a video frame sequence of a video to be extracted;

a segmentation module: the video frame sequence is divided into at least one video frame subsequence according to the difference degree between every two adjacent video frames; the difference degree between adjacent video frames in each video frame subsequence is within a set first threshold value, the difference degree is in direct proportion to the number of difference pixel points between the adjacent video frames, and the difference pixel points comprise pixel points which meet a preset pixel difference condition at corresponding positions between the adjacent video frames;

a processing module: the system comprises a video frame text recognition module, a text information module and a text information module, wherein the video frame text recognition module is used for respectively carrying out video frame text recognition on each video frame subsequence and acquiring the text information of each video frame subsequence based on the result of the video frame text recognition; and performing fusion processing on the text information of each video frame subsequence to obtain the subject text of the video to be extracted.

Optionally, the preset pixel difference condition includes:

the ratio of the absolute value of the pixel value difference of two pixels at corresponding positions between adjacent video frames to the sum of the pixel values is greater than a second threshold.

Optionally, the difference degree is a ratio of the number of difference pixels between adjacent video frames to the total number of pixels of the video frames.

Optionally, the processing module is further configured to: before acquiring text information of each video frame subsequence, respectively carrying out audio identification on an audio file corresponding to each video frame subsequence to obtain an audio identification result; and the number of the first and second groups,

the processing module is specifically configured to: and combining the audio recognition result and the video frame text recognition result to acquire text information of each video frame subsequence.

Optionally, the processing module is specifically configured to:

determining the similarity between the audio recognition result corresponding to each video frame subsequence and the text recognition result;

and when the similarity between the audio recognition result and the text recognition result corresponding to the video frame subsequence is greater than the preset similarity, combining the audio recognition result and the text recognition result to obtain the text information of the video frame subsequence.

Optionally, for each video frame subsequence, the processing module is specifically configured to:

sampling each video frame in the video frame subsequence to obtain at least one target video frame;

performing video frame text recognition on each target video frame to obtain a subfile corresponding to each target video frame;

filtering each sub-text according to a preset filtering condition;

obtaining a video frame text recognition result based on each filtered sub-text;

and acquiring text information of each video frame subsequence based on the result of the video frame text identification.

Optionally, the processing module is specifically configured to include one or any combination of the following:

filtering preset keywords in each sub-text, wherein the preset keywords comprise keywords irrelevant to the video theme;

repeating the similar sub texts according to the similarity between the sub texts;

and filtering the low-frequency sub-texts according to the duration of each target video frame associated with the sub-texts.

Optionally, for two sub-texts, the processing module is specifically configured to:

converting the first sub-text and the second sub-text into a first character string and a second character string respectively;

adding characters, replacing characters or deleting characters on the first character string to convert the first character string into the second character string;

determining the similarity between the first sub-text and the second sub-text according to the minimum operation times required for converting the first character string into the second character string, wherein the operation times are in inverse proportion to the similarity;

and if the similarity between the first sub-text and the second sub-text is greater than a similarity threshold, filtering the first sub-text or the second sub-text.

Optionally, the processing module is specifically configured to:

inputting the text information of each video frame subsequence into a trained text probability model to obtain abstract information of each video frame subsequence; the abstract information is used for evaluating the semantics of the text information by a preset text amount; the trained text probability model is obtained by training based on text information of each video frame subsequence;

inputting the abstract information of each video frame subsequence into a trained topic probability model to obtain a topic text of the video to be extracted; the trained topic probability model is obtained by training based on the abstract information of each video frame subsequence.

In a third aspect, a computer device comprises:

a memory for storing program instructions;

a processor for calling the program instructions stored in the memory and executing the method according to the first aspect according to the obtained program instructions.

In a fourth aspect, a storage medium stores computer-executable instructions for causing a computer to perform the method of the first aspect.

In the embodiment of the application, the video frame sequence is divided into at least one video frame subsequence, for example, the video can be divided into a plurality of segments according to the shooting scene of the video, so that the text information of each video frame subsequence is obtained in a targeted manner, for example, text recognition is performed on video frames in the same shooting scene, semantic information of texts in the shooting scene can be obtained more accurately, and the accuracy of the text information is improved. And the difference degree between the two video frames is determined only according to the number of the difference pixel points between the two video frames, and the difference image between the two video frames does not need to be analyzed, so that the process of determining the difference degree between the two video frames is simplified, and the efficiency of segmenting the video frame sequence is improved. And finally, fusing the text information of all the video frame subsequences to obtain the subject text of the video. The semantic information contained in the video is fully utilized to extract the subject text of the video, so that the sources of the extracted subject text are enriched, and the problem that the extracted subject text of the video is inaccurate due to the fact that the video title or video classification provided by a user who publishes the video is too limited is solved.

Drawings

Fig. 1 is a schematic diagram illustrating a principle of a method for extracting a video theme text according to an embodiment of the present application;

fig. 2a is an application scenario one of the method for extracting a video theme text according to the embodiment of the present application;

fig. 2b is a second application scenario of the method for extracting a video theme text according to the embodiment of the present application;

fig. 3 is a schematic flowchart of a method for extracting a video theme text according to an embodiment of the present application;

fig. 4 is a schematic diagram illustrating a principle of a method for extracting a video theme text according to an embodiment of the present application;

fig. 5 is a schematic diagram illustrating a principle of a method for extracting a video theme text according to an embodiment of the present application;

fig. 6 is a schematic diagram illustrating a principle of a method for extracting a video theme text according to an embodiment of the present application;

fig. 7 is a first schematic structural diagram of an apparatus for extracting a video theme text according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a device for extracting a video theme text according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

Some terms in the embodiments of the present application are explained below to facilitate understanding by those skilled in the art.

(1) Long Short-Term Memory network (LSTM):

LSTM is a time-cycling Neural Network (RNN) specifically designed to solve the long-term dependency problem in RNN, and is applied to processing and predicting important events with very long intervals and delays in time series.

(2)RNN：

The recurrent neural network is a recurrent neural network (recurrent neural network) in which sequence data is input, recursion (recursion) is performed in the evolution direction of the sequence, and all nodes (recurrent units) are connected in a chain.

Embodiments of the present application relate to Artificial Intelligence (AI) and Machine Learning technologies, and are designed based on Computer Vision Technology (CV), Speech Technology (Speech Technology), Natural Language Processing (NLP) Technology, and Machine Learning (ML) in the AI.

Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology mainly comprises a computer vision technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and the like.

The natural language processing technology is an important direction in the fields of computer science and artificial intelligence. It is a research into various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Text processing is a main process in natural language processing technology and can be widely applied to various application scenarios. Identifying aligned sentences in text is an important part of text processing. For example, in the composition correcting process, if the ranking sentences in the composition can be identified, the composition can be evaluated more accurately in the culture collection dimension.

With the research and progress of artificial intelligence technology, artificial intelligence is researched and applied in a plurality of fields, such as common smart homes, smart recommendation systems, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, robots, smart medical treatment and the like.

The application field of the technical scheme provided by the embodiment of the application is briefly introduced below.

With the rapid development of internet technology, more and more users can obtain content in the internet or share content in the internet, for example, a user can watch a video or publish a video on a video playing platform. After a user who releases a video uploads the video on a video playing platform, the video playing platform firstly needs to transcode the video to standardize a video file, and stores the meta-information of the video, so that the compatibility of the video is improved. After transcoding the video, the video playing platform needs to examine the video content to ensure the validity of the video content, and also needs to classify the video content to ensure that the relevant video can be recommended to the user watching the video according to the preference of the user watching the video, or to facilitate management of the video. After the video playing platform finishes auditing the video, the video can be published on the playing platform.

The review of the video can often be converted into the review of the video subject text, so that the subject text of the video needs to be accurately extracted. The method for extracting the video theme text generally comprises two methods, one method for extracting the video theme text is to extract the video theme text in a mode of manually watching a video. The manual review mode requires that the auditor view the video completely and understand the content of the video correctly. However, with the increase of the video uploading amount, the video uploading amount of the video playing platform can reach millions, and the manual review mode is not only low in efficiency, but also high in labor cost. In addition, the videos uploaded to the video playing platform are rich and diverse in video content, the comprehension capability of the auditors on the video content cannot be guaranteed, whether the auditors watch the videos seriously or not cannot be guaranteed, and the requirement on the business quality of the auditors by a manual auditing mode is high.

Another method for extracting the video theme text is to extract the video theme text through a machine. After receiving the video uploaded by the user who publishes the video, the video playing platform acquires a video title set by the user who publishes the video, a video category selected by the user who publishes the video and the like. And the video playing platform carries out semantic recognition on the video title, and extracts the video subject text by using the selected video classification based on the video title set by the user publishing the video and the video publishing video. However, the video title set by the user who publishes the video is often used for attracting the user who watches the video to click the video for watching, and the video content cannot be accurately expressed, for example, when a legal video title is set for an illegal video, the video topic text cannot be accurately extracted. The same problem exists with user-selected video categories that distribute videos. The accuracy of the method for extracting the video subject text completely depends on the user who publishes the video. Therefore, the method for extracting the video theme text in the related technology is low in accuracy.

In order to solve the problems of low accuracy and the like of extracting a video theme text in the related technology, the application provides a method for extracting the video theme text. The method divides the video frame sequence according to the difference between adjacent video frames in the video frame sequence to obtain at least one video frame subsequence, thereby dividing the video into a plurality of segments according to the basis of different shooting scenes, different shooting targets or different environmental factors and the like in the video and extracting the text information of each video segment in a targeted manner. The shooting scene includes, for example, a bedroom, a living room, an indoor space, an outdoor space, a court or a mall, and the like. The photographic subject includes, for example, a user a, a user B, an animal, a plant, a still, or the like. The environmental factors include, for example, ambient brightness or weather. And finally, fusing the text information of all the video frame subsequences to obtain the subject text of the video. The semantic information contained in the video is fully utilized to extract the subject text of the video, so that the problem that the extracted subject text of the video is inaccurate due to the fact that the video title or video classification provided by a user who publishes the video is too limited is solved.

Please refer to fig. 1, which is a schematic diagram illustrating a method for extracting a video theme text. The method comprises the steps of obtaining a video frame sequence of a video to be extracted, and dividing the video frame sequence into at least one video frame subsequence according to the difference degree between every two adjacent video frames in the video of the video frames. And respectively carrying out video frame text recognition on each video frame subsequence to obtain text information of each video frame subsequence. And performing fusion processing on the text information of each video frame subsequence to obtain a subject text of the video to be extracted.

In the embodiment of the application, the video frame sequence is divided into at least one video frame subsequence, for example, the video can be divided into a plurality of segments according to the shooting scene of the video, so that the text information of each video frame subsequence is obtained in a targeted manner, for example, text recognition is performed on video frames in the same shooting scene, semantic information of texts in the shooting scene can be obtained more accurately, and the accuracy of the text information is improved. And finally, fusing the text information of all the video frame subsequences to obtain the subject text of the video. The semantic information contained in the video is fully utilized to extract the subject text of the video, so that the problem that the extracted subject text of the video is inaccurate due to the fact that the video title or video classification provided by a user who publishes the video is too limited is solved.

An application scenario of the method for extracting a video theme text provided by the present application is described below.

Please refer to fig. 2a, which is an application scenario of the method for extracting the video theme text. The application scenario includes a video providing device 101, a theme text extraction device 102, and a video processing device 103. Communication is possible between the video providing apparatus 101 and the subject text extracting apparatus 102, and communication is possible between the subject text extracting apparatus 102 and the video processing apparatus 103. The communication mode can be wired communication, for example, communication is performed through a connecting network line or a serial port line; the communication may also be wireless communication, such as bluetooth, and the like, and is not limited specifically.

The video providing apparatus 101 broadly refers to an apparatus, such as a terminal apparatus, a server, or a client, which can transmit a video to the subject text extraction apparatus 102. The terminal device can be a mobile phone, a desktop computer or a tablet computer. The server may be a local server of the subject text extraction device 102, or a third party server associated with the subject text extraction device 102, or a cloud server, etc. The client may be a third party application installed in the subject text extraction device 102 or a web page or the like accessible by the subject text extraction device 102.

The theme text extraction device 102 generally refers to a device that can extract the theme text of the video, for example, a terminal device, a server, a client, or the like. Video processing device 103 generally refers to a device that can process videos, such as sorting videos, recommending videos to a user, and so forth. The video processing device 103 may be a terminal device, a server or a client, etc.

As an embodiment, the video providing device 101 and the subject text extracting device 102 may be the same device, or the subject text extracting device 102 and the video processing device 103 may be the same device, or the video providing device 101, the subject text extracting device 102 and the video processing device 103 may be the same device, which is not limited specifically. In the embodiment of the present application, the video providing device 101, the theme text extraction device 102, and the video processing device 103 are different devices, and are described as an example.

The interaction between the devices is illustrated below based on fig. 2 a:

the video providing apparatus 101 may send a video to be extracted to the topic text extraction apparatus 102, and the topic text extraction apparatus 102 receives the video to be extracted sent by the video providing apparatus 101.

The theme text extraction device 102 obtains a video frame sequence of a video to be extracted, and divides the video frame sequence into at least one video frame subsequence according to the difference degree between every two adjacent video frames. The topic text extraction device 102 performs video frame text recognition on each video frame subsequence respectively to obtain text information of each video frame subsequence. The topic text extraction device 102 performs fusion processing on the text information of each video frame subsequence to obtain a topic text of the video to be extracted.

The subject text extraction device 102 sends the video to be extracted and the subject text of the video to be extracted to the video processing device 103, and the video processing device 103 receives the video to be extracted and the subject text of the video to be extracted sent by the subject text extraction device 102. The video processing device 103 classifies the video to be extracted according to the subject text of the video to be extracted, and recommends the video to be extracted to the relevant users according to the interest figures of the users.

Please refer to fig. 2b, which is an application scenario of the method for extracting the video theme text. The application scenario includes a video generating device 201, a first storage device 202, a second storage device 203, a scheduling device 204, a subject text extracting device 205 and a video processing device 206, which can communicate with each other.

The user who publishes the video can upload the video to be extracted through the front-end interface or the back-end interface of the video generating apparatus 201. The information uploaded together with the video to be extracted may further include a video title, a publisher, a video summary, a cover page, a publishing time, and the like. The scheduling device 204 stores the video to be extracted or the information uploaded together with the video to be extracted into the first storage device 202. The first storage device 202 is, for example, a video content storage server, i.e., a relational database. The scheduling device 204 stores meta information of the video to be extracted, such as video file size, cover map link, bit rate, video file format, video title, publisher, video summary, and publishing time, etc., into the second storage device 203. The second storage means 203 is for example a content database, i.e. a non-relational database. When a user watching a video acquires the video, the scheduling device 204 may determine index information of the video displayed to the user in the first storage device 202 according to the subject text of each video, download a streaming media file of the video from the first storage device 202, and play the video through a local player of the user watching the video. The subject text of each video is obtained by the subject text extraction means 205.

As an embodiment, the second storage device 203 further stores video classification or video tags of the video to be extracted. For example, regarding videos of brand a phones, the first level is classified as technology, the second level is classified as phones, the third level is classified as home phones, and the label is brand a phones and model.

As an embodiment, the scheduling device 204 may obtain the content stored in the first storage device 202 or the second storage device 203, and according to the theme text extracted by the theme text extraction device 205, rearrange the video, delete the repeatedly uploaded video or the video with plagiarism suspicion, and the like.

As an embodiment, the video processing apparatus 206 may recommend a relevant video to a user watching the video according to the interest figures of the user according to the subject text of the video; or displaying the related video to the user watching the video according to the keyword searched by the user.

In the embodiment of the present application, a method for extracting a video theme text by using the theme text extraction device 102 or the theme text extraction apparatus 205 in the two application scenarios is specifically described.

Referring to fig. 3, a flow of the method for extracting a video theme text is described in detail below, which is a schematic flow of the method for extracting a video theme text.

S301, acquiring a video frame sequence of the video to be extracted.

The video frame sequence comprises video frames which are arranged according to a time sequence, and each video frame in the video frame sequence is switched sequentially according to the time sequence to form a video.

S302, dividing the video frame sequence into at least one video frame subsequence according to the difference degree between every two adjacent video frames.

After obtaining the video frame sequence of the video to be extracted, determining the difference degree between every two adjacent video frames in the video frame sequence. The difference degree between two adjacent video frames is in direct proportion to the number of difference pixel points between the two adjacent video frames. The difference pixel point between two adjacent video frames is the pixel point which meets the preset pixel difference condition at the corresponding position between the two adjacent video frames. The difference pixel points can be pixel points which meet the preset pixel difference condition at corresponding positions between two adjacent complete video frames; or, the difference between the two adjacent video frames may be a pixel point which satisfies a preset pixel difference condition at a corresponding position between specified regions in the two adjacent video frames. For example, whether the pixel point at each corresponding position in two adjacent video frames meets the preset pixel difference condition is sequentially determined, or, when the designated area is a face area, whether the pixel point at each corresponding position in the face areas of two adjacent video frames meets the preset pixel difference condition is sequentially determined, and the like.

The process of dividing a sequence of video frames into at least one sub-sequence of video frames is described in detail below.

S1.1, determining difference pixel points between two adjacent video frames.

The preset pixel difference condition includes a plurality of conditions, and two of the conditions are described below as examples.

The preset pixel difference condition one:

and the absolute value of the pixel value difference of two pixel points at the corresponding position between the adjacent video frames is greater than a third threshold value.

If the pixel value difference of the two pixel points at the corresponding position between the adjacent video frames is larger, and when the absolute value of the difference value is larger than a third threshold value, the two pixel points at the corresponding position are determined to be difference pixel points.

According to the following formula (1), the difference pixel point between adjacent video frames is determined.

|T_m(i，j)-T_n(i，j)|>ε₁ (1)

Wherein i represents the number of rows of the pixel points in the video frame, and j represents the number of columns of the pixel points in the video frame. m, n represent two adjacent video frames, T_m(i, j) represents the pixel value, T, of the pixel at the position of the mth frame (i, j)_nAnd (i, j) represents the pixel value of the pixel point at the position of the nth frame (i, j). i is an integer greater than 0 and less than the total number of rows of pixels in the video frame, and j is an integer greater than 0 and less than the total number of columns of pixels in the video frame. Epsilon₁Representing a third threshold.

Presetting a pixel difference condition two:

the ratio of the absolute value of the pixel value difference of two pixels at the corresponding position between the adjacent video frames to the sum of the pixel values is larger than a second threshold value.

If the pixel value difference of two pixel points at the corresponding position between the adjacent video frames is larger, probably because the pixel value of the pixel point at the position is larger, the pixel value difference of the two pixel points is normalized, and the difference pixel point between the adjacent video frames is determined according to the pixel value difference of the two pixel points and the sum of the pixel values of the two pixel points. If the ratio of the absolute value of the difference of the pixel values of the two pixels at the corresponding position between the adjacent video frames to the sum of the pixel values is larger, the change of the pixel values of the two pixels at the corresponding position between the adjacent video frames is larger.

According to the following formula (2), the difference pixel point between adjacent video frames is determined.

Wherein i represents the number of rows of the pixel points in the video frame, and j represents the number of columns of the pixel points in the video frame. m, n represent two adjacent video frames, T_m(i, j) represents the pixel value, T, of the pixel at the position of the mth frame (i, j)_nAnd (i, j) represents the pixel value of the pixel point at the position of the nth frame (i, j). i is an integer greater than 0 and less than the total number of rows of pixels in the video frame, and j is an integer greater than 0 and less than the total number of columns of pixels in the video frame. Epsilon₂Representing a second threshold.

As an embodiment, the sum of the pixel values of the two pixel points can be replaced by an average value of the sum of the pixel values of the two pixel points, that is, the sum of the pixel values of the two pixel points is

In one embodiment, the pixel values of the pixels in the video frames are different according to the color space of the video frame sequence. The color space is, for example, an RGB color space, a YUV color space, or the like.

For the RGB color space, the pixel values of the pixel points include three color channels of red, green, and blue, so the pixel value of the pixel point can be the sum of the values of the three color channels, and then the pixel value difference value of the two pixel points can be the sum of the difference values of the pixel values; or, the pixel values of the pixel points may include a value of each channel, and then the pixel value difference of the two pixel points may be the sum of differences of values of corresponding channels of the two pixel points, and the like. For the YUV color space, the pixel value of the pixel point includes a brightness and two chromaticities, and thus the pixel value of the pixel point may be the sum of the brightness and the chromaticity, or may include the brightness and the chromaticity, and the like.

S1.2, determining the difference degree between two adjacent video frames.

After the difference pixel point between two adjacent video frames is determined according to the preset pixel difference condition, the difference degree between the two adjacent video frames is determined according to the difference pixel point between the two adjacent video frames. The difference degree between two adjacent video frames is in direct proportion to the number of difference pixel points between the two adjacent video frames.

Determining the difference degree between two adjacent video frames according to the following formula (3);

d_m，n＝∑h_m，n(i，j) (3)

wherein h is_m，n(i, j) represents the difference pixel between two video frames, ∑ h_m，n(i, j) represents the number of the difference pixel points between the mth frame and the nth frame, i is an integer which is greater than 0 and less than the total number of rows of the pixel points in the video frame, and j is an integer which is greater than 0 and less than the total number of columns of the pixel points in the video frame.

As an embodiment, if the number of difference pixels between two adjacent video frames is large, possibly because the video frames themselves include many pixels, in order to improve the applicability of the method for determining the difference degree between two adjacent video frames, the number of difference pixels may be normalized according to the total number of pixels in the video frames, that is, the difference degree between two adjacent video frames may be determined according to the number of difference pixels between two adjacent video frames divided by the total number of pixels in the video frames. Therefore, the difference degree between two adjacent video frames cannot be influenced by the total number of pixel points in the video frames, and the method is suitable for video frames of any size.

Determining the difference degree between two video frames according to the following formula (4);

the total line number of the pixel points in the video frame A and the total column number of the pixel points in the video frame B are determined.

S1.3, dividing the video frame sequence into at least one video frame subsequence.

After determining the degree of difference between every two video frames in the sequence of video frames, the sequence of video frames is divided into at least one subsequence of video frames according to the degree of difference between every two video frames.

The sequence of video frames is divided according to the following equation (5).

d_n+1，n-d_n，n-1>ε₃ (5)

And taking a first video frame of the video frame sequence as a first video frame of a first video frame subsequence, and sequentially aiming at each video frame from a second video frame of the video frame sequence, if the difference between the difference degree between the current video frame and the previous video frame of the current video frame and the difference degree between the current video frame and the next video frame of the current video frame is determined to be larger than a fourth threshold value, dividing the video frame sequence, taking the current video frame as the last video frame of the current video frame subsequence, and taking the next video frame of the current video frame as the first video frame of the next video frame subsequence of the current video frame subsequence. Until the last video frame of the sequence of video frames is determined, the last video frame of the sequence of video frames is taken as the last video frame of the subsequence of last video frames, obtaining each subsequence of video frames.

And S303, respectively carrying out video frame text recognition on each video frame subsequence, and acquiring text information of each video frame subsequence based on the result of the video frame text recognition.

After obtaining each video frame subsequence, respectively performing video frame text recognition on each video frame in each video frame subsequence based on a computer vision technology in the field of artificial intelligence, and obtaining text information of each video frame subsequence based on a text recognition result of each video frame in the video frame subsequence; or, each video frame subsequence may be sampled, video frame text recognition may be performed on each sampled target video frame, and text information of each video frame subsequence may be acquired based on a result of the text recognition of each sampled target video frame.

In the following, the following description will be given taking the example of sampling each video frame subsequence and performing video frame text recognition on each sampled target video frame, and the principle of performing video frame text recognition on each video frame in each video frame subsequence is the same, and is not repeated.

S2.1, sampling each video frame in the video frame subsequence.

Sampling video frames or I frames (Intra Picture) which represent the movement of a person or the change of an object in the video frame subsequence to obtain at least one target video frame.

As an example, the at least one target video frame may constitute the sequence of target video frames in chronological order. Each two adjacent target video frames in the sequence of target video frames may be two adjacent video frames in the subsequence of video frames or may be two non-adjacent video frames in the subsequence of video frames.

As an embodiment, the time duration of the chronologically preceding target video frame in each two adjacent target video frames may be determined according to the number of video frames spaced in the video frame sub-sequence for each two adjacent target video frames in the target video frame sequence. Every two adjacent target video frames in the target video frame sequence, video frames spaced in the video frame subsequence can be regarded as transition video frames or intermediate video frames and the like which are associated with the previous target video frames in the time sequence in every two adjacent target video frames, and the transition video frames or the intermediate video frames can improve the continuity of changing from one target video frame to another target video frame, so that only the target video frames can be subjected to text recognition, the data amount required to be processed is reduced, and the text recognition efficiency is improved.

And S2.2, performing video frame text recognition on each target video frame to obtain a sub text corresponding to each target video frame.

After at least one target video frame is obtained, text detection is carried out on each target video frame based on machine learning technology in the field of artificial intelligence, and whether the text is included in the target video frame or not is determined. If the target video frame does not comprise text, the text recognition is not carried out on the target video frame, and if the target video frame comprises text, the text recognition is continuously carried out on the target video frame.

As an embodiment, please refer to fig. 4, which is a schematic diagram illustrating a principle of text detection. Firstly, feature extraction is carried out on a target video frame, and at least one text region, such as a first text region, a second text region and a third text region, is obtained according to the feature extraction result. The number of the at least one text region may be preset by a user, or may be automatically determined by the device according to the feature extraction result, which is not limited in particular. The text region may be a pixel or a region including a plurality of pixels, and the text region indicates that the pixel in the text region is a pixel related to the sub-text corresponding to the target video frame. And secondly, aiming at each text region, continuously merging the pixel points around the text region according to the similarity between the pixel points, expanding each text region, and finally fusing all the text regions to obtain the text region of the sub-text corresponding to the complete target video frame. Compared with a mode of detecting a text through a rectangular box, the method and the device for detecting the text do not need to limit the appearance form of the sub-text corresponding to the target video frame, and have strong applicability to any font or handwritten text and the like.

In one embodiment, when detecting whether text is included in the target video frame, text detection may also be performed by using a trained PSENet model or the like.

After text detection is carried out on the target video frame, if the text is included in the target video frame, text recognition is continuously carried out on the target video frame based on natural language processing technology in the field of artificial intelligence. Please refer to fig. 5, which is a schematic diagram illustrating a principle of text recognition of a video frame. Firstly, feature extraction is carried out on a text region obtained through text detection by using a feature extraction network, and a text sequence of the text region is obtained. And secondly, performing vector compression on the text sequence of the text region by using a vector compression network, and obtaining a semantic vector after vector compression according to the generated forward hidden state and the reverse hidden state. Then, the semantic vector is restored to the text sequence based on the vector-compressed semantic vector and the at least one mapping parameter by using a mapping network. By introducing at least one mapping parameter, matching different positions of the semantic vector and the degree of association between different positions of the text sequence, the accuracy of determining the text sequence is improved.

The text recognition model is introduced by taking the feature extraction network as an EfficientNet model, the vector compression network as a BiRNN model formed by two layers of LSTMs and the mapping network as a Multi-head Attention model as an example.

Inputting the text area into an EfficientNet model to obtain a text sequence { x₁，x₂，…，x_n}. And inputting the text sequence into a BiRNN model, and calculating each hidden state. Processing the text sequence from left to right by the first layer LSTM of the BiRNN model to generate a forward hidden state

The second layer LSTM of the BiRNN model processes the text sequence from right to left to generate a reverse hidden state

Obtaining a semantic vector according to the forward hidden state and the reverse hidden state

And inputting the semantic vector into a Multi-head Attention model, and reducing the semantic vector into a text sequence according to the mapping parameters to obtain a subfile of the target video frame.

Please refer to fig. 6, which is a schematic diagram illustrating a mapping network. And inputting the semantic vector into a mapping network, and restoring each text sequence through at least one mapping parameter. And fusing the text sequences into a final sub text.

In one embodiment, the duration of the video frame associated with the sub-text of the target video frame is the duration of the target video frame. If the subfolders of the plurality of target video frames are the same, the duration of the video frame associated with the subfolder is the sum of the durations of the plurality of target video frames.

As an embodiment, after text detection is carried out on each target video frame, whether the text is included in the target video frame is determined, if the text is not included in the target video frame, text description can be carried out on the target video frame without including any text by combining computer vision, natural language processing and machine learning technologies, images are translated into texts, connection of two different modal spaces of the images and the texts is established, behaviors of people speaking in a view of pictures can be simulated, emotions of people in the images can be described through the texts, and flexibility and intelligence of obtaining the texts of the target video frames are improved.

Inputting the target video frame into the trained text description model, and performing feature extraction on the target video frame by a feature extraction network in the text description model to obtain a feature vector representing the spatial information of the target video frame. After the feature vector of the target video frame is obtained, a sequence generation network in the text description model determines keywords associated with the target video frame based on the feature vector of the target video frame, and obtains each keyword sequence, so that a subfile of the target video frame can be obtained.

The feature extraction network in the text description model can be built through a rolling-in neural network and a full-connection network, and the generation sequence network in the text description model can be built through a circulating neural network, so that the generation sequence network can memorize and maintain information for a long time.

The text description model can be trained according to a large number of sample images which do not include text and text labels corresponding to the sample images. The sample image not including the text may be a video frame not including the text in a video acquired from a network resource, and the text label may be obtained according to a manual annotation or an equipment annotation, or may be obtained according to a text associated with the sample image in a source of the sample image, which is not limited specifically.

And S2.3, filtering each sub-text according to preset filtering conditions.

Each acquired subfile may include some texts irrelevant to the video theme or more repeated texts, so that each subfile can be filtered by setting a filtering condition, thereby ensuring the accuracy and the simplicity of the acquired video theme text. Three filtration conditions are described below as examples.

The filtration condition one:

and filtering preset keywords in each sub text.

The video may have text irrelevant to the video theme, such as watermark text or trademark text, and therefore, preset keywords may be set, and the preset keywords in each sub-text may be filtered.

For example, a user has taken a video through software a and distributed it through software B. The name watermark of the software A is included in the video, and the name of the software A is actually unrelated to the video content, so that the name text of the software A can be deleted from the obtained sub-texts of each target video frame, and the accuracy and the simplicity of extracting the theme text are improved.

And (2) filtering conditions II:

and repeating the similar sub texts according to the similarity among the sub texts.

Similar text, such as similar words or sentences, may exist in the sub-texts of each target video frame due to text recognition errors or similar words, and therefore, the similar sub-texts in each sub-text may be filtered.

After obtaining the respective sub-texts, for each sub-sequence of video frames, a similarity between every two sub-texts in the sub-sequence of video frames is determined. And if the similarity between the two sub texts is greater than the similarity threshold value, determining that the two sub texts are similar to each other. Removing the duplicate of the similar sub-texts, for example, only keeping the similar sub-text with the longest duration of the associated video frame for a plurality of similar sub-texts, and deleting the similar sub-texts with shorter durations of the other associated video frames; or, only the similar sub-texts arranged in the front target video frame in time sequence are retained, and the similar sub-texts arranged in the rear target video frame are deleted.

As an example, there are various methods for determining the similarity between two sub-texts, such as euclidean distance, cosine similarity, or levenstein distance. The method of determining the similarity between two sub-texts is described below using the levenstein distance as an example.

And converting the two subfiles into two character strings, performing operations such as character adding, character deleting or character replacing on the first character string, and counting the minimum operation times required by converting the first character string into the second character string. According to the operation times, the similarity between the two sub texts can be determined, and the operation times is inversely proportional to the similarity. And if the operation times of converting the first character string into the second character string are within the preset range, determining that the similarity between the two sub-texts is higher, wherein the two sub-texts are mutually similar sub-texts.

And (3) filtering conditions are as follows:

If some sub-texts in the video appear for a short time, for example, the sub-texts appear in only 3 video frames, the sub-texts can be considered as low-frequency sub-texts, and are not related to the video subject text, so that the associated sub-texts with a short time can be deleted.

After the video frame time length associated with each subfile in the video frame subsequence is obtained, the subfiles can be sorted from large to small according to the associated video frame time length, and all the subfiles arranged behind the preset serial number are deleted. Alternatively, all sub-texts with associated video frame duration less than the preset duration may be deleted. Or after sorting each sub-text from large to small according to the associated video frame time length, calculating the difference value of the video frame time lengths of every two adjacent sub-texts, and if the difference value is greater than a preset difference value, deleting all sub-texts ranked behind the sub-text, and the like.

After filtering the sub-text of each target video frame in each video frame sub-sequence, the text information of each video frame sub-sequence is obtained.

As an embodiment, before obtaining the text information of each video frame subsequence, based on a speech technology in the field of artificial intelligence, audio recognition may be performed on an audio file corresponding to each video frame subsequence, or audio recognition may be performed on an audio file corresponding to each target video frame, so as to obtain an audio recognition result. After the text recognition result is obtained, for example, after the sub-text of each target video frame is obtained, the text recognition result and the audio recognition result may be combined, and the text information of each video frame sub-sequence may be obtained after the combined text is filtered.

The audio file is, for example, a dubbing file corresponding to a sequence of video frames, and for example, the playing time of a current target video frame in a video to be extracted is used as a starting time, the playing time of a next target video frame of the current target video frame in the video to be extracted is used as an ending time, audio information in the audio file of the video to be extracted is determined, a text identified according to the audio information is acquired, and a sub-text of the current target video frame is acquired.

For another example, according to the playing time corresponding to each video frame sub-sequence, in the audio file of the video to be extracted, the audio information corresponding to each video frame sub-sequence is determined, and audio identification is performed on the audio information, so as to obtain the result of audio identification corresponding to each video frame sub-sequence.

As an embodiment, in the process of merging the audio recognition result and the text recognition result, in order to avoid a situation that the audio recognition result is misaligned with the corresponding text recognition result on a content level, or the audio recognition result is not related to the text recognition result, thereby causing content confusion, and the like, a similarity between the audio recognition result and the text recognition result respectively corresponding to each video frame subsequence may be determined first. And when the similarity between the audio recognition result and the text recognition result corresponding to the video frame subsequence is greater than the preset similarity, combining the audio recognition result and the text recognition result to obtain the text information of the video frame subsequence. Alternatively, the similarity between the result of audio recognition and the result of text recognition corresponding to each target video frame may be determined first. And when the similarity between the audio recognition result and the text recognition result corresponding to the target video frame is greater than the preset similarity, combining the audio recognition result and the text recognition result to obtain the text information of the video frame subsequence.

And S304, carrying out fusion processing on the text information of each video frame subsequence to obtain a subject text of the video to be extracted.

And splicing the text information of the video frame sub-sequences according to the sequence of each video frame in the video frame sub-sequences aiming at each video frame sub-sequence to obtain the text information of the spliced video frame sub-sequences. And splicing the text information of each spliced video frame subsequence according to the sequence of each video frame subsequence in the video frame sequence to obtain the theme text of the video to be extracted.

As an example, if a sub-sequence of video frames comprises more video frames, the text information of the obtained sub-sequence of video frames may comprise more text. In order to avoid the problem that the theme text of the video to be extracted is not brief after the fusion processing, abstract extraction, such as extraction of keywords and the like, can be performed on the text information of the video frame subsequence of which the number of video frames included in the video frame subsequence is greater than the preset number of video frames. And updating the text information of the video frame subsequence into the extracted summary information, and then fusing the text information of each video frame subsequence to obtain the subject text of the video to be extracted.

Or, if the text information of the video frame sub-sequence includes more texts, in order to avoid the problem that the obtained subject text of the video to be extracted is not brief, the text information of each video frame sub-sequence may be fused based on the text probability model and the subject probability model, and the subject text of the video to be extracted is determined.

The text information of each video frame subsequence is respectively input into the trained text probability model, the text probability model can firstly divide the text information into a plurality of words, and a word vector of each word is obtained. After obtaining each word in the text information and the word vector of each word, the text probability model can determine each word combination mode of various word combination modes of at least one word, can express the probability distribution of the semantics of the text information of the video frame subsequence, combines at least one word according to the word combination mode with the maximum probability, and outputs the abstract information of the video frame subsequence. The summary information may characterize the semantics of the text information in a preset text amount. The preset text amount may refer to the number of characters in the text information, or may be the size of a memory occupied by the text information, and the like, and is not particularly limited.

After obtaining summary information for each sub-sequence of video frames, the summary information for the sub-sequence of video frames may be input into the trained topic probability model. The topic probability model determines an information vector corresponding to each abstract information, determines each abstract combination mode in various abstract combination modes of each abstract information, can express the probability distribution of the semantics of the abstract information of the video frame subsequence, combines each abstract information according to the abstract combination mode with the maximum probability, and outputs the topic text of the video frame sequence.

As an embodiment, the text probability model can perform unsupervised training through the text information of each video frame subsequence, if the Markov chain of the obtained probability distribution is converged, the trained text probability model is obtained, otherwise, the model parameters of the text probability model are adjusted to continue training. The topic probability model can perform unsupervised training through the abstract information of each video frame subsequence, and the training process is similar to that of the text probability model and is not repeated here.

As an example, the text probability model and the topic probability model may be constructed by an implicit Dirichlet allocation (LDA) model.

Based on the same inventive concept, the embodiment of the present application provides a device for extracting a video theme text, which is equivalent to the theme text extraction device 102 or the theme text extraction device 205 discussed above, and can implement the corresponding functions of the method for extracting a video theme text. Referring to fig. 7, the apparatus includes an obtaining module 701, a dividing module 702, and a processing module 703, wherein:

an acquisition module 701: the method comprises the steps of obtaining a video frame sequence of a video to be extracted;

the segmentation module 702: the video frame sequence is divided into at least one video frame subsequence according to the difference degree between every two adjacent video frames; the difference degree between adjacent video frames in each video frame subsequence is within a set first threshold value, the difference degree is in direct proportion to the number of difference pixel points between the adjacent video frames, and the difference pixel points comprise pixel points which meet a preset pixel difference condition at corresponding positions between the adjacent video frames;

the processing module 703: the system comprises a video frame text recognition module, a text information module and a text information module, wherein the video frame text recognition module is used for respectively carrying out video frame text recognition on each video frame subsequence and acquiring the text information of each video frame subsequence based on the result of the video frame text recognition; and performing fusion processing on the text information of each video frame subsequence to obtain a subject text of the video to be extracted.

In one possible embodiment, the preset pixel difference condition comprises:

In one possible embodiment, the difference degree is a ratio of the number of difference pixels between adjacent video frames to the total number of pixels of the video frames.

In a possible embodiment, the processing module 703 is further configured to: before acquiring text information of each video frame subsequence, respectively carrying out audio identification on an audio file corresponding to each video frame subsequence to obtain an audio identification result; and the number of the first and second groups,

the processing module 703 is specifically configured to: and combining the audio recognition result and the video frame text recognition result to acquire the text information of each video frame subsequence.

In a possible embodiment, the processing module 703 is specifically configured to:

In a possible embodiment, for each video frame sub-sequence, the processing module 703 is specifically configured to:

filtering each sub-text according to a preset filtering condition;

In a possible embodiment, the processing module 703 is specifically configured to include one or any combination of the following:

In a possible embodiment, for two sub-texts, the processing module 703 is specifically configured to:

adding characters, replacing characters or deleting characters on the first character string to convert the first character string into a second character string;

and if the similarity between the first sub text and the second sub text is larger than the similarity threshold value, filtering the first sub text or the second sub text.

inputting the text information of each video frame subsequence into a trained text probability model to obtain abstract information of each video frame subsequence; the abstract information is used for evaluating the semantics of the text information by a preset text amount; the trained text probability model is obtained by training based on the text information of each video frame subsequence;

inputting the abstract information of each video frame subsequence into the trained topic probability model to obtain a topic text of the video to be extracted; the trained topic probability model is obtained by training based on the summary information of each video frame subsequence.

Based on the same inventive concept, the embodiment of the present application provides a computer device, and the computer device 800 is described below.

Referring to fig. 8, the apparatus for extracting video theme text may be run on a computer device 800, and a current version and a historical version of a program for extracting video theme text and application software corresponding to the program for extracting video theme text may be installed on the computer device 800, where the computer device 800 includes a display unit 840, a processor 880 and a memory 820, and the display unit 840 includes a display panel 841 for displaying an interface interacted with by a user, etc.

In one possible embodiment, the Display panel 841 may be configured in the form of a Liquid Crystal Display (LCD) or an Organic Light-Emitting Diode (OLED) or the like.

The processor 880 is configured to read the computer program and then execute a method defined by the computer program, for example, the processor 880 reads a program or a file or the like for extracting the video theme text, so that the program for extracting the video theme text is run on the computer device 800, and a corresponding interface is displayed on the display unit 840. The Processor 880 may include one or more general-purpose processors, and may further include one or more DSPs (Digital Signal processors) for performing relevant operations to implement the technical solutions provided in the embodiments of the present application.

Memory 820 typically includes both internal and external memory, which may be Random Access Memory (RAM), Read Only Memory (ROM), and CACHE memory (CACHE). The external memory can be a hard disk, an optical disk, a USB disk, a floppy disk or a tape drive. The memory 820 is used for storing a computer program including an application program and the like corresponding to each client, and other data, which may include data generated after an operating system or the application program is executed, including system data (e.g., configuration parameters of the operating system) and user data. The program instructions in the embodiments of the present application are stored in the memory 820, and the processor 880 executes the program instructions stored in the memory 820 to implement any one of the methods for extracting the video theme text discussed in the previous figures.

The display unit 840 is used to receive input numerical information, character information, or contact touch operation/non-contact gesture, and generate signal input related to user setting and function control of the computer device 800, and the like. Specifically, in the embodiment of the present application, the display unit 840 may include a display panel 841. The display panel 841, such as a touch screen, may collect touch operations of a user (e.g., operations of a user on the display panel 841 or on the display panel 841 using a finger, a stylus, or any other suitable object or accessory) thereon or nearby, and drive a corresponding connection device according to a preset program.

In one possible embodiment, the display panel 841 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch direction of a player, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts it to touch point coordinates, and sends the touch point coordinates to the processor 880, and can receive and execute commands from the processor 880.

The display panel 841 can be implemented by various types, such as resistive, capacitive, infrared, and surface acoustic wave. In addition to the display unit 840, the computer device 800 may also include an input unit 830, the input unit 830 may include a graphical input device 831 and other input devices 832, wherein the other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

In addition to the above, computer device 800 may also include a power supply 890 for powering the other modules, audio circuitry 860, near field communication module 870, and RF circuitry 810. The computer device 800 may also include one or more sensors 850, such as acceleration sensors, light sensors, pressure sensors, and the like. The audio circuit 860 specifically includes a speaker 861, a microphone 862, and the like, for example, the computer device 800 may collect the sound of the user through the microphone 862 and perform corresponding operations.

For one embodiment, the number of the processors 880 may be one or more, and the processors 880 and the memory 820 may be coupled or relatively independent.

As an example, the processor 880 of fig. 8 may be used to implement the functions of the acquisition module, the segmentation module, and the processing module of fig. 7.

As an embodiment, the processor 880 in fig. 8 may be used to implement the corresponding functions of the apparatus for extracting video theme text discussed above.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method for extracting a video theme text, comprising:

acquiring a video frame sequence of a video to be extracted;

2. The method of claim 1, wherein the preset pixel difference condition comprises:

3. The method of claim 1, wherein the difference degree is a ratio of a number of difference pixels between adjacent video frames to a total number of pixels of the video frames.

4. The method of claim 1, further comprising, prior to obtaining text information for each of the subsequences of video frames:

respectively carrying out audio recognition on the audio files corresponding to each video frame subsequence to obtain the result of the audio recognition, and,

the acquiring text information of each video frame subsequence based on the result of the video frame text recognition comprises:

and combining the audio recognition result and the video frame text recognition result to acquire text information of each video frame subsequence.

5. The method of claim 4, wherein merging the result of the audio recognition and the result of the video frame text recognition comprises:

6. The method according to any one of claims 1 to 5, wherein the performing video frame text recognition for each video frame subsequence respectively, and when acquiring text information of each video frame subsequence based on the result of the video frame text recognition, comprises, for each video frame subsequence:

filtering each sub-text according to a preset filtering condition;

7. The method according to claim 6, wherein the filtering processing is performed on each sub-text according to a preset filtering condition, and includes one or any combination of the following:

8. The method of claim 7, wherein the step of de-emphasizing similar sub-texts according to the similarity between the sub-texts comprises, for two sub-texts:

9. The method according to claim 1, wherein the fusing the text information of each video frame subsequence to obtain the subject text of the video to be extracted comprises:

10. An apparatus for extracting a video theme text, comprising:

11. The apparatus of claim 10, wherein the processing module is further configured to:

before acquiring the text information of each video frame subsequence, respectively performing audio recognition on the audio file corresponding to each video frame subsequence to obtain an audio recognition result, and,

the processing module is specifically configured to:

12. The apparatus according to claim 10 or 11, wherein the processing module is specifically configured to:

filtering each sub-text according to a preset filtering condition;

13. The apparatus according to claim 12, wherein the processing module is specifically configured to include one or any combination of the following:

14. A computer device, comprising:

a memory for storing program instructions;

a processor for calling the program instructions stored in the memory and executing the method according to any one of claims 1 to 9 according to the obtained program instructions.

15. A storage medium having stored thereon computer-executable instructions for causing a computer to perform the method of any one of claims 1 to 9.