CN116597360A

CN116597360A - Video processing method, system, equipment and medium and program product

Info

Publication number: CN116597360A
Application number: CN202310615111.2A
Authority: CN
Inventors: 尹君豪; 杜春赛; 康积华; 杨晶生
Original assignee: Beijing Zitiao Network Technology Co Ltd
Current assignee: Beijing Zitiao Network Technology Co Ltd
Priority date: 2023-05-29
Filing date: 2023-05-29
Publication date: 2023-08-15

Abstract

The application provides a video processing method, a video processing system, video processing equipment, video processing media and a program product. The method comprises the following steps: acquiring a video to be processed; determining a key frame in the video according to the video to be processed, wherein the key frame is at least one of a plurality of image frames of the video; identifying the contents of candidate areas of the key frames to obtain description information of the key frames; and presenting the descriptive information of the key frames. According to the method, the content of the candidate area of the key frame in the video is identified, the description information of the key frame can be obtained, and in some specific application scenes, such as a video conference scene, the title of the document currently demonstrated by the video can be obtained, so that a user can know the description information of a plurality of key frames in the video, the user can conveniently locate the video, and the video can be watched efficiently.

Description

Video processing method, system, equipment and medium and program product

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a video processing method, system, apparatus, computer readable storage medium, and computer program product.

Background

Video (video) is a dynamic image that is captured, recorded, processed, stored, transmitted, and reproduced in an electrical signal. Video typically carries relatively rich information, e.g., video content may include documents with a large number of words. By way of example of a conference scene, participants can record a document demonstrated in the conference process to obtain a video recorded with document content, so that other persons can know the conference content through the video, or the participants review the conference content through the video.

Users often need to watch from scratch while watching video. If a user wants to view a specific portion of content, he needs to locate a specific point in time in the video and view from that specific point in time. However, the user typically needs to perform a plurality of manual positions to determine a specific point in time corresponding to a specific portion to be viewed, which is time consuming and labor intensive.

Therefore, there is a need for a video processing method that can efficiently view video.

Disclosure of Invention

The application provides a video processing method. The method can locate the description information of the key frames in the video, and is convenient for the user to jump to watch, thereby realizing high-efficiency video watching. The application also provides a system, a device, a computer readable storage medium and a computer program product corresponding to the method.

In a first aspect, the present application provides a video processing method. The method comprises the following steps:

acquiring a video to be processed;

determining a key frame in the video according to the video to be processed, wherein the key frame is at least one of a plurality of image frames of the video;

identifying the content of the candidate region of the key frame to obtain the description information of the key frame;

and presenting the description information of the key frames.

In some possible implementations, the presenting the description information of the key frame includes:

presenting the description information of the key frames in a navigation area of the video; or alternatively, the process may be performed,

and presenting the description information of the key frames on the time axis of the video.

In some possible implementations, the presenting the description information of the key frame on the time axis of the video includes:

based on the time information of the key frame in the video, the description information of the key frame is displayed in association with the time axis of the video.

In some possible implementations, the method further includes:

and responding to the triggering operation of the user on the description information of the key frame, positioning the video to a time point corresponding to the description information of the key frame, and playing the video from the time point.

In some possible implementations, the identifying the content of the candidate region of the key frame to obtain the description information of the key frame includes:

recognizing the contents of the candidate areas of the key frames through optical character recognition OCR to obtain a character recognition result of the candidate areas;

and obtaining the description information of the key frame according to the character recognition result of the candidate region.

In some possible implementations, the obtaining the description information of the key frame according to the text recognition result of the candidate region includes:

filtering the character recognition result of the candidate region according to the set filtering condition;

and obtaining the description information of the key frame according to the filtered character recognition result.

In some possible implementations, the candidate region of the key frame includes a first candidate region and a second candidate region, the first candidate region has a higher priority than the second candidate region, the obtaining the description information of the key frame according to the text recognition result of the candidate region includes:

adjusting the character recognition result of the first candidate region by utilizing the character recognition result of the second candidate region;

And obtaining the description information of the key frame according to the character recognition result of the adjusted first candidate region.

In some possible implementations, the obtaining the description information of the key frame according to the text recognition result of the adjusted first candidate region includes:

and obtaining the description information of the key frame according to the character recognition result of the adjusted first candidate region and the description information of the adjacent key frame adjacent to the key frame.

In some possible implementations, the candidate region of the key frame includes a first candidate region and a second candidate region, the first candidate region has a higher priority than the second candidate region, the first candidate region is empty, the obtaining the description information of the key frame according to the text recognition result of the candidate region includes:

performing character matching according to the character recognition result of the second candidate region;

if a matching result is obtained, determining the matching result as the description information of the key frame;

otherwise, determining the description information of the key frame which is the last key frame of the key frame as the description information of the key frame.

In some possible implementations, the determining, according to the video to be processed, a key frame in the video includes:

Extracting frames from the video at set time intervals to obtain a plurality of image frames of the video;

classifying a plurality of image frames of the video, and de-duplicating the plurality of image frames of the video, the key frames being determined from the plurality of image frames of the video.

In some possible implementations, the video to be processed is generated based on a real-time interactive process.

determining candidate key frames from the video to be processed based on preset user behaviors in the real-time interaction process and/or in response to at least one image frame in the video to be processed including preset identification information;

and determining a key frame from the candidate key frames according to the semantic features of the candidate key frames.

In a second aspect, the present application provides a video processing system. The system comprises:

the acquisition unit is used for acquiring the video to be processed;

a determining unit, configured to determine a key frame in the video according to the video to be processed, where the key frame is at least one of a plurality of image frames of the video;

The identification unit is used for identifying the contents of the candidate areas of the key frames and obtaining the description information of the key frames;

and the presentation unit is used for presenting the description information of the key frames.

In some possible implementations, the presentation unit is specifically configured to:

In some possible implementations, the system further includes:

and the positioning unit is used for responding to the triggering operation of the user on the description information of the key frame and positioning the video to a time point corresponding to the description information of the key frame so as to enable the video to be played from the time point.

In some possible implementations, the identification unit is specifically configured to:

In some possible implementations, the candidate region of the key frame includes a first candidate region and a second candidate region, the first candidate region has a higher priority than the second candidate region, and the identifying unit is specifically configured to:

In some possible implementations, the candidate region of the key frame includes a first candidate region and a second candidate region, the first candidate region has a higher priority than the second candidate region, the first candidate region is empty, and the identifying unit is specifically configured to:

In some possible implementations, the determining unit is specifically configured to:

In a third aspect, the present application provides an electronic device comprising a processor and a memory. The processor and the memory communicate with each other. The processor is configured to execute instructions stored in the memory to cause the electronic device to perform the video processing method as in the first aspect or any implementation of the first aspect.

In a fourth aspect, the present application provides a computer readable storage medium having stored therein instructions for instructing an electronic device to execute the video processing method according to the first aspect or any implementation manner of the first aspect.

In a fifth aspect, the present application provides a computer program product comprising instructions which, when run on an electronic device, cause the electronic device to perform the video processing method of the first aspect or any implementation of the first aspect.

Further combinations of the present application may be made to provide further implementations based on the implementations provided in the above aspects.

From the above technical solutions, the embodiment of the present application has the following advantages:

the embodiment of the application provides a video processing method. Firstly, acquiring a video to be processed, then determining a key frame in the video according to the video to be processed, wherein the key frame is at least one of a plurality of image frames of the video, then identifying the content of a candidate area of the key frame, acquiring description information of the key frame, and presenting the description information of the key frame. According to the method, the content of the candidate area of the key frame in the video is identified, the description information of the key frame can be obtained, and in some specific application scenes, such as a video conference scene, the title of the document currently demonstrated by the video can be obtained, so that a user can know the description information of a plurality of key frames in the video, the user can conveniently locate the video, and the video can be watched efficiently.

Drawings

In order to more clearly illustrate the technical method of the embodiments of the present application, the drawings used in the embodiments will be briefly described below.

Fig. 1 is a schematic flow chart of a video processing method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a candidate region of a key frame according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a character recognition result of a candidate region according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of determining description information of a key frame according to an embodiment of the present application;

FIG. 5A is a schematic diagram illustrating presentation of description information according to an embodiment of the present application;

FIG. 5B is a schematic diagram illustrating presentation of description information according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a video processing system according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The terms "first", "second" in embodiments of the application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature.

Some technical terms related to the embodiments of the present application will be described first.

Video refers to dynamic images that are captured, recorded, processed, stored, transmitted, and reproduced as electrical signals. According to the persistence of vision principle, when the continuous image changes by more than 24 frames per second, the human eyes cannot distinguish a single static picture, and the continuous static picture looks like a smooth continuous visual effect, so that a video is generated.

With the rapid development of computer technology, video is gradually appearing in daily life. For example, a user may record video through a camera, a cell phone, or the like. In operation, the video usually carries rich information, so that the working efficiency can be improved due to the video.

Next, a conference scene in operation will be described as an example. With the rapid development of internet technology, the working conference is gradually changed from face-to-face offline conference to more flexible video conference and cloud conference. In video conferencing, participants have access to the conference via the internet (e.g., a video conferencing application).

In a video conference, participants can share, demonstrate and explain documents in a screen sharing mode. Further, the meeting participants can record the meeting process, so that videos recorded with document contents are obtained. Participants can review the conference content by viewing the video, and others can learn about the conference content by viewing the video.

Users (who may be participants or others, for example) typically need to watch from scratch while watching the video. If a user wants to view a specific portion of content, he needs to locate a specific point in time in the video and view from that specific point in time. For example, if the user wants to view the content of the third portion of the document, the user needs to skip the content of the first and second portions of the document, locate a point in time corresponding to the content of the third portion of the document, and start viewing from the point in time.

However, since the user cannot know the content of the video in advance, the user often needs to perform manual positioning a plurality of times to determine a specific point in time corresponding to a specific portion to be viewed. For example, the user first locates at 1/2 of the video timeline, finds that the video is now presenting content of the fourth portion of the relevant document, and thus the user second locates at 1/4 of the video timeline, finds that the video is now presenting content of the second portion of the relevant document, and thus the user third locates at 1/3 of the video timeline, thereby determining a point in time corresponding to the third portion of the content of the document. It has been found that it is difficult, time-consuming and labor-consuming for the user to accurately locate the point in time to be viewed in the manner described above.

In view of this, the embodiment of the application provides a video processing method. Firstly, acquiring a video to be processed, then determining a key frame in the video according to the video to be processed, wherein the key frame is at least one of a plurality of image frames of the video, then identifying the content of a candidate area of the key frame, acquiring description information of the key frame, and presenting the description information of the key frame.

According to the method, the content of the candidate area of the key frame in the video is identified, the description information of the key frame can be obtained, and in some specific application scenes, such as a video conference scene, the current demonstration of the video and the title of the explanation content can be obtained, so that a user can know the description information of a plurality of key frames in the video, the user can conveniently locate the video, and the video can be watched efficiently.

In order to facilitate understanding of the technical solution provided by the embodiments of the present application, the following description will be given with reference to the accompanying drawings.

Referring to fig. 1, a flowchart of a video processing method according to an embodiment of the present application is shown, where the video processing method may be performed by a video processing system. The method specifically comprises the following steps:

s102: the video processing system acquires a video to be processed.

The video to be processed may be a video including a plurality of parts of content, for example, may be a video of a plurality of parts of content under the same theme, or a video of a plurality of parts of content corresponding to different themes.

In some possible implementations, the video to be processed may be generated based on a real-time interactive process. In some embodiments, the video to be processed may be generated based on video conference scenes, such as video recorded by participants in a video conference. In other embodiments, the video to be processed may be generated based on a live scene, such as video generated by a host during a live broadcast.

In the embodiment of the application, the video source to be processed is not limited. In some possible implementations, the video to be processed may be video stored in an application server (e.g., conference application server, live application server). In other possible implementations, the video to be processed may be video stored in a personal terminal (e.g., may be a personal computer).

S104: the video processing system determines key frames in the video according to the video to be processed.

A keyframe refers to at least one image frame associated with the presence of video content. For example, for a video conference scene, the key frames may be at least one image frame of a document presentation, explanation. For another example, for a live scene, the key frame may be at least one image frame at which the anchor introduced the merchandise.

In particular, the video processing system may extract frames from a video to be processed at set time intervals to obtain a plurality of image frames of the video, then classify the plurality of image frames of the video, de-repeat the plurality of image frames of the video, and determine key frames from the plurality of image frames of the video.

In order to improve video processing efficiency, the video processing system may first perform frame extraction on a video to be processed according to actual requirements of a user. For example, the video processing system may perform frame extraction at 3 second intervals to obtain a plurality of image frames of the video.

The video processing system may classify and de-repeat multiple image frames of the video to determine key frames, considering that there may be portions of the video to be processed that do not need to be processed. Specifically, the video processing system may input a plurality of image frames of the video into the classification model, obtain image categories of the plurality of image frames of the video, and determine image frames of a plurality of target image categories from the plurality of image frames of the video. Wherein, since the description information of the key frame is usually presented in the form of text, the image category can comprise a document category and other categories, and the target image category can be the document category. In this way, the video processing system may determine the image frames of the document category.

In the embodiment of the present application, the classification model is not limited. For example, the classification model may be a K nearest neighbor (K nearest neighbour, KNN) model, a naive bayes model, a decision tree (decision tree) model, or other classification models.

The video processing system may then de-duplicate the image frames of the multiple target image categories. In some embodiments, the video processing system may utilize the distances between the image frames of multiple target image categories to perform deduplication, remove image frames of the same target image category, and obtain key frames of the video, thereby saving subsequent computing resources.

In the embodiment of the present application, the distances between the image frames of the plurality of target image categories are not limited. For example, the distance may be a plurality of distances such as Euclidean distance (euclidean distance), hamming distance (hamming distance), manhattan distance (manhattan distance), and the like.

In some possible implementations, the video processing system may also remove image frames corresponding to periods of no human speech in the video to be processed. It will be appreciated that the image frames corresponding to periods of no human speech in the video to be processed are typically not associated with the video content. For example, for a video conference scenario, an image frame corresponding to a time period when no person is speaking in the video to be processed means that no document explanation or presentation is performed by the participants. Therefore, in order to avoid resource waste, it is unnecessary to obtain the description information of the key frame corresponding to the above-mentioned time period.

It should be noted that, in the embodiment of the present application, the order of performing the classification of the plurality of image frames of the video and the de-duplication of the plurality of image frames of the video is not limited. In some possible implementations of the embodiments of the present application, the video processing system may perform de-duplication on multiple image frames of the video, remove the same image frame, and then classify the multiple image frames from which the same image frame is removed, so as to obtain a keyframe of the video.

In other embodiments, the video processing system may determine candidate key frames in the video based on the video to be processed, and determine key frames from the candidate key frames based on semantic features of the candidate key frames.

Specifically, the candidate key frames may be determined based on preset user behavior during real-time interaction. The preset user behavior may be different in different video scenes. For example, in a live scenario, the preset user behavior may be the behavior of an on-shelf commodity, the behavior of an off-shelf commodity, the behavior of issuing a coupon. For another example, in a video conference scenario, the preset user behavior may be a behavior that triggers a document presentation, a behavior that ends a document presentation, or a browsing behavior, an editing behavior, etc. during presentation of a document.

In addition, the video processing system may also determine candidate key frames from the video to be processed in response to at least one image frame in the video to be processed including preset identification information. In particular, the video processing system may match at least one image frame in the video with a preset identification template, and determine the at least one image frame as a key frame when the matching result indicates that the at least one image frame includes preset identification information.

Taking a live broadcast scene as an example, in the scene, the identification template may be a template for describing the identification of the commodity, for example, a uniform resource locator (uniform resource locator, URL) for characterizing the commodity or a template for characterizing a storage address of a certain type of presentation content, and by matching an image frame with the identification template, an image frame including identification information may be screened out, so that a key frame of the video is determined.

S106: the video processing system identifies the contents of candidate areas of the key frames and obtains the description information of the key frames.

The description information of the key frame refers to information for describing contents included in the key frame. For example, for a video conference scenario, the description information of the key frame may be information such as a title, a abstract, a subject, and the like of the document content included in the key frame. For another example, for a live scene, the description information of the key frame may be information such as a name of a recommended commodity, a merchant, and the like included in the key frame.

The candidate region of the key frame refers to a region where the description information of the key frame may appear. In some possible implementations, the video processing system may obtain candidate regions of the key frame through an object detection (object detection) model. Wherein the object detection model may comprise an object detection algorithm based on deep learning. For example, the target detection algorithm may include a plurality of algorithms such as a one-step target detection algorithm (one-stage object detectors) and a two-step target detection algorithm (two-stage object detectors), which is not limited in this embodiment of the present application.

For visual understanding, a video conference scenario will be taken as an example, and the candidate regions of the key frames will be described in detail with reference to fig. 2.

Referring to the schematic diagram of candidate regions for a key frame shown in fig. 2, key frame 200 is a key frame determined in a video conference scenario. The key frame 200 may include a attendee list area 201, a shared prompt area 202, and a shared screen area 203.

In particular, the attendee list area 201 is used to show the people participating in the video conference, for example, the attendees may be AA, BB, and CC. The participant list area 201 may also be used to present information such as the name, head portrait, etc. of the participant and may also prompt the participant currently speaking. For example, the region frame corresponding to the speaking participant may be displayed in bold.

The sharing prompt area 202 is used to display the information of the participants who are sharing the screen, for example, by displaying the text of "screen of AA", indicating that the participants who are sharing the screen are AA. The shared screen area 203 is used to show the screen being shared, for example, the screen being shared by the attendees AA is a document.

The video processing system may obtain candidate regions for the key frame 200 by targeting the shared screen region 203. Specifically, candidate regions of the key frame 200 may include a document name region 2041, a directory column region 2042, a directory highlight header region 2043, a top half screen header region 2044, and other header regions 2045.

Wherein the document name area 2041 is located at the upper left portion of the shared screen area 203, the directory column area 2042 is located at the left portion of the shared screen area 203, the directory column highlight title area 2043 is located in the directory column area 2042, the upper half screen title area 2044 is located at the upper half portion of the shared screen area 203, and the other title areas 2045 are located at other portions of the shared screen area 203 than the upper half portion.

In some embodiments, the video processing system may detect according to the principle that the shortest length is the largest, so as to obtain a header area with the largest font size in the shared screen area 203, determine a header area located in the upper half of the shared screen area 203 as an upper half screen header area 2044, and determine header areas located in other parts of the shared screen area 203 except the upper half as other header areas 2045.

Further, to facilitate locating candidate regions in the shared screen region 203, the video processing system may also select an uppermost header region in the candidate regions according to a principle that the y-axis coordinates of the upper left vertex are minimum, add an upper half-screen header region whose y-axis coordinates of the upper left vertex are smaller than the uppermost header region-a preset threshold (e.g., 50) to the uppermost header region, and select an upper left vertex x-axis coordinate of which is minimum and maximum as the upper left corner header region and the upper right corner header region in the uppermost header region.

After obtaining the candidate region, the video processing system may further identify the candidate region to obtain descriptive information of the key frame.

In some possible implementations, the video processing system may identify the content of the candidate region of the key frame by optical character recognition (optical character recognition, OCR) to obtain a text recognition result for the candidate region, and then obtain the description information of the key frame according to the text recognition result for the candidate region. Wherein, OCR refers to a process of recognizing the shape of the image content to be recognized as text in text format by using a character recognition method.

In order to improve recognition accuracy, the video processing system may first pre-process candidate regions of the key frame and then recognize the candidate regions through OCR. In some embodiments, the video processing system may pre-process bounding boxes (bbox) of the candidate regions. For example, the video processing system may add 10 pixels to the left and right of the x-axis of the bbox of the candidate region, and change the candidate region into a square by filling (padding) white, so as to identify the square candidate region by OCR, thereby improving the accuracy of identification.

After the character recognition result of the candidate region is obtained, the video processing system can filter the character recognition result of the candidate region according to the set filtering condition. The set filtering conditions are used for filtering the contents with low possibility of belonging to the description information in the candidate area contents.

In some possible implementations, the description information of the key frame is a header of the key frame, and the set filtering conditions may include:

1. and filtering the text recognition results with the confidence coefficient smaller than the confidence coefficient threshold value. The confidence coefficient is the confidence coefficient corresponding to the character recognition result output when character recognition is performed through OCR. Confidence may be used to characterize the accuracy of the word recognition results. If the confidence coefficient is smaller than the confidence coefficient threshold value, the fact that the accuracy degree of the current character recognition result is lower is indicated, and the character recognition result with the lower accuracy degree is filtered, so that calculation resources can be effectively saved. In some embodiments, the confidence threshold may be 0.8.

2. And filtering the character recognition result belonging to the watermark. Many documents are watermarked in order to ensure security of the document. The video processing system can filter the character recognition result belonging to the watermark, thereby avoiding the influence of the watermark characters on the description information of the subsequent determined key frame. Since the size of the watermark text is smaller in the normal case, the video processing system can filter the text recognition result that the font size is smaller than the size threshold and the x-axis coordinate difference between the upper left vertex and the lower left vertex of the candidate region is larger than the coordinate difference threshold, thereby realizing the filtering of the text recognition result belonging to the watermark. In some embodiments, the word size threshold may be 25 and the coordinate difference threshold may be 2.5.

3. And filtering the character recognition result in the non-heading direction. Text in multiple directions may appear in a keyframe, for example text in both the lateral and vertical directions may appear. The video processing system may determine a heading direction based on the font size and the candidate region size and filter text recognition results for non-heading directions.

4. And filtering the character recognition results with the font size which does not meet the preset conditions. In the text recognition process, there may be some cases of recognition errors. For example, a mouse in a key frame may be recognized as text. Thus, the video processing system may filter according to preset conditions. For example, the video processing system may filter text recognition results having a height less than a height threshold. In some embodiments, the height threshold may be related to the height of the largest area word recognition result, e.g., may be 0.33 times the height of the largest area word recognition result.

5. And filtering the text recognition results belonging to the subtitles. For video conferencing scenarios, the video conferencing application may provide the user with a service that automatically generates subtitles. The video processing system may filter text recognition results pertaining to the subtitles based on where the subtitles appear in the keyframes. For example, the video processing system may filter text recognition results that have a font size less than 35 and the candidate region satisfies the caption region condition. In some embodiments, the caption area condition may be that the top-side y-axis coordinate of the candidate area is greater than 0.8 times the key frame height, and the left-side x-axis coordinate of the candidate area is less than half the key frame width, and the right-side x-axis coordinate is greater than half the key frame width.

6. Filtering the text recognition result with the Wen Zichang degree too short and too long. Typically, the length of the title is not too short or too long. Therefore, the video processing system can filter the character recognition result with too short and too long character length. For example, the video processing system may filter text recognition results having a text length of 1 or less. For another example, the video processing system may filter text recognition results for Chinese text having a length greater than 40 and English text having a length greater than 100.

7. And filtering the character recognition result meeting the default character string. Typically, the document will provide a default template for use by the user. If the user does not modify the information in the default template, a default string will appear in the keyframe. The video processing system may filter text recognition results that satisfy the default string. In some embodiments, the default string may include "click here add title", "enter title", "No Signal".

8. And filtering the text recognition result which is the same as the document name. It will be appreciated that the candidate region may include a document name region, and that the document name is typically not the title of the current key frame (i.e., the content currently being presented and taught). Thus, the video processing system can filter the text recognition results that are the same as the document name.

By filtering the character recognition results of the candidate areas, the character recognition results which possibly affect the description information of the determined key frames or the character recognition results which are not the description information of the key frames can be filtered, so that the waste of calculation resources is reduced, and the video processing efficiency is improved.

Further, the video processing system can also normalize the character recognition result of the candidate region, so as to obtain a normalized character recognition result. In some possible implementations, the description information of the key frame is a header of the key frame, and the set filtering conditions may include:

1. Punctuation in the text recognition result is generated. For longer length word recognition results, the video processing system may generate punctuation in the word recognition results based on the punctuation model. For example, if the text recognition result exceeds 1 line, the video processing system may generate punctuation by calling the generate_button_title function, so that the text recognition result is more smooth and easy to understand.

2. A space in the text recognition result is generated. For longer length word recognition results, the video processing system may generate spaces in the word recognition results. For example, if the text recognition result exceeds 1 line, the video processing system may increase the space in the text recognition result by calling an add_space function.

3. And removing special symbols in the character recognition result. It will be appreciated that the title will not typically begin with a special symbol, and thus the video processing system may remove the spaces from the beginning to the end of the text recognition result, as well as the special symbol of the text recognition result header.

4. And removing unmatched symbols in the character recognition result. If the unmatched symbols appear in the text recognition result, for example, only half of brackets and the title number appear, the video processing system can remove the unmatched symbols and avoid the asymmetry of the text recognition result.

5. Punctuation in the normalized word recognition result. The video processing system can normalize punctuation in the character recognition result according to language information of the character recognition result. For example, if the text recognition result is Chinese, but where the English punctuation "+|! "it can be normalized to a Chinese punctuation" +|! ".

Through standardization processing on the character recognition result, the obtained character recognition result can be more in accordance with the requirement of the description information, so that the description information of the key frame can be obtained conveniently and presented later.

In some possible implementations, the video processing system may store the text recognition results of the candidate regions in the form of a table. Next, description will be made with reference to a schematic diagram of the character recognition result of one candidate region shown in fig. 3.

The text recognition result of the candidate region may include a variety of information. Wherein, the first column is a key frame name, which may be determined according to the frame extraction order, for example, the key frame names may be "001", "002", etc.

The second column is a candidate region type, where "max" may represent the upper half screen header region 2044 in fig. 2, "doc_category" may represent the directory column region 2042 in fig. 2, "doc_highlight" may represent the directory highlight header region 2043 in fig. 2, and "doc_name" may represent the document name region 2041 in fig. 2.

The third column is language information, where "cn" may represent Chinese and "en" may represent English.

The fourth column is the recognition result, namely the character recognition result of the candidate region.

And the fifth column is confidence level used for representing the confidence level of the character recognition result.

The sixth to eleventh columns are coordinate information of the candidate region, wherein the sixth and seventh columns are coordinate information of a center point of the candidate region, the eighth to eleventh columns are coordinate information of a vertex of the candidate region, and the twelfth column is a font size.

It should be noted that, the text recognition result shown in fig. 3 is only one possible candidate region. In other embodiments, the text recognition result of the candidate region may not include all of the above information.

Further, the video processing system can obtain the description information of the key frame according to the character recognition result of the candidate region. Specifically, the candidate region of the key frame may include a first candidate region and a second candidate region. Wherein the first candidate region has a higher priority than the second candidate region. The video processing system can adjust the character recognition result of the first candidate region by using the character recognition result of the second candidate region, and then acquire the description information of the key frame according to the adjusted character recognition result of the first candidate region.

The candidate region of the key frame shown in fig. 2 is taken as an example for illustration. In the key frame 200, the first candidate region may be the upper half screen header region 2044, and the second candidate region may include a document name region 2041, a directory column region 2042, and a directory highlight header region 2043. It will be appreciated that in the key frame 200, the header of the key frame 200 is more likely to appear in the upper half screen header area 2044, and therefore, the upper half screen header area 2044 has a higher priority than other candidate areas.

In order to avoid possible influence of inaccurate recognition results on the description information of the key frame, the video processing system may adjust the text recognition result of the first candidate region by using the text recognition result of the second candidate region.

In some embodiments, the video processing system may adjust text recognition results that may not be fully recognized to obtain complete text recognition results. For example, if the text recognition result of the first candidate region ends with "," and the text recognition result is removed "," then the text recognition result is a child string of the other text recognition results of the second candidate region, the video processing system may adjust the text recognition result to be a parent string.

To illustrate with a specific example, the video processing system may adjust "two three …" to "two three four" if the text recognition result of the first candidate region is "two three …" and the text recognition result of the second candidate region is "two three four". Further, if the second candidate region further has another word recognition result "one, two, three, four, five", the video processing system may adjust "one, two, three, …" to the other word recognition result with the longest length, i.e., to "one, two, three, four, five".

Thus, the video processing system can obtain the description information of the key frame according to the adjusted text recognition result of the first candidate region and the description information of the adjacent key frame adjacent to the key frame.

In particular, the video processing system may determine the text recognition result located in the first candidate region as candidate descriptive information. Taking the candidate region of the key frame shown in fig. 2 as an example, the video processing system may determine the text recognition result corresponding to the uppermost header region in the first candidate region (i.e., the upper half-screen header region 2044) as candidate description information, i.e., a "technical field". Then, the video processing system may compare the candidate description information with the description information of the neighboring key frame, and if the candidate description information of the current key frame is a sub-string of the description information of the neighboring key frame, the video processing system may modify the candidate description information into a corresponding parent string, and the description information of the current key frame is the parent string.

In addition, the video processing system can also compare the similarity between the current key frame and the adjacent key frame, and further judge whether to modify the candidate description information by using the similarity. For example, the video processing system may compare the cosine similarity and euclidean similarity of the current key frame and the neighboring key frames, and modify the candidate descriptive information only if the similarity meets a similarity threshold and the candidate descriptive information is a substring of the descriptive information of the neighboring key frames, thereby ensuring accuracy of the modification. In some embodiments, the cosine similarity threshold may be 0.97 and the euclidean similarity threshold may be 6.6.

Next, the video processing system may compare the candidate descriptive information with the descriptive information of the neighboring key frame, and if the descriptive information of the neighboring key frame is found to be the same, but the candidate descriptive information is different from the descriptive information of the neighboring key frame, the video processing system may modify the candidate descriptive information to the descriptive information of the neighboring key frame. For example, the candidate description information of the current key frame is "technical field", and the description information of the adjacent key frames (i.e. the previous key frame and the next key frame of the current key frame) are both "background", and because the three key frames are adjacent in time on the time axis, in order to avoid abrupt jump of the description information, the video processing system may modify the candidate description information of the current key frame to "background", so as to obtain the description information of the key frame.

In other possible implementations, the candidate region of the key frame includes a first candidate region and a second candidate region, where the first candidate region has a higher priority than the second candidate region, and the first candidate region is empty. The video processing system can perform text matching according to the text recognition result of the second candidate region, if the matching result is obtained, the matching result is determined to be the description information of the key frame, otherwise, the description information of the last key frame of the key frame is determined to be the description information of the key frame.

See a schematic flow chart for determining the descriptive information of the key frame shown in fig. 4. First, the video processing system may determine whether the directory highlight region in the second candidate region is empty, if the directory highlight region is not empty, the video processing system may further determine whether the highlight topic (i.e., the text recognition result of the highlight topic region) is complete, and if the highlight topic is complete, the video processing system may determine the description information of the key frame as the highlight topic. If the highlighted title is incomplete, the video processing system may match the highlighted title. For example, the video processing system may match the highlight title with the text recognition result of the second candidate region. If the match is successful, a match result is obtained and the video processing system may determine the match result as a highlighted title.

If the directory highlight topic area is empty, or the highlight topic is matched, but no matching result is obtained, the video processing system may match other text recognition results of the second candidate area. For example, the video processing system may match the text recognition result of the directory column region with the other text recognition results of the second candidate region, and if the matching is successful, the video processing system may determine the next title of the corresponding title in the directory column that is successfully matched as a matching result, and determine the matching result as the description information of the key frame. If the matching is not successful, the video processing system can determine the description information of the last key frame as the description information of the key frame.

After obtaining the description information of the current key frame, the video processing system may perform merging processing on the description information of the key frame. Specifically, if the description information of adjacent key frames is the same, the video processing system may combine the same description information. For example, the description information of the previous key frame of the current key frame is "technical field", the description information of the next key frame of the current key frame is "technical field", and the video processing system may combine the description information of the three key frames.

In some embodiments, the video processing system may also remove the short duration keyframe descriptive information. Specifically, the video processing system may determine duration information of the description information according to a time corresponding to the description information of the combined key frame, and remove the description information with duration less than a time threshold. For example, the time threshold may be 4 seconds.

It should be noted that, in the embodiment of the present application, the manner of identifying the content of the candidate region of the key frame is not limited. In some possible implementations, the video processing system may also obtain image recognition results by recognizing images in candidate regions of the key frames. For example, the video processing system may identify aircraft images in candidate regions of the key frame to obtain an image identification result "aircraft", thereby obtaining description information of the key frame according to the image identification result.

S108: the video processing system presents descriptive information of the key frames.

The embodiment of the application supports the presentation of the description information of the key frames in different modes. In some embodiments, the video processing system may present descriptive information of key frames in a navigation area of the video.

In particular, reference is made to a schematic diagram of presentation description information shown in fig. 5A. The presentation interface may include a video content area 501, a video timeline 502, a video control 503, and a video navigation area 504. The video content area 501 is used to present video content for viewing by a user, who may acquire the viewing progress of the video via the video timeline 502, and perform operations such as play, pause, fast forward, fast reverse, etc. via the video control 503.

The video navigation area 504 is used to present descriptive information for the key frames. For example, descriptive information of the key frames may be presented, as well as corresponding points in time. In response to a triggering operation of a user on the descriptive information of the key frame, the video processing system positions the video to a time point corresponding to the descriptive information of the key frame so that the video is played from the time point. For example, when the user clicks on "technical field" at a start time of "00:00:20", the video jumps to 20 seconds to achieve video positioning, so that the user can watch the content related to "technical field".

In other embodiments, the video processing system may present the descriptive information of the key frames on the timeline of the video.

In particular, referring to a schematic diagram of presenting descriptive information shown in fig. 5B, descriptive information of key frames may be presented in the video timeline 502. In particular, the description information of the key frame can be displayed in association with the time axis of the video based on the time information of the key frame in the video. The time information of the key frame in the video may be a first time point of the key frame in the video, or a time period of the key frames with the same description information in the video. In some embodiments, the point in time when the key frame first appears in the video may be presented by way of a dot in the video timeline 502. Thus, when the user's mouse is placed over the dot, the descriptive information of the key frame is presented. In other embodiments, different time periods may also be distinguished in the video timeline 502 by timeline color, such as using the same color to represent the time periods in which multiple key frames with the same descriptive information appear in the video. Thus, the user can intuitively obtain the contents of a plurality of parts in the video through time axes with different colors, and quick video positioning is realized. Similarly, in response to a triggering operation by a user for the descriptive information of a key frame, the video processing system locates the video to a point in time corresponding to the descriptive information of the key frame to cause the video to be played from the point in time.

Based on the above description, the embodiment of the application provides a video processing method. According to the method, the content of the candidate area of the key frame in the video is identified, the description information of the key frame can be obtained, and in some specific application scenes, such as a video conference scene, the current demonstration of the video and the title of the explanation content can be obtained, so that a user can know the description information of a plurality of key frames in the video, the user can conveniently locate the video, and the video can be watched efficiently.

The video processing method provided by the embodiment of the present application is described in detail above with reference to fig. 1 to 5, and the system and the device provided by the embodiment of the present application are described below with reference to the accompanying drawings.

Referring to the schematic diagram of the video processing system shown in fig. 6, the system 600 includes:

an acquiring unit 601, configured to acquire a video to be processed;

a determining unit 602, configured to determine a key frame in a video according to a video to be processed, where the key frame is at least one of a plurality of image frames of the video;

an identifying unit 603, configured to identify contents of candidate areas of the key frame, and obtain description information of the key frame;

and a presenting unit 604, configured to present the description information of the key frame.

In some possible implementations, the presentation unit 604 is specifically configured to:

the description information of the key frames is presented on the time axis of the video.

In some possible implementations, the presenting unit 604 is specifically configured to:

presenting descriptive information of the key frame in association with a timeline of the video based on temporal information of the key frame in the video

In some possible implementations, the system 600 further includes:

and the positioning unit is used for responding to the triggering operation of the user on the description information of the key frames and positioning the video to the time point corresponding to the description information of the key frames so as to enable the video to be played from the time point.

In some possible implementations, the identifying unit 603 is specifically configured to:

recognizing the contents of the candidate areas of the key frames through optical character recognition OCR to obtain character recognition results of the candidate areas;

filtering character recognition results of the candidate areas according to the set filtering conditions;

In some possible implementations, the candidate region of the key frame includes a first candidate region and a second candidate region, where the first candidate region has a higher priority than the second candidate region, and the identifying unit 603 is specifically configured to:

In some possible implementations, the candidate region of the key frame includes a first candidate region and a second candidate region, where the first candidate region has a higher priority than the second candidate region, and the first candidate region is empty, and the identifying unit 603 is specifically configured to:

Otherwise, determining the description information of the previous key frame of the key frame as the description information of the key frame.

In some possible implementations, the determining unit 602 is specifically configured to:

classifying a plurality of image frames of the video, and de-duplicating the plurality of image frames of the video, determining a key frame from the plurality of image frames of the video.

The video processing system 600 according to the embodiment of the present application may correspond to performing the method described in the embodiment of the present application, and the above and other operations and/or functions of the respective modules/units of the video processing system 600 are respectively for implementing the respective flows of the respective methods in the embodiment shown in fig. 1, which are not described herein for brevity.

The embodiment of the application also provides electronic equipment. The electronic device is specifically configured to implement the functionality of video processing system 60 in the embodiment shown in fig. 6.

Fig. 7 provides a schematic structural diagram of an electronic device 700, and as shown in fig. 7, the electronic device 700 includes a bus 701, a processor 702, a communication interface 703, and a memory 704. Communication between processor 702, memory 704 and communication interface 703 is via bus 701.

Bus 701 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 7, but not only one bus or one type of bus.

The processor 702 may be any one or more of a central processing unit (central processing unit, CPU), a graphics processor (graphics processing unit, GPU), a Microprocessor (MP), or a digital signal processor (digital signal processor, DSP).

The communication interface 703 is used for communication with the outside. For example, the communication interface 703 may be used for communication with a terminal. The communication interface 703 is configured to send the description information of the key frame to the terminal, so that the terminal presents the description information of the key frame.

The memory 704 may include volatile memory (RAM), such as random access memory (random access memory). The memory 704 may also include a non-volatile memory (non-volatile memory), such as read-only memory (ROM), flash memory, hard Disk Drive (HDD), or solid state drive (solid state drive, SSD).

The memory 704 has stored therein executable code that the processor 702 executes to perform the video processing methods described previously.

In particular, in the case where the embodiment shown in fig. 6 is implemented, and where the modules or units of the video processing system 600 described in the embodiment of fig. 6 are implemented by software, software or program code required to perform the functions of the modules/units in fig. 6 may be partially or entirely stored in the memory 704. The processor 702 executes program codes corresponding to the respective units stored in the memory 704 to perform the aforementioned video processing method.

The embodiment of the application also provides a computer readable medium, in which instructions or a computer program is stored, which when executed on an electronic device, cause the electronic device to execute any implementation of the video processing method provided by the embodiment of the application.

The computer readable medium of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (Hyper Text Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be embodied in the apparatus; or may be present alone without being fitted into the device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the method described above.

Computer program code for carrying out operations of the present application may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present application may be implemented in software or in hardware. Where the names of the units/modules do not constitute a limitation of the units themselves in some cases.

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of the present application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

It should be noted that, in the present description, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different manner from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system or device disclosed in the embodiments, since it corresponds to the method disclosed in the embodiments, the description is relatively simple, and the relevant points refer to the description of the method section.

It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

It is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of video processing, the method comprising:

acquiring a video to be processed;

and presenting the description information of the key frames.

2. The method of claim 1, wherein the presenting the descriptive information of the keyframe comprises:

3. The method of claim 2, wherein the presenting the descriptive information of the key frames at the time axis of the video comprises:

4. The method according to claim 2, wherein the method further comprises:

5. The method according to claim 1, wherein the identifying the content of the candidate region of the key frame to obtain the description information of the key frame includes:

6. The method according to claim 5, wherein the obtaining the description information of the key frame according to the text recognition result of the candidate region includes:

7. The method of claim 5, wherein the candidate region of the key frame includes a first candidate region and a second candidate region, the first candidate region having a higher priority than the second candidate region, the obtaining the description information of the key frame according to the text recognition result of the candidate region includes:

8. The method of claim 7, wherein the obtaining the description information of the key frame according to the text recognition result of the adjusted first candidate region includes:

9. The method of claim 5, wherein the candidate region of the key frame includes a first candidate region and a second candidate region, the first candidate region has a higher priority than the second candidate region, the first candidate region is empty, the obtaining the description information of the key frame according to the text recognition result of the candidate region includes:

10. The method according to claims 1 to 9, wherein said determining key frames in said video from said video to be processed comprises:

11. The method according to claims 1 to 9, wherein the video to be processed is generated based on a real-time interactive process.

12. The method of claim 11, wherein the determining key frames in the video from the video to be processed comprises:

13. A video processing system, the system comprising:

the acquisition unit is used for acquiring the video to be processed;

14. An electronic device comprising a processor and a memory;

the processor is configured to execute instructions stored in the memory, causing the electronic device to perform the method of any one of claims 1 to 12.

15. A computer readable storage medium comprising instructions that instruct an electronic device to perform the method of any one of claims 1 to 12.

16. A computer program product, characterized in that the computer program product, when run on an electronic device, causes the electronic device to perform the method of any one of claims 1 to 12.