CN112183249A

CN112183249A - Video processing method and device

Info

Publication number: CN112183249A
Application number: CN202010960748.1A
Authority: CN
Inventors: 姜秋宇; 李晓宇; 李明; 张月鹏; 裴广超
Original assignee: Beijing Ultrapower Intelligent Data Technology Co ltd
Current assignee: Beijing Ultrapower Intelligent Data Technology Co ltd
Priority date: 2020-09-14
Filing date: 2020-09-14
Publication date: 2021-01-05

Abstract

The application discloses a video processing method and a video processing device, wherein the method comprises the following steps: acquiring video content; intercepting a video screenshot from the video content according to a frame sequence, and performing duplication elimination processing on the video screenshot to obtain a duplicate-eliminated picture sequence; performing text recognition on each picture in the picture sequence to obtain text data, and generating a text box comprising the text data according to the text data; generating a base image picture of each picture according to the bitmap data of the picture; and generating a presentation page corresponding to each picture in the presentation according to the text box and the base picture corresponding to each picture and the sequence number information of the picture in the picture sequence. According to the technical scheme, manual participation is not needed in the process of generating the presentation, convenience and rapidness are achieved, and the output and arrangement work of the presentation can be completed more efficiently.

Description

Video processing method and device

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a video processing method and apparatus.

Background

In the face of the current need to collate presentations according to video content, the most common method is to collate and output the presentations by manually viewing the videos. As shown in fig. 1, the workflow for manually extracting a PPT presentation from a video generally includes the following steps:

first, the staff member needs to record the content described in the video by typing while the video is playing.

Then, segmenting the extracted text based on the narration structure of the video; when the video is played to the key frame corresponding to each text, the staff needs to pause the video to make the PPT page of the current page, so that the text of the video and the related PPT page are obtained.

And finally, when the video playing is finished, the staff sorts the recorded text and the PPT page, and finally the presentation is output.

The above means for manually extracting a presentation from a video is the most common means for meeting the demand, but there are some problems, such as: the recording time is usually much longer than the video duration, which is caused by that the typing speed of the staff is possibly slower than the video narration speed, the recording work and the video playing of the staff cannot be finished simultaneously, and the staff needs to back the video for a period of time to watch the video repeatedly so as to make a PPT page; when the demand for converting the PPT presentation is large, a plurality of workers are needed to perform the PPT presentation simultaneously, and therefore large labor cost is generated.

Disclosure of Invention

The present application provides a video processing method and apparatus to solve or partially solve the above problems.

According to an aspect of the present application, there is provided a video processing method, including:

acquiring video content;

intercepting a video screenshot from the video content according to a frame sequence, and performing duplication elimination processing on the video screenshot to obtain a duplicate-eliminated picture sequence;

performing text recognition on each picture in the picture sequence to obtain text data, and generating a text box comprising the text data according to the text data; generating a base image picture of each picture according to the bitmap data of the picture;

and generating a presentation page corresponding to each picture in the presentation according to the text box and the base picture corresponding to each picture and the sequence number information of the picture in the picture sequence.

According to an aspect of the present application, there is provided a video processing apparatus including:

a video acquisition unit for acquiring video content;

the picture duplication removing unit is used for intercepting the video screenshots from the video content according to a frame sequence and carrying out duplication removing processing on the video screenshots to obtain a duplicated picture sequence;

the text recognition unit is used for performing text recognition on each picture in the picture sequence to obtain text data and generating a text box comprising the text data according to the text data;

a base map generation unit for generating a base map picture of each picture from the bitmap data of the picture;

and the manuscript generating unit is used for generating a demonstration manuscript page corresponding to each picture in the demonstration manuscript according to the text box and the base image picture corresponding to each picture and the sequence number information of the picture in the picture sequence.

According to one aspect of the present application, there is provided an electronic device comprising a memory and a processor; a memory storing computer-executable instructions; a processor, the computer executable instructions when executed cause the processor to perform a method of generating a presentation based on video content.

According to an aspect of the present application, there is provided a computer readable storage medium having one or more computer programs stored thereon which, when executed, implement a method of generating a presentation based on video content.

The beneficial effect of this application is: the method comprises the steps of carrying out duplication removal processing on a video screenshot captured from video content, carrying out text recognition and base image generation processing on each picture in a duplicated picture sequence to obtain text data and a base image picture of each picture, and generating a presentation based on the text data and the base image picture.

Drawings

Fig. 1 is a schematic diagram of extraction of a PPT presentation from a video based on manual means;

FIG. 2 is a schematic flow chart of a video processing method according to an embodiment of the present application;

figure 3 is a flow diagram of generating a PPT presentation according to one embodiment of the present application;

fig. 4 and 5 are schematic diagrams of gray scale values of each pixel point of two 4 × 4 gray scale maps according to an embodiment of the present application;

FIG. 6 is a diagram illustrating distribution of picture text data according to an embodiment of the present application;

fig. 7 is a hardware configuration diagram of an electronic device in which a video processing apparatus according to an embodiment of the present application is located;

fig. 8 is a functional block diagram of a video processing apparatus according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

When meeting the requirement of sorting PPT presentations according to video contents, the most common method is to sort and output by manually watching videos. The method is simple and easy to implement, can meet requirements under most conditions, and when the video of the presentation to be output is long in time or large in quantity, workers need to repeatedly watch the video to output characters and often pause the video to make PPT pages, the time is long, and large labor cost is generated.

In order to meet the requirement of rapidly extracting a presentation from video content with large data volume, the embodiment of the disclosure provides a method for generating a PPT presentation based on video content by combining technologies such as data analysis, data mining and image processing.

Referring to fig. 2, the method comprises the steps of:

in step S210, video content is acquired.

And S220, intercepting a video screenshot from the video content according to a frame sequence, and performing duplication elimination processing on the video screenshot to obtain a duplicate-eliminated picture sequence.

In the step, when the video screenshot is subjected to the duplicate removal processing, the duplicate removal processing is specifically performed according to the image fingerprint information and the histogram information of the video screenshot, and the similarity judgment is performed on the video screenshot by combining two characteristics of the image, so that the accuracy of the similarity judgment is improved.

Step S230, performing text recognition on each picture in the picture sequence to obtain text data, and generating a text box comprising the text data according to the text data; and generating a base picture of each picture according to the bitmap data of the picture.

In this step, a text recognition tool may be used to perform text recognition on the picture, for example, a Chinese recognition packet of python is used to perform text recognition on the picture.

And step S240, generating a presentation page corresponding to each picture in the presentation according to the text box and the base image picture corresponding to each picture and the sequence number information of the picture in the picture sequence.

In this embodiment, the presentation documents include PPT presentation documents from microsoft corporation and WPS presentation documents from beijing jinshan office software corporation.

As shown in fig. 2, in this embodiment, a video screenshot captured from a video content is subjected to duplication removal processing, and text recognition and base image generation processing are performed on each picture in a duplicate-removed picture sequence to obtain text data and a base image picture of each picture, and a presentation is generated based on the text data and the base image picture.

The steps of generating the presentation are specifically described below with reference to fig. 3 to 6.

Referring to fig. 3, fig. 3 takes generating a PPT presentation as an example, and the process of generating a WPS presentation is the same, which is not described herein again.

As shown in fig. 3, the step of generating the PPT presentation includes: the method comprises four steps of picture duplication removal, text extraction, base map generation and PPT presentation generation, and the steps are detailed below.

1. Removing the duplicate of the picture:

firstly, the video content needs to be intercepted frame by frame to obtain a video screenshot without duplicate removal, and then the duplicate removal work is carried out.

In one embodiment, the video screenshot can be subjected to size scaling processing and graying processing to obtain a grayed video screenshot with a preset size; and calculating image fingerprint information and gray histogram information of the processed video screenshot, and performing duplicate removal processing on the video screenshot by using the image fingerprint information and the gray histogram information to obtain a duplicate-removed picture sequence.

One implementation includes: calculating a fingerprint hash value of each video screenshot based on a preset hash algorithm, wherein the preset hash algorithm comprises a mean hash algorithm, a difference hash algorithm and a perceptual hash algorithm; calculating the Hamming distance of adjacent video screenshots according to the fingerprint Hash value, and calculating the similarity between the adjacent video screenshots according to the Hamming distance; calculating a gray histogram of each video screenshot, for example, calculating the gray histogram of each video screenshot according to a single-channel histogram algorithm and a three-channel histogram algorithm, and calculating the histogram contact ratio of adjacent video screenshots according to the gray histograms; after the similarity and the histogram overlap ratio between the adjacent video screenshots are obtained, the similarity index between the adjacent video screenshots is calculated based on the similarity between the two adjacent video screenshots and the histogram overlap ratio of the adjacent video screenshots, when the similarity index is larger than a similarity threshold value, one video screenshot in the adjacent video screenshots is deleted, the other video screenshot is stored, namely, any one video screenshot in the two adjacent video screenshots is deleted, the other video screenshot is stored, therefore, the duplication elimination processing of all the video screenshots is completed, and the picture sequence after duplication elimination is obtained.

Because the similarity and the histogram overlap ratio belong to two numerical value fields, when the similarity index between adjacent video screenshots is calculated, a first numerical value field corresponding to the similarity between the adjacent video screenshots and a second numerical value field corresponding to the histogram overlap ratio of the adjacent video screenshots are subjected to domain unification to obtain the similarity and the histogram overlap ratio between the adjacent video screenshots after the domains are unified; and carrying out numerical value average processing on the similarity between the adjacent video screenshots after the domains are unified and the coincidence degree of the histograms, wherein the obtained average value is the similarity index.

In one embodiment, the capture frame number can be set according to the frame rate of the video content, and the capture frame number is used for indicating the time interval for capturing the screen capture; and then intercepting a video screenshot from the video content according to the interception frame number, storing a storage path of the intercepted video screenshot in a list, and implementing duplicate removal of the video screenshot by operating elements in the list. Of course, in other embodiments of the present application, the number of capturing frames may also be set based on the frame rate of the video content and the time length corresponding to the video content, as long as the key video frames are not missed and the number of video screenshots is reduced, and the manner of setting the number of capturing frames is not specifically limited in the present application.

Assuming that the frame rate of the video content is 30 frames/second and the capture frame number is set to 30 frames (of course, the capture frame number may also be set to be greater than 30 frames or less than 30 frames), when performing video capture on the video content according to the capture frame number of 30 frames, the video capture processing is performed every 1 second. In the embodiment, the screenshot is carried out on the video content by setting the screenshot frame number, so that the number of the intercepted video screenshots is reduced and the occupation of memory resources is reduced on the premise of ensuring that key video frames are not missed.

In addition, storage paths of all video screenshots are stored in a list, then elements in the list are taken for duplication elimination, so that an os library (which is a python standard library and provides an interactive function for an operating system) is not required to be used for entering a local path for deletion, storage and other operations, storage space is saved, and memory is also saved for similarity calculation.

In one embodiment, the flow of deduplication work is as follows:

(1) the video shot is scaled to 8 x 8 gray scale.

The scaling of the embodiment is not fixed, and the scaling can be set based on the size of the video screenshot and the recognition effect.

(2) The similarity of the video screenshots is cooperatively judged by adopting five algorithms, which are respectively as follows:

and (3) mean value hash algorithm: and calculating the gray average value of the pixels of the video screenshot, then comparing the gray of each pixel with the average value, marking the pixels which are more than or equal to the average value as 1, and marking the pixels which are less than the average value as 0 to obtain the fingerprint hash value of the video screenshot.

Difference hash algorithm: and regarding the gray-scale image of the video screenshot as a matrix, comparing two adjacent elements of the matrix, for example, subtracting the right element from the left element to obtain a difference value, marking a pixel point with the difference value being a positive number or zero as 1, and marking the pixel point as 0 if the difference value is a negative number to obtain a fingerprint hash value of the video screenshot.

Perceptual hashing algorithm: the video screenshot is subjected to Discrete Cosine Transform (DCT), for example, a 32 x 32 matrix is obtained after DCT transformation, only the matrix of 8 x 8 at the upper left corner is reserved, the matrix can present the lowest frequency of the video screenshot, pixel points which are more than or equal to the DCT mean value are marked as 1, otherwise, the pixel points are marked as 0, and fingerprint hash values are sequentially generated.

Single-channel histogram algorithm: an image is composed of pixel points with different gray values, and the distribution condition of the gray values in the image is an important characteristic of the image. The gray histogram of an image is a function of the gray level, describing the number of pixels in the image having that gray level: the principle of the single-channel histogram calculation method is that firstly, a single-channel histogram of the video screenshots is calculated, and then the histogram overlap ratio of the two video screenshots is calculated to obtain a single-channel similarity value.

Three-channel histogram algorithm: namely, the video screenshot is separated into three channels of RGB, and the similarity value of each channel is calculated.

It should be noted that, the combination of the sequence of 1 and 0 output in the mean hash algorithm, the difference hash algorithm, and the perceptual hash algorithm is just the hash value of the picture, in the hash algorithm, the fingerprint hash values of the two pictures are compared, and the hamming distance is calculated based on the fingerprint hash values, for example, the 8 × 8 grayscale images of the above example are used, the hash values output by the two pictures are 64 bits, the comparison hash values are different, the smaller the different bits are, the more similar the pictures are, and the hamming distance is the step required for changing a group of binary data into another group of data, so that the difference between the two pictures can be measured, and the smaller the hamming distance is, the higher the similarity is; a hamming distance of 0 indicates that the two pictures are identical, and a hamming distance of 1 indicates that the two pictures are completely different. In the histogram algorithm, the larger the area of the overlapped histograms is, the more similar the picture is.

In one example, as shown in fig. 4 and fig. 5, two matrices of 4 × 4 represent the gray scale maps converted from two pictures, and the number represents the gray scale value of the pixel. The present example uses 4 × 4 gray scale map as an example, the principle is the same as that of 8 × 8 gray scale map, and the smaller number of bits is more convenient for illustration.

The process of calculating the fingerprint hash values of the two pictures based on the mean hash algorithm comprises the following steps:

firstly, calculating the average gray value of the gray map, wherein the average gray value in fig. 4 is 139.5, and the average gray value in fig. 5 is 151.8; then comparing the gray value of each pixel point with the average value, marking the pixel points which are greater than or equal to the average value as 1, and marking the pixel points which are less than the average value as 0, so as to obtain a fingerprint hash value 1001110100110001 in fig. 4 and a fingerprint hash value 1110001101010011 in fig. 5; the resulting two 4 x 4 gray scale maps are 9-bit different.

The process of calculating the fingerprint hash values of the two pictures based on the difference hash algorithm comprises the following steps:

comparing two adjacent elements of the matrix, illustratively subtracting the right element from the left element to obtain a difference value, marking a pixel point with the difference value being a positive number or zero as 1, and marking the pixel point with the difference value being a negative number as 0 to obtain a fingerprint hash value 110011010001110 in fig. 4 and a fingerprint hash value 101100010101000 in fig. 5; the resulting two 4 x 4 gray scale maps are 8-bit different.

The process of calculating the fingerprint hash values of the two pictures based on the perceptual hash algorithm comprises the following steps:

suppose that two pictures which need to be subjected to similarity comparison are subjected to DCT transformation and converted into 32 x 32 gray-scale graphs, and two 4 x 4 matrixes in FIG. 4 are 4 x 4 matrixes at the upper left corners of the two 32 x 32 matrixes, the lowest frequency of the pictures can be presented through the matrixes, pixel points which are more than or equal to the DCT mean value are marked as 1, and otherwise, the pixel points are marked as 0, and the fingerprint hash value is generated.

(3) Calculating a similarity index: the results of the five algorithms are combined together by calculating the average value, and it should be noted here that although the results calculated by the five algorithms are similarities between pictures, the hash algorithm and the histogram algorithm have different principles, so the output values of the hash algorithm and the histogram algorithm are also different. It should be noted that the difference between the output values of the hash-like algorithm and the histogram-like algorithm means that the value fields of the output results are different, not that the output result values are different. The value fields need to be unified, and the principle of unifying the value fields of the hash algorithm into the value fields of the histogram algorithm is similar to that of unifying the value fields of the histogram algorithm into the value fields of the hash algorithm, which is not described herein again.

Since the output value of the hash algorithm is the hamming distance between two pictures, the hamming distance calculated by the hash algorithm is divided by the number of bits of the gray-scale map, and taking the 8 × 8 gray-scale map as an example, the hamming distance is divided by 64 to obtain a result, and the range marked as n is in the range of [0,1 ]. However, because the values of the histogram algorithm are more similar as they approach 1, and the values of the hash algorithm are more similar as they approach 0, the result n is subtracted from 1, so that the output of the hash algorithm and the output of the histogram algorithm are unified, and then the similarity index can be obtained by averaging the video capture similarity results calculated by the five algorithms.

(4) Removing the duplicate of the picture: and when the similarity index is larger than the similarity threshold, deleting one of the two video screenshots and storing the other video screenshot to obtain the de-duplicated picture sequence.

For example, there are 4 video screenshots without duplication removal, which are recorded as Pic1, Pic2, Pic3 and Pic4, if the similarity index between Pic1 and Pic2 is greater than 0.75, Pic2 is deleted, then the similarity index between Pic1 and Pic3 is calculated, and if the similarity index is greater than 0.75, Pic3 is deleted, the similarity index between Pic1 and Pic4 is continuously calculated; if the similarity index between Pic1 and Pic3 is not more than 0.75, Pic3 is retained, and then the similarity index between Pic1 and Pic4 and the similarity index between Pic3 and Pic4 are calculated respectively.

The image duplicate removal can be realized through the four steps.

2. Text extraction:

performing text recognition on each picture by adopting a text recognition tool to obtain text data; calculating the Hamming distance between adjacent text elements in the text data, and carrying out partition processing on the text data based on the Hamming distance between the adjacent text elements to obtain a data block corresponding to each partition; a text box is generated for each data block and the text data in the data block is inserted into the text box. When a plurality of text boxes are generated, the relative position relations of the text boxes are generated according to the relative position relations of the database in the picture.

The text elements in this embodiment include elements such as chinese characters and english words.

The text data can be partitioned by the following method:

when the Hamming distance between the ith text element and the (i + 1) th text element is different from the Hamming distance between the (i-1) th text element and the ith text element, taking the (i + 1) th text element as a partition starting position;

when the Hamming distance between the jth text element and the jth +1 text element is different from the Hamming distance between the jth-1 text element and the jth text element, taking the jth text element as a partition ending position; wherein i and j are positive integers;

and dividing the element between the (i + 1) th text element and the jth text element into a region, thereby completing the partition processing of the text data.

By the method, the text of each picture of the picture sequence can be recognized, the region division of the text is completed in the recognition process, and the text data included by the picture is divided into the data bodies with corresponding numbers, so that a corresponding PPT page can be generated for each picture according to the number of the data bodies included by each picture, the PPT page comprises the number and the positions of text boxes, the number of the text boxes is the same as the number of the data bodies, the distribution of the positions of the text boxes is the same as the position distribution of the data bodies in the picture, and the text data corresponding to the corresponding data bodies are inserted into each text box.

Taking a picture 10 in a picture sequence as an example, determining that the picture 10 correspondingly comprises text data A through text identification; referring to fig. 6, the picture 10 includes three text boxes, text box 1, text box 2, and text box 3, with text data a1 included within text box 1, text data a2 included within text box 2, and text data A3 included within text box 3.

After the text data a is partitioned by the partition processing method, three partitions can be obtained, each partition corresponds to one data block, the text data in one data block is text data a1, and the text data in the other two data blocks are text data a2 and text data A3 respectively. Therefore, a PPT page can be generated for the picture 10, the PPT page includes three text boxes, the positions of the three text boxes in the PPT page are distributed in the same way as the positions of the three text boxes in the picture 10, and the text data a1, a2 and A3 are inserted into the corresponding text boxes, so that the text recognition and the text box generation of the picture 10 are completed.

3. And (3) generating a base map:

generating a blank map for each picture, wherein pixel points in the blank map have a one-to-one correspondence with pixel points of the picture; judging whether the RGB value of a pixel point on the picture is an effective value or not according to the bitmap data of the picture; setting the RGB value of the pixel point with the effective value on the picture as the RGB value of the corresponding pixel point in the blank picture according to the one-to-one correspondence; calculating the RGB mean value of the adjacent color blocks of the target pixel points corresponding to the pixel points with invalid values in the blank image, and taking the mean value as the RGB value of the target pixel points; and determining the blank map with the set RGB values as a base map picture corresponding to the picture.

In one embodiment, assume that the size of the picture a is 4 x 4, and has 16 pixels, which are respectively denoted as P_i,jAnd i and j take values of [0,1,2,3 ]]In blank space, P_2,2The example is given by taking the pixel as the target pixel.

If P is determined according to the bitmap data of the image A_2,2The RGB value of the pixel point is an effective value, and the target pixel point P is set_2,2The RGB value of (A) is set as P on the picture A_2,2RGB values at pixel points. If P is determined according to the bitmap data of the image A_2,2The RGB value of the pixel point is an invalid value, and the target pixel point P on the blank picture is processed_2,2The RGB value is set as the target pixel point P on the blank_2,2RGB mean values of adjacent color blocks, e.g. P on a blank_1,1Pixel point, P_1,2Pixel point, P_1,3Pixel point, P_2,1Pixel point, P_2,2Pixel point, P_2,3Pixel point, P_3,1Pixel point, P_3,2Pixel point, P_3,3The pixels form adjacent color blocks, and the average value of RGB corresponding to the pixel points is a target pixel point P_2,2RGB mean of adjacent color blocks.

It should be noted that, the size of the adjacent color blocks is not limited in this embodiment, and may be set according to application requirements. The method for judging whether the RGB value of the pixel point on the picture is an effective value is as follows:

acquiring the RGB value of each pixel point of the picture according to the bitmap data of the picture, and calculating the confidence of the pixel point according to the RGB value; for example, calculating confidence C of pixel point (i, j) according to Red Red value in RGB value of pixel point_i,jRed/225.0, i.e. the confidence C calculated from the Red value_i,jAs the confidence of the pixel point (i, j). Obviously, three initial confidence values can also be calculated according to the Red value, the Green value and the Blue value respectively, and the weighted sum of the three initial confidence values is used as the confidence of the pixel point. Because the accuracy rate of the base map recognition based on the confidence coefficient calculated by the Red value is higher than the accuracy rate of the base map recognition based on the confidence coefficient calculated by the Green value and the Blue value, the probability of mistakenly recognizing the text data in the picture into the base map can be reduced, and therefore the confidence coefficient of the Red value pixel point in the RGB value based on each pixel point is preferably selected. And when the confidence coefficient is greater than the set threshold value, judging that the RGB value of the pixel point is an effective value, and when the confidence coefficient is not greater than the set threshold value, judging that the RGB value of the pixel point is an invalid value.

4. PPT presentation generation:

and generating a PPT page corresponding to each picture in the PPT presentation according to the text box and the base picture corresponding to each picture and the sequence number information of the picture in the picture sequence.

Generating the serial number information of the PPT page according to the serial number information of the picture in the picture sequence, and if the picture is the first picture in the picture sequence, the serial number of the generated PPT page is 1, namely the first page of the PPT presentation; and taking the text box corresponding to the picture as a text box of the PPT page, and taking the base picture of the picture as a background picture of the PPT page, thereby producing the PPT page corresponding to the picture in the PPT presentation.

The PPT presentation is generated through the steps 1-4, so that the method is more efficient compared with the traditional method for manually transcribing the PPT presentation, and labor cost can be saved for enterprises; in addition, five algorithms are introduced to the image duplicate removal part for collaborative calculation of similarity, and compared with a common duplicate removal method, the accuracy is higher; the operations such as deletion, search and the like are performed by utilizing the column list, so that the storage space is not occupied, more calculation space can be saved for the calculation of the similarity, and the processing speed is higher; and in addition, calculating the confidence coefficient by using the RGB value to obtain a PPT page base map.

Corresponding to the method, the application also provides a device for generating the presentation based on the video content, and the device can be applied to electronic equipment such as a personal computer.

The device for generating the presentation based on the video content can be realized by software, or can be realized by hardware or a combination of the software and the hardware.

For example, in a software implementation, machine executable instructions corresponding to the device 60 for generating a presentation based on video content in the non-volatile memory 50 may be read by the processor 10 into the volatile memory 40 for execution.

In terms of hardware, as shown in fig. 7, a hardware structure diagram of an electronic device where an apparatus for generating a document based on video content according to the present application is located may include other hardware according to actual functions of the electronic device, in addition to the processor 10, the internal bus 20, the network interface 30, the volatile memory 40, and the nonvolatile memory 50 shown in fig. 7, and is not described again.

In various embodiments, the non-volatile memory 50 may be: a storage drive (e.g., hard disk drive), a solid state drive, any type of storage disk (e.g., compact disk, DVD, etc.), or similar storage medium, or a combination thereof. The volatile memory 40 may be: RAM (random Access Memory), and the like.

Further, the non-volatile memory 50 and the volatile memory 40 serve as machine-readable storage media on which machine-executable instructions corresponding to the video content-based presentation generating apparatus 60 executed by the processor 10 may be stored.

Functionally divided, as shown in fig. 8, the apparatus 60 for generating a presentation based on video content includes:

a video acquisition unit 610 for acquiring video content;

a picture duplication removing unit 620, configured to intercept a video screenshot from the video content according to a frame sequence, and perform duplication removing processing on the video screenshot to obtain a duplicate-removed picture sequence;

a text recognition unit 630, configured to perform text recognition on each picture in the picture sequence to obtain text data, and generate a text box including the text data according to the text data;

a base map generation unit 640 for generating a base map picture of each picture from the bitmap data of the picture;

and the manuscript generating unit 650 is configured to generate a presentation page corresponding to each picture in the presentation according to the text box and the base image picture corresponding to each picture and according to the sequence number information of the picture in the picture sequence.

The text recognition unit 630 in one embodiment includes a recognition module, a calculation module, and a text box module;

the recognition module is used for performing text recognition on each picture by adopting a text recognition tool to obtain text data;

the calculation module is used for calculating the Hamming distance between adjacent text elements in the text data and carrying out partition processing on the text data based on the Hamming distance between the adjacent text elements to obtain a data block corresponding to each partition;

and the text box module is used for generating a text box for each data block and inserting the text data in the data block into the text box.

The calculation module is further used for taking the (i + 1) th text element as a partition starting position when the Hamming distance between the (i) th text element and the (i + 1) th text element is different from the Hamming distance between the (i-1) th text element and the (i + 1) th text element; when the Hamming distance between the jth text element and the jth +1 text element is different from the Hamming distance between the jth-1 text element and the jth text element, taking the jth text element as a partition ending position; wherein i and j are positive integers; and dividing the element between the (i + 1) th text element and the jth text element into a region, thereby completing the partition processing of the text data.

In one embodiment, the base map generating unit 640 is configured to generate a blank map for each picture, where a one-to-one correspondence relationship exists between a pixel point in the blank map and a pixel point in the picture; judging whether the RGB value of a pixel point on the picture is an effective value or not according to the bitmap data of the picture; setting the RGB value of the pixel point with the effective value on the picture as the RGB value of the corresponding pixel point in the blank picture according to the one-to-one correspondence; calculating the average value of adjacent color blocks of a target pixel point corresponding to the pixel point with an invalid value in the blank image, and taking the average value as the RGB value of the target pixel point; and determining the blank map with the set RGB values as a base map picture corresponding to the picture.

The base map generating unit 640 is further configured to obtain an RGB value of each pixel point of the picture according to the bitmap data of the picture, and calculate a confidence of the pixel point according to the RGB value; and when the confidence coefficient is greater than the set threshold value, judging that the RGB value of the pixel point is an effective value, and when the confidence coefficient is not greater than the set threshold value, judging that the RGB value of the pixel point is an invalid value.

In one embodiment, the picture deduplication unit 620 is configured to set an interception frame number according to a frame rate of the video content, where the interception frame number is used to indicate a time interval for intercepting a screenshot of a video screen; and intercepting a video screenshot from the video content according to the interception frame number, storing a storage path of the intercepted video screenshot in a list according to a frame sequence, and implementing duplication removal of the video screenshot by operating elements in the list.

The picture deduplication unit 620 in one embodiment comprises a picture preprocessing module and a picture calculation module;

the picture preprocessing module is used for carrying out size scaling processing and graying processing on the video screenshot to obtain a grayed video screenshot with a preset size;

and the picture calculation module is used for calculating the image fingerprint information and the gray histogram information of the processed video screenshot, and performing duplicate removal processing on the video screenshot by using the image fingerprint information and the gray histogram information to obtain a duplicate-removed picture sequence.

The picture calculation module is also used for calculating a fingerprint hash value of each video screenshot based on a preset hash algorithm, calculating the Hamming distance of adjacent video screenshots according to the fingerprint hash value, and calculating the similarity between the adjacent video screenshots according to the Hamming distance; calculating a gray level histogram of each video screenshot, and calculating the histogram contact ratio of adjacent video screenshots according to the gray level histogram; and calculating a similarity index between the adjacent video screenshots based on the similarity between the adjacent video screenshots and the histogram overlap ratio of the adjacent video screenshots, and deleting one video screenshot from the adjacent video screenshots and storing the other video screenshot when the similarity index is greater than a similarity threshold, thereby completing the de-duplication processing of the video screenshots and obtaining a de-duplicated picture sequence.

In this embodiment, the picture calculation module is further configured to perform domain unification on a first numerical domain corresponding to the similarity between the adjacent video screenshots and a second numerical domain corresponding to the histogram overlap ratio of the adjacent video screenshots, so as to obtain the similarity and the histogram overlap ratio between the adjacent video screenshots after the domain unification; and carrying out numerical value average processing on the similarity between the adjacent video screenshots after the domains are unified and the coincidence degree of the histograms, wherein the obtained average value is the similarity index.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A video processing method, comprising:

acquiring video content;

2. The method of claim 1, wherein performing text recognition on each picture in the sequence of pictures to obtain text data, and generating a text box including the text data according to the text data comprises:

performing text recognition on each picture by adopting a text recognition tool to obtain text data;

calculating the Hamming distance between adjacent text elements in the text data, and carrying out partition processing on the text data based on the Hamming distance between the adjacent text elements to obtain a data block corresponding to each partition;

a text box is generated for each data block and the text data in the data block is inserted into the text box.

3. The method of claim 2, wherein partitioning the text data based on the hamming distance between adjacent text elements comprises:

4. The method of claim 1, wherein generating a base picture for each picture from the bitmap data for the picture comprises:

generating a blank map for each picture, wherein pixel points in the blank map have a one-to-one correspondence with pixel points of the picture;

judging whether the RGB value of a pixel point on the picture is an effective value or not according to the bitmap data of the picture;

setting the RGB value of the pixel point with the effective value on the picture as the RGB value of the corresponding pixel point in the blank picture according to the one-to-one correspondence; calculating the RGB mean value of the adjacent color blocks of the target pixel points corresponding to the pixel points with invalid values in the blank image, and taking the RGB mean value as the RGB value of the target pixel points;

and determining the blank map with the set RGB values as a base map picture corresponding to the picture.

5. The method of claim 4, wherein determining whether the RGB values of the pixel points of the picture are valid values according to the bitmap data of the picture comprises:

acquiring the RGB value of each pixel point of the picture according to the bitmap data of the picture, and calculating the confidence of the pixel point according to the RGB value;

and when the confidence coefficient is greater than the set threshold value, judging that the RGB value of the pixel point is an effective value, and when the confidence coefficient is not greater than the set threshold value, judging that the RGB value of the pixel point is an invalid value.

6. The method of claim 1, wherein capturing video screenshots from the video content in a frame order, and performing a deduplication process on the video screenshots to obtain a deduplicated picture sequence comprises:

setting an intercepting frame number according to the frame rate of the video content, wherein the intercepting frame number is used for indicating the time interval of intercepting the screenshot of the video screen;

and intercepting a video screenshot from the video content according to the interception frame number, storing a storage path of the intercepted video screenshot in a list according to a frame sequence, and implementing duplication removal of the video screenshot by operating elements in the list.

7. The method of claim 1, wherein the performing the de-duplication process on the video screenshot to obtain a de-duplicated picture sequence comprises:

carrying out size scaling processing and graying processing on the video screenshot to obtain a grayed video screenshot with a preset size;

and calculating image fingerprint information and gray histogram information of the processed video screenshot, and performing duplicate removal processing on the video screenshot by using the image fingerprint information and the gray histogram information to obtain a duplicate-removed picture sequence.

8. The method of claim 7, wherein calculating image fingerprint information and histogram of gray level information of the processed video screenshot, and performing de-duplication processing on the video screenshot by using the image fingerprint information and histogram of gray level information to obtain a de-duplicated picture sequence comprises:

calculating a fingerprint hash value of each video screenshot based on a preset hash algorithm, calculating a Hamming distance of adjacent video screenshots according to the fingerprint hash value, and calculating the similarity between the adjacent video screenshots according to the Hamming distance;

calculating a gray level histogram of each video screenshot, and calculating the histogram contact ratio of adjacent video screenshots according to the gray level histogram;

and calculating a similarity index between the adjacent video screenshots based on the similarity between the adjacent video screenshots and the histogram overlap ratio of the adjacent video screenshots, and deleting one video screenshot from the adjacent video screenshots and storing the other video screenshot when the similarity index is greater than a similarity threshold value, thereby completing the de-duplication processing of the video screenshots and obtaining a de-duplicated picture sequence.

9. The method of claim 8, wherein computing a similarity index between adjacent video shots based on a similarity between the adjacent video shots and a histogram overlap ratio of the adjacent video shots comprises:

carrying out domain unification on a first numerical domain corresponding to the similarity between the adjacent video screenshots and a second numerical domain corresponding to the histogram overlap ratio of the adjacent video screenshots to obtain the similarity and the histogram overlap ratio between the adjacent video screenshots after the domains are unified;

and carrying out numerical value average processing on the similarity between the adjacent video screenshots after the domains are unified and the coincidence degree of the histograms, wherein the obtained average value is the similarity index.

10. A video processing apparatus, comprising:

a video acquisition unit for acquiring video content;