CN114581900A

CN114581900A - Method and device for identifying video subtitles, electronic equipment and storage medium

Info

Publication number: CN114581900A
Application number: CN202210232824.6A
Authority: CN
Inventors: 安达; 唐大闰
Original assignee: Beijing Minglue Zhaohui Technology Co Ltd
Current assignee: Beijing Minglue Zhaohui Technology Co Ltd
Priority date: 2022-03-09
Filing date: 2022-03-09
Publication date: 2022-06-03

Abstract

The application relates to the technical field of video processing, and discloses a method for recognizing video subtitles, which comprises the following steps: performing character recognition on a plurality of video frames to obtain all text boxes in each video frame; determining a plurality of text box sets according to the occurrence times of the height of the text boxes; calculating the mean square error of the width of each text box in each text box set; and determining the subtitles according to the mean square error of the widths of the text boxes. Because the height of the text box in the video is the same, but the width of the text box in different video frames is different, the text box of the video caption can be accurately identified through the height and the width of the text box. When the subtitles of different types of videos are identified, the information such as the size, the subtitle position and the like of the videos does not need to be considered, and manual labeling is not needed for classifying the subtitles. The application also discloses a device, an electronic device and a storage medium for identifying the video subtitles.

Description

Method and device for identifying video subtitles, electronic equipment and storage medium

Technical Field

The present application relates to the field of video processing technologies, and for example, to a method and an apparatus for identifying video subtitles, an electronic device, and a storage medium.

Background

At present, videos become more and more media choices for people to transmit information, and can be classified into long videos, short videos, live videos and the like according to different categories. However, whatever the type of video, the subtitle processing method can be adopted to ensure that the video content is clearly conveyed to the user, and the user can conveniently extract or identify the subtitle through the video for other purposes such as content understanding and the like.

In the related art, a method of a specific position is adopted to identify a subtitle, for example, a video content is a movie segment or a television play segment, the subtitle is arranged below a video, and the subtitle is positioned by setting a fixed height; or, the method of classification is used for recognizing the subtitles, for example, the video content contains some characters but not subtitles, and the neural network is used for training the classification model to position the subtitles by labeling a large amount of common subtitles and non-subtitles. Related technologies also provide a video subtitle positioning method, including: acquiring all image frames of a video, performing character detection on all the image frames to obtain a first text box set of all the image frames, traversing the first text box set, and acquiring first similarity of text boxes in every two image frames in a first direction; constructing a first graph network about the first set of text boxes based on the plurality of first similarities; and clustering the first graph network to obtain a plurality of first sub-networks, and extracting text boxes of the video subtitles from the first sub-networks of which the number of nodes meets a first preset condition.

In the process of implementing the embodiments of the present disclosure, it is found that at least the following problems exist in the related art:

because different videos have different contents, the subtitle position, the subtitle type and the subtitle font of each video are also different, and some videos not only have characters of subtitles, but also possibly characters of background information, bullet screen characters, special effect characters and the like. Therefore, when subtitle recognition is performed under the influence of numerous factors, subtitle characters and other characters cannot be effectively distinguished, and further misrecognition is easy to occur.

Disclosure of Invention

The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments, but is intended to be a prelude to the more detailed description that is presented later.

The embodiment of the disclosure provides a method and a device for video subtitle recognition, an electronic device and a storage medium, so as to improve the accuracy of subtitle recognition of different videos.

In some embodiments, the method for video subtitle recognition includes: performing character recognition on a plurality of video frames to obtain all text boxes in each video frame; wherein each text box comprises a line of words; determining a plurality of text box sets according to the occurrence times of the height of the text boxes; wherein the text boxes in each text box set have the same height; calculating the mean square error of the width of each text box in each text box set; and determining the subtitles according to the mean square error of the widths of the text boxes.

In some embodiments, the apparatus for video caption recognition comprises: the character recognition module is configured to perform character recognition on a plurality of video frames to obtain all text boxes in each video frame; wherein each text box comprises a line of words; a determining module configured to determine a plurality of text box sets according to the number of occurrences of the text box height; wherein the text boxes in each text box set have the same height; a calculation module configured to calculate a text box width mean square error for each set of text boxes; and the caption identification module determines the caption according to the mean square error of the width of each text box.

In some embodiments, the apparatus for video caption recognition includes a processor and a memory storing program instructions, the processor being configured to execute the method for video caption recognition as described above when executing the program instructions.

In some embodiments, the electronic device comprises the apparatus for video caption recognition as described above.

In some embodiments, the storage medium stores program instructions that, when executed, perform the method for video subtitle recognition as described above.

The method and the device for identifying the video subtitles, the electronic device and the storage medium provided by the embodiment of the disclosure can realize the following technical effects: when the text boxes in all the video frames are obtained, because the heights of the text boxes in the video are the same, but the widths of the text boxes in different video frames are different, a plurality of text box sets are obtained according to the heights of the text boxes, and the text boxes of the video subtitles can be accurately identified by calculating the mean square error of the widths of the text boxes of the text box sets. When the subtitles of different types of videos are identified, the information such as the size, the subtitle position and the like of the videos does not need to be considered, and manual labeling is not needed for classifying the subtitles. The method for positioning the video subtitles is used in the technical field of deep learning, not only reduces the calculation complexity, but also accelerates the calculation speed through a computer vision method, thereby improving the accuracy of recognizing different video subtitles and effectively eliminating other interference information in the video.

The foregoing general description and the following description are exemplary and explanatory only and are not restrictive of the application.

Drawings

One or more embodiments are illustrated in the accompanying drawings, which correspond to the accompanying drawings and not in a limiting sense, in which elements having the same reference numeral designations represent like elements, and in which:

fig. 1 is a schematic diagram of a method for video subtitle recognition according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of another method for video subtitle recognition provided by an embodiment of the present disclosure;

fig. 3 is a schematic diagram of another method for video subtitle recognition provided by an embodiment of the present disclosure;

fig. 4 is a schematic diagram of another method for video subtitle recognition provided by an embodiment of the present disclosure;

fig. 5 is a schematic diagram of an apparatus for video subtitle recognition according to an embodiment of the present disclosure;

fig. 6 is a schematic diagram of another apparatus for video subtitle recognition according to an embodiment of the present disclosure.

Detailed Description

So that the manner in which the features and elements of the disclosed embodiments can be understood in detail, a more particular description of the disclosed embodiments, briefly summarized above, may be had by reference to the embodiments, some of which are illustrated in the appended drawings. In the following description of the technology, for purposes of explanation, numerous details are set forth in order to provide a thorough understanding of the disclosed embodiments. However, one or more embodiments may be practiced without these details. In other instances, well-known structures and devices may be shown in simplified form in order to simplify the drawing.

The terms "first," "second," and the like in the description and in the claims, and the above-described drawings of embodiments of the present disclosure, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the present disclosure described herein may be made. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions.

The term "plurality" means two or more unless otherwise specified.

In the embodiment of the present disclosure, the character "/" indicates that the preceding and following objects are in an or relationship. For example, A/B represents: a or B.

The term "and/or" is an associative relationship that describes objects, meaning that three relationships may exist. For example, a and/or B, represents: a or B, or A and B.

The term "correspond" may refer to an association or binding relationship, and a corresponds to B refers to an association or binding relationship between a and B.

As shown in fig. 1, an embodiment of the present disclosure provides a method for video subtitle recognition, including:

s101, a processor performs character recognition on a plurality of video frames to obtain all text boxes in each video frame; wherein each text box comprises a line of words;

step S102, the processor determines a plurality of text box sets according to the occurrence times of the height of the text boxes; wherein the text boxes in each text box set have the same height;

step S103, the processor calculates the mean square error of the width of each text box in each text box set;

and step S104, the processor determines the subtitles according to the mean square error of the width of each text box.

By adopting the method for identifying the video subtitles provided by the embodiment of the disclosure, when the text boxes in all the video frames are obtained, because the height of the text boxes in the video is the same, but the width of the text boxes in different video frames is different, a plurality of text box sets are obtained through the height of the text boxes, and the text boxes of the video subtitles can be accurately identified by calculating the mean square difference of the width of the text boxes of each text box set. When the subtitles of different types of videos are identified, the information such as the size, the subtitle position and the like of the videos does not need to be considered, and manual labeling is not needed for classifying the subtitles. Therefore, through a computer vision method, not only is the calculation complexity reduced, but also the calculation speed is accelerated, so that the accuracy rate of recognizing different video subtitles is improved, and other interference information in the video is effectively eliminated.

Performing character recognition on a plurality of video frames to obtain all text boxes in each video frame; alternatively, the video may be any kind, such as a long video, a short video, a live video, a tv show, a movie, an integrated program, etc., the video includes subtitle text, such as bullet text, background text, advertisement text, etc., and if the video is a short video, the video may also include other text information, such as watermark text, a user nickname, a video name.

Here, the video is cut into frames to obtain all video frames constituting the video, and each video frame is subjected to Character Recognition by OCR (Optical Character Recognition), so that characters can be recognized, and a text box with the characters can be obtained, where the text box includes all the characters mentioned above. The OCR obtains the character image information on the paper by scanning, shooting and other optical input modes, analyzes character morphological characteristics by using various pattern recognition algorithms to convert bills, newspapers, books, manuscripts and other printed matters into image information, and converts the image information into a usable computer input technology by using a character recognition technology.

In the embodiment, a plurality of text box sets are determined according to the occurrence times of the height of the text boxes; because different characters have different characteristics, such as background characters of the video, the characteristics of the background characters in the same video are basically the same under the condition of not changing scenes. However, in the case of video subtitles, the height of the text may change slightly with the change of the text, but the change of the width of the whole line of text is obvious. On the other hand, the heights of the subtitles in different types are different, and the occurrence frequency is also different, so that all the text boxes can be classified according to the occurrence frequency of the heights of the text boxes.

In the multiple text box sets, each text box set has multiple text boxes with the same height, that is, each text box set has multiple characters with the same height, so that it is convenient to distinguish which text box with the same height is the text box of the caption characters.

In this embodiment, the mean square error of the width of each text box in each text box set is calculated, the mean square error is generally referred to as Standard Deviation (Standard development) and is the arithmetic square root of the arithmetic mean (i.e., variance) from the square of the mean square error, and the mean square error can reflect the degree of dispersion of a data set. Since the standard deviation is typically relative to the mean of the sample data, it represents how far from the mean a certain observed value of the sample data is. Therefore, the standard deviation is greatly affected by the extreme value. In this application, it may be expressed as a discrete degree of the width of a text box in a text box set.

In this embodiment, since the width of the entire line of subtitle text is clearly changed, the width of the text box in each text box set can be obtained. That is, the smaller the standard deviation is, the closer the width of the text box in the text box set is; the larger the standard deviation is, the more discrete the width of the text box in the text box set is shown, and the caption is determined according to the condition of the mean square error of the width of each text box.

Optionally, the method for recognizing video subtitles provided in the embodiment of the present disclosure may also recognize a plurality of image frames or image frame characters, and the operation steps thereof are the same as those of the above method, and are not described herein again.

With reference to fig. 2, another method for identifying a video subtitle according to an embodiment of the present disclosure includes:

step S201, a processor performs character recognition on a plurality of video frames to obtain all text boxes in each video frame; wherein each text box comprises a line of words;

step S202, the processor carries out clustering collection on the text boxes with the same height to generate a plurality of text box collections; wherein the text boxes in each text box set have the same height;

step S203, the processor performs descending order arrangement on each text box set according to the number of the text boxes in each text box set;

s204, the processor selects the first N text box sets, wherein N is an integer larger than 1;

step S205, the processor calculates the mean square error of the width of each text box in each text box set;

and step S206, the processor determines the subtitles according to the mean square error of the widths of the text boxes.

In this embodiment, in order to further improve the accuracy of recognizing different video subtitles, enough video frames and enough text box sets are required, and a certain number of text boxes are required in the text box sets. In this way, the video subtitles can be identified more accurately.

In the present embodiment, since the text boxes in the video are the same in height, a plurality of text box sets are generated by clustering the text boxes having the same height, and the text box sets are arranged in descending order according to the number of text boxes in the text box sets. Therefore, the text box set formed by the text boxes with a large number is positioned in front of the text box set queue, good calculation data is provided for calculating the mean square error of the width of the text box, and the accuracy rate of recognizing different video subtitles is improved.

In the embodiment, under the condition that the number of the dialog characters in the video is large, the frequency of frame cutting of the video can be reduced appropriately; in the case that the dialogue characters in the video are small, the frequency of frame cutting of the video can be increased by a proper amount.

In this embodiment, when there are many video frames, some non-subtitle texts can be filtered by arranging the text box sets in descending order, for example, there are bullet text in the video, and when a frame is cut, there is only one video frame in the bullet text. Therefore, the text box set formed by the bullet screen characters in height is behind the text box set queue, the text box set in front is selected, and the text box set of the bullet screen characters is directly filtered, so that the interference information in the video is effectively eliminated, the calculation complexity is reduced, and the calculation speed is accelerated.

Optionally, in some embodiments, the text box sets may also be arranged in ascending order according to the number of text boxes in each text box set; and selecting the N text box sets, wherein N is an integer larger than 1.

To further calculate the mean square error of the width of the text box for each set of text boxes, in some embodiments, the mean square error is calculated as follows:

wherein σ_IThe mean square error of the width of the text box of the I-th text box set, I ═ 1, …, N; n is the number of text boxes in the text box set; w is a_iThe width of the ith text box, i ═ 1, …, n;

is the width average of the text boxes in the text box set.

In some embodiments, the video is cut into 100 video frames, and 544 text boxes are obtained from the 100 video frames by OCR for text recognition. The corresponding relation between the height of the text box and the text box is as follows:

height h	10mm	11mm	12mm	13mm	14mm	15mm	16mm
								Number n of	102	82	73	53	46	100	88

In the embodiment, the first 5 text box sets, namely a 10mm text box set, a 15mm text box set, a 16mm text box set, an 11mm text box set and a 12mm text box set, are selected by sorting the number of the text boxes;

firstly, calculating a width mean value of a text box corresponding to each text box set, calculating a width mean value of 102 text boxes in a 10mm text box set, calculating a width mean value of 100 text boxes in a 15mm text box set, calculating a width mean value of 88 text boxes in a 16mm text box set, calculating a width mean value of 82 text boxes in an 11mm text box set, and calculating a width mean value of 73 text boxes in a 12mm text box set;

the width of each text box, the mean value of the width of each text box and the number of the text boxes are calculated by a mean square error formula, finally the mean square error of the width of the text boxes of 5 text box sets is obtained, and the caption is determined according to the mean square error of the width of the text boxes, so that the interference information in the video is effectively eliminated, the calculation complexity is reduced, and the calculation speed is increased.

With reference to fig. 3, another method for identifying video subtitles is provided in an embodiment of the present disclosure, including:

step S301, the processor performs character recognition on a plurality of video frames to obtain all text boxes in each video frame; wherein each text box comprises a line of words;

step S302, the processor carries out clustering collection on the text boxes with the same height to generate a plurality of text box collections; wherein the text boxes in each text box set have the same height;

step S303, the processor performs descending order arrangement on each text box set according to the number of the text boxes in each text box set;

s304, the processor selects the first N text box sets, wherein N is an integer larger than 1;

s305, calculating the mean square error of the width of each text box set by the processor;

s306, the processor determines the maximum value in the mean square error of the width of each text box;

in step S307, the processor uses each text box in the text box set corresponding to the maximum value as a subtitle.

In this embodiment, in order to further improve the accuracy of recognizing different video subtitles, the width of each text box, the mean value of the width of each text box, and the number of text boxes are calculated by a mean square error formula, so as to finally obtain the numerical value of the mean square error of the widths of the text boxes of a plurality of text box sets, and each text box in the text box set corresponding to the maximum value in the mean square error of the widths of the text boxes can be directly used as a subtitle in a comparison manner, or a corresponding text box set is selected in a manner of sorting the numerical values of the mean square error of the widths of the text boxes, and each text box in the text box set is used as a subtitle.

In some embodiments, when determining a plurality of text box sets according to the number of occurrences of the height, since the detected height of the same text in different video frames still has some errors, and on the other hand, the height of the text also slightly changes with the change of the text, in some cases, it is necessary to calculate an average value of the heights of the similar text boxes, and determine the text box set according to the average value of the heights of the text boxes. E.g. average of the height of the text box

Will be provided with

The text boxes are clustered, so that the accuracy of recognizing different video subtitles can be improved.

In this embodiment, under the condition of a large number of text box sets, the number of selected text box sets can be reduced appropriately, so as to reduce the calculation complexity and the calculation amount and accelerate the calculation speed.

Optionally, in some embodiments, the values of the mean square error of the width of each text box may be arranged in an ascending order according to the values of the mean square error of the width of each text box; and selecting the text box width mean square error with the maximum numerical value.

As shown in fig. 4, another method for identifying video subtitles according to an embodiment of the present disclosure includes:

step S401, the processor 100 performs character recognition on a plurality of continuous video frames;

step S402, the processor 100 identifies all text boxes in each video frame to obtain the height and width of each text box; wherein each text box comprises a line of words;

step S403, the processor 100 performs clustering on the text boxes with the same height to generate a plurality of text box sets; wherein the text boxes in each text box set have the same height;

s404, the processor performs descending order arrangement on each text box set according to the number of the text boxes in each text box set;

s405, the processor selects the first N text box sets, wherein N is an integer larger than 1;

step S406, the processor calculates the mean square error of the width of each text box in each text box set;

step S407, the processor determines the maximum value in the mean square error of the width of each text box;

step S408, the processor takes each text box in the text box set corresponding to the maximum value as a subtitle.

In this embodiment, by performing character recognition on a plurality of consecutive video frames, all text boxes in each video frame are recognized, and the height and width of each text box are obtained. And in a preset time interval, carrying out character recognition on a plurality of continuous video frames. Optionally, the interval in the interval preset time period is 1S, 2S, 3S, 4S, and the like, where taking the preset time period as 2min and the interval duration as 1S as an example, within 2min, the video is cut into frames every 1S, and finally 120 video frames are obtained; identifying all text boxes in 120 video frames and recording the height and width of each text box; clustering the text boxes with the same height to finally obtain 5 text box collections with different heights; wherein h is₁With 3 text boxes, h₂With 1 text box, h₃With 10 text boxes, h₄With 8 text boxes, h₅There are 15 text boxes; according to the number of the text boxes, performing descending arrangement on each text box set h₅＞h₃＞h₄＞h₁＞h₂(ii) a Selection of h₅、h₃And h₄Text box of (1), calculate h₅、h₃And h₄The final ordering of the mean square deviations of the widths of the text boxes is h₅＞h₃＞h₄. I.e. h₅The text box in the text box set of (1) is a subtitle. Therefore, different subtitles can be classified according to the height of the text box, meanwhile, the text box of the video subtitles can be accurately identified by calculating the mean square error of the width of the text box of each text box set, other interference information in the video is effectively eliminated, and the accuracy rate of identifying different video subtitles is improved.

As shown in fig. 5, an apparatus for recognizing video subtitles according to an embodiment of the present disclosure includes a text recognition module 501, a determination module 502, a calculation module 503, and a subtitle recognition module 504. A character recognition module 501 configured to perform character recognition on a plurality of video frames to obtain all text boxes in each video frame; wherein each text box comprises a line of words; a determining module 502 configured to determine a plurality of text box sets according to the number of occurrences of the text box height; wherein the text boxes in each text box set have the same height; a calculation module 503 configured to calculate a text box width mean square error of each text box set; and the subtitle identification module 504 determines the subtitles according to the mean square error of the widths of the text boxes.

By adopting the device for identifying the video subtitles, when the text boxes in all the video frames are obtained, because the height of the text boxes in the video is the same, but the width of the text boxes in different video frames is different, a plurality of text box sets are obtained through the height of the text boxes, and the text boxes of the video subtitles can be accurately identified by calculating the mean square difference of the width of the text boxes of each text box set. When the subtitles of different types of videos are identified, the information such as the size, the subtitle position and the like of the videos does not need to be considered, and manual labeling is not needed for classifying the subtitles. Therefore, the calculation complexity is reduced, the calculation speed is increased, the accuracy of recognizing different video subtitles is improved, and other interference information in the video is effectively eliminated.

Optionally, the determining module 502 further includes an aggregating unit, a sorting unit, and a selecting unit. The aggregation unit is used for performing clustering collection on the text boxes with the same height to generate a plurality of text box collections; the sorting unit is configured to sort the text box sets in a descending order according to the number of the text boxes in each text box set; and the selecting unit is configured to select the first N text box sets, wherein N is an integer larger than 1.

Optionally, the text recognition module 501 is specifically configured to perform text recognition on a plurality of consecutive video frames; all text boxes in each video frame are identified, and the height and width of each text box are obtained.

Optionally, the subtitle recognition module 504 is specifically configured to determine a maximum value of the mean square error of the widths of the text boxes; and taking each text box in the text box set corresponding to the maximum value as a subtitle.

As shown in fig. 6, an apparatus for video subtitle recognition according to an embodiment of the present disclosure includes a processor (processor)100 and a memory (memory) 101. Optionally, the apparatus may also include a Communication Interface (Communication Interface)102 and a bus 103. The processor 100, the communication interface 102, and the memory 101 may communicate with each other via a bus 103. The communication interface 102 may be used for information transfer. The processor 100 may call logic instructions in the memory 101 to perform the method for video caption recognition of the above-described embodiment.

In addition, the logic instructions in the memory 101 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products.

The memory 101, which is a computer-readable storage medium, may be used for storing software programs, computer-executable programs, such as program instructions/modules corresponding to the methods in the embodiments of the present disclosure. The processor 100 executes functional applications and data processing, i.e., implements the method for video caption recognition in the above-described embodiments, by executing program instructions/modules stored in the memory 101.

The memory 101 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal device, and the like. In addition, the memory 101 may include a high-speed random access memory, and may also include a nonvolatile memory.

The embodiment of the disclosure provides an electronic device, which comprises the above device for recognizing the video subtitles.

The disclosed embodiments provide a computer-readable storage medium storing computer-executable instructions configured to perform the above-described method for video subtitle recognition.

The disclosed embodiments provide a computer program product comprising a computer program stored on a computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the above-described method for video subtitle recognition.

The computer-readable storage medium described above may be a transitory computer-readable storage medium or a non-transitory computer-readable storage medium.

The technical solution of the embodiments of the present disclosure may be embodied in the form of a software product, where the computer software product is stored in a storage medium and includes one or more instructions to enable a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method of the embodiments of the present disclosure. And the aforementioned storage medium may be a non-transitory storage medium comprising: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes, and may also be a transient storage medium.

The above description and drawings sufficiently illustrate embodiments of the disclosure to enable those skilled in the art to practice them. Other embodiments may incorporate structural, logical, electrical, process, and other changes. The examples merely typify possible variations. Individual components and functions are optional unless explicitly required, and the sequence of operations may vary. Portions and features of some embodiments may be included in or substituted for those of others. Furthermore, the words used in the specification are words of description for example only and are not limiting upon the claims. As used in the description of the embodiments and the claims, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Similarly, the term "and/or" as used in this application is meant to encompass any and all possible combinations of one or more of the associated listed. Furthermore, the terms "comprises" and/or "comprising," when used in this application, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Without further limitation, an element defined by the phrase "comprising an …" does not exclude the presence of other like elements in a process, method or apparatus that comprises the element. In this document, each embodiment may be described with emphasis on differences from other embodiments, and the same and similar parts between the respective embodiments may be referred to each other. For methods, products, etc. of the embodiment disclosures, reference may be made to the description of the method section for relevance if it corresponds to the method section of the embodiment disclosure.

Those of skill in the art would appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software may depend upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments. It can be clearly understood by the skilled person that, for convenience and brevity of description, the specific working processes of the system, the apparatus and the unit described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments disclosed herein, the disclosed methods, products (including but not limited to devices, apparatuses, etc.) may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units may be only one type of logical functional division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form. The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to implement the present embodiment. In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. In the description corresponding to the flowcharts and block diagrams in the figures, operations or steps corresponding to different blocks may also occur in different orders than disclosed in the description, and sometimes there is no specific order between the different operations or steps. For example, two sequential operations or steps may in fact be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. Each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Claims

1. A method for video caption recognition, comprising:

performing character recognition on a plurality of video frames to obtain all text boxes in each video frame; wherein each of the text boxes comprises a line of text;

determining a plurality of text box sets according to the occurrence times of the text box heights; wherein the text boxes in each text box set have the same height;

calculating the mean square error of the width of each text box in each text box set;

and determining the subtitles according to the mean square error of the width of each text box.

2. The method of claim 1, wherein determining a plurality of text box sets based on the number of occurrences of the text box height comprises:

clustering the text boxes with the same height to generate a plurality of text box sets;

according to the number of the text boxes in each text box set, performing descending arrangement on each text box set;

and selecting the first N text box sets, wherein N is an integer larger than 1.

3. The method of claim 2, wherein the mean square error of the width of each text box of said each set of text boxes is calculated as:

wherein σ_IThe mean square error of the width of the text box of the I-th text box set is I1, …, N, N is the number of the text box sets; n is the number of the text boxes in the text box set; w is a_iThe width of the ith text box, i ═ 1, …, n;

is the width average of the text boxes in the text box set.

4. The method of any one of claims 1 to 3, wherein determining the subtitles according to the mean square error of the width of each text box comprises:

determining the maximum value in the mean square error of the width of each text box;

and taking each text box in the text box set corresponding to the maximum value as a subtitle.

5. The method of claim 4, wherein performing text recognition on a plurality of video frames to obtain all text boxes in each video frame comprises:

performing character recognition on a plurality of continuous video frames;

all text boxes in each video frame are identified, and the height and width of each text box are obtained.

6. An apparatus for video caption recognition, comprising:

the character recognition module is configured to perform character recognition on a plurality of video frames to obtain all text boxes in each video frame; wherein each of the text boxes comprises a line of words;

a determining module configured to determine a plurality of text box sets according to the number of occurrences of the text box height; wherein the text boxes in each text box set have the same height;

a calculation module configured to calculate a text box width mean square error of each text box set;

and the caption identification module determines the caption according to the mean square error of the width of each text box.

7. The apparatus of claim 6, wherein the determining module comprises:

the aggregation unit is used for performing clustering collection on the text boxes with the same height to generate a plurality of text box collections;

the sorting unit is configured to sort the text box sets in a descending order according to the number of the text boxes in each text box set;

and the selecting unit is configured to select the first N text box sets, wherein N is an integer larger than 1.

8. An apparatus for video caption recognition comprising a processor and a memory having stored thereon program instructions, wherein the processor is configured to execute the method for video caption recognition according to any one of claims 1 to 5 when executing the program instructions.

9. An electronic device, characterized in that it comprises an apparatus for video subtitle recognition according to any one of claims 6 to 8.

10. A storage medium storing program instructions which, when executed, perform a method for video subtitle recognition according to any one of claims 1 to 5.