WO2020052085A1

WO2020052085A1 - Video text detection method and device, and computer readable storage medium

Info

Publication number: WO2020052085A1
Application number: PCT/CN2018/117715
Authority: WO
Inventors: 周多友; 王长虎
Original assignee: 北京字节跳动网络技术有限公司
Priority date: 2018-09-13
Filing date: 2018-11-27
Publication date: 2020-03-19
Also published as: CN109299682A

Abstract

Disclosed are a video text detection method, a video text detection device, a video text detection hardware device, and a computer readable storage medium. The video text detection method comprises: partitioning an image to be detected extracted from a video to be detected, so as to obtain at least one image block; and determining whether said video comprises text information according to text detection results of the image blocks. In embodiments of the present disclosure, first, the image to be detected extracted from the video to be detected is partitioned to obtain at least one image block, and then, it is determined whether said video comprises text information according to the text detection results of the image blocks; therefore, the text detection accuracy can be improved.

Description

Video text detection method, device and computer-readable storage medium

cross reference

The present disclosure refers to a Chinese patent application filed on September 13, 2018, entitled "Video Text Detection Method, Apparatus, and Computer-readable Storage Medium" with application number 201811065276.2, which is incorporated by reference in its entirety.

Technical field

The present disclosure relates to the technical field of information processing, and in particular, to a video text detection method, device, and computer-readable storage medium.

Background technique

In recent years, with the rapid development of multimedia technology and computer networks, the capacity of digital video is growing at an alarming rate. In this way, images captured from digital video often contain important text information, which plays an important role in video database retrieval based on text content. That is, to some extent, it is convenient for concise description and description of the main content of the video, or for video classification, or for identification of illegal videos.

Videos often contain text, such as advertisements, introductions, or text on signboards. When judging whether there is text in the video, in current technology, optical characters are often extracted by extracting each frame in the video. Recognition (Optical Character Recognition, OCR) recognition. However, when the text contained in the image is small, the OCR recognition effect is not ideal, and the accuracy is not high enough.

Summary of the Invention

The technical problem solved by the present disclosure is to provide a video text detection method to at least partially solve the technical problem that the OCR has a poor recognition effect and low recognition accuracy when recognizing small characters. In addition, a video text detection device, a video text detection hardware device, a computer-readable storage medium, and a video text detection terminal are also provided.

To achieve the above objective, according to one aspect of the present disclosure, the following technical solutions are provided:

A video text detection method includes:

Segmenting the to-be-detected picture extracted from the to-be-detected video to obtain at least one image block;

It is determined whether text information is included in the video to be detected according to a text detection result on the image block.

Further, the step of determining whether text information is included in the video to be detected according to a text detection result on the image block includes:

Text detection on each image block;

If it is detected that any image block contains text information, it is determined that the video to be detected contains text information.

Further, the method further includes:

Segmenting pictures that are known to contain text information and / or pictures that are not known to contain text information to obtain at least one image block as a training sample;

Mark the training samples according to whether text information is included;

A deep learning classification algorithm is used to perform training and learning on the labeled training samples to obtain an image classifier.

Further, the step of segmenting the to-be-detected picture extracted from the to-be-detected video to obtain at least one image block includes:

Inputting the picture to be detected into the image classifier, and dividing the picture to be detected by the image classifier to obtain at least one image block;

The method further includes:

Text detection is performed on the image block by the image classifier, and a text detection result of the image block is determined according to a classification result of the image classifier.

Further, the steps of performing text detection on the image block by the image classifier, and determining the text detection result of the image block based on the classification result of the image classifier, include:

Scoring each image block by the image classifier to obtain a score value of each image block;

A text detection result of the image block is determined according to the score.

Further, the step of determining a text detection result of the image block according to the score includes:

If the score exceeds a preset score, determine that the image block contains text information; or, select a maximum score from the score, and if the maximum score exceeds the preset score, determine the maximum score. The image block contains text information; or, if the score is smaller than a preset score, it is determined that the image block contains text information; or, a minimum score is selected from the scores, and if the minimum score is If the value is less than the preset score, it is determined that the image block contains text information.

Further, the step of performing text detection on the image block by the image classifier, and determining a text detection result of the image block according to a classification result of the image classifier, includes:

Perform text detection on each image block through the image classifier, and directly output any one of the following results through the image classifier: including text information and not including text information;

The output result is used as a text detection result of the image block.

To achieve the above object, according to another aspect of the present disclosure, the following technical solutions are also provided:

A video text detection device includes:

A picture block module, configured to block the pictures to be detected extracted from the videos to be detected to obtain at least one image block;

A text determining module is configured to determine whether text information is included in the video to be detected according to a text detection result of the image block.

Further, the text determination module is specifically configured to: perform text detection on each image block; if it is detected that any image block contains text information, determine that the video to be detected includes text information.

Further, the device further includes:

A classifier training module, configured to block pictures that have known text information and / or pictures that do not contain text information to obtain at least one image block as a training sample; and perform training on the training sample according to whether text information is included Labeling; using deep learning classification algorithms to train and learn the labeled training samples to obtain an image classifier.

Further, the picture segmentation module is specifically configured to: input the picture to be detected into the image classifier, and divide the picture to be detected by the image classifier to obtain at least one image block;

The device further includes:

A text detection module is configured to perform text detection on the image block through the image classifier, and determine a text detection result of the image block according to a classification result of the image classifier.

Further, the character detection module includes:

A scoring unit, configured to score each image block through the image classifier to obtain a score value of each image block;

A character detection unit, configured to determine a character detection result of the image block according to the score.

Further, the character detection unit is specifically configured to:

If the score exceeds a preset score, determine that the image block contains text information; or, select a maximum score from the score, and if the maximum score exceeds a preset score, determine the image block. Contains text information; or, if the score is less than a preset score, determine that the image block contains text information; or select a minimum score from the scores, and if the minimum score is less than a preset score , It is determined that the image block contains text information.

Further, the text detection module is specifically configured to perform text detection on each image block through the image classifier, and directly output any one of the following results through the image classifier: including text information and not including text information; The output result is used as a text detection result of the image block.

A video text detection hardware device includes:

Memory for storing non-transitory computer-readable instructions; and

A processor, configured to run the computer-readable instructions, so that the processor, when executed, implements the steps described in any of the foregoing technical solutions of a video text detection method.

A computer-readable storage medium is used for storing non-transitory computer-readable instructions. When the non-transitory computer-readable instructions are executed by a computer, the computer is caused to execute any of the technical solutions of the video text detection method described above. The steps described.

A video text detection terminal includes any of the video text detection devices described above.

Embodiments of the present disclosure provide a video text detection method, a video text detection device, a video text detection hardware device, a computer-readable storage medium, and a video text detection terminal. Wherein, the video text detection method includes segmenting a to-be-detected picture extracted from the to-be-detected video to obtain at least one image block; and determining whether the to-be-detected video contains text information based on a text detection result of the image block. . The embodiment of the present disclosure first divides the to-be-detected picture extracted from the to-be-detected video into at least one image block, and then determines whether the to-be-detected video contains text information according to a text detection result on the image block. Improve text detection accuracy.

The above description is only an overview of the technical solutions of the present disclosure. In order to better understand the technical means of the present disclosure, it can be implemented in accordance with the contents of the description, and to make the above and other objects, features, and advantages of the present disclosure more obvious and understandable The preferred embodiments are described below and described in detail with the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1a is a schematic flowchart of a video text detection method according to an embodiment of the present disclosure; FIG.

1b is a schematic flowchart of a video text detection method according to another embodiment of the present disclosure;

1c is a schematic flowchart of a video text detection method according to another embodiment of the present disclosure;

2a is a schematic structural diagram of a video text detection device according to an embodiment of the present disclosure;

2b is a schematic structural diagram of a video text detection device according to another embodiment of the present disclosure;

3 is a schematic structural diagram of a video text detection hardware device according to an embodiment of the present disclosure;

4 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of a video text detection terminal according to an embodiment of the present disclosure.

detailed description

The embodiments of the present disclosure are described below through specific specific examples. Those skilled in the art can easily understand other advantages and effects of the present disclosure from the content disclosed in this specification. Obviously, the described embodiments are only a part of the embodiments of the present disclosure, but not all the embodiments. The present disclosure may also be implemented or applied through other different specific implementations, and various details in this specification may also be modified or changed based on different viewpoints and applications without departing from the spirit of the present disclosure. It should be noted that, in the case of no conflict, the following embodiments and features in the embodiments can be combined with each other. Based on the embodiments in the present disclosure, all other embodiments obtained by a person having ordinary skill in the art without making creative efforts fall within the protection scope of the present disclosure.

It should be noted that various aspects of the embodiments within the scope of the appended claims are described below. It should be apparent that aspects described herein may be embodied in a wide variety of forms and that any specific structure and / or function described herein is merely illustrative. Based on the present disclosure, those skilled in the art should understand that one aspect described herein may be implemented independently of any other aspect, and that two or more of these aspects may be combined in various ways. For example, any number of the aspects set forth herein may be used to implement a device and / or a practice method. In addition, the apparatus and / or the method may be implemented using other structures and / or functionality than one or more of the aspects set forth herein.

It should also be noted that the illustrations provided in the following embodiments only illustrate the basic idea of the present disclosure in a schematic manner, and only the components related to the present disclosure are shown in the drawings instead of the number, shape and For size drawing, the type, quantity, and proportion of each component can be changed at will in actual implementation, and the component layout type may be more complicated.

In addition, in the following description, specific details are provided to facilitate a thorough understanding of the examples. However, those skilled in the art will understand that the described aspects may be practiced without these specific details.

In order to solve the technical problems that the OCR has a poor recognition effect and low recognition accuracy when recognizing small characters, an embodiment of the present disclosure provides a video character detection method. As shown in FIG. 1a, the video text detection method mainly includes the following steps S1 to S2. among them:

Step S1: Divide the detected pictures extracted from the videos to be detected to obtain at least one image block.

The picture to be detected may be one frame or multiple frames. When the picture to be detected is multiple frames, the pictures to be detected are divided into blocks.

The number of image blocks or the size of the image blocks may be specifically determined according to the size of the picture to be detected. Specifically, in order to improve the accuracy of text detection, a plurality of pictures of different sizes can be divided into blocks in advance, and text detection can be performed, and the optimal number or size of blocks can be determined according to the accuracy of text detection.

Step S2: Determine whether text information is included in the video to be detected according to the text detection result of the image block.

The text information includes, but is not limited to, any one or combination of numbers, Chinese characters, and foreign languages.

Specifically, for a picture to be detected that contains small text information, the text information can be enlarged by segmentation, thereby improving the accuracy of text detection.

In this embodiment, at least one image block is obtained by dividing the to-be-detected picture extracted from the to-be-detected video, and then determining whether the to-be-detected video contains text information based on the text detection result of the image block, which can improve the accuracy of text detection. .

In an optional embodiment, as shown in FIG. 1b, step S2 includes:

S21: Perform character detection on each image block.

In this step, the text detection method in the prior art can be used to perform text detection on the image block. Because the image to be detected is divided, the text contained in the image block may be incomplete. For example, the detected image block may be incomplete. Containing only a part of a text or a part of a text, it is determined that the image block contains text information.

S22: If it is detected that any image block contains text information, it is determined that the video to be detected contains text information.

In this embodiment, at least one image block is obtained by dividing the to-be-detected picture extracted from the to-be-detected video, and each image block is subjected to text detection. If any image block is detected to contain text information, the video to be detected is determined. The text information is included in the block, and the text information contained in the picture to be detected can be enlarged by the block, thereby improving the accuracy of the text detection.

In an optional embodiment, as shown in FIG. 1c, the method in this embodiment further includes:

S3: Block pictures that are known to contain text information and / or pictures that do not contain text information to obtain at least one image block as a training sample.

S4: Annotate training samples according to whether text information is included.

Specifically, before training, in order to distinguish different image blocks, that is, image blocks containing text information, and image blocks that do not contain text information, each image block needs to be labeled. For example, an image block containing text information is marked with 1 and an image block without text information is marked with 0.

S5: The deep learning classification algorithm is used to train and learn the labeled training samples to obtain an image classifier.

The deep learning classification algorithms that can be used include, but are not limited to, any of the following: Naive Bayes algorithm, artificial neural network algorithm, genetic algorithm, K-Nearest Neighbor (KNN) classification algorithm, clustering algorithm, and the like.

Wherein, the image classifier obtained through this embodiment not only has an automatic block function, but also can directly determine whether each image block contains text information.

Further, based on FIG. 1c, step S1 specifically includes:

The picture to be detected is input to an image classifier, and the picture to be detected is divided into blocks by the image classifier to obtain at least one image block.

The method in this embodiment further includes:

S6: Perform text detection on the image block through the image classifier, and determine the text detection result of the image block according to the classification result of the image classifier.

Further, step S6 specifically includes:

S61: Score each image block by an image classifier to obtain a score value of each image block.

The score may be a normalized score, for example, any value from 0 to 100 or 0-1.

S62: Determine the text detection result of the image block according to the score.

Further, step S62 specifically includes:

If the score exceeds the preset score, determine that the image block contains text information; or, select the maximum score from the score, and if the maximum score exceeds the preset score, determine that the image block contains text information; or, If the score is smaller than the preset score, it is determined that the image block contains text information; or, the minimum score is selected from the scores; if the minimum score is smaller than the preset score, the image block is determined to contain text information.

Regarding this step, a scoring rule can be set in advance. For example, the larger the score, the higher the probability that the character information is included, or the smaller the score, the higher the possibility that the character information is included. Based on the scoring rules set above, it is determined whether the image block contains text information.

Further, step S6 specifically includes:

S63: Perform text detection on each image block through the image classifier, and directly output any one of the following results through the image classifier: including text information and not including text information.

S64: Use the output result as the text detection result of the image block.

Those skilled in the art should understand that, on the basis of the foregoing embodiments, obvious modifications (for example, combining the listed modes) or equivalent replacements can also be performed.

In the above, although the steps in the embodiment of the video text detection method are described in the above order, those skilled in the art should understand that the steps in the embodiments of the present disclosure are not necessarily performed in the above order, and they may also be performed in reverse order and in parallel. , Cross, and other executions, and based on the above steps, those skilled in the art can also add other steps, these obvious variations or equivalent replacements should also be included in the scope of protection of the present disclosure, not here More details.

The following is a device embodiment of the present disclosure. The device embodiment of the present disclosure can be used to perform the steps implemented by the method embodiments of the present disclosure. For convenience of explanation, only parts related to the embodiments of the present disclosure are shown. Specific technical details are not disclosed. Reference is made to the method embodiments of the present disclosure.

In order to solve the technical problem of how to improve the user experience effect, an embodiment of the present disclosure provides a video text detection device. The device can perform the steps in the foregoing embodiment of the video text detection method. As shown in FIG. 2a, the device mainly includes: a picture block module 21 and a text determination module 22; wherein the picture block module 21 is configured to block a picture to be detected extracted from a video to be detected to obtain at least one image Block; the text determining module 22 is configured to determine whether text information is included in a video to be detected according to a text detection result on an image block.

In this embodiment, the picture segmentation module 21 is used to divide the picture to be detected extracted from the video to be detected to obtain at least one image block, and then the text determination module 22 determines whether the video to be detected is based on the text detection result of the image block. Contains text information to improve text detection accuracy.

In an optional embodiment, based on FIG. 2a, the text determination module 22 is specifically configured to: perform text detection on each image block; if it is detected that any image block contains text information, determine that the video to be detected contains text information.

The text determination module 22 may use the text detection methods in the prior art to perform text detection on image blocks. Because the pictures to be detected are divided, the text contained in the image blocks may be incomplete, for example, the detected image blocks It may contain only a part of a character or a part of a character. At this time, it is determined that the image block contains character information.

In this embodiment, the picture segmentation module 21 is used to segment the pictures to be detected extracted from the video to be detected to obtain at least one image block, and the text determination module 22 is used to perform text detection on each image block. If any image block is detected, If text information is included in the video, it is determined that the text information is included in the video to be detected. Since the text information contained in the image to be detected can be enlarged by segmentation, the accuracy of text detection is improved.

In an optional embodiment, as shown in FIG. 2b, the apparatus in this embodiment further includes: a classifier training module 23; wherein the classifier training module 23 is configured to perform a process on pictures and / or pictures that already contain text information. The pictures that do not contain text information are divided into blocks to obtain at least one image block as training samples; the training samples are labeled according to whether they contain text information; the deep learning classification algorithm is used to train and learn the labeled training samples to obtain an image classifier .

Specifically, before training, the classifier training module 23 needs to label each image block in order to distinguish different image blocks, that is, image blocks containing text information and image blocks that do not contain text information. For example, an image block containing text information is marked with 1 and an image block without text information is marked with 0.

Further, based on FIG. 2b, the picture blocking module 21 is specifically configured to: input a picture to be detected into an image classifier, and divide the picture to be detected by the image classifier to obtain at least one image block;

The device of this embodiment further includes a text detection module 24; wherein the text detection module 24 is configured to perform text detection on the image block through the image classifier, and determine the text detection result of the image block according to the classification result of the image classifier.

Further, the text detection module 24 includes: a scoring unit 241 and a text detection unit 242; wherein the scoring unit 241 is configured to score each image block through an image classifier to obtain a score value of each image block; the text detection unit 242 is configured to: Determine the text detection result of the image block according to the score.

Further, the character detection unit 242 is specifically configured to: if the score exceeds a preset score, determine that the image block contains text information; or select a maximum score from the scores, and if the maximum score exceeds the preset score, Determine that the image block contains text information; or, if the score is less than a preset score, determine that the image block contains text information; or select a minimum score from the scores, and if the minimum score is less than the preset score, It is determined that the image block contains text information.

Regarding the character detection unit 242, a scoring rule can be set in advance. For example, the larger the score, the higher the probability that the character information is included, or the smaller the score, the higher the probability that the character information is included. Based on the scoring rules set above, it is determined whether the image block contains text information.

Further, the text detection module 24 is specifically configured to: perform text detection on each image block through an image classifier, and directly output any of the following results through the image classifier: including text information and not including text information; and using the output result as an image Block text detection results.

For a detailed description of the working principle and technical effects of the embodiment of the video text detection device embodiment, refer to the related description in the foregoing embodiment of the video text detection method, and details are not described herein again.

FIG. 3 is a hardware block diagram illustrating a video text detection hardware device according to an embodiment of the present disclosure. As shown in FIG. 3, a video text detection hardware device 30 according to an embodiment of the present disclosure includes a memory 31 and a processor 32.

The memory 31 is configured to store non-transitory computer-readable instructions. Specifically, the memory 31 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and / or non-volatile memory. The volatile memory may include, for example, a random access memory (RAM) and / or a cache memory. The non-volatile memory may include, for example, a read-only memory (ROM), a hard disk, a flash memory, and the like.

The processor 32 may be a central processing unit (CPU) or other form of processing unit having data processing capabilities and / or instruction execution capabilities, and may control other components in the video text detection hardware device 30 to perform a desired function. In an embodiment of the present disclosure, the processor 32 is configured to run the computer-readable instructions stored in the memory 31, so that the video text detection hardware device 30 executes the foregoing video text detection method of the embodiments of the present disclosure. All or part of the steps.

Those skilled in the art should understand that in order to solve the technical problem of how to obtain a good user experience effect, this embodiment may also include well-known structures such as a communication bus and an interface. These well-known structures should also be included in the protection scope of the present disclosure. within.

For detailed descriptions of this embodiment, reference may be made to corresponding descriptions in the foregoing embodiments, and details are not described herein again.

FIG. 4 is a schematic diagram illustrating a computer-readable storage medium according to an embodiment of the present disclosure. As shown in FIG. 4, a computer-readable storage medium 40 according to an embodiment of the present disclosure stores non-transitory computer-readable instructions 41 thereon. When the non-transitory computer-readable instruction 41 is executed by a processor, all or part of the steps of the method for comparing video features of the foregoing embodiments of the present disclosure are performed.

The computer-readable storage medium 40 includes, but is not limited to, optical storage media (for example, CD-ROM and DVD), magneto-optical storage media (for example, MO), magnetic storage media (for example, magnetic tape or mobile hard disk), Non-volatile memory rewritable media (for example: memory card) and media with built-in ROM (for example: ROM box).

FIG. 5 is a schematic diagram illustrating a hardware structure of a terminal according to an embodiment of the present disclosure. As shown in FIG. 5, the video text detection terminal 50 includes the foregoing video text detection device embodiment.

The terminal may be implemented in various forms, and the terminal in the present disclosure may include, but is not limited to, such as a mobile phone, a smart phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP ( Portable multimedia players), navigation devices, on-board terminals, on-board display terminals, on-board electronic rear-view mirrors, and other mobile terminals, and fixed terminals such as digital TVs, desktop computers, and the like.

As an equivalent alternative, the terminal may further include other components. As shown in FIG. 5, the video text detection terminal 50 may include a power supply unit 51, a wireless communication unit 52, an A / V (audio / video) input unit 53, a user input unit 54, a sensing unit 55, an interface unit 56, and a control unit. Device 57, output unit 58, memory 59, and so on. FIG. 5 shows a terminal with various components, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.

Among them, the wireless communication unit 52 allows radio communication between the terminal 50 and a wireless communication system or network. The A / V input unit 53 is used to receive audio or video signals. The user input unit 54 may generate key input data according to a command input by the user to control various operations of the terminal. The sensing unit 55 detects the current state of the terminal 50, the position of the terminal 50, the presence or absence of a user's touch input to the terminal 50, the orientation of the terminal 50, the acceleration or deceleration movement and direction of the terminal 50, and the like, and generates a signal for controlling the terminal 50 commands or signals for operation. The interface unit 56 functions as an interface through which at least one external device can be connected to the terminal 50. The output unit 58 is configured to provide an output signal in a visual, audio, and / or tactile manner. The memory 59 may store software programs and the like for processing and control operations performed by the controller 55, or may temporarily store data that has been output or is to be output. The memory 59 may include at least one type of storage medium. Moreover, the terminal 50 may cooperate with a network storage device that performs a storage function of the memory 59 through a network connection. The controller 57 generally controls the overall operation of the terminal. In addition, the controller 57 may include a multimedia module for reproducing or playing back multimedia data. The controller 57 may perform a pattern recognition process to recognize a handwriting input or a picture drawing input performed on the touch screen as characters or images. The power supply unit 51 receives external power or internal power under the control of the controller 57 and provides appropriate power required to operate each element and component.

Various embodiments of the video feature comparison method proposed by the present disclosure may be implemented in a computer-readable medium using, for example, computer software, hardware, or any combination thereof. For hardware implementation, various embodiments of the video feature comparison method proposed in the present disclosure can be implemented by using an application-specific integrated circuit (ASIC), a digital signal processor (DSP), a digital signal processing device (DSPD), and a programmable logic device. (PLD), field programmable gate array (FPGA), processor, controller, microcontroller, microprocessor, electronic unit designed to perform the functions described herein, and in some cases implemented Various embodiments of the video feature comparison method proposed in the present disclosure may be implemented in the controller 57. For software implementation, various embodiments of the video feature comparison method proposed by the present disclosure can be implemented with a separate software module that allows at least one function or operation to be performed. The software codes may be implemented by a software application (or program) written in any suitable programming language, and the software codes may be stored in the memory 59 and executed by the controller 57.

The basic principles of the present disclosure have been described above in conjunction with specific embodiments, but it should be noted that the advantages, advantages, effects, etc. mentioned in this disclosure are merely examples and not limitations, and these advantages, advantages, effects, etc. cannot be considered as Required for various embodiments of the present disclosure. In addition, the specific details of the above disclosure are only for the purpose of example and easy to understand, and are not limiting, and the above details do not limit the present disclosure to the implementation of the above specific details.

The block diagrams of the devices, devices, equipment, and systems involved in this disclosure are only illustrative examples and are not intended to require or imply that they must be connected, arranged, and configured in the manner shown in the block diagrams. As will be recognized by those skilled in the art, these devices, devices, equipment, systems can be connected, arranged, and configured in any manner. Words such as "including," "including," "having," and the like are open words that refer to "including, but not limited to," and can be used interchangeably with them. As used herein, the terms "or" and "and" refer to the terms "and / or" and are used interchangeably therewith unless the context clearly indicates otherwise. The term "such as" as used herein refers to the phrase "such as, but not limited to," and is used interchangeably with it.

In addition, as used herein, an "or" used in an enumeration of items beginning with "at least one" indicates a separate enumeration such that, for example, an "at least one of A, B, or C" enumeration means A or B or C, or AB or AC or BC, or ABC (ie A and B and C). Furthermore, the word "exemplary" does not mean that the described example is preferred or better than other examples.

It should also be noted that in the system and method of the present disclosure, each component or each step can be disassembled and / or recombined. These decompositions and / or recombinations should be regarded as equivalent solutions of the present disclosure.

Various changes, substitutions, and alterations to the techniques described herein can be made without departing from the techniques taught by the appended claims. Further, the scope of the claims of the present disclosure is not limited to the specific aspects of the processes, machines, manufacturing, composition of events, means, methods, and actions described above. The composition, means, methods, or actions of processes, machines, manufacturing, and events that currently exist or are to be developed later may be utilized that perform substantially the same functions or achieve substantially the same results as the corresponding aspects described herein. Accordingly, the appended claims include within their scope such processes, machines, manufacture, compositions of matter, means, methods, or actions.

The above description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other aspects without departing from the scope of the present disclosure. Accordingly, the disclosure is not intended to be limited to the aspects shown herein, but to the broadest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been given for the purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of the present disclosure to the forms disclosed herein. Although a number of example aspects and embodiments have been discussed above, those skilled in the art will recognize certain variations, modifications, changes, additions and sub-combinations thereof.

Claims

A video text detection method, comprising:

Segmenting the to-be-detected picture extracted from the to-be-detected video to obtain at least one image block;

It is determined whether text information is included in the video to be detected according to a text detection result on the image block.
The method according to claim 1, wherein the step of determining whether the video to be detected contains text information according to a text detection result on the image block includes:

Text detection on each image block;

If it is detected that any image block contains text information, it is determined that the video to be detected contains text information.
The method according to claim 1, further comprising:

Segmenting pictures that are known to contain text information and / or pictures that are not known to contain text information to obtain at least one image block as a training sample;

Mark the training samples according to whether text information is included;

A deep learning classification algorithm is used to perform training and learning on the labeled training samples to obtain an image classifier.
The method according to claim 3, wherein the step of dividing the picture to be detected extracted from the video to be detected to obtain at least one image block comprises:

Inputting the picture to be detected into the image classifier, and dividing the picture to be detected by the image classifier to obtain at least one image block;

The method further includes:

Text detection is performed on the image block by the image classifier, and a text detection result of the image block is determined according to a classification result of the image classifier.
The method according to claim 4, characterized in that the image classifier performs text detection on the image block, and determines the text detection result of the image block according to the classification result of the image classifier. Steps, including:

Scoring each image block by the image classifier to obtain a score value of each image block;

A text detection result of the image block is determined according to the score.
The method according to claim 5, wherein the step of determining a text detection result of the image block according to the score comprises:

If the score exceeds a preset score, determine that the image block contains text information; or, select a maximum score from the score, and if the maximum score exceeds the preset score, determine the maximum score. The image block contains text information; or, if the score is smaller than a preset score, it is determined that the image block contains text information; or, a minimum score is selected from the scores, and if the minimum score is If the value is less than the preset score, it is determined that the image block contains text information.
The method according to claim 4, characterized in that the image classifier performs text detection on the image block, and determines the text detection result of the image block according to the classification result of the image classifier. Steps, including:

Perform text detection on each image block through the image classifier, and directly output any one of the following results through the image classifier: including text information and not including text information;

The output result is used as a text detection result of the image block.
A video text detection device, comprising:

A picture block module, configured to block the pictures to be detected extracted from the videos to be detected to obtain at least one image block;

A text determining module is configured to determine whether text information is included in the video to be detected according to a text detection result of the image block.
The device according to claim 8, wherein the text determination module is specifically configured to: perform text detection on each image block; if it is detected that any image block contains text information, determine that the video to be detected includes text information.
The apparatus according to claim 8, further comprising:

A classifier training module, configured to block pictures that have known text information and / or pictures that do not contain text information to obtain at least one image block as a training sample; and perform training on the training sample according to whether text information is included Labeling; using deep learning classification algorithms to train and learn the labeled training samples to obtain an image classifier.
The device according to claim 10, wherein the picture segmentation module is specifically configured to: input the picture to be detected into the image classifier, and classify the picture to be detected by the image classifier. Block to obtain at least one image block;

The device further includes:

A text detection module is configured to perform text detection on the image block through the image classifier, and determine a text detection result of the image block according to a classification result of the image classifier.
The device according to claim 11, wherein the character detection module comprises:

A scoring unit, configured to score each image block through the image classifier to obtain a score value of each image block;

A character detection unit, configured to determine a character detection result of the image block according to the score.
The device according to claim 12, wherein the character detection unit is specifically configured to:

If the score exceeds a preset score, determine that the image block contains text information; or, select a maximum score from the score, and if the maximum score exceeds a preset score, determine the image block. Contains text information; or, if the score is less than a preset score, determine that the image block contains text information; or select a minimum score from the scores, and if the minimum score is less than a preset score , It is determined that the image block contains text information.
The device according to claim 11, wherein the text detection module is specifically configured to perform text detection on each image block through the image classifier, and directly output any one of the following results through the image classifier : Contains text information and does not contain text information; and uses the output result as the text detection result of the image block.
A video text detection hardware device includes:

Memory for storing non-transitory computer-readable instructions; and

A processor, configured to run the computer-readable instructions, so that the processor, when executed, implements the video text detection method according to any one of claims 1-7.
A computer-readable storage medium is configured to store non-transitory computer-readable instructions, and when the non-transitory computer-readable instructions are executed by a computer, cause the computer to execute any one of claims 1-7 Video text detection method.