CN114067322A

CN114067322A - Image character detection method and system and electronic equipment

Info

Publication number: CN114067322A
Application number: CN202010751818.2A
Authority: CN
Inventors: 谭明强; 李文华; 赵耀; 张彬; 雷剑; 谢新标; 韩增辉
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Shandong Co Ltd; China Mobile Group Guizhou Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Shandong Co Ltd; China Mobile Group Guizhou Co Ltd
Priority date: 2020-07-30
Filing date: 2020-07-30
Publication date: 2022-02-18

Abstract

The invention discloses an image character detection method, a system and electronic equipment, wherein the method comprises the following steps: extracting image characteristics of an image to be detected by using a high-resolution characteristic extraction network, wherein the image characteristics comprise a plurality of image points; according to whether the image points belong to the same text or not, segmenting the image points in the image to be detected to obtain a plurality of image text boxes; and extracting the image text boxes, wherein the characters contained in the image text boxes are the characters of the image to be detected. The embodiment of the invention adopts the high-resolution feature extraction network, can always keep the communication among the features with different resolutions in the feature extraction process, can avoid information loss caused by feature reduction in the convolution process, and can realize quick and effective detection of complicated scene characters in the image by judging whether the adjacent points of the image point belong to the characters.

Description

Image character detection method and system and electronic equipment

Technical Field

The invention relates to the field of character recognition, in particular to an image character detection method, an image character detection system and electronic equipment.

Background

Characters in the image play a role in putting weight in the visual recognition task, the characters are products of human beings, contain high-level semantic information and convey human thought and emotion, the video information or the image information is difficult to directly describe, the characters can be used for knowing historical events thousands of years ago through the Starch, the characters are used as carriers, the characters are also important clues in the visual recognition, and for pictures in natural scenes, some auxiliary information of the images can be acquired by means of the character information on the pictures, such as the places, the types and the like of the shot images. The method has a great auxiliary effect on the subsequent visual recognition task.

In traditional document character detection, the background is often clean, the font is regular, the layout is smooth and uniform, and the color is also monotonous, and for character detection in a video image, the following three challenges are faced: (1) often, the text is varied in different video images, such as the orientation, language, font, size, color, etc. of the text. (2) The text background also faces diversity, and in many scenes, there are pictures with certain similarity to the text, such as signal lights, windows, fences, etc., which can interfere with the detection. (3) The image has quality problems, for example, the detected picture has the problems of uneven illumination, low resolution, local shielding and the like, so that the detection becomes very difficult.

Disclosure of Invention

The embodiment of the invention provides an image character detection method, an image character detection system and electronic equipment, and aims to solve the problem that characters cannot be detected in an image character recognition process due to inconsistent character directions, sizes and colors, low resolution, local shielding and the like in the prior art.

In order to solve the technical problem, the invention is realized as follows:

in a first aspect, a method for detecting image and text is provided, where the method includes:

extracting image characteristics of an image to be detected by using a high-resolution characteristic extraction network, wherein the image characteristics comprise a plurality of image points;

according to whether the image points belong to the same text or not, segmenting the image points in the image to be detected to obtain a plurality of image text boxes;

and extracting the image text boxes, wherein the characters contained in the image text boxes are the characters of the image to be detected.

In a second aspect, an image text detection system is provided, which includes:

the extraction module is used for extracting the image characteristics of the image to be detected by utilizing a high-resolution characteristic extraction network, wherein the image characteristics comprise a plurality of image points;

the segmentation module is used for segmenting the image points in the image to be detected according to whether the image points belong to the same text or not to obtain a plurality of image text boxes;

and the detection module is used for extracting the image text boxes, and the characters contained in the image text boxes are the characters of the image to be detected.

In a third aspect, an electronic device is provided, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the method according to the first aspect.

In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, realizes the steps of the method according to the first aspect.

In the embodiment of the invention, firstly, a high-resolution feature extraction network is utilized to extract the image features of an image to be detected, then, image points in the image to be detected are segmented according to whether the image points in the image features belong to the same text to obtain a plurality of image text boxes, and finally, a plurality of image text boxes are extracted, wherein characters contained in the text boxes are characters of the image to be detected. The embodiment of the invention adopts the high-resolution feature extraction network, can always keep the communication among the features with different resolutions in the feature extraction process, can avoid information loss caused by feature reduction in the convolution process, and can realize quick and effective detection of complicated scene characters in the image by judging whether the adjacent points of the image point belong to the characters.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

fig. 1 is a schematic flow chart of an image text detection method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a high resolution feature extraction network according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of another image text detection system according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides an image character detection method, an image character detection system and electronic equipment, wherein a convolutional neural network is adopted, a high-resolution feature extraction network is designed, and character detection in an image is realized by predicting whether an image point is a point in a text box on the image and whether adjacent points of the image point belong to the same text. The designed high-resolution network can better extract high-level semantic features of various images, for example, when a detected image is too small, a common convolutional neural network can make the extracted image features smaller and smaller along with the convolution, when the image features are too small, part of semantic information can be lost, and the designed high-resolution network always keeps the image features with different resolution sizes in the convolution process, and reduces the information loss after continuously fusing the features with the resolution sizes. Then, whether the image point is in the text box or not and whether the adjacent image points around the image point belong to the same text or not are judged, and the character information on the image is detected. By the method, the character detection process is effectively simplified, the condition that a large amount of data is required for training can be reduced, and the situation that different texts can not be effectively segmented by only utilizing single pixel segmentation after different characters are too close to each other is relieved.

Fig. 1 is a schematic flow chart of an image and text detection method according to an embodiment of the present invention. As shown in fig. 1, the image text detection method may include: contents shown in step S101 to step S103.

In step S101, an image feature of the image to be detected is extracted using a high-resolution feature extraction network, where the image feature includes a plurality of image points.

In step S102, the image points in the image to be detected are segmented according to whether the image points belong to the same text, so as to obtain a plurality of image text boxes.

In step S103, a plurality of image text boxes are extracted, where the characters contained in the image text boxes are the characters of the image to be detected.

In one possible embodiment of the present invention, extracting the image features of the image to be detected by using the high-resolution feature extraction network may include the following steps.

Determining a first preset number of branches in a high-resolution feature extraction network;

and fusing the characteristics of the branches with the first preset number by adopting a characteristic pyramid network (FPN) to obtain the image characteristics of the image to be detected.

Specifically, the design and feature fusion method of the high-resolution network includes designing a first preset number of branches of the high-resolution network, and fusing features on different branches with each other. Taking four branches as an example, as shown in fig. 2, the feature 4 after output convolution of the branch 4 is up-sampled and then added with the feature 3 after output convolution of the branch 3 to obtain a feature 3, the feature 3 after up-sampling is added and fused with the feature obtained after output convolution of the branch 2 to obtain a feature 2, and the same feature 2 after up-sampling is added and fused with the feature obtained after output convolution of the branch 1 to obtain a final output feature 1. The operation details of the high resolution network design are as follows.

The up-sampling in the feature fusion is an operation of obtaining the shape consistent with the shape of the upper-layer feature after bilinear interpolation is adopted for the feature with low resolution, and the down-sampling is an operation of reducing the convolution of the feature with high resolution to be consistent with the low resolution by adopting convolution with step number for the feature with high resolution in the branch. The size of the features in the convolution process remains consistent throughout the same branch.

The output convolution adopted in the feature fusion of the final stage of the network is the same operation, and the classification of the image points with the same corresponding feature size and dimension and whether the 4 adjacent points of the image points belong to the same text box or not are obtained through the output convolution on different branches after the output convolution.

In the embodiment of the application, by designing the high-resolution network comprising a plurality of branches, the sizes of the features can be kept consistent all the time in the feature convolution process, and information loss caused by feature reduction in the convolution process can be avoided.

In a possible embodiment of the present application, before segmenting image points in an image to be detected according to whether the image points belong to the same text to obtain a plurality of image text boxes, the image text detection method may further include the following steps.

Judging whether the image point is in the text box or not;

if the image point is in the text box, marking the image point as a text image point;

if the image point is not in the text box, the image point is marked as a non-text image point.

In this embodiment, before determining whether an image point belongs to the same text, it may be determined whether the image point is in a text box, that is, whether the image point is a point in a text, if the image point is in the text box, that is, in the text, it indicates that the image point is a text image point, otherwise, the image point is not a text image point. In the embodiment of the invention, whether the image point of the characteristic in the image to be detected is the text image point is determined, so that the subsequent detection characters are more accurate, and the complex scene characters in the image can be quickly and effectively detected.

In a possible embodiment of the present invention, segmenting image points in an image to be detected according to whether the image points belong to the same text to obtain a plurality of image text boxes may include the following steps.

Judging whether a second preset number of text image points adjacent to the text image points are positioned in the same text box or not;

and processing the text box through a computer vision library to obtain the coordinates of the text box.

Specifically, it is determined whether a second preset number of neighboring points are included in the same text for the image points in the text box, and a neighboring point value that belongs to the same text as the image point is set to 1, if not, the image point value that belongs to the same text is set to 0. By finding the image point in the text box and finding the adjacent point belonging to the same text with the image point, the coordinates of the text box corresponding to the text can be obtained through simple processing of a computer vision library.

The second preset number may be 3, 4 or other numbers, and may be set according to actual conditions. The computer vision library may be an opencv computer vision library, or may be another computer vision library, which is not specifically limited in the embodiment of the present invention.

In the embodiment of the invention, the image points which are texts and the image points which belong to the same texts are collected into the same set, and the text box coordinates of the image points which belong to the same text box can be obtained through simple processing in an opencv computer vision library, so that the method is simple and effective in predicting the text box, and avoids the need of a large amount of data for training based on a regression method.

In one possible embodiment of the present invention, determining whether a second predetermined number of text image points adjacent to the text image point are located in the same text box may include the following steps.

Acquiring a preset data set;

carrying out coding training on a label text in a preset data set to obtain a training label, wherein the training label comprises the classification of image points and whether the image points and connected image points belong to the same text box;

training the high-resolution feature extraction network by using the training labels to obtain training weights, wherein a cross entropy function is used as a loss function of the high-resolution feature extraction network;

inputting the training weight into the high-resolution feature extraction network to obtain the trained high-resolution feature extraction network;

and dividing the text image points adjacent to the text image points into the same set by using the trained high-resolution feature extraction network, and taking the text image points in the same set as the text image points in the same text box.

In the embodiment of the application, a deep learning-based method is adopted, and through training of a large amount of data, the character detection effect is better in the complex scene, and higher precision ratio and recall ratio are achieved.

The specific flow of the present invention is as follows.

In the first step, a high-resolution feature extraction network is designed.

The design and feature fusion mode of the high-resolution network is adopted, including designing 4 branches of the high-resolution network and fusing features on different branches. As shown in fig. 2, the feature 4 after the output convolution of the branch 4 is up-sampled and then added with the feature 3 after the output convolution of the branch 3 to obtain the feature 3, the feature 3 after the up-sampling is added and fused with the feature obtained by the output convolution of the branch 2 to obtain the feature 2, and the same feature 2 after the up-sampling is added and fused with the feature obtained by the output convolution of the branch 1 to obtain the final output feature 1. The operation detail steps in the high resolution network design are as follows:

And secondly, classifying the image points and judging whether the adjacent image points belong to the same text or not.

And (3) performing character extraction by adopting a method based on image point classification and a method for judging that 4 adjacent points around the image point belong to the same text. The method comprises the following specific steps:

points in the text box on the image are marked as text image points, and points not in the text box are marked as non-text image points.

For the image point in the text box, it is determined whether the surrounding 4 neighboring points are contained in the same text, and for the neighboring point value belonging to the same text as the point, it is set to 1, if the image point value belonging to the text is set to 0.

By finding the image point in the text box and finding the adjacent point belonging to the same text with the image point, the coordinates of the text box corresponding to the text can be obtained through simple processing of an opencv computer vision library.

And thirdly, acquiring and expanding the preset data.

Acquiring main data and expansion part data, wherein the main data adopts an MLT2019 data set and an ICPR2018 data set in ICDAR2019 and an RCTW (ICDAR2017 compatibility on Reading Chinese Text in the Wild) data set, and the MLT2019 data set contains characters of different language types. The ICPR2018 data set contains network pictures. RCTW is a dataset for chinese in an image. The data set is used as main data, after the three data are fused, the expansion data are expanded in a generation and partial marking mode, and the expansion data are marked or generated difficult data aiming at certain specific scenes. And adding the expanded partial data to integrate into final training data.

And fourthly, encoding the preset data.

The data set and the corresponding label text information in the data set are encoded into training labels, and the training labels comprise the classification (text points/non-text points) of image points and whether 4 adjacent points near the image points belong to the same text.

And fifthly, training the high-resolution feature extraction network.

And training the network, namely, for the image point classification part and judging whether the adjacent image points belong to the same text part, using a cross entropy function as a loss function, and simultaneously using an SGD (Stochastic Gradient Descent) optimizer to optimize the training weight in the training process until the training converges.

And sixthly, reasoning by fixing the weight to determine the text box.

Fixing the trained weights, introducing the weights into a high-resolution network, carrying out inference detection on videos or images, then carrying out image point classification, dividing image points predicted to belong to texts by the high-resolution network and adjacent image points connected with the images into the same set, wherein the post-processing mode in the graph 2 is to utilize a computer vision graphic processing library opencv to obtain text box coordinates formed by the image points in the set, and visually displaying the text box to obtain a final text detection result or intercepting the images of the text box for subsequent tasks such as text recognition.

In the embodiment of the invention, points in a text box on an image to be detected are marked as text points, the label value of the image point is marked as 1, image points in a non-text box are marked as non-text points, the label value of the point is marked as 0, four points (upper, lower, left and right) around the image point are simultaneously judged whether belong to the same text, 4 values are allocated to the four image points, if the four image points belong to the same text, the corresponding image point value is 1, and if the four image points do not belong to the same text, the label value of the image point is 0. On the final characteristic output by the high-resolution network, image points predicted to be text points and points which belong to the same text as the points are collected into the same set to form a point set of a certain text in a picture, and the image point set can be visualized through a simple post-processing mode of an opencv computer visual graphic library to obtain a final text box, so that the complex post-processing flow of the existing pixel segmentation-based method is avoided.

Judging whether the image point is on the text or not on the basis of the image features extracted based on high resolution, judging whether the image point and the nearby 4 points belong to the same text or not, and finally extracting text information on the image. The scheme has stronger characteristic extraction capability, can better deal with pictures from various scenes, such as too low resolution, complex background and the like, and the character detection method based on image point classification and judgment and belonging to the same text image point can avoid a complex post-processing flow based on a pixel segmentation method, can more efficiently segment image points belonging to different characters, and can better have better precision ratio and recall ratio when dealing with character pictures from different scenes by combining the two methods, thereby being suitable for character detection in various scenes and having better robustness.

The embodiment of the invention also provides an image character detection system. Fig. 3 is a schematic diagram of an image and text detection system according to an embodiment of the present invention. As shown in fig. 3, the image text detection system may include: an extraction module 301, a segmentation module 302, and a detection module 303.

Specifically, the extraction module 301 is configured to extract an image feature of an image to be detected by using a high-resolution feature extraction network, where the image feature includes a plurality of image points; a segmentation module 302, configured to segment image points in the image to be detected according to whether the image points belong to the same text, so as to obtain a plurality of image text boxes; the detection module 303 is configured to extract a plurality of image text boxes, where words contained in the image text boxes are words of the image to be detected.

In the embodiment of the present invention, the extracting module 301 firstly extracts the image features of the image to be detected by using a high-resolution feature extraction network, then the segmenting module 302 segments the image points in the image to be detected according to whether the image points in the image features belong to the same text, so as to obtain a plurality of image text boxes, and finally the detecting module 303 extracts a plurality of image text boxes, where the text boxes include characters that are the characters of the image to be detected. The embodiment of the invention adopts the high-resolution feature extraction network, can always keep the communication among the features with different resolutions in the feature extraction process, can avoid information loss caused by feature reduction in the convolution process, and can realize quick and effective detection of complicated scene characters in the image by judging whether the adjacent points of the image point belong to the characters.

Optionally, the extraction module may include: a first determination unit and a fusion unit.

Specifically, a first determination unit configured to determine a first preset number of branches in a high-resolution feature extraction network; and the fusion unit is configured to fuse the features of the first preset number of branches by adopting a feature pyramid network to obtain the image features of the image to be detected.

Optionally, the image text detection system may further include: the device comprises a judging module, a first determining module and a second determining module.

Specifically, the judging module is configured to judge whether the image point is in the text box; a first determination module configured to mark an image point as a text image point if the image point is in the text box; a second determination module configured to mark the image point as a non-text image point if the image point is not in the text box.

Optionally, the segmentation module may include: a judging unit and a second determining unit.

Specifically, the judging unit is configured to judge whether a second preset number of text image points adjacent to the text image point are located in the same text box; and the second determining unit is configured to process the text box through the computer vision library to obtain the coordinates of the text box.

Alternatively, the determination unit may be configured to:

acquiring a preset data set;

The functions of the image and text detection system of the present invention have been described in detail in the method embodiments shown in fig. 1 and fig. 2, so that the description of the embodiment is not detailed, and reference may be made to the related descriptions in the foregoing embodiments, and further description is omitted here.

Fig. 4 is a schematic diagram of a hardware structure of a terminal device for implementing various embodiments of the present invention.

The terminal device 400 includes but is not limited to: radio frequency unit 401, network module 402, audio output unit 403, input unit 404, sensor 405, display unit 406, user input unit 407, interface unit 408, memory 409, processor 410, and power supply 411. Those skilled in the art will appreciate that the terminal device configuration shown in fig. 4 does not constitute a limitation of the terminal device, and that the terminal device may include more or fewer components than shown, or combine certain components, or a different arrangement of components. In the embodiment of the present invention, the terminal device includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted terminal, a wearable device, a pedometer, and the like.

Wherein, the processor 410 may be configured to:

and extracting a plurality of image text boxes, wherein the characters contained in the image text boxes are the characters of the image to be detected.

It should be understood that, in the embodiment of the present invention, the radio frequency unit 401 may be used for receiving and sending signals during a message sending and receiving process or a call process, and specifically, receives downlink data from a base station and then processes the received downlink data to the processor 410; in addition, the uplink data is transmitted to the base station. Typically, radio unit 401 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. Further, the radio unit 401 can also communicate with a network and other devices through a wireless communication system.

The terminal device provides wireless broadband internet access to the user through the network module 402, such as helping the user send and receive e-mails, browse web pages, and access streaming media.

The audio output unit 403 may convert audio data received by the radio frequency unit 401 or the network module 402 or stored in the memory 409 into an audio signal and output as sound. Also, the audio output unit 403 may also provide audio output related to a specific function performed by the terminal apparatus 400 (e.g., a call signal reception sound, a message reception sound, etc.). The audio output unit 403 includes a speaker, a buzzer, a receiver, and the like.

The input unit 404 is used to receive audio or video signals. The input Unit 404 may include a Graphics Processing Unit (GPU) 4041 and a microphone 4042, and the Graphics processor 4041 processes image data of a still picture or video obtained by an image capturing apparatus (such as a camera) in a video capturing mode or an image capturing mode. The processed image frames may be displayed on the display unit 406. The image frames processed by the graphic processor 4041 may be stored in the memory 409 (or other storage medium) or transmitted via the radio frequency unit 401 or the network module 402. The microphone 4042 may receive sound, and may be capable of processing such sound into audio data. The processed audio data may be converted into a format output transmittable to a mobile communication base station via the radio frequency unit 401 in case of the phone call mode.

The terminal device 400 further comprises at least one sensor 405, such as light sensors, motion sensors and other sensors. Specifically, the light sensor includes an ambient light sensor that adjusts the brightness of the display panel 4061 according to the brightness of ambient light, and a proximity sensor that turns off the display panel 4061 and/or the backlight when the terminal apparatus 400 is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally three axes), detect the magnitude and direction of gravity when stationary, and can be used to identify the terminal device posture (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration identification related functions (such as pedometer, tapping), and the like; the sensors 405 may also include a fingerprint sensor, a pressure sensor, an iris sensor, a molecular sensor, a gyroscope, a barometer, a hygrometer, a thermometer, an infrared sensor, etc., which will not be described in detail herein.

The display unit 406 is used to display information input by the user or information provided to the user. The Display unit 406 may include a Display panel 4061, and the Display panel 4061 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.

The user input unit 407 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the terminal device. Specifically, the user input unit 407 includes a touch panel 4071 and other input devices 4072. Touch panel 4071, also referred to as a touch screen, may collect touch operations by a user on or near it (e.g., operations by a user on or near touch panel 4071 using a finger, a stylus, or any suitable object or attachment). The touch panel 4071 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 410, receives a command from the processor 410, and executes the command. In addition, the touch panel 4071 can be implemented by using various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. In addition to the touch panel 4071, the user input unit 407 may include other input devices 4072. Specifically, the other input devices 4072 may include, but are not limited to, a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a track ball, a mouse, and a joystick, which are not described herein again.

Further, the touch panel 4071 can be overlaid on the display panel 4061, and when the touch panel 4071 detects a touch operation thereon or nearby, the touch operation is transmitted to the processor 410 to determine the type of the touch event, and then the processor 410 provides a corresponding visual output on the display panel 4061 according to the type of the touch event. Although in fig. 4, the touch panel 4071 and the display panel 4061 are two independent components to implement the input and output functions of the terminal device, in some embodiments, the touch panel 4071 and the display panel 4061 may be integrated to implement the input and output functions of the terminal device, which is not limited herein.

The interface unit 408 is an interface for connecting an external device to the terminal apparatus 400. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. The interface unit 408 may be used to receive input (e.g., data information, power, etc.) from an external device and transmit the received input to one or more elements within the terminal apparatus 400 or may be used to transmit data between the terminal apparatus 400 and an external device.

The memory 409 may be used to store software programs as well as various data. The memory 409 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 409 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The processor 410 is a control center of the terminal device, connects various parts of the entire terminal device by using various interfaces and lines, and performs various functions of the terminal device and processes data by operating or executing software programs and/or modules stored in the memory 409 and calling data stored in the memory 409, thereby performing overall monitoring of the terminal device. Processor 410 may include one or more processing units; preferably, the processor 410 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 410.

The terminal device 400 may further include a power supply 411 (e.g., a battery) for supplying power to various components, and preferably, the power supply 411 may be logically connected to the processor 410 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system.

In addition, the terminal device 400 includes some functional modules that are not shown, and are not described in detail herein.

Preferably, an embodiment of the present invention further provides a terminal device, which includes a processor 410, a memory 409, and a computer program that is stored in the memory 409 and can be run on the processor 410, and when being executed by the processor 410, the computer program implements each process of the above-mentioned image and text detection method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the above-mentioned image and text detection method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. An image character detection method is characterized by comprising the following steps:

2. The method of claim 1, wherein the extracting image features of the image to be detected by using the high-resolution feature extraction network comprises:

determining a first preset number of branches in the high-resolution feature extraction network;

and fusing the characteristics of the branches with the first preset number by adopting a characteristic pyramid network to obtain the image characteristics of the image to be detected.

3. The method of claim 1, before segmenting image points in the image to be detected into a plurality of image text boxes according to whether the image points belong to the same text, the method further comprises:

judging whether the image point is in a text box or not;

and if the image point is not in the text box, marking the image point as a non-text image point.

4. The method of claim 3, wherein the segmenting image points in the image to be detected according to whether the image points belong to the same text to obtain a plurality of image text boxes comprises:

5. The method of claim 4, wherein said determining whether a second predetermined number of text image points adjacent to said text image point are located in the same text box comprises:

acquiring a preset data set;

performing coding training on the label text in the preset data set to obtain a training label, wherein the training label comprises the classification of image points and whether the image points and the connected image points belong to the same text box;

inputting the training weight into the high-resolution feature extraction network to obtain a trained high-resolution feature extraction network;

6. An image text detection system, comprising:

7. The system of claim 6, wherein the extraction module comprises:

a first determining unit, configured to determine a first preset number of branches in the high-resolution feature extraction network;

and the fusion unit is used for fusing the characteristics of the branches with the first preset number by adopting a characteristic pyramid network to obtain the image characteristics of the image to be detected.

8. The system of claim 6, further comprising:

the judging module is used for judging whether the image point is in the text box or not;

the first determining module is used for marking the image point as a text image point if the image point is in a text box;

and the second determining module is used for marking the image point as a non-text image point if the image point is not in the text box.

9. The system of claim 8, wherein the segmentation module comprises:

the judging unit is used for judging whether a second preset number of text image points adjacent to the text image points are positioned in the same text box or not;

and the second determining unit is used for processing the text box through the computer vision library to obtain the coordinates of the text box.

10. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, which computer program, when executed by the processor, carries out the steps of the method according to any one of claims 1 to 5.