CN112926565B

CN112926565B - Picture text recognition method, system, equipment and storage medium

Info

Publication number: CN112926565B
Application number: CN202110213721.0A
Authority: CN
Inventors: 何小臻
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2021-02-25
Filing date: 2021-02-25
Publication date: 2024-02-06
Anticipated expiration: 2041-02-25
Also published as: CN112926565A

Abstract

The invention provides a picture text recognition method, which comprises the steps of obtaining a picture to be recognized, and preprocessing the picture; detecting the picture by using a preset text detection model to obtain the coordinate of each text line in the picture; obtaining a width value corresponding to each text line according to the coordinates of each text line; sequencing the text lines according to the width values, splicing the text lines with the longest width and the shortest width to form a long text, repeating the operation, and stopping the splicing when the width threshold value is about to be exceeded; detecting a long text, and repairing the width of the long text according to the width threshold if the width value of the long text does not reach the width threshold; repeating the operation until the long texts formed by the rest text lines reach a width threshold value to form a long text set; inputting the long text set into a preset text recognition model for recognition, and disassembling the returned result to obtain a recognition result of the picture. The invention effectively improves the processing time of the background model through the bottom logic of the secondary batch processing.

Description

Picture text recognition method, system, equipment and storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a method, a system, an apparatus, and a storage medium for recognizing a picture text.

Background

In the current artificial intelligence field, algorithms are floor or engineering deployed, and there are several commonly used deployment frameworks on the market, such as TensorFlow-sense of TensorFlow, paddle platform of hundred degrees, tensorRT framework of Injettia, etc. TensorRT is a high-performance deep learning reasoning (information) optimizer that can provide low-latency, high-throughput deployment reasoning for deep learning applications. The TensorRT can be used for reasoning and accelerating a very large-scale data center, an embedded platform or an automatic driving platform. TensorRT can now support almost all deep learning frameworks of TensorFlow, caffe, mxnet, pytorch, and by combining the GPUs of TensorRT and NVIDIA, quick and efficient deployment reasoning can be performed in almost all frameworks. Taking the TensorRT framework example of Indelphia, aiming at a text recognition scene, the general solution thinking is as follows: all text lines are detected from the input picture, and then the text lines are sequentially sent to the recognition model for text recognition.

Since the number of text lines of an input picture is variable, especially for a picture with dense text, the number of text lines measured is much larger, and even if the TensorRT framework like Injeida supports batch processing, the overall processing is much more time-consuming. Therefore, in order to accelerate the processing time, we propose a new method for recognizing the picture text.

Disclosure of Invention

Based on the above, the invention provides a method, a system, a device and a storage medium for identifying the picture text, so as to accelerate the identification speed of the picture text.

In order to achieve the above object, the present invention provides a method for identifying a picture text, the method comprising:

acquiring a picture to be identified, and preprocessing the picture;

detecting the preprocessed picture by using a text detection model which is trained in advance to obtain the coordinate of each text line in the picture;

calculating according to the coordinates of each text line to obtain a width value corresponding to each text line;

sequencing the text lines according to the width values, traversing all the width values, splicing the text line with the longest width and the text line with the shortest width to form a long text, repeating the operation, and stopping splicing when the width value of the long text being spliced exceeds a width threshold value;

detecting the long text, and if the width value of the long text does not reach the width threshold value, repairing the width of the long text according to the width threshold value;

splicing and repairing the rest text lines until all text lines form long text reaching a width threshold value to form a long text set;

And inputting the long text set into a preset text recognition model for recognition, and disassembling the returned result to obtain a recognition result of the picture to be recognized.

Preferably, the step of preprocessing the picture includes:

scaling the picture, wherein the maximum width of the scaling of the picture is not more than 1600 pixels, the maximum height is not more than 2400 pixels, the minimum width is not less than 600 pixels, and the minimum height is not less than 800 pixels;

and converting the zoomed picture into a gray scale picture.

Preferably, the step of detecting the preprocessed picture by using the pre-trained text detection model to obtain coordinates of each text line in the picture includes:

invoking a trained text detection model, wherein the text detection model adopts a dbnet algorithm, a preprocessed picture is input into the text detection model, and the text detection model outputs a probability value of a corresponding text in a picture pixel point;

thresholding selection is carried out on the pixel points in the picture, and the pixel points in the picture are divided into text pixel points and non-text pixel points according to the probability value, so that a binary image is obtained;

calculating a connected domain set of the binary image by using a first image processing algorithm;

And inputting the connected domain set into a second image processing algorithm, and calculating the minimum circumscribed rectangle of each connected domain, wherein four vertexes of the minimum circumscribed rectangle are coordinates of text lines.

Preferably, the step of calculating the width value of each corresponding text line according to the coordinates of each text line includes:

calculating according to coordinates of four vertexes of each text line to obtain the original width and the original height of each text line;

scaling each text line to the same preset height, and calculating to obtain a scaling ratio corresponding to each text line through the original height and the preset height of each text line;

the current width value of each text line is obtained through the original width and the scaling ratio of each text line.

Preferably, the step of detecting the spliced text line, if the width value of the text line does not reach the width threshold, repairing the width of the text line according to the width threshold includes:

detecting the long text, and judging whether the width value of the long text reaches a width threshold value, wherein the width threshold value is 1600 pixels;

if not, selecting black edges of three color channels of the picture to be identified, and adding the black edges to the width of the long text according to a width threshold value, so that the width of the long text reaches 1600 pixels.

Preferably, the step of inputting the long text set into a preset text recognition model for recognition and disassembling the returned result to obtain the recognition result of the picture to be recognized includes:

carrying out recognition calculation on a text recognition model by the long text set according to the batch processing operation of TensorRT, wherein the text recognition model adopts a CRNN algorithm;

splitting the text strings identified by the long text and corresponding to the corresponding text lines;

and obtaining the identification result of the picture to be identified.

Preferably, after the identification result of the picture to be identified is obtained, the identification result is uploaded to a blockchain, so that the blockchain stores the identification result in an encrypted manner.

In order to achieve the above object, the present invention further provides a system for recognizing a picture text, the system comprising:

the preprocessing module is used for acquiring a picture to be identified and preprocessing the picture;

the detection module is used for detecting the preprocessed picture by utilizing the text detection model which is trained in advance to obtain the coordinate of each text line in the picture;

the width value module is used for calculating a width value corresponding to each text line according to the coordinates of each text line;

The splicing module is used for sequencing the text lines according to the width values, traversing all the width values, splicing the text line with the longest width and the text line with the shortest width to form a long text, repeating the operation, and stopping splicing when the width value of the long text being spliced exceeds a width threshold value;

the repair module is used for detecting the long text, and repairing the width of the long text according to the width threshold if the width value of the long text does not reach the width threshold;

the long text set module is used for controlling the splicing module and the repairing module to splice and repair the rest text lines until all the text lines form long texts reaching the width threshold value so as to form a long text set;

the recognition module is used for inputting the long text set into a preset text recognition model for recognition, and disassembling the returned result to obtain a recognition result of the picture to be recognized.

To achieve the above object, the present invention also provides a computer device including a memory and a processor, wherein the memory stores readable instructions that, when executed by the processor, cause the processor to perform the steps of the method for recognizing a picture text as described above.

To achieve the above object, the present invention also provides a computer-readable storage medium storing a program file capable of realizing the method of recognizing a picture text as described above.

The invention provides a method, a system, equipment and a storage medium for identifying a picture text, wherein the identification method is used for preprocessing a picture to be identified by acquiring the picture to be identified; detecting the preprocessed picture by using a text detection model which is trained in advance to obtain the coordinate of each text line in the picture; calculating according to the coordinates of each text line to obtain a width value corresponding to each text line; sequencing the text lines according to the width values, traversing all the width values, splicing the text line with the longest width and the text line with the shortest width to form a long text, repeating the operation, and stopping splicing when the width value of the long text being spliced exceeds a width threshold value; detecting the long text, and if the width value of the long text does not reach the width threshold value, repairing the width of the long text according to the width threshold value; splicing and repairing the rest text lines until all text lines form long text reaching a width threshold value to form a long text set; and inputting the long text set into a preset text recognition model for recognition, and disassembling the returned result to obtain a recognition result of the picture to be recognized. Therefore, the recognition method can effectively improve the calculation and processing time of the background model through the bottom logic of the secondary batch processing in the environment with high concurrency request under the actual use scene after the project is landed by the batch text recognition acceleration method based on text combination.

Drawings

FIG. 1 is a diagram of an implementation environment for an identification method provided in one embodiment;

FIG. 2 is a block diagram of the internal architecture of a computer device in one embodiment;

FIG. 3 is a flow chart of an identification method in one embodiment;

FIG. 4 is a schematic diagram of an identification system in one embodiment;

FIG. 5 is a schematic diagram of a computer device in one embodiment;

FIG. 6 is a schematic diagram of a structure of a computer-readable storage medium in one embodiment.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

It will be understood that the terms "first," "second," and the like, as used herein, may be used to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish one element from another element.

Fig. 1 is a diagram of an implementation environment of a method for recognizing a picture text provided in one embodiment, as shown in fig. 1, in the implementation environment, including a computer device 110 and a display device 120.

The computer device 110 may be a computer device such as a computer used by a user, and a recognition system for a picture text is installed on the computer device 110. When calculating, the user may perform the calculation according to a picture text recognition method at the computer device 110 and display the calculation result through the display device 120.

It should be noted that, the combination of the computer device 110 and the display device 120 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, etc., but is not limited thereto.

FIG. 2 is a schematic diagram of the internal structure of a computer device in one embodiment. As shown in fig. 2, the computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected by a system bus. The nonvolatile storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store a control information sequence, and the computer readable instructions can enable the processor to realize a picture text recognition method when the computer readable instructions are executed by the processor. The processor of the computer device is used to provide computing and control capabilities, supporting the operation of the entire computer device. The memory of the computer device may store computer readable instructions that, when executed by the processor, cause the processor to perform a method of recognizing a picture text. The network interface of the computer device is for communicating with a terminal connection. It will be appreciated by those skilled in the art that the structure shown in fig. 2 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

As shown in fig. 3, in one embodiment, a method for identifying a picture text is provided, where the method may be applied to the computer device 110 and the display device 120, and may specifically include the following steps:

and step 31, acquiring a picture to be identified, and preprocessing the picture.

Specifically, the specific step of preprocessing the picture includes:

s311, scaling the picture, wherein the maximum width of the scaling of the picture is not more than 1600 pixels, the maximum height is not more than 2400 pixels, the minimum width is not less than 600 pixels, and the minimum height is not less than 800 pixels;

specifically, the obtained color picture is generally a color picture, and the picture is zoomed first, and in the zooming process, a priori value is generated according to the resolution condition of the real picture, namely the maximum of the picture is not more than 1600 pixels, and the maximum of the picture is not more than 2400 pixels; the minimum width is not less than 600 pixels, the minimum height is not less than 800 pixels, and the scaling of the picture is not fixed but basically fixed in the intervals. More specifically, the zooming of the picture is performed according to the real resolution of the picture to be identified, if the picture is too small, the effect of text detection is poor, if the picture is too large, the effect of text detection is reduced, otherwise the time of text detection is too long. For all the pictures to be identified, the width and height of the pictures are limited in the interval through scaling, namely, the pictures only need to be in the interval, and further, the picture scaling is performed according to an interpolation algorithm.

S312, converting the zoomed picture into a gray scale picture.

Specifically, the image is converted into a gray image after being scaled, and in a general practical use scene, the user uploads a color image generally, but the gray image is required to be processed by adopting an algorithm later, so that the input image is converted into the gray image.

Further, a color picture is converted into a gray scale, i.e., 3 channels (RGB) of the picture are converted into 1 channel. There are three general methods to implement, respectively:

(1) The average method, the simplest method is of course the average method, which averages the values of 3 channels RGB at the same pixel position, I (x, y) =1/3×i_r (x, y) +1/3×i_g (x, y) +1/3×i_b (x, y).

(2) The maximum-minimum average method is to average the maximum brightness and the minimum brightness of RGB in the same pixel position.

(3) Weighted average method, I (x, y) =0.3×i_r (x, y) +0.59×i_g (x, y) +0.11×i_b (x, y); this is the most popular method. Several weighting coefficients 0.3,0.59,0.11 are parameters that are adjusted according to the human brightness perception system, and are widely used standardized parameters.

And step 32, detecting the preprocessed picture by using the pre-trained text detection model to obtain the coordinates of each text line in the picture.

The text detection model can detect the coordinate of each text line in the picture, wherein the coordinate refers to the four vertex coordinates of the smallest circumscribed rectangle of the text line, the origin of the coordinate is the vertex of the upper left corner of the picture, the coordinate axis takes the width of the picture as the abscissa, and the height of the picture as the ordinate.

Specifically, the step of detecting the preprocessed picture by using the text detection model which is trained in advance to obtain the coordinates of each text line in the picture includes:

s321, calling a trained text detection model, wherein the text detection model adopts a dbnet algorithm, a preprocessed picture is input into the text detection model, and the text detection model outputs a probability value of a corresponding text in a picture pixel point;

specifically, the text detection model can be constructed by adopting a text detection algorithm based on pixel segmentation, and an attention mechanism can be added into the text detection model. The segmentation-based text detection algorithm may be any one of algorithms such as SENet, DBNet, pixelLink. In this embodiment, a trained text detection model needs to be invoked, where the text detection model adopts a dbnet algorithm (Differentiable Binarization Network, a differentiable binary network), and the dbnet algorithm is a text segmentation model designed based on a picture segmentation method. According to the dbnet algorithm, the user can use the self-marked data as a training set to train to obtain a desired text detection model. Specifically, the result output by the text detection model is the probability (0-1) of each pixel point in the picture, for example, a 100-pixel x 100-pixel picture, after dbnet calculation is completed, the probability value corresponding to 10000 pixel points and belonging to the text is output, that is, how many of 10000 pixel points correspond to the text in the picture, and how many of 10000 pixel points correspond to the blank or non-text part in the picture.

S322, thresholding selection is carried out on the pixel points in the picture, and the pixel points in the picture are divided into text pixel points and non-text pixel points according to the probability value, so that a binary image is obtained;

specifically, a user may set a threshold, and divide the 10000 pixels into text pixels and non-text pixels through probability values, where the non-text pixels may be pixels belonging to a blank portion in a picture, so that a binary image may be formed.

S323, calculating a connected domain set of the binary image by using a first image processing algorithm;

in particular, the first image processing algorithm may be the computer vision processing toolkit opencv that is widely used in the industry today.

S324, inputting the connected domain set into a second image processing algorithm, and calculating the minimum circumscribed rectangle of each connected domain, wherein four vertexes of the minimum circumscribed rectangle are coordinates of a text line.

Specifically, the second image processing algorithm calls a findcontius function of opencv.

It should be noted that the text behavior output by the text detection model includes at least one text line of text, which may be understood as a text outline image or an outline image of a text line/column. In a possible implementation, the text line may also be a sub-image including text that is segmented from the target picture.

In some embodiments, the text line is a bounding box in units of particular content, where the particular content may be a word, a line of text, a single word, or the like. In some embodiments, the text detection model may generate different lines of text based on the type of text in the picture to be identified. For example, when the picture contains english text, the text detection model may respectively frame the english text of the picture line by line in units of words, so as to generate a plurality of text lines, and it may be understood that the text in the text lines determined in this embodiment is a single english word. For another example, when the picture to be identified includes chinese, the text detection model may frame the chinese text of the picture in units of lines to generate a plurality of text lines, and it may be understood that the text in the text lines determined in this embodiment is a line of chinese text. For another example, when the picture to be identified contains chinese, the text detection model may frame the chinese text of the picture in units of a single word to generate a plurality of text lines, and it may be understood that the text in the text lines determined in this embodiment is a word.

It is easy to understand by those skilled in the art that the typesetting of the characters in most pictures is relatively regular, and the characters are generally arranged in a straight line manner, such as transverse, vertical and oblique. The text lines thus obtained in this step are generally quadrangular in shape. Specifically, according to the text typesetting of the picture to be detected, corresponding data can be selected as a training set to perform model training, and if the text of the picture to be detected is the transverse typesetting, the text detection model training set data are pictures of transverse text.

Specifically, after the target picture is input into the text detection model, the text line information output by the text detection model is information of one or more text lines in the target picture. All text line information output by the text detection model can be characterized as d= { p1, p2, p3,., pn, where n= (1, 2,3,., N), N characterizes the number of text lines in the detected target picture. When each text line information includes four vertex coordinates of the text line, the four vertex coordinates of each text line may be characterized as pi = { (x 1, y 1), (x 2, y 2), (x 3, y 3), (x 4, y 4) }, where (x 1, y 1) is the text line upper left corner coordinate point, (x 2, y 2) is the text line upper right corner coordinate point, (x 3, y 3) is the text line lower right corner coordinate point, and (x 4, y 4) is the text line lower left corner coordinate point.

In the process of training the text detection model, the number of training sample pictures adopted can be 500, 800, 1000 and the like, and specifically, the training sample pictures can be determined by a research and development personnel according to actual conditions.

And step 33, calculating the width value corresponding to each text line according to the coordinates of each text line.

Specifically, the step of calculating the width value of each corresponding text line according to the coordinates of each text line includes:

S331, calculating according to coordinates of four vertexes of each text line to obtain an original width and an original height of each text line;

specifically, since the text line detected by the text detection model is a rectangular text line, there is coordinate information of four vertices of the rectangular text line, and the width and height of the text line can be calculated from the numerical values of the vertices, for example: width= (x_max-x_min); height= (y_max-y_min)).

S332, scaling each text line to the same preset height, and calculating to obtain a scaling ratio corresponding to each text line through the original height and the preset height of each text line;

specifically, since the function of the TensorRT batch processing requires the resolution of each batch of pictures, i.e. the width and the height are consistent; the resolution of the text lines detected by the text detection model is various, so that scaling operation is required to be performed on all the detected text lines; it is necessary to know the original high of the text line and the scaled high to calculate the scaling ratio, scaling ratio = scaled high/original high. In one embodiment, the scaled high may be uniformly set to 32 pixels.

S323, obtaining the current width value of each text line through the original width and the scaling ratio of each text line.

Specifically, the scaled width=original width×scaling ratio, and thus the scaled width value is calculated.

And step 34, sorting the text lines according to the width values, traversing all the width values, splicing the text line with the longest width and the text line with the shortest width to form a long text, and repeating the operation, and stopping the splicing when the width value of the long text being spliced exceeds the width threshold value.

Specifically, the text line width values are sorted from small to large or from large to small to form a text line set, then from the sorted text line set, the text line sets are combined from two ends, namely, the text line sets are traversed from the longest text to the shortest text, and are spliced together, wherein the splicing is to splice two text lines together end to end, for example, a preset width threshold value is 1600 pixels, two text lines a and b are sequentially selected from the file line sets and spliced together to obtain a new text line c, namely, a long text c, then the new text line c is spliced together with other text lines in the set to obtain a new long text d, and the operation is repeated until the width threshold value is to be exceeded after the next text line is added, so that a long text is formed. The preset width threshold value is a priori value set according to experimental conditions, 1600 pixels are preferentially adopted as the priori value, the priori value is fixed and is an experimental value obtained through experiments, and the priori value is irrelevant to other factors.

Step 35, detecting the long text, and if the width value of the long text does not reach the width threshold, repairing the width of the long text according to the width threshold.

Specifically, detecting the long text, if the width value of the long text does not reach the width threshold, repairing the width of the long text according to the width threshold includes:

s351, detecting the long text, and judging whether the width value of the long text reaches a width threshold value, wherein the width threshold value is 1600 pixels;

in general, few spliced long texts are exactly equal to the width threshold, so that it is necessary to perform interpolation operation on the long texts so that the wide complements are as wide as the set long text width. For example, when the width value of the long text formed by stitching is 1580 pixels, then the preset long text width is 1600 pixels, but since no short text with 20 pixels in width value in the text line set and the 1580 pixels width value can be stitched again to form a long text with 1600 pixels, 20 pixels are added at the tail of the stitched long text.

And S352, if not, selecting black edges of three color channels of the picture to be identified, and adding the black edges to the width of the long text according to a width threshold value so that the width of the long text reaches 1600 pixels.

Specifically, the source of interpolation is the black edge from the color three channel (RGB) of the picture to be identified to complement 1600 pixels in width. More specifically, we have to limit the width of each long text to a size, such as 1600 pixels, because the TensorRT framework is called for batch processing, which requires the same resolution for each batch of pictures entered.

At step 36, the operations of stitching and patching the remaining text lines are performed until all text lines form long text reaching the width threshold to form a long text set.

Specifically, for the long text set, after the repair of one long text is completed, the next iteration may be performed until all long texts are traversed. In one embodiment, the sequence numbers of the text lines of each long text combination and their corresponding locations in the artwork are recorded simultaneously. When the sequence number detects a stack of text lines from the text detection model, the sequence numbers of the text lines such as 0,1,2 and … need to be set; the position is the position coordinate of the original image, namely four vertex coordinates of each text line. The aim of this step is to make a batch operation before the TensorRT batch is made, so that the TensorRT batch is fed again, which is equal to making two batches, so as to achieve the effect of speeding up.

And 37, inputting the long text set into a preset text recognition model for recognition, and disassembling the returned result to obtain a recognition result of the picture to be recognized.

Specifically, the step of inputting the long text set into a preset text recognition model for recognition and disassembling the returned result to obtain the recognition result of the picture to be recognized includes:

s371, performing recognition calculation on a text recognition model by using the long text set according to the batch processing operation of TensorRT, wherein the text recognition model adopts a CRNN algorithm;

specifically, a text recognition model needs to be trained in advance, the text recognition model adopts a CRNN algorithm, then the model is deployed in a TensoRT framework, text recognition is performed by using batch processing inference characteristics of TensorRT, and the recognition result is the text content of each long text.

S372, splitting the text strings identified by the long text and corresponding to the corresponding text lines;

specifically, the returned result needs to be disassembled, namely, the text strings identified by the long text with the width of 1600 pixels are separated and correspond to the corresponding text lines, and the corresponding text lines are spliced into the initial text lines of the long text. For example, a 1600-pixel long text line is composed of a, b, c, d four short text lines in turn, and since the text line serial numbers of 0,1,2 and … are already set when the text detection model detects a stack of text lines, and the position coordinates of the text lines are recorded at the same time, the serial numbers and the position coordinates of the four short text lines can be known a, b, c, d. According to the text recognition model, every four pixels have a text candidate, and then 400 text candidates in a long text line with 1600 pixels wide form a candidate set, if a is 400 pixels wide, then the long text line corresponds to 100 text candidate sets, and so on, because the initial text line has a serial number and a position coordinate, a text string recognized by a long text formed by splicing the text lines also corresponds to the serial number and the position coordinate, and thus, the text string recognized by the text recognition model can be split into each corresponding short text line.

And S373, obtaining the identification result of the picture to be identified.

Specifically, at the time of composing a long text, the serial number of each short text line composing the long text and their coordinate information have been recorded. After the long text recognition results are disassembled, we know what the strings of each short text line are and the positions of these strings on the original. According to the method, each long text is formed by combining short text lines, so that the recognition result is split into the recognition result of the corresponding short text lines, the coordinate value of each short text line is mapped one by one, and finally the OCR recognition result of the picture is returned, wherein the coordinate position and the text content of each text line are included.

According to the steps, after multiple experiments prove that the speed can be increased by about 30% under the TensorRT framework by adopting the text combination mode.

OCR (optical character recognition technique) (OCR, optical Character Recognition) is to input an image of a manuscript into a computer by a scanner or other input means, then preprocess the image by the computer, finally recognize each character in the preprocessed image and convert it into a corresponding Chinese character code)

In an alternative embodiment, it is also possible to: and uploading the result of the picture text recognition method to a blockchain.

Specifically, corresponding summary information is obtained based on the result of the method for identifying the picture text, specifically, the summary information is obtained by hashing the result of the method for identifying the picture text, for example, the summary information is obtained by using a sha256s algorithm. Uploading summary information to the blockchain can ensure its security and fair transparency to the user. The user can download the summary information from the blockchain to verify whether the result of the identification method of the picture text is tampered. The blockchain referred to in this example is a novel mode of application for computer technology such as distributed data storage, point-to-point transmission, consensus mechanisms, encryption algorithms, and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

As shown in fig. 4, the present invention further provides a system for recognizing a picture text, which may be integrated in the computer device 110, and specifically may include a preprocessing module 20, a detection module 30, a width value module 40, a stitching module 50, a patching module 60, a long text aggregation module 70, and a recognition module 80.

The preprocessing module 20 is configured to obtain a picture to be identified, and preprocess the picture; the detection module 30 is configured to detect the preprocessed picture by using a text detection model that is trained in advance, so as to obtain coordinates of each text line in the picture; the width value module 40 is configured to calculate a width value corresponding to each text line according to the coordinates of each text line; the splicing module 50 is configured to sort the text lines according to the width values, traverse all the width values, splice the text line with the longest width and the text line with the shortest width to form a long text, and repeat the operation, and stop the splicing when the width value of the long text being spliced is about to exceed the width threshold; the patching module 60 is configured to detect the long text, and if the width value of the long text does not reach the width threshold, patch the width of the long text according to the width threshold; the long text set module 70 is configured to control the splicing module 50 and the patching module 60 to splice and patch the remaining text lines until all text lines form long text reaching the width threshold, so as to form a long text set; the recognition module 80 is configured to input the long text set into a preset text recognition model for recognition, and disassemble the returned result to obtain a recognition result of the picture to be recognized.

In one embodiment, the preprocessing step of the preprocessing module 20 includes:

and converting the zoomed picture into a gray scale picture.

In one embodiment, the processing steps of the detection module 30 include:

In one embodiment, the processing steps of the width value module 40 include:

In one embodiment, the processing steps of the patching module 60 include:

In one embodiment, the processing steps of the identification module 80 include:

and obtaining the identification result of the picture to be identified.

In one embodiment, the identifying system further includes a blockchain module (not shown) configured to upload the identification result to a blockchain after the identification result of the picture to be identified is obtained, so that the blockchain stores the identification result in an encrypted manner.

The processing steps of the above modules are described in specific detail in the embodiments of the method and are not further described herein.

Referring to fig. 5, fig. 5 is a schematic structural diagram of an apparatus according to an embodiment of the invention. As shown in fig. 5, the device 200 includes a processor 201 and a memory 202 coupled to the processor 201.

The memory 202 stores program instructions for implementing the method for recognizing a picture text according to any of the above embodiments.

The processor 201 is configured to execute program instructions stored by the memory 202.

The processor 201 may also be referred to as a CPU (Central Processing Unit ). The processor 201 may be an integrated circuit chip with signal processing capabilities. Processor 201 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a storage medium according to an embodiment of the present invention. The storage medium of the embodiment of the present invention stores a program file 301 capable of implementing the method for recognizing a picture text, where the program file 301 may be stored in the storage medium in the form of a software product, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the method in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random-access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes, or a terminal device such as a computer, a server, a mobile phone, a tablet, or the like.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments. From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above, comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.

Claims

1. A method for identifying a picture text, the method comprising:

acquiring a picture to be identified, and preprocessing the picture;

Sequencing the text lines according to the width values, traversing all the width values, splicing the text line with the longest width and the text line with the shortest width to form a long text, repeating the operation, and stopping splicing when the width value of the long text being spliced exceeds a width threshold value; the method comprises the steps of sorting from small to large or from large to small according to text line width values, forming a text line set, combining from two ends in the sorted text line set, traversing from the longest text and the shortest text, and splicing together, wherein the splicing is to splice two text lines end to end; further comprises the steps of sequentially selecting two text lines a and b from the file line set, spelling the two text lines together to obtain a new text line c, then spelling the new text line c with other text lines in the set to obtain a new long text d, repeating the operation until the width threshold value is exceeded after the next text line is added, and finally forming a long text;

detecting the long text, and if the width value of the long text does not reach the width threshold, repairing the width of the long text according to the width threshold, so that the width value of the long text reaches the width threshold;

inputting the long text set into a preset text recognition model for recognition, performing recognition calculation on the text recognition model by the long text set according to batch processing operation, and disassembling the returned result to obtain a recognition result of the picture to be recognized.

2. The identification method of claim 1, wherein the step of preprocessing the picture comprises:

and converting the zoomed picture into a gray scale picture.

3. The method of claim 1, wherein the step of detecting the preprocessed picture using the pre-trained text detection model to obtain coordinates of each text line in the picture comprises:

invoking a trained text detection model, wherein the text detection model adopts a dbnet algorithm, a preprocessed picture is input into the text detection model, and the text detection model outputs a probability value of a corresponding text in a pixel point of the picture;

4. The recognition method as set forth in claim 1, wherein the step of calculating a width value corresponding to each text line based on coordinates of each text line comprises:

5. The method of claim 1, wherein the step of detecting the long text, if the long text width value does not reach the width threshold, repairs the long text width according to the width threshold comprises:

if not, selecting black edges of three color channels of the picture to be identified, and adding the black edges to the width of the long text according to a width threshold value so that the width of the long text reaches 1600 pixels.

6. The recognition method according to claim 1, wherein the step of inputting the long text set into a preset text recognition model for recognition and disassembling the returned result to obtain the recognition result of the picture to be recognized comprises:

and obtaining the identification result of the picture to be identified.

7. The method according to claim 1, wherein after the identification result of the picture to be identified is obtained, the identification result is uploaded to a blockchain, so that the blockchain stores the identification result in an encrypted manner.

8. A system for recognizing a picture text, the recognition system comprising:

the splicing module is used for sequencing the text lines according to the width values, traversing all the width values, splicing the text line with the longest width and the text line with the shortest width to form a long text, repeating the operation, and stopping splicing when the width value of the long text being spliced exceeds a width threshold value; the method comprises the steps of sorting from small to large or from large to small according to text line width values, forming a text line set, combining from two ends in the sorted text line set, traversing from the longest text and the shortest text, and splicing together, wherein the splicing is to splice two text lines end to end; further comprises the steps of sequentially selecting two text lines a and b from the file line set, spelling the two text lines together to obtain a new text line c, then spelling the new text line c with other text lines in the set to obtain a new long text d, repeating the operation until the width threshold value is exceeded after the next text line is added, and finally forming a long text;

The patching module is used for detecting the long text, and if the width value of the long text does not reach the width threshold value, patching the width of the long text according to the width threshold value so that the width value of the long text reaches the width threshold value;

the recognition module is used for inputting the long text set into a preset text recognition model for recognition, carrying out recognition calculation on the text recognition model by the long text set according to batch processing operation, and disassembling the returned result to obtain a recognition result of the picture to be recognized.

9. A computer device comprising a memory and a processor, the memory having stored therein readable instructions that when executed by the processor cause the processor to perform the steps of the method of recognizing a photo-text as claimed in any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that a program file capable of realizing the method of recognizing a picture text according to any one of claims 1 to 7 is stored.