WO2023024793A1

WO2023024793A1 - Character recognition method and related device thereof

Info

Publication number: WO2023024793A1
Application number: PCT/CN2022/107728
Authority: WO
Inventors: 蔡悦; 张宇轩; 黄灿; 王长虎
Original assignee: 北京有竹居网络技术有限公司
Priority date: 2021-08-26
Filing date: 2022-07-26
Publication date: 2023-03-02
Also published as: CN113657369A

Abstract

The present application discloses a character recognition method and a related device thereof. The method comprises: after acquiring a text image to be recognized comprising a long text, performing first segmentation processing on the text image to be recognized according to a preset slice parameter, obtaining at least one image slice and position information of the at least one image slice; according to a single character detection result and the position information of the at least one image slice, determining an actual slicing position corresponding to the text image to be recognized; then, according to the actual slicing position corresponding to the text image to be recognized, performing second segmentation processing on the text image to be recognized, obtaining at least one picture to be used; finally, according to the character recognition result of the at least one picture to be used, determining a character recognition result of the text image to be recognized, so that a character recognition process for the long text can be realized.

Description

A character recognition method and related equipment

This disclosure claims the priority of the Chinese patent application with the application number 202110988932.1 and the title of the invention "a character recognition method and related equipment" submitted to the China Patent Office on August 26, 2021, the entire contents of which are incorporated herein by reference In public.

technical field

The present application relates to the technical field of data processing, in particular to a character recognition method and related equipment.

Background technique

With the development of character recognition technology, the application range of character recognition technology is getting wider and wider. Among them, the character recognition technology is used to perform recognition processing on characters appearing in an image.

However, due to defects in some text recognition technologies (such as optical character recognition (Optical Character Recognition, OCR) and other technologies), the recognition accuracy of these text recognition technologies in some application scenarios (such as long text recognition and other application scenarios) lower. Wherein, "long text recognition" refers to a process of character recognition for an image including long text.

Contents of the invention

In order to solve the above technical problems, the present application provides a character recognition method and related equipment, which can improve the recognition accuracy of long text recognition.

In order to achieve the above objectives, the technical solutions provided in the embodiments of the present application are as follows:

An embodiment of the present application provides a character recognition method, the method comprising:

After the text image to be recognized is acquired, the text image to be recognized is subjected to a first segmentation process according to preset slice parameters to obtain at least one image slice and position information of the at least one image slice; wherein, the text image to be recognized Text images include long text;

According to the word detection result of the at least one image slice and the position information of the at least one image slice, determine the actual image cutting position corresponding to the text image to be recognized;

Performing a second segmentation process on the text image to be recognized according to the actual image cutting position corresponding to the text image to be recognized to obtain at least one picture to be used;

According to the character recognition result of the at least one image to be used, the character recognition result of the text image to be recognized is determined.

In a possible implementation manner, the determining the actual image cutting position corresponding to the text image to be recognized according to the word detection result of the at least one image slice and the position information of the at least one image slice includes:

According to the word detection result of the at least one image slice, the position information of the at least one image slice, and the preset cut position corresponding to the text image to be recognized, determine the actual cut position corresponding to the text image to be recognized .

In a possible implementation manner, the process of determining the actual cutting position corresponding to the text image to be recognized includes:

The word detection result of described at least one image slice is spliced according to the positional information of described at least one image slice, obtains the word detection result of described to-be-recognized text image;

According to the single character detection result of the text image to be recognized and the preset cut position corresponding to the text image to be recognized, the actual cut position corresponding to the text image to be recognized is determined.

In a possible implementation manner, the preset slice parameters include a segmentation interval and a segmentation offset length; wherein, the segmentation offset length is smaller than the segmentation interval;

The process of determining the at least one image slice includes:

cutting the image region with the segmentation offset length from the text image to be recognized to obtain the image to be segmented;

Segmenting the image to be segmented according to the segmentation interval to obtain at least one image slice.

In a possible implementation manner, the preset slicing parameters also include a resection starting position;

The determination process of the image to be segmented includes:

Determine the position of the resection area according to the resection start position and the segmentation offset length;

Perform region cutting processing on the to-be-recognized text image according to the cut-off region position to obtain the to-be-segmented image.

In a possible implementation manner, the process of determining the word detection result of the at least one image slice includes:

Using a pre-built word detection model to perform parallel word detection processing on the at least one image slice to obtain a word detection result of the at least one image slice; wherein the word detection model is based on the sample text image and the sample text image The actual position of each character in is constructed.

The embodiment of the present application also provides a text recognition device, including:

The first segmentation unit is configured to perform a first segmentation process on the text image to be recognized according to preset slice parameters after acquiring the text image to be recognized, to obtain at least one image slice and the position of the at least one image slice Information; wherein, the text image to be recognized includes long text;

A position determination unit, configured to determine the actual cut-out position corresponding to the text image to be recognized according to the word detection result of the at least one image slice and the position information of the at least one image slice;

The second segmentation unit is configured to perform a second segmentation process on the text image to be recognized according to the actual image cutting position corresponding to the text image to be recognized to obtain at least one image to be used;

The result determination unit is configured to determine the character recognition result of the text image to be recognized according to the character recognition result of the at least one image to be used.

The embodiment of the present application also provides a device, the device includes a processor and a memory:

The memory is used to store computer programs;

The processor is configured to execute any implementation of the character recognition method provided in the embodiments of the present application according to the computer program.

The embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium is used to store a computer program, and the computer program is used to execute any implementation manner of the character recognition method provided in the embodiment of the present application.

The embodiment of the present application also provides a computer program product, which, when running on a terminal device, enables the terminal device to execute any implementation manner of the character recognition method provided in the embodiment of the present application.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments described in this application. Those skilled in the art can also obtain other drawings based on these drawings without creative work.

Fig. 1 is the flowchart of a kind of character recognition method that the embodiment of the present application provides;

FIG. 2 is a schematic diagram of a text image to be recognized provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of another text image to be recognized provided in the embodiment of the present application;

FIG. 4 is a schematic diagram of an image slice processing process provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of a comparison of two character recognition processes provided by the embodiment of the present application;

FIG. 6 is a schematic structural diagram of a single word detection model provided in the embodiment of the present application;

FIG. 7 is a schematic diagram of a character recognition process provided by an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a character recognition device provided by an embodiment of the present application.

Detailed ways

The inventor found in the research on text recognition that because some text recognition models (such as Optical Character Recognition (Optical Character Recognition, OCR) recognition models) usually support fixed-width input data, it is difficult to obtain images containing long texts. Afterwards, the image needs to be greatly reduced first; then these character recognition models perform character recognition on the reduced image to obtain the character recognition result of the long text. However, because the above-mentioned "substantial reduction" operation usually greatly reduces the image definition, the reduced image is prone to blurred image content, so that the character recognition result determined based on the reduced image is inaccurate , so that the recognition accuracy of long text recognition is low.

Based on the above findings, in order to solve the technical problems in the background technology section, the embodiment of the present application provides a text recognition method, the method includes: after obtaining the text image to be recognized including long text, first the text image to be recognized according to Performing the first segmentation process with preset slice parameters to obtain at least one image slice and the position information of the at least one image slice; cutting position; then, according to the actual cutting position corresponding to the text image to be recognized, the text image to be recognized is subjected to a second segmentation process to obtain at least one picture to be used; finally, according to the text of the at least one picture to be used The recognition result determines the character recognition result of the text image to be recognized, so that the character recognition process for long text can be realized.

It can be seen that because the above-mentioned "single character detection result and position information of at least one image slice" can accurately represent the position information of at least one character in the text image to be recognized, the actual cutting position determined based on the single character detection result is as close as possible. will appear inside the character, so that when cutting the image based on the actual image cutting position, the phenomenon of cutting the character will not appear as much as possible, so that each cutting image corresponding to the text image to be recognized can be avoided as much as possible (that is, , incomplete characters appear in each picture to be used), which is conducive to improving the recognition accuracy of long text recognition. Also, because the length of each image slice is much smaller than the length of the text image to be recognized, the processing time for each image slice is much shorter than the processing time for the text image to be recognized, which is conducive to improving the efficiency of text recognition.

In addition, the embodiment of the present application does not limit the subject of the character recognition method. For example, the character recognition method provided in the embodiment of the present application can be applied to data processing devices such as terminal devices or servers. Wherein, the terminal device may be a smart phone, a computer, a personal digital assistant (Personal Digital Assistant, PDA), or a tablet computer. The server can be an independent server, a cluster server or a cloud server.

In order to enable those skilled in the art to better understand the solution of the present application, the technical solution in the embodiment of the application will be clearly and completely described below in conjunction with the accompanying drawings in the embodiment of the application. Obviously, the described embodiment is only It is a part of the embodiments of this application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.

method embodiment

Referring to FIG. 1 , this figure is a flow chart of a character recognition method provided by an embodiment of the present application.

The character recognition method provided in the embodiment of this application includes S1-S5:

S1: Obtain a text image to be recognized.

The text image to be recognized refers to an image that requires character recognition processing (especially long text recognition processing); and the text image to be recognized includes long text (especially super long text). Wherein, "long text" refers to a text whose number of characters exceeds the first threshold; moreover, the first threshold can be preset. "Extremely long text" refers to text whose number of characters exceeds a second threshold; the second threshold can be preset, and the second threshold is greater than the above-mentioned "first threshold".

In addition, this embodiment of the present application does not limit the text image to be recognized. For example, the text image to be recognized may be the image to be processed as shown in FIG. 2 , or the text image corresponding to the image to be processed as shown in FIG. 3 . Wherein, the "text image corresponding to the image to be processed" refers to an image cut from the image to be processed according to the text detection result of the image to be processed. In addition, please refer to S11 below for the relevant content of the "image to be processed" and the "text detection result of the image to be processed".

In addition, the embodiment of the present application does not limit the implementation manner of S1. For ease of understanding, two examples are used for description below.

Example 1, S1 may specifically include: after the image to be processed is acquired, the image to be processed may be directly determined as the text image to be recognized.

Example 2, in order to avoid as much as possible other image information other than text in the image to be processed from causing adverse effects on long text recognition, S1 can specifically include S11-S12:

S11: After the image to be processed is acquired, perform text detection on the image to be processed to obtain a text detection result of the image to be processed.

Wherein, the image to be processed refers to an image that requires image processing (such as text detection and/or character recognition); and the embodiment of the present application does not limit the image to be processed, for example, the image to be processed may be a frame of video image.

The text detection result of the image to be processed is used to describe the position of the text in the image to be processed (for example, "this is an image including long text") in the image to be processed.

In addition, the embodiment of the present application does not limit the implementation of "text detection" in S11, and any existing or future method capable of text detection for images can be used for implementation.

S12: Cut out a text image to be recognized from the image to be processed according to the text detection result of the image to be processed.

In the embodiment of the present application, after obtaining the text detection result of the image to be processed (as shown in Figure 2), the image area corresponding to the text detection result in the image to be processed is cut off to obtain the text image to be recognized (as shown in Figure 2 ). 3), so that the text image to be recognized can more accurately represent the character information carried by the image to be processed.

Based on the relevant content of S1 above, after the image to be processed (such as a frame of video image) is acquired, the text image to be recognized can be determined according to the image to be processed, so that the text image to be recognized is used to represent the image to be processed The character information carried by the image, so that the character information carried by the image to be processed can be accurately determined subsequently based on the text image to be recognized.

S2: Perform a first segmentation process on the text image to be recognized according to preset slice parameters to obtain at least one image slice and position information of the at least one image slice.

Wherein, the "preset slice parameter" refers to the parameter that needs to be referred to when performing the first segmentation process on the text image to be recognized; and the embodiment of the present application does not limit the "preset slice parameter", for example, it may include the segmentation interval. "Segmentation interval" is used to indicate the distance between two adjacent segmentation positions when the first segmentation process is performed on the text image to be recognized; and the embodiment of the present application does not limit the "segmentation interval" (for example, FIG. 4 512 pixels shown).

The "first slicing process" is used to indicate the slicing process implemented according to the above-mentioned "preset slicing parameters".

"At least one image slice" refers to at least one image segment obtained after the first segmentation process of the text image to be recognized; and "position information of at least one image slice" is used to describe the position of each image slice in the text image to be recognized location.

In addition, the embodiment of the present application does not limit the process of determining "at least one image slice". For ease of understanding, two possible implementation manners are described below in combination.

In a possible implementation manner, when the above-mentioned "preset slice parameter" includes a segmentation interval, the determination process of "at least one image slice" may specifically include: first segmenting the text image to be recognized according to the segmentation interval Processing to obtain at least one image slice, so that the length of each image slice is the above-mentioned "segment interval" (eg, 512 pixels as shown in FIG. 4 ).

In some cases (for example, similar to S3 below by referring to the "preset cutting position corresponding to the text image to be recognized" to determine the "actual cutting position corresponding to the text image to be recognized"), when the above "first segmentation When there is a phenomenon of cutting bad characters in "handling" (as shown in Figure 5, "knot" is cut into two parts, such as "纟" and "吉", etc.), these badly cut characters will easily lead to subsequent character recognition errors. For example, when there is the same position between the segmentation position involved in the above-mentioned "first segmentation process" and the following "preset image segmentation position" (for example, there are both positions between "纟" and "吉" as shown in Figure 5). space between), prone to character recognition errors (for example, an error that causes "knot" to be mistakenly recognized as the two characters "three" and "ji").

Based on the above analysis, in order to avoid as much as possible the above-mentioned adverse effects caused by the phenomenon of broken characters in the "first segmentation process", you can control the segmentation position involved in the "first segmentation process" There is no same position between "set cut map position" for implementation. Based on this, the embodiment of the present application also provides another possible implementation of determining "at least one image slice". In this implementation, when the above-mentioned "preset slice parameters" include segmentation interval and segmentation offset length When , the determination process of "at least one image slice" may specifically include S21-S22:

S21: Cut out the image area with the segmentation offset length from the text image to be recognized to obtain the image to be segmented, so that the image to be segmented does not include the above-mentioned "image area with the segmentation offset length".

Among them, the "segmentation offset length" is used to indicate the segmentation offset to be used when performing the first segmentation process on the text image to be recognized; and the "segmentation offset length" can be smaller than the above-mentioned "segmentation interval ". In addition, the embodiment of the present application does not limit the "segmentation offset length". For example, as shown in FIG. 4, when the above-mentioned "segmentation interval" is 512 pixels, the "segmentation offset length" can be 256 pixels .

In addition, the embodiment of the present application does not limit the position of the above-mentioned "image area with segmentation offset length", for example, it may be located in the leftmost area of the text image to be recognized (as shown in Figure 4), or it may be located in the text image to be recognized The rightmost region of the text image may also be located in a preset inner region of the text image to be recognized.

In addition, the embodiment of this application does not limit the implementation of S21. For example, in a possible implementation, if the above-mentioned "preset slice parameters" also include the starting position of resection, then S21 may specifically include S211-S212:

S211: Determine the position of the resection area according to the resection starting position and the segmentation offset length.

Among them, the "cutting start position" is used to indicate the position of a boundary position (such as the left end boundary position) of the above-mentioned "image area with segmentation offset length" in the above-mentioned "text image to be recognized"; and the present application The embodiment does not limit the "cutting start position", for example, as shown in FIG. 4 , it may be the left boundary position of the text image to be recognized.

"Excision area position" is used to indicate the position of the above-mentioned "image area with segmentation offset length" in the "text image to be recognized"; and the length of the "excision area position" is the above-mentioned "segmentation offset length" , the boundary position of the "resection area position" includes the above-mentioned "resection start position".

S212: Execute region cutting processing on the text image to be recognized according to the position of the cutting region to obtain the image to be segmented.

In the embodiment of the present application, after the position of the cut-out area is obtained, the image area occupying the position of the cut-out area (that is, the above-mentioned "image area with a segmentation offset length") can be cut off from the text image to be recognized, and The remaining area of the text image to be recognized is determined as the image to be segmented, so that the image to be segmented is used to represent other image areas in the text image to be recognized except the above-mentioned "image area with a segmentation offset length", Therefore, the image to be segmented does not include the above-mentioned "image region with a segmented offset length".

Based on the relevant content of S21 above, after obtaining the text image to be recognized, the image area with the segmentation offset length can be cut off from the text image to be recognized to obtain the image to be segmented, so that the image to be segmented does not include The above-mentioned "image region with segmentation offset length" is used to perform segmentation processing on the image to be segmented subsequently.

S22: Perform segmentation processing on the image to be segmented according to the segmentation interval to obtain at least one image slice.

In the embodiment of the present application, after the image to be segmented is acquired, the image to be segmented may be segmented according to the segmentation interval to obtain at least one image slice (a plurality of image slices as shown in FIG. 4 ). Among them, because the above-mentioned "image to be segmented" lacks a part of the area compared with the "text image to be recognized", the segmenting position used when performing segmentation processing on the "image to be segmented" is relative to the "text image to be recognized". Text image” has a certain amount of offset, so that the segmentation position used for the above-mentioned “image to be segmented” is almost impossible to be the same as the following “preset image segmentation position” position, which can effectively avoid the above-mentioned adverse effects caused by the phenomenon of broken characters in the "first segmentation process".

Based on the relevant content of S2 above, after the text image to be recognized is acquired, the text image to be recognized can be subjected to the first segmentation process according to the preset slice parameters to obtain at least one image slice and the position information of the at least one image slice, In order to realize the character recognition result for the "text image to be recognized" based on the at least one image slice later. Wherein, because the length of each image slice is much smaller than the length of the text image to be recognized, the processing process for each image slice The time consumption is far less than the time consumption for processing the image of the text to be recognized, which is beneficial to improve the efficiency of text recognition.

S3: According to the word detection result of at least one image slice and the position information of the at least one image slice, determine the actual cut-out position corresponding to the text image to be recognized.

Wherein, the "single character detection result of at least one image slice" is used to indicate the position of each character in each image slice.

In addition, the embodiment of the present application does not limit the determination process of "the word detection result of at least one image slice". For example, word detection processing may be performed on each image slice to obtain the word detection result of each image slice. It should be noted that the embodiment of the present application does not limit the implementation manner of "single word detection processing", for example, any existing or future single word detection method may be used for implementation. As another example, the "single character detection model" shown below can be used for implementation.

In addition, in order to further improve the word detection efficiency, the embodiment of the present application also provides another possible implementation manner of determining "the word detection result of at least one image slice", which may specifically include: using a pre-built word detection model to at least A parallel word detection process is performed on one image slice to obtain a word detection result of the at least one image slice.

Wherein, the single-character detection model is used to perform character position detection (for example, to perform character boundary position detection) on the input data of the single-character detection model.

The embodiment of the present application does not limit the model structure of the word detection model. For example, in a possible implementation manner, as shown in FIG. 6, the word detection model 600 may include a feature extraction layer 601 and a word position determination layer 602, and the The input data of the word position determination layer 602 includes the output data of the feature extraction layer 601 .

In order to facilitate the understanding of the working principle of the single character detection model 600 , the process of determining the single character detection result of the above target image is taken as an example to describe below. Wherein, the "target image" is used to represent any image slice in the above "at least one image slice".

As an example, the process of using the word detection model 600 to determine the above-mentioned "single word detection result" may specifically include steps 11-12:

Step 11: Input the target image into the feature extraction layer 601 to obtain the image position feature output by the feature extraction layer 601 .

Wherein, the feature extraction layer 601 is used to perform feature extraction for the input data of the feature extraction layer 601; and the embodiment of the present application does not limit the feature extraction layer 601, for example, the feature extraction layer 601 can use any convolutional neural network ( Convolutional Neural Networks, CNN) for implementation (eg, (Visual Geometry Group, VGG) network can be used for implementation).

The image position feature is used to represent the information carried by each position in the target image (especially, the information carried by each position in the width direction). In addition, the embodiment of the present application does not limit the image position feature. For example, if the target image is a [C, H, W] matrix, the image position feature may be a [1, 1, W/4] matrix. Wherein, C represents the number of image channels (eg, C=3), H represents the image height (eg, H=32), and W represents the image width (eg, W=512).

Step 12: Input the feature of the image position into the word position determination layer 602, and obtain the word detection result of the target image output by the word position determination layer 602.

Wherein, the character position determining layer 602 is used for performing character boundary position recognition processing on the input data of the character position determining layer 602 .

In addition, the embodiment of the present application does not limit the word position determination layer 602. For example, in a possible implementation manner, if the width of the image position feature is smaller than the width of the target image (for example, the width of the image position feature is the width of the target image 1/4 of the width), then the word location determination layer 602 may include a location classification layer and a location mapping layer, and the input data of the location mapping layer includes the output data of the location classification layer.

In order to facilitate the understanding of the working principle of the single character position determination layer 602, the determination process of the above-mentioned "single character detection result" is taken as an example to describe below.

As an example, if the single character position determination layer 602 includes a position classification layer and a position mapping layer, then the determination process of the above-mentioned "single character detection result" may include steps 21-22:

Step 21: Input the image position feature into the position classification layer, and obtain the position classification result output by the position classification layer.

Wherein, the position classification layer is used to judge whether the input data of the position classification layer belongs to the character boundary position.

In addition, the embodiment of the present application does not limit the implementation manner of the position classification layer, and any existing or future classifier (eg, softmax, etc.) may be used for implementation.

The position classification result is used to indicate whether each position in the target image belongs to a character boundary (especially, whether each position in the width direction of the target image belongs to a character boundary).

Step 22: Input the position classification result into the position mapping layer, and obtain the word detection result of the target image output by the position mapping layer.

Wherein, the position mapping layer is used for performing mapping processing on the position mapping layer.

In addition, the embodiment of the present application does not limit the working principle of the location mapping layer. For example, the location mapping layer may map each location in the location classification result according to formula (1).

y＝a×x+b (1)

In the formula, y represents the mapped position coordinate corresponding to x; "a" represents the ratio between the width of the target image and the width of the image position feature (for example, 4); "x" represents a position coordinate in the position classification result ( In particular, a position coordinate of the position classification result in the width direction); “b” represents the convolution offset used in the feature extraction layer 601 .

It can be seen that, in some cases, because the width of the image position feature is smaller than the width of the target image (for example, the width of the image position feature is 1/4 of the width of the target image), the position classification result determined based on the image position feature The width of is also smaller than the width of the target image (for example, the width of the position classification result is also 1/4 of the width of the target image), at this time, in order to describe more accurately whether each position of the target image in the width direction belongs to the character Each position coordinate of the position classification result in the width direction can be mapped to the position coordinates of the target image in the width direction according to formula (1).

Based on the relevant content of the above step 11 to step 12, it can be known that for the word detection model 600 shown in FIG. Perform feature extraction processing and single character position determination processing, obtain and output the single character detection result of the target image, so that the single character detection result can accurately represent the boundary position of each character in the target image.

In addition, the single character detection model can be constructed in advance according to the sample text image and the actual position of each character in the sample text image. Wherein, the sample text image refers to the image used to construct the word detection model; and the embodiment of the present application does not limit the number of sample text images. In addition, the embodiment of the present application does not limit the actual position of each character in the sample text image, for example, it may be the actual boundary position of each character in the sample text image.

In addition, the embodiment of the present application does not limit the construction process of the single word detection model, for example, in a possible implementation manner, the construction process of the single word detection model may include steps 31-step 34:

Step 31: Input the sample text image into the model to be trained, and obtain the predicted character position of the sample text image output by the model to be trained.

Wherein, the model to be trained is used for character position detection (for example, character boundary position detection) for the data input of the model to be trained. In addition, the model structure of the model to be trained is the same as the "word detection model" above, so the relevant content of the model structure of the model to be trained can refer to the relevant content of the model structure of the "word detection model" above.

The predicted character position of the sample text image is used to describe the predicted position of at least one character in the sample text image.

Step 32: Judging whether the preset stop condition is met, if yes, execute step 34; if not, execute step 33.

Wherein, the preset stop condition can be set in advance, for example, the preset stop condition can be that the loss value of the model to be trained is lower than the preset loss threshold, or that the rate of change of the loss value of the model to be trained is lower than the preset The rate-of-change threshold (that is, the character position detection performance of the model to be trained is in a convergent state) may also be that the number of updates of the model to be trained reaches a preset threshold.

The loss value of the model to be trained is used to characterize the character position detection performance of the model to be trained; moreover, the embodiment of the present application does not limit the method for determining the loss value of the model to be trained.

In addition, the preset loss threshold, the preset change rate threshold, and the preset number of times threshold can all be preset.

Step 33: Update the model to be trained according to the predicted character position of the sample text image and the actual position of each character in the sample text image, and return to step 31.

In the embodiment of the present application, after it is determined that the model to be trained in the current round has not yet reached the preset stop condition, it can be determined that the model to be trained in the current round still has poor character position detection performance, so the prediction of the sample text image can be based on The difference between the character position and the actual position of each character in the sample text image, update the model to be trained in the current round, so that the updated model to be trained has better character position detection performance, and return to continue execution Step 31 and its subsequent steps.

Step 34: Determine a word detection model according to the model to be trained.

In the embodiment of the present application, after it is determined that the model to be trained in the current round has reached the preset stop condition, it can be determined that the model to be trained in the current round has good character position detection performance, so it can be based on the model to be trained in the current round, Determine the single-word detection model (for example, the model to be trained in the current round can be directly determined as the single-word detection model. For another example, the model structure and the model of the single-word detection model can be determined according to the model structure and model parameters of the model to be trained in the current round parameters, so that the model structure and model parameters of the word detection model are consistent with the model structure and model parameters of the current round of the model to be trained), so that the word detection model also has better character position detection performance, so that Subsequent word detection results determined for at least one image slice using the word detection model can accurately represent the positions of each character in each image slice.

The above "actual cutting position corresponding to the text image to be recognized" is used to describe the actual cutting position for the text image to be recognized; moreover, the embodiment of the present application does not limit the determination process of "the actual cutting position corresponding to the text image to be recognized" (That is, the embodiment of S3), for example, can first determine the word position information of the text image to be recognized according to the word detection result of at least one image slice and the position information of the at least one image slice; then according to the text image to be recognized The single character position information of the text image to be recognized is determined to determine the actual cutting position corresponding to the text image to be recognized, so that the "actual cutting position corresponding to the text image to be recognized" will not appear inside the character as much as possible.

In some cases, end users may set character recognition efficiency requirements; or, different application scenarios may correspond to different character recognition efficiency requirements. Based on this, it can be seen that, in order to meet the above-mentioned "character recognition efficiency requirements", the embodiment of the present application also provides a possible implementation of S3, which may specifically include: according to the word detection result of at least one image slice, the at least one image slice The position information of the text image to be recognized and the preset cutting position corresponding to the text image to be recognized determine the actual cutting position corresponding to the text image to be recognized.

Among them, the "preset cutting position corresponding to the text image to be recognized" refers to the cutting position preset for the text image to be recognized; and the "preset cutting position corresponding to the text image to be recognized" is based on the above "text Identification of efficiency needs" identified.

In addition, this embodiment of the present application does not limit the preset cut position corresponding to the text image to be recognized, for example, it may include at least one hard cut position. Wherein, the "hard cutting position" is used to indicate a preset cutting position corresponding to the text image to be recognized. For ease of understanding, the text image to be recognized shown in FIG. 7 is taken as an example for description below.

As an example, if the text image to be recognized is the text image to be recognized as shown in FIG. 7 , the preset cut position corresponding to the text image to be recognized may be {512, 1024, 1536, 2048}. Among them, "512", "1024", "1536", and "2048" are hard-cut positions corresponding to the text image to be recognized.

In addition, the embodiment of the present application does not limit the process of determining the preset cutting position corresponding to the text image to be recognized. For example, it may specifically include steps 41-42:

Step 41: Obtain preset segmentation parameters.

Among them, the "preset segmentation parameter" is used to indicate the maximum width of a cut image (that is, the distance between two adjacent hard cut positions in the above-mentioned "preset cut image position"); and the preset cut image The sub-parameters can be preset according to the application scenario (in particular, they can be set according to the character recognition efficiency requirement in the application scenario). For example, the preset segmentation parameter may be 512 pixel values.

Step 42: According to the preset segmentation parameters and the text image to be recognized, determine the preset cutting position corresponding to the text image to be recognized.

In the embodiment of the present application, after the text image to be recognized is obtained, the preset segmentation parameters can be referred to to determine the preset cutting position corresponding to the text image to be recognized (as {512, 1024, 1536, 2048} in Fig. 7) , so that the position interval between adjacent positions in the preset image cutting position does not exceed the preset cutting parameter.

Based on the relevant content of the above-mentioned steps 41 to 42, it can be known that the preset cut-out position corresponding to the text image to be recognized can be determined according to the application scenario (in particular, it can be determined according to the text recognition efficiency requirements in the application scenario), so that based on the preset The actual image cutting position determined by the image cutting position can perform image segmentation on the premise of meeting the text recognition efficiency requirements in this application scenario, so that the text recognition method provided by this application can meet the text recognition efficiency requirements in this application scenario.

In addition, this embodiment of the present application does not limit the above-mentioned implementation of determining the actual cutting position corresponding to the text image to be recognized by referring to the above-mentioned "preset cutting position". For example, it may specifically include steps 51-52:

Step 51: Concatenate the word detection result of at least one image slice according to the position information of the at least one image slice, to obtain the word detection result of the text image to be recognized.

Wherein, the "single word detection result of the text image to be recognized" is used to describe the position of at least one character in the text image to be recognized.

In addition, this embodiment of the present application does not limit the "single character detection result of the text image to be recognized", for example, the single character detection result may include at least one boundary position. Wherein, "border position" is used to indicate the edge position of a character. For ease of understanding, the text image to be recognized shown in FIG. 7 is taken as an example for description below.

As an example, if the text image to be recognized is the text image to be recognized as shown in FIG. 7 , the word detection result of the text image to be recognized may be {43, 82, 293, 309, . . . }. Among them, "43" represents the left boundary of "this", "82" represents the right boundary of "this", "293" represents the left boundary of "yes", "309" represents the right boundary of "yes", ... (with and so on).

Based on the relevant content of the above-mentioned step 51, after obtaining the word detection result of at least one image slice, the word detection result of the at least one image slice can be spliced according to the position information of the at least one image slice to obtain the text to be recognized The single-character detection result of the image, so that the "single-character detection result of the text image to be recognized" is used to describe the position of at least one character in the text image to be recognized.

Step 52: Determine the actual cutting position corresponding to the text image to be recognized according to the word detection result of the text image to be recognized and the preset cutting position corresponding to the text image to be recognized.

In the embodiment of the present application, after obtaining the word detection result of the text image to be recognized and the preset cutting position corresponding to the text image to be recognized, the actual cutting position corresponding to the text image to be recognized can be determined by referring to the above two and the determination process can specifically include: as shown in Figure 7, a preset algorithm can be used to match the preset cut-out position corresponding to the text image to be recognized with the word detection result of the text image to be recognized to obtain the text image to be recognized The actual cutting position corresponding to the text image. Wherein, the preset algorithm may be preset, for example, the preset algorithm may be a greedy algorithm or a Hungarian algorithm.

In order to facilitate the understanding of step 52, the following will be described in conjunction with an example.

The following describes with two examples.

Example 1, Step 52 may specifically include Step 61-Step 63:

Step 61: Determine a first position set and a second position set according to the word detection result of the text image to be recognized and the preset cutting position corresponding to the text image to be recognized.

Wherein, the number of positions in the first position set is not less than the number of positions in the second position set. That is, the first position set refers to a set with more image cutting positions, and the second position set refers to a set with fewer image cutting positions.

In addition, the embodiment of the present application does not limit the implementation of step 61. For example, if the word detection result of the text image to be recognized includes at least one boundary position, and the preset cut position corresponding to the text image to be recognized includes at least one hard cut position , then step 61 may specifically include step 611-step 612:

Step 611: If the number of boundary positions is not less than the number of hard-cut positions, determine the set of "at least one boundary position" as the first set of positions, and determine the set of "at least one hard-cut position" Set for the second position.

Step 612: If the number of boundary positions is lower than the number of hard-cut positions, then determine the set of "at least one hard-cut position" as the first set of positions, and determine the set of "at least one boundary position" as Second location set.

Based on the relevant content of the above step 611 to step 612, it can be seen that the first position set and the second position set can be based on the number of cut-out positions (that is, boundary positions) represented by the word detection result and the number of cut-out positions represented by the preset cut-out position The size relationship between the positions (that is, the hard cut position) number is determined, so that the first position set is used to represent the position in the cut picture position represented by the word detection result and the preset cut picture position represented by the cut picture position. The set with a larger number, and the second set of positions is used to represent the set with a smaller number of positions among the image cut positions indicated by the word detection result and the preset image cut positions. For example, if the word detection result of the text image to be recognized is the position set {43, 82, 293, 309, ... } shown in Figure 7, and the preset cutting position corresponding to the text image to be recognized is as shown in Figure 4 The set of locations {512, 1024, 1536, 2048}, the first set of locations may be {43, 82, 293, 309, ...}, and the second set of locations may be {512, 1024, 1536, 2048}.

Step 62: Match each position in the second position set with at least one position in the first position set, and obtain matching results corresponding to each position in the second position set.

In the embodiment of the present application, if the second position set includes N positions, then the position that successfully matches the nth position in the second position set can be searched from the first position set (for example, from the first position set to find The closest position to the nth position in the second position set), to obtain the matching result corresponding to the nth position in the second position set, so that the matching result corresponding to the nth position in the second position set is used Yu represents the location in the first location set that successfully matches the nth location. For example, as shown in Figure 7, if the first location set is {43, 82, 293, 309, ...}, and the second location set is {512, 1024, 1536, 2048}, then in the second location set " The matching result corresponding to 512" may be that "512" and "335" match successfully, ... (and so on).

Step 63: According to the matching results corresponding to each position in the second position set, determine the actual cutting position corresponding to the text image to be recognized.

In the embodiment of the present application, after the matching results corresponding to each position in the second position set are obtained, the actual image cutting position corresponding to the text image to be recognized can be determined by referring to the matching result corresponding to each position in the second position set ( For example, the matching result corresponding to each position in the second position set is directly determined as the actual cutting position corresponding to the text image to be recognized).

Based on the relevant content of the above-mentioned steps 61 to 63, after obtaining the word detection result of the text image to be recognized and the preset cutting position corresponding to the text image to be recognized, the number of cutting positions indicated by the word detection result can be determined first. number and the number of cut-out positions indicated by the preset cut-out positions; The image cut positions are matched to obtain the matching results corresponding to each image cut position in the set of cut image positions with fewer image cut positions; finally, according to the matching results, the actual image cut position corresponding to the text image to be recognized is determined.

Example 2, if the word detection result of the text image to be recognized includes at least one boundary position, and the preset cut position corresponding to the text image to be recognized includes at least one hard cut position, then step 52 may specifically include step 71-step 74:

Step 71: Determine a first position set and a second position set according to the word detection result of the text image to be recognized and the preset cutting position corresponding to the text image to be recognized.

It should be noted that, for the relevant content of step 71, please refer to S21 above.

Step 72: If it is determined that the second set of positions includes at least one boundary position, then determine the second set of positions as the actual cut-out position corresponding to the text image to be recognized.

In the embodiment of the present application, if it is determined that the second set of positions includes at least one boundary position, it can be determined that the second set of positions is determined according to the single word detection result of the text image to be recognized, so that each position in the second set of positions is not will appear inside the character, so the second set of positions can be directly determined as the actual cut-out position corresponding to the text image to be recognized, so that the actual cut-out position will not appear inside the character, so that based on the actual cut-out position When the image is cut at the corresponding position, there will be no phenomenon of broken characters, which can effectively prevent incomplete characters from appearing in each cut image corresponding to the text image to be recognized, thereby helping to improve the recognition accuracy of long text recognition.

Step 73: If it is determined that the second set of positions includes at least one hard-cut position, then match each position in the second set of positions with at least one position in the first set of positions, and obtain matching results corresponding to each position in the second set of positions .

It should be noted that step 73 can be implemented by using any implementation manner of S62 above.

It can be seen that if it is determined that the second set of positions includes at least one hard cut position, it can be determined that the second set of positions is determined according to the preset cut position corresponding to the text image to be recognized, so that the second set of positions may appear in the character Therefore, the positions that can be successfully matched with each position in the second position set can be searched from the first position set, so that these found positions can be used to determine the actual cut-out position corresponding to the text image to be recognized, so that the The actual image cutting position will not appear inside the character, so that the character will not be damaged when cutting the image based on the actual image cutting position, so that it can effectively avoid the occurrence of incomplete characters, which is beneficial to improve the recognition accuracy of long text recognition.

Step 74: According to the matching results corresponding to each position in the second position set, determine the actual image cutting position corresponding to the text image to be recognized.

It should be noted that, for the relevant content of step 74, please refer to the above S63.

Based on the relevant content of the above step 71 to step 74, after obtaining the word detection result of the text image to be recognized and the preset cutting position corresponding to the text image to be recognized, it should be selected from the word detection result as much as possible. The actual image cutting position corresponding to the text image to be recognized, so that the actual image cutting position can meet the character recognition efficiency requirement in the application scenario as much as possible without cutting the characters.

Based on the relevant content of the above step 52, it can be seen that after obtaining the word detection result of the text image to be recognized and the preset cutting position corresponding to the text image to be recognized, the above two can be combined to determine the actual text image corresponding to the text image to be recognized. The image cutting position, so that the actual image cutting position corresponding to the text image to be recognized can meet the text recognition efficiency requirements in the application scenario as much as possible without cutting the characters. Wherein, because the above-mentioned "preset cutting position corresponding to the text image to be recognized" is determined according to the preset segmentation parameters corresponding to the application scene, the preset cutting position meets the text recognition efficiency requirement in the application scene, so that The actual image cutting position determined based on the preset image cutting position also meets the text recognition efficiency requirements in this application scenario, so that the text recognition process based on the preset image cutting position can meet the text recognition efficiency requirements in this application scenario In this way, on the premise of ensuring the recognition accuracy of long text recognition, the text recognition efficiency requirements in different application scenarios can be met as much as possible.

Based on the relevant content of S3 above, after obtaining the word detection result of at least one image slice and the position information of the at least one image slice, the text image to be recognized can be determined by referring to the word detection result and position information of the at least one image slice The corresponding actual cutting position.

S4: Perform a second segmentation process on the text image to be recognized according to the actual picture cutting position corresponding to the text image to be recognized, to obtain at least one picture to be used.

Wherein, the "second segmentation processing" refers to the process of performing segmentation processing on the text image to be recognized according to the actual image cutting position corresponding to the text image to be recognized.

It can be seen that after obtaining the actual image cutout position corresponding to the text image to be recognized, the text image to be recognized can be segmented according to the actual cutout position to obtain each cutout image corresponding to the text image to be recognized, and each cutout The pictures are respectively determined as pictures to be used.

S5: Determine the text recognition result of the text image to be recognized according to the text recognition result of at least one image to be used.

Wherein, the text recognition result of the picture to be used is used to describe the character information carried by the picture to be used; and the embodiment of the present application does not limit the determination process of the text recognition result of the picture to be used, any existing or future A character recognition method is implemented (for example, an OCR model can be used for implementation). In addition, in order to improve the efficiency of character recognition, character recognition processing can be performed on all the pictures to be used in parallel to obtain a character recognition result of each picture to be used.

The character recognition result of the text image to be recognized is used to describe the character information carried by the text image to be recognized.

In addition, the embodiment of the present application does not limit the implementation of S5. For example, S5 may specifically include: splicing the text recognition results of at least one picture to be used according to the sequence corresponding to the at least one picture to be used to obtain the text image to be recognized. Text recognition results.

Wherein, the arrangement order corresponding to at least one picture to be used is used to represent the positional adjacent relationship of the at least one picture to be used in the text image to be recognized; The pictures to be used are adjacent to each other, the picture to be used with the sequence number 2 is adjacent to the picture to be used with the sequence number 3, ... (and so on), the picture to be used with the sequence number T-1 is adjacent to the picture to be used with the sequence number T The images to be used are adjacent to each other. Wherein, T is a positive integer, and T represents the number of pictures to be used.

Based on the relevant content of the above S1 to S5, it can be known that for the character recognition method provided in the embodiment of the present application, after the text image to be recognized including long text is obtained, the text image to be recognized is firstly processed according to the preset slice parameters. Segmentation processing to obtain at least one image slice and the position information of the at least one image slice; then according to the word detection result and the position information of the at least one image slice, determine the actual cutting position corresponding to the text image to be recognized; then, According to the actual cutting position corresponding to the text image to be recognized, the text image to be recognized is subjected to a second segmentation process to obtain at least one picture to be used; finally, according to the text recognition result of the at least one picture to be used, determine the text to be recognized The character recognition result of recognizing the text image, so that the character recognition process for long text can be realized.

Based on the character recognition method provided by the above method embodiment, the embodiment of the present application also provides a character recognition device, which will be explained and described below with reference to the accompanying drawings.

Device embodiment

For the technical details of the character recognition device provided by the device embodiment, please refer to the above method embodiment.

Refer to FIG. 8 , which is a schematic structural diagram of a character recognition device provided by an embodiment of the present application.

The character recognition device 800 provided in the embodiment of the present application includes:

The first segmentation unit 801 is configured to, after acquiring the text image to be recognized, perform the first segmentation process on the text image to be recognized according to preset slice parameters, to obtain at least one image slice and the at least one image slice Location information; wherein, the text image to be recognized includes long text;

A position determination unit 802, configured to determine the actual cut-out position corresponding to the text image to be recognized according to the word detection result of the at least one image slice and the position information of the at least one image slice;

The second segmentation unit 803 is configured to perform a second segmentation process on the text image to be recognized according to the actual image cutting position corresponding to the text image to be recognized to obtain at least one picture to be used;

The result determining unit 804 is configured to determine a character recognition result of the to-be-recognized text image according to the character recognition result of the at least one image to be used.

In a possible implementation manner, the position determination unit 802 is specifically configured to: according to the word detection result of the at least one image slice, the position information of the at least one image slice, and the text image to be recognized The preset image cutting position is used to determine the actual image cutting position corresponding to the text image to be recognized.

In a possible implementation manner, the position determination unit 802 is specifically configured to: splicing the word detection results of the at least one image slice according to the position information of the at least one image slice to obtain the to-be-recognized The word detection result of the text image; according to the word detection result of the text image to be recognized and the preset cutting position corresponding to the text image to be recognized, determine the actual cutting position corresponding to the text image to be recognized.

The first dividing unit 801 includes:

a region cutting subunit, configured to cut an image region having the segmentation offset length from the text image to be recognized to obtain the image to be segmented;

The image slice subunit is configured to perform segmentation processing on the image to be segmented according to the segmentation interval to obtain at least one image slice.

The region cutting subunit is specifically configured to: determine the position of the cutting region according to the cutting start position and the segmentation offset length; perform region cutting processing on the text image to be recognized according to the cutting region position, Obtain the image to be segmented.

In a possible implementation manner, the process of determining the word detection result of the at least one image slice includes: using a pre-built word detection model to perform parallel word detection processing on the at least one image slice to obtain the at least one The single character detection result of the image slice; wherein, the single character detection model is constructed according to the sample text image and the actual position of each character in the sample text image.

Based on the relevant content of the above-mentioned character recognition device 800, it can be seen that for the character recognition device 800, after obtaining the text image to be recognized including long text, the first segmentation process is performed on the text image to be recognized according to the preset slice parameters , to obtain at least one image slice and the position information of the at least one image slice; then according to the word detection result and the position information of the at least one image slice, determine the actual cut-out position corresponding to the text image to be recognized; then, according to the to-be-recognized The text image corresponds to the actual image cutting position, and the second segmentation process is performed on the text image to be recognized to obtain at least one image to be used; finally, according to the text recognition result of the at least one image to be used, determine the text image to be identified Text recognition results, so that the text recognition process for long texts can be implemented.

Further, the embodiment of the present application also provides a device, the device includes a processor and a memory:

The memory is used to store computer programs;

Further, the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium is used to store a computer program, and the computer program is used to execute any of the character recognition methods provided in the embodiment of the present application. implementation.

Furthermore, the embodiment of the present application also provides a computer program product, which, when running on a terminal device, enables the terminal device to execute any implementation manner of the character recognition method provided in the embodiment of the present application.

It should be understood that in this application, "at least one (item)" means one or more, and "multiple" means two or more. "And/or" is used to describe the association relationship of associated objects, indicating that there can be three types of relationships, for example, "A and/or B" can mean: only A exists, only B exists, and A and B exist at the same time , where A and B can be singular or plural. The character "/" generally indicates that the contextual objects are an "or" relationship. "At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one item (piece) of a, b or c can mean: a, b, c, "a and b", "a and c", "b and c", or "a and b and c ", where a, b, c can be single or multiple.

The above descriptions are only preferred embodiments of the present invention, and do not limit the present invention in any form. Although the present invention has been disclosed above with preferred embodiments, it is not intended to limit the present invention. Any person familiar with the art, without departing from the scope of the technical solution of the present invention, can use the methods and technical content disclosed above to make many possible changes and modifications to the technical solution of the present invention, or modify it into an equivalent of equivalent change Example. Therefore, any simple modifications, equivalent changes and modifications made to the above embodiments according to the technical essence of the present invention, which do not deviate from the technical solution of the present invention, still fall within the protection scope of the technical solution of the present invention.

Claims

A character recognition method, characterized in that the method comprises:

After the text image to be recognized is acquired, the text image to be recognized is subjected to a first segmentation process according to preset slice parameters to obtain at least one image slice and position information of the at least one image slice; wherein, the text image to be recognized Text images include long text;

According to the word detection result of the at least one image slice and the position information of the at least one image slice, determine the actual image cutting position corresponding to the text image to be recognized;

Performing a second segmentation process on the text image to be recognized according to the actual image cutting position corresponding to the text image to be recognized to obtain at least one picture to be used;

According to the character recognition result of the at least one image to be used, the character recognition result of the text image to be recognized is determined.
The method according to claim 1, characterized in that, according to the word detection result of the at least one image slice and the position information of the at least one image slice, the actual cutting position corresponding to the text image to be recognized is determined ,include:

According to the word detection result of the at least one image slice, the position information of the at least one image slice, and the preset cut position corresponding to the text image to be recognized, determine the actual cut position corresponding to the text image to be recognized .
The method according to claim 2, wherein the process of determining the actual cutting position corresponding to the text image to be recognized includes:

performing splicing processing on the word detection result of the at least one image slice according to the position information of the at least one image slice, to obtain the word detection result of the text image to be recognized;

According to the single character detection result of the text image to be recognized and the preset cut position corresponding to the text image to be recognized, the actual cut position corresponding to the text image to be recognized is determined.
The method according to claim 1, wherein the preset slice parameters include a segmentation interval and a segmentation offset length; wherein the segmentation offset length is smaller than the segmentation interval;

The process of determining the at least one image slice includes:

cutting the image region with the segmentation offset length from the text image to be recognized to obtain the image to be segmented;

Segmenting the image to be segmented according to the segmentation interval to obtain at least one image slice.
The method according to claim 4, wherein the preset slicing parameters further include a resection starting position;

The determination process of the image to be segmented includes:

Determine the position of the resection area according to the resection start position and the segmentation offset length;

Perform region cutting processing on the to-be-recognized text image according to the cut-off region position to obtain the to-be-segmented image.
The method according to any one of claims 1-5, wherein the determination process of the word detection result of the at least one image slice comprises:

Using a pre-built word detection model to perform parallel word detection processing on the at least one image slice to obtain a word detection result of the at least one image slice; wherein the word detection model is based on the sample text image and the sample text image The actual position of each character in is constructed.
A character recognition device, characterized in that it comprises:

The first segmentation unit is configured to perform a first segmentation process on the text image to be recognized according to preset slice parameters after acquiring the text image to be recognized, to obtain at least one image slice and the position of the at least one image slice Information; wherein, the text image to be recognized includes long text;

A position determination unit, configured to determine the actual cut-out position corresponding to the text image to be recognized according to the word detection result of the at least one image slice and the position information of the at least one image slice;

The second segmentation unit is configured to perform a second segmentation process on the text image to be recognized according to the actual image cutting position corresponding to the text image to be recognized to obtain at least one image to be used;

The result determination unit is configured to determine the character recognition result of the text image to be recognized according to the character recognition result of the at least one picture to be used.
A device, characterized in that the device includes a processor and a memory:

The memory is used to store computer programs;

The processor is configured to execute the method according to any one of claims 1-6 according to the computer program.
A computer-readable storage medium, characterized in that the computer-readable storage medium is used to store a computer program, and the computer program is used to execute the method according to any one of claims 1-6.
A computer program product, characterized in that, when the computer program product is run on a terminal device, the terminal device is made to execute the method according to any one of claims 1-6.