CN115223166A

CN115223166A - Picture pre-labeling method, picture labeling method and device, and electronic equipment

Info

Publication number: CN115223166A
Application number: CN202211146686.6A
Authority: CN
Inventors: 林群书; 刘明皓; 祁士刚; 杨易; 张超; 赵子健; 左汪洋
Original assignee: Integer Intelligence Information Technology Hangzhou Co ltd
Current assignee: Integer Intelligence Information Technology Hangzhou Co ltd
Priority date: 2022-09-20
Filing date: 2022-09-20
Publication date: 2022-10-21

Abstract

The invention discloses a picture pre-labeling method, a picture labeling device and electronic equipment, wherein the picture pre-labeling method comprises the following steps of: respectively inputting pictures to be pre-labeled into a plurality of trained text region prediction network models to obtain a plurality of region prediction results; integrating a plurality of regional prediction results to obtain a total regional prediction result; obtaining the area of each text in the picture according to the total area prediction result; cutting the area from the picture to obtain a sub-picture; correcting the text direction in the subgraph; respectively inputting the subgraphs with the corrected text direction into a plurality of trained text recognition network models to obtain a plurality of text recognition results; integrating a plurality of text recognition results to obtain a total text recognition result; and pre-labeling the picture according to the total region prediction result and the total text recognition result. By using multi-model fusion, the method has better recognition effect on different scenes; and by adopting region identification and region character identification, the identification effect is obviously improved.

Description

Picture pre-labeling method, picture labeling method and device and electronic equipment

Technical Field

The present application relates to the field of image annotation, and in particular, to an image pre-annotation method, an image annotation method and apparatus, and an electronic device.

Background

With the popularization of the artificial intelligence technology in the traffic field, different subjects, such as artificial intelligence companies, science and technology companies, scientific research institutions, industry enterprises and the like, have larger and larger demand for data acquisition and marking, especially for OCR data.

The AI model quality inspection can play a role in auxiliary marking, missing and filling in gaps and improving the overall data quality in the data marking process. Some companies and enterprises train models first through a small amount of data, then generate pre-labeled results, and finally further optimize training according to manual feedback results. However, such pre-labeling methods typically require repeated training of a single AI model through a large amount of data, which is costly and inefficient. In addition, the existing pre-labeling system does not consider quality inspection and further feedback, and the labeling results are good and uneven.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:

1. only a single model is used, and the single model is difficult to solve very complex tasks; 2. the model is repeatedly trained, so that the effect is low and the cost is high; 3. the single-stage generation of the pre-label is difficult to complete a complex recognition task, and the performance is low.

Disclosure of Invention

The embodiment of the application aims to provide a picture pre-labeling method, a picture labeling device and electronic equipment, so as to solve the technical problems of low performance and repeated model training in the related technology.

According to a first aspect of the embodiments of the present application, there is provided a picture pre-labeling method, including:

respectively inputting pictures to be pre-marked into a plurality of trained text region prediction network models to obtain a plurality of region prediction results;

integrating a plurality of the regional prediction results to obtain a total regional prediction result;

obtaining the region of each text in the picture according to the total region prediction result;

cropping the area from the picture to obtain a subgraph;

correcting the text direction in the subgraph;

inputting the subgraphs with the corrected text direction into a plurality of trained text recognition network models respectively to obtain a plurality of text recognition results;

integrating a plurality of text recognition results to obtain a total text recognition result;

and pre-labeling the picture according to the total region prediction result and the total text recognition result.

Optionally, the building of the trained text region prediction network model includes:

constructing a plurality of text region prediction network initial models taking different feature extraction networks as a backbone network;

and training a plurality of text region prediction network initial models by using the existing public data set to obtain a plurality of trained text region prediction network models.

Optionally, integrating a plurality of the regional predictors to obtain a total regional predictor, comprising:

respectively inputting pre-labeled pictures into the plurality of text region prediction network models to obtain a plurality of initial region prediction results;

distributing weights to each text region prediction network model, wherein the weights correspond to the initial region prediction results one by one;

obtaining an initial integrated prediction result according to the weight and the initial region prediction result;

calculating an error between a pre-labeled picture result and the initial integrated prediction result, and updating the weight of each text region prediction network model by using a gradient generated by the error;

and using the updated weight as a final weight result to obtain a total area prediction result.

Optionally, correcting the text direction in the sub-image includes:

predicting a text direction within the subgraph using a convolutional network model;

and correcting the text direction in the subgraph according to the prediction result.

Optionally, integrating a plurality of the text recognition results to obtain a total text recognition result, including:

respectively inputting pre-labeled pictures into the plurality of text recognition prediction network models to obtain a plurality of initial text recognition results;

distributing a weight to each text recognition prediction network model, wherein the weight is in one-to-one correspondence with the initial text recognition result;

obtaining an initial integrated recognition result according to the weight and the initial text recognition result;

calculating an error between a pre-labeled picture result and the initially integrated recognition result, and updating the weight of each text recognition prediction network model by using a gradient generated by the error;

and using the updated weight as a final weight result to obtain a total text recognition result.

According to a second aspect of the embodiments of the present application, there is provided a picture pre-labeling apparatus, including:

the prediction module is used for respectively inputting the pictures to be pre-labeled into a plurality of trained text region prediction network models to obtain a plurality of region prediction results;

a first integration module, configured to integrate a plurality of the regional prediction results to obtain a total regional prediction result;

the region obtaining module is used for obtaining the region of each text in the picture according to the total region prediction result;

the cropping module is used for cropping the area from the picture to obtain a sub-picture;

the correcting module is used for correcting the text direction in the subgraph;

the recognition module is used for respectively inputting the subgraphs after the text direction correction into a plurality of trained text recognition network models to obtain a plurality of text recognition results;

the second integration module is used for integrating a plurality of text recognition results to obtain a total text recognition result;

and the marking module is used for pre-marking the picture according to the total region prediction result and the total text recognition result.

According to a third aspect of the embodiments of the present application, there is provided a picture annotation method, including:

acquiring an OCR picture to be marked;

preprocessing the OCR picture;

pre-labeling the preprocessed OCR picture by using the method of the first aspect to obtain a pre-labeling result;

sending the pre-labeling result to a first appointed platform for labeling correction, receiving a correction result of the first appointed platform, and taking the correction result as data to be subjected to quality inspection;

performing quality inspection on the data to be subjected to quality inspection by using the neural network model to obtain the probability of each data to be subjected to quality inspection;

and sending the data to be quality-tested with the probability lower than the set threshold value to a second specified platform for quality testing again, receiving the quality testing result of the second specified platform, and taking the quality testing result as a final labeling result.

Optionally, performing quality inspection on the data to be subjected to quality inspection by using the neural network model to obtain the probability of each data to be subjected to quality inspection, including:

constructing a neural network model by using a convolution, an LSTM network and a full connection layer, wherein feature extraction is carried out on the data to be tested by using the convolution and the LSTM network, the extracted features use the full connection layer, and a predicted value corresponding to each feature is output;

training the neural network model using the labeled data;

performing quality inspection on the data to be inspected by using the trained neural network model to obtain a quality inspection result;

and obtaining the probability of each data to be inspected according to the quality inspection result.

According to a fourth aspect of the embodiments of the present application, there is provided a picture labeling apparatus, including:

the acquisition module is used for acquiring an OCR picture to be marked;

the preprocessing module is used for preprocessing the OCR picture;

the pre-labeling module is used for pre-labeling the preprocessed OCR picture by using the method in the first aspect to obtain a pre-labeling result;

the correction module is used for sending the pre-labeling result to a first specified platform for labeling correction, receiving the correction result of the first specified platform and taking the correction result as data to be subjected to quality inspection;

the first quality inspection module is used for performing quality inspection on the data to be subjected to quality inspection by using the neural network model to obtain the probability of each data to be subjected to quality inspection;

and the second quality inspection module is used for sending the data to be inspected, the probability of which is lower than a set threshold value, to a second designated platform for quality inspection again, receiving a quality inspection result of the second designated platform again, and taking the quality inspection result again as a final labeling result.

According to a fifth aspect of embodiments herein, there is provided an electronic device comprising:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement a method as described in the first or third aspects.

According to a sixth aspect of embodiments herein, there is provided a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method according to the first or third aspect.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

according to the embodiment, in order to overcome the problem that a single model is difficult to solve a very complex task, the invention uses a method of fusing multiple models (a text region prediction network model and a text recognition network model) in a pre-labeling stage, and the multiple models can be dynamically modified according to different tasks, so that the method has a better recognition effect on different scenes; in order to avoid the problems of low efficiency and high calculation cost caused by repeatedly training the model in the labeling process, the models used by the invention are the models which are trained by utilizing a large amount of data, can adapt to the pre-labeled data task, and do not need to be repeatedly trained in the labeling process, so that the calculation cost is reduced, and the operation efficiency is improved; in order to overcome the problems that complicated recognition tasks are often difficult to complete and the performance is low when the pre-labeling is generated in a single stage, the method also adopts the strategies of region recognition, region character recognition and two-stage fusion, so that the step-by-step optimization of the pre-labeling effect is achieved, and the recognition and supervision effects are obviously improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and, together with the description, serve to explain the principles of the application.

Fig. 1 is a flowchart illustrating a picture pre-labeling method according to an exemplary embodiment.

Fig. 2 is a flowchart illustrating S11 according to an exemplary embodiment.

FIG. 3 is a diagram illustrating multi-model region prediction, according to an example embodiment.

Fig. 4 is a flowchart illustrating S12 according to an exemplary embodiment.

Fig. 5 is a flowchart illustrating S15 according to an exemplary embodiment.

Fig. 6 is an exemplary diagram of S15 shown according to an exemplary embodiment.

FIG. 7 is a flow diagram illustrating the construction of a plurality of trained plurality of text recognition network models, according to an example embodiment.

Fig. 8 is a flowchart illustrating S17 according to an exemplary embodiment.

Fig. 9 is an exemplary diagram illustrating S17 according to an exemplary embodiment.

Fig. 10 is a block diagram illustrating a picture pre-labeling apparatus according to an exemplary embodiment.

Fig. 11 is a flowchart illustrating a method of annotating pictures according to an exemplary embodiment.

Fig. 12 is a flowchart illustrating S25 according to an exemplary embodiment.

Fig. 13 is a block diagram illustrating a picture annotation device according to an exemplary embodiment.

Fig. 14 is a schematic structural diagram of an electronic device shown in accordance with an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the application, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. The word "if," as used herein, may be interpreted as "at \8230; \8230when" or "when 8230; \823030when" or "in response to a determination," depending on the context.

Fig. 1 is a flowchart illustrating a picture pre-labeling method according to an exemplary embodiment, and as shown in fig. 1, the method may include the following steps:

s11: respectively inputting pictures to be pre-marked into a plurality of trained text region prediction network models to obtain a plurality of region prediction results;

s12: integrating a plurality of the regional prediction results to obtain a total regional prediction result;

s13: obtaining the area of each text in the picture according to the total area prediction result;

s14: cropping the area from the picture to obtain a subgraph;

s15: correcting the text direction in the subgraph;

s16: respectively inputting the subgraphs with the corrected text direction into a plurality of trained text recognition network models to obtain a plurality of text recognition results;

s17: integrating a plurality of text recognition results to obtain a total text recognition result;

s18: and pre-labeling the picture according to the total region prediction result and the total text recognition result.

According to the embodiment, in order to overcome the problem that a single model is difficult to solve a very complex task, the invention uses a method of fusing multiple models (a text region prediction network model and a text recognition network model) in a pre-labeling stage, and the multiple models can be dynamically modified according to different tasks, so that the method has a better recognition effect on different scenes; in order to avoid the problems of low efficiency and high calculation cost caused by repeatedly training the model in the labeling process, the models used by the invention are all models which are trained by utilizing a large amount of data, can adapt to the data tasks of pre-labeling and do not need to be repeatedly trained in the labeling process, thereby reducing the calculation cost and improving the operation efficiency; in order to overcome the problems that complicated recognition tasks are often difficult to complete and the performance is low when the pre-labeling is generated in a single stage, the method also adopts the strategies of region recognition, region character recognition and two-stage fusion, so that the step-by-step optimization of the pre-labeling effect is achieved, and the recognition effect is obviously improved.

In the specific implementation of S11: respectively inputting pictures to be pre-labeled into a plurality of trained text region prediction network models to obtain a plurality of region prediction results;

specifically, the pictures to be pre-labeled are uploaded on a first designated platform such as a Molar platform or oss, and the pictures can be uploaded by a user in the form of folders and single files.

Referring to fig. 2, the construction of the plurality of trained text region prediction network models herein may include the following steps:

s111: constructing a plurality of text region prediction network initial models taking different feature extraction networks as backbone networks;

specifically, a plurality of text region prediction network initial models with model1 (ABCNet), model2 (FPN), model3 (ContourNet), and the like as the backbone network can be constructed. Since the features of the OCR regions extracted by different feature extraction networks are different, for example, ABCNet is extracted by using a parameterized inner-seeler curve, FPN can extract the features of a multi-scale region, and ContourNet can adapt to a text region with large scale (shape) variation. The multi-model fusion strategy can integrate the advantages of different models and cover various scenes of the pre-labeled data. In addition, different models can be dynamically updated and replaced according to the requirements to be marked, and the detection effect of the models is further improved. Referring to fig. 3, the pictures to be labeled are respectively input into three text region prediction models, and the prediction result of each model is obtained. According to the prediction result of each model, the region frame obtained by the model1 is not accurate enough on the boundary, the region frame obtained by the model2 is too large for the 'integer intelligent' 4-word prediction region, the English region prediction is good, and the regions recognized by the model3 are crossed. And the result of fusing multiple models can realize high-precision regional prediction, thereby exceeding the precision of a single model.

S112: and training a plurality of text region prediction network initial models by using a data set to obtain a plurality of text region prediction network models.

Specifically, the pictures to be labeled are input into a plurality of trained text region prediction network models respectively, and a plurality of region prediction results are obtained.

In a specific implementation of S12: integrating a plurality of the regional prediction results to obtain a total regional prediction result; referring to fig. 4, this step may include the steps of:

s121: respectively inputting pre-labeled pictures into the plurality of text region prediction network models to obtain a plurality of initial region prediction results;

specifically, the pictures to be labeled are sorted into batches (e.g., 4 pictures in one batch) and are respectively input into the trained text region prediction network models, and the emphasis points of the prediction results of different models are different. The mode of taking the pictures in batches as input can effectively improve the calculation efficiency.

S122: distributing weights to each text region prediction network model, wherein the weights correspond to the initial region prediction results one by one;

specifically, the initial weight of each text region prediction network model is equally distributed, if 4 network models exist, the weight of each model is 1/4, and other values can be given according to the data requirements. The total prediction result can be obtained through the initial weight.

S123: obtaining an initial integrated prediction result according to the weight and the initial region prediction result;

s124: calculating an error between a pre-labeled picture result and the initial integrated prediction result, and updating the weight of each text region prediction network model by using a gradient generated by the error;

specifically, the pre-finely labeled picture is input into a trained text region prediction model to obtain a total initial prediction result. And calculating the difference between the fine annotation and the initial prediction result through the Euclidean distance to obtain the total loss. The gradient of the loss for each model is calculated and the weight is updated by the gradient. The weight of each model is determined through data, the contribution of each model can be further refined, and the pre-labeling precision is improved.

S125: and using the updated weight as a final weight result to obtain an overall regional prediction result.

In a specific implementation of S13: obtaining the region of each text in the picture according to the total region prediction result;

in a specific implementation of S14: cropping the area from the picture to obtain a subgraph;

in the specific implementation of S15: correcting the text direction in the subgraph; referring to fig. 5, this step may include the following sub-steps:

s151: predicting the text direction of the picture by using a convolutional network model;

specifically, the deflection angle of the picture text is predicted by using a trained ResNet or MobileNet network, the direction of the text is obtained, the recognition result can be further optimized, and the precision is improved. As shown in fig. 6, taking (a) as an example, the prediction is accurate for the "integer smart" region, but the text region is tilted. And (3) knowing that the anticlockwise rotation angle of the character area is 330 degrees through a text direction prediction network, rotating the character area to obtain a corrected picture (b), and filling other areas.

S152: and correcting the text direction in the subgraph according to the prediction result of the text direction of the picture.

Specifically, according to the predicted deflection angle of the text direction of the picture, the text picture is rotated, blank supplementary residual regions are added, and if the deflection angle is 90 degrees, the picture is rotated by 90 degrees. The direction of the text is corrected, so that the recognition result can be further optimized, and the precision is improved.

In a specific implementation of S16: inputting the subgraphs with the corrected text direction into a plurality of trained text recognition network models respectively to obtain a plurality of text recognition results;

specifically, the pictures to be labeled are arranged into batches (for example, 4 pictures are arranged in a batch) and are respectively input into the trained text recognition network models, and the emphasis points of the prediction results of different models are different. The batch as an input can effectively improve the calculation efficiency.

Referring to fig. 7, the construction of the plurality of trained plurality of text recognition network models herein may include the steps of:

s161: constructing a plurality of text recognition network initial models;

specifically, CRNN and RARE text recognition network initial models may be constructed. Since the scenes handled by text recognition network models are different, e.g., CRNN may recognize longer text sequences, RARE accurately recognizes perspective transformed short text, as well as warped short text. The multi-model fusion strategy can integrate the advantages of different models and cover various scenes of the pre-marked data. In addition, different models can be dynamically updated and replaced according to the requirements to be marked, and the recognition effect of the models is further improved.

S162: and respectively training a plurality of text recognition network initial models by using an internal data set to obtain a text recognition network model.

In particular, a public data set may be collected to train the text recognition network initial model. The generalization capability of the model can be improved by training the model through a large amount of data, so that the model is prevented from being trained for many times.

In a specific implementation of S17: integrating a plurality of text recognition results to obtain a total text recognition result; referring to fig. 8, this step may include the following sub-steps:

s171: respectively inputting pre-labeled pictures into the plurality of text recognition prediction network models to obtain a plurality of initial text recognition results;

specifically, the pictures to be labeled are sorted into batches (for example, 4 pictures in one batch) and are respectively input into the trained text recognition network models, and the emphasis points of the prediction results of different models are different. The calculation efficiency can be effectively improved by taking the batch as an input.

S172: distributing a weight to each text recognition prediction network model, wherein the weight is in one-to-one correspondence with the initial text recognition result;

specifically, the initial weight of each text region prediction network model is equally distributed, if 2 network models exist, the weight of each model is 1/2, and other values can be given according to the data requirements. The total prediction result can be obtained by the initial weight.

S173: obtaining an initial integrated recognition result according to the weight and the initial text recognition result;

s174: calculating an error between a pre-labeled picture result and the initially integrated recognition result, and updating the weight of each text recognition prediction network model by using a gradient generated by the error;

specifically, the pre-finely labeled picture is input into the trained text content recognition model to obtain a total initial recognition result. The difference between the fine annotation and the initial recognition result is calculated by BCELoss, resulting in the total loss. The gradient of the loss to each model is calculated, and the weight is updated by the gradient. The weight of each model is determined through data, the contribution of each model can be further refined, and the pre-labeling precision is improved.

S175: and using the updated weight as a final weight result to obtain a total text recognition result.

In a specific implementation of S18: and pre-labeling the picture according to the total region prediction result and the total text recognition result.

Specifically, the weight of each model output result is calculated, and as shown in fig. 9, the picture to be labeled is respectively input into the text recognition model4, the text recognition model5 and the text recognition model 6. Taking the character "A" as an example, the result recognized as "A" is model5 and model6, the weights of the two models are 0.3 and 0.3 respectively, and for the pair, the probability of "A" is 0.6 and is greater than the probability of recognizing as other characters (^), the total prediction result is "A", the total prediction result is integrated with the advantages of the models, and the overall recognition accuracy can be improved by integrating the results of a plurality of models.

Corresponding to the embodiment of the image pre-labeling method, the application also provides an embodiment of an image pre-labeling device.

Fig. 10 is a block diagram illustrating a picture pre-labeling apparatus according to an exemplary embodiment. Referring to fig. 10, the apparatus includes a prediction module 11, a first integration module 12, an area obtaining module 13, a cropping module 14, a correction module 15, an identification module 16, a second integration module 17, and an annotation module 18.

The prediction module 11 is configured to input pictures to be pre-labeled into a plurality of trained text region prediction network models respectively to obtain a plurality of region prediction results;

a first integration module 12, configured to integrate a plurality of the regional prediction results to obtain a total regional prediction result;

a region obtaining module 13, configured to obtain a region of each text in the picture according to the total region prediction result;

a cropping module 14, configured to crop the region from the picture to obtain a sub-picture;

the correcting module 15 is used for correcting the text direction in the subgraph;

the recognition module 16 is configured to input the subgraph after the text direction correction into a plurality of trained text recognition network models respectively to obtain a plurality of text recognition results;

a second integration module 17, configured to integrate a plurality of the text recognition results to obtain a total text recognition result;

and the labeling module 18 is configured to perform pre-labeling on the picture according to the total region prediction result and the total text recognition result.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the application. One of ordinary skill in the art can understand and implement it without inventive effort.

Example 2:

fig. 11 is a flowchart illustrating a picture labeling method according to an exemplary embodiment, and as shown in fig. 11, the method may include the following steps:

s21: acquiring an OCR picture to be marked;

s22: preprocessing the OCR picture;

s23: pre-labeling the preprocessed OCR picture by using the method in the embodiment 1 to obtain a pre-labeling result;

s24: sending the pre-labeling result to a first specified platform for labeling correction, receiving a correction result of the first specified platform, and taking the correction result as data to be subjected to quality inspection;

s25: performing quality inspection on the data to be subjected to quality inspection by using the neural network model to obtain the probability of each data to be subjected to quality inspection;

s26: and sending the data to be quality-tested with the probability lower than the set threshold value to a second specified platform for quality testing again, receiving the quality testing result of the second specified platform, and taking the quality testing result as a final labeling result.

According to the embodiment, in order to overcome the problem that a single model is difficult to solve a very complex task, the invention uses a method of fusing multiple models (a text region prediction network model and a text recognition network model) in a pre-labeling stage, and the multiple models can be dynamically modified according to different tasks, so that the method has a better recognition effect on different scenes; in order to avoid the problems of low efficiency and high calculation cost caused by repeatedly training the model in the labeling process, the models used by the invention are the models which are trained by utilizing a large amount of data, can adapt to the pre-labeled data task, and do not need to be repeatedly trained in the labeling process, so that the calculation cost is reduced, and the operation efficiency is improved; in order to overcome the problems that complicated recognition tasks are often difficult to complete and the performance is low when the pre-labeling is generated in a single stage, the method also adopts the strategies of region recognition, region character recognition and two-stage fusion, so that the step-by-step optimization of the pre-labeling effect is achieved, and the recognition effect is obviously improved. In addition, in order to further improve the accuracy of the model, the invention also adds manual work and the model to further check and modify the labeling result.

In a specific implementation of S21: acquiring an OCR picture to be marked;

specifically, the OCR pictures to be marked are uploaded on a first specified platform, and a user can upload the pictures in a folder or single file mode. For example, a specified platform (e.g., a WEB front end) is returned via a first specified platform (e.g., a WEB back end).

In a specific implementation of S22: preprocessing the OCR picture;

specifically, each picture can be scaled to 512 × 512, and in addition, the large picture is cut into a plurality of sub-pictures according to the size, such as 1000 × 1000 pictures, and the sub-pictures are cut into 4 pictures. The images are conveniently scaled to form batch data, and the calculation cost is reduced.

In a specific implementation of S23: pre-labeling the preprocessed OCR picture by using the method of the embodiment 1 to obtain a pre-labeling result;

specifically, a plurality of OCR pictures are organized into batch data (for example, 4 pictures in one batch), and the batch data is input to a second designated platform (server) to obtain a pre-labeling result returned by the platform, including a polygonal area of a text and a recognition result.

In a specific implementation of S24: sending the pre-labeling result to a first designated platform for labeling correction, receiving a correction result of the first designated platform, such as the correction of the position size of the area and the correction of the character recognition result, and taking the correction result as data to be subjected to quality inspection;

specifically, the appointed personnel review and modify the pre-annotation result at the first appointed platform, wherein the modified content comprises a polygonal area and the identified content. And the appointed personnel audits the passed or modified labeling result as the data to be subjected to quality inspection. By adding manual steps, the marking precision is improved.

In a specific implementation of S25: performing quality inspection on the data to be subjected to quality inspection by using the neural network model to obtain the probability of each data to be subjected to quality inspection; referring to fig. 12, this step may include the following sub-steps:

s251: constructing a neural network model by using a convolution, an LSTM network and a full connection layer, wherein feature extraction is carried out on the data to be tested by using the convolution and the LSTM network, the extracted features use the full connection layer, and a predicted value corresponding to each feature is output;

s252: training the neural network model using the labeled data;

s253: performing quality inspection on the data to be inspected by using the trained neural network model to obtain a quality inspection result;

specifically, the data to be tested is input into the neural network model according to batches, and a network output result is obtained: polygon boxes and specific text content. Through batch input, the calculation cost is reduced.

S254: and obtaining the probability of each data to be inspected according to the quality inspection result.

Specifically, calculating the difference between a frame output by the network and a marked frame, and calculating the difference by using the Euclidean distance to obtain the probability of the frame; and calculating the difference between the text content output by the network and the labeled text content, and calculating the difference by using the text distance to obtain the text recognition probability. The probability can be used for quality detection, and the labeling quality is improved.

In a specific implementation of S26: and sending the data to be quality-tested with the probability lower than the set threshold value to a second specified platform for quality testing again, receiving the quality testing result of the second specified platform, and taking the quality testing result as a final labeling result.

Corresponding to the embodiment of the image annotation method, the application also provides an embodiment of an image annotation device.

FIG. 13 is a block diagram illustrating a picture annotation device in accordance with an exemplary embodiment. Referring to fig. 13, the apparatus includes an obtaining module 21, a preprocessing module 22, a pre-labeling module 23, a modification module 24, a first quality inspection module 25, and a second quality inspection module 26.

The obtaining module 21 is configured to obtain an OCR picture to be labeled;

the preprocessing module 22 is used for preprocessing the OCR pictures;

the pre-labeling module 23 is configured to perform pre-labeling on the preprocessed OCR picture by using the method described in embodiment 1 to obtain a pre-labeling result;

the correcting module 24 is configured to send the pre-annotation result to a first specified platform for annotation correction, receive a correction result of the first specified platform, and use the correction result as data to be quality inspected;

the first quality inspection module 25 is configured to perform quality inspection on data to be subjected to quality inspection by using the neural network model to obtain a probability of each data to be subjected to quality inspection;

and the second quality inspection module 26 is configured to send the data to be quality inspected, of which the probability is lower than the set threshold, to the second designated platform for quality inspection again, receive a quality inspection result of the second designated platform again, and use the quality inspection result again as a final labeling result.

For the device embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the application. One of ordinary skill in the art can understand and implement without inventive effort.

Example 3:

correspondingly, the present application also provides an electronic device, comprising: one or more processors; a memory for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors implement the picture pre-labeling method or the picture labeling method as described above. As shown in fig. 14, for a hardware structure diagram of an arbitrary device with data processing capability where a picture pre-labeling apparatus or a picture labeling apparatus is located according to an embodiment of the present invention, in addition to the processor, the memory, and the memory shown in fig. 14, the arbitrary device with data processing capability where the apparatus is located in the embodiment may also include other hardware according to an actual function of the arbitrary device with data processing capability, which is not described again.

Accordingly, the present application also provides a computer readable storage medium, on which computer instructions are stored, and the instructions, when executed by a processor, implement the image pre-labeling method or the image labeling method as described above. The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing device described in any previous embodiment. The computer readable storage medium may also be an external storage device of the wind turbine, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), and the like, provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing capable device, and may also be used for temporarily storing data that has been output or is to be output.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the present disclosure. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A picture pre-labeling method is characterized by comprising the following steps:

respectively inputting pictures to be pre-labeled into a plurality of trained text region prediction network models to obtain a plurality of region prediction results;

cropping the area from the picture to obtain a subgraph;

correcting the text direction in the subgraph;

2. The method of claim 1, wherein the constructing of the plurality of trained text region prediction network models comprises:

constructing a plurality of text region prediction network initial models taking different feature extraction networks as backbone networks;

3. The method of claim 1, wherein correcting the direction of the text within the subgraph comprises:

4. The method of claim 1, wherein integrating a plurality of the text recognition results to obtain a total text recognition result comprises:

respectively inputting pre-labeled pictures into a plurality of text recognition prediction network models to obtain a plurality of initial text recognition results;

5. A picture pre-labeling device, comprising:

the cropping module is used for cropping the area from the picture to obtain a subgraph;

6. A picture marking method is characterized by comprising the following steps:

acquiring an OCR picture to be marked;

preprocessing the OCR picture;

pre-labeling the preprocessed OCR picture by using the method of any one of claims 1 to 4 to obtain a pre-labeling result;

7. The method of claim 6, wherein the performing quality inspection on the data to be inspected by using the neural network model to obtain the probability of each data to be inspected, comprises:

constructing a neural network model by using a convolution, an LSTM network and a full connection layer, wherein the convolution and the LSTM network are used for extracting the characteristics of the data to be detected, the full connection layer is used for extracting the characteristics, and a predicted value corresponding to each characteristic is output;

training the neural network model using the labeled data;

performing quality inspection on data to be inspected by using the trained neural network model to obtain a quality inspection result;

8. A picture labeling device, comprising:

the acquisition module is used for acquiring an OCR picture to be marked;

the preprocessing module is used for preprocessing the OCR picture;

the pre-labeling module is used for pre-labeling the preprocessed OCR picture by using the method of any one of claims 1-4 to obtain a pre-labeling result;

the first quality inspection module is used for performing quality inspection on data to be subjected to quality inspection by using the neural network model to obtain the probability of each data to be subjected to quality inspection;

and the second quality inspection module is used for sending the data to be inspected, the probability of which is lower than a set threshold value, to a second designated platform for re-quality inspection, receiving a re-quality inspection result of the second designated platform and taking the re-quality inspection result as a final labeling result.

9. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-4, 6-7.

10. A computer-readable storage medium having stored thereon computer instructions, which when executed by a processor, perform the steps of the method according to any one of claims 1-4, 6-7.