CN110533079B

CN110533079B - Method, apparatus, medium, and electronic device for forming image sample

Info

Publication number: CN110533079B
Application number: CN201910717086.2A
Authority: CN
Inventors: 李壮
Original assignee: Beike Technology Co Ltd
Current assignee: Beike Technology Co Ltd
Priority date: 2019-08-05
Filing date: 2019-08-05
Publication date: 2022-05-24
Anticipated expiration: 2039-08-05
Also published as: CN110533079A

Abstract

A method, apparatus, medium, and electronic device for forming an image sample are disclosed. The method for forming the image sample comprises the following steps: acquiring a first image sample, wherein the first image sample is provided with at least one piece of text labeling information; providing the first image sample to a text box detection model, and performing text box detection processing on the first image sample through the text box detection model to obtain detected text box position information; determining text labeling information corresponding to the detected text box position information; and setting new text labeling information for the first image sample according to the text box position information and the corresponding text labeling information to form a second image sample. The method and the device are favorable for improving the setting efficiency of the text labeling information of the image sample, enriching the image sample and improving the identification accuracy of the text content identification model.

Description

Method, apparatus, medium, and electronic device for forming image sample

Technical Field

The present disclosure relates to computer vision technology, and more particularly, to a method of forming an image sample, an apparatus for forming an image sample, a storage medium, and an electronic device.

Background

An OCR (Optical Character Recognition) technology is a technology capable of recognizing characters (such as characters and symbols) on a paper.

Currently, some OCR techniques are implemented by means of deep learning. Specifically, an image to be recognized is provided for a text box detection model based on deep learning, the text box detection model performs text box detection processing on an input image to be recognized to obtain text box position information in the image to be recognized, then the image to be recognized is cut according to the text box position information to obtain an image block to be recognized, and the image block to be recognized is provided for a text content recognition model, so that text content in the image block to be recognized can be obtained according to output information of the text content recognition model.

How to improve the recognition accuracy of the text content recognition model is a technical problem of great concern.

Disclosure of Invention

The present disclosure is proposed to solve the above technical problems. Embodiments of the present disclosure provide a method of forming an image sample, an apparatus for forming an image sample, a storage medium, and an electronic device.

According to an aspect of an embodiment of the present disclosure, there is provided a method of forming an image sample, including: acquiring a first image sample, wherein the first image sample is provided with at least one piece of text labeling information; providing the first image sample to a text box detection model, and performing text box detection processing on the first image sample through the text box detection model to obtain detected text box position information; determining text labeling information corresponding to the detected text box position information; and setting new text labeling information for the first image sample according to the text box position information and the corresponding text labeling information to form a second image sample.

In an embodiment of the present disclosure, the providing the first image sample to a text box detection model, and performing text box detection processing on the first image sample via the text box detection model to obtain detected text box position information includes: and providing the first image sample to a plurality of text box detection models based on different detection algorithms, and respectively carrying out text box detection processing on the first image sample through the plurality of text box detection models to obtain the text box position information detected by the plurality of text box detection models.

In another embodiment of the present disclosure, the plurality of text box detection models have differences in corresponding hyper-parameters during the training process.

In another embodiment of the present disclosure, the determining text label information corresponding to the detected text box position information includes: determining the text box area overlapping information according to the text box position marking information in each text marking information of the first image sample and the detected text box position information; and determining text labeling information corresponding to the detected text box position information according to the overlapping information and a preset condition.

In another embodiment of the present disclosure, the setting new text label information for the first image sample according to the text box position information and the corresponding text label information to form a second image sample includes: and taking the text content marking information in the text box position information and the corresponding text marking information as new text marking information of the first image sample to form a second image sample.

In yet another embodiment of the present disclosure, the method further comprises: and training the text content recognition model to be trained by utilizing the second image sample.

In another embodiment of the present disclosure, the training, by using the second image sample, the text content recognition model to be trained includes: according to the text labeling information of the first image sample and the text labeling information of the second image sample respectively obtained by a plurality of text box detection models based on different detection algorithms, cutting out an image block sample containing text content from the first image sample and the second image sample; acquiring image block samples from a first image sample and image block samples from a second image sample corresponding to different detection algorithms according to a preset mixing ratio; providing the obtained image block samples to a text content identification model to be trained, and carrying out text content identification processing on each image block sample through the text content identification model to be trained to obtain a plurality of identified text interiors; and adjusting the model parameters of the text content recognition model to be trained according to the difference between the recognized text contents and the text content marking information in the text marking information.

According to another aspect of an embodiment of the present disclosure, there is provided an apparatus for forming an image sample, the apparatus including: the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a first image sample, and the first image sample is provided with at least one piece of text labeling information; the detection module is used for providing the first image sample acquired by the acquisition module for a text box detection model, and performing text box detection processing on the first image sample through the text box detection model to acquire detected text box position information; the determining module is used for determining the text labeling information corresponding to the text box position information detected by the detecting module; and the setting module is used for setting new text labeling information for the first image sample according to the position information of the text box detected by the detection module and the corresponding text labeling information determined by the determination module to form a second image sample.

In an embodiment of the present disclosure, the detection module is further configured to: and providing the first image sample to a plurality of text box detection models based on different detection algorithms, and respectively carrying out text box detection processing on the first image sample through the plurality of text box detection models to obtain the text box position information detected by the plurality of text box detection models.

In another embodiment of the present disclosure, there is a difference between the hyper-parameters corresponding to the text box detection models in the training process.

In yet another embodiment of the present disclosure, the determining module includes: the first sub-module is used for determining the text box area overlapping information according to the text box position marking information in each text marking information of the first image sample and the detected text box position information; and the second submodule is used for determining the text marking information corresponding to the detected text box position information according to the overlapping information determined by the first submodule and a preset condition.

In yet another embodiment of the present disclosure, the setting module is further configured to: and taking the text content marking information in the text box position information and the corresponding text marking information as new text marking information of the first image sample to form a second image sample.

In yet another embodiment of the present disclosure, the apparatus further includes: and the training module is used for training the text content recognition model to be trained by utilizing the second image sample.

In yet another embodiment of the present disclosure, the training module includes: the third sub-module is used for cutting out image block samples containing text contents from the first image samples and the second image samples according to the text labeling information of the first image samples and the text labeling information of the second image samples respectively obtained by a plurality of text box detection models based on different detection algorithms; the fourth sub-module is used for acquiring image block samples from the first image samples and image block samples from the second image samples corresponding to different detection algorithms according to a preset mixing ratio; the fifth sub-module is used for providing the obtained image block samples to a text content identification model to be trained, and performing text content identification processing on each image block sample through the text content identification model to be trained to obtain a plurality of identified text interiors; and the sixth submodule is used for adjusting the model parameters of the text content identification model to be trained according to the difference between the plurality of identified text contents and the text content marking information in the text marking information.

According to still another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the above-described method of forming an image sample.

According to still another aspect of an embodiment of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; the processor is used for reading the executable instructions from the memory and executing the instructions to realize the method for forming the image sample.

According to the method and the device for forming the image sample, provided by the embodiment of the disclosure, the text box detection processing is performed on the first image sample by using the text box detection model, and the detected position information of the text box can be obtained. Because the text box position marking information of the image sample is obtained by the detection of the text box detection model, when the image sample is used for training the text content recognition model, the successfully trained text content recognition model can be more fit with the actual application scene. Therefore, the technical scheme provided by the disclosure is beneficial to improving the setting efficiency of the text labeling information of the image sample, enriching the image sample and simultaneously being beneficial to improving the identification accuracy of the text content identification model.

The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.

The present disclosure may be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a schematic diagram of one embodiment of a suitable scenario for use with the present disclosure;

FIG. 2 is a flow chart of one embodiment of a method of forming an image sample of the present disclosure;

FIG. 3 is a schematic view of one embodiment of a first region and a second region of the present disclosure;

FIG. 4 is a schematic diagram illustrating one embodiment of delivering textual content annotation information according to the present disclosure;

FIG. 5 is a flow diagram of one embodiment of the present disclosure for training a textual content recognition model to be trained;

FIG. 6 is a flow diagram of another embodiment of the present disclosure for training a textual content recognition model to be trained;

FIG. 7 is a schematic diagram illustrating the structure of one embodiment of an apparatus for forming an image sample according to the present disclosure;

fig. 8 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.

Detailed Description

Example embodiments according to the present disclosure will be described in detail below with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.

It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

It will be understood by those of skill in the art that the terms "first," "second," and the like in the embodiments of the present disclosure are used merely to distinguish one element from another, and are not intended to imply any particular technical meaning, nor is the necessary logical order between them.

It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more than two and "at least one" may refer to one, two or more than two.

It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.

In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing the associated object, and means that there may be three kinds of relationships, such as a and/or B, and may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

Embodiments of the present disclosure may be implemented in electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with an electronic device, such as a terminal device, computer system, or server, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network pcs, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be implemented in a distributed cloud computing environment. In a distributed cloud computing environment, tasks may be performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

Summary of the disclosure

In the process of implementing the present disclosure, the inventors found that training a text content recognition model often requires a large number of image samples. If the image sample is labeled by a manual labeling method, high labor cost and time cost are generally consumed. If the image sample is extended by adopting the modes of scaling, position translation, contrast adjustment, color adjustment, noise increase and the like aiming at the existing image sample, the extended image sample is often different from the real image in distribution, so that the text content recognition model is trained by utilizing the image sample, which is possibly not beneficial to ensuring the recognition accuracy of the text content recognition model in practical application.

In addition, in practical application, the image blocks provided for the text content recognition model are usually the image blocks cut out from the image to be recognized by using the text box position information detected and obtained from the image to be recognized by the text box detection model, and the text content recognition model is trained in the image samples formed by adopting the extended image sample mode, which is equivalent to blocking the connection between the text box detection model and the text content recognition model, so that the recognition accuracy of the text content recognition model in practical application is not favorably ensured.

The present disclosure performs a text box detection process on a first image sample by using a text box detection model, the detected position information of the text box can be obtained, new text label information is set for the first image sample by utilizing the position information of the text box and the text label information corresponding to the position information of the text box to form a second image sample, so that part of the content in the text label information of the first image sample can be transferred to the text label information of the second image sample, thereby not only forming new image samples quickly and enriching the image samples, but also, since the text box position information in the text label information of the formed new image sample is obtained by the text box detection model detection, therefore, the method and the device can keep the connection between the text content recognition model and the text box detection model in the training process, and are favorable for ensuring the recognition accuracy of the text content recognition model in practical application.

Brief description of the drawings

The technology for forming image samples provided by the disclosure is generally applied to an application scenario for training a text content recognition model. One application scenario of the technique of forming an image sample of the present disclosure is illustrated below with reference to fig. 1.

In fig. 1, the image sample set 100 includes a plurality of image samples, for example, the image sample set 100 includes an image sample 1, an image sample 2, … …, and an image sample N. At least a portion of the image samples in the image sample set 100 are formed using the techniques of the traffic image samples of the present disclosure. Each image sample is provided with text annotation information, which generally includes: and marking information of the position of the text box and marking information of the text content.

A plurality of image samples are obtained from the image sample set 100, and each image sample is cut according to the position label information of the text box of each obtained image sample, so as to obtain a plurality of image block samples, such as image block sample 1, image block samples 2 and … …, and image block template M. According to the text content labeling information of the image samples, the text content labeling information of each image block sample can be determined.

According to the preset batch processing size, a certain number of image block samples are obtained from the obtained plurality of image block samples, and the obtained image block samples are respectively used as input and provided for the text content identification model 101 to be trained. The text content recognition processing is respectively performed on each input image block sample through the text content recognition model 101 to be trained, so that the text content recognized by the text content recognition model 101 to be trained for each image block sample can be obtained. And adjusting model parameters of the text content recognition model 101 to be trained, such as convolution kernel weight and/or matrix weight, according to the text content recognized by the text content recognition model 101 for each image block sample and the text content label information of each image block sample.

By using the training process, the successfully trained text content recognition model 101 can be finally obtained.

Illustrative methodMethod of

FIG. 2 is a schematic flow chart diagram illustrating one embodiment of a method of forming an image sample according to the present disclosure.

As shown in fig. 2, the method of this embodiment includes the steps of: s200, S201, S202, and S203. The following describes each step.

And S200, acquiring a first image sample.

The first image sample in the present disclosure refers to an image sample used for training a neural network, and is referred to as a first image sample in order to be distinguished from a new image sample generated later. The first image sample may be an image sample in a pre-set of image samples. The first image sample usually contains characters, where the characters may include words, symbols, and the like. The characters can be characters and words in Chinese, and can also be characters and words in other languages (such as English). The first image sample in the present disclosure may be an image obtained by scanning or photographing or the like.

Each first image sample in the disclosure is provided with at least one text label information. The text marking information can be set in a manual marking mode or in other modes. One text annotation in the present disclosure generally includes: and marking information of the position of the text box and marking information of text content. The text box position annotation information is used to characterize the position of the text box in the first image sample. The text content marking information is used for representing the text content in the text box, such as characters, symbols and the like in the text box.

S201, providing the first image sample to a text box detection model, and performing text box detection processing on the first image sample through the text box detection model to obtain detected text box position information.

A text box detection model in this disclosure may refer to a neural network used to predict a particular location of a text box in an image sample. The neural network may comprise a convolutional neural network or the like. Neural networks in the present disclosure may include, but are not limited to: convolutional layers, Relu (Rectified Linear Unit) layers (which may also be referred to as active layers), pooling layers, and full link layers, among others. The greater the number of layers that the neural network contains, the deeper the network. The present disclosure does not limit the specific structure of the neural network.

The text box detection model in the present disclosure may be a model obtained by training with an image sample in advance. The textbox detection processing procedure of the textbox detection model in the present disclosure on the first image sample may include: generating a plurality of candidate frames, performing confidence prediction on the candidate frames, performing regression processing on the corresponding candidate frames according to the confidence prediction result, and the like. The text box detection model in the present disclosure generally employs a corresponding detection algorithm to perform text box detection processing, which is not limited by the present disclosure.

Text box position information in the present disclosure is used to represent the position of the text box in the first image sample. The text box location information may include: coordinate information of any vertex in the text box and the length and height of the text box. The text box position information may also include: coordinate information of a center point of the text box and a length and a width of the text box. The present disclosure does not limit the concrete expression of the text box position information.

S202, determining text label information corresponding to the detected text box position information.

Since each piece of text label information of the first image sample includes the text box position label information, the present disclosure may determine, from each piece of text label information, the text label information corresponding to the detected text box position information according to the area indicated by each piece of text box position label information of the first image sample and the area indicated by the detected text box position information. That is, the detected position information of the text box may represent a first area, and each piece of position label information of the text box of the first image sample may represent a second area, respectively.

S203, setting new text label information for the first image sample according to the text box position information and the corresponding text label information to form a second image sample.

The second image sample in this disclosure may also be used to train a neural network. The second image sample is typically identical to the first image sample in image content, except that the textual annotation information is not identical. The text box position information in the text annotation information of the second image sample is obtained by performing text box detection processing on the first image sample by using a text box detection model, and other annotation information in the text annotation information of the second image sample is obtained by inheriting from the text annotation information of the first image sample. That is, part of the annotation information in the text annotation information of the first image sample is passed to the text annotation information of the second image sample.

The method can obtain the detected position information of the text box by utilizing the text box detection model to carry out text box detection processing on the first image sample, and can set new text label information for the first image sample by utilizing the position information of the text box and the text label information corresponding to the position information of the text box to form the second image sample, so that part of the content in the text label information of the first image sample can be transferred to the text label information of the second image sample, thereby the method can conveniently form the new image sample, enrich the image sample, improve the setting efficiency of the text label information and avoid the phenomenon of large consumption of labor cost and time cost caused by manual labeling; moreover, because the position information of the text box in the text labeling information of the formed new image sample is obtained by detecting the text box detection model, under the condition that the formed new image sample is used for training the content identification model, the method can ensure that the training process of the text content identification model is linked with the detection process of the text box detection model, namely the linkage of the training process of the text content identification model and the detection process of the text box detection model is maintained, thereby being beneficial to improving the identification accuracy of the text content identification model in practical application.

In an optional example, the text box detection model in the present disclosure may be multiple, and the multiple text box detection models are generally based on different text box detection algorithms, and it may also be considered that the multiple text box detection models implement different text box position detection. For example, the plurality of text detection models include, but are not limited to: CTPN (continuous Text suggestion neural Network), EAST (An Efficient and Accurate segment Text Detector), or a Text box detection model based on SegLink Scene oriented word detection algorithm, etc.

Optionally, the present disclosure may provide the first image samples in the training data set to the plurality of text box detection models, and perform text box detection processing on the input first image samples through the plurality of text box detection models, so as to obtain text box position information detected by each of the plurality of text box detection models with respect to the first image sample.

According to the method, the plurality of text box detection models are utilized to respectively carry out text box detection processing on the first image sample, and because the detection results output by different text box detection models are different, the method can generate a plurality of second image samples with different text annotation information by utilizing different detection results and through the transmission of partial contents in the text annotation information of the first image sample. Therefore, the method is beneficial to improving the generation efficiency of the image sample and enriching the image sample; in addition, because the position information of the text box in the formed text labeling information of the new image sample is obtained by detecting a plurality of text box detection models, under the condition of training the content recognition model by using the new image sample, the method can respectively link the training process of the text content recognition model with the detection processes of the plurality of text box detection models, thereby being further beneficial to improving the recognition accuracy of the text content recognition model in practical application.

In an alternative example, the hyper-parameters used in the training process of the multiple text box detection models in the present disclosure are different, that is, the hyper-parameters used in the training process of the multiple text box detection models are not identical. The hyper-parameters in the present disclosure include, but are not limited to: the number of batches (batch _ size), the pixel threshold (pixel _ threshold), the boundary pixel threshold (side _ vertex _ pixel _ threshold), and the head-to-tail pixel discrimination threshold (round _ threshold).

Optionally, in the multiple textbox detection models in the present disclosure, the number of used image samples may be different in the training process. For example, some text box detection models are trained using full-size image samples, while some text box detection models are trained using non-full-size image samples. As another example, different textbox detection models may not have the same requirements for the size of the spatial resolution of the input image samples.

The method and the device have the advantages that different hyper-parameters are used by the plurality of text box detection models in the training process, the randomness of the training process of the text box detection models is favorably enhanced, the diversity of the text box detection results output by the text box detection models is favorably enhanced, and the diversity of the formed second image samples is favorably enhanced.

In an alternative example, the process of determining text label information corresponding to the detected text box position information of the present disclosure may include the following two steps:

step 1, determining text box region overlapping information according to text box position marking information in each text marking information of the first image sample and detected text box position information.

Optionally, the text box region overlapping information in the present disclosure may reflect: whether the two text boxes correspond to information of the same text content in the first image sample. The text box region overlap information in the present disclosure may include, but is not limited to: the intersection ratio of the two text boxes (IoU).

It is assumed that the present disclosure obtains N1(N1 is an integer greater than zero) pieces of detected text box position information, each corresponding to one region in the first image sample, hereinafter referred to as a first region, for the detection result of the input first image sample according to a text box detection model. That is, the present disclosure may obtain N1 first regions for the detection result of the input first image sample according to a text box detection model. The region formed by the dashed box enclosing ". times.. times." in fig. 3 is the first region.

It is assumed in this disclosure that the first image sample has N2(N2 is an integer greater than zero, and N2 and N1 may be equal to or different from each other), and the text box position annotation information in each text annotation information corresponds to an area in the first image sample, which is referred to as a second area below. That is, the present disclosure may obtain N2 second regions from the text annotation information of the first image sample. The region formed by the solid line frame enclosed with "×") in fig. 3 is the second region.

The present disclosure may calculate, for each first region, the intersection ratio between the first region and all the second regions, respectively, so that the present disclosure may obtain N1 × N2 intersection ratios in total. For example, the intersection ratio between the first region and the second region in fig. 3 is calculated. The area of intersection between the first region and the second region in fig. 3 is shown as the area of the filled dots in fig. 4. The union area between the first region and the second region in FIG. 3 is as

Shown in the area filled with vertical lines in fig. 4. The intersection ratio between the first region and the second region in fig. 3 may be a ratio of an area of the region where the dots are filled to an area of the region where the vertical lines are filled in fig. 4.

And 2, determining text labeling information corresponding to the detected text box position information according to the overlapping information and preset conditions.

The preset conditions in the present disclosure may include, but are not limited to: the intersection ratio of the first area and the second area is larger than a preset threshold value. For example, the present disclosure may compare the obtained N1 × N2 intersection ratios with a preset threshold respectively, so as to obtain intersection ratios exceeding the preset threshold, and the present disclosure may use the text label information to which the second area corresponding to the intersection ratios exceeding the preset threshold is associated as the text label information corresponding to the detected text box position information.

According to the method and the device, the text labeling information corresponding to the detected text frame position information can be conveniently and accurately determined by utilizing the overlapping information and the preset conditions, so that the transmission accuracy of partial contents in the text labeling information of the first image sample is favorably ensured, and the accuracy of the text labeling information of the second image sample is favorably ensured.

In an optional example, in the present disclosure, the process of setting new text label information for the first image sample according to the text box position information and the corresponding text label information may be: and using the text box position information obtained by the detection of the text box detection model and the text content marking information in the determined corresponding text marking information as new text marking information of the first image sample to form a second image sample. For example, in fig. 4, when the intersection ratio between the first region and the second region exceeds the preset threshold T, the text content labeling information "×" corresponding to the first region may be transferred to the text content labeling information corresponding to the second region, that is, the text box position information and the text content labeling information "×" obtained by the text box detection model detection may be used as one text labeling information of the second image sample.

According to the method, the text content labeling information in the text labeling information of the first image sample and the corresponding text box position information obtained by detecting the text box detection model are used as the text content labeling information of the second image sample, the text labeling information of the second image sample can be automatically set conveniently and accurately, so that the labor cost and the time cost caused by forming the image sample through manual labeling can be avoided, the generation efficiency of the image sample is improved, and the image sample is enriched.

In one optional example, the second image sample generated by the present disclosure is used to train a text content recognition model to be trained. An example of training a text content recognition model to be trained is shown in fig. 5.

In fig. 5, the second image sample obtained by the present disclosure may form an image sample set, and the second image sample may also form an image sample set together with the first image sample. The first image sample and the second image sample in the set of image samples may both be referred to as image samples.

S500, obtaining a plurality of image samples from the image sample set, wherein the plurality of image samples may include at least one second image sample. The plurality of image samples may further comprise at least one first image sample

S501, according to the acquired text box position marking information of each image sample, each image sample is cut, and therefore a plurality of image block samples are obtained.

The method and the device can determine the text content marking information of each image block sample according to the text content marking information of the image samples.

And S502, acquiring a certain number of image block samples from the acquired image block samples according to a preset batch processing size, and providing the acquired image block samples as input to a text content identification model to be trained.

And S503, respectively carrying out text content identification processing on each input image block sample by the text content identification model to be trained.

S504, obtaining the text content recognized by the text content recognition model to be trained aiming at each image block sample according to the recognition result output by the text content recognition model to be trained.

And S505, performing loss calculation by using corresponding loss functions according to the text content respectively recognized by the to-be-trained text content recognition model aiming at each image block sample and the text content labeling information of each image block sample.

And S506, performing back propagation according to the loss obtained by calculation to adjust model parameters of the text content recognition model to be trained, for example, adjusting a convolution kernel weight and/or a matrix weight of the text content recognition model to be trained.

And S507, judging whether a preset iteration condition is reached, if so, going to S508, and if not, returning to S502.

Optionally, the predetermined iteration condition in the present disclosure may include: the text content recognition model to be trained meets the preset difference requirement aiming at the difference between the text content output by the image block sample and the text content marking information of the image block sample. And under the condition that the difference meets the preset difference requirement, successfully training the text content recognition model. The predetermined iteration condition in the present disclosure may also include: and training the text content recognition model to be trained, wherein the number of the used image block samples meets the requirement of a preset number, and the like. And under the condition that the number of the used image block samples meets the requirement of the preset number but the difference does not meet the requirement of the preset difference, the text content identification model to be trained at this time is not trained successfully. The successfully trained text content recognition model can be used for text content detection processing.

And S508, finishing the training process.

According to the method, the image sample set containing the second image sample is used for training the text content recognition model to be trained, and the text box position information in the text labeling information of the second image sample is obtained by detecting the text box detection model, so that the training process of the text content recognition model can be linked with the detection process of the text box detection model, and the method is favorable for improving the recognition accuracy of the text content recognition model in practical application.

In an alternative example, in a case where the disclosure employs a text box detection model with multiple algorithms to obtain multiple second image samples, the disclosure may use proportionally mixed image block samples in training a text content recognition model to be trained by using a first image sample and a second image sample. Another example of training a text content recognition model to be trained is shown in fig. 6.

In fig. 6, a plurality of image sample sets may be formed by the second image samples obtained by the text box detection model of the text box detection algorithms according to the present disclosure, and one image sample set corresponds to the second image sample obtained by one text box detection model. In addition, the first image sample forms a set of image samples.

S600, respectively obtaining a plurality of image samples from each image sample set.

S601, according to the position marking information of the text box of each image sample obtained from different image sample sets, each image sample is cut, and a plurality of image block samples corresponding to each image sample set are obtained.

And S602, respectively acquiring a certain number of image block samples from a plurality of image block samples corresponding to each image sample set according to a preset batch processing size and a preset mixing ratio, and respectively taking all the acquired image block samples as input to provide the input to a text content recognition model to be trained.

The preset mixing ratio in the present disclosure may be: a ratio set for the number of image block samples from the first image samples and the number of image block samples from the second image samples corresponding to different algorithms. An example is as follows:

assuming that there are three different text box detection models for detection algorithms in the present disclosure, the second image sample obtained by using the text box model for each detection algorithm forms one image sample set, so that three image sample sets, i.e., the second image sample set 1, the second image sample set 2, and the third image sample set 3, can be obtained. The corresponding set of first image samples may be referred to as a first set of image samples.

The present disclosure may obtain a certain number of first image samples from a first image sample set, and perform a cropping process on each of the currently obtained first image samples, where the obtained plurality of image block samples form a first image block sample set.

The present disclosure may obtain a certain number of second image samples from the second image sample set 1, and perform a cropping process on each currently obtained second image sample, and the obtained multiple image block samples form a second image block sample set.

The present disclosure may obtain a certain number of second image samples from the second image sample set 2, and perform a cropping process on each currently obtained second image sample, and the obtained multiple image block samples form a third image block sample set.

The present disclosure may obtain a certain number of second image samples from the second image sample set 3, and perform a cropping process on each of the currently obtained second image samples, where the obtained multiple image block samples form a fourth image block sample set.

The method can be implemented according to the preset batch processing size according to a 1: a 2: a 3: a4, the preset mixing ratio is to obtain a certain number of image block samples from the first, second, third and fourth image block sample sets, respectively. Wherein the sum of a1, a2, a3 and a4 can be 1. In addition, the specific values of a1, a2, a3 and a4 can be set according to actual requirements. For example, if the text box detection model of a certain detection algorithm is used in a wider range, the value of the ratio corresponding to the text box detection model may be set to be slightly larger. The present disclosure is not limited thereto.

And S603, the text content identification model to be trained respectively identifies the text content of each input image block sample.

And S604, obtaining the text content recognized by the text content recognition model to be trained aiming at each image block sample according to the recognition result output by the text content recognition model to be trained.

And S605, performing loss calculation by using a corresponding loss function according to the text content respectively recognized by the text content recognition model to be trained aiming at each image block sample and the text content labeling information of each image block sample.

And S606, performing back propagation according to the loss obtained by calculation to adjust model parameters of the text content recognition model to be trained, for example, adjusting convolution kernel weight and/or matrix weight of the text content recognition model to be trained.

And S607, judging whether a preset iteration condition is reached, if so, going to S608, and if not, returning to S602.

Optionally, the predetermined iteration condition in the present disclosure may include: the text content recognition model to be trained meets the preset difference requirement aiming at the difference between the text content output by the image block sample and the text content marking information of the image block sample. And under the condition that the difference meets the preset difference requirement, successfully training the text content recognition model. The predetermined iteration condition in the present disclosure may also include: and training the text content recognition model to be trained, wherein the number of the used image block samples meets the requirement of a preset number, and the like. And under the condition that the number of the used image block samples meets the requirement of a preset number but the difference does not meet the requirement of a preset difference, the text content recognition model to be trained is not trained successfully. The successfully trained text content recognition model can be used for text content detection processing. In addition, if the number of image block samples corresponding to a certain image sample set is insufficient in the process of obtaining image block samples from a plurality of image block samples corresponding to each image sample set according to a preset mixing ratio, the present disclosure may obtain the image block samples with the insufficient number from the image block samples corresponding to other image sample sets, so that the number of the obtained image block samples meets the preset requirement of batch processing size.

And S608, finishing the training process.

According to the method and the device, the image block samples from different image samples are obtained according to the preset mixing proportion, so that the training process of the text content recognition model is further favorably linked with the detection process of the text box detection model, and the recognition accuracy of the text content recognition model in practical application is further favorably improved.

Exemplary devices

FIG. 7 is a schematic structural diagram of one embodiment of an apparatus for forming an image sample according to the present disclosure. The apparatus of this embodiment may be used to implement the method embodiments of the present disclosure described above. As shown in fig. 7, the apparatus of this embodiment mainly includes: the device comprises an acquisition module 700, a detection module 701, a determination module 702 and a setting module 703. Optionally, the apparatus may further include: and a training module 704.

The acquisition module 700 is configured to acquire a first image sample. Wherein the first image sample is provided with at least one text label information. The acquisition module 700 may acquire a first image sample from a set of image samples. The operation specifically performed by the obtaining module 700 may refer to the description of S200 in the above method embodiment, and is not described in detail here.

The detecting module 701 is configured to provide the first image sample obtained by the obtaining module 700 to a text box detecting model, and perform text box detecting processing on the first image sample through the text box detecting model to obtain detected text box position information.

Optionally, the detecting module 701 may be further configured to provide the first image sample to a plurality of text box detecting models based on different detecting algorithms, and perform text box detecting processing on the first image sample through the plurality of text box detecting models, respectively, to obtain text box position information detected by each of the plurality of text box detecting models. Optionally, the hyper-parameters corresponding to the multiple text box detection models in the training process have differences. The operation specifically performed by the detection module 701 may refer to the description of S201 in the above method embodiment, and is not described in detail here.

The determining module 702 is configured to determine text label information corresponding to the text box position information detected by the detecting module 701.

Optionally, the determining module 702 may include: a first sub-module and a second sub-module. The first sub-module is used for determining text box region overlapping information according to text box position marking information in each text marking information of the first image sample and the detected text box position information. The second submodule is used for determining text labeling information corresponding to the detected text box position information according to the overlapping information determined by the first submodule and a preset condition. The operation specifically performed by the determining module 702 may be referred to the description of S202 in the above method embodiment, and is not described in detail here.

The setting module 703 is configured to set new text label information for the first image sample according to the position information of the text box detected by the detecting module 701 and the corresponding text label information determined by the determining module 702, so as to form a second image sample.

Optionally, the setting module 703 may be further configured to use the text content annotation information in the text box position information and the corresponding text annotation information as new text annotation information of the first image sample to form a second image sample. The operation specifically performed by the setting module 703 may refer to the description of S203 in the above method embodiment, and is not described in detail here.

The training module 704 is configured to perform a training process on the text content recognition model to be trained by using the second image sample. Optionally, the training module 704 may include a third sub-module, a fourth sub-module, a fifth sub-module, and a sixth sub-module. The third sub-module is used for cutting out image block samples containing text contents from the first image samples and the second image samples according to the text labeling information of the first image samples and the text labeling information of the second image samples respectively obtained by a plurality of text box detection models based on different detection algorithms. The fourth sub-module is used for acquiring image block samples from the first image samples and image block samples from the second image samples corresponding to different detection algorithms according to a preset mixing ratio. The fifth sub-module is used for providing the obtained image block samples to a text content identification model to be trained, and performing text content identification processing on the image block samples through the text content identification model to be trained to obtain a plurality of identified text interiors. The sixth submodule is used for adjusting the model parameters of the text content recognition model to be trained according to the difference between the recognized text contents and the text content marking information in the text marking information. The operations specifically performed by the training module 704 may be referred to the description of the above method embodiments with respect to fig. 5 and 6, and will not be described in detail here.

Exemplary electronic device

An electronic device according to an embodiment of the present disclosure is described below with reference to fig. 8. FIG. 8 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure. As shown in fig. 8, the electronic device 81 includes one or more processors 811 and memory 812.

The processor 811 may be a Central Processing Unit (CPU) or other form of processing unit having the capability of forming image samples and/or instruction execution capabilities, and may control other components in the electronic device 81 to perform desired functions.

Memory 812 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory, for example, may include: random Access Memory (RAM) and/or cache memory (cache), etc. The nonvolatile memory, for example, may include: read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by processor 811 to implement the methods of forming image samples of the various embodiments of the disclosure described above and/or other desired functions. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.

In one example, the electronic device 81 may further include: an input device 813, an output device 814, etc., which are interconnected by a bus system and/or other form of connection mechanism (not shown). The input device 813 may also include, for example, a keyboard, a mouse, and the like. The output device 814 may output various information to the outside. The output devices 814 may include, for example, a display, speakers, printer, and a communication network and its connected remote output devices, among others.

Of course, for simplicity, only some of the components of the electronic device 81 relevant to the present disclosure are shown in fig. 8, and components such as buses, input/output interfaces, and the like are omitted. In addition, the electronic device 81 may include any other suitable components, depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the methods and apparatus described above, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the methods of forming an image sample according to various embodiments of the present disclosure described in the "exemplary methods" section above of this specification.

The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform steps in a method of forming an image sample according to various embodiments of the present disclosure described in the "exemplary methods" section above in this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium may include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, and systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," comprising, "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects, and the like, will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A method of forming an image sample, comprising:

acquiring a first image sample, wherein the first image sample is provided with at least one piece of text labeling information;

providing the first image sample to a text box detection model, and performing text box detection processing on the first image sample through the text box detection model to obtain detected text box position information;

determining text labeling information corresponding to the detected text box position information;

and setting new text labeling information for the first image sample according to the text box position information and the corresponding text labeling information to form a second image sample, wherein other labeling information except the text box position information in the text labeling information of the second image sample inherits the text labeling information of the first image sample.

2. The method of claim 1, wherein the providing the first image sample to a text box detection model, performing text box detection processing on the first image sample via the text box detection model, and obtaining detected text box position information comprises:

and providing the first image sample to a plurality of text box detection models based on different detection algorithms, and respectively carrying out text box detection processing on the first image sample through the plurality of text box detection models to obtain text box position information detected by each of the plurality of text box detection models.

3. The method of claim 2, wherein the plurality of text box detection models correspond to hyper-parameters and have differences during training.

4. The method of forming an image sample of any of claims 1 to 3, wherein said determining text annotation information corresponding to said detected text box location information comprises:

determining the text box area overlapping information according to the text box position marking information in each text marking information of the first image sample and the detected text box position information;

and determining text labeling information corresponding to the detected text box position information according to the overlapping information and a preset condition.

5. The method for forming image samples according to any one of claims 1 to 4, wherein the setting of new text label information for a first image sample according to the text box position information and the corresponding text label information to form a second image sample comprises:

and taking the text content marking information in the text box position information and the corresponding text marking information as new text marking information of the first image sample to form a second image sample.

6. The method of forming an image sample of any of claims 1 to 5, wherein the method further comprises:

and training the text content recognition model to be trained by utilizing the second image sample.

7. The method for forming image samples according to claim 6, wherein the training the text content recognition model to be trained by using the second image sample comprises:

according to the text labeling information of the first image sample and the text labeling information of the second image sample respectively obtained by a plurality of text box detection models based on different detection algorithms, cutting out an image block sample containing text content from the first image sample and the second image sample;

acquiring image block samples from a first image sample and image block samples from a second image sample corresponding to different detection algorithms according to a preset mixing ratio;

providing the obtained image block samples to a text content recognition model to be trained, and performing text content recognition processing on each image block sample through the text content recognition model to be trained to obtain a plurality of recognized text interiors;

and adjusting the model parameters of the text content identification model to be trained according to the difference between the plurality of identified text contents and the text content marking information in the text marking information.

8. An apparatus for forming an image sample, wherein the apparatus comprises:

the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a first image sample, and the first image sample is provided with at least one piece of text labeling information;

the detection module is used for providing the first image sample acquired by the acquisition module for a text box detection model, and performing text box detection processing on the first image sample through the text box detection model to acquire detected text box position information;

the determining module is used for determining text labeling information corresponding to the text box position information detected by the detecting module;

and the setting module is used for setting new text annotation information for the first image sample according to the text box position information detected by the detection module and the corresponding text annotation information determined by the determination module to form a second image sample, wherein other annotation information except the text box position information in the text annotation information of the second image sample is inherited from the text annotation information of the first image sample.

9. The apparatus for forming an image sample of claim 8, wherein the detection module is further configured to:

and providing the first image sample to a plurality of text box detection models based on different detection algorithms, and respectively carrying out text box detection processing on the first image sample through the plurality of text box detection models to obtain the text box position information detected by the plurality of text box detection models.

10. The apparatus for forming an image sample as recited in claim 9, wherein the plurality of text box detection models are different in corresponding hyper-parameters during a training process.

11. The apparatus for forming an image sample according to any one of claims 8 to 10, wherein the determining means comprises:

the first sub-module is used for determining the text box area overlapping information according to the text box position marking information in each text marking information of the first image sample and the detected text box position information;

and the second submodule is used for determining the text marking information corresponding to the detected text box position information according to the overlapping information determined by the first submodule and a preset condition.

12. An apparatus for forming an image sample according to any of claims 8 to 11, wherein the setup module is further configured to:

13. An apparatus for forming an image sample according to any one of claims 8 to 12, wherein the apparatus further comprises:

and the training module is used for training the text content recognition model to be trained by utilizing the second image sample.

14. An apparatus for forming an image specimen as recited in claim 13, wherein said training module comprises:

the third sub-module is used for cutting out image block samples containing text contents from the first image samples and the second image samples according to the text labeling information of the first image samples and the text labeling information of the second image samples respectively obtained by a plurality of text box detection models based on different detection algorithms;

the fourth sub-module is used for acquiring image block samples from the first image samples and image block samples from the second image samples corresponding to different detection algorithms according to a preset mixing ratio;

the fifth sub-module is used for providing the obtained image block samples to a text content identification model to be trained, and performing text content identification processing on each image block sample through the text content identification model to be trained to obtain a plurality of identified text interiors;

and the sixth submodule is used for adjusting the model parameters of the text content identification model to be trained according to the difference between the plurality of identified text contents and the text content marking information in the text marking information.

15. A computer-readable storage medium, the storage medium storing a computer program for performing the method of any of the preceding claims 1-7.

16. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method of any one of claims 1-7.