CN111062385A

CN111062385A - Network model construction method and system for image text information detection

Info

Publication number: CN111062385A
Application number: CN201911127868.7A
Authority: CN
Inventors: 周康明; 冯晓锐
Original assignee: Shanghai Eye Control Technology Co Ltd
Current assignee: Shanghai Eye Control Technology Co Ltd
Priority date: 2019-11-18
Filing date: 2019-11-18
Publication date: 2020-04-24

Abstract

The method comprises the steps of firstly obtaining label information of each training sample image, then extracting corresponding text information based on the label information, then determining a text feature matrix corresponding to the text information based on the text information, splicing the text feature matrix and the corresponding training sample image to obtain a spliced training sample image, and finally inputting the spliced training sample image into an improved network model for training until a preset training threshold value is met. The method combines the text information characteristics in the image, can realize the detection of the smaller text information in the image, has higher detection precision, and the constructed network model has better generalization capability, thereby expanding the application range of the constructed network model.

Description

Network model construction method and system for image text information detection

Technical Field

The application relates to the technical field of computer image processing, in particular to a network model construction technology for image text information detection.

Background

At present, a common method for detecting text information in an image is to use a target detection neural network, where SSD (Single Shot multi box Detector, a target detection method that only needs a Single deep neural network) has a good effect in simple text information detection, but if various similar text information appears in the image in a concentrated manner, for example: the forms with similar formats are experienced by people, but part of frame lines or contents in the forms are different, and in this case, when the SSD is used for detecting text information in an image, the SSD has general expression capability. If an SSD is trained for each type of image among various images with similar text information, resources are wasted and efficiency is low.

Disclosure of Invention

The application aims to provide a network model construction method and system for image text information detection.

According to one aspect of the application, a network model construction method for image text information detection is provided, wherein the method comprises the following steps:

acquiring label information of each training sample image;

extracting corresponding text information based on the label information;

determining a text characteristic matrix corresponding to the text information based on the text information;

splicing the text feature matrix and the corresponding training sample image to obtain a spliced training sample image;

and inputting the spliced training sample images into an improved network model for training until a preset training threshold value is met.

Preferably, the method further comprises:

acquiring characters corresponding to text information in all training sample images;

establishing a character library based on the characters, wherein the characters contained in the character library are not repeated, and each character corresponds to a unique character label;

wherein the determining the text feature matrix corresponding to the text information based on the text information comprises:

determining a corresponding character based on the text information;

and determining a text feature matrix corresponding to the text information based on the character label of each character in the character library.

Preferably, when a new training sample image is added, whether the text information of the new training sample image contains characters which do not exist in the character library is judged;

if the character library does not contain the characters which do not exist in the character library, the characters which do not exist are added into the character library so as to update the character library.

Preferably, if the training sample image is a three-channel color image, the text feature matrix is spliced with the corresponding training sample image to obtain a four-channel spliced training sample image; and if the training sample image is a single-channel gray scale image, splicing the text feature matrix and the corresponding training sample image to obtain a spliced training sample image of two channels.

Preferably, the improved network model is based on an SSD neural network, and comprises 1 data layer, a VGG-16 base network, and 6 convolutional layers behind the VGG-16 base network.

According to another aspect of the present application, there is provided a network model construction system for image text information detection, wherein the system comprises:

the label information acquisition module is used for acquiring label information of each training sample image;

the text information extraction module is used for extracting corresponding text information based on the label information acquired by the label information acquisition module;

the text characteristic matrix module is used for determining a text characteristic matrix corresponding to the text information based on the text information extracted by the text information extraction module;

the splicing module is used for splicing the text characteristic matrix and the corresponding training sample image to obtain a spliced training sample image;

and the network model construction module is used for inputting the spliced training sample images into an improved network model for training until a preset training threshold value is met.

Compared with the prior art, the method for constructing the network model for detecting the image text information comprises the steps of firstly obtaining label information of each training sample image, then extracting corresponding text information based on the label information, then determining a text characteristic matrix corresponding to the text information based on the text information, then splicing the text characteristic matrix and the corresponding training sample image to obtain a spliced training sample image, and finally inputting the spliced training sample image into the improved network model for training until a preset training threshold value is met. The method combines the text information characteristics in the image, can realize the detection of the smaller text information in the image, has higher detection precision, and the constructed network model has better generalization capability, thereby expanding the application range of the constructed network model.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:

FIG. 1 illustrates a flow diagram of a network model construction method for image text information detection in accordance with an aspect of the subject application;

FIG. 2 illustrates a block diagram of a network model construction system for image text information detection, in accordance with another aspect of the subject application;

the same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

The present invention is described in further detail below with reference to the attached drawing figures.

In a typical configuration of the present application, each module and trusted party of the system includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

In order to further explain the technical means and effects adopted by the present application, the following description clearly and completely describes the technical solution of the present application with reference to the accompanying drawings and preferred embodiments.

Fig. 1 is a flowchart illustrating a network model construction method for image text information detection according to an aspect of the present application, where the method of an embodiment includes:

s11, acquiring label information of each training sample image;

s12 extracting corresponding text information based on the label information;

s13, determining a text feature matrix corresponding to the text information based on the text information;

s14, splicing the text feature matrix and the corresponding training sample image to obtain a spliced training sample image;

and S15, inputting the spliced training sample images into an improved network model for training until a preset training threshold value is met. Acquiring an image to be recognized containing a human body posture;

in the present application, the method is performed by a device 1, the device 1 is a computer device and/or a cloud, the computer device includes but is not limited to a personal computer, a notebook computer, an industrial computer, a network host, a single network server, a plurality of network server sets; the Cloud is made up of a large number of computers or web servers based on Cloud Computing (Cloud Computing), which is a type of distributed Computing, a virtual supercomputer consisting of a collection of loosely coupled computers.

The computer device and/or cloud are merely examples, and other existing or future devices and/or resource sharing platforms, as applicable to the present application, are also intended to be included within the scope of the present application and are hereby incorporated by reference.

In this embodiment, in the step S11, the apparatus 1 acquires label information of each training sample image. The training sample image may include text information of a form type or the like. The training sample may be obtained by uploading an image of the training sample to the apparatus 1, including but not limited to taking a picture or scanning, or may also be obtained by processing text information of paper through a camera/scanner device carried by the apparatus 1, and the obtaining manner is merely an example, and other manners as applicable to this application should also be included in the scope of this application.

The label information may be embodied in a file, such as an xml file or other format file, where the file may include contents such as an image file name and a category corresponding to the training sample image, text information in the training sample image, and coordinate information corresponding to the text information in the image.

Preferably, the label information at least includes text information of the training sample image and coordinate information corresponding to the text information.

Continuing in this embodiment, in the step S12, corresponding text information is extracted based on the tag information. And analyzing the label information, and extracting corresponding text information from the label information.

In this embodiment, in step S13, a text feature matrix corresponding to the text information is determined based on the text information.

The text feature matrix is used to represent the text information, for example, the text information may be divided into a plurality of segments, and the plurality of segments are placed in a matrix representing text features, or the text information may be represented in the form of a single character, and each character is placed in the matrix, or features that correspond to the text information one to one are obtained, and the features are arranged in a text feature matrix, and the like.

Preferably, characters corresponding to text information in all training sample images are acquired, and a character library is established based on the characters, wherein the characters included in the character library are not repeated, and each character corresponds to a unique character label, wherein the step S13 includes:

determining a corresponding character based on the text information;

The characters of the text information in all the training sample images in the training sample image set can be counted manually, and automatic counting and sorting can be realized through software tools in the device 1 or other devices. In the case that the number of training sample images is determined, the character library is fixed, for example, the character library contains n characters, wherein each character corresponds to a uniquely determined character label.

The text feature matrix can embody corresponding text information. Each character in the text information can be subjected to one-hot coding, the one-hot coding is also called unique hot coding and one-bit effective coding, the category variable can be converted into a form which is easy to use by a machine learning algorithm, and only one activation point is available at the same time.

The one-hot coding can express the discrete features as feature values, the features are expanded to a certain degree, and the distance calculation between the features is more reasonable.

For example, the text information "upper case amount" in a training sample image, after each character traverses characters in a character library, a unique corresponding label is provided in the character library, assume that the characters respectively correspond to the a-th, b-th, c-th and d-th bits in the character library, that is, the label corresponding to "big" is a; the corresponding reference number "write" is b; the corresponding reference number of "gold" is c; the reference numeral "forehead" is d. Through one-hot encoding, an M x N text feature matrix can be obtained, wherein M x N is determined by the pixel size of an input image supported by a trained network model. Wherein, N is larger than the label number corresponding to each character, the a-th column of the first row of the text characteristic matrix is set to be 1, and the rest columns of the row are set to be 0; the b-th column of the second row of the text feature matrix is set to 1, and the rest columns of the row are set to 0; the c-th column of the third row of the text feature matrix is set to 1, and the remaining columns of the row are set to 0; the d-th column of the fourth row of the text feature matrix is set to 1 and the remaining columns of the row are set to 0. And finally obtaining an M x N text feature matrix corresponding to the text information of capital money through one-hot coding, wherein M x N is the pixel size of the training sample image after size adjustment.

In particular, if the number N of characters in the character library is greater than the number N of columns of the text feature matrix, then for text information containing characters located in the character range (N-N), the one-hot encoding of the characters in the text information is indicated by two lines. For example, if the text information of a training sample contains 4 characters, the text information respectively corresponds to e, f, g, and h bits in a character library, wherein the labels e, f, and g in the character library corresponding to the first three characters are all smaller than N, and the label h of the character library corresponding to the 4 th character is larger than N, an M × N text feature matrix can be obtained by one-hot encoding, the e column of the first row of the text feature matrix is set to 1, and the remaining columns of the row and the whole second row are set to 0; the f column of the third row of the text feature matrix is set to 1, and the rest of the row and the whole row of the fourth row are set to 0; the g bit of the fifth row of the text feature matrix is set to 1, and the rest columns of the row and the whole sixth row are set to 0; the seventh row of the text feature matrix is set to 0 in its entirety, the eighth row of the text feature matrix is set to 1 in its (h-N) th column, and the remaining bits of this row are set to 0.

Particularly, if a training sample image contains multiple sections of text information, such as a capital sum and a small capital sum, after each character traverses characters in a character library, a unique corresponding label is arranged in the character library, the two sections of text information contain 5 non-repetitive characters, and assume that the characters respectively correspond to the a, b, c, d and e bits in the character library, that is, the label corresponding to the large character is a; the corresponding reference number "write" is b; the corresponding reference number of "gold" is c; the reference number corresponding to "forehead" is d; the "small" corresponds to the reference number e. And obtaining an M x N text feature matrix through one-hot coding, wherein N is larger than the label corresponding to each character. The a bit of the first line of the text characteristic matrix is set to 1, and the rest bits of the line are set to 0; the bit b of the second row of the text feature matrix is set to 1, and the rest bits of the row are set to 0; the c-th bit of the third row of the text feature matrix is set to 1, and the rest bits of the row are set to 0; the d bit of the fourth line of the text feature matrix is set to 1, and the rest bits of the line are set to 0; the e bit of the fifth row of the text feature matrix is set to 1, and the rest bits of the row are set to 0; the bit b of the sixth row of the text feature matrix is set to 1, and the rest bits of the row are set to 0; the c bit of the seventh row of the text feature matrix is set to 1, and the rest bits of the row are set to 0; the d-th bit of the eighth row of the text feature matrix is set to 1 and the remaining bits of the row are set to 0. And finally obtaining an M x N text feature matrix corresponding to text information of capital sum and lowercase sum through one-hot coding.

If the characters contained in the text information in the newly added training sample images are not in the character library, the new characters are automatically added behind the last character in the character library, so that the characters of the text information in each training sample image have unique corresponding labels in the updated character library, and the text feature matrix corresponding to the original training sample image is not influenced.

In this embodiment, in step S14, the text feature matrix is spliced with the corresponding training sample image to obtain a spliced training sample image.

Before the corresponding training sample image is spliced with the text feature matrix, various conventional parameters such as brightness, chroma, contrast, divergence and the like can be transformed, and conventional data amplification is performed, so that the diversity of the sample is increased, and the robustness of the trained network model can be improved.

The corresponding training sample image is subjected to various conventional parameter transformations such as brightness, chroma, contrast, divergence and the like, the pixel size after conventional data amplification is M x N, and the requirement of a trained network model on the size of an input image is met, wherein the pixel size after the conventional data amplification is adjusted to M x N on the premise that the characteristics of text information in the training sample image are not reduced.

The text feature matrix of the training sample image is the same as that of the training sample image before parameter transformation and data augmentation.

Preferably, the text feature matrix is spliced with the corresponding training sample image to obtain a spliced training sample image, and if the corresponding training sample image is a three-channel color image, the text feature matrix is spliced with the corresponding training sample image to obtain a four-channel spliced training sample image; and if the training sample image is a single-channel gray scale image, splicing the text feature matrix and the corresponding training sample image to obtain a spliced training sample image of two channels.

Continuing in this embodiment, in step S15, the stitched training sample image is input into an improved network model for training until a preset training threshold is met.

Wherein the improved network model adapts the stitched training sample image input based on a data layer of an SSD target detection neural network.

The VGG-16 basic network comprises 13 convolutional layers and 5 pooling layers, wherein the convolutional layers are used for extracting different dimensional features, and the pooling layers are used for feature dimension reduction and simplifying the calculation data volume.

The 6 convolutional layers behind the VGG-16 basic network are divided into 3 groups, each group of 2 convolutional layers, the convolutional kernel size of one convolutional layer is 1 x 1, and the step length is 1; the convolution kernel size of the other convolution layer is 3 x 3 with a step size of 2. These 6 convolutional layers are used to further extract features to obtain a feature map.

Specifically, in the feature extraction process, 5 detection branches are respectively led out, wherein 2 detection branches are led out from the VGG-16 basic network, and 3 detection branches are led out from 6 convolutional layers behind the VGG-16 basic network. Because the receptive fields of different convolutional layers are different, different detection branches detect texts with different scales to obtain a prediction frame of a target text region, a low layer (convolutional layer close to the input end) predicts a small target, a high layer (convolutional layer close to the output end) predicts a large target, and each pixel point on the feature map generates a prediction frame with different aspect ratios.

The number, the size and the aspect ratio of the prediction frames can be set according to specific scenes, and the detection efficiency of the network model can be improved by selecting different numbers of prediction frames and setting the aspect ratios of the different prediction frames. Since the text information areas in the training sample images of the present application are substantially rectangular, for example, 5 prediction boxes may be selected, and the aspect ratios are respectively set to (1/1,1/2,1/3,1/4,1/5), which is only an example, and the selection of parameters such as the number of other prediction boxes and the aspect ratio should also be included in the protection scope of the present application as applicable to the present application. The SSD classifies and regresses all prediction frames of each pixel point of the feature map, and the prediction frames are used as the reference, so that the training difficulty is reduced to a certain extent.

Specifically, in the training process, the SSD searches for a real label with the largest IOU (Intersection over Union) for each prediction frame, that is, coordinates of the target frame, and if the real label can be matched, the prediction frame is considered as a positive sample, and for all remaining prediction frames that are not matched with the real label, if the IOU of the prediction frame and a certain real label is greater than a preset confidence threshold (the preset confidence threshold is generally 0.5, and may be adjusted according to an actual training situation), the prediction frame is considered to be matched with the real label, and the prediction frame is also considered as a positive sample; if the IOU with a real label is smaller than a preset confidence threshold, the prediction box is considered as a negative sample, namely background. In the actual training process, in order to balance the proportion of the positive sample and the Negative sample and enable the training process to be more easily converged, an HNM (Hard Negative Mining) strategy can be adopted, the Negative samples are sorted according to confidence errors, the confidence of the Negative samples is smaller, the errors are larger, and a certain number of prediction frames with larger errors are selected, for example, k (top-k) prediction frames before sorting are used as training Negative samples.

Ideally, the ratio of positive samples to negative samples is 1:1, but in the actual training process, it is difficult to ensure that the ideal situation occurs, and the negative samples are usually more. However, if the ratio of the negative sample to the positive sample is seriously unbalanced, the training effect of the network is often poor, for example, 1 positive sample and 10 negative samples exist, the learning emphasis during network training may be on the negative samples, and the learning of the positive samples is really needed.

And selecting positive and negative samples according to the IOU and the confidence coefficient, wherein the ratio of the positive and negative samples is 1:3 generally, the network training can be converged quickly, the training result is stable, and a good effect is obtained.

And matching the corresponding real label for the prediction frame of the positive sample, decoding to obtain the real position on the corresponding training sample image, clipping if the real position exceeds the image range, sorting according to the confidence error size to obtain the first k sorted prediction frames, filtering the prediction frames with larger overlap by combining with an NMS (Non-Maximum Suppression) method, and finally obtaining the prediction result.

And the NMS method is used for extracting the prediction box with the highest score in target detection as a prediction result. The method comprises the following steps: the first k prediction frames form a prediction frame list B, the detection frame M with the maximum score is selected according to the confidence degree s, the detection frame M with the maximum score is placed into a final detection result list D, the prediction frame with the IOU larger than a preset threshold value of the detection frame M with the maximum score is removed from the prediction frame list for the rest prediction frames in the prediction frame list, a new prediction frame list B formed by the rest prediction frames is obtained, the detection frame M ' with the maximum score is selected according to the confidence degree s, the detection frame M ' with the maximum score is placed into the final detection result list D, and the prediction frame with the IOU larger than the preset threshold value of the detection frame M ' with the maximum score is removed from the prediction frame list B for the rest prediction frames in the new prediction frame list B. And repeating the steps until the prediction box list is empty.

Specifically, in the training process, each weight parameter and bias parameter of the network model may be updated according to the obtained loss function value of the SSD network. The loss function of a SSD network is typically defined as follows:

from the above formula, the loss function of the SSD network consists of confidence loss and localization loss. The confidence loss is the normalized softmax loss of the true tag and the prediction box; the localization loss is a smooth L1 loss generated by a pattern prediction box and a real label.

The softmax function may map the input into the (0,1) interval, resulting in a probability of belonging to a certain classification category. The softmax loss is calculated from the probability of each class.

The smooth L1 loss function is defined as follows:

specifically, network model parameters are updated through a random gradient descent method in the training process, when the loss function of the SSD network meets a preset training threshold value, the network tends to be stable, the network model training is finished, and the network model construction is completed.

Fig. 2 illustrates a system block diagram of a network model building system for image text information detection according to another aspect of the present application, wherein the system of an embodiment comprises:

a label information obtaining module 21, configured to obtain label information of each training sample image;

the text information extracting module 22 is configured to extract corresponding text information based on the tag information acquired by the tag information acquiring module;

the text feature matrix module 23 is configured to determine a text feature matrix corresponding to the text information based on the text information extracted by the text information extraction module;

the splicing module 24 is configured to splice the text feature matrix and the corresponding training sample image to obtain a spliced training sample image;

and a network model building module 25, configured to input the spliced training sample image into an improved network model for training until a preset training threshold is met.

The modules are located in the device 1, and a network model construction system for image text information detection is realized.

According to yet another aspect of the present application, there is also provided a computer readable medium having stored thereon computer readable instructions executable by a processor to implement the foregoing method.

According to another aspect of the present application, there is also provided a network model building apparatus for image text information detection, wherein the apparatus includes:

one or more processors; and

a memory storing computer readable instructions that, when executed, cause the processor to perform operations of the method as previously described.

For example, the computer readable instructions, when executed, cause the one or more processors to: firstly, label information of each training sample image is obtained, corresponding text information is extracted based on the label information, a text characteristic matrix corresponding to the text information is determined based on the text information, the text characteristic matrix and the corresponding training sample image are spliced to obtain a spliced training sample image, and finally the spliced training sample image is input into an improved network model for training until a preset training threshold value is met.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims

1. A network model construction method for image text information detection is characterized by comprising the following steps:

acquiring label information of each training sample image;

extracting corresponding text information based on the label information;

2. The method of claim 1, further comprising:

determining a corresponding character based on the text information;

3. The method of claim 2, further comprising:

when a new training sample image is added, judging whether the text information of the new training sample image contains the characters which do not exist in the character library;

4. The method according to claim 1 or 2, wherein if the training sample image is a three-channel color image, the text feature matrix is spliced with the corresponding training sample image to obtain a four-channel spliced training sample image; and if the training sample image is a single-channel gray scale image, splicing the text feature matrix and the corresponding training sample image to obtain a spliced training sample image of two channels.

5. The method of any of claims 1 to 4, wherein the improved network model is based on an SSD neural network, comprising 1 data layer, a VGG-16 base network, and 6 convolutional layers after the VGG-16 base network.

6. The method according to any one of claims 1 to 5, wherein the label information at least includes text information of the training sample image and coordinate information corresponding to the text information.

7. A network model building system for image text information detection, the system comprising:

8. A computer-readable medium, wherein,

stored thereon computer readable instructions executable by a processor to implement the method of any one of claims 1 to 6.

9. A network model building apparatus for image text information detection, wherein the apparatus comprises:

one or more processors; and

memory storing computer readable instructions that, when executed, cause the processor to perform the operations of the method of any of claims 1 to 6.