CN110390251B

CN110390251B - Image and character semantic segmentation method based on multi-neural-network model fusion processing

Info

Publication number: CN110390251B
Application number: CN201910403196.1A
Authority: CN
Inventors: 刘晋; 张鑫; 李云辉
Original assignee: Shanghai Maritime University
Current assignee: Shanghai Maritime University
Priority date: 2019-05-15
Filing date: 2019-05-15
Publication date: 2022-09-30
Anticipated expiration: 2039-05-15
Also published as: CN110390251A

Abstract

The invention provides an image and character semantic segmentation method based on multi-neural-network model fusion processing, which comprises a multi/multi semantic segmentation model training method and a multi-model fusion processing method. The invention utilizes a plurality of semantic segmentation network models such as a multi-scale full convolution neural network model MSFCN, a U-shaped full convolution neural network model U-net, a full convolution neural network model R-FCN based on a region, a Faster region convolution neural network model Faster R-CNN based on a region and the like to carry out semantic positioning on a character region in an image, but is not limited to the 4 semantic segmentation network models, and can be adjusted and replaced by other plurality/various semantic segmentation neural network models based on a global or local region. The invention can effectively eliminate the interference of complex non-character areas and can semantically divide character areas containing various character sizes, character colors, character fonts and character languages by utilizing the deep neural network technology, thereby having wide application range and strong robustness.

Description

Image character semantic segmentation method based on multi-neural network model fusion processing

Technical Field

The invention relates to the field of image recognition processing, in particular to a semantic segmentation method for characters in an image.

Background

The characters are used as a vital tool for people to exchange information at ordinary times, and have very important influence on the development of the whole society. As the times have been advanced, more and more characters and information are required to be processed, and more data and worksheets are identified and analyzed by human beings, which has become more difficult. The research on some methods for recognizing literal characters has become an urgent need at present.

Character segmentation is a difficult and hot spot of literal character recognition. The number of characters is large, and 3000 common vocabularies of Chinese characters are only. Currently, the main character segmentation methods can be divided into three types: 1. segmentation based on structural analysis; 2. a method based on recognition; 3. and (4) an integral segmentation strategy. These methods require the image to be presented in a specific format before segmentation is performed to simplify subsequent processing. The preprocessing comprises digitalization, denoising, binarization and normalization. However, there are various factors that hinder the text-based image segmentation process, some of which are as follows: image quality, location of text content, texture file, text type.

For the segmentation of image characters, interference of other information needs to be considered, and some rule-based segmentation methods cannot effectively perform segmentation. Meanwhile, in recent years, it is desired to improve accuracy of character recognition. Therefore, the defects of the traditional method can be solved by applying the deep neural network technologies such as the full convolution neural network FCN, the region convolution-based neural network model R-CNN and the like to the semantic segmentation of the image characters. Meanwhile, the situation that a single model has poor performance on a certain type or a certain types of detection objects can be solved by adopting a multi-model fusion processing method. In practical application, the application limit of a single model can be broken through by multi-model fusion processing, and a better detection effect is achieved.

The full convolution Neural network (FCN) is proposed by the research team of the university of california, and the original Convolutional Neural Network (CNN) is popularized to classify the picture of any size at the pixel level in an end-to-end manner, thereby solving the problem of image segmentation at the semantic level.

MSFCN (Multi-scale full volumetric Neural Networks) is a Multi-scale full convolution Neural network model.

The U-shaped full convolution neural network U-Net is improved based on the full convolution neural network FCN, and some data with fewer samples can be trained by data enhancement.

R-FCN (region based full Convolutional neural network) is a region-based full Convolutional neural network model, and Position-sensitive score maps are provided to solve the Position sensitivity problem of target detection

The Faster R-CNN (Faster Region-CNN) is a Faster regional convolutional neural network.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provide a method for segmenting the meaning of characters in an image.

The invention specifically provides an image character semantic segmentation method based on multi-neural network model fusion processing, which comprises the following steps:

the multiple/multiple semantic segmentation model training method comprises the following steps: preprocessing the semantically segmented sample image such as graying, normalization and the like; performing multi-scale feature extraction on a semantically segmented sample image; generating labels for semantic segmentation respectively by semantic annotation aiming at multiple models to obtain a data set for deep learning; based on a semantic segmentation neural network technology, constructing a multi-scale full convolution neural network model MSFCN, a U-shaped full convolution neural network model U-net, a region-based full convolution neural network model R-FCN and a Faster region-based convolution neural network model Faster R-CNN; constructing a convolutional neural network model CNN based on the evaluation of the character region and the non-character region; and respectively training the plurality of deep neural network models.

The multi-model fusion processing method comprises the following steps: preprocessing the image to be semantically segmented, such as graying, normalization and the like; performing multi-scale feature extraction on a semantically segmented sample image; respectively applying the processed sample image to be semantically segmented to a multi-scale full convolution neural network model MSFCN, a U-shaped full convolution neural network model U-net, a region-based full convolution neural network model R-FCN and a Faster region convolution neural network model Faster R-CNN to obtain a prediction result image of each single model; and evaluating and fusing the prediction result of the single model by using the convolutional neural network model CNN to obtain a final semantic segmentation result.

The image character semantic segmentation method based on the multi-neural-network model fusion processing is different from the traditional morphological processing and other methods, can effectively eliminate the interference of complex non-character areas, and can perform semantic segmentation on character areas containing various character sizes, character colors, character fonts and character languages, and has wide application range and strong robustness.

The above description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood, and to make the above and other objects, features, and advantages of the present invention more apparent.

Drawings

FIG. 1 is a schematic diagram of the implementation steps of the method of the present invention

FIG. 2 shows three dimensions of feature images

FIG. 3 is a schematic diagram of a semantic annotation image

FIG. 4 is a schematic diagram of a semantic box information xml format

FIG. 5 is a schematic diagram of a matrix text region image

FIG. 6 is a flow chart of a multi-scale full convolution neural network model

FIG. 7 is a flow chart of a model for constructing a U-shaped full convolution neural network

FIG. 8 is a flowchart of a method for constructing a full convolution neural network model based on regions

FIG. 9 is a flow chart for building a faster model of a convolutional neural network based on regions

FIG. 10 is a schematic diagram of a multi-model voting strategy process

FIG. 11 is a sample diagram of semantic segmentation of characters to be imaged

FIG. 12 is a semantic location graph for neural network processing

FIG. 13 is a block diagram of semantic segmentation for neural network processing

FIG. 14 is a diagram of sub-picture sample instances divided by semantics

Detailed Description

The present invention will be described in further detail with reference to test examples and specific embodiments. It should not be understood that the scope of the above-described subject matter is limited to the following examples, and that any technique that can be implemented based on the teachings of the present invention is within the scope of the present invention.

The invention provides a method for realizing image and character semantic segmentation based on multi-neural-network model fusion processing, which comprises the following steps as shown in figure 1.

In step S110, preprocessing such as graying and normalization and multi-scale feature extraction are performed on the semantic segmentation sample image.

In a specific embodiment of the present invention, the multi-scale feature extraction algorithm may be as follows: in order for the semantic segmentation model to become more sensitive to retaining detailed information, it is necessary to provide the model with more information about the distance. The gap features can be effectively extracted by adopting a multi-scale feature extraction algorithm. By adding the multi-scale feature map to the semantic segmentation model, more pitch information can be provided for the full convolutional neural network model, for example. In the multi-scale feature extraction algorithm, the number of the multi-scales may be an appropriate value according to the semantic segmentation model, for example, the value in an embodiment of the present invention is 3, and may also be 4, 5, 6, and other values. Fig. 2 shows the feature images of 3 scales obtained by the multi-scale feature extraction algorithm. The three image sizes shown in fig. 2 may be preset values: 512 × 376 × 1, 256 × 188 × 1, and 128 × 94 × 1. The first and second digits represent the width and height of the image, and the third digit represents channel information for the feature image. However, other image sizes, such as 32 × 32, 64 × 128, etc., and other amounts of feature image channel information, such as 3, 4, etc., are also possible.

In step S120, semantic labeling is performed manually or semi-manually according to the semantically segmented sample image, and a semantic label image or semantic label box information is generated. Specifically, for a neural network of a full convolution type, such as the MSFCN, U-net and R-FCN, a semantic annotation image as shown in FIG. 3 is constructed; an xml file with labeled semantic box information as shown in fig. 4 is constructed for a neural network of the region-based convolution type, such as the FasterR-CNN described above. For the convolutional neural network CNN for text region and non-text region evaluation, an image of a plurality of matrix text regions and non-text regions is generated by clipping the text regions and non-text regions labeled semantically, as shown in fig. 5.

In step S130, the data set is expanded by a data enhancement method, such as translation, rotation, mirroring, reflection transformation, and the like.

In step S140, the multiple/multiple neural network structures for semantic segmentation are constructed, according to an embodiment of the present invention, 4 models, such as a multi-scale full convolution neural network model MSFCN, a U-type full convolution neural network model U-net, a region-based full convolution neural network model R-FCN, and a Faster region-based convolution neural network model fast R-CNN, may be used to perform semantic localization on an image containing text.

In step S150, according to an embodiment of the present invention, a convolutional neural network model CNN for evaluating text regions and non-text regions may be used.

For each neural network model described above, the following is a preferred example given by the present invention.

Fig. 6 is a flow chart of the method for constructing a multi-scale full convolution neural network model according to the present invention. In one embodiment of the present invention, images with 3 scales, such as 512 × 376 × 1, 256 × 188 × 1, 128 × 94 × 1, etc., are generated by preprocessing the image to be semantically segmented and generating a multi-feature extraction algorithm, and are respectively represented as Scale1x, Scale2x, and Scale4 x. For the result of the Scale1x feature image passing through 2 convolutional layers and 1 pooling layer, 1 fusion layer processing is performed on the result of the Scale2x feature image passing through 2 convolutional layers. And processing the result of the fusion result passing through 2 convolutional layers and 1 pooling layer and the result of the Scale4x feature image passing through 2 convolutional layers by 1 fusion layer. And processing the fusion result by 2 convolution layers, 1 pooling layer and 2 convolution layers, and then processing by 3 deconvolution layers to obtain an output result of the neural network. And calculating loss according to the corresponding semantic segmentation label graph, and updating model parameters until training is finished and the model result is stored.

Fig. 7 is a flow chart of the method for constructing the U-shaped full convolution neural network model according to the present invention. In one embodiment of the invention, the image to be semantically segmented is preprocessed. Marking the feature map of the preprocessing result after being processed by 2 convolution layers as Fa 1; the characteristic graph of Fa1 after being processed by 1 pooling layer and 2 convolutional layers is marked as Fa 2; the characteristic diagram of Fa2 after being processed by 1 pooling layer and 2 convolutional layers is recorded as Fa 3. After Fa3 is treated by 1 pooling layer and 2 convolution layers, the result of 1 deconvolution layer and Fa3 are treated by 1 fusion layer; processing the result of the fusion result by 2 convolutional layers and 1 deconvolution layer and Fa2 by 1 fusion layer; and (3) processing the result of the fusion result passing through 2 convolutional layers and 1 deconvolution layer and Fa1 by 1 fusion layer. And processing the fusion result by 2 convolution layers and 1 convolution layer with convolution kernel of 1 multiplied by 1 to obtain the output result of the neural network. And calculating loss according to the corresponding semantic segmentation label graph, and updating the model parameters until the training is finished and the model result is stored.

FIG. 8 is a flow chart of the present invention for constructing a region-based full convolution neural network model. In one embodiment of the invention, a basic convolutional network similar to ResNet-101 is used, a region suggestion network RPN, a convolutional layer for the position sensitive score map, and finally a ROI pooling layer and decision layer for voting. Where a convolutional network like ResNet-101 contains 15 convolutional layers, 1 globally averaged pooling layer and 1 fully connected layer. The R-FCN calculates parameters for updating the neural network model by using a loss function of the ROI by adopting two steps including region suggestion and region classification. And (5) storing the model result after multiple times of training.

FIG. 9 is a flow chart of the method for constructing a faster model of a convolutional neural network based on regions. In one embodiment of the invention, a regional proposal network RPN and a Faster R-CNN network are constructed. The RPN construction is based on a VGG16 network structure, where RPN and Fast-R-CNN share 13 VGG convolutional layers. Initializing network parameters through a pre-training model, training RPN and Faster R-CNN independently, performing convolution and pooling operation on candidate regions output by RPN for multiple times, performing ROI pooling and full-link layer, and performing target classification on one output result and region regression on the other output result. And training the RPN again, and only updating the unique partial parameters of the RPN. The Fast-RCNN network is again fine-tuned with the results of the RPN, updating only the parameters of the unique part of Fast-RCNN. And after training, storing the model result.

In one embodiment of the invention, a convolutional neural network model CNN for evaluating a text region and a non-text region adopts 6 convolutional layers and 2 full-link layers, and finally, the region is evaluated by using a regression value. After training is complete, the model results are saved.

In one embodiment of the present invention, the above-mentioned deep neural networks all use a3 × 3 convolution kernel. For example, 5 × 5, 7 × 7 convolution kernels or other scale punctured convolution kernels may also be used in the present invention.

In step S170, a prediction result map obtained through each single model process is obtained. This includes the need to process the results that are predicted to be semantic box information, such as Faster region-based convolutional neural network model fast R-CNN. And constructing a semantic segmentation image which is the same as the FCN prediction result of the full convolution neural network by using semantic box information obtained by processing the semantic segmentation model. The semantic segmentation image marks the possibility of being a text region with different pixel values 0 to 255, for example, black, which has a pixel value of 0 indicating that the possibility is the maximum in a non-text region, and white, which has a pixel value of 255 indicating that the possibility is the maximum in a text region. The pixel point included in each semantic box in the prediction result represents the possibility of the character area by using the product of the predicted value obtained by fast R-CNN aiming at the semantic box and 255. All pixel values outside the semantic box are marked as 0. Thereby generating a semantically segmented image.

The operation of pseudo-binarization processing on the pixel points in the prediction result graph by using the full convolution network models such as the multi-scale full convolution neural network model MSFCN, the U-shaped full convolution neural network model U-net, the region-based full convolution neural network model R-FCN and the like is required to be included.

The pseudo-binarization processing operation is to screen pixel points in a prediction result graph according to a proper threshold value: the pixel values smaller than the threshold are all marked as 0, and the pixel values larger than the threshold keep the original value.

Such processing operation makes it possible to represent what category the pixel points represent. For example, for a text region and a non-text region, each pixel value in the prediction result may indicate the possibility of being a text region.

In step S180, the method of evaluating the prediction result graph of each single model includes: and evaluating by using the trained convolutional neural network model CNN for evaluating the character region and the non-character region according to the character region marked in the prediction result graph of each single model. The evaluation result is a value of 0 to 1, where 0 indicates the highest probability of being a non-text region and 1 indicates the highest probability of being a text region. The entire prediction result map is then updated with the product of the evaluation value of each text region and the value of each pixel within the text region.

In step S190, the final semantic segmentation map generated through the fusion algorithm may adopt the following strategies:

firstly, carrying out binarization processing on a proper threshold value to ensure that the numerical values of all pixel points of the prediction graph are only 0 or 255, and then generating a final semantic segmentation result by adopting a voting strategy mode for each pixel point. The voting strategy can be expressed as: assume that there are N semantic segmentation models. For each pixel point in the image to be detected, the prediction value of the N semantic segmentation models at the point (i, j) is represented as S _(i，j) ＝{S ₁ ,S ₂ ,S ₃ ,…,S _N-1 ,S _N }. Wherein S _k Is 0 or 255, and k is between 1 and N. Then one voting strategy may be S _(i，j) ＝Max{Num(S _k ＝0),Num(S _k 255). Where Num (S) _k 0) represents S _(i，j) Middle S _k The number of the same Num (S) as 0 _k 255) denotes S _(i，j) Middle S _k The number of values is 255. Max means the largest number of values. According to the condition that the segmentation result of most semantic segmentation is credible, the prediction result of most models at the point (i, j) can be taken as the final prediction result. Pixel points for all positionsThe operations are repeated to generate a semantic segmentation image through fusion.

The multi-model voting strategy processing strategy is shown in fig. 10, namely, the results of a plurality of single models are utilized to perform optimization fusion, and the processing has better performance than the single model processing.

It can be understood by those skilled in the art that the voting strategy by pixel point mentioned in the above-mentioned multi-model fusion processing method is only a specific strategy of one embodiment of the present invention. In step S190, other multi-model fusion processing methods may be further improved based on this idea, for example, a similar multi-model weighted-average-by-pixel policy, a weighted multi-model voting policy, and the like.

In a specific embodiment of the invention, a preprocessed text image to be semantically segmented is received. In the process of image character semantic segmentation of multi-model fusion processing:

fig. 11 is the image originally containing text. The size of the picture to be processed is converted into the scale size of 512 multiplied by 376 multiplied by 1 through the preprocessing operation on the semantic segmentation image, and then another two pictures with the scale sizes of 256 multiplied by 188 multiplied by 1 and 128 multiplied by 94 multiplied by 1 are generated through the multi-scale feature extraction algorithm. Through a self-adaptive binarization processing method, a binary image which can still reflect the whole and local characteristics of the image is obtained by selecting 256 brightness-level gray level images through self-adaptive threshold values, and a result picture is input as the multi-scale characteristics of a neural network.

When the semantic segmentation is carried out on the image by utilizing the semantic segmentation model/models, the following 2 steps are included:

step 1: and performing prediction operation on the input image with the three scale characteristics generated by processing in the step 6 by utilizing the trained multiple/multiple semantic segmentation models. And marking the semantic character area with different pixel values by using each single model.

Step 2: and performing character region evaluation processing according to the trained convolutional neural network, processing prediction results obtained by all single models by utilizing a multimode fusion algorithm, and finally processing and generating a semantic segmentation result graph as shown in fig. 12.

The effect marked on the image of the original size according to the final semantic segmentation result graph is shown in fig. 13. A sub-picture obtained by performing the segmentation processing on the text region block is shown in fig. 14.

And obtaining the sub-picture generated by segmenting the word meaning region on the whole image through processing of the semantic region segmentation module for multiple times.

It will be appreciated by those of ordinary skill in the art that the foregoing description provides numerous implementation details. However, embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Claims

1. An image character semantic segmentation method based on multi-neural network model fusion processing is characterized by comprising the following steps:

the method comprises the following steps: receiving a character image to be semantically segmented;

step two: preprocessing the character image to be semantically segmented, such as graying, normalization and the like;

step three: respectively inputting the preprocessed images into the trained multiple/multiple semantic segmentation models for prediction;

step four: evaluating and processing a character region and a non-character region by using a plurality of single model prediction results obtained by the trained convolutional neural network CNN pair;

step five: and (4) adopting a multi-model fusion method for the evaluated and processed result to generate the final semantic segmentation result.

2. The image text semantic segmentation method based on the fusion processing of the multi-neural network model according to claim 1, characterized in that: the training method of the multiple/multiple semantic segmentation models related to the third step comprises the following steps:

step 21: graying and normalizing the semantically segmented sample image and extracting multi-scale features;

step 22: generating labels for semantic segmentation respectively by semantic annotation aiming at multiple models to obtain a data set for deep learning;

step 23: based on a semantic segmentation neural network technology, constructing a multi-scale full convolution neural network model MSFCN, a U-shaped full convolution neural network model U-net, a region-based full convolution neural network model R-FCN and a faster region-based convolution neural network model FasterR-CNN;

step 24: and respectively training and storing the plurality of deep neural network models.

3. The image text semantic segmentation method based on the fusion processing of the multi-neural network model according to claim 2, characterized in that: the method for preprocessing the semantic segmentation sample image and extracting the multi-scale features in the step 21 comprises the following steps:

step 31: carrying out gray processing on a semantically segmented sample image, wherein the pixel of each point is represented by a numerical value from 0 to 255;

step 32: normalizing the grayed image, and scaling the length and width of the image to the size of a preset image;

step 33: and generating a multi-scale feature sample image by adopting a multi-scale feature extraction algorithm based on multi-scale transformation on the normalized and scaled image.

4. The image text semantic segmentation method based on the fusion processing of the multi-neural network model according to claim 2, characterized in that: the method for obtaining the deep learning data set in the step 22 is characterized by comprising the following steps:

step 41: performing semantic marking through manual work or semi-manual work according to the sample image subjected to semantic segmentation to generate a semantic marked image or semantic marked frame information;

step 42: cutting and generating images of a plurality of matrix character areas and non-character areas according to the character areas and the non-character areas marked by the semantemes;

step 43: the data set is expanded through a data enhancement method, such as translation, rotation, mirror image, reflection transformation and the like, and processed into a training data set format suitable for multiple/multiple semantic segmentation models.

5. The image text semantic segmentation method based on the fusion processing of the multi-neural network model according to claim 1, characterized in that: the method is characterized by comprising a multi-scale full convolution neural network model MSFCN, a U-shaped full convolution neural network model U-net, a full convolution neural network model R-FCN based on a region and a Faster full convolution neural network model Faster R-CNN based on the region, but not limited to the mentioned network structure, and also comprising the following steps:

the FCN and various modified variant structures thereof are based on a basic multi-layer single-input single-output full convolution neural network, such as a full convolution neural network structure for semantic segmentation, which comprises single-input multiple-output, multiple-input single-output, multiple-input multiple-output, and the like;

and a neural network structure for semantic segmentation based on global processing and local processing, such as various variant structures based on a regional convolutional neural network model R-CNN.

6. The image text semantic segmentation method based on the fusion processing of the multi-neural network model according to claim 1, characterized in that: the method for evaluating text regions and non-text regions in the fourth step is characterized by using a convolutional neural network model CNN.

7. The image text semantic segmentation method based on the fusion processing of the multi-neural network model according to claim 1, characterized in that: the process of respectively predicting the multiple/multiple semantic segmentation models in the third step to obtain the prediction result graph is as follows: respectively applying the processed sample image to be semantically segmented to a multi-scale full convolution neural network model MSFCN, a U-shaped full convolution neural network model U-net, a region-based full convolution neural network model R-FCN and a Faster region-based convolution neural network model Faster R-CNN to obtain a prediction result image of each single model; the prediction result map indicates the possibility of whether or not the image is a text region by using pixel values of 0 to 255, and for example, a black pixel value of 0 indicates that the image is a non-text region and a white pixel value of 255 indicates that the image is a text region.

8. The image text semantic segmentation method based on the fusion processing of the multi-neural network model according to claim 1, characterized in that: in the fourth step, the method for evaluating and processing the character area and the non-character area by using the plurality of single model prediction results obtained by the convolutional neural network CNN pair comprises the following steps: and (3) aiming at a prediction result graph generated by a single semantic segmentation model, individually segmenting the marked text region, and evaluating by a convolutional neural network model (CNN). The convolutional neural network model CNN derives an evaluation value of 0 to 1 for whether it is a text region, where 0 indicates the maximum possibility of being a non-text region and 1 indicates the maximum possibility of being a text region; all the character areas in the prediction result graph generated by the single model are subjected to the evaluation processing, and then the evaluation value is used as weight to be added to all the pixel values of the character areas in the corresponding prediction result graph; and performing fusion processing on all the estimated prediction result graphs to generate a final semantic segmentation result.

9. The image text semantic segmentation method based on the fusion processing of the multi-neural network model according to claim 1, characterized in that: in the fourth step, a multi-model fusion method is adopted for the result after evaluation and processing, and the method comprises the following steps: performing binarization processing by using a proper threshold value; and (4) adopting a voting strategy aiming at each pixel point, and finally generating a final semantic segmentation result.

10. The image text semantic segmentation method based on the fusion processing of the multi-neural network model as claimed in claim 6, wherein: further comprising: the following algorithm for classification: a naive Bayes model, a support vector machine model and a related improved classification algorithm.