CN113673509B

CN113673509B - Instrument detection classification method based on image text

Info

Publication number: CN113673509B
Application number: CN202110855223.6A
Authority: CN
Inventors: 田联房; 王昭霖; 杜启亮
Original assignee: South China University of Technology SCUT; Zhuhai Institute of Modern Industrial Innovation of South China University of Technology
Current assignee: South China University of Technology SCUT; Zhuhai Institute of Modern Industrial Innovation of South China University of Technology
Priority date: 2021-07-28
Filing date: 2021-07-28
Publication date: 2023-06-09
Anticipated expiration: 2041-07-28
Also published as: CN113673509A

Abstract

The invention discloses an instrument detection classification method based on image text, which comprises the following steps: 1) Constructing an instrument positioning data set, improving YOLO network training, and outputting dial images by using a network; 2) Constructing a character detection data set, improving EAST network training, and outputting character images by using a network; 3) Constructing a character recognition data set, training a CRNN network, and outputting character information by using the network; 4) And constructing a text classification data set, training a textCNN network, and outputting the instrument type by using the network. The invention uses the neural network to realize the detection of the instrument and the detection and identification of the text information of the instrument, has higher precision and better generalization capability under different backgrounds, can accurately detect the instruments with different sizes, has no limitation of acquisition angles and distances, can solve the problems that the position of the instrument can be identified but the type of the instrument is difficult to distinguish in the field of machine vision by utilizing the character information on the instrument, and can detect the instrument and identify the type of the instrument.

Description

Instrument detection classification method based on image text

Technical Field

The invention relates to the technical field of image processing and neural networks, in particular to an instrument detection classification method based on image texts.

Background

The instrument is used as a monitoring device, mainly including pressure instruments, temperature instruments, flow instruments, electrical instruments and electronic measuring instruments, and is widely applied to various aspects of industrial production and social life, thereby providing great convenience for living generation. Compared with the manual classification method, the method has the advantages of wide application range and high classification efficiency, the method is also gradually mainstream along with the development of image processing technology and neural network technology, the key links in the method comprise positioning and identification of texts in the meters, and whether the accuracy of the positioning and identification information of the texts has important influence on the classification and dimension reading of the meters.

At present, research and implementation of instrument classification methods are mainly focused on classifying and training different types of instrument images by using a neural network, and the method has some defects, such as training of the instrument images to be identified, the degree of distinction of the different types of instruments on the images is not very large, and the identification and classification performance of the depth network on the different instruments is not ideal. At present, research and implementation of a character recognition method on an instrument are mainly focused on application of a traditional image processing technology, and character information is acquired through a series of image processing technologies including filtering, graying, thresholding, edge detection and template detection. With the rapid development of image processing and neural network technology in recent years, it is possible to use neural networks for character location recognition classification. The text detection mainly carries out quadrilateral frame positioning on character information through a neural network, text information reading on the instrument is realized through the text recognition algorithm, and the text information is classified through the text classification algorithm to obtain the type of the instrument.

By combining the discussion, the instrument detection classification method with real-time performance and high precision has higher practical application value.

Disclosure of Invention

The invention aims to overcome the defects and shortcomings of the prior art, and provides an instrument detection classification method based on image texts, which uses a neural network to realize instrument detection and instrument text information detection and identification, has higher precision and better generalization capability under different backgrounds, can accurately detect instruments with different sizes, has no limitation of acquisition angles and distances, can solve the problems that the instrument position can be identified in the field of machine vision but the instrument types are difficult to distinguish by utilizing character information on the instruments, and can detect the instruments and identify the instrument types.

In order to achieve the above purpose, the technical scheme provided by the invention is as follows: an instrument detection classification method based on image text comprises the following steps:

1) Marking the positions of the dial plates by using the instrument images to construct an instrument positioning data set, dividing the instrument positioning data set into a training set and a testing set, recharging parameters, training an improved YOLO network by using the training set, obtaining an optimal improved YOLO network after training, inputting the testing set into the optimal improved YOLO network, outputting dial plate images and cutting out the dial plate images; wherein, the improved YOLO network optimizes the backbone network to be a mobilent lightweight network so as to reduce network parameters and calculated amount and improve operation speed;

2) Marking the character positions in the dial images cut out in the step 1) to construct a character detection data set, dividing the character detection data set into a training set and a test set, recharging training parameters, training an improved EAST network by using the training set, obtaining an optimal improved EAST network after training, inputting the test set into the optimal improved EAST network, outputting the character positions in the character detection data set, and cutting the character detection data set into character images; the improved EAST network is to change a backbone network into VGG to improve network detection accuracy, and the output layer structure modification prediction module only uses head elements to predict vertexes so as to improve the prediction performance of long characters;

3) Marking the character information in the character image cut in the step 2) to construct a character recognition data set, dividing the character recognition data set into a training set and a test set, recharging training parameters to train the CRNN network by using the training set, obtaining an optimal CRNN network after training, inputting the test set into the optimal CRNN network, and outputting the character information;

4) Splicing the character information output in the step 3) into a text, marking instrument types corresponding to the text to construct a text classification data set, dividing the text classification data set into a training set and a test set, recharging training parameters to train the textCNN network by using the training set, obtaining an optimal textCNN network after training, and inputting the test set into the optimal textCNN network to output the instrument types corresponding to the text.

In step 1), various instrument images under different environments are collected through a camera, filtering and image enhancement preprocessing operations are performed on the instrument images, abnormal data in the instrument images are removed, the abnormal data comprise data with surface dirt, illumination ends and incomplete shooting abnormality, the rest data are marked, the marked content is dial positions, an instrument positioning data set is constructed, and the instrument positioning data set is divided into a training set and a testing set.

Further, in step 1), the specific case of the improved YOLO network is as follows:

a. constructing a feature extraction network according to the real-time and high-precision requirements:

the first layer is a combined convolution module 1-A, which consists of a zero filling layer, a convolution layer, a batch normalization layer and an activation layer;

the second layer is a combined convolution module 1-B, which consists of a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;

the third layer is a combined convolution module 1-C, which consists of a zero filling layer, a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;

the fourth layer is a combined convolution module 1-B, which consists of a deep convolution layer, two batch normalization layers, two activation layers and a convolution layer;

The fifth layer is a combined convolution module 1-C, which consists of a zero filling layer, a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;

the sixth layer is a combined convolution module 1-B, which consists of a deep convolution layer, two batch normalization layers, two activation layers and a convolution layer;

the seventh layer is a combined convolution module 1-C, which consists of a zero filling layer, a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;

the eighth layer is a combined convolution module 1-D, which consists of five combined convolution modules 1-B;

the ninth layer is a combined convolution module 1-C, which consists of a zero filling layer, a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;

the tenth layer is a combined convolution module 1-B, which consists of a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;

b. constructing and outputting prediction networks for predicting targets with different sizes according to the output of different layers of the feature extraction network, wherein the prediction networks comprise a large-size target prediction network, a medium-size target prediction network and a small-size target prediction network;

b1, inputting a tenth layer of output of a feature extraction network, wherein the large-size target prediction network consists of a plurality of combination convolution modules and convolution layers, and has the following structure:

The first layer is a combined convolution module 1-D, which consists of five combined convolution modules 1-B;

the third layer is a convolution layer;

b2, inputting an eighth layer output of a characteristic extraction network and a first layer output of a large-size target prediction network, wherein the medium-size target prediction network consists of a plurality of combination convolution modules and convolution layers, and the structure is as follows:

the first layer is an input fusion module 1-E, which consists of a combined convolution module 1-B, an up-sampling layer and a tensor splicing layer;

the second layer is a combined convolution module 1-D, which consists of five combined convolution modules 1-B;

the third layer is a combined convolution module 1-B, which consists of a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;

the fourth layer is a convolution layer;

b3, taking the input as the sixth layer output of the characteristic extraction network and the second layer output of the medium-size target prediction network, wherein the small-size target prediction network consists of a plurality of combination convolution modules and convolution layers and has the following structure:

the fourth layer is a convolution layer;

finally, the output of the large-size target prediction network, the medium-size target prediction network and the small-size target prediction network is processed through a non-maximum suppression layer to obtain the predicted target position and category;

c. the loss function is set to have a center coordinate loss function, a wide-high loss function, a confidence loss function and a category loss function;

the center coordinate loss function formula is as follows:

Loss _xy ＝mark _object *(2-w*h)*Loss _log (xy _true ,xy _predict )

in the Loss _xy Representing center coordinate loss, mark _object A flag bit representing whether an object exists in the anchor frame, w represents the width of the anchor frame, h represents the height of the anchor frame, and Loss _log Representing a binary cross entropy loss, xy _true Representing the true central coordinate value, xy _predict Representing a predicted central coordinate value;

the wide-high loss function formula is as follows:

Loss _wh ＝0.5*mark _object *(2-w*h)*(wh _true -wh _predict ) ²

in the Loss _wh Representing the loss of width and height,wh _true representing the true width and height values wh _predict Representing a predicted width and height value;

the confidence loss function formula is as follows:

Loss _confidence ＝mark _object *Loss _log (mark _object ,c _predict )+(1-mark _object )*Loss _log (mark _object ,c _predict )*mark _ignore

in the Loss _confidence Representing confidence loss, c _predict Confidence value representing prediction frame, mark _ignore A flag bit representing an anchor block having an IOU less than a threshold;

the class loss function formula is as follows:

Loss _cls ＝mark _object *Loss _log (cls _true ,cls _predict )

in the Loss _cls Representing class loss, cls _true Representing true class, cls _predict Representing a prediction category;

the total loss function formula is as follows:

Loss＝(Loss _xy +Loss _wh +Loss _confidence +Loss _cls )/numf

where Loss represents total Loss, numf represents floating point number of total input number;

loading training parameters to train the improved YOLO network, wherein the training parameters are set as follows: setting a training optimizer to Adam, an initial learning rate of 0.001, a maximum training period of 500 and a batch size of 8; setting the interval detection training accuracy of the verification set, and storing a network after the training is completed to the optimal condition, wherein the training completion mark is the maximum training period or the average crossing ratio meeting the requirement;

inputting the test set into an optimal improved YOLO network to obtain the dial position and the dial image.

Further, in step 2), the specific case of the improved EAST network is as follows:

a. the method comprises the following steps of constructing a feature extraction network:

the first layer is a combined convolution module 2-B, which consists of two combined convolution modules 2-A and a maximum pooling layer, wherein the combined convolution module 2-A consists of a zero filling layer, a convolution layer and an activation layer;

the second layer is a combined convolution module 2-B, which consists of two combined convolution modules 2-A and a maximum pooling layer;

The third layer is a combined convolution module 2-C, which consists of three combined convolution modules 2-A and a maximum pooling layer;

the fourth layer is a combined convolution module 2-C, which consists of three combined convolution modules 2-A and a maximum pooling layer;

the fifth layer is a combined convolution module 2-C, which consists of three combined convolution modules 2-A and a maximum pooling layer;

b. the feature fusion network is constructed, and the structure is as follows:

the first layer is an input fusion module 2-G, which consists of an up-sampling layer and a tensor splicing layer;

the second layer is a combined convolution module 2-E, which consists of two batch normalization layers, a combined convolution module 2-D and a combined convolution module 2-A; wherein the combined convolution module 2-D consists of a zero padding layer, a convolution layer and an activation layer;

the third layer is an input fusion module 2-G, which consists of an up-sampling layer and a tensor splicing layer;

the fourth layer is a combined convolution module 2-E, which consists of two batch normalization layers, a combined convolution module 2-D and a combined convolution module 2-A;

the fifth layer is an input fusion module 2-G, which consists of an up-sampling layer and a tensor splicing layer;

the sixth layer is a combined convolution module 2-F, which consists of three batch normalization layers, one combined convolution module 2-D and two combined convolution modules 2-A;

c. The prediction network is constructed, and the structure is as follows:

the first layer is divided into three branches, and the first branch consists of a combined convolution module 2-D; the second branch consists of a combined convolution module 2-D; the third branch consists of a combined convolution module 2-D;

the second layer is an input fusion module which is formed by splicing three branches of the first layer;

d. the set loss function comprises a category loss function, a geometric shape loss function and an angle loss function;

the class loss function formula is as follows:

wherein L is _S Representing the class loss, beta represents the weight,

is a predicted category, Y ^* Is a true category;

the geometry loss function formula is as follows:

wherein L is _AABB Representing the loss of geometry and,

representing the geometry of the predictive quadrilateral text box AABB, R ^* Representing the geometry of a real quadrilateral text box AABB, ioU representing the intersection ratio;

the angle loss function formula is as follows:

wherein L is _θ

θ ^* ) Is an angle loss->

Is the predicted value of the rotation angle theta ^* Is the actual value of the rotation angle;

training and improving EAST network by loading training parameters, wherein the training parameters are set as follows: setting a training optimizer to Adam, an initial learning rate of 0.001, a maximum training period of 500 and a batch size of 8; setting the interval detection training accuracy of the verification set, and storing a network after the training is completed to the optimal condition, wherein the training completion mark is the maximum training period or the average crossing ratio meeting the requirement;

Inputting the test set into the optimal improved EAST network to obtain text positions, and cutting the text positions into character images.

Further, in step 3), the specific cases of the CRNN network are as follows:

the first layer is a combined convolution module 3-A, which consists of a zero filling layer, a convolution layer and an activation layer;

the second layer is a maximum pooling layer;

the third layer is a combined convolution module 3-A, which consists of a zero filling layer, a convolution layer and an activation layer;

the fourth layer is the largest pooling layer;

the fifth layer is a combined convolution module 3-B, which consists of a zero filling layer, a convolution layer, a batch normalization layer and an activation layer;

the sixth layer is a combined convolution module 3-A, which consists of a zero filling layer, a convolution layer and an activation layer;

the seventh layer is the largest pooling layer;

the eighth layer is a combined convolution module 3-B, which consists of a zero filling layer, a convolution layer, a batch normalization layer and an activation layer;

the ninth layer is a combined convolution module 3-A, which consists of a zero filling layer, a convolution layer and an activation layer;

the tenth layer is the largest pooling layer;

the eleventh layer is a combined convolution module 3-C, which consists of a zero filling layer, a convolution layer, a batch normalization layer and an activation layer;

b. The prediction network is constructed, and the structure is as follows:

the first layer is a cyclic convolution module, which consists of a bidirectional LSTM;

the second layer is a full-connection layer;

the third layer is a circular convolution module, which consists of a bidirectional LSTM;

the fourth layer is a full-connection layer;

c. setting a decoder to convert the output sequence into character information;

d. setting a loss function as a CTC (Connectionist Temporal Classification) loss function;

the CTC loss function formula is as follows:

L _CTC ＝-ln∏ _(x,z) p(z|x)＝-∑ _(x,z)∈S lnp(z|x)

wherein L is _CTC Representing CTC loss, p (z|x) represents the probability of a given input x output sequence z, S being the training set;

training the CRNN network by loading training parameters, wherein the training parameters are set as follows: setting a training optimizer to Adam, an initial learning rate of 0.0001, a maximum training period of 100 and a batch size of 32; setting the interval detection training accuracy of the verification set, and storing a network after the training is completed to the optimal condition, wherein the training completion mark meets the requirement for the maximum training period or the identification accuracy;

and inputting the test set into the optimal CRNN network to obtain character information.

Further, in step 4), the concrete case of the TextCNN network is as follows:

a. the network structure is constructed as follows:

the first layer is an embedded layer;

The second layer is a convolution module;

the third layer is a maximum pooling layer;

the fourth layer consists of a full connection layer, a Dropout layer and an activation layer;

the fifth layer consists of a full connection layer and an activation layer;

b. the loss function is set as multi-classification cross entropy, and the formula is as follows:

wherein L is _CrossEntropy Represents loss, n represents category number, y _i Representing the true probability of the corresponding i-category,

representing a prediction probability corresponding to the i category;

loading training parameters to train the textCNN network, wherein the training parameters are set as follows: setting a training optimizer adam, an initial learning rate of 0.001, iteration times of 1000 and batch size of 64; setting the interval detection training accuracy of the verification set, and storing a network after the training is completed to the optimal condition, wherein the training completion mark is that the maximum iteration number and the accuracy meet the requirements;

and inputting the test set into an optimal textCNN network to obtain a corresponding instrument type.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the invention uses the neural network to realize instrument positioning and character detection and recognition on the instrument, and has higher precision and better generalization capability under different backgrounds compared with the traditional image processing method.

2. According to the invention, targets with different sizes can be predicted by using the improved YOLO network, the prediction frames with the most suitable sizes are comprehensively selected, the instruments with different sizes can be accurately detected, and the limitations of acquisition angles and distances are avoided.

3. Compared with other target detection networks, the improved EAST network has better detection performance and faster detection speed on characters, and has better detection effect on characters shot at different angles.

4. The invention can solve the problems that the position of the instrument can be identified in the field of machine vision but the type of the instrument is difficult to distinguish by utilizing the character information on the instrument to splice into the text, thereby not only detecting the instrument but also identifying the type of the instrument.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a schematic block diagram of an improved YOLO network.

Fig. 3 is a schematic block diagram of an improved EAST network.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.

As shown in fig. 1, the method for detecting and classifying the image text according to the embodiment includes the following steps:

1) Collecting instrument images shot in different actual scenes, removing interference data with fuzzy, angle extreme and instrument missing influence identification, marking dial positions in other data by using an open source marking tool labelImg, constructing an instrument positioning data set, dividing the instrument positioning data set into a training set and a testing set, reloading parameters, training the training set to improve a YOLO network, obtaining an optimal improved YOLO network after training, inputting the testing set into the optimal improved YOLO network, outputting dial images and cutting dial images; the improved YOLO network optimizes the backbone network to be a mobilent lightweight network so as to reduce network parameters and calculation amount and improve operation speed.

According to the specific application scene and the characteristics of the identification object, designing and improving the YOLO network, wherein the activation layer is a leakage Relu activation function if not additionally stated in the step, and the method comprises the following steps:

a. constructing a feature extraction network

And constructing a feature extraction network according to the real-time and high-precision requirements. The feature extraction network is mainly composed of a plurality of combined convolution modules.

The feature extraction network has the following structure:

the input image is 416×416×3.

The first layer is the combined convolution module 1-a, as shown in fig. 2 (a). The module first passes through zero padding layer and outputs 418 x 3. Then the convolution layer, the batch normalization layer and the activation layer are adopted, the convolution kernel is (3, 3), the step length is 2, the number of filters is 32, and the output is 208 multiplied by 32.

The second layer is the combined convolution module 1-B, as shown in fig. 2 (B). The module first goes through the deep convolution, batch normalization layer and activation layer, the convolution kernel is (3, 3), the step size is 1, the filling is used to make the input and output size consistent, and the output is 208×208×32. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 64, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 208 multiplied by 64.

The third layer is the combined convolution module 1-C, as shown in fig. 2 (C). The module first passes through the zero padding layer and outputs 210 x 64. And then the depth convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (3, 3), the step size is 2, and the output is 104 multiplied by 64. Finally, through convolution, a batch normalization layer and an activation layer, the convolution kernel is (1, 1), the step length is 1, the number of filters is 128, the input and output sizes are consistent by using filling, and the output is 104 multiplied by 128.

The fourth layer is the combined convolution module 1-B, as shown in fig. 2 (B). The module first goes through the deep convolution, batch normalization layer and activation layer, the convolution kernel is (3, 3), the step size is 1, the filling is used to make the input and output size consistent, and the output is 104×104×128. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 128, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 104 multiplied by 128.

The fifth layer is the combined convolution module 1-C, as shown in fig. 2 (C). The module first outputs 106×106×128 through zero padding layers. And then the depth convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (3, 3), the step size is 2, and the output is 52 multiplied by 128. Finally, through convolution, batch normalization layer and activation layer, the convolution kernel is (1, 1), step length is 1, the number of filters is 256, the input and output are consistent by using filling, and the output is 52 multiplied by 256.

The sixth layer is the combined convolution module 1-B, as shown in fig. 2 (B). The module first goes through the deep convolution, batch normalization layer and activation layer, the convolution kernel is (3, 3), the step size is 1, the filling is used to make the input and output size consistent, and the output is 52×52×256. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 256, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 52 multiplied by 256.

The seventh layer is the combined convolution module 1-C, as shown in fig. 2 (C). The module first passes through zero padding layer and outputs 54×54×256. And then the depth convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (3, 3), the step size is 2, and the output is 26 multiplied by 256. Finally, through convolution, a batch normalization layer and an activation layer, the convolution kernel is (1, 1), the step length is 1, the number of filters is 512, the input and output sizes are consistent by using filling, and the output is 26 multiplied by 512.

The eighth layer is the combined convolution module 1-D, as shown in fig. 2 (D). The modules pass through five combined convolution modules 1-B in sequence as shown in fig. 2 (B). In each combination convolution module 1-B, the input first goes through the deep convolution, the batch normalization layer and the activation layer, the convolution kernel is (3, 3), the step size is 1, the padding is used to make the input and the output uniform in size, and the output is 26×26×512. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 512, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 26 multiplied by 512. After passing through the same combined convolution modules 1-B in turn, the output is 26×26×512.

The ninth layer is the combined convolution module 1-C, as shown in fig. 2 (C). The module first passes through the zero padding layer and outputs 28 x 512. Then through the depth convolution, the batch normalization layer and the activation layer, the convolution kernel is (3, 3), the step length is 2, and the output is 13 multiplied by 512. Finally, through convolution, batch normalization layer and activation layer, the convolution kernel is (1, 1), step length is 1, the number of filters is 1024, the size of input and output is consistent by using filling, and the output is 13×13×1024.

The tenth layer is the combined convolution module 1-B, as shown in fig. 2 (B). The module first goes through the deep convolution, the batch normalization layer and the activation layer, the convolution kernel is (3, 3), the step length is 1, the filling is used to make the input and output size consistent, and the output is 13×13×1024. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 1024, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 13 multiplied by 1024.

b. Constructing a predictive network

And constructing and outputting prediction networks for predicting targets with different sizes according to the output of different layers of the feature extraction network, wherein the prediction networks comprise a large-size target prediction network, a medium-size target prediction network and a small-size target prediction network.

b1, large-size target prediction network

The input is the tenth layer output of the feature extraction network, and the large-size target prediction network mainly comprises a plurality of combination convolution modules and convolution layers.

The input image is 13×13×1024.

The large-size target prediction network has the following structure:

the first layer is the combined convolution module 1-D, as shown in fig. 2 (D). The modules pass through five combined convolution modules 1-B in sequence as shown in fig. 2 (B). In the first combined convolution module 1-B, the input is first subjected to a deep convolution, a batch normalization layer and an activation layer, the convolution kernel is (1, 1), the step size is 1, the padding is used to make the input and output uniform in size, and the output is 13×13×1024. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 512, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 13 multiplied by 512. In the second combined convolution module 1-B, the input is first subjected to a depth convolution, a batch normalization layer and an activation layer, the convolution kernel is (3, 3), the step size is 1, the padding is used to make the input and output uniform in size, and the output is 13×13×512. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 1024, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 13 multiplied by 1024. After the two different parameters of the combined convolution modules 1-B are alternately input, the output is 13 multiplied by 512.

The second layer is the combined convolution module 1-B, as shown in fig. 2 (B). The module first goes through the deep convolution, batch normalization layer and activation layer, the convolution kernel is (3, 3), the step size is 1, the filling is used to make the input and output size consistent, and the output is 13×13×512. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 1024, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 13 multiplied by 1024.

The third layer is a convolution layer. The convolution kernel is (1, 1), the step size is 1, the number of filters is 256, and the output is 13×13×256.

b2, medium-sized target prediction network

The input is the eighth layer output of the feature extraction network and the first layer output of the large-size target prediction network, and the medium-size target prediction network mainly comprises a plurality of combination convolution modules and convolution layers.

The input images are 26×26×512 and 13×13×512.

The medium-size target prediction network structure is as follows:

the first layer is the input fusion module, as shown in fig. 2 (e). The input 13×13×512 is first passed through a combined convolution module 1-B, where the depth convolution, batch normalization layer and activation layer are first passed, the convolution kernel is (1, 1), the step size is 1, the padding is used to make the input and output uniform in size, and the output is 13×13×512. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 512, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 13 multiplied by 512. And then through the up-sampling layer, the sampling factor is 2, and the output is 26 multiplied by 512. Finally, the output and input 26×26×512 pass through the tensor mosaic layer, and the output is 26×26×1024.

The second layer is the combined convolution module 1-D, as shown in fig. 2 (D). The modules pass through five combined convolution modules 1-B in sequence as shown in fig. 2 (B). In the first combined convolution module 1-B, the input is first subjected to a depth convolution, a batch normalization layer and an activation layer, the convolution kernel is (1, 1), the step size is 1, the padding is used to make the input and output uniform in size, and the output is 26×26×1024. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 256, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 26 multiplied by 256. In the second combined convolution module 1-B, the input is first subjected to a depth convolution, a batch normalization layer and an activation layer, the convolution kernel is (3, 3), the step size is 1, the padding is used to make the input and output uniform in size, and the output is 26×26×256. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 512, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 26 multiplied by 512. After the two different parameters of the combined convolution modules 1-B are alternately input, the output is 26 multiplied by 256.

The third layer is the combined convolution module 1-B, as shown in fig. 2 (B). The module first goes through the deep convolution, batch normalization layer and activation layer, the convolution kernel is (3, 3), the step size is 1, the filling is used to make the input and output size consistent, and the output is 26×26×256. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 512, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 26 multiplied by 512.

The fourth layer is a convolution layer. The convolution kernel is (1, 1), the step size is 1, the number of filters is 256, and the output is 26×26×256.

b3, small-sized target prediction network

The input is the sixth layer output of the feature extraction network and the second layer output of the medium-size target prediction network, and the small-size target prediction network mainly comprises a plurality of combination convolution modules and convolution layers.

The input images are 52×52×256 and 26×26×256.

The small-size target prediction network structure is as follows:

the first layer is the input fusion module, as shown in fig. 2 (e). The input 26×26×256 is first passed through a combined convolution module 1-B, where the depth convolution, batch normalization layer and activation layer are first passed, the convolution kernel is (1, 1), the step size is 1, the padding is used to make the input and output uniform in size, and the output is 26×26×256. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 256, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 26 multiplied by 256. And then through the up-sampling layer, the sampling factor is 2, and the output is 52 multiplied by 256. Finally, the output and input 52×52×256 pass through the tensor mosaic layer, and the output is 52×52×512.

The second layer is the combined convolution module 1-D, as shown in fig. 2 (D). The modules pass through five combined convolution modules 1-B in sequence as shown in fig. 2 (B). In the first combined convolution module 1-B, the input is first subjected to a deep convolution, a batch normalization layer and an activation layer, the convolution kernel is (1, 1), the step size is 1, the padding is used to make the input and output uniform in size, and the output is 52×52×512. And then the convolution, the batch normalization layer and the activation layer are carried out, wherein the convolution kernel is (1, 1), the step length is 1, the number of filters is 128, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 52 multiplied by 128. In the second combined convolution module 1-B, the input is first subjected to a depth convolution, a batch normalization layer and an activation layer, the convolution kernel is (3, 3), the step size is 1, the padding is used to make the input and output uniform in size, and the output is 52×52×128. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 256, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 52 multiplied by 256. After the two different parameters of the combined convolution modules 1-B are alternately input, the output is 52 multiplied by 128.

The third layer is the combined convolution module 1-B, as shown in fig. 2 (B). The module first goes through the deep convolution, batch normalization layer and activation layer, the convolution kernel is (3, 3), the step size is 1, the filling is used to make the input and output size consistent, and the output is 52×52×128. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 256, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 52 multiplied by 256.

The fourth layer is a convolution layer. The convolution kernel is (1, 1), the step size is 1, the number of filters is 256, and the output is 52×52×256.

Finally, the output 13×13×256 of the large-size target prediction network, the output 26×26×256 of the medium-size target prediction network and the output 52×52×256 of the small-size target prediction network are processed by a non-maximum suppression layer to obtain the predicted target position and category.

c. Setting a loss function

The loss function is set as the sum average of the center coordinate loss function, the wide-high loss function, the confidence loss and the category loss function. The loss function formula is as follows:

Loss＝(Loss _xy +Loss _wh +Loss _confidence +Loss _cls )/numf

wherein Loss represents total Loss, loss _xy Representing center coordinate Loss, loss _wh Representing the Loss of width and height, loss _confidence Representing confidence Loss, loss _cls Representing class loss, numf represents the floating point number of the total number of inputs. The respective loss function formulas are as follows:

Loss _xy ＝mark _object *(2-w*h)*Loss _log (xy _true ,xy _predict )

Loss _wh ＝0.5*mark _object *(2-w*h)*(wh _true -wh _predict ) ²

Loss _cls ＝mark _object *Loss _log (cls _true ,cls _predict )

wherein mark is a mark _object A flag bit representing whether an object exists in the anchor frame, w represents the width of the anchor frame, h represents the height of the anchor frame, and Loss _log Representing a binary cross entropy loss, xy _true Representing the true central coordinate value, xy _predict Representing the predicted central coordinate value wh _true Representing the true width and height values wh _predict Representing the predicted width and height values, c _predict Confidence value representing prediction frame, mark _ignore Flag bit, cls, representing anchor block with IOU less than threshold _true Representing true class, cls _predict Representing the prediction category.

Training an improved YOLO network comprising the steps of:

d1, setting training parameters

Setting the training optimizer to Adam, initial learning rate of 0.001, iteration number of 500, batch size of 8, and K means clustering all tags to generate initial prior boxes of (38, 29), (65, 52), (94, 87), (142, 134), (195, 69), (216, 206), (337, 320), (397, 145), (638, 569).

d2, on-line data enhancement

The data enhancement is carried out on the input image, the data set is expanded, and the data enhancement method comprises the following steps: random mirror-inversion, random noise addition, random contrast adjustment.

d3, setting training completion mark

And setting the interval detection training accuracy of the verification set, and storing a network after the training is completed to the optimal condition, wherein the training completion mark is 500 for reaching the maximum iteration times and the accuracy meets the requirements.

2) Marking the character positions in the dial images cut out in the step 1) to construct a character detection data set, dividing the character detection data set into a training set and a test set, recharging training parameters, training an improved EAST network by using the training set, obtaining an optimal improved EAST network after training, inputting the test set into the optimal improved EAST network, outputting the character positions in the character detection data set, and cutting the character detection data set into character images; the improved EAST network is to change the backbone network into VGG to improve network detection accuracy, and the output layer structure modification prediction module only uses head elements to predict vertexes so as to improve the prediction performance of long characters.

According to the specific application scene and the characteristics of the identification object, the EAST network is designed and improved, and the following activation layers are all Relu activation functions if not additionally stated. The method comprises the following steps:

a. constructing a feature extraction network

The feature extraction network has the following structure:

The input image is 256×256×3.

The first layer is a combined convolution module 2-B, which, as shown in fig. 3 (B), consists of two combined convolution modules 2-a and one maximally pooled layer. The first combined convolution module 2-a outputs 258×258×3 through zero padding layer, and outputs 256×256×64 through convolution layer and activation layer, with convolution kernel (3, 3), step size 1, number of filters 64. The second convolution module 2-a outputs 258×258×64 through zero padding layer, and outputs 256×256×64 through convolution layer and activation layer, convolution kernel is (3, 3), step size is 1, filter number is 64. And then passing through a maximum pooling layer, wherein the pooling core size is (2, 2) step length is 2, and the output is 128 multiplied by 64.

The second layer is a combined convolution module 2-B, and consists of two combined convolution modules 2-A and a maximum pooling layer. The first combined convolution module 2-a passes through the zero-padding layer first, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step size is 1, and the number of filters is 128. The second convolution module 2-a passes through the zero-padding layer first, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step size is 1, and the number of filters is 128. And then passing through a maximum pooling layer, wherein the pooling core size is (2, 2) step length is 2, and the output is 64 multiplied by 128.

The third layer is a combined convolution module 2-C, which, as shown in fig. 3 (C), consists of three combined convolution modules 2-a and one max pooling layer. The first combined convolution module 2-a passes through the zero-padding layer first, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step size is 1, and the number of filters is 256. The second convolution module 2-a passes through the zero-padding layer, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step size is 1, and the number of filters is 256. The third convolution module 2-a passes through the zero-padding layer, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step size is 1, and the number of filters is 256. And then passing through a maximum pooling layer, wherein the pooling core size is (2, 2) step length is 2, and the output is 32 multiplied by 256.

The fourth layer is a combined convolution module 2-C, and consists of three combined convolution modules 2-A and a maximum pooling layer. The first combined convolution module 2-a passes through the zero-padding layer first, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step size is 1, and the number of filters is 512. The second convolution module 2-a passes through the zero-padding layer, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step size is 1, and the number of filters is 512. The third convolution module 2-a passes through the zero-padding layer, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step size is 1, and the number of filters is 512. And then passing through a maximum pooling layer, wherein the pooling core size is (2, 2) step length is 2, and the output is 16 multiplied by 512.

The fifth layer is a combined convolution module 2-C, and consists of three combined convolution modules 2-A and a maximum pooling layer. The first combined convolution module 2-a passes through the zero-padding layer first, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step size is 1, and the number of filters is 512. The second convolution module 2-a passes through the zero-padding layer, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step size is 1, and the number of filters is 512. The third convolution module 2-a passes through the zero-padding layer, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step size is 1, and the number of filters is 512. And then passing through a maximum pooling layer, wherein the pooling core size is (2, 2), the step size is 2, and the output is 8 multiplied by 512.

b. Constructing feature fusion networks

The first layer is the input fusion module 2-G, as shown in FIG. 3 (G). The last layer of the feature extraction network outputs 8×8×512, first goes through the upsampling layer, with a sampling factor of 2, and outputs 16×16×512. The output and feature extraction network fourth layer output is 16×16×512, and the output is 16×16×1024 via tensor stitching layer.

The second layer is a combined convolution module 2-E, which, as shown in FIG. 3 (E), consists of two batch normalization layers, one combined convolution module 2-D and one combined convolution module 2-A. The combined convolution module 2-D consists of one zero padding layer, one convolution layer and one activation layer. Firstly, a batch normalization layer is passed, and then a combined convolution module 2-D is passed. The combined convolution module 2-D passes through the zero filling layer, then passes through the convolution layer and the activation layer, the convolution kernel is (1, 1), the step length is 1, the number of filters is 128, and the output is 16 multiplied by 128. Then passes through a batch normalization layer and then passes through a combined convolution module 2-A. The combined convolution module 2-a passes through the zero padding layer, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step length is 1, the number of filters is 64, and the output is 16×16×64.

And the third layer is an input fusion module 2-G. The second layer output of the feature fusion network is 16×16×64, first goes through the upsampling layer, the sampling factor is 2, and the output is 32×32×64. The output and feature extraction network third layer outputs 32×32×256 through the tensor stitching layer, and outputs 32×32×320.

The fourth layer is a combined convolution module 2-E. Firstly, a batch normalization layer is passed, and then a combined convolution module 2-D is passed. The combined convolution module 2-D passes through the zero filling layer, then passes through the convolution layer and the activation layer, the convolution kernel is (1, 1), the step length is 1, the number of filters is 128, and the output is 32 multiplied by 128. Then passes through a batch normalization layer and then passes through a combined convolution module 1-A. The combined convolution module 1-A passes through the zero filling layer, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step length is 1, the number of filters is 64, and the output is 32 multiplied by 64.

The fifth layer is the input fusion module 1-G. The feature fusion network second layer output 32 x 64 first goes through the upsampling layer, the sampling factor is set to be 2, the output is 64 x 64. The output and feature extraction network second layer output is 64×64×128, and the output is 64×64×192 via tensor stitching layer.

The sixth layer is a combined convolution module 1-F, which, as shown in FIG. 3 (F), consists of three batch normalization layers, one combined convolution module 1-D and two combined convolution modules 1-A. First through a batch normalization layer and then through the combined convolution module 1-D. The combined convolution module 1-D passes through the zero filling layer, then passes through the convolution layer and the activation layer, the convolution kernel is (1, 1), the step length is 1, the number of filters is 32, and the output is 64 multiplied by 32. Then passes through a batch normalization layer and then passes through a combined convolution module 1-A. The combined convolution module 1-A passes through the zero filling layer, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step length is 1, the number of filters is 32, and the output is 64 multiplied by 32. Then passes through a batch normalization layer and then passes through a combined convolution module 1-A. The combined convolution module 1-A passes through the zero filling layer, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step length is 1, the number of filters is 32, and the output is 64 multiplied by 32.

c. Constructing a predictive network

The first layer has three branches, the first branch is composed of a combined convolution module 1-D, first, a zero filling layer is passed through, then, a convolution layer and an activation layer are passed through, the convolution kernel is (1, 1), the step length is 1, the number of filters is 1, and the output is 64 multiplied by 1. The second branch consists of a combined convolution module 1-D, which passes through a zero filling layer, a convolution layer and an activation layer, the convolution kernel is (1, 1), the step length is 1, the number of filters is 2, and the output is 64 multiplied by 2. The third branch consists of a combined convolution module 1-D, which passes through a zero filling layer, a convolution layer and an activation layer, the convolution kernel is (1, 1), the step size is 1, the number of filters is 4, and the output is 64 multiplied by 4.

The second layer is an input fusion module which is formed by splicing three branches of the first layer of the prediction network, and the output is 64 multiplied by 7.

d. Setting a loss function

The loss function is set to be the sum of category loss, geometry loss, and angle loss.

The class loss function formula is as follows:

wherein L is _S Representing the class loss, beta represents the weight,

is a predicted category, Y ^* Is a true category.

The geometry loss function formula is as follows:

wherein L is _AABB Representing a geometric shape loss function,

representing the predicted AABB geometry, R ^* Representing real AABBIoU represents the cross-over ratio.

The angle loss function formula is as follows:

wherein L is _θ

θ ^* ) Is an angle loss function, +>

Is the prediction of rotation angle, θ ^* Is the real case of the rotation angle.

Training an improved EAST network comprising the steps of:

e1, setting training parameters

Setting a training optimizer, an initial learning rate, iteration times, batch sizes and an initial priori frame.

e2, on-line data enhancement

The data enhancement is carried out on the input image, the data set is expanded, and the main method for data enhancement is as follows: noise is added randomly, and contrast is adjusted randomly.

e3, setting training completion mark

And setting the interval detection training accuracy of the verification set, and storing a network after the training is completed to the optimal condition for achieving the maximum iteration times and meeting the accuracy requirement by a training completion mark.

Inputting the test set into the optimal improved EAST network to obtain the character image.

3) Labeling the character information in the character image cut in the step 2) to construct a character recognition data set, dividing the character recognition data set into a training set and a test set, recharging training parameters to train the CRNN network by using the training set, obtaining an optimal CRNN network after training, inputting the test set into the optimal CRNN network, and outputting the character information.

The CRNN network is constructed, which comprises the following steps:

a. constructing a feature extraction network

The input image is w×32×1, where w is the width of the input image, and is adaptively changed according to the input picture size.

The first layer is a combined convolution module 3-A, which passes through a zero filling layer, a convolution layer and an activation layer, the convolution kernel is (3, 3), the step length is 1, the filter is 64, and the output is w multiplied by 32 multiplied by 64.

The second layer is the largest pooling layer, the pooling core size is (2, 2), the step length is 2, and the output is

The third layer is a combined convolution module 3-A, which passes through a zero filling layer, a convolution layer and an activation layer, the convolution kernel is (3, 3), the step length is 1, the filter is 128, and the output is

The fourth layer is the largest pooling layer, the pooling core size is (2, 2), the step length is 2, and the output is

The fifth layer is a combined convolution module 3-B, which firstly passes through a zero filling layer, then passes through a convolution layer, a batch normalization layer and an activation layer, the convolution kernel is (3, 3), the step length is 1, the filter is 256, and the output is

The sixth layer is a combined convolution module 3-A, which passes through a zero filling layer, a convolution layer and an activation layer, the convolution kernel is (3, 3), the step length is 1, the filter is 256, and the output is

The seventh layer is the largest pooling layer, the pooling core size is (2, 2), the step length is 2, and the output is

The eighth layer is a combined convolution module 3-B, firstly passes through a zero filling layer, then passes through a convolution layer, a batch normalization layer and an activation layer, the convolution kernel is (3, 3), the step length is 1, the filter is 512, and the output is

The ninth layer is a combined convolution module 3-A, and passes through a zero filling layer, a convolution layer and an activation layer, wherein the convolution kernel is (3, 3), the step length is 1, the filter is 512, and the output is

The tenth layer is the largest pooling layer, the pooling core size is (2, 2), the step length is 2, and the output is

The eleventh layer is a combined convolution module 3-C, and is subjected to a convolution layer, a batch normalization layer and an activation layer, wherein the convolution kernel is (2, 2), the step length is 1, the filter is 512, and the output is

b. Constructing a predictive network

The first layer is a circular convolution module, which consists of a bidirectional LSTM, the output is

The second layer is a full-connection layer, and the output is

The third layer is a circular convolution module, which consists of a bidirectional LSTM, and the output is

The fourth layer is a full-connection layer, and the output is

/>

c. Setting up a decoder

Converting output of predictive network into

Each element ranging from 0 to 6735, respectively corresponding to an independent character (wherein 0 corresponds to a null character), corresponding to dividing a line of text into +. >

A block is predicted by the characters. And processing the sequence from left to right, and outputting character information according to the element value corresponding to the character library when the element is not 0 and is the same as the last element.

d. Setting a loss function

The loss function was set to CTC (Connectionist Temporal Classification) loss function.

The CTC loss function formula is as follows:

L _CTC ＝-lnΠ _(x,z) p(z|x)＝-∑ _(x,z)∈S lnp(z|x)

wherein L is _CTC Representing CTC loss function, p (z|x) represents the probability of a given input x output sequence z, S being the training set.

Training a CRNN network comprising the steps of:

e1, setting training parameters

Setting a training optimizer, an initial learning rate, iteration times and batch sizes.

e2, setting training completion mark

The textCNN network is constructed, comprising the following steps:

a. constructing a network structure:

the first layer is an embedded layer, the length m of an input text is input, and the word vector is 600 multiplied by 64, namely the input tensor;

the second layer is a convolution module, the convolution kernel is (5, 5), the step length is 1, the number of filters is 256, and the output is 596 multiplied by 256;

the third layer is the largest pooling layer, and the output is 1 multiplied by 256;

the fourth layer consists of a full connection layer, a Dropout layer and an activation layer, and the output is 1 multiplied by 128;

the fifth layer consists of a full-connection layer and an activation layer, the output is 1 Xcls, and cls is the category number;

b. setting a loss function

Setting the loss function as multi-class cross entropy

representing the predicted probability for the corresponding i category.

Training TextCNN network, comprising the steps of:

c1, setting training parameters

Setting a training optimizer, an initial learning rate as the number of iterations and the batch size.

c2, setting training completion mark

And inputting the test set into an optimal instrument type corresponding to the textCNN network output text.

In summary, after the scheme is adopted, the invention provides a new method for detecting and classifying the instrument images, and the neural network is used as an effective method for detecting and classifying the instruments, so that the problem that the instrument types are difficult to read can be effectively solved, the development of an automatic instrument identification technology is effectively promoted, and the method has practical popularization value and is worthy of popularization.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. The instrument detection classification method based on the image text is characterized by comprising the following steps of:

the specific cases of the improved EAST network are as follows:

b. the feature fusion network is constructed, and the structure is as follows:

c. The prediction network is constructed, and the structure is as follows:

the class loss function formula is as follows:

wherein L is _S Representing the class loss, beta represents the weight,

is a predicted category, Y ^* Is a true category; />

The geometry loss function formula is as follows:

wherein L is _AABB Representing the loss of geometry and,

the angle loss function formula is as follows:

in the method, in the process of the invention,

is an angle loss->

Inputting the test set into an optimal improved EAST network to obtain a text position, and cutting the text position into character images;

2. The method for detecting and classifying the instruments based on the image texts according to claim 1 is characterized in that in the step 1), various instrument images under different environments are collected through a camera, filtering and image enhancement preprocessing operations are carried out on the instrument images, abnormal data in the instrument images are removed, the abnormal data including data with surface dirt, illumination extreme and abnormal shooting are included, the rest data are marked, the marked content is the dial position, an instrument positioning data set is constructed, and the instrument positioning data set is divided into a training set and a test set.

3. The method of image text based meter test classification of claim 1, wherein in step 1), said improved YOLO network is specified as follows:

The third layer is a convolution layer;

the fourth layer is a convolution layer;

The fourth layer is a convolution layer;

the center coordinate loss function formula is as follows:

Loss _xy ＝mark _object *(2-w*h)*Loss _log (xy _true ,xy _predict )

the wide-high loss function formula is as follows:

Loss _wh ＝0.5*mark _object *(2-w*h)*(wh _true -wh _predict ) ²

in the Loss _wh Represents the loss of width and height, wh _true Representing the true width and height values wh _predict Representing a predicted width and height value;

the confidence loss function formula is as follows:

Loss _confidence ＝mark _object *Loss _log (mark _object ,c _predict )

+(1-mark _object )*Loss _log (mark _object ,c _predict )*mark _ignore

the class loss function formula is as follows:

Loss _cls ＝mark _object *Loss _log (cls _true ,cls _predict )

the total loss function formula is as follows:

Loss＝(Loss _xy +Loss _wh +Loss _confidence +Loss _cls )/numf

4. The method of claim 1, wherein in step 3), the CRNN network is as follows:

the second layer is a maximum pooling layer;

the fourth layer is the largest pooling layer;

the seventh layer is the largest pooling layer;

the tenth layer is the largest pooling layer;

b. the prediction network is constructed, and the structure is as follows:

the second layer is a full-connection layer;

the fourth layer is a full-connection layer;

c. setting a decoder to convert the output sequence into character information;

the CTC loss function formula is as follows:

L _CTC +lnΠ _(x,z) p(z|x)＝-∑ _(x,z)∈S lnp(z|x)

5. The method of claim 1, wherein in step 4), the TextCNN network is as follows:

a. the network structure is constructed as follows:

the first layer is an embedded layer;

the second layer is a convolution module;

the third layer is a maximum pooling layer;

the fifth layer consists of a full connection layer and an activation layer;

representing a prediction probability corresponding to the i category;