CN113673509A

CN113673509A - Instrument detection and classification method based on image text

Info

Publication number: CN113673509A
Application number: CN202110855223.6A
Authority: CN
Inventors: 田联房; 王昭霖; 杜启亮
Original assignee: South China University of Technology SCUT; Zhuhai Institute of Modern Industrial Innovation of South China University of Technology
Current assignee: South China University of Technology SCUT; Zhuhai Institute of Modern Industrial Innovation of South China University of Technology
Priority date: 2021-07-28
Filing date: 2021-07-28
Publication date: 2021-11-19
Anticipated expiration: 2041-07-28
Also published as: CN113673509B

Abstract

The invention discloses an instrument detection and classification method based on an image text, which comprises the following steps: 1) constructing an instrument positioning data set, improving YOLO network training, and outputting a dial plate image by using a network; 2) constructing a character detection data set, improving EAST network training, and outputting character images by using a network; 3) constructing a character recognition data set, performing CRNN network training, and outputting character information by using a network; 4) and constructing a text classification data set, carrying out TextCNN network training, and outputting an instrument type by using a network. The invention realizes instrument detection and instrument text information detection and identification by using the neural network, has higher precision and better generalization capability under different backgrounds, can accurately detect instruments with different sizes, has no limitation of acquisition angle and distance, can solve the problem that the positions of the instruments can be identified but the types of the instruments are difficult to distinguish by using character information on the instruments, and can detect the instruments and identify the types of the instruments.

Description

Instrument detection and classification method based on image text

Technical Field

The invention relates to the technical field of image processing and neural networks, in particular to an instrument detection and classification method based on an image text.

Background

The instrument is used as a monitoring device, mainly comprises a pressure instrument, a temperature instrument, a flow instrument, an electrical instrument and an electronic measuring instrument, is widely applied to various aspects of industrial production and social life, and provides great convenience for life production. Compared with an artificial classification method, the method has the advantages of wide application range and high classification efficiency, and gradually becomes mainstream along with the development of an image processing technology and a neural network technology.

At present, research and implementation of an instrument classification method mainly focuses on classification training of different types of instrument images by using a neural network, and the method has some defects, for example, the instrument images to be identified need training, the distinction degree of different types of instruments on the images is not very large, and the depth network does not have ideal performance of identification and classification of different instruments. At present, research and implementation of a character recognition method on an instrument mainly focus on application of a traditional image processing technology, and specifically, character information is obtained through a series of image processing technologies of filtering, graying, thresholding, edge detection and template detection. With the rapid development of image processing and neural network technology in recent years, character positioning, recognition and classification by using a neural network become possible. The method mainly comprises three algorithms of text detection, text recognition and text classification, wherein the text detection mainly carries out quadrilateral frame positioning on character information through a neural network, reading of the text information on the instrument is realized through a text recognition algorithm, and the text information is classified through a text classification algorithm to obtain the type of the instrument.

In combination with the above discussion, the method for detecting and classifying the instrument with real-time performance and high precision has higher practical application value.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide an image text-based instrument detection and classification method, which uses a neural network to realize instrument detection and instrument text information detection and identification, has higher precision and better generalization capability under different backgrounds, can accurately detect instruments with different sizes, has no limitation of acquisition angle and distance, can solve the problem that the positions of the instruments can be identified but the types of the instruments are difficult to distinguish by using character information on the instruments, and can detect the instruments and identify the types of the instruments.

In order to achieve the purpose, the technical scheme provided by the invention is as follows: a meter detection and classification method based on image texts comprises the following steps:

1) marking dial plate positions by using an instrument image to construct an instrument positioning data set, dividing the instrument positioning data set into a training set and a testing set, training and improving a YOLO network by using the training set according to loading parameters, obtaining an optimal improved YOLO network after training is finished, inputting the testing set into the optimal improved YOLO network, outputting a dial plate image and cutting the dial plate image; the improved YOLO network is characterized in that a backbone network is optimized into a mobile lightweight network so as to reduce network parameters and calculated amount and improve operation speed;

2) marking the character positions in the dial image cut out in the step 1) to construct a character detection data set, dividing the character detection data set into a training set and a testing set, then loading training parameters, training an improved EAST network by using the training set, obtaining an optimal improved EAST network after training is finished, inputting the testing set into the optimal improved EAST network, outputting the character positions in the character detection data set and cutting the character positions into character images; the improved EAST network is characterized in that a backbone network is changed into VGG to improve network detection accuracy, and an output layer structure modification prediction module of the improved EAST network only predicts a vertex by using a head element to improve the prediction performance of a long character;

3) marking the character information in the character image cut in the step 2) to construct a character recognition data set, dividing the character recognition data set into a training set and a testing set, then loading training parameters to train the CRNN by using the training set, obtaining an optimal CRNN after training is finished, inputting the testing set into the optimal CRNN, and outputting the character information;

4) splicing the character information output in the step 3) into a text, labeling the instrument type corresponding to the text to construct a text classification data set, dividing the text classification data set into a training set and a test set, loading training parameters to train a TextCNN network by using the training set, obtaining an optimal TextCNN network after the training is finished, and inputting the test set into the optimal TextCNN network to output the instrument type corresponding to the text.

Further, in step 1), firstly, various instrument images in different environments are collected through a camera, preprocessing operations of filtering and image enhancement are carried out on the instrument images, then abnormal data in the instrument images are removed, the abnormal data include data with surface dirt, extreme illumination and incomplete shooting abnormality, then the rest data are marked, the marked content is the dial position, an instrument positioning data set is constructed, and the instrument positioning data set is divided into a training set and a testing set.

Further, in step 1), the specific conditions of the improved YOLO network are as follows:

a. constructing a feature extraction network according to the requirements of real-time performance and high precision:

the first layer is a combined convolution module 1-A which consists of a zero filling layer, a convolution layer, a batch normalization layer and an activation layer;

the second layer is a combined convolution module 1-B which consists of a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;

the third layer is a combined convolution module 1-C which consists of a zero filling layer, a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;

the fourth layer is a combined convolution module 1-B which consists of a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;

the fifth layer is a combined convolution module 1-C which consists of a zero filling layer, a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;

the sixth layer is a combined convolution module 1-B which consists of a deep convolution layer, two batch normalization layers, two activation layers and a convolution layer;

the seventh layer is a combined convolution module 1-C which consists of a zero filling layer, a deep convolution layer, two batch normalization layers, two activation layers and a convolution layer;

the eighth layer is a combined convolution module 1-D which consists of five combined convolution modules 1-B;

the ninth layer is a combined convolution module 1-C which consists of a zero filling layer, a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;

the tenth layer is a combined convolution module 1-B which consists of a deep convolution layer, two batch normalization layers, two activation layers and a convolution layer;

b. constructing prediction networks for outputting and predicting targets with different sizes according to the output of different layers of the feature extraction network, wherein the prediction networks comprise a large-size target prediction network, a medium-size target prediction network and a small-size target prediction network;

b1, inputting the data as the tenth layer output of the feature extraction network, wherein the large-size target prediction network consists of a plurality of combined convolution modules and convolution layers, and has the following structure:

the first layer is a combined convolution module 1-D which consists of five combined convolution modules 1-B;

the third layer is a convolution layer;

b2, inputting the eight-layer output of the feature extraction network and the first-layer output of the large-size target prediction network, wherein the medium-size target prediction network consists of a plurality of combined convolution modules and convolution layers, and the structure of the medium-size target prediction network is as follows:

the first layer is an input fusion module 1-E, which consists of a combined convolution module 1-B, an up-sampling layer and a tensor splicing layer;

the second layer is a combined convolution module 1-D which consists of five combined convolution modules 1-B;

the third layer is a combined convolution module 1-B which consists of a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;

the fourth layer is a convolution layer;

b3, inputting the sixth layer output of the feature extraction network and the second layer output of the medium-size target prediction network, wherein the small-size target prediction network consists of a plurality of combined convolution modules and convolution layers, and the structure of the small-size target prediction network is as follows:

the fourth layer is a convolution layer;

finally, the output of the large-size target prediction network, the medium-size target prediction network and the small-size target prediction network passes through a non-maximum value inhibition layer to obtain the positions and the types of the predicted targets;

c. setting loss functions including a central coordinate loss function, a width and height loss function, a confidence coefficient loss function and a category loss function;

the center coordinate loss function is formulated as follows:

Loss_xy＝mark_object*(2-w*h)*Loss_log(xy_true,xy_predict)

in the formula, Loss_xyRepresenting loss of central coordinates, mark_objectRepresenting whether the anchor frame has the zone bit of the object or not, w representing the width of the anchor frame, h representing the height of the anchor frame, and Loss_logRepresenting a binary cross-entropy loss, xy_trueRepresenting the true central coordinate value, xy_predictRepresenting a predicted central coordinate value;

the broad height loss function is formulated as follows:

Loss_wh＝0.5*mark_object*(2-w*h)*(wh_true-wh_predict)²

in the formula, Loss_whRepresents the wide high loss, wh_trueRepresents the true width and height value, wh_predictRepresents a predicted aspect ratio value;

the confidence loss function is formulated as follows:

Loss_confidence＝mark_object*Loss_log(mark_object,c_predict)+(1-mark_object)*Loss_log(mark_object,c_predict)*mark_ignore

in the formula, Loss_confidenceRepresenting a loss of confidence, c_predictRepresenting confidence values, mark, of prediction boxes_ignoreA flag bit representing an anchor box with an IOU less than a threshold;

the class loss function is formulated as follows:

Loss_cls＝mark_object*Loss_log(cls_true,cls_predict)

in the formula, Loss_clsRepresents class loss, cls_trueRepresenting the true class, cls_predictRepresenting a prediction category;

the total loss function is formulated as follows:

Loss＝(Loss_xy+Loss_wh+Loss_confidence+Loss_cls)/numf

where Loss represents total Loss and numf represents the floating point number of the total number of inputs;

loading training parameters to train the improved YOLO network, wherein the training parameters are set as follows: setting a training optimizer Adam, an initial learning rate of 0.001, a maximum training period of 500 and a batch size of 8; setting verification set interval detection training accuracy, marking a training completion mark as reaching a maximum training period or meeting requirements in a mean-cross-parallel ratio, and storing the network after the training completion reaches the optimum;

and inputting the test set into an optimal improved YOLO network to obtain the dial position and the dial image.

Further, in step 2), the details of the modified EAST network are as follows:

a. constructing a feature extraction network, wherein the structure is as follows:

the first layer is a combined convolution module 2-B which consists of two combined convolution modules 2-A and a maximum pooling layer, and the combined convolution module 2-A consists of a zero padding layer, a convolution layer and an active layer;

the second layer is a combined convolution module 2-B which consists of two combined convolution modules 2-A and a maximum pooling layer;

the third layer is a combined convolution module 2-C which consists of three combined convolution modules 2-A and a maximum pooling layer;

the fourth layer is a combined convolution module 2-C which consists of three combined convolution modules 2-A and a maximum pooling layer;

the fifth layer is a combined convolution module 2-C which consists of three combined convolution modules 2-A and a maximum pooling layer;

b. constructing a feature fusion network, wherein the structure is as follows:

the first layer is an input fusion module 2-G which consists of an up-sampling layer and a tensor splicing layer;

the second layer is a combined convolution module 2-E which consists of two batch normalization layers, a combined convolution module 2-D and a combined convolution module 2-A; the combined convolution module 2-D consists of a zero padding layer, a convolution layer and an active layer;

the third layer is an input fusion module 2-G which consists of an up-sampling layer and a tensor splicing layer;

the fourth layer is a combined convolution module 2-E which consists of two batch normalization layers, a combined convolution module 2-D and a combined convolution module 2-A;

the fifth layer is an input fusion module 2-G which consists of an up-sampling layer and a tensor splicing layer;

the sixth layer is a combined convolution module 2-F which consists of three batch normalization layers, a combined convolution module 2-D and two combined convolution modules 2-A;

c. constructing a prediction network, wherein the structure is as follows:

the first layer is divided into three branches, and the first branch consists of a combined convolution module 2-D; the second branch consists of a combined convolution module 2-D; the third branch consists of a combined convolution module 2-D;

the second layer is an input fusion module which is formed by splicing three branches of the first layer;

d. setting a loss function comprising a category loss function, a geometric shape loss function and an angle loss function;

the class loss function is formulated as follows:

in the formula, L_SRepresenting the class loss, beta represents the weight,

is a predicted category, Y^*Is a real category;

the geometry loss function is formulated as follows:

in the formula, L_AABBWhich represents a loss of the geometric shape,

representing the geometry, R, of a predictive quadrilateral text box AABB^*Representing the geometry of a real quadrangular textbox AABBShape, IoU denotes the intersection ratio;

the angle loss function is formulated as follows:

in the formula, L_θ

θ^*) Is the loss of the angle of the beam,

is a predicted value of the rotation angle, theta^*Is the actual value of the rotation angle;

loading training parameters to train an improved EAST network, wherein the training parameters are set as follows: setting a training optimizer Adam, an initial learning rate of 0.001, a maximum training period of 500 and a batch size of 8; setting verification set interval detection training accuracy, marking a training completion mark as reaching a maximum training period or meeting requirements in a mean-cross-parallel ratio, and storing the network after the training completion reaches the optimum;

and inputting the test set into an optimal improved EAST network to obtain a text position, and cutting the text position into a character image.

Further, in step 3), the specific situation of the CRNN network is as follows:

the first layer is a combined convolution module 3-A which consists of a zero filling layer, a convolution layer and an active layer;

the second layer is a maximum pooling layer;

the third layer is a combined convolution module 3-A which consists of a zero filling layer, a convolution layer and an activation layer;

the fourth layer is a maximum pooling layer;

the fifth layer is a combined convolution module 3-B which consists of a zero filling layer, a convolution layer, a batch normalization layer and an activation layer;

the sixth layer is a combined convolution module 3-A which consists of a zero filling layer, a convolution layer and an activation layer;

the seventh layer is a maximum pooling layer;

the eighth layer is a combined convolution module 3-B which consists of a zero filling layer, a convolution layer, a batch normalization layer and an activation layer;

the ninth layer is a combined convolution module 3-A which consists of a zero filling layer, a convolution layer and an activation layer;

the tenth layer is a maximum pooling layer;

the eleventh layer is a combined convolution module 3-C, which consists of a zero-padding layer, a convolution layer, a batch normalization layer and an activation layer;

b. constructing a prediction network, wherein the structure is as follows:

the first layer is a cyclic convolution module, which consists of a bidirectional LSTM;

the second layer is a full connection layer;

the third layer is a cyclic convolution module which consists of a bidirectional LSTM;

the fourth layer is a full connection layer;

c. setting a decoder to convert the output sequence into character information;

d. setting a loss function as a CTC (connectionist Temporal classification) loss function;

the CTC loss function is formulated as follows:

L_CTC＝-ln∏_(x,z)p(z|x)＝-∑_(x,z)∈Slnp(z|x)

in the formula, L_CTCRepresents the CTC loss, p (z | x) represents the probability of a given input x output sequence z, S is the training set;

loading training parameters to train the CRNN, wherein the training parameters are set as follows: setting the training optimizer to Adam, the initial learning rate to 0.0001, the maximum training period to 100, and the batch size to 32; setting verification set interval detection training accuracy, setting a training completion flag to reach a maximum training period or meet requirements on identification accuracy, and storing the network after training is completed to reach an optimal value;

and inputting the test set into the optimal CRNN network to obtain character information.

Further, in step 4), the details of the TextCNN network are as follows:

a. a network structure is constructed as follows:

the first layer is an embedding layer;

the second layer is a convolution module;

the third layer is a maximum pooling layer;

the fourth layer consists of a full connection layer, a Dropout layer and an activation layer;

the fifth layer consists of a full connecting layer and an activation layer;

b. setting a loss function as a multi-class cross entropy, wherein the formula is as follows:

in the formula, L_CrossEntropyRepresenting loss, n representing number of classes, y_iRepresenting the true probability of the corresponding i category,

representing the prediction probability of the corresponding i category;

loading training parameters to train the TextCNN network, wherein the training parameters are set as follows: setting a training optimizer adam, an initial learning rate of 0.001, iteration times of 1000 and a batch size of 64; setting verification set interval detection training accuracy, marking a training completion mark as reaching the maximum iteration number and meeting the requirement of accuracy, and storing the network after the training completion reaches the optimum;

and inputting the test set into an optimal TextCNN network to obtain a corresponding instrument type.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the invention realizes instrument positioning and character detection and identification on the instrument by using the neural network, and has higher precision and better generalization capability under different backgrounds compared with the traditional image processing method.

2. The invention can predict targets with different sizes by using the improved YOLO network, comprehensively selects the prediction frame with the most suitable size, can accurately detect instruments with different sizes, and has no limitation on acquisition angle and distance.

3. Compared with other target detection networks, the improved EAST network has better detection performance and higher detection speed on characters, and can have better detection effect on characters shot at different angles.

4. The invention uses the character information on the instruments to splice into the text, can solve the problem that the positions of the instruments can be identified but the types of the instruments are difficult to distinguish in the machine vision field, and can not only detect the instruments but also identify the types of the instruments.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a block diagram of an improved YOLO network.

Fig. 3 is a block diagram of an improved EAST network.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

As shown in fig. 1, the method for detecting and classifying meters of image texts provided by this embodiment includes the following steps:

1) collecting instrument images shot under different actual scenes, removing interference data which have the influences of fuzziness, extreme angles and instrument deficiency and affect recognition, labeling dial plate positions in other data by using an open source labeling tool labelImg, constructing an instrument positioning data set, dividing the instrument positioning data set into a training set and a testing set, training parameters by using the training set to improve a YOLO network, obtaining an optimal improved YOLO network after training is finished, inputting the testing set into the optimal improved YOLO network, outputting the dial plate images and cutting the dial plate images; the improved YOLO network is characterized in that a backbone network is optimized into a mobile lightweight network so as to reduce network parameters and calculation amount and improve operation speed.

According to the specific application scene and the characteristics of the identified object, an improved YOLO network is designed, if the activation layer in the step is not additionally stated, the activation layer is a Leaky Relu activation function, and the method comprises the following steps:

a. constructing a feature extraction network

And constructing a feature extraction network according to the requirements of real-time performance and high precision. The feature extraction network is mainly composed of a plurality of combined convolution modules.

The structure of the feature extraction network is as follows:

the input image is 416 × 416 × 3.

The first layer is the combined convolution module 1-a, shown in fig. 2 (a). The module first passes through the zero-padding layer with an output of 418 x 3. Then the convolution kernel is (3,3), the step length is 2, the filter number is 32, and the output is 208 multiplied by 32.

The second layer is the combined convolution module 1-B, shown in fig. 2 (B). The module first goes through deep convolution, batch normalization layer and activation layer, the convolution kernel is (3,3), the step size is 1, padding is used to make the input and output sizes consistent, and the output is 208 × 208 × 32. And after the convolution, batch normalization and activation layers, the convolution kernel is (1,1), the step size is 1, the number of filters is 64, the input and output sizes are consistent by using filling, and the output is 208 multiplied by 64.

The third layer is a combined convolution module 1-C, as shown in fig. 2 (C). The module first passes through the zero-padding layer and the output is 210 x 64. And then the obtained product is subjected to deep convolution, batch normalization layer and activation layer, the convolution kernel is (3,3), the step size is 2, and the output is 104 multiplied by 64. And finally, after convolution, batch normalization layer and activation layer, the convolution kernel is (1,1), the step size is 1, the number of filters is 128, the input and output sizes are consistent by using filling, and the output is 104 multiplied by 128.

The fourth layer is a combined convolution module 1-B, as shown in fig. 2 (B). The module first goes through deep convolution, batch normalization layer and activation layer, the convolution kernel is (3,3), the step size is 1, the input and output sizes are consistent by using padding, and the output is 104 × 104 × 128. And then the filter passes through a convolution layer, a batch normalization layer and an activation layer, the convolution kernel is (1,1), the step size is 1, the number of filters is 128, the input and output sizes are consistent by using filling, and the output is 104 multiplied by 128.

The fifth layer is a combined convolution module 1-C, as shown in fig. 2 (C). The module first passes through the zero-padding layer with an output of 106 x 128. And then the obtained product is subjected to deep convolution, batch normalization layer and activation layer, the convolution kernel is (3,3), the step size is 2, and the output is 52 multiplied by 128. And finally, after convolution, batch normalization layer and activation layer, the convolution kernel is (1,1), the step size is 1, the number of filters is 256, the input and output sizes are consistent by using filling, and the output is 52 multiplied by 256.

The sixth layer is a combined convolution module 1-B, shown in fig. 2 (B). The module first goes through deep convolution, batch normalization layer and activation layer, the convolution kernel is (3,3), the step size is 1, padding is used to make the input and output sizes consistent, and the output is 52 × 52 × 256. And then the filter passes through a convolution layer, a batch normalization layer and an activation layer, the convolution kernel is (1,1), the step size is 1, the number of filters is 256, the input and output sizes are consistent by using filling, and the output is 52 multiplied by 256.

The seventh layer is a combined convolution module 1-C, as shown in fig. 2 (C). The module first passes through the zero-padding layer and the output is 54 x 256. And then the obtained product is subjected to deep convolution, batch normalization layer and activation layer, the convolution kernel is (3,3), the step size is 2, and the output is 26 multiplied by 256. And finally, after convolution, batch normalization layer and activation layer, the convolution kernel is (1,1), the step size is 1, the number of filters is 512, the input and output sizes are consistent by using filling, and the output is 26 multiplied by 512.

The eighth layer is the combined convolution module 1-D, as shown in FIG. 2 (D). The modules pass through five combined convolution modules 1-B in sequence, as shown in fig. 2 (B). In each of the combined convolution modules 1-B, the input first goes through deep convolution, batch normalization layer and activation layer, the convolution kernel is (3,3), the step size is 1, the input and output sizes are consistent by using padding, and the output is 26 × 26 × 512. And (3) performing convolution, batch normalization and activation, wherein the convolution kernel is (1,1), the step size is 1, the number of filters is 512, the input and output sizes are consistent by using filling, and the output is 26 multiplied by 512. After sequentially passing through the same combined convolution modules 1-B, the output is 26 × 26 × 512.

The ninth layer is a combined convolution module 1-C, as shown in FIG. 2 (C). The module first passes through the zero-padding layer and the output is 28 x 512. And then the obtained product is subjected to deep convolution, batch normalization layer and activation layer, the convolution kernel is (3,3), the step length is 2, and the output is 13 multiplied by 512. And finally, performing convolution, batch normalization and activation layers, wherein the convolution kernel is (1,1), the step length is 1, the number of filters is 1024, the input and output sizes are consistent by using filling, and the output is 13 multiplied by 1024.

The tenth layer is the combined convolution module 1-B, as shown in fig. 2 (B). The module first goes through deep convolution, batch normalization layer and activation layer, the convolution kernel is (3,3), the step size is 1, the input and output sizes are consistent by using padding, and the output is 13 × 13 × 1024. And then the convolution, batch normalization and activation layers are carried out, the convolution kernel is (1,1), the step length is 1, the number of filters is 1024, the input and output sizes are consistent by using filling, and the output is 13 multiplied by 1024.

b. Building a predictive network

And constructing prediction networks for outputting and predicting targets with different sizes according to the output of different layers of the feature extraction network, wherein the prediction networks comprise a large-size target prediction network, a medium-size target prediction network and a small-size target prediction network.

b1 large-size target prediction network

The input is the tenth layer output of the feature extraction network, and the large-size target prediction network mainly comprises a plurality of combined convolution modules and convolution layers.

The input image is 13 × 13 × 1024.

The large-size target prediction network has the following structure:

the first layer is the combined convolution module 1-D, as shown in FIG. 2 (D). The modules pass through five combined convolution modules 1-B in sequence, as shown in fig. 2 (B). In the first combined convolution module 1-B, the input first goes through deep convolution, batch normalization layer and activation layer, the convolution kernel is (1,1), the step size is 1, padding is used to make the input and output size consistent, and the output is 13 × 13 × 1024. And performing convolution, batch normalization and activation layers, wherein the convolution kernel is (1,1), the step length is 1, the number of filters is 512, the input and output sizes are consistent by using filling, and the output is 13 multiplied by 512. In the second combined convolution module 1-B, the input is first passed through deep convolution, batch normalization layer and activation layer, the convolution kernel is (3,3), the step size is 1, padding is used to make the input and output sizes consistent, and the output is 13 × 13 × 512. And then the convolution, batch normalization and activation layers are carried out, the convolution kernel is (1,1), the step length is 1, the number of filters is 1024, the input and output sizes are consistent by using filling, and the output is 13 multiplied by 1024. Then, after the two combined convolution modules 1-B with different parameters are alternately input, the output is 13 multiplied by 512.

The second layer is the combined convolution module 1-B, shown in fig. 2 (B). The module first goes through deep convolution, batch normalization layer and activation layer, the convolution kernel is (3,3), the step size is 1, the input and output sizes are consistent by using padding, and the output is 13 × 13 × 512. And then the convolution, batch normalization and activation layers are carried out, the convolution kernel is (1,1), the step length is 1, the number of filters is 1024, the input and output sizes are consistent by using filling, and the output is 13 multiplied by 1024.

The third layer is a convolutional layer. The convolution kernel is (1,1), the step size is 1, the number of filters is 256, and the output is 13 × 13 × 256.

b2 medium size target prediction network

The input is the eighth layer output of the feature extraction network and the first layer output of the large-size target prediction network, and the medium-size target prediction network mainly comprises a plurality of combined convolution modules and convolution layers.

The input images are 26 × 26 × 512 and 13 × 13 × 512.

The medium-sized target prediction network structure is as follows:

the first layer is the input fusion module, as shown in fig. 2 (e). The input 13 x 512 first goes through the combined convolution module 1-B where the deep convolution, batch normalization layer and activation layer are first passed, the convolution kernel is (1,1) with a step size of 1, padding is used to make the input and output sizes consistent, and the output is 13 x 512. And performing convolution, batch normalization and activation layers, wherein the convolution kernel is (1,1), the step length is 1, the number of filters is 512, the input and output sizes are consistent by using filling, and the output is 13 multiplied by 512. And then passes through an up-sampling layer, the sampling factor is 2, and the output is 26 multiplied by 512. Finally, the output and input are 26 × 26 × 512 through a tensor splicing layer, and the output is 26 × 26 × 1024.

The second layer is the combined convolution module 1-D, as shown in FIG. 2 (D). The modules pass through five combined convolution modules 1-B in sequence, as shown in fig. 2 (B). In the first combined convolution module 1-B, the input first goes through deep convolution, batch normalization layer and activation layer, the convolution kernel is (1,1), the step size is 1, padding is used to make the input and output size consistent, and the output is 26 × 26 × 1024. And then the data is subjected to convolution, batch normalization and activation layers, the convolution kernel is (1,1), the step size is 1, the number of filters is 256, the input and output sizes are consistent by using filling, and the output is 26 multiplied by 256. In the second combined convolution module 1-B, the input is first passed through deep convolution, batch normalization layer and activation layer, the convolution kernel is (3,3), the step size is 1, padding is used to make the input and output size consistent, and the output is 26 × 26 × 256. And (3) performing convolution, batch normalization and activation, wherein the convolution kernel is (1,1), the step size is 1, the number of filters is 512, the input and output sizes are consistent by using filling, and the output is 26 multiplied by 512. After the two different parameters of the combined convolution modules 1-B are alternately input, the output is 26 × 26 × 256.

The third layer is a combined convolution module 1-B, as shown in fig. 2 (B). The module first goes through deep convolution, batch normalization layer and activation layer, the convolution kernel is (3,3), the step size is 1, the input and output sizes are consistent by using padding, and the output is 26 × 26 × 256. And (3) performing convolution, batch normalization and activation, wherein the convolution kernel is (1,1), the step size is 1, the number of filters is 512, the input and output sizes are consistent by using filling, and the output is 26 multiplied by 512.

The fourth layer is a convolutional layer. The convolution kernel is (1,1), the step size is 1, the number of filters is 256, and the output is 26 × 26 × 256.

b3 small-size target prediction network

The input is the sixth layer output of the feature extraction network and the second layer output of the medium-size target prediction network, and the small-size target prediction network mainly comprises a plurality of combined convolution modules and convolution layers.

The input images are 52 × 52 × 256 and 26 × 26 × 256.

The small-size target prediction network structure is as follows:

the first layer is the input fusion module, as shown in fig. 2 (e). The input 26 × 26 × 256 first passes through the combined convolution module 1-B, where the deep convolution, batch normalization layer and activation layer are first passed, the convolution kernel is (1,1) and the step size is 1, padding is used to make the input and output size uniform, and the output is 26 × 26 × 256. And then the data is subjected to convolution, batch normalization and activation layers, the convolution kernel is (1,1), the step size is 1, the number of filters is 256, the input and output sizes are consistent by using filling, and the output is 26 multiplied by 256. And the sampling factor is 2 after passing through an up-sampling layer, and the output is 52 multiplied by 256. Finally, the output and input 52 × 52 × 256 go through a tensor concatenation layer, and the output is 52 × 52 × 512.

The second layer is the combined convolution module 1-D, as shown in FIG. 2 (D). The modules pass through five combined convolution modules 1-B in sequence, as shown in fig. 2 (B). In the first combined convolution module 1-B, the input first goes through deep convolution, batch normalization layer and activation layer, the convolution kernel is (1,1), the step size is 1, padding is used to make the input and output size consistent, and the output is 52 × 52 × 512. And then the filter passes through a convolution layer, a batch normalization layer and an activation layer, the convolution kernel is (1,1), the step size is 1, the number of filters is 128, the input and output sizes are consistent by using filling, and the output is 52 multiplied by 128. In the second combined convolution module 1-B, the input is first passed through a deep convolution, batch normalization layer and activation layer, the convolution kernel is (3,3), the step size is 1, padding is used to make the input and output sizes consistent, and the output is 52 x 128. And then the filter passes through a convolution layer, a batch normalization layer and an activation layer, the convolution kernel is (1,1), the step size is 1, the number of filters is 256, the input and output sizes are consistent by using filling, and the output is 52 multiplied by 256. After the two different parameters of the combined convolution modules 1-B are alternately input, the output is 52 x 128.

The third layer is a combined convolution module 1-B, as shown in fig. 2 (B). The module first goes through deep convolution, batch normalization layer and activation layer, the convolution kernel is (3,3), the step size is 1, padding is used to make the input and output sizes consistent, and the output is 52 × 52 × 128. And then the filter passes through a convolution layer, a batch normalization layer and an activation layer, the convolution kernel is (1,1), the step size is 1, the number of filters is 256, the input and output sizes are consistent by using filling, and the output is 52 multiplied by 256.

The fourth layer is a convolutional layer. The convolution kernel is (1,1), the step size is 1, the number of filters is 256, and the output is 52 × 52 × 256.

And finally, obtaining the predicted target position and the predicted target category through a non-maximum value inhibition layer by using the output 13 × 13 × 256 of the large-size target prediction network, the output 26 × 26 × 256 of the medium-size target prediction network and the output 52 × 52 × 256 of the small-size target prediction network.

c. Setting a loss function

And setting the loss function as a summation mean of a central coordinate loss function, a width and height loss function, a confidence coefficient loss and a category loss function. The loss function is formulated as follows:

Loss＝(Loss_xy+Loss_wh+Loss_confidence+Loss_cls)/numf

wherein Loss represents total Loss, Loss_xyRepresenting Loss of center coordinates, Loss_whRepresenting a wide high Loss, Loss_confidenceRepresenting Loss of confidence, Loss_clsRepresents class loss and numf represents the floating point number of the total number of inputs. The respective loss functions are formulated as follows:

Loss_xy＝mark_object*(2-w*h)*Loss_log(xy_true,xy_predict)

Loss_wh＝0.5*mark_object*(2-w*h)*(wh_true-wh_predict)²

Loss_cls＝mark_object*Loss_log(cls_true,cls_predict)

wherein mark_objectRepresenting whether the anchor frame has the zone bit of the object or not, w representing the width of the anchor frame, h representing the height of the anchor frame, and Loss_logRepresents twoValue cross entropy loss, xy_trueRepresenting the true central coordinate value, xy_predictRepresents the predicted central coordinate value, wh_trueRepresents the true width and height value, wh_predictRepresenting the predicted width and height values, c_predictRepresenting confidence values, mark, of prediction boxes_ignoreFlags, cls, representing anchor boxes whose IOU is less than a threshold_trueRepresenting the true class, cls_predictRepresenting a prediction category.

Training an improved YOLO network, comprising the steps of:

d1 setting training parameters

Setting Adam as a training optimizer, 0.001 as an initial learning rate, 500 as an iteration number, 8 as a batch size, and K means clustering on all labels generates initial prior frames (38, 29), (65, 52), (94, 87), (142, 134), (195, 69), (216, 206), (337, 320), (397, 145), (638, 569).

d2, Online data enhancement

The data enhancement is carried out on the input image, the data set is expanded, and the data enhancement method comprises the following steps: random mirror image turning, random noise addition and random contrast adjustment.

d3 setting training completion flag

Setting verification set interval detection training accuracy, marking a training completion mark as reaching the maximum iteration number of 500 and enabling the accuracy to meet requirements, and storing the network after the training completion reaches the optimal value.

2) Marking the character positions in the dial image cut out in the step 1) to construct a character detection data set, dividing the character detection data set into a training set and a testing set, then loading training parameters, training an improved EAST network by using the training set, obtaining an optimal improved EAST network after training is finished, inputting the testing set into the optimal improved EAST network, outputting the character positions in the character detection data set and cutting the character positions into character images; the improved EAST network is to change the backbone network into VGG to improve the network detection accuracy, and the output layer structure modification prediction module only uses the head element to predict the vertex so as to improve the prediction performance of the long character.

According to the specific application scene and the characteristics of the recognition object, an improved EAST network is designed, and the following activation layers are Relu activation functions if not additionally stated. The method comprises the following steps:

a. constructing a feature extraction network

The structure of the feature extraction network is as follows:

the input image is 256 × 256 × 3.

The first layer is a combined convolution module 2-B, which, as shown in fig. 3 (B), consists of two combined convolution modules 2-a and one max-pooling layer. The first combined convolution module 2-a passes through the zero-padding layer first, with an output of 258 x 3, the convolution kernel of (3,3), the step size of 1, the number of filters of 64, and the output of 256 x 64, through the convolutional layer and the active layer. The second convolution module 2-a, first passes through the zero-padding layer and outputs 258 × 258 × 64, then passes through the convolution layer and the active layer, the convolution kernel is (3,3), the step size is 1, the number of filters is 64, and the output is 256 × 256 × 64. After a maximum pooling level, the pooling kernel size is (2,2) step size is 2 and the output is 128 × 128 × 64.

The second layer is a combined convolution module 2-B, which consists of two combined convolution modules 2-A and one maximum pooling layer. The first combined convolution module 2-a passes through the zero-padding layer first, then through the convolutional layer and the active layer, with a convolution kernel of (3,3), a step size of 1, and a number of filters of 128. The second convolution module 2-a, first passes through the zero-crossing filler layer, then the convolution layer and the activation layer, with a convolution kernel of (3,3), a step size of 1, and a filter number of 128. After a maximum pooling level, the pooling kernel size is (2,2) with step size of 2 and the output is 64 × 64 × 128.

The third layer is a combined convolution module 2-C, which, as shown in fig. 3 (C), consists of three combined convolution modules 2-a and one max pooling layer. The first combined convolution module 2-a passes through the zero-padding layer first, then through the convolutional layer and the active layer, with a convolution kernel of (3,3), a step size of 1, and a filter number of 256. The second convolution module 2-a, first passes through the zero-crossing filler layer, then the convolution layer and the activation layer, with a convolution kernel of (3,3), a step size of 1, and a filter number of 256. The third convolution module 2-a passes through the zero-crossing filling layer, the convolution layer and the activation layer, the convolution kernel is (3,3), the step length is 1, and the number of filters is 256. After a maximum pooling layer, the pooling kernel size is (2,2) step size is 2 and the output is 32 × 32 × 256.

The fourth layer is a combined convolution module 2-C, which consists of three combined convolution modules 2-A and a maximum pooling layer. The first combined convolution module 2-a passes through the zero-padding layer first, then through the convolutional layer and the active layer, with a convolution kernel of (3,3), a step size of 1, and a filter number of 512. The second convolution module 2-a, first passes through the zero-crossing filler layer, then the convolution layer and the activation layer, the convolution kernel is (3,3), the step size is 1, and the number of filters is 512. The third convolution module 2-a passes through the zero-crossing filling layer, the convolution layer and the activation layer, the convolution kernel is (3,3), the step length is 1, and the number of filters is 512. And then passing through a maximum pooling layer, the pooling kernel size is (2,2) and the step size is 2, and the output is 16 × 16 × 512.

And the fifth layer is a combined convolution module 2-C which consists of three combined convolution modules 2-A and a maximum pooling layer. The first combined convolution module 2-a passes through the zero-padding layer first, then through the convolutional layer and the active layer, with a convolution kernel of (3,3), a step size of 1, and a filter number of 512. The second convolution module 2-a, first passes through the zero-crossing filler layer, then the convolution layer and the activation layer, the convolution kernel is (3,3), the step size is 1, and the number of filters is 512. The third convolution module 2-a passes through the zero-crossing filling layer, the convolution layer and the activation layer, the convolution kernel is (3,3), the step length is 1, and the number of filters is 512. And passing through a maximum pooling layer, the pooling kernel size is (2,2), the step size is 2, and the output is 8 × 8 × 512.

b. Building feature fusion networks

The first level is the input fusion module 2-G, as shown in FIG. 3 (G). The last layer output of the feature extraction network is 8 × 8 × 512, the sampling factor is 2, and the output is 16 × 16 × 512. The output and the fourth layer output of the feature extraction network are 16 multiplied by 512, and the output is 16 multiplied by 1024 through a tensor splicing layer.

The second layer is a combined convolution module 2-E, which, as shown in fig. 3 (E), consists of two batch normalization layers, one combined convolution module 2-D and one combined convolution module 2-a. The combined convolution module 2-D consists of a zero-padded layer, a convolutional layer, and an active layer. First, a batch normalization layer is passed, and then a combined convolution module 2-D is passed. The combined convolution module 2-D passes through the zero-crossing filling layer, the convolution layer and the activation layer, the convolution kernel is (1,1), the step size is 1, the number of filters is 128, and the output is 16 × 16 × 128. Then passes through a batch normalization layer and then passes through a combined convolution module 2-A. The combined convolution module 2-A passes through the zero-padding layer, the convolution layer and the activation layer, the convolution kernel is (3,3), the step size is 1, the number of filters is 64, and the output is 16 × 16 × 64.

The third layer is an input fusion module 2-G. The second layer output of the feature fusion network is 16 × 16 × 64, the sampling factor is 2, and the output is 32 × 32 × 64. The output of the third layer of the output and feature extraction network is 32 multiplied by 256 and passes through a tensor splicing layer, and the output is 32 multiplied by 320.

The fourth layer is the combined convolution module 2-E. First, a batch normalization layer is passed, and then a combined convolution module 2-D is passed. The combined convolution module 2-D passes through the zero-crossing filler layer first, then the convolution layer and the activation layer, the convolution kernel is (1,1), the step size is 1, the number of filters is 128, and the output is 32 × 32 × 128. Then passes through a batch normalization layer and then passes through a combined convolution module 1-A. The combined convolution module 1-a passes through the zero-padding layer first, then the convolution layer and the active layer, the convolution kernel is (3,3), the step size is 1, the number of filters is 64, and the output is 32 × 32 × 64.

The fifth layer is an input fusion module 1-G. The second layer output of the feature fusion network is 32 multiplied by 64 firstly passes through an upsampling layer, the sampling factor is 2, and the output is 64 multiplied by 64. The output and the second layer output of the feature extraction network are 64 multiplied by 128, and the output is 64 multiplied by 192 through a tensor splicing layer.

The sixth layer is a combined convolution module 1-F, which, as shown in fig. 3 (F), consists of three batch normalization layers, one combined convolution module 1-D and two combined convolution modules 1-a. First, a batch normalization layer is passed, and then a combined convolution module 1-D is passed. The combined convolution module 1-D passes through the zero-crossing filler layer first, the convolution layer and the active layer, the convolution kernel is (1,1), the step size is 1, the number of filters is 32, and the output is 64 × 64 × 32. Then passes through a batch normalization layer and then passes through a combined convolution module 1-A. The combined convolution module 1-a passes through the zero-padding layer first, then the convolution layer and the active layer, the convolution kernel is (3,3), the step size is 1, the number of filters is 32, and the output is 64 × 64 × 32. Then passes through a batch normalization layer and then passes through a combined convolution module 1-A. The combined convolution module 1-a passes through the zero-padding layer first, then the convolution layer and the active layer, the convolution kernel is (3,3), the step size is 1, the number of filters is 32, and the output is 64 × 64 × 32.

c. Building a predictive network

The first layer has three branches, the first branch is composed of a combined convolution module 1-D, and passes through a zero padding layer, a convolution layer and an activation layer, the convolution kernel is (1,1), the step length is 1, the number of filters is 1, and the output is 64 multiplied by 1. The second branch consists of a combined convolution module 1-D, first passing through a zero-padding layer, then through convolutional layers and activation layers, with a convolution kernel of (1,1), step size of 1, filter number of 2, and output of 64 × 64 × 2. The third branch consists of a combined convolution module 1-D, first passing through a zero-padding layer, then passing through convolution layers and activation layers, the convolution kernel is (1,1), the step size is 1, the number of filters is 4, and the output is 64 x 4.

The second layer is an input fusion module which is formed by splicing three branches of the first layer of the prediction network, and the output is 64 multiplied by 7.

d. Setting a loss function

The loss function is set to the sum of the class loss, the geometry loss and the angle loss.

The class loss function is formulated as follows:

wherein L is_SRepresenting the class loss, beta represents the weight,

is a predicted category, Y^*Is a real category.

The geometry loss function is formulated as follows:

wherein L is_AABBRepresenting the loss function of the geometry shape,

representing the predicted AABB geometry, R^*Representing the geometry of the real AABB, IoU representing the cross-over ratio.

The angle loss function is formulated as follows:

wherein L is_θ

θ^*) Is a function of the angular loss that is,

is a prediction of the rotation angle, theta^*Is the true case of the rotation angle.

Training an improved EAST network, comprising the steps of:

e1 setting training parameters

And setting a training optimizer, an initial learning rate, iteration times, batch size and an initial prior frame.

e2, Online data enhancement

The data enhancement is carried out on the input image, the data set is expanded, and the main method of the data enhancement is as follows: noise is added randomly, and contrast is adjusted randomly.

e3, setting training completion flag

And setting verification set interval detection training accuracy, marking a training completion mark as reaching the maximum iteration number and meeting the accuracy requirement, and storing the network after the training completion reaches the optimum.

And inputting the test set into an optimal improved EAST network to obtain a character image.

3) Marking the character information in the character image cut in the step 2) to construct a character recognition data set, dividing the character recognition data set into a training set and a testing set, then loading training parameters to train the CRNN by using the training set, obtaining the optimal CRNN after the training is finished, inputting the testing set into the optimal CRNN, and outputting the character information.

Constructing a CRNN network, comprising the following steps:

a. constructing a feature extraction network

The input image is w × 32 × 1, where w is the width of the input image, and is adaptively changed according to the input picture size.

The first layer is the combined convolution module 3-a, first passes through a zero-padding layer, then a convolution layer and an active layer, the convolution kernel is (3,3), the step size is 1, the filter is 64, and the output is w × 32 × 64.

The second layer is the maximum pooling layer, the pooling kernel size is (2,2), the step size is 2, and the output is

The third layer is a combined convolution module 3-A, firstly passes through a zero filling layer, then passes through a convolution layer and an activation layer, the convolution kernel is (3,3), the step length is 1, the filter is 128, and the output is

The fourth layer is the maximum pooling layer, the pooling kernel size is (2,2), the step length is 2, and the output is

The fifth layer is a combined convolution module 3-B, which firstly passes through a zero filling layer, then passes through a convolution layer, a batch normalization layer and an activation layer, the convolution kernel is (3,3), the step length is 1, and the filterIs 256, the output is

The sixth layer is a combined convolution module 3-A, firstly passes through a zero filling layer, then passes through a convolution layer and an activation layer, the convolution kernel is (3,3), the step length is 1, the filter is 256, and the output is

The seventh layer is the maximum pooling layer, the pooling kernel size is (2,2), the step size is 2, and the output is

The eighth layer is a combined convolution module 3-B, which firstly passes through a zero filling layer, then passes through a convolution layer, a batch normalization layer and an activation layer, the convolution kernel is (3,3), the step length is 1, the filter is 512, and the output is

The ninth layer is a combined convolution module 3-A, firstly passes through a zero filling layer, then passes through a convolution layer and an activation layer, the convolution kernel is (3,3), the step length is 1, the filter is 512, and the output is

The tenth layer is the maximum pooling layer, the pooling kernel size is (2,2), the step length is 2, and the output is

The eleventh layer is a combined convolution module 3-C, which passes through a convolution layer, a batch normalization layer and an activation layer, the convolution kernel is (2,2), the step length is 1, the filter is 512, and the output is

b. Building a predictive network

The first layer is a cyclic convolution module, which consists of a bi-directional LSTM with an output of

The second layer is a full connection layer, the output is

The third layer is a cyclic convolution module which consists of a bidirectional LSTM with an output of

The fourth layer is a full connection layer, the output is

c. Setting decoder

Converting the output of a prediction network into

Each element ranging from 0 to 6735, corresponding to a respective independent character (where 0 corresponds to a null character), corresponding to the division of a line of text into lines

The character predicts the block. And processing the sequence from left to right, and outputting character information according to the element value corresponding to the character library when the element is not 0 and is the same as the previous element.

d. Setting a loss function

The loss function is set to be the ctc (connectionist Temporal classification) loss function.

The CTC loss function is formulated as follows:

L_CTC＝-lnΠ_(x,z)p(z|x)＝-∑_(x,z)∈S lnp(z|x)

wherein L is_CTCRepresenting the CTC loss function, p (z | x) represents the probability of outputting a sequence z given an input x, S being the training set.

Training a CRNN network, comprising the steps of:

e1 setting training parameters

And setting a training optimizer, an initial learning rate, iteration times and batch size.

e2, setting training completion flag

Constructing a TextCNN network, comprising the following steps:

a. constructing a network structure:

the first layer is an embedded layer, the length of an input text is m, and the word vectorization is 600 multiplied by 64, namely the input tensor;

the second layer is a convolution module, the convolution kernel is (5, 5), the step length is 1, the number of filters is 256, and the output is 596 multiplied by 256;

the third layer is a maximum pooling layer, and the output is 1 multiplied by 256;

the fourth layer consists of a full connection layer, a Dropout layer and an activation layer, and the output is 1 multiplied by 128;

the fifth layer consists of a full connection layer and an activation layer, the output is 1 xcls, and cls is the category number;

b. setting a loss function

Setting loss function as multi-class cross entropy

Wherein L is_CrossEntropyRepresenting loss, n representing number of classes, y_iRepresenting the true probability of the corresponding i category,

representing the prediction probability of the corresponding i category.

Training a TextCNN network, comprising the steps of:

c1 setting training parameters

c2 setting training completion flag

And inputting the test set into the instrument type corresponding to the optimal TextCNN network output text.

In conclusion, after the scheme is adopted, the invention provides a new method for detecting and classifying the instrument image, and the neural network is used as an effective method for detecting and classifying the instrument, so that the problem that the instrument type is difficult to read can be effectively solved, the development of an automatic instrument identification technology is effectively promoted, and the method has actual popularization value and is worthy of popularization.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A meter detection and classification method based on image texts is characterized by comprising the following steps:

2. The method for detecting and classifying meters based on image texts as claimed in claim 1, wherein in step 1), first, various meter images under different environments are collected by a camera, preprocessing operations of filtering and image enhancement are performed on the meter images, then abnormal data in the meter images, including data with dirty surface, extreme illumination and incomplete shooting abnormality, are removed, and then the rest of data are labeled, wherein the labeled content is the dial position, and a meter positioning data set is constructed and divided into a training set and a testing set.

3. The method for classifying instrument detection based on image text as claimed in claim 1, wherein in step 1), the specific conditions of the improved YOLO network are as follows:

the third layer is a convolution layer;

the fourth layer is a convolution layer;

the center coordinate loss function is formulated as follows:

Loss_xy＝mark_object*(2-w*h)*Loss_log(xy_true,xy_predict)

the broad height loss function is formulated as follows:

Loss_wh＝0.5*mark_object*(2-w*h)*(wh_true-wh_predict)²

the confidence loss function is formulated as follows:

the class loss function is formulated as follows:

Loss_cls＝mark_object*Loss_log(cls_true,cls_predict)

the total loss function is formulated as follows:

Loss＝(Loss_xy+Loss_wh+Loss_confidence+Loss_cls)/numf

4. The method for detecting and classifying meters based on image texts as claimed in claim 1, wherein in step 2), the concrete conditions of the improved EAST network are as follows:

b. constructing a feature fusion network, wherein the structure is as follows:

c. constructing a prediction network, wherein the structure is as follows:

the class loss function is formulated as follows:

in the formula, L_SRepresenting the class loss, beta represents the weight,

is a predicted category, Y^*Is a real category;

the geometry loss function is formulated as follows:

in the formula, L_AABBWhich represents a loss of the geometric shape,

representing the geometry, R, of a predictive quadrilateral text box AABB^*Representing the geometry of the real quadrangular text box AABB, IoU representing the intersection ratio;

the angle loss function is formulated as follows: