CN113673509B - Instrument detection classification method based on image text - Google Patents
Instrument detection classification method based on image text Download PDFInfo
- Publication number
- CN113673509B CN113673509B CN202110855223.6A CN202110855223A CN113673509B CN 113673509 B CN113673509 B CN 113673509B CN 202110855223 A CN202110855223 A CN 202110855223A CN 113673509 B CN113673509 B CN 113673509B
- Authority
- CN
- China
- Prior art keywords
- layer
- convolution
- network
- training
- combined
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an instrument detection classification method based on image text, which comprises the following steps: 1) Constructing an instrument positioning data set, improving YOLO network training, and outputting dial images by using a network; 2) Constructing a character detection data set, improving EAST network training, and outputting character images by using a network; 3) Constructing a character recognition data set, training a CRNN network, and outputting character information by using the network; 4) And constructing a text classification data set, training a textCNN network, and outputting the instrument type by using the network. The invention uses the neural network to realize the detection of the instrument and the detection and identification of the text information of the instrument, has higher precision and better generalization capability under different backgrounds, can accurately detect the instruments with different sizes, has no limitation of acquisition angles and distances, can solve the problems that the position of the instrument can be identified but the type of the instrument is difficult to distinguish in the field of machine vision by utilizing the character information on the instrument, and can detect the instrument and identify the type of the instrument.
Description
Technical Field
The invention relates to the technical field of image processing and neural networks, in particular to an instrument detection classification method based on image texts.
Background
The instrument is used as a monitoring device, mainly including pressure instruments, temperature instruments, flow instruments, electrical instruments and electronic measuring instruments, and is widely applied to various aspects of industrial production and social life, thereby providing great convenience for living generation. Compared with the manual classification method, the method has the advantages of wide application range and high classification efficiency, the method is also gradually mainstream along with the development of image processing technology and neural network technology, the key links in the method comprise positioning and identification of texts in the meters, and whether the accuracy of the positioning and identification information of the texts has important influence on the classification and dimension reading of the meters.
At present, research and implementation of instrument classification methods are mainly focused on classifying and training different types of instrument images by using a neural network, and the method has some defects, such as training of the instrument images to be identified, the degree of distinction of the different types of instruments on the images is not very large, and the identification and classification performance of the depth network on the different instruments is not ideal. At present, research and implementation of a character recognition method on an instrument are mainly focused on application of a traditional image processing technology, and character information is acquired through a series of image processing technologies including filtering, graying, thresholding, edge detection and template detection. With the rapid development of image processing and neural network technology in recent years, it is possible to use neural networks for character location recognition classification. The text detection mainly carries out quadrilateral frame positioning on character information through a neural network, text information reading on the instrument is realized through the text recognition algorithm, and the text information is classified through the text classification algorithm to obtain the type of the instrument.
By combining the discussion, the instrument detection classification method with real-time performance and high precision has higher practical application value.
Disclosure of Invention
The invention aims to overcome the defects and shortcomings of the prior art, and provides an instrument detection classification method based on image texts, which uses a neural network to realize instrument detection and instrument text information detection and identification, has higher precision and better generalization capability under different backgrounds, can accurately detect instruments with different sizes, has no limitation of acquisition angles and distances, can solve the problems that the instrument position can be identified in the field of machine vision but the instrument types are difficult to distinguish by utilizing character information on the instruments, and can detect the instruments and identify the instrument types.
In order to achieve the above purpose, the technical scheme provided by the invention is as follows: an instrument detection classification method based on image text comprises the following steps:
1) Marking the positions of the dial plates by using the instrument images to construct an instrument positioning data set, dividing the instrument positioning data set into a training set and a testing set, recharging parameters, training an improved YOLO network by using the training set, obtaining an optimal improved YOLO network after training, inputting the testing set into the optimal improved YOLO network, outputting dial plate images and cutting out the dial plate images; wherein, the improved YOLO network optimizes the backbone network to be a mobilent lightweight network so as to reduce network parameters and calculated amount and improve operation speed;
2) Marking the character positions in the dial images cut out in the step 1) to construct a character detection data set, dividing the character detection data set into a training set and a test set, recharging training parameters, training an improved EAST network by using the training set, obtaining an optimal improved EAST network after training, inputting the test set into the optimal improved EAST network, outputting the character positions in the character detection data set, and cutting the character detection data set into character images; the improved EAST network is to change a backbone network into VGG to improve network detection accuracy, and the output layer structure modification prediction module only uses head elements to predict vertexes so as to improve the prediction performance of long characters;
3) Marking the character information in the character image cut in the step 2) to construct a character recognition data set, dividing the character recognition data set into a training set and a test set, recharging training parameters to train the CRNN network by using the training set, obtaining an optimal CRNN network after training, inputting the test set into the optimal CRNN network, and outputting the character information;
4) Splicing the character information output in the step 3) into a text, marking instrument types corresponding to the text to construct a text classification data set, dividing the text classification data set into a training set and a test set, recharging training parameters to train the textCNN network by using the training set, obtaining an optimal textCNN network after training, and inputting the test set into the optimal textCNN network to output the instrument types corresponding to the text.
In step 1), various instrument images under different environments are collected through a camera, filtering and image enhancement preprocessing operations are performed on the instrument images, abnormal data in the instrument images are removed, the abnormal data comprise data with surface dirt, illumination ends and incomplete shooting abnormality, the rest data are marked, the marked content is dial positions, an instrument positioning data set is constructed, and the instrument positioning data set is divided into a training set and a testing set.
Further, in step 1), the specific case of the improved YOLO network is as follows:
a. constructing a feature extraction network according to the real-time and high-precision requirements:
the first layer is a combined convolution module 1-A, which consists of a zero filling layer, a convolution layer, a batch normalization layer and an activation layer;
the second layer is a combined convolution module 1-B, which consists of a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;
the third layer is a combined convolution module 1-C, which consists of a zero filling layer, a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;
the fourth layer is a combined convolution module 1-B, which consists of a deep convolution layer, two batch normalization layers, two activation layers and a convolution layer;
The fifth layer is a combined convolution module 1-C, which consists of a zero filling layer, a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;
the sixth layer is a combined convolution module 1-B, which consists of a deep convolution layer, two batch normalization layers, two activation layers and a convolution layer;
the seventh layer is a combined convolution module 1-C, which consists of a zero filling layer, a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;
the eighth layer is a combined convolution module 1-D, which consists of five combined convolution modules 1-B;
the ninth layer is a combined convolution module 1-C, which consists of a zero filling layer, a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;
the tenth layer is a combined convolution module 1-B, which consists of a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;
b. constructing and outputting prediction networks for predicting targets with different sizes according to the output of different layers of the feature extraction network, wherein the prediction networks comprise a large-size target prediction network, a medium-size target prediction network and a small-size target prediction network;
b1, inputting a tenth layer of output of a feature extraction network, wherein the large-size target prediction network consists of a plurality of combination convolution modules and convolution layers, and has the following structure:
The first layer is a combined convolution module 1-D, which consists of five combined convolution modules 1-B;
the second layer is a combined convolution module 1-B, which consists of a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;
the third layer is a convolution layer;
b2, inputting an eighth layer output of a characteristic extraction network and a first layer output of a large-size target prediction network, wherein the medium-size target prediction network consists of a plurality of combination convolution modules and convolution layers, and the structure is as follows:
the first layer is an input fusion module 1-E, which consists of a combined convolution module 1-B, an up-sampling layer and a tensor splicing layer;
the second layer is a combined convolution module 1-D, which consists of five combined convolution modules 1-B;
the third layer is a combined convolution module 1-B, which consists of a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;
the fourth layer is a convolution layer;
b3, taking the input as the sixth layer output of the characteristic extraction network and the second layer output of the medium-size target prediction network, wherein the small-size target prediction network consists of a plurality of combination convolution modules and convolution layers and has the following structure:
the first layer is an input fusion module 1-E, which consists of a combined convolution module 1-B, an up-sampling layer and a tensor splicing layer;
The second layer is a combined convolution module 1-D, which consists of five combined convolution modules 1-B;
the third layer is a combined convolution module 1-B, which consists of a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;
the fourth layer is a convolution layer;
finally, the output of the large-size target prediction network, the medium-size target prediction network and the small-size target prediction network is processed through a non-maximum suppression layer to obtain the predicted target position and category;
c. the loss function is set to have a center coordinate loss function, a wide-high loss function, a confidence loss function and a category loss function;
the center coordinate loss function formula is as follows:
Loss xy =mark object *(2-w*h)*Loss log (xy true ,xy predict )
in the Loss xy Representing center coordinate loss, mark object A flag bit representing whether an object exists in the anchor frame, w represents the width of the anchor frame, h represents the height of the anchor frame, and Loss log Representing a binary cross entropy loss, xy true Representing the true central coordinate value, xy predict Representing a predicted central coordinate value;
the wide-high loss function formula is as follows:
Loss wh =0.5*mark object *(2-w*h)*(wh true -wh predict ) 2
in the Loss wh Representing the loss of width and height,wh true representing the true width and height values wh predict Representing a predicted width and height value;
the confidence loss function formula is as follows:
Loss confidence =mark object *Loss log (mark object ,c predict )+(1-mark object )*Loss log (mark object ,c predict )*mark ignore
in the Loss confidence Representing confidence loss, c predict Confidence value representing prediction frame, mark ignore A flag bit representing an anchor block having an IOU less than a threshold;
the class loss function formula is as follows:
Loss cls =mark object *Loss log (cls true ,cls predict )
in the Loss cls Representing class loss, cls true Representing true class, cls predict Representing a prediction category;
the total loss function formula is as follows:
Loss=(Loss xy +Loss wh +Loss confidence +Loss cls )/numf
where Loss represents total Loss, numf represents floating point number of total input number;
loading training parameters to train the improved YOLO network, wherein the training parameters are set as follows: setting a training optimizer to Adam, an initial learning rate of 0.001, a maximum training period of 500 and a batch size of 8; setting the interval detection training accuracy of the verification set, and storing a network after the training is completed to the optimal condition, wherein the training completion mark is the maximum training period or the average crossing ratio meeting the requirement;
inputting the test set into an optimal improved YOLO network to obtain the dial position and the dial image.
Further, in step 2), the specific case of the improved EAST network is as follows:
a. the method comprises the following steps of constructing a feature extraction network:
the first layer is a combined convolution module 2-B, which consists of two combined convolution modules 2-A and a maximum pooling layer, wherein the combined convolution module 2-A consists of a zero filling layer, a convolution layer and an activation layer;
the second layer is a combined convolution module 2-B, which consists of two combined convolution modules 2-A and a maximum pooling layer;
The third layer is a combined convolution module 2-C, which consists of three combined convolution modules 2-A and a maximum pooling layer;
the fourth layer is a combined convolution module 2-C, which consists of three combined convolution modules 2-A and a maximum pooling layer;
the fifth layer is a combined convolution module 2-C, which consists of three combined convolution modules 2-A and a maximum pooling layer;
b. the feature fusion network is constructed, and the structure is as follows:
the first layer is an input fusion module 2-G, which consists of an up-sampling layer and a tensor splicing layer;
the second layer is a combined convolution module 2-E, which consists of two batch normalization layers, a combined convolution module 2-D and a combined convolution module 2-A; wherein the combined convolution module 2-D consists of a zero padding layer, a convolution layer and an activation layer;
the third layer is an input fusion module 2-G, which consists of an up-sampling layer and a tensor splicing layer;
the fourth layer is a combined convolution module 2-E, which consists of two batch normalization layers, a combined convolution module 2-D and a combined convolution module 2-A;
the fifth layer is an input fusion module 2-G, which consists of an up-sampling layer and a tensor splicing layer;
the sixth layer is a combined convolution module 2-F, which consists of three batch normalization layers, one combined convolution module 2-D and two combined convolution modules 2-A;
c. The prediction network is constructed, and the structure is as follows:
the first layer is divided into three branches, and the first branch consists of a combined convolution module 2-D; the second branch consists of a combined convolution module 2-D; the third branch consists of a combined convolution module 2-D;
the second layer is an input fusion module which is formed by splicing three branches of the first layer;
d. the set loss function comprises a category loss function, a geometric shape loss function and an angle loss function;
the class loss function formula is as follows:
wherein L is S Representing the class loss, beta represents the weight,is a predicted category, Y * Is a true category;
the geometry loss function formula is as follows:
wherein L is AABB Representing the loss of geometry and,representing the geometry of the predictive quadrilateral text box AABB, R * Representing the geometry of a real quadrilateral text box AABB, ioU representing the intersection ratio;
the angle loss function formula is as follows:
wherein L is θ θ * ) Is an angle loss->Is the predicted value of the rotation angle theta * Is the actual value of the rotation angle;
training and improving EAST network by loading training parameters, wherein the training parameters are set as follows: setting a training optimizer to Adam, an initial learning rate of 0.001, a maximum training period of 500 and a batch size of 8; setting the interval detection training accuracy of the verification set, and storing a network after the training is completed to the optimal condition, wherein the training completion mark is the maximum training period or the average crossing ratio meeting the requirement;
Inputting the test set into the optimal improved EAST network to obtain text positions, and cutting the text positions into character images.
Further, in step 3), the specific cases of the CRNN network are as follows:
a. the method comprises the following steps of constructing a feature extraction network:
the first layer is a combined convolution module 3-A, which consists of a zero filling layer, a convolution layer and an activation layer;
the second layer is a maximum pooling layer;
the third layer is a combined convolution module 3-A, which consists of a zero filling layer, a convolution layer and an activation layer;
the fourth layer is the largest pooling layer;
the fifth layer is a combined convolution module 3-B, which consists of a zero filling layer, a convolution layer, a batch normalization layer and an activation layer;
the sixth layer is a combined convolution module 3-A, which consists of a zero filling layer, a convolution layer and an activation layer;
the seventh layer is the largest pooling layer;
the eighth layer is a combined convolution module 3-B, which consists of a zero filling layer, a convolution layer, a batch normalization layer and an activation layer;
the ninth layer is a combined convolution module 3-A, which consists of a zero filling layer, a convolution layer and an activation layer;
the tenth layer is the largest pooling layer;
the eleventh layer is a combined convolution module 3-C, which consists of a zero filling layer, a convolution layer, a batch normalization layer and an activation layer;
b. The prediction network is constructed, and the structure is as follows:
the first layer is a cyclic convolution module, which consists of a bidirectional LSTM;
the second layer is a full-connection layer;
the third layer is a circular convolution module, which consists of a bidirectional LSTM;
the fourth layer is a full-connection layer;
c. setting a decoder to convert the output sequence into character information;
d. setting a loss function as a CTC (Connectionist Temporal Classification) loss function;
the CTC loss function formula is as follows:
L CTC =-ln∏ (x,z) p(z|x)=-∑ (x,z)∈S lnp(z|x)
wherein L is CTC Representing CTC loss, p (z|x) represents the probability of a given input x output sequence z, S being the training set;
training the CRNN network by loading training parameters, wherein the training parameters are set as follows: setting a training optimizer to Adam, an initial learning rate of 0.0001, a maximum training period of 100 and a batch size of 32; setting the interval detection training accuracy of the verification set, and storing a network after the training is completed to the optimal condition, wherein the training completion mark meets the requirement for the maximum training period or the identification accuracy;
and inputting the test set into the optimal CRNN network to obtain character information.
Further, in step 4), the concrete case of the TextCNN network is as follows:
a. the network structure is constructed as follows:
the first layer is an embedded layer;
The second layer is a convolution module;
the third layer is a maximum pooling layer;
the fourth layer consists of a full connection layer, a Dropout layer and an activation layer;
the fifth layer consists of a full connection layer and an activation layer;
b. the loss function is set as multi-classification cross entropy, and the formula is as follows:
wherein L is CrossEntropy Represents loss, n represents category number, y i Representing the true probability of the corresponding i-category,representing a prediction probability corresponding to the i category;
loading training parameters to train the textCNN network, wherein the training parameters are set as follows: setting a training optimizer adam, an initial learning rate of 0.001, iteration times of 1000 and batch size of 64; setting the interval detection training accuracy of the verification set, and storing a network after the training is completed to the optimal condition, wherein the training completion mark is that the maximum iteration number and the accuracy meet the requirements;
and inputting the test set into an optimal textCNN network to obtain a corresponding instrument type.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the invention uses the neural network to realize instrument positioning and character detection and recognition on the instrument, and has higher precision and better generalization capability under different backgrounds compared with the traditional image processing method.
2. According to the invention, targets with different sizes can be predicted by using the improved YOLO network, the prediction frames with the most suitable sizes are comprehensively selected, the instruments with different sizes can be accurately detected, and the limitations of acquisition angles and distances are avoided.
3. Compared with other target detection networks, the improved EAST network has better detection performance and faster detection speed on characters, and has better detection effect on characters shot at different angles.
4. The invention can solve the problems that the position of the instrument can be identified in the field of machine vision but the type of the instrument is difficult to distinguish by utilizing the character information on the instrument to splice into the text, thereby not only detecting the instrument but also identifying the type of the instrument.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Fig. 2 is a schematic block diagram of an improved YOLO network.
Fig. 3 is a schematic block diagram of an improved EAST network.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.
As shown in fig. 1, the method for detecting and classifying the image text according to the embodiment includes the following steps:
1) Collecting instrument images shot in different actual scenes, removing interference data with fuzzy, angle extreme and instrument missing influence identification, marking dial positions in other data by using an open source marking tool labelImg, constructing an instrument positioning data set, dividing the instrument positioning data set into a training set and a testing set, reloading parameters, training the training set to improve a YOLO network, obtaining an optimal improved YOLO network after training, inputting the testing set into the optimal improved YOLO network, outputting dial images and cutting dial images; the improved YOLO network optimizes the backbone network to be a mobilent lightweight network so as to reduce network parameters and calculation amount and improve operation speed.
According to the specific application scene and the characteristics of the identification object, designing and improving the YOLO network, wherein the activation layer is a leakage Relu activation function if not additionally stated in the step, and the method comprises the following steps:
a. constructing a feature extraction network
And constructing a feature extraction network according to the real-time and high-precision requirements. The feature extraction network is mainly composed of a plurality of combined convolution modules.
The feature extraction network has the following structure:
the input image is 416×416×3.
The first layer is the combined convolution module 1-a, as shown in fig. 2 (a). The module first passes through zero padding layer and outputs 418 x 3. Then the convolution layer, the batch normalization layer and the activation layer are adopted, the convolution kernel is (3, 3), the step length is 2, the number of filters is 32, and the output is 208 multiplied by 32.
The second layer is the combined convolution module 1-B, as shown in fig. 2 (B). The module first goes through the deep convolution, batch normalization layer and activation layer, the convolution kernel is (3, 3), the step size is 1, the filling is used to make the input and output size consistent, and the output is 208×208×32. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 64, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 208 multiplied by 64.
The third layer is the combined convolution module 1-C, as shown in fig. 2 (C). The module first passes through the zero padding layer and outputs 210 x 64. And then the depth convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (3, 3), the step size is 2, and the output is 104 multiplied by 64. Finally, through convolution, a batch normalization layer and an activation layer, the convolution kernel is (1, 1), the step length is 1, the number of filters is 128, the input and output sizes are consistent by using filling, and the output is 104 multiplied by 128.
The fourth layer is the combined convolution module 1-B, as shown in fig. 2 (B). The module first goes through the deep convolution, batch normalization layer and activation layer, the convolution kernel is (3, 3), the step size is 1, the filling is used to make the input and output size consistent, and the output is 104×104×128. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 128, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 104 multiplied by 128.
The fifth layer is the combined convolution module 1-C, as shown in fig. 2 (C). The module first outputs 106×106×128 through zero padding layers. And then the depth convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (3, 3), the step size is 2, and the output is 52 multiplied by 128. Finally, through convolution, batch normalization layer and activation layer, the convolution kernel is (1, 1), step length is 1, the number of filters is 256, the input and output are consistent by using filling, and the output is 52 multiplied by 256.
The sixth layer is the combined convolution module 1-B, as shown in fig. 2 (B). The module first goes through the deep convolution, batch normalization layer and activation layer, the convolution kernel is (3, 3), the step size is 1, the filling is used to make the input and output size consistent, and the output is 52×52×256. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 256, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 52 multiplied by 256.
The seventh layer is the combined convolution module 1-C, as shown in fig. 2 (C). The module first passes through zero padding layer and outputs 54×54×256. And then the depth convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (3, 3), the step size is 2, and the output is 26 multiplied by 256. Finally, through convolution, a batch normalization layer and an activation layer, the convolution kernel is (1, 1), the step length is 1, the number of filters is 512, the input and output sizes are consistent by using filling, and the output is 26 multiplied by 512.
The eighth layer is the combined convolution module 1-D, as shown in fig. 2 (D). The modules pass through five combined convolution modules 1-B in sequence as shown in fig. 2 (B). In each combination convolution module 1-B, the input first goes through the deep convolution, the batch normalization layer and the activation layer, the convolution kernel is (3, 3), the step size is 1, the padding is used to make the input and the output uniform in size, and the output is 26×26×512. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 512, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 26 multiplied by 512. After passing through the same combined convolution modules 1-B in turn, the output is 26×26×512.
The ninth layer is the combined convolution module 1-C, as shown in fig. 2 (C). The module first passes through the zero padding layer and outputs 28 x 512. Then through the depth convolution, the batch normalization layer and the activation layer, the convolution kernel is (3, 3), the step length is 2, and the output is 13 multiplied by 512. Finally, through convolution, batch normalization layer and activation layer, the convolution kernel is (1, 1), step length is 1, the number of filters is 1024, the size of input and output is consistent by using filling, and the output is 13×13×1024.
The tenth layer is the combined convolution module 1-B, as shown in fig. 2 (B). The module first goes through the deep convolution, the batch normalization layer and the activation layer, the convolution kernel is (3, 3), the step length is 1, the filling is used to make the input and output size consistent, and the output is 13×13×1024. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 1024, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 13 multiplied by 1024.
b. Constructing a predictive network
And constructing and outputting prediction networks for predicting targets with different sizes according to the output of different layers of the feature extraction network, wherein the prediction networks comprise a large-size target prediction network, a medium-size target prediction network and a small-size target prediction network.
b1, large-size target prediction network
The input is the tenth layer output of the feature extraction network, and the large-size target prediction network mainly comprises a plurality of combination convolution modules and convolution layers.
The input image is 13×13×1024.
The large-size target prediction network has the following structure:
the first layer is the combined convolution module 1-D, as shown in fig. 2 (D). The modules pass through five combined convolution modules 1-B in sequence as shown in fig. 2 (B). In the first combined convolution module 1-B, the input is first subjected to a deep convolution, a batch normalization layer and an activation layer, the convolution kernel is (1, 1), the step size is 1, the padding is used to make the input and output uniform in size, and the output is 13×13×1024. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 512, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 13 multiplied by 512. In the second combined convolution module 1-B, the input is first subjected to a depth convolution, a batch normalization layer and an activation layer, the convolution kernel is (3, 3), the step size is 1, the padding is used to make the input and output uniform in size, and the output is 13×13×512. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 1024, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 13 multiplied by 1024. After the two different parameters of the combined convolution modules 1-B are alternately input, the output is 13 multiplied by 512.
The second layer is the combined convolution module 1-B, as shown in fig. 2 (B). The module first goes through the deep convolution, batch normalization layer and activation layer, the convolution kernel is (3, 3), the step size is 1, the filling is used to make the input and output size consistent, and the output is 13×13×512. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 1024, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 13 multiplied by 1024.
The third layer is a convolution layer. The convolution kernel is (1, 1), the step size is 1, the number of filters is 256, and the output is 13×13×256.
b2, medium-sized target prediction network
The input is the eighth layer output of the feature extraction network and the first layer output of the large-size target prediction network, and the medium-size target prediction network mainly comprises a plurality of combination convolution modules and convolution layers.
The input images are 26×26×512 and 13×13×512.
The medium-size target prediction network structure is as follows:
the first layer is the input fusion module, as shown in fig. 2 (e). The input 13×13×512 is first passed through a combined convolution module 1-B, where the depth convolution, batch normalization layer and activation layer are first passed, the convolution kernel is (1, 1), the step size is 1, the padding is used to make the input and output uniform in size, and the output is 13×13×512. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 512, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 13 multiplied by 512. And then through the up-sampling layer, the sampling factor is 2, and the output is 26 multiplied by 512. Finally, the output and input 26×26×512 pass through the tensor mosaic layer, and the output is 26×26×1024.
The second layer is the combined convolution module 1-D, as shown in fig. 2 (D). The modules pass through five combined convolution modules 1-B in sequence as shown in fig. 2 (B). In the first combined convolution module 1-B, the input is first subjected to a depth convolution, a batch normalization layer and an activation layer, the convolution kernel is (1, 1), the step size is 1, the padding is used to make the input and output uniform in size, and the output is 26×26×1024. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 256, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 26 multiplied by 256. In the second combined convolution module 1-B, the input is first subjected to a depth convolution, a batch normalization layer and an activation layer, the convolution kernel is (3, 3), the step size is 1, the padding is used to make the input and output uniform in size, and the output is 26×26×256. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 512, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 26 multiplied by 512. After the two different parameters of the combined convolution modules 1-B are alternately input, the output is 26 multiplied by 256.
The third layer is the combined convolution module 1-B, as shown in fig. 2 (B). The module first goes through the deep convolution, batch normalization layer and activation layer, the convolution kernel is (3, 3), the step size is 1, the filling is used to make the input and output size consistent, and the output is 26×26×256. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 512, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 26 multiplied by 512.
The fourth layer is a convolution layer. The convolution kernel is (1, 1), the step size is 1, the number of filters is 256, and the output is 26×26×256.
b3, small-sized target prediction network
The input is the sixth layer output of the feature extraction network and the second layer output of the medium-size target prediction network, and the small-size target prediction network mainly comprises a plurality of combination convolution modules and convolution layers.
The input images are 52×52×256 and 26×26×256.
The small-size target prediction network structure is as follows:
the first layer is the input fusion module, as shown in fig. 2 (e). The input 26×26×256 is first passed through a combined convolution module 1-B, where the depth convolution, batch normalization layer and activation layer are first passed, the convolution kernel is (1, 1), the step size is 1, the padding is used to make the input and output uniform in size, and the output is 26×26×256. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 256, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 26 multiplied by 256. And then through the up-sampling layer, the sampling factor is 2, and the output is 52 multiplied by 256. Finally, the output and input 52×52×256 pass through the tensor mosaic layer, and the output is 52×52×512.
The second layer is the combined convolution module 1-D, as shown in fig. 2 (D). The modules pass through five combined convolution modules 1-B in sequence as shown in fig. 2 (B). In the first combined convolution module 1-B, the input is first subjected to a deep convolution, a batch normalization layer and an activation layer, the convolution kernel is (1, 1), the step size is 1, the padding is used to make the input and output uniform in size, and the output is 52×52×512. And then the convolution, the batch normalization layer and the activation layer are carried out, wherein the convolution kernel is (1, 1), the step length is 1, the number of filters is 128, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 52 multiplied by 128. In the second combined convolution module 1-B, the input is first subjected to a depth convolution, a batch normalization layer and an activation layer, the convolution kernel is (3, 3), the step size is 1, the padding is used to make the input and output uniform in size, and the output is 52×52×128. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 256, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 52 multiplied by 256. After the two different parameters of the combined convolution modules 1-B are alternately input, the output is 52 multiplied by 128.
The third layer is the combined convolution module 1-B, as shown in fig. 2 (B). The module first goes through the deep convolution, batch normalization layer and activation layer, the convolution kernel is (3, 3), the step size is 1, the filling is used to make the input and output size consistent, and the output is 52×52×128. And then the convolution, the batch normalization layer and the activation layer are carried out, the convolution kernel is (1, 1), the step length is 1, the number of filters is 256, the filling is used to ensure that the sizes of the input and the output are consistent, and the output is 52 multiplied by 256.
The fourth layer is a convolution layer. The convolution kernel is (1, 1), the step size is 1, the number of filters is 256, and the output is 52×52×256.
Finally, the output 13×13×256 of the large-size target prediction network, the output 26×26×256 of the medium-size target prediction network and the output 52×52×256 of the small-size target prediction network are processed by a non-maximum suppression layer to obtain the predicted target position and category.
c. Setting a loss function
The loss function is set as the sum average of the center coordinate loss function, the wide-high loss function, the confidence loss and the category loss function. The loss function formula is as follows:
Loss=(Loss xy +Loss wh +Loss confidence +Loss cls )/numf
wherein Loss represents total Loss, loss xy Representing center coordinate Loss, loss wh Representing the Loss of width and height, loss confidence Representing confidence Loss, loss cls Representing class loss, numf represents the floating point number of the total number of inputs. The respective loss function formulas are as follows:
Loss xy =mark object *(2-w*h)*Loss log (xy true ,xy predict )
Loss wh =0.5*mark object *(2-w*h)*(wh true -wh predict ) 2
Loss confidence =mark object *Loss log (mark object ,c predict )+(1-mark object )*Loss log (mark object ,c predict )*mark ignore
Loss cls =mark object *Loss log (cls true ,cls predict )
wherein mark is a mark object A flag bit representing whether an object exists in the anchor frame, w represents the width of the anchor frame, h represents the height of the anchor frame, and Loss log Representing a binary cross entropy loss, xy true Representing the true central coordinate value, xy predict Representing the predicted central coordinate value wh true Representing the true width and height values wh predict Representing the predicted width and height values, c predict Confidence value representing prediction frame, mark ignore Flag bit, cls, representing anchor block with IOU less than threshold true Representing true class, cls predict Representing the prediction category.
Training an improved YOLO network comprising the steps of:
d1, setting training parameters
Setting the training optimizer to Adam, initial learning rate of 0.001, iteration number of 500, batch size of 8, and K means clustering all tags to generate initial prior boxes of (38, 29), (65, 52), (94, 87), (142, 134), (195, 69), (216, 206), (337, 320), (397, 145), (638, 569).
d2, on-line data enhancement
The data enhancement is carried out on the input image, the data set is expanded, and the data enhancement method comprises the following steps: random mirror-inversion, random noise addition, random contrast adjustment.
d3, setting training completion mark
And setting the interval detection training accuracy of the verification set, and storing a network after the training is completed to the optimal condition, wherein the training completion mark is 500 for reaching the maximum iteration times and the accuracy meets the requirements.
Inputting the test set into an optimal improved YOLO network to obtain the dial position and the dial image.
2) Marking the character positions in the dial images cut out in the step 1) to construct a character detection data set, dividing the character detection data set into a training set and a test set, recharging training parameters, training an improved EAST network by using the training set, obtaining an optimal improved EAST network after training, inputting the test set into the optimal improved EAST network, outputting the character positions in the character detection data set, and cutting the character detection data set into character images; the improved EAST network is to change the backbone network into VGG to improve network detection accuracy, and the output layer structure modification prediction module only uses head elements to predict vertexes so as to improve the prediction performance of long characters.
According to the specific application scene and the characteristics of the identification object, the EAST network is designed and improved, and the following activation layers are all Relu activation functions if not additionally stated. The method comprises the following steps:
a. constructing a feature extraction network
The feature extraction network has the following structure:
The input image is 256×256×3.
The first layer is a combined convolution module 2-B, which, as shown in fig. 3 (B), consists of two combined convolution modules 2-a and one maximally pooled layer. The first combined convolution module 2-a outputs 258×258×3 through zero padding layer, and outputs 256×256×64 through convolution layer and activation layer, with convolution kernel (3, 3), step size 1, number of filters 64. The second convolution module 2-a outputs 258×258×64 through zero padding layer, and outputs 256×256×64 through convolution layer and activation layer, convolution kernel is (3, 3), step size is 1, filter number is 64. And then passing through a maximum pooling layer, wherein the pooling core size is (2, 2) step length is 2, and the output is 128 multiplied by 64.
The second layer is a combined convolution module 2-B, and consists of two combined convolution modules 2-A and a maximum pooling layer. The first combined convolution module 2-a passes through the zero-padding layer first, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step size is 1, and the number of filters is 128. The second convolution module 2-a passes through the zero-padding layer first, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step size is 1, and the number of filters is 128. And then passing through a maximum pooling layer, wherein the pooling core size is (2, 2) step length is 2, and the output is 64 multiplied by 128.
The third layer is a combined convolution module 2-C, which, as shown in fig. 3 (C), consists of three combined convolution modules 2-a and one max pooling layer. The first combined convolution module 2-a passes through the zero-padding layer first, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step size is 1, and the number of filters is 256. The second convolution module 2-a passes through the zero-padding layer, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step size is 1, and the number of filters is 256. The third convolution module 2-a passes through the zero-padding layer, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step size is 1, and the number of filters is 256. And then passing through a maximum pooling layer, wherein the pooling core size is (2, 2) step length is 2, and the output is 32 multiplied by 256.
The fourth layer is a combined convolution module 2-C, and consists of three combined convolution modules 2-A and a maximum pooling layer. The first combined convolution module 2-a passes through the zero-padding layer first, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step size is 1, and the number of filters is 512. The second convolution module 2-a passes through the zero-padding layer, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step size is 1, and the number of filters is 512. The third convolution module 2-a passes through the zero-padding layer, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step size is 1, and the number of filters is 512. And then passing through a maximum pooling layer, wherein the pooling core size is (2, 2) step length is 2, and the output is 16 multiplied by 512.
The fifth layer is a combined convolution module 2-C, and consists of three combined convolution modules 2-A and a maximum pooling layer. The first combined convolution module 2-a passes through the zero-padding layer first, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step size is 1, and the number of filters is 512. The second convolution module 2-a passes through the zero-padding layer, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step size is 1, and the number of filters is 512. The third convolution module 2-a passes through the zero-padding layer, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step size is 1, and the number of filters is 512. And then passing through a maximum pooling layer, wherein the pooling core size is (2, 2), the step size is 2, and the output is 8 multiplied by 512.
b. Constructing feature fusion networks
The first layer is the input fusion module 2-G, as shown in FIG. 3 (G). The last layer of the feature extraction network outputs 8×8×512, first goes through the upsampling layer, with a sampling factor of 2, and outputs 16×16×512. The output and feature extraction network fourth layer output is 16×16×512, and the output is 16×16×1024 via tensor stitching layer.
The second layer is a combined convolution module 2-E, which, as shown in FIG. 3 (E), consists of two batch normalization layers, one combined convolution module 2-D and one combined convolution module 2-A. The combined convolution module 2-D consists of one zero padding layer, one convolution layer and one activation layer. Firstly, a batch normalization layer is passed, and then a combined convolution module 2-D is passed. The combined convolution module 2-D passes through the zero filling layer, then passes through the convolution layer and the activation layer, the convolution kernel is (1, 1), the step length is 1, the number of filters is 128, and the output is 16 multiplied by 128. Then passes through a batch normalization layer and then passes through a combined convolution module 2-A. The combined convolution module 2-a passes through the zero padding layer, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step length is 1, the number of filters is 64, and the output is 16×16×64.
And the third layer is an input fusion module 2-G. The second layer output of the feature fusion network is 16×16×64, first goes through the upsampling layer, the sampling factor is 2, and the output is 32×32×64. The output and feature extraction network third layer outputs 32×32×256 through the tensor stitching layer, and outputs 32×32×320.
The fourth layer is a combined convolution module 2-E. Firstly, a batch normalization layer is passed, and then a combined convolution module 2-D is passed. The combined convolution module 2-D passes through the zero filling layer, then passes through the convolution layer and the activation layer, the convolution kernel is (1, 1), the step length is 1, the number of filters is 128, and the output is 32 multiplied by 128. Then passes through a batch normalization layer and then passes through a combined convolution module 1-A. The combined convolution module 1-A passes through the zero filling layer, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step length is 1, the number of filters is 64, and the output is 32 multiplied by 64.
The fifth layer is the input fusion module 1-G. The feature fusion network second layer output 32 x 64 first goes through the upsampling layer, the sampling factor is set to be 2, the output is 64 x 64. The output and feature extraction network second layer output is 64×64×128, and the output is 64×64×192 via tensor stitching layer.
The sixth layer is a combined convolution module 1-F, which, as shown in FIG. 3 (F), consists of three batch normalization layers, one combined convolution module 1-D and two combined convolution modules 1-A. First through a batch normalization layer and then through the combined convolution module 1-D. The combined convolution module 1-D passes through the zero filling layer, then passes through the convolution layer and the activation layer, the convolution kernel is (1, 1), the step length is 1, the number of filters is 32, and the output is 64 multiplied by 32. Then passes through a batch normalization layer and then passes through a combined convolution module 1-A. The combined convolution module 1-A passes through the zero filling layer, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step length is 1, the number of filters is 32, and the output is 64 multiplied by 32. Then passes through a batch normalization layer and then passes through a combined convolution module 1-A. The combined convolution module 1-A passes through the zero filling layer, then passes through the convolution layer and the activation layer, the convolution kernel is (3, 3), the step length is 1, the number of filters is 32, and the output is 64 multiplied by 32.
c. Constructing a predictive network
The first layer has three branches, the first branch is composed of a combined convolution module 1-D, first, a zero filling layer is passed through, then, a convolution layer and an activation layer are passed through, the convolution kernel is (1, 1), the step length is 1, the number of filters is 1, and the output is 64 multiplied by 1. The second branch consists of a combined convolution module 1-D, which passes through a zero filling layer, a convolution layer and an activation layer, the convolution kernel is (1, 1), the step length is 1, the number of filters is 2, and the output is 64 multiplied by 2. The third branch consists of a combined convolution module 1-D, which passes through a zero filling layer, a convolution layer and an activation layer, the convolution kernel is (1, 1), the step size is 1, the number of filters is 4, and the output is 64 multiplied by 4.
The second layer is an input fusion module which is formed by splicing three branches of the first layer of the prediction network, and the output is 64 multiplied by 7.
d. Setting a loss function
The loss function is set to be the sum of category loss, geometry loss, and angle loss.
The class loss function formula is as follows:
wherein L is S Representing the class loss, beta represents the weight,is a predicted category, Y * Is a true category.
The geometry loss function formula is as follows:
wherein L is AABB Representing a geometric shape loss function,representing the predicted AABB geometry, R * Representing real AABBIoU represents the cross-over ratio.
The angle loss function formula is as follows:
wherein L is θ θ * ) Is an angle loss function, +>Is the prediction of rotation angle, θ * Is the real case of the rotation angle.
Training an improved EAST network comprising the steps of:
e1, setting training parameters
Setting a training optimizer, an initial learning rate, iteration times, batch sizes and an initial priori frame.
e2, on-line data enhancement
The data enhancement is carried out on the input image, the data set is expanded, and the main method for data enhancement is as follows: noise is added randomly, and contrast is adjusted randomly.
e3, setting training completion mark
And setting the interval detection training accuracy of the verification set, and storing a network after the training is completed to the optimal condition for achieving the maximum iteration times and meeting the accuracy requirement by a training completion mark.
Inputting the test set into the optimal improved EAST network to obtain the character image.
3) Labeling the character information in the character image cut in the step 2) to construct a character recognition data set, dividing the character recognition data set into a training set and a test set, recharging training parameters to train the CRNN network by using the training set, obtaining an optimal CRNN network after training, inputting the test set into the optimal CRNN network, and outputting the character information.
The CRNN network is constructed, which comprises the following steps:
a. constructing a feature extraction network
The input image is w×32×1, where w is the width of the input image, and is adaptively changed according to the input picture size.
The first layer is a combined convolution module 3-A, which passes through a zero filling layer, a convolution layer and an activation layer, the convolution kernel is (3, 3), the step length is 1, the filter is 64, and the output is w multiplied by 32 multiplied by 64.
The second layer is the largest pooling layer, the pooling core size is (2, 2), the step length is 2, and the output is
The third layer is a combined convolution module 3-A, which passes through a zero filling layer, a convolution layer and an activation layer, the convolution kernel is (3, 3), the step length is 1, the filter is 128, and the output is
The fourth layer is the largest pooling layer, the pooling core size is (2, 2), the step length is 2, and the output is
The fifth layer is a combined convolution module 3-B, which firstly passes through a zero filling layer, then passes through a convolution layer, a batch normalization layer and an activation layer, the convolution kernel is (3, 3), the step length is 1, the filter is 256, and the output is
The sixth layer is a combined convolution module 3-A, which passes through a zero filling layer, a convolution layer and an activation layer, the convolution kernel is (3, 3), the step length is 1, the filter is 256, and the output is
The seventh layer is the largest pooling layer, the pooling core size is (2, 2), the step length is 2, and the output is
The eighth layer is a combined convolution module 3-B, firstly passes through a zero filling layer, then passes through a convolution layer, a batch normalization layer and an activation layer, the convolution kernel is (3, 3), the step length is 1, the filter is 512, and the output is
The ninth layer is a combined convolution module 3-A, and passes through a zero filling layer, a convolution layer and an activation layer, wherein the convolution kernel is (3, 3), the step length is 1, the filter is 512, and the output is
The tenth layer is the largest pooling layer, the pooling core size is (2, 2), the step length is 2, and the output is
The eleventh layer is a combined convolution module 3-C, and is subjected to a convolution layer, a batch normalization layer and an activation layer, wherein the convolution kernel is (2, 2), the step length is 1, the filter is 512, and the output is
b. Constructing a predictive network
The first layer is a circular convolution module, which consists of a bidirectional LSTM, the output is
The third layer is a circular convolution module, which consists of a bidirectional LSTM, and the output is
c. Setting up a decoder
Converting output of predictive network intoEach element ranging from 0 to 6735, respectively corresponding to an independent character (wherein 0 corresponds to a null character), corresponding to dividing a line of text into +. >A block is predicted by the characters. And processing the sequence from left to right, and outputting character information according to the element value corresponding to the character library when the element is not 0 and is the same as the last element.
d. Setting a loss function
The loss function was set to CTC (Connectionist Temporal Classification) loss function.
The CTC loss function formula is as follows:
L CTC =-lnΠ (x,z) p(z|x)=-∑ (x,z)∈S lnp(z|x)
wherein L is CTC Representing CTC loss function, p (z|x) represents the probability of a given input x output sequence z, S being the training set.
Training a CRNN network comprising the steps of:
e1, setting training parameters
Setting a training optimizer, an initial learning rate, iteration times and batch sizes.
e2, setting training completion mark
And setting the interval detection training accuracy of the verification set, and storing a network after the training is completed to the optimal condition for achieving the maximum iteration times and meeting the accuracy requirement by a training completion mark.
And inputting the test set into the optimal CRNN network to obtain character information.
4) Splicing the character information output in the step 3) into a text, marking instrument types corresponding to the text to construct a text classification data set, dividing the text classification data set into a training set and a test set, recharging training parameters to train the textCNN network by using the training set, obtaining an optimal textCNN network after training, and inputting the test set into the optimal textCNN network to output the instrument types corresponding to the text.
The textCNN network is constructed, comprising the following steps:
a. constructing a network structure:
the first layer is an embedded layer, the length m of an input text is input, and the word vector is 600 multiplied by 64, namely the input tensor;
the second layer is a convolution module, the convolution kernel is (5, 5), the step length is 1, the number of filters is 256, and the output is 596 multiplied by 256;
the third layer is the largest pooling layer, and the output is 1 multiplied by 256;
the fourth layer consists of a full connection layer, a Dropout layer and an activation layer, and the output is 1 multiplied by 128;
the fifth layer consists of a full-connection layer and an activation layer, the output is 1 Xcls, and cls is the category number;
b. setting a loss function
Wherein L is CrossEntropy Represents loss, n represents category number, y i Representing the true probability of the corresponding i-category,representing the predicted probability for the corresponding i category.
Training TextCNN network, comprising the steps of:
c1, setting training parameters
Setting a training optimizer, an initial learning rate as the number of iterations and the batch size.
c2, setting training completion mark
And setting the interval detection training accuracy of the verification set, and storing a network after the training is completed to the optimal condition for achieving the maximum iteration times and meeting the accuracy requirement by a training completion mark.
And inputting the test set into an optimal instrument type corresponding to the textCNN network output text.
In summary, after the scheme is adopted, the invention provides a new method for detecting and classifying the instrument images, and the neural network is used as an effective method for detecting and classifying the instruments, so that the problem that the instrument types are difficult to read can be effectively solved, the development of an automatic instrument identification technology is effectively promoted, and the method has practical popularization value and is worthy of popularization.
The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.
Claims (5)
1. The instrument detection classification method based on the image text is characterized by comprising the following steps of:
1) Marking the positions of the dial plates by using the instrument images to construct an instrument positioning data set, dividing the instrument positioning data set into a training set and a testing set, recharging parameters, training an improved YOLO network by using the training set, obtaining an optimal improved YOLO network after training, inputting the testing set into the optimal improved YOLO network, outputting dial plate images and cutting out the dial plate images; wherein, the improved YOLO network optimizes the backbone network to be a mobilent lightweight network so as to reduce network parameters and calculated amount and improve operation speed;
2) Marking the character positions in the dial images cut out in the step 1) to construct a character detection data set, dividing the character detection data set into a training set and a test set, recharging training parameters, training an improved EAST network by using the training set, obtaining an optimal improved EAST network after training, inputting the test set into the optimal improved EAST network, outputting the character positions in the character detection data set, and cutting the character detection data set into character images; the improved EAST network is to change a backbone network into VGG to improve network detection accuracy, and the output layer structure modification prediction module only uses head elements to predict vertexes so as to improve the prediction performance of long characters;
the specific cases of the improved EAST network are as follows:
a. the method comprises the following steps of constructing a feature extraction network:
the first layer is a combined convolution module 2-B, which consists of two combined convolution modules 2-A and a maximum pooling layer, wherein the combined convolution module 2-A consists of a zero filling layer, a convolution layer and an activation layer;
the second layer is a combined convolution module 2-B, which consists of two combined convolution modules 2-A and a maximum pooling layer;
the third layer is a combined convolution module 2-C, which consists of three combined convolution modules 2-A and a maximum pooling layer;
The fourth layer is a combined convolution module 2-C, which consists of three combined convolution modules 2-A and a maximum pooling layer;
the fifth layer is a combined convolution module 2-C, which consists of three combined convolution modules 2-A and a maximum pooling layer;
b. the feature fusion network is constructed, and the structure is as follows:
the first layer is an input fusion module 2-G, which consists of an up-sampling layer and a tensor splicing layer;
the second layer is a combined convolution module 2-E, which consists of two batch normalization layers, a combined convolution module 2-D and a combined convolution module 2-A; wherein the combined convolution module 2-D consists of a zero padding layer, a convolution layer and an activation layer;
the third layer is an input fusion module 2-G, which consists of an up-sampling layer and a tensor splicing layer;
the fourth layer is a combined convolution module 2-E, which consists of two batch normalization layers, a combined convolution module 2-D and a combined convolution module 2-A;
the fifth layer is an input fusion module 2-G, which consists of an up-sampling layer and a tensor splicing layer;
the sixth layer is a combined convolution module 2-F, which consists of three batch normalization layers, one combined convolution module 2-D and two combined convolution modules 2-A;
c. The prediction network is constructed, and the structure is as follows:
the first layer is divided into three branches, and the first branch consists of a combined convolution module 2-D; the second branch consists of a combined convolution module 2-D; the third branch consists of a combined convolution module 2-D;
the second layer is an input fusion module which is formed by splicing three branches of the first layer;
d. the set loss function comprises a category loss function, a geometric shape loss function and an angle loss function;
the class loss function formula is as follows:
wherein L is S Representing the class loss, beta represents the weight,is a predicted category, Y * Is a true category; />
The geometry loss function formula is as follows:
wherein L is AABB Representing the loss of geometry and,representing the geometry of the predictive quadrilateral text box AABB, R * Representing the geometry of a real quadrilateral text box AABB, ioU representing the intersection ratio;
the angle loss function formula is as follows:
in the method, in the process of the invention,is an angle loss->Is the predicted value of the rotation angle theta * Is the actual value of the rotation angle;
training and improving EAST network by loading training parameters, wherein the training parameters are set as follows: setting a training optimizer to Adam, an initial learning rate of 0.001, a maximum training period of 500 and a batch size of 8; setting the interval detection training accuracy of the verification set, and storing a network after the training is completed to the optimal condition, wherein the training completion mark is the maximum training period or the average crossing ratio meeting the requirement;
Inputting the test set into an optimal improved EAST network to obtain a text position, and cutting the text position into character images;
3) Marking the character information in the character image cut in the step 2) to construct a character recognition data set, dividing the character recognition data set into a training set and a test set, recharging training parameters to train the CRNN network by using the training set, obtaining an optimal CRNN network after training, inputting the test set into the optimal CRNN network, and outputting the character information;
4) Splicing the character information output in the step 3) into a text, marking instrument types corresponding to the text to construct a text classification data set, dividing the text classification data set into a training set and a test set, recharging training parameters to train the textCNN network by using the training set, obtaining an optimal textCNN network after training, and inputting the test set into the optimal textCNN network to output the instrument types corresponding to the text.
2. The method for detecting and classifying the instruments based on the image texts according to claim 1 is characterized in that in the step 1), various instrument images under different environments are collected through a camera, filtering and image enhancement preprocessing operations are carried out on the instrument images, abnormal data in the instrument images are removed, the abnormal data including data with surface dirt, illumination extreme and abnormal shooting are included, the rest data are marked, the marked content is the dial position, an instrument positioning data set is constructed, and the instrument positioning data set is divided into a training set and a test set.
3. The method of image text based meter test classification of claim 1, wherein in step 1), said improved YOLO network is specified as follows:
a. constructing a feature extraction network according to the real-time and high-precision requirements:
the first layer is a combined convolution module 1-A, which consists of a zero filling layer, a convolution layer, a batch normalization layer and an activation layer;
the second layer is a combined convolution module 1-B, which consists of a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;
the third layer is a combined convolution module 1-C, which consists of a zero filling layer, a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;
the fourth layer is a combined convolution module 1-B, which consists of a deep convolution layer, two batch normalization layers, two activation layers and a convolution layer;
the fifth layer is a combined convolution module 1-C, which consists of a zero filling layer, a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;
the sixth layer is a combined convolution module 1-B, which consists of a deep convolution layer, two batch normalization layers, two activation layers and a convolution layer;
The seventh layer is a combined convolution module 1-C, which consists of a zero filling layer, a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;
the eighth layer is a combined convolution module 1-D, which consists of five combined convolution modules 1-B;
the ninth layer is a combined convolution module 1-C, which consists of a zero filling layer, a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;
the tenth layer is a combined convolution module 1-B, which consists of a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;
b. constructing and outputting prediction networks for predicting targets with different sizes according to the output of different layers of the feature extraction network, wherein the prediction networks comprise a large-size target prediction network, a medium-size target prediction network and a small-size target prediction network;
b1, inputting a tenth layer of output of a feature extraction network, wherein the large-size target prediction network consists of a plurality of combination convolution modules and convolution layers, and has the following structure:
the first layer is a combined convolution module 1-D, which consists of five combined convolution modules 1-B;
the second layer is a combined convolution module 1-B, which consists of a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;
The third layer is a convolution layer;
b2, inputting an eighth layer output of a characteristic extraction network and a first layer output of a large-size target prediction network, wherein the medium-size target prediction network consists of a plurality of combination convolution modules and convolution layers, and the structure is as follows:
the first layer is an input fusion module 1-E, which consists of a combined convolution module 1-B, an up-sampling layer and a tensor splicing layer;
the second layer is a combined convolution module 1-D, which consists of five combined convolution modules 1-B;
the third layer is a combined convolution module 1-B, which consists of a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;
the fourth layer is a convolution layer;
b3, taking the input as the sixth layer output of the characteristic extraction network and the second layer output of the medium-size target prediction network, wherein the small-size target prediction network consists of a plurality of combination convolution modules and convolution layers and has the following structure:
the first layer is an input fusion module 1-E, which consists of a combined convolution module 1-B, an up-sampling layer and a tensor splicing layer;
the second layer is a combined convolution module 1-D, which consists of five combined convolution modules 1-B;
the third layer is a combined convolution module 1-B, which consists of a depth convolution layer, two batch normalization layers, two activation layers and a convolution layer;
The fourth layer is a convolution layer;
finally, the output of the large-size target prediction network, the medium-size target prediction network and the small-size target prediction network is processed through a non-maximum suppression layer to obtain the predicted target position and category;
c. the loss function is set to have a center coordinate loss function, a wide-high loss function, a confidence loss function and a category loss function;
the center coordinate loss function formula is as follows:
Loss xy =mark object *(2-w*h)*Loss log (xy true ,xy predict )
in the Loss xy Representing center coordinate loss, mark object A flag bit representing whether an object exists in the anchor frame, w represents the width of the anchor frame, h represents the height of the anchor frame, and Loss log Representing a binary cross entropy loss, xy true Representing the true central coordinate value, xy predict Representing a predicted central coordinate value;
the wide-high loss function formula is as follows:
Loss wh =0.5*mark object *(2-w*h)*(wh true -wh predict ) 2
in the Loss wh Represents the loss of width and height, wh true Representing the true width and height values wh predict Representing a predicted width and height value;
the confidence loss function formula is as follows:
Loss confidence =mark object *Loss log (mark object ,c predict )
+(1-mark object )*Loss log (mark object ,c predict )*mark ignore
in the Loss confidence Representing confidence loss, c predict Confidence value representing prediction frame, mark ignore A flag bit representing an anchor block having an IOU less than a threshold;
the class loss function formula is as follows:
Loss cls =mark object *Loss log (cls true ,cls predict )
in the Loss cls Representing class loss, cls true Representing true class, cls predict Representing a prediction category;
the total loss function formula is as follows:
Loss=(Loss xy +Loss wh +Loss confidence +Loss cls )/numf
Where Loss represents total Loss, numf represents floating point number of total input number;
loading training parameters to train the improved YOLO network, wherein the training parameters are set as follows: setting a training optimizer to Adam, an initial learning rate of 0.001, a maximum training period of 500 and a batch size of 8; setting the interval detection training accuracy of the verification set, and storing a network after the training is completed to the optimal condition, wherein the training completion mark is the maximum training period or the average crossing ratio meeting the requirement;
inputting the test set into an optimal improved YOLO network to obtain the dial position and the dial image.
4. The method of claim 1, wherein in step 3), the CRNN network is as follows:
a. the method comprises the following steps of constructing a feature extraction network:
the first layer is a combined convolution module 3-A, which consists of a zero filling layer, a convolution layer and an activation layer;
the second layer is a maximum pooling layer;
the third layer is a combined convolution module 3-A, which consists of a zero filling layer, a convolution layer and an activation layer;
the fourth layer is the largest pooling layer;
the fifth layer is a combined convolution module 3-B, which consists of a zero filling layer, a convolution layer, a batch normalization layer and an activation layer;
The sixth layer is a combined convolution module 3-A, which consists of a zero filling layer, a convolution layer and an activation layer;
the seventh layer is the largest pooling layer;
the eighth layer is a combined convolution module 3-B, which consists of a zero filling layer, a convolution layer, a batch normalization layer and an activation layer;
the ninth layer is a combined convolution module 3-A, which consists of a zero filling layer, a convolution layer and an activation layer;
the tenth layer is the largest pooling layer;
the eleventh layer is a combined convolution module 3-C, which consists of a zero filling layer, a convolution layer, a batch normalization layer and an activation layer;
b. the prediction network is constructed, and the structure is as follows:
the first layer is a cyclic convolution module, which consists of a bidirectional LSTM;
the second layer is a full-connection layer;
the third layer is a circular convolution module, which consists of a bidirectional LSTM;
the fourth layer is a full-connection layer;
c. setting a decoder to convert the output sequence into character information;
d. setting a loss function as a CTC (Connectionist Temporal Classification) loss function;
the CTC loss function formula is as follows:
L CTC +lnΠ (x,z) p(z|x)=-∑ (x,z)∈S lnp(z|x)
wherein L is CTC Representing CTC loss, p (z|x) represents the probability of a given input x output sequence z, S being the training set;
Training the CRNN network by loading training parameters, wherein the training parameters are set as follows: setting a training optimizer to Adam, an initial learning rate of 0.0001, a maximum training period of 100 and a batch size of 32; setting the interval detection training accuracy of the verification set, and storing a network after the training is completed to the optimal condition, wherein the training completion mark meets the requirement for the maximum training period or the identification accuracy;
and inputting the test set into the optimal CRNN network to obtain character information.
5. The method of claim 1, wherein in step 4), the TextCNN network is as follows:
a. the network structure is constructed as follows:
the first layer is an embedded layer;
the second layer is a convolution module;
the third layer is a maximum pooling layer;
the fourth layer consists of a full connection layer, a Dropout layer and an activation layer;
the fifth layer consists of a full connection layer and an activation layer;
b. the loss function is set as multi-classification cross entropy, and the formula is as follows:
wherein L is CrossEntropy Represents loss, n represents category number, y i Representing the true probability of the corresponding i-category,representing a prediction probability corresponding to the i category;
loading training parameters to train the textCNN network, wherein the training parameters are set as follows: setting a training optimizer adam, an initial learning rate of 0.001, iteration times of 1000 and batch size of 64; setting the interval detection training accuracy of the verification set, and storing a network after the training is completed to the optimal condition, wherein the training completion mark is that the maximum iteration number and the accuracy meet the requirements;
And inputting the test set into an optimal textCNN network to obtain a corresponding instrument type.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110855223.6A CN113673509B (en) | 2021-07-28 | 2021-07-28 | Instrument detection classification method based on image text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110855223.6A CN113673509B (en) | 2021-07-28 | 2021-07-28 | Instrument detection classification method based on image text |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113673509A CN113673509A (en) | 2021-11-19 |
CN113673509B true CN113673509B (en) | 2023-06-09 |
Family
ID=78540390
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110855223.6A Active CN113673509B (en) | 2021-07-28 | 2021-07-28 | Instrument detection classification method based on image text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113673509B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113936280B (en) * | 2021-11-23 | 2024-04-05 | 河海大学 | Automatic character recognition system and method for code disc of embedded instrument |
CN115424121B (en) * | 2022-07-30 | 2023-10-13 | 南京理工大学紫金学院 | Electric power pressing plate switch inspection method based on computer vision |
CN116416626B (en) * | 2023-06-12 | 2023-08-29 | 平安银行股份有限公司 | Method, device, equipment and storage medium for acquiring circular seal data |
CN116958998B (en) * | 2023-09-20 | 2023-12-26 | 四川泓宝润业工程技术有限公司 | Digital instrument reading identification method based on deep learning |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108710831A (en) * | 2018-04-24 | 2018-10-26 | 华南理工大学 | A kind of small data set face recognition algorithms based on machine vision |
CN110543878A (en) * | 2019-08-07 | 2019-12-06 | 华南理工大学 | pointer instrument reading identification method based on neural network |
CN111062282A (en) * | 2019-12-05 | 2020-04-24 | 武汉科技大学 | Transformer substation pointer type instrument identification method based on improved YOLOV3 model |
CN111368825A (en) * | 2020-02-25 | 2020-07-03 | 华南理工大学 | Pointer positioning method based on semantic segmentation |
CN111401358A (en) * | 2020-02-25 | 2020-07-10 | 华南理工大学 | Instrument dial plate correction method based on neural network |
CN111639643A (en) * | 2020-05-22 | 2020-09-08 | 深圳市赛为智能股份有限公司 | Character recognition method, character recognition device, computer equipment and storage medium |
CN111814919A (en) * | 2020-08-31 | 2020-10-23 | 江西小马机器人有限公司 | Instrument positioning and identifying system based on deep learning |
CN112801094A (en) * | 2021-02-02 | 2021-05-14 | 中国长江三峡集团有限公司 | Pointer instrument image inclination correction method |
CN112861867A (en) * | 2021-02-01 | 2021-05-28 | 北京大学 | Pointer type instrument panel identification method, system and storage medium |
-
2021
- 2021-07-28 CN CN202110855223.6A patent/CN113673509B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108710831A (en) * | 2018-04-24 | 2018-10-26 | 华南理工大学 | A kind of small data set face recognition algorithms based on machine vision |
CN110543878A (en) * | 2019-08-07 | 2019-12-06 | 华南理工大学 | pointer instrument reading identification method based on neural network |
CN111062282A (en) * | 2019-12-05 | 2020-04-24 | 武汉科技大学 | Transformer substation pointer type instrument identification method based on improved YOLOV3 model |
CN111368825A (en) * | 2020-02-25 | 2020-07-03 | 华南理工大学 | Pointer positioning method based on semantic segmentation |
CN111401358A (en) * | 2020-02-25 | 2020-07-10 | 华南理工大学 | Instrument dial plate correction method based on neural network |
CN111639643A (en) * | 2020-05-22 | 2020-09-08 | 深圳市赛为智能股份有限公司 | Character recognition method, character recognition device, computer equipment and storage medium |
CN111814919A (en) * | 2020-08-31 | 2020-10-23 | 江西小马机器人有限公司 | Instrument positioning and identifying system based on deep learning |
CN112861867A (en) * | 2021-02-01 | 2021-05-28 | 北京大学 | Pointer type instrument panel identification method, system and storage medium |
CN112801094A (en) * | 2021-02-02 | 2021-05-14 | 中国长江三峡集团有限公司 | Pointer instrument image inclination correction method |
Also Published As
Publication number | Publication date |
---|---|
CN113673509A (en) | 2021-11-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110543878B (en) | Pointer instrument reading identification method based on neural network | |
CN113673509B (en) | Instrument detection classification method based on image text | |
CN111368825B (en) | Pointer positioning method based on semantic segmentation | |
CN109766873B (en) | Pedestrian re-identification method based on hybrid deformable convolution | |
CN113239930A (en) | Method, system and device for identifying defects of cellophane and storage medium | |
CN110490915B (en) | Point cloud registration method based on convolution-limited Boltzmann machine | |
Roy et al. | Script identification from handwritten document | |
CN111369526B (en) | Multi-type old bridge crack identification method based on semi-supervised deep learning | |
CN111401358B (en) | Instrument dial correction method based on neural network | |
CN105335760A (en) | Image number character recognition method | |
CN105184225A (en) | Multinational paper money image identification method and apparatus | |
CN105654042B (en) | The proving temperature character identifying method of glass-stem thermometer | |
CN116052110B (en) | Intelligent positioning method and system for pavement marking defects | |
Li et al. | An efficient method for DPM code localization based on depthwise separable convolution | |
CN113837166B (en) | Automatic pointer instrument reading method based on deep learning | |
Tan et al. | An application of an improved FCOS algorithm in detection and recognition of industrial instruments | |
CN113705731A (en) | End-to-end image template matching method based on twin network | |
CN113570542A (en) | Coal gangue detection method based on machine vision under shielding condition | |
Li et al. | Research on reading recognition of pointer meter based on improved U-net network | |
CN113673508B (en) | Pointer instrument image data synthesis method | |
CN117593514B (en) | Image target detection method and system based on deep principal component analysis assistance | |
CN117392440B (en) | Textile fabric retrieval method and system based on tissue structure and color classification | |
CN111860519B (en) | Method and system for segmenting pipeline image of aircraft engine | |
Zhao et al. | A Practical Unified Network for Localization and Recognition of Arbitrary-oriented Container code and type | |
CN117576051A (en) | Particle defect detection method and device for low-voltage sheath |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |