Disclosure of Invention
In order to solve the above problems, the present application provides a text detection and recognition method and system based on a mobile device.
In one aspect, the present application provides a text detection and recognition method based on a mobile device, including:
preprocessing an RGB image to obtain a first image;
calculating a plurality of sets of coordinates in the first image;
extracting each second image corresponding to each group of coordinates;
extracting image features of the second image by using a MobileNet convolution network;
converting the image features into a sequence of features using a feature mapping function;
after the Dropconnect is used for processing the characteristic sequence, a long-time memory unit is used for extracting a text prediction value of the characteristic sequence;
processing the text prediction value by using a softmax function to obtain text probability distribution data;
and extracting text content from the text probability distribution data by using an argmax function and outputting the text content.
Preferably, after the using the long-term memory unit to extract the text prediction value of the feature sequence, the method further includes:
calculating the identification loss between the text predicted value and the corresponding label text by using a CTC loss function;
if the recognition loss is less than or equal to the recognition loss threshold value, obtaining a trained text recognition model;
if the recognition loss is larger than the recognition loss threshold, automatically adjusting weight parameters in the text recognition model according to the recognition loss, continuously training the text recognition model by using the images in the training set to obtain new recognition loss, if the new recognition loss is smaller than or equal to the recognition loss threshold, obtaining a trained text recognition model, and if the new recognition loss is larger than the recognition loss obtained in the last step and cannot be reduced, outputting the text recognition model with the minimum recognition loss.
Preferably, after the calculating the plurality of sets of coordinates in the first image, the method further includes:
calculating coordinate loss between the multiple groups of coordinates and the corresponding label coordinates;
if the coordinate loss is less than or equal to the detection loss threshold value, obtaining a trained text detection model;
if the coordinate loss is larger than the detection loss threshold, adjusting a weight parameter in the text detection model according to the coordinate loss, continuing to train the text detection model by using the images in the training set to obtain a new coordinate loss, if the new coordinate loss is smaller than or equal to the detection loss threshold, obtaining a trained text detection model, and if the new coordinate loss is larger than the coordinate loss obtained in the last step and cannot be reduced, outputting the text detection model with the minimum coordinate loss.
Preferably, the preprocessing the RGB image to obtain a first image includes:
converting the RGB image into an HSV image;
carrying out histogram equalization on the HSV image to obtain an equalized image;
and converting colors lower than the gray threshold value in the equalized image into white to obtain a first image.
Preferably, the calculating the plurality of sets of coordinates in the first image comprises:
extracting Y-axis coordinates of each line, horizontal translation amount of the text and height of the text area;
and calculating four-point coordinates of the text area of each line by using the Y-axis coordinates of each line, the horizontal translation amount of the text and the height of the text area to obtain a plurality of groups of coordinates.
Preferably, the images in the training set are images labeled with labels.
Preferably, the tag is a data unit comprising: the tag text, the tag coordinates, and an image in an area of the first image corresponding to the tag coordinates.
In a second aspect, the present application provides a mobile device-based text detection and recognition system, comprising:
the preprocessing module is used for preprocessing the RGB image to obtain a first image;
the text detection module is used for calculating a plurality of groups of coordinates in the first image;
the text extraction module is used for extracting each second image corresponding to each group of coordinates;
the text recognition module is used for extracting the image characteristics of the second image by using a MobileNet convolution network; converting the image features into a sequence of features using a feature mapping function; after the Dropconnect is used for processing the characteristic sequence, a long-time memory unit is used for extracting a text prediction value of the characteristic sequence; processing the text prediction value by using a softmax function to obtain text probability distribution data; and extracting text content from the text probability distribution data by using an argmax function and outputting the text content.
Preferably, the system further comprises a first training module for calculating a recognition loss between the text prediction value and the label text corresponding to the text prediction value by using a CTC loss function; if the recognition loss is less than or equal to the recognition loss threshold value, obtaining a trained text recognition model; if the recognition loss is larger than the recognition loss threshold, automatically adjusting weight parameters in the text recognition model according to the recognition loss, continuously training the text recognition model by using the images in the training set to obtain new recognition loss, if the new recognition loss is smaller than or equal to the recognition loss threshold, obtaining a trained text recognition model, and if the new recognition loss is larger than the recognition loss obtained in the last step and cannot be reduced, outputting the text recognition model with the minimum recognition loss.
Preferably, the system further comprises a second training module, configured to calculate a coordinate loss between the sets of coordinates and the corresponding label coordinates; if the coordinate loss is less than or equal to the detection loss threshold value, obtaining a trained text detection model; if the coordinate loss is larger than the detection loss threshold, adjusting a weight parameter in the text detection model according to the coordinate loss, continuing to train the text detection model by using the images in the training set to obtain a new coordinate loss, if the new coordinate loss is smaller than or equal to the detection loss threshold, obtaining a trained text detection model, and if the new coordinate loss is larger than the coordinate loss obtained in the last step and cannot be reduced, outputting the text detection model with the minimum coordinate loss.
The application has the advantages that: by preprocessing the RGB image, the interference of a complex text background can be reduced; the mobile platform text detection and recognition system can achieve rapid text detection and recognition on the mobile platform by using the MobileNet, and is low in power consumption. The preprocessing of the image effectively reduces the interference generated by complex environment in various scenes, and is more beneficial to text recognition in real application scenes.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
According to an embodiment of the present application, a text detection and recognition method based on a mobile device is provided, as shown in fig. 1, including:
s101, preprocessing an RGB image to obtain a first image;
s102, calculating a plurality of groups of coordinates in the first image;
s103, extracting second images corresponding to the groups of coordinates;
s104, extracting image features of the second image by using a MobileNet convolution network;
s105, converting the image features into a feature sequence by using a feature mapping function;
s106, after the characteristic sequence is regularized by using Dropconnect, extracting a text prediction value of the characteristic sequence (the characteristic sequence after regularization) by using a long-time memory unit and a short-time memory unit;
s107, processing the text predicted value by using a softmax function to obtain text probability distribution data;
and S108, extracting text content from the text probability distribution data by using an argmax function, and outputting the text content.
The long and short time memory unit is a deep bidirectional long and short time memory unit.
After the long-time memory unit extracts the text prediction value of the feature sequence, the method further comprises the following steps:
calculating the identification loss between the text predicted value and the corresponding label text by using a CTC loss function;
if the recognition loss is less than or equal to the recognition loss threshold value, obtaining a trained text recognition model;
if the recognition loss is larger than the recognition loss threshold value, automatically adjusting weight parameters in the text recognition model according to the recognition loss, continuously using images in the training set to train the text recognition model to obtain new recognition loss, if the new recognition loss is smaller than or equal to the recognition loss threshold value, obtaining a trained text recognition model, and if the new recognition loss is larger than the recognition loss obtained in the last step and cannot be reduced, outputting the text recognition model with the minimum recognition loss.
And when the recognition loss is greater than the recognition loss threshold, automatically adjusting the weight parameters in the text recognition model according to the recognition loss, continuously training the text recognition model by using the images in the training set to obtain new recognition loss, if the obtained new recognition loss is smaller than the recognition loss of the previous step and is still greater than the recognition loss threshold, automatically adjusting the weight parameters in the text recognition model according to the recognition loss, continuously training the text recognition model by using the images in the training set until the newly obtained recognition loss is smaller than or equal to the recognition loss threshold, and obtaining the trained text recognition model. And if the new recognition loss can not be reduced, namely the new recognition loss is always larger than the recognition loss of the previous step and can not be reduced, outputting the text recognition model with the minimum recognition loss.
The resulting (output) text recognition model is used for installing on a mobile device for text recognition.
The new recognition loss cannot be reduced, and the recognition loss obtained continuously for several times is higher than the recognition loss obtained last time.
The number of consecutive times may be set.
The identification loss threshold may be set.
The CTC loss is calculated by connection-dominant Temporal Classification (CTC).
The text prediction value is a training loss value.
The text recognition model is used for calculating a predicted value of the text.
After the calculating the plurality of sets of coordinates in the first image, further comprising:
calculating coordinate loss between the multiple groups of coordinates and the corresponding label coordinates;
if the coordinate loss is less than or equal to the detection loss threshold value, obtaining a trained text detection model;
if the coordinate loss is larger than the detection loss threshold, adjusting a weight parameter in the text detection model according to the coordinate loss, continuing to train the text detection model by using the images in the training set to obtain a new coordinate loss, if the new coordinate loss is smaller than or equal to the detection loss threshold, obtaining a trained text detection model, and if the new coordinate loss is larger than the coordinate loss obtained in the last step and cannot be reduced, outputting the text detection model with the minimum coordinate loss.
And when the coordinate loss is greater than the detection loss threshold, automatically adjusting the weight parameters in the text recognition model according to the coordinate loss, continuously training the text detection model by using the images in the training set to obtain new coordinate loss, and if the obtained new coordinate loss is smaller than the coordinate loss of the last step and is still greater than the detection loss threshold, automatically adjusting the weight parameters in the text detection model according to the coordinate loss, continuously training the text detection model by using the images in the training set until the newly obtained coordinate loss is smaller than or equal to the detection loss threshold, and obtaining the trained text detection model. And if the new coordinate loss can not be reduced, namely the new coordinate loss is always larger than the coordinate loss of the previous step and can not be reduced, outputting the text detection model with the minimum coordinate loss.
The resulting (output) text detection model is used for installing on a mobile device for text recognition.
The new coordinate loss can not be reduced, and the coordinate loss obtained continuously for several times is higher than the identification loss of the last time.
The number of consecutive times may be set.
The detection loss threshold may be set.
The coordinate Loss between the plurality of sets of coordinates and the corresponding tag coordinates may be calculated using algorithms such as a Cross Entropy Loss (Cross Entropy Loss) algorithm, a L1 Smooth Loss (Smooth L1 Loss) algorithm, and a maximum soft Loss (softmax Loss) algorithm.
The text detection model is used to calculate a plurality of sets of coordinates in the first image.
The preprocessing the RGB image to obtain a first image, comprising:
converting the RGB image into an HSV image;
carrying out histogram equalization on the HSV image to obtain an equalized image;
and converting colors lower than the gray threshold value in the equalized image into white to obtain a first image.
The grayscale threshold may be set.
HSV is a color space different from RGB and consists of Hue (Hue, H), Saturation (S) and Value (V). The color recall angle measures red from 0 degrees, and saturation represents the shade of a color, and lightness represents the brightness of a color. And the color space is converted into an HSV color space, so that the direct processing of the brightness of the image by a subsequent algorithm is facilitated. Histogram equalization is a method for solving the problem of over-bright or under-exposed (under-exposed) condition of the whole and local image, and can also enhance the contrast of the picture for machine identification.
The input RGB picture is converted into an HSV channel, so that the subsequent treatment can be directly carried out on the gray level of the image. And then carrying out histogram equalization on the obtained HSV image to obtain an image with more balanced gray distribution, wherein the operation can effectively reduce overexposed and shaded areas in the image. And finally, filtering the color lower than the gray threshold value by using a color filtering method, taking the pixel with the v value smaller than 46 as a black pixel as an example, and converting the pixel with the v value larger than 46 into white to obtain a first image.
The calculating of the plurality of sets of coordinates in the first image comprises:
extracting Y-axis coordinates of each line, horizontal translation amount of the text and height of the text area;
and calculating four-point coordinates of the text area of each line by using the Y-axis coordinates of each line, the horizontal translation amount of the text and the height of the text area to obtain a plurality of groups of coordinates.
The method comprises the steps of extracting image features of a first image by using a MobileNet convolution network, converting the image features into feature sequences by using a feature mapping function, carrying out regularization processing on the feature sequences by using Dropconnect, and extracting text position prediction values of the feature sequences (feature sequences after regularization processing) by using a long-time memory unit and a short-time memory unit. And after the text position predicted value passes through the full connection layer, outputting Y-axis coordinates of each line, the horizontal translation amount of the text and the height of the text area.
The images in the training set are labeled with labels.
The label is a manual labeling label.
The tag is a data unit, comprising: the tag text, the tag coordinates, and an image in an area of the first image corresponding to the tag coordinates.
The label text comprises text contents corresponding to all the text areas.
The tag coordinates include all text region coordinates.
And the image in the area corresponding to the label coordinate in the first image is the image in the area corresponding to the text area coordinate in the first image.
After extracting and outputting the text content from the text probability distribution data by using the argmax function, the method further includes:
and outputting the corresponding text content according to the setting.
The user can select the text in the largest area and/or the text in the smallest area in the text detection according to the requirement, and can also select the text area by self. The user may also select particular text content by himself, such as outputting only numbers, etc. The user may also select different combinations as desired.
The size of each region is determined by calculating the perimeter obtained by the coordinates of four points of each group corresponding to each region.
Taking the example that the user needs to output only the price on the label, the user may choose to output only the number in the largest area. The data obtained by the neural network calculation of text detection and recognition is excessive, and a lot of redundant data exists. In this case, it is necessary to filter out non-numeric parts through the regular expression first. Then, the circumference of the text area is calculated through the coordinates (four-point coordinates) of the text area output by the text detection part, and finally, the text (the text after the non-numeric part is filtered) in the area with the maximum circumference is found out to be finally output.
When a convolutional neural network and a cyclic neural network consisting of two-way long-and-short-term memory units are used for processing a text, the two networks need to be connected to realize the prediction from an image end to a text end.
As shown in fig. 2, for the neural network of the text detection part, MobileNet is used as a convolution network part, all layers including the last pooling layer in MobileNet are removed, a two-layer Deep bidirectional long-short time memory unit (Deep BiLSTM) is connected from the last convolution layer, and then the output of the long-short time memory unit is connected to a full connection layer by using DropConnect regularization.
A text detection model for installation on a mobile device for text detection includes a neural network of the text detection portion.
As shown in fig. 3, for the neural network of the text recognition part, during training, MobileNet is used as a convolutional network part, the last fully connected layer and the following Softmax layer in MobileNet are removed, the last pooling layer is connected to a two-layer deep bidirectional long-short time memory unit by using a feature mapping function and using DropConnect regularization, the output of the long-short time memory unit is connected to a transcription layer using connection principal time Classification (CTC), and finally connected to an output layer.
As shown in fig. 4, after training, for the neural network of the text recognition part, when the neural network is installed on a mobile device for text recognition, MobileNet is used as a convolutional network part, the last full connection layer and the following Softmax layer in MobileNet are removed, the last pooling layer is connected to a two-layer Deep bidirectional long-time memory unit (Deep bit lstm) through feature mapping function and DropConnect regularization, and the output of the long-time memory unit is connected to the Softmax layer for outputting text probability distribution data.
And extracting text content from the text probability distribution data by using an argmax function and outputting the text content.
The method is also capable of outputting four-point coordinates of each line (corresponding to each text region).
The lower the CTC loss, the higher the accuracy of text recognition.
According to an embodiment of the present application, there is also provided a text detection and recognition system based on a mobile device, as shown in fig. 5, including:
the preprocessing module 101 is configured to preprocess the RGB image to obtain a first image;
a text detection module 102, configured to calculate multiple sets of coordinates in the first image;
a text extraction module 103, configured to extract each second image corresponding to each group of coordinates;
a text recognition module 104, configured to extract image features of the second image using a MobileNet convolutional network; converting the image features into a sequence of features using a feature mapping function; after the Dropconnect is used for processing the characteristic sequence, a long-time memory unit is used for extracting a text prediction value of the characteristic sequence; processing the text prediction value by using a softmax function to obtain text probability distribution data; and extracting text content from the text probability distribution data by using an argmax function and outputting the text content.
The system also comprises a first training module for calculating a recognition loss between the text prediction value and the label text corresponding to the text prediction value by using a CTC loss function; if the recognition loss is less than or equal to the recognition loss threshold value, obtaining a trained text recognition model; if the recognition loss is larger than the recognition loss threshold, automatically adjusting weight parameters in the text recognition model according to the recognition loss, continuously training the text recognition model by using the images in the training set to obtain new recognition loss, if the new recognition loss is smaller than or equal to the recognition loss threshold, obtaining a trained text recognition model, and if the new recognition loss is larger than the recognition loss obtained in the last step and cannot be reduced, outputting the text recognition model with the minimum recognition loss.
As shown in fig. 6, the text recognition module includes a prediction value calculation unit and a content extraction unit. The predicted value calculating unit is used for calculating a text predicted value of the input second image; the content extraction unit is used for extracting text content according to the text prediction value and outputting the extracted text content. The first training module is used for training a predicted value calculation unit.
As shown in fig. 7, for the neural network of the prediction value calculation unit of the text recognition module, during training, MobileNet is used as a convolution network part, the last full connection layer and the following Softmax layer in MobileNet are removed, and the last pooling layer is connected to a two-layer Deep bidirectional long-time memory unit (Deep BilSt) by using feature mapping function and DropConnect regularization. And connecting the output of the long-time and short-time memory unit to a transcription layer using the CTC in the first training module, and finally connecting the output to an output layer.
As shown in fig. 8, after training, for the neural network of the prediction value calculation unit of the text recognition module, when the neural network is used for text recognition, MobileNet is used as a convolution network part, the last full connection layer and the following Softmax layer in MobileNet are removed, the last pooling layer is connected to a two-layer Deep bidirectional long and short term memory unit (Deep BilsTM) by using a feature mapping function and using DropConnect regularization, the output of the long and short term memory unit is connected to the neural network Softmax layer of the content extraction unit, and text probability distribution data is output. Extracting text content from the text probability distribution data by using argmax function and outputting the text content
The system also comprises a second training module, a second training module and a third training module, wherein the second training module is used for calculating coordinate loss between the multiple groups of coordinates and the corresponding label coordinates; if the coordinate loss is less than or equal to the detection loss threshold value, obtaining a trained text detection model; if the coordinate loss is larger than the detection loss threshold, adjusting a weight parameter in the text detection model according to the coordinate loss, continuing to train the text detection model by using the images in the training set to obtain a new coordinate loss, if the new coordinate loss is smaller than or equal to the detection loss threshold, obtaining a trained text detection model, and if the new coordinate loss is larger than the coordinate loss obtained in the last step and cannot be reduced, outputting the text detection model with the minimum coordinate loss.
The system also comprises an output selection module used for outputting the corresponding text content according to the setting.
The output selection module is connected with the text recognition module.
The user can select the text in the largest area and/or the text in the smallest area in the text detection according to the requirement, and can also select the text area by self. The user may also select particular text content by himself, such as outputting only numbers, etc. The user may also select different combinations as desired.
The size of each region is determined by calculating the perimeter obtained by the coordinates of four points of each group corresponding to each region.
Assuming that the user needs to output only the price on the label for example, the user may choose to output only the numbers in the largest area. The data obtained by the neural network calculation of text detection and recognition is excessive, and a lot of redundant data exists. In this case, it is necessary to filter out non-numeric parts through the regular expression first. Then, the circumference of the text area is calculated through the coordinates (four-point coordinates) of the text area output by the text detection part, and finally, the text (the text after the non-numeric part is filtered) in the area with the maximum circumference is found out to be finally output
According to the method, the interference of the complex text background can be reduced by preprocessing the RGB image; the mobile platform text detection and recognition system can achieve rapid text detection and recognition on the mobile platform by using the MobileNet, and is low in power consumption. The preprocessing of the image effectively reduces the interference generated by complex environment in various scenes, and is more beneficial to text recognition in real application scenes. By combining the text detection and identification into one process, the method can directly process the image of a complex scene to obtain the target text, realizes end-to-end identification, and can be directly applied to different scenes. By optimizing the mobile platform and reducing the calculation amount by using the lightweight neural network, the mobile equipment with low processing performance can also perform quick processing and occupy smaller storage space. Based on the MobileNet, the structure of the neural network is modified to be lighter, so that the operation requirements of a mobile terminal for rapidness and low power consumption are met. By adding DropConnect regularization between layers, weight between nodes of a hidden layer is randomly lost, so that regularization of a network structure is achieved, interdependency between the nodes is reduced, generalization capability of the network can be enhanced, the condition that the network is over-fitted is avoided, and accurate prediction results can be obtained under various complex scenes.
The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.