CN110533027B - Text detection and identification method and system based on mobile equipment - Google Patents

Text detection and identification method and system based on mobile equipment Download PDF

Info

Publication number
CN110533027B
CN110533027B CN201910663009.3A CN201910663009A CN110533027B CN 110533027 B CN110533027 B CN 110533027B CN 201910663009 A CN201910663009 A CN 201910663009A CN 110533027 B CN110533027 B CN 110533027B
Authority
CN
China
Prior art keywords
text
loss
recognition
image
coordinates
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910663009.3A
Other languages
Chinese (zh)
Other versions
CN110533027A (en
Inventor
陈曦
龚小龙
陈奕臻
麻志毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced Institute of Information Technology AIIT of Peking University
Hangzhou Weiming Information Technology Co Ltd
Original Assignee
Advanced Institute of Information Technology AIIT of Peking University
Hangzhou Weiming Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced Institute of Information Technology AIIT of Peking University, Hangzhou Weiming Information Technology Co Ltd filed Critical Advanced Institute of Information Technology AIIT of Peking University
Priority to CN201910663009.3A priority Critical patent/CN110533027B/en
Publication of CN110533027A publication Critical patent/CN110533027A/en
Application granted granted Critical
Publication of CN110533027B publication Critical patent/CN110533027B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application discloses a text detection and identification method and system based on mobile equipment, which comprises the following steps: preprocessing an RGB image to obtain a first image; calculating a plurality of sets of coordinates in the first image; extracting each second image corresponding to each group of coordinates; extracting image features of the second image by using a MobileNet convolution network; converting the image features into a sequence of features using a feature mapping function; after the DropConnect is used for regularizing the characteristic sequence, a bidirectional long-time and short-time memory unit is used for extracting a text prediction value of the characteristic sequence; processing the text prediction value by using a softmax function to obtain text probability distribution data; and extracting text content from the text probability distribution data by using an argmax function and outputting the text content. By preprocessing the RGB image, the text background interference generated by complex environment in various scenes can be effectively reduced, and the text recognition in real application scenes is facilitated; the mobile platform text detection and recognition system can achieve rapid text detection and recognition on the mobile platform by using the MobileNet, and is low in power consumption.

Description

Text detection and identification method and system based on mobile equipment
Technical Field
The application is applied to the mobile internet industry, and particularly relates to a text detection and identification method and system based on mobile equipment.
Background
In the existing similar technical scheme, only text recognition and related algorithms are generally mentioned, and the interference of texts in complex scenes cannot be reduced, however, scene interference is a common problem in various scenes. Text recognition in complex natural scenes generally causes a reduction in recognition accuracy due to the influence of environmental factors, such as the illumination angle of light, the exposure of pictures, and the influence of picture background colors and text fonts on text recognition. And text in complex scenes cannot be recognized more accurately to avoid recognizing non-text objects. Meanwhile, the existing many neural network text recognition processing technologies have long reprocessing time and high power consumption, and do not meet the running requirements of higher speed and low power consumption at the mobile terminal.
In summary, it is desirable to provide a method and a system capable of recognizing a text in a complex scene with high accuracy, fast running speed on a mobile terminal, and low power consumption.
Disclosure of Invention
In order to solve the above problems, the present application provides a text detection and recognition method and system based on a mobile device.
In one aspect, the present application provides a text detection and recognition method based on a mobile device, including:
preprocessing an RGB image to obtain a first image;
calculating a plurality of sets of coordinates in the first image;
extracting each second image corresponding to each group of coordinates;
extracting image features of the second image by using a MobileNet convolution network;
converting the image features into a sequence of features using a feature mapping function;
after the Dropconnect is used for processing the characteristic sequence, a long-time memory unit is used for extracting a text prediction value of the characteristic sequence;
processing the text prediction value by using a softmax function to obtain text probability distribution data;
and extracting text content from the text probability distribution data by using an argmax function and outputting the text content.
Preferably, after the using the long-term memory unit to extract the text prediction value of the feature sequence, the method further includes:
calculating the identification loss between the text predicted value and the corresponding label text by using a CTC loss function;
if the recognition loss is less than or equal to the recognition loss threshold value, obtaining a trained text recognition model;
if the recognition loss is larger than the recognition loss threshold, automatically adjusting weight parameters in the text recognition model according to the recognition loss, continuously training the text recognition model by using the images in the training set to obtain new recognition loss, if the new recognition loss is smaller than or equal to the recognition loss threshold, obtaining a trained text recognition model, and if the new recognition loss is larger than the recognition loss obtained in the last step and cannot be reduced, outputting the text recognition model with the minimum recognition loss.
Preferably, after the calculating the plurality of sets of coordinates in the first image, the method further includes:
calculating coordinate loss between the multiple groups of coordinates and the corresponding label coordinates;
if the coordinate loss is less than or equal to the detection loss threshold value, obtaining a trained text detection model;
if the coordinate loss is larger than the detection loss threshold, adjusting a weight parameter in the text detection model according to the coordinate loss, continuing to train the text detection model by using the images in the training set to obtain a new coordinate loss, if the new coordinate loss is smaller than or equal to the detection loss threshold, obtaining a trained text detection model, and if the new coordinate loss is larger than the coordinate loss obtained in the last step and cannot be reduced, outputting the text detection model with the minimum coordinate loss.
Preferably, the preprocessing the RGB image to obtain a first image includes:
converting the RGB image into an HSV image;
carrying out histogram equalization on the HSV image to obtain an equalized image;
and converting colors lower than the gray threshold value in the equalized image into white to obtain a first image.
Preferably, the calculating the plurality of sets of coordinates in the first image comprises:
extracting Y-axis coordinates of each line, horizontal translation amount of the text and height of the text area;
and calculating four-point coordinates of the text area of each line by using the Y-axis coordinates of each line, the horizontal translation amount of the text and the height of the text area to obtain a plurality of groups of coordinates.
Preferably, the images in the training set are images labeled with labels.
Preferably, the tag is a data unit comprising: the tag text, the tag coordinates, and an image in an area of the first image corresponding to the tag coordinates.
In a second aspect, the present application provides a mobile device-based text detection and recognition system, comprising:
the preprocessing module is used for preprocessing the RGB image to obtain a first image;
the text detection module is used for calculating a plurality of groups of coordinates in the first image;
the text extraction module is used for extracting each second image corresponding to each group of coordinates;
the text recognition module is used for extracting the image characteristics of the second image by using a MobileNet convolution network; converting the image features into a sequence of features using a feature mapping function; after the Dropconnect is used for processing the characteristic sequence, a long-time memory unit is used for extracting a text prediction value of the characteristic sequence; processing the text prediction value by using a softmax function to obtain text probability distribution data; and extracting text content from the text probability distribution data by using an argmax function and outputting the text content.
Preferably, the system further comprises a first training module for calculating a recognition loss between the text prediction value and the label text corresponding to the text prediction value by using a CTC loss function; if the recognition loss is less than or equal to the recognition loss threshold value, obtaining a trained text recognition model; if the recognition loss is larger than the recognition loss threshold, automatically adjusting weight parameters in the text recognition model according to the recognition loss, continuously training the text recognition model by using the images in the training set to obtain new recognition loss, if the new recognition loss is smaller than or equal to the recognition loss threshold, obtaining a trained text recognition model, and if the new recognition loss is larger than the recognition loss obtained in the last step and cannot be reduced, outputting the text recognition model with the minimum recognition loss.
Preferably, the system further comprises a second training module, configured to calculate a coordinate loss between the sets of coordinates and the corresponding label coordinates; if the coordinate loss is less than or equal to the detection loss threshold value, obtaining a trained text detection model; if the coordinate loss is larger than the detection loss threshold, adjusting a weight parameter in the text detection model according to the coordinate loss, continuing to train the text detection model by using the images in the training set to obtain a new coordinate loss, if the new coordinate loss is smaller than or equal to the detection loss threshold, obtaining a trained text detection model, and if the new coordinate loss is larger than the coordinate loss obtained in the last step and cannot be reduced, outputting the text detection model with the minimum coordinate loss.
The application has the advantages that: by preprocessing the RGB image, the interference of a complex text background can be reduced; the mobile platform text detection and recognition system can achieve rapid text detection and recognition on the mobile platform by using the MobileNet, and is low in power consumption. The preprocessing of the image effectively reduces the interference generated by complex environment in various scenes, and is more beneficial to text recognition in real application scenes.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to denote like parts throughout the drawings. In the drawings:
FIG. 1 is a schematic diagram illustrating steps of a mobile device based text detection and recognition method provided herein;
FIG. 2 is a schematic structural diagram of a text detection portion of a mobile device-based text detection and recognition method provided in the present application;
FIG. 3 is a schematic structural diagram of a text recognition part during training of a text detection and recognition method based on a mobile device according to the present application;
FIG. 4 is a schematic diagram illustrating a trained text recognition portion of a mobile device-based text detection and recognition method according to the present application;
FIG. 5 is a schematic structural diagram of a mobile device based text detection and recognition system provided herein;
FIG. 6 is a schematic diagram of a connection between a text recognition module and a first training module of a mobile device based text detection and recognition system provided herein;
FIG. 7 is a schematic diagram of a predictive value calculating unit of a mobile device-based text detection and recognition system connected to a first training module during training;
fig. 8 is a schematic structural diagram illustrating a predicted value calculating unit of a mobile device-based text detection and recognition system connected to a content extracting unit after training.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
According to an embodiment of the present application, a text detection and recognition method based on a mobile device is provided, as shown in fig. 1, including:
s101, preprocessing an RGB image to obtain a first image;
s102, calculating a plurality of groups of coordinates in the first image;
s103, extracting second images corresponding to the groups of coordinates;
s104, extracting image features of the second image by using a MobileNet convolution network;
s105, converting the image features into a feature sequence by using a feature mapping function;
s106, after the characteristic sequence is regularized by using Dropconnect, extracting a text prediction value of the characteristic sequence (the characteristic sequence after regularization) by using a long-time memory unit and a short-time memory unit;
s107, processing the text predicted value by using a softmax function to obtain text probability distribution data;
and S108, extracting text content from the text probability distribution data by using an argmax function, and outputting the text content.
The long and short time memory unit is a deep bidirectional long and short time memory unit.
After the long-time memory unit extracts the text prediction value of the feature sequence, the method further comprises the following steps:
calculating the identification loss between the text predicted value and the corresponding label text by using a CTC loss function;
if the recognition loss is less than or equal to the recognition loss threshold value, obtaining a trained text recognition model;
if the recognition loss is larger than the recognition loss threshold value, automatically adjusting weight parameters in the text recognition model according to the recognition loss, continuously using images in the training set to train the text recognition model to obtain new recognition loss, if the new recognition loss is smaller than or equal to the recognition loss threshold value, obtaining a trained text recognition model, and if the new recognition loss is larger than the recognition loss obtained in the last step and cannot be reduced, outputting the text recognition model with the minimum recognition loss.
And when the recognition loss is greater than the recognition loss threshold, automatically adjusting the weight parameters in the text recognition model according to the recognition loss, continuously training the text recognition model by using the images in the training set to obtain new recognition loss, if the obtained new recognition loss is smaller than the recognition loss of the previous step and is still greater than the recognition loss threshold, automatically adjusting the weight parameters in the text recognition model according to the recognition loss, continuously training the text recognition model by using the images in the training set until the newly obtained recognition loss is smaller than or equal to the recognition loss threshold, and obtaining the trained text recognition model. And if the new recognition loss can not be reduced, namely the new recognition loss is always larger than the recognition loss of the previous step and can not be reduced, outputting the text recognition model with the minimum recognition loss.
The resulting (output) text recognition model is used for installing on a mobile device for text recognition.
The new recognition loss cannot be reduced, and the recognition loss obtained continuously for several times is higher than the recognition loss obtained last time.
The number of consecutive times may be set.
The identification loss threshold may be set.
The CTC loss is calculated by connection-dominant Temporal Classification (CTC).
The text prediction value is a training loss value.
The text recognition model is used for calculating a predicted value of the text.
After the calculating the plurality of sets of coordinates in the first image, further comprising:
calculating coordinate loss between the multiple groups of coordinates and the corresponding label coordinates;
if the coordinate loss is less than or equal to the detection loss threshold value, obtaining a trained text detection model;
if the coordinate loss is larger than the detection loss threshold, adjusting a weight parameter in the text detection model according to the coordinate loss, continuing to train the text detection model by using the images in the training set to obtain a new coordinate loss, if the new coordinate loss is smaller than or equal to the detection loss threshold, obtaining a trained text detection model, and if the new coordinate loss is larger than the coordinate loss obtained in the last step and cannot be reduced, outputting the text detection model with the minimum coordinate loss.
And when the coordinate loss is greater than the detection loss threshold, automatically adjusting the weight parameters in the text recognition model according to the coordinate loss, continuously training the text detection model by using the images in the training set to obtain new coordinate loss, and if the obtained new coordinate loss is smaller than the coordinate loss of the last step and is still greater than the detection loss threshold, automatically adjusting the weight parameters in the text detection model according to the coordinate loss, continuously training the text detection model by using the images in the training set until the newly obtained coordinate loss is smaller than or equal to the detection loss threshold, and obtaining the trained text detection model. And if the new coordinate loss can not be reduced, namely the new coordinate loss is always larger than the coordinate loss of the previous step and can not be reduced, outputting the text detection model with the minimum coordinate loss.
The resulting (output) text detection model is used for installing on a mobile device for text recognition.
The new coordinate loss can not be reduced, and the coordinate loss obtained continuously for several times is higher than the identification loss of the last time.
The number of consecutive times may be set.
The detection loss threshold may be set.
The coordinate Loss between the plurality of sets of coordinates and the corresponding tag coordinates may be calculated using algorithms such as a Cross Entropy Loss (Cross Entropy Loss) algorithm, a L1 Smooth Loss (Smooth L1 Loss) algorithm, and a maximum soft Loss (softmax Loss) algorithm.
The text detection model is used to calculate a plurality of sets of coordinates in the first image.
The preprocessing the RGB image to obtain a first image, comprising:
converting the RGB image into an HSV image;
carrying out histogram equalization on the HSV image to obtain an equalized image;
and converting colors lower than the gray threshold value in the equalized image into white to obtain a first image.
The grayscale threshold may be set.
HSV is a color space different from RGB and consists of Hue (Hue, H), Saturation (S) and Value (V). The color recall angle measures red from 0 degrees, and saturation represents the shade of a color, and lightness represents the brightness of a color. And the color space is converted into an HSV color space, so that the direct processing of the brightness of the image by a subsequent algorithm is facilitated. Histogram equalization is a method for solving the problem of over-bright or under-exposed (under-exposed) condition of the whole and local image, and can also enhance the contrast of the picture for machine identification.
The input RGB picture is converted into an HSV channel, so that the subsequent treatment can be directly carried out on the gray level of the image. And then carrying out histogram equalization on the obtained HSV image to obtain an image with more balanced gray distribution, wherein the operation can effectively reduce overexposed and shaded areas in the image. And finally, filtering the color lower than the gray threshold value by using a color filtering method, taking the pixel with the v value smaller than 46 as a black pixel as an example, and converting the pixel with the v value larger than 46 into white to obtain a first image.
The calculating of the plurality of sets of coordinates in the first image comprises:
extracting Y-axis coordinates of each line, horizontal translation amount of the text and height of the text area;
and calculating four-point coordinates of the text area of each line by using the Y-axis coordinates of each line, the horizontal translation amount of the text and the height of the text area to obtain a plurality of groups of coordinates.
The method comprises the steps of extracting image features of a first image by using a MobileNet convolution network, converting the image features into feature sequences by using a feature mapping function, carrying out regularization processing on the feature sequences by using Dropconnect, and extracting text position prediction values of the feature sequences (feature sequences after regularization processing) by using a long-time memory unit and a short-time memory unit. And after the text position predicted value passes through the full connection layer, outputting Y-axis coordinates of each line, the horizontal translation amount of the text and the height of the text area.
The images in the training set are labeled with labels.
The label is a manual labeling label.
The tag is a data unit, comprising: the tag text, the tag coordinates, and an image in an area of the first image corresponding to the tag coordinates.
The label text comprises text contents corresponding to all the text areas.
The tag coordinates include all text region coordinates.
And the image in the area corresponding to the label coordinate in the first image is the image in the area corresponding to the text area coordinate in the first image.
After extracting and outputting the text content from the text probability distribution data by using the argmax function, the method further includes:
and outputting the corresponding text content according to the setting.
The user can select the text in the largest area and/or the text in the smallest area in the text detection according to the requirement, and can also select the text area by self. The user may also select particular text content by himself, such as outputting only numbers, etc. The user may also select different combinations as desired.
The size of each region is determined by calculating the perimeter obtained by the coordinates of four points of each group corresponding to each region.
Taking the example that the user needs to output only the price on the label, the user may choose to output only the number in the largest area. The data obtained by the neural network calculation of text detection and recognition is excessive, and a lot of redundant data exists. In this case, it is necessary to filter out non-numeric parts through the regular expression first. Then, the circumference of the text area is calculated through the coordinates (four-point coordinates) of the text area output by the text detection part, and finally, the text (the text after the non-numeric part is filtered) in the area with the maximum circumference is found out to be finally output.
When a convolutional neural network and a cyclic neural network consisting of two-way long-and-short-term memory units are used for processing a text, the two networks need to be connected to realize the prediction from an image end to a text end.
As shown in fig. 2, for the neural network of the text detection part, MobileNet is used as a convolution network part, all layers including the last pooling layer in MobileNet are removed, a two-layer Deep bidirectional long-short time memory unit (Deep BiLSTM) is connected from the last convolution layer, and then the output of the long-short time memory unit is connected to a full connection layer by using DropConnect regularization.
A text detection model for installation on a mobile device for text detection includes a neural network of the text detection portion.
As shown in fig. 3, for the neural network of the text recognition part, during training, MobileNet is used as a convolutional network part, the last fully connected layer and the following Softmax layer in MobileNet are removed, the last pooling layer is connected to a two-layer deep bidirectional long-short time memory unit by using a feature mapping function and using DropConnect regularization, the output of the long-short time memory unit is connected to a transcription layer using connection principal time Classification (CTC), and finally connected to an output layer.
As shown in fig. 4, after training, for the neural network of the text recognition part, when the neural network is installed on a mobile device for text recognition, MobileNet is used as a convolutional network part, the last full connection layer and the following Softmax layer in MobileNet are removed, the last pooling layer is connected to a two-layer Deep bidirectional long-time memory unit (Deep bit lstm) through feature mapping function and DropConnect regularization, and the output of the long-time memory unit is connected to the Softmax layer for outputting text probability distribution data.
And extracting text content from the text probability distribution data by using an argmax function and outputting the text content.
The method is also capable of outputting four-point coordinates of each line (corresponding to each text region).
The lower the CTC loss, the higher the accuracy of text recognition.
According to an embodiment of the present application, there is also provided a text detection and recognition system based on a mobile device, as shown in fig. 5, including:
the preprocessing module 101 is configured to preprocess the RGB image to obtain a first image;
a text detection module 102, configured to calculate multiple sets of coordinates in the first image;
a text extraction module 103, configured to extract each second image corresponding to each group of coordinates;
a text recognition module 104, configured to extract image features of the second image using a MobileNet convolutional network; converting the image features into a sequence of features using a feature mapping function; after the Dropconnect is used for processing the characteristic sequence, a long-time memory unit is used for extracting a text prediction value of the characteristic sequence; processing the text prediction value by using a softmax function to obtain text probability distribution data; and extracting text content from the text probability distribution data by using an argmax function and outputting the text content.
The system also comprises a first training module for calculating a recognition loss between the text prediction value and the label text corresponding to the text prediction value by using a CTC loss function; if the recognition loss is less than or equal to the recognition loss threshold value, obtaining a trained text recognition model; if the recognition loss is larger than the recognition loss threshold, automatically adjusting weight parameters in the text recognition model according to the recognition loss, continuously training the text recognition model by using the images in the training set to obtain new recognition loss, if the new recognition loss is smaller than or equal to the recognition loss threshold, obtaining a trained text recognition model, and if the new recognition loss is larger than the recognition loss obtained in the last step and cannot be reduced, outputting the text recognition model with the minimum recognition loss.
As shown in fig. 6, the text recognition module includes a prediction value calculation unit and a content extraction unit. The predicted value calculating unit is used for calculating a text predicted value of the input second image; the content extraction unit is used for extracting text content according to the text prediction value and outputting the extracted text content. The first training module is used for training a predicted value calculation unit.
As shown in fig. 7, for the neural network of the prediction value calculation unit of the text recognition module, during training, MobileNet is used as a convolution network part, the last full connection layer and the following Softmax layer in MobileNet are removed, and the last pooling layer is connected to a two-layer Deep bidirectional long-time memory unit (Deep BilSt) by using feature mapping function and DropConnect regularization. And connecting the output of the long-time and short-time memory unit to a transcription layer using the CTC in the first training module, and finally connecting the output to an output layer.
As shown in fig. 8, after training, for the neural network of the prediction value calculation unit of the text recognition module, when the neural network is used for text recognition, MobileNet is used as a convolution network part, the last full connection layer and the following Softmax layer in MobileNet are removed, the last pooling layer is connected to a two-layer Deep bidirectional long and short term memory unit (Deep BilsTM) by using a feature mapping function and using DropConnect regularization, the output of the long and short term memory unit is connected to the neural network Softmax layer of the content extraction unit, and text probability distribution data is output. Extracting text content from the text probability distribution data by using argmax function and outputting the text content
The system also comprises a second training module, a second training module and a third training module, wherein the second training module is used for calculating coordinate loss between the multiple groups of coordinates and the corresponding label coordinates; if the coordinate loss is less than or equal to the detection loss threshold value, obtaining a trained text detection model; if the coordinate loss is larger than the detection loss threshold, adjusting a weight parameter in the text detection model according to the coordinate loss, continuing to train the text detection model by using the images in the training set to obtain a new coordinate loss, if the new coordinate loss is smaller than or equal to the detection loss threshold, obtaining a trained text detection model, and if the new coordinate loss is larger than the coordinate loss obtained in the last step and cannot be reduced, outputting the text detection model with the minimum coordinate loss.
The system also comprises an output selection module used for outputting the corresponding text content according to the setting.
The output selection module is connected with the text recognition module.
The user can select the text in the largest area and/or the text in the smallest area in the text detection according to the requirement, and can also select the text area by self. The user may also select particular text content by himself, such as outputting only numbers, etc. The user may also select different combinations as desired.
The size of each region is determined by calculating the perimeter obtained by the coordinates of four points of each group corresponding to each region.
Assuming that the user needs to output only the price on the label for example, the user may choose to output only the numbers in the largest area. The data obtained by the neural network calculation of text detection and recognition is excessive, and a lot of redundant data exists. In this case, it is necessary to filter out non-numeric parts through the regular expression first. Then, the circumference of the text area is calculated through the coordinates (four-point coordinates) of the text area output by the text detection part, and finally, the text (the text after the non-numeric part is filtered) in the area with the maximum circumference is found out to be finally output
According to the method, the interference of the complex text background can be reduced by preprocessing the RGB image; the mobile platform text detection and recognition system can achieve rapid text detection and recognition on the mobile platform by using the MobileNet, and is low in power consumption. The preprocessing of the image effectively reduces the interference generated by complex environment in various scenes, and is more beneficial to text recognition in real application scenes. By combining the text detection and identification into one process, the method can directly process the image of a complex scene to obtain the target text, realizes end-to-end identification, and can be directly applied to different scenes. By optimizing the mobile platform and reducing the calculation amount by using the lightweight neural network, the mobile equipment with low processing performance can also perform quick processing and occupy smaller storage space. Based on the MobileNet, the structure of the neural network is modified to be lighter, so that the operation requirements of a mobile terminal for rapidness and low power consumption are met. By adding DropConnect regularization between layers, weight between nodes of a hidden layer is randomly lost, so that regularization of a network structure is achieved, interdependency between the nodes is reduced, generalization capability of the network can be enhanced, the condition that the network is over-fitted is avoided, and accurate prediction results can be obtained under various complex scenes.
The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A text detection and recognition method based on a mobile device, comprising:
preprocessing an RGB image to obtain a first image;
calculating a plurality of sets of coordinates in the first image;
calculating the multiple groups of coordinates by adopting a text detection model, wherein the text detection model specifically comprises the following steps: when a MobileNet convolutional network is used, removing all layers including the rear part of the pooling layer in the MobileNet, connecting the last convolutional layer to a double-layer deep bidirectional long-time and short-time memory unit, and connecting the output of the double-layer deep bidirectional long-time and short-time memory unit to a full-connection layer by using Dropconnect regularization;
extracting each second image corresponding to each group of coordinates;
recognizing the second image by adopting a text recognition model, which specifically comprises the following steps:
extracting image features of the second image by using a MobileNet convolution network;
converting the image features into a sequence of features using a feature mapping function;
after the characteristic sequence is processed by using Dropconnect, extracting a text predicted value of the characteristic sequence by using a double-layer depth bidirectional long-term and short-term memory unit;
processing the text prediction value by using a softmax function to obtain text probability distribution data;
and extracting text content from the text probability distribution data by using an argmax function and outputting the text content.
2. The mobile device-based text detection and recognition method of claim 1, wherein after the extracting the text prediction value of the feature sequence using the two-layer deep bi-directional long-and-short term memory unit, further comprising:
calculating the identification loss between the text predicted value and the corresponding label text by using a CTC loss function;
if the recognition loss is less than or equal to the recognition loss threshold value, obtaining a trained text recognition model;
if the recognition loss is larger than the recognition loss threshold, automatically adjusting weight parameters in the text recognition model according to the recognition loss, continuously training the text recognition model by using the images in the training set to obtain new recognition loss, if the new recognition loss is smaller than or equal to the recognition loss threshold, obtaining a trained text recognition model, and if the new recognition loss is larger than the recognition loss obtained in the last step and cannot be reduced, outputting the text recognition model with the minimum recognition loss.
3. A mobile device based text detection and recognition method as recited in claim 1, further comprising, after said computing the sets of coordinates in the first image:
calculating coordinate loss between the multiple groups of coordinates and the corresponding label coordinates;
if the coordinate loss is less than or equal to the detection loss threshold value, obtaining a trained text detection model;
if the coordinate loss is larger than the detection loss threshold value, adjusting weight parameters in the text detection model according to the coordinate loss, continuing to train the text detection model by using the images in the training set to obtain new coordinate loss, if the new coordinate loss is smaller than or equal to the detection loss threshold value, obtaining the trained text detection model, and if the new coordinate loss is larger than the coordinate loss obtained in the last step and cannot be reduced, outputting the text detection model with the minimum coordinate loss.
4. The method of claim 1, wherein preprocessing the RGB image to obtain the first image comprises:
converting the RGB image into an HSV image;
carrying out histogram equalization on the HSV image to obtain an equalized image;
and converting colors lower than the gray threshold value in the equalized image into white to obtain a first image.
5. A mobile device based text detection and recognition method as recited in claim 1, wherein said computing sets of coordinates in the first image comprises:
extracting Y-axis coordinates of each line, horizontal translation amount of the text and height of the text area;
and calculating four-point coordinates of the text area of each line by using the Y-axis coordinates of each line, the horizontal translation amount of the text and the height of the text area to obtain a plurality of groups of coordinates.
6. A method as claimed in claim 2 or 3, wherein the images in the training set are labelled images using labels.
7. The mobile device-based text detection and identification method of claim 6, wherein the tag is a data unit comprising: the tag text, the tag coordinates, and an image in an area of the first image corresponding to the tag coordinates.
8. A mobile device based text detection and recognition system, comprising:
the preprocessing module is used for preprocessing the RGB image to obtain a first image;
the text detection module is used for calculating a plurality of groups of coordinates in the first image, wherein when a MobileNet convolutional network is used, all layers including the rear part of the pooling layer in the MobileNet are removed, the last convolutional layer is connected to a double-layer depth bidirectional long-short time memory unit, and then the output of the double-layer depth bidirectional long-short time memory unit is connected to a full connection layer through Dropconnect regularization;
the text extraction module is used for extracting each second image corresponding to each group of coordinates;
the text recognition module is used for recognizing the second image and extracting the image characteristics of the second image by using a MobileNet convolutional network; converting the image features into a sequence of features using a feature mapping function; after the characteristic sequence is processed by using Dropconnect, extracting a text predicted value of the characteristic sequence by using a double-layer depth bidirectional long-term and short-term memory unit; processing the text prediction value by using a softmax function to obtain text probability distribution data; and extracting text content from the text probability distribution data by using an argmax function and outputting the text content.
9. The mobile device-based text detection and recognition system of claim 8, further comprising a first training module for calculating a recognition loss between a predicted value of text and its corresponding tagged text using a CTC loss function; if the recognition loss is less than or equal to the recognition loss threshold value, obtaining a trained text recognition model; if the recognition loss is larger than the recognition loss threshold, automatically adjusting weight parameters in the text recognition model according to the recognition loss, continuously training the text recognition model by using the images in the training set to obtain new recognition loss, if the new recognition loss is smaller than or equal to the recognition loss threshold, obtaining a trained text recognition model, and if the new recognition loss is larger than the recognition loss obtained in the last step and cannot be reduced, outputting the text recognition model with the minimum recognition loss.
10. The mobile device-based text detection and recognition system of claim 8, further comprising a second training module for calculating coordinate losses between the sets of coordinates and their corresponding tag coordinates; if the coordinate loss is less than or equal to the detection loss threshold value, obtaining a trained text detection model; if the coordinate loss is larger than the detection loss threshold, adjusting a weight parameter in the text detection model according to the coordinate loss, continuing to train the text detection model by using the images in the training set to obtain a new coordinate loss, if the new coordinate loss is smaller than or equal to the detection loss threshold, obtaining a trained text detection model, and if the new coordinate loss is larger than the coordinate loss obtained in the last step and cannot be reduced, outputting the text detection model with the minimum coordinate loss.
CN201910663009.3A 2019-07-22 2019-07-22 Text detection and identification method and system based on mobile equipment Active CN110533027B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910663009.3A CN110533027B (en) 2019-07-22 2019-07-22 Text detection and identification method and system based on mobile equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910663009.3A CN110533027B (en) 2019-07-22 2019-07-22 Text detection and identification method and system based on mobile equipment

Publications (2)

Publication Number Publication Date
CN110533027A CN110533027A (en) 2019-12-03
CN110533027B true CN110533027B (en) 2022-09-02

Family

ID=68661769

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910663009.3A Active CN110533027B (en) 2019-07-22 2019-07-22 Text detection and identification method and system based on mobile equipment

Country Status (1)

Country Link
CN (1) CN110533027B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113780276B (en) * 2021-09-06 2023-12-05 成都人人互娱科技有限公司 Text recognition method and system combined with text classification
CN114758179A (en) * 2022-04-19 2022-07-15 电子科技大学 Imprinted character recognition method and system based on deep learning

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106355180A (en) * 2016-09-07 2017-01-25 武汉安可威视科技有限公司 Method for positioning license plates on basis of combination of color and edge features
CN108846841A (en) * 2018-07-02 2018-11-20 北京百度网讯科技有限公司 Display screen quality determining method, device, electronic equipment and storage medium
CN109117848A (en) * 2018-09-07 2019-01-01 泰康保险集团股份有限公司 A kind of line of text character identifying method, device, medium and electronic equipment
CN109325464A (en) * 2018-10-16 2019-02-12 上海翎腾智能科技有限公司 A kind of finger point reading character recognition method and interpretation method based on artificial intelligence
CN109840521A (en) * 2018-12-28 2019-06-04 安徽清新互联信息科技有限公司 A kind of integrated licence plate recognition method based on deep learning
CN109886330A (en) * 2019-02-18 2019-06-14 腾讯科技(深圳)有限公司 Method for text detection, device, computer readable storage medium and computer equipment
CN109978139A (en) * 2019-03-20 2019-07-05 深圳大学 Picture automatically generates method, system, electronic device and the storage medium of description

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106355180A (en) * 2016-09-07 2017-01-25 武汉安可威视科技有限公司 Method for positioning license plates on basis of combination of color and edge features
CN108846841A (en) * 2018-07-02 2018-11-20 北京百度网讯科技有限公司 Display screen quality determining method, device, electronic equipment and storage medium
CN109117848A (en) * 2018-09-07 2019-01-01 泰康保险集团股份有限公司 A kind of line of text character identifying method, device, medium and electronic equipment
CN109325464A (en) * 2018-10-16 2019-02-12 上海翎腾智能科技有限公司 A kind of finger point reading character recognition method and interpretation method based on artificial intelligence
CN109840521A (en) * 2018-12-28 2019-06-04 安徽清新互联信息科技有限公司 A kind of integrated licence plate recognition method based on deep learning
CN109886330A (en) * 2019-02-18 2019-06-14 腾讯科技(深圳)有限公司 Method for text detection, device, computer readable storage medium and computer equipment
CN109978139A (en) * 2019-03-20 2019-07-05 深圳大学 Picture automatically generates method, system, electronic device and the storage medium of description

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Handwritten Chinese Text Recognition Using Separable Multi-Dimensional Recurrent Neural Network;Yi-Chao Wu et al.;《2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)》;20180129;全文 *
基于特征提取和机器学习的文档区块图像分类算法;李翌昕 等;《信号处理》;20190525;第1卷(第5期);全文 *

Also Published As

Publication number Publication date
CN110533027A (en) 2019-12-03

Similar Documents

Publication Publication Date Title
CN108829826A (en) A kind of image search method based on deep learning and semantic segmentation
CN109145872B (en) CFAR and Fast-RCNN fusion-based SAR image ship target detection method
CN107133622A (en) The dividing method and device of a kind of word
CN112766195B (en) Electrified railway bow net arcing visual detection method
CN109558806A (en) The detection method and system of high score Remote Sensing Imagery Change
CN110555465A (en) Weather image identification method based on CNN and multi-feature fusion
CN111274987B (en) Facial expression recognition method and facial expression recognition device
CN108875619A (en) Method for processing video frequency and device, electronic equipment, computer readable storage medium
CN107704878B (en) Hyperspectral database semi-automatic establishment method based on deep learning
CN110533027B (en) Text detection and identification method and system based on mobile equipment
CN112819858B (en) Target tracking method, device, equipment and storage medium based on video enhancement
CN110503103A (en) A kind of character cutting method in line of text based on full convolutional neural networks
CN112163508A (en) Character recognition method and system based on real scene and OCR terminal
CN109409210B (en) Face detection method and system based on SSD (solid State disk) framework
CN113409355A (en) Moving target identification system and method based on FPGA
CN116110036A (en) Electric power nameplate information defect level judging method and device based on machine vision
CN112990220B (en) Intelligent identification method and system for target text in image
CN114463197A (en) Text recognition method and equipment for power equipment
CN115909336A (en) Text recognition method and device, computer equipment and computer-readable storage medium
US10764471B1 (en) Customized grayscale conversion in color form processing for text recognition in OCR
Zhang et al. A novel approach for binarization of overlay text
CN107424134A (en) Image processing method, device, computer-readable recording medium and computer equipment
KR101600617B1 (en) Method for detecting human in image frame
CN112200008A (en) Face attribute recognition method in community monitoring scene
CN115984133A (en) Image enhancement method, vehicle snapshot method, device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: Room 101, building 1, block C, Qianjiang Century Park, ningwei street, Xiaoshan District, Hangzhou City, Zhejiang Province

Applicant after: Hangzhou Weiming Information Technology Co.,Ltd.

Applicant after: Institute of Information Technology, Zhejiang Peking University

Address before: Room 288-1, 857 Xinbei Road, Ningwei Town, Xiaoshan District, Hangzhou City, Zhejiang Province

Applicant before: Institute of Information Technology, Zhejiang Peking University

Applicant before: Hangzhou Weiming Information Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant