CN111461105A - Text recognition method and device - Google Patents

Text recognition method and device Download PDF

Info

Publication number
CN111461105A
CN111461105A CN201910108577.7A CN201910108577A CN111461105A CN 111461105 A CN111461105 A CN 111461105A CN 201910108577 A CN201910108577 A CN 201910108577A CN 111461105 A CN111461105 A CN 111461105A
Authority
CN
China
Prior art keywords
text
image
network model
determining
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910108577.7A
Other languages
Chinese (zh)
Other versions
CN111461105B (en
Inventor
刘聪海
陈亮亮
方清
曾晓嘉
淦小健
朱正一
崔子玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SF Technology Co Ltd
Original Assignee
SF Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SF Technology Co Ltd filed Critical SF Technology Co Ltd
Priority to CN201910108577.7A priority Critical patent/CN111461105B/en
Publication of CN111461105A publication Critical patent/CN111461105A/en
Application granted granted Critical
Publication of CN111461105B publication Critical patent/CN111461105B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/63Scene text, e.g. street names
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Abstract

The application discloses a text recognition method, a text recognition device and a storage medium, wherein the text recognition device acquires an image to be recognized, which comprises a target text; determining a characteristic text region from the image to be recognized according to a preset text recognition network model and preset characteristic words; extracting text information from the characteristic text region according to the trained text extraction network model, wherein the text extraction network model consists of four CNN modules, one RNN module and one CTC module; and finally, determining a target text according to the text information. According to the scheme, a text region related to the characteristic words is extracted (cut) from the image to be recognized, text information is extracted from the extracted characteristic text region according to the text extraction network model, and then the target text is recognized from the text information.

Description

Text recognition method and device
Technical Field
The present application relates to the field of image recognition, and in particular, to a text recognition method and apparatus.
Background
The natural scene image refers to an image which contains other doped scenes besides characters in a picture, and the difficulty of extracting a text of a specified type from the natural scene image is high.
For example, in the existing map system, information of house numbers of each unit building in a cell is often required to meet actual use requirements, and for express delivery services, if an electronic map can provide high-precision positioning of the building numbers, labor consumption can be greatly saved, and delivery speed is increased. However, since it is difficult to extract the house number text information (extracting the text of the designated type) from the natural scene image, people are often required to manually collect the house number text information from the natural scene image containing the house number text information or manually collect the house number text information when the person visits the field.
Disclosure of Invention
The embodiment of the application provides a text recognition method and device, which are used for automatically acquiring a target text from an image.
In one aspect, the present application provides a text recognition method, including:
acquiring an image to be identified containing a target text;
determining a characteristic text region from the image to be recognized according to a preset text recognition network model and preset characteristic words;
extracting text information from the characteristic text region according to a trained text extraction network model, wherein the text extraction network model consists of four Convolutional Neural Network (CNN) modules, a Recurrent Neural Network (RNN) module and a CTC module;
and determining the target text according to the text information.
Optionally, the determining a feature text region from the image to be recognized according to a preset text recognition network model and preset feature words includes:
determining a text region from the image to be recognized according to the text recognition network model;
and determining the characteristic text region from the text region according to the characteristic words.
Optionally, the determining the target text according to the text information includes:
mapping the text information to a trained high-dimensional space model to obtain word distances between the feature words and a plurality of sub texts;
and determining the sub text with the minimum word distance as the target text.
Optionally, before mapping the text information into the trained high-dimensional spatial model, the method further includes:
and training the high-dimensional space model according to training samples, wherein the training samples are samples with known word distances.
Optionally, before determining the feature text region from the image to be recognized according to the preset text recognition network model and the preset feature word, the method further includes:
detecting the image to be identified according to a preset angle detection model to obtain the inclination angle of the image to be detected;
carrying out angle adjustment on the image to be recognized according to the inclination angle to obtain an adjusted image to be recognized;
the determining a characteristic text region from the image to be recognized according to a preset text recognition network model and preset characteristic words comprises the following steps:
and determining a characteristic text region from the adjusted image to be recognized according to a preset text recognition network model and preset characteristic words.
Optionally, before extracting text information from the feature text region according to the trained text extraction network model, the method further includes:
and constructing the text extraction network model, wherein the text extraction network model consists of four Convolutional Neural Network (CNN) modules, a Recurrent Neural Network (RNN) module and a CTC module.
And training the text extraction network model according to the CTC loss value to obtain the trained text extraction network model.
Optionally, the CNN module includes a convolutional layer, a first BN layer, a first rulu layer, and a maximum pooling layer, and the RNN module includes a flatten layer, a first full-connected layer, a second BN layer, a second rulu layer, a dropout layer, a first bidirectional RNN layer, a second full-connected layer, a second bidirectional RNN layer, a third full-connected layer, and a softmax layer.
Optionally, after determining a sub-text closest to the word distance of the feature word in the text information as a target text, the method further includes:
and extracting the target text.
Optionally, the target text is doorplate text information.
In one aspect, the present application further provides a text recognition apparatus, including:
the acquiring unit is used for acquiring an image to be recognized containing a target text;
the first determining unit is used for determining a characteristic text region from the image to be recognized according to a preset text recognition network model and preset characteristic words;
the first extraction unit is used for extracting text information from the characteristic text region according to the trained text extraction network model;
and the second determining unit is used for determining a subfile which is closest to the word distance of the characteristic word in the text information as the target text.
Optionally, the first determining unit is specifically configured to:
determining a text region from the image to be recognized according to the text recognition network model;
and determining the characteristic text region from the text region according to the characteristic words.
Optionally, the second determining unit is specifically configured to:
mapping the text information to a trained high-dimensional space model to obtain word distances between the feature words and a plurality of sub texts;
and determining the sub text with the minimum word distance as the target text.
Optionally, the apparatus further comprises:
and the first training unit is used for training the high-dimensional space model according to training samples, wherein the training samples are samples with known word distances.
Optionally, the apparatus further comprises:
the detection unit is used for detecting the image to be identified according to a preset angle detection model to obtain the inclination angle of the image to be detected;
the adjusting unit is used for carrying out angle adjustment on the image to be recognized according to the inclination angle to obtain an adjusted image to be recognized;
the first determining unit is specifically configured to:
and determining a characteristic text region from the adjusted image to be recognized according to a preset text recognition network model and preset characteristic words.
Optionally, the apparatus further comprises:
the construction unit is used for constructing the text extraction network model, and the text extraction network model consists of four Convolutional Neural Network (CNN) modules, a Recurrent Neural Network (RNN) module and a CTC module.
And the second training unit is used for training the text extraction network model according to the CTC loss value to obtain the trained text extraction network model.
Optionally, the CNN module includes a convolutional layer, a first BN layer, a first rulu layer, and a maximum pooling layer, and the RNN module includes a flatten layer, a first full-connected layer, a second BN layer, a second rulu layer, a dropout layer, a first bidirectional RNN layer, a second full-connected layer, a second bidirectional RNN layer, a third full-connected layer, and a softmax layer.
Optionally, the apparatus further comprises:
and the second extraction unit is used for extracting the target text.
Optionally, the target text is doorplate text information.
In addition, a storage medium is further provided, where multiple instructions are stored, and the instructions are suitable for being loaded by a processor to perform steps in any text recognition method provided in an embodiment of the present application.
In the embodiment of the application, a text recognition device acquires an image to be recognized containing a target text; then, determining a characteristic text region from the image to be recognized according to a preset text recognition network model and preset characteristic words; extracting text information from the characteristic text region according to a trained text extraction network model, wherein the text extraction network model consists of four CNN modules, one RNN module and one CTC module; and finally, determining the target text according to the text information. According to the scheme, the target text can be automatically acquired from the image, and the text extraction network module in the scheme is combined with the CNN module, the RNN module and the CTC module, so that the text information in the text area can be accurately identified, and the accuracy of target text identification is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic view of an application scenario of a text recognition method according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a text recognition method according to an embodiment of the present application;
FIG. 3 is a schematic structural diagram of a text extraction network model provided in an embodiment of the present application;
fig. 4 is another schematic flowchart of a text recognition method according to an embodiment of the present application;
FIG. 5 is a schematic structural diagram of a text recognition apparatus according to an embodiment of the present application;
FIG. 6 is a schematic structural diagram of another text recognition apparatus provided in an embodiment of the present application;
fig. 7 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In the description that follows, specific embodiments of the present application will be described with reference to steps and symbols executed by one or more computers, unless otherwise indicated. Accordingly, these steps and operations will be referred to, several times, as being performed by a computer, the computer performing operations involving a processing unit of the computer in electronic signals representing data in a structured form. This operation transforms the data or maintains it at locations in the computer's memory system, which may be reconfigured or otherwise altered in a manner well known to those skilled in the art. The data maintains a data structure that is a physical location of the memory that has particular characteristics defined by the data format. However, while the principles of the application have been described in language specific to above, it is not intended to be limited to the specific embodiments shown, and it will be recognized by those of ordinary skill in the art that various of the steps and operations described below may be implemented in hardware.
The principles of the present application may be employed in numerous other general-purpose or special-purpose computing, communication environments or configurations. Examples of well known computing systems, environments, and configurations that may be suitable for use with the application include, but are not limited to, hand-held telephones, personal computers, servers, multiprocessor systems, microcomputer-based systems, mainframe-based computers, and distributed computing environments that include any of the above systems or devices.
The terms "first", "second", and "third", etc. in this application are used to distinguish between different objects and not to describe a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions.
The embodiment of the application provides a text recognition method, a text recognition device and a storage medium.
The text recognition device can be integrated in a server, the specified type of target text can be automatically recognized from the image by the text recognition device, and the accuracy is high, wherein the text recognition method comprises the following steps:
acquiring an image to be identified containing a target text; determining a characteristic text region from the image to be recognized according to a preset text recognition network model and preset characteristic words; extracting text information from the characteristic text region according to the trained text extraction network model; and determining a subfile closest to the word distance of the characteristic word in the text information as the target text.
The target text mentioned in the embodiment of the application may be doorplate text information, and the image to be recognized is a natural scene image containing the doorplate text information.
Referring to fig. 1, fig. 1 is a schematic view of a specific scene of a text recognition method according to an embodiment of the present application.
Referring to fig. 2, fig. 2 is a schematic flow chart of a text recognition method according to an embodiment of the present application.
The method comprises the following specific processes:
201. and acquiring an image to be recognized containing the target text.
The image to be recognized in the embodiment of the application may be a natural scene image containing house number text information, and the image may be collected by taking a picture by a courier of a logistics company, or may be collected by other methods, which is not limited herein.
In some embodiments, due to the problem of the angle inversion of the image to be recognized caused by photographing, in order to improve the accuracy of the detection result, the image to be recognized needs to be "straightened", and the method is as follows:
a. and detecting the image to be identified according to a preset angle detection model to obtain the inclination angle of the image to be detected.
Specifically, the angle detection model adopts the parameters of a VGG16 (a classic convolutional neural network algorithm framework) classic pre-training model as initialization parameters, end-to-end training and detection are realized by using collected limited training samples, and a depth residual module of a ResNet (a classic convolutional neural network algorithm framework) model is added on the basis of enabling the model to be more effective, so that sparse matrices are clustered into more dense sub-matrices while deep networks are fully utilized to extract features, and the detection accuracy is improved.
After the acquired image to be recognized is input into the angle detection model, the inclination angle of the image is detected, for example, 90 °, 180 °, 270 °, 0 °, and the like, wherein if 0 ° is detected, it indicates that the image is not inclined, and in this case, the angle adjustment is not needed, and in other cases, the angle adjustment is needed.
b. And adjusting the angle of the image to be recognized according to the inclination angle to obtain the adjusted image to be recognized.
For example, if the image is detected to be inclined by 30 °, the image needs to be rotated by 30 ° in the opposite direction to "correct" the image, so as to obtain a "corrected" image to be recognized.
202. And determining a characteristic text region from the image to be recognized according to a preset text recognition network model and preset characteristic words.
The text recognition network model adopts a YO L O (You Only L ook one) network, and the area needing to be detected by the algorithm is a picture area containing house plate text information.
The YO L O neural network divides the input image into S-S grids, then predicts B bounding boxes for each grid, each bounding box contains 5 predicted values x, y, w, h and confidence, wherein x and y are the predicted values of the center coordinates of the bounding box, w and h are the predicted values of the width and height of the bounding box, and confidence is the confidence of the category to which the bounding box belongs.
For example, in this embodiment, the target text to be extracted is house number text information, and the feature words at this time may be numbers, letters, and characters such as "door, house, number, building, unit, and unit".
In some embodiments, determining the feature text region from the image to be recognized according to the preset text recognition network model and the preset feature words specifically includes:
a. and determining a text region from the image to be recognized according to the text recognition network model.
The image to be recognized, which is to be subjected to angle adjustment, is input into the text recognition network model, and then the text recognition network model automatically selects or cuts out a picture region (text region) containing a text from the image to be recognized, namely, the image to be recognized is divided into a text region and a non-text region, wherein the text region is generally multiple.
b. And determining the characteristic text region from the text region according to the characteristic words.
After the text region is determined from the image to be recognized, the text region needs to be further processed to remove interference text information, reduce the interference degree of irrelevant texts, and improve the detection accuracy.
Specifically, the determined text region may be input into the YO L O neural network corresponding to the feature word, and then the feature text region corresponding to the feature word is screened out from the plurality of text regions, i.e., the two types of regions, i.e., the "house" and the "non-house", are distinguished in the "text region".
203. And extracting text information from the characteristic text region according to the trained text extraction network model, wherein the text extraction network model consists of four CNN modules, one RNN module and one CTC module.
Before this step, the text extraction network model needs to be constructed and trained, specifically:
a. and constructing a text extraction network model.
The text extraction Network model consists of four Convolutional Neural Networks (CNN) modules, a Recurrent Neural Network (RNN) module and a CTC (connected terminal Classification) module, wherein each CNN module is a block (brick), each RNN module is an out-block, at this time, the text extraction network model is shown in FIG. 3, and comprises block1-block2(mp) -block3-block4(mp) -out _ block-CTC, wherein each block consists of a Conv2D (convolution layer of 3 x 3 kernel) -BN-relu-maxpool unit block, wherein mp represents whether maxpool layer is used or not, and if so, represents that outblock is used, and the organizational form of outblock is flat-dense-knot-bi RNN (bidirectional RNN) -dense-birRNN-dense-softmax.
In addition, the CTC module is used for carrying out reverse training on the text extraction network model, and is used for measuring the difference between the input data and the real output data after the input data passes through the text extraction network model, and outputting the final result.
The text extraction network model comprises a plurality of network modules, and the accuracy of the extraction result of the model is higher than that of other existing text extraction network models obtained through inventor experiments.
b. And training the text extraction network model according to the CTC loss value to obtain the trained text extraction network model.
And then training the text extraction network model according to the CTC loss value corresponding to the CTC module, wherein the specific method comprises the following steps:
after the text extraction network model is constructed, pass is set to CTC pass (loss value). The classifier CTC outputs the recognition probability distribution of each frame sequence characteristic data through a CTC layer, and the training process of the CTC is realized through
Figure BDA0001950435240000091
Adjusting the value of w to maximize the target value, provided that
Figure BDA0001950435240000092
Can be obtained according to the reverse propagation
Figure BDA0001950435240000093
Wherein z represents the corresponding output label, x is the cross-cut sequence, w is the weight,
Figure BDA0001950435240000094
denotes the probability of outputting k at time t, and p denotes the sequence.
204. And determining a target text according to the text information.
After extracting text information from the feature text region through the previous steps, since the scene is complex and text contents such as advertising words and headlines are included in the selected region, the doorplate text information needs to be selected (i.e. target text is obtained) through natural language processing.
In some embodiments, determining the target text according to the text information specifically includes:
a. and mapping the text information to the trained high-dimensional space model to obtain word distances between the feature words and the plurality of subfiles.
Before mapping the text information into the trained high-dimensional space model, we need to train the high-dimensional space model according to training samples, wherein the training samples are samples with known word distances.
The high-dimensional space model in the application adopts the idea of n-gram (a language model), wherein the n-gram is generally used for word frequency conversion of a large vocabulary and large corpus and only counts the occurrence probability of each word.
Since the target vocabulary (target text) of the task is not large (only including numbers, 26 letters, and limited characters such as 'gate, Chinese, number, building, single, unit', and the like), the correct arrangement sequence of the target text cannot be distinguished by simple statistical probability, so the accuracy is not high if the n-gram is directly used for identifying the house number text in the application. Therefore, the scheme is improved, the target numbers, English characters and Chinese characters are mapped to a high-dimensional space, and the words and phrases formed by the target numbers, the English characters and the Chinese characters are used as word vectors to calculate the distance, so that the correct ordering of the target text can be obtained, and the accuracy rate of extracting the target text is improved.
In this step, all the characters of the text information need to be mapped into the trained high-dimensional space model, and then words formed by the characters are used as word vectors (sub-texts) to calculate distances.
b. And determining the sub text with the minimum word distance as the target text.
After obtaining a plurality of distance values, taking the sub-text of the vector corresponding to the value with the minimum distance value as the target text, namely, taking the sub-text closest to the word of the 'house plate' (with the last confidence) as the recognition result.
In some embodiments, after the target text is determined, the target text needs to be extracted, sent to a map system, and placed at a position corresponding to the text, so that an electronic map in the map system is more accurate, delivery by a courier is facilitated, delivery time of the courier is saved, and the like.
In the embodiment of the application, a text recognition device acquires an image to be recognized containing a target text; then, determining a characteristic text region from the image to be recognized according to a preset text recognition network model and preset characteristic words; extracting text information from the characteristic text region according to a trained text extraction network model, wherein the text extraction network model consists of four CNN modules, an RNN module and a CTC module; and finally, determining a target text according to the text information. According to the scheme, a text region related to the characteristic words is extracted (cut) from the image to be recognized, text information is extracted from the extracted characteristic text region according to the text extraction network model, and then the target text is recognized from the text information.
Referring to fig. 4, fig. 4 is another schematic flow chart of the text recognition method according to the embodiment of the present application, and the specific flow of the method may be as follows:
401. and acquiring an image to be recognized containing house plate text information.
The image to be recognized in the embodiment of the application may be a natural scene image containing house number text information, and the image may be collected by taking a picture by a courier of a logistics company, or may be collected by other methods, which is not limited herein.
402. And detecting the image to be identified according to a preset angle detection model to obtain the inclination angle of the image to be detected.
Specifically, the angle detection model adopts the parameters of a VGG16 classic pre-training model as initialization parameters, end-to-end training and detection are realized by using collected limited training samples, and a depth residual module of a ResNet model is added on the basis of enabling the model to be more effective, so that sparse matrices are clustered into more dense sub-matrices while deep networks are fully utilized to extract features, and the detection accuracy is improved.
After the acquired image to be recognized is input into the angle detection model, the inclination angle of the image is detected, for example, 90 °, 180 °, 270 °, 0 °, and the like, wherein if 0 ° is detected, it indicates that the image is not inclined, and in this case, the angle adjustment is not needed, and in other cases, the angle adjustment is needed.
403. And adjusting the angle of the image to be recognized according to the inclination angle to obtain the adjusted image to be recognized.
For example, if the image is detected to be inclined by 40 °, the image needs to be rotated by 40 ° in the reverse direction to "correct" the image, and the image to be recognized after "correction" is obtained.
404. And determining a text region from the image to be recognized according to the text recognition network model.
The method comprises the steps that a YO L O network is adopted in a text recognition network model, and the area needing to be detected by the algorithm is a picture area containing house plate text information.
The YO L O neural network divides an input image into S-S grids, then B boundary frames are predicted for each grid, each boundary frame comprises 5 predicted values x, y, w, h and confidence, wherein x and y are predicted values of center coordinates of the boundary frame, w and h are predicted values of width and height of the boundary frame, and confidence is confidence of the category to which the boundary frame belongs.
For example, in this embodiment, the house plate text information to be extracted is house plate text information, and the feature words at this time may be numbers, letters, and characters such as "door, house, number, building, unit," and the like.
In this embodiment, the image to be recognized, which is subjected to the angle adjustment, is input into the text recognition network model, and then the text recognition network model automatically selects or cuts out a picture region (i.e., a text region) containing a text from the image to be recognized, that is, the image to be recognized is divided into a "text region" and a "non-text region", where there are generally a plurality of text regions.
405. And determining a doorplate text area from the text area according to the characteristic words.
After the text region is determined from the image to be recognized, the text region needs to be further processed to remove interference text information, reduce the interference degree of irrelevant texts, and improve the detection accuracy.
Specifically, the determined text region may be input into the YO L O neural network corresponding to the feature word, and then the feature text region corresponding to the feature word is screened out from the plurality of text regions, i.e., the two types of regions, i.e., the "house" and the "non-house", are distinguished in the "text region".
406. And extracting text information from the doorplate text area according to the trained text extraction network model.
Before this step, the text extraction network model needs to be constructed and trained, specifically:
a. and constructing a text extraction network model.
The text extraction network model comprises four convolutional neural network CNN modules, a recurrent neural network RNN module and a CTC module, wherein each CNN module is a block, each RNN module is an out-block, namely the text extraction network model is shown in FIG. 3 and comprises a block1-block2(mp) -block3-block4(mp) -out-block-CTC, wherein each block comprises Conv2D (a convolutional layer of 3 × 3 inner core) -BN-relu-maxipool (maximum pooling layer) unit block, wherein mp represents whether a maxipool layer is used, and the organization form of outblock is flitten-dense (full connection layer) -BN-relu-dropout-bi RNN (bidirectional RNN) -dense-RNN-dense-softmax.
In addition, the CTC module is used for carrying out reverse training on the text extraction network model, and is used for measuring the difference between the input data and the real output data after the input data passes through the text extraction network model, and outputting the final result.
The text extraction network model comprises a plurality of network modules, and the accuracy of the extraction result of the model is higher than that of other existing text extraction network models obtained through inventor experiments.
b. And training the text extraction network model according to the CTC loss value to obtain the trained text extraction network model.
And then training the text extraction network model according to the CTC loss value corresponding to the CTC module, wherein the specific method comprises the following steps:
after the text extraction network model is constructed, pass is set to CTC pass (loss value). Classifier CTC outputs each frame sequence feature through CTC layerIdentifying probability distribution of data, CTC training process is carried out
Figure BDA0001950435240000131
Adjusting the value of w to maximize the target value, provided that
Figure BDA0001950435240000132
Can be obtained according to the reverse propagation
Figure BDA0001950435240000133
Wherein z represents the corresponding output label, x is the cross-cut sequence, w is the weight,
Figure BDA0001950435240000134
denotes the probability of outputting k at time t, and p denotes the sequence.
407. And mapping the text information to the trained high-dimensional space model to obtain word distances between the feature words and the plurality of subfiles.
After extracting text information from the feature text region through the previous steps, the doorplate text information needs to be sorted out through natural language processing because the scene is complex and text contents such as advertising words and goalkeeper are mixed in the selected region.
Before mapping text information into a trained high-dimensional space model, the high-dimensional space model needs to be trained according to training samples, wherein the training samples are samples with known word distances.
The high-dimensional space model in the application adopts the thought similar to n-gram, the n-gram is generally used for word frequency conversion of a large vocabulary and large corpus, and the n-gram only counts the occurrence probability of each word.
Because the task target vocabulary (house plate text information) is not large (only contains numbers, 26 letters, and limited characters such as 'door, ridge, number, building, unit', and the like), the correct arrangement sequence of the house plate text information cannot be distinguished by simple statistical probability, and the accuracy is not high if the n-gram is directly used for house plate text recognition in the application. Therefore, the technical scheme improves the method, the target numbers, the English characters and the Chinese characters are mapped to a high-dimensional space, and the words and phrases formed by the target numbers, the English characters and the Chinese characters are used as word vectors to calculate the distance, so that the correct ordering of the house plate text information can be obtained, and the accuracy rate of extracting the house plate text information is improved.
In this step, all the characters of the text information need to be mapped into the trained high-dimensional space model, and then words formed by the characters are used as word vectors (sub-texts) to calculate distances.
408. And determining the sub-text with the minimum word distance as the house number text information.
After obtaining a plurality of distance values, taking the subfile of the vector corresponding to the value with the minimum distance value as the house plate text information, namely, taking the subfile closest to the word distance of the house plate (the confidence coefficient is the last) as the recognition result.
409. And extracting house plate text information.
After the house plate text information is determined, the house plate text information needs to be extracted, the house plate text information is sent to a map system, and the house plate text information is placed at a position corresponding to the text, so that an electronic map in the map system is more accurate, delivery of couriers is facilitated, and delivery time of the couriers is saved.
In the embodiment of the application, a text recognition device acquires an image to be recognized containing house plate text information; then, determining a characteristic text region from the image to be recognized according to a preset text recognition network model and preset characteristic words; extracting text information from the characteristic text region according to a trained text extraction network model, wherein the text extraction network model consists of four CNN modules, one RNN module and one CTC module; and finally, determining the target text according to the text information. According to the scheme, a text region related to the characteristic words is extracted (cut) from the image to be recognized, text information is extracted from the extracted characteristic text region according to the text extraction network model, and then the target text is recognized from the text information.
In order to better implement the text recognition method provided by the embodiment of the present application, the embodiment of the present application further provides a text recognition apparatus, and the text recognition apparatus may be specifically integrated in a server. The meanings of the nouns are the same as those in the text recognition method, and specific implementation details can refer to the description in the method embodiment.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a text recognition apparatus according to an embodiment of the present disclosure, in which the text recognition apparatus 500 includes an obtaining unit 501, a first determining unit 502, a first extracting unit 503, and a second determining unit 504, as follows:
an obtaining unit 501, configured to obtain an image to be recognized that includes a target text;
a first determining unit 502, configured to determine a feature text region from the image to be recognized according to a preset text recognition network model and preset feature words;
a first extraction unit 503, configured to extract text information from the feature text region according to a trained text extraction network model, where the text extraction network model is composed of four convolutional neural network CNN modules, a recurrent neural network RNN module, and a CTC module;
a second determining unit 504, configured to determine the target text according to the text information. .
In some embodiments, the first determining unit is specifically configured to:
determining a text region from the image to be recognized according to the text recognition network model;
and determining the characteristic text region from the text region according to the characteristic words.
In some embodiments, the second determining unit is specifically configured to:
mapping the text information to a trained high-dimensional space model to obtain word distances between the feature words and a plurality of sub texts;
and determining the sub text with the minimum word distance as the target text.
Referring to fig. 6, in some embodiments, the apparatus 500 further includes:
a first training unit 505, configured to train the high-dimensional space model according to training samples, where the training samples are samples with known word distances.
In some embodiments, the apparatus 500 further comprises:
the detection unit 506 is configured to detect the image to be identified according to a preset angle detection model to obtain an inclination angle of the image to be detected;
the adjusting unit 507 is configured to perform angle adjustment on the image to be recognized according to the inclination angle to obtain an adjusted image to be recognized;
the first determining unit 502 is specifically configured to:
and determining a characteristic text region from the adjusted image to be recognized according to a preset text recognition network model and preset characteristic words.
In some embodiments, the apparatus 500 further comprises:
a constructing unit 508, configured to construct the text extraction network model, where the text extraction network model is composed of four convolutional neural network CNN modules, a recurrent neural network RNN module, and a CTC module.
A second training unit 509, configured to train the text extraction network model according to the CTC loss values, so as to obtain the trained text extraction network model.
Optionally, the CNN module includes a convolutional layer, a first BN layer, a first rulu layer, and a maximum pooling layer, and the RNN module includes a flatten layer, a first full-connected layer, a second BN layer, a second rulu layer, a dropout layer, a first bidirectional RNN layer, a second full-connected layer, a second bidirectional RNN layer, a third full-connected layer, and a softmax layer.
In some embodiments, the apparatus 500 further comprises:
a second extracting unit 510, configured to extract the target text.
In some embodiments, the target text is house number text information.
In the embodiment of the application, the obtaining unit 501 obtains an image to be recognized, which includes a target text; then, the first determining unit 502 determines a feature text region from the image to be recognized according to a preset text recognition network model and preset feature words; the first extraction unit 503 extracts text information from the feature text region according to the trained text extraction network model, where the text extraction network model is composed of four CNN modules, one RNN module, and one CTC module; finally, the second determination unit 504 determines the target text from the text information. According to the scheme, a text region related to the characteristic words is extracted (cut) from the image to be recognized, text information is extracted from the characteristic text region extracted from the text extraction network model, and then the target text is recognized from the text information.
Referring to fig. 7, the present application provides a server 700, which may include one or more processors 701 of a processing core, one or more memories 702 of a computer-readable storage medium, a Radio Frequency (RF) circuit 703, a power supply 704, an input unit 705, and a display unit 706. Those skilled in the art will appreciate that the server architecture shown in FIG. 5 is not meant to be limiting, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:
the processor 701 is a control center of the server, connects various parts of the entire server using various interfaces and lines, and performs various functions of the server and processes data by running or executing software programs and/or modules stored in the memory 702 and calling data stored in the memory 702, thereby performing overall monitoring of the server. Optionally, processor 701 may include one or more processing cores; preferably, the processor 701 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 701.
The memory 702 may be used to store software programs and modules, and the processor 701 executes various functional applications and data processing by operating the software programs and modules stored in the memory 702.
The RF circuit 703 may be used for receiving and transmitting signals during transmission and reception of information.
The server also includes a power supply 704 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 701 via a power management system to manage charging, discharging, and power consumption management functions via the power management system.
The server may further include an input unit 705, and the input unit 705 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.
The server may also include a display unit 706, which display unit 706 may be used to display information input by or provided to the user, as well as various graphical user interfaces of the server, which may be made up of graphics, text, icons, video, and any combination thereof. Specifically, in this embodiment, the processor 701 in the server loads the executable file corresponding to the process of one or more application programs into the memory 702 according to the following instructions, and the processor 701 runs the application program stored in the memory 702, thereby implementing various functions as follows:
acquiring an image to be identified containing a target text;
determining a characteristic text region from the image to be recognized according to a preset text recognition network model and preset characteristic words;
extracting text information from the characteristic text region according to a trained text extraction network model, wherein the text extraction network model consists of four Convolutional Neural Network (CNN) modules, a Recurrent Neural Network (RNN) module and a CTC module;
and determining the target text according to the text information.
As can be seen from the above, in the embodiment of the present application, the text recognition apparatus obtains the image to be recognized including the target text; then, determining a characteristic text region from the image to be recognized according to a preset text recognition network model and preset characteristic words; extracting text information from the characteristic text region according to a trained text extraction network model, wherein the text extraction network model consists of four CNN modules, one RNN module and one CTC module; and finally, determining the target text according to the text information. According to the scheme, a text region related to the characteristic words is extracted (cut) from the image to be recognized, then the network model is extracted according to the text, the text information is extracted from the extracted characteristic text region, and then the target text is recognized from the text information.
It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.
To this end, embodiments of the present application provide a storage medium, in which a plurality of instructions are stored, where the instructions can be loaded by a processor to execute the steps in any one of the text recognition methods provided in the embodiments of the present application. For example, the instructions may perform the steps of:
acquiring an image to be identified containing a target text;
determining a characteristic text region from the image to be recognized according to a preset text recognition network model and preset characteristic words;
extracting text information from the characteristic text region according to a trained text extraction network model, wherein the text extraction network model consists of four Convolutional Neural Network (CNN) modules, a Recurrent Neural Network (RNN) module and a CTC module;
and determining the target text according to the text information.
The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.
Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
Since the instructions stored in the storage medium can execute the steps in any text recognition method provided in the embodiments of the present application, the beneficial effects that can be achieved by any text recognition method provided in the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.
The text recognition method, the text recognition device, and the storage medium provided by the embodiments of the present application are described in detail above, and specific examples are applied herein to explain the principles and implementations of the present application, and the description of the above embodiments is only used to help understand the method and the core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A text recognition method, comprising:
acquiring an image to be identified containing a target text;
determining a characteristic text region from the image to be recognized according to a preset text recognition network model and preset characteristic words;
extracting text information from the characteristic text region according to a trained text extraction network model, wherein the text extraction network model consists of four Convolutional Neural Network (CNN) modules, a Recurrent Neural Network (RNN) module and a CTC module;
and determining the target text according to the text information.
2. The method according to claim 1, wherein the determining a characteristic text region from the image to be recognized according to a preset text recognition network model and preset characteristic words comprises:
determining a text region from the image to be recognized according to the text recognition network model;
and determining the characteristic text region from the text region according to the characteristic words.
3. The method of claim 1, wherein the determining the target text from the text information comprises:
mapping the text information to a trained high-dimensional space model to obtain word distances between the feature words and a plurality of sub texts;
and determining the sub text with the minimum word distance as the target text.
4. The method of claim 3, wherein prior to mapping the textual information into the trained high-dimensional spatial model, the method further comprises:
and training the high-dimensional space model according to training samples, wherein the training samples are samples with known word distances.
5. The method according to claim 1, wherein before determining the characteristic text region from the image to be recognized according to the preset text recognition network model and the preset characteristic words, the method further comprises:
detecting the image to be identified according to a preset angle detection model to obtain the inclination angle of the image to be detected;
carrying out angle adjustment on the image to be recognized according to the inclination angle to obtain an adjusted image to be recognized;
the determining a characteristic text region from the image to be recognized according to a preset text recognition network model and preset characteristic words comprises the following steps:
and determining a characteristic text region from the adjusted image to be recognized according to a preset text recognition network model and preset characteristic words.
6. The method of claim 1, wherein before extracting text information from the feature text region according to the trained text extraction network model, the method further comprises:
constructing the text extraction network model;
and training the text extraction network model according to the CTC loss value to obtain the trained text extraction network model.
7. The method of any one of claims 1 to 6, wherein the CNN module consists of a convolutional layer, a first BN layer, a first rulu layer, and a max-pooling layer, and wherein the RNN module consists of a flatten layer, a first fully-connected layer, a second BN layer, a second rulu layer, a dropout layer, a first bi-directional RNN layer, a second fully-connected layer, a second bi-directional RNN layer, a third fully-connected layer, and a softmax layer.
8. The method of any one of claims 1 to 6, wherein the target text is house number text information.
9. A text recognition apparatus, comprising:
the acquiring unit is used for acquiring an image to be recognized containing a target text;
the first determining unit is used for determining a characteristic text region from the image to be recognized according to a preset text recognition network model and preset characteristic words;
the first extraction unit is used for extracting a network model according to the trained text and extracting text information from the characteristic text region, wherein the text extraction network model consists of four Convolutional Neural Network (CNN) modules, a Recurrent Neural Network (RNN) module and a CTC module;
and the second determining unit is used for determining the target text according to the text information.
10. The apparatus according to claim 9, wherein the first determining unit is specifically configured to:
determining a text region from the image to be recognized according to the text recognition network model;
and determining the characteristic text region from the text region according to the characteristic words.
CN201910108577.7A 2019-01-18 2019-01-18 Text recognition method and device Active CN111461105B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910108577.7A CN111461105B (en) 2019-01-18 2019-01-18 Text recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910108577.7A CN111461105B (en) 2019-01-18 2019-01-18 Text recognition method and device

Publications (2)

Publication Number Publication Date
CN111461105A true CN111461105A (en) 2020-07-28
CN111461105B CN111461105B (en) 2023-11-28

Family

ID=71685044

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910108577.7A Active CN111461105B (en) 2019-01-18 2019-01-18 Text recognition method and device

Country Status (1)

Country Link
CN (1) CN111461105B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112016561A (en) * 2020-09-01 2020-12-01 中国银行股份有限公司 Text recognition method and related equipment
CN112633422A (en) * 2021-03-10 2021-04-09 北京易真学思教育科技有限公司 Training method of text recognition model, text recognition method, device and equipment
CN112949455A (en) * 2021-02-26 2021-06-11 武汉天喻信息产业股份有限公司 Value-added tax invoice identification system and method
CN114241467A (en) * 2021-12-21 2022-03-25 北京有竹居网络技术有限公司 Text recognition method and related equipment thereof

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9263036B1 (en) * 2012-11-29 2016-02-16 Google Inc. System and method for speech recognition using deep recurrent neural networks
WO2016197381A1 (en) * 2015-06-12 2016-12-15 Sensetime Group Limited Methods and apparatus for recognizing text in an image
CN107203606A (en) * 2017-05-17 2017-09-26 西北工业大学 Text detection and recognition methods under natural scene based on convolutional neural networks
WO2018010657A1 (en) * 2016-07-15 2018-01-18 北京市商汤科技开发有限公司 Structured text detection method and system, and computing device
CN107688808A (en) * 2017-08-07 2018-02-13 电子科技大学 A kind of quickly natural scene Method for text detection
CN108446621A (en) * 2018-03-14 2018-08-24 平安科技(深圳)有限公司 Bank slip recognition method, server and computer readable storage medium
CN108960073A (en) * 2018-06-05 2018-12-07 大连理工大学 Cross-module state image steganalysis method towards Biomedical literature

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9263036B1 (en) * 2012-11-29 2016-02-16 Google Inc. System and method for speech recognition using deep recurrent neural networks
WO2016197381A1 (en) * 2015-06-12 2016-12-15 Sensetime Group Limited Methods and apparatus for recognizing text in an image
CN107636691A (en) * 2015-06-12 2018-01-26 商汤集团有限公司 Method and apparatus for identifying the text in image
WO2018010657A1 (en) * 2016-07-15 2018-01-18 北京市商汤科技开发有限公司 Structured text detection method and system, and computing device
CN107203606A (en) * 2017-05-17 2017-09-26 西北工业大学 Text detection and recognition methods under natural scene based on convolutional neural networks
CN107688808A (en) * 2017-08-07 2018-02-13 电子科技大学 A kind of quickly natural scene Method for text detection
CN108446621A (en) * 2018-03-14 2018-08-24 平安科技(深圳)有限公司 Bank slip recognition method, server and computer readable storage medium
CN108960073A (en) * 2018-06-05 2018-12-07 大连理工大学 Cross-module state image steganalysis method towards Biomedical literature

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112016561A (en) * 2020-09-01 2020-12-01 中国银行股份有限公司 Text recognition method and related equipment
CN112016561B (en) * 2020-09-01 2023-08-04 中国银行股份有限公司 Text recognition method and related equipment
CN112949455A (en) * 2021-02-26 2021-06-11 武汉天喻信息产业股份有限公司 Value-added tax invoice identification system and method
CN112949455B (en) * 2021-02-26 2024-04-05 武汉天喻信息产业股份有限公司 Value-added tax invoice recognition system and method
CN112633422A (en) * 2021-03-10 2021-04-09 北京易真学思教育科技有限公司 Training method of text recognition model, text recognition method, device and equipment
CN112633422B (en) * 2021-03-10 2021-06-22 北京易真学思教育科技有限公司 Training method of text recognition model, text recognition method, device and equipment
CN114241467A (en) * 2021-12-21 2022-03-25 北京有竹居网络技术有限公司 Text recognition method and related equipment thereof

Also Published As

Publication number Publication date
CN111461105B (en) 2023-11-28

Similar Documents

Publication Publication Date Title
US11238310B2 (en) Training data acquisition method and device, server and storage medium
CN111461105A (en) Text recognition method and device
CN110472675B (en) Image classification method, image classification device, storage medium and electronic equipment
EP3910551A1 (en) Face detection method, apparatus, device, and storage medium
CN111488985B (en) Deep neural network model compression training method, device, equipment and medium
CN110619051B (en) Question sentence classification method, device, electronic equipment and storage medium
CN111259940A (en) Target detection method based on space attention map
CN112329826A (en) Training method of image recognition model, image recognition method and device
EP4113376A1 (en) Image classification model training method and apparatus, computer device, and storage medium
CN111475613A (en) Case classification method and device, computer equipment and storage medium
CN113392253B (en) Visual question-answering model training and visual question-answering method, device, equipment and medium
CN115658955B (en) Cross-media retrieval and model training method, device, equipment and menu retrieval system
CN110489747A (en) A kind of image processing method, device, storage medium and electronic equipment
CN116226785A (en) Target object recognition method, multi-mode recognition model training method and device
CN113947188A (en) Training method of target detection network and vehicle detection method
CN110489423A (en) A kind of method, apparatus of information extraction, storage medium and electronic equipment
CN111079374A (en) Font generation method, device and storage medium
CN114495113A (en) Text classification method and training method and device of text classification model
CN113255501A (en) Method, apparatus, medium, and program product for generating form recognition model
CN113095072A (en) Text processing method and device
CN115359322A (en) Target detection model training method, device, equipment and storage medium
CN110442714B (en) POI name normative evaluation method, device, equipment and storage medium
CN113947140A (en) Training method of face feature extraction model and face feature extraction method
CN114119972A (en) Model acquisition and object processing method and device, electronic equipment and storage medium
CN113536876A (en) Image recognition method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant