CN111461105A

CN111461105A - Text recognition method and device

Info

Publication number: CN111461105A
Application number: CN201910108577.7A
Authority: CN
Inventors: 刘聪海; 陈亮亮; 方清; 曾晓嘉; 淦小健; 朱正一; 崔子玲
Original assignee: SF Technology Co Ltd
Current assignee: SF Technology Co Ltd
Priority date: 2019-01-18
Filing date: 2019-01-18
Publication date: 2020-07-28
Anticipated expiration: 2039-01-18
Also published as: CN111461105B

Abstract

The application discloses a text recognition method, a text recognition device and a storage medium, wherein the text recognition device acquires an image to be recognized, which comprises a target text; determining a characteristic text region from the image to be recognized according to a preset text recognition network model and preset characteristic words; extracting text information from the characteristic text region according to the trained text extraction network model, wherein the text extraction network model consists of four CNN modules, one RNN module and one CTC module; and finally, determining a target text according to the text information. According to the scheme, a text region related to the characteristic words is extracted (cut) from the image to be recognized, text information is extracted from the extracted characteristic text region according to the text extraction network model, and then the target text is recognized from the text information.

Description

Text recognition method and device

Technical Field

The present application relates to the field of image recognition, and in particular, to a text recognition method and apparatus.

Background

The natural scene image refers to an image which contains other doped scenes besides characters in a picture, and the difficulty of extracting a text of a specified type from the natural scene image is high.

For example, in the existing map system, information of house numbers of each unit building in a cell is often required to meet actual use requirements, and for express delivery services, if an electronic map can provide high-precision positioning of the building numbers, labor consumption can be greatly saved, and delivery speed is increased. However, since it is difficult to extract the house number text information (extracting the text of the designated type) from the natural scene image, people are often required to manually collect the house number text information from the natural scene image containing the house number text information or manually collect the house number text information when the person visits the field.

Disclosure of Invention

The embodiment of the application provides a text recognition method and device, which are used for automatically acquiring a target text from an image.

In one aspect, the present application provides a text recognition method, including:

acquiring an image to be identified containing a target text;

determining a characteristic text region from the image to be recognized according to a preset text recognition network model and preset characteristic words;

extracting text information from the characteristic text region according to a trained text extraction network model, wherein the text extraction network model consists of four Convolutional Neural Network (CNN) modules, a Recurrent Neural Network (RNN) module and a CTC module;

and determining the target text according to the text information.

Optionally, the determining a feature text region from the image to be recognized according to a preset text recognition network model and preset feature words includes:

determining a text region from the image to be recognized according to the text recognition network model;

and determining the characteristic text region from the text region according to the characteristic words.

Optionally, the determining the target text according to the text information includes:

mapping the text information to a trained high-dimensional space model to obtain word distances between the feature words and a plurality of sub texts;

and determining the sub text with the minimum word distance as the target text.

Optionally, before mapping the text information into the trained high-dimensional spatial model, the method further includes:

and training the high-dimensional space model according to training samples, wherein the training samples are samples with known word distances.

Optionally, before determining the feature text region from the image to be recognized according to the preset text recognition network model and the preset feature word, the method further includes:

detecting the image to be identified according to a preset angle detection model to obtain the inclination angle of the image to be detected;

carrying out angle adjustment on the image to be recognized according to the inclination angle to obtain an adjusted image to be recognized;

the determining a characteristic text region from the image to be recognized according to a preset text recognition network model and preset characteristic words comprises the following steps:

and determining a characteristic text region from the adjusted image to be recognized according to a preset text recognition network model and preset characteristic words.

Optionally, before extracting text information from the feature text region according to the trained text extraction network model, the method further includes:

and constructing the text extraction network model, wherein the text extraction network model consists of four Convolutional Neural Network (CNN) modules, a Recurrent Neural Network (RNN) module and a CTC module.

And training the text extraction network model according to the CTC loss value to obtain the trained text extraction network model.

Optionally, the CNN module includes a convolutional layer, a first BN layer, a first rulu layer, and a maximum pooling layer, and the RNN module includes a flatten layer, a first full-connected layer, a second BN layer, a second rulu layer, a dropout layer, a first bidirectional RNN layer, a second full-connected layer, a second bidirectional RNN layer, a third full-connected layer, and a softmax layer.

Optionally, after determining a sub-text closest to the word distance of the feature word in the text information as a target text, the method further includes:

and extracting the target text.

Optionally, the target text is doorplate text information.

In one aspect, the present application further provides a text recognition apparatus, including:

the acquiring unit is used for acquiring an image to be recognized containing a target text;

the first determining unit is used for determining a characteristic text region from the image to be recognized according to a preset text recognition network model and preset characteristic words;

the first extraction unit is used for extracting text information from the characteristic text region according to the trained text extraction network model;

and the second determining unit is used for determining a subfile which is closest to the word distance of the characteristic word in the text information as the target text.

Optionally, the first determining unit is specifically configured to:

Optionally, the second determining unit is specifically configured to:

and determining the sub text with the minimum word distance as the target text.

Optionally, the apparatus further comprises:

and the first training unit is used for training the high-dimensional space model according to training samples, wherein the training samples are samples with known word distances.

Optionally, the apparatus further comprises:

the detection unit is used for detecting the image to be identified according to a preset angle detection model to obtain the inclination angle of the image to be detected;

the adjusting unit is used for carrying out angle adjustment on the image to be recognized according to the inclination angle to obtain an adjusted image to be recognized;

the first determining unit is specifically configured to:

Optionally, the apparatus further comprises:

the construction unit is used for constructing the text extraction network model, and the text extraction network model consists of four Convolutional Neural Network (CNN) modules, a Recurrent Neural Network (RNN) module and a CTC module.

And the second training unit is used for training the text extraction network model according to the CTC loss value to obtain the trained text extraction network model.

Optionally, the apparatus further comprises:

and the second extraction unit is used for extracting the target text.

Optionally, the target text is doorplate text information.

In addition, a storage medium is further provided, where multiple instructions are stored, and the instructions are suitable for being loaded by a processor to perform steps in any text recognition method provided in an embodiment of the present application.

In the embodiment of the application, a text recognition device acquires an image to be recognized containing a target text; then, determining a characteristic text region from the image to be recognized according to a preset text recognition network model and preset characteristic words; extracting text information from the characteristic text region according to a trained text extraction network model, wherein the text extraction network model consists of four CNN modules, one RNN module and one CTC module; and finally, determining the target text according to the text information. According to the scheme, the target text can be automatically acquired from the image, and the text extraction network module in the scheme is combined with the CNN module, the RNN module and the CTC module, so that the text information in the text area can be accurately identified, and the accuracy of target text identification is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario of a text recognition method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a text recognition method according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a text extraction network model provided in an embodiment of the present application;

fig. 4 is another schematic flowchart of a text recognition method according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a text recognition apparatus according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of another text recognition apparatus provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the description that follows, specific embodiments of the present application will be described with reference to steps and symbols executed by one or more computers, unless otherwise indicated. Accordingly, these steps and operations will be referred to, several times, as being performed by a computer, the computer performing operations involving a processing unit of the computer in electronic signals representing data in a structured form. This operation transforms the data or maintains it at locations in the computer's memory system, which may be reconfigured or otherwise altered in a manner well known to those skilled in the art. The data maintains a data structure that is a physical location of the memory that has particular characteristics defined by the data format. However, while the principles of the application have been described in language specific to above, it is not intended to be limited to the specific embodiments shown, and it will be recognized by those of ordinary skill in the art that various of the steps and operations described below may be implemented in hardware.

The principles of the present application may be employed in numerous other general-purpose or special-purpose computing, communication environments or configurations. Examples of well known computing systems, environments, and configurations that may be suitable for use with the application include, but are not limited to, hand-held telephones, personal computers, servers, multiprocessor systems, microcomputer-based systems, mainframe-based computers, and distributed computing environments that include any of the above systems or devices.

The terms "first", "second", and "third", etc. in this application are used to distinguish between different objects and not to describe a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions.

The embodiment of the application provides a text recognition method, a text recognition device and a storage medium.

The text recognition device can be integrated in a server, the specified type of target text can be automatically recognized from the image by the text recognition device, and the accuracy is high, wherein the text recognition method comprises the following steps:

acquiring an image to be identified containing a target text; determining a characteristic text region from the image to be recognized according to a preset text recognition network model and preset characteristic words; extracting text information from the characteristic text region according to the trained text extraction network model; and determining a subfile closest to the word distance of the characteristic word in the text information as the target text.

The target text mentioned in the embodiment of the application may be doorplate text information, and the image to be recognized is a natural scene image containing the doorplate text information.

Referring to fig. 1, fig. 1 is a schematic view of a specific scene of a text recognition method according to an embodiment of the present application.

Referring to fig. 2, fig. 2 is a schematic flow chart of a text recognition method according to an embodiment of the present application.

The method comprises the following specific processes:

201. and acquiring an image to be recognized containing the target text.

The image to be recognized in the embodiment of the application may be a natural scene image containing house number text information, and the image may be collected by taking a picture by a courier of a logistics company, or may be collected by other methods, which is not limited herein.

In some embodiments, due to the problem of the angle inversion of the image to be recognized caused by photographing, in order to improve the accuracy of the detection result, the image to be recognized needs to be "straightened", and the method is as follows:

a. and detecting the image to be identified according to a preset angle detection model to obtain the inclination angle of the image to be detected.

Specifically, the angle detection model adopts the parameters of a VGG16 (a classic convolutional neural network algorithm framework) classic pre-training model as initialization parameters, end-to-end training and detection are realized by using collected limited training samples, and a depth residual module of a ResNet (a classic convolutional neural network algorithm framework) model is added on the basis of enabling the model to be more effective, so that sparse matrices are clustered into more dense sub-matrices while deep networks are fully utilized to extract features, and the detection accuracy is improved.

After the acquired image to be recognized is input into the angle detection model, the inclination angle of the image is detected, for example, 90 °, 180 °, 270 °, 0 °, and the like, wherein if 0 ° is detected, it indicates that the image is not inclined, and in this case, the angle adjustment is not needed, and in other cases, the angle adjustment is needed.

b. And adjusting the angle of the image to be recognized according to the inclination angle to obtain the adjusted image to be recognized.

For example, if the image is detected to be inclined by 30 °, the image needs to be rotated by 30 ° in the opposite direction to "correct" the image, so as to obtain a "corrected" image to be recognized.

202. And determining a characteristic text region from the image to be recognized according to a preset text recognition network model and preset characteristic words.

The text recognition network model adopts a YO L O (You Only L ook one) network, and the area needing to be detected by the algorithm is a picture area containing house plate text information.

The YO L O neural network divides the input image into S-S grids, then predicts B bounding boxes for each grid, each bounding box contains 5 predicted values x, y, w, h and confidence, wherein x and y are the predicted values of the center coordinates of the bounding box, w and h are the predicted values of the width and height of the bounding box, and confidence is the confidence of the category to which the bounding box belongs.

For example, in this embodiment, the target text to be extracted is house number text information, and the feature words at this time may be numbers, letters, and characters such as "door, house, number, building, unit, and unit".

In some embodiments, determining the feature text region from the image to be recognized according to the preset text recognition network model and the preset feature words specifically includes:

a. and determining a text region from the image to be recognized according to the text recognition network model.

The image to be recognized, which is to be subjected to angle adjustment, is input into the text recognition network model, and then the text recognition network model automatically selects or cuts out a picture region (text region) containing a text from the image to be recognized, namely, the image to be recognized is divided into a text region and a non-text region, wherein the text region is generally multiple.

b. And determining the characteristic text region from the text region according to the characteristic words.

After the text region is determined from the image to be recognized, the text region needs to be further processed to remove interference text information, reduce the interference degree of irrelevant texts, and improve the detection accuracy.

Specifically, the determined text region may be input into the YO L O neural network corresponding to the feature word, and then the feature text region corresponding to the feature word is screened out from the plurality of text regions, i.e., the two types of regions, i.e., the "house" and the "non-house", are distinguished in the "text region".

203. And extracting text information from the characteristic text region according to the trained text extraction network model, wherein the text extraction network model consists of four CNN modules, one RNN module and one CTC module.

Before this step, the text extraction network model needs to be constructed and trained, specifically:

a. and constructing a text extraction network model.

The text extraction Network model consists of four Convolutional Neural Networks (CNN) modules, a Recurrent Neural Network (RNN) module and a CTC (connected terminal Classification) module, wherein each CNN module is a block (brick), each RNN module is an out-block, at this time, the text extraction network model is shown in FIG. 3, and comprises block1-block2(mp) -block3-block4(mp) -out _ block-CTC, wherein each block consists of a Conv2D (convolution layer of 3 x 3 kernel) -BN-relu-maxpool unit block, wherein mp represents whether maxpool layer is used or not, and if so, represents that outblock is used, and the organizational form of outblock is flat-dense-knot-bi RNN (bidirectional RNN) -dense-birRNN-dense-softmax.

In addition, the CTC module is used for carrying out reverse training on the text extraction network model, and is used for measuring the difference between the input data and the real output data after the input data passes through the text extraction network model, and outputting the final result.

The text extraction network model comprises a plurality of network modules, and the accuracy of the extraction result of the model is higher than that of other existing text extraction network models obtained through inventor experiments.

b. And training the text extraction network model according to the CTC loss value to obtain the trained text extraction network model.

And then training the text extraction network model according to the CTC loss value corresponding to the CTC module, wherein the specific method comprises the following steps:

after the text extraction network model is constructed, pass is set to CTC pass (loss value). The classifier CTC outputs the recognition probability distribution of each frame sequence characteristic data through a CTC layer, and the training process of the CTC is realized through

Adjusting the value of w to maximize the target value, provided that

Can be obtained according to the reverse propagation

Wherein z represents the corresponding output label, x is the cross-cut sequence, w is the weight,

denotes the probability of outputting k at time t, and p denotes the sequence.

204. And determining a target text according to the text information.

After extracting text information from the feature text region through the previous steps, since the scene is complex and text contents such as advertising words and headlines are included in the selected region, the doorplate text information needs to be selected (i.e. target text is obtained) through natural language processing.

In some embodiments, determining the target text according to the text information specifically includes:

a. and mapping the text information to the trained high-dimensional space model to obtain word distances between the feature words and the plurality of subfiles.

Before mapping the text information into the trained high-dimensional space model, we need to train the high-dimensional space model according to training samples, wherein the training samples are samples with known word distances.

The high-dimensional space model in the application adopts the idea of n-gram (a language model), wherein the n-gram is generally used for word frequency conversion of a large vocabulary and large corpus and only counts the occurrence probability of each word.

Since the target vocabulary (target text) of the task is not large (only including numbers, 26 letters, and limited characters such as 'gate, Chinese, number, building, single, unit', and the like), the correct arrangement sequence of the target text cannot be distinguished by simple statistical probability, so the accuracy is not high if the n-gram is directly used for identifying the house number text in the application. Therefore, the scheme is improved, the target numbers, English characters and Chinese characters are mapped to a high-dimensional space, and the words and phrases formed by the target numbers, the English characters and the Chinese characters are used as word vectors to calculate the distance, so that the correct ordering of the target text can be obtained, and the accuracy rate of extracting the target text is improved.

In this step, all the characters of the text information need to be mapped into the trained high-dimensional space model, and then words formed by the characters are used as word vectors (sub-texts) to calculate distances.

b. And determining the sub text with the minimum word distance as the target text.

After obtaining a plurality of distance values, taking the sub-text of the vector corresponding to the value with the minimum distance value as the target text, namely, taking the sub-text closest to the word of the 'house plate' (with the last confidence) as the recognition result.

In some embodiments, after the target text is determined, the target text needs to be extracted, sent to a map system, and placed at a position corresponding to the text, so that an electronic map in the map system is more accurate, delivery by a courier is facilitated, delivery time of the courier is saved, and the like.

In the embodiment of the application, a text recognition device acquires an image to be recognized containing a target text; then, determining a characteristic text region from the image to be recognized according to a preset text recognition network model and preset characteristic words; extracting text information from the characteristic text region according to a trained text extraction network model, wherein the text extraction network model consists of four CNN modules, an RNN module and a CTC module; and finally, determining a target text according to the text information. According to the scheme, a text region related to the characteristic words is extracted (cut) from the image to be recognized, text information is extracted from the extracted characteristic text region according to the text extraction network model, and then the target text is recognized from the text information.

Referring to fig. 4, fig. 4 is another schematic flow chart of the text recognition method according to the embodiment of the present application, and the specific flow of the method may be as follows:

401. and acquiring an image to be recognized containing house plate text information.

402. And detecting the image to be identified according to a preset angle detection model to obtain the inclination angle of the image to be detected.

Specifically, the angle detection model adopts the parameters of a VGG16 classic pre-training model as initialization parameters, end-to-end training and detection are realized by using collected limited training samples, and a depth residual module of a ResNet model is added on the basis of enabling the model to be more effective, so that sparse matrices are clustered into more dense sub-matrices while deep networks are fully utilized to extract features, and the detection accuracy is improved.

403. And adjusting the angle of the image to be recognized according to the inclination angle to obtain the adjusted image to be recognized.

For example, if the image is detected to be inclined by 40 °, the image needs to be rotated by 40 ° in the reverse direction to "correct" the image, and the image to be recognized after "correction" is obtained.

404. And determining a text region from the image to be recognized according to the text recognition network model.

The method comprises the steps that a YO L O network is adopted in a text recognition network model, and the area needing to be detected by the algorithm is a picture area containing house plate text information.

The YO L O neural network divides an input image into S-S grids, then B boundary frames are predicted for each grid, each boundary frame comprises 5 predicted values x, y, w, h and confidence, wherein x and y are predicted values of center coordinates of the boundary frame, w and h are predicted values of width and height of the boundary frame, and confidence is confidence of the category to which the boundary frame belongs.

For example, in this embodiment, the house plate text information to be extracted is house plate text information, and the feature words at this time may be numbers, letters, and characters such as "door, house, number, building, unit," and the like.

In this embodiment, the image to be recognized, which is subjected to the angle adjustment, is input into the text recognition network model, and then the text recognition network model automatically selects or cuts out a picture region (i.e., a text region) containing a text from the image to be recognized, that is, the image to be recognized is divided into a "text region" and a "non-text region", where there are generally a plurality of text regions.

405. And determining a doorplate text area from the text area according to the characteristic words.

406. And extracting text information from the doorplate text area according to the trained text extraction network model.

a. and constructing a text extraction network model.

The text extraction network model comprises four convolutional neural network CNN modules, a recurrent neural network RNN module and a CTC module, wherein each CNN module is a block, each RNN module is an out-block, namely the text extraction network model is shown in FIG. 3 and comprises a block1-block2(mp) -block3-block4(mp) -out-block-CTC, wherein each block comprises Conv2D (a convolutional layer of 3 × 3 inner core) -BN-relu-maxipool (maximum pooling layer) unit block, wherein mp represents whether a maxipool layer is used, and the organization form of outblock is flitten-dense (full connection layer) -BN-relu-dropout-bi RNN (bidirectional RNN) -dense-RNN-dense-softmax.

after the text extraction network model is constructed, pass is set to CTC pass (loss value). Classifier CTC outputs each frame sequence feature through CTC layerIdentifying probability distribution of data, CTC training process is carried out

Adjusting the value of w to maximize the target value, provided that

Can be obtained according to the reverse propagation

denotes the probability of outputting k at time t, and p denotes the sequence.

407. And mapping the text information to the trained high-dimensional space model to obtain word distances between the feature words and the plurality of subfiles.

After extracting text information from the feature text region through the previous steps, the doorplate text information needs to be sorted out through natural language processing because the scene is complex and text contents such as advertising words and goalkeeper are mixed in the selected region.

Before mapping text information into a trained high-dimensional space model, the high-dimensional space model needs to be trained according to training samples, wherein the training samples are samples with known word distances.

The high-dimensional space model in the application adopts the thought similar to n-gram, the n-gram is generally used for word frequency conversion of a large vocabulary and large corpus, and the n-gram only counts the occurrence probability of each word.

Because the task target vocabulary (house plate text information) is not large (only contains numbers, 26 letters, and limited characters such as 'door, ridge, number, building, unit', and the like), the correct arrangement sequence of the house plate text information cannot be distinguished by simple statistical probability, and the accuracy is not high if the n-gram is directly used for house plate text recognition in the application. Therefore, the technical scheme improves the method, the target numbers, the English characters and the Chinese characters are mapped to a high-dimensional space, and the words and phrases formed by the target numbers, the English characters and the Chinese characters are used as word vectors to calculate the distance, so that the correct ordering of the house plate text information can be obtained, and the accuracy rate of extracting the house plate text information is improved.

408. And determining the sub-text with the minimum word distance as the house number text information.

After obtaining a plurality of distance values, taking the subfile of the vector corresponding to the value with the minimum distance value as the house plate text information, namely, taking the subfile closest to the word distance of the house plate (the confidence coefficient is the last) as the recognition result.

409. And extracting house plate text information.

After the house plate text information is determined, the house plate text information needs to be extracted, the house plate text information is sent to a map system, and the house plate text information is placed at a position corresponding to the text, so that an electronic map in the map system is more accurate, delivery of couriers is facilitated, and delivery time of the couriers is saved.

In the embodiment of the application, a text recognition device acquires an image to be recognized containing house plate text information; then, determining a characteristic text region from the image to be recognized according to a preset text recognition network model and preset characteristic words; extracting text information from the characteristic text region according to a trained text extraction network model, wherein the text extraction network model consists of four CNN modules, one RNN module and one CTC module; and finally, determining the target text according to the text information. According to the scheme, a text region related to the characteristic words is extracted (cut) from the image to be recognized, text information is extracted from the extracted characteristic text region according to the text extraction network model, and then the target text is recognized from the text information.

In order to better implement the text recognition method provided by the embodiment of the present application, the embodiment of the present application further provides a text recognition apparatus, and the text recognition apparatus may be specifically integrated in a server. The meanings of the nouns are the same as those in the text recognition method, and specific implementation details can refer to the description in the method embodiment.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a text recognition apparatus according to an embodiment of the present disclosure, in which the text recognition apparatus 500 includes an obtaining unit 501, a first determining unit 502, a first extracting unit 503, and a second determining unit 504, as follows:

an obtaining unit 501, configured to obtain an image to be recognized that includes a target text;

a first determining unit 502, configured to determine a feature text region from the image to be recognized according to a preset text recognition network model and preset feature words;

a first extraction unit 503, configured to extract text information from the feature text region according to a trained text extraction network model, where the text extraction network model is composed of four convolutional neural network CNN modules, a recurrent neural network RNN module, and a CTC module;

a second determining unit 504, configured to determine the target text according to the text information. .

In some embodiments, the first determining unit is specifically configured to:

In some embodiments, the second determining unit is specifically configured to:

and determining the sub text with the minimum word distance as the target text.

Referring to fig. 6, in some embodiments, the apparatus 500 further includes:

a first training unit 505, configured to train the high-dimensional space model according to training samples, where the training samples are samples with known word distances.

In some embodiments, the apparatus 500 further comprises:

the detection unit 506 is configured to detect the image to be identified according to a preset angle detection model to obtain an inclination angle of the image to be detected;

the adjusting unit 507 is configured to perform angle adjustment on the image to be recognized according to the inclination angle to obtain an adjusted image to be recognized;

the first determining unit 502 is specifically configured to:

In some embodiments, the apparatus 500 further comprises:

a constructing unit 508, configured to construct the text extraction network model, where the text extraction network model is composed of four convolutional neural network CNN modules, a recurrent neural network RNN module, and a CTC module.

A second training unit 509, configured to train the text extraction network model according to the CTC loss values, so as to obtain the trained text extraction network model.

In some embodiments, the apparatus 500 further comprises:

a second extracting unit 510, configured to extract the target text.

In some embodiments, the target text is house number text information.

In the embodiment of the application, the obtaining unit 501 obtains an image to be recognized, which includes a target text; then, the first determining unit 502 determines a feature text region from the image to be recognized according to a preset text recognition network model and preset feature words; the first extraction unit 503 extracts text information from the feature text region according to the trained text extraction network model, where the text extraction network model is composed of four CNN modules, one RNN module, and one CTC module; finally, the second determination unit 504 determines the target text from the text information. According to the scheme, a text region related to the characteristic words is extracted (cut) from the image to be recognized, text information is extracted from the characteristic text region extracted from the text extraction network model, and then the target text is recognized from the text information.

Referring to fig. 7, the present application provides a server 700, which may include one or more processors 701 of a processing core, one or more memories 702 of a computer-readable storage medium, a Radio Frequency (RF) circuit 703, a power supply 704, an input unit 705, and a display unit 706. Those skilled in the art will appreciate that the server architecture shown in FIG. 5 is not meant to be limiting, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 701 is a control center of the server, connects various parts of the entire server using various interfaces and lines, and performs various functions of the server and processes data by running or executing software programs and/or modules stored in the memory 702 and calling data stored in the memory 702, thereby performing overall monitoring of the server. Optionally, processor 701 may include one or more processing cores; preferably, the processor 701 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 701.

The memory 702 may be used to store software programs and modules, and the processor 701 executes various functional applications and data processing by operating the software programs and modules stored in the memory 702.

The RF circuit 703 may be used for receiving and transmitting signals during transmission and reception of information.

The server also includes a power supply 704 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 701 via a power management system to manage charging, discharging, and power consumption management functions via the power management system.

The server may further include an input unit 705, and the input unit 705 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

The server may also include a display unit 706, which display unit 706 may be used to display information input by or provided to the user, as well as various graphical user interfaces of the server, which may be made up of graphics, text, icons, video, and any combination thereof. Specifically, in this embodiment, the processor 701 in the server loads the executable file corresponding to the process of one or more application programs into the memory 702 according to the following instructions, and the processor 701 runs the application program stored in the memory 702, thereby implementing various functions as follows:

acquiring an image to be identified containing a target text;

and determining the target text according to the text information.

As can be seen from the above, in the embodiment of the present application, the text recognition apparatus obtains the image to be recognized including the target text; then, determining a characteristic text region from the image to be recognized according to a preset text recognition network model and preset characteristic words; extracting text information from the characteristic text region according to a trained text extraction network model, wherein the text extraction network model consists of four CNN modules, one RNN module and one CTC module; and finally, determining the target text according to the text information. According to the scheme, a text region related to the characteristic words is extracted (cut) from the image to be recognized, then the network model is extracted according to the text, the text information is extracted from the extracted characteristic text region, and then the target text is recognized from the text information.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present application provide a storage medium, in which a plurality of instructions are stored, where the instructions can be loaded by a processor to execute the steps in any one of the text recognition methods provided in the embodiments of the present application. For example, the instructions may perform the steps of:

acquiring an image to be identified containing a target text;

and determining the target text according to the text information.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium can execute the steps in any text recognition method provided in the embodiments of the present application, the beneficial effects that can be achieved by any text recognition method provided in the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

The text recognition method, the text recognition device, and the storage medium provided by the embodiments of the present application are described in detail above, and specific examples are applied herein to explain the principles and implementations of the present application, and the description of the above embodiments is only used to help understand the method and the core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A text recognition method, comprising:

acquiring an image to be identified containing a target text;

and determining the target text according to the text information.

2. The method according to claim 1, wherein the determining a characteristic text region from the image to be recognized according to a preset text recognition network model and preset characteristic words comprises:

3. The method of claim 1, wherein the determining the target text from the text information comprises:

and determining the sub text with the minimum word distance as the target text.

4. The method of claim 3, wherein prior to mapping the textual information into the trained high-dimensional spatial model, the method further comprises:

5. The method according to claim 1, wherein before determining the characteristic text region from the image to be recognized according to the preset text recognition network model and the preset characteristic words, the method further comprises:

6. The method of claim 1, wherein before extracting text information from the feature text region according to the trained text extraction network model, the method further comprises:

constructing the text extraction network model;

7. The method of any one of claims 1 to 6, wherein the CNN module consists of a convolutional layer, a first BN layer, a first rulu layer, and a max-pooling layer, and wherein the RNN module consists of a flatten layer, a first fully-connected layer, a second BN layer, a second rulu layer, a dropout layer, a first bi-directional RNN layer, a second fully-connected layer, a second bi-directional RNN layer, a third fully-connected layer, and a softmax layer.

8. The method of any one of claims 1 to 6, wherein the target text is house number text information.

9. A text recognition apparatus, comprising:

the first extraction unit is used for extracting a network model according to the trained text and extracting text information from the characteristic text region, wherein the text extraction network model consists of four Convolutional Neural Network (CNN) modules, a Recurrent Neural Network (RNN) module and a CTC module;

and the second determining unit is used for determining the target text according to the text information.

10. The apparatus according to claim 9, wherein the first determining unit is specifically configured to: