CN111368841A - Text recognition method, device, equipment and storage medium - Google Patents

Text recognition method, device, equipment and storage medium Download PDF

Info

Publication number
CN111368841A
CN111368841A CN202010127912.0A CN202010127912A CN111368841A CN 111368841 A CN111368841 A CN 111368841A CN 202010127912 A CN202010127912 A CN 202010127912A CN 111368841 A CN111368841 A CN 111368841A
Authority
CN
China
Prior art keywords
text
picture
recognized
features
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010127912.0A
Other languages
Chinese (zh)
Inventor
章放
邹雨晗
杨海军
徐倩
杨强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN202010127912.0A priority Critical patent/CN111368841A/en
Publication of CN111368841A publication Critical patent/CN111368841A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Character Discrimination (AREA)

Abstract

The invention discloses a text recognition method, a text recognition device, text recognition equipment and a storage medium, wherein the method comprises the following steps: acquiring a text picture to be identified; calling a text feature extraction structure in a text recognition model to extract text features of a text picture to be recognized, wherein the text recognition model is obtained by pre-training; calling a picture type feature extraction structure in the text recognition model to extract picture type features of the text picture to be recognized; and calling an identification structure in the text identification model to identify a text identification result of the text picture to be identified based on the text feature and the picture type feature. The method and the device can accurately identify the characters corresponding to the text when the types of the text pictures are different, so that the universality of the text identification model is improved, namely the universality of an OCR model is improved.

Description

Text recognition method, device, equipment and storage medium
Technical Field
The present invention relates to the field of computer vision technologies, and in particular, to a text recognition method, apparatus, device, and storage medium.
Background
OCR (Optical Character Recognition) refers to a process of scanning text data, analyzing an image file, and acquiring text and layout information. Current OCR technology typically uses and trains different OCR models for different text image types, such as identification cards, vehicle registration cards, driving licenses, documents, and so on, and typically does not use and train a unified OCR model. Because the current OCR models generally have a relatively good recognition rate for different types of text pictures, a specific OCR model is used and trained for a specific text picture, and whenever there is a new text picture, it is necessary to retrain or even redesign an OCR model to address the newly appeared text picture, that is, the OCR model has poor versatility.
Disclosure of Invention
The invention mainly aims to provide a text recognition method, a text recognition device, text recognition equipment and a storage medium, and aims to solve the problems that an existing OCR model only has a good recognition rate for one type of text pictures and is poor in universality for different types of text pictures.
In order to achieve the above object, the present invention provides a text recognition method, including the steps of:
acquiring a text picture to be identified;
calling a text feature extraction structure in a text recognition model to extract text features of the text picture to be recognized, wherein the text recognition model is obtained by pre-training;
calling a picture type feature extraction structure in the text recognition model to extract picture type features of the text picture to be recognized;
and calling an identification structure in the text identification model to identify and obtain a text identification result of the text picture to be identified based on the text feature and the picture type feature.
Optionally, the step of calling a picture type feature extraction structure in the text recognition model to extract the picture type feature of the text picture to be recognized includes:
and calling the picture type feature extraction structure to extract comprehensive features of the text picture to be recognized spanning multiple width units, and taking the comprehensive features as the picture type features of the text picture to be recognized.
Optionally, the text features include sub-text features respectively corresponding to each width unit of the text picture to be recognized,
the step of calling the picture type feature extraction structure to extract the comprehensive features of the text picture to be recognized across a plurality of width units comprises the following steps:
and calling the picture type feature extraction structure to perform feature comprehensive processing on each sub-text feature of the text picture to be recognized in multiple width units to obtain comprehensive features of the text picture to be recognized across multiple width units.
Optionally, the identification structure includes a first full-link layer and a second full-link layer, and the step of calling the identification structure in the text recognition model to obtain the text recognition result of the text picture to be recognized based on the text feature and the picture type feature recognition includes:
inputting the sub-text features corresponding to a single width unit into the first full-connection layer to obtain a first vector;
inputting the picture type features into the second full-connection layer to obtain a second vector, wherein the dimensions of the first vector and the second vector are the same;
multiplying the first vector and the second vector by corresponding elements to obtain a third vector;
and determining a text recognition result of the text picture to be recognized according to the third vector corresponding to each width unit.
Optionally, the identifying structure further includes a decoder, and the step of determining the text identification result of the text picture to be identified according to the third vector corresponding to each width unit includes:
and respectively inputting the third vectors corresponding to the width units into the decoder to obtain target characters corresponding to the width units, and taking the target characters as text recognition results of the text pictures to be recognized.
Optionally, the step of calling the picture type feature extraction structure to perform feature comprehensive processing on each sub-text feature of the text picture to be recognized in multiple width units to obtain a comprehensive feature of the text picture to be recognized across multiple width units includes:
and calling the picture type feature extraction structure to calculate an average value of each sub-text feature of the text picture to be recognized in multiple width units, so as to obtain the comprehensive feature of the text picture to be recognized across multiple width units.
Optionally, the step of acquiring a text picture to be recognized includes:
calling a preset text detection model to detect the text position in the picture to be recognized;
and cutting the picture to be recognized based on the text position to obtain the text picture to be recognized containing one line or one column of text characters.
In order to achieve the above object, the present invention further provides a text recognition apparatus, including:
the acquisition module is used for acquiring a text picture to be identified;
the text feature extraction module is used for calling a text feature extraction structure in a text recognition model to extract the text features of the text picture to be recognized, wherein the text recognition model is obtained by pre-training;
the picture type feature extraction module is used for calling a picture type feature extraction structure in the text recognition model to extract the picture type features of the text picture to be recognized;
and the recognition module is used for calling a recognition structure in the text recognition model to recognize and obtain a text recognition result of the text picture to be recognized based on the text feature and the picture type feature.
In order to achieve the above object, the present invention also provides a text recognition apparatus, including: a memory, a processor and a text recognition program stored on the memory and executable on the processor, the text recognition program when executed by the processor implementing the steps of the text recognition method as described above.
Furthermore, to achieve the above object, the present invention also provides a computer readable storage medium having a text recognition program stored thereon, which when executed by a processor implements the steps of the text recognition method as described above.
In the invention, the text recognition model is trained in advance, the text feature extraction structure in the text recognition model is called to extract the text feature of the text picture to be recognized, the picture type feature extraction structure in the text recognition model is called to extract the picture type feature of the text picture to be recognized, and the recognition structure is called to recognize the text recognition result of the text picture to be recognized based on the text feature and the picture type feature, so that the picture type feature also becomes a factor influencing the recognition result, and therefore, when the types of the text pictures are different, the fonts, the backgrounds and the like of the text pictures are different, the characters corresponding to the text can be accurately recognized, the universality of the text recognition model is improved, and the universality of the OCR model is also improved. Due to the fact that the universality of the text recognition model is improved, the model does not need to be trained respectively aiming at various types of text pictures, and only one universal model needs to be trained, so that the model training efficiency is improved, and the universality of training data is also improved.
Drawings
FIG. 1 is a schematic diagram of a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a text recognition method according to a first embodiment of the present invention;
FIG. 3 is a block diagram of a text recognition apparatus according to a preferred embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention.
It should be noted that, in the embodiment of the present invention, the text recognition device may be a smart phone, a personal computer, a server, and the like, and is not limited specifically here.
As shown in fig. 1, the text recognition apparatus may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.
Those skilled in the art will appreciate that the device configuration shown in FIG. 1 does not constitute a limitation of text recognition devices, and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a text recognition program. Among them, the operating system is a program that manages and controls the hardware and software resources of the device, supporting the execution of the text recognition program and other software or programs.
In the device shown in fig. 1, the user interface 1003 is mainly used for data communication with a client; the network interface 1004 is mainly used for establishing communication connection with a server; and the processor 1001 may be configured to invoke a text recognition program stored in the memory 1005 and perform the following operations:
acquiring a text picture to be identified;
calling a text feature extraction structure in a text recognition model to extract text features of the text picture to be recognized, wherein the text recognition model is obtained by pre-training;
calling a picture type feature extraction structure in the text recognition model to extract picture type features of the text picture to be recognized;
and calling an identification structure in the text identification model to identify and obtain a text identification result of the text picture to be identified based on the text feature and the picture type feature.
Further, the step of calling a picture type feature extraction structure in the text recognition model to extract the picture type features of the text picture to be recognized comprises:
and calling the picture type feature extraction structure to extract comprehensive features of the text picture to be recognized spanning multiple width units, and taking the comprehensive features as the picture type features of the text picture to be recognized.
Further, the text features comprise sub-text features respectively corresponding to each width unit of the text picture to be recognized,
the step of calling the picture type feature extraction structure to extract the comprehensive features of the text picture to be recognized across a plurality of width units comprises the following steps:
and calling the picture type feature extraction structure to perform feature comprehensive processing on each sub-text feature of the text picture to be recognized in multiple width units to obtain comprehensive features of the text picture to be recognized across multiple width units.
Further, the step of calling the recognition structure in the text recognition model to obtain the text recognition result of the text picture to be recognized based on the text feature and the picture type feature recognition includes:
inputting the sub-text features corresponding to a single width unit into the first full-connection layer to obtain a first vector;
inputting the picture type features into the second full-connection layer to obtain a second vector, wherein the dimensions of the first vector and the second vector are the same;
multiplying the first vector and the second vector by corresponding elements to obtain a third vector;
and determining a text recognition result of the text picture to be recognized according to the third vector corresponding to each width unit.
Further, the identification structure further includes a decoder, and the step of determining the text identification result of the text picture to be identified according to the third vector corresponding to each width unit includes:
and respectively inputting the third vectors corresponding to the width units into the decoder to obtain target characters corresponding to the width units, and taking the target characters as text recognition results of the text pictures to be recognized.
Further, the step of calling the picture type feature extraction structure to perform feature comprehensive processing on each sub-text feature of the text picture to be recognized in multiple width units to obtain comprehensive features of the text picture to be recognized across multiple width units includes:
and calling the picture type feature extraction structure to calculate an average value of each sub-text feature of the text picture to be recognized in multiple width units, so as to obtain the comprehensive feature of the text picture to be recognized across multiple width units.
Further, the step of acquiring the text picture to be recognized includes:
calling a preset text detection model to detect the text position in the picture to be recognized;
and cutting the picture to be recognized based on the text position to obtain the text picture to be recognized containing one line or one column of text characters.
Based on the above structure, various embodiments of the text recognition method are proposed.
Referring to fig. 2, fig. 2 is a flowchart illustrating a text recognition method according to a first embodiment of the present invention.
While a logical order is shown in the flow chart, in some cases, the steps shown or described may be performed in an order different than presented herein. The executing subject of each embodiment of the text recognition method of the present invention may be a device such as a smart phone, a personal computer, and a server, and for convenience of description, the executing subject is omitted in the following embodiments for illustration. In this embodiment, the text recognition method includes:
step S10, acquiring a text picture to be recognized;
the text picture may be a picture in which text (characters, numbers, characters, or the like, hereinafter collectively referred to as characters) exists in the picture, and may be a picture obtained by shooting or scanning various certificates or paper documents. The text picture to be recognized is the text picture of the text in the picture to be recognized. It should be noted that the text picture to be recognized may be a small picture containing one, one or one column of text, for example, a small picture containing one row of numbers that is cut out by the user through a picture editing tool; it may also be a large picture containing multiple rows or columns, for example, a picture obtained by scanning a paper document; the image may also include a text region and a non-text region, for example, a picture obtained by scanning the front surface of the identification card, where the text region includes "name", "gender", and the like, and also includes a blank non-text region.
The method for acquiring the text picture to be recognized is various, and may be different according to different application scenarios, for example, the text picture may be acquired by acquiring a pre-stored text picture, scanning a paper document, or shooting a certificate.
Further, step S10 includes:
step S101, calling a preset text detection model to detect a text position in a picture to be recognized;
and S102, cutting the picture to be recognized based on the text position to obtain the text picture to be recognized containing one line or one column of text characters.
The process of acquiring the text picture to be recognized may be: for a picture to be recognized containing a text region and a non-text region, a preset text detection model can be called to detect the text position in the picture to be recognized. The text position is the position of the text area in the picture to be identified; the text detection model may adopt an existing commonly used text detection model, which is not described herein again. And then cutting the text picture to be recognized according to the text position, and cutting the text area to obtain the text picture to be recognized containing one line or one column of text characters.
Step S20, calling a text feature extraction structure in a text recognition model to extract the text features of the text picture to be recognized, wherein the text recognition model is obtained by pre-training;
in this embodiment, after the text picture to be recognized is obtained, a text recognition model may be used to perform text recognition on the text picture to be recognized. Wherein, the text recognition model is trained in advance. Specifically, a large number of text pictures and labels of texts existing in the text pictures can be used to train the text recognition model, and the specific training mode can be a training mode of an existing machine learning model, which is not described in detail herein. It should be noted that the image types of the text images adopted for training the text recognition model may be multiple, and when different types of text images are adopted for training the text recognition model, the text recognition model obtained through training can have a better recognition effect on the different types of text images, so that the universality of the text recognition model is improved. The picture types can be distinguished according to the types of texts in the pictures, such as Chinese, English, numbers and the like, or can be distinguished according to the obtaining ways of the pictures, such as pictures obtained by scanning paper documents and identity documents, which belong to different types.
In this embodiment, the text recognition model may include a text feature extraction structure, a picture type feature extraction structure, and a recognition structure. The text feature extraction structure is used for extracting text features in the text picture, and it is understood that the text features represent features of texts in the text picture, and obviously different texts have different features, for example, the shapes of the numbers 1 and 2 are obviously different. The extracted text features may be represented in any form, such as vectors, matrices, etc., where the elements may represent the feature representation of the text in the text image in various aspects, such as shape, color, or size, etc. The text feature extraction structure may include at least a backbone network portion, and the backbone network may adopt some common feature extraction networks, such as VGG (Visual Geometry Group) model, ResNet (residual network), densnet, and the like. The text feature extraction structure may further include a head Network connected behind the backbone Network, and the head Network may adopt a Neural Network with a cyclic property, such as RNN (Recurrent Neural Network), LSTM (Long Short-Term Memory ), bilst (Bi-directional Long Short-Term Memory), gru (gated corrected current unit), or blank, and because the text in the text picture has context correlation, the extracted text feature is more accurate by processing the continuous input data by the head Network.
And calling a text feature extraction structure in the text recognition model to extract the text features of the text picture to be recognized, specifically, inputting the text picture into the text feature recognition structure, and outputting the text features of the text picture to be recognized by the text feature recognition structure.
Step S30, calling a picture type feature extraction structure in the text recognition model to extract the picture type features of the text picture to be recognized;
the picture type feature extraction structure may be a picture type feature for extracting a text picture, and it is understood that the picture type feature represents a picture feature of the text picture, and obviously different types of picture types have different features, for example, picture background features of text pictures obtained by scanning different documents are different. The specific representation form of the extracted image type feature is not limited, and may be, for example, a vector or a matrix, etc., where elements of the vector or the matrix may represent the type feature representation of the text image in various aspects, such as image background, source, etc. The picture type feature extraction structure can adopt a common neural network structure; the picture type feature extraction structure can be set independently from the text feature extraction structure, and the text picture is taken as input to extract the picture type feature of the whole text picture; or the text features output by the text feature recognition structure are used as input, and the picture type features of the text picture are extracted based on the text features.
And calling a picture type feature extraction structure in the text recognition model to extract the picture type features of the text picture to be recognized. Specifically, the text features of the text picture to be recognized may be input into the picture type feature extraction structure, and the picture type feature extraction structure outputs the picture type features of the text picture to be recognized.
And step S40, calling an identification structure in the text identification model to identify and obtain a text identification result of the text picture to be identified based on the text feature and the picture type feature.
And calling an identification structure in the text identification model to identify based on the text characteristics and the picture type characteristics to obtain the text identification structure of the text picture to be identified. The recognition structure can be connected behind the text feature extraction structure and the picture type feature extraction structure, namely, the text feature and the picture type feature are used as input, and a text recognition result of the text picture to be recognized is output. Specifically, the recognition structure may adopt one or more fully connected layers, and take the text features and the picture type features as input, and then connect a conventional classifier after the fully connected layers, and the classifier outputs which character each text in the text picture belongs to, and takes each output character as the text recognition result of the text picture. The recognition structure is used for recognizing the text not only according to the text characteristics, but also by combining the picture type characteristics and the text characteristics of the text pictures, so that the picture type characteristics also become factors influencing recognition results, and characters corresponding to the text can be accurately recognized when the fonts, backgrounds and the like of the text pictures are different due to different text picture types, so that the universality of a text recognition model is improved, namely the universality of an OCR model is improved. The text recognition model may be used alone as an OCR model, or a text detection model may be connected to the front of the text recognition model, and the whole may be used as an OCR model.
In the embodiment, a text recognition model is trained in advance, a text feature extraction structure in the text recognition model is called to extract text features of a text picture to be recognized, a picture type feature extraction structure in the text recognition model is called to extract picture type features of the text picture to be recognized, the recognition structure is called again to perform recognition based on the text features and the picture type features to obtain a text recognition result of the text picture to be recognized, so that the picture type features also become factors influencing the recognition result, and characters corresponding to the text can be accurately recognized when the types of the text pictures are different, so that the universality of the text recognition model is improved, namely the universality of the OCR model is improved. Due to the fact that the universality of the text recognition model is improved, the model does not need to be trained respectively aiming at various types of text pictures, and only one universal model needs to be trained, so that the model training efficiency is improved, and the universality of training data is also improved.
Further, based on the first embodiment described above, a second embodiment of the text recognition method according to the present invention is proposed, and in this embodiment, the step S30 includes:
step S301, calling the picture type feature extraction structure to extract the comprehensive features of the text picture to be recognized spanning multiple width units, and taking the comprehensive features as the picture type features of the text picture to be recognized.
In this embodiment, the text picture may be divided into a plurality of width units in a preset width unit, for example, the width occupied by one character is taken as one width unit, if the text picture is a small picture with one line of characters, the text picture is divided into N width units, and N is the number of characters. The picture type feature extraction structure may be configured to extract a comprehensive feature of the text picture across a plurality of width units, and use the comprehensive feature as a picture type feature of the text picture to be recognized. The comprehensive features may be obtained by integrating features extracted from the width units, where the features extracted from the width units may be text features, or other features, such as background features, font features, and the like. Because the text features extracted by the text feature extraction structure are features of a single character, a plurality of characters, that is, features integrated in a plurality of width units, are not considered, and if the text is identified only by the text features, the influence of the features of the whole text picture on the identification result of the single character is ignored, so that the picture type feature extraction structure is adopted in the embodiment to extract the integrated features spanning the plurality of width units, and the integrated features can assist in the identification of the single character.
Further, the text features include sub-text features corresponding to each width unit of the text picture to be recognized, and the step S301 includes:
step S3011, calling the picture type feature extraction structure to perform feature comprehensive processing on each sub-text feature of the text picture to be recognized in multiple width units, so as to obtain a comprehensive feature of the text picture to be recognized across multiple width units.
The text feature extraction structure may be configured to divide an input text picture into a plurality of width units, and extract text features, that is, sub-text features, of the text picture in each width unit, respectively, that is, each width unit corresponds to one sub-text feature. The text features extracted by the text feature extraction structure can be represented by a matrix (w, h), wherein w is a width dimension, and h is a feature dimension, so that the elements of the matrix represent the feature expression of each width unit on different feature dimensions. The comprehensive features across a plurality of width units extracted by the picture type feature extraction structure may be comprehensive features obtained by comprehensively processing sub-text features corresponding to the plurality of width units. In this case, the picture-type feature extraction structure may be set after the text feature extraction structure, and take a plurality of sub-text features as input, for example, take the above matrix (w, h) as input, the picture-type feature extraction structure may adopt one or more fully connected layers, and the output result may also be a vector or a matrix, which is used to represent the picture-type features of the extracted text picture.
Further, the step S3011 includes:
step a, calling the picture type feature extraction structure to calculate an average value of each sub-text feature of the text picture to be recognized in multiple width units, and obtaining a comprehensive feature of the text picture to be recognized across the multiple width units.
The picture type feature extraction structure may be a structure for calculating an average value, and the picture type feature extraction structure is called to calculate an average value of each sub-text feature corresponding to the plurality of width units, that is, calculate an average value of the plurality of sub-text features, and use the average value as a comprehensive feature spanning the plurality of width units. For example, when a text feature is represented by a matrix (w, h), each row of the matrix is a sub-text feature, an average value is calculated for a plurality of sub-text features, that is, an average value is calculated for each column of the matrix, and a vector a ═ a _1, a _2,. once, a _ h ] is obtained, where the vector a can represent a comprehensive feature spanning multiple width units, and the vector a is used as the extracted picture type feature. And then, calling an identification structure of the text identification model to identify based on the text characteristics and the picture type characteristics to obtain a text identification result of the text picture to be identified.
In this embodiment, the text feature extraction structure is called to extract the sub-text features corresponding to each width unit of the text picture, the picture type feature extraction structure is then adopted to calculate an average value of the plurality of sub-text features, the calculation result is adopted as the picture type feature of the text picture, and the re-tuning recognition structure recognizes the text recognition result of the text picture to be recognized based on the text features and the picture type features, so that in the recognition process, recognition is performed according to the text features of a single character, but the features formed by combining the text features of the characters are considered, so that recognition of the single character is more accurate, the problem that the recognition effect is poor when the background, the font and the like are changed is avoided, and therefore, better recognition effects are achieved for different types of text pictures.
Further, based on the first and second embodiments, a third embodiment of the text recognition method according to the present invention is provided, in which the recognition structure includes a first full-link layer and a second full-link layer, and step S40 includes:
step S401, inputting the sub-text features corresponding to a single width unit into the first full connection layer to obtain a first vector;
in this embodiment, the identification structure may be a structure including a first fully connected layer and a second fully connected layer. The first full connection layer is set as a connection text feature, and specifically, a sub-text feature can be used as an input; the second full connection layer is set to connect the picture type features, and the picture type features can be specifically used as input; the output of the first full-link layer and the output of the second full-link layer can be set as a vector, and the number of vector elements can be the total number of characters which can be recognized by the text recognition model.
The calling identification structure identifies based on the text feature and the picture type feature, specifically, the sub-text feature corresponding to a single width unit is input into the first full connection layer, and the first vector is obtained through output. Similarly, the sub-text features corresponding to the plurality of width units are sequentially input into the first full-link layer, so as to obtain first vectors corresponding to the plurality of width units respectively.
Step S402, inputting the picture type characteristics into the second full-connection layer to obtain a second vector, wherein the dimensions of the first vector and the second vector are the same;
and inputting the picture type characteristics into a second full-connection layer, and outputting to obtain a second vector. The first vector and the second vector have the same dimension, that is, the number of elements of the vectors is the same.
Step S403, multiplying corresponding elements of the first vector and the second vector to obtain a third vector;
and multiplying the corresponding elements of the first vector and the second vector to obtain a third vector. The corresponding elements are multiplied, that is, the first element of the first vector is multiplied by the first element of the second vector, the second element of the first vector is multiplied by the second element of the second vector, and so on, and the result is taken as the element of the third vector, it can be understood that the number of the elements of the third vector is the same as that of the first vector and the second vector. Similarly, the plurality of first vectors are respectively multiplied by the corresponding elements of the second vector, and then a plurality of third vectors respectively corresponding to the width units are obtained. Since the first vector is derived from the text feature and the second vector is derived from the picture type feature, the third vector corresponding to each width unit integrates the text feature and the picture type feature.
Step S404, determining a text recognition result of the text picture to be recognized according to the third vector corresponding to each width unit.
And determining a text recognition result of the text picture to be recognized according to the third vector corresponding to each width unit. Specifically, if one width unit corresponds to one character, each character can be recognized according to each third vector, and the recognition results of each character are combined to obtain the text recognition result of the text picture to be recognized.
Further, the identification structure further includes a decoder, and the step S404 includes:
step S4041, the third vectors corresponding to the width units are respectively input into the decoder to obtain target characters corresponding to the width units, and the target characters are used as the text recognition results of the text pictures to be recognized.
The recognition structure may further include a decoder, which may use a decoder used by a common classification model, and preset a correspondence between values of elements of the third vector and characters that can be recognized by the text recognition model. Specifically, the third vectors corresponding to the width units are respectively input into the decoder, the decoder decodes the third vectors to obtain target characters corresponding to the width units, and the target characters are combined in sequence to obtain a text recognition result of the text picture to be recognized. Further, the third vector may be subjected to a softmax (activation function) operation prior to accessing the decoder. Wherein, softmax is used in the multi-classification process to map the output of a plurality of neurons into the (0,1) interval.
Further, in an embodiment, the text picture to be recognized may be a small picture (a text picture containing one row or one column of text characters) processed by a text detection model; processing an input text picture to be recognized into a tensor T [ batch _ size, w, h ] through a backbone network and a head network in a text recognition structure, wherein the batch _ size can be the batch size of the input text picture to be recognized; w is calculated as follows: assuming that the text feature extraction structure can change the Width of an input text picture to be recognized into the original Width/x _ w, wherein Width is the Width of the input text picture to be recognized, and x _ w is a preset picture scaling ratio, and w is Width/x _ w; h is a preset characteristic dimension parameter. The first dimension of T represents the batch, each sub tensor in the batch is a matrix [ w, h ], and represents a feature matrix extracted from each text picture to be recognized in the batch. The average value of each column is calculated for this matrix [ w, h ] resulting in a vector a ═ a _1, a _ 2. Then the first fully-connected layer is connected to each row [ bw _1, bw _2,. till., bw _ h ] in each batch of T, and the second fully-connected layer is connected to a. The input dimensionality of the first full connection layer and the second full connection layer is h, and the output dimensionality is the total number c of characters which can be recognized by the text recognition model. The fully-connected layer essentially performs a matrix multiplication, and the corresponding matrices of the two fully-connected layers are denoted as D _1 and D _2, and the element in D _1 is denoted as D _1_ i _ j, and the element in D _2 is denoted as D _2_ i _ j, where 1< ═ i < ═ c, and 1< ═ j < ═ h. Then, the outputs (the first vector and the second vector) of the two fully-connected layers are multiplied by corresponding elements to obtain a final output, the output is a c-dimensional vector (a third vector) and is denoted as O, and each element O _ k can be expressed by a formula as follows: o _ k ═ (d _1_ k _1 × bw _1+ d _1_ k _2 × bw _2+. + d _1_ k _ h × bw _ h) (d _2_ k _1 × a _1+ d _2_ k _2 a _2+. + d _2_ k _ h a _ h), where 1 ═ k <, c. Then, d _2_ k _1 a _1+ d _2_ k _2 a _2+. + d _2_ k _ h a _ h is used to represent information such as font and background of the text picture to be recognized. And then, performing softmax operation on the vector O, and then connecting a decoder for decoding to obtain a text recognition result of the text picture to be recognized.
In this embodiment, the reason why the existing OCR model has no generality for different types of text pictures is that the main difference between different text pictures is the difference between the background and the font, and when different fonts and backgrounds are collocated, such a situation often occurs: characters look the same but are interpreted differently. For example, the character O represents the number 0 in some background and font, and the letter O in some font and background; and a character such as theta, which in some fonts and contexts represents the greek letter theta, while in other cases it may represent the number 0. So in general, a single OCR model cannot recognize nearly identical characters into two different results under different backgrounds and fonts. Therefore, in this embodiment, the text features and the picture type features of the text picture to be recognized are integrated through the first full-link layer and the second full-link layer of the recognition structure, so that the finally obtained recognition result takes the background and the font features of the text features to be recognized into account, and thus different classification results can be made for text pictures with different fonts and backgrounds even if the characters are almost the same, and the universality of the text recognition model is improved.
In addition, an embodiment of the present invention further provides a text recognition apparatus, and referring to fig. 3, the text recognition apparatus includes:
the acquiring module 10 is used for acquiring a text picture to be identified;
the text feature extraction module 20 is configured to invoke a text feature extraction structure in a text recognition model to extract text features of the text picture to be recognized, where the text recognition model is obtained through pre-training;
the picture type feature extraction module 30 is configured to invoke a picture type feature extraction structure in the text recognition model to extract picture type features of the text picture to be recognized;
and the identification module 40 is used for calling an identification structure in the text identification model to identify and obtain a text identification result of the text picture to be identified based on the text feature and the picture type feature.
Further, the picture type feature extraction module 30 is further configured to:
and calling the picture type feature extraction structure to extract comprehensive features of the text picture to be recognized spanning multiple width units, and taking the comprehensive features as the picture type features of the text picture to be recognized.
Further, the text features comprise sub-text features respectively corresponding to each width unit of the text picture to be recognized,
the picture type feature extraction module 30 is further configured to:
and calling the picture type feature extraction structure to perform feature comprehensive processing on each sub-text feature of the text picture to be recognized in multiple width units to obtain comprehensive features of the text picture to be recognized across multiple width units.
Further, the identification structure comprises a first fully connected layer and a second fully connected layer, and the identification module 40 comprises:
a first input unit, configured to input the sub-text feature corresponding to a single width unit into the first full-link layer, so as to obtain a first vector;
a second input unit, configured to input the picture type feature into the second full-link layer to obtain a second vector, where dimensions of the first vector and the second vector are the same;
the calculating unit is used for multiplying corresponding elements of the first vector and the second vector to obtain a third vector;
and the determining unit is used for determining the text recognition result of the text picture to be recognized according to the third vector corresponding to each width unit.
Further, the identification structure further comprises a decoder, and the determining unit is further configured to:
and respectively inputting the third vectors corresponding to the width units into the decoder to obtain target characters corresponding to the width units, and taking the target characters as text recognition results of the text pictures to be recognized.
Further, the picture type feature extraction module 30 is further configured to:
and calling the picture type feature extraction structure to calculate an average value of each sub-text feature of the text picture to be recognized in multiple width units, so as to obtain the comprehensive feature of the text picture to be recognized across multiple width units.
Further, the obtaining module 10 includes:
the calling unit is used for calling a preset text detection model to detect the text position in the picture to be recognized;
and the cutting unit is used for cutting the picture to be recognized based on the text position to obtain the text picture to be recognized containing one line or one column of text characters.
The specific implementation of the text recognition apparatus of the present invention is basically the same as the embodiments of the text recognition method, and is not described herein again.
In addition, an embodiment of the present invention further provides a computer-readable storage medium, where a text recognition program is stored on the storage medium, and the text recognition program, when executed by a processor, implements the steps of the text recognition method as described below.
The embodiments of the text recognition device and the computer-readable storage medium of the present invention can refer to the embodiments of the text recognition method of the present invention, and are not described herein again.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A text recognition method, characterized in that the text recognition method comprises the steps of:
acquiring a text picture to be identified;
calling a text feature extraction structure in a text recognition model to extract text features of the text picture to be recognized, wherein the text recognition model is obtained by pre-training;
calling a picture type feature extraction structure in the text recognition model to extract picture type features of the text picture to be recognized;
and calling an identification structure in the text identification model to identify and obtain a text identification result of the text picture to be identified based on the text feature and the picture type feature.
2. The text recognition method of claim 1, wherein the step of invoking the picture type feature extraction structure in the text recognition model to extract the picture type features of the text picture to be recognized comprises:
and calling the picture type feature extraction structure to extract comprehensive features of the text picture to be recognized spanning multiple width units, and taking the comprehensive features as the picture type features of the text picture to be recognized.
3. The text recognition method according to claim 2, wherein the text features comprise sub-text features corresponding to the width units of the text picture to be recognized respectively,
the step of calling the picture type feature extraction structure to extract the comprehensive features of the text picture to be recognized across a plurality of width units comprises the following steps:
and calling the picture type feature extraction structure to perform feature comprehensive processing on each sub-text feature of the text picture to be recognized in multiple width units to obtain comprehensive features of the text picture to be recognized across multiple width units.
4. The text recognition method of claim 3, wherein the recognition structure comprises a first fully-connected layer and a second fully-connected layer, and the step of calling the recognition structure in the text recognition model to obtain the text recognition result of the text picture to be recognized based on the text feature and the picture type feature recognition comprises:
inputting the sub-text features corresponding to a single width unit into the first full-connection layer to obtain a first vector;
inputting the picture type features into the second full-connection layer to obtain a second vector, wherein the dimensions of the first vector and the second vector are the same;
multiplying the first vector and the second vector by corresponding elements to obtain a third vector;
and determining a text recognition result of the text picture to be recognized according to the third vector corresponding to each width unit.
5. The text recognition method of claim 4, wherein the recognition structure further comprises a decoder, and the step of determining the text recognition result of the text picture to be recognized according to the third vector corresponding to each width unit comprises:
and respectively inputting the third vectors corresponding to the width units into the decoder to obtain target characters corresponding to the width units, and taking the target characters as text recognition results of the text pictures to be recognized.
6. The method of claim 3, wherein the step of calling the picture type feature extraction structure to perform feature synthesis processing on each of the sub-text features of the text picture to be recognized in multiple width units to obtain a synthesized feature of the text picture to be recognized across multiple width units comprises:
and calling the picture type feature extraction structure to calculate an average value of each sub-text feature of the text picture to be recognized in multiple width units, so as to obtain the comprehensive feature of the text picture to be recognized across multiple width units.
7. The text recognition method of any one of claims 1 to 6, wherein the step of obtaining the text picture to be recognized comprises:
calling a preset text detection model to detect the text position in the picture to be recognized;
and cutting the picture to be recognized based on the text position to obtain the text picture to be recognized containing one line or one column of text characters.
8. A text recognition apparatus, characterized in that the text recognition apparatus comprises:
the acquisition module is used for acquiring a text picture to be identified;
the text feature extraction module is used for calling a text feature extraction structure in a text recognition model to extract the text features of the text picture to be recognized, wherein the text recognition model is obtained by pre-training;
the picture type feature extraction module is used for calling a picture type feature extraction structure in the text recognition model to extract the picture type features of the text picture to be recognized;
and the recognition module is used for calling a recognition structure in the text recognition model to recognize and obtain a text recognition result of the text picture to be recognized based on the text feature and the picture type feature.
9. A text recognition apparatus characterized by comprising: memory, processor and text recognition program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the text recognition method according to any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a text recognition program which, when executed by a processor, implements the steps of the text recognition method according to any one of claims 1 to 7.
CN202010127912.0A 2020-02-28 2020-02-28 Text recognition method, device, equipment and storage medium Pending CN111368841A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010127912.0A CN111368841A (en) 2020-02-28 2020-02-28 Text recognition method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010127912.0A CN111368841A (en) 2020-02-28 2020-02-28 Text recognition method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111368841A true CN111368841A (en) 2020-07-03

Family

ID=71206527

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010127912.0A Pending CN111368841A (en) 2020-02-28 2020-02-28 Text recognition method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111368841A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112990175A (en) * 2021-04-01 2021-06-18 深圳思谋信息科技有限公司 Method and device for recognizing handwritten Chinese characters, computer equipment and storage medium
CN113780098A (en) * 2021-08-17 2021-12-10 北京百度网讯科技有限公司 Character recognition method, character recognition device, electronic equipment and storage medium
WO2023065895A1 (en) * 2021-10-18 2023-04-27 北京有竹居网络技术有限公司 Text recognition method and apparatus, readable medium, and electronic device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112990175A (en) * 2021-04-01 2021-06-18 深圳思谋信息科技有限公司 Method and device for recognizing handwritten Chinese characters, computer equipment and storage medium
CN113780098A (en) * 2021-08-17 2021-12-10 北京百度网讯科技有限公司 Character recognition method, character recognition device, electronic equipment and storage medium
CN113780098B (en) * 2021-08-17 2024-02-06 北京百度网讯科技有限公司 Character recognition method, character recognition device, electronic equipment and storage medium
WO2023065895A1 (en) * 2021-10-18 2023-04-27 北京有竹居网络技术有限公司 Text recognition method and apparatus, readable medium, and electronic device

Similar Documents

Publication Publication Date Title
US10846553B2 (en) Recognizing typewritten and handwritten characters using end-to-end deep learning
US8494273B2 (en) Adaptive optical character recognition on a document with distorted characters
CN110705233B (en) Note generation method and device based on character recognition technology and computer equipment
CN111368841A (en) Text recognition method, device, equipment and storage medium
EP1231558A2 (en) A printing control interface system and method with handwriting discrimination capability
CN110503054B (en) Text image processing method and device
CN109409349B (en) Credit certificate authentication method, credit certificate authentication device, credit certificate authentication terminal and computer readable storage medium
CN111340037B (en) Text layout analysis method and device, computer equipment and storage medium
JP2002063215A (en) Method and system for displaying document, computer program and recording medium
US7796817B2 (en) Character recognition method, character recognition device, and computer product
EP3540644B1 (en) Image processing device, image processing method, and image processing program
US11418658B2 (en) Image processing apparatus, image processing system, image processing method, and storage medium
EP3525441A1 (en) Receipt processing apparatus, program, and report production method
CN112801084A (en) Image processing method and device, electronic equipment and storage medium
CN108090728B (en) Express information input method and system based on intelligent terminal
CN113673528A (en) Text processing method and device, electronic equipment and readable storage medium
CN115083024A (en) Signature identification method, device, medium and equipment based on region division
CN115188006A (en) Method and device for extracting text from image, storage medium and electronic equipment
CN111213157A (en) Express information input method and system based on intelligent terminal
CN113111882A (en) Card identification method and device, electronic equipment and storage medium
JP4945593B2 (en) Character string collation device, character string collation program, and character string collation method
CN114202761B (en) Information batch extraction method based on picture information clustering
JP3114446B2 (en) Character recognition device
JP7410532B2 (en) Character recognition device and character recognition program
EP4379678A1 (en) Image processing system, image processing method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination