CN109389115A - Text recognition method, device, storage medium and computer equipment - Google Patents
Text recognition method, device, storage medium and computer equipment Download PDFInfo
- Publication number
- CN109389115A CN109389115A CN201710687380.4A CN201710687380A CN109389115A CN 109389115 A CN109389115 A CN 109389115A CN 201710687380 A CN201710687380 A CN 201710687380A CN 109389115 A CN109389115 A CN 109389115A
- Authority
- CN
- China
- Prior art keywords
- character
- text sequence
- type
- image
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/62—Text, e.g. of license plates, overlay texts or captions on TV images
- G06V20/63—Scene text, e.g. street names
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Character Discrimination (AREA)
Abstract
The present invention relates to a kind of text recognition method, device, storage medium and computer equipments, which comprises obtains text sequence image;Character recognition is carried out to the text sequence image by each character type corresponding character recognition mode, the character of respective symbols type will be not belonging in the text sequence image, it is identified as being not belonging to the general character of respective symbols type, obtains each corresponding text sequence of character type;Text sequence is chosen from each corresponding text sequence of character type;Determine the position being not belonging to where the general character of respective symbols type in the text sequence chosen;Obtain the character for belonging to respective symbols type after choosing in remaining text sequence at the position;Character at position described in the text sequence chosen according to the character correction of acquisition, obtains recognition result.Scheme provided by the present application provides the accuracy rate of text identification.
Description
Technical field
The present invention relates to field of computer technology, more particularly to a kind of text recognition method, device, storage medium and meter
Calculate machine equipment.
Background technique
With the development of computer technology, more and more texts are added into image for carrying out information propagation.Make
It is also more and more common to be identified to the text for including in image with text recognition technique, such as to business card or in photo
Text carry out text identification etc..
Currently, the text identification of various images is mainly based upon fixed character feature and is extracted to the progress of each character
Identification.However, the recognition result accuracy rate that this text identification mode in the complicated multiplicity of content of text, identifies text
It is substantially reduced.
Summary of the invention
Based on this, it is necessary to which for traditional text recognition method, in content of text complicated multiplicity, recognition accuracy is low asks
Topic provides a kind of text recognition method, device, storage medium and computer equipment.
A kind of text recognition method, which comprises
Obtain text sequence image;
Character recognition is carried out to the text sequence image by each character type corresponding character recognition mode, by the text
It is not belonging to the character of respective symbols type in this sequence image, is identified as being not belonging to the general character of respective symbols type, obtain
To each corresponding text sequence of character type;
Text sequence is chosen from each corresponding text sequence of character type;
Determine the position being not belonging to where the character of respective symbols type in the text sequence chosen;
Obtain the character for belonging to respective symbols type after choosing in remaining text sequence at the position;
Character at position described in the text sequence chosen according to the character correction of acquisition, obtains recognition result.
A kind of text identification device, described device include:
First obtains module, for obtaining text sequence image;
Identification module, for carrying out character to the text sequence image by the corresponding character recognition mode of each character type
Identification, will be not belonging to the character of respective symbols type, is identified as being not belonging to respective symbols type in the text sequence image
General character obtains each corresponding text sequence of character type;
Module is chosen, for choosing text sequence from each corresponding text sequence of character type;
Determining module, for determining the position where being not belonging to the character of respective symbols type in the text sequence chosen;
Second obtains module, belongs to respective symbols kind at the position in remaining text sequence after choosing for acquisition
The character of class;
Module is corrected, the character at position described in the text sequence for choosing according to the character correction of acquisition obtains
Recognition result.
One or more is stored with the non-volatile computer readable storage medium storing program for executing of computer executable instructions, the calculating
When machine executable instruction is executed by one or more processors, so that one or more of processors execute text recognition method
The step of.
A kind of computer equipment, including memory and processor store computer-readable instruction in the memory, institute
When stating computer-readable instruction and being executed by the processor, so that the step of processor executes text recognition method.
Above-mentioned text recognition method, device, storage medium and computer equipment are pressed after getting text sequence image
Character recognition is carried out respectively according to the different corresponding character recognition modes of character type, obtains the corresponding text sequence of each character type
Column.Wherein, when being identified by the corresponding character recognition mode of certain character type, this will be not belonging in text sequence image
The character of character type is identified as being not belonging to the general character of the character type.And then an optional text sequence, it determines and chooses
Text sequence in be not belonging to position where the general character of respective symbols type, pass through remaining text sequence after choosing
In character at the position, correct the character in the text sequence of selection at the position, obtain recognition result.In this way using by word
Symbol type is known otherwise to identify to text sequence image, it is ensured that when being identified by each character type,
Belong to the recognition accuracy of the character of the character type, and in the complicated multiplicity of content of text, text sequence can also be taken into account
The identification for the various characters kind class text for including in image belongs to this in the text sequence for recycling each character category identification to obtain
The character of character type corrects the character of corresponding position in other text sequences, recognition result can be obtained, and improves
Text identification accuracy rate.
Detailed description of the invention
Fig. 1 is the schematic diagram of internal structure of computer equipment in one embodiment;
Fig. 2 is the flow diagram of text recognition method in one embodiment;
Fig. 3 is the schematic diagram of text recognition method in one embodiment;
Fig. 4 is the quantity of the character in the quantity of character and the text sequence of selection obtained in one embodiment at position
The schematic illustration of character correction is carried out when consistent;
Fig. 5 is the quantity of the character in the quantity of character and the text sequence of selection obtained in one embodiment at position
The schematic illustration of character correction is carried out when inconsistent;
Fig. 6 is the schematic diagram of text recognition method in another embodiment;
Fig. 7 is the principle flow chart of text recognition method in a concrete application scene;
Fig. 8 is the structural block diagram of text identification device in one embodiment;
Fig. 9 is the structural block diagram of text identification device in another embodiment.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.
Fig. 1 is the schematic diagram of internal structure of computer equipment in one embodiment.As shown in Figure 1, the computer equipment packet
Include processor, non-volatile memory medium and the built-in storage connected by system bus.Wherein, the computer equipment is non-easy
The property lost storage medium can storage program area and computer-readable instruction, which is performed, and may make place
Reason device executes a kind of text recognition method.The processor supports entire computer equipment for providing calculating and control ability
Operation.Computer-readable instruction can be stored in the built-in storage, it, can when which is executed by the processor
So that the processor executes a kind of text recognition method.The computer equipment can be terminal, be also possible to server etc..Eventually
End specifically can be terminal console or mobile terminal, and mobile terminal specifically can be in mobile phone, tablet computer, laptop etc.
It is at least one.Service implement body can be independent physical server, be also possible to physical server cluster.Those skilled in the art
Member is appreciated that structure shown in Fig. 1, only the block diagram of part-structure relevant to application scheme, composition pair
The restriction for the terminal that application scheme is applied thereon, specific terminal may include than more or fewer portions as shown in the figure
Part perhaps combines certain components or with different component layouts.
Fig. 2 is the flow diagram of text recognition method in one embodiment.The present embodiment is mainly applied in this way
Computer equipment in above-mentioned Fig. 1 illustrates.Referring to Fig. 2, this method specifically comprises the following steps:
S202 obtains text sequence image.
Wherein, text sequence is the character string that more than one character is constituted in order.Text sequence image then includes
The image of text sequence.According to the difference of text sequence image layout, text sequence can be line of text or text column.Text
Row is the text sequence that character is substantially arranged in the horizontal direction, and text column is then the text sequence that character is substantially longitudinally arranged in.
In one embodiment, computer equipment can directly acquire the text sequence figure divided by text sequence
Picture.The text sequence image that computer equipment is got can be computer equipment and receive the text that another computer equipment is sent
Sequence image is also possible to the text sequence image that computer equipment is crawled from internet, can also be that computer equipment passes through
The text sequence image etc. that overscanning or shooting obtain.
In one embodiment, computer equipment can first get the image to text sequence dividing processing, then to the figure
As carrying out text sequence segmentation, to obtain text sequence image.Image such as business card image to text sequence dividing processing or
Person's file and picture etc..Business card image is the image comprising contents of visiting cards, can be business card photo, business card scan part or electronics name
Piece picture etc..File and picture, which is one or more text sequences, combines the image to be formed according to specific arrangement feature.
In one embodiment, due to there is the arrangement feature of rule between different text sequences, computer equipment can
Text sequence image is detected from image according to the priori of text sequence arrangement feature.The priori arrangement aspect ratio of text sequence
As there are the character pitch feature line of text or text inside gap, line of text perhaps text column between different line of text
Arrange the feature etc. of internal character center substantially point-blank.Computer equipment can arrange feature using this priori will be different
Text sequence image split from image.
In one embodiment, computer equipment can carry out connected domain analysis to image and extract connected domain.Due to identical
Connected domain in text sequence can form a complete connected domain, and computer equipment can will be approximately on same straight line
The outer profile of multiple connected domains is determined as text sequence image, and different text sequence images is split from image.
S204 carries out character recognition to text sequence image by the corresponding character recognition mode of each character type, by text
It is not belonging to the character of respective symbols type in sequence image, is identified as being not belonging to the general character of respective symbols type, obtain
Each corresponding text sequence of character type.
Wherein, character type is the classification obtained after classifying according to character feature to character.Character feature such as word
Accord with stroke feature or the affiliated languages of character etc..
In the present embodiment, computer equipment can classify to character by languages, such as English character type, middle text
Accord with type and Korea character type etc..For calculating by remaining character after languages classification, such as number and punctuation mark etc.
Remaining character can be uniformly divided into individually a kind of character, such as other character types by machine equipment.Computer equipment can also incite somebody to action
Remaining character classification is into the one type character classified by languages, for example English character type can both include English
Character also includes remaining character after classifying by languages.
Computer equipment can establish character repertoire by character type, include largely belonging to respective symbols type in the character repertoire
Character.For example, including the character for largely belonging to English by the character repertoire that English character type is established.The corresponding character of character type
Identification method is the identification method accurately identified to the character for belonging to the character type.Computer equipment can be to being not belonging to
The character of the character type accurately identifies or fuzzy diagnosis.For example, the corresponding character recognition mode of English character type, to English
Character is accurately identified, and is not required to non-English character recognition precision.
General character is the pre-set character of computer equipment, for according to the corresponding character recognition of character type
Recognition result when mode carries out character recognition, as the character for being not belonging to respective symbols type.For example, according to English character
When the corresponding character recognition mode of type carries out character recognition, the character recognition that will not belong to English character type is to be not belonging to English
The general character of Chinese character type.
In one embodiment, for each character type, it may be present one and be not belonging to the general of respective symbols type
Character.For example, the general character " Chinese " for being not belonging to English character type may be present for English character type, it will not
Belong to the character of English character type, for example Chinese character or Korea character etc. are identified as " Chinese ".
In one embodiment, for each character type, it also may be present and multiple be not belonging to the general of respective symbols type
Character.This multiple general character may belong to identical character types.For example, English character type may be present more
Chinese character recognition is " Chinese ", by Korea character by a general character " Chinese " and " Korea Spro " etc. for being not belonging to respective symbols type
It is identified as " Korea Spro ".This multiple general character is also possible to other one-to-one words of character type of non-corresponding character type
Symbol.For example, for English character type, may be present multiple general characters " Chinese " for being not belonging to respective symbols type and
Deng, by Chinese character recognition be " Chinese ", Korea character is identified as
In one embodiment, computer equipment can press the corresponding character recognition mode of each character type to text sequence
When image carries out character recognition, character category identification first is carried out to the character in text sequence image.Wherein, character category identification
It can be two assorting processes, determine that character belongs to respective symbols type or is not belonging to respective symbols type.Computer equipment
The character for belonging to respective symbols type can be accurately identified again, directly will not belong to the general word of respective symbols type
Symbol, the recognition result as the character for being not belonging to respective symbols type.
For example, to English character type, it is assumed that the pre-set general character of computer equipment be " Chinese ", press English
The corresponding character recognition mode of Chinese character type identifies to include " I in text sequence imageWhen A ", first character " I " is true
It surely is the character for being not belonging to English character type, the recognition result by " Chinese " as " I ", second characterDetermination is not
The character for belonging to English character type, by " Chinese " conductRecognition result, second character " A " determination is to belong to English
The character of character types, is further identified, accurate recognition result is obtained.
Character category identification is also possible to more assorting processes, determines character is which kind of character type belonged to.Computer equipment
The character for belonging to respective symbols type can be accurately identified again, directly will not belong to respective symbols type and with wait know
The identical general character of the other affiliated character type of character, the recognition result as the character to be identified.
For example, to English character type, it is assumed that the general character of the pre-set Chinese character of computer equipment is
The general character of " Chinese ", Korea character isText sequence is being identified by the corresponding character recognition mode of English character type
It include " I in column imageWhen A ", first character " I " determination is the character of Chinese character type, by " Chinese " conduct " I "
Recognition result, second characterDetermination is the character of Korea character type, willAsRecognition result,
Second character " A " determination is the character for belonging to English character type, is further identified, obtains accurate recognition result.
In one embodiment, the mode that computer equipment carries out character recognition can be the identification side based on template matching
Formula.The corresponding character recognition mode of character type is to carry out matched identification method using the corresponding Character mother plate of character type.
For example, the corresponding character recognition mode of English character type, is matched using the corresponding Character mother plate of English character type
Identification method, English character can be accurately identified in this way.If computer equipment needs to carry out non-English character accurate
Identification, can be matched using the corresponding Character mother plate of other character types simultaneously.If computer equipment is not needed to non-English
Character is accurately identified, and can directly be the general character of non-English character by non-English character recognition.
Specifically, computer equipment collects the Character mother plate of each character in the character repertoire established by character type, then will
Character to be identified and the Character mother plate by the setting of character type collected carry out relevant matches, calculate character to be identified and each
Similarity between Character mother plate takes character corresponding to the maximum Character mother plate of similarity as recognition result, to obtain
Each corresponding text sequence of character type.
For example, the text sequence identified by the corresponding character recognition mode of English character type: " Chinese Chinese Chinese My
Name is Addy ", wherein " M ", " y " and " n " etc. is present in the corresponding character repertoire of English character type, to belong to English
The character of Chinese character type." Chinese " is to be not present in the corresponding character repertoire of English character type, to be not belonging to English character class
The character of type.
In one embodiment, the mode that computer equipment carries out character recognition is also possible to the identification based on feature extraction
Mode.The corresponding character recognition mode of character type is to carry out matched identification side using the corresponding character feature of character type
Formula.Specifically, the extractable character feature for pressing each character in the character repertoire that character type is established of computer equipment, then extract wait know
The character feature of other character, the character feature relevant matches with character each in character repertoire, calculates character to be identified and each word
The similarity between feature is accorded with, takes character corresponding to the maximum Character mother plate of similarity as recognition result, to obtain each
The corresponding text sequence of character type.
Specifically, computer equipment can extract the geometrical characteristic of character, such as endpoint, bifurcation, the concavo-convex portion of character
And line segment, the closed loop of all directions such as horizontal, vertical, inclination etc., according to the position of the feature of extraction and correlation into
The judgement of row logical combination, obtains recognition result.
In one embodiment, computer equipment can be by the corresponding character recognition mode of each character type directly to text sequence
Column image carry out character recognition, can also by text sequence image cutting be single character picture after, then to single character picture into
Line character identification.
In one embodiment, computer equipment can be used machine learning model and carry out character recognition.Machine learning model
It can be neural network model, CNN (Convolutional Neural Networks, convolutional neural networks) specifically can be used
Model or FCNN (Fully Convolutional Neural Networks, full convolutional neural networks) model.Wherein CNN
Model is very strong in visual field classification capacity, can accurately carry out individual character identification.
S206 chooses text sequence from each corresponding text sequence of character type.
Specifically, computer equipment can randomly select text sequence from each corresponding text sequence of character type.It calculates
Machine equipment can also count respectively each text sequence packet for each corresponding text sequence of character type before choosing text sequence
The quantity of the character of the respective symbols type included chooses the most text sequence of the character including respective symbols type.
For example, computer equipment is by character types to the Chinese character type obtained after text sequence image recognition
The text of text sequence A, the text sequence B of English character type, the text sequence C of Korea character type and Japanese character type
Sequence D.Wherein, A includes Chinese character 15, and B includes English character 69, and C includes Korea character 3, and D includes Japanese character
6.Computer equipment can from tetra- text sequences of A, B, C and D an optional text sequence, can also choose including respective symbols class
The most text sequence B of the character of type.
S208 determines the position being not belonging to where the general character of respective symbols type in the text sequence chosen.
Specifically, computer equipment can first determine the corresponding character types of text sequence, so after choosing text sequence
It determines in text sequence afterwards and is not belonging to the general character of text sequence respective symbols type, then determine that these are not belonging to this
Position of the character of text sequence respective symbols type in the text sequence of selection.Wherein, it is corresponding to be not belonging to text sequence
The character of character type specifically can be the character in the character repertoire not included in text sequence respective symbols type.
In one embodiment, computer equipment can traverse the character for including in the text sequence of selection, in traversal, sentence
Whether the character of disconnected traversal extremely is the character repertoire for being included in respective symbols type.If computer equipment determines the word of current traversal extremely
Symbol is the character repertoire for being included in respective symbols type, then continues to traverse;If computer equipment determines current traversal, character extremely is
Not included in the character repertoire of respective symbols type, then position of the character in the text sequence of selection of traversal extremely is recorded.
In one embodiment, computer equipment is pressing the corresponding character recognition mode of each character type to text sequence figure
When as carrying out character recognition, the character can be marked when identifying the character for being not belonging to respective symbols type.Computer
Equipment can check the character that label is added in the text sequence of selection after choosing text sequence, to determine in text sequence
It is not belonging to the character of respective symbols type, and then determines position of these characters in the text sequence of selection.Implement at one
In example, it is not belonging to the position where the character of respective symbols type in the text sequence of selection, can be and be not belonging to respective symbols
The character of type is in text sequence, with the character relative position for belonging to respective symbols type.For example, knowing by English character type
The text sequence not obtained: " Chinese Chinese Chinese My name is Addy ", then being not belonging to the character " Chinese " of respective symbols type in text
Position in this sequence can be the front " My name is Addy ".
In one embodiment, the position where the character of respective symbols type is not belonging in the text sequence of selection,
It can be not belonging to absolute position of the character of respective symbols type in text sequence.For example, being obtained by English character type identification
The text sequence arrived: " Chinese Chinese Chinese My name is Addy ", then being not belonging to the character " Chinese " of respective symbols type in text sequence
Position in column can be initial character to third character.
S210 obtains the character for belonging to respective symbols type after choosing in remaining text sequence at position.
Specifically, computer equipment can traverse remaining text sequence, in traversal, judge to traverse in text sequence extremely
Whether the character at position is the character for belonging to the text sequence respective symbols type of traversal extremely.If computer equipment judgement is worked as
Character in the text sequence of preceding traversal extremely at position is the character for belonging to the text sequence respective symbols type of traversal extremely, then
Obtain the character;If computer equipment determines that the character in the text sequence of current traversal extremely at position is to be not belonging to traversal extremely
Text sequence respective symbols type character, then continue to traverse.
S212 obtains recognition result according to the character in the text sequence of the character correction of acquisition selection at position.
Specifically, computer equipment belongs to respective symbols type in remaining text sequence after obtaining selection at position
Character after, can compare respectively for determining each position according in the character of the position acquisition and the text sequence of selection
Character at position, when detecting that the two is inconsistent, at position in the text sequence of the character correction selection of acquisition
Character obtains the higher recognition result of accuracy rate after the character correction at the position for completing each determination.
Above-mentioned text recognition method carries out word according to different character types after getting text sequence image respectively
Symbol identification, obtains each corresponding text sequence of character type.Wherein, by the corresponding character recognition mode of certain character type into
When row identification, by the character recognition for being not belonging to the character type in text sequence image at being not belonging to the general of the character type
Character.And then an optional text sequence, determine the general character institute that respective symbols type is not belonging in the text sequence chosen
Position the position in the text sequence of selection is corrected by the character after choosing in remaining text sequence at the position
The character at place, obtains recognition result.Text sequence image is known using being known otherwise by character type in this way
Not, it is ensured that when being identified by each character type, belong to the recognition accuracy of the character of the character type, and in text
When content is complicated various, the identification for the various characters kind class text for including in text sequence image can also be taken into account, is recycled each
The character for belonging to the character type in the text sequence that character category identification obtains, to corresponding position in other text sequences
Character is corrected, and recognition result can be obtained, and improves text identification accuracy rate.
Fig. 3 shows the schematic diagram of text recognition method in one embodiment.With reference to Fig. 3, computer equipment is being got
After text sequence image, character recognition is carried out to text sequence image according to each character type respectively, obtains each character type phase
The character for belonging to the character type in the text sequence answered, then sharp obtained each text sequence, to corresponding in other text sequences
Character at position is corrected, and recognition result can be obtained.
In one embodiment, step S202 includes: acquisition images to be recognized;Images to be recognized is carried out at binaryzation
Reason, obtains text image;Text texture image is extracted from text image;Determine the connected domain in text texture image;According to
Connected domain determines text sequence image.
Wherein, images to be recognized is the image to carry out character recognition to the text sequence for including in image.It specifically can be with
It is business card image or file and picture etc..The binaryzation of image is to set the gray value of the pixel on image to two kinds of pixels
Value, that is, whole image shows to apparent only there are two types of the visual effects of pixel value.
Specifically, fixed threshold Binarization methods or adaptive threshold Binarization methods can be used in computer equipment, will
Images to be recognized is higher than threshold value and is set to one of preset two kinds of pixel values, both pictures respectively lower than the pixel value of threshold value
Plain value is the first pixel value and the second pixel value respectively.Images to be recognized after binaryzation, indicate text is all first
Pixel value, such as white;Indicate background is all the second pixel value, such as black.
Further, computer equipment can extract the first picture for indicating text from the images to be recognized after binaryzation
Element is worth the image-region that corresponding pixel is formed, and obtains text image.Computer equipment can again from obtained text image,
Character stroke texture is extracted, the image-region that the pixel for constituting stroke texture is formed is determined, obtains text texture image.
Further, computer equipment can carry out connected domain analysis to text texture image again and extract connected domain, also
Adjacent connected domain can be merged.Computer equipment specifically can be used stroke smoothing algorithm and carry out connected domain analysis and merging, should
The pixel of adjacent connected domain can be connected by algorithm, form the region of monolith, due to each company of one text interior sequences
Lead to the distance between domain relatively, so the connected domain in same text sequence can form a complete connected domain.
Still further, the outer profile for multiple connected domains that computer equipment can will be approximately on same straight line is determined as
The position of text sequence image and record, with the corresponding text sequence image of determination.Computer equipment can also be by each connection
Domain is respectively as independent text sequence image procossing.
In the present embodiment, by after gradually extracting text texture image in images to be recognized, further according to text line
The connected domain in image is managed, determines corresponding text sequence image, avoiding will be excessive in text sequence image determination process
Background area include into so that accuracy rate is higher when subsequent progress character recognition.
In one embodiment, step S204 includes: by each corresponding identification method of character type, from text sequence image
In identify the character for belonging to respective symbols type, and identify from text sequence image and to be not belonging to the logical of respective symbols type
Character;It successively combines the character gone out by each character category identification respectively, obtains each corresponding text sequence of character type.
Specifically, can corresponding recognition strategy be arranged for each character type in advance in computer equipment.In one embodiment, computer
Equipment can correspond to character type, and the character for belonging to the character type is accurately identified, and obtain practical corresponding with the character
Character;The character that will not belong to the character type carries out Fuzzy Processing, labeled as the general word for being not belonging to the character type
Symbol, the character of the character being accurately identified and Fuzzy Processing is differentiated.
In one embodiment, it by each corresponding identification method of character type, identifies and belongs to from text sequence image
The character of respective symbols type, and the step for being not belonging to the general character of respective symbols type is identified from text sequence image
It suddenly include: that individual character image is syncopated as from text sequence image;It is right respectively by each corresponding machine learning model of character type
Individual character image carries out character recognition, obtains belonging to the character of respective symbols type and is not belonging to the general of respective symbols type
Character.
Wherein, individual character image is the rectangular image for including single character, computer equipment cutting from text sequence image
Individual character image one by one out.Computer equipment specifically can be according to text sequence pitch characteristics, character length feature and character
The priori knowledges such as ratio consistency are syncopated as the sequence of individual character image from text sequence image.Text sequence image is being split
Before can pass through image enhancement, such as increase picture contrast.
In one embodiment, computer equipment can will project each pixel value therein after text sequence image binaryzation
Accumulated value is obtained on to text sequence image longitudinal direction, local maxima accumulated value is searched out or Local Minimum accumulated value carries out
Cutting, to obtain individual character image.Wherein, if indicating after text sequence image binaryzation, the pixel color of character is white,
Find Local Minimum accumulated value;If the pixel color for indicating character after text sequence image binaryzation is black, part is found
Maximum accumulated value.
Further, computer equipment can pass through machine learning after being syncopated as individual character image in text sequence image
Model carries out character recognition to individual character image.Each corresponding machine learning model of character type can precondition obtain.
In one embodiment, the step of each character type of training corresponding machine learning model includes: to obtain character figure
As sample set;By character type, corresponding word is added to belong to the character picture of respective symbols type in character picture sample set
The mark of symbol, and to be not belonging to the mark that the character picture of respective symbols type adds general character in character picture sample set
Note;According to the character picture in character picture sample set and the mark added by character type, each character type phase is respectively trained
The machine learning model answered.
It wherein, include several character pictures in character picture sample set.Character picture may include the word of various character types
Accord with the character picture generated.The character picture sample set used when the corresponding machine learning model of each character type of training can be
Unified character picture sample set is also possible to the corresponding character picture sample set of each character type.Each character type is each
Self-corresponding character picture sample set has the skewed popularity for respective symbols type.It can specifically include and largely belong to corresponding word
The character picture that the character of type generates is accorded with, and is not belonging to the character picture that the character of respective symbols type generates on a small quantity.
Specifically, machine learning model is a kind of functional relation of character for being mapped to character picture and accordingly marking.Root
According to character picture sample set training machine learning model, the character picture sample of the known character for being mapped to and accordingly marking exactly is utilized
This collection adjusts the parameter inside machine learning model, machine learning model is enabled to predict that new character picture is be mapped to
Character, to achieve the effect that identify respective symbols from the image containing character.SVM (branch can be used in machine learning model
Hold vector machine) or various neural networks.
In one embodiment, machine learning model uses convolutional neural networks (CNN).CNN is that one kind is learned end to end
Learning method, CNN directly receive the pixel input of character picture, therefore input layer number is also equal to character figure after normalization
The number of pixels of picture.Local shape factor and the pondization processing of several layers are first carried out after CNN input data, then middle layer carries out
The global characteristics transformation connected entirely, last output layer are output with the target of task.
Specifically, computer equipment can be directed to each character type, to belong to respective symbols type in character picture sample set
Character picture add the mark of corresponding character, and to be not belonging to the character figure of respective symbols type in character picture sample set
Mark as adding general character.Computer equipment is further according to the character picture in character picture sample set and presses character type
Each corresponding machine learning model of character type is respectively trained in the mark of addition.
In one embodiment, machine learning model can be according to character picture sample set to having trained for identification
The parameter of the convolutional neural networks of image is iterated what adjustment obtained.
In the present embodiment, using the powerful study of machine learning model and indicate that ability carries out the study of character big data, institute
The machine learning model that training obtains identifies character, more preferable compared with the effect that conventional method identifies character.
In above-described embodiment, text sequence image cutting is obtained into individual character image, then machine learning is used to individual character image
Model carries out character recognition, can conveniently and efficiently complete the character recognition process to text sequence image.
Computer equipment is corresponding to character type, and the word for belonging to respective symbols type is identified from text sequence image
Symbol, and after being identified in text sequence image and being not belonging to the general character of respective symbols type, the character that will identify that by
It is successively combined according to sequences of text in text sequence image, obtains the corresponding text sequence of character type.
Computer equipment is after respectively obtaining the corresponding text sequence of each character type to each character type, according to obtaining
Text sequence in character whether be general character determine the character whether be respective symbols type character.It counts in this way
Machine equipment is calculated after choosing text sequence in text sequence, can directly inquire the general character in text sequence, this is general
Character where position be, by the position for needing to carry out character correction in text sequence.
In above-described embodiment, when carrying out text sequence image recognition by character type, respective symbols type will not belong to
Character carry out Fuzzy Processing, and marked with general character, needs can be quickly located when carrying out character correction
The character corrected obtains more accurate recognition result to complete character correction.
In one embodiment, the step of individual character image is syncopated as from text sequence image includes: in text sequence figure
As in, along the long side of text sequence image, candidate cut-off is chosen according to the short spacing of the short side than text sequence image;It obtains
The cutting confidence level of each candidate's cut-off;Cut-off is determined according to cutting confidence level;According to determining cut-off from text sequence
Individual character image is syncopated as in image.
Wherein, candidate cut-off is candidate dicing position, can be risen with coordinate or apart from text sequence picture headers
The distance of point indicates.
In one embodiment, text sequence image is rectangular image, the short side of text sequence image substantially text sequence
The width or height of character in column, long side are then about the length of text sequence in text sequence image, and computer equipment can be according to
The spacing shorter than short side chooses candidate cut-off.The spacing for choosing candidate cut-off can specifically be less than or equal to text sequence image
Short side half or one third or a quarter.
Further, cutting confidence level be corresponding candidate cut-off be actual cut-off probability quantized value.Meter
Calculate machine equipment specifically can be syncopated as corresponding picture according to candidate cut-off, by the picture being syncopated as extract characteristics of image after according to
It is secondary to be input in trained classifier, export the cutting confidence level of corresponding candidate cut-off.Classifier can be used random gloomy
Woods classifier.Wherein, the characteristics of image of extraction can be using HOG (Histogram of Oriented Gradient, direction ladder
Spend histogram) feature, it also can also be using other spies such as LBP (Local Binary Patterns, local binary patterns) features
Sign.
Further, computer equipment can be sentenced if being higher than preset threshold by cutting confidence level compared with preset threshold
It is set to actual cut-off.Cutting is carried out at the computer equipment cut-off that everywhere determines in text sequence image again, is obtained
To individual character image one by one.
It, can be by densely selecting candidate cut-off in text sequence image in above-described embodiment, and utilize each time
It selects the cutting confidence level of cut-off to carry out cutting text sequence image and obtains individual character image, the standard to text sequence image may be implemented
Definite point, to improve follow-up text recognition accuracy.
In one embodiment, step S212 includes: when position in the quantity of character and the text sequence of selection obtained
When the quantity of the character at place is consistent, then by the character in the text sequence of selection at position, in the character that replaces with acquisition one by one
By the one-to-one character of character at character sequence and position;When position in the quantity of the character of acquisition and the text sequence of selection
The quantity for setting the character at place is inconsistent, and when the quantity of the character in the text sequence chosen at position is more than one, then will select
Character in the text sequence taken at position integrally replaces with the character of acquisition.
Specifically, computer equipment can first count the quantity of the character in the text sequence of selection at position, then obtain choosing
It takes in rear remaining text sequence and belongs to the character of respective symbols type at position, and count the quantity of the character of acquisition, it is right
Than counting two obtained character quantities.If computer equipment determines position in the quantity of character and the text sequence of selection that obtain
The quantity for setting the character at place is consistent, identifies then then thinking in text sequence image that corresponding each character corresponds respectively
Character, computer equipment can replace with one by one the character in the text sequence of selection at position in the character of acquisition by word
Symbol sequence and the one-to-one character of character at position.
If computer equipment determines the quantity of the character in the quantity of character and the text sequence of selection that obtain at position
It is inconsistent, then then think in text sequence image in corresponding character there are character it is unidentified go out as a result, computer equipment can
When the quantity of character in the text sequence of selection at position is more than one, then by the word in the text sequence of selection at position
The whole character for replacing with acquisition of symbol, to obtain accurate recognition result as far as possible.
For example, Fig. 4 shows position in the quantity of the character obtained in one embodiment and the text sequence of selection
The schematic illustration of character correction is carried out when the quantity of the character at place is consistent.Original contents with reference to Fig. 4, in text sequence image
Are as follows: " I is a Hans My name is Addy ", the text sequence obtained according to Chinese character type identification are as follows: " I am
One the Hans AA AAAA AA AAAA ", the text sequence obtained according to English character type identification are as follows: " Han Hanhanhanhanhan
Chinese My name is Addy ".
Computer equipment can choose the text sequence obtained according to English character type identification, and determination is not belonging to English character
Position where the general character " Chinese " of type, and the quantity of " Chinese ": 7.What remaining Chinese character type identification obtained
The character for belonging to Chinese character type of text sequence in the position is that the quantity of " I is a Hans " character is 7, two
Quantity is identical, then by the character in the corresponding text sequence of English character type at the position, is replaced with one by one from Chinese character
By the one-to-one character of character at character sequence and position in the character obtained in the corresponding text sequence of type.
Fig. 5 shows the character in the quantity of the character obtained in one embodiment and the text sequence of selection at position
The schematic illustration of character correction is carried out when quantity is inconsistent.Original contents with reference to Fig. 5, in text sequence image are as follows: " I am
One the Hans My name is Addy ", the text sequence obtained according to Chinese character type identification are as follows: " I is a Han nationality
People AA AAAA ", the text sequence obtained according to English character type identification are as follows: " Chinese Chinese Chinese My name is Addy ".
Computer equipment can choose the text sequence obtained according to English character type identification, and determination is not belonging to English character
Position where the general character " Chinese " of type, and the quantity of " Chinese ": 3.What remaining Chinese character type identification obtained
The character for belonging to Chinese character type of text sequence in the position is that the quantity of " I is a Hans " character is 7, two
Quantity is not identical, and 3 are greater than 1, then by the character in the corresponding text sequence of English character type at the position, integrally replace with
The character obtained from the corresponding text sequence of Chinese character type.
In above-described embodiment, the number of the character in the quantity of the character of acquisition and the text sequence of selection at position is provided
When measuring consistent or inconsistent, the processing mode of character correction is carried out.Correction processing is carried out to character by this processing mode
Accurate recognition result can be obtained as far as possible.
As shown in fig. 6, in a specific embodiment, text recognition method the following steps are included:
S602 obtains images to be recognized;Images to be recognized is subjected to binary conversion treatment, obtains text image;From text diagram
Text texture image is extracted as in;Determine the connected domain in text texture image;Text sequence image is determined according to connected domain.
S604 is short according to the short side than text sequence image along the long side of text sequence image in text sequence image
Spacing choose candidate cut-off;Obtain the cutting confidence level of each candidate cut-off;Cut-off is determined according to cutting confidence level;It presses
Individual character image is syncopated as from text sequence image according to determining cut-off.
S606 obtains character picture sample set;By character type, to belong to respective symbols type in character picture sample set
Character picture add the mark of corresponding character, and to be not belonging to the character figure of respective symbols type in character picture sample set
Mark as adding general character.
S608 is respectively trained each according to the character picture in character picture sample set and the mark added by character type
The corresponding machine learning model of character type.
S610 is carried out character recognition to individual character image respectively, is obtained by each corresponding machine learning model of character type
Belong to the character of respective symbols type and is not belonging to the general character of respective symbols type.
S612 successively combines the character gone out by each character category identification respectively, obtains each corresponding text of character type
Sequence.
S614 chooses text sequence from each corresponding text sequence of character type.
S616 determines the position being not belonging to where the general character of respective symbols type in the text sequence chosen.
S618 obtains the character for belonging to respective symbols type after choosing in remaining text sequence at position.
S620, judge the character in the quantity of character and the text sequence of selection that obtain at position quantity whether one
It causes;If so, jumping to step S622;If it is not, the S624 that then gos to step.
S622 replaces with the character in the text sequence of selection at position in the character of acquisition by character sequence one by one
With the one-to-one character of character at position.
S624, if the quantity of the character in the text sequence chosen at position is more than one, by the text sequence of selection
Character at middle position integrally replaces with the character of acquisition.
In the present embodiment, after getting text sequence image, character knowledge is carried out respectively according to different character types
Not, each corresponding text sequence of character type, and then an optional text sequence are obtained, determines and is not belonging in the text sequence chosen
Position where the character of respective symbols type choosing is corrected by the character at the position in remaining text sequence after choosing
Character in the text sequence taken at the position, obtains recognition result.In this way using by character type known otherwise come
Text sequence image is identified, it is ensured that when being identified by each character type, belong to the character of the character type
Recognition accuracy, and in the complicated multiplicity of content of text, the various characters kind for including in text sequence image can also be taken into account
The identification of class text belongs to the character of the character type in the text sequence for recycling each character category identification to obtain, to other
The character of corresponding position is corrected in text sequence, and recognition result can be obtained, and improves text identification accuracy rate.
Fig. 7 shows the principle flow chart of text recognition method in a concrete application scene.With reference to Fig. 7, this is specifically answered
The text identification run after fame in picture with scene.Computer equipment first can carry out line of text detection to business card image.It is detecting
To after line of text, pass through the corresponding machine learning model of Chinese character type and the corresponding machine learning of English character type respectively
Model identifies the character in line of text.In the present embodiment, by other characters such as number and punctuation marks, pass through English
The corresponding machine learning model of character types is accurately identified.
For line of text after through the corresponding machine learning model of Chinese character type, obtained text sequence includes accurate knowledge
The Chinese character not obtained and the general character " A " for marking non-Chinese character.Line of text is passing through English character type
After corresponding machine learning model, obtained text sequence includes the English character accurately identified, number and punctuation mark,
And the general character " Chinese " for marking Chinese character.
Computer equipment can choose the text sequence obtained according to English character type identification again, and determination is not belonging to English words
Accord with the position where the general character " Chinese " of type, the text sequence that Chinese character type identification obtains belonging in the position
The character of Chinese character type, by the character in the corresponding text sequence of English character type at the position, integrally replace with from
The character obtained in the corresponding text sequence of Chinese character type.
As shown in figure 8, in one embodiment, providing a kind of text identification device 800.Referring to Fig. 8, text identification
Device 800 includes the first acquisition module 801, identification module 802, chooses module 803, the acquisition module of determining module 804, second
805 and correct module 806.
First obtains module 801, for obtaining text sequence image.
Identification module 802, for carrying out character recognition to text sequence image by each corresponding identification method of character type,
It will be not belonging to the character of respective symbols type in text sequence image, is identified as being not belonging to the general word of respective symbols type
Symbol, obtains each corresponding text sequence of character type.
Module 803 is chosen, for choosing text sequence from each corresponding text sequence of character type.
Determining module 804, for determining the general character institute for being not belonging to respective symbols type in the text sequence chosen
Position.
Second obtains module 805, belongs to respective symbols kind at position in remaining text sequence after choosing for acquisition
The character of class.
Module 806 is corrected, the character in the text sequence for choosing according to the character correction of acquisition at position is known
Other result.
Above-mentioned text identification device 800 carries out after getting text sequence image according to different character types respectively
Character recognition obtains each corresponding text sequence of character type.Wherein, the corresponding character recognition mode of certain character type is being pressed
When being identified, by the character recognition for being not belonging to the character type in text sequence image at being not belonging to the general of the character type
Character.And then an optional text sequence, determine the general character that respective symbols type is not belonging in the text sequence chosen
The position at place the position in the text sequence of selection is corrected by the character at the position in remaining text sequence after choosing
The character for setting place, obtains recognition result.Text sequence image is carried out using being known otherwise by character type in this way
Identification, it is ensured that when being identified by each character type, belong to the recognition accuracy of the character of the character type, and in text
When this content is complicated various, the identification for the various characters kind class text for including in text sequence image can also be taken into account, is recycled
The character for belonging to the character type in the text sequence that each character category identification obtains, to corresponding position in other text sequences
Character corrected, recognition result can be obtained, improve text identification accuracy rate.
In one embodiment, the first acquisition module 801 is also used to obtain images to be recognized;Images to be recognized is carried out two
Value processing, obtains text image;Text texture image is extracted from text image;Determine the connection in text texture image
Domain;Text sequence image is determined according to connected domain.
In the present embodiment, by after gradually extracting text texture image in images to be recognized, further according to text line
The connected domain in image is managed, determines corresponding text sequence image, avoiding will be excessive in text sequence image determination process
Background area include into so that accuracy rate is higher when subsequent progress character recognition.
In one embodiment, identification module 802 is also used to by each corresponding identification method of character type, from text sequence
The character for belonging to respective symbols type is identified in image, and is identified from text sequence image and be not belonging to respective symbols type
General character;It successively combines the character gone out by each character category identification respectively, obtains each corresponding text of character type
Sequence.In the present embodiment, when carrying out text sequence image recognition by character type, it will not belong to the word of respective symbols type
Symbol carries out Fuzzy Processing, and is marked with general character, can quickly locate and need to carry out when carrying out character correction
The character of correction obtains more accurate recognition result to complete character correction.
In one embodiment, identification module 802 is also used to be syncopated as individual character image from text sequence image;By each
The corresponding machine learning model of character type carries out character recognition to individual character image respectively, obtains belonging to respective symbols type
Character and the general character for being not belonging to respective symbols type.
In the present embodiment, text sequence image cutting is obtained into individual character image, then machine learning is used to individual character image
Model carries out character recognition, can conveniently and efficiently complete the character recognition process to text sequence image.
In one embodiment, identification module 802 is also used in text sequence image, along the length of text sequence image
Candidate cut-off is chosen according to the short spacing of the short side than text sequence image in side;Obtain the cutting confidence of each candidate cut-off
Degree;Cut-off is determined according to cutting confidence level;Individual character image is syncopated as from text sequence image according to determining cut-off.
In the present embodiment, can be by densely selecting candidate cut-off in text sequence image, and utilize each time
It selects the cutting confidence level of cut-off to carry out cutting text sequence image and obtains individual character image, the standard to text sequence image may be implemented
Definite point, to improve follow-up text recognition accuracy.
In one embodiment, module 806 is corrected to be also used to when in the quantity of character and the text sequence of selection obtained
When the quantity of character at the position is consistent, then the character at position described in the text sequence by selection, replaces with one by one
By the one-to-one character of character at character sequence and the position in the character of acquisition;When the quantity and choosing of the character of acquisition
The quantity of character at position described in the text sequence taken is inconsistent, and the character at position described in the text sequence chosen
Quantity when being more than one, then the character at position described in the text sequence by selection integrally replaces with the character of acquisition.
In the present embodiment, the number of the character in the quantity of the character of acquisition and the text sequence of selection at position is provided
When measuring consistent or inconsistent, the processing mode of character correction is carried out.Correction processing is carried out to character by this processing mode
Accurate recognition result can be obtained as far as possible.
As shown in figure 9, in one embodiment, text identification device 800 further include: training module 807.
Training module 807, for obtaining character picture sample set;By character type, to belong in character picture sample set
The character picture of respective symbols type adds the mark of corresponding character, and to be not belonging to respective symbols in character picture sample set
The character picture of type adds the mark of general character;According to the character picture in character picture sample set and press character type
Each corresponding machine learning model of character type is respectively trained in the mark of addition.
In the present embodiment, the study of character big data is carried out using the powerful study of machine learning model and expression ability,
The machine learning model trained identifies character, more preferable compared with the effect that conventional method identifies character.
In one embodiment, one or more computer-readable storage mediums for being stored with computer-readable instruction are provided
Matter, when computer-readable instruction is executed by one or more processors, so that one or more processors execute following steps: obtaining
Take text sequence image;Character recognition is carried out to text sequence image by each character type corresponding identification method, by text sequence
It is not belonging to the character of respective symbols type in column image, is identified as being not belonging to the general character of respective symbols type, obtains each
The corresponding text sequence of character type;Text sequence is chosen from each corresponding text sequence of character type;Determine the text chosen
The position where the general character of respective symbols type is not belonging in this sequence;Obtain after choosing in remaining text sequence
Belong to the character of respective symbols type at position;Character in the text sequence chosen according to the character correction of acquisition at position,
Obtain recognition result.
In one embodiment, text sequence image is obtained, comprising: obtain images to be recognized;Images to be recognized is carried out
Binary conversion treatment obtains text image;Text texture image is extracted from text image;Determine the connection in text texture image
Domain;Text sequence image is determined according to connected domain.
In one embodiment, character recognition is carried out to text sequence image by each character type corresponding identification method,
Obtain each corresponding text sequence of character type, comprising: each corresponding identification method of character type is pressed, from text sequence image
It identifies the character for belonging to respective symbols type, and is identified from text sequence image and be not belonging to the general of respective symbols type
Character;It successively combines the character gone out by each character category identification respectively, obtains each corresponding text sequence of character type.
In one embodiment, it by each corresponding identification method of character type, identifies and belongs to from text sequence image
The character of respective symbols type, and the general character for being not belonging to respective symbols type is identified from text sequence image, it wraps
It includes: being syncopated as individual character image from text sequence image;By each corresponding machine learning model of character type, respectively to individual character
Image carries out character recognition, obtains belonging to the character of respective symbols type and is not belonging to the general word of respective symbols type
Symbol.
In one embodiment, individual character image is syncopated as from text sequence image, comprising: in text sequence image,
Along the long side of text sequence image, candidate cut-off is chosen according to the short spacing of the short side than text sequence image;Obtain each time
Select the cutting confidence level of cut-off;Cut-off is determined according to cutting confidence level;According to determining cut-off from text sequence image
In be syncopated as individual character image.
In one embodiment, computer-readable instruction also makes processor execute following steps: obtaining character picture sample
This collection;By character type, corresponding character is added to belong to the character picture of respective symbols type in character picture sample set
Mark, and to be not belonging to the mark that the character picture of respective symbols type adds general character in character picture sample set;Root
According to the character picture in character picture sample set and the mark added by character type, each corresponding machine of character type is respectively trained
Device learning model.
In one embodiment, according to the character in the text sequence of the character correction of acquisition selection at position, known
Other result, comprising: when the quantity of character in the quantity of the character of acquisition and the text sequence of selection at position is consistent, then will
Character in the text sequence of selection at position is replaced in the character of acquisition one by one by the character one at character sequence and position
One corresponding character;When the quantity of the character in the quantity of the character of acquisition and the text sequence of selection at position is inconsistent, and
When the quantity of character in the text sequence of selection at position is more than one, then by the character in the text sequence of selection at position
Entirety replaces with the character of acquisition.
Above-mentioned storage medium carries out character knowledge according to different character types after getting text sequence image respectively
Not, each corresponding text sequence of character type is obtained.Wherein, known by the corresponding character recognition mode of certain character type
When other, by the character recognition for being not belonging to the character type in text sequence image at the general word for being not belonging to the character type
Symbol.And then an optional text sequence, it determines and is not belonging in the text sequence chosen where the general character of respective symbols type
Position corrected in the text sequence of selection at the position by the character after choosing in remaining text sequence at the position
Character, obtain recognition result.Text sequence image is identified using being known otherwise by character type in this way,
It ensures when being identified by each character type, belongs to the recognition accuracy of the character of the character type, and in text
When holding complicated multiplicity, the identification for the various characters kind class text for including in text sequence image can also be taken into account, each word is recycled
The character for belonging to the character type in the text sequence that symbol category identification obtains, to the word of corresponding position in other text sequences
Symbol is corrected, and recognition result can be obtained, improve text identification accuracy rate.
In one embodiment, a kind of computer equipment, including memory and processor are provided, is stored in memory
Computer-readable instruction, when computer-readable instruction is executed by processor, so that processor executes following steps: obtaining text sequence
Column image;Character recognition is carried out to text sequence image by each character type corresponding identification method, it will be in text sequence image
It is not belonging to the character of respective symbols type, is identified as being not belonging to the general character of respective symbols type, obtains each character type
Corresponding text sequence;Text sequence is chosen from each corresponding text sequence of character type;It determines in the text sequence chosen
It is not belonging to the position where the general character of respective symbols type;It obtains and belongs at position in remaining text sequence after choosing
In the character of respective symbols type;Character in the text sequence chosen according to the character correction of acquisition at position, is identified
As a result.
In one embodiment, text sequence image is obtained, comprising: obtain images to be recognized;Images to be recognized is carried out
Binary conversion treatment obtains text image;Text texture image is extracted from text image;Determine the connection in text texture image
Domain;Text sequence image is determined according to connected domain.
In one embodiment, character recognition is carried out to text sequence image by each character type corresponding identification method,
Obtain each corresponding text sequence of character type, comprising: each corresponding identification method of character type is pressed, from text sequence image
It identifies the character for belonging to respective symbols type, and is identified from text sequence image and be not belonging to the general of respective symbols type
Character;It successively combines the character gone out by each character category identification respectively, obtains each corresponding text sequence of character type.
In one embodiment, it by each corresponding identification method of character type, identifies and belongs to from text sequence image
The character of respective symbols type, and the general character for being not belonging to respective symbols type is identified from text sequence image, it wraps
It includes: being syncopated as individual character image from text sequence image;By each corresponding machine learning model of character type, respectively to individual character
Image carries out character recognition, obtains belonging to the character of respective symbols type and is not belonging to the general word of respective symbols type
Symbol.
In one embodiment, individual character image is syncopated as from text sequence image, comprising: in text sequence image,
Along the long side of text sequence image, candidate cut-off is chosen according to the short spacing of the short side than text sequence image;Obtain each time
Select the cutting confidence level of cut-off;Cut-off is determined according to cutting confidence level;According to determining cut-off from text sequence image
In be syncopated as individual character image.
In one embodiment, computer-readable instruction also makes processor execute following steps: obtaining character picture sample
This collection;By character type, corresponding character is added to belong to the character picture of respective symbols type in character picture sample set
Mark, and to be not belonging to the mark that the character picture of respective symbols type adds general character in character picture sample set;Root
According to the character picture in character picture sample set and the mark added by character type, each corresponding machine of character type is respectively trained
Device learning model.
In one embodiment, according to the character in the text sequence of the character correction of acquisition selection at position, known
Other result, comprising: when the quantity of character in the quantity of the character of acquisition and the text sequence of selection at position is consistent, then will
Character in the text sequence of selection at position is replaced in the character of acquisition one by one by the character one at character sequence and position
One corresponding character;When the quantity of the character in the quantity of the character of acquisition and the text sequence of selection at position is inconsistent, and
When the quantity of character in the text sequence of selection at position is more than one, then by the character in the text sequence of selection at position
Entirety replaces with the character of acquisition.
Above-mentioned computer equipment carries out character according to different character types after getting text sequence image respectively
Identification, obtains each corresponding text sequence of character type.Wherein, it is carried out by the corresponding character recognition mode of certain character type
When identification, by the character recognition for being not belonging to the character type in text sequence image at the general word for being not belonging to the character type
Symbol.And then an optional text sequence, it determines and is not belonging in the text sequence chosen where the general character of respective symbols type
Position corrected in the text sequence of selection at the position by the character after choosing in remaining text sequence at the position
Character, obtain recognition result.Text sequence image is identified using being known otherwise by character type in this way,
It ensures when being identified by each character type, belongs to the recognition accuracy of the character of the character type, and in text
When holding complicated multiplicity, the identification for the various characters kind class text for including in text sequence image can also be taken into account, each word is recycled
The character for belonging to the character type in the text sequence that symbol category identification obtains, to the word of corresponding position in other text sequences
Symbol is corrected, and recognition result can be obtained, improve text identification accuracy rate.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, the program can be stored in a non-volatile computer and can be read
In storage medium, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage is situated between
Matter can be magnetic disk, CD, read-only memory (Read-Only Memory, ROM) etc..
Each technical characteristic of above embodiments can be combined arbitrarily, for simplicity of description, not to above-described embodiment
In each technical characteristic it is all possible combination be all described, as long as however, the combination of these technical characteristics be not present lance
Shield all should be considered as described in this specification.
The embodiments described above only express several embodiments of the present invention, and the description thereof is more specific and detailed, but simultaneously
Limitations on the scope of the patent of the present invention therefore cannot be interpreted as.It should be pointed out that for those of ordinary skill in the art
For, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to guarantor of the invention
Protect range.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.
Claims (15)
1. a kind of text recognition method, which comprises
Obtain text sequence image;
Character recognition is carried out to the text sequence image by each character type corresponding character recognition mode, by the text sequence
It is not belonging to the character of respective symbols type in column image, is identified as being not belonging to the general character of respective symbols type, obtains each
The corresponding text sequence of character type;
Text sequence is chosen from each corresponding text sequence of character type;
Determine the position being not belonging to where the general character of respective symbols type in the text sequence chosen;
Obtain the character for belonging to respective symbols type after choosing in remaining text sequence at the position;
Character at position described in the text sequence chosen according to the character correction of acquisition, obtains recognition result.
2. the method according to claim 1, wherein the acquisition text sequence image, comprising:
Obtain images to be recognized;
The images to be recognized is subjected to binary conversion treatment, obtains text image;
Text texture image is extracted from the text image;
Determine the connected domain in the text texture image;
Text sequence image is determined according to the connected domain.
3. the method according to claim 1, wherein described press the corresponding character recognition mode pair of each character type
The text sequence image carries out character recognition, and the character of respective symbols type will be not belonging in the text sequence image, knows
It Wei be not belonging to the general character of respective symbols type, obtain each corresponding text sequence of character type, comprising:
By the corresponding character recognition mode of each character type, is identified from the text sequence image and belong to respective symbols type
Character, and the general character for being not belonging to respective symbols type is identified from the text sequence image;
It successively combines the character gone out by each character category identification respectively, obtains each corresponding text sequence of character type.
4. according to the method described in claim 3, it is characterized in that, it is described press the corresponding character recognition mode of each character type,
The character for belonging to respective symbols type is identified from the text sequence image, and is identified from the text sequence image
It is not belonging to the general character of respective symbols type, comprising:
Individual character image is syncopated as from the text sequence image;
By each corresponding machine learning model of character type, character recognition is carried out to the individual character image respectively, is belonged to
The character of respective symbols type and the general character for being not belonging to respective symbols type.
5. according to the method described in claim 4, it is characterized in that, described be syncopated as individual character figure from the text sequence image
Picture, comprising:
In the text sequence image, along the long side of the text sequence image, according to shorter than the text sequence image
The short spacing in side chooses candidate cut-off;
Obtain the cutting confidence level of each candidate cut-off;
Cut-off is determined according to the cutting confidence level;
Individual character image is syncopated as from the text sequence image according to determining cut-off.
6. according to the method described in claim 4, it is characterized in that, the method also includes:
Obtain character picture sample set;
By character type, corresponding character is added to belong to the character picture of respective symbols type in the character picture sample set
Mark, and to be not belonging to the mark that the character picture of respective symbols type adds general character in the character picture sample set
Note;
According to the character picture in the character picture sample set and the mark added by character type, each character kind is respectively trained
The corresponding machine learning model of class.
7. method according to any one of claim 1 to 6, which is characterized in that described to be selected according to the character correction of acquisition
Character at position described in the text sequence taken, obtains recognition result, comprising:
When the quantity of the character of acquisition is consistent with the quantity of the character at position described in the text sequence of selection, then
Character at position described in text sequence by selection, replace with one by one in the character of acquisition by character sequence with it is described
The one-to-one character of character at position;
The quantity of character at the position described in the quantity of the character of acquisition and the text sequence of selection is inconsistent, and choose
When the quantity of character at position described in text sequence is more than one, then
Character at position described in text sequence by selection integrally replaces with the character of acquisition.
8. a kind of text identification device, described device include:
First obtains module, for obtaining text sequence image;
Identification module, for carrying out character knowledge to the text sequence image by the corresponding character recognition mode of each character type
Not, it will be not belonging to the character of respective symbols type in the text sequence image, be identified as being not belonging to the logical of respective symbols type
Character obtains each corresponding text sequence of character type;
Module is chosen, for choosing text sequence from each corresponding text sequence of character type;
Determining module, for determining the position where being not belonging to the general character of respective symbols type in the text sequence chosen
It sets;
Second obtains module, for belonging to respective symbols type at the position in remaining text sequence after obtaining selection
Character;
Module is corrected, the character at position described in the text sequence for choosing according to the character correction of acquisition is identified
As a result.
9. device according to claim 8, which is characterized in that the first acquisition module is also used to obtain figure to be identified
Picture;The images to be recognized is subjected to binary conversion treatment, obtains text image;Text texture maps are extracted from the text image
Picture;Determine the connected domain in the text texture image;Text sequence image is determined according to the connected domain.
10. device according to claim 8, which is characterized in that the identification module is also used to corresponding by each character type
Character recognition mode, identify the character for belonging to respective symbols type from the text sequence image, and from the text
The general character for being not belonging to respective symbols type is identified in sequence image;The character difference that will go out by each character category identification
It successively combines, obtains each corresponding text sequence of character type.
11. device according to claim 10, which is characterized in that the identification module is also used to from the text sequence figure
Individual character image is syncopated as in;By each corresponding machine learning model of character type, word is carried out to the individual character image respectively
Symbol identification, obtains belonging to the character of respective symbols type and is not belonging to the general character of respective symbols type.
12. device according to claim 11, which is characterized in that the identification module is also used in the text sequence figure
As in, along the long side of the text sequence image, candidate is chosen according to the spacing shorter than the short side of the text sequence image and is cut
Branch;Obtain the cutting confidence level of each candidate cut-off;Cut-off is determined according to the cutting confidence level;According to determining cutting
Point is syncopated as individual character image from the text sequence image.
13. device according to claim 11, which is characterized in that described device further include:
Training module, for obtaining character picture sample set;It is corresponding to belong in the character picture sample set by character type
The character picture of character type adds the mark of corresponding character, and to be not belonging to respective symbols in the character picture sample set
The character picture of type adds the mark of general character;According to the character picture in the character picture sample set and press character
The mark of type addition, is respectively trained each corresponding machine learning model of character type.
14. one or more is stored with the non-volatile computer readable storage medium storing program for executing of computer executable instructions, the calculating
When machine executable instruction is executed by one or more processors, so that one or more of processors execute such as claim 1
The step of to method described in any one of 6.
15. a kind of computer equipment, including memory and processor, computer-readable instruction is stored in the memory, institute
When stating computer-readable instruction and being executed by the processor, so that the processor is executed such as any one of claims 1 to 6 institute
The step of method stated.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710687380.4A CN109389115B (en) | 2017-08-11 | 2017-08-11 | Text recognition method, device, storage medium and computer equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710687380.4A CN109389115B (en) | 2017-08-11 | 2017-08-11 | Text recognition method, device, storage medium and computer equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109389115A true CN109389115A (en) | 2019-02-26 |
CN109389115B CN109389115B (en) | 2023-05-23 |
Family
ID=65413997
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710687380.4A Active CN109389115B (en) | 2017-08-11 | 2017-08-11 | Text recognition method, device, storage medium and computer equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109389115B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110210469A (en) * | 2019-05-31 | 2019-09-06 | 中科软科技股份有限公司 | A kind of method and system identifying picture character languages |
CN110674876A (en) * | 2019-09-25 | 2020-01-10 | 北京猎户星空科技有限公司 | Character detection method and device, electronic equipment and computer readable medium |
CN110969161A (en) * | 2019-12-02 | 2020-04-07 | 上海肇观电子科技有限公司 | Image processing method, circuit, visual impairment assisting apparatus, electronic apparatus, and medium |
CN111339910A (en) * | 2020-02-24 | 2020-06-26 | 支付宝实验室(新加坡)有限公司 | Text processing method and device and text classification model training method and device |
CN111797922A (en) * | 2020-07-03 | 2020-10-20 | 泰康保险集团股份有限公司 | Text image classification method and device |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH11272799A (en) * | 1998-03-20 | 1999-10-08 | Canon Inc | Method and device for character recognition processing and storage medium |
CN101777124A (en) * | 2010-01-29 | 2010-07-14 | 北京新岸线网络技术有限公司 | Method for extracting video text message and device thereof |
CN102156865A (en) * | 2010-12-14 | 2011-08-17 | 上海合合信息科技发展有限公司 | Handwritten text line character segmentation method and identification method |
CN102332096A (en) * | 2011-10-17 | 2012-01-25 | 中国科学院自动化研究所 | Video caption text extraction and identification method |
WO2013097072A1 (en) * | 2011-12-26 | 2013-07-04 | 华为技术有限公司 | Method and apparatus for recognizing a character of a video |
WO2014131339A1 (en) * | 2013-02-26 | 2014-09-04 | 山东新北洋信息技术股份有限公司 | Character identification method and character identification apparatus |
CN104268603A (en) * | 2014-09-16 | 2015-01-07 | 科大讯飞股份有限公司 | Intelligent marking method and system for text objective questions |
CN106056114A (en) * | 2016-05-24 | 2016-10-26 | 腾讯科技(深圳)有限公司 | Business card content identification method and business card content identification device |
-
2017
- 2017-08-11 CN CN201710687380.4A patent/CN109389115B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH11272799A (en) * | 1998-03-20 | 1999-10-08 | Canon Inc | Method and device for character recognition processing and storage medium |
CN101777124A (en) * | 2010-01-29 | 2010-07-14 | 北京新岸线网络技术有限公司 | Method for extracting video text message and device thereof |
CN102156865A (en) * | 2010-12-14 | 2011-08-17 | 上海合合信息科技发展有限公司 | Handwritten text line character segmentation method and identification method |
CN102332096A (en) * | 2011-10-17 | 2012-01-25 | 中国科学院自动化研究所 | Video caption text extraction and identification method |
WO2013097072A1 (en) * | 2011-12-26 | 2013-07-04 | 华为技术有限公司 | Method and apparatus for recognizing a character of a video |
WO2014131339A1 (en) * | 2013-02-26 | 2014-09-04 | 山东新北洋信息技术股份有限公司 | Character identification method and character identification apparatus |
CN104268603A (en) * | 2014-09-16 | 2015-01-07 | 科大讯飞股份有限公司 | Intelligent marking method and system for text objective questions |
CN106056114A (en) * | 2016-05-24 | 2016-10-26 | 腾讯科技(深圳)有限公司 | Business card content identification method and business card content identification device |
Non-Patent Citations (2)
Title |
---|
付强;丁晓青;蒋焰;: "基于多信息融合的中文手写地址字符串切分与识别" * |
杨武夷;张树武;: "一种视频中字符的集成型切分与识别算法" * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110210469A (en) * | 2019-05-31 | 2019-09-06 | 中科软科技股份有限公司 | A kind of method and system identifying picture character languages |
CN110674876A (en) * | 2019-09-25 | 2020-01-10 | 北京猎户星空科技有限公司 | Character detection method and device, electronic equipment and computer readable medium |
CN110969161A (en) * | 2019-12-02 | 2020-04-07 | 上海肇观电子科技有限公司 | Image processing method, circuit, visual impairment assisting apparatus, electronic apparatus, and medium |
CN110969161B (en) * | 2019-12-02 | 2023-11-07 | 上海肇观电子科技有限公司 | Image processing method, circuit, vision-impaired assisting device, electronic device, and medium |
CN111339910A (en) * | 2020-02-24 | 2020-06-26 | 支付宝实验室(新加坡)有限公司 | Text processing method and device and text classification model training method and device |
CN111339910B (en) * | 2020-02-24 | 2023-11-28 | 支付宝实验室(新加坡)有限公司 | Text processing and text classification model training method and device |
CN111797922A (en) * | 2020-07-03 | 2020-10-20 | 泰康保险集团股份有限公司 | Text image classification method and device |
CN111797922B (en) * | 2020-07-03 | 2023-11-28 | 泰康保险集团股份有限公司 | Text image classification method and device |
Also Published As
Publication number | Publication date |
---|---|
CN109389115B (en) | 2023-05-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Neumann et al. | Efficient scene text localization and recognition with local character refinement | |
CN107133622B (en) | Word segmentation method and device | |
CN106056114B (en) | Contents of visiting cards recognition methods and device | |
CN109389115A (en) | Text recognition method, device, storage medium and computer equipment | |
US8744196B2 (en) | Automatic recognition of images | |
Pan et al. | A robust system to detect and localize texts in natural scene images | |
CN110647829A (en) | Bill text recognition method and system | |
CN104217203B (en) | Complex background card face information identifying method and system | |
CN108717543B (en) | Invoice identification method and device and computer storage medium | |
JP5176763B2 (en) | Low quality character identification method and apparatus | |
CN106203539B (en) | Method and device for identifying container number | |
Vanetti et al. | Gas meter reading from real world images using a multi-net system | |
Ye et al. | Scene text detection via integrated discrimination of component appearance and consensus | |
RU2581786C1 (en) | Determination of image transformations to increase quality of optical character recognition | |
Shivakumara et al. | New gradient-spatial-structural features for video script identification | |
CN109447080B (en) | Character recognition method and device | |
Salvi et al. | Handwritten text segmentation using average longest path algorithm | |
CN113158895A (en) | Bill identification method and device, electronic equipment and storage medium | |
JPWO2015146113A1 (en) | Identification dictionary learning system, identification dictionary learning method, and identification dictionary learning program | |
Ramirez et al. | Automatic recognition of square notation symbols in western plainchant manuscripts | |
Li et al. | Leveraging surrounding context for scene text detection | |
Chen et al. | Salient object detection: Integrate salient features in the deep learning framework | |
CN113780116A (en) | Invoice classification method and device, computer equipment and storage medium | |
Vidhyalakshmi et al. | Text detection in natural images with hybrid stroke feature transform and high performance deep Convnet computing | |
US9092688B2 (en) | Assisted OCR |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |