CN109389115A - Text recognition method, device, storage medium and computer equipment - Google Patents

Text recognition method, device, storage medium and computer equipment Download PDF

Info

Publication number
CN109389115A
CN109389115A CN201710687380.4A CN201710687380A CN109389115A CN 109389115 A CN109389115 A CN 109389115A CN 201710687380 A CN201710687380 A CN 201710687380A CN 109389115 A CN109389115 A CN 109389115A
Authority
CN
China
Prior art keywords
character
text sequence
type
image
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710687380.4A
Other languages
Chinese (zh)
Other versions
CN109389115B (en
Inventor
刘银松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shanghai Co Ltd
Original Assignee
Tencent Technology Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shanghai Co Ltd filed Critical Tencent Technology Shanghai Co Ltd
Priority to CN201710687380.4A priority Critical patent/CN109389115B/en
Publication of CN109389115A publication Critical patent/CN109389115A/en
Application granted granted Critical
Publication of CN109389115B publication Critical patent/CN109389115B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/63Scene text, e.g. street names
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Character Discrimination (AREA)

Abstract

The present invention relates to a kind of text recognition method, device, storage medium and computer equipments, which comprises obtains text sequence image;Character recognition is carried out to the text sequence image by each character type corresponding character recognition mode, the character of respective symbols type will be not belonging in the text sequence image, it is identified as being not belonging to the general character of respective symbols type, obtains each corresponding text sequence of character type;Text sequence is chosen from each corresponding text sequence of character type;Determine the position being not belonging to where the general character of respective symbols type in the text sequence chosen;Obtain the character for belonging to respective symbols type after choosing in remaining text sequence at the position;Character at position described in the text sequence chosen according to the character correction of acquisition, obtains recognition result.Scheme provided by the present application provides the accuracy rate of text identification.

Description

Text recognition method, device, storage medium and computer equipment
Technical field
The present invention relates to field of computer technology, more particularly to a kind of text recognition method, device, storage medium and meter Calculate machine equipment.
Background technique
With the development of computer technology, more and more texts are added into image for carrying out information propagation.Make It is also more and more common to be identified to the text for including in image with text recognition technique, such as to business card or in photo Text carry out text identification etc..
Currently, the text identification of various images is mainly based upon fixed character feature and is extracted to the progress of each character Identification.However, the recognition result accuracy rate that this text identification mode in the complicated multiplicity of content of text, identifies text It is substantially reduced.
Summary of the invention
Based on this, it is necessary to which for traditional text recognition method, in content of text complicated multiplicity, recognition accuracy is low asks Topic provides a kind of text recognition method, device, storage medium and computer equipment.
A kind of text recognition method, which comprises
Obtain text sequence image;
Character recognition is carried out to the text sequence image by each character type corresponding character recognition mode, by the text It is not belonging to the character of respective symbols type in this sequence image, is identified as being not belonging to the general character of respective symbols type, obtain To each corresponding text sequence of character type;
Text sequence is chosen from each corresponding text sequence of character type;
Determine the position being not belonging to where the character of respective symbols type in the text sequence chosen;
Obtain the character for belonging to respective symbols type after choosing in remaining text sequence at the position;
Character at position described in the text sequence chosen according to the character correction of acquisition, obtains recognition result.
A kind of text identification device, described device include:
First obtains module, for obtaining text sequence image;
Identification module, for carrying out character to the text sequence image by the corresponding character recognition mode of each character type Identification, will be not belonging to the character of respective symbols type, is identified as being not belonging to respective symbols type in the text sequence image General character obtains each corresponding text sequence of character type;
Module is chosen, for choosing text sequence from each corresponding text sequence of character type;
Determining module, for determining the position where being not belonging to the character of respective symbols type in the text sequence chosen;
Second obtains module, belongs to respective symbols kind at the position in remaining text sequence after choosing for acquisition The character of class;
Module is corrected, the character at position described in the text sequence for choosing according to the character correction of acquisition obtains Recognition result.
One or more is stored with the non-volatile computer readable storage medium storing program for executing of computer executable instructions, the calculating When machine executable instruction is executed by one or more processors, so that one or more of processors execute text recognition method The step of.
A kind of computer equipment, including memory and processor store computer-readable instruction in the memory, institute When stating computer-readable instruction and being executed by the processor, so that the step of processor executes text recognition method.
Above-mentioned text recognition method, device, storage medium and computer equipment are pressed after getting text sequence image Character recognition is carried out respectively according to the different corresponding character recognition modes of character type, obtains the corresponding text sequence of each character type Column.Wherein, when being identified by the corresponding character recognition mode of certain character type, this will be not belonging in text sequence image The character of character type is identified as being not belonging to the general character of the character type.And then an optional text sequence, it determines and chooses Text sequence in be not belonging to position where the general character of respective symbols type, pass through remaining text sequence after choosing In character at the position, correct the character in the text sequence of selection at the position, obtain recognition result.In this way using by word Symbol type is known otherwise to identify to text sequence image, it is ensured that when being identified by each character type, Belong to the recognition accuracy of the character of the character type, and in the complicated multiplicity of content of text, text sequence can also be taken into account The identification for the various characters kind class text for including in image belongs to this in the text sequence for recycling each character category identification to obtain The character of character type corrects the character of corresponding position in other text sequences, recognition result can be obtained, and improves Text identification accuracy rate.
Detailed description of the invention
Fig. 1 is the schematic diagram of internal structure of computer equipment in one embodiment;
Fig. 2 is the flow diagram of text recognition method in one embodiment;
Fig. 3 is the schematic diagram of text recognition method in one embodiment;
Fig. 4 is the quantity of the character in the quantity of character and the text sequence of selection obtained in one embodiment at position The schematic illustration of character correction is carried out when consistent;
Fig. 5 is the quantity of the character in the quantity of character and the text sequence of selection obtained in one embodiment at position The schematic illustration of character correction is carried out when inconsistent;
Fig. 6 is the schematic diagram of text recognition method in another embodiment;
Fig. 7 is the principle flow chart of text recognition method in a concrete application scene;
Fig. 8 is the structural block diagram of text identification device in one embodiment;
Fig. 9 is the structural block diagram of text identification device in another embodiment.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.
Fig. 1 is the schematic diagram of internal structure of computer equipment in one embodiment.As shown in Figure 1, the computer equipment packet Include processor, non-volatile memory medium and the built-in storage connected by system bus.Wherein, the computer equipment is non-easy The property lost storage medium can storage program area and computer-readable instruction, which is performed, and may make place Reason device executes a kind of text recognition method.The processor supports entire computer equipment for providing calculating and control ability Operation.Computer-readable instruction can be stored in the built-in storage, it, can when which is executed by the processor So that the processor executes a kind of text recognition method.The computer equipment can be terminal, be also possible to server etc..Eventually End specifically can be terminal console or mobile terminal, and mobile terminal specifically can be in mobile phone, tablet computer, laptop etc. It is at least one.Service implement body can be independent physical server, be also possible to physical server cluster.Those skilled in the art Member is appreciated that structure shown in Fig. 1, only the block diagram of part-structure relevant to application scheme, composition pair The restriction for the terminal that application scheme is applied thereon, specific terminal may include than more or fewer portions as shown in the figure Part perhaps combines certain components or with different component layouts.
Fig. 2 is the flow diagram of text recognition method in one embodiment.The present embodiment is mainly applied in this way Computer equipment in above-mentioned Fig. 1 illustrates.Referring to Fig. 2, this method specifically comprises the following steps:
S202 obtains text sequence image.
Wherein, text sequence is the character string that more than one character is constituted in order.Text sequence image then includes The image of text sequence.According to the difference of text sequence image layout, text sequence can be line of text or text column.Text Row is the text sequence that character is substantially arranged in the horizontal direction, and text column is then the text sequence that character is substantially longitudinally arranged in.
In one embodiment, computer equipment can directly acquire the text sequence figure divided by text sequence Picture.The text sequence image that computer equipment is got can be computer equipment and receive the text that another computer equipment is sent Sequence image is also possible to the text sequence image that computer equipment is crawled from internet, can also be that computer equipment passes through The text sequence image etc. that overscanning or shooting obtain.
In one embodiment, computer equipment can first get the image to text sequence dividing processing, then to the figure As carrying out text sequence segmentation, to obtain text sequence image.Image such as business card image to text sequence dividing processing or Person's file and picture etc..Business card image is the image comprising contents of visiting cards, can be business card photo, business card scan part or electronics name Piece picture etc..File and picture, which is one or more text sequences, combines the image to be formed according to specific arrangement feature.
In one embodiment, due to there is the arrangement feature of rule between different text sequences, computer equipment can Text sequence image is detected from image according to the priori of text sequence arrangement feature.The priori arrangement aspect ratio of text sequence As there are the character pitch feature line of text or text inside gap, line of text perhaps text column between different line of text Arrange the feature etc. of internal character center substantially point-blank.Computer equipment can arrange feature using this priori will be different Text sequence image split from image.
In one embodiment, computer equipment can carry out connected domain analysis to image and extract connected domain.Due to identical Connected domain in text sequence can form a complete connected domain, and computer equipment can will be approximately on same straight line The outer profile of multiple connected domains is determined as text sequence image, and different text sequence images is split from image.
S204 carries out character recognition to text sequence image by the corresponding character recognition mode of each character type, by text It is not belonging to the character of respective symbols type in sequence image, is identified as being not belonging to the general character of respective symbols type, obtain Each corresponding text sequence of character type.
Wherein, character type is the classification obtained after classifying according to character feature to character.Character feature such as word Accord with stroke feature or the affiliated languages of character etc..
In the present embodiment, computer equipment can classify to character by languages, such as English character type, middle text Accord with type and Korea character type etc..For calculating by remaining character after languages classification, such as number and punctuation mark etc. Remaining character can be uniformly divided into individually a kind of character, such as other character types by machine equipment.Computer equipment can also incite somebody to action Remaining character classification is into the one type character classified by languages, for example English character type can both include English Character also includes remaining character after classifying by languages.
Computer equipment can establish character repertoire by character type, include largely belonging to respective symbols type in the character repertoire Character.For example, including the character for largely belonging to English by the character repertoire that English character type is established.The corresponding character of character type Identification method is the identification method accurately identified to the character for belonging to the character type.Computer equipment can be to being not belonging to The character of the character type accurately identifies or fuzzy diagnosis.For example, the corresponding character recognition mode of English character type, to English Character is accurately identified, and is not required to non-English character recognition precision.
General character is the pre-set character of computer equipment, for according to the corresponding character recognition of character type Recognition result when mode carries out character recognition, as the character for being not belonging to respective symbols type.For example, according to English character When the corresponding character recognition mode of type carries out character recognition, the character recognition that will not belong to English character type is to be not belonging to English The general character of Chinese character type.
In one embodiment, for each character type, it may be present one and be not belonging to the general of respective symbols type Character.For example, the general character " Chinese " for being not belonging to English character type may be present for English character type, it will not Belong to the character of English character type, for example Chinese character or Korea character etc. are identified as " Chinese ".
In one embodiment, for each character type, it also may be present and multiple be not belonging to the general of respective symbols type Character.This multiple general character may belong to identical character types.For example, English character type may be present more Chinese character recognition is " Chinese ", by Korea character by a general character " Chinese " and " Korea Spro " etc. for being not belonging to respective symbols type It is identified as " Korea Spro ".This multiple general character is also possible to other one-to-one words of character type of non-corresponding character type Symbol.For example, for English character type, may be present multiple general characters " Chinese " for being not belonging to respective symbols type and Deng, by Chinese character recognition be " Chinese ", Korea character is identified as
In one embodiment, computer equipment can press the corresponding character recognition mode of each character type to text sequence When image carries out character recognition, character category identification first is carried out to the character in text sequence image.Wherein, character category identification It can be two assorting processes, determine that character belongs to respective symbols type or is not belonging to respective symbols type.Computer equipment The character for belonging to respective symbols type can be accurately identified again, directly will not belong to the general word of respective symbols type Symbol, the recognition result as the character for being not belonging to respective symbols type.
For example, to English character type, it is assumed that the pre-set general character of computer equipment be " Chinese ", press English The corresponding character recognition mode of Chinese character type identifies to include " I in text sequence imageWhen A ", first character " I " is true It surely is the character for being not belonging to English character type, the recognition result by " Chinese " as " I ", second characterDetermination is not The character for belonging to English character type, by " Chinese " conductRecognition result, second character " A " determination is to belong to English The character of character types, is further identified, accurate recognition result is obtained.
Character category identification is also possible to more assorting processes, determines character is which kind of character type belonged to.Computer equipment The character for belonging to respective symbols type can be accurately identified again, directly will not belong to respective symbols type and with wait know The identical general character of the other affiliated character type of character, the recognition result as the character to be identified.
For example, to English character type, it is assumed that the general character of the pre-set Chinese character of computer equipment is The general character of " Chinese ", Korea character isText sequence is being identified by the corresponding character recognition mode of English character type It include " I in column imageWhen A ", first character " I " determination is the character of Chinese character type, by " Chinese " conduct " I " Recognition result, second characterDetermination is the character of Korea character type, willAsRecognition result, Second character " A " determination is the character for belonging to English character type, is further identified, obtains accurate recognition result.
In one embodiment, the mode that computer equipment carries out character recognition can be the identification side based on template matching Formula.The corresponding character recognition mode of character type is to carry out matched identification method using the corresponding Character mother plate of character type. For example, the corresponding character recognition mode of English character type, is matched using the corresponding Character mother plate of English character type Identification method, English character can be accurately identified in this way.If computer equipment needs to carry out non-English character accurate Identification, can be matched using the corresponding Character mother plate of other character types simultaneously.If computer equipment is not needed to non-English Character is accurately identified, and can directly be the general character of non-English character by non-English character recognition.
Specifically, computer equipment collects the Character mother plate of each character in the character repertoire established by character type, then will Character to be identified and the Character mother plate by the setting of character type collected carry out relevant matches, calculate character to be identified and each Similarity between Character mother plate takes character corresponding to the maximum Character mother plate of similarity as recognition result, to obtain Each corresponding text sequence of character type.
For example, the text sequence identified by the corresponding character recognition mode of English character type: " Chinese Chinese Chinese My Name is Addy ", wherein " M ", " y " and " n " etc. is present in the corresponding character repertoire of English character type, to belong to English The character of Chinese character type." Chinese " is to be not present in the corresponding character repertoire of English character type, to be not belonging to English character class The character of type.
In one embodiment, the mode that computer equipment carries out character recognition is also possible to the identification based on feature extraction Mode.The corresponding character recognition mode of character type is to carry out matched identification side using the corresponding character feature of character type Formula.Specifically, the extractable character feature for pressing each character in the character repertoire that character type is established of computer equipment, then extract wait know The character feature of other character, the character feature relevant matches with character each in character repertoire, calculates character to be identified and each word The similarity between feature is accorded with, takes character corresponding to the maximum Character mother plate of similarity as recognition result, to obtain each The corresponding text sequence of character type.
Specifically, computer equipment can extract the geometrical characteristic of character, such as endpoint, bifurcation, the concavo-convex portion of character And line segment, the closed loop of all directions such as horizontal, vertical, inclination etc., according to the position of the feature of extraction and correlation into The judgement of row logical combination, obtains recognition result.
In one embodiment, computer equipment can be by the corresponding character recognition mode of each character type directly to text sequence Column image carry out character recognition, can also by text sequence image cutting be single character picture after, then to single character picture into Line character identification.
In one embodiment, computer equipment can be used machine learning model and carry out character recognition.Machine learning model It can be neural network model, CNN (Convolutional Neural Networks, convolutional neural networks) specifically can be used Model or FCNN (Fully Convolutional Neural Networks, full convolutional neural networks) model.Wherein CNN Model is very strong in visual field classification capacity, can accurately carry out individual character identification.
S206 chooses text sequence from each corresponding text sequence of character type.
Specifically, computer equipment can randomly select text sequence from each corresponding text sequence of character type.It calculates Machine equipment can also count respectively each text sequence packet for each corresponding text sequence of character type before choosing text sequence The quantity of the character of the respective symbols type included chooses the most text sequence of the character including respective symbols type.
For example, computer equipment is by character types to the Chinese character type obtained after text sequence image recognition The text of text sequence A, the text sequence B of English character type, the text sequence C of Korea character type and Japanese character type Sequence D.Wherein, A includes Chinese character 15, and B includes English character 69, and C includes Korea character 3, and D includes Japanese character 6.Computer equipment can from tetra- text sequences of A, B, C and D an optional text sequence, can also choose including respective symbols class The most text sequence B of the character of type.
S208 determines the position being not belonging to where the general character of respective symbols type in the text sequence chosen.
Specifically, computer equipment can first determine the corresponding character types of text sequence, so after choosing text sequence It determines in text sequence afterwards and is not belonging to the general character of text sequence respective symbols type, then determine that these are not belonging to this Position of the character of text sequence respective symbols type in the text sequence of selection.Wherein, it is corresponding to be not belonging to text sequence The character of character type specifically can be the character in the character repertoire not included in text sequence respective symbols type.
In one embodiment, computer equipment can traverse the character for including in the text sequence of selection, in traversal, sentence Whether the character of disconnected traversal extremely is the character repertoire for being included in respective symbols type.If computer equipment determines the word of current traversal extremely Symbol is the character repertoire for being included in respective symbols type, then continues to traverse;If computer equipment determines current traversal, character extremely is Not included in the character repertoire of respective symbols type, then position of the character in the text sequence of selection of traversal extremely is recorded.
In one embodiment, computer equipment is pressing the corresponding character recognition mode of each character type to text sequence figure When as carrying out character recognition, the character can be marked when identifying the character for being not belonging to respective symbols type.Computer Equipment can check the character that label is added in the text sequence of selection after choosing text sequence, to determine in text sequence It is not belonging to the character of respective symbols type, and then determines position of these characters in the text sequence of selection.Implement at one In example, it is not belonging to the position where the character of respective symbols type in the text sequence of selection, can be and be not belonging to respective symbols The character of type is in text sequence, with the character relative position for belonging to respective symbols type.For example, knowing by English character type The text sequence not obtained: " Chinese Chinese Chinese My name is Addy ", then being not belonging to the character " Chinese " of respective symbols type in text Position in this sequence can be the front " My name is Addy ".
In one embodiment, the position where the character of respective symbols type is not belonging in the text sequence of selection, It can be not belonging to absolute position of the character of respective symbols type in text sequence.For example, being obtained by English character type identification The text sequence arrived: " Chinese Chinese Chinese My name is Addy ", then being not belonging to the character " Chinese " of respective symbols type in text sequence Position in column can be initial character to third character.
S210 obtains the character for belonging to respective symbols type after choosing in remaining text sequence at position.
Specifically, computer equipment can traverse remaining text sequence, in traversal, judge to traverse in text sequence extremely Whether the character at position is the character for belonging to the text sequence respective symbols type of traversal extremely.If computer equipment judgement is worked as Character in the text sequence of preceding traversal extremely at position is the character for belonging to the text sequence respective symbols type of traversal extremely, then Obtain the character;If computer equipment determines that the character in the text sequence of current traversal extremely at position is to be not belonging to traversal extremely Text sequence respective symbols type character, then continue to traverse.
S212 obtains recognition result according to the character in the text sequence of the character correction of acquisition selection at position.
Specifically, computer equipment belongs to respective symbols type in remaining text sequence after obtaining selection at position Character after, can compare respectively for determining each position according in the character of the position acquisition and the text sequence of selection Character at position, when detecting that the two is inconsistent, at position in the text sequence of the character correction selection of acquisition Character obtains the higher recognition result of accuracy rate after the character correction at the position for completing each determination.
Above-mentioned text recognition method carries out word according to different character types after getting text sequence image respectively Symbol identification, obtains each corresponding text sequence of character type.Wherein, by the corresponding character recognition mode of certain character type into When row identification, by the character recognition for being not belonging to the character type in text sequence image at being not belonging to the general of the character type Character.And then an optional text sequence, determine the general character institute that respective symbols type is not belonging in the text sequence chosen Position the position in the text sequence of selection is corrected by the character after choosing in remaining text sequence at the position The character at place, obtains recognition result.Text sequence image is known using being known otherwise by character type in this way Not, it is ensured that when being identified by each character type, belong to the recognition accuracy of the character of the character type, and in text When content is complicated various, the identification for the various characters kind class text for including in text sequence image can also be taken into account, is recycled each The character for belonging to the character type in the text sequence that character category identification obtains, to corresponding position in other text sequences Character is corrected, and recognition result can be obtained, and improves text identification accuracy rate.
Fig. 3 shows the schematic diagram of text recognition method in one embodiment.With reference to Fig. 3, computer equipment is being got After text sequence image, character recognition is carried out to text sequence image according to each character type respectively, obtains each character type phase The character for belonging to the character type in the text sequence answered, then sharp obtained each text sequence, to corresponding in other text sequences Character at position is corrected, and recognition result can be obtained.
In one embodiment, step S202 includes: acquisition images to be recognized;Images to be recognized is carried out at binaryzation Reason, obtains text image;Text texture image is extracted from text image;Determine the connected domain in text texture image;According to Connected domain determines text sequence image.
Wherein, images to be recognized is the image to carry out character recognition to the text sequence for including in image.It specifically can be with It is business card image or file and picture etc..The binaryzation of image is to set the gray value of the pixel on image to two kinds of pixels Value, that is, whole image shows to apparent only there are two types of the visual effects of pixel value.
Specifically, fixed threshold Binarization methods or adaptive threshold Binarization methods can be used in computer equipment, will Images to be recognized is higher than threshold value and is set to one of preset two kinds of pixel values, both pictures respectively lower than the pixel value of threshold value Plain value is the first pixel value and the second pixel value respectively.Images to be recognized after binaryzation, indicate text is all first Pixel value, such as white;Indicate background is all the second pixel value, such as black.
Further, computer equipment can extract the first picture for indicating text from the images to be recognized after binaryzation Element is worth the image-region that corresponding pixel is formed, and obtains text image.Computer equipment can again from obtained text image, Character stroke texture is extracted, the image-region that the pixel for constituting stroke texture is formed is determined, obtains text texture image.
Further, computer equipment can carry out connected domain analysis to text texture image again and extract connected domain, also Adjacent connected domain can be merged.Computer equipment specifically can be used stroke smoothing algorithm and carry out connected domain analysis and merging, should The pixel of adjacent connected domain can be connected by algorithm, form the region of monolith, due to each company of one text interior sequences Lead to the distance between domain relatively, so the connected domain in same text sequence can form a complete connected domain.
Still further, the outer profile for multiple connected domains that computer equipment can will be approximately on same straight line is determined as The position of text sequence image and record, with the corresponding text sequence image of determination.Computer equipment can also be by each connection Domain is respectively as independent text sequence image procossing.
In the present embodiment, by after gradually extracting text texture image in images to be recognized, further according to text line The connected domain in image is managed, determines corresponding text sequence image, avoiding will be excessive in text sequence image determination process Background area include into so that accuracy rate is higher when subsequent progress character recognition.
In one embodiment, step S204 includes: by each corresponding identification method of character type, from text sequence image In identify the character for belonging to respective symbols type, and identify from text sequence image and to be not belonging to the logical of respective symbols type Character;It successively combines the character gone out by each character category identification respectively, obtains each corresponding text sequence of character type. Specifically, can corresponding recognition strategy be arranged for each character type in advance in computer equipment.In one embodiment, computer Equipment can correspond to character type, and the character for belonging to the character type is accurately identified, and obtain practical corresponding with the character Character;The character that will not belong to the character type carries out Fuzzy Processing, labeled as the general word for being not belonging to the character type Symbol, the character of the character being accurately identified and Fuzzy Processing is differentiated.
In one embodiment, it by each corresponding identification method of character type, identifies and belongs to from text sequence image The character of respective symbols type, and the step for being not belonging to the general character of respective symbols type is identified from text sequence image It suddenly include: that individual character image is syncopated as from text sequence image;It is right respectively by each corresponding machine learning model of character type Individual character image carries out character recognition, obtains belonging to the character of respective symbols type and is not belonging to the general of respective symbols type Character.
Wherein, individual character image is the rectangular image for including single character, computer equipment cutting from text sequence image Individual character image one by one out.Computer equipment specifically can be according to text sequence pitch characteristics, character length feature and character The priori knowledges such as ratio consistency are syncopated as the sequence of individual character image from text sequence image.Text sequence image is being split Before can pass through image enhancement, such as increase picture contrast.
In one embodiment, computer equipment can will project each pixel value therein after text sequence image binaryzation Accumulated value is obtained on to text sequence image longitudinal direction, local maxima accumulated value is searched out or Local Minimum accumulated value carries out Cutting, to obtain individual character image.Wherein, if indicating after text sequence image binaryzation, the pixel color of character is white, Find Local Minimum accumulated value;If the pixel color for indicating character after text sequence image binaryzation is black, part is found Maximum accumulated value.
Further, computer equipment can pass through machine learning after being syncopated as individual character image in text sequence image Model carries out character recognition to individual character image.Each corresponding machine learning model of character type can precondition obtain.
In one embodiment, the step of each character type of training corresponding machine learning model includes: to obtain character figure As sample set;By character type, corresponding word is added to belong to the character picture of respective symbols type in character picture sample set The mark of symbol, and to be not belonging to the mark that the character picture of respective symbols type adds general character in character picture sample set Note;According to the character picture in character picture sample set and the mark added by character type, each character type phase is respectively trained The machine learning model answered.
It wherein, include several character pictures in character picture sample set.Character picture may include the word of various character types Accord with the character picture generated.The character picture sample set used when the corresponding machine learning model of each character type of training can be Unified character picture sample set is also possible to the corresponding character picture sample set of each character type.Each character type is each Self-corresponding character picture sample set has the skewed popularity for respective symbols type.It can specifically include and largely belong to corresponding word The character picture that the character of type generates is accorded with, and is not belonging to the character picture that the character of respective symbols type generates on a small quantity.
Specifically, machine learning model is a kind of functional relation of character for being mapped to character picture and accordingly marking.Root According to character picture sample set training machine learning model, the character picture sample of the known character for being mapped to and accordingly marking exactly is utilized This collection adjusts the parameter inside machine learning model, machine learning model is enabled to predict that new character picture is be mapped to Character, to achieve the effect that identify respective symbols from the image containing character.SVM (branch can be used in machine learning model Hold vector machine) or various neural networks.
In one embodiment, machine learning model uses convolutional neural networks (CNN).CNN is that one kind is learned end to end Learning method, CNN directly receive the pixel input of character picture, therefore input layer number is also equal to character figure after normalization The number of pixels of picture.Local shape factor and the pondization processing of several layers are first carried out after CNN input data, then middle layer carries out The global characteristics transformation connected entirely, last output layer are output with the target of task.
Specifically, computer equipment can be directed to each character type, to belong to respective symbols type in character picture sample set Character picture add the mark of corresponding character, and to be not belonging to the character figure of respective symbols type in character picture sample set Mark as adding general character.Computer equipment is further according to the character picture in character picture sample set and presses character type Each corresponding machine learning model of character type is respectively trained in the mark of addition.
In one embodiment, machine learning model can be according to character picture sample set to having trained for identification The parameter of the convolutional neural networks of image is iterated what adjustment obtained.
In the present embodiment, using the powerful study of machine learning model and indicate that ability carries out the study of character big data, institute The machine learning model that training obtains identifies character, more preferable compared with the effect that conventional method identifies character.
In above-described embodiment, text sequence image cutting is obtained into individual character image, then machine learning is used to individual character image Model carries out character recognition, can conveniently and efficiently complete the character recognition process to text sequence image.
Computer equipment is corresponding to character type, and the word for belonging to respective symbols type is identified from text sequence image Symbol, and after being identified in text sequence image and being not belonging to the general character of respective symbols type, the character that will identify that by It is successively combined according to sequences of text in text sequence image, obtains the corresponding text sequence of character type.
Computer equipment is after respectively obtaining the corresponding text sequence of each character type to each character type, according to obtaining Text sequence in character whether be general character determine the character whether be respective symbols type character.It counts in this way Machine equipment is calculated after choosing text sequence in text sequence, can directly inquire the general character in text sequence, this is general Character where position be, by the position for needing to carry out character correction in text sequence.
In above-described embodiment, when carrying out text sequence image recognition by character type, respective symbols type will not belong to Character carry out Fuzzy Processing, and marked with general character, needs can be quickly located when carrying out character correction The character corrected obtains more accurate recognition result to complete character correction.
In one embodiment, the step of individual character image is syncopated as from text sequence image includes: in text sequence figure As in, along the long side of text sequence image, candidate cut-off is chosen according to the short spacing of the short side than text sequence image;It obtains The cutting confidence level of each candidate's cut-off;Cut-off is determined according to cutting confidence level;According to determining cut-off from text sequence Individual character image is syncopated as in image.
Wherein, candidate cut-off is candidate dicing position, can be risen with coordinate or apart from text sequence picture headers The distance of point indicates.
In one embodiment, text sequence image is rectangular image, the short side of text sequence image substantially text sequence The width or height of character in column, long side are then about the length of text sequence in text sequence image, and computer equipment can be according to The spacing shorter than short side chooses candidate cut-off.The spacing for choosing candidate cut-off can specifically be less than or equal to text sequence image Short side half or one third or a quarter.
Further, cutting confidence level be corresponding candidate cut-off be actual cut-off probability quantized value.Meter Calculate machine equipment specifically can be syncopated as corresponding picture according to candidate cut-off, by the picture being syncopated as extract characteristics of image after according to It is secondary to be input in trained classifier, export the cutting confidence level of corresponding candidate cut-off.Classifier can be used random gloomy Woods classifier.Wherein, the characteristics of image of extraction can be using HOG (Histogram of Oriented Gradient, direction ladder Spend histogram) feature, it also can also be using other spies such as LBP (Local Binary Patterns, local binary patterns) features Sign.
Further, computer equipment can be sentenced if being higher than preset threshold by cutting confidence level compared with preset threshold It is set to actual cut-off.Cutting is carried out at the computer equipment cut-off that everywhere determines in text sequence image again, is obtained To individual character image one by one.
It, can be by densely selecting candidate cut-off in text sequence image in above-described embodiment, and utilize each time It selects the cutting confidence level of cut-off to carry out cutting text sequence image and obtains individual character image, the standard to text sequence image may be implemented Definite point, to improve follow-up text recognition accuracy.
In one embodiment, step S212 includes: when position in the quantity of character and the text sequence of selection obtained When the quantity of the character at place is consistent, then by the character in the text sequence of selection at position, in the character that replaces with acquisition one by one By the one-to-one character of character at character sequence and position;When position in the quantity of the character of acquisition and the text sequence of selection The quantity for setting the character at place is inconsistent, and when the quantity of the character in the text sequence chosen at position is more than one, then will select Character in the text sequence taken at position integrally replaces with the character of acquisition.
Specifically, computer equipment can first count the quantity of the character in the text sequence of selection at position, then obtain choosing It takes in rear remaining text sequence and belongs to the character of respective symbols type at position, and count the quantity of the character of acquisition, it is right Than counting two obtained character quantities.If computer equipment determines position in the quantity of character and the text sequence of selection that obtain The quantity for setting the character at place is consistent, identifies then then thinking in text sequence image that corresponding each character corresponds respectively Character, computer equipment can replace with one by one the character in the text sequence of selection at position in the character of acquisition by word Symbol sequence and the one-to-one character of character at position.
If computer equipment determines the quantity of the character in the quantity of character and the text sequence of selection that obtain at position It is inconsistent, then then think in text sequence image in corresponding character there are character it is unidentified go out as a result, computer equipment can When the quantity of character in the text sequence of selection at position is more than one, then by the word in the text sequence of selection at position The whole character for replacing with acquisition of symbol, to obtain accurate recognition result as far as possible.
For example, Fig. 4 shows position in the quantity of the character obtained in one embodiment and the text sequence of selection The schematic illustration of character correction is carried out when the quantity of the character at place is consistent.Original contents with reference to Fig. 4, in text sequence image Are as follows: " I is a Hans My name is Addy ", the text sequence obtained according to Chinese character type identification are as follows: " I am One the Hans AA AAAA AA AAAA ", the text sequence obtained according to English character type identification are as follows: " Han Hanhanhanhanhan Chinese My name is Addy ".
Computer equipment can choose the text sequence obtained according to English character type identification, and determination is not belonging to English character Position where the general character " Chinese " of type, and the quantity of " Chinese ": 7.What remaining Chinese character type identification obtained The character for belonging to Chinese character type of text sequence in the position is that the quantity of " I is a Hans " character is 7, two Quantity is identical, then by the character in the corresponding text sequence of English character type at the position, is replaced with one by one from Chinese character By the one-to-one character of character at character sequence and position in the character obtained in the corresponding text sequence of type.
Fig. 5 shows the character in the quantity of the character obtained in one embodiment and the text sequence of selection at position The schematic illustration of character correction is carried out when quantity is inconsistent.Original contents with reference to Fig. 5, in text sequence image are as follows: " I am One the Hans My name is Addy ", the text sequence obtained according to Chinese character type identification are as follows: " I is a Han nationality People AA AAAA ", the text sequence obtained according to English character type identification are as follows: " Chinese Chinese Chinese My name is Addy ".
Computer equipment can choose the text sequence obtained according to English character type identification, and determination is not belonging to English character Position where the general character " Chinese " of type, and the quantity of " Chinese ": 3.What remaining Chinese character type identification obtained The character for belonging to Chinese character type of text sequence in the position is that the quantity of " I is a Hans " character is 7, two Quantity is not identical, and 3 are greater than 1, then by the character in the corresponding text sequence of English character type at the position, integrally replace with The character obtained from the corresponding text sequence of Chinese character type.
In above-described embodiment, the number of the character in the quantity of the character of acquisition and the text sequence of selection at position is provided When measuring consistent or inconsistent, the processing mode of character correction is carried out.Correction processing is carried out to character by this processing mode Accurate recognition result can be obtained as far as possible.
As shown in fig. 6, in a specific embodiment, text recognition method the following steps are included:
S602 obtains images to be recognized;Images to be recognized is subjected to binary conversion treatment, obtains text image;From text diagram Text texture image is extracted as in;Determine the connected domain in text texture image;Text sequence image is determined according to connected domain.
S604 is short according to the short side than text sequence image along the long side of text sequence image in text sequence image Spacing choose candidate cut-off;Obtain the cutting confidence level of each candidate cut-off;Cut-off is determined according to cutting confidence level;It presses Individual character image is syncopated as from text sequence image according to determining cut-off.
S606 obtains character picture sample set;By character type, to belong to respective symbols type in character picture sample set Character picture add the mark of corresponding character, and to be not belonging to the character figure of respective symbols type in character picture sample set Mark as adding general character.
S608 is respectively trained each according to the character picture in character picture sample set and the mark added by character type The corresponding machine learning model of character type.
S610 is carried out character recognition to individual character image respectively, is obtained by each corresponding machine learning model of character type Belong to the character of respective symbols type and is not belonging to the general character of respective symbols type.
S612 successively combines the character gone out by each character category identification respectively, obtains each corresponding text of character type Sequence.
S614 chooses text sequence from each corresponding text sequence of character type.
S616 determines the position being not belonging to where the general character of respective symbols type in the text sequence chosen.
S618 obtains the character for belonging to respective symbols type after choosing in remaining text sequence at position.
S620, judge the character in the quantity of character and the text sequence of selection that obtain at position quantity whether one It causes;If so, jumping to step S622;If it is not, the S624 that then gos to step.
S622 replaces with the character in the text sequence of selection at position in the character of acquisition by character sequence one by one With the one-to-one character of character at position.
S624, if the quantity of the character in the text sequence chosen at position is more than one, by the text sequence of selection Character at middle position integrally replaces with the character of acquisition.
In the present embodiment, after getting text sequence image, character knowledge is carried out respectively according to different character types Not, each corresponding text sequence of character type, and then an optional text sequence are obtained, determines and is not belonging in the text sequence chosen Position where the character of respective symbols type choosing is corrected by the character at the position in remaining text sequence after choosing Character in the text sequence taken at the position, obtains recognition result.In this way using by character type known otherwise come Text sequence image is identified, it is ensured that when being identified by each character type, belong to the character of the character type Recognition accuracy, and in the complicated multiplicity of content of text, the various characters kind for including in text sequence image can also be taken into account The identification of class text belongs to the character of the character type in the text sequence for recycling each character category identification to obtain, to other The character of corresponding position is corrected in text sequence, and recognition result can be obtained, and improves text identification accuracy rate.
Fig. 7 shows the principle flow chart of text recognition method in a concrete application scene.With reference to Fig. 7, this is specifically answered The text identification run after fame in picture with scene.Computer equipment first can carry out line of text detection to business card image.It is detecting To after line of text, pass through the corresponding machine learning model of Chinese character type and the corresponding machine learning of English character type respectively Model identifies the character in line of text.In the present embodiment, by other characters such as number and punctuation marks, pass through English The corresponding machine learning model of character types is accurately identified.
For line of text after through the corresponding machine learning model of Chinese character type, obtained text sequence includes accurate knowledge The Chinese character not obtained and the general character " A " for marking non-Chinese character.Line of text is passing through English character type After corresponding machine learning model, obtained text sequence includes the English character accurately identified, number and punctuation mark, And the general character " Chinese " for marking Chinese character.
Computer equipment can choose the text sequence obtained according to English character type identification again, and determination is not belonging to English words Accord with the position where the general character " Chinese " of type, the text sequence that Chinese character type identification obtains belonging in the position The character of Chinese character type, by the character in the corresponding text sequence of English character type at the position, integrally replace with from The character obtained in the corresponding text sequence of Chinese character type.
As shown in figure 8, in one embodiment, providing a kind of text identification device 800.Referring to Fig. 8, text identification Device 800 includes the first acquisition module 801, identification module 802, chooses module 803, the acquisition module of determining module 804, second 805 and correct module 806.
First obtains module 801, for obtaining text sequence image.
Identification module 802, for carrying out character recognition to text sequence image by each corresponding identification method of character type, It will be not belonging to the character of respective symbols type in text sequence image, is identified as being not belonging to the general word of respective symbols type Symbol, obtains each corresponding text sequence of character type.
Module 803 is chosen, for choosing text sequence from each corresponding text sequence of character type.
Determining module 804, for determining the general character institute for being not belonging to respective symbols type in the text sequence chosen Position.
Second obtains module 805, belongs to respective symbols kind at position in remaining text sequence after choosing for acquisition The character of class.
Module 806 is corrected, the character in the text sequence for choosing according to the character correction of acquisition at position is known Other result.
Above-mentioned text identification device 800 carries out after getting text sequence image according to different character types respectively Character recognition obtains each corresponding text sequence of character type.Wherein, the corresponding character recognition mode of certain character type is being pressed When being identified, by the character recognition for being not belonging to the character type in text sequence image at being not belonging to the general of the character type Character.And then an optional text sequence, determine the general character that respective symbols type is not belonging in the text sequence chosen The position at place the position in the text sequence of selection is corrected by the character at the position in remaining text sequence after choosing The character for setting place, obtains recognition result.Text sequence image is carried out using being known otherwise by character type in this way Identification, it is ensured that when being identified by each character type, belong to the recognition accuracy of the character of the character type, and in text When this content is complicated various, the identification for the various characters kind class text for including in text sequence image can also be taken into account, is recycled The character for belonging to the character type in the text sequence that each character category identification obtains, to corresponding position in other text sequences Character corrected, recognition result can be obtained, improve text identification accuracy rate.
In one embodiment, the first acquisition module 801 is also used to obtain images to be recognized;Images to be recognized is carried out two Value processing, obtains text image;Text texture image is extracted from text image;Determine the connection in text texture image Domain;Text sequence image is determined according to connected domain.
In the present embodiment, by after gradually extracting text texture image in images to be recognized, further according to text line The connected domain in image is managed, determines corresponding text sequence image, avoiding will be excessive in text sequence image determination process Background area include into so that accuracy rate is higher when subsequent progress character recognition.
In one embodiment, identification module 802 is also used to by each corresponding identification method of character type, from text sequence The character for belonging to respective symbols type is identified in image, and is identified from text sequence image and be not belonging to respective symbols type General character;It successively combines the character gone out by each character category identification respectively, obtains each corresponding text of character type Sequence.In the present embodiment, when carrying out text sequence image recognition by character type, it will not belong to the word of respective symbols type Symbol carries out Fuzzy Processing, and is marked with general character, can quickly locate and need to carry out when carrying out character correction The character of correction obtains more accurate recognition result to complete character correction.
In one embodiment, identification module 802 is also used to be syncopated as individual character image from text sequence image;By each The corresponding machine learning model of character type carries out character recognition to individual character image respectively, obtains belonging to respective symbols type Character and the general character for being not belonging to respective symbols type.
In the present embodiment, text sequence image cutting is obtained into individual character image, then machine learning is used to individual character image Model carries out character recognition, can conveniently and efficiently complete the character recognition process to text sequence image.
In one embodiment, identification module 802 is also used in text sequence image, along the length of text sequence image Candidate cut-off is chosen according to the short spacing of the short side than text sequence image in side;Obtain the cutting confidence of each candidate cut-off Degree;Cut-off is determined according to cutting confidence level;Individual character image is syncopated as from text sequence image according to determining cut-off.
In the present embodiment, can be by densely selecting candidate cut-off in text sequence image, and utilize each time It selects the cutting confidence level of cut-off to carry out cutting text sequence image and obtains individual character image, the standard to text sequence image may be implemented Definite point, to improve follow-up text recognition accuracy.
In one embodiment, module 806 is corrected to be also used to when in the quantity of character and the text sequence of selection obtained When the quantity of character at the position is consistent, then the character at position described in the text sequence by selection, replaces with one by one By the one-to-one character of character at character sequence and the position in the character of acquisition;When the quantity and choosing of the character of acquisition The quantity of character at position described in the text sequence taken is inconsistent, and the character at position described in the text sequence chosen Quantity when being more than one, then the character at position described in the text sequence by selection integrally replaces with the character of acquisition.
In the present embodiment, the number of the character in the quantity of the character of acquisition and the text sequence of selection at position is provided When measuring consistent or inconsistent, the processing mode of character correction is carried out.Correction processing is carried out to character by this processing mode Accurate recognition result can be obtained as far as possible.
As shown in figure 9, in one embodiment, text identification device 800 further include: training module 807.
Training module 807, for obtaining character picture sample set;By character type, to belong in character picture sample set The character picture of respective symbols type adds the mark of corresponding character, and to be not belonging to respective symbols in character picture sample set The character picture of type adds the mark of general character;According to the character picture in character picture sample set and press character type Each corresponding machine learning model of character type is respectively trained in the mark of addition.
In the present embodiment, the study of character big data is carried out using the powerful study of machine learning model and expression ability, The machine learning model trained identifies character, more preferable compared with the effect that conventional method identifies character.
In one embodiment, one or more computer-readable storage mediums for being stored with computer-readable instruction are provided Matter, when computer-readable instruction is executed by one or more processors, so that one or more processors execute following steps: obtaining Take text sequence image;Character recognition is carried out to text sequence image by each character type corresponding identification method, by text sequence It is not belonging to the character of respective symbols type in column image, is identified as being not belonging to the general character of respective symbols type, obtains each The corresponding text sequence of character type;Text sequence is chosen from each corresponding text sequence of character type;Determine the text chosen The position where the general character of respective symbols type is not belonging in this sequence;Obtain after choosing in remaining text sequence Belong to the character of respective symbols type at position;Character in the text sequence chosen according to the character correction of acquisition at position, Obtain recognition result.
In one embodiment, text sequence image is obtained, comprising: obtain images to be recognized;Images to be recognized is carried out Binary conversion treatment obtains text image;Text texture image is extracted from text image;Determine the connection in text texture image Domain;Text sequence image is determined according to connected domain.
In one embodiment, character recognition is carried out to text sequence image by each character type corresponding identification method, Obtain each corresponding text sequence of character type, comprising: each corresponding identification method of character type is pressed, from text sequence image It identifies the character for belonging to respective symbols type, and is identified from text sequence image and be not belonging to the general of respective symbols type Character;It successively combines the character gone out by each character category identification respectively, obtains each corresponding text sequence of character type.
In one embodiment, it by each corresponding identification method of character type, identifies and belongs to from text sequence image The character of respective symbols type, and the general character for being not belonging to respective symbols type is identified from text sequence image, it wraps It includes: being syncopated as individual character image from text sequence image;By each corresponding machine learning model of character type, respectively to individual character Image carries out character recognition, obtains belonging to the character of respective symbols type and is not belonging to the general word of respective symbols type Symbol.
In one embodiment, individual character image is syncopated as from text sequence image, comprising: in text sequence image, Along the long side of text sequence image, candidate cut-off is chosen according to the short spacing of the short side than text sequence image;Obtain each time Select the cutting confidence level of cut-off;Cut-off is determined according to cutting confidence level;According to determining cut-off from text sequence image In be syncopated as individual character image.
In one embodiment, computer-readable instruction also makes processor execute following steps: obtaining character picture sample This collection;By character type, corresponding character is added to belong to the character picture of respective symbols type in character picture sample set Mark, and to be not belonging to the mark that the character picture of respective symbols type adds general character in character picture sample set;Root According to the character picture in character picture sample set and the mark added by character type, each corresponding machine of character type is respectively trained Device learning model.
In one embodiment, according to the character in the text sequence of the character correction of acquisition selection at position, known Other result, comprising: when the quantity of character in the quantity of the character of acquisition and the text sequence of selection at position is consistent, then will Character in the text sequence of selection at position is replaced in the character of acquisition one by one by the character one at character sequence and position One corresponding character;When the quantity of the character in the quantity of the character of acquisition and the text sequence of selection at position is inconsistent, and When the quantity of character in the text sequence of selection at position is more than one, then by the character in the text sequence of selection at position Entirety replaces with the character of acquisition.
Above-mentioned storage medium carries out character knowledge according to different character types after getting text sequence image respectively Not, each corresponding text sequence of character type is obtained.Wherein, known by the corresponding character recognition mode of certain character type When other, by the character recognition for being not belonging to the character type in text sequence image at the general word for being not belonging to the character type Symbol.And then an optional text sequence, it determines and is not belonging in the text sequence chosen where the general character of respective symbols type Position corrected in the text sequence of selection at the position by the character after choosing in remaining text sequence at the position Character, obtain recognition result.Text sequence image is identified using being known otherwise by character type in this way, It ensures when being identified by each character type, belongs to the recognition accuracy of the character of the character type, and in text When holding complicated multiplicity, the identification for the various characters kind class text for including in text sequence image can also be taken into account, each word is recycled The character for belonging to the character type in the text sequence that symbol category identification obtains, to the word of corresponding position in other text sequences Symbol is corrected, and recognition result can be obtained, improve text identification accuracy rate.
In one embodiment, a kind of computer equipment, including memory and processor are provided, is stored in memory Computer-readable instruction, when computer-readable instruction is executed by processor, so that processor executes following steps: obtaining text sequence Column image;Character recognition is carried out to text sequence image by each character type corresponding identification method, it will be in text sequence image It is not belonging to the character of respective symbols type, is identified as being not belonging to the general character of respective symbols type, obtains each character type Corresponding text sequence;Text sequence is chosen from each corresponding text sequence of character type;It determines in the text sequence chosen It is not belonging to the position where the general character of respective symbols type;It obtains and belongs at position in remaining text sequence after choosing In the character of respective symbols type;Character in the text sequence chosen according to the character correction of acquisition at position, is identified As a result.
In one embodiment, text sequence image is obtained, comprising: obtain images to be recognized;Images to be recognized is carried out Binary conversion treatment obtains text image;Text texture image is extracted from text image;Determine the connection in text texture image Domain;Text sequence image is determined according to connected domain.
In one embodiment, character recognition is carried out to text sequence image by each character type corresponding identification method, Obtain each corresponding text sequence of character type, comprising: each corresponding identification method of character type is pressed, from text sequence image It identifies the character for belonging to respective symbols type, and is identified from text sequence image and be not belonging to the general of respective symbols type Character;It successively combines the character gone out by each character category identification respectively, obtains each corresponding text sequence of character type.
In one embodiment, it by each corresponding identification method of character type, identifies and belongs to from text sequence image The character of respective symbols type, and the general character for being not belonging to respective symbols type is identified from text sequence image, it wraps It includes: being syncopated as individual character image from text sequence image;By each corresponding machine learning model of character type, respectively to individual character Image carries out character recognition, obtains belonging to the character of respective symbols type and is not belonging to the general word of respective symbols type Symbol.
In one embodiment, individual character image is syncopated as from text sequence image, comprising: in text sequence image, Along the long side of text sequence image, candidate cut-off is chosen according to the short spacing of the short side than text sequence image;Obtain each time Select the cutting confidence level of cut-off;Cut-off is determined according to cutting confidence level;According to determining cut-off from text sequence image In be syncopated as individual character image.
In one embodiment, computer-readable instruction also makes processor execute following steps: obtaining character picture sample This collection;By character type, corresponding character is added to belong to the character picture of respective symbols type in character picture sample set Mark, and to be not belonging to the mark that the character picture of respective symbols type adds general character in character picture sample set;Root According to the character picture in character picture sample set and the mark added by character type, each corresponding machine of character type is respectively trained Device learning model.
In one embodiment, according to the character in the text sequence of the character correction of acquisition selection at position, known Other result, comprising: when the quantity of character in the quantity of the character of acquisition and the text sequence of selection at position is consistent, then will Character in the text sequence of selection at position is replaced in the character of acquisition one by one by the character one at character sequence and position One corresponding character;When the quantity of the character in the quantity of the character of acquisition and the text sequence of selection at position is inconsistent, and When the quantity of character in the text sequence of selection at position is more than one, then by the character in the text sequence of selection at position Entirety replaces with the character of acquisition.
Above-mentioned computer equipment carries out character according to different character types after getting text sequence image respectively Identification, obtains each corresponding text sequence of character type.Wherein, it is carried out by the corresponding character recognition mode of certain character type When identification, by the character recognition for being not belonging to the character type in text sequence image at the general word for being not belonging to the character type Symbol.And then an optional text sequence, it determines and is not belonging in the text sequence chosen where the general character of respective symbols type Position corrected in the text sequence of selection at the position by the character after choosing in remaining text sequence at the position Character, obtain recognition result.Text sequence image is identified using being known otherwise by character type in this way, It ensures when being identified by each character type, belongs to the recognition accuracy of the character of the character type, and in text When holding complicated multiplicity, the identification for the various characters kind class text for including in text sequence image can also be taken into account, each word is recycled The character for belonging to the character type in the text sequence that symbol category identification obtains, to the word of corresponding position in other text sequences Symbol is corrected, and recognition result can be obtained, improve text identification accuracy rate.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a non-volatile computer and can be read In storage medium, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage is situated between Matter can be magnetic disk, CD, read-only memory (Read-Only Memory, ROM) etc..
Each technical characteristic of above embodiments can be combined arbitrarily, for simplicity of description, not to above-described embodiment In each technical characteristic it is all possible combination be all described, as long as however, the combination of these technical characteristics be not present lance Shield all should be considered as described in this specification.
The embodiments described above only express several embodiments of the present invention, and the description thereof is more specific and detailed, but simultaneously Limitations on the scope of the patent of the present invention therefore cannot be interpreted as.It should be pointed out that for those of ordinary skill in the art For, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to guarantor of the invention Protect range.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.

Claims (15)

1. a kind of text recognition method, which comprises
Obtain text sequence image;
Character recognition is carried out to the text sequence image by each character type corresponding character recognition mode, by the text sequence It is not belonging to the character of respective symbols type in column image, is identified as being not belonging to the general character of respective symbols type, obtains each The corresponding text sequence of character type;
Text sequence is chosen from each corresponding text sequence of character type;
Determine the position being not belonging to where the general character of respective symbols type in the text sequence chosen;
Obtain the character for belonging to respective symbols type after choosing in remaining text sequence at the position;
Character at position described in the text sequence chosen according to the character correction of acquisition, obtains recognition result.
2. the method according to claim 1, wherein the acquisition text sequence image, comprising:
Obtain images to be recognized;
The images to be recognized is subjected to binary conversion treatment, obtains text image;
Text texture image is extracted from the text image;
Determine the connected domain in the text texture image;
Text sequence image is determined according to the connected domain.
3. the method according to claim 1, wherein described press the corresponding character recognition mode pair of each character type The text sequence image carries out character recognition, and the character of respective symbols type will be not belonging in the text sequence image, knows It Wei be not belonging to the general character of respective symbols type, obtain each corresponding text sequence of character type, comprising:
By the corresponding character recognition mode of each character type, is identified from the text sequence image and belong to respective symbols type Character, and the general character for being not belonging to respective symbols type is identified from the text sequence image;
It successively combines the character gone out by each character category identification respectively, obtains each corresponding text sequence of character type.
4. according to the method described in claim 3, it is characterized in that, it is described press the corresponding character recognition mode of each character type, The character for belonging to respective symbols type is identified from the text sequence image, and is identified from the text sequence image It is not belonging to the general character of respective symbols type, comprising:
Individual character image is syncopated as from the text sequence image;
By each corresponding machine learning model of character type, character recognition is carried out to the individual character image respectively, is belonged to The character of respective symbols type and the general character for being not belonging to respective symbols type.
5. according to the method described in claim 4, it is characterized in that, described be syncopated as individual character figure from the text sequence image Picture, comprising:
In the text sequence image, along the long side of the text sequence image, according to shorter than the text sequence image The short spacing in side chooses candidate cut-off;
Obtain the cutting confidence level of each candidate cut-off;
Cut-off is determined according to the cutting confidence level;
Individual character image is syncopated as from the text sequence image according to determining cut-off.
6. according to the method described in claim 4, it is characterized in that, the method also includes:
Obtain character picture sample set;
By character type, corresponding character is added to belong to the character picture of respective symbols type in the character picture sample set Mark, and to be not belonging to the mark that the character picture of respective symbols type adds general character in the character picture sample set Note;
According to the character picture in the character picture sample set and the mark added by character type, each character kind is respectively trained The corresponding machine learning model of class.
7. method according to any one of claim 1 to 6, which is characterized in that described to be selected according to the character correction of acquisition Character at position described in the text sequence taken, obtains recognition result, comprising:
When the quantity of the character of acquisition is consistent with the quantity of the character at position described in the text sequence of selection, then
Character at position described in text sequence by selection, replace with one by one in the character of acquisition by character sequence with it is described The one-to-one character of character at position;
The quantity of character at the position described in the quantity of the character of acquisition and the text sequence of selection is inconsistent, and choose When the quantity of character at position described in text sequence is more than one, then
Character at position described in text sequence by selection integrally replaces with the character of acquisition.
8. a kind of text identification device, described device include:
First obtains module, for obtaining text sequence image;
Identification module, for carrying out character knowledge to the text sequence image by the corresponding character recognition mode of each character type Not, it will be not belonging to the character of respective symbols type in the text sequence image, be identified as being not belonging to the logical of respective symbols type Character obtains each corresponding text sequence of character type;
Module is chosen, for choosing text sequence from each corresponding text sequence of character type;
Determining module, for determining the position where being not belonging to the general character of respective symbols type in the text sequence chosen It sets;
Second obtains module, for belonging to respective symbols type at the position in remaining text sequence after obtaining selection Character;
Module is corrected, the character at position described in the text sequence for choosing according to the character correction of acquisition is identified As a result.
9. device according to claim 8, which is characterized in that the first acquisition module is also used to obtain figure to be identified Picture;The images to be recognized is subjected to binary conversion treatment, obtains text image;Text texture maps are extracted from the text image Picture;Determine the connected domain in the text texture image;Text sequence image is determined according to the connected domain.
10. device according to claim 8, which is characterized in that the identification module is also used to corresponding by each character type Character recognition mode, identify the character for belonging to respective symbols type from the text sequence image, and from the text The general character for being not belonging to respective symbols type is identified in sequence image;The character difference that will go out by each character category identification It successively combines, obtains each corresponding text sequence of character type.
11. device according to claim 10, which is characterized in that the identification module is also used to from the text sequence figure Individual character image is syncopated as in;By each corresponding machine learning model of character type, word is carried out to the individual character image respectively Symbol identification, obtains belonging to the character of respective symbols type and is not belonging to the general character of respective symbols type.
12. device according to claim 11, which is characterized in that the identification module is also used in the text sequence figure As in, along the long side of the text sequence image, candidate is chosen according to the spacing shorter than the short side of the text sequence image and is cut Branch;Obtain the cutting confidence level of each candidate cut-off;Cut-off is determined according to the cutting confidence level;According to determining cutting Point is syncopated as individual character image from the text sequence image.
13. device according to claim 11, which is characterized in that described device further include:
Training module, for obtaining character picture sample set;It is corresponding to belong in the character picture sample set by character type The character picture of character type adds the mark of corresponding character, and to be not belonging to respective symbols in the character picture sample set The character picture of type adds the mark of general character;According to the character picture in the character picture sample set and press character The mark of type addition, is respectively trained each corresponding machine learning model of character type.
14. one or more is stored with the non-volatile computer readable storage medium storing program for executing of computer executable instructions, the calculating When machine executable instruction is executed by one or more processors, so that one or more of processors execute such as claim 1 The step of to method described in any one of 6.
15. a kind of computer equipment, including memory and processor, computer-readable instruction is stored in the memory, institute When stating computer-readable instruction and being executed by the processor, so that the processor is executed such as any one of claims 1 to 6 institute The step of method stated.
CN201710687380.4A 2017-08-11 2017-08-11 Text recognition method, device, storage medium and computer equipment Active CN109389115B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710687380.4A CN109389115B (en) 2017-08-11 2017-08-11 Text recognition method, device, storage medium and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710687380.4A CN109389115B (en) 2017-08-11 2017-08-11 Text recognition method, device, storage medium and computer equipment

Publications (2)

Publication Number Publication Date
CN109389115A true CN109389115A (en) 2019-02-26
CN109389115B CN109389115B (en) 2023-05-23

Family

ID=65413997

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710687380.4A Active CN109389115B (en) 2017-08-11 2017-08-11 Text recognition method, device, storage medium and computer equipment

Country Status (1)

Country Link
CN (1) CN109389115B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110210469A (en) * 2019-05-31 2019-09-06 中科软科技股份有限公司 A kind of method and system identifying picture character languages
CN110674876A (en) * 2019-09-25 2020-01-10 北京猎户星空科技有限公司 Character detection method and device, electronic equipment and computer readable medium
CN110969161A (en) * 2019-12-02 2020-04-07 上海肇观电子科技有限公司 Image processing method, circuit, visual impairment assisting apparatus, electronic apparatus, and medium
CN111339910A (en) * 2020-02-24 2020-06-26 支付宝实验室(新加坡)有限公司 Text processing method and device and text classification model training method and device
CN111797922A (en) * 2020-07-03 2020-10-20 泰康保险集团股份有限公司 Text image classification method and device

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11272799A (en) * 1998-03-20 1999-10-08 Canon Inc Method and device for character recognition processing and storage medium
CN101777124A (en) * 2010-01-29 2010-07-14 北京新岸线网络技术有限公司 Method for extracting video text message and device thereof
CN102156865A (en) * 2010-12-14 2011-08-17 上海合合信息科技发展有限公司 Handwritten text line character segmentation method and identification method
CN102332096A (en) * 2011-10-17 2012-01-25 中国科学院自动化研究所 Video caption text extraction and identification method
WO2013097072A1 (en) * 2011-12-26 2013-07-04 华为技术有限公司 Method and apparatus for recognizing a character of a video
WO2014131339A1 (en) * 2013-02-26 2014-09-04 山东新北洋信息技术股份有限公司 Character identification method and character identification apparatus
CN104268603A (en) * 2014-09-16 2015-01-07 科大讯飞股份有限公司 Intelligent marking method and system for text objective questions
CN106056114A (en) * 2016-05-24 2016-10-26 腾讯科技(深圳)有限公司 Business card content identification method and business card content identification device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11272799A (en) * 1998-03-20 1999-10-08 Canon Inc Method and device for character recognition processing and storage medium
CN101777124A (en) * 2010-01-29 2010-07-14 北京新岸线网络技术有限公司 Method for extracting video text message and device thereof
CN102156865A (en) * 2010-12-14 2011-08-17 上海合合信息科技发展有限公司 Handwritten text line character segmentation method and identification method
CN102332096A (en) * 2011-10-17 2012-01-25 中国科学院自动化研究所 Video caption text extraction and identification method
WO2013097072A1 (en) * 2011-12-26 2013-07-04 华为技术有限公司 Method and apparatus for recognizing a character of a video
WO2014131339A1 (en) * 2013-02-26 2014-09-04 山东新北洋信息技术股份有限公司 Character identification method and character identification apparatus
CN104268603A (en) * 2014-09-16 2015-01-07 科大讯飞股份有限公司 Intelligent marking method and system for text objective questions
CN106056114A (en) * 2016-05-24 2016-10-26 腾讯科技(深圳)有限公司 Business card content identification method and business card content identification device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
付强;丁晓青;蒋焰;: "基于多信息融合的中文手写地址字符串切分与识别" *
杨武夷;张树武;: "一种视频中字符的集成型切分与识别算法" *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110210469A (en) * 2019-05-31 2019-09-06 中科软科技股份有限公司 A kind of method and system identifying picture character languages
CN110674876A (en) * 2019-09-25 2020-01-10 北京猎户星空科技有限公司 Character detection method and device, electronic equipment and computer readable medium
CN110969161A (en) * 2019-12-02 2020-04-07 上海肇观电子科技有限公司 Image processing method, circuit, visual impairment assisting apparatus, electronic apparatus, and medium
CN110969161B (en) * 2019-12-02 2023-11-07 上海肇观电子科技有限公司 Image processing method, circuit, vision-impaired assisting device, electronic device, and medium
CN111339910A (en) * 2020-02-24 2020-06-26 支付宝实验室(新加坡)有限公司 Text processing method and device and text classification model training method and device
CN111339910B (en) * 2020-02-24 2023-11-28 支付宝实验室(新加坡)有限公司 Text processing and text classification model training method and device
CN111797922A (en) * 2020-07-03 2020-10-20 泰康保险集团股份有限公司 Text image classification method and device
CN111797922B (en) * 2020-07-03 2023-11-28 泰康保险集团股份有限公司 Text image classification method and device

Also Published As

Publication number Publication date
CN109389115B (en) 2023-05-23

Similar Documents

Publication Publication Date Title
Neumann et al. Efficient scene text localization and recognition with local character refinement
CN107133622B (en) Word segmentation method and device
CN106056114B (en) Contents of visiting cards recognition methods and device
CN109389115A (en) Text recognition method, device, storage medium and computer equipment
US8744196B2 (en) Automatic recognition of images
Pan et al. A robust system to detect and localize texts in natural scene images
CN110647829A (en) Bill text recognition method and system
CN104217203B (en) Complex background card face information identifying method and system
CN108717543B (en) Invoice identification method and device and computer storage medium
JP5176763B2 (en) Low quality character identification method and apparatus
CN106203539B (en) Method and device for identifying container number
Vanetti et al. Gas meter reading from real world images using a multi-net system
Ye et al. Scene text detection via integrated discrimination of component appearance and consensus
RU2581786C1 (en) Determination of image transformations to increase quality of optical character recognition
Shivakumara et al. New gradient-spatial-structural features for video script identification
CN109447080B (en) Character recognition method and device
Salvi et al. Handwritten text segmentation using average longest path algorithm
CN113158895A (en) Bill identification method and device, electronic equipment and storage medium
JPWO2015146113A1 (en) Identification dictionary learning system, identification dictionary learning method, and identification dictionary learning program
Ramirez et al. Automatic recognition of square notation symbols in western plainchant manuscripts
Li et al. Leveraging surrounding context for scene text detection
Chen et al. Salient object detection: Integrate salient features in the deep learning framework
CN113780116A (en) Invoice classification method and device, computer equipment and storage medium
Vidhyalakshmi et al. Text detection in natural images with hybrid stroke feature transform and high performance deep Convnet computing
US9092688B2 (en) Assisted OCR

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant