Invention content
An embodiment of the present invention provides the identification methods of calibration and device of a kind of character string, can be using field term come to knowing
Other result is corrected, and can effectively improve the accuracy of identification.
The first aspect of the embodiment of the present invention provides a kind of identification method of calibration of character string, including:
The text library of field term is created, each field term in the text library has corresponding index;
Similarity algorithm based on preset adjacent words searches the corresponding neck of character string to be corrected in the text library
Domain term.
Optionally, further include:The index of each field term is established according to the phonetic of Chinese character;Or, according to the spelling of Chinese character
The index of each field term is established in sound and position.
Optionally, further include:The word frequency probability of each field term is set.
Optionally, the similarity algorithm based on preset adjacent words searches character to be corrected in the text library
Corresponding field term of going here and there includes:
The corresponding field term of character string to be corrected is searched by following algorithm:
Character string to be corrected is decomposed into the set of two tuples of adjacent words;
Each two tuple in set is retrieved in the text library respectively, obtains the not corresponding retrieval of each binary composition
Set;
Calculate separately the similarity of each field term in each retrieval set;
Determine the field term of the corresponding highest similarity of each two tuple respectively, and by the field art of the highest similarity
Language is determined as the corresponding field term of the character string to be corrected.
Optionally, described calculate separately in each retrieval set further includes after the similarity of each field term:If described each
The similarity of field term is equal, then is determined as the field term with highest word frequency probability in each field term described
The corresponding field term of character string to be corrected.
Second aspect of the embodiment of the present invention provides a kind of identification calibration equipment of character string, including:
Creation module, the text library for creating field term, each field term in the text library have corresponding
Index;
Searching module searches word to be corrected for the similarity algorithm based on preset adjacent words in the text library
The corresponding field term of symbol string.
Optionally, further include:
Module is established, the index for establishing each field term according to the phonetic of Chinese character;Or, according to the phonetic of Chinese character
The index of each field term is established with position.
Optionally, further include:
Setup module, the word frequency probability for each field term to be arranged.
Optionally, the searching module, specifically for searching the corresponding field of character string to be corrected by following algorithm
Term:
Character string to be corrected is decomposed into the set of two tuples of adjacent words;
Each two tuple in set is retrieved in the text library respectively, obtains the not corresponding retrieval of each binary composition
Set;
Calculate separately the similarity of each field term in each retrieval set;
Determine the field term of the corresponding highest similarity of each two tuple respectively, and by the field art of the highest similarity
Language is determined as the corresponding field term of the character string to be corrected.
Optionally, further include:
Determining module will have most if the similarity for each field term is equal in each field term
The field term of high word frequency probability is determined as the corresponding field term of the character string to be corrected.
The third aspect of the embodiment of the present invention provides a kind of electronic equipment, including at least one processor;
And the memory being connect at least one processor communication;
Wherein, the memory is stored with the instruction repertorie that can be executed by least one processor, described instruction journey
Sequence is executed by least one processor, so that at least one processor is able to carry out method as described above.
Fourth aspect of the embodiment of the present invention provides a kind of computer journey used in the identification calibration equipment of character string
Sequence product, the computer program product include any function module as above.
As can be seen from the above technical solutions, the embodiment of the present invention has the following advantages:The text library of field term is created,
Each field term in the text library has corresponding index;Similarity algorithm based on preset adjacent words is in the text
The corresponding field term of character string to be corrected is searched in library.So as to be corrected to recognition result using field term,
It can effectively improve the accuracy of identification.
Term " first " in description and claims of this specification and above-mentioned attached drawing, " second " are for distinguishing class
As object, without being used to describe specific sequence or precedence.It should be appreciated that the data used in this way are in appropriate situation
Under can be interchanged, so that the embodiments described herein can be real with the sequence other than the content for illustrating or describing herein
It applies.In addition, term " comprising " and " having " and their any deformation, it is intended that cover it is non-exclusive include, for example, packet
Contained series of steps or unit process, method, system, product or equipment those of be not necessarily limited to clearly to list step or
Unit, but may include not listing clearly or for the intrinsic other steps of these processes, method, product or equipment or
Unit.
Due to the limitation of OCR technique, there is a problem of shade, smudgy with the photograph of mobile phone photograph so that identification
Rate is not high.If the embodiment of the present invention considers the text library using the field term pre-established as a result, recognition result is utilized
Word be compared that (special scenes being directed to are the chemical examination documents of hospital with each field term in the text library of field term
Photo carries out Text region), the Chinese character that identification mistake is corrected with the immediate field term of recognition result is obtained, to obtain
Correct recognition result.Certainly big in a field term on condition that there was only one or two of word identification mistake in a field term
Part word all identifies that mistake, this probability are very small.For example, it is assumed that Ps is the mistake of a Chinese phrase recognition result
Rate, Pi is the identification error rate of each Chinese character in the phrase, and due to the influence of recognizer, each Chinese Character Recognition error rate is only
Vertical, it does not influence mutually, therefore calculation formula is as follows:
Ps=P1*P2*..*Pi*..*Pn, n are the Chinese character numbers of phrase.
Assuming that Pi is 0.4, (i.e. the identification error rate of individual Chinese character is 40%, because being cell phone pictures, identification error rate is
It is more much higher than the error rate of printed Chinese character) phrase is 5 Chinese characters composition, then the completely wrong error rate of this phrase is Ps=
0.4*0.4*0.4*0.4*0.4=0.01024.
It can be seen that such result probability of happening is very small, all it is usually that mistake occurs for one or two of word
Accidentally, it is easily obtainable correction by comparing the field term in medical dictionary.
And non-Chinese character identifies very difficult in fact in medical terms, Greek alphabet is sometimes closely similar with English alphabet.
For example, a and α, B and β, E and ∈, y and γ.These want by improve recognizer accuracy substantially can not possibly, so selection
It is matched by the field term in medical dictionary to correct.
The identification method of calibration of the character string of the present invention is illustrated below by specific embodiment, referring to Fig. 1, figure
1 proposes a kind of identification method of calibration of character string for the embodiment of the present invention, includes the following steps:
S10, the text library for creating field term, each field term in the text library have corresponding index;
In the present embodiment, it is contemplated that existing OCR technique for the resolution of mobile phone photograph or poor, to build
The text library of the field term of vertical specific area, such as in the hospital of medical field checks the Text region of document, utilize neck
Domain term is corrected recognition result, can effectively improve the accuracy of identification.
Using medical field as example, some terms of medical field are very awkward-sounding, have plenty of English and translate, for example, penicillin;
Have plenty of and combined come for example, serum by different terms according to molecular formula | asparagine acyl group | transferase | it measures;Medical art
The length of language is sometimes very long, if compared one by one each Chinese character, retrieval can take considerable time, so setting is a set of
Index rule becomes very significant, i.e., each field term in text library has corresponding index.
The index of each field term can be established according to the phonetic of Chinese character.For example, this term of the penicillin of upper example,
Following index can be established:PNXL becomes English alphabet retrieval, retrieval rate can be in this way in retrieval by Chinese character index
It greatly speeds up.
Simultaneously in order to retain the location information of Chinese character, position encoded can also build according to the first letter of pinyin of Chinese character and together
Found the index of each field term.For example, this term of the penicillin of upper example, can establish following index:P1N2X3L4.
Chinese character is usually had in medical terms is mingled with English alphabet, number, symbol and Greek alphabet etc..These symbols can
Directly to connect location information, and without becoming first letter of pinyin.Such as serum /-glutamyl based transferase measures, search index
Can be X1Q2 γ 3-4G5A6XJ8Z9Y10M11.
Certainly, have many term Chinese characters different, first letter of pinyin may be the same, index out in this way it is coming the result is that
One combination.If it is desired to search index has uniqueness, it may be considered that introduce the four-corner system.The four-corner system, Chinese dictionary are common
One of character indexing method sorts out Chinese character with most 5 Arabic numerals.
If the field that we are directed to is medical field, since the magnitude of the term of medical field is little, even if phonetic rope
It is unique to draw the result being retrieved not, and relationship is also little, substantially increases retrieval rate after all.
In addition it can which the word frequency probability of each field term is arranged, if connect based on the similarity that adjacent words are calculated
Closely, can be with the word frequency probability of reference term, the word frequency probability the high, is that the possibility of the term is bigger.
S20, the similarity algorithm based on preset adjacent words are searched character string to be corrected in the text library and are corresponded to
Field term.
The similarity algorithm based on preset adjacent words given by the present embodiment is to refer to n-gram algorithms.N-gram is
Common a kind of language model in large vocabulary continuous speech recognition, for Chinese, referred to as Chinese language model (CLM,
Chinese Language Model).Chinese language model utilizes the collocation information between adjacent word in context, is needing even
Continuing the phonetic without space, stroke or representing letter or the number of stroke can calculate when being converted into Chinese character string (i.e. sentence)
Sentence with maximum probability manually selects without user to realize the automatic conversion to Chinese character, avoids many Chinese characters pair
Answer the coincident code problem of an identical phonetic (or stroke string or numeric string).
The language model based on it is such a it is assumed that n-th word appearance only it is related to the word of front N-1, and with it is other
Any word is all uncorrelated, and the probability of whole sentence is exactly the product of each word probability of occurrence.
Such as there are one the sequence (sentence in other words) being made of m word, probability is P (w1, w2 ..., wm), according to
Chain type rule, can obtain
P (w1, w2 ..., wm)=P (w1) P (w2 | w1) P (w3 | w1, w2) ... P (wm | w1 ..., wm-1);
This probability is obvious and bad calculation, might as well utilize markovian it is assumed that this i.e. current word is only with front
Several limited words are related, therefore also need not just trace back to that word most started, can substantially reduce above-mentioned formula in this way
Length.I.e.
P (wi | w1 ..., wi-1)=P (wi | wi-n+1 ..., wi-1);
Particularly, the case where smaller value being obtained for n:
Work as n=1, a linear model (unigram model) is P (w1, w2 ..., wm)=P (wi);
Work as n=2, a binary model (bigram model) is P (w1, w2 ..., wm)=P (wi | wi-1);
Work as n=3, a ternary model (trigram model) is P (w1, w2 ..., wm)=P (wi | wi-2wi-1).
Next one group of parameter can be found out using maximum likelihood method so that the probability of training sample obtains maximum value.
For unigram model, wherein c (w1 .., wn) indicates n-gram w1, and .., wn is in training corpus
The number of appearance, M are the total number of word (such as yes no no no yes, M=5) in corpus
P (wi)=C (wi)/M;
For bigram model,
P (wi | wi-1)=C (wi-1wi)/C (wi-1);
For n-gram model,
P (wi | wi-n-1 ..., wi-1)=C (wi-n-1 ..., wi)/C (wi-n-1 ..., wi-1).
N-gram technologies are widely used for participle, semantic analysis, Text compression, the mistake that checks spelling, accelerate word
Symbol string searches, the identification of document languages, application scenarios and less identical herein, so the algorithm reference n-gram's of this paper is similar
Computation rule is spent, the computational methods of oneself are obtained:
It is primarily based on such a scene, due to the identification of the core of hospital assay list, before several columns
Word interval is larger, and word segmentation processing has been done in identification process.In correction course, string length be it is accurate, still
Whether each Chinese character is correct undetermined.
Next takes binary model, is defined as follows:
Define 1:Two tuple of adjacent words (nb):Refer to that recognition result (is needed to record by adjacent 2 words come two tuples formed
Position is retrieved).For example, two tuple of adjacent words of uric acid detection is:Three uric acid, acid inspection, detection two tuples, a combination thereof are
NB.The quantity of NB elements | NB | calculation formula isWherein n is the length of recognition result character string.
Define 2:Two tuple retrieval results:Recognition result is pressed into adjacent 2 words retrieval result in terminology bank one by one, as a result
It may be a combination (to reject the inconsistent term of length in retrieval), be denoted as Rnb (nb ∈ NB), wherein each element
(i.e. a term) is r.
Define 3:Retrieval overall result Unb is the intersection of each Rnb.
Unb=Rnb1+..+Rnbi+..+Rnbm (m be | NB |).
Define 4:When intersection, each element needs to record number of repetition, that is, similarity Sr.
It is as follows to the similarity algorithm of adjacent words:
Character string to be corrected is decomposed into the set of two tuples of adjacent words;
Each two tuple in set is retrieved in the text library respectively, obtains the not corresponding retrieval of each binary composition
Set;
Calculate separately the similarity of each field term in each retrieval set;
Determine the field term of the corresponding highest similarity of each two tuple respectively, and by the field art of the highest similarity
Language is determined as the corresponding field term of the character string to be corrected.
If in addition, the similarity of each field term is equal, there will be highest word frequency general in each field term
The field term of rate is determined as the corresponding field term of the character string to be corrected.
In the present embodiment, the text library of field term is created, each field term in the text library has corresponding rope
Draw;Similarity algorithm based on preset adjacent words searches the corresponding field art of character string to be corrected in the text library
Language.So as to be corrected to recognition result using field term, the accuracy of identification can effectively improve.
The embodiment of the present invention additionally provides a kind of identification calibration equipment of character string, as shown in Fig. 2, the identification of the character string
Calibration equipment includes:
Creation module 10, the text library for creating field term, each field term in the text library have corresponding
Index;
Searching module 20 is searched for the similarity algorithm based on preset adjacent words in the text library to be corrected
The corresponding field term of character string.
Further, can also include:Module is established, the rope for establishing each field term according to the phonetic of Chinese character
Draw;Or, establishing the index of each field term according to the phonetic of Chinese character and position.
Further, can also include:Setup module, the word frequency probability for each field term to be arranged.
Further, searching module 20, specifically for searching the corresponding field of character string to be corrected by following algorithm
Term:
Character string to be corrected is decomposed into the set of two tuples of adjacent words;
Each two tuple in set is retrieved in the text library respectively, obtains the not corresponding retrieval of each binary composition
Set;
Calculate separately the similarity of each field term in each retrieval set;
Determine the field term of the corresponding highest similarity of each two tuple respectively, and by the field art of the highest similarity
Language is determined as the corresponding field term of the character string to be corrected.
Further, further include:
Determining module will have most if the similarity for each field term is equal in each field term
The field term of high word frequency probability is determined as the corresponding field term of the character string to be corrected.
In the present embodiment, it is contemplated that existing OCR technique for the resolution of mobile phone photograph or poor, to build
The text library of the field term of vertical specific area, such as in the hospital of medical field checks the Text region of document, utilize neck
Domain term is corrected recognition result, can effectively improve the accuracy of identification.
Using medical field as example, some terms of medical field are very awkward-sounding, have plenty of English and translate, for example, penicillin;
Have plenty of and combined come for example, serum by different terms according to molecular formula | asparagine acyl group | transferase | it measures;Medical art
The length of language is sometimes very long, if compared one by one each Chinese character, retrieval can take considerable time, so setting is a set of
Index rule becomes very significant, i.e., each field term in text library has corresponding index.
The index of each field term can be established according to the phonetic of Chinese character.For example, this term of the penicillin of upper example,
Following index can be established:PNXL becomes English alphabet retrieval, retrieval rate can be in this way in retrieval by Chinese character index
It greatly speeds up.
Simultaneously in order to retain the location information of Chinese character, position encoded can also build according to the first letter of pinyin of Chinese character and together
Found the index of each field term.For example, this term of the penicillin of upper example, can establish following index:P1N2X3L4.
Chinese character is usually had in medical terms is mingled with English alphabet, number, symbol and Greek alphabet etc..These symbols can
Directly to connect location information, and without becoming first letter of pinyin.Such as serum /-glutamyl based transferase measures, search index
Can be X1Q2 γ 3-4G5A6XJ8Z9Y10M11.
Certainly, have many term Chinese characters different, first letter of pinyin may be the same, index out in this way it is coming the result is that
One combination.If it is desired to search index has uniqueness, it may be considered that introduce the four-corner system.The four-corner system, Chinese dictionary are common
One of character indexing method sorts out Chinese character with most 5 Arabic numerals.
If the field that we are directed to is medical field, since the magnitude of the term of medical field is little, even if phonetic rope
It is unique to draw the result being retrieved not, and relationship is also little, substantially increases retrieval rate after all.
In addition it can which the word frequency probability of each field term is arranged, if connect based on the similarity that adjacent words are calculated
Closely, can be with the word frequency probability of reference term, the word frequency probability the high, is that the possibility of the term is bigger.
The similarity algorithm based on preset adjacent words given by the present embodiment is to refer to n-gram algorithms.N-gram is
Common a kind of language model in large vocabulary continuous speech recognition, for Chinese, referred to as Chinese language model (CLM,
Chinese Language Model).Chinese language model utilizes the collocation information between adjacent word in context, is needing even
Continuous phonetic without space, stroke, or represents alphabetical or stroke number and can be calculated when being converted into Chinese character string (i.e. sentence)
Sentence with maximum probability manually selects without user to realize the automatic conversion to Chinese character, avoids many Chinese characters pair
Answer the coincident code problem of an identical phonetic (or stroke string or numeric string).
N-gram technologies are widely used for participle, semantic analysis, Text compression, the mistake that checks spelling, accelerate word
Symbol string searches, the identification of document languages, application scenarios and less identical herein, so the algorithm reference n-gram's of this paper is similar
Computation rule is spent, the computational methods of oneself are obtained:
It is primarily based on such a scene, due to the identification of the core of hospital assay list, before several columns
Word interval is larger, and word segmentation processing has been done in identification process.In correction course, string length be it is accurate, still
Whether each Chinese character is correct undetermined.
Next takes binary model, is defined as follows:
Define 1:Two tuple of adjacent words (nb):Refer to that recognition result (is needed to record by adjacent 2 words come two tuples formed
Position is retrieved).For example, two tuple of adjacent words of uric acid detection is:Three uric acid, acid inspection, detection two tuples, a combination thereof are
NB.The quantity of NB elements | NB | calculation formula isWherein n is the length of recognition result character string.
Define 2:Two tuple retrieval results:Recognition result is pressed into adjacent 2 words retrieval result in terminology bank one by one, as a result
It may be a combination (to reject the inconsistent term of length in retrieval), be denoted as Rnb (nb ∈ NB), wherein each element
(i.e. a term) is r.
Define 3:Retrieval overall result Unb is the intersection of each Rnb.
Unb=Rnb1+..+Rnbi+..+Rnbm (m be | NB |).
Define 4:When intersection, each element needs to record number of repetition, that is, similarity Sr.
To obtain the similarity algorithm of adjacent words, details are not described herein again.
In the present embodiment, the text library of field term is created, each field term in the text library has corresponding rope
Draw;Similarity algorithm based on preset adjacent words searches the corresponding field art of character string to be corrected in the text library
Language.So as to be corrected to recognition result using field term, the accuracy of identification can effectively improve.
Fig. 3 is the hardware architecture diagram of electronic equipment provided by the embodiments of the present application, which includes:It is one or more
Processor 301 and memory 302.In Fig. 3 for one.Wherein, processor 301 and memory 302 can be by total
Line or other modes connect, in Fig. 3 for being connected by bus.
Memory 302 is used as a kind of non-volatile computer readable storage medium storing program for executing, can be used for storing non-volatile software journey
Sequence, non-volatile computer executable program and module, as the identification calibration equipment of character string in the embodiment of the present invention corresponds to
Program instruction/module.Processor 301 by run storage non-volatile software program in the memory 302, instruction and
Character string in above method embodiment is realized in module, the various function application to execute server and data processing
Identify calibration equipment.
Memory 302 may include storing program area and storage data field, wherein storing program area can store operation system
System, the required application program of at least one function;Storage data field can be stored to be made according to the identification calibration equipment of character string
With the data etc. created.In addition, memory 302 may include high-speed random access memory, can also include non-volatile
Memory, for example, at least a disk memory, flush memory device or other non-volatile solid state memory parts.In some realities
It applies in example, it includes the memory remotely located relative to processor 301 that memory 302 is optional, these remote memories can lead to
Network connection is crossed to the identification calibration equipment of character string.The example of above-mentioned network include but not limited to internet, intranet,
LAN, mobile radio communication and combinations thereof.
Above-mentioned electronic equipment can perform the device or method that the embodiment of the present application is provided, and has and executes the device or method
Corresponding function module and advantageous effect.The not technical detail of detailed description in the present embodiment, reference can be made to the embodiment of the present application
The device or method provided.
Also, system embodiment described above is only schematical, illustrates as separating component wherein described
Unit may or may not be physically separated, and the component shown as unit may or may not be object
Manage unit, you can be located at a place, or may be distributed over multiple network units.It can select according to the actual needs
Some or all of module therein is selected to achieve the purpose of the solution of this embodiment.
Through the above description of the embodiments, those of ordinary skill in the art can be understood that each embodiment
The mode of general hardware platform can be added to realize by software, naturally it is also possible to pass through hardware.Based on this understanding, of the invention
Technical solution substantially all or part of the part that contributes to existing technology or the technical solution can be in other words
It is expressed in the form of software products, which is stored in a storage medium, including some instructions are used
So that a computer equipment (can be personal computer, server or the network equipment etc.) executes each reality of the present invention
Apply all or part of step of the method.And storage medium above-mentioned includes:USB flash disk, mobile hard disk, read-only memory
(Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disc or CD
Etc. the various media that can store program code.
The above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although with reference to the foregoing embodiments
Invention is explained in detail, it will be understood by those of ordinary skill in the art that:It still can be to aforementioned each implementation
Technical solution recorded in example is modified or equivalent replacement of some of the technical features;And these modification or
It replaces, the spirit and scope for various embodiments of the present invention technical solution that it does not separate the essence of the corresponding technical solution.