CN108564086A - A kind of the identification method of calibration and device of character string - Google Patents

A kind of the identification method of calibration and device of character string Download PDF

Info

Publication number
CN108564086A
CN108564086A CN201810221541.5A CN201810221541A CN108564086A CN 108564086 A CN108564086 A CN 108564086A CN 201810221541 A CN201810221541 A CN 201810221541A CN 108564086 A CN108564086 A CN 108564086A
Authority
CN
China
Prior art keywords
field term
character string
term
corrected
field
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810221541.5A
Other languages
Chinese (zh)
Inventor
祝安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Kedu Medical Technology Co ltd
Original Assignee
Shenzhen Geek Thinking Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Geek Thinking Technology Co Ltd filed Critical Shenzhen Geek Thinking Technology Co Ltd
Priority to CN201810221541.5A priority Critical patent/CN108564086A/en
Publication of CN108564086A publication Critical patent/CN108564086A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Abstract

The embodiment of the invention discloses the identification methods of calibration and device of a kind of character string, can be corrected to recognition result using field term, can effectively improve the accuracy of identification.Present invention method includes:The text library of field term is created, each field term in the text library has corresponding index;Similarity algorithm based on preset adjacent words searches the corresponding field term of character string to be corrected in the text library.

Description

A kind of the identification method of calibration and device of character string
Technical field
The present invention relates to field of character recognition more particularly to the identification methods of calibration and device of a kind of character string.
Background technology
With popularizing for mobile Internet, internet medical treatment is increasingly becoming the emerging popular row of medical information development Industry.But hospital is in the considerations of information security, information will not be put on internet to realize information interconnection and intercommunication.But if From the angle of patient, allows others that the inspection bill of hospital is taken pictures and carry out Text region to obtain medical electronics information, just Just the arrangement and collection of data and the structuring of electronic health record processing.
But due to the limitation of OCR technique, there is a problem of shade, smudgy with the photograph of mobile phone photograph so that Discrimination is not high.
Invention content
An embodiment of the present invention provides the identification methods of calibration and device of a kind of character string, can be using field term come to knowing Other result is corrected, and can effectively improve the accuracy of identification.
The first aspect of the embodiment of the present invention provides a kind of identification method of calibration of character string, including:
The text library of field term is created, each field term in the text library has corresponding index;
Similarity algorithm based on preset adjacent words searches the corresponding neck of character string to be corrected in the text library Domain term.
Optionally, further include:The index of each field term is established according to the phonetic of Chinese character;Or, according to the spelling of Chinese character The index of each field term is established in sound and position.
Optionally, further include:The word frequency probability of each field term is set.
Optionally, the similarity algorithm based on preset adjacent words searches character to be corrected in the text library Corresponding field term of going here and there includes:
The corresponding field term of character string to be corrected is searched by following algorithm:
Character string to be corrected is decomposed into the set of two tuples of adjacent words;
Each two tuple in set is retrieved in the text library respectively, obtains the not corresponding retrieval of each binary composition Set;
Calculate separately the similarity of each field term in each retrieval set;
Determine the field term of the corresponding highest similarity of each two tuple respectively, and by the field art of the highest similarity Language is determined as the corresponding field term of the character string to be corrected.
Optionally, described calculate separately in each retrieval set further includes after the similarity of each field term:If described each The similarity of field term is equal, then is determined as the field term with highest word frequency probability in each field term described The corresponding field term of character string to be corrected.
Second aspect of the embodiment of the present invention provides a kind of identification calibration equipment of character string, including:
Creation module, the text library for creating field term, each field term in the text library have corresponding Index;
Searching module searches word to be corrected for the similarity algorithm based on preset adjacent words in the text library The corresponding field term of symbol string.
Optionally, further include:
Module is established, the index for establishing each field term according to the phonetic of Chinese character;Or, according to the phonetic of Chinese character The index of each field term is established with position.
Optionally, further include:
Setup module, the word frequency probability for each field term to be arranged.
Optionally, the searching module, specifically for searching the corresponding field of character string to be corrected by following algorithm Term:
Character string to be corrected is decomposed into the set of two tuples of adjacent words;
Each two tuple in set is retrieved in the text library respectively, obtains the not corresponding retrieval of each binary composition Set;
Calculate separately the similarity of each field term in each retrieval set;
Determine the field term of the corresponding highest similarity of each two tuple respectively, and by the field art of the highest similarity Language is determined as the corresponding field term of the character string to be corrected.
Optionally, further include:
Determining module will have most if the similarity for each field term is equal in each field term The field term of high word frequency probability is determined as the corresponding field term of the character string to be corrected.
The third aspect of the embodiment of the present invention provides a kind of electronic equipment, including at least one processor;
And the memory being connect at least one processor communication;
Wherein, the memory is stored with the instruction repertorie that can be executed by least one processor, described instruction journey Sequence is executed by least one processor, so that at least one processor is able to carry out method as described above.
Fourth aspect of the embodiment of the present invention provides a kind of computer journey used in the identification calibration equipment of character string Sequence product, the computer program product include any function module as above.
As can be seen from the above technical solutions, the embodiment of the present invention has the following advantages:The text library of field term is created, Each field term in the text library has corresponding index;Similarity algorithm based on preset adjacent words is in the text The corresponding field term of character string to be corrected is searched in library.So as to be corrected to recognition result using field term, It can effectively improve the accuracy of identification.
Description of the drawings
Fig. 1 is identification method of calibration one embodiment schematic diagram of character string in the embodiment of the present invention;
Fig. 2 is identification calibration equipment one embodiment schematic diagram of character string in the embodiment of the present invention;
Fig. 3 is electronic equipment one embodiment schematic diagram in the embodiment of the present invention.
Specific implementation mode
An embodiment of the present invention provides the identification methods of calibration and device of a kind of character string.It can be using field term come to knowing Other result is corrected, and can effectively improve the accuracy of identification.
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, the every other implementation that those skilled in the art are obtained without creative efforts Example, shall fall within the protection scope of the present invention.
Term " first " in description and claims of this specification and above-mentioned attached drawing, " second " are for distinguishing class As object, without being used to describe specific sequence or precedence.It should be appreciated that the data used in this way are in appropriate situation Under can be interchanged, so that the embodiments described herein can be real with the sequence other than the content for illustrating or describing herein It applies.In addition, term " comprising " and " having " and their any deformation, it is intended that cover it is non-exclusive include, for example, packet Contained series of steps or unit process, method, system, product or equipment those of be not necessarily limited to clearly to list step or Unit, but may include not listing clearly or for the intrinsic other steps of these processes, method, product or equipment or Unit.
Due to the limitation of OCR technique, there is a problem of shade, smudgy with the photograph of mobile phone photograph so that identification Rate is not high.If the embodiment of the present invention considers the text library using the field term pre-established as a result, recognition result is utilized Word be compared that (special scenes being directed to are the chemical examination documents of hospital with each field term in the text library of field term Photo carries out Text region), the Chinese character that identification mistake is corrected with the immediate field term of recognition result is obtained, to obtain Correct recognition result.Certainly big in a field term on condition that there was only one or two of word identification mistake in a field term Part word all identifies that mistake, this probability are very small.For example, it is assumed that Ps is the mistake of a Chinese phrase recognition result Rate, Pi is the identification error rate of each Chinese character in the phrase, and due to the influence of recognizer, each Chinese Character Recognition error rate is only Vertical, it does not influence mutually, therefore calculation formula is as follows:
Ps=P1*P2*..*Pi*..*Pn, n are the Chinese character numbers of phrase.
Assuming that Pi is 0.4, (i.e. the identification error rate of individual Chinese character is 40%, because being cell phone pictures, identification error rate is It is more much higher than the error rate of printed Chinese character) phrase is 5 Chinese characters composition, then the completely wrong error rate of this phrase is Ps= 0.4*0.4*0.4*0.4*0.4=0.01024.
It can be seen that such result probability of happening is very small, all it is usually that mistake occurs for one or two of word Accidentally, it is easily obtainable correction by comparing the field term in medical dictionary.
And non-Chinese character identifies very difficult in fact in medical terms, Greek alphabet is sometimes closely similar with English alphabet. For example, a and α, B and β, E and ∈, y and γ.These want by improve recognizer accuracy substantially can not possibly, so selection It is matched by the field term in medical dictionary to correct.
The identification method of calibration of the character string of the present invention is illustrated below by specific embodiment, referring to Fig. 1, figure 1 proposes a kind of identification method of calibration of character string for the embodiment of the present invention, includes the following steps:
S10, the text library for creating field term, each field term in the text library have corresponding index;
In the present embodiment, it is contemplated that existing OCR technique for the resolution of mobile phone photograph or poor, to build The text library of the field term of vertical specific area, such as in the hospital of medical field checks the Text region of document, utilize neck Domain term is corrected recognition result, can effectively improve the accuracy of identification.
Using medical field as example, some terms of medical field are very awkward-sounding, have plenty of English and translate, for example, penicillin; Have plenty of and combined come for example, serum by different terms according to molecular formula | asparagine acyl group | transferase | it measures;Medical art The length of language is sometimes very long, if compared one by one each Chinese character, retrieval can take considerable time, so setting is a set of Index rule becomes very significant, i.e., each field term in text library has corresponding index.
The index of each field term can be established according to the phonetic of Chinese character.For example, this term of the penicillin of upper example, Following index can be established:PNXL becomes English alphabet retrieval, retrieval rate can be in this way in retrieval by Chinese character index It greatly speeds up.
Simultaneously in order to retain the location information of Chinese character, position encoded can also build according to the first letter of pinyin of Chinese character and together Found the index of each field term.For example, this term of the penicillin of upper example, can establish following index:P1N2X3L4.
Chinese character is usually had in medical terms is mingled with English alphabet, number, symbol and Greek alphabet etc..These symbols can Directly to connect location information, and without becoming first letter of pinyin.Such as serum /-glutamyl based transferase measures, search index Can be X1Q2 γ 3-4G5A6XJ8Z9Y10M11.
Certainly, have many term Chinese characters different, first letter of pinyin may be the same, index out in this way it is coming the result is that One combination.If it is desired to search index has uniqueness, it may be considered that introduce the four-corner system.The four-corner system, Chinese dictionary are common One of character indexing method sorts out Chinese character with most 5 Arabic numerals.
If the field that we are directed to is medical field, since the magnitude of the term of medical field is little, even if phonetic rope It is unique to draw the result being retrieved not, and relationship is also little, substantially increases retrieval rate after all.
In addition it can which the word frequency probability of each field term is arranged, if connect based on the similarity that adjacent words are calculated Closely, can be with the word frequency probability of reference term, the word frequency probability the high, is that the possibility of the term is bigger.
S20, the similarity algorithm based on preset adjacent words are searched character string to be corrected in the text library and are corresponded to Field term.
The similarity algorithm based on preset adjacent words given by the present embodiment is to refer to n-gram algorithms.N-gram is Common a kind of language model in large vocabulary continuous speech recognition, for Chinese, referred to as Chinese language model (CLM, Chinese Language Model).Chinese language model utilizes the collocation information between adjacent word in context, is needing even Continuing the phonetic without space, stroke or representing letter or the number of stroke can calculate when being converted into Chinese character string (i.e. sentence) Sentence with maximum probability manually selects without user to realize the automatic conversion to Chinese character, avoids many Chinese characters pair Answer the coincident code problem of an identical phonetic (or stroke string or numeric string).
The language model based on it is such a it is assumed that n-th word appearance only it is related to the word of front N-1, and with it is other Any word is all uncorrelated, and the probability of whole sentence is exactly the product of each word probability of occurrence.
Such as there are one the sequence (sentence in other words) being made of m word, probability is P (w1, w2 ..., wm), according to Chain type rule, can obtain
P (w1, w2 ..., wm)=P (w1) P (w2 | w1) P (w3 | w1, w2) ... P (wm | w1 ..., wm-1);
This probability is obvious and bad calculation, might as well utilize markovian it is assumed that this i.e. current word is only with front Several limited words are related, therefore also need not just trace back to that word most started, can substantially reduce above-mentioned formula in this way Length.I.e.
P (wi | w1 ..., wi-1)=P (wi | wi-n+1 ..., wi-1);
Particularly, the case where smaller value being obtained for n:
Work as n=1, a linear model (unigram model) is P (w1, w2 ..., wm)=P (wi);
Work as n=2, a binary model (bigram model) is P (w1, w2 ..., wm)=P (wi | wi-1);
Work as n=3, a ternary model (trigram model) is P (w1, w2 ..., wm)=P (wi | wi-2wi-1).
Next one group of parameter can be found out using maximum likelihood method so that the probability of training sample obtains maximum value.
For unigram model, wherein c (w1 .., wn) indicates n-gram w1, and .., wn is in training corpus The number of appearance, M are the total number of word (such as yes no no no yes, M=5) in corpus
P (wi)=C (wi)/M;
For bigram model,
P (wi | wi-1)=C (wi-1wi)/C (wi-1);
For n-gram model,
P (wi | wi-n-1 ..., wi-1)=C (wi-n-1 ..., wi)/C (wi-n-1 ..., wi-1).
N-gram technologies are widely used for participle, semantic analysis, Text compression, the mistake that checks spelling, accelerate word Symbol string searches, the identification of document languages, application scenarios and less identical herein, so the algorithm reference n-gram's of this paper is similar Computation rule is spent, the computational methods of oneself are obtained:
It is primarily based on such a scene, due to the identification of the core of hospital assay list, before several columns Word interval is larger, and word segmentation processing has been done in identification process.In correction course, string length be it is accurate, still Whether each Chinese character is correct undetermined.
Next takes binary model, is defined as follows:
Define 1:Two tuple of adjacent words (nb):Refer to that recognition result (is needed to record by adjacent 2 words come two tuples formed Position is retrieved).For example, two tuple of adjacent words of uric acid detection is:Three uric acid, acid inspection, detection two tuples, a combination thereof are NB.The quantity of NB elements | NB | calculation formula isWherein n is the length of recognition result character string.
Define 2:Two tuple retrieval results:Recognition result is pressed into adjacent 2 words retrieval result in terminology bank one by one, as a result It may be a combination (to reject the inconsistent term of length in retrieval), be denoted as Rnb (nb ∈ NB), wherein each element (i.e. a term) is r.
Define 3:Retrieval overall result Unb is the intersection of each Rnb.
Unb=Rnb1+..+Rnbi+..+Rnbm (m be | NB |).
Define 4:When intersection, each element needs to record number of repetition, that is, similarity Sr.
It is as follows to the similarity algorithm of adjacent words:
Character string to be corrected is decomposed into the set of two tuples of adjacent words;
Each two tuple in set is retrieved in the text library respectively, obtains the not corresponding retrieval of each binary composition Set;
Calculate separately the similarity of each field term in each retrieval set;
Determine the field term of the corresponding highest similarity of each two tuple respectively, and by the field art of the highest similarity Language is determined as the corresponding field term of the character string to be corrected.
If in addition, the similarity of each field term is equal, there will be highest word frequency general in each field term The field term of rate is determined as the corresponding field term of the character string to be corrected.
In the present embodiment, the text library of field term is created, each field term in the text library has corresponding rope Draw;Similarity algorithm based on preset adjacent words searches the corresponding field art of character string to be corrected in the text library Language.So as to be corrected to recognition result using field term, the accuracy of identification can effectively improve.
The embodiment of the present invention additionally provides a kind of identification calibration equipment of character string, as shown in Fig. 2, the identification of the character string Calibration equipment includes:
Creation module 10, the text library for creating field term, each field term in the text library have corresponding Index;
Searching module 20 is searched for the similarity algorithm based on preset adjacent words in the text library to be corrected The corresponding field term of character string.
Further, can also include:Module is established, the rope for establishing each field term according to the phonetic of Chinese character Draw;Or, establishing the index of each field term according to the phonetic of Chinese character and position.
Further, can also include:Setup module, the word frequency probability for each field term to be arranged.
Further, searching module 20, specifically for searching the corresponding field of character string to be corrected by following algorithm Term:
Character string to be corrected is decomposed into the set of two tuples of adjacent words;
Each two tuple in set is retrieved in the text library respectively, obtains the not corresponding retrieval of each binary composition Set;
Calculate separately the similarity of each field term in each retrieval set;
Determine the field term of the corresponding highest similarity of each two tuple respectively, and by the field art of the highest similarity Language is determined as the corresponding field term of the character string to be corrected.
Further, further include:
Determining module will have most if the similarity for each field term is equal in each field term The field term of high word frequency probability is determined as the corresponding field term of the character string to be corrected.
In the present embodiment, it is contemplated that existing OCR technique for the resolution of mobile phone photograph or poor, to build The text library of the field term of vertical specific area, such as in the hospital of medical field checks the Text region of document, utilize neck Domain term is corrected recognition result, can effectively improve the accuracy of identification.
Using medical field as example, some terms of medical field are very awkward-sounding, have plenty of English and translate, for example, penicillin; Have plenty of and combined come for example, serum by different terms according to molecular formula | asparagine acyl group | transferase | it measures;Medical art The length of language is sometimes very long, if compared one by one each Chinese character, retrieval can take considerable time, so setting is a set of Index rule becomes very significant, i.e., each field term in text library has corresponding index.
The index of each field term can be established according to the phonetic of Chinese character.For example, this term of the penicillin of upper example, Following index can be established:PNXL becomes English alphabet retrieval, retrieval rate can be in this way in retrieval by Chinese character index It greatly speeds up.
Simultaneously in order to retain the location information of Chinese character, position encoded can also build according to the first letter of pinyin of Chinese character and together Found the index of each field term.For example, this term of the penicillin of upper example, can establish following index:P1N2X3L4.
Chinese character is usually had in medical terms is mingled with English alphabet, number, symbol and Greek alphabet etc..These symbols can Directly to connect location information, and without becoming first letter of pinyin.Such as serum /-glutamyl based transferase measures, search index Can be X1Q2 γ 3-4G5A6XJ8Z9Y10M11.
Certainly, have many term Chinese characters different, first letter of pinyin may be the same, index out in this way it is coming the result is that One combination.If it is desired to search index has uniqueness, it may be considered that introduce the four-corner system.The four-corner system, Chinese dictionary are common One of character indexing method sorts out Chinese character with most 5 Arabic numerals.
If the field that we are directed to is medical field, since the magnitude of the term of medical field is little, even if phonetic rope It is unique to draw the result being retrieved not, and relationship is also little, substantially increases retrieval rate after all.
In addition it can which the word frequency probability of each field term is arranged, if connect based on the similarity that adjacent words are calculated Closely, can be with the word frequency probability of reference term, the word frequency probability the high, is that the possibility of the term is bigger.
The similarity algorithm based on preset adjacent words given by the present embodiment is to refer to n-gram algorithms.N-gram is Common a kind of language model in large vocabulary continuous speech recognition, for Chinese, referred to as Chinese language model (CLM, Chinese Language Model).Chinese language model utilizes the collocation information between adjacent word in context, is needing even Continuous phonetic without space, stroke, or represents alphabetical or stroke number and can be calculated when being converted into Chinese character string (i.e. sentence) Sentence with maximum probability manually selects without user to realize the automatic conversion to Chinese character, avoids many Chinese characters pair Answer the coincident code problem of an identical phonetic (or stroke string or numeric string).
N-gram technologies are widely used for participle, semantic analysis, Text compression, the mistake that checks spelling, accelerate word Symbol string searches, the identification of document languages, application scenarios and less identical herein, so the algorithm reference n-gram's of this paper is similar Computation rule is spent, the computational methods of oneself are obtained:
It is primarily based on such a scene, due to the identification of the core of hospital assay list, before several columns Word interval is larger, and word segmentation processing has been done in identification process.In correction course, string length be it is accurate, still Whether each Chinese character is correct undetermined.
Next takes binary model, is defined as follows:
Define 1:Two tuple of adjacent words (nb):Refer to that recognition result (is needed to record by adjacent 2 words come two tuples formed Position is retrieved).For example, two tuple of adjacent words of uric acid detection is:Three uric acid, acid inspection, detection two tuples, a combination thereof are NB.The quantity of NB elements | NB | calculation formula isWherein n is the length of recognition result character string.
Define 2:Two tuple retrieval results:Recognition result is pressed into adjacent 2 words retrieval result in terminology bank one by one, as a result It may be a combination (to reject the inconsistent term of length in retrieval), be denoted as Rnb (nb ∈ NB), wherein each element (i.e. a term) is r.
Define 3:Retrieval overall result Unb is the intersection of each Rnb.
Unb=Rnb1+..+Rnbi+..+Rnbm (m be | NB |).
Define 4:When intersection, each element needs to record number of repetition, that is, similarity Sr.
To obtain the similarity algorithm of adjacent words, details are not described herein again.
In the present embodiment, the text library of field term is created, each field term in the text library has corresponding rope Draw;Similarity algorithm based on preset adjacent words searches the corresponding field art of character string to be corrected in the text library Language.So as to be corrected to recognition result using field term, the accuracy of identification can effectively improve.
Fig. 3 is the hardware architecture diagram of electronic equipment provided by the embodiments of the present application, which includes:It is one or more Processor 301 and memory 302.In Fig. 3 for one.Wherein, processor 301 and memory 302 can be by total Line or other modes connect, in Fig. 3 for being connected by bus.
Memory 302 is used as a kind of non-volatile computer readable storage medium storing program for executing, can be used for storing non-volatile software journey Sequence, non-volatile computer executable program and module, as the identification calibration equipment of character string in the embodiment of the present invention corresponds to Program instruction/module.Processor 301 by run storage non-volatile software program in the memory 302, instruction and Character string in above method embodiment is realized in module, the various function application to execute server and data processing Identify calibration equipment.
Memory 302 may include storing program area and storage data field, wherein storing program area can store operation system System, the required application program of at least one function;Storage data field can be stored to be made according to the identification calibration equipment of character string With the data etc. created.In addition, memory 302 may include high-speed random access memory, can also include non-volatile Memory, for example, at least a disk memory, flush memory device or other non-volatile solid state memory parts.In some realities It applies in example, it includes the memory remotely located relative to processor 301 that memory 302 is optional, these remote memories can lead to Network connection is crossed to the identification calibration equipment of character string.The example of above-mentioned network include but not limited to internet, intranet, LAN, mobile radio communication and combinations thereof.
Above-mentioned electronic equipment can perform the device or method that the embodiment of the present application is provided, and has and executes the device or method Corresponding function module and advantageous effect.The not technical detail of detailed description in the present embodiment, reference can be made to the embodiment of the present application The device or method provided.
Also, system embodiment described above is only schematical, illustrates as separating component wherein described Unit may or may not be physically separated, and the component shown as unit may or may not be object Manage unit, you can be located at a place, or may be distributed over multiple network units.It can select according to the actual needs Some or all of module therein is selected to achieve the purpose of the solution of this embodiment.
Through the above description of the embodiments, those of ordinary skill in the art can be understood that each embodiment The mode of general hardware platform can be added to realize by software, naturally it is also possible to pass through hardware.Based on this understanding, of the invention Technical solution substantially all or part of the part that contributes to existing technology or the technical solution can be in other words It is expressed in the form of software products, which is stored in a storage medium, including some instructions are used So that a computer equipment (can be personal computer, server or the network equipment etc.) executes each reality of the present invention Apply all or part of step of the method.And storage medium above-mentioned includes:USB flash disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disc or CD Etc. the various media that can store program code.
The above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although with reference to the foregoing embodiments Invention is explained in detail, it will be understood by those of ordinary skill in the art that:It still can be to aforementioned each implementation Technical solution recorded in example is modified or equivalent replacement of some of the technical features;And these modification or It replaces, the spirit and scope for various embodiments of the present invention technical solution that it does not separate the essence of the corresponding technical solution.

Claims (10)

1. a kind of identification method of calibration of character string, which is characterized in that including:
The text library of field term is created, each field term in the text library has corresponding index;
Similarity algorithm based on preset adjacent words searches the corresponding field art of character string to be corrected in the text library Language.
2. the method as described in claim 1, which is characterized in that further include:Each field art is established according to the phonetic of Chinese character The index of language;Or, establishing the index of each field term according to the phonetic of Chinese character and position.
3. method as claimed in claim 2, which is characterized in that further include:The word frequency probability of each field term is set.
4. method as claimed in claim 3, which is characterized in that the similarity algorithm based on preset adjacent words is described The corresponding field term of character string to be corrected is searched in text library includes:
The corresponding field term of character string to be corrected is searched by following algorithm:
Character string to be corrected is decomposed into the set of two tuples of adjacent words;
Each two tuple in set is retrieved in the text library respectively, obtains the not corresponding retrieved set of each binary composition It closes;
Calculate separately the similarity of each field term in each retrieval set;
The field term of the corresponding highest similarity of each two tuple is determined respectively, and the field term of the highest similarity is true It is set to the corresponding field term of the character string to be corrected.
5. method as claimed in claim 4, which is characterized in that the phase for calculating separately each field term in each retrieval set Further include later like degree:If the similarity of each field term is equal, will there is highest word frequency in each field term The field term of probability is determined as the corresponding field term of the character string to be corrected.
6. a kind of identification calibration equipment of character string, which is characterized in that including:
Creation module, the text library for creating field term, each field term in the text library have corresponding index;
Searching module searches character string to be corrected for the similarity algorithm based on preset adjacent words in the text library Corresponding field term.
7. device as claimed in claim 6, which is characterized in that further include:
Module is established, the index for establishing each field term according to the phonetic of Chinese character;Or, according to the phonetic of Chinese character and position Set up the index of vertical each field term.
8. device as claimed in claim 7, which is characterized in that further include:
Setup module, the word frequency probability for each field term to be arranged.
9. device as claimed in claim 8, which is characterized in that the searching module is searched specifically for passing through following algorithm The corresponding field term of character string to be corrected:
Character string to be corrected is decomposed into the set of two tuples of adjacent words;
Each two tuple in set is retrieved in the text library respectively, obtains the not corresponding retrieved set of each binary composition It closes;
Calculate separately the similarity of each field term in each retrieval set;
The field term of the corresponding highest similarity of each two tuple is determined respectively, and the field term of the highest similarity is true It is set to the corresponding field term of the character string to be corrected.
10. device as claimed in claim 9, which is characterized in that further include:
Determining module will have highest word if the similarity for each field term is equal in each field term The field term of frequency probability is determined as the corresponding field term of the character string to be corrected.
CN201810221541.5A 2018-03-17 2018-03-17 A kind of the identification method of calibration and device of character string Pending CN108564086A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810221541.5A CN108564086A (en) 2018-03-17 2018-03-17 A kind of the identification method of calibration and device of character string

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810221541.5A CN108564086A (en) 2018-03-17 2018-03-17 A kind of the identification method of calibration and device of character string

Publications (1)

Publication Number Publication Date
CN108564086A true CN108564086A (en) 2018-09-21

Family

ID=63532966

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810221541.5A Pending CN108564086A (en) 2018-03-17 2018-03-17 A kind of the identification method of calibration and device of character string

Country Status (1)

Country Link
CN (1) CN108564086A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109413504A (en) * 2018-09-30 2019-03-01 武汉斗鱼网络科技有限公司 Barrage method of calibration, device, terminal and storage medium based on character string replacement
CN111898612A (en) * 2020-06-30 2020-11-06 北京来也网络科技有限公司 OCR recognition method and device combining RPA and AI, equipment and medium

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003331214A (en) * 2002-05-15 2003-11-21 Nippon Telegr & Teleph Corp <Ntt> Character recognition error correction method, device and program
US20050278292A1 (en) * 2004-06-11 2005-12-15 Hitachi, Ltd. Spelling variation dictionary generation system
CN102324233A (en) * 2011-08-03 2012-01-18 中国科学院计算技术研究所 Method for automatically correcting identification error of repeated words in Chinese pronunciation identification
CN102375807A (en) * 2010-08-27 2012-03-14 汉王科技股份有限公司 Method and device for proofing characters
CN103530840A (en) * 2013-10-10 2014-01-22 中国中医科学院 Accurate and quick electronic medical record type-in system
CN103870575A (en) * 2014-03-19 2014-06-18 北京百度网讯科技有限公司 Method and device for extracting domain keywords
US9037967B1 (en) * 2014-02-18 2015-05-19 King Fahd University Of Petroleum And Minerals Arabic spell checking technique
US20160019430A1 (en) * 2014-07-21 2016-01-21 Optum, Inc. Targeted optical character recognition (ocr) for medical terminology
CN105512110A (en) * 2015-12-15 2016-04-20 江苏科技大学 Wrong word knowledge base construction method based on fuzzy matching and statistics
CN105550173A (en) * 2016-02-06 2016-05-04 北京京东尚科信息技术有限公司 Text correction method and device
CN106127265A (en) * 2016-06-22 2016-11-16 北京邮电大学 A kind of text in picture identification error correction method based on activating force model
CN106383853A (en) * 2016-08-30 2017-02-08 刘勇 Realization method and system for electronic medical record post-structuring and auxiliary diagnosis
CN106528846A (en) * 2016-11-21 2017-03-22 广州华多网络科技有限公司 Retrieval method and device
CN106682397A (en) * 2016-12-09 2017-05-17 江西中科九峰智慧医疗科技有限公司 Knowledge-based electronic medical record quality control method

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003331214A (en) * 2002-05-15 2003-11-21 Nippon Telegr & Teleph Corp <Ntt> Character recognition error correction method, device and program
US20050278292A1 (en) * 2004-06-11 2005-12-15 Hitachi, Ltd. Spelling variation dictionary generation system
CN102375807A (en) * 2010-08-27 2012-03-14 汉王科技股份有限公司 Method and device for proofing characters
CN102324233A (en) * 2011-08-03 2012-01-18 中国科学院计算技术研究所 Method for automatically correcting identification error of repeated words in Chinese pronunciation identification
CN103530840A (en) * 2013-10-10 2014-01-22 中国中医科学院 Accurate and quick electronic medical record type-in system
US9037967B1 (en) * 2014-02-18 2015-05-19 King Fahd University Of Petroleum And Minerals Arabic spell checking technique
CN103870575A (en) * 2014-03-19 2014-06-18 北京百度网讯科技有限公司 Method and device for extracting domain keywords
US20160019430A1 (en) * 2014-07-21 2016-01-21 Optum, Inc. Targeted optical character recognition (ocr) for medical terminology
CN105512110A (en) * 2015-12-15 2016-04-20 江苏科技大学 Wrong word knowledge base construction method based on fuzzy matching and statistics
CN105550173A (en) * 2016-02-06 2016-05-04 北京京东尚科信息技术有限公司 Text correction method and device
CN106127265A (en) * 2016-06-22 2016-11-16 北京邮电大学 A kind of text in picture identification error correction method based on activating force model
CN106383853A (en) * 2016-08-30 2017-02-08 刘勇 Realization method and system for electronic medical record post-structuring and auxiliary diagnosis
CN106528846A (en) * 2016-11-21 2017-03-22 广州华多网络科技有限公司 Retrieval method and device
CN106682397A (en) * 2016-12-09 2017-05-17 江西中科九峰智慧医疗科技有限公司 Knowledge-based electronic medical record quality control method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
段建勇 等: "基于统计和特征相结合的查询纠错方法", 《现代图书情报技术》 *
王宸敏: "基于OCR技术的化验单识别方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109413504A (en) * 2018-09-30 2019-03-01 武汉斗鱼网络科技有限公司 Barrage method of calibration, device, terminal and storage medium based on character string replacement
CN109413504B (en) * 2018-09-30 2021-04-09 武汉斗鱼网络科技有限公司 Bullet screen checking method, device, terminal and storage medium based on character string replacement
CN111898612A (en) * 2020-06-30 2020-11-06 北京来也网络科技有限公司 OCR recognition method and device combining RPA and AI, equipment and medium

Similar Documents

Publication Publication Date Title
US8364470B2 (en) Text analysis method for finding acronyms
CN109460552B (en) Method and equipment for automatically detecting Chinese language diseases based on rules and corpus
US8725497B2 (en) System and method for detecting and correcting mismatched Chinese character
CN109299233A (en) Text data processing method, device, computer equipment and storage medium
CN112507727A (en) Text visual question-answering system and method based on text
CN112149680B (en) Method and device for detecting and identifying wrong words, electronic equipment and storage medium
Saluja et al. Error detection and corrections in Indic OCR using LSTMs
CN114861635B (en) Chinese spelling error correction method, device, equipment and storage medium
US8411958B2 (en) Apparatus and method for handwriting recognition
Bach et al. Reference extraction from Vietnamese legal documents
Chen et al. Integrating natural language processing with image document analysis: what we learned from two real-world applications
US20090063965A1 (en) Preparing a display document for analysis
CN110134766B (en) Word segmentation method and device for traditional Chinese medical ancient book documents
CN108564086A (en) A kind of the identification method of calibration and device of character string
CN109635125B (en) Vocabulary atlas building method and electronic equipment
Yang et al. Spell Checking for Chinese.
JP3794597B2 (en) Topic extraction method and topic extraction program recording medium
WO2000036530A1 (en) Searching method, searching device, and recorded medium
CN111310457B (en) Word mismatching recognition method and device, electronic equipment and storage medium
Iqbal et al. Urdu spell checking: Reverse edit distance approach
CN109727591B (en) Voice search method and device
Mollá et al. Named entity recognition in question answering of speech data
JP3975825B2 (en) Character recognition error correction method, apparatus and program
JPH08166966A (en) Dictionary retrieval device, database device, character recognizing device, speech recognition device and sentence correction device
CN113255331A (en) Text error correction method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20240403

Address after: 200333, Block E, East Side, 3rd Floor, Building 3, No. 14, Lane 172, Jinshajiang Road, Putuo District, Shanghai

Applicant after: SHANGHAI KEDU MEDICAL TECHNOLOGY CO.,LTD.

Country or region after: China

Address before: 518000, B202-33, 2nd Floor, Building 3, Yu'anju Tongjian Building, Xin'an Street, Bao'an District, Shenzhen City, Guangdong Province

Applicant before: SHENZHEN JIKE SISUO TECHNOLOGY CO.,LTD.

Country or region before: China