CN108564086A

CN108564086A - A kind of the identification method of calibration and device of character string

Info

Publication number: CN108564086A
Application number: CN201810221541.5A
Authority: CN
Inventors: 祝安
Original assignee: Shenzhen Geek Thinking Technology Co Ltd
Current assignee: Shanghai Kedu Medical Technology Co ltd
Priority date: 2018-03-17
Filing date: 2018-03-17
Publication date: 2018-09-21

Abstract

The embodiment of the invention discloses the identification methods of calibration and device of a kind of character string, can be corrected to recognition result using field term, can effectively improve the accuracy of identification.Present invention method includes：The text library of field term is created, each field term in the text library has corresponding index；Similarity algorithm based on preset adjacent words searches the corresponding field term of character string to be corrected in the text library.

Description

A kind of the identification method of calibration and device of character string

Technical field

The present invention relates to field of character recognition more particularly to the identification methods of calibration and device of a kind of character string.

Background technology

With popularizing for mobile Internet, internet medical treatment is increasingly becoming the emerging popular row of medical information development Industry.But hospital is in the considerations of information security, information will not be put on internet to realize information interconnection and intercommunication.But if From the angle of patient, allows others that the inspection bill of hospital is taken pictures and carry out Text region to obtain medical electronics information, just Just the arrangement and collection of data and the structuring of electronic health record processing.

But due to the limitation of OCR technique, there is a problem of shade, smudgy with the photograph of mobile phone photograph so that Discrimination is not high.

Invention content

An embodiment of the present invention provides the identification methods of calibration and device of a kind of character string, can be using field term come to knowing Other result is corrected, and can effectively improve the accuracy of identification.

The first aspect of the embodiment of the present invention provides a kind of identification method of calibration of character string, including：

The text library of field term is created, each field term in the text library has corresponding index；

Similarity algorithm based on preset adjacent words searches the corresponding neck of character string to be corrected in the text library Domain term.

Optionally, further include：The index of each field term is established according to the phonetic of Chinese character；Or, according to the spelling of Chinese character The index of each field term is established in sound and position.

Optionally, further include：The word frequency probability of each field term is set.

Optionally, the similarity algorithm based on preset adjacent words searches character to be corrected in the text library Corresponding field term of going here and there includes：

The corresponding field term of character string to be corrected is searched by following algorithm：

Character string to be corrected is decomposed into the set of two tuples of adjacent words；

Each two tuple in set is retrieved in the text library respectively, obtains the not corresponding retrieval of each binary composition Set；

Calculate separately the similarity of each field term in each retrieval set；

Determine the field term of the corresponding highest similarity of each two tuple respectively, and by the field art of the highest similarity Language is determined as the corresponding field term of the character string to be corrected.

Optionally, described calculate separately in each retrieval set further includes after the similarity of each field term：If described each The similarity of field term is equal, then is determined as the field term with highest word frequency probability in each field term described The corresponding field term of character string to be corrected.

Second aspect of the embodiment of the present invention provides a kind of identification calibration equipment of character string, including：

Creation module, the text library for creating field term, each field term in the text library have corresponding Index；

Searching module searches word to be corrected for the similarity algorithm based on preset adjacent words in the text library The corresponding field term of symbol string.

Optionally, further include：

Module is established, the index for establishing each field term according to the phonetic of Chinese character；Or, according to the phonetic of Chinese character The index of each field term is established with position.

Optionally, further include：

Setup module, the word frequency probability for each field term to be arranged.

Optionally, the searching module, specifically for searching the corresponding field of character string to be corrected by following algorithm Term：

Calculate separately the similarity of each field term in each retrieval set；

Optionally, further include：

Determining module will have most if the similarity for each field term is equal in each field term The field term of high word frequency probability is determined as the corresponding field term of the character string to be corrected.

The third aspect of the embodiment of the present invention provides a kind of electronic equipment, including at least one processor；

And the memory being connect at least one processor communication；

Wherein, the memory is stored with the instruction repertorie that can be executed by least one processor, described instruction journey Sequence is executed by least one processor, so that at least one processor is able to carry out method as described above.

Fourth aspect of the embodiment of the present invention provides a kind of computer journey used in the identification calibration equipment of character string Sequence product, the computer program product include any function module as above.

As can be seen from the above technical solutions, the embodiment of the present invention has the following advantages：The text library of field term is created, Each field term in the text library has corresponding index；Similarity algorithm based on preset adjacent words is in the text The corresponding field term of character string to be corrected is searched in library.So as to be corrected to recognition result using field term, It can effectively improve the accuracy of identification.

Description of the drawings

Fig. 1 is identification method of calibration one embodiment schematic diagram of character string in the embodiment of the present invention；

Fig. 2 is identification calibration equipment one embodiment schematic diagram of character string in the embodiment of the present invention；

Fig. 3 is electronic equipment one embodiment schematic diagram in the embodiment of the present invention.

Specific implementation mode

An embodiment of the present invention provides the identification methods of calibration and device of a kind of character string.It can be using field term come to knowing Other result is corrected, and can effectively improve the accuracy of identification.

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, the every other implementation that those skilled in the art are obtained without creative efforts Example, shall fall within the protection scope of the present invention.

Term " first " in description and claims of this specification and above-mentioned attached drawing, " second " are for distinguishing class As object, without being used to describe specific sequence or precedence.It should be appreciated that the data used in this way are in appropriate situation Under can be interchanged, so that the embodiments described herein can be real with the sequence other than the content for illustrating or describing herein It applies.In addition, term " comprising " and " having " and their any deformation, it is intended that cover it is non-exclusive include, for example, packet Contained series of steps or unit process, method, system, product or equipment those of be not necessarily limited to clearly to list step or Unit, but may include not listing clearly or for the intrinsic other steps of these processes, method, product or equipment or Unit.

Due to the limitation of OCR technique, there is a problem of shade, smudgy with the photograph of mobile phone photograph so that identification Rate is not high.If the embodiment of the present invention considers the text library using the field term pre-established as a result, recognition result is utilized Word be compared that (special scenes being directed to are the chemical examination documents of hospital with each field term in the text library of field term Photo carries out Text region), the Chinese character that identification mistake is corrected with the immediate field term of recognition result is obtained, to obtain Correct recognition result.Certainly big in a field term on condition that there was only one or two of word identification mistake in a field term Part word all identifies that mistake, this probability are very small.For example, it is assumed that Ps is the mistake of a Chinese phrase recognition result Rate, Pi is the identification error rate of each Chinese character in the phrase, and due to the influence of recognizer, each Chinese Character Recognition error rate is only Vertical, it does not influence mutually, therefore calculation formula is as follows:

Ps=P1*P2*..*Pi*..*Pn, n are the Chinese character numbers of phrase.

Assuming that Pi is 0.4, (i.e. the identification error rate of individual Chinese character is 40%, because being cell phone pictures, identification error rate is It is more much higher than the error rate of printed Chinese character) phrase is 5 Chinese characters composition, then the completely wrong error rate of this phrase is Ps= 0.4*0.4*0.4*0.4*0.4=0.01024.

It can be seen that such result probability of happening is very small, all it is usually that mistake occurs for one or two of word Accidentally, it is easily obtainable correction by comparing the field term in medical dictionary.

And non-Chinese character identifies very difficult in fact in medical terms, Greek alphabet is sometimes closely similar with English alphabet. For example, a and α, B and β, E and ∈, y and γ.These want by improve recognizer accuracy substantially can not possibly, so selection It is matched by the field term in medical dictionary to correct.

The identification method of calibration of the character string of the present invention is illustrated below by specific embodiment, referring to Fig. 1, figure 1 proposes a kind of identification method of calibration of character string for the embodiment of the present invention, includes the following steps：

S10, the text library for creating field term, each field term in the text library have corresponding index；

In the present embodiment, it is contemplated that existing OCR technique for the resolution of mobile phone photograph or poor, to build The text library of the field term of vertical specific area, such as in the hospital of medical field checks the Text region of document, utilize neck Domain term is corrected recognition result, can effectively improve the accuracy of identification.

Using medical field as example, some terms of medical field are very awkward-sounding, have plenty of English and translate, for example, penicillin； Have plenty of and combined come for example, serum by different terms according to molecular formula | asparagine acyl group | transferase | it measures；Medical art The length of language is sometimes very long, if compared one by one each Chinese character, retrieval can take considerable time, so setting is a set of Index rule becomes very significant, i.e., each field term in text library has corresponding index.

The index of each field term can be established according to the phonetic of Chinese character.For example, this term of the penicillin of upper example, Following index can be established：PNXL becomes English alphabet retrieval, retrieval rate can be in this way in retrieval by Chinese character index It greatly speeds up.

Simultaneously in order to retain the location information of Chinese character, position encoded can also build according to the first letter of pinyin of Chinese character and together Found the index of each field term.For example, this term of the penicillin of upper example, can establish following index：P1N2X3L4.

Chinese character is usually had in medical terms is mingled with English alphabet, number, symbol and Greek alphabet etc..These symbols can Directly to connect location information, and without becoming first letter of pinyin.Such as serum /-glutamyl based transferase measures, search index Can be X1Q2 γ 3-4G5A6XJ8Z9Y10M11.

Certainly, have many term Chinese characters different, first letter of pinyin may be the same, index out in this way it is coming the result is that One combination.If it is desired to search index has uniqueness, it may be considered that introduce the four-corner system.The four-corner system, Chinese dictionary are common One of character indexing method sorts out Chinese character with most 5 Arabic numerals.

If the field that we are directed to is medical field, since the magnitude of the term of medical field is little, even if phonetic rope It is unique to draw the result being retrieved not, and relationship is also little, substantially increases retrieval rate after all.

In addition it can which the word frequency probability of each field term is arranged, if connect based on the similarity that adjacent words are calculated Closely, can be with the word frequency probability of reference term, the word frequency probability the high, is that the possibility of the term is bigger.

S20, the similarity algorithm based on preset adjacent words are searched character string to be corrected in the text library and are corresponded to Field term.

The similarity algorithm based on preset adjacent words given by the present embodiment is to refer to n-gram algorithms.N-gram is Common a kind of language model in large vocabulary continuous speech recognition, for Chinese, referred to as Chinese language model (CLM, Chinese Language Model).Chinese language model utilizes the collocation information between adjacent word in context, is needing even Continuing the phonetic without space, stroke or representing letter or the number of stroke can calculate when being converted into Chinese character string (i.e. sentence) Sentence with maximum probability manually selects without user to realize the automatic conversion to Chinese character, avoids many Chinese characters pair Answer the coincident code problem of an identical phonetic (or stroke string or numeric string).

The language model based on it is such a it is assumed that n-th word appearance only it is related to the word of front N-1, and with it is other Any word is all uncorrelated, and the probability of whole sentence is exactly the product of each word probability of occurrence.

Such as there are one the sequence (sentence in other words) being made of m word, probability is P (w1, w2 ..., wm), according to Chain type rule, can obtain

P (w1, w2 ..., wm)=P (w1) P (w2 | w1) P (w3 | w1, w2) ... P (wm | w1 ..., wm-1)；

This probability is obvious and bad calculation, might as well utilize markovian it is assumed that this i.e. current word is only with front Several limited words are related, therefore also need not just trace back to that word most started, can substantially reduce above-mentioned formula in this way Length.I.e.

P (wi | w1 ..., wi-1)=P (wi | wi-n+1 ..., wi-1)；

Particularly, the case where smaller value being obtained for n：

Work as n=1, a linear model (unigram model) is P (w1, w2 ..., wm)=P (wi)；

Work as n=2, a binary model (bigram model) is P (w1, w2 ..., wm)=P (wi | wi-1)；

Work as n=3, a ternary model (trigram model) is P (w1, w2 ..., wm)=P (wi | wi-2wi-1).

Next one group of parameter can be found out using maximum likelihood method so that the probability of training sample obtains maximum value.

For unigram model, wherein c (w1 .., wn) indicates n-gram w1, and .., wn is in training corpus The number of appearance, M are the total number of word (such as yes no no no yes, M=5) in corpus

P (wi)=C (wi)/M；

For bigram model,

P (wi | wi-1)=C (wi-1wi)/C (wi-1)；

For n-gram model,

P (wi | wi-n-1 ..., wi-1)=C (wi-n-1 ..., wi)/C (wi-n-1 ..., wi-1).

N-gram technologies are widely used for participle, semantic analysis, Text compression, the mistake that checks spelling, accelerate word Symbol string searches, the identification of document languages, application scenarios and less identical herein, so the algorithm reference n-gram's of this paper is similar Computation rule is spent, the computational methods of oneself are obtained：

It is primarily based on such a scene, due to the identification of the core of hospital assay list, before several columns Word interval is larger, and word segmentation processing has been done in identification process.In correction course, string length be it is accurate, still Whether each Chinese character is correct undetermined.

Next takes binary model, is defined as follows：

Define 1：Two tuple of adjacent words (nb)：Refer to that recognition result (is needed to record by adjacent 2 words come two tuples formed Position is retrieved).For example, two tuple of adjacent words of uric acid detection is：Three uric acid, acid inspection, detection two tuples, a combination thereof are NB.The quantity of NB elements | NB | calculation formula isWherein n is the length of recognition result character string.

Define 2：Two tuple retrieval results：Recognition result is pressed into adjacent 2 words retrieval result in terminology bank one by one, as a result It may be a combination (to reject the inconsistent term of length in retrieval), be denoted as Rnb (nb ∈ NB), wherein each element (i.e. a term) is r.

Define 3：Retrieval overall result Unb is the intersection of each Rnb.

Unb=Rnb1+..+Rnbi+..+Rnbm (m be | NB |).

Define 4：When intersection, each element needs to record number of repetition, that is, similarity Sr.

It is as follows to the similarity algorithm of adjacent words：

Calculate separately the similarity of each field term in each retrieval set；

If in addition, the similarity of each field term is equal, there will be highest word frequency general in each field term The field term of rate is determined as the corresponding field term of the character string to be corrected.

In the present embodiment, the text library of field term is created, each field term in the text library has corresponding rope Draw；Similarity algorithm based on preset adjacent words searches the corresponding field art of character string to be corrected in the text library Language.So as to be corrected to recognition result using field term, the accuracy of identification can effectively improve.

The embodiment of the present invention additionally provides a kind of identification calibration equipment of character string, as shown in Fig. 2, the identification of the character string Calibration equipment includes：

Creation module 10, the text library for creating field term, each field term in the text library have corresponding Index；

Searching module 20 is searched for the similarity algorithm based on preset adjacent words in the text library to be corrected The corresponding field term of character string.

Further, can also include：Module is established, the rope for establishing each field term according to the phonetic of Chinese character Draw；Or, establishing the index of each field term according to the phonetic of Chinese character and position.

Further, can also include：Setup module, the word frequency probability for each field term to be arranged.

Further, searching module 20, specifically for searching the corresponding field of character string to be corrected by following algorithm Term：

Calculate separately the similarity of each field term in each retrieval set；

Further, further include：

The similarity algorithm based on preset adjacent words given by the present embodiment is to refer to n-gram algorithms.N-gram is Common a kind of language model in large vocabulary continuous speech recognition, for Chinese, referred to as Chinese language model (CLM, Chinese Language Model).Chinese language model utilizes the collocation information between adjacent word in context, is needing even Continuous phonetic without space, stroke, or represents alphabetical or stroke number and can be calculated when being converted into Chinese character string (i.e. sentence) Sentence with maximum probability manually selects without user to realize the automatic conversion to Chinese character, avoids many Chinese characters pair Answer the coincident code problem of an identical phonetic (or stroke string or numeric string).

Next takes binary model, is defined as follows：

Define 3：Retrieval overall result Unb is the intersection of each Rnb.

Unb=Rnb1+..+Rnbi+..+Rnbm (m be | NB |).

To obtain the similarity algorithm of adjacent words, details are not described herein again.

Fig. 3 is the hardware architecture diagram of electronic equipment provided by the embodiments of the present application, which includes：It is one or more Processor 301 and memory 302.In Fig. 3 for one.Wherein, processor 301 and memory 302 can be by total Line or other modes connect, in Fig. 3 for being connected by bus.

Memory 302 is used as a kind of non-volatile computer readable storage medium storing program for executing, can be used for storing non-volatile software journey Sequence, non-volatile computer executable program and module, as the identification calibration equipment of character string in the embodiment of the present invention corresponds to Program instruction/module.Processor 301 by run storage non-volatile software program in the memory 302, instruction and Character string in above method embodiment is realized in module, the various function application to execute server and data processing Identify calibration equipment.

Memory 302 may include storing program area and storage data field, wherein storing program area can store operation system System, the required application program of at least one function；Storage data field can be stored to be made according to the identification calibration equipment of character string With the data etc. created.In addition, memory 302 may include high-speed random access memory, can also include non-volatile Memory, for example, at least a disk memory, flush memory device or other non-volatile solid state memory parts.In some realities It applies in example, it includes the memory remotely located relative to processor 301 that memory 302 is optional, these remote memories can lead to Network connection is crossed to the identification calibration equipment of character string.The example of above-mentioned network include but not limited to internet, intranet, LAN, mobile radio communication and combinations thereof.

Above-mentioned electronic equipment can perform the device or method that the embodiment of the present application is provided, and has and executes the device or method Corresponding function module and advantageous effect.The not technical detail of detailed description in the present embodiment, reference can be made to the embodiment of the present application The device or method provided.

Also, system embodiment described above is only schematical, illustrates as separating component wherein described Unit may or may not be physically separated, and the component shown as unit may or may not be object Manage unit, you can be located at a place, or may be distributed over multiple network units.It can select according to the actual needs Some or all of module therein is selected to achieve the purpose of the solution of this embodiment.

Through the above description of the embodiments, those of ordinary skill in the art can be understood that each embodiment The mode of general hardware platform can be added to realize by software, naturally it is also possible to pass through hardware.Based on this understanding, of the invention Technical solution substantially all or part of the part that contributes to existing technology or the technical solution can be in other words It is expressed in the form of software products, which is stored in a storage medium, including some instructions are used So that a computer equipment (can be personal computer, server or the network equipment etc.) executes each reality of the present invention Apply all or part of step of the method.And storage medium above-mentioned includes：USB flash disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disc or CD Etc. the various media that can store program code.

The above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although with reference to the foregoing embodiments Invention is explained in detail, it will be understood by those of ordinary skill in the art that：It still can be to aforementioned each implementation Technical solution recorded in example is modified or equivalent replacement of some of the technical features；And these modification or It replaces, the spirit and scope for various embodiments of the present invention technical solution that it does not separate the essence of the corresponding technical solution.

Claims

1. a kind of identification method of calibration of character string, which is characterized in that including：

Similarity algorithm based on preset adjacent words searches the corresponding field art of character string to be corrected in the text library Language.

2. the method as described in claim 1, which is characterized in that further include：Each field art is established according to the phonetic of Chinese character The index of language；Or, establishing the index of each field term according to the phonetic of Chinese character and position.

3. method as claimed in claim 2, which is characterized in that further include：The word frequency probability of each field term is set.

4. method as claimed in claim 3, which is characterized in that the similarity algorithm based on preset adjacent words is described The corresponding field term of character string to be corrected is searched in text library includes：

Each two tuple in set is retrieved in the text library respectively, obtains the not corresponding retrieved set of each binary composition It closes；

Calculate separately the similarity of each field term in each retrieval set；

The field term of the corresponding highest similarity of each two tuple is determined respectively, and the field term of the highest similarity is true It is set to the corresponding field term of the character string to be corrected.

5. method as claimed in claim 4, which is characterized in that the phase for calculating separately each field term in each retrieval set Further include later like degree：If the similarity of each field term is equal, will there is highest word frequency in each field term The field term of probability is determined as the corresponding field term of the character string to be corrected.

6. a kind of identification calibration equipment of character string, which is characterized in that including：

Searching module searches character string to be corrected for the similarity algorithm based on preset adjacent words in the text library Corresponding field term.

7. device as claimed in claim 6, which is characterized in that further include：

Module is established, the index for establishing each field term according to the phonetic of Chinese character；Or, according to the phonetic of Chinese character and position Set up the index of vertical each field term.

8. device as claimed in claim 7, which is characterized in that further include：

9. device as claimed in claim 8, which is characterized in that the searching module is searched specifically for passing through following algorithm The corresponding field term of character string to be corrected：

Calculate separately the similarity of each field term in each retrieval set；

10. device as claimed in claim 9, which is characterized in that further include：

Determining module will have highest word if the similarity for each field term is equal in each field term The field term of frequency probability is determined as the corresponding field term of the character string to be corrected.