CN103870822A - Word identification method and device - Google Patents

Word identification method and device Download PDF

Info

Publication number
CN103870822A
CN103870822A CN 201210570618 CN201210570618A CN103870822A CN 103870822 A CN103870822 A CN 103870822A CN 201210570618 CN201210570618 CN 201210570618 CN 201210570618 A CN201210570618 A CN 201210570618A CN 103870822 A CN103870822 A CN 103870822A
Authority
CN
Grant status
Application
Patent type
Prior art keywords
word
words
confidence
common
candidate
Prior art date
Application number
CN 201210570618
Other languages
Chinese (zh)
Other versions
CN103870822B (en )
Inventor
郑大念
Original Assignee
北京千橡网景科技发展有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Abstract

The embodiment of the invention provides a word identification method. The method comprises the steps of identifying each single character in a word and recording a plurality of identified candidate characters with the highest confidence degree in the identification result and the confidence degree corresponding to the characters; searching that whether each single character in each commonly used word occurs in the candidate characters of single character of the word, if yes, recording the confidence degree of the single character in the candidate character in the commonly used word, and if not, recording the confidence degree of the character as zero; calculating the average confidence degree of single characters in the commonly used word aiming at each commonly used word as the confidence degree of the commonly used word; if the confidence degree of the commonly used word with the highest confidence degree is larger than a threshold value, outputting the commonly used word as the identification result of the word, otherwise outputting the candidate character with the highest confidence degree of each single character of the word as the identification result of the word. The confidence degree of the whole word is obtained by identifying the word through using the priori knowledge of the word, the identification error for the single word is reduced and the accuracy and efficiency in word identification are improved.

Description

词语识别方法及装置技术领域 Word recognition method and apparatus TECHNICAL FIELD

[0001] 本发明的各实施方式涉及词语识别方法及装置。 [0001] Embodiments of the present invention relates to a method and apparatus for identifying words.

背景技术 Background technique

[0002] 在对词语进行光学字符识别时,通常先将词语通过各种拆分方法拆分为多个文字,然后对每个文字分别进行识别。 [0002] When optical character recognition of words, typically by various methods of resolution first words into text, and then identifying each character, respectively. 这种方法一方面速度较慢,另一方面,可能某个文字区域不清晰或有残缺而导致该区域对应文字的识别不准确。 In one aspect of this method is slow, on the other hand, a character area may be unclear or incomplete recognition result corresponding to the character region inaccurate. 而且,由于每个字的识别具有一定的错误概率,这使得整个词语的准确识别的概率就更低。 Further, since the identification of each word has a certain probability of error, which makes the probability of accurate recognition of the whole word is even lower.

发明内容 SUMMARY

[0003] 鉴于上述原因,本发明提供一种词语识别方法及装置,其通过使用出现频率较高的常用词与要识别的词语进行比较,从而获得对词语较高的识别率。 [0003] In view of the foregoing, the present invention provides a method and apparatus for identifying words, the words to be recognized is compared with the higher frequencies which occur through the use of common words, to obtain a higher recognition rate for words.

[0004] 根据本发明的一个方面,提供一种词语识别方法,包括:对所述词语中的每个单字进行识别,并记录识别结果中置信度最高的前若干个识别的候选字及其对应的置信度;搜索每个常用词的各单字是否在所述词语的单字的候选字中出现,若出现,则记录该单字的在该常用词中的该候选字的置信度,若未出现,则将该字的置信度计为零;计算针对每个常用词的所述词语中各单字的平均置信度,作为该常用词的置信度;若置信度最高的常用词的置信度大于一阈值,则输出这个常用词作为该词语的识别结果,否则输出该词语的每个单字的置信度最高的候选字作为该词语的识别结果。 [0004] In accordance with one aspect of the present invention, there is provided a method of word recognition, comprising: the words in each word to identify and record the recognition result before the recognition of the highest number of candidates and their corresponding confidence confidence; each of the candidate search word in each of the common words whether the words appear in the word, if there, the confidence that the candidate word is the word used in the recording, if it is not, the word confidence meter is zero; calculating an average of the confidence level for each of the words of each word in the word used as the common word confidence; confidence if the confidence of the most common words is greater than a threshold value , the output of this common word as a recognition result of the word, or the highest confidence in the output of the candidate words as a recognition result of each word of the words.

[0005] 根据本发明的另一方面,使用光学字符识别(OCR)对所述单字进行识别。 [0005] According to another aspect of the present invention, the use of optical character recognition (OCR) to identify the word.

[0006] 根据本发明的另一方面,搜索每个常用词的各单字是否在所述词语的所有单字的所有候选字中出现。 [0006] The occurrence of all words of all the candidate words to another aspect of the present invention, each of the search for each word whether the common words in the words.

`[0007] 根据本发明的另一方面,当某个常用词中的某个单根据本发明的另一方面字中出现时,将不在该被识别的词语中的该单字的候选字中搜索该常用词中的其他单字。 `[0007] According to another aspect of the present invention, when a candidate word according to a single word appearing in the other hand the present invention, which will not be recognized in a certain common words in the search word of other commonly used words in the word.

[0008] 根据本发明的另一方面,仅搜索与被识别的词语字数相同的常用词。 [0008] According to another aspect of the present invention, only the same search words common words are identified words.

[0009] 根据本发明的另一方面,搜索每个常用词的各单字是否在所述词语的与该常用词中相同位置的单字的候选字中出现。 [0009] The occurrence of the word candidates on the other hand, each of the word search terms are the same for each common location of the words in the word used in the present invention.

[0010] 根据本发明的另一方面,提供一种词语识别装置,包括:单字识别单元,用于对所述词语中的每个单字进行识别,并记录识别结果中置信度最高的前若干个识别的候选字及其对应的置信度;常用词搜索单元,用于搜索每个常用词的各单字是否在所述词语的单字的候选字中出现,若出现,则记录该单字的在该常用词中的该候选字的置信度,若未出现,则将该字的置信度计为零;置信度计算单元,用于计算针对每个常用词的所述词语中各单字的平均置信度,作为该常用词的置信度;输出单元,若置信度最高的常用词的置信度大于一阈值,输出单元输出这个常用词作为该词语的识别结果,否则输出该词语的每个单字的置信度最高的候选字作为该词语的识别结果。 [0010] According to another aspect of the present invention, there is provided a word recognition apparatus, comprising: a word recognition unit for identification of the words in each word, and record the highest number recognition result before confidence confidence recognition candidates and corresponding; common word search means for searching each word of each word appears in common candidate words in the word, if present, this word is recorded in the common the confidence level of the candidate words, if it is not, then the word confidence meter zero; confidence calculation unit for calculating average confidence for each of the words used in the words of each word, as the confidence of the general vocabulary; an output unit, if the confidence highest confidence common words is greater than a threshold value, the output unit outputs the common words as a recognition result of the word, or the output of each word of the word confidence highest candidate word as a recognition result of the word.

[0011] 根据本发明的另一方面,所述单字识别单元包括光学字符识别(OCR)引擎。 [0011] According to another aspect of the present invention, the word identifying unit includes an optical character recognition (OCR) engine. [0012] 根据本发明的另一方面,该常用词搜索单元被配置为搜索每个常用词的各单字是否在所述词语的所有单字的所有候选字中出现。 [0012] According to another aspect of the present invention, the common word search unit configured to search for each word of each word appears in common for all words of all the candidate words in the word.

[0013] 根据本发明的另一方面,该常用词搜索单元被配置为当某个常用词中的某个单字在被识别的词语中的某个单字的候选字中出现时,将不在该被识别的词语中的该单字的候选字中搜索该常用词中的其他单字。 [0013] According to another aspect of the present invention, the common word search unit is configured to, when a candidate word is a common word in a word in the identified words appear in, it will not be the identifying candidate words in the search word in other words of the commonly used words.

[0014] 根据本发明的另一方面,该常用词搜索单元被配置为仅搜索与被识别的词语字数相同的常用词。 [0014] According to another aspect of the present invention, the common word search unit configured to search for only the same number of words and the identified words common words.

[0015] 根据本发明的另一方面,该常用词搜索单元被配置为搜索每个常用词的各单字是否在所述词语的与该常用词中相同位置的单字的候选字中出现。 [0015] According to another aspect of the present invention, the common word search unit configured to search for each word of each word commonly appears in the candidate word with the position of common words in the same word in.

[0016] 通过使用常用词语的先验知识对词语进行识别,获得整个词语的置信度,降低了单个字识别的误差,提高了词语识别的准确率和效率。 [0016] identified by prior knowledge of the use of the words in common terms, access to the entire confidence word, single word recognition reduce errors and improve the accuracy and efficiency of word recognition. 该词语识别方法和词语识别装置在名片等具有特定常用词的场合中的词语识别是特别有利的。 The word recognition and word recognition method in the case of word recognition means common words having a specific card or the like are particularly advantageous.

附图说明 BRIEF DESCRIPTION

[0017] 当结合附图阅读下文对示范性实施方式的详细描述时,这些以及其他目的、特征和优点将变得显而易见,在附图中: [0017] When reading the following detailed description in conjunction with the accompanying drawings of exemplary embodiments Hereinafter, these and other objects, features and advantages will become apparent from the drawings in which:

[0018] 图1是根据本发明优选实施例的词语识别方法的流程图; [0018] FIG. 1 is a flow diagram of the word recognition method according to a preferred embodiment of the present invention;

[0019] 图2是适于用来实践本发明实施方式的词语识别装置的示意性框图; [0019] FIG. 2 is a schematic block diagram adapted for word recognition apparatus according to an embodiment of the present invention may be practiced;

[0020] 图3是用来实践本发明实施方式的移动终端的示意性框图。 [0020] FIG. 3 is a schematic block diagram of an embodiment of the present invention to practice embodiments of the mobile terminal.

具体实施方式 detailed description

[0021] 附图中的流程图和框图,图示了按照本发明各种实施例的装置、方法和计算机程序产品的可能实现的体系架构、功能和操作。 [0021] The flowchart and block diagrams in the Figures illustrate an apparatus according to various embodiments of the present invention, the architecture of the method and computer program product may be implemented, functionality, and operation. 在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,所述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。 In this regard, the flowchart or block diagrams each block may represent a module, segment, or portion of code of a program, a module, segment, or which comprises one or a plurality of logic for implementing the specified executable instructions function. 也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。 It should also be noted that, in some implementations Alternatively, the functions noted in the block may be different from the order noted in the figures occur. 例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。 For example, two blocks shown in succession may in fact be executed substantially concurrently, they may sometimes be executed in the reverse order, depending upon the functionality involved may be. 也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。 Also be noted that each block of the flowchart in block diagram, as well as combinations and / or block diagrams and / or flowchart block may perform a predetermined function or operation of dedicated hardware-based system to achieve , or may be special purpose hardware and computer instructions to implement.

[0022] 现在将仅通过示例性方式来详细地描述本发明的各种实施方式。 [0022] The various embodiments will now be only embodiment of the present invention will be described in detail by way of example.

[0023] 图1是根据本发明优选实施例的一词语识别方法的流程图。 [0023] FIG. 1 is a flowchart of a word recognition method according to a preferred embodiment of the present invention. 该方法可典型地用于名片中的常用词的识别。 This method may typically be used to identify the card in common words. 该方法具体包括下列步骤: The method comprises the steps of:

[0024] 首先执行步骤S11,对所述词语中的每个单字进行识别,并记录识别结果中置信度最高的前若干个识别的候选字及其对应的置信度。 [0024] First performing step S11, the words in each of the identified word and the recognition result before the record number identifying the highest confidence in the candidate and the corresponding confidence level. 优选可通过光学字符识别(OCR)对单个的字进行识别,光学字符识别是公知的字符识别技术,在此不再赘述。 May be preferably performed by optical character recognition (OCR) on a single word recognition, optical character recognition are well-known character recognition technique, not described herein again. 对每个单字可能识别出多个候选字,每个后候选字对应有其识别的置信度。 Each word may for identifying a plurality of candidate words, each candidate word corresponding to the rear has its recognition confidence. 记录其中每个单字置信度最高的前若干个候选字及其置信度,以备后面步骤使用。 Before recording the highest number of confidence for each candidate word and confidence level used to prepare a later step. 例如,假设要识别词语AB,假设取每个单字的置信度最高的前三个候选字,则首先使用OCR将AB中的单字A识别出A的置信度为0.9,识别为A'的置信度为0.4,识别为A”的置信度为0.2,将单字B识别为B的置信度为0.8,识别为B'的置信度为0.4,识别为B”的置信度为0.1,将这些数据进行记录。 For example, assume that the recognition vocabulary to AB, the highest confidence level is assumed to take the first three characters of each word in the candidate, first using OCR to the word A AB A confidence identified as 0.9, identified as A 'confidence 0.4, identified as a "confidence level is 0.2, the word B is identified as the confidence B 0.8, identified as B 'confidence level of 0.4, identified as B" confidence level 0.1 these data are recorded .

[0025] 然后执行步骤S12,搜索每个常用词的各单字是否在所述词语的单字的候选字中出现,若出现,则记录该单字的在该常用词中的该候选字的置信度,若未出现,则将该字的置信度计为零。 Candidates [0025] then perform step S12, each search word whether each word in the commonly used words in a word appear, if there, the confidence that the candidate word is the word used is recorded, if it is not, then the word confidence meter zero. 其中的常用词是使用概率较高的、需要对其进行非常准确的识别的一些词语。 In which the word is commonly used to use a higher probability of certain words need to be very accurate recognition. 例如对于名片中的词语识别来说,该常用词可以是“姓名”、“电话”、“地址”等通常会在名片中出现的词语。 For example, word recognition card in it, the word can be used the word "name", "telephone", "address" and usually appear on the contact card. 可以通过建立一个常用词库来记录常用词,并且可以根据需要向该常用词库中增加或删除常用词。 Common words can be recorded through the establishment of a common lexicon, and you can add or remove common words as needed to the common lexicon.

[0026] 在第一优选实施方式中,搜索每个常用词的各单字是否在所述词语的所有单字的所有候选字中出现。 [0026] In a first preferred embodiment, all of the candidate search words for each word whether each word is used in all of the words in the words appear. 假设要识别词语AB,则在搜索常用词库中的常用词AC时,先在A、A'、A”、B、B'和B”的集合中搜索是否出现AC中的A。 It is assumed to be the recognition vocabulary AB, the common words used in the search thesaurus AC, AC to the search appears in the collection of A. A, A ', A ", B, B' and B" in 显然,该集合中存在A,则记录该候选字A对应的置信度0.9。 Obviously, the presence of the set A, the A record corresponding to the candidate confidence 0.9. 然后再在该集合中搜索是否出现AC中的C,显然,该集合中不存在C,因而将常用词AC中C对应的置信度设为零。 AC then search the C appears in the collection, it is clear that the collection does not exist in C, thus the term commonly used in AC C corresponding confidence level is set to zero.

[0027] 优选地,当某个常用词中的某个单字在被识别的词语中的某个单字的候选字中出现时,将不在该被识别的词语中的该单字的候选字中搜索该常用词中的其他单字。 [0027] Preferably, when the candidate of a word in a certain common words in the words of a word to be recognized appears in the words not in the candidate identified in the search word in the other commonly used words in the word. 如上例,当在上述集合中搜索到常用词AC中的A后,在搜索AC中的C时,将不再在A的候选字A、A'、A”中搜索,而仅在剩余的候选字B、B'和B”中进行搜索。 The above embodiment, when the search of the AC common words in said set A, and will not be searched in the candidate A of A, A ', A "in the search AC of C, the only remaining candidate character B, B 'and B "in the search. 因为对于被识别的词语中的A已经在常用词中找到对应的单字,那么A对应的其他候选字很大程度可能是被误识别的字或与该常用词无关的字,因而在搜索该常用词的其他单字时,无需再在该范围中进行搜索。 It has been found in the corresponding word commonly used word for word in A is identified, then the other candidate A large degree of correspondence may be unrelated to the word or words are common words mistakenly identified, and thus in the common search when word other words, no longer need to search within that range. 这样,可以节约计算资源,提高常用词的搜索速度。 In this way, we can save computing resources, improve search speed of common words.

[0028] 当搜索完常用词AC后,再按照上述方法在上述集合中搜索常用词库中的其他常用词AB、AD、EB、AFG等中的各单字,并得到各常用词中各单字对应的置信度。 [0028] When the search is finished common words AC, and then each search word other commonly used word lexicon AB, AD, EB, AFG, and the like in the above-described set as described above, and with each word corresponding to each common words confidence. 例如,常用词AB中A的置信度即为0.9, B的置信度即为0.8,而AD中A的置信度即为0.9, D因在集合中不存在因而置信度为零。 For example, commonly used words confidence in A of AB namely 0.9, 0.8 is the confidence level of B, and the degree of confidence in AD is the 0.9 A, D due to the absence of the confidence in the set thus zero.

[0029] 然后执行步骤S 13,计算针对每个常用词的所述词语中各单字的平均置信度,作为该常用词的置信度。 [0029] and then the step S 13, the calculated average confidence for each of the words of each word in the word used as the common word confidence. 该平均置信度可以通过将常用词中的各单字的置信度取均值而获得。 The average degree of confidence may be obtained by the confidence levels of common words in word averaging. 例如,对于上述常用词AC,由于A的置信度为0.9,C的置信度为0,因而常用词AC的平均置信度为(0.9+0)/2 = 0.45。 For example, the above-described conventional AC words, since the confidence A is 0.9, C is the confidence 0, and thus AC common words average confidence of (0.9 + 0) / 2 = 0.45. 而常用词AB的平均置信度为(0.9+0.8)/2 = 0.85。 The average confidence common word AB is (0.9 + 0.8) / 2 = 0.85. 常用词AFG的平均置信度为(0.9+0+0)/3 = 0.3。 Average confidence AFG common word is (0.9 + 0 + 0) / 3 = 0.3. 通过该步骤可以获得被识别的词语相对于常用词库中所有常用词的置信度。 Terms that are identified can be obtained by the step with respect to the confidence of all the common dictionary of common words. 该置信度同时考虑了词语中的各个单字的识别概率,因而该置信度能够相对于各个单字更能全面地反映被识别词语与该常用词之间的匹配概率,减小单字识别误差对词语整体识别的影响,从而更利于对词语整体进行准确地识别。 The confidence level of the recognition probability considering the words of each word, so that the confidence level with respect to each word more fully reflect the probability of a match is identified between the words commonly used words, reducing the overall word error identification words impact identification, and thus more conducive to accurately identify whole words.

[0030] 当获得每个常用词的置信度后,便执行步骤S14,判断置信度最高的常用词的置信度是否大于一阈值。 [0030] After obtaining the confidence of each word used, it is executed step S14, it is determined the most common word confidence confidence is greater than a threshold value. 该阈值可以通过经验进行设置,即保证一定的识别准确率,又允许词语图形具有部分容错能力,例如可以将该阈值设为0.8。 The threshold can be set empirically, i.e. to guarantee a certain recognition rate, in turn, allows a partial fault tolerant word pattern, for example, the threshold is set to 0.8. 若置信度最高的常用词的置信度大于一阈值,则执行步骤S15,输出这个常用词作为该词语的识别结果,否则执行步骤S16,输出该词语的每个单字的置信度最高的候选字作为该词语的识别结果。 If the confidence level highest confidence common words is greater than a threshold value, then perform step S15, the output of the common words as a recognition result of the word, otherwise, executing step S16, the highest confidence in the candidate word outputting the words of each word as identifying the words in the result. 例如,上述例子中,常用词中置信度最高的为常用词AB,其置信度为0.85,大于设定的阈值0.8,则将该词语AB识别为常用词库中的常用词AB输出。 For example, the above example, the most commonly used word confidence of AB common words, the confidence of 0.85, 0.8 larger than the set threshold value, then the word is identified as AB AB output commonly used word lexicon. 可见,该识别的词语的含义是正确的。 Visible, the identification of the meaning of the words is correct. 如果常用词库中没有收录常用词AB,而常用词AC的置信度是最高的,为0.45,小于设定的阈值0.8,说明该词语与常用词AC不是太匹配,则不输出该常用词AC,而是将在步骤Sll中识别的AB中的各个单字的置信度最高的候选字作为该词语的识别结果,即A的候选字中A的置信度最高,为 If not included commonly used word lexicon AB, and AC common word confidence level is the highest, 0.45, 0.8 less than the set threshold value, indicating that the word is not too AC common words and matching, which is not commonly used words AC output , but the highest confidence in the candidates identified in step Sll AB of each word as a recognition result of the word, i.e., the highest confidence in the candidate a, a, in order to

0.9,则输出A,而B的候选字中B的置信度最高,为0.8,则输出B,因而其输出为AB,与该词语的含义是符合的。 0.9, the output A, and the highest confidence in the candidate word B B is 0.8, then the output B, thus its output is AB, the meaning of the words is compliant.

[0031] 在第二优选实施方式中,仅搜索与被识别的词语字数相同的常用词。 [0031] In a second preferred embodiment, only the search words with the identified words of the same common words. 例如在上述例子中,被识别的词AB为两个单字,则仅在这两个单字的候选字集合中搜索具有两个单字的常用词,例如AC、AB、AD、EB等,而不搜索AFG等不是两个单字的常用词。 For example, in the above example, the identified word AB is two words, then the search for common words having only two words in the candidate set of two words, such as AC, AB, AD, EB, etc., rather than searching the AFG is not commonly used words such as two-word. 通过这种方式,可以节约字数不匹配的常用词的搜索时间,提高识别效率。 In this manner, it is possible to save the number of words commonly used words do not match the search time, improving the recognition efficiency.

[0032] 优选地,搜索每个常用词的各单字是否在所述词语的与该常用词中相同位置的单字的候选字中出现。 [0032] Preferably, each of the words, searching for each of the common words in the candidate word appears in the same position in the common words in the word. 例如,常用词AB中的A出现在第一个字位置,则仅在被识别词语AB的第一个字位置的单字A的候选字A、A'、A”中搜索A,常用词AB中的B出现在第二个字位置,则仅在被识别词语AB的第二个字位置的单字B的候选字B、B'和B”中搜索B。 For example, commonly used words in AB A word appears in the first position, only the search candidates A A AB identified word of the first word position of the word A, A ', A ", the words used in the AB B-word appears in the second position, only the search word in the recognition vocabulary AB B is the second word in the candidate position B, B 'and B "in B. 通过位置匹配的搜索,可以节约搜算计算量,提高搜索效率。 By matching the position of the search, the search can be saved count calculation and improve the efficiency of the search.

[0033] 该词语识别方法适于通过计算机程序来实现。 [0033] The word recognition method adapted to be implemented by a computer program.

[0034] 图2是适于用来实践本发明实施方式的词语识别装置的示意性框图。 [0034] FIG. 2 is a schematic block diagram adapted for word recognition apparatus according to an embodiment of the present invention may be practiced. 在图2中,词语识别装置200包括:单字识别单元201,用于对所述词语中的每个单字进行识别,并记录识别结果中置信度最高的前若干个识别的候选字及其对应的置信度;常用词搜索单元202,用于搜索每个常用词的各单字是否在所述词语的单字的候选字中出现,若出现,则记录该单字的在该常用词中的该候选字的置信度,若未出现,则将该字的置信度计为零;置信度计算单元203,用于计算针对每个常用词的所述词语中各单字的平均置信度,作为该常用词的置信度;输出单元204,若置信度最高的常用词的置信度大于一阈值,输出单元204输出这个常用词作为该词语的识别结果,否则输出该词语的每个单字的置信度最高的候选字作为该词语的识别结果。 In FIG. 2, the word recognition apparatus 200 comprises: word recognition unit 201, for each of the words in the word recognition, and the recognition result before the record number identifying the highest confidence in the candidate and the corresponding confidence; common word search unit 202, searching for each word of each word appears in common candidate words in the word, if occurs, the candidate word is the word used in the word is recorded confidence level if it is not, then the word confidence meter zero; confidence calculation unit 203 for calculating average confidence for each of the words in each word of words commonly used as the common words confidence degree; an output unit 204, if the confidence of the highest confidence common words is greater than a threshold value, the output unit 204 outputs the common words as a recognition result of the word, or the output of the highest confidence candidate of each word in the word as identifying the words in the result.

[0035] 优选地,所述单字识别单元包括光学字符识别(OCR)引擎。 [0035] Preferably, the single character recognition means includes an optical character recognition (OCR) engine.

[0036] 优选地,该常用词搜索单元被配置为搜索每个常用词的各单字是否在所述词语的所有单字的所有候选字中出现。 [0036] Preferably, the common unit is configured to search all word candidates for the search for each word in common words whether all of the words in the words appear.

[0037] 优选地,该常用词搜索单元被配置为当某个常用词中的某个单字在被识别的词语中的某个单字的候选字中出现时,将不在该被识别的词语中的该单字的候选字中搜索该常用词中的其他单字。 [0037] Preferably, the common word search unit is configured to words in a candidate word when the word is a common word in one of the identified words appear in, it will not be identified in the the word candidates in search of other commonly used words in the word.

[0038] 优选地,该常用词搜索单元被配置为仅搜索与被识别的词语字数相同的常用词。 [0038] Preferably, the common word search unit configured to search for only the same number of words and the identified words common words.

[0039] 优选地,该常用词搜索单元被配置为搜索每个常用词的各单字是否在所述词语的与该常用词中相同位置的单字的候选字中出现。 [0039] Preferably, the common word search unit configured to search whether there is a commonly used word for each word in the candidate words for each of the common words in the same position in the word.

[0040] 该词语识别装置200适于执行以上所述的各种词语识别方法。 Various word recognition method [0040] The word recognition means 200 is adapted to perform the above.

[0041] 该词语识别装置适于通过载入以上词语识别方法的计算机硬件来实现。 [0041] The word recognition means is adapted to be implemented by the above hardware loaded into a computer word recognition method. 该词语识别装置尤其适于通过载入以上词语识别算法的、具有计算处理功能的手机等移动设备来实现。 The word recognition device is particularly suitable for loading by the above word recognition algorithm, a computing processing function realized phones and other mobile devices. 该移动设备优选还具有数码相机,用于拍摄名片等包含文字的图像信息。 The mobile device preferably further includes a digital camera, for capturing image information, such as business cards containing text. 该移动设备可通过载入的算法程序,即时对拍摄的名片等图像中的词语进行提取、识别、存储。 The mobile device may, for immediate word image photographed business cards were extracted, identified, stored program by loading algorithm.

[0042] 下面参考图3,其示出了适于用来实践本发明实施方式的移动终端300的示意性框图。 [0042] Referring to FIG 3, which shows a schematic block diagram of embodiments of the invention suitable for use in the practice of the mobile terminal 300. 在图3所示的示例中,移动终端300是一个具有无线通信能力的移动设备。 In the example shown in FIG. 3, the mobile terminal 300 is a mobile device with wireless communication capabilities. 然而,可以理解,这仅仅是示例性而非限制性的。 However, it is understood that this is merely exemplary and not limiting. 其他类型的移动终端也可以容易地采用本发明的实施方式,诸如便携式数字助理(PDA)、寻呼机、移动计算机、移动电视、游戏设备、膝上型计算机、照相机、录像机、GPS设备以及其他类型的语音和文本通信系统。 Other types of mobile terminals may also readily employ embodiments of the present invention, such as a portable digital assistant (PDA), pagers, mobile computers, mobile televisions, gaming devices, laptop computers, cameras, video recorders, GPS devices and other types of voice and text communications systems. 固定式移动终端同样可以容易地使用本发明的实施方式。 The mobile terminal can also be stationary embodiment of the present invention is easy to use.

[0043] 移动终端300包括一个或天线312,其可操作地与发射机314和接收机316进行通信。 [0043] or a mobile terminal 300 comprises an antenna 312, which is operable to communicate with the transmitter 314 and receiver 316. 移动终端300还包括处理器312或者其他处理元件,其分别提供去往发射机314的信号和接收来自接收机316的信号。 The mobile terminal 300 further includes a processor 312 or other processing elements which are destined to provide the transmitter 314 and the reception signal 316 from the signal receiver. 信号包括按照适当蜂窝系统的空中接口标准的信令信息,并且还包括用户语音、接收的数据和/或用户生成的数据。 Signals include signaling information in accordance with the air interface standard of the applicable cellular system, and also user speech, received data and / or user generated data. 在此方面,移动终端300能够利用一个或多个空中接口标准、通信协议、调制类型以及接入类型来进行操作。 In this regard, the mobile terminal 300 can utilize one or more air interface standards, communication protocols, modulation types, and access types operate. 作为示范,移动终端300能够根据多个第一代、第二代、第三代和/或第四代通信协议等中的任何协议来进行操作。 As in the illustration, the mobile terminal 300 can be a plurality of first, second, third and / or fourth-generation communication protocols or the like to operate. 例如,移动终端300可以能够按照第二代(G)无线通信协议IS-136 (TDMA)、GSM和IS-95 (CDMA)来进行操作,或者按照诸如UMTS、CDMA2000, WCDMA和TD-SCDMA的第三代(G)无线通信协议来进行操作,或者按照第四代(4G)无线通信协议和/或类似协议进行操作。 For example, the mobile terminal 300 may be capable (G) wireless communication protocols IS-136 (TDMA), GSM, and IS-95 (CDMA) to operate in accordance with the second generation, according to the first or such as UMTS, CDMA2000, WCDMA and TD-SCDMA is three generations (G) to operate a wireless communication protocol, or (4G) wireless communication protocols and / or the like operates according to the fourth generation.

[0044] 可以理解,处理器312包括实现移动终端300的功能所需的电路。 [0044] It will be appreciated, the processor 312 includes circuitry required for implementing functions of the mobile terminal 300. 例如,处理器312可以包括数字信号处理器设备、微处理器设备、各种模数转换器、数模转换器和其他支持电路。 For example, processor 312 may comprise a digital signal processor device, a microprocessor device, various analog to digital converters, digital to analog converters, and other support circuits. 移动终端300的控制和信号处理功能按照这些设备各自的能力在其间分配。 The mobile terminal control and signal processing functions 300 are allocated between devices in accordance with these capabilities. 处理器312由此还可以包括在调制和传输之前对消息和数据进行卷积编码和交织的功能。 The processor 312 thus may also include message data and functionality to convolutionally encode and interleave prior to modulation and transmission. 处理器312还可以另外包括内部语音编码器,并且可以包括内部数据调制解调器。 The processor 312 may additionally include an internal voice coder, and may include an internal data modem. 此外,处理器312可以包括对可以存储在存储器中的一个或多个软件程序进行操作的功能。 Further, the processor 312 may include one or more software programs may be stored in a memory functionality to operate. 例如,处理器312可以能够操作连接程序,诸如传统的Web浏览器。 For example, the processor 312 may be capable of operating a connectivity program, such as a conventional Web browser. 连接程序继而可以允许移动终端300例如按照无线应用协议(WAP)、超文本传输协议(HTTP)等来发射和接收Web内容(诸如基于位置的内容和/或其他web页面内容)。 The connectivity program may then allow the mobile terminal 300, for example, according to a Wireless Application Protocol (WAP), Hypertext Transfer Protocol (HTTP) and the like to transmit and receive Web content (such as location-based content and / or other web page content).

[0045] 移动终端300还可以包括用户接口,其例如可以包括耳机或者扬声器324、振铃器322、麦克风326、显示屏328以及输入接口331,所有这些设备都耦合至处理器312。 [0045] The mobile terminal 300 may further include a user interface, which may include e.g. earphone or speaker 324, a ringer 322, a microphone 326, an input interface 331 and display 328, all of which are coupled to the processor 312. 移动终端300可以包括小键盘330。 The mobile terminal 300 may include a keypad 330. 小键盘330可以包括传统的数字键(0_9)和相关键(#、*),以及用于操作移动终端300的其他键。 The keypad 330 may include the conventional numeric (0_9) and related keys (#, *), and other keys for operating the mobile terminal 300. 备选地,小键盘330可以包括传统的QWERTY小键盘布置。 Alternatively, the keypad 330 may include a conventional QWERTY keypad arrangement. 小键盘330还可以包括与功能相关联的各种软键。 The keypad 330 may also include various soft keys with associated functions. 移动终端300还可以包括相机模块336,用于捕获静态和/或动态图像。 The mobile terminal 300 may further include a camera module 336 for capturing still and / or moving images.

[0046] 特别地,显示屏328可以包括触摸式屏幕和/或邻近式屏幕,用户可以通过直接操作屏幕而操作移动终端300。 [0046] In particular, the display 328 may include a touch screen and / or adjacent to the screen, the user may operate the mobile terminal 300 by touching the screen directly. 此时,显示屏328同时充当输入设备和输出设备二者。 At this time, the display 328 also acts as both an input device and an output device. 在这样的实施方式中,输入接口331可以配置用于接收用户通过例如普通的笔、专用触笔和/或手指在显示屏328上提供的输入,包括指点输入和手势输入。 In such an embodiment, the input interface 331 may be configured to receive user input via an ordinary pen, a dedicated stylus and / or fingers 328 provided on the display screen, for example, include a gesture input and a pointing input. 处理器312可配置用于检测此类输入,并且识别出用户的手势。 The processor 312 may be configured to detect such an input, and recognizes the user's gesture.

[0047] 此外,移动终端300可以包括诸如操纵杆的接口设备或者其他用于输入接口。 [0047] Further, the mobile terminal 300 may include an interface device such as a joystick or other input interface. 移动终端300还包括电池334,诸如振动电池组,用于为操作移动终端300所需的各种电路供电,以及可选地提供机械振动作为可检测输出。 The mobile terminal 300 further includes a battery 334, such as a vibrating battery pack, for powering the various circuits required to operate the mobile terminal 300, as well as optionally providing mechanical vibration as a detectable output.

[0048] 移动终端300可以进一步包括用户标识模块(UM) 338。 [0048] The mobile terminal 300 may further include a user identity module (UM) 338. UIM 338通常是具有内置处理器的存储器设备。 UIM 338 is typically a memory device having a processor built in. UM 338例如可以包括订户标识模块(SM)、通用集成电路卡(ΠCC)、通用订户标识模块(USM)、可移动用户标识模块(R-UM)等。 UM 338 may comprise, for example, a subscriber identity module (SM), a universal integrated circuit card (ΠCC), Universal Subscriber Identity Module (USM), a removable user identity module (R-UM) and the like. ΠΜ 338通常存储与移动订户相关的信元。 ΠΜ 338 typically stores information elements related to a mobile subscriber.

[0049] 移动终端300还可以具有存储器。 [0049] The mobile terminal 300 may also have a memory. 例如,移动终端300可以包括易失性存储器340,例如包括用于数据临时存储的高速缓存区域的易失性随机存取存储器(RAM)。 For example, the mobile terminal 300 may include volatile memory 340, e.g. including volatile cache area for the temporary storage of data in a random access memory (RAM). 移动终端300还可以包括其他非易失性存储器342,其可以是嵌入式的和/或可移动的。 The mobile terminal 300 may also include other non-volatile memory 342, which may be embedded and / or may be removable. 非易失性存储器342可以附加地或者可选地包括例如EEPROM和闪存等。 The nonvolatile memory 342 may additionally or alternatively comprise, for example, EEPROM and flash memory. 存储器可以存储移动终端300所使用的多个信息片段和数据中的任意项,以实现移动终端300的功能。 The memories can store any of a plurality of pieces of information and data used by the mobile terminal 300, 300 to implement the functions of the mobile terminal.

[0050] 所述移动终端300可以配置用于实现上文结合图1描述的方法以及作为结合图2描述的装置。 [0050] The mobile terminal 300 may be configured to methods described above in connection with FIG. 1 and implemented as an apparatus described in conjunction with FIG. 2.

[0051] 应当理解,图3所述的结构框图仅仅为了示例的目的而示出的,而不是对本发明范围的限制。 [0051] It should be understood that the block diagram in FIG. 3 for purposes of example only and illustrated, and not to limit the scope of the present invention. 在某些情况下,可以根据具体情况而增加或者减少某些设备。 In some cases, it may be increased or decreased depending on the circumstances some devices.

[0052] 已经出于示出和描述的目的给出了本发明的说明书,但是其并不意在是穷举的或者限制于所公开形式的发明。 [0052] have been shown for the purpose of description and given in the specification of the present invention, but it is not intended to be exhaustive or to limit the invention to the form disclosed. 本领域技术人员可以想到很多修改和变体。 Those skilled in the art may conceive many modifications and variations. 本领域技术人员应当理解,本发明实施方式中的方法和装置可以以软件、硬件、固件或其组合实现。 Those skilled in the art will appreciate, embodiments of the present invention a method and apparatus may be implemented in software, hardware, firmware or a combination thereof.

[0053] 因此,实施方式是为了更好地说明本发明的原理、实际应用以及使本领域技术人员中的其他人员能够理解以下内容而选择和描述的,即,在不脱离本发明精神的前提下,做出的所有修改和替换都将落入所附权利要求定义的本发明保护范围内。 [0053] Accordingly, embodiments in order to best explain the principles of the present invention, the practical application and enable others skilled in the art to understand the chosen and described below, i.e., without departing from the spirit of the invention next, made to all modifications and alternative are intended to fall within the scope of the invention defined in the appended claims.

Claims (12)

  1. 1.一种词语识别方法,包括: 对词语中的每个单字进行识别,并记录识别结果中置信度最高的前若干个识别的候选字及其对应的置信度; 搜索每个常用词的各单字是否在所述词语的单字的候选字中出现,若出现,则记录该单字的在该常用词中的该候选字的置信度,若未出现,则将该字的置信度计为零; 计算针对每个常用词的所述词语中各单字的平均置信度,作为该常用词的置信度; 若置信度最高的常用词的置信度大于一阈值,则输出这个常用词作为该词语的识别结果,否则输出该词语的每个单字的置信度最高的候选字作为该词语的识别结果。 A word recognition method, comprising: for each word in the word recognition, and the recognition result before the record number identifying the highest confidence in the candidate and the corresponding confidence; each search word for each common whether the word candidate word appearing words, if there, the confidence that the candidate word is the word used in the recording, if it is not, then the word confidence meter zero; the average calculated for the confidence in the words of each word of each word used as the confidence of the general vocabulary; if the highest confidence level of common words confidence greater than a threshold, then the output of this common word as the identification of the words as a result, the highest confidence otherwise outputting the candidate words for each word as a recognition result of the word.
  2. 2.根据权利要求1所述的方法,其中,使用光学字符识别(OCR)对所述单字进行识别。 The method according to claim 1, wherein, using optical character recognition (OCR) to identify the word.
  3. 3.根据权利要求1或2所述的方法,其中,搜索每个常用词的各单字是否在所述词语的所有单字的所有候选字中出现。 3. The method of claim 1 or claim 2, wherein all of the candidate search word whether each word is used for each of the words in all the words appear.
  4. 4.根据权利要求3所述的方法,其中,当某个常用词中的某个单字在被识别的词语中的某个单字的候选字中出现时,将不在该被识别的词语中的该单字的候选字中搜索该常用词中的其他单字。 The term method according to claim 3, wherein, when a candidate word is a common word in a word in the identified words appear in, will not be identified in the candidate words in the word search other commonly used words in.
  5. 5.根据权利要求1或2所述的方法,其中,仅搜索与被识别的词语字数相同的常用词。 5. The method of claim 1 or 2, wherein only the same search words common words are identified words.
  6. 6.根据权利要求5所述的方法,其中,搜索每个常用词的各单字是否在所述词语的与该常用词中相同位置的单字的候选字中出现。 6. The method according to claim 5, wherein each of the search words of each word used in the candidate word appears in the same position in the common words in the word.
  7. 7.一种词语识别装置,包括: 单字识别单元,用于对词语中的每个单字进行识别,并记录识别结果中置信度最高的前若干个识别的候选字及其对应的置信度; 常用词搜索单元,用于搜索每个常用词的各单字是否在所述词语的单字的候选字中出现,若出现,则记录该单字的在该常用词中的该候选字的置信度,若未出现,则将该字的置信度计为零; 置信度计算单元,用于计算针对每个常用词的所述词语中各单字的平均置信度,作为该常用词的置信度; 输出单元,若置信度最高的常用词的置信度大于一阈值,输出单元输出这个常用词作为该词语的识别结果,否则输出该词语的每个单字的置信度最高的候选字作为该词语的识别结果。 A word recognition apparatus, comprising: a single character recognition unit for each word in the word recognition, and the recognition result before the record number identifying the highest confidence in the candidate and the corresponding confidence; Common word search means for searching each word of each word appears in common candidate words in the word, if there, the confidence that the candidate word is the word used in the recording, if not appears, then the word confidence meter zero; confidence calculation unit for calculating average confidence for each of the words of each word in the word used as the common word confidence; an output unit, if confidence highest confidence common words greater than a threshold value, the output unit outputs the common words as a recognition result of the word, or the highest confidence in the output of the candidate words as a recognition result of each word of the words.
  8. 8.根据权利要求7所述的装置,其中,所述单字识别单元包括光学字符识别(OCR)引擎。 8. The apparatus according to claim 7, wherein said word identifying means includes an optical character recognition (OCR) engine.
  9. 9.根据权利要求7或8所述的装置,其中,该常用词搜索单元被配置为搜索每个常用词的各单字是否在所述词语的所有单字的所有候选字中出现。 9. The apparatus of claim 7 or claim 8, wherein the common word search unit is configured to appear in all words of all the candidate words in the word search for each word of each word used.
  10. 10.根据权利要求9所述的装置,其中,该常用词搜索单元被配置为当某个常用词中的某个单字在被识别的词语中的某个单字的候选字中出现时,将不在该被识别的词语中的该单字的候选字中搜索该常用词中的其他单字。 10. The apparatus according to claim 9, wherein the common word search unit is configured to, when a candidate of a word commonly used words in a word in the identified words appear in, will not the term candidates identified in the search word in other words of the commonly used words.
  11. 11.根据权利要求7或8所述的装置,其中,该常用词搜索单元被配置为仅搜索与被识别的词语字数相同的常用词。 11. The apparatus of claim 7 or claim 8, wherein the common word search unit configured to search for only the same number of words and the identified words common words.
  12. 12.根据权利要求11所述的装置,其中,该常用词搜索单元被配置为搜索每个常用词的各单字是否在所述词语的与该常用词中相同位置的单字的候选字中出现。 12. The apparatus according to claim 11, wherein the common word search unit is configured candidate word appears in the same position of the words in the search for common words each word of each word used.
CN 201210570618 2012-12-17 2012-12-17 Word recognition method and apparatus CN103870822B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201210570618 CN103870822B (en) 2012-12-17 2012-12-17 Word recognition method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201210570618 CN103870822B (en) 2012-12-17 2012-12-17 Word recognition method and apparatus

Publications (2)

Publication Number Publication Date
CN103870822A true true CN103870822A (en) 2014-06-18
CN103870822B CN103870822B (en) 2018-09-25

Family

ID=50909338

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201210570618 CN103870822B (en) 2012-12-17 2012-12-17 Word recognition method and apparatus

Country Status (1)

Country Link
CN (1) CN103870822B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007094684A2 (en) * 2006-02-17 2007-08-23 Lumex As Method and system for verification of uncertainly recognized words in an ocr system
CN101482862A (en) * 2009-01-20 2009-07-15 上海邮政科学研究院 Chinese automatic translation method for English mail address
CN102375807A (en) * 2010-08-27 2012-03-14 汉王科技股份有限公司 Method and device for proofing characters

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007094684A2 (en) * 2006-02-17 2007-08-23 Lumex As Method and system for verification of uncertainly recognized words in an ocr system
CN101443787A (en) * 2006-02-17 2009-05-27 徕美股份公司 Method and system for verification of uncertainly recognized words in an OCR system
CN101482862A (en) * 2009-01-20 2009-07-15 上海邮政科学研究院 Chinese automatic translation method for English mail address
CN102375807A (en) * 2010-08-27 2012-03-14 汉王科技股份有限公司 Method and device for proofing characters

Also Published As

Publication number Publication date Type
CN103870822B (en) 2018-09-25 grant

Similar Documents

Publication Publication Date Title
US7797629B2 (en) Handheld electronic device and method for performing optimized spell checking during text entry by providing a sequentially ordered series of spell-check algorithms
US8385589B2 (en) Web-based content detection in images, extraction and recognition
US7277029B2 (en) Using language models to expand wildcards
US20120131520A1 (en) Gesture-based Text Identification and Selection in Images
US20080077393A1 (en) Virtual keyboard adaptation for multilingual input
US20090225041A1 (en) Language input interface on a device
US20080182599A1 (en) Method and apparatus for user input
US20110202836A1 (en) Typing assistance for editing
US20070255706A1 (en) Information retrieval apparatus
US20120056814A1 (en) Character input device and character input method
US20110302654A1 (en) Method and apparatus for analyzing and detecting malicious software
US20090295737A1 (en) Identification of candidate characters for text input
US20110267278A1 (en) Adaptive soft keyboard
US20040240739A1 (en) Pen gesture-based user interface
US20140180670A1 (en) General Dictionary for All Languages
US20060290656A1 (en) Combined input processing for a computing device
US20100328112A1 (en) Method of dynamically adjusting long-press delay time, electronic device, and computer-readable medium
US8290772B1 (en) Interactive text editing
US20120287070A1 (en) Method and apparatus for notification of input environment
US20090225034A1 (en) Japanese-Language Virtual Keyboard
US20130145466A1 (en) System And Method For Detecting Malware In Documents
US20150127965A1 (en) Method of controlling power supply for fingerprint sensor, fingerprint processing device, and electronic device performing the same
JP2005275652A (en) Apparatus and method for processing input trajectory
CN103135884A (en) Input method, system and device thereof for querying by using a region formed by an enclosed track
JP2005182772A (en) Character recognition device, program and recording medium

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data