CN101882226B

CN101882226B - Method and device for improving language discrimination among characters

Info

Publication number: CN101882226B
Application number: CN 201010218319
Authority: CN
Inventors: 郭育生; 邹明福; 王利娟
Original assignee: Hanwang Technology Co Ltd
Current assignee: Hanwang Technology Co Ltd
Priority date: 2010-06-24
Filing date: 2010-06-24
Publication date: 2013-07-24
Anticipated expiration: 2030-06-24
Also published as: CN101882226A

Abstract

The embodiment of the invention discloses a method and a device for improving language discrimination among characters, which relates to information processing technology and aims to improve language discrimination among similar characters, wherein the method comprises the following steps: recognizing character samples, and clustering the characters according to the recognizing result; computing the intra-class language probability according to the clustering result of the characters; computing the extra-class language probability according to the clustering result of the characters; and computing the language probability of the characters according to the intra-class language probability and the extra-class language probability. The embodiment of the invention is mainly used for character recognition or phrase recognition technology.

Description

Method and device for improving language discrimination between characters

Technical Field

The present invention relates to information processing technologies, and in particular, to a method and an apparatus for improving linguistic distinction between characters.

Background

In the pattern recognition technology, characters that can be recognized include english, simplified chinese, traditional chinese, arabic, greek letters, various symbols, and the like. Among these characters, there are a large number of similar characters, such as "already" and "already", etc. In recognizing these similar characters, it is difficult to efficiently recognize the correct character using only a character recognition technique. In order to distinguish these similar characters, the probability statistics technique of the language model is widely used in the pattern recognition technique.

In the process of implementing the invention, the inventor finds that the probability statistical technology of the language model is to count the language probability for all characters, but the language probability statistical method for all characters including dissimilar characters is still difficult to accurately distinguish the similar characters.

Disclosure of Invention

The embodiment of the invention provides a method and a device for improving the language discrimination between characters, which are used for improving the language discrimination between similar characters.

The embodiment of the invention adopts the following technical scheme:

a method of improving linguistic distinctness between characters, comprising:

identifying character samples, and clustering characters according to identification results;

calculating the intra-class language probability according to the clustering result of the characters;

calculating the language probability among classes according to the clustering result of the characters;

and calculating the language probability of the character according to the language probability in the class and the language probability among the classes.

An apparatus for improving linguistic distinction between characters, comprising:

the clustering unit is used for identifying the character samples and clustering the characters according to the identification result;

the first calculating unit is used for calculating the intra-class language probability according to the clustering result of the characters;

the second calculating unit is used for calculating the language probability among the classes according to the clustering result of the characters;

and the third calculating unit is used for calculating the language probability of the character according to the language probability in the class and the language probability among the classes.

The method and the device for improving the language discrimination between the characters in the embodiment of the invention firstly classify the characters by the recognition technology, then obtain the intra-class language probability between different characters in the same class and the inter-class language probability between different classes, and finally obtain the language probability of the characters. That is to say, the language probabilities of the characters calculated by the intra-class language probabilities and the inter-class language probabilities sufficiently represent the language probability difference between different characters in the same class and the language probability difference between different classes, so that the method and the device of the embodiment of the invention can improve the language discrimination between similar characters.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flowchart illustrating a method for improving linguistic distinction between characters according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating an apparatus for improving linguistic distinction between characters according to an embodiment of the present invention;

FIG. 3 is a block diagram of an apparatus for improving linguistic distinction between characters according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the method for improving the language discrimination between the characters, the character samples are firstly identified through an identification technology, the characters are clustered according to identification results, then the intra-class language probability and the inter-class language probability are respectively obtained according to the clustering results of the characters, and finally the language probability of the language probability characters of the characters is calculated according to the intra-class language probability and the inter-class language probability.

According to the technical scheme, the language probabilities of the characters calculated through the intra-class language probabilities and the inter-class language probabilities fully reflect the language probability difference between different characters in the same class and the language probability difference between different classes, and therefore the method can improve the language discrimination between similar characters.

The following describes the implementation process of the above technical solution in detail with reference to specific embodiments.

As shown in fig. 1, a method for improving linguistic distinction between characters in an embodiment of the present invention includes:

and 11, acquiring character samples D (c, k), wherein c is the identity of the samples, and k represents the sequence of the samples with the same identity.

The character sample provides shape characteristics of characters, and is the basis of a character recognition-based clustering method, and the corpus provides language probability materials. In this embodiment, when collecting the character samples, the character recognition program may be used to automatically label the character samples, manually correct the labeling result of the character samples, and automatically count the frequency of the characters appearing in the corpus.

And step 12, recognizing the character samples, and clustering the characters according to the recognition result.

In this embodiment, the step may include:

step 121, recognizing the character sample D (C, k) by using the character recognition core, obtaining a character set C: { C₁，C₂，K，C_NWhere N is the total number of characters in the character set.

Step 122, calculating each character C in the character set_iA first recognition similarity vector Z (C) with the character sample D (C, k)_i，k)＝(z₁(C_i，k)，z₂(C_i，k)，K，z_N(C_i，k))^T。

In the process, specifically, character samples are recognized by using a character recognition core, and each character C in the character set is calculated_iRecognition similarity z with the character sample D (c, k)_i(C, k) and generating the first recognition similarity vector Z (C) according to the recognition similarity_i，k)＝(z₁(C_i，k)，z₂(C_i，k)，K，z_N(C_i，k))^T。

Step 123, according to the first similarity vector Z (C)_i，k)＝(z₁(C_i，k)，z₂(C_i，k)，K，z_N(C_i，k))^TAnd calculating the character set C: { C₁，C₂，K，C_NWith each character C in said character set_iSecond recognition similarity vector Y (C)_i)＝(y₁(C_i)，y₂(C_i)，K，y_N(C_i))^T。

In this process, specifically, for the character samples having the same character in the character set, that is, having the same identity C_iAccording to the character C calculated in step 122_iA first recognition similarity vector Z (C) with the character sample_i，k)＝(z₁(C_i，k)，z₂(C_i，k)，K，z_N(C_i，k))^TTaking the average value of the first recognition similarity vector as the second recognition similarity vector Y (C)_i)＝(y₁(C_i)，y₂(C_i)，K，y_N(C_i))^TWhereinn(C_i) Is identity C_iTotal number of character samples.

Due to the possible presence of y_i(C_j)≠y_j(C_i) Thus introducing x_i，j＝(y_i(C_j)+y_j(C_i) 2) such that the second recognition similarity vector can be represented as X_i＝(x_1，i，x_2，i，K，x_N，i)^T，

And step 124, clustering the character set according to the first identification similarity vector and the second identification similarity vector.

Firstly, the class center of each classification is set according to the number of the classes to be classified. Then, for each character in the set of characters, when the character is closest to the class center, the class center belongs to the class to which the character belongs. The above process is repeated until the class center is no longer changed. Finally, the characters with the same identification are divided into the same classIn other respects. Wherein, the result of character set clustering can be expressed as:

K，

Ω_irepresenting classes, all characters in a class being available in a general formula

Is shown, i.e.Represents the ith class omega_iThe number of characters in each category is determined according to the clustering result of step 124, but the total number of characters in all categories is equal to the total number N of characters in the character set.

And step 13, calculating the intra-class language probability according to the clustering result of the characters.

Wherein, the calculation mode of the class language probability is as follows:

P (C_{c_{i, k}} | Ω_{i}) = {- K}_{1} \times \log (n (C_{c_{i, k}}) / \underset{j}{Σ} n (C_{c_{i, j}}))

wherein omega_iThe category is represented by a list of categories,

represents the class Ω_iThe frequency of occurrence of the k-th character,represents the class Ω_iThe frequency of occurrence of the jth character in (c),

andrespectively represent the class omega_iOf the K and j characters, K₁To fix the parameters, K₁A magnification factor representing the negative log probability of a character in the category.

And 14, calculating the language probability among the classes according to the clustering result of the characters.

Wherein, the calculation mode of the language probability among the classes is as follows:

P (Ω_{i}) = - K_{2} \times \log (\underset{k}{Σ} n (C_{c_{i, k}}) / S)

wherein omega_iThe category is represented by a list of categories,

represents the class Ω_iThe frequency of occurrence of the K-th character in (A), S represents the sum of the frequencies of occurrence of all characters of all classes, K₂To fix the parameters, K₂A magnification factor representing the negative log probability of a character in the category.

And step 15, calculating the language probability of the characters according to the language probability in the classes and the language probability among the classes.

Wherein the manner of calculating the language probability of the character according to the intra-class language probability and the inter-class language probability is as follows:

P (C_{c_{i, k}}) = ω \times P (C_{c_{i, k}} | Ω_{i}) + (1 - ω) \times P (Ω_{i}), 0 \leq ω \leq 1,

wherein,

is like omega_iThe (k) th character in (c) is the intra-class language probability, P (Ω)_i) Is like omega_iInter-class language probabilities.

It can be seen from the above that the language probabilities of the characters calculated by the intra-class language probabilities and the inter-class language probabilities sufficiently reflect the language probability difference between different characters in the same class and the language probability difference between different classes, and therefore, the method of the embodiment of the present invention can improve the language discrimination between similar characters.

As shown in fig. 2, an embodiment of the present invention further provides a device for improving linguistic distinction between characters, including: clustering section 21, first calculating section 22, second calculating section 23, and third calculating section 24.

The clustering unit 21 is configured to identify character samples and cluster characters according to an identification result; the first calculating unit 22 is configured to calculate an intra-class language probability according to the clustering result of the characters; the second calculating unit 23 is configured to calculate an inter-class language probability according to the clustering result of the characters; the third calculating unit 24 is configured to calculate the language probability of the character according to the intra-class language probability and the inter-class language probability.

Wherein the clustering unit 21 may include: the character set acquisition module is used for identifying the character samples and acquiring a character set; a first calculation module, configured to calculate a first recognition similarity vector between each character in the character set and the character sample; the second calculation module is used for calculating a second recognition similarity vector of each character in the character set and the character set according to the first similarity vector; and the classification module is used for clustering the character set according to the first identification similarity vector and the second identification similarity vector.

Further, the classification module may include: the setting submodule is used for setting the class center of each classification according to the number to be classified; the class identification submodule is used for identifying each character in the character set by using the class to which the class center belongs when the distance between the character and the class center is the shortest; and the class division submodule is used for dividing the characters with the same identification into the same class.

The first calculating module is specifically configured to calculate a recognition similarity between each character in the character set and the character sample, and generate the first recognition similarity vector according to the recognition similarity. The second calculating module is specifically configured to calculate, for a character sample having a same character in the character set, a first recognition similarity vector between the character and the character sample, and use an average value of the first recognition similarity vector as the second recognition similarity vector.

It can be seen from the above that the language probabilities of the characters calculated by the intra-class language probabilities and the inter-class language probabilities sufficiently reflect the language probability difference between different characters in the same class and the language probability difference between different classes, and therefore, the device of the embodiment of the present invention can improve the language discrimination between similar characters.

To further improve efficiency, as shown in fig. 3, the apparatus according to the embodiment of the present invention may further include: a sample acquiring unit 20, configured to acquire the character sample.

The working principle of the device according to the embodiment of the present invention can refer to the description of the foregoing method embodiment.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A method for improving linguistic distinctness between characters, comprising:

identifying character samples, wherein the character samples provide shape characteristics of characters, and clustering the characters according to identification results;

calculating the language probability of the character according to the intra-class language probability and the inter-class language probability,

wherein, the recognizing the character sample and clustering the characters according to the recognition result comprises:

recognizing the character sample by using the character recognition core to obtain a character set;

calculating a first recognition similarity vector of each character in the character set and the character sample;

calculating a second recognition similarity vector of the character set and each character in the character set according to the first recognition similarity vector;

and clustering the character set according to the first identification similarity vector and the second identification similarity vector.

2. The method of claim 1, wherein the computing a first recognition similarity vector for each character in the character set to the character sample comprises:

and calculating the recognition similarity of each character in the character set and the character sample, and generating the first recognition similarity vector according to the recognition similarity.

3. The method of claim 1, wherein computing a second recognition similarity vector for each character in the character set and the character set based on the first recognition similarity vector comprises:

and for a character sample with the same character in the character set, calculating a first recognition similarity vector of the character in the character set and the character sample, and taking the average value of the first recognition similarity vector as the second recognition similarity vector.

4. The method of claim 1, wherein the clustering the character set according to the first recognition similarity vector and the second recognition similarity vector comprises:

setting a class center of each classification according to the number of the classes to be classified;

for each character in the character set, when the distance between the character and the class center is the shortest, identifying the character by using the class to which the class center belongs;

characters having the same identification are classified into the same category.

5. The method of claim 1, wherein the intra-class linguistic probability is calculated by:

P (C_{c_{i, k}} | Ω_{i}) = - K_{1} \times \log (n (C_{c_{i, k}}) / \underset{j}{Σ} n (C_{c_{i, j}}))

wherein omega_iThe category is represented by a list of categories,

represents the class Ω_iThe frequency of occurrence of the k-th character,

represents the class Ω_iThe frequency of occurrence of the jth character in (c),

andrespectively represent the class omega_iOf the K and j characters, K₁A magnification factor representing the negative log probability of a character in the category.

6. The method of claim 1, wherein the inter-class linguistic probability is calculated by:

P (Ω_{i}) = - K_{2} \times \log (\underset{k}{Σ} n (C_{c_{i, k}}) / S)

wherein omega_iThe category is represented by a list of categories,

represents the class Ω_iThe frequency of occurrence of the K-th character in (A), S represents the sum of the frequencies of occurrence of all characters of all classes, K₂Represents the class Ω_iThe magnification factor of the negative logarithmic probability of the middle character.

7. The method of claim 1, wherein the computing the linguistic probability for the character based on the intra-class linguistic probability and the inter-class linguistic probability is by:

P(c)＝ω×P(c|Ω_c)+(1-ω)×P(Ω_c)，0≤ω≤1，

wherein, P (c | Ω)_c) Is the probability of language within class, P (omega)_c) For inter-class linguistic probability, c denotes the character, Ω_cIndicating the category in which the character c is located.

8. An apparatus for improving linguistic distinction between characters, comprising:

the clustering unit is used for identifying character samples, providing shape characteristics of characters, and clustering the characters according to an identification result;

a third calculation unit for calculating the language probability of the character based on the intra-class language probability and the inter-class language probability,

wherein the clustering unit includes:

the character set acquisition module is used for identifying the character samples by utilizing the character identification core to obtain a character set;

a first calculation module, configured to calculate a first recognition similarity vector between each character in the character set and the character sample;

the second calculation module is used for calculating a second recognition similarity vector of each character in the character set and the character set according to the first recognition similarity vector;

and the classification module is used for clustering the character set according to the first identification similarity vector and the second identification similarity vector.

9. The apparatus according to claim 8, wherein the first calculating module is specifically configured to calculate a recognition similarity between each character in the character set and the character sample, and generate the first recognition similarity vector according to the recognition similarity.

10. The apparatus of claim 8, wherein the second computing module is specifically configured to compute, for a character sample having a same character in the character set, a first recognition similarity vector between the character in the character set and the character sample, and use an average of the first recognition similarity vector as the second recognition similarity vector.

11. The apparatus of claim 8, wherein the classification module comprises:

the setting submodule is used for setting the class center of each classification according to the number to be classified;

the class identification submodule is used for identifying each character in the character set by using the class to which the class center belongs when the distance between the character and the class center is closest;

and the class division submodule is used for dividing the characters with the same identification into the same class.