CN101882226B - Method and device for improving language discrimination among characters - Google Patents
Method and device for improving language discrimination among characters Download PDFInfo
- Publication number
- CN101882226B CN101882226B CN 201010218319 CN201010218319A CN101882226B CN 101882226 B CN101882226 B CN 101882226B CN 201010218319 CN201010218319 CN 201010218319 CN 201010218319 A CN201010218319 A CN 201010218319A CN 101882226 B CN101882226 B CN 101882226B
- Authority
- CN
- China
- Prior art keywords
- character
- class
- characters
- probability
- similarity vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 238000004364 calculation method Methods 0.000 claims description 7
- 238000005516 engineering process Methods 0.000 abstract description 7
- 230000010365 information processing Effects 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000003909 pattern recognition Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
Images
Landscapes
- Character Discrimination (AREA)
Abstract
The embodiment of the invention discloses a method and a device for improving language discrimination among characters, which relates to information processing technology and aims to improve language discrimination among similar characters, wherein the method comprises the following steps: recognizing character samples, and clustering the characters according to the recognizing result; computing the intra-class language probability according to the clustering result of the characters; computing the extra-class language probability according to the clustering result of the characters; and computing the language probability of the characters according to the intra-class language probability and the extra-class language probability. The embodiment of the invention is mainly used for character recognition or phrase recognition technology.
Description
Technical Field
The present invention relates to information processing technologies, and in particular, to a method and an apparatus for improving linguistic distinction between characters.
Background
In the pattern recognition technology, characters that can be recognized include english, simplified chinese, traditional chinese, arabic, greek letters, various symbols, and the like. Among these characters, there are a large number of similar characters, such as "already" and "already", etc. In recognizing these similar characters, it is difficult to efficiently recognize the correct character using only a character recognition technique. In order to distinguish these similar characters, the probability statistics technique of the language model is widely used in the pattern recognition technique.
In the process of implementing the invention, the inventor finds that the probability statistical technology of the language model is to count the language probability for all characters, but the language probability statistical method for all characters including dissimilar characters is still difficult to accurately distinguish the similar characters.
Disclosure of Invention
The embodiment of the invention provides a method and a device for improving the language discrimination between characters, which are used for improving the language discrimination between similar characters.
The embodiment of the invention adopts the following technical scheme:
a method of improving linguistic distinctness between characters, comprising:
identifying character samples, and clustering characters according to identification results;
calculating the intra-class language probability according to the clustering result of the characters;
calculating the language probability among classes according to the clustering result of the characters;
and calculating the language probability of the character according to the language probability in the class and the language probability among the classes.
An apparatus for improving linguistic distinction between characters, comprising:
the clustering unit is used for identifying the character samples and clustering the characters according to the identification result;
the first calculating unit is used for calculating the intra-class language probability according to the clustering result of the characters;
the second calculating unit is used for calculating the language probability among the classes according to the clustering result of the characters;
and the third calculating unit is used for calculating the language probability of the character according to the language probability in the class and the language probability among the classes.
The method and the device for improving the language discrimination between the characters in the embodiment of the invention firstly classify the characters by the recognition technology, then obtain the intra-class language probability between different characters in the same class and the inter-class language probability between different classes, and finally obtain the language probability of the characters. That is to say, the language probabilities of the characters calculated by the intra-class language probabilities and the inter-class language probabilities sufficiently represent the language probability difference between different characters in the same class and the language probability difference between different classes, so that the method and the device of the embodiment of the invention can improve the language discrimination between similar characters.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a flowchart illustrating a method for improving linguistic distinction between characters according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating an apparatus for improving linguistic distinction between characters according to an embodiment of the present invention;
FIG. 3 is a block diagram of an apparatus for improving linguistic distinction between characters according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the method for improving the language discrimination between the characters, the character samples are firstly identified through an identification technology, the characters are clustered according to identification results, then the intra-class language probability and the inter-class language probability are respectively obtained according to the clustering results of the characters, and finally the language probability of the language probability characters of the characters is calculated according to the intra-class language probability and the inter-class language probability.
According to the technical scheme, the language probabilities of the characters calculated through the intra-class language probabilities and the inter-class language probabilities fully reflect the language probability difference between different characters in the same class and the language probability difference between different classes, and therefore the method can improve the language discrimination between similar characters.
The following describes the implementation process of the above technical solution in detail with reference to specific embodiments.
As shown in fig. 1, a method for improving linguistic distinction between characters in an embodiment of the present invention includes:
and 11, acquiring character samples D (c, k), wherein c is the identity of the samples, and k represents the sequence of the samples with the same identity.
The character sample provides shape characteristics of characters, and is the basis of a character recognition-based clustering method, and the corpus provides language probability materials. In this embodiment, when collecting the character samples, the character recognition program may be used to automatically label the character samples, manually correct the labeling result of the character samples, and automatically count the frequency of the characters appearing in the corpus.
And step 12, recognizing the character samples, and clustering the characters according to the recognition result.
In this embodiment, the step may include:
step 121, recognizing the character sample D (C, k) by using the character recognition core, obtaining a character set C: { C1,C2,K,CNWhere N is the total number of characters in the character set.
Step 122, calculating each character C in the character setiA first recognition similarity vector Z (C) with the character sample D (C, k)i,k)=(z1(Ci,k),z2(Ci,k),K,zN(Ci,k))T。
In the process, specifically, character samples are recognized by using a character recognition core, and each character C in the character set is calculatediRecognition similarity z with the character sample D (c, k)i(C, k) and generating the first recognition similarity vector Z (C) according to the recognition similarityi,k)=(z1(Ci,k),z2(Ci,k),K,zN(Ci,k))T。
Step 123, according to the first similarity vector Z (C)i,k)=(z1(Ci,k),z2(Ci,k),K,zN(Ci,k))TAnd calculating the character set C: { C1,C2,K,CNWith each character C in said character setiSecond recognition similarity vector Y (C)i)=(y1(Ci),y2(Ci),K,yN(Ci))T。
In this process, specifically, for the character samples having the same character in the character set, that is, having the same identity CiAccording to the character C calculated in step 122iA first recognition similarity vector Z (C) with the character samplei,k)=(z1(Ci,k),z2(Ci,k),K,zN(Ci,k))TTaking the average value of the first recognition similarity vector as the second recognition similarity vector Y (C)i)=(y1(Ci),y2(Ci),K,yN(Ci))TWhereinn(Ci) Is identity CiTotal number of character samples.
Due to the possible presence of yi(Cj)≠yj(Ci) Thus introducing xi,j=(yi(Cj)+yj(Ci) 2) such that the second recognition similarity vector can be represented as Xi=(x1,i,x2,i,K,xN,i)T,
And step 124, clustering the character set according to the first identification similarity vector and the second identification similarity vector.
Firstly, the class center of each classification is set according to the number of the classes to be classified. Then, for each character in the set of characters, when the character is closest to the class center, the class center belongs to the class to which the character belongs. The above process is repeated until the class center is no longer changed. Finally, the characters with the same identification are divided into the same classIn other respects. Wherein, the result of character set clustering can be expressed as: K,Ωirepresenting classes, all characters in a class being available in a general formulaIs shown, i.e.Represents the ith class omegaiThe number of characters in each category is determined according to the clustering result of step 124, but the total number of characters in all categories is equal to the total number N of characters in the character set.
And step 13, calculating the intra-class language probability according to the clustering result of the characters.
Wherein, the calculation mode of the class language probability is as follows:
wherein omegaiThe category is represented by a list of categories,represents the class ΩiThe frequency of occurrence of the k-th character,represents the class ΩiThe frequency of occurrence of the jth character in (c),andrespectively represent the class omegaiOf the K and j characters, K1To fix the parameters, K1A magnification factor representing the negative log probability of a character in the category.
And 14, calculating the language probability among the classes according to the clustering result of the characters.
Wherein, the calculation mode of the language probability among the classes is as follows:
wherein omegaiThe category is represented by a list of categories,represents the class ΩiThe frequency of occurrence of the K-th character in (A), S represents the sum of the frequencies of occurrence of all characters of all classes, K2To fix the parameters, K2A magnification factor representing the negative log probability of a character in the category.
And step 15, calculating the language probability of the characters according to the language probability in the classes and the language probability among the classes.
Wherein the manner of calculating the language probability of the character according to the intra-class language probability and the inter-class language probability is as follows:
wherein,is like omegaiThe (k) th character in (c) is the intra-class language probability, P (Ω)i) Is like omegaiInter-class language probabilities.
It can be seen from the above that the language probabilities of the characters calculated by the intra-class language probabilities and the inter-class language probabilities sufficiently reflect the language probability difference between different characters in the same class and the language probability difference between different classes, and therefore, the method of the embodiment of the present invention can improve the language discrimination between similar characters.
As shown in fig. 2, an embodiment of the present invention further provides a device for improving linguistic distinction between characters, including: clustering section 21, first calculating section 22, second calculating section 23, and third calculating section 24.
The clustering unit 21 is configured to identify character samples and cluster characters according to an identification result; the first calculating unit 22 is configured to calculate an intra-class language probability according to the clustering result of the characters; the second calculating unit 23 is configured to calculate an inter-class language probability according to the clustering result of the characters; the third calculating unit 24 is configured to calculate the language probability of the character according to the intra-class language probability and the inter-class language probability.
Wherein the clustering unit 21 may include: the character set acquisition module is used for identifying the character samples and acquiring a character set; a first calculation module, configured to calculate a first recognition similarity vector between each character in the character set and the character sample; the second calculation module is used for calculating a second recognition similarity vector of each character in the character set and the character set according to the first similarity vector; and the classification module is used for clustering the character set according to the first identification similarity vector and the second identification similarity vector.
Further, the classification module may include: the setting submodule is used for setting the class center of each classification according to the number to be classified; the class identification submodule is used for identifying each character in the character set by using the class to which the class center belongs when the distance between the character and the class center is the shortest; and the class division submodule is used for dividing the characters with the same identification into the same class.
The first calculating module is specifically configured to calculate a recognition similarity between each character in the character set and the character sample, and generate the first recognition similarity vector according to the recognition similarity. The second calculating module is specifically configured to calculate, for a character sample having a same character in the character set, a first recognition similarity vector between the character and the character sample, and use an average value of the first recognition similarity vector as the second recognition similarity vector.
It can be seen from the above that the language probabilities of the characters calculated by the intra-class language probabilities and the inter-class language probabilities sufficiently reflect the language probability difference between different characters in the same class and the language probability difference between different classes, and therefore, the device of the embodiment of the present invention can improve the language discrimination between similar characters.
To further improve efficiency, as shown in fig. 3, the apparatus according to the embodiment of the present invention may further include: a sample acquiring unit 20, configured to acquire the character sample.
The working principle of the device according to the embodiment of the present invention can refer to the description of the foregoing method embodiment.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.
Claims (11)
1. A method for improving linguistic distinctness between characters, comprising:
identifying character samples, wherein the character samples provide shape characteristics of characters, and clustering the characters according to identification results;
calculating the intra-class language probability according to the clustering result of the characters;
calculating the language probability among classes according to the clustering result of the characters;
calculating the language probability of the character according to the intra-class language probability and the inter-class language probability,
wherein, the recognizing the character sample and clustering the characters according to the recognition result comprises:
recognizing the character sample by using the character recognition core to obtain a character set;
calculating a first recognition similarity vector of each character in the character set and the character sample;
calculating a second recognition similarity vector of the character set and each character in the character set according to the first recognition similarity vector;
and clustering the character set according to the first identification similarity vector and the second identification similarity vector.
2. The method of claim 1, wherein the computing a first recognition similarity vector for each character in the character set to the character sample comprises:
and calculating the recognition similarity of each character in the character set and the character sample, and generating the first recognition similarity vector according to the recognition similarity.
3. The method of claim 1, wherein computing a second recognition similarity vector for each character in the character set and the character set based on the first recognition similarity vector comprises:
and for a character sample with the same character in the character set, calculating a first recognition similarity vector of the character in the character set and the character sample, and taking the average value of the first recognition similarity vector as the second recognition similarity vector.
4. The method of claim 1, wherein the clustering the character set according to the first recognition similarity vector and the second recognition similarity vector comprises:
setting a class center of each classification according to the number of the classes to be classified;
for each character in the character set, when the distance between the character and the class center is the shortest, identifying the character by using the class to which the class center belongs;
characters having the same identification are classified into the same category.
5. The method of claim 1, wherein the intra-class linguistic probability is calculated by:
wherein omegaiThe category is represented by a list of categories,represents the class ΩiThe frequency of occurrence of the k-th character,represents the class ΩiThe frequency of occurrence of the jth character in (c),andrespectively represent the class omegaiOf the K and j characters, K1A magnification factor representing the negative log probability of a character in the category.
6. The method of claim 1, wherein the inter-class linguistic probability is calculated by:
wherein omegaiThe category is represented by a list of categories,represents the class ΩiThe frequency of occurrence of the K-th character in (A), S represents the sum of the frequencies of occurrence of all characters of all classes, K2Represents the class ΩiThe magnification factor of the negative logarithmic probability of the middle character.
7. The method of claim 1, wherein the computing the linguistic probability for the character based on the intra-class linguistic probability and the inter-class linguistic probability is by:
P(c)=ω×P(c|Ωc)+(1-ω)×P(Ωc),0≤ω≤1,
wherein, P (c | Ω)c) Is the probability of language within class, P (omega)c) For inter-class linguistic probability, c denotes the character, ΩcIndicating the category in which the character c is located.
8. An apparatus for improving linguistic distinction between characters, comprising:
the clustering unit is used for identifying character samples, providing shape characteristics of characters, and clustering the characters according to an identification result;
the first calculating unit is used for calculating the intra-class language probability according to the clustering result of the characters;
the second calculating unit is used for calculating the language probability among the classes according to the clustering result of the characters;
a third calculation unit for calculating the language probability of the character based on the intra-class language probability and the inter-class language probability,
wherein the clustering unit includes:
the character set acquisition module is used for identifying the character samples by utilizing the character identification core to obtain a character set;
a first calculation module, configured to calculate a first recognition similarity vector between each character in the character set and the character sample;
the second calculation module is used for calculating a second recognition similarity vector of each character in the character set and the character set according to the first recognition similarity vector;
and the classification module is used for clustering the character set according to the first identification similarity vector and the second identification similarity vector.
9. The apparatus according to claim 8, wherein the first calculating module is specifically configured to calculate a recognition similarity between each character in the character set and the character sample, and generate the first recognition similarity vector according to the recognition similarity.
10. The apparatus of claim 8, wherein the second computing module is specifically configured to compute, for a character sample having a same character in the character set, a first recognition similarity vector between the character in the character set and the character sample, and use an average of the first recognition similarity vector as the second recognition similarity vector.
11. The apparatus of claim 8, wherein the classification module comprises:
the setting submodule is used for setting the class center of each classification according to the number to be classified;
the class identification submodule is used for identifying each character in the character set by using the class to which the class center belongs when the distance between the character and the class center is closest;
and the class division submodule is used for dividing the characters with the same identification into the same class.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201010218319 CN101882226B (en) | 2010-06-24 | 2010-06-24 | Method and device for improving language discrimination among characters |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201010218319 CN101882226B (en) | 2010-06-24 | 2010-06-24 | Method and device for improving language discrimination among characters |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101882226A CN101882226A (en) | 2010-11-10 |
CN101882226B true CN101882226B (en) | 2013-07-24 |
Family
ID=43054238
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 201010218319 Active CN101882226B (en) | 2010-06-24 | 2010-06-24 | Method and device for improving language discrimination among characters |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101882226B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1342967A (en) * | 2000-09-13 | 2002-04-03 | 中国科学院自动化研究所 | Unified recognizing method for multi-speed working pattern |
CN1346112A (en) * | 2000-09-27 | 2002-04-24 | 中国科学院自动化研究所 | Integrated prediction searching method for Chinese continuous speech recognition |
CN1369877A (en) * | 2000-10-04 | 2002-09-18 | 微软公司 | Method and system for identifying property of new word in non-divided text |
CN101090461A (en) * | 2006-06-13 | 2007-12-19 | 中国科学院计算技术研究所 | Automatic translation method for digital video captions |
CN101095138A (en) * | 2004-09-30 | 2007-12-26 | 谷歌公司 | Methods and systems for selecting a language for text segmentation |
CN101707873A (en) * | 2007-03-26 | 2010-05-12 | 谷歌公司 | Large language models in the mechanical translation |
-
2010
- 2010-06-24 CN CN 201010218319 patent/CN101882226B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1342967A (en) * | 2000-09-13 | 2002-04-03 | 中国科学院自动化研究所 | Unified recognizing method for multi-speed working pattern |
CN1346112A (en) * | 2000-09-27 | 2002-04-24 | 中国科学院自动化研究所 | Integrated prediction searching method for Chinese continuous speech recognition |
CN1369877A (en) * | 2000-10-04 | 2002-09-18 | 微软公司 | Method and system for identifying property of new word in non-divided text |
CN101095138A (en) * | 2004-09-30 | 2007-12-26 | 谷歌公司 | Methods and systems for selecting a language for text segmentation |
CN101090461A (en) * | 2006-06-13 | 2007-12-19 | 中国科学院计算技术研究所 | Automatic translation method for digital video captions |
CN101707873A (en) * | 2007-03-26 | 2010-05-12 | 谷歌公司 | Large language models in the mechanical translation |
Also Published As
Publication number | Publication date |
---|---|
CN101882226A (en) | 2010-11-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109918673B (en) | Semantic arbitration method and device, electronic equipment and computer-readable storage medium | |
CN104167208B (en) | A kind of method for distinguishing speek person and device | |
CN105702251B (en) | Reinforce the speech-emotion recognition method of audio bag of words based on Top-k | |
CN105389593A (en) | Image object recognition method based on SURF | |
CN109189892B (en) | Recommendation method and device based on article comments | |
CN110853648B (en) | Bad voice detection method and device, electronic equipment and storage medium | |
CN110134777B (en) | Question duplication eliminating method and device, electronic equipment and computer readable storage medium | |
CN110287311B (en) | Text classification method and device, storage medium and computer equipment | |
CN108038208B (en) | Training method and device of context information recognition model and storage medium | |
CN113254643B (en) | Text classification method and device, electronic equipment and text classification program | |
CN107180084A (en) | Word library updating method and device | |
WO2014022172A2 (en) | Information classification based on product recognition | |
CN111506726B (en) | Short text clustering method and device based on part-of-speech coding and computer equipment | |
CN104200814A (en) | Speech emotion recognition method based on semantic cells | |
CN105609116A (en) | Speech emotional dimensions region automatic recognition method | |
CN115982144A (en) | Similar text duplicate removal method and device, storage medium and electronic device | |
CN110867180B (en) | System and method for generating word-by-word lyric file based on K-means clustering algorithm | |
CN117235137B (en) | Professional information query method and device based on vector database | |
CN114416991A (en) | Method and system for analyzing text emotion reason based on prompt | |
CN111125329B (en) | Text information screening method, device and equipment | |
CN105006231A (en) | Distributed large population speaker recognition method based on fuzzy clustering decision tree | |
CN101882226B (en) | Method and device for improving language discrimination among characters | |
CN116070642A (en) | Text emotion analysis method and related device based on expression embedding | |
Dileep et al. | Speaker recognition using pyramid match kernel based support vector machines | |
CN112071304B (en) | Semantic analysis method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |