CN101882226B - Method and device for improving language discrimination among characters - Google Patents

Method and device for improving language discrimination among characters Download PDF

Info

Publication number
CN101882226B
CN101882226B CN 201010218319 CN201010218319A CN101882226B CN 101882226 B CN101882226 B CN 101882226B CN 201010218319 CN201010218319 CN 201010218319 CN 201010218319 A CN201010218319 A CN 201010218319A CN 101882226 B CN101882226 B CN 101882226B
Authority
CN
China
Prior art keywords
character
class
characters
probability
similarity vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN 201010218319
Other languages
Chinese (zh)
Other versions
CN101882226A (en
Inventor
郭育生
邹明福
王利娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hanwang Technology Co Ltd
Original Assignee
Hanwang Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hanwang Technology Co Ltd filed Critical Hanwang Technology Co Ltd
Priority to CN 201010218319 priority Critical patent/CN101882226B/en
Publication of CN101882226A publication Critical patent/CN101882226A/en
Application granted granted Critical
Publication of CN101882226B publication Critical patent/CN101882226B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Character Discrimination (AREA)

Abstract

The embodiment of the invention discloses a method and a device for improving language discrimination among characters, which relates to information processing technology and aims to improve language discrimination among similar characters, wherein the method comprises the following steps: recognizing character samples, and clustering the characters according to the recognizing result; computing the intra-class language probability according to the clustering result of the characters; computing the extra-class language probability according to the clustering result of the characters; and computing the language probability of the characters according to the intra-class language probability and the extra-class language probability. The embodiment of the invention is mainly used for character recognition or phrase recognition technology.

Description

Method and device for improving language discrimination between characters
Technical Field
The present invention relates to information processing technologies, and in particular, to a method and an apparatus for improving linguistic distinction between characters.
Background
In the pattern recognition technology, characters that can be recognized include english, simplified chinese, traditional chinese, arabic, greek letters, various symbols, and the like. Among these characters, there are a large number of similar characters, such as "already" and "already", etc. In recognizing these similar characters, it is difficult to efficiently recognize the correct character using only a character recognition technique. In order to distinguish these similar characters, the probability statistics technique of the language model is widely used in the pattern recognition technique.
In the process of implementing the invention, the inventor finds that the probability statistical technology of the language model is to count the language probability for all characters, but the language probability statistical method for all characters including dissimilar characters is still difficult to accurately distinguish the similar characters.
Disclosure of Invention
The embodiment of the invention provides a method and a device for improving the language discrimination between characters, which are used for improving the language discrimination between similar characters.
The embodiment of the invention adopts the following technical scheme:
a method of improving linguistic distinctness between characters, comprising:
identifying character samples, and clustering characters according to identification results;
calculating the intra-class language probability according to the clustering result of the characters;
calculating the language probability among classes according to the clustering result of the characters;
and calculating the language probability of the character according to the language probability in the class and the language probability among the classes.
An apparatus for improving linguistic distinction between characters, comprising:
the clustering unit is used for identifying the character samples and clustering the characters according to the identification result;
the first calculating unit is used for calculating the intra-class language probability according to the clustering result of the characters;
the second calculating unit is used for calculating the language probability among the classes according to the clustering result of the characters;
and the third calculating unit is used for calculating the language probability of the character according to the language probability in the class and the language probability among the classes.
The method and the device for improving the language discrimination between the characters in the embodiment of the invention firstly classify the characters by the recognition technology, then obtain the intra-class language probability between different characters in the same class and the inter-class language probability between different classes, and finally obtain the language probability of the characters. That is to say, the language probabilities of the characters calculated by the intra-class language probabilities and the inter-class language probabilities sufficiently represent the language probability difference between different characters in the same class and the language probability difference between different classes, so that the method and the device of the embodiment of the invention can improve the language discrimination between similar characters.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a flowchart illustrating a method for improving linguistic distinction between characters according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating an apparatus for improving linguistic distinction between characters according to an embodiment of the present invention;
FIG. 3 is a block diagram of an apparatus for improving linguistic distinction between characters according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the method for improving the language discrimination between the characters, the character samples are firstly identified through an identification technology, the characters are clustered according to identification results, then the intra-class language probability and the inter-class language probability are respectively obtained according to the clustering results of the characters, and finally the language probability of the language probability characters of the characters is calculated according to the intra-class language probability and the inter-class language probability.
According to the technical scheme, the language probabilities of the characters calculated through the intra-class language probabilities and the inter-class language probabilities fully reflect the language probability difference between different characters in the same class and the language probability difference between different classes, and therefore the method can improve the language discrimination between similar characters.
The following describes the implementation process of the above technical solution in detail with reference to specific embodiments.
As shown in fig. 1, a method for improving linguistic distinction between characters in an embodiment of the present invention includes:
and 11, acquiring character samples D (c, k), wherein c is the identity of the samples, and k represents the sequence of the samples with the same identity.
The character sample provides shape characteristics of characters, and is the basis of a character recognition-based clustering method, and the corpus provides language probability materials. In this embodiment, when collecting the character samples, the character recognition program may be used to automatically label the character samples, manually correct the labeling result of the character samples, and automatically count the frequency of the characters appearing in the corpus.
And step 12, recognizing the character samples, and clustering the characters according to the recognition result.
In this embodiment, the step may include:
step 121, recognizing the character sample D (C, k) by using the character recognition core, obtaining a character set C: { C1,C2,K,CNWhere N is the total number of characters in the character set.
Step 122, calculating each character C in the character setiA first recognition similarity vector Z (C) with the character sample D (C, k)i,k)=(z1(Ci,k),z2(Ci,k),K,zN(Ci,k))T
In the process, specifically, character samples are recognized by using a character recognition core, and each character C in the character set is calculatediRecognition similarity z with the character sample D (c, k)i(C, k) and generating the first recognition similarity vector Z (C) according to the recognition similarityi,k)=(z1(Ci,k),z2(Ci,k),K,zN(Ci,k))T
Step 123, according to the first similarity vector Z (C)i,k)=(z1(Ci,k),z2(Ci,k),K,zN(Ci,k))TAnd calculating the character set C: { C1,C2,K,CNWith each character C in said character setiSecond recognition similarity vector Y (C)i)=(y1(Ci),y2(Ci),K,yN(Ci))T
In this process, specifically, for the character samples having the same character in the character set, that is, having the same identity CiAccording to the character C calculated in step 122iA first recognition similarity vector Z (C) with the character samplei,k)=(z1(Ci,k),z2(Ci,k),K,zN(Ci,k))TTaking the average value of the first recognition similarity vector as the second recognition similarity vector Y (C)i)=(y1(Ci),y2(Ci),K,yN(Ci))TWhereinn(Ci) Is identity CiTotal number of character samples.
Due to the possible presence of yi(Cj)≠yj(Ci) Thus introducing xi,j=(yi(Cj)+yj(Ci) 2) such that the second recognition similarity vector can be represented as Xi=(x1,i,x2,i,K,xN,i)T
And step 124, clustering the character set according to the first identification similarity vector and the second identification similarity vector.
Firstly, the class center of each classification is set according to the number of the classes to be classified. Then, for each character in the set of characters, when the character is closest to the class center, the class center belongs to the class to which the character belongs. The above process is repeated until the class center is no longer changed. Finally, the characters with the same identification are divided into the same classIn other respects. Wherein, the result of character set clustering can be expressed as:
Figure BSA00000172316900041
Figure BSA00000172316900042
K,
Figure BSA00000172316900043
Ωirepresenting classes, all characters in a class being available in a general formula
Figure BSA00000172316900044
Is shown, i.e.Represents the ith class omegaiThe number of characters in each category is determined according to the clustering result of step 124, but the total number of characters in all categories is equal to the total number N of characters in the character set.
And step 13, calculating the intra-class language probability according to the clustering result of the characters.
Wherein, the calculation mode of the class language probability is as follows:
P ( C c i , k | Ω i ) = - K 1 × log ( n ( C c i , k ) / Σ j n ( C c i , j ) )
wherein omegaiThe category is represented by a list of categories,
Figure BSA00000172316900047
represents the class ΩiThe frequency of occurrence of the k-th character,represents the class ΩiThe frequency of occurrence of the jth character in (c),
Figure BSA00000172316900049
andrespectively represent the class omegaiOf the K and j characters, K1To fix the parameters, K1A magnification factor representing the negative log probability of a character in the category.
And 14, calculating the language probability among the classes according to the clustering result of the characters.
Wherein, the calculation mode of the language probability among the classes is as follows:
P ( Ω i ) = - K 2 × log ( Σ k n ( C c i , k ) / S )
wherein omegaiThe category is represented by a list of categories,
Figure BSA000001723169000412
represents the class ΩiThe frequency of occurrence of the K-th character in (A), S represents the sum of the frequencies of occurrence of all characters of all classes, K2To fix the parameters, K2A magnification factor representing the negative log probability of a character in the category.
And step 15, calculating the language probability of the characters according to the language probability in the classes and the language probability among the classes.
Wherein the manner of calculating the language probability of the character according to the intra-class language probability and the inter-class language probability is as follows:
P ( C c i , k ) = ω × P ( C c i , k | Ω i ) + ( 1 - ω ) × P ( Ω i ) , 0 ≤ ω ≤ 1 ,
wherein,
Figure BSA00000172316900052
is like omegaiThe (k) th character in (c) is the intra-class language probability, P (Ω)i) Is like omegaiInter-class language probabilities.
It can be seen from the above that the language probabilities of the characters calculated by the intra-class language probabilities and the inter-class language probabilities sufficiently reflect the language probability difference between different characters in the same class and the language probability difference between different classes, and therefore, the method of the embodiment of the present invention can improve the language discrimination between similar characters.
As shown in fig. 2, an embodiment of the present invention further provides a device for improving linguistic distinction between characters, including: clustering section 21, first calculating section 22, second calculating section 23, and third calculating section 24.
The clustering unit 21 is configured to identify character samples and cluster characters according to an identification result; the first calculating unit 22 is configured to calculate an intra-class language probability according to the clustering result of the characters; the second calculating unit 23 is configured to calculate an inter-class language probability according to the clustering result of the characters; the third calculating unit 24 is configured to calculate the language probability of the character according to the intra-class language probability and the inter-class language probability.
Wherein the clustering unit 21 may include: the character set acquisition module is used for identifying the character samples and acquiring a character set; a first calculation module, configured to calculate a first recognition similarity vector between each character in the character set and the character sample; the second calculation module is used for calculating a second recognition similarity vector of each character in the character set and the character set according to the first similarity vector; and the classification module is used for clustering the character set according to the first identification similarity vector and the second identification similarity vector.
Further, the classification module may include: the setting submodule is used for setting the class center of each classification according to the number to be classified; the class identification submodule is used for identifying each character in the character set by using the class to which the class center belongs when the distance between the character and the class center is the shortest; and the class division submodule is used for dividing the characters with the same identification into the same class.
The first calculating module is specifically configured to calculate a recognition similarity between each character in the character set and the character sample, and generate the first recognition similarity vector according to the recognition similarity. The second calculating module is specifically configured to calculate, for a character sample having a same character in the character set, a first recognition similarity vector between the character and the character sample, and use an average value of the first recognition similarity vector as the second recognition similarity vector.
It can be seen from the above that the language probabilities of the characters calculated by the intra-class language probabilities and the inter-class language probabilities sufficiently reflect the language probability difference between different characters in the same class and the language probability difference between different classes, and therefore, the device of the embodiment of the present invention can improve the language discrimination between similar characters.
To further improve efficiency, as shown in fig. 3, the apparatus according to the embodiment of the present invention may further include: a sample acquiring unit 20, configured to acquire the character sample.
The working principle of the device according to the embodiment of the present invention can refer to the description of the foregoing method embodiment.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (11)

1. A method for improving linguistic distinctness between characters, comprising:
identifying character samples, wherein the character samples provide shape characteristics of characters, and clustering the characters according to identification results;
calculating the intra-class language probability according to the clustering result of the characters;
calculating the language probability among classes according to the clustering result of the characters;
calculating the language probability of the character according to the intra-class language probability and the inter-class language probability,
wherein, the recognizing the character sample and clustering the characters according to the recognition result comprises:
recognizing the character sample by using the character recognition core to obtain a character set;
calculating a first recognition similarity vector of each character in the character set and the character sample;
calculating a second recognition similarity vector of the character set and each character in the character set according to the first recognition similarity vector;
and clustering the character set according to the first identification similarity vector and the second identification similarity vector.
2. The method of claim 1, wherein the computing a first recognition similarity vector for each character in the character set to the character sample comprises:
and calculating the recognition similarity of each character in the character set and the character sample, and generating the first recognition similarity vector according to the recognition similarity.
3. The method of claim 1, wherein computing a second recognition similarity vector for each character in the character set and the character set based on the first recognition similarity vector comprises:
and for a character sample with the same character in the character set, calculating a first recognition similarity vector of the character in the character set and the character sample, and taking the average value of the first recognition similarity vector as the second recognition similarity vector.
4. The method of claim 1, wherein the clustering the character set according to the first recognition similarity vector and the second recognition similarity vector comprises:
setting a class center of each classification according to the number of the classes to be classified;
for each character in the character set, when the distance between the character and the class center is the shortest, identifying the character by using the class to which the class center belongs;
characters having the same identification are classified into the same category.
5. The method of claim 1, wherein the intra-class linguistic probability is calculated by:
P ( C c i , k | Ω i ) = - K 1 × log ( n ( C c i , k ) / Σ j n ( C c i , j ) )
wherein omegaiThe category is represented by a list of categories,
Figure FSB00001016347000022
represents the class ΩiThe frequency of occurrence of the k-th character,
Figure FSB00001016347000023
represents the class ΩiThe frequency of occurrence of the jth character in (c),
Figure FSB00001016347000024
andrespectively represent the class omegaiOf the K and j characters, K1A magnification factor representing the negative log probability of a character in the category.
6. The method of claim 1, wherein the inter-class linguistic probability is calculated by:
P ( Ω i ) = - K 2 × log ( Σ k n ( C c i , k ) / S )
wherein omegaiThe category is represented by a list of categories,
Figure FSB00001016347000027
represents the class ΩiThe frequency of occurrence of the K-th character in (A), S represents the sum of the frequencies of occurrence of all characters of all classes, K2Represents the class ΩiThe magnification factor of the negative logarithmic probability of the middle character.
7. The method of claim 1, wherein the computing the linguistic probability for the character based on the intra-class linguistic probability and the inter-class linguistic probability is by:
P(c)=ω×P(c|Ωc)+(1-ω)×P(Ωc),0≤ω≤1,
wherein, P (c | Ω)c) Is the probability of language within class, P (omega)c) For inter-class linguistic probability, c denotes the character, ΩcIndicating the category in which the character c is located.
8. An apparatus for improving linguistic distinction between characters, comprising:
the clustering unit is used for identifying character samples, providing shape characteristics of characters, and clustering the characters according to an identification result;
the first calculating unit is used for calculating the intra-class language probability according to the clustering result of the characters;
the second calculating unit is used for calculating the language probability among the classes according to the clustering result of the characters;
a third calculation unit for calculating the language probability of the character based on the intra-class language probability and the inter-class language probability,
wherein the clustering unit includes:
the character set acquisition module is used for identifying the character samples by utilizing the character identification core to obtain a character set;
a first calculation module, configured to calculate a first recognition similarity vector between each character in the character set and the character sample;
the second calculation module is used for calculating a second recognition similarity vector of each character in the character set and the character set according to the first recognition similarity vector;
and the classification module is used for clustering the character set according to the first identification similarity vector and the second identification similarity vector.
9. The apparatus according to claim 8, wherein the first calculating module is specifically configured to calculate a recognition similarity between each character in the character set and the character sample, and generate the first recognition similarity vector according to the recognition similarity.
10. The apparatus of claim 8, wherein the second computing module is specifically configured to compute, for a character sample having a same character in the character set, a first recognition similarity vector between the character in the character set and the character sample, and use an average of the first recognition similarity vector as the second recognition similarity vector.
11. The apparatus of claim 8, wherein the classification module comprises:
the setting submodule is used for setting the class center of each classification according to the number to be classified;
the class identification submodule is used for identifying each character in the character set by using the class to which the class center belongs when the distance between the character and the class center is closest;
and the class division submodule is used for dividing the characters with the same identification into the same class.
CN 201010218319 2010-06-24 2010-06-24 Method and device for improving language discrimination among characters Active CN101882226B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010218319 CN101882226B (en) 2010-06-24 2010-06-24 Method and device for improving language discrimination among characters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010218319 CN101882226B (en) 2010-06-24 2010-06-24 Method and device for improving language discrimination among characters

Publications (2)

Publication Number Publication Date
CN101882226A CN101882226A (en) 2010-11-10
CN101882226B true CN101882226B (en) 2013-07-24

Family

ID=43054238

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010218319 Active CN101882226B (en) 2010-06-24 2010-06-24 Method and device for improving language discrimination among characters

Country Status (1)

Country Link
CN (1) CN101882226B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1342967A (en) * 2000-09-13 2002-04-03 中国科学院自动化研究所 Unified recognizing method for multi-speed working pattern
CN1346112A (en) * 2000-09-27 2002-04-24 中国科学院自动化研究所 Integrated prediction searching method for Chinese continuous speech recognition
CN1369877A (en) * 2000-10-04 2002-09-18 微软公司 Method and system for identifying property of new word in non-divided text
CN101090461A (en) * 2006-06-13 2007-12-19 中国科学院计算技术研究所 Automatic translation method for digital video captions
CN101095138A (en) * 2004-09-30 2007-12-26 谷歌公司 Methods and systems for selecting a language for text segmentation
CN101707873A (en) * 2007-03-26 2010-05-12 谷歌公司 Large language models in the mechanical translation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1342967A (en) * 2000-09-13 2002-04-03 中国科学院自动化研究所 Unified recognizing method for multi-speed working pattern
CN1346112A (en) * 2000-09-27 2002-04-24 中国科学院自动化研究所 Integrated prediction searching method for Chinese continuous speech recognition
CN1369877A (en) * 2000-10-04 2002-09-18 微软公司 Method and system for identifying property of new word in non-divided text
CN101095138A (en) * 2004-09-30 2007-12-26 谷歌公司 Methods and systems for selecting a language for text segmentation
CN101090461A (en) * 2006-06-13 2007-12-19 中国科学院计算技术研究所 Automatic translation method for digital video captions
CN101707873A (en) * 2007-03-26 2010-05-12 谷歌公司 Large language models in the mechanical translation

Also Published As

Publication number Publication date
CN101882226A (en) 2010-11-10

Similar Documents

Publication Publication Date Title
CN109918673B (en) Semantic arbitration method and device, electronic equipment and computer-readable storage medium
CN104167208B (en) A kind of method for distinguishing speek person and device
CN105702251B (en) Reinforce the speech-emotion recognition method of audio bag of words based on Top-k
CN105389593A (en) Image object recognition method based on SURF
CN109189892B (en) Recommendation method and device based on article comments
CN110853648B (en) Bad voice detection method and device, electronic equipment and storage medium
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
CN110287311B (en) Text classification method and device, storage medium and computer equipment
CN108038208B (en) Training method and device of context information recognition model and storage medium
CN113254643B (en) Text classification method and device, electronic equipment and text classification program
CN107180084A (en) Word library updating method and device
WO2014022172A2 (en) Information classification based on product recognition
CN111506726B (en) Short text clustering method and device based on part-of-speech coding and computer equipment
CN104200814A (en) Speech emotion recognition method based on semantic cells
CN105609116A (en) Speech emotional dimensions region automatic recognition method
CN115982144A (en) Similar text duplicate removal method and device, storage medium and electronic device
CN110867180B (en) System and method for generating word-by-word lyric file based on K-means clustering algorithm
CN117235137B (en) Professional information query method and device based on vector database
CN114416991A (en) Method and system for analyzing text emotion reason based on prompt
CN111125329B (en) Text information screening method, device and equipment
CN105006231A (en) Distributed large population speaker recognition method based on fuzzy clustering decision tree
CN101882226B (en) Method and device for improving language discrimination among characters
CN116070642A (en) Text emotion analysis method and related device based on expression embedding
Dileep et al. Speaker recognition using pyramid match kernel based support vector machines
CN112071304B (en) Semantic analysis method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant