CN110442861B

CN110442861B - Chinese professional term and new word discovery method based on real world statistics

Info

Publication number: CN110442861B
Application number: CN201910608625.9A
Authority: CN
Inventors: 马逸韬; 宁光; 姚华彦; 崔斌; 张敬谊; 李光亚; 张鑫金
Original assignee: SHANGHAI INSTITUTE OF ENDOCRINE AND METABOLIC DISEASES; WONDERS INFORMATION CO Ltd
Current assignee: SHANGHAI INSTITUTE OF ENDOCRINE AND METABOLIC DISEASES; WONDERS INFORMATION CO Ltd
Priority date: 2019-07-08
Filing date: 2019-07-08
Publication date: 2023-04-07
Anticipated expiration: 2039-07-08
Also published as: CN110442861A

Abstract

The invention relates to a method for discovering Chinese professional terms and new words based on real world statistics. The invention uses the inter-Point Mutual Information (PMI) and the adjacent entropy (BE) to judge and search the 'seed' (the word with high aggregation), and the reason for adopting the two methods is mainly that the two methods belong to unsupervised learning and have complementary functions. After finding the "seed", we filter out new words using the refined statistics in the 16 hundred million word based real world corpus.

Description

Chinese professional term and new word discovery method based on real world statistics

Technical Field

The invention relates to a Chinese professional term and new word discovery method based on real world statistics, which is used for detecting new words and professional terms in Chinese texts in the professional field.

Background

Word segmentation plays an important role in Chinese Natural Language Processing (NLP), which is the first task in natural language processing. In the current natural language processing process, constructing a professional word dictionary for some texts with strong speciality is an effective method for improving word segmentation quality. How to efficiently establish a professional word dictionary in the professional field is difficult work, and at present, a large number of methods are deep learning algorithms based on manual labeling, and the process is called entity name recognition. But it cannot handle professional medical texts like drug names or operation names without professional assistance or existing dictionaries.

Taking the names of the drugs in the medical field as an example, see Table 1

TABLE 1 Chinese and English reference table (part) for medicine name

It can be seen that for such a professional text most non-professional persons are simply not able to perform accurate labeling. Such as "doreggi (fentanyl transdermal patch)", this term can be divided into two parts: "doregyzgi" and "fentanyl transdermal patch". But the second part is easy to be left alone due to the existence of transliterated words and terms. Its correct division is "fentanyl", "transdermal", and "transdermal patch". The word "transparent" is easily incorporated by humans into other words. The same problem often appears in similar texts, the labeling difficulty of the professional text is very high, and the traditional NLP processing method has very poor processing effect on the professional text and can not meet the actual application requirements.

Disclosure of Invention

The purpose of the invention is: and determining the degree of solidification of one word based on the information entropy and the adjacent entropy so as to realize the discovery of new words and professional terms in the text.

In order to achieve the above object, the technical solution of the present invention is to provide a method for discovering Chinese professional terms and new words based on real world statistics, which is characterized by comprising the following steps:

step 1, collecting news corpora from various news media, defining the news corpora as news texts, taking clinical medicine names of medical institutions as contrast medical professional test texts, and defining the contrast medical professional test texts as professional texts;

step 2, binary word segmentation is respectively used for news texts and professional texts, non-Chinese characters in the word segmentation result of the news texts are abandoned to obtain candidate words, the occurrence times and frequency of the candidate words are counted, after the candidate words with lower frequency are removed, PMI value calculation is carried out on each remaining candidate word, the PMI value is a standard for calculating the solidification degree between two characters in the candidate words, the higher the PMI value is, the closer the connection between the two characters is represented, after the PMI value of each candidate word is calculated, the candidate words with the PMI value in a position division are abandoned, and therefore target words are obtained;

step 3, calculating the external adjacent entropy of any one target word x obtained in the step 2

And internal entropy of adjacency

Wherein:

in the formula, H _r (x) Representing the right-adjacent entropy, H, of the target word x _l (x) Representing the left-adjacent entropy, H, of the target word x _r (x _l ) Representing the left-hand character x in the target word x _l Right adjacent entropy of (H) _l (x _r ) Representing the word x to the right in the target word x _r Left contiguous entropy of (d);

step 4, each target word is obtained through calculation according to the external adjacent entropy and the internal adjacent entropy of each target wordThe BE value of the target word is normalized to obtain a normalized BE value, the BE value of the target word x is set as BE (x), and the normalized BE value is set as BE (x)

Then there are:

in the formula (I), the compound is shown in the specification,

represents the mean of the BE values for all target words, std (BE (x)) represents the standard deviation of BE (x);

and 5, acquiring a Score value of each target word, and setting the Score value of the target word x as Score (x), wherein the Score value comprises the following steps:

in the formula, λ represents a weight, PMI' represents a PMI value of the target word x;

step 6, taking the target word with the score value larger than a set threshold value as a seed word;

and 7, after the generation of the seed words is completed, obtaining a word group table of the binary characters, wherein the coagulation degree of the words in the word group table is high, and the professional terms to be extracted are extracted from the news text in the form of the binary words, so that the binary words in the word group table are integrated.

Preferably, in step 2, if the PMI value of any candidate word obtained by binary word segmentation of the news text is PMI', then:

in the formula, x represents one character in candidate words obtained by binary word segmentation of the professional text, the other character is y, p (x) represents the frequency of the character x in the professional text, p (y) represents the frequency of the character y in the professional text, and p (x, y) represents the frequency of the character xy in the professional text; x 'represents one word in candidate words obtained by binary word segmentation of news text, x' = x, another word is y ', y' = y, p (x ') represents the frequency of occurrence of the word x' in the news text, p (y ') represents the frequency of occurrence of the word y' in the news text, and p (x ', y') represents the frequency of occurrence of the word x 'y' in the news text.

Preferably, in step 2, the obtained pmi 'is normalized, so that the normalized value of the pmi' is

Then there are:

the obtained normalized value

As the PMI value of the current candidate word.

Preferably, in step 7, when the bigrams in the word group table are integrated together, the bigrams in the word group table are recombined or elongated by using the conditional probability.

The invention uses the mutual information between Points (PMI) and the adjacent entropy (BE) to judge and search the 'seed' (the word with high aggregation), and the reason for adopting the two methods is mainly that the two methods belong to unsupervised learning and have complementary functions. After finding the "seed", we filter out new words using the refined statistics in the 16 hundred million word based real world corpus.

Detailed Description

The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.

The invention provides a method for discovering Chinese professional terms and new words based on real world statistics, which comprises the following steps:

in the first step, news corpora (hereinafter referred to as news texts) from the news of New wave, the news of China, the news of Tencent, the news of Baidu and the network media of people's daily news are collected. The name of the clinical drug of the medical institution was used as a reference medical professional test text (hereinafter referred to as a professional text).

And secondly, respectively using binary word segmentation for the news text and the professional text, and discarding non-Chinese characters in word segmentation results. The obtained result includes candidate words, the occurrence frequency of the candidate words, and the frequency of the candidate words, and table 2 shows the above operations performed on the 1G news text and partial results. And candidate words with the frequency lower than alpha are eliminated. And calculating PMI and BE values of each candidate word after the low-frequency word is removed to obtain the word solidity of each binary character.

Candidate word	Number of occurrences of candidate word	Frequency of occurrence of candidate words
			Represent	544435	0.00133
Product(s)	422727	0.00103
			Editing	372018	0.00091
Report on	259518	0.00063
			Beijing	249406	0.00060
Appear by	245255	0.00059
			In part	240593	0.00058
Become into	229208	0.00056
			To	226781	0.00055
First of all	224486	0.00054

Table 2

Thirdly, calculating the PMI value of each candidate word obtained in the second step

The PMI value is a criterion for calculating the degree of coagulation between two words, and a higher value thereof indicates a closer relationship between two words, and mathematically the PMI value can be expressed as PMI:

in equation (1), x represents one word, y represents another word, P (x) represents the frequency of occurrence of the word x in the text, P (y) represents the frequency of occurrence of the word x in the text, and P (x, y) represents the frequency of occurrence of the word xy in the text.

After the formula (1) for calculating the PMI value is applied to the invention, the PMI value is corrected, and the corrected PMI value PMI' of the candidate word after the news text obtained in the second step passes through the binary word segmentation is calculated, so that the following steps are provided:

in the formula (2), x represents one character in candidate words obtained by binary word segmentation of the professional text, and the other character is y; x 'represents one word in candidate words obtained by binary word segmentation of news text, x' = x, and the other word is y ', y' = y.

Candidate words PMI '≦ 0 are first discarded, and then the normalized value of the corrected PMI value PMI' is calculated

Taking the normalized value as the PMI value of the current candidate word, the following steps are carried out:

the invention hopes to correct the professional text through the candidate words in the news text, the professional text is the test object of the invention, and the invention aims to mine the words in the professional text. But words dug by the traditional method have various problems. Therefore, the present invention makes corrections through news text, i.e., real world data.

And sorting the PMI values of all candidate words in a descending order, and then discarding the candidate words of which the PMI values are behind the first quartile (25% of the numbers of all numerical values in the sample after being arranged from small to large), thereby obtaining the target words.

In the following steps, the present invention will calculate the BE value of the target word, BE (Adjacent entropy) being another criterion for determining word solidity. For a target word x, we define x _i As its adjacent character. The unidirectional adjacency entropy of x can be written as: h (x) = -Sigma _i p(x _i )log ₂ p(x _i )，p(x _i ) The expression x _i Frequency of occurrence in the text. It shows the diversity of characters on the left and right sides of a target word. Higher numbers indicate that the word appears more often in the text, whereas the word does not appear in much of the text, and it is more likely to merge with adjacent words into a new word.

Fourthly, calculating the external adjacent entropy of any one target word x obtained in the third step

And internal contiguous entropy>

Wherein:

in the formulae (4) and (5), H _r (x) Representing the right-adjacent entropy, H, of the target word x _l (x) Representing the left adjacency of the target word xEntropy, H _r (x _l ) Representing the left-hand character x in the target word x _l Right adjacent entropy of (1), H _l (x _r ) Representing the right-hand character x in the target word x _r Left adjacent entropy.

x _lr Representing the right-adjacent entropy of the word to the left of the target word x, x _rl Representing the left-adjacent entropy of the word to the right of the target word x.

External entropy of adjacency

The result of (a) represents the multiplicity of a word when externally adjoining entropy @>

A large value of (a) indicates that the word occurs in a large number of contexts. The invention also obtains better effect by calculating the internal adjacency entropy.

Fifthly, calculating to obtain the BE value of each target word according to the external adjacent entropy and the internal adjacent entropy of each target word, normalizing each BE value to obtain a normalized BE value, setting the BE value of the target word x as BE (x), and setting the normalized BE value as BE (x)

Then there are:

in the formulae (6) and (7),

represents the mean of the BE values for all target words, and Std (BE (x)) represents the standard deviation of BE (x). Equation (6) combines the entropy of the internal adjacency and the entropy of the external adjacency to obtain a new value, and the size of the new value can express the degree of coagulation of the candidate word. Ideally, the seed word is desired to be present in a variety of contexts and the internal freezing of the word is high, mathematically represented by equation (6).

The present invention requires the combination of BE and PMI to calculate the degree of coagulation of all the words of the "seed", and therefore requires a normalization process on the final result. Under the condition that the overall sample distribution and parameters are not known, the method uses the t distribution to carry out normalization processing, as shown in the formula (7).

And sixthly, acquiring a Score value of each target word, and setting the Score value of the target word x as Score (x), wherein the Score value comprises the following steps:

in the formula (7), λ represents a weight value pmi _′ The PMI value representing the target word x.

In order to obtain higher-quality seed words, the weight needs to BE added to PMI and BE when the PMI and the BE are combined, and lambda is introduced into the calculation as a parameter, as shown in formula (7).

And seventhly, taking the target word with the score value larger than the set threshold value as a seed word. The larger the score value is, the more the candidate seed word is like a fixed collocation; conversely, this indicates that the word is not sufficient to be a new word or candidate.

And eighthly, obtaining a phrase table of binary characters after the generation of the seed words is finished, wherein the solidification degree of the words is high. Due to the fact that real-world statistics are combined as the screening condition, the professional terms to be extracted can be considered to be extracted through the form of the binary words. These bigrams need to be integrated later. The present invention recombines or lengthens these scattered strings by using conditional probabilities.

For example, starting from the word "two", we always take the last word in the string as the starting point. So for "two", we chose "A" as the starting point. The next test looks for all words in the "seeds" table beginning with "A". The appearance probability of the words is consistent with the form of Bayesian conditional probability

The invention then gives a threshold to determine which words can be the subject of the extension. For example, the candidate word of "inject" may be "inject liquid", etc., and these words are recombined to obtain a new three-element word. And continuously iterating until no next word or all candidate words can not reach the set threshold. Thus, all professional word discovery is completed.

The invention is further illustrated by the following specific examples:

step 1, collecting news feeds from New wave news, china daily news, tencent news, hundred-degree news and people's daily news network media, wherein the time span is 2014 to 2018, the fields cover the fields of sports, entertainment, politics, science, art, culture and the like, the word number of each news is about 1000 words, the total number of words is 8GB news data, and the total word number reaches 16 hundred million words of news corpora (hereinafter referred to as news texts). The name of the clinical drug of the medical institution was used as a reference medical professional test text (hereinafter referred to as a professional text).

Step 2: a binary word list is generated and candidate words with a frequency of occurrence less than 5 are discarded. The PMI for each word is then computed and the characters for PMI <0 are discarded because this means that they are not enough to be a word.

And 3, step 3: setting the weight λ =0.3, the BE for each candidate is calculated and combined with PMI into a new quantity, denoted score. Finally, a score value for score is found and candidate words smaller than this value are discarded.

After all the above processes are completed, a statistically significant table with a high degree of word solidity is obtained. Table 3 shows the results of the descending order of the seed table, showing the higher ranked results of the degree of coagulation. A common feature of these "seeds" is that they occur in reality but are rarely used and, in addition, occur in large numbers and often together in our test text. Therefore, we believe that they can be high quality seeds and are ready to extend word length.

Sorting	Candidate word	Socre
			1	Medical debate	14.919
2	Fork assembly	13.762
			3	Point matching	12.535
4	Measuring pump	12.414
			5	Chamber or	12.385
6	Coriolus versicolor	11.798
			7	Check and	11.794
8	backup instrument	11.537
			9	Two sides of the bag	10.590
10	Study and examination	10.178

TABLE 3 test of the 10 words in the text with the highest degree of word aggregation and their scores

And 4, generating a list named Continue and stop. Wherein, words which can be extended in length continuously are stored in Continue; and storing words with the length which cannot be expanded continuously in stop. The length of the word begins to be expanded, and the probability threshold is set to be 0.3, which indicates that the word is only in P _next >The case of 0.3 is considered as a candidate word that can be expanded. Putting the current word into a Continue list if the current word can still find the extension word; otherwise, put into the list of stop.

Claims

1. A method for discovering Chinese professional terms and new words based on real world statistics is characterized by comprising the following steps:

step 2, binary word segmentation is respectively used for news texts and professional texts, non-Chinese characters in the word segmentation result of the news texts are abandoned to obtain candidate words, the occurrence times and frequency of the candidate words are counted, after the candidate words with the frequency less than 5 are removed, PMI value calculation is carried out on each remaining candidate word, the PMI value is a standard for calculating the solidification degree between two characters in the candidate words, the higher the PMI value is, the closer the relation between the two characters is represented, after the PMI value of each candidate word is calculated, the candidate words with the PMI value in a position division are abandoned, and therefore target words are obtained; if the PMI value of any candidate word obtained by binary word segmentation of the news text is PMI', the method comprises the following steps:

in the formula, x represents one character in candidate words obtained by binary word segmentation of the professional text, the other character is y, p (x) represents the frequency of the character x in the professional text, p (y) represents the frequency of the character y in the professional text, and p (x, y) represents the frequency of the character xy in the professional text; x is the number of ^′ Representing one word, x, of candidate words of a news text obtained by binary word segmentation ^′ = x, another word is y ^′ ，y ^′ ＝y，p(x ^′ ) Representing a word x ^′ Frequency of occurrence in news text, p (y) ^′ ) Indicating the word y ^′ Frequency of occurrence in news text, p (x) ^′ ,y ^′ ) The expression x' y ^′ Frequency of occurrence in news text;

normalizing the obtained pmi 'so that the normalized value of the pmi' is

Then there are:

the obtained normalized value

PMI value as current candidate word;

step 3, calculating the external adjacent entropy of any one target word x obtained in step 2

And internal entropy of adjacency

Wherein:

in the formula, H _r (x) Representing the right-adjacent entropy, H, of the target word x _l (x) Representing the left-adjacent entropy, H, of the target word x _r (x _l ) Representing the left-hand character x in the target word x _l Right adjacent entropy of (1), H _l (x _r ) Representing the right-hand character x in the target word x _r Left contiguous entropy of (d);

step 4, calculating to obtain BE value of each target word according to external adjacent entropy and internal adjacent entropy of each target word, and normalizing each BE value to obtain normalizationThe BE value after the normalization is set as BE (x) which is the BE value of the target word x

Then there are:

in the formula (I), the compound is shown in the specification,

and 5, obtaining the Score value of each target word, and setting the Score value of the target word x as Score (x), wherein the Score value comprises the following steps:

in the formula, λ represents weight value pmi ^′ A PMI value representing a target word x;

step 7, after the generation of the seed words is completed, obtaining a word group table of binary characters, wherein the coagulation degree of the words in the word group table is high, and the professional terms to be extracted are extracted from the news text in the form of the binary words, so that the binary words in the word group table are integrated; when the bigrams in the word group table are integrated together, the bigrams in the word group table are recombined or elongated by using conditional probabilities.