CN110442861B - Chinese professional term and new word discovery method based on real world statistics - Google Patents

Chinese professional term and new word discovery method based on real world statistics Download PDF

Info

Publication number
CN110442861B
CN110442861B CN201910608625.9A CN201910608625A CN110442861B CN 110442861 B CN110442861 B CN 110442861B CN 201910608625 A CN201910608625 A CN 201910608625A CN 110442861 B CN110442861 B CN 110442861B
Authority
CN
China
Prior art keywords
word
value
words
pmi
news
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910608625.9A
Other languages
Chinese (zh)
Other versions
CN110442861A (en
Inventor
马逸韬
宁光
姚华彦
崔斌
张敬谊
李光亚
张鑫金
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANGHAI INSTITUTE OF ENDOCRINE AND METABOLIC DISEASES
WONDERS INFORMATION CO Ltd
Original Assignee
SHANGHAI INSTITUTE OF ENDOCRINE AND METABOLIC DISEASES
WONDERS INFORMATION CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI INSTITUTE OF ENDOCRINE AND METABOLIC DISEASES, WONDERS INFORMATION CO Ltd filed Critical SHANGHAI INSTITUTE OF ENDOCRINE AND METABOLIC DISEASES
Priority to CN201910608625.9A priority Critical patent/CN110442861B/en
Publication of CN110442861A publication Critical patent/CN110442861A/en
Application granted granted Critical
Publication of CN110442861B publication Critical patent/CN110442861B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to a method for discovering Chinese professional terms and new words based on real world statistics. The invention uses the inter-Point Mutual Information (PMI) and the adjacent entropy (BE) to judge and search the 'seed' (the word with high aggregation), and the reason for adopting the two methods is mainly that the two methods belong to unsupervised learning and have complementary functions. After finding the "seed", we filter out new words using the refined statistics in the 16 hundred million word based real world corpus.

Description

Chinese professional term and new word discovery method based on real world statistics
Technical Field
The invention relates to a Chinese professional term and new word discovery method based on real world statistics, which is used for detecting new words and professional terms in Chinese texts in the professional field.
Background
Word segmentation plays an important role in Chinese Natural Language Processing (NLP), which is the first task in natural language processing. In the current natural language processing process, constructing a professional word dictionary for some texts with strong speciality is an effective method for improving word segmentation quality. How to efficiently establish a professional word dictionary in the professional field is difficult work, and at present, a large number of methods are deep learning algorithms based on manual labeling, and the process is called entity name recognition. But it cannot handle professional medical texts like drug names or operation names without professional assistance or existing dictionaries.
Taking the names of the drugs in the medical field as an example, see Table 1
Figure BDA0002121585740000011
TABLE 1 Chinese and English reference table (part) for medicine name
It can be seen that for such a professional text most non-professional persons are simply not able to perform accurate labeling. Such as "doreggi (fentanyl transdermal patch)", this term can be divided into two parts: "doregyzgi" and "fentanyl transdermal patch". But the second part is easy to be left alone due to the existence of transliterated words and terms. Its correct division is "fentanyl", "transdermal", and "transdermal patch". The word "transparent" is easily incorporated by humans into other words. The same problem often appears in similar texts, the labeling difficulty of the professional text is very high, and the traditional NLP processing method has very poor processing effect on the professional text and can not meet the actual application requirements.
Disclosure of Invention
The purpose of the invention is: and determining the degree of solidification of one word based on the information entropy and the adjacent entropy so as to realize the discovery of new words and professional terms in the text.
In order to achieve the above object, the technical solution of the present invention is to provide a method for discovering Chinese professional terms and new words based on real world statistics, which is characterized by comprising the following steps:
step 1, collecting news corpora from various news media, defining the news corpora as news texts, taking clinical medicine names of medical institutions as contrast medical professional test texts, and defining the contrast medical professional test texts as professional texts;
step 2, binary word segmentation is respectively used for news texts and professional texts, non-Chinese characters in the word segmentation result of the news texts are abandoned to obtain candidate words, the occurrence times and frequency of the candidate words are counted, after the candidate words with lower frequency are removed, PMI value calculation is carried out on each remaining candidate word, the PMI value is a standard for calculating the solidification degree between two characters in the candidate words, the higher the PMI value is, the closer the connection between the two characters is represented, after the PMI value of each candidate word is calculated, the candidate words with the PMI value in a position division are abandoned, and therefore target words are obtained;
step 3, calculating the external adjacent entropy of any one target word x obtained in the step 2
Figure BDA0002121585740000021
And internal entropy of adjacency
Figure BDA0002121585740000022
Wherein:
Figure BDA0002121585740000023
Figure BDA0002121585740000024
in the formula, H r (x) Representing the right-adjacent entropy, H, of the target word x l (x) Representing the left-adjacent entropy, H, of the target word x r (x l ) Representing the left-hand character x in the target word x l Right adjacent entropy of (H) l (x r ) Representing the word x to the right in the target word x r Left contiguous entropy of (d);
step 4, each target word is obtained through calculation according to the external adjacent entropy and the internal adjacent entropy of each target wordThe BE value of the target word is normalized to obtain a normalized BE value, the BE value of the target word x is set as BE (x), and the normalized BE value is set as BE (x)
Figure BDA0002121585740000025
Then there are:
Figure BDA0002121585740000026
Figure BDA0002121585740000027
in the formula (I), the compound is shown in the specification,
Figure BDA0002121585740000028
represents the mean of the BE values for all target words, std (BE (x)) represents the standard deviation of BE (x);
and 5, acquiring a Score value of each target word, and setting the Score value of the target word x as Score (x), wherein the Score value comprises the following steps:
Figure BDA0002121585740000029
in the formula, λ represents a weight, PMI' represents a PMI value of the target word x;
step 6, taking the target word with the score value larger than a set threshold value as a seed word;
and 7, after the generation of the seed words is completed, obtaining a word group table of the binary characters, wherein the coagulation degree of the words in the word group table is high, and the professional terms to be extracted are extracted from the news text in the form of the binary words, so that the binary words in the word group table are integrated.
Preferably, in step 2, if the PMI value of any candidate word obtained by binary word segmentation of the news text is PMI', then:
Figure BDA0002121585740000031
in the formula, x represents one character in candidate words obtained by binary word segmentation of the professional text, the other character is y, p (x) represents the frequency of the character x in the professional text, p (y) represents the frequency of the character y in the professional text, and p (x, y) represents the frequency of the character xy in the professional text; x 'represents one word in candidate words obtained by binary word segmentation of news text, x' = x, another word is y ', y' = y, p (x ') represents the frequency of occurrence of the word x' in the news text, p (y ') represents the frequency of occurrence of the word y' in the news text, and p (x ', y') represents the frequency of occurrence of the word x 'y' in the news text.
Preferably, in step 2, the obtained pmi 'is normalized, so that the normalized value of the pmi' is
Figure BDA0002121585740000032
Then there are:
Figure BDA0002121585740000033
the obtained normalized value
Figure BDA0002121585740000034
As the PMI value of the current candidate word.
Preferably, in step 7, when the bigrams in the word group table are integrated together, the bigrams in the word group table are recombined or elongated by using the conditional probability.
The invention uses the mutual information between Points (PMI) and the adjacent entropy (BE) to judge and search the 'seed' (the word with high aggregation), and the reason for adopting the two methods is mainly that the two methods belong to unsupervised learning and have complementary functions. After finding the "seed", we filter out new words using the refined statistics in the 16 hundred million word based real world corpus.
Detailed Description
The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.
The invention provides a method for discovering Chinese professional terms and new words based on real world statistics, which comprises the following steps:
in the first step, news corpora (hereinafter referred to as news texts) from the news of New wave, the news of China, the news of Tencent, the news of Baidu and the network media of people's daily news are collected. The name of the clinical drug of the medical institution was used as a reference medical professional test text (hereinafter referred to as a professional text).
And secondly, respectively using binary word segmentation for the news text and the professional text, and discarding non-Chinese characters in word segmentation results. The obtained result includes candidate words, the occurrence frequency of the candidate words, and the frequency of the candidate words, and table 2 shows the above operations performed on the 1G news text and partial results. And candidate words with the frequency lower than alpha are eliminated. And calculating PMI and BE values of each candidate word after the low-frequency word is removed to obtain the word solidity of each binary character.
Candidate word Number of occurrences of candidate word Frequency of occurrence of candidate words
Represent 544435 0.00133
Product(s) 422727 0.00103
Editing 372018 0.00091
Report on 259518 0.00063
Beijing 249406 0.00060
Appear by 245255 0.00059
In part 240593 0.00058
Become into 229208 0.00056
To 226781 0.00055
First of all 224486 0.00054
Table 2
Thirdly, calculating the PMI value of each candidate word obtained in the second step
The PMI value is a criterion for calculating the degree of coagulation between two words, and a higher value thereof indicates a closer relationship between two words, and mathematically the PMI value can be expressed as PMI:
Figure BDA0002121585740000051
in equation (1), x represents one word, y represents another word, P (x) represents the frequency of occurrence of the word x in the text, P (y) represents the frequency of occurrence of the word x in the text, and P (x, y) represents the frequency of occurrence of the word xy in the text.
After the formula (1) for calculating the PMI value is applied to the invention, the PMI value is corrected, and the corrected PMI value PMI' of the candidate word after the news text obtained in the second step passes through the binary word segmentation is calculated, so that the following steps are provided:
Figure BDA0002121585740000052
in the formula (2), x represents one character in candidate words obtained by binary word segmentation of the professional text, and the other character is y; x 'represents one word in candidate words obtained by binary word segmentation of news text, x' = x, and the other word is y ', y' = y.
Candidate words PMI '≦ 0 are first discarded, and then the normalized value of the corrected PMI value PMI' is calculated
Figure BDA0002121585740000053
Taking the normalized value as the PMI value of the current candidate word, the following steps are carried out:
Figure BDA0002121585740000054
the invention hopes to correct the professional text through the candidate words in the news text, the professional text is the test object of the invention, and the invention aims to mine the words in the professional text. But words dug by the traditional method have various problems. Therefore, the present invention makes corrections through news text, i.e., real world data.
And sorting the PMI values of all candidate words in a descending order, and then discarding the candidate words of which the PMI values are behind the first quartile (25% of the numbers of all numerical values in the sample after being arranged from small to large), thereby obtaining the target words.
In the following steps, the present invention will calculate the BE value of the target word, BE (Adjacent entropy) being another criterion for determining word solidity. For a target word x, we define x i As its adjacent character. The unidirectional adjacency entropy of x can be written as: h (x) = -Sigma i p(x i )log 2 p(x i ),p(x i ) The expression x i Frequency of occurrence in the text. It shows the diversity of characters on the left and right sides of a target word. Higher numbers indicate that the word appears more often in the text, whereas the word does not appear in much of the text, and it is more likely to merge with adjacent words into a new word.
Fourthly, calculating the external adjacent entropy of any one target word x obtained in the third step
Figure BDA0002121585740000055
And internal contiguous entropy>
Figure BDA0002121585740000056
Wherein:
Figure BDA0002121585740000057
Figure BDA0002121585740000058
in the formulae (4) and (5), H r (x) Representing the right-adjacent entropy, H, of the target word x l (x) Representing the left adjacency of the target word xEntropy, H r (x l ) Representing the left-hand character x in the target word x l Right adjacent entropy of (1), H l (x r ) Representing the right-hand character x in the target word x r Left adjacent entropy.
Figure BDA0002121585740000061
Figure BDA0002121585740000062
Figure BDA0002121585740000063
Figure BDA0002121585740000064
x lr Representing the right-adjacent entropy of the word to the left of the target word x, x rl Representing the left-adjacent entropy of the word to the right of the target word x.
External entropy of adjacency
Figure BDA00021215857400000611
The result of (a) represents the multiplicity of a word when externally adjoining entropy @>
Figure BDA00021215857400000612
A large value of (a) indicates that the word occurs in a large number of contexts. The invention also obtains better effect by calculating the internal adjacency entropy.
Fifthly, calculating to obtain the BE value of each target word according to the external adjacent entropy and the internal adjacent entropy of each target word, normalizing each BE value to obtain a normalized BE value, setting the BE value of the target word x as BE (x), and setting the normalized BE value as BE (x)
Figure BDA0002121585740000067
Then there are:
Figure BDA0002121585740000068
Figure BDA0002121585740000069
in the formulae (6) and (7),
Figure BDA00021215857400000610
represents the mean of the BE values for all target words, and Std (BE (x)) represents the standard deviation of BE (x). Equation (6) combines the entropy of the internal adjacency and the entropy of the external adjacency to obtain a new value, and the size of the new value can express the degree of coagulation of the candidate word. Ideally, the seed word is desired to be present in a variety of contexts and the internal freezing of the word is high, mathematically represented by equation (6).
The present invention requires the combination of BE and PMI to calculate the degree of coagulation of all the words of the "seed", and therefore requires a normalization process on the final result. Under the condition that the overall sample distribution and parameters are not known, the method uses the t distribution to carry out normalization processing, as shown in the formula (7).
And sixthly, acquiring a Score value of each target word, and setting the Score value of the target word x as Score (x), wherein the Score value comprises the following steps:
Figure BDA0002121585740000071
in the formula (7), λ represents a weight value pmi The PMI value representing the target word x.
In order to obtain higher-quality seed words, the weight needs to BE added to PMI and BE when the PMI and the BE are combined, and lambda is introduced into the calculation as a parameter, as shown in formula (7).
And seventhly, taking the target word with the score value larger than the set threshold value as a seed word. The larger the score value is, the more the candidate seed word is like a fixed collocation; conversely, this indicates that the word is not sufficient to be a new word or candidate.
And eighthly, obtaining a phrase table of binary characters after the generation of the seed words is finished, wherein the solidification degree of the words is high. Due to the fact that real-world statistics are combined as the screening condition, the professional terms to be extracted can be considered to be extracted through the form of the binary words. These bigrams need to be integrated later. The present invention recombines or lengthens these scattered strings by using conditional probabilities.
For example, starting from the word "two", we always take the last word in the string as the starting point. So for "two", we chose "A" as the starting point. The next test looks for all words in the "seeds" table beginning with "A". The appearance probability of the words is consistent with the form of Bayesian conditional probability
The invention then gives a threshold to determine which words can be the subject of the extension. For example, the candidate word of "inject" may be "inject liquid", etc., and these words are recombined to obtain a new three-element word. And continuously iterating until no next word or all candidate words can not reach the set threshold. Thus, all professional word discovery is completed.
The invention is further illustrated by the following specific examples:
step 1, collecting news feeds from New wave news, china daily news, tencent news, hundred-degree news and people's daily news network media, wherein the time span is 2014 to 2018, the fields cover the fields of sports, entertainment, politics, science, art, culture and the like, the word number of each news is about 1000 words, the total number of words is 8GB news data, and the total word number reaches 16 hundred million words of news corpora (hereinafter referred to as news texts). The name of the clinical drug of the medical institution was used as a reference medical professional test text (hereinafter referred to as a professional text).
Step 2: a binary word list is generated and candidate words with a frequency of occurrence less than 5 are discarded. The PMI for each word is then computed and the characters for PMI <0 are discarded because this means that they are not enough to be a word.
And 3, step 3: setting the weight λ =0.3, the BE for each candidate is calculated and combined with PMI into a new quantity, denoted score. Finally, a score value for score is found and candidate words smaller than this value are discarded.
After all the above processes are completed, a statistically significant table with a high degree of word solidity is obtained. Table 3 shows the results of the descending order of the seed table, showing the higher ranked results of the degree of coagulation. A common feature of these "seeds" is that they occur in reality but are rarely used and, in addition, occur in large numbers and often together in our test text. Therefore, we believe that they can be high quality seeds and are ready to extend word length.
Sorting Candidate word Socre
1 Medical debate 14.919
2 Fork assembly 13.762
3 Point matching 12.535
4 Measuring pump 12.414
5 Chamber or 12.385
6 Coriolus versicolor 11.798
7 Check and 11.794
8 backup instrument 11.537
9 Two sides of the bag 10.590
10 Study and examination 10.178
TABLE 3 test of the 10 words in the text with the highest degree of word aggregation and their scores
And 4, generating a list named Continue and stop. Wherein, words which can be extended in length continuously are stored in Continue; and storing words with the length which cannot be expanded continuously in stop. The length of the word begins to be expanded, and the probability threshold is set to be 0.3, which indicates that the word is only in P next >The case of 0.3 is considered as a candidate word that can be expanded. Putting the current word into a Continue list if the current word can still find the extension word; otherwise, put into the list of stop.

Claims (1)

1. A method for discovering Chinese professional terms and new words based on real world statistics is characterized by comprising the following steps:
step 1, collecting news corpora from various news media, defining the news corpora as news texts, taking clinical medicine names of medical institutions as contrast medical professional test texts, and defining the contrast medical professional test texts as professional texts;
step 2, binary word segmentation is respectively used for news texts and professional texts, non-Chinese characters in the word segmentation result of the news texts are abandoned to obtain candidate words, the occurrence times and frequency of the candidate words are counted, after the candidate words with the frequency less than 5 are removed, PMI value calculation is carried out on each remaining candidate word, the PMI value is a standard for calculating the solidification degree between two characters in the candidate words, the higher the PMI value is, the closer the relation between the two characters is represented, after the PMI value of each candidate word is calculated, the candidate words with the PMI value in a position division are abandoned, and therefore target words are obtained; if the PMI value of any candidate word obtained by binary word segmentation of the news text is PMI', the method comprises the following steps:
Figure FDA0004058561070000011
in the formula, x represents one character in candidate words obtained by binary word segmentation of the professional text, the other character is y, p (x) represents the frequency of the character x in the professional text, p (y) represents the frequency of the character y in the professional text, and p (x, y) represents the frequency of the character xy in the professional text; x is the number of Representing one word, x, of candidate words of a news text obtained by binary word segmentation = x, another word is y ,y =y,p(x ) Representing a word x Frequency of occurrence in news text, p (y) ) Indicating the word y Frequency of occurrence in news text, p (x) ,y ) The expression x' y Frequency of occurrence in news text;
normalizing the obtained pmi 'so that the normalized value of the pmi' is
Figure FDA0004058561070000012
Then there are:
Figure FDA0004058561070000013
the obtained normalized value
Figure FDA0004058561070000014
PMI value as current candidate word;
step 3, calculating the external adjacent entropy of any one target word x obtained in step 2
Figure FDA0004058561070000015
And internal entropy of adjacency
Figure FDA0004058561070000016
Wherein:
Figure FDA0004058561070000017
Figure FDA0004058561070000018
in the formula, H r (x) Representing the right-adjacent entropy, H, of the target word x l (x) Representing the left-adjacent entropy, H, of the target word x r (x l ) Representing the left-hand character x in the target word x l Right adjacent entropy of (1), H l (x r ) Representing the right-hand character x in the target word x r Left contiguous entropy of (d);
step 4, calculating to obtain BE value of each target word according to external adjacent entropy and internal adjacent entropy of each target word, and normalizing each BE value to obtain normalizationThe BE value after the normalization is set as BE (x) which is the BE value of the target word x
Figure FDA0004058561070000021
Then there are:
Figure FDA0004058561070000022
Figure FDA0004058561070000023
in the formula (I), the compound is shown in the specification,
Figure FDA0004058561070000024
represents the mean of the BE values for all target words, std (BE (x)) represents the standard deviation of BE (x);
and 5, obtaining the Score value of each target word, and setting the Score value of the target word x as Score (x), wherein the Score value comprises the following steps:
Figure FDA0004058561070000025
in the formula, λ represents weight value pmi A PMI value representing a target word x;
step 6, taking the target word with the score value larger than a set threshold value as a seed word;
step 7, after the generation of the seed words is completed, obtaining a word group table of binary characters, wherein the coagulation degree of the words in the word group table is high, and the professional terms to be extracted are extracted from the news text in the form of the binary words, so that the binary words in the word group table are integrated; when the bigrams in the word group table are integrated together, the bigrams in the word group table are recombined or elongated by using conditional probabilities.
CN201910608625.9A 2019-07-08 2019-07-08 Chinese professional term and new word discovery method based on real world statistics Active CN110442861B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910608625.9A CN110442861B (en) 2019-07-08 2019-07-08 Chinese professional term and new word discovery method based on real world statistics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910608625.9A CN110442861B (en) 2019-07-08 2019-07-08 Chinese professional term and new word discovery method based on real world statistics

Publications (2)

Publication Number Publication Date
CN110442861A CN110442861A (en) 2019-11-12
CN110442861B true CN110442861B (en) 2023-04-07

Family

ID=68429578

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910608625.9A Active CN110442861B (en) 2019-07-08 2019-07-08 Chinese professional term and new word discovery method based on real world statistics

Country Status (1)

Country Link
CN (1) CN110442861B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112988953B (en) * 2021-04-26 2021-09-03 成都索贝数码科技股份有限公司 Adaptive broadcast television news keyword standardization method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105224682A (en) * 2015-10-27 2016-01-06 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN105786991A (en) * 2016-02-18 2016-07-20 中国科学院自动化研究所 Chinese emotion new word recognition method and system in combination with user emotion expression ways
CN106126606A (en) * 2016-06-21 2016-11-16 国家计算机网络与信息安全管理中心 A kind of short text new word discovery method
CN108509425A (en) * 2018-04-10 2018-09-07 中国人民解放军陆军工程大学 Chinese new word discovery method based on novelty
CN108845982A (en) * 2017-12-08 2018-11-20 昆明理工大学 A kind of Chinese word cutting method of word-based linked character
CN108874921A (en) * 2018-05-30 2018-11-23 广州杰赛科技股份有限公司 Extract method, apparatus, terminal device and the storage medium of text feature word
CN108959259A (en) * 2018-07-05 2018-12-07 第四范式(北京)技术有限公司 New word discovery method and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105224682A (en) * 2015-10-27 2016-01-06 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN105786991A (en) * 2016-02-18 2016-07-20 中国科学院自动化研究所 Chinese emotion new word recognition method and system in combination with user emotion expression ways
CN106126606A (en) * 2016-06-21 2016-11-16 国家计算机网络与信息安全管理中心 A kind of short text new word discovery method
CN108845982A (en) * 2017-12-08 2018-11-20 昆明理工大学 A kind of Chinese word cutting method of word-based linked character
CN108509425A (en) * 2018-04-10 2018-09-07 中国人民解放军陆军工程大学 Chinese new word discovery method based on novelty
CN108874921A (en) * 2018-05-30 2018-11-23 广州杰赛科技股份有限公司 Extract method, apparatus, terminal device and the storage medium of text feature word
CN108959259A (en) * 2018-07-05 2018-12-07 第四范式(北京)技术有限公司 New word discovery method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Liang Yang 等.Extraction New Sentiment Words in Weibo Based on Relative Branch Entropy.《China Conference on Information Retrieval》.2018,全文. *
刘伟童 ; 刘培玉 ; 刘文锋 ; 李娜娜 ; .基于互信息和邻接熵的新词发现算法.计算机应用研究.2018,(第05期),全文. *

Also Published As

Publication number Publication date
CN110442861A (en) 2019-11-12

Similar Documents

Publication Publication Date Title
Liu et al. A soft-label method for noise-tolerant distantly supervised relation extraction
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
TWI518528B (en) Method, apparatus and system for identifying target words
CN104899260B (en) Chinese pathological text structured processing method
CN105786991B (en) In conjunction with the Chinese emotion new word identification method and system of user feeling expression way
CN106897559B (en) A kind of symptom and sign class entity recognition method and device towards multi-data source
CN106844351B (en) Medical institution organization entity identification method and device oriented to multiple data sources
CN111899890B (en) Medical data similarity detection system and method based on bit string hash
RU2018119771A (en) COMPARISON OF HOSPITALS FROM DECLINED HEALTH DATABASES WITHOUT OBVIOUS QUASI-IDENTIFIERS
CN107993724A (en) A kind of method and device of medicine intelligent answer data processing
CN110502750A (en) Disambiguation method, system, equipment and medium during Chinese medicine text participle
CN109344250A (en) Single diseases diagnostic message rapid structure method based on medical insurance data
CN113343703B (en) Medical entity classification extraction method and device, electronic equipment and storage medium
CN105488098B (en) A kind of new words extraction method based on field otherness
CN109947951A (en) A kind of automatically updated emotion dictionary construction method for financial text analyzing
CN109215798B (en) Knowledge base construction method for traditional Chinese medicine ancient languages
CN106959943B (en) Language identification updating method and device
CN109471950A (en) The construction method of the structural knowledge network of abdominal ultrasonic text data
CN105956158B (en) The method that network neologisms based on massive micro-blog text and user information automatically extract
CN112632910A (en) Operation encoding method, electronic device and storage device
CN110442861B (en) Chinese professional term and new word discovery method based on real world statistics
CN115982222A (en) Searching method based on special disease and special medicine scenes
CN111104481A (en) Method, device and equipment for identifying matching field
US11556706B2 (en) Effective retrieval of text data based on semantic attributes between morphemes
Gafni Child phonology analyzer: Processing and analyzing transcribed speech.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant