CN110502750A - Disambiguation method, system, equipment and medium during Chinese medicine text participle - Google Patents
Disambiguation method, system, equipment and medium during Chinese medicine text participle Download PDFInfo
- Publication number
- CN110502750A CN110502750A CN201910722134.7A CN201910722134A CN110502750A CN 110502750 A CN110502750 A CN 110502750A CN 201910722134 A CN201910722134 A CN 201910722134A CN 110502750 A CN110502750 A CN 110502750A
- Authority
- CN
- China
- Prior art keywords
- word
- words
- combined
- ambiguous
- chinese medicine
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 239000003814 drug Substances 0.000 title claims abstract description 71
- 238000000034 method Methods 0.000 title claims abstract description 55
- 230000011218 segmentation Effects 0.000 claims abstract description 101
- 239000013598 vector Substances 0.000 claims abstract description 36
- 238000012545 processing Methods 0.000 claims abstract description 20
- 230000008569 process Effects 0.000 claims abstract description 16
- 238000012706 support-vector machine Methods 0.000 claims description 18
- 238000002372 labelling Methods 0.000 claims description 16
- 238000007781 pre-processing Methods 0.000 claims description 16
- 238000012216 screening Methods 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 6
- 238000003745 diagnosis Methods 0.000 claims description 2
- 238000005194 fractionation Methods 0.000 abstract 1
- 238000002474 experimental method Methods 0.000 description 11
- 230000006870 function Effects 0.000 description 7
- 238000011160 research Methods 0.000 description 7
- 208000000059 Dyspnea Diseases 0.000 description 6
- 206010013975 Dyspnoeas Diseases 0.000 description 6
- 238000000605 extraction Methods 0.000 description 5
- 238000012360 testing method Methods 0.000 description 5
- 208000013738 Sleep Initiation and Maintenance disease Diseases 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 4
- 238000010276 construction Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 206010022437 insomnia Diseases 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 239000000463 material Substances 0.000 description 3
- 208000013220 shortness of breath Diseases 0.000 description 3
- 208000024891 symptom Diseases 0.000 description 3
- 208000002173 dizziness Diseases 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 208000004998 Abdominal Pain Diseases 0.000 description 1
- 206010003497 Asphyxia Diseases 0.000 description 1
- 208000008035 Back Pain Diseases 0.000 description 1
- 206010021033 Hypomenorrhoea Diseases 0.000 description 1
- 208000008930 Low Back Pain Diseases 0.000 description 1
- 208000037093 Menstruation Disturbances Diseases 0.000 description 1
- 206010057071 Rectal tenesmus Diseases 0.000 description 1
- 208000032140 Sleepiness Diseases 0.000 description 1
- 206010041349 Somnolence Diseases 0.000 description 1
- 206010042727 Swollen tongue Diseases 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 235000020965 cold beverage Nutrition 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 206010016256 fatigue Diseases 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000005906 menstruation Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000027758 ovulation cycle Effects 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 239000004984 smart glass Substances 0.000 description 1
- 208000012271 tenesmus Diseases 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Abstract
The present disclosure discloses the disambiguation methods during Chinese medicine text participle, comprising: obtains Chinese medicine text to be segmented;Chinese medicine text is pre-processed;Word segmentation processing is carried out to pretreated Chinese medicine text;Result after word segmentation processing is matched with the combinational ambiguity dictionary constructed in advance, from the result after word segmentation processing, filters out combinational ambiguity word and non-combined ambiguity word;By the storage of non-combined ambiguity word into word segmentation result database;Word frequency and part of speech label are carried out to the combinational ambiguity word filtered out, according to the part of speech and word frequency of the combinational ambiguity word filtered out, calculate the mutual information vector of present combination ambiguity word, mutual information vector is input in preparatory trained supporting vector machine model, whether the classification of output present combination ambiguity word is removable sub-category;The fractionation or non-deconsolidation process to present combination ambiguity word are realized according to classification.The correct participle for eliminating combined vocabulary during Chinese medicine text segments, realizes the accurate disambiguation of combined Chinese medicine vocabulary.
Description
Technical Field
The present disclosure relates to the field of text segmentation technologies, and in particular, to a disambiguation method, system, device, and medium for use in a text segmentation process in traditional Chinese medicine.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
In the course of implementing the present disclosure, the inventors found that the following technical problems exist in the prior art:
in the existing word segmentation process of the traditional Chinese medicine text, the word segmentation result is not accurate enough, and particularly, the accurate word segmentation and the accurate disambiguation cannot be realized on combined ambiguous words, so that the word segmentation result is unsatisfactory.
Disclosure of Invention
In order to solve the deficiencies of the prior art, the present disclosure provides disambiguation methods, systems, devices and media in the process of Chinese medicine text word segmentation;
in a first aspect, the present disclosure provides a disambiguation method in a text-to-word segmentation process of traditional Chinese medicine;
the disambiguation method in the process of Chinese medicine text word segmentation comprises the following steps:
acquiring a Chinese medicine text to be segmented; preprocessing the Chinese medicine text, wherein the preprocessing comprises the following steps: deleting stop words, repeated words and tone words;
performing word segmentation on the preprocessed traditional Chinese medicine text;
matching the result after word segmentation processing with a pre-constructed combined ambiguous word library, and screening out combined ambiguous words and non-combined ambiguous words from the result after word segmentation processing; storing the non-combined ambiguous words into a word segmentation result database;
performing word frequency and word property marking on the screened combined ambiguous words, calculating mutual information vectors of the current combined ambiguous words according to the word properties and the word frequencies of the screened combined ambiguous words, inputting the mutual information vectors into a pre-trained support vector machine model, and outputting whether the category of the current combined ambiguous words is a detachable category or not; and splitting or not splitting the current combined ambiguous word according to the category.
In a second aspect, the present disclosure also provides a disambiguation system in the process of Chinese medicine text word segmentation;
the disambiguation system in the process of Chinese medicine text word segmentation comprises:
the preprocessing module is used for acquiring a Chinese medicine text to be segmented; preprocessing the Chinese medicine text, wherein the preprocessing comprises the following steps: deleting stop words, repeated words and tone words;
the word segmentation module is used for carrying out word segmentation on the preprocessed traditional Chinese medicine text;
the matching module is used for matching the result after the word segmentation processing with a pre-constructed combined ambiguous word bank and screening out combined ambiguous words and non-combined ambiguous words from the result after the word segmentation processing; storing the non-combined ambiguous words into a word segmentation result database;
the disambiguation module is used for marking word frequency and word frequency of the screened combined ambiguous words, calculating mutual information vectors of the current combined ambiguous words according to the word frequency and the word frequency of the screened combined ambiguous words, inputting the mutual information vectors into a pre-trained support vector machine model, and outputting whether the category of the current combined ambiguous words is a detachable category; and splitting or not splitting the current combined ambiguous word according to the category.
In a third aspect, the present disclosure also provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of the method of the first aspect.
In a fourth aspect, the present disclosure also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the steps of the method of the first aspect.
Compared with the prior art, the beneficial effect of this disclosure is:
the method has the advantages that the word segmentation result is accurate, and the word segmentation result eliminates the problem that the combined vocabulary has ambiguity; in particular, the correct word segmentation of the combined words in the word segmentation process of the Chinese medicine text is eliminated, and the accurate disambiguation of the combined Chinese medicine words is realized.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.
FIG. 1 is a flow chart of the method of the first embodiment.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
The embodiment one, this disclosure has provided the disambiguation method in the Chinese medicine text word segmentation course;
as shown in fig. 1, the disambiguation method in the process of Chinese medicine text word segmentation includes:
s1: acquiring a Chinese medicine text to be segmented; preprocessing the Chinese medicine text, wherein the preprocessing comprises the following steps: deleting stop words, repeated words and tone words;
s2: performing word segmentation on the preprocessed traditional Chinese medicine text;
s3: matching the result after word segmentation processing with a pre-constructed combined ambiguous word library, and screening out combined ambiguous words and non-combined ambiguous words from the result after word segmentation processing; storing the non-combined ambiguous words into a word segmentation result database;
s4: performing word frequency and word property marking on the screened combined ambiguous words, calculating mutual information vectors of the current combined ambiguous words according to the word properties and the word frequencies of the screened combined ambiguous words, inputting the mutual information vectors into a pre-trained support vector machine model, and outputting whether the category of the current combined ambiguous words is a detachable category or not; and splitting or not splitting the current combined ambiguous word according to the category.
As one or more embodiments, the obtained text of the traditional Chinese medicine to be segmented includes a text of a medical record of the traditional Chinese medicine, specifically includes a patient's self-describing disease condition or a doctor's diagnosis conclusion.
As one or more embodiments, the word segmentation processing is performed on the preprocessed chinese medical text by using a chinese word segmentation system in a chinese academy of sciences.
As one or more embodiments, the pre-constructed combined ambiguous word bank is constructed by the following steps:
segmenting words of all data sets, combining each field after segmenting words with the most adjacent field, labeling the current field and the most adjacent field after segmenting words if the combined words also exist in a Chinese medicine dictionary, then manually identifying all labeled fields, and if the combined words are true, putting the labeled fields into a combined word bank;
or,
performing word segmentation on all the data sets, and performing statistics on all words subjected to word segmentation; and (3) performing secondary word segmentation on each word independently, labeling the words capable of performing secondary word segmentation if a certain word can perform secondary word segmentation, extracting the labeled words, manually identifying the extracted words, and putting the field into a combined word bank if the extracted words are really combined words.
As one or more embodiments, the word frequency tagging is performed on the screened ambiguous words, which means that the frequency of the current ambiguous words appearing in the current chinese medical text is tagged.
As one or more embodiments, the part-of-speech tagging is performed on the selected ambiguous word, which means that the part-of-speech of the current ambiguous word in the text of the chinese medical science is tagged. The part of speech includes: nouns, verbs, adjectives, time words, and so forth.
As one or more embodiments, calculating a mutual information vector of the current ambiguous word according to the part of speech and the word frequency of the selected ambiguous word; the method comprises the following specific steps:
MI3=P(wi-1|si-1)P(si|si-1)P(wi|si)P(si+1|si)P(wi+1|si+1); (3)
MI4=P(wi-1|si-1)P(si1|si-1)P(wi1|si1)P(si2|si1)P(wi2|si2)P(si+1|si2)P(wi+1|si+1); (4)
wherein MI1Representing a first mutual information vector; MI2Representing a second mutual information vector; MI3Representing a third mutual information vector; MI4Representing a fourth mutual information vector; w is ai-1The previous word, s, representing an ambiguous fieldi-1A part-of-speech of a word preceding the ambiguous field; w is ai+1The latter word, s, representing an ambiguous fieldi+1A part-of-speech of a subsequent word representing the ambiguous field; w is aiRepresenting combinatorically ambiguous fields as single fields, s, not split processediRepresenting the part of speech of the combined ambiguous field as a single field which is not split; w is ai1And wi2Representing the combinatorial ambiguity field as two fields capable of split processing; si1And si2A field that represents combinatorial ambiguity is a part-of-speech of two fields that can be split processed.
As one or more embodiments, a pre-trained support vector machine model; the specific training steps include:
s41, selecting a plurality of Chinese medicine medical case texts for word segmentation;
s42, matching each field in the word segmentation result with a pre-constructed combined ambiguous word bank; carrying out ambiguous word recognition, and labeling ambiguous words:
if a certain field exists in the combined ambiguous word bank and the combination of the field and the next field also exists in the combined word bank, labeling the field, wherein the labeling of the field is represented in a mode that the current combined word can be split and processed;
if a certain field exists in the combined ambiguous word stock, but the combination of the field and the next field does not exist in the combined word stock, labeling the field, wherein the labeling of the field is represented in a form that the current combined word is not separable;
if a certain field does not exist in the combined word stock, continuing to match other fields with the combined ambiguous word stock;
s43, calculating mutual information vector MI of ambiguous words1、MI2、MI3And MI4To obtain a vector<MI1,MI2,MI3,MI4>;
S44, vector<MI1,MI2,MI3,MI4>And substituting the known detachable category of the current ambiguous word into the support vector machine model for training to obtain the trained support vector machine model.
As one or more embodiments, the splitting or non-splitting processing of the current ambiguous word is implemented according to a category; the method comprises the following specific steps:
if the ambiguous words are classified in a detachable way, performing word segmentation on the current ambiguous words, and storing the splitting result of the current ambiguous words into a word segmentation result database as a final word segmentation result;
and if the ambiguous word is not the detachable category, not segmenting the current ambiguous word, and directly storing the current ambiguous word into a word segmentation result database as a final word segmentation result.
1. Feature selection
In the traditional Chinese medical record, doctors have a certain rule for writing medical record texts, and one or more words form one symptom: the famous and lexical symptoms of the traditional Chinese medicine such as thunder head wind, stroke, dizziness and the like; symptoms + changes, such as suffocation relief, shortness of breath relief, hypomenorrhea, etc.; body part + adjective or adjective + body part, such as abdominal pain, dizziness, green tongue, etc. Therefore, in the traditional Chinese medical scheme, the text words and the words before and after the text words have strong connection, and certain regularity exists in the part of speech between the words before and after the text words. According to the word frequency and the part of speech of the words, the text establishes mutual information of the word frequency and the part of speech for feature selection.
1.1 conventional mutual information
wiState indicating that the current combinatory ambiguity field is "on", wi1And wi2Indicating a state with the current combined ambiguity field being "score". W ═ W1w2…wi…wnRepresents a participled sentence, and the current combined ambiguous field is in the form of "combined", W ═ W1w2…wi1wi2…wnIndicating that the current combined ambiguity field is in the form of "score". Mutual information is often used for extracting text features, the correlation degree between words can be reflected by the mutual information, and the higher the mutual information value between two words is, the higher the correlation degree between the words is. The calculation formula can be expressed as:
wherein, P (w)i-1|wi) Means a characteristic word, wi-1In a combinatorial ambiguous field wiProbability of occurrence in the Chinese medical record data set in the form of "closed", P (w)i-1) Representation of a feature word wi-1Probability of occurrence in the data set. According to the relevant research, compared with other characteristic selection methods (chi-square and information gain), the experimental effect of mutual information is not ideal. The reason is that when low-frequency words are selected as features, the low-frequency words are used as denominators, so that the value of the whole formula is increased, the mutual information value of the low-frequency words is increased, the low-frequency words are often extracted as important features in past researches, the importance of factors such as word frequency and word property is ignored, and the disambiguation effect of texts is seriously influenced.
1.2 improved word frequency mutual information
When the feature word is a low-frequency word, the conventional oneP (w) in mutual information calculation methodi-1) Being the denominator in the formula will make the eigenvalues of the low frequency words large. In the text of traditional Chinese medicine, the medium-high frequency words are the most important characteristics in a section or a sentence, the data mining of the text has important significance, and the low-frequency words have low contribution degree to the text and can become noise. In order to solve the problem of high mutual information value of low-frequency words, the word frequency factor eta of the characteristic words is added into the mutual information in the researchiAs the word frequency of the feature word in different types of ambiguities, the formula is as follows:
wherein, P (w)i-1|wi) Indicating a combinatorial ambiguous field wiCharacteristic word w when being "closedi-1Word frequency of, P*(wi-1|wi) Representation of a feature word wi-1In a combinatorial ambiguous field wiThe number of cases in "on", P*(wi) Indicating a combinatorial ambiguous field wiAll the cases in "close" are indicated. Adding word frequency factor eta into mutual information formulaiThen, the formula of mutual information is:
1.3 parts-of-speech mutual information construction
The characteristics of the Chinese medicine text and the research of the combined type ambiguity fields are combined, and the word in the Chinese medicine medical scheme is found to be greatly related to the part of speech of the words. The part-of-speech of the combinatory ambiguous field is associated with the characteristic words in the context thereof to a great extent, and particularly, the combinatory form and the divisive form of the combinatory ambiguous field have certain importance. Using MI according to the association characteristics of part-of-speech between the combinatory ambiguity field and the characteristic word in the context thereof1,MI2The combined ambiguity field is represented as a part-of-speech mutual information value between the form of 'joint' and 'branch' and the characteristic word. S ═ S1s2…si…snAnd S ═ S1s2…si1si2…snThe respective combinatory ambiguity fields are in the form of "in" and "out" parts of speech tag strings corresponding to the sentences. siIndicating when the field w is ambiguousiPart of speech in the form of "closed", si-1As ambiguous field wiThe previous word wi-1Part of speech, si1And si2Word w in the form of ambiguous field "score" respectivelyi1And the word wi2The part of speech of.
Defining the mutual information of parts of speech as formulas (2), (3)
MI3=P(wi-1|si-1)P(si|si-1)P(wi|si)P(si+1|si)P(wi+1|si+1) (2)
MI4=P(wi-1|si-1)P(si1|si-1)P(wi1|si1)P(si2|si1)P(wi2|si2)P(si+1|si2)P(wi+1|si+1) (3)
P(wi|ti) Expressed in the Chinese medical record, the word wiHas a part of speech of siThe word wiThe probability of occurrence; p (t)i|ti-1) The term wiThe previous word wi-1Is a part of speech ofi-1The word wiPart of speech siThe probability of occurrence.
1.4 construction of vectors
In this example, MI1Indicating when the ambiguity field w is combinediThe word frequency mutual information, MI, being formed by "closed" time and context2Indicating when the ambiguity field w is combinediThe word frequency mutual information formed by the word frequency mutual information and the context when the word frequency mutual information is divided. Mixing MI3Indicating a combinatorial ambiguous field wiMutual part-of-speech information of "closed" and context, MI4Indicating a combinatorial ambiguous field wiThe parts of speech mutual information with the context when the parts are divided. Expressing each ambiguous field as a vector according to the value obtained from the word frequency mutual information and the word property mutual information, and recording the vector as<MI1,MI2,MI1,MI4>。
2 support vector machine model
The Support Vector Machine (Support Vector Machine) SVM is a common Machine learning algorithm, has good classification precision, and is particularly suitable for solving the two classifications. The working principle is to find an optimal super-classification plane, and the plane has the largest distance to two sides while meeting the classification precision. The combined ambiguity in the traditional Chinese medicine case has two ambiguity conditions of 'close' and 'divide', the two forms of 'close' and 'divide' of the combined ambiguity can be regarded as two types, and the two-classification problem of the combined ambiguity is solved by using a support vector machine.
The basic idea of the SVM algorithm is as follows: present in the dataset used (x)1,y1),…,(xi,yi),…,(xn,yn),i=1,2,…,n,xi∈Rd,yiE { -1, +1 }. The separable hyperplane given by the SVM is as follows:
wTx+b=0
the support vector function is defined as:
wTx+b=±1
decision hyperplane of SVM:
g(x)=sgn(w*x+b*)
when the sample x to be classified is tested, the classification of x can be determined by calculating g (x), and the output of the function value is the result of the classification.
Respectively calculating MI of the combinatory ambiguity field according to formulas (1), (2) and (3)1,MI2,MI3,MI4To obtain a vector<MI1,MI2,MI1,MI4>Substituting the obtained vector into a classification function g (x), and if the obtained calculation result is equal to 1, the ambiguous field is in a 'resultant' form; if the resulting calculation is equal to-1, then the ambiguous field is in "score" form.
3 disambiguation model construction
3.1 definition, construction, acquisition of Combined thesaurus
(1) Combinatorial ambiguity definition
Definition of combinatory ambiguity fields herein:
combinatorial ambiguity field: assume a field AB, consisting of a and B fields, and A, B, AB can both be words. There is a sentence W in the chinese text, where A, B holds both grammatical and semantic.
An example of a combinatorial ambiguity field is as follows:
1: insomnia/somnolence/sleep-
2: recent/3/year/last/burst/breathlessness/,/multi/on/exertion/post/occurrence-
The word "more than" may be considered a combinatorially ambiguous field in the above sentence. In example 1, "more than" is in the form of the combination word "in"; in example 2, "more than" is split into two words, "more" and "then" in the form of a compound word "score.
(2) Establishment of combined word stock
Combinatorial disambiguation techniques have matured gradually in the current research. However, a combined ambiguity corpus for disambiguation is lacking, and particularly, a suitable combined ambiguity corpus is not available in the field of Chinese medicine text disambiguation, and important features such as word frequency features and part-of-speech features in sentences are not fully utilized, so that disambiguation performance is not ideal. In the research, a combinatorial ambiguous word library is established by combinatorial ambiguous words selected from the medical records aiming at the characteristics of the combinatorial ambiguity resolution combined dictionary, and is used for identifying the combinatorial ambiguous words existing in the medical records of traditional Chinese medicine.
(3) Acquisition of combinatory ambiguous fields
And preprocessing the obtained traditional Chinese medicine case by word segmentation, part of speech tagging and the like, and then identifying, tagging and extracting combined ambiguous fields by utilizing the established combined word bank facing to the traditional Chinese medicine text through a matching algorithm. In the traditional Chinese medicine medical record data set, the fields of the form of 'in sum' (A, B) and the form of 'out of order' (AB) with combinatorial ambiguity exist simultaneously, namely the fields with combinatorial ambiguity are labeled simultaneously. According to the experiment requirement, 500 parts of combined ambiguous fields are extracted from the word segmentation linguistic data of the Chinese medical case. The flow of the combinatorial ambiguity field extraction method is shown in fig. 1.
The following sentences are labeled by word segmentation and part of speech as follows:
the following sentences are sentences which have combined ambiguous fields and are subjected to word segmentation and part-of-speech tagging:
more recently/t few/m/qmonth/n,/un night/t insomnia/v than/v peaceful/v. Un
Near/a three/m years/n burst/q occurrence/v breathlessness/n,/un breathlessness/n multiple/v at/p exertion/an after/f occurrence/v. Un
Table 1 lists the characteristic information included in the above example sentence when the window size is 2.
TABLE 1 characteristic information
Type of feature | Characteristic value |
Local word | Insomnia, insomnia, shortness of breath and fatigue |
Local word part of speech | ti-1=v,ti+1=v,ti-1=n,ti+1=an |
3.2 disambiguation step
The specific disambiguation algorithm is described as follows:
(1) the main steps of the training phase are as follows:
in the step 1, 200 parts of traditional Chinese medicine medical records are selected for word segmentation.
And 2, matching the segmented traditional Chinese medicine medical case with a combined word bank, identifying ambiguous words by using a matching extraction algorithm, and labeling the ambiguous words.
Step 3, respectively calculating mutual information values MI of the combined type ambiguity fields1,MI2,MI3,MI4To obtain a vector<MI1,MI2,MI3,MI4>。
Step 4 vector<MI1,MI2,MI3,MI4>And substituting the model of the support vector machine for training to obtain a classification function g (x).
(2) The main steps of the testing stage are as follows:
step 1, selecting 300 traditional Chinese medicine cases for word segmentation to obtain a data set after word segmentation, matching the data set through a combined word bank, and identifying combined ambiguous fields contained in sentences.
And 2, obtaining two segmentation paths in a form of 'closed' and 'divided' and corresponding part-of-speech tagging strings in sentences containing the combined ambiguous fields.
And 3, extracting words and parts of speech corresponding to the words, and calculating the word frequency and the part of speech frequency of the words. Then, these are substituted into the formulas (1), (2) and (3) to calculate MI1,MI2,MI3,MI4Is expressed as a vector<MI1,MI2,MI3,MI4>。
And 4, substituting the obtained vector into the trained classification function g (x) to obtain the category of 1 or-1 to obtain a corresponding segmentation result.
And 5, resolving ambiguity of the combined ambiguity field to obtain a word segmentation result subjected to ambiguity resolution, and ending the experiment.
4 experiment
4.1 Experimental data
In the text of traditional Chinese medicine, there are the characteristics of traditional Chinese medicine terms nouns, ancient Chinese and modern language adulteration. The Chinese language used in the traditional Chinese medicine needs to be segmented, and the word frequency and the relation between words need to be considered. In order to solve the characteristic that no evaluation language material is disclosed in the combined type ambiguity resolution work of the Chinese medicine texts, a Chinese medicine medical case ambiguity word library is established for testing the effect of the medical case ambiguity resolution method for the research. The main language material adopted by the text is from 2 ten thousand medical cases of subsidiary hospitals of Shandong Chinese medicine university, and the obtained Chinese medicine text is subjected to part-of-speech tagging in steps of segmentation, clause segmentation, word segmentation and the like, so that the word segmentation language material required by the experiment is finally obtained. The method comprises the steps of manual scanning, combined word extraction and the like to construct a traditional Chinese medicine combined word bank, then matching the traditional Chinese medicine combined word bank with word segmentation corpora by using a matching algorithm, and obtaining and labeling ambiguous fields of traditional Chinese medicine text combination. UTF-8 is adopted for encoding the Chinese medicinal corpus. 2000 of the cases were selected for ambiguous word resolution experiments.
The ambiguous word resolution experiments were divided into three groups: in the first experiment, word segmentation is carried out by adopting a traditional mutual information-based characteristic extraction method; experiment two: performing word segmentation by adopting a characteristic extraction method based on part-of-speech mutual information; experiment three uses the ambiguity resolution method proposed herein to perform word segmentation. The standard language used herein is used as the experimental language, and the total number of 25052 words and 10356 words are obtained.
4.2 analysis of results
This document extracts 5 example sentences in the test corpus to reveal disambiguation results.
The first embodiment is as follows: the weight of the body is reduced by 15kg within 2 years.
Example two: the patient had a slightly swollen tongue.
Example three: the lumbago, shortness of breath and tenesmus of stool are obviously improved in rainy days.
Example four: the patient has normal menstruation, and the menstrual cycle is prolonged by 7 days after taking cold drink 2 years ago.
Table 3 lists the comparison results of 5 example sentences in the test corpus from experiment one to experiment two.
TABLE 3 presentation of test corpus participle results
From table 3, it can be seen that the disambiguation method based on the context information makes the disambiguation result undesirable at the beginning of the sentence because there are no predecessors. When the disambiguation method based on the support vector machine meets the professional nouns, the word segmentation result is not ideal. From experimental results, the word segmentation system added with the disambiguation method has good word segmentation effect overall.
In the second embodiment, the disclosure also provides a disambiguation system for the combined type ambiguity of the Chinese medicine texts;
the disambiguation system in the process of Chinese medicine text word segmentation comprises:
the preprocessing module is used for acquiring a Chinese medicine text to be segmented; preprocessing the Chinese medicine text, wherein the preprocessing comprises the following steps: deleting stop words, repeated words and tone words;
the word segmentation module is used for carrying out word segmentation on the preprocessed traditional Chinese medicine text;
the matching module is used for matching the result after the word segmentation processing with a pre-constructed combined ambiguous word bank and screening out combined ambiguous words and non-combined ambiguous words from the result after the word segmentation processing; storing the non-combined ambiguous words into a word segmentation result database;
the disambiguation module is used for marking word frequency and word frequency of the screened combined ambiguous words, calculating mutual information vectors of the current combined ambiguous words according to the word frequency and the word frequency of the screened combined ambiguous words, inputting the mutual information vectors into a pre-trained support vector machine model, and outputting whether the category of the current combined ambiguous words is a detachable category; and splitting or not splitting the current combined ambiguous word according to the category.
The present disclosure also provides an electronic device, which includes a memory, a processor, and a computer instruction stored in the memory and executed on the processor, where when the computer instruction is executed by the processor, each operation in the method is completed, and details are not described herein for brevity.
The electronic device may be a mobile terminal and a non-mobile terminal, the non-mobile terminal includes a desktop computer, and the mobile terminal includes a Smart Phone (such as an Android Phone and an IOS Phone), Smart glasses, a Smart watch, a Smart bracelet, a tablet computer, a notebook computer, a personal digital assistant, and other mobile internet devices capable of performing wireless communication.
It should be understood that in the present disclosure, the processor may be a central processing unit CPU, but may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.
In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The steps of a method disclosed in connection with the present disclosure may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here. Those of ordinary skill in the art will appreciate that the various illustrative elements, i.e., algorithm steps, described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is merely a division of one logic function, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.
Claims (10)
1. The disambiguation method in the process of Chinese medicine text word segmentation is characterized by comprising the following steps:
acquiring a Chinese medicine text to be segmented; preprocessing the Chinese medicine text, wherein the preprocessing comprises the following steps: deleting stop words, repeated words and tone words;
performing word segmentation on the preprocessed traditional Chinese medicine text;
matching the result after word segmentation processing with a pre-constructed combined ambiguous word library, and screening out combined ambiguous words and non-combined ambiguous words from the result after word segmentation processing; storing the non-combined ambiguous words into a word segmentation result database;
performing word frequency and word property marking on the screened combined ambiguous words, calculating mutual information vectors of the current combined ambiguous words according to the word properties and the word frequencies of the screened combined ambiguous words, inputting the mutual information vectors into a pre-trained support vector machine model, and outputting whether the category of the current combined ambiguous words is a detachable category or not; and splitting or not splitting the current combined ambiguous word according to the category.
2. The method of claim 1, wherein the obtaining of the text of the TCM to be segmented comprises a text of a medical record of the TCM, and specifically comprises a patient's self-describing condition or a diagnosis conclusion of a doctor.
3. The method of claim 1, wherein the word segmentation of the preprocessed chinese medical text is performed by using a chinese word segmentation system in chinese academy of sciences.
4. The method of claim 1, wherein the pre-constructed combinatorial ambiguous lexicon is constructed by the steps of:
segmenting words of all data sets, combining each field after segmenting words with the most adjacent field, labeling the current field and the most adjacent field after segmenting words if the combined words also exist in a Chinese medicine dictionary, then manually identifying all labeled fields, and if the combined words are true, putting the labeled fields into a combined word bank;
or ,
performing word segmentation on all the data sets, and performing statistics on all words subjected to word segmentation; and (3) performing secondary word segmentation on each word independently, labeling the words capable of performing secondary word segmentation if a certain word can perform secondary word segmentation, extracting the labeled words, manually identifying the extracted words, and putting the field into a combined word bank if the extracted words are really combined words.
5. The method as set forth in claim 1, wherein,
performing word frequency marking on the screened ambiguous words, namely marking the frequency of the current ambiguous words appearing in the current Chinese medicine text;
and the part of speech tagging is carried out on the screened ambiguous words, namely, the part of speech of the current ambiguous words in the Chinese medicine text is tagged.
6. The method of claim 1, wherein a pre-trained support vector machine model; the specific training steps include:
s41, selecting a plurality of Chinese medicine medical case texts for word segmentation;
s42, matching each field in the word segmentation result with a pre-constructed combined ambiguous word bank; carrying out ambiguous word recognition, and labeling ambiguous words:
if a certain field exists in the combined ambiguous word bank and the combination of the field and the next field also exists in the combined word bank, labeling the field, wherein the labeling of the field is represented in a mode that the current combined word can be split and processed;
if a certain field exists in the combined ambiguous word stock, but the combination of the field and the next field does not exist in the combined word stock, labeling the field, wherein the labeling of the field is represented in a form that the current combined word is not separable;
if a certain field does not exist in the combined word stock, continuing to match other fields with the combined ambiguous word stock;
s43, calculating mutual information vector MI of ambiguous words1、MI2、MI3 and MI4To obtain a vector<MI1,MI2,MI3,MI4>;
S44, vector<MI1,MI2,MI3,MI4>And substituting the known detachable category of the current ambiguous word into the support vector machine model for training to obtain the trained support vector machine model.
7. The method of claim 1, wherein the splitting or non-splitting processing of the current ambiguous word is performed according to category; the method comprises the following specific steps:
if the ambiguous words are classified in a detachable way, performing word segmentation on the current ambiguous words, and storing the splitting result of the current ambiguous words into a word segmentation result database as a final word segmentation result;
and if the ambiguous word is not the detachable category, not segmenting the current ambiguous word, and directly storing the current ambiguous word into a word segmentation result database as a final word segmentation result.
8. The disambiguation system in the process of Chinese medicine text word segmentation is characterized by comprising the following components:
the preprocessing module is used for acquiring a Chinese medicine text to be segmented; preprocessing the Chinese medicine text, wherein the preprocessing comprises the following steps: deleting stop words, repeated words and tone words;
the word segmentation module is used for carrying out word segmentation on the preprocessed traditional Chinese medicine text;
the matching module is used for matching the result after the word segmentation processing with a pre-constructed combined ambiguous word bank and screening out combined ambiguous words and non-combined ambiguous words from the result after the word segmentation processing; storing the non-combined ambiguous words into a word segmentation result database;
the disambiguation module is used for marking word frequency and word frequency of the screened combined ambiguous words, calculating mutual information vectors of the current combined ambiguous words according to the word frequency and the word frequency of the screened combined ambiguous words, inputting the mutual information vectors into a pre-trained support vector machine model, and outputting whether the category of the current combined ambiguous words is a detachable category; and splitting or not splitting the current combined ambiguous word according to the category.
9. An electronic device comprising a memory and a processor and computer instructions stored on the memory and executable on the processor, the computer instructions when executed by the processor performing the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910722134.7A CN110502750B (en) | 2019-08-06 | 2019-08-06 | Disambiguation method, disambiguation system, disambiguation equipment and disambiguation medium in Chinese medicine text word segmentation process |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910722134.7A CN110502750B (en) | 2019-08-06 | 2019-08-06 | Disambiguation method, disambiguation system, disambiguation equipment and disambiguation medium in Chinese medicine text word segmentation process |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110502750A true CN110502750A (en) | 2019-11-26 |
CN110502750B CN110502750B (en) | 2023-08-11 |
Family
ID=68587902
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910722134.7A Active CN110502750B (en) | 2019-08-06 | 2019-08-06 | Disambiguation method, disambiguation system, disambiguation equipment and disambiguation medium in Chinese medicine text word segmentation process |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110502750B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111178064A (en) * | 2019-12-13 | 2020-05-19 | 平安医疗健康管理股份有限公司 | Information pushing method and device based on field word segmentation processing and computer equipment |
CN111259667A (en) * | 2020-01-16 | 2020-06-09 | 上海国民集团健康科技有限公司 | Chinese medicine word segmentation algorithm |
CN111274806A (en) * | 2020-01-20 | 2020-06-12 | 医惠科技有限公司 | Method and device for recognizing word segmentation and part of speech and method and device for analyzing electronic medical record |
CN111626055A (en) * | 2020-05-25 | 2020-09-04 | 泰康保险集团股份有限公司 | Text processing method and device, computer storage medium and electronic equipment |
CN111931478A (en) * | 2020-07-16 | 2020-11-13 | 丰图科技(深圳)有限公司 | Address interest plane model training method, address prediction method and device |
CN112800321A (en) * | 2021-01-05 | 2021-05-14 | 百威投资(中国)有限公司 | Ambiguous post identification method based on keyword retrieval and computer equipment |
CN113343686A (en) * | 2021-04-30 | 2021-09-03 | 山东师范大学 | Text multi-feature ambiguity resolution method and system |
CN114662477A (en) * | 2022-03-10 | 2022-06-24 | 平安科技(深圳)有限公司 | Stop word list generating method and device based on traditional Chinese medicine conversation and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103365974A (en) * | 2013-06-28 | 2013-10-23 | 百度在线网络技术(北京)有限公司 | Semantic disambiguation method and system based on related words topic |
CN105426539A (en) * | 2015-12-23 | 2016-03-23 | 成都电科心通捷信科技有限公司 | Dictionary-based lucene Chinese word segmentation method |
CN106202039A (en) * | 2016-06-30 | 2016-12-07 | 昆明理工大学 | Vietnamese portmanteau word disambiguation method based on condition random field |
CN108549639A (en) * | 2018-04-20 | 2018-09-18 | 山东管理学院 | Based on the modified Chinese medicine case name recognition methods of multiple features template and system |
WO2019113938A1 (en) * | 2017-12-15 | 2019-06-20 | 华为技术有限公司 | Data annotation method and apparatus, and storage medium |
CN110069630A (en) * | 2019-03-20 | 2019-07-30 | 重庆信科设计有限公司 | A kind of improved mutual information feature selection approach |
-
2019
- 2019-08-06 CN CN201910722134.7A patent/CN110502750B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103365974A (en) * | 2013-06-28 | 2013-10-23 | 百度在线网络技术(北京)有限公司 | Semantic disambiguation method and system based on related words topic |
CN105426539A (en) * | 2015-12-23 | 2016-03-23 | 成都电科心通捷信科技有限公司 | Dictionary-based lucene Chinese word segmentation method |
CN106202039A (en) * | 2016-06-30 | 2016-12-07 | 昆明理工大学 | Vietnamese portmanteau word disambiguation method based on condition random field |
WO2019113938A1 (en) * | 2017-12-15 | 2019-06-20 | 华为技术有限公司 | Data annotation method and apparatus, and storage medium |
CN108549639A (en) * | 2018-04-20 | 2018-09-18 | 山东管理学院 | Based on the modified Chinese medicine case name recognition methods of multiple features template and system |
CN110069630A (en) * | 2019-03-20 | 2019-07-30 | 重庆信科设计有限公司 | A kind of improved mutual information feature selection approach |
Non-Patent Citations (9)
Title |
---|
尤慧丽等: "中文分词中组合型切分歧义的消解研究", 《计算机工程与应用》, pages 125 - 127 * |
张辉丽;孟昭鹏;王慧芝;: "汉语自动分词中的歧义处理", 微计算机应用, no. 06, pages 685 - 688 * |
李佳 等: "基于多分类器加权投票法的越南语组合歧义消歧", 《计算机科学》 * |
李佳 等: "基于多分类器加权投票法的越南语组合歧义消歧", 《计算机科学》, 15 January 2018 (2018-01-15), pages 167 - 172 * |
王冰: "中医医案文本消歧算法的研究与实现", 《中国优秀硕士论文全文数据库》 * |
王冰: "中医医案文本消歧算法的研究与实现", 《中国优秀硕士论文全文数据库》, 15 August 2020 (2020-08-15) * |
秦鹏;张华平;刘金刚;: "基于新词发现技术的关键词提算法的研究", 微计算机信息, no. 33, pages 257 - 258 * |
赵岩;王晓龙;刘秉权;关毅;: "融合聚类触发对特征的最大熵词性标注模型", 计算机研究与发展, no. 02, pages 268 - 274 * |
魏博诚;王爱平;沙先军;王永;: "一种消除中文分词中交集型歧义的方法", 计算机技术与发展, no. 05, pages 60 - 63 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111178064A (en) * | 2019-12-13 | 2020-05-19 | 平安医疗健康管理股份有限公司 | Information pushing method and device based on field word segmentation processing and computer equipment |
CN111178064B (en) * | 2019-12-13 | 2022-11-29 | 深圳平安医疗健康科技服务有限公司 | Information pushing method and device based on field word segmentation processing and computer equipment |
CN111259667A (en) * | 2020-01-16 | 2020-06-09 | 上海国民集团健康科技有限公司 | Chinese medicine word segmentation algorithm |
CN111274806A (en) * | 2020-01-20 | 2020-06-12 | 医惠科技有限公司 | Method and device for recognizing word segmentation and part of speech and method and device for analyzing electronic medical record |
CN111626055A (en) * | 2020-05-25 | 2020-09-04 | 泰康保险集团股份有限公司 | Text processing method and device, computer storage medium and electronic equipment |
CN111931478A (en) * | 2020-07-16 | 2020-11-13 | 丰图科技(深圳)有限公司 | Address interest plane model training method, address prediction method and device |
CN111931478B (en) * | 2020-07-16 | 2023-11-10 | 丰图科技(深圳)有限公司 | Training method of address interest surface model, and prediction method and device of address |
CN112800321A (en) * | 2021-01-05 | 2021-05-14 | 百威投资(中国)有限公司 | Ambiguous post identification method based on keyword retrieval and computer equipment |
CN112800321B (en) * | 2021-01-05 | 2023-01-20 | 百威投资(中国)有限公司 | Ambiguous post identification method based on keyword retrieval and computer equipment |
CN113343686A (en) * | 2021-04-30 | 2021-09-03 | 山东师范大学 | Text multi-feature ambiguity resolution method and system |
CN114662477A (en) * | 2022-03-10 | 2022-06-24 | 平安科技(深圳)有限公司 | Stop word list generating method and device based on traditional Chinese medicine conversation and storage medium |
CN114662477B (en) * | 2022-03-10 | 2024-02-02 | 平安科技(深圳)有限公司 | Method, device and storage medium for generating deactivated word list based on Chinese medicine dialogue |
Also Published As
Publication number | Publication date |
---|---|
CN110502750B (en) | 2023-08-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110502750A (en) | Disambiguation method, system, equipment and medium during Chinese medicine text participle | |
CN109670179B (en) | Medical record text named entity identification method based on iterative expansion convolutional neural network | |
US9223779B2 (en) | Text segmentation with multiple granularity levels | |
US9164983B2 (en) | Broad-coverage normalization system for social media language | |
Hornik et al. | The textcat package for n-gram based text categorization in R | |
CN107180025B (en) | Method and device for identifying new words | |
CN111984851B (en) | Medical data searching method, device, electronic device and storage medium | |
CN106844351B (en) | Medical institution organization entity identification method and device oriented to multiple data sources | |
US9594742B2 (en) | Method and apparatus for matching misspellings caused by phonetic variations | |
Georgiev et al. | Feature-rich named entity recognition for Bulgarian using conditional random fields | |
Atia et al. | Increasing the accuracy of opinion mining in Arabic | |
Yazdani et al. | Automated misspelling detection and correction in Persian clinical text | |
Kondrak et al. | Automatic identification of confusable drug names | |
CN111460175A (en) | SNOMED-CT-based medical noun dictionary construction and expansion method | |
Alharbi et al. | Sequence labeling to detect stuttering events in read speech | |
CN110929520A (en) | Non-named entity object extraction method and device, electronic equipment and storage medium | |
Mahalakshmi | Content-based information retrieval by named entity recognition and verb semantic role labelling | |
Pérez et al. | Inferred joint multigram models for medical term normalization according to ICD | |
Wankerl et al. | An Analysis of Perplexity to Reveal the Effects of Alzheimer's Disease on Language | |
Nanayakkara et al. | Clinical dialogue transcription error correction using Seq2Seq models | |
CN109523992A (en) | Tibetan dialect speech processing system | |
Wang et al. | Word intuition agreement among Chinese speakers: a Mechanical Turk-based study | |
KR101879309B1 (en) | Method and apparatus for extracting animate noun using possessive postposition | |
Shi | English word frequency and recognition in bilinguals: Inter-corpus comparison and error analysis | |
Biltawi et al. | Exploiting multilingual wikipedia to improve arabic named entity resources. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |