CN110502750A - Disambiguation method, system, equipment and medium during Chinese medicine text participle - Google Patents

Disambiguation method, system, equipment and medium during Chinese medicine text participle Download PDF

Info

Publication number
CN110502750A
CN110502750A CN201910722134.7A CN201910722134A CN110502750A CN 110502750 A CN110502750 A CN 110502750A CN 201910722134 A CN201910722134 A CN 201910722134A CN 110502750 A CN110502750 A CN 110502750A
Authority
CN
China
Prior art keywords
word
words
combined
ambiguous
chinese medicine
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910722134.7A
Other languages
Chinese (zh)
Other versions
CN110502750B (en
Inventor
袁锋
王冰
郑向伟
于凤洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Normal University
Original Assignee
Shandong Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Normal University filed Critical Shandong Normal University
Priority to CN201910722134.7A priority Critical patent/CN110502750B/en
Publication of CN110502750A publication Critical patent/CN110502750A/en
Application granted granted Critical
Publication of CN110502750B publication Critical patent/CN110502750B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The present disclosure discloses the disambiguation methods during Chinese medicine text participle, comprising: obtains Chinese medicine text to be segmented;Chinese medicine text is pre-processed;Word segmentation processing is carried out to pretreated Chinese medicine text;Result after word segmentation processing is matched with the combinational ambiguity dictionary constructed in advance, from the result after word segmentation processing, filters out combinational ambiguity word and non-combined ambiguity word;By the storage of non-combined ambiguity word into word segmentation result database;Word frequency and part of speech label are carried out to the combinational ambiguity word filtered out, according to the part of speech and word frequency of the combinational ambiguity word filtered out, calculate the mutual information vector of present combination ambiguity word, mutual information vector is input in preparatory trained supporting vector machine model, whether the classification of output present combination ambiguity word is removable sub-category;The fractionation or non-deconsolidation process to present combination ambiguity word are realized according to classification.The correct participle for eliminating combined vocabulary during Chinese medicine text segments, realizes the accurate disambiguation of combined Chinese medicine vocabulary.

Description

Disambiguation method, system, device and medium in Chinese medicine text word segmentation process
Technical Field
The present disclosure relates to the field of text segmentation technologies, and in particular, to a disambiguation method, system, device, and medium for use in a text segmentation process in traditional Chinese medicine.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
In the course of implementing the present disclosure, the inventors found that the following technical problems exist in the prior art:
in the existing word segmentation process of the traditional Chinese medicine text, the word segmentation result is not accurate enough, and particularly, the accurate word segmentation and the accurate disambiguation cannot be realized on combined ambiguous words, so that the word segmentation result is unsatisfactory.
Disclosure of Invention
In order to solve the deficiencies of the prior art, the present disclosure provides disambiguation methods, systems, devices and media in the process of Chinese medicine text word segmentation;
in a first aspect, the present disclosure provides a disambiguation method in a text-to-word segmentation process of traditional Chinese medicine;
the disambiguation method in the process of Chinese medicine text word segmentation comprises the following steps:
acquiring a Chinese medicine text to be segmented; preprocessing the Chinese medicine text, wherein the preprocessing comprises the following steps: deleting stop words, repeated words and tone words;
performing word segmentation on the preprocessed traditional Chinese medicine text;
matching the result after word segmentation processing with a pre-constructed combined ambiguous word library, and screening out combined ambiguous words and non-combined ambiguous words from the result after word segmentation processing; storing the non-combined ambiguous words into a word segmentation result database;
performing word frequency and word property marking on the screened combined ambiguous words, calculating mutual information vectors of the current combined ambiguous words according to the word properties and the word frequencies of the screened combined ambiguous words, inputting the mutual information vectors into a pre-trained support vector machine model, and outputting whether the category of the current combined ambiguous words is a detachable category or not; and splitting or not splitting the current combined ambiguous word according to the category.
In a second aspect, the present disclosure also provides a disambiguation system in the process of Chinese medicine text word segmentation;
the disambiguation system in the process of Chinese medicine text word segmentation comprises:
the preprocessing module is used for acquiring a Chinese medicine text to be segmented; preprocessing the Chinese medicine text, wherein the preprocessing comprises the following steps: deleting stop words, repeated words and tone words;
the word segmentation module is used for carrying out word segmentation on the preprocessed traditional Chinese medicine text;
the matching module is used for matching the result after the word segmentation processing with a pre-constructed combined ambiguous word bank and screening out combined ambiguous words and non-combined ambiguous words from the result after the word segmentation processing; storing the non-combined ambiguous words into a word segmentation result database;
the disambiguation module is used for marking word frequency and word frequency of the screened combined ambiguous words, calculating mutual information vectors of the current combined ambiguous words according to the word frequency and the word frequency of the screened combined ambiguous words, inputting the mutual information vectors into a pre-trained support vector machine model, and outputting whether the category of the current combined ambiguous words is a detachable category; and splitting or not splitting the current combined ambiguous word according to the category.
In a third aspect, the present disclosure also provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of the method of the first aspect.
In a fourth aspect, the present disclosure also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the steps of the method of the first aspect.
Compared with the prior art, the beneficial effect of this disclosure is:
the method has the advantages that the word segmentation result is accurate, and the word segmentation result eliminates the problem that the combined vocabulary has ambiguity; in particular, the correct word segmentation of the combined words in the word segmentation process of the Chinese medicine text is eliminated, and the accurate disambiguation of the combined Chinese medicine words is realized.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.
FIG. 1 is a flow chart of the method of the first embodiment.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
The embodiment one, this disclosure has provided the disambiguation method in the Chinese medicine text word segmentation course;
as shown in fig. 1, the disambiguation method in the process of Chinese medicine text word segmentation includes:
s1: acquiring a Chinese medicine text to be segmented; preprocessing the Chinese medicine text, wherein the preprocessing comprises the following steps: deleting stop words, repeated words and tone words;
s2: performing word segmentation on the preprocessed traditional Chinese medicine text;
s3: matching the result after word segmentation processing with a pre-constructed combined ambiguous word library, and screening out combined ambiguous words and non-combined ambiguous words from the result after word segmentation processing; storing the non-combined ambiguous words into a word segmentation result database;
s4: performing word frequency and word property marking on the screened combined ambiguous words, calculating mutual information vectors of the current combined ambiguous words according to the word properties and the word frequencies of the screened combined ambiguous words, inputting the mutual information vectors into a pre-trained support vector machine model, and outputting whether the category of the current combined ambiguous words is a detachable category or not; and splitting or not splitting the current combined ambiguous word according to the category.
As one or more embodiments, the obtained text of the traditional Chinese medicine to be segmented includes a text of a medical record of the traditional Chinese medicine, specifically includes a patient's self-describing disease condition or a doctor's diagnosis conclusion.
As one or more embodiments, the word segmentation processing is performed on the preprocessed chinese medical text by using a chinese word segmentation system in a chinese academy of sciences.
As one or more embodiments, the pre-constructed combined ambiguous word bank is constructed by the following steps:
segmenting words of all data sets, combining each field after segmenting words with the most adjacent field, labeling the current field and the most adjacent field after segmenting words if the combined words also exist in a Chinese medicine dictionary, then manually identifying all labeled fields, and if the combined words are true, putting the labeled fields into a combined word bank;
or,
performing word segmentation on all the data sets, and performing statistics on all words subjected to word segmentation; and (3) performing secondary word segmentation on each word independently, labeling the words capable of performing secondary word segmentation if a certain word can perform secondary word segmentation, extracting the labeled words, manually identifying the extracted words, and putting the field into a combined word bank if the extracted words are really combined words.
As one or more embodiments, the word frequency tagging is performed on the screened ambiguous words, which means that the frequency of the current ambiguous words appearing in the current chinese medical text is tagged.
As one or more embodiments, the part-of-speech tagging is performed on the selected ambiguous word, which means that the part-of-speech of the current ambiguous word in the text of the chinese medical science is tagged. The part of speech includes: nouns, verbs, adjectives, time words, and so forth.
As one or more embodiments, calculating a mutual information vector of the current ambiguous word according to the part of speech and the word frequency of the selected ambiguous word; the method comprises the following specific steps:
MI3=P(wi-1|si-1)P(si|si-1)P(wi|si)P(si+1|si)P(wi+1|si+1); (3)
MI4=P(wi-1|si-1)P(si1|si-1)P(wi1|si1)P(si2|si1)P(wi2|si2)P(si+1|si2)P(wi+1|si+1); (4)
wherein MI1Representing a first mutual information vector; MI2Representing a second mutual information vector; MI3Representing a third mutual information vector; MI4Representing a fourth mutual information vector; w is ai-1The previous word, s, representing an ambiguous fieldi-1A part-of-speech of a word preceding the ambiguous field; w is ai+1The latter word, s, representing an ambiguous fieldi+1A part-of-speech of a subsequent word representing the ambiguous field; w is aiRepresenting combinatorically ambiguous fields as single fields, s, not split processediRepresenting the part of speech of the combined ambiguous field as a single field which is not split; w is ai1And wi2Representing the combinatorial ambiguity field as two fields capable of split processing; si1And si2A field that represents combinatorial ambiguity is a part-of-speech of two fields that can be split processed.
As one or more embodiments, a pre-trained support vector machine model; the specific training steps include:
s41, selecting a plurality of Chinese medicine medical case texts for word segmentation;
s42, matching each field in the word segmentation result with a pre-constructed combined ambiguous word bank; carrying out ambiguous word recognition, and labeling ambiguous words:
if a certain field exists in the combined ambiguous word bank and the combination of the field and the next field also exists in the combined word bank, labeling the field, wherein the labeling of the field is represented in a mode that the current combined word can be split and processed;
if a certain field exists in the combined ambiguous word stock, but the combination of the field and the next field does not exist in the combined word stock, labeling the field, wherein the labeling of the field is represented in a form that the current combined word is not separable;
if a certain field does not exist in the combined word stock, continuing to match other fields with the combined ambiguous word stock;
s43, calculating mutual information vector MI of ambiguous words1、MI2、MI3And MI4To obtain a vector<MI1,MI2,MI3,MI4>;
S44, vector<MI1,MI2,MI3,MI4>And substituting the known detachable category of the current ambiguous word into the support vector machine model for training to obtain the trained support vector machine model.
As one or more embodiments, the splitting or non-splitting processing of the current ambiguous word is implemented according to a category; the method comprises the following specific steps:
if the ambiguous words are classified in a detachable way, performing word segmentation on the current ambiguous words, and storing the splitting result of the current ambiguous words into a word segmentation result database as a final word segmentation result;
and if the ambiguous word is not the detachable category, not segmenting the current ambiguous word, and directly storing the current ambiguous word into a word segmentation result database as a final word segmentation result.
1. Feature selection
In the traditional Chinese medical record, doctors have a certain rule for writing medical record texts, and one or more words form one symptom: the famous and lexical symptoms of the traditional Chinese medicine such as thunder head wind, stroke, dizziness and the like; symptoms + changes, such as suffocation relief, shortness of breath relief, hypomenorrhea, etc.; body part + adjective or adjective + body part, such as abdominal pain, dizziness, green tongue, etc. Therefore, in the traditional Chinese medical scheme, the text words and the words before and after the text words have strong connection, and certain regularity exists in the part of speech between the words before and after the text words. According to the word frequency and the part of speech of the words, the text establishes mutual information of the word frequency and the part of speech for feature selection.
1.1 conventional mutual information
wiState indicating that the current combinatory ambiguity field is "on", wi1And wi2Indicating a state with the current combined ambiguity field being "score". W ═ W1w2…wi…wnRepresents a participled sentence, and the current combined ambiguous field is in the form of "combined", W ═ W1w2…wi1wi2…wnIndicating that the current combined ambiguity field is in the form of "score". Mutual information is often used for extracting text features, the correlation degree between words can be reflected by the mutual information, and the higher the mutual information value between two words is, the higher the correlation degree between the words is. The calculation formula can be expressed as:
wherein, P (w)i-1|wi) Means a characteristic word, wi-1In a combinatorial ambiguous field wiProbability of occurrence in the Chinese medical record data set in the form of "closed", P (w)i-1) Representation of a feature word wi-1Probability of occurrence in the data set. According to the relevant research, compared with other characteristic selection methods (chi-square and information gain), the experimental effect of mutual information is not ideal. The reason is that when low-frequency words are selected as features, the low-frequency words are used as denominators, so that the value of the whole formula is increased, the mutual information value of the low-frequency words is increased, the low-frequency words are often extracted as important features in past researches, the importance of factors such as word frequency and word property is ignored, and the disambiguation effect of texts is seriously influenced.
1.2 improved word frequency mutual information
When the feature word is a low-frequency word, the conventional oneP (w) in mutual information calculation methodi-1) Being the denominator in the formula will make the eigenvalues of the low frequency words large. In the text of traditional Chinese medicine, the medium-high frequency words are the most important characteristics in a section or a sentence, the data mining of the text has important significance, and the low-frequency words have low contribution degree to the text and can become noise. In order to solve the problem of high mutual information value of low-frequency words, the word frequency factor eta of the characteristic words is added into the mutual information in the researchiAs the word frequency of the feature word in different types of ambiguities, the formula is as follows:
wherein, P (w)i-1|wi) Indicating a combinatorial ambiguous field wiCharacteristic word w when being "closedi-1Word frequency of, P*(wi-1|wi) Representation of a feature word wi-1In a combinatorial ambiguous field wiThe number of cases in "on", P*(wi) Indicating a combinatorial ambiguous field wiAll the cases in "close" are indicated. Adding word frequency factor eta into mutual information formulaiThen, the formula of mutual information is:
1.3 parts-of-speech mutual information construction
The characteristics of the Chinese medicine text and the research of the combined type ambiguity fields are combined, and the word in the Chinese medicine medical scheme is found to be greatly related to the part of speech of the words. The part-of-speech of the combinatory ambiguous field is associated with the characteristic words in the context thereof to a great extent, and particularly, the combinatory form and the divisive form of the combinatory ambiguous field have certain importance. Using MI according to the association characteristics of part-of-speech between the combinatory ambiguity field and the characteristic word in the context thereof1,MI2The combined ambiguity field is represented as a part-of-speech mutual information value between the form of 'joint' and 'branch' and the characteristic word. S ═ S1s2…si…snAnd S ═ S1s2…si1si2…snThe respective combinatory ambiguity fields are in the form of "in" and "out" parts of speech tag strings corresponding to the sentences. siIndicating when the field w is ambiguousiPart of speech in the form of "closed", si-1As ambiguous field wiThe previous word wi-1Part of speech, si1And si2Word w in the form of ambiguous field "score" respectivelyi1And the word wi2The part of speech of.
Defining the mutual information of parts of speech as formulas (2), (3)
MI3=P(wi-1|si-1)P(si|si-1)P(wi|si)P(si+1|si)P(wi+1|si+1) (2)
MI4=P(wi-1|si-1)P(si1|si-1)P(wi1|si1)P(si2|si1)P(wi2|si2)P(si+1|si2)P(wi+1|si+1) (3)
P(wi|ti) Expressed in the Chinese medical record, the word wiHas a part of speech of siThe word wiThe probability of occurrence; p (t)i|ti-1) The term wiThe previous word wi-1Is a part of speech ofi-1The word wiPart of speech siThe probability of occurrence.
1.4 construction of vectors
In this example, MI1Indicating when the ambiguity field w is combinediThe word frequency mutual information, MI, being formed by "closed" time and context2Indicating when the ambiguity field w is combinediThe word frequency mutual information formed by the word frequency mutual information and the context when the word frequency mutual information is divided. Mixing MI3Indicating a combinatorial ambiguous field wiMutual part-of-speech information of "closed" and context, MI4Indicating a combinatorial ambiguous field wiThe parts of speech mutual information with the context when the parts are divided. Expressing each ambiguous field as a vector according to the value obtained from the word frequency mutual information and the word property mutual information, and recording the vector as<MI1,MI2,MI1,MI4>。
2 support vector machine model
The Support Vector Machine (Support Vector Machine) SVM is a common Machine learning algorithm, has good classification precision, and is particularly suitable for solving the two classifications. The working principle is to find an optimal super-classification plane, and the plane has the largest distance to two sides while meeting the classification precision. The combined ambiguity in the traditional Chinese medicine case has two ambiguity conditions of 'close' and 'divide', the two forms of 'close' and 'divide' of the combined ambiguity can be regarded as two types, and the two-classification problem of the combined ambiguity is solved by using a support vector machine.
The basic idea of the SVM algorithm is as follows: present in the dataset used (x)1,y1),…,(xi,yi),…,(xn,yn),i=1,2,…,n,xi∈Rd,yiE { -1, +1 }. The separable hyperplane given by the SVM is as follows:
wTx+b=0
the support vector function is defined as:
wTx+b=±1
decision hyperplane of SVM:
g(x)=sgn(w*x+b*)
when the sample x to be classified is tested, the classification of x can be determined by calculating g (x), and the output of the function value is the result of the classification.
Respectively calculating MI of the combinatory ambiguity field according to formulas (1), (2) and (3)1,MI2,MI3,MI4To obtain a vector<MI1,MI2,MI1,MI4>Substituting the obtained vector into a classification function g (x), and if the obtained calculation result is equal to 1, the ambiguous field is in a 'resultant' form; if the resulting calculation is equal to-1, then the ambiguous field is in "score" form.
3 disambiguation model construction
3.1 definition, construction, acquisition of Combined thesaurus
(1) Combinatorial ambiguity definition
Definition of combinatory ambiguity fields herein:
combinatorial ambiguity field: assume a field AB, consisting of a and B fields, and A, B, AB can both be words. There is a sentence W in the chinese text, where A, B holds both grammatical and semantic.
An example of a combinatorial ambiguity field is as follows:
1: insomnia/somnolence/sleep-
2: recent/3/year/last/burst/breathlessness/,/multi/on/exertion/post/occurrence-
The word "more than" may be considered a combinatorially ambiguous field in the above sentence. In example 1, "more than" is in the form of the combination word "in"; in example 2, "more than" is split into two words, "more" and "then" in the form of a compound word "score.
(2) Establishment of combined word stock
Combinatorial disambiguation techniques have matured gradually in the current research. However, a combined ambiguity corpus for disambiguation is lacking, and particularly, a suitable combined ambiguity corpus is not available in the field of Chinese medicine text disambiguation, and important features such as word frequency features and part-of-speech features in sentences are not fully utilized, so that disambiguation performance is not ideal. In the research, a combinatorial ambiguous word library is established by combinatorial ambiguous words selected from the medical records aiming at the characteristics of the combinatorial ambiguity resolution combined dictionary, and is used for identifying the combinatorial ambiguous words existing in the medical records of traditional Chinese medicine.
(3) Acquisition of combinatory ambiguous fields
And preprocessing the obtained traditional Chinese medicine case by word segmentation, part of speech tagging and the like, and then identifying, tagging and extracting combined ambiguous fields by utilizing the established combined word bank facing to the traditional Chinese medicine text through a matching algorithm. In the traditional Chinese medicine medical record data set, the fields of the form of 'in sum' (A, B) and the form of 'out of order' (AB) with combinatorial ambiguity exist simultaneously, namely the fields with combinatorial ambiguity are labeled simultaneously. According to the experiment requirement, 500 parts of combined ambiguous fields are extracted from the word segmentation linguistic data of the Chinese medical case. The flow of the combinatorial ambiguity field extraction method is shown in fig. 1.
The following sentences are labeled by word segmentation and part of speech as follows:
the following sentences are sentences which have combined ambiguous fields and are subjected to word segmentation and part-of-speech tagging:
more recently/t few/m/qmonth/n,/un night/t insomnia/v than/v peaceful/v. Un
Near/a three/m years/n burst/q occurrence/v breathlessness/n,/un breathlessness/n multiple/v at/p exertion/an after/f occurrence/v. Un
Table 1 lists the characteristic information included in the above example sentence when the window size is 2.
TABLE 1 characteristic information
Type of feature Characteristic value
Local word Insomnia, insomnia, shortness of breath and fatigue
Local word part of speech ti-1=v,ti+1=v,ti-1=n,ti+1=an
3.2 disambiguation step
The specific disambiguation algorithm is described as follows:
(1) the main steps of the training phase are as follows:
in the step 1, 200 parts of traditional Chinese medicine medical records are selected for word segmentation.
And 2, matching the segmented traditional Chinese medicine medical case with a combined word bank, identifying ambiguous words by using a matching extraction algorithm, and labeling the ambiguous words.
Step 3, respectively calculating mutual information values MI of the combined type ambiguity fields1,MI2,MI3,MI4To obtain a vector<MI1,MI2,MI3,MI4>。
Step 4 vector<MI1,MI2,MI3,MI4>And substituting the model of the support vector machine for training to obtain a classification function g (x).
(2) The main steps of the testing stage are as follows:
step 1, selecting 300 traditional Chinese medicine cases for word segmentation to obtain a data set after word segmentation, matching the data set through a combined word bank, and identifying combined ambiguous fields contained in sentences.
And 2, obtaining two segmentation paths in a form of 'closed' and 'divided' and corresponding part-of-speech tagging strings in sentences containing the combined ambiguous fields.
And 3, extracting words and parts of speech corresponding to the words, and calculating the word frequency and the part of speech frequency of the words. Then, these are substituted into the formulas (1), (2) and (3) to calculate MI1,MI2,MI3,MI4Is expressed as a vector<MI1,MI2,MI3,MI4>。
And 4, substituting the obtained vector into the trained classification function g (x) to obtain the category of 1 or-1 to obtain a corresponding segmentation result.
And 5, resolving ambiguity of the combined ambiguity field to obtain a word segmentation result subjected to ambiguity resolution, and ending the experiment.
4 experiment
4.1 Experimental data
In the text of traditional Chinese medicine, there are the characteristics of traditional Chinese medicine terms nouns, ancient Chinese and modern language adulteration. The Chinese language used in the traditional Chinese medicine needs to be segmented, and the word frequency and the relation between words need to be considered. In order to solve the characteristic that no evaluation language material is disclosed in the combined type ambiguity resolution work of the Chinese medicine texts, a Chinese medicine medical case ambiguity word library is established for testing the effect of the medical case ambiguity resolution method for the research. The main language material adopted by the text is from 2 ten thousand medical cases of subsidiary hospitals of Shandong Chinese medicine university, and the obtained Chinese medicine text is subjected to part-of-speech tagging in steps of segmentation, clause segmentation, word segmentation and the like, so that the word segmentation language material required by the experiment is finally obtained. The method comprises the steps of manual scanning, combined word extraction and the like to construct a traditional Chinese medicine combined word bank, then matching the traditional Chinese medicine combined word bank with word segmentation corpora by using a matching algorithm, and obtaining and labeling ambiguous fields of traditional Chinese medicine text combination. UTF-8 is adopted for encoding the Chinese medicinal corpus. 2000 of the cases were selected for ambiguous word resolution experiments.
The ambiguous word resolution experiments were divided into three groups: in the first experiment, word segmentation is carried out by adopting a traditional mutual information-based characteristic extraction method; experiment two: performing word segmentation by adopting a characteristic extraction method based on part-of-speech mutual information; experiment three uses the ambiguity resolution method proposed herein to perform word segmentation. The standard language used herein is used as the experimental language, and the total number of 25052 words and 10356 words are obtained.
4.2 analysis of results
This document extracts 5 example sentences in the test corpus to reveal disambiguation results.
The first embodiment is as follows: the weight of the body is reduced by 15kg within 2 years.
Example two: the patient had a slightly swollen tongue.
Example three: the lumbago, shortness of breath and tenesmus of stool are obviously improved in rainy days.
Example four: the patient has normal menstruation, and the menstrual cycle is prolonged by 7 days after taking cold drink 2 years ago.
Table 3 lists the comparison results of 5 example sentences in the test corpus from experiment one to experiment two.
TABLE 3 presentation of test corpus participle results
From table 3, it can be seen that the disambiguation method based on the context information makes the disambiguation result undesirable at the beginning of the sentence because there are no predecessors. When the disambiguation method based on the support vector machine meets the professional nouns, the word segmentation result is not ideal. From experimental results, the word segmentation system added with the disambiguation method has good word segmentation effect overall.
In the second embodiment, the disclosure also provides a disambiguation system for the combined type ambiguity of the Chinese medicine texts;
the disambiguation system in the process of Chinese medicine text word segmentation comprises:
the preprocessing module is used for acquiring a Chinese medicine text to be segmented; preprocessing the Chinese medicine text, wherein the preprocessing comprises the following steps: deleting stop words, repeated words and tone words;
the word segmentation module is used for carrying out word segmentation on the preprocessed traditional Chinese medicine text;
the matching module is used for matching the result after the word segmentation processing with a pre-constructed combined ambiguous word bank and screening out combined ambiguous words and non-combined ambiguous words from the result after the word segmentation processing; storing the non-combined ambiguous words into a word segmentation result database;
the disambiguation module is used for marking word frequency and word frequency of the screened combined ambiguous words, calculating mutual information vectors of the current combined ambiguous words according to the word frequency and the word frequency of the screened combined ambiguous words, inputting the mutual information vectors into a pre-trained support vector machine model, and outputting whether the category of the current combined ambiguous words is a detachable category; and splitting or not splitting the current combined ambiguous word according to the category.
The present disclosure also provides an electronic device, which includes a memory, a processor, and a computer instruction stored in the memory and executed on the processor, where when the computer instruction is executed by the processor, each operation in the method is completed, and details are not described herein for brevity.
The electronic device may be a mobile terminal and a non-mobile terminal, the non-mobile terminal includes a desktop computer, and the mobile terminal includes a Smart Phone (such as an Android Phone and an IOS Phone), Smart glasses, a Smart watch, a Smart bracelet, a tablet computer, a notebook computer, a personal digital assistant, and other mobile internet devices capable of performing wireless communication.
It should be understood that in the present disclosure, the processor may be a central processing unit CPU, but may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.
In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The steps of a method disclosed in connection with the present disclosure may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here. Those of ordinary skill in the art will appreciate that the various illustrative elements, i.e., algorithm steps, described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is merely a division of one logic function, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. The disambiguation method in the process of Chinese medicine text word segmentation is characterized by comprising the following steps:
acquiring a Chinese medicine text to be segmented; preprocessing the Chinese medicine text, wherein the preprocessing comprises the following steps: deleting stop words, repeated words and tone words;
performing word segmentation on the preprocessed traditional Chinese medicine text;
matching the result after word segmentation processing with a pre-constructed combined ambiguous word library, and screening out combined ambiguous words and non-combined ambiguous words from the result after word segmentation processing; storing the non-combined ambiguous words into a word segmentation result database;
performing word frequency and word property marking on the screened combined ambiguous words, calculating mutual information vectors of the current combined ambiguous words according to the word properties and the word frequencies of the screened combined ambiguous words, inputting the mutual information vectors into a pre-trained support vector machine model, and outputting whether the category of the current combined ambiguous words is a detachable category or not; and splitting or not splitting the current combined ambiguous word according to the category.
2. The method of claim 1, wherein the obtaining of the text of the TCM to be segmented comprises a text of a medical record of the TCM, and specifically comprises a patient's self-describing condition or a diagnosis conclusion of a doctor.
3. The method of claim 1, wherein the word segmentation of the preprocessed chinese medical text is performed by using a chinese word segmentation system in chinese academy of sciences.
4. The method of claim 1, wherein the pre-constructed combinatorial ambiguous lexicon is constructed by the steps of:
segmenting words of all data sets, combining each field after segmenting words with the most adjacent field, labeling the current field and the most adjacent field after segmenting words if the combined words also exist in a Chinese medicine dictionary, then manually identifying all labeled fields, and if the combined words are true, putting the labeled fields into a combined word bank;
or ,
performing word segmentation on all the data sets, and performing statistics on all words subjected to word segmentation; and (3) performing secondary word segmentation on each word independently, labeling the words capable of performing secondary word segmentation if a certain word can perform secondary word segmentation, extracting the labeled words, manually identifying the extracted words, and putting the field into a combined word bank if the extracted words are really combined words.
5. The method as set forth in claim 1, wherein,
performing word frequency marking on the screened ambiguous words, namely marking the frequency of the current ambiguous words appearing in the current Chinese medicine text;
and the part of speech tagging is carried out on the screened ambiguous words, namely, the part of speech of the current ambiguous words in the Chinese medicine text is tagged.
6. The method of claim 1, wherein a pre-trained support vector machine model; the specific training steps include:
s41, selecting a plurality of Chinese medicine medical case texts for word segmentation;
s42, matching each field in the word segmentation result with a pre-constructed combined ambiguous word bank; carrying out ambiguous word recognition, and labeling ambiguous words:
if a certain field exists in the combined ambiguous word bank and the combination of the field and the next field also exists in the combined word bank, labeling the field, wherein the labeling of the field is represented in a mode that the current combined word can be split and processed;
if a certain field exists in the combined ambiguous word stock, but the combination of the field and the next field does not exist in the combined word stock, labeling the field, wherein the labeling of the field is represented in a form that the current combined word is not separable;
if a certain field does not exist in the combined word stock, continuing to match other fields with the combined ambiguous word stock;
s43, calculating mutual information vector MI of ambiguous words1、MI2、MI3 and MI4To obtain a vector<MI1,MI2,MI3,MI4>;
S44, vector<MI1,MI2,MI3,MI4>And substituting the known detachable category of the current ambiguous word into the support vector machine model for training to obtain the trained support vector machine model.
7. The method of claim 1, wherein the splitting or non-splitting processing of the current ambiguous word is performed according to category; the method comprises the following specific steps:
if the ambiguous words are classified in a detachable way, performing word segmentation on the current ambiguous words, and storing the splitting result of the current ambiguous words into a word segmentation result database as a final word segmentation result;
and if the ambiguous word is not the detachable category, not segmenting the current ambiguous word, and directly storing the current ambiguous word into a word segmentation result database as a final word segmentation result.
8. The disambiguation system in the process of Chinese medicine text word segmentation is characterized by comprising the following components:
the preprocessing module is used for acquiring a Chinese medicine text to be segmented; preprocessing the Chinese medicine text, wherein the preprocessing comprises the following steps: deleting stop words, repeated words and tone words;
the word segmentation module is used for carrying out word segmentation on the preprocessed traditional Chinese medicine text;
the matching module is used for matching the result after the word segmentation processing with a pre-constructed combined ambiguous word bank and screening out combined ambiguous words and non-combined ambiguous words from the result after the word segmentation processing; storing the non-combined ambiguous words into a word segmentation result database;
the disambiguation module is used for marking word frequency and word frequency of the screened combined ambiguous words, calculating mutual information vectors of the current combined ambiguous words according to the word frequency and the word frequency of the screened combined ambiguous words, inputting the mutual information vectors into a pre-trained support vector machine model, and outputting whether the category of the current combined ambiguous words is a detachable category; and splitting or not splitting the current combined ambiguous word according to the category.
9. An electronic device comprising a memory and a processor and computer instructions stored on the memory and executable on the processor, the computer instructions when executed by the processor performing the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the method of any one of claims 1 to 7.
CN201910722134.7A 2019-08-06 2019-08-06 Disambiguation method, disambiguation system, disambiguation equipment and disambiguation medium in Chinese medicine text word segmentation process Active CN110502750B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910722134.7A CN110502750B (en) 2019-08-06 2019-08-06 Disambiguation method, disambiguation system, disambiguation equipment and disambiguation medium in Chinese medicine text word segmentation process

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910722134.7A CN110502750B (en) 2019-08-06 2019-08-06 Disambiguation method, disambiguation system, disambiguation equipment and disambiguation medium in Chinese medicine text word segmentation process

Publications (2)

Publication Number Publication Date
CN110502750A true CN110502750A (en) 2019-11-26
CN110502750B CN110502750B (en) 2023-08-11

Family

ID=68587902

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910722134.7A Active CN110502750B (en) 2019-08-06 2019-08-06 Disambiguation method, disambiguation system, disambiguation equipment and disambiguation medium in Chinese medicine text word segmentation process

Country Status (1)

Country Link
CN (1) CN110502750B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111178064A (en) * 2019-12-13 2020-05-19 平安医疗健康管理股份有限公司 Information pushing method and device based on field word segmentation processing and computer equipment
CN111259667A (en) * 2020-01-16 2020-06-09 上海国民集团健康科技有限公司 Chinese medicine word segmentation algorithm
CN111274806A (en) * 2020-01-20 2020-06-12 医惠科技有限公司 Method and device for recognizing word segmentation and part of speech and method and device for analyzing electronic medical record
CN111626055A (en) * 2020-05-25 2020-09-04 泰康保险集团股份有限公司 Text processing method and device, computer storage medium and electronic equipment
CN111931478A (en) * 2020-07-16 2020-11-13 丰图科技(深圳)有限公司 Address interest plane model training method, address prediction method and device
CN112800321A (en) * 2021-01-05 2021-05-14 百威投资(中国)有限公司 Ambiguous post identification method based on keyword retrieval and computer equipment
CN113343686A (en) * 2021-04-30 2021-09-03 山东师范大学 Text multi-feature ambiguity resolution method and system
CN114662477A (en) * 2022-03-10 2022-06-24 平安科技(深圳)有限公司 Stop word list generating method and device based on traditional Chinese medicine conversation and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103365974A (en) * 2013-06-28 2013-10-23 百度在线网络技术(北京)有限公司 Semantic disambiguation method and system based on related words topic
CN105426539A (en) * 2015-12-23 2016-03-23 成都电科心通捷信科技有限公司 Dictionary-based lucene Chinese word segmentation method
CN106202039A (en) * 2016-06-30 2016-12-07 昆明理工大学 Vietnamese portmanteau word disambiguation method based on condition random field
CN108549639A (en) * 2018-04-20 2018-09-18 山东管理学院 Based on the modified Chinese medicine case name recognition methods of multiple features template and system
WO2019113938A1 (en) * 2017-12-15 2019-06-20 华为技术有限公司 Data annotation method and apparatus, and storage medium
CN110069630A (en) * 2019-03-20 2019-07-30 重庆信科设计有限公司 A kind of improved mutual information feature selection approach

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103365974A (en) * 2013-06-28 2013-10-23 百度在线网络技术(北京)有限公司 Semantic disambiguation method and system based on related words topic
CN105426539A (en) * 2015-12-23 2016-03-23 成都电科心通捷信科技有限公司 Dictionary-based lucene Chinese word segmentation method
CN106202039A (en) * 2016-06-30 2016-12-07 昆明理工大学 Vietnamese portmanteau word disambiguation method based on condition random field
WO2019113938A1 (en) * 2017-12-15 2019-06-20 华为技术有限公司 Data annotation method and apparatus, and storage medium
CN108549639A (en) * 2018-04-20 2018-09-18 山东管理学院 Based on the modified Chinese medicine case name recognition methods of multiple features template and system
CN110069630A (en) * 2019-03-20 2019-07-30 重庆信科设计有限公司 A kind of improved mutual information feature selection approach

Non-Patent Citations (9)

* Cited by examiner, † Cited by third party
Title
尤慧丽等: "中文分词中组合型切分歧义的消解研究", 《计算机工程与应用》, pages 125 - 127 *
张辉丽;孟昭鹏;王慧芝;: "汉语自动分词中的歧义处理", 微计算机应用, no. 06, pages 685 - 688 *
李佳 等: "基于多分类器加权投票法的越南语组合歧义消歧", 《计算机科学》 *
李佳 等: "基于多分类器加权投票法的越南语组合歧义消歧", 《计算机科学》, 15 January 2018 (2018-01-15), pages 167 - 172 *
王冰: "中医医案文本消歧算法的研究与实现", 《中国优秀硕士论文全文数据库》 *
王冰: "中医医案文本消歧算法的研究与实现", 《中国优秀硕士论文全文数据库》, 15 August 2020 (2020-08-15) *
秦鹏;张华平;刘金刚;: "基于新词发现技术的关键词提算法的研究", 微计算机信息, no. 33, pages 257 - 258 *
赵岩;王晓龙;刘秉权;关毅;: "融合聚类触发对特征的最大熵词性标注模型", 计算机研究与发展, no. 02, pages 268 - 274 *
魏博诚;王爱平;沙先军;王永;: "一种消除中文分词中交集型歧义的方法", 计算机技术与发展, no. 05, pages 60 - 63 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111178064A (en) * 2019-12-13 2020-05-19 平安医疗健康管理股份有限公司 Information pushing method and device based on field word segmentation processing and computer equipment
CN111178064B (en) * 2019-12-13 2022-11-29 深圳平安医疗健康科技服务有限公司 Information pushing method and device based on field word segmentation processing and computer equipment
CN111259667A (en) * 2020-01-16 2020-06-09 上海国民集团健康科技有限公司 Chinese medicine word segmentation algorithm
CN111274806A (en) * 2020-01-20 2020-06-12 医惠科技有限公司 Method and device for recognizing word segmentation and part of speech and method and device for analyzing electronic medical record
CN111626055A (en) * 2020-05-25 2020-09-04 泰康保险集团股份有限公司 Text processing method and device, computer storage medium and electronic equipment
CN111931478A (en) * 2020-07-16 2020-11-13 丰图科技(深圳)有限公司 Address interest plane model training method, address prediction method and device
CN111931478B (en) * 2020-07-16 2023-11-10 丰图科技(深圳)有限公司 Training method of address interest surface model, and prediction method and device of address
CN112800321A (en) * 2021-01-05 2021-05-14 百威投资(中国)有限公司 Ambiguous post identification method based on keyword retrieval and computer equipment
CN112800321B (en) * 2021-01-05 2023-01-20 百威投资(中国)有限公司 Ambiguous post identification method based on keyword retrieval and computer equipment
CN113343686A (en) * 2021-04-30 2021-09-03 山东师范大学 Text multi-feature ambiguity resolution method and system
CN114662477A (en) * 2022-03-10 2022-06-24 平安科技(深圳)有限公司 Stop word list generating method and device based on traditional Chinese medicine conversation and storage medium
CN114662477B (en) * 2022-03-10 2024-02-02 平安科技(深圳)有限公司 Method, device and storage medium for generating deactivated word list based on Chinese medicine dialogue

Also Published As

Publication number Publication date
CN110502750B (en) 2023-08-11

Similar Documents

Publication Publication Date Title
CN110502750A (en) Disambiguation method, system, equipment and medium during Chinese medicine text participle
CN109670179B (en) Medical record text named entity identification method based on iterative expansion convolutional neural network
US9223779B2 (en) Text segmentation with multiple granularity levels
US9164983B2 (en) Broad-coverage normalization system for social media language
Hornik et al. The textcat package for n-gram based text categorization in R
CN107180025B (en) Method and device for identifying new words
CN111984851B (en) Medical data searching method, device, electronic device and storage medium
CN106844351B (en) Medical institution organization entity identification method and device oriented to multiple data sources
US9594742B2 (en) Method and apparatus for matching misspellings caused by phonetic variations
Georgiev et al. Feature-rich named entity recognition for Bulgarian using conditional random fields
Atia et al. Increasing the accuracy of opinion mining in Arabic
Yazdani et al. Automated misspelling detection and correction in Persian clinical text
Kondrak et al. Automatic identification of confusable drug names
CN111460175A (en) SNOMED-CT-based medical noun dictionary construction and expansion method
Alharbi et al. Sequence labeling to detect stuttering events in read speech
CN110929520A (en) Non-named entity object extraction method and device, electronic equipment and storage medium
Mahalakshmi Content-based information retrieval by named entity recognition and verb semantic role labelling
Pérez et al. Inferred joint multigram models for medical term normalization according to ICD
Wankerl et al. An Analysis of Perplexity to Reveal the Effects of Alzheimer's Disease on Language
Nanayakkara et al. Clinical dialogue transcription error correction using Seq2Seq models
CN109523992A (en) Tibetan dialect speech processing system
Wang et al. Word intuition agreement among Chinese speakers: a Mechanical Turk-based study
KR101879309B1 (en) Method and apparatus for extracting animate noun using possessive postposition
Shi English word frequency and recognition in bilinguals: Inter-corpus comparison and error analysis
Biltawi et al. Exploiting multilingual wikipedia to improve arabic named entity resources.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant