CN110502750A

CN110502750A - Disambiguation method, system, equipment and medium during Chinese medicine text participle

Info

Publication number: CN110502750A
Application number: CN201910722134.7A
Authority: CN
Inventors: 袁锋; 王冰; 郑向伟; 于凤洋
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2019-08-06
Filing date: 2019-08-06
Publication date: 2019-11-26
Anticipated expiration: 2039-08-06
Also published as: CN110502750B

Abstract

The present disclosure discloses the disambiguation methods during Chinese medicine text participle, comprising: obtains Chinese medicine text to be segmented；Chinese medicine text is pre-processed；Word segmentation processing is carried out to pretreated Chinese medicine text；Result after word segmentation processing is matched with the combinational ambiguity dictionary constructed in advance, from the result after word segmentation processing, filters out combinational ambiguity word and non-combined ambiguity word；By the storage of non-combined ambiguity word into word segmentation result database；Word frequency and part of speech label are carried out to the combinational ambiguity word filtered out, according to the part of speech and word frequency of the combinational ambiguity word filtered out, calculate the mutual information vector of present combination ambiguity word, mutual information vector is input in preparatory trained supporting vector machine model, whether the classification of output present combination ambiguity word is removable sub-category；The fractionation or non-deconsolidation process to present combination ambiguity word are realized according to classification.The correct participle for eliminating combined vocabulary during Chinese medicine text segments, realizes the accurate disambiguation of combined Chinese medicine vocabulary.

Description

Disambiguation method, system, device and medium in Chinese medicine text word segmentation process

Technical Field

The present disclosure relates to the field of text segmentation technologies, and in particular, to a disambiguation method, system, device, and medium for use in a text segmentation process in traditional Chinese medicine.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

In the course of implementing the present disclosure, the inventors found that the following technical problems exist in the prior art:

in the existing word segmentation process of the traditional Chinese medicine text, the word segmentation result is not accurate enough, and particularly, the accurate word segmentation and the accurate disambiguation cannot be realized on combined ambiguous words, so that the word segmentation result is unsatisfactory.

Disclosure of Invention

In order to solve the deficiencies of the prior art, the present disclosure provides disambiguation methods, systems, devices and media in the process of Chinese medicine text word segmentation;

in a first aspect, the present disclosure provides a disambiguation method in a text-to-word segmentation process of traditional Chinese medicine;

the disambiguation method in the process of Chinese medicine text word segmentation comprises the following steps:

acquiring a Chinese medicine text to be segmented; preprocessing the Chinese medicine text, wherein the preprocessing comprises the following steps: deleting stop words, repeated words and tone words;

performing word segmentation on the preprocessed traditional Chinese medicine text;

matching the result after word segmentation processing with a pre-constructed combined ambiguous word library, and screening out combined ambiguous words and non-combined ambiguous words from the result after word segmentation processing; storing the non-combined ambiguous words into a word segmentation result database;

performing word frequency and word property marking on the screened combined ambiguous words, calculating mutual information vectors of the current combined ambiguous words according to the word properties and the word frequencies of the screened combined ambiguous words, inputting the mutual information vectors into a pre-trained support vector machine model, and outputting whether the category of the current combined ambiguous words is a detachable category or not; and splitting or not splitting the current combined ambiguous word according to the category.

In a second aspect, the present disclosure also provides a disambiguation system in the process of Chinese medicine text word segmentation;

the disambiguation system in the process of Chinese medicine text word segmentation comprises:

the preprocessing module is used for acquiring a Chinese medicine text to be segmented; preprocessing the Chinese medicine text, wherein the preprocessing comprises the following steps: deleting stop words, repeated words and tone words;

the word segmentation module is used for carrying out word segmentation on the preprocessed traditional Chinese medicine text;

the matching module is used for matching the result after the word segmentation processing with a pre-constructed combined ambiguous word bank and screening out combined ambiguous words and non-combined ambiguous words from the result after the word segmentation processing; storing the non-combined ambiguous words into a word segmentation result database;

the disambiguation module is used for marking word frequency and word frequency of the screened combined ambiguous words, calculating mutual information vectors of the current combined ambiguous words according to the word frequency and the word frequency of the screened combined ambiguous words, inputting the mutual information vectors into a pre-trained support vector machine model, and outputting whether the category of the current combined ambiguous words is a detachable category; and splitting or not splitting the current combined ambiguous word according to the category.

In a third aspect, the present disclosure also provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of the method of the first aspect.

In a fourth aspect, the present disclosure also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the steps of the method of the first aspect.

Compared with the prior art, the beneficial effect of this disclosure is:

the method has the advantages that the word segmentation result is accurate, and the word segmentation result eliminates the problem that the combined vocabulary has ambiguity; in particular, the correct word segmentation of the combined words in the word segmentation process of the Chinese medicine text is eliminated, and the accurate disambiguation of the combined Chinese medicine words is realized.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.

FIG. 1 is a flow chart of the method of the first embodiment.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiment one, this disclosure has provided the disambiguation method in the Chinese medicine text word segmentation course;

as shown in fig. 1, the disambiguation method in the process of Chinese medicine text word segmentation includes:

s1: acquiring a Chinese medicine text to be segmented; preprocessing the Chinese medicine text, wherein the preprocessing comprises the following steps: deleting stop words, repeated words and tone words;

s2: performing word segmentation on the preprocessed traditional Chinese medicine text;

s3: matching the result after word segmentation processing with a pre-constructed combined ambiguous word library, and screening out combined ambiguous words and non-combined ambiguous words from the result after word segmentation processing; storing the non-combined ambiguous words into a word segmentation result database;

s4: performing word frequency and word property marking on the screened combined ambiguous words, calculating mutual information vectors of the current combined ambiguous words according to the word properties and the word frequencies of the screened combined ambiguous words, inputting the mutual information vectors into a pre-trained support vector machine model, and outputting whether the category of the current combined ambiguous words is a detachable category or not; and splitting or not splitting the current combined ambiguous word according to the category.

As one or more embodiments, the obtained text of the traditional Chinese medicine to be segmented includes a text of a medical record of the traditional Chinese medicine, specifically includes a patient's self-describing disease condition or a doctor's diagnosis conclusion.

As one or more embodiments, the word segmentation processing is performed on the preprocessed chinese medical text by using a chinese word segmentation system in a chinese academy of sciences.

As one or more embodiments, the pre-constructed combined ambiguous word bank is constructed by the following steps:

segmenting words of all data sets, combining each field after segmenting words with the most adjacent field, labeling the current field and the most adjacent field after segmenting words if the combined words also exist in a Chinese medicine dictionary, then manually identifying all labeled fields, and if the combined words are true, putting the labeled fields into a combined word bank;

or,

performing word segmentation on all the data sets, and performing statistics on all words subjected to word segmentation; and (3) performing secondary word segmentation on each word independently, labeling the words capable of performing secondary word segmentation if a certain word can perform secondary word segmentation, extracting the labeled words, manually identifying the extracted words, and putting the field into a combined word bank if the extracted words are really combined words.

As one or more embodiments, the word frequency tagging is performed on the screened ambiguous words, which means that the frequency of the current ambiguous words appearing in the current chinese medical text is tagged.

As one or more embodiments, the part-of-speech tagging is performed on the selected ambiguous word, which means that the part-of-speech of the current ambiguous word in the text of the chinese medical science is tagged. The part of speech includes: nouns, verbs, adjectives, time words, and so forth.

As one or more embodiments, calculating a mutual information vector of the current ambiguous word according to the part of speech and the word frequency of the selected ambiguous word; the method comprises the following specific steps:

wherein MI₁Representing a first mutual information vector; MI₂Representing a second mutual information vector; MI₃Representing a third mutual information vector; MI₄Representing a fourth mutual information vector; w is a_i-1The previous word, s, representing an ambiguous field_i-1A part-of-speech of a word preceding the ambiguous field; w is a_i+1The latter word, s, representing an ambiguous field_i+1A part-of-speech of a subsequent word representing the ambiguous field; w is a_iRepresenting combinatorically ambiguous fields as single fields, s, not split processed_iRepresenting the part of speech of the combined ambiguous field as a single field which is not split; w is a_i1And w_i2Representing the combinatorial ambiguity field as two fields capable of split processing; s_i1And s_i2A field that represents combinatorial ambiguity is a part-of-speech of two fields that can be split processed.

As one or more embodiments, a pre-trained support vector machine model; the specific training steps include:

s41, selecting a plurality of Chinese medicine medical case texts for word segmentation;

s42, matching each field in the word segmentation result with a pre-constructed combined ambiguous word bank; carrying out ambiguous word recognition, and labeling ambiguous words:

if a certain field exists in the combined ambiguous word bank and the combination of the field and the next field also exists in the combined word bank, labeling the field, wherein the labeling of the field is represented in a mode that the current combined word can be split and processed;

if a certain field exists in the combined ambiguous word stock, but the combination of the field and the next field does not exist in the combined word stock, labeling the field, wherein the labeling of the field is represented in a form that the current combined word is not separable;

if a certain field does not exist in the combined word stock, continuing to match other fields with the combined ambiguous word stock;

s43, calculating mutual information vector MI of ambiguous words₁、MI₂、MI₃And MI₄To obtain a vector<MI₁,MI₂,MI₃,MI₄>；

S44, vector<MI₁,MI₂,MI₃,MI₄>And substituting the known detachable category of the current ambiguous word into the support vector machine model for training to obtain the trained support vector machine model.

As one or more embodiments, the splitting or non-splitting processing of the current ambiguous word is implemented according to a category; the method comprises the following specific steps:

if the ambiguous words are classified in a detachable way, performing word segmentation on the current ambiguous words, and storing the splitting result of the current ambiguous words into a word segmentation result database as a final word segmentation result;

and if the ambiguous word is not the detachable category, not segmenting the current ambiguous word, and directly storing the current ambiguous word into a word segmentation result database as a final word segmentation result.

1. Feature selection

In the traditional Chinese medical record, doctors have a certain rule for writing medical record texts, and one or more words form one symptom: the famous and lexical symptoms of the traditional Chinese medicine such as thunder head wind, stroke, dizziness and the like; symptoms + changes, such as suffocation relief, shortness of breath relief, hypomenorrhea, etc.; body part + adjective or adjective + body part, such as abdominal pain, dizziness, green tongue, etc. Therefore, in the traditional Chinese medical scheme, the text words and the words before and after the text words have strong connection, and certain regularity exists in the part of speech between the words before and after the text words. According to the word frequency and the part of speech of the words, the text establishes mutual information of the word frequency and the part of speech for feature selection.

1.1 conventional mutual information

w_iState indicating that the current combinatory ambiguity field is "on", w_i1And w_i2Indicating a state with the current combined ambiguity field being "score". W ═ W₁w₂…w_i…w_nRepresents a participled sentence, and the current combined ambiguous field is in the form of "combined", W ═ W₁w₂…w_i1w_i2…w_nIndicating that the current combined ambiguity field is in the form of "score". Mutual information is often used for extracting text features, the correlation degree between words can be reflected by the mutual information, and the higher the mutual information value between two words is, the higher the correlation degree between the words is. The calculation formula can be expressed as:

wherein, P (w)_i-1|w_i) Means a characteristic word, w_i-1In a combinatorial ambiguous field w_iProbability of occurrence in the Chinese medical record data set in the form of "closed", P (w)_i-1) Representation of a feature word w_i-1Probability of occurrence in the data set. According to the relevant research, compared with other characteristic selection methods (chi-square and information gain), the experimental effect of mutual information is not ideal. The reason is that when low-frequency words are selected as features, the low-frequency words are used as denominators, so that the value of the whole formula is increased, the mutual information value of the low-frequency words is increased, the low-frequency words are often extracted as important features in past researches, the importance of factors such as word frequency and word property is ignored, and the disambiguation effect of texts is seriously influenced.

1.2 improved word frequency mutual information

When the feature word is a low-frequency word, the conventional oneP (w) in mutual information calculation method_i-1) Being the denominator in the formula will make the eigenvalues of the low frequency words large. In the text of traditional Chinese medicine, the medium-high frequency words are the most important characteristics in a section or a sentence, the data mining of the text has important significance, and the low-frequency words have low contribution degree to the text and can become noise. In order to solve the problem of high mutual information value of low-frequency words, the word frequency factor eta of the characteristic words is added into the mutual information in the research_iAs the word frequency of the feature word in different types of ambiguities, the formula is as follows:

wherein, P (w)_i-1|w_i) Indicating a combinatorial ambiguous field w_iCharacteristic word w when being "closed_i-1Word frequency of, P^*(w_i-1|w_i) Representation of a feature word w_i-1In a combinatorial ambiguous field w_iThe number of cases in "on", P^*(w_i) Indicating a combinatorial ambiguous field w_iAll the cases in "close" are indicated. Adding word frequency factor eta into mutual information formula_iThen, the formula of mutual information is:

1.3 parts-of-speech mutual information construction

The characteristics of the Chinese medicine text and the research of the combined type ambiguity fields are combined, and the word in the Chinese medicine medical scheme is found to be greatly related to the part of speech of the words. The part-of-speech of the combinatory ambiguous field is associated with the characteristic words in the context thereof to a great extent, and particularly, the combinatory form and the divisive form of the combinatory ambiguous field have certain importance. Using MI according to the association characteristics of part-of-speech between the combinatory ambiguity field and the characteristic word in the context thereof₁，MI₂The combined ambiguity field is represented as a part-of-speech mutual information value between the form of 'joint' and 'branch' and the characteristic word. S ═ S₁s₂…s_i…s_nAnd S ═ S₁s₂…s_i1s_i2…s_nThe respective combinatory ambiguity fields are in the form of "in" and "out" parts of speech tag strings corresponding to the sentences. s_iIndicating when the field w is ambiguous_iPart of speech in the form of "closed", s_i-1As ambiguous field w_iThe previous word w_i-1Part of speech, s_i1And s_i2Word w in the form of ambiguous field "score" respectively_i1And the word w_i2The part of speech of.

Defining the mutual information of parts of speech as formulas (2), (3)

P(w_i|t_i) Expressed in the Chinese medical record, the word w_iHas a part of speech of s_iThe word w_iThe probability of occurrence; p (t)_i|t_i-1) The term w_iThe previous word w_i-1Is a part of speech of_i-1The word w_iPart of speech s_iThe probability of occurrence.

1.4 construction of vectors

In this example, MI₁Indicating when the ambiguity field w is combined_iThe word frequency mutual information, MI, being formed by "closed" time and context₂Indicating when the ambiguity field w is combined_iThe word frequency mutual information formed by the word frequency mutual information and the context when the word frequency mutual information is divided. Mixing MI₃Indicating a combinatorial ambiguous field w_iMutual part-of-speech information of "closed" and context, MI₄Indicating a combinatorial ambiguous field w_iThe parts of speech mutual information with the context when the parts are divided. Expressing each ambiguous field as a vector according to the value obtained from the word frequency mutual information and the word property mutual information, and recording the vector as<MI₁，MI₂，MI₁，MI₄>。

2 support vector machine model

The Support Vector Machine (Support Vector Machine) SVM is a common Machine learning algorithm, has good classification precision, and is particularly suitable for solving the two classifications. The working principle is to find an optimal super-classification plane, and the plane has the largest distance to two sides while meeting the classification precision. The combined ambiguity in the traditional Chinese medicine case has two ambiguity conditions of 'close' and 'divide', the two forms of 'close' and 'divide' of the combined ambiguity can be regarded as two types, and the two-classification problem of the combined ambiguity is solved by using a support vector machine.

The basic idea of the SVM algorithm is as follows: present in the dataset used (x)₁,y₁)，…，(x_i,y_i)，…，(x_n,y_n)，i＝1，2，…，n，x_i∈R^d，y_iE { -1, +1 }. The separable hyperplane given by the SVM is as follows:

w^Tx+b＝0

the support vector function is defined as:

w^Tx+b＝±1

decision hyperplane of SVM:

g(x)＝sgn(w^*x+b^*)

when the sample x to be classified is tested, the classification of x can be determined by calculating g (x), and the output of the function value is the result of the classification.

Respectively calculating MI of the combinatory ambiguity field according to formulas (1), (2) and (3)₁，MI₂，MI₃,MI₄To obtain a vector<MI₁，MI₂，MI₁，MI₄>Substituting the obtained vector into a classification function g (x), and if the obtained calculation result is equal to 1, the ambiguous field is in a 'resultant' form; if the resulting calculation is equal to-1, then the ambiguous field is in "score" form.

3 disambiguation model construction

3.1 definition, construction, acquisition of Combined thesaurus

(1) Combinatorial ambiguity definition

Definition of combinatory ambiguity fields herein:

combinatorial ambiguity field: assume a field AB, consisting of a and B fields, and A, B, AB can both be words. There is a sentence W in the chinese text, where A, B holds both grammatical and semantic.

An example of a combinatorial ambiguity field is as follows:

1: insomnia/somnolence/sleep-

2: recent/3/year/last/burst/breathlessness/,/multi/on/exertion/post/occurrence-

The word "more than" may be considered a combinatorially ambiguous field in the above sentence. In example 1, "more than" is in the form of the combination word "in"; in example 2, "more than" is split into two words, "more" and "then" in the form of a compound word "score.

(2) Establishment of combined word stock

Combinatorial disambiguation techniques have matured gradually in the current research. However, a combined ambiguity corpus for disambiguation is lacking, and particularly, a suitable combined ambiguity corpus is not available in the field of Chinese medicine text disambiguation, and important features such as word frequency features and part-of-speech features in sentences are not fully utilized, so that disambiguation performance is not ideal. In the research, a combinatorial ambiguous word library is established by combinatorial ambiguous words selected from the medical records aiming at the characteristics of the combinatorial ambiguity resolution combined dictionary, and is used for identifying the combinatorial ambiguous words existing in the medical records of traditional Chinese medicine.

(3) Acquisition of combinatory ambiguous fields

And preprocessing the obtained traditional Chinese medicine case by word segmentation, part of speech tagging and the like, and then identifying, tagging and extracting combined ambiguous fields by utilizing the established combined word bank facing to the traditional Chinese medicine text through a matching algorithm. In the traditional Chinese medicine medical record data set, the fields of the form of 'in sum' (A, B) and the form of 'out of order' (AB) with combinatorial ambiguity exist simultaneously, namely the fields with combinatorial ambiguity are labeled simultaneously. According to the experiment requirement, 500 parts of combined ambiguous fields are extracted from the word segmentation linguistic data of the Chinese medical case. The flow of the combinatorial ambiguity field extraction method is shown in fig. 1.

The following sentences are labeled by word segmentation and part of speech as follows:

the following sentences are sentences which have combined ambiguous fields and are subjected to word segmentation and part-of-speech tagging:

more recently/t few/m/qmonth/n,/un night/t insomnia/v than/v peaceful/v. Un

Near/a three/m years/n burst/q occurrence/v breathlessness/n,/un breathlessness/n multiple/v at/p exertion/an after/f occurrence/v. Un

Table 1 lists the characteristic information included in the above example sentence when the window size is 2.

TABLE 1 characteristic information

Type of feature	Characteristic value
		Local word	Insomnia, insomnia, shortness of breath and fatigue
Local word part of speech	t_i-1＝v,t_i+1＝v,t_i-1＝n,t_i+1＝an

3.2 disambiguation step

The specific disambiguation algorithm is described as follows:

(1) the main steps of the training phase are as follows:

in the step 1, 200 parts of traditional Chinese medicine medical records are selected for word segmentation.

And 2, matching the segmented traditional Chinese medicine medical case with a combined word bank, identifying ambiguous words by using a matching extraction algorithm, and labeling the ambiguous words.

Step 3, respectively calculating mutual information values MI of the combined type ambiguity fields₁,MI₂,MI₃,MI₄To obtain a vector<MI₁,MI₂,MI₃,MI₄>。

Step 4 vector<MI₁,MI₂,MI₃,MI₄>And substituting the model of the support vector machine for training to obtain a classification function g (x).

(2) The main steps of the testing stage are as follows:

step 1, selecting 300 traditional Chinese medicine cases for word segmentation to obtain a data set after word segmentation, matching the data set through a combined word bank, and identifying combined ambiguous fields contained in sentences.

And 2, obtaining two segmentation paths in a form of 'closed' and 'divided' and corresponding part-of-speech tagging strings in sentences containing the combined ambiguous fields.

And 3, extracting words and parts of speech corresponding to the words, and calculating the word frequency and the part of speech frequency of the words. Then, these are substituted into the formulas (1), (2) and (3) to calculate MI₁,MI₂,MI₃,MI₄Is expressed as a vector<MI₁,MI₂,MI₃,MI₄>。

And 4, substituting the obtained vector into the trained classification function g (x) to obtain the category of 1 or-1 to obtain a corresponding segmentation result.

And 5, resolving ambiguity of the combined ambiguity field to obtain a word segmentation result subjected to ambiguity resolution, and ending the experiment.

4 experiment

4.1 Experimental data

In the text of traditional Chinese medicine, there are the characteristics of traditional Chinese medicine terms nouns, ancient Chinese and modern language adulteration. The Chinese language used in the traditional Chinese medicine needs to be segmented, and the word frequency and the relation between words need to be considered. In order to solve the characteristic that no evaluation language material is disclosed in the combined type ambiguity resolution work of the Chinese medicine texts, a Chinese medicine medical case ambiguity word library is established for testing the effect of the medical case ambiguity resolution method for the research. The main language material adopted by the text is from 2 ten thousand medical cases of subsidiary hospitals of Shandong Chinese medicine university, and the obtained Chinese medicine text is subjected to part-of-speech tagging in steps of segmentation, clause segmentation, word segmentation and the like, so that the word segmentation language material required by the experiment is finally obtained. The method comprises the steps of manual scanning, combined word extraction and the like to construct a traditional Chinese medicine combined word bank, then matching the traditional Chinese medicine combined word bank with word segmentation corpora by using a matching algorithm, and obtaining and labeling ambiguous fields of traditional Chinese medicine text combination. UTF-8 is adopted for encoding the Chinese medicinal corpus. 2000 of the cases were selected for ambiguous word resolution experiments.

The ambiguous word resolution experiments were divided into three groups: in the first experiment, word segmentation is carried out by adopting a traditional mutual information-based characteristic extraction method; experiment two: performing word segmentation by adopting a characteristic extraction method based on part-of-speech mutual information; experiment three uses the ambiguity resolution method proposed herein to perform word segmentation. The standard language used herein is used as the experimental language, and the total number of 25052 words and 10356 words are obtained.

4.2 analysis of results

This document extracts 5 example sentences in the test corpus to reveal disambiguation results.

The first embodiment is as follows: the weight of the body is reduced by 15kg within 2 years.

Example two: the patient had a slightly swollen tongue.

Example three: the lumbago, shortness of breath and tenesmus of stool are obviously improved in rainy days.

Example four: the patient has normal menstruation, and the menstrual cycle is prolonged by 7 days after taking cold drink 2 years ago.

Table 3 lists the comparison results of 5 example sentences in the test corpus from experiment one to experiment two.

TABLE 3 presentation of test corpus participle results

From table 3, it can be seen that the disambiguation method based on the context information makes the disambiguation result undesirable at the beginning of the sentence because there are no predecessors. When the disambiguation method based on the support vector machine meets the professional nouns, the word segmentation result is not ideal. From experimental results, the word segmentation system added with the disambiguation method has good word segmentation effect overall.

In the second embodiment, the disclosure also provides a disambiguation system for the combined type ambiguity of the Chinese medicine texts;

The present disclosure also provides an electronic device, which includes a memory, a processor, and a computer instruction stored in the memory and executed on the processor, where when the computer instruction is executed by the processor, each operation in the method is completed, and details are not described herein for brevity.

The electronic device may be a mobile terminal and a non-mobile terminal, the non-mobile terminal includes a desktop computer, and the mobile terminal includes a Smart Phone (such as an Android Phone and an IOS Phone), Smart glasses, a Smart watch, a Smart bracelet, a tablet computer, a notebook computer, a personal digital assistant, and other mobile internet devices capable of performing wireless communication.

It should be understood that in the present disclosure, the processor may be a central processing unit CPU, but may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The steps of a method disclosed in connection with the present disclosure may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here. Those of ordinary skill in the art will appreciate that the various illustrative elements, i.e., algorithm steps, described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is merely a division of one logic function, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. The disambiguation method in the process of Chinese medicine text word segmentation is characterized by comprising the following steps:

2. The method of claim 1, wherein the obtaining of the text of the TCM to be segmented comprises a text of a medical record of the TCM, and specifically comprises a patient's self-describing condition or a diagnosis conclusion of a doctor.

3. The method of claim 1, wherein the word segmentation of the preprocessed chinese medical text is performed by using a chinese word segmentation system in chinese academy of sciences.

4. The method of claim 1, wherein the pre-constructed combinatorial ambiguous lexicon is constructed by the steps of:

or ,

5. The method as set forth in claim 1, wherein,

performing word frequency marking on the screened ambiguous words, namely marking the frequency of the current ambiguous words appearing in the current Chinese medicine text;

and the part of speech tagging is carried out on the screened ambiguous words, namely, the part of speech of the current ambiguous words in the Chinese medicine text is tagged.

6. The method of claim 1, wherein a pre-trained support vector machine model; the specific training steps include:

s43, calculating mutual information vector MI of ambiguous words₁、MI₂、MI₃ and MI₄To obtain a vector<MI₁,MI₂,MI₃,MI₄>；

7. The method of claim 1, wherein the splitting or non-splitting processing of the current ambiguous word is performed according to category; the method comprises the following specific steps:

8. The disambiguation system in the process of Chinese medicine text word segmentation is characterized by comprising the following components:

9. An electronic device comprising a memory and a processor and computer instructions stored on the memory and executable on the processor, the computer instructions when executed by the processor performing the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the method of any one of claims 1 to 7.