CN111737460A

CN111737460A - Unsupervised learning multipoint matching method based on clustering algorithm

Info

Publication number: CN111737460A
Application number: CN202010470688.5A
Authority: CN
Inventors: 陈明东; 黄越
Original assignee: Sipai Health Industry Investment Co ltd
Current assignee: Sipai Health Industry Investment Co ltd
Priority date: 2020-05-28
Filing date: 2020-05-28
Publication date: 2020-10-02

Abstract

The invention discloses an unsupervised learning multipoint matching method based on a clustering algorithm, which comprises the steps of S1, preprocessing a short text library to obtain a first type of mapping chain of characters contained in the word segmentation-word segmentation of the short text-short text, and obtaining a second type of mapping chain of the word segmentation-short text according to the mapping chain of the first type; and S2, inputting a text to be matched, scattering the text to be matched into single words, mapping the single words into participles by using a second type of mapping chain, mapping the participles into short texts, and describing the reference relation of each short text to the text to be matched by a vector according to the position of each word in the text to be matched so as to obtain a reference matrix of a short text library. The advantages are that: through parallel multipoint matching, the algorithm can extract all possibly matched short texts at one time, so that the matching efficiency is improved, and the circular matching of one text to be matched is avoided.

Description

Unsupervised learning multipoint matching method based on clustering algorithm

Technical Field

The invention relates to the field of text processing, in particular to an unsupervised learning multipoint matching method based on a clustering algorithm.

Background

At present, a supervised learning method is mainly used in the text processing technology, and although the accuracy of the method is generally high, a large amount of labeled texts are required to be used as a training set to train the model.

Disclosure of Invention

The invention aims to provide an unsupervised learning multipoint matching method based on a clustering algorithm, so that the problems in the prior art are solved.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

an unsupervised learning multipoint matching method based on a clustering algorithm comprises the following steps,

s1, preprocessing the short text library to obtain a first type of mapping chain of characters contained in the short text-short text word segmentation-word segmentation, and obtaining a second type of mapping chain of characters-word segmentation-short text according to the mapping relation obtained by the first type of mapping chain;

s2, inputting a text to be matched, scattering the text to be matched into single words, mapping the single words into participles by using a second type of mapping chain, mapping the participles into short texts, and describing the reference relation of each short text to be matched by a vector according to the position of each word in the text to be matched so as to obtain a reference matrix of a short text library;

and S3, performing cluster analysis on the reference matrix, performing region division on the short texts in the short text library, matching and scoring the short texts contained in each class and the divided short text regions, and selecting the best matching short texts to form a target matching set as a final matching result.

Preferably, in step S1, performing word segmentation processing on each short text in the short text library to obtain a first type of mapping chain, where a mapping relationship of the first type of mapping chain is a word included in a word segmentation-word segmentation of the short text-short text; reversing the first type of mapping chain to obtain a second type of mapping chain, wherein the mapping relation of the second type of mapping chain is character-word segmentation-short text; the first type of mapping chain is mapped in a forward direction, and the second type of mapping chain is mapped in a reverse direction.

Preferably, in the second kind of mapping chain, each level of mapping is a one-to-many mapping relationship, i.e. a word may appear in different participles, and a participle may appear in different short texts.

Preferably, in step S2, according to the position of each word in the text to be matched, a vector describes the reference relationship of each short text to the text to be matched, so as to obtain a reference matrix of the short text library, specifically, whether the word in the text to be matched appears in the first short text in the short text library is sequentially compared, if yes, the word is replaced by "1", otherwise, the word is replaced by "0" to generate a corresponding matrix for the first short text, each short text in the short text library is sequentially determined, a plurality of corresponding matrices are generated, and each corresponding matrix is sequentially spliced to form the reference matrix of the short text library; each row of the reference matrix corresponds to a short text, each column of the reference matrix corresponds to a position in the text to be matched, all the columns of the reference matrix are equal to the length of the text to be matched, and the value of the row and column of the reference matrix corresponds to whether the word at the position in the text to be matched appears in the short text corresponding to the row.

Preferably, step S3 specifically includes performing cluster analysis on the reference matrix to implement region division on short texts in the short text library, performing matching and scoring on the short texts contained in each category and the divided short text regions, selecting the short text with the highest score in each category as the best matching short text of the category, recording the score, sequentially comparing the best matching short texts of all the categories with a set threshold, if the score is smaller than the threshold, rejecting the best matching short text, and if the score is not smaller than the threshold, retaining the best matching short text; and all the reserved best matching short texts form a target matching set for final matching.

The invention has the beneficial effects that: when the matching method is used for extracting short texts from the texts to be matched, the requirement of a large amount of marked data in a supervised learning method is avoided, and a large amount of manpower is saved; in addition, through parallel multipoint matching, the algorithm can extract all the possibly matched short texts at one time, so that the matching efficiency is improved, and the circular matching of one text to be matched is avoided.

Drawings

Fig. 1 is a flow chart illustrating a matching method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

Example one

As shown in fig. 1, the embodiment provides an unsupervised learning multipoint matching method based on a clustering algorithm, which includes the following steps,

In this embodiment, step S1 is specifically to perform word segmentation processing on each short text in the short text library to obtain a first type of mapping chain, where a mapping relationship of the first type of mapping chain is a word included in the word segmentation-word segmentation of the short text-short text; reversing the first type of mapping chain to obtain a second type of mapping chain, wherein the mapping relation of the second type of mapping chain is character-word segmentation-short text; the first type of mapping chain is mapped in a forward direction, and the second type of mapping chain is mapped in a reverse direction.

In the second mapping chain, each level of mapping is a one-to-many mapping relationship, that is, a word may appear in different participles, and a participle may appear in different short texts.

In this embodiment, in step S2, according to the position of each word in the text to be matched, a vector describes the reference relationship of each short text to be matched to obtain a reference matrix of the short text library, specifically, whether a word in the text to be matched appears in the first short text in the short text library is sequentially compared, if yes, the word is replaced with "1", otherwise, the word is replaced with "0" to generate a corresponding matrix for the first short text, each short text in the short text library is sequentially determined, a plurality of corresponding matrices are generated, and each corresponding matrix is sequentially spliced to form a reference matrix of the short text library; each row of the reference matrix corresponds to a short text, each column of the reference matrix corresponds to a position in the text to be matched, all the columns of the reference matrix are equal to the length of the text to be matched, and the value of the row and column of the reference matrix corresponds to whether the word at the position in the text to be matched appears in the short text corresponding to the row.

In this embodiment, step S3 specifically includes performing cluster analysis on the reference matrix to perform area division on short texts in the short text library, performing matching and scoring on the short texts included in each category and the divided short text areas, selecting the short text with the highest score in each category as the best matching short text of the category, recording the score, sequentially comparing the best matching short texts of all the categories with a set threshold, if the score is smaller than the threshold, removing the best matching short text, and if the score is not smaller than the threshold, retaining the best matching short text; and all the reserved best matching short texts form a target matching set as a final matching result.

In this embodiment, the cluster analysis uses the DBSCAN algorithm. The region division actually refers to a region to which each short text corresponds to the text to be matched, such as a "hypertension" region or a "diabetes" region in "hypertension and diabetes". The divided regions are actually classes in the cluster analysis.

In this embodiment, the matching score may adopt a currently common text matching algorithm, such as a jaro _ winkler algorithm in the Levenshtein package.

Example two

In this embodiment, the matching method can automatically match all diagnoses in the discharge summary or the medical record and correspond to the standard diagnosis.

In the actual model building, the first step is the pre-processing of the short text library. The selection of the short text library is based on practical application requirements, for example, if the purpose of modeling is to match a standard diagnosis name from a discharge summary, the short text library can be selected as a standard diagnosis library (ICD). In the processing process, each short text in the short text library is firstly subjected to word segmentation, and the word segmentation method can select any one of the existing word segmentation modes. After the treatment, the following can be obtained: mapping chains of short text- > word segmentation of short text- > word included in word segmentation. Inverting the mapping chain to obtain the inverted mapping chain: word- > participle- > short text, it should be noted that each level of mapping in the chain of inverse mappings is a one-to-many mapping relationship, i.e. a word may appear in different participles, and a participle may also appear in different short texts. The first mapping chain is referred to as forward mapping and the second mapping chain is referred to as reverse mapping.

After the short text library has been pre-processed, the next step in the modeling can be performed. When a new text to be matched (such as a discharge summary) is input, the text to be matched is firstly scattered into a single character. Then, by using reverse mapping, the word segmentation mapped by each single character and the short text mapped by the word segmentation can be obtained. Since the position of each word in the text to be matched is determined (for example, the position of "high" in "diagnosis of essential hypertension of patient" is 9), the reference of the short text (standard text, i.e. matching target) in each short text library to the text to be matched can be described by a one-hot encoded vector, for example, the reference of "primary hypertension" to "diagnosis of essential hypertension of patient" is (00000000111). Thus, all reference vectors in the short text library can be written as a matrix.

The reference matrix is clustered using the DBSCAN algorithm. Clustering each row of the input reference matrix to correspond to a standard short text, each column to correspond to a position in the text to be matched, and the number of all columns to be equal to the length of the text to be matched; the value of the row and column corresponds to whether the word of the place value in the text to be matched appears in the standard short text corresponding to the row (the appearance is 1, and the non-appearance is 0). All short texts in the short text library can be divided into regions by clustering, for example, the diagnosis of hypertension is concentrated in the 6-11 regions of the text to be matched, that is, the patient is diagnosed with essential hypertension and the first stage of diabetes, and the diagnosis of diabetes is concentrated in the 13-17 regions.

And performing short text score matching (for example, by a jaro-winkler method) on each short text in each class and the text in the corresponding area of the text to be matched, selecting the highest-score short text in each class as the best matching short text in the class, and recording the score of the short text. A threshold is set for the highest score in all classes (e.g., a threshold of 80 may be set in the case of 0-100 points, as the case may be, depending on the actual in-use requirements). And after threshold filtering, all the obtained best matching short texts are used as the final matching result.

By adopting the technical scheme disclosed by the invention, the following beneficial effects are obtained:

the invention provides an unsupervised learning multipoint matching method based on a clustering algorithm, which avoids the requirement of a large amount of labeled data in a supervised learning method and saves a large amount of manpower when extracting short texts from texts to be matched; in addition, through parallel multipoint matching, the algorithm can extract all the possibly matched short texts at one time, so that the matching efficiency is improved, and the circular matching of one text to be matched is avoided.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and improvements can be made without departing from the principle of the present invention, and such modifications and improvements should also be considered within the scope of the present invention.

Claims

1. An unsupervised learning multipoint matching method based on a clustering algorithm is characterized in that: comprises the following steps of (a) carrying out,

2. The unsupervised learning multipoint matching method based on clustering algorithm as claimed in claim 1, wherein: step S1 is specifically to perform word segmentation processing on each short text in the short text library to obtain a first type of mapping chain, where a mapping relationship of the first type of mapping chain is a word included in a word segmentation of the short text-short text; reversing the first type of mapping chain to obtain a second type of mapping chain, wherein the mapping relation of the second type of mapping chain is character-word segmentation-short text; the first type of mapping chain is mapped in a forward direction, and the second type of mapping chain is mapped in a reverse direction.

3. The unsupervised learning multipoint matching method based on clustering algorithm as claimed in claim 2, wherein: in the second kind of mapping chain, each level of mapping is a one-to-many mapping relationship, i.e. a word may appear in different participles, and a participle may appear in different short texts.

4. The unsupervised learning multipoint matching method based on clustering algorithm as claimed in claim 3, wherein: in step S2, according to the position of each word in the text to be matched, a vector describes the reference relationship of each short text to the text to be matched to obtain a reference matrix of the short text library, specifically, whether the word in the text to be matched appears in the first short text in the short text library is sequentially compared, if yes, the word is replaced by "1", otherwise, the word is replaced by "0" to generate a corresponding matrix for the first short text, each short text in the short text library is sequentially judged to generate a plurality of corresponding matrices, and each corresponding matrix is sequentially spliced to form the reference matrix of the short text library; each row of the reference matrix corresponds to a short text, each column of the reference matrix corresponds to a position in the text to be matched, all the columns of the reference matrix are equal to the length of the text to be matched, and the value of the row and column of the reference matrix corresponds to whether the word at the position in the text to be matched appears in the short text corresponding to the row.

5. The unsupervised learning multipoint matching method based on clustering algorithm as claimed in claim 4, wherein: step S3 specifically includes performing cluster analysis on the reference matrix to implement region division on short texts in the short text library, performing matching scoring on the short texts included in each category and the divided short text regions, selecting the short text with the highest score in each category as the best matching short text of the category, recording the score, sequentially comparing the best matching short texts of all the categories with a set threshold, if the best matching short text is less than the threshold, removing the best matching short text, and if the best matching short text is not less than the threshold, retaining the best matching short text; and all the reserved best matching short texts form a target matching set for final matching.