CN111737460A - Unsupervised learning multipoint matching method based on clustering algorithm - Google Patents

Unsupervised learning multipoint matching method based on clustering algorithm Download PDF

Info

Publication number
CN111737460A
CN111737460A CN202010470688.5A CN202010470688A CN111737460A CN 111737460 A CN111737460 A CN 111737460A CN 202010470688 A CN202010470688 A CN 202010470688A CN 111737460 A CN111737460 A CN 111737460A
Authority
CN
China
Prior art keywords
text
short
short text
mapping
matched
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010470688.5A
Other languages
Chinese (zh)
Inventor
陈明东
黄越
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sipai Health Industry Investment Co ltd
Original Assignee
Sipai Health Industry Investment Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sipai Health Industry Investment Co ltd filed Critical Sipai Health Industry Investment Co ltd
Priority to CN202010470688.5A priority Critical patent/CN111737460A/en
Publication of CN111737460A publication Critical patent/CN111737460A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The invention discloses an unsupervised learning multipoint matching method based on a clustering algorithm, which comprises the steps of S1, preprocessing a short text library to obtain a first type of mapping chain of characters contained in the word segmentation-word segmentation of the short text-short text, and obtaining a second type of mapping chain of the word segmentation-short text according to the mapping chain of the first type; and S2, inputting a text to be matched, scattering the text to be matched into single words, mapping the single words into participles by using a second type of mapping chain, mapping the participles into short texts, and describing the reference relation of each short text to the text to be matched by a vector according to the position of each word in the text to be matched so as to obtain a reference matrix of a short text library. The advantages are that: through parallel multipoint matching, the algorithm can extract all possibly matched short texts at one time, so that the matching efficiency is improved, and the circular matching of one text to be matched is avoided.

Description

Unsupervised learning multipoint matching method based on clustering algorithm
Technical Field
The invention relates to the field of text processing, in particular to an unsupervised learning multipoint matching method based on a clustering algorithm.
Background
At present, a supervised learning method is mainly used in the text processing technology, and although the accuracy of the method is generally high, a large amount of labeled texts are required to be used as a training set to train the model.
Disclosure of Invention
The invention aims to provide an unsupervised learning multipoint matching method based on a clustering algorithm, so that the problems in the prior art are solved.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
an unsupervised learning multipoint matching method based on a clustering algorithm comprises the following steps,
s1, preprocessing the short text library to obtain a first type of mapping chain of characters contained in the short text-short text word segmentation-word segmentation, and obtaining a second type of mapping chain of characters-word segmentation-short text according to the mapping relation obtained by the first type of mapping chain;
s2, inputting a text to be matched, scattering the text to be matched into single words, mapping the single words into participles by using a second type of mapping chain, mapping the participles into short texts, and describing the reference relation of each short text to be matched by a vector according to the position of each word in the text to be matched so as to obtain a reference matrix of a short text library;
and S3, performing cluster analysis on the reference matrix, performing region division on the short texts in the short text library, matching and scoring the short texts contained in each class and the divided short text regions, and selecting the best matching short texts to form a target matching set as a final matching result.
Preferably, in step S1, performing word segmentation processing on each short text in the short text library to obtain a first type of mapping chain, where a mapping relationship of the first type of mapping chain is a word included in a word segmentation-word segmentation of the short text-short text; reversing the first type of mapping chain to obtain a second type of mapping chain, wherein the mapping relation of the second type of mapping chain is character-word segmentation-short text; the first type of mapping chain is mapped in a forward direction, and the second type of mapping chain is mapped in a reverse direction.
Preferably, in the second kind of mapping chain, each level of mapping is a one-to-many mapping relationship, i.e. a word may appear in different participles, and a participle may appear in different short texts.
Preferably, in step S2, according to the position of each word in the text to be matched, a vector describes the reference relationship of each short text to the text to be matched, so as to obtain a reference matrix of the short text library, specifically, whether the word in the text to be matched appears in the first short text in the short text library is sequentially compared, if yes, the word is replaced by "1", otherwise, the word is replaced by "0" to generate a corresponding matrix for the first short text, each short text in the short text library is sequentially determined, a plurality of corresponding matrices are generated, and each corresponding matrix is sequentially spliced to form the reference matrix of the short text library; each row of the reference matrix corresponds to a short text, each column of the reference matrix corresponds to a position in the text to be matched, all the columns of the reference matrix are equal to the length of the text to be matched, and the value of the row and column of the reference matrix corresponds to whether the word at the position in the text to be matched appears in the short text corresponding to the row.
Preferably, step S3 specifically includes performing cluster analysis on the reference matrix to implement region division on short texts in the short text library, performing matching and scoring on the short texts contained in each category and the divided short text regions, selecting the short text with the highest score in each category as the best matching short text of the category, recording the score, sequentially comparing the best matching short texts of all the categories with a set threshold, if the score is smaller than the threshold, rejecting the best matching short text, and if the score is not smaller than the threshold, retaining the best matching short text; and all the reserved best matching short texts form a target matching set for final matching.
The invention has the beneficial effects that: when the matching method is used for extracting short texts from the texts to be matched, the requirement of a large amount of marked data in a supervised learning method is avoided, and a large amount of manpower is saved; in addition, through parallel multipoint matching, the algorithm can extract all the possibly matched short texts at one time, so that the matching efficiency is improved, and the circular matching of one text to be matched is avoided.
Drawings
Fig. 1 is a flow chart illustrating a matching method according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.
Example one
As shown in fig. 1, the embodiment provides an unsupervised learning multipoint matching method based on a clustering algorithm, which includes the following steps,
s1, preprocessing the short text library to obtain a first type of mapping chain of characters contained in the short text-short text word segmentation-word segmentation, and obtaining a second type of mapping chain of characters-word segmentation-short text according to the mapping relation obtained by the first type of mapping chain;
s2, inputting a text to be matched, scattering the text to be matched into single words, mapping the single words into participles by using a second type of mapping chain, mapping the participles into short texts, and describing the reference relation of each short text to be matched by a vector according to the position of each word in the text to be matched so as to obtain a reference matrix of a short text library;
and S3, performing cluster analysis on the reference matrix, performing region division on the short texts in the short text library, matching and scoring the short texts contained in each class and the divided short text regions, and selecting the best matching short texts to form a target matching set as a final matching result.
In this embodiment, step S1 is specifically to perform word segmentation processing on each short text in the short text library to obtain a first type of mapping chain, where a mapping relationship of the first type of mapping chain is a word included in the word segmentation-word segmentation of the short text-short text; reversing the first type of mapping chain to obtain a second type of mapping chain, wherein the mapping relation of the second type of mapping chain is character-word segmentation-short text; the first type of mapping chain is mapped in a forward direction, and the second type of mapping chain is mapped in a reverse direction.
In the second mapping chain, each level of mapping is a one-to-many mapping relationship, that is, a word may appear in different participles, and a participle may appear in different short texts.
In this embodiment, in step S2, according to the position of each word in the text to be matched, a vector describes the reference relationship of each short text to be matched to obtain a reference matrix of the short text library, specifically, whether a word in the text to be matched appears in the first short text in the short text library is sequentially compared, if yes, the word is replaced with "1", otherwise, the word is replaced with "0" to generate a corresponding matrix for the first short text, each short text in the short text library is sequentially determined, a plurality of corresponding matrices are generated, and each corresponding matrix is sequentially spliced to form a reference matrix of the short text library; each row of the reference matrix corresponds to a short text, each column of the reference matrix corresponds to a position in the text to be matched, all the columns of the reference matrix are equal to the length of the text to be matched, and the value of the row and column of the reference matrix corresponds to whether the word at the position in the text to be matched appears in the short text corresponding to the row.
In this embodiment, step S3 specifically includes performing cluster analysis on the reference matrix to perform area division on short texts in the short text library, performing matching and scoring on the short texts included in each category and the divided short text areas, selecting the short text with the highest score in each category as the best matching short text of the category, recording the score, sequentially comparing the best matching short texts of all the categories with a set threshold, if the score is smaller than the threshold, removing the best matching short text, and if the score is not smaller than the threshold, retaining the best matching short text; and all the reserved best matching short texts form a target matching set as a final matching result.
In this embodiment, the cluster analysis uses the DBSCAN algorithm. The region division actually refers to a region to which each short text corresponds to the text to be matched, such as a "hypertension" region or a "diabetes" region in "hypertension and diabetes". The divided regions are actually classes in the cluster analysis.
In this embodiment, the matching score may adopt a currently common text matching algorithm, such as a jaro _ winkler algorithm in the Levenshtein package.
Example two
In this embodiment, the matching method can automatically match all diagnoses in the discharge summary or the medical record and correspond to the standard diagnosis.
In the actual model building, the first step is the pre-processing of the short text library. The selection of the short text library is based on practical application requirements, for example, if the purpose of modeling is to match a standard diagnosis name from a discharge summary, the short text library can be selected as a standard diagnosis library (ICD). In the processing process, each short text in the short text library is firstly subjected to word segmentation, and the word segmentation method can select any one of the existing word segmentation modes. After the treatment, the following can be obtained: mapping chains of short text- > word segmentation of short text- > word included in word segmentation. Inverting the mapping chain to obtain the inverted mapping chain: word- > participle- > short text, it should be noted that each level of mapping in the chain of inverse mappings is a one-to-many mapping relationship, i.e. a word may appear in different participles, and a participle may also appear in different short texts. The first mapping chain is referred to as forward mapping and the second mapping chain is referred to as reverse mapping.
After the short text library has been pre-processed, the next step in the modeling can be performed. When a new text to be matched (such as a discharge summary) is input, the text to be matched is firstly scattered into a single character. Then, by using reverse mapping, the word segmentation mapped by each single character and the short text mapped by the word segmentation can be obtained. Since the position of each word in the text to be matched is determined (for example, the position of "high" in "diagnosis of essential hypertension of patient" is 9), the reference of the short text (standard text, i.e. matching target) in each short text library to the text to be matched can be described by a one-hot encoded vector, for example, the reference of "primary hypertension" to "diagnosis of essential hypertension of patient" is (00000000111). Thus, all reference vectors in the short text library can be written as a matrix.
The reference matrix is clustered using the DBSCAN algorithm. Clustering each row of the input reference matrix to correspond to a standard short text, each column to correspond to a position in the text to be matched, and the number of all columns to be equal to the length of the text to be matched; the value of the row and column corresponds to whether the word of the place value in the text to be matched appears in the standard short text corresponding to the row (the appearance is 1, and the non-appearance is 0). All short texts in the short text library can be divided into regions by clustering, for example, the diagnosis of hypertension is concentrated in the 6-11 regions of the text to be matched, that is, the patient is diagnosed with essential hypertension and the first stage of diabetes, and the diagnosis of diabetes is concentrated in the 13-17 regions.
And performing short text score matching (for example, by a jaro-winkler method) on each short text in each class and the text in the corresponding area of the text to be matched, selecting the highest-score short text in each class as the best matching short text in the class, and recording the score of the short text. A threshold is set for the highest score in all classes (e.g., a threshold of 80 may be set in the case of 0-100 points, as the case may be, depending on the actual in-use requirements). And after threshold filtering, all the obtained best matching short texts are used as the final matching result.
By adopting the technical scheme disclosed by the invention, the following beneficial effects are obtained:
the invention provides an unsupervised learning multipoint matching method based on a clustering algorithm, which avoids the requirement of a large amount of labeled data in a supervised learning method and saves a large amount of manpower when extracting short texts from texts to be matched; in addition, through parallel multipoint matching, the algorithm can extract all the possibly matched short texts at one time, so that the matching efficiency is improved, and the circular matching of one text to be matched is avoided.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and improvements can be made without departing from the principle of the present invention, and such modifications and improvements should also be considered within the scope of the present invention.

Claims (5)

1. An unsupervised learning multipoint matching method based on a clustering algorithm is characterized in that: comprises the following steps of (a) carrying out,
s1, preprocessing the short text library to obtain a first type of mapping chain of characters contained in the short text-short text word segmentation-word segmentation, and obtaining a second type of mapping chain of characters-word segmentation-short text according to the mapping relation obtained by the first type of mapping chain;
s2, inputting a text to be matched, scattering the text to be matched into single words, mapping the single words into participles by using a second type of mapping chain, mapping the participles into short texts, and describing the reference relation of each short text to be matched by a vector according to the position of each word in the text to be matched so as to obtain a reference matrix of a short text library;
and S3, performing cluster analysis on the reference matrix, performing region division on the short texts in the short text library, matching and scoring the short texts contained in each class and the divided short text regions, and selecting the best matching short texts to form a target matching set as a final matching result.
2. The unsupervised learning multipoint matching method based on clustering algorithm as claimed in claim 1, wherein: step S1 is specifically to perform word segmentation processing on each short text in the short text library to obtain a first type of mapping chain, where a mapping relationship of the first type of mapping chain is a word included in a word segmentation of the short text-short text; reversing the first type of mapping chain to obtain a second type of mapping chain, wherein the mapping relation of the second type of mapping chain is character-word segmentation-short text; the first type of mapping chain is mapped in a forward direction, and the second type of mapping chain is mapped in a reverse direction.
3. The unsupervised learning multipoint matching method based on clustering algorithm as claimed in claim 2, wherein: in the second kind of mapping chain, each level of mapping is a one-to-many mapping relationship, i.e. a word may appear in different participles, and a participle may appear in different short texts.
4. The unsupervised learning multipoint matching method based on clustering algorithm as claimed in claim 3, wherein: in step S2, according to the position of each word in the text to be matched, a vector describes the reference relationship of each short text to the text to be matched to obtain a reference matrix of the short text library, specifically, whether the word in the text to be matched appears in the first short text in the short text library is sequentially compared, if yes, the word is replaced by "1", otherwise, the word is replaced by "0" to generate a corresponding matrix for the first short text, each short text in the short text library is sequentially judged to generate a plurality of corresponding matrices, and each corresponding matrix is sequentially spliced to form the reference matrix of the short text library; each row of the reference matrix corresponds to a short text, each column of the reference matrix corresponds to a position in the text to be matched, all the columns of the reference matrix are equal to the length of the text to be matched, and the value of the row and column of the reference matrix corresponds to whether the word at the position in the text to be matched appears in the short text corresponding to the row.
5. The unsupervised learning multipoint matching method based on clustering algorithm as claimed in claim 4, wherein: step S3 specifically includes performing cluster analysis on the reference matrix to implement region division on short texts in the short text library, performing matching scoring on the short texts included in each category and the divided short text regions, selecting the short text with the highest score in each category as the best matching short text of the category, recording the score, sequentially comparing the best matching short texts of all the categories with a set threshold, if the best matching short text is less than the threshold, removing the best matching short text, and if the best matching short text is not less than the threshold, retaining the best matching short text; and all the reserved best matching short texts form a target matching set for final matching.
CN202010470688.5A 2020-05-28 2020-05-28 Unsupervised learning multipoint matching method based on clustering algorithm Pending CN111737460A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010470688.5A CN111737460A (en) 2020-05-28 2020-05-28 Unsupervised learning multipoint matching method based on clustering algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010470688.5A CN111737460A (en) 2020-05-28 2020-05-28 Unsupervised learning multipoint matching method based on clustering algorithm

Publications (1)

Publication Number Publication Date
CN111737460A true CN111737460A (en) 2020-10-02

Family

ID=72648081

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010470688.5A Pending CN111737460A (en) 2020-05-28 2020-05-28 Unsupervised learning multipoint matching method based on clustering algorithm

Country Status (1)

Country Link
CN (1) CN111737460A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030004942A1 (en) * 2001-06-29 2003-01-02 International Business Machines Corporation Method and apparatus of metadata generation
CN102622338A (en) * 2012-02-24 2012-08-01 北京工业大学 Computer-assisted computing method of semantic distance between short texts
US20190005049A1 (en) * 2014-03-17 2019-01-03 NLPCore LLC Corpus search systems and methods
CN109472019A (en) * 2018-10-11 2019-03-15 厦门快商通信息技术有限公司 A kind of short text Similarity Match Method and system based on thesaurus
CN110825852A (en) * 2019-11-07 2020-02-21 四川长虹电器股份有限公司 Long text-oriented semantic matching method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030004942A1 (en) * 2001-06-29 2003-01-02 International Business Machines Corporation Method and apparatus of metadata generation
CN102622338A (en) * 2012-02-24 2012-08-01 北京工业大学 Computer-assisted computing method of semantic distance between short texts
US20190005049A1 (en) * 2014-03-17 2019-01-03 NLPCore LLC Corpus search systems and methods
CN109472019A (en) * 2018-10-11 2019-03-15 厦门快商通信息技术有限公司 A kind of short text Similarity Match Method and system based on thesaurus
CN110825852A (en) * 2019-11-07 2020-02-21 四川长虹电器股份有限公司 Long text-oriented semantic matching method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ASIM M.EI TAHIR ALI等: "Using Kohonen Maps and Singular Value Decomposition for Plagiarism Detection", 《IEEE》 *
李宏广: "基于深度神经网络的文本匹配算法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Similar Documents

Publication Publication Date Title
CN104965819B (en) A kind of biomedical event trigger word recognition methods based on syntax term vector
Wilkinson et al. Neural Ctrl-F: segmentation-free query-by-string word spotting in handwritten manuscript collections
CN110297893B (en) Natural language question-answering method, device, computer device and storage medium
Alrobah et al. Arabic handwritten recognition using deep learning: A survey
CN111144102B (en) Method and device for identifying entity in statement and electronic equipment
CN110209819A (en) File classification method, device, equipment and medium
CN109189965A (en) Pictograph search method and system
CN115130613B (en) False news identification model construction method, false news identification method and device
CN116486419A (en) Handwriting word recognition method based on twin convolutional neural network
CN113254651B (en) Method and device for analyzing referee document, computer equipment and storage medium
CN110969005B (en) Method and device for determining similarity between entity corpora
CN113160886B (en) Cell type prediction system based on single cell Hi-C data
CN111553442B (en) Optimization method and system for classifier chain tag sequence
CN111737460A (en) Unsupervised learning multipoint matching method based on clustering algorithm
CN110750984A (en) Command line character string processing method, terminal, device and readable storage medium
CN113610080B (en) Cross-modal perception-based sensitive image identification method, device, equipment and medium
CN110610006A (en) Morphological double-channel Chinese word embedding method based on strokes and glyphs
CN115640378A (en) Work order retrieval method, server, medium and product
CN115146073A (en) Test question knowledge point marking method for cross-space semantic knowledge injection and application
CN111046934B (en) SWIFT message soft clause recognition method and device
CN114064873A (en) Method and device for building FAQ knowledge base in insurance field and electronic equipment
CN113204984A (en) Traditional Chinese medicine handwritten prescription identification method under small amount of labeled data
CN111708896A (en) Entity relationship extraction method applied to biomedical documents
CN115881265B (en) Intelligent medical record quality control method, system and equipment for electronic medical record and storage medium
Solis et al. Recognition of handwritten Japanese characters using ensemble of convolutional neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned
AD01 Patent right deemed abandoned

Effective date of abandoning: 20240202