CN114064904A

CN114064904A - Clustering method, system and device for medical texts

Info

Publication number: CN114064904A
Application number: CN202111426905.1A
Authority: CN
Inventors: 金迪; 李征
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2021-11-28
Filing date: 2021-11-28
Publication date: 2022-02-18

Abstract

The embodiment of the invention relates to a clustering method, a device and a system for medical texts, which relate to the technical field of text data mining, and the method comprises the following steps: acquiring a medical label and a text of a question and answer part of a medical website; updating a word segmentation word bank through the label text, segmenting the medical text by using the updated word bank and filtering stop words to construct a training corpus; training the material by using the training model to obtain a trained word vector; acquiring a medical text to be clustered, and performing word segmentation and word filtering on the text to stop using words; and clustering the medical texts to be clustered by using the clustering model to obtain a clustering result. The invention can realize more accurate and specific classification of the medical texts and can automatically determine the number of the clustered categories.

Description

Clustering method, system and device for medical texts

Technical Field

The invention relates to the technical field of text data mining, in particular to a clustering method, system and device for medical texts.

Background

In the era of big explosion of internet data, text data of all walks of life are constantly increasing. Most of medical texts in the network appear in a semi-structured and unstructured mode, and medical workers use manpower to process and classify texts when facing massive medical texts, which wastes time and labor. Under the background, the text data is simplified and analyzed by using a clustering technology, and the texts are classified so as to facilitate medical workers to search useful information in massive network information, so that the working efficiency of the medical workers can be effectively improved.

In the medical field, medical texts can be classified into various categories, including: symptoms, treatments, examinations, etiologies, care, prevention, and the like. The categories in a large number of articles in the network are mixed with fishes and dragons, and the method has extremely important practical significance for classifying a large number of texts. The classified and clear text can enable a doctor to quickly judge the illness state of the patient, and the doctor can take medicine according to the symptoms, so that the working efficiency of the doctor is greatly improved.

The text clustering technology is widely applied to aspects of text mining, information retrieval and the like, and has important application value in aspects of organization and browsing of large-scale text sets, automatic generation of hierarchical classification of the text sets and the like. The goal of the text clustering technique is to partition a data set into different classes or clusters according to a certain specific criterion (e.g., distance), so that the similarity of data objects in the same cluster is as large as possible, and the difference of data objects in no longer the same cluster is as large as possible. The current common clustering technologies mainly include a K-means algorithm, a density-based algorithm (DBSCAN), a hierarchy-based algorithm (BIRCH) and the like. Wherein the K-means algorithm requires a pre-set number of clusters. Generally, because the type and the number of texts are large, the number of clusters is difficult to accurately determine, and the clustering result is inaccurate. In the BIRCH algorithm, the CF-Tree limits the number of CFs in each node, so that the clustering result may be different from the real category distribution.

Disclosure of Invention

The invention aims to provide a clustering method, a system and a device for medical texts. To solve the problems set forth in the background art described above. The invention aims to use the medical text as a starting point and finally realize more specific classification of the medical text data.

In order to achieve the above object, the present invention provides a clustering method for medical texts, which mainly comprises the following steps.

Step S100: and acquiring a medical label and a text of a question and answer part of the medical website.

Step S200: and updating the word segmentation word bank through the label text, segmenting the medical text by using the updated word bank, and filtering stop words to construct a training corpus.

Step S300: and training the speech by using the training model to obtain a trained word vector.

Step S400: and acquiring the medical text to be clustered, and performing word segmentation and word filtering on the text to stop using words.

Step S500: and clustering the medical texts to be clustered by using the clustering model to obtain a clustering result.

Preferably, the question tags of the question and answer pages in the medical professional website obtained in step 1 are used as professional vocabularies in the medical field and used as custom vocabularies for segmenting sentences.

Preferably, a regular expression is used on the read medical text to filter out Chinese sentences in the medical text.

Preferably, the word vector suitable for the medical field is stored after the word segmentation and word filtering stop of the training corpus are carried out and the training corpus is input into a word2vec model for training.

Preferably, the medical text to be clustered is preprocessed, all texts need to be segmented and stop words are filtered, and the sentence vectors are obtained by using the average feature matrix of the word vectors in the sentences.

Preferably, clustering is performed on the clustered texts by using a DBSCAN clustering method, and similar texts are screened by calculating the similarity and cluster radius of the texts.

Corresponding to the method, the invention also provides a clustering system for the medical texts, which comprises the following steps.

And the data collection module is used for acquiring the medical text corpus and the medical label, using the medical label in a user-defined word bank used in word segmentation, and using the medical text corpus in training a word vector in the medical field.

And the training word vector module is used for acquiring word vectors in the medical field and converting words obtained after the words are segmented in the sentences in the data text into the word vectors.

And the clustering module is used for classifying the texts to be clustered, classifying the texts to be classified according to the set text similarity threshold and the cluster radius, and classifying the similar texts into one class.

Corresponding to the system, the embodiment of the invention provides a clustering device for medical texts, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of a clustering method for the medical texts.

Embodiments of the present invention provide a computer-readable storage medium, which may store a computer program that, when executed by a processor, implements a clustering method for medical texts as described in any one of the above.

Compared with the prior art, the invention has the following advantages and beneficial effects.

(1) The invention realizes a clustering method, a system and a device for medical texts, firstly, a word vector is obtained by expanding special words in the medical field and training the preprocessed medical texts by using word2vec, and the obtained word vector is obtained by training in medical text corpora and can express semantic information of the medical words.

(2) The invention can realize the specific classification of the medical texts.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

Fig. 1 is a schematic flow chart of a medical text clustering method provided by the present invention.

FIG. 2 is a functional block diagram of the medical text clustering system provided by the present invention.

Fig. 3 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to clearly understand the above objects, features and advantages of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a clustering method, a system and a device for medical texts, which solve the problem that medical workers waste time and labor when classifying massive medical texts.

The first embodiment.

The following describes a flow chart of the medical text clustering method of the present invention with reference to fig. 1, where the method includes.

Performing data collection on the content of the healthcare professional website, including: medical label text content and medical question and answer part text content.

The medical field self-defined word bank is expanded, medical label texts are stored to be special words of the medical field and used when the texts are segmented, and medical field professional words are used as the self-defined words, so that the error condition that some medical words are separated when the words are segmented is guaranteed, and the semantic characteristics of the words can be conveniently and fully learned at the back.

Pre-processing the text, the pre-processing comprising: and deleting the html tag, segmenting the text, and removing stop words. The preset stop words can be punctuation marks, virtual words, non-medical impurity words and the like. Each word after word segmentation can be compared with preset stop words, and stop words in the sentence are filtered out.

When the words are segmented, a jieba word segmentation tool is used, and the medical text is segmented through the added self-defined word stock, so that the medical professional vocabulary is not segmented. The present embodiment is not limited to the type of word segmentation tool.

In the training process described in this embodiment, each word in the text to be clustered is converted into a word vector by a word embedding method. The present embodiment is not limited to the type of the word embedding method, and may be an artificial neural network or the like. The word2vec model is used in this embodiment. By converting words into word vectors, the words can be mapped into vectors in a real number domain, and the text clustering performance can be effectively improved.

Specifically, when the word2vec model is trained, the words appearing in the corpus may be counted, and all the counted words are used as a vocabulary (vocabularies). And training and optimizing the word2vec model according to the objective function until a preset termination condition is met. Words in the corpus can be converted into vectors based on the trained word2vec model. The word2vec model trained in this way can more accurately represent the semantic features of each vocabulary in the medical text.

And preprocessing the medical text to be clustered, segmenting sentences and filtering stop words, wherein the text to be clustered has better category distinguishing capability by filtering the stop words.

In the embodiment, when the sentence vector is calculated, the average value of each word vector in the sentence is calculated to be used as the feature vector of the sentence.

In the clustering process described in this embodiment, whether two texts belong to the same class is determined by calculating cosine similarity and cluster radius of the two texts to be clustered. The smaller the cosine similarity of the two texts is, the larger the distance is, the larger the similarity is, and the smaller the distance is. The present embodiment is not limited to the way of similarity calculation.

Example two.

The invention will be described with reference to fig. 2, wherein the medical text clustering system module comprises the following functional modules.

First, medical training texts and medical professional vocabularies are collected through a data collection module. And then training the word vector module to obtain a word vector containing semantic features. And finally, classifying the medical texts to be clustered through the trained word vectors.

Example three.

The following describes a schematic structural diagram of a clustering device according to the present invention with reference to fig. 3, which mainly includes a collector, a memory, a processor, a communication interface, and a communication bus. Wherein the collector, the memory, the processor and the communication interface are in communication with each other via a communication bus. The processor may invoke logic instructions in the memory to perform a medical text clustering method, the method comprising: acquiring a medical label and a text of a question and answer part of a medical website; updating a word segmentation word bank through the label text, segmenting the medical text by using the updated word bank and filtering stop words to construct a training corpus; training the material by using the training model to obtain a trained word vector; acquiring a medical text to be clustered, and performing word segmentation and word filtering on the text to stop using words; and clustering the medical texts to be clustered by using the clustering model to obtain a clustering result.

The medical text clustering device of the embodiment of the invention comprises: a collector, a memory, a processor, a communication interface, a communication bus, and a computer program stored in the memory and executable on the processor, such as: a clustering program for medical text. The processor implements the steps in the above medical text clustering embodiments when executing the computer program. The processor, when executing the computer program, implements the functions of the modules in the embodiments of the apparatus described above, for example: the device comprises a data collection module, a training word vector module and a clustering module.

The above description is only a preferred example of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A clustering method for medical text, comprising:

step S100: collecting medical labels and texts of a question and answer part of a medical website;

step S200: updating a word segmentation word bank through the label text, segmenting the medical text by using the updated word bank and filtering stop words to construct a training corpus;

step S300: training the material by using the training model to obtain a trained word vector;

step S400: acquiring a medical text to be clustered, and performing word segmentation and word filtering on the text to stop using words;

2. The method for clustering medical texts according to claim 1, wherein the question tags of the question and answer pages in the medical professional website are obtained as medical field professional vocabulary and used as custom vocabulary in the process of segmenting sentences.

3. The medical text clustering method according to claim 1, wherein the training corpus is input into a word2vec model for training after being subjected to word segmentation and word filtering, and word vectors suitable for the medical field are stored.

4. The method for clustering medical texts according to claim 1, wherein the medical texts to be clustered are preprocessed, all texts are required to be participled and stop words are filtered, and sentence vectors are obtained by using the average feature matrix of the word vectors in the sentences.

5. The method for clustering medical texts according to claim 1, wherein the clustered texts are clustered by using a DBSCAN clustering method, and texts of the same type are screened by calculating similarity and cluster radius of the texts.

6. Clustering device for medical texts, comprising an acquirer, a processor, a memory and a computer program stored in the memory and executable on the processor, characterized in that the processor, when executing the computer program, implements the steps of the method according to any one of claims 1-5.

7. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of a method for text mining based on unstructured electronic medical records according to any of the claims 1 to 5.

8. A clustering system for medical text, characterized by: the system comprises:

the data collection module is used for acquiring medical text corpora and medical labels, using the medical labels for a user-defined word bank used in word segmentation, and using the medical text corpora for training word vectors in the medical field;

the training word vector module is used for acquiring word vectors in the medical field and converting words obtained after words are segmented in sentences in the data text into word vectors;