CN114064904A - Clustering method, system and device for medical texts - Google Patents

Clustering method, system and device for medical texts Download PDF

Info

Publication number
CN114064904A
CN114064904A CN202111426905.1A CN202111426905A CN114064904A CN 114064904 A CN114064904 A CN 114064904A CN 202111426905 A CN202111426905 A CN 202111426905A CN 114064904 A CN114064904 A CN 114064904A
Authority
CN
China
Prior art keywords
medical
text
texts
clustering
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111426905.1A
Other languages
Chinese (zh)
Inventor
金迪
李征
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan University
Original Assignee
Henan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan University filed Critical Henan University
Priority to CN202111426905.1A priority Critical patent/CN114064904A/en
Publication of CN114064904A publication Critical patent/CN114064904A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention relates to a clustering method, a device and a system for medical texts, which relate to the technical field of text data mining, and the method comprises the following steps: acquiring a medical label and a text of a question and answer part of a medical website; updating a word segmentation word bank through the label text, segmenting the medical text by using the updated word bank and filtering stop words to construct a training corpus; training the material by using the training model to obtain a trained word vector; acquiring a medical text to be clustered, and performing word segmentation and word filtering on the text to stop using words; and clustering the medical texts to be clustered by using the clustering model to obtain a clustering result. The invention can realize more accurate and specific classification of the medical texts and can automatically determine the number of the clustered categories.

Description

Clustering method, system and device for medical texts
Technical Field
The invention relates to the technical field of text data mining, in particular to a clustering method, system and device for medical texts.
Background
In the era of big explosion of internet data, text data of all walks of life are constantly increasing. Most of medical texts in the network appear in a semi-structured and unstructured mode, and medical workers use manpower to process and classify texts when facing massive medical texts, which wastes time and labor. Under the background, the text data is simplified and analyzed by using a clustering technology, and the texts are classified so as to facilitate medical workers to search useful information in massive network information, so that the working efficiency of the medical workers can be effectively improved.
In the medical field, medical texts can be classified into various categories, including: symptoms, treatments, examinations, etiologies, care, prevention, and the like. The categories in a large number of articles in the network are mixed with fishes and dragons, and the method has extremely important practical significance for classifying a large number of texts. The classified and clear text can enable a doctor to quickly judge the illness state of the patient, and the doctor can take medicine according to the symptoms, so that the working efficiency of the doctor is greatly improved.
The text clustering technology is widely applied to aspects of text mining, information retrieval and the like, and has important application value in aspects of organization and browsing of large-scale text sets, automatic generation of hierarchical classification of the text sets and the like. The goal of the text clustering technique is to partition a data set into different classes or clusters according to a certain specific criterion (e.g., distance), so that the similarity of data objects in the same cluster is as large as possible, and the difference of data objects in no longer the same cluster is as large as possible. The current common clustering technologies mainly include a K-means algorithm, a density-based algorithm (DBSCAN), a hierarchy-based algorithm (BIRCH) and the like. Wherein the K-means algorithm requires a pre-set number of clusters. Generally, because the type and the number of texts are large, the number of clusters is difficult to accurately determine, and the clustering result is inaccurate. In the BIRCH algorithm, the CF-Tree limits the number of CFs in each node, so that the clustering result may be different from the real category distribution.
Disclosure of Invention
The invention aims to provide a clustering method, a system and a device for medical texts. To solve the problems set forth in the background art described above. The invention aims to use the medical text as a starting point and finally realize more specific classification of the medical text data.
In order to achieve the above object, the present invention provides a clustering method for medical texts, which mainly comprises the following steps.
Step S100: and acquiring a medical label and a text of a question and answer part of the medical website.
Step S200: and updating the word segmentation word bank through the label text, segmenting the medical text by using the updated word bank, and filtering stop words to construct a training corpus.
Step S300: and training the speech by using the training model to obtain a trained word vector.
Step S400: and acquiring the medical text to be clustered, and performing word segmentation and word filtering on the text to stop using words.
Step S500: and clustering the medical texts to be clustered by using the clustering model to obtain a clustering result.
Preferably, the question tags of the question and answer pages in the medical professional website obtained in step 1 are used as professional vocabularies in the medical field and used as custom vocabularies for segmenting sentences.
Preferably, a regular expression is used on the read medical text to filter out Chinese sentences in the medical text.
Preferably, the word vector suitable for the medical field is stored after the word segmentation and word filtering stop of the training corpus are carried out and the training corpus is input into a word2vec model for training.
Preferably, the medical text to be clustered is preprocessed, all texts need to be segmented and stop words are filtered, and the sentence vectors are obtained by using the average feature matrix of the word vectors in the sentences.
Preferably, clustering is performed on the clustered texts by using a DBSCAN clustering method, and similar texts are screened by calculating the similarity and cluster radius of the texts.
Corresponding to the method, the invention also provides a clustering system for the medical texts, which comprises the following steps.
And the data collection module is used for acquiring the medical text corpus and the medical label, using the medical label in a user-defined word bank used in word segmentation, and using the medical text corpus in training a word vector in the medical field.
And the training word vector module is used for acquiring word vectors in the medical field and converting words obtained after the words are segmented in the sentences in the data text into the word vectors.
And the clustering module is used for classifying the texts to be clustered, classifying the texts to be classified according to the set text similarity threshold and the cluster radius, and classifying the similar texts into one class.
Corresponding to the system, the embodiment of the invention provides a clustering device for medical texts, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of a clustering method for the medical texts.
Embodiments of the present invention provide a computer-readable storage medium, which may store a computer program that, when executed by a processor, implements a clustering method for medical texts as described in any one of the above.
Compared with the prior art, the invention has the following advantages and beneficial effects.
(1) The invention realizes a clustering method, a system and a device for medical texts, firstly, a word vector is obtained by expanding special words in the medical field and training the preprocessed medical texts by using word2vec, and the obtained word vector is obtained by training in medical text corpora and can express semantic information of the medical words.
(2) The invention can realize the specific classification of the medical texts.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
Fig. 1 is a schematic flow chart of a medical text clustering method provided by the present invention.
FIG. 2 is a functional block diagram of the medical text clustering system provided by the present invention.
Fig. 3 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to clearly understand the above objects, features and advantages of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a clustering method, a system and a device for medical texts, which solve the problem that medical workers waste time and labor when classifying massive medical texts.
The first embodiment.
The following describes a flow chart of the medical text clustering method of the present invention with reference to fig. 1, where the method includes.
Step S100: and acquiring a medical label and a text of a question and answer part of the medical website.
Performing data collection on the content of the healthcare professional website, including: medical label text content and medical question and answer part text content.
Step S200: and updating the word segmentation word bank through the label text, segmenting the medical text by using the updated word bank, and filtering stop words to construct a training corpus.
The medical field self-defined word bank is expanded, medical label texts are stored to be special words of the medical field and used when the texts are segmented, and medical field professional words are used as the self-defined words, so that the error condition that some medical words are separated when the words are segmented is guaranteed, and the semantic characteristics of the words can be conveniently and fully learned at the back.
Pre-processing the text, the pre-processing comprising: and deleting the html tag, segmenting the text, and removing stop words. The preset stop words can be punctuation marks, virtual words, non-medical impurity words and the like. Each word after word segmentation can be compared with preset stop words, and stop words in the sentence are filtered out.
When the words are segmented, a jieba word segmentation tool is used, and the medical text is segmented through the added self-defined word stock, so that the medical professional vocabulary is not segmented. The present embodiment is not limited to the type of word segmentation tool.
Step S300: and training the speech by using the training model to obtain a trained word vector.
In the training process described in this embodiment, each word in the text to be clustered is converted into a word vector by a word embedding method. The present embodiment is not limited to the type of the word embedding method, and may be an artificial neural network or the like. The word2vec model is used in this embodiment. By converting words into word vectors, the words can be mapped into vectors in a real number domain, and the text clustering performance can be effectively improved.
Specifically, when the word2vec model is trained, the words appearing in the corpus may be counted, and all the counted words are used as a vocabulary (vocabularies). And training and optimizing the word2vec model according to the objective function until a preset termination condition is met. Words in the corpus can be converted into vectors based on the trained word2vec model. The word2vec model trained in this way can more accurately represent the semantic features of each vocabulary in the medical text.
Step S400: and acquiring the medical text to be clustered, and performing word segmentation and word filtering on the text to stop using words.
And preprocessing the medical text to be clustered, segmenting sentences and filtering stop words, wherein the text to be clustered has better category distinguishing capability by filtering the stop words.
Step S500: and clustering the medical texts to be clustered by using the clustering model to obtain a clustering result.
In the embodiment, when the sentence vector is calculated, the average value of each word vector in the sentence is calculated to be used as the feature vector of the sentence.
In the clustering process described in this embodiment, whether two texts belong to the same class is determined by calculating cosine similarity and cluster radius of the two texts to be clustered. The smaller the cosine similarity of the two texts is, the larger the distance is, the larger the similarity is, and the smaller the distance is. The present embodiment is not limited to the way of similarity calculation.
Example two.
The invention will be described with reference to fig. 2, wherein the medical text clustering system module comprises the following functional modules.
And the data collection module is used for acquiring the medical text corpus and the medical label, using the medical label in a user-defined word bank used in word segmentation, and using the medical text corpus in training a word vector in the medical field.
And the training word vector module is used for acquiring word vectors in the medical field and converting words obtained after the words are segmented in the sentences in the data text into the word vectors.
And the clustering module is used for classifying the texts to be clustered, classifying the texts to be classified according to the set text similarity threshold and the cluster radius, and classifying the similar texts into one class.
First, medical training texts and medical professional vocabularies are collected through a data collection module. And then training the word vector module to obtain a word vector containing semantic features. And finally, classifying the medical texts to be clustered through the trained word vectors.
Example three.
The following describes a schematic structural diagram of a clustering device according to the present invention with reference to fig. 3, which mainly includes a collector, a memory, a processor, a communication interface, and a communication bus. Wherein the collector, the memory, the processor and the communication interface are in communication with each other via a communication bus. The processor may invoke logic instructions in the memory to perform a medical text clustering method, the method comprising: acquiring a medical label and a text of a question and answer part of a medical website; updating a word segmentation word bank through the label text, segmenting the medical text by using the updated word bank and filtering stop words to construct a training corpus; training the material by using the training model to obtain a trained word vector; acquiring a medical text to be clustered, and performing word segmentation and word filtering on the text to stop using words; and clustering the medical texts to be clustered by using the clustering model to obtain a clustering result.
The medical text clustering device of the embodiment of the invention comprises: a collector, a memory, a processor, a communication interface, a communication bus, and a computer program stored in the memory and executable on the processor, such as: a clustering program for medical text. The processor implements the steps in the above medical text clustering embodiments when executing the computer program. The processor, when executing the computer program, implements the functions of the modules in the embodiments of the apparatus described above, for example: the device comprises a data collection module, a training word vector module and a clustering module.
The above description is only a preferred example of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (8)

1. A clustering method for medical text, comprising:
step S100: collecting medical labels and texts of a question and answer part of a medical website;
step S200: updating a word segmentation word bank through the label text, segmenting the medical text by using the updated word bank and filtering stop words to construct a training corpus;
step S300: training the material by using the training model to obtain a trained word vector;
step S400: acquiring a medical text to be clustered, and performing word segmentation and word filtering on the text to stop using words;
step S500: and clustering the medical texts to be clustered by using the clustering model to obtain a clustering result.
2. The method for clustering medical texts according to claim 1, wherein the question tags of the question and answer pages in the medical professional website are obtained as medical field professional vocabulary and used as custom vocabulary in the process of segmenting sentences.
3. The medical text clustering method according to claim 1, wherein the training corpus is input into a word2vec model for training after being subjected to word segmentation and word filtering, and word vectors suitable for the medical field are stored.
4. The method for clustering medical texts according to claim 1, wherein the medical texts to be clustered are preprocessed, all texts are required to be participled and stop words are filtered, and sentence vectors are obtained by using the average feature matrix of the word vectors in the sentences.
5. The method for clustering medical texts according to claim 1, wherein the clustered texts are clustered by using a DBSCAN clustering method, and texts of the same type are screened by calculating similarity and cluster radius of the texts.
6. Clustering device for medical texts, comprising an acquirer, a processor, a memory and a computer program stored in the memory and executable on the processor, characterized in that the processor, when executing the computer program, implements the steps of the method according to any one of claims 1-5.
7. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of a method for text mining based on unstructured electronic medical records according to any of the claims 1 to 5.
8. A clustering system for medical text, characterized by: the system comprises:
the data collection module is used for acquiring medical text corpora and medical labels, using the medical labels for a user-defined word bank used in word segmentation, and using the medical text corpora for training word vectors in the medical field;
the training word vector module is used for acquiring word vectors in the medical field and converting words obtained after words are segmented in sentences in the data text into word vectors;
and the clustering module is used for classifying the texts to be clustered, classifying the texts to be classified according to the set text similarity threshold and the cluster radius, and classifying the similar texts into one class.
CN202111426905.1A 2021-11-28 2021-11-28 Clustering method, system and device for medical texts Pending CN114064904A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111426905.1A CN114064904A (en) 2021-11-28 2021-11-28 Clustering method, system and device for medical texts

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111426905.1A CN114064904A (en) 2021-11-28 2021-11-28 Clustering method, system and device for medical texts

Publications (1)

Publication Number Publication Date
CN114064904A true CN114064904A (en) 2022-02-18

Family

ID=80276790

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111426905.1A Pending CN114064904A (en) 2021-11-28 2021-11-28 Clustering method, system and device for medical texts

Country Status (1)

Country Link
CN (1) CN114064904A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114662477A (en) * 2022-03-10 2022-06-24 平安科技(深圳)有限公司 Stop word list generating method and device based on traditional Chinese medicine conversation and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114662477A (en) * 2022-03-10 2022-06-24 平安科技(深圳)有限公司 Stop word list generating method and device based on traditional Chinese medicine conversation and storage medium
CN114662477B (en) * 2022-03-10 2024-02-02 平安科技(深圳)有限公司 Method, device and storage medium for generating deactivated word list based on Chinese medicine dialogue

Similar Documents

Publication Publication Date Title
CN111914558B (en) Course knowledge relation extraction method and system based on sentence bag attention remote supervision
CN111274806B (en) Method and device for recognizing word segmentation and part of speech and method and device for analyzing electronic medical record
CN112001177A (en) Electronic medical record named entity identification method and system integrating deep learning and rules
CN113011533A (en) Text classification method and device, computer equipment and storage medium
CN107330011A (en) The recognition methods of the name entity of many strategy fusions and device
CN109960800A (en) Weakly supervised text classification method and device based on active learning
CN105701084A (en) Characteristic extraction method of text classification on the basis of mutual information
CN112395395B (en) Text keyword extraction method, device, equipment and storage medium
CN108875809A (en) The biomedical entity relationship classification method of joint attention mechanism and neural network
CN109657058A (en) A kind of abstracting method of notice information
CN109492105B (en) Text emotion classification method based on multi-feature ensemble learning
CN115809345A (en) Knowledge graph-based multi-source data difference traceability retrieval method
CN110633365A (en) Word vector-based hierarchical multi-label text classification method and system
Rizvi et al. Optical character recognition system for Nastalique Urdu-like script languages using supervised learning
CN110188359B (en) Text entity extraction method
CN110675962A (en) Traditional Chinese medicine pharmacological action identification method and system based on machine learning and text rules
CN113962293A (en) LightGBM classification and representation learning-based name disambiguation method and system
Popchev et al. Text Mining in the Domain of Plant Genetic Resources
CN112328792A (en) Optimization method for recognizing credit events based on DBSCAN clustering algorithm
CN116775897A (en) Knowledge graph construction and query method and device, electronic equipment and storage medium
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN114840685A (en) Emergency plan knowledge graph construction method
CN114064904A (en) Clustering method, system and device for medical texts
CN111597330A (en) Intelligent expert recommendation-oriented user image drawing method based on support vector machine
CN116795980A (en) Short text classification method integrating fine-grained element knowledge

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination