CN111899890A

CN111899890A - Medical data similarity detection system and method based on bit string Hash

Info

Publication number: CN111899890A
Application number: CN202010810385.3A
Authority: CN
Inventors: 周铁华; 王玲; 李建; 刘文强
Original assignee: Northeast Dianli University
Current assignee: Northeast Electric Power University
Priority date: 2020-08-13
Filing date: 2020-08-13
Publication date: 2020-11-06
Anticipated expiration: 2040-08-13
Also published as: CN111899890B

Abstract

The invention relates to the field of text similarity detection, in particular to a medical data similarity detection system and method based on bit string Hash. The system comprises a data storage module, a data preprocessing module, a text feature extraction module, a hashing processing module, a text similarity calculation module and a similarity visualization module. The medical text data is stored in the data storage module, the text is subjected to dimensionality reduction processing and privacy information is removed in the data preprocessing module, the document characteristics and the weight of the document form a document-characteristic matrix in the characteristic extraction module, the Hash processing module hashes the text, the document is mapped into a digital fingerprint by the similarity calculation module, the hamming distance is calculated, the document similarity group is divided, and the medical text is displayed in a visual mode by the similarity visualization module. According to the invention, through carrying out Hash processing on the text, a text set with higher similarity to a target text can be found in massive medical text data, and the medical problem retrieval efficiency is improved.

Description

Medical data similarity detection system and method based on bit string Hash

Technical Field

The invention relates to the field of text similarity detection, in particular to a medical data similarity detection system and method based on bit string Hash.

Background

With the rapid development of online medical treatment, the accumulation of text data in the medical field is increased day by day, and the latent value contained in the text data can effectively reduce the communication cost between doctors and patients, help the medical community to perform fine operation and provide more targeted services. The medical text has the characteristics of unobvious classification, obvious unstructured features, high discrimination weight of low-frequency words, ubiquitous information loss, inconsistent information and the like. How to accurately calculate the similarity between medical texts and quickly and accurately retrieve relevant medical information is a problem to be solved urgently at present. In order to solve the above problems, a medical text similarity detection system and method based on hash processing are provided herein.

The text similarity calculation method is mainly applied to the field of medical text retrieval, and combined with medical field knowledge, texts similar to the specified texts are found out from thousands of texts, the types of the texts are judged, similar texts are searched in the similar texts, the similar problems with high similarity can be predicted more accurately, and retrieval matching precision among different disease consultation texts is improved.

The method mainly comprises two types of text similarity retrieval methods, one type is a traditional method based on keyword matching, the similarity is considered from the same parts of two texts, the co-occurrence and the repetition degree of character strings are taken as similarity measurement standards, the methods can only compare the texts from the literal level, and the semantic information of the texts is not considered, so the effect is not ideal; the other type is a form of converting text features into vectors by using a spatial similarity model, and the method has a good effect in text similarity calculation in the general field, but the precision is not high in text similarity calculation in the vertical subdivision field, and the conventional Chinese medical text similarity calculation method generally has the condition of semantic information deficiency, is inaccurate in calculation of the Chinese medical text similarity, and cannot accurately reflect the similarity between medical texts. Therefore, how to accurately calculate the similarity between the medical texts is a problem to be solved urgently by researchers in the field.

Disclosure of Invention

The invention aims to overcome the defects of the text similarity calculation method, provides a medical data similarity detection system and method based on bit string Hash, integrates the method based on statistics and Hash, improves text distinguishing quality, improves text similarity detection quality under similar medical subjects, analyzes similar medical problem retrieval, and finds the association among diseases.

In order to realize the purpose, the invention is realized by the following technical scheme:

medical data similarity detection system based on bit string hash includes: the system comprises a data storage module, a data preprocessing module, a feature extraction module, a Hash processing module, a similarity calculation module and a similarity visualization module; the data storage module is used for storing medical text data, the data preprocessing module is used for performing dimensionality reduction processing on a text based on a medical field word stock and removing privacy information in the text data, the feature extraction module is used for extracting features based on smoothing processing and forming a document-feature weight matrix by using documents, document features and weights of the documents and the document features, the hashing processing module is used for updating parameters by defining initial weights and continuously iterating and dynamically adjusting thresholds and hashing the text according to a final threshold, the similarity calculation module is used for mapping the documents into digital fingerprints, calculating hamming distances among the documents and dividing document similarity groups according to the similarity thresholds, and the similarity visualization module is used for displaying a document subset with high similarity to a target document and similar text arrangement in each document group; the data storage module is in communication connection with the data preprocessing module, the feature extraction module is in communication connection with the data preprocessing module and the Hash processing module respectively, the similarity calculation module is in communication connection with the Hash processing module and the similarity visualization module respectively, and the similarity visualization module is in communication connection with the similarity calculation module.

According to the medical data similarity detection system based on the bit string hash, data stored by the data storage module comprise a user ID and a user self-describing medical text.

According to the medical data similarity detection system based on the bit string Hash, the data preprocessing module removes patient privacy data irrelevant to retrieval contents through a privacy protection device, performs word segmentation processing on user speech segments by using a word bank based on a disease symptom vertical field to obtain a word segmentation set, and removes stop words; the stop words refer to words which have no practical significance and are useless for model training, and comprise tone auxiliary words, conjunctions, prepositions and adverbs.

In the medical data similarity detection system based on bit string hash, the method for constructing the text-feature weight matrix by the feature extraction module comprises the following steps: and counting data by using a characteristic calculation device, concentrating the STF-IDF value of each characteristic word to obtain a word frequency document of the current data set, performing descending arrangement on each word according to the STF-IDF value, and selecting the first n characteristic words to form a characteristic word set of the characteristic text set.

In the medical data similarity detection system based on bit string hash, the document similarity calculation module divides the document similarity group according to the following method: searching text semantic information corresponding to the text data information from a preset database, forming characteristic data vector information through a Hash function, judging the similarity between the reference text data information and the comparison text data information through a similarity function according to the similar data information corresponding to the reference text data information and the similarity corresponding to the text data information, and further dividing a document similarity group.

The medical data similarity detection method based on the bit string Hash comprises the following steps:

step 1: aiming at an online medical community, storing acquired medical text data according to a database with reasonable field and attribute design to obtain a text set T = { T = }₁,T₂,T₃,…,T_iWhere T is_iRepresenting a piece of medical text data;

step 2: performing data preprocessing on the data: desensitizing the data to remove unnecessary information with low information value in the corpus, including patient name, address, and family condition, and performing word segmentation on the data set, wherein a stop word dictionary d is required to be set during the word segmentation process₁And a medical field dictionary d₂Further eliminating useless interference words and simultaneously keeping the integrity of the proprietary medical nouns to obtain a set W = { W } of all characteristic words₁,w₂,w₃,…,w_iIn which w_iA representative word;

and step 3: text feature extraction:

(1) calculating the weight of the features in the corpus by adopting a weight calculation method based on a smoothing idea, calculating the STF-IDF value of each word by utilizing a formula STF-IDF = (1+ log (tf)) ln (p/(p '+ 1) +1), wherein tf is the feature word frequency, p is the total number of documents, p' is the number of the documents containing the features, and the minimum support degree is set to be 2 to obtain an initial word weight set Z = (w) { (w)₁,v₁),(w₂,v₂),(w₃,v₃),…,(w_i,v_i) In which (w)_i,v_i) The feature words are sorted in a descending order according to the initial weight value v for the feature words and the corresponding STF-IDF values thereof, and the first n feature words are selected to form a feature word set of a feature text set;

(2) constructing an initial document-feature matrix M, and arranging feature words in the matrix according to a v value descending order;

and 4, step 4: document matrix hashing:

(1) the weight value in the document feature matrix M is expressed according to the formula V' = (V)_i+V_j) And 2 (i, j are the coordinates of the initial position and the end position of each column in the matrix respectively) to be dynamically updated until all (w)_i,v_i) When the calculation is finished, a new document feature matrix M is obtained₁；

(2) Definition matrix M₁The mean value of the middle updated weight value is v_p；

(3) First with v_pHash coding is carried out on data in the matrix for a threshold value, and the data is more than or equal to v_pAssigned value of 1, and otherwise is 0, i.e. when V'_ij≧v_pIs V'_ijIs No. =1, when V'_ij＜v_pIs V'_ij=0，v_ijThe weights of the ith row and the jth column in the Hash matrix are calculated;

(4) dynamically adjusting threshold values

，v(w,T_i) Representing features w in a document T_jThe weight in (1) is (are),

the co-occurrence coefficient of the characteristics among the documents is represented, the last step (3) is repeated, and k pairs of documents T are respectively selected according to different thresholds_iDocument T most similar thereto_jA pair of composed similar documents;

(5) calculating the average cosine similarity between similar document pairs under different threshold settings, selecting the threshold when the average cosine similarity is maximum as the final threshold, and updating the threshold as v', wherein the formula of the average document cosine similarity is as follows: similarity = simiarity =

Wherein k is the similar document logarithm;

(6) constructing a similar matrix;

and 5: dividing similarity document groups:

(1) calculating the Hamming distance between the text Hash codes according to the Hash value updated by the final threshold value, and setting a similar text division threshold value with equal interval compensation according to the Hamming distance;

(2) dividing the documents into different similarity groups according to similarity step sizes, wherein the documents conforming to the similarity interval are divided into the same similarity group, and group = { group1, group2, group3, … }, group_iIs a divided similarity group;

step 6: and displaying the similarity condition between the documents and the category group in the similarity visualization module.

In summary, the medical data similarity detection system and method based on bit string hash according to the present invention have the following beneficial effects:

(1) the method is particularly beneficial to analyzing similar medical problem retrieval and finding the association among diseases;

(2) the method comprehensively considers the applicability of disease characteristics, retains the similarity of original texts, and reduces the ambiguity of analysis results, thereby providing guarantee for later-stage similarity detection;

(3) the method integrates a statistical-based method and a Hash-based method, mutual promotion is realized, and the purposes of improving text distinguishing quality and improving text similarity detection quality under similar medical subjects are achieved;

(4) the method is scientific and reasonable and has strong applicability.

The foregoing is a summary of the present application and thus contains, by necessity, simplifications, generalizations and omissions of detail; those skilled in the art will appreciate that the summary is illustrative of the application and is not intended to be in any way limiting. Other aspects, features and advantages of the devices and/or methods and/or other subject matter described in this specification will become apparent as the description proceeds. The summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

Drawings

The above-described and other features of the present application will become more fully apparent from the following description and appended claims, taken in conjunction with the accompanying drawings. It is to be understood that these drawings are solely for purposes of illustrating several embodiments of the present application and are not intended as a definition of the limits of the application, for which reference should be made to the appended drawings, wherein the disclosure is to be interpreted in a more complete and detailed manner.

Fig. 1 is a system module analysis diagram of the medical data similarity detection system and method based on bit string hashing according to the present invention.

Fig. 2 is a system flow diagram of the medical data similarity detection system and method based on bit string hashing according to the present invention.

Detailed Description

In the following detailed description, reference is made to the accompanying drawings, which form a part hereof. In the drawings, the same/similar reference numerals generally refer to the same/similar parts unless otherwise specified in the specification. The illustrative embodiments described in the detailed description, drawings, and claims should not be considered limiting of the application. Other embodiments of, and changes to, the present application may be made without departing from the spirit or scope of the subject matter presented in the present application. It should be readily understood that the aspects of the present application, as generally described in the specification and illustrated in the figures herein, could be arranged, substituted, combined, designed in a wide variety of different configurations, and that all such modifications are expressly contemplated and made part of this application.

The invention provides a medical data similarity detection system and method based on bit string Hash, which are used for accurately calculating the similarity between similar medical texts, and help to quickly and accurately retrieve the problems commonly concerned by doctors and doctors of patients, reduce the communication cost of doctors and patients and find the relation between diseases.

Referring to fig. 1, the present invention includes: the device comprises a data storage module, a data preprocessing module, a feature extraction module, a hashing processing module, a similarity calculation module and a similarity visualization module. The data storage module is used for storing medical text data of network medical communities and websites and constructing a database, the data stored by the data storage module comprises user IDs and user self-describing medical texts, and in the embodiment, the stored data comprises patient IDs, consulting corpora of the on-line medical communities of patients and the like; the data preprocessing module is used for performing dimensionality reduction processing on the text based on a word bank in the medical field, removing privacy information in the text data, performing word segmentation processing on the corpus and removing stop words; the feature extraction module extracts features based on smoothing processing to form a document-feature weight matrix by documents, document features and weights thereof, the hashing processing module is used for updating parameters by defining initial weights and continuously iterating and dynamically adjusting thresholds, and hashing texts according to final thresholds, the similarity calculation module is used for mapping the documents into digital fingerprints, calculating hamming distances among the documents and dividing document similarity groups according to the similarity thresholds, and the similarity visualization module is used for displaying document subsets with high similarity to target documents and similar text arrangement in the document groups. The data storage module is in communication connection with the data preprocessing module, the feature extraction module is in communication connection with the data preprocessing module and the Hash processing module respectively, the similarity calculation module is in communication connection with the Hash processing module and the similarity visualization module respectively, and the similarity visualization module is in communication connection with the similarity calculation module.

The data preprocessing module removes the patient privacy data irrelevant to the retrieval content through a privacy protection device, performs word segmentation processing on user speech segments by utilizing a word bank based on the disease symptom vertical field to obtain a word segmentation set, and removes stop words. The stop words refer to words which have no practical meaning and are useless for model training, and comprise tone auxiliary words, conjunctions, prepositions, adverbs and the like, such as the common stop words of' and the like.

The method for constructing the text-feature weight matrix by the feature extraction module comprises the following steps: and counting data by using a characteristic calculation device, concentrating the STF-IDF value of each characteristic word to obtain a word frequency document of the current data set, performing descending arrangement on each word according to the STF-IDF value, and selecting the first n characteristic words to form a characteristic word set of the characteristic text set.

The method for dividing the document similarity group by the document similarity calculation module comprises the following steps: searching text semantic information corresponding to the text data information from a preset database, forming characteristic data vector information through a Hash function, judging the similarity between the reference text data information and the comparison text data information through a similarity function according to the similar data information corresponding to the reference text data information and the similarity corresponding to the text data information, and further dividing a document similarity group. By adopting the scheme, when the similarity of the two text data information needs to be judged, the text is processed and coded firstly, so that the text data and the text data are associated to form a matrix data, namely the text information is converted into the matrix data similar to the picture, the similarity of the two texts is accurately obtained according to the hamming distance of the calculated document matrix data, and the judgment accuracy is higher.

Referring to fig. 2, a medical data similarity detection method based on bit string hash includes the following steps:

step 1: aiming at an online medical community, storing acquired medical text data according to a database with reasonable field and attribute design to obtain a text set T = { T = }₁,T₂,T₃,…,T_iWhere T is_iRepresenting a piece of medical text data.

Step 2: performing data preprocessing on the data: desensitizing the data to remove unnecessary information with low information value in the corpus, including patient name, address, and family condition, and performing word segmentation on the data set, wherein a stop word dictionary d is required to be set during the word segmentation process₁And a medical field dictionary d₂Further eliminating useless interference words and simultaneously keeping the integrity of the proprietary medical nouns to obtain a set W = { W } of all characteristic words₁,w₂,w₃,…,w_iIn which w_iThe expression words, preferably, the dictionary used in this step is a word list of the use of Haugh disabled and a word bank of the symptoms of the disease of the dog searching.

And step 3: text feature extraction:

(1) calculating the weight of the features in the corpus by adopting a weight calculation method based on a smoothing idea, calculating the STF-IDF value of each word by utilizing a formula STF-IDF = (1+ log (tf)) ln (p/(p '+ 1) +1), wherein tf is the feature word frequency, p is the total number of documents, p' is the number of the documents containing the features, and the minimum support degree is set to be 2 to obtain an initial word weight set Z = (w) { (w)₁,v₁),(w₂,v₂),(w₃,v₃),…,(w_i,v_i) In which (w)_i,v_i) For feature words and their corresponding STF-IDF values, for featuresThe words are arranged in a descending order according to the initial weight value v, and the first n characteristic words are selected to form a characteristic word set of a characteristic text set;

(2) and constructing an initial document-feature matrix M, and arranging the feature words in the matrix according to the v value in a descending order.

And 4, step 4: document matrix hashing:

(4) dynamically adjusting threshold values

Wherein k is the similar document logarithm;

(6) and constructing a similarity matrix.

And 5: dividing similarity document groups:

(1) calculating the hamming distance between the text hash codes according to the hash value updated by the final threshold, and setting a similar text division threshold with equal distance compensation according to the hamming distance, wherein the hamming distance 3 is taken as a division step length in the embodiment;

(2) dividing the documents into different similarity groups according to similarity step sizes, wherein the documents conforming to the similarity interval are divided into the same similarity group, and group = { group1, group2, group3, … }, group_iIs a divided similarity group.

Step 6: displaying the similarity condition between the documents and the category group of the documents in a similarity visualization module, wherein the similarity condition comprises the similarity group of the target document、Similar texts with higher similarity to the similar texts in the group, and the like.

In conclusion, the medical data similarity detection system and method based on the bit string hash are particularly beneficial to analysis of similar medical problem retrieval, discovery of correlation among diseases, scientific and reasonable and strong in applicability. The method comprehensively considers the applicability of disease characteristics, retains the similarity of original texts, and reduces the ambiguity of analysis results, thereby providing guarantee for later-stage similarity detection. In addition, the method integrates a statistical-based method and a Hash-based method, mutual promotion is realized, and the purposes of improving text distinguishing quality and improving text similarity detection quality under similar medical subjects are achieved.

The foregoing has been a detailed description of various embodiments of the apparatus and/or methods of the present application via block diagrams, flowcharts, and/or examples of implementations. When the block diagrams, flowcharts, and/or embodiments include one or more functions and/or operations, it will be understood by those within the art that each function and/or operation within the block diagrams, flowcharts, and/or embodiments can be implemented, individually and/or collectively, by a wide range of hardware, software, firmware, or virtually any combination thereof.

Those skilled in the art will recognize that it is common within the art to describe devices and/or methods in the manner described in this specification and then to perform engineering practices to integrate the described devices and/or methods into a data processing system. That is, at least a portion of the devices and/or methods described herein may be integrated into a data processing system through a reasonable amount of experimentation. With respect to substantially any plural and/or singular terms used in this specification, those skilled in the art may interpret the plural as singular and/or the singular as plural as appropriate from a context and/or application. Various singular/plural combinations may be explicitly stated in this specification for the sake of clarity.

Various aspects and embodiments of the present application are disclosed herein, and other aspects and embodiments of the present application will be apparent to those skilled in the art. The various aspects and embodiments disclosed in this application are presented by way of example only, and not by way of limitation, and the true scope and spirit of the application is to be determined by the following claims.

Claims

1. Medical data similarity detection system based on bit string hash, characterized by comprising: the system comprises a data storage module, a data preprocessing module, a feature extraction module, a Hash processing module, a similarity calculation module and a similarity visualization module;

the data storage module is used for storing medical text data, the data preprocessing module is used for performing dimensionality reduction processing on a text based on a medical field word stock and removing privacy information in the text data, the feature extraction module is used for extracting features based on smoothing processing and forming a document-feature weight matrix by using documents, document features and weights of the documents and the document features, the hashing processing module is used for updating parameters by defining initial weights and continuously iterating and dynamically adjusting thresholds and hashing the text according to a final threshold, the similarity calculation module is used for mapping the documents into digital fingerprints, calculating hamming distances among the documents and dividing document similarity groups according to the similarity thresholds, and the similarity visualization module is used for displaying a document subset with high similarity to a target document and similar text arrangement in each document group;

the data storage module is in communication connection with the data preprocessing module, the feature extraction module is in communication connection with the data preprocessing module and the Hash processing module respectively, the similarity calculation module is in communication connection with the Hash processing module and the similarity visualization module respectively, and the similarity visualization module is in communication connection with the similarity calculation module.

2. The system for detecting similarity of medical data based on hash of bit strings as claimed in claim 1, wherein the data stored in the data storage module includes user ID and self-describing medical text of user.

3. The medical data similarity detection system based on bit string hash as claimed in claim 1, wherein the data preprocessing module removes the private data of the patient irrelevant to the retrieval content through a privacy protection device, performs word segmentation processing on the user speech by using a word bank based on the vertical domain of disease symptoms to obtain a word segmentation set, and removes stop words;

the stop words refer to words which have no practical significance and are useless for model training, and comprise tone auxiliary words, conjunctions, prepositions and adverbs.

4. The medical data similarity detection system based on bit string hashing according to claim 1, wherein the method for constructing the text-feature weight matrix by the feature extraction module is as follows: and counting data by using a characteristic calculation device, concentrating the STF-IDF value of each characteristic word to obtain a word frequency document of the current data set, performing descending arrangement on each word according to the STF-IDF value, and selecting the first n characteristic words to form a characteristic word set of the characteristic text set.

5. The medical data similarity detection system based on bit string hashing according to claim 1, wherein said document similarity calculation module divides the document similarity group into: searching text semantic information corresponding to the text data information from a preset database, forming characteristic data vector information through a Hash function, judging the similarity between the reference text data information and the comparison text data information through a similarity function according to the similar data information corresponding to the reference text data information and the similarity corresponding to the text data information, and further dividing a document similarity group.

6. The medical data similarity detection method based on bit string Hash is characterized by comprising the following steps of:

and step 3: text feature extraction:

(1) calculating the weight of the features in the corpus by adopting a weight calculation method based on a smoothing idea, calculating the STF-IDF value of each word by utilizing a formula STF-IDF = (1+ log (tf)) ln (p/(p '+ 1) +1), wherein tf is the feature word frequency, p is the total number of documents, p' is the number of the documents containing the features, and the minimum support degree is set to be 2 to obtain an initial word weight set Z = (w) { (w)₁,v₁),(w₂,v₂),(w₃,v₃),…,(w_i,v_i) In which (w)_i,v_i) Is a characteristic wordAnd corresponding STF-IDF values of the feature words are arranged in a descending order according to the initial weight value v, and the first n feature words are selected to form a feature word set of the feature text set;

and 4, step 4: document matrix hashing:

(4) dynamically adjusting threshold values

Wherein k is the similar document logarithm;

(6) constructing a similar matrix;

and 5: dividing similarity document groups: