CN111899890A - Medical data similarity detection system and method based on bit string Hash - Google Patents
Medical data similarity detection system and method based on bit string Hash Download PDFInfo
- Publication number
- CN111899890A CN111899890A CN202010810385.3A CN202010810385A CN111899890A CN 111899890 A CN111899890 A CN 111899890A CN 202010810385 A CN202010810385 A CN 202010810385A CN 111899890 A CN111899890 A CN 111899890A
- Authority
- CN
- China
- Prior art keywords
- similarity
- data
- text
- document
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 238000001514 detection method Methods 0.000 title claims abstract description 30
- 239000011159 matrix material Substances 0.000 claims abstract description 40
- 238000004364 calculation method Methods 0.000 claims abstract description 32
- 238000012545 processing Methods 0.000 claims abstract description 29
- 238000007781 pre-processing Methods 0.000 claims abstract description 20
- 238000000605 extraction Methods 0.000 claims abstract description 17
- 238000012800 visualization Methods 0.000 claims abstract description 17
- 238000013500 data storage Methods 0.000 claims abstract description 14
- 238000004891 communication Methods 0.000 claims description 14
- 230000011218 segmentation Effects 0.000 claims description 13
- 201000010099 disease Diseases 0.000 claims description 11
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims description 11
- 238000009499 grossing Methods 0.000 claims description 6
- 208000024891 symptom Diseases 0.000 claims description 4
- 239000013598 vector Substances 0.000 claims description 4
- 238000013461 design Methods 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 238000012549 training Methods 0.000 claims description 3
- 230000000007 visual effect Effects 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 5
- 230000009286 beneficial effect Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Public Health (AREA)
- Medical Informatics (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biomedical Technology (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- Pathology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Epidemiology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Primary Health Care (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the field of text similarity detection, in particular to a medical data similarity detection system and method based on bit string Hash. The system comprises a data storage module, a data preprocessing module, a text feature extraction module, a hashing processing module, a text similarity calculation module and a similarity visualization module. The medical text data is stored in the data storage module, the text is subjected to dimensionality reduction processing and privacy information is removed in the data preprocessing module, the document characteristics and the weight of the document form a document-characteristic matrix in the characteristic extraction module, the Hash processing module hashes the text, the document is mapped into a digital fingerprint by the similarity calculation module, the hamming distance is calculated, the document similarity group is divided, and the medical text is displayed in a visual mode by the similarity visualization module. According to the invention, through carrying out Hash processing on the text, a text set with higher similarity to a target text can be found in massive medical text data, and the medical problem retrieval efficiency is improved.
Description
Technical Field
The invention relates to the field of text similarity detection, in particular to a medical data similarity detection system and method based on bit string Hash.
Background
With the rapid development of online medical treatment, the accumulation of text data in the medical field is increased day by day, and the latent value contained in the text data can effectively reduce the communication cost between doctors and patients, help the medical community to perform fine operation and provide more targeted services. The medical text has the characteristics of unobvious classification, obvious unstructured features, high discrimination weight of low-frequency words, ubiquitous information loss, inconsistent information and the like. How to accurately calculate the similarity between medical texts and quickly and accurately retrieve relevant medical information is a problem to be solved urgently at present. In order to solve the above problems, a medical text similarity detection system and method based on hash processing are provided herein.
The text similarity calculation method is mainly applied to the field of medical text retrieval, and combined with medical field knowledge, texts similar to the specified texts are found out from thousands of texts, the types of the texts are judged, similar texts are searched in the similar texts, the similar problems with high similarity can be predicted more accurately, and retrieval matching precision among different disease consultation texts is improved.
The method mainly comprises two types of text similarity retrieval methods, one type is a traditional method based on keyword matching, the similarity is considered from the same parts of two texts, the co-occurrence and the repetition degree of character strings are taken as similarity measurement standards, the methods can only compare the texts from the literal level, and the semantic information of the texts is not considered, so the effect is not ideal; the other type is a form of converting text features into vectors by using a spatial similarity model, and the method has a good effect in text similarity calculation in the general field, but the precision is not high in text similarity calculation in the vertical subdivision field, and the conventional Chinese medical text similarity calculation method generally has the condition of semantic information deficiency, is inaccurate in calculation of the Chinese medical text similarity, and cannot accurately reflect the similarity between medical texts. Therefore, how to accurately calculate the similarity between the medical texts is a problem to be solved urgently by researchers in the field.
Disclosure of Invention
The invention aims to overcome the defects of the text similarity calculation method, provides a medical data similarity detection system and method based on bit string Hash, integrates the method based on statistics and Hash, improves text distinguishing quality, improves text similarity detection quality under similar medical subjects, analyzes similar medical problem retrieval, and finds the association among diseases.
In order to realize the purpose, the invention is realized by the following technical scheme:
medical data similarity detection system based on bit string hash includes: the system comprises a data storage module, a data preprocessing module, a feature extraction module, a Hash processing module, a similarity calculation module and a similarity visualization module; the data storage module is used for storing medical text data, the data preprocessing module is used for performing dimensionality reduction processing on a text based on a medical field word stock and removing privacy information in the text data, the feature extraction module is used for extracting features based on smoothing processing and forming a document-feature weight matrix by using documents, document features and weights of the documents and the document features, the hashing processing module is used for updating parameters by defining initial weights and continuously iterating and dynamically adjusting thresholds and hashing the text according to a final threshold, the similarity calculation module is used for mapping the documents into digital fingerprints, calculating hamming distances among the documents and dividing document similarity groups according to the similarity thresholds, and the similarity visualization module is used for displaying a document subset with high similarity to a target document and similar text arrangement in each document group; the data storage module is in communication connection with the data preprocessing module, the feature extraction module is in communication connection with the data preprocessing module and the Hash processing module respectively, the similarity calculation module is in communication connection with the Hash processing module and the similarity visualization module respectively, and the similarity visualization module is in communication connection with the similarity calculation module.
According to the medical data similarity detection system based on the bit string hash, data stored by the data storage module comprise a user ID and a user self-describing medical text.
According to the medical data similarity detection system based on the bit string Hash, the data preprocessing module removes patient privacy data irrelevant to retrieval contents through a privacy protection device, performs word segmentation processing on user speech segments by using a word bank based on a disease symptom vertical field to obtain a word segmentation set, and removes stop words; the stop words refer to words which have no practical significance and are useless for model training, and comprise tone auxiliary words, conjunctions, prepositions and adverbs.
In the medical data similarity detection system based on bit string hash, the method for constructing the text-feature weight matrix by the feature extraction module comprises the following steps: and counting data by using a characteristic calculation device, concentrating the STF-IDF value of each characteristic word to obtain a word frequency document of the current data set, performing descending arrangement on each word according to the STF-IDF value, and selecting the first n characteristic words to form a characteristic word set of the characteristic text set.
In the medical data similarity detection system based on bit string hash, the document similarity calculation module divides the document similarity group according to the following method: searching text semantic information corresponding to the text data information from a preset database, forming characteristic data vector information through a Hash function, judging the similarity between the reference text data information and the comparison text data information through a similarity function according to the similar data information corresponding to the reference text data information and the similarity corresponding to the text data information, and further dividing a document similarity group.
The medical data similarity detection method based on the bit string Hash comprises the following steps:
step 1: aiming at an online medical community, storing acquired medical text data according to a database with reasonable field and attribute design to obtain a text set T = { T = }1,T2,T3,…,TiWhere T isiRepresenting a piece of medical text data;
step 2: performing data preprocessing on the data: desensitizing the data to remove unnecessary information with low information value in the corpus, including patient name, address, and family condition, and performing word segmentation on the data set, wherein a stop word dictionary d is required to be set during the word segmentation process1And a medical field dictionary d2Further eliminating useless interference words and simultaneously keeping the integrity of the proprietary medical nouns to obtain a set W = { W } of all characteristic words1,w2,w3,…,wiIn which wiA representative word;
and step 3: text feature extraction:
(1) calculating the weight of the features in the corpus by adopting a weight calculation method based on a smoothing idea, calculating the STF-IDF value of each word by utilizing a formula STF-IDF = (1+ log (tf)) ln (p/(p '+ 1) +1), wherein tf is the feature word frequency, p is the total number of documents, p' is the number of the documents containing the features, and the minimum support degree is set to be 2 to obtain an initial word weight set Z = (w) { (w)1,v1),(w2,v2),(w3,v3),…,(wi,vi) In which (w)i,vi) The feature words are sorted in a descending order according to the initial weight value v for the feature words and the corresponding STF-IDF values thereof, and the first n feature words are selected to form a feature word set of a feature text set;
(2) constructing an initial document-feature matrix M, and arranging feature words in the matrix according to a v value descending order;
and 4, step 4: document matrix hashing:
(1) the weight value in the document feature matrix M is expressed according to the formula V' = (V)i+Vj) And 2 (i, j are the coordinates of the initial position and the end position of each column in the matrix respectively) to be dynamically updated until all (w)i,vi) When the calculation is finished, a new document feature matrix M is obtained1;
(2) Definition matrix M1The mean value of the middle updated weight value is vp;
(3) First with vpHash coding is carried out on data in the matrix for a threshold value, and the data is more than or equal to vpAssigned value of 1, and otherwise is 0, i.e. when V'ij≧vpIs V'ijIs No. =1, when V'ij<vpIs V'ij=0,vijThe weights of the ith row and the jth column in the Hash matrix are calculated;
(4) dynamically adjusting threshold values,v(w,Ti) Representing features w in a document TjThe weight in (1) is (are),the co-occurrence coefficient of the characteristics among the documents is represented, the last step (3) is repeated, and k pairs of documents T are respectively selected according to different thresholdsiDocument T most similar theretojA pair of composed similar documents;
(5) calculating the average cosine similarity between similar document pairs under different threshold settings, selecting the threshold when the average cosine similarity is maximum as the final threshold, and updating the threshold as v', wherein the formula of the average document cosine similarity is as follows: similarity = simiarity =Wherein k is the similar document logarithm;
(6) constructing a similar matrix;
and 5: dividing similarity document groups:
(1) calculating the Hamming distance between the text Hash codes according to the Hash value updated by the final threshold value, and setting a similar text division threshold value with equal interval compensation according to the Hamming distance;
(2) dividing the documents into different similarity groups according to similarity step sizes, wherein the documents conforming to the similarity interval are divided into the same similarity group, and group = { group1, group2, group3, … }, groupiIs a divided similarity group;
step 6: and displaying the similarity condition between the documents and the category group in the similarity visualization module.
In summary, the medical data similarity detection system and method based on bit string hash according to the present invention have the following beneficial effects:
(1) the method is particularly beneficial to analyzing similar medical problem retrieval and finding the association among diseases;
(2) the method comprehensively considers the applicability of disease characteristics, retains the similarity of original texts, and reduces the ambiguity of analysis results, thereby providing guarantee for later-stage similarity detection;
(3) the method integrates a statistical-based method and a Hash-based method, mutual promotion is realized, and the purposes of improving text distinguishing quality and improving text similarity detection quality under similar medical subjects are achieved;
(4) the method is scientific and reasonable and has strong applicability.
The foregoing is a summary of the present application and thus contains, by necessity, simplifications, generalizations and omissions of detail; those skilled in the art will appreciate that the summary is illustrative of the application and is not intended to be in any way limiting. Other aspects, features and advantages of the devices and/or methods and/or other subject matter described in this specification will become apparent as the description proceeds. The summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
Drawings
The above-described and other features of the present application will become more fully apparent from the following description and appended claims, taken in conjunction with the accompanying drawings. It is to be understood that these drawings are solely for purposes of illustrating several embodiments of the present application and are not intended as a definition of the limits of the application, for which reference should be made to the appended drawings, wherein the disclosure is to be interpreted in a more complete and detailed manner.
Fig. 1 is a system module analysis diagram of the medical data similarity detection system and method based on bit string hashing according to the present invention.
Fig. 2 is a system flow diagram of the medical data similarity detection system and method based on bit string hashing according to the present invention.
Detailed Description
In the following detailed description, reference is made to the accompanying drawings, which form a part hereof. In the drawings, the same/similar reference numerals generally refer to the same/similar parts unless otherwise specified in the specification. The illustrative embodiments described in the detailed description, drawings, and claims should not be considered limiting of the application. Other embodiments of, and changes to, the present application may be made without departing from the spirit or scope of the subject matter presented in the present application. It should be readily understood that the aspects of the present application, as generally described in the specification and illustrated in the figures herein, could be arranged, substituted, combined, designed in a wide variety of different configurations, and that all such modifications are expressly contemplated and made part of this application.
The invention provides a medical data similarity detection system and method based on bit string Hash, which are used for accurately calculating the similarity between similar medical texts, and help to quickly and accurately retrieve the problems commonly concerned by doctors and doctors of patients, reduce the communication cost of doctors and patients and find the relation between diseases.
Referring to fig. 1, the present invention includes: the device comprises a data storage module, a data preprocessing module, a feature extraction module, a hashing processing module, a similarity calculation module and a similarity visualization module. The data storage module is used for storing medical text data of network medical communities and websites and constructing a database, the data stored by the data storage module comprises user IDs and user self-describing medical texts, and in the embodiment, the stored data comprises patient IDs, consulting corpora of the on-line medical communities of patients and the like; the data preprocessing module is used for performing dimensionality reduction processing on the text based on a word bank in the medical field, removing privacy information in the text data, performing word segmentation processing on the corpus and removing stop words; the feature extraction module extracts features based on smoothing processing to form a document-feature weight matrix by documents, document features and weights thereof, the hashing processing module is used for updating parameters by defining initial weights and continuously iterating and dynamically adjusting thresholds, and hashing texts according to final thresholds, the similarity calculation module is used for mapping the documents into digital fingerprints, calculating hamming distances among the documents and dividing document similarity groups according to the similarity thresholds, and the similarity visualization module is used for displaying document subsets with high similarity to target documents and similar text arrangement in the document groups. The data storage module is in communication connection with the data preprocessing module, the feature extraction module is in communication connection with the data preprocessing module and the Hash processing module respectively, the similarity calculation module is in communication connection with the Hash processing module and the similarity visualization module respectively, and the similarity visualization module is in communication connection with the similarity calculation module.
The data preprocessing module removes the patient privacy data irrelevant to the retrieval content through a privacy protection device, performs word segmentation processing on user speech segments by utilizing a word bank based on the disease symptom vertical field to obtain a word segmentation set, and removes stop words. The stop words refer to words which have no practical meaning and are useless for model training, and comprise tone auxiliary words, conjunctions, prepositions, adverbs and the like, such as the common stop words of' and the like.
The method for constructing the text-feature weight matrix by the feature extraction module comprises the following steps: and counting data by using a characteristic calculation device, concentrating the STF-IDF value of each characteristic word to obtain a word frequency document of the current data set, performing descending arrangement on each word according to the STF-IDF value, and selecting the first n characteristic words to form a characteristic word set of the characteristic text set.
The method for dividing the document similarity group by the document similarity calculation module comprises the following steps: searching text semantic information corresponding to the text data information from a preset database, forming characteristic data vector information through a Hash function, judging the similarity between the reference text data information and the comparison text data information through a similarity function according to the similar data information corresponding to the reference text data information and the similarity corresponding to the text data information, and further dividing a document similarity group. By adopting the scheme, when the similarity of the two text data information needs to be judged, the text is processed and coded firstly, so that the text data and the text data are associated to form a matrix data, namely the text information is converted into the matrix data similar to the picture, the similarity of the two texts is accurately obtained according to the hamming distance of the calculated document matrix data, and the judgment accuracy is higher.
Referring to fig. 2, a medical data similarity detection method based on bit string hash includes the following steps:
step 1: aiming at an online medical community, storing acquired medical text data according to a database with reasonable field and attribute design to obtain a text set T = { T = }1,T2,T3,…,TiWhere T isiRepresenting a piece of medical text data.
Step 2: performing data preprocessing on the data: desensitizing the data to remove unnecessary information with low information value in the corpus, including patient name, address, and family condition, and performing word segmentation on the data set, wherein a stop word dictionary d is required to be set during the word segmentation process1And a medical field dictionary d2Further eliminating useless interference words and simultaneously keeping the integrity of the proprietary medical nouns to obtain a set W = { W } of all characteristic words1,w2,w3,…,wiIn which wiThe expression words, preferably, the dictionary used in this step is a word list of the use of Haugh disabled and a word bank of the symptoms of the disease of the dog searching.
And step 3: text feature extraction:
(1) calculating the weight of the features in the corpus by adopting a weight calculation method based on a smoothing idea, calculating the STF-IDF value of each word by utilizing a formula STF-IDF = (1+ log (tf)) ln (p/(p '+ 1) +1), wherein tf is the feature word frequency, p is the total number of documents, p' is the number of the documents containing the features, and the minimum support degree is set to be 2 to obtain an initial word weight set Z = (w) { (w)1,v1),(w2,v2),(w3,v3),…,(wi,vi) In which (w)i,vi) For feature words and their corresponding STF-IDF values, for featuresThe words are arranged in a descending order according to the initial weight value v, and the first n characteristic words are selected to form a characteristic word set of a characteristic text set;
(2) and constructing an initial document-feature matrix M, and arranging the feature words in the matrix according to the v value in a descending order.
And 4, step 4: document matrix hashing:
(1) the weight value in the document feature matrix M is expressed according to the formula V' = (V)i+Vj) And 2 (i, j are the coordinates of the initial position and the end position of each column in the matrix respectively) to be dynamically updated until all (w)i,vi) When the calculation is finished, a new document feature matrix M is obtained1;
(2) Definition matrix M1The mean value of the middle updated weight value is vp;
(3) First with vpHash coding is carried out on data in the matrix for a threshold value, and the data is more than or equal to vpAssigned value of 1, and otherwise is 0, i.e. when V'ij≧vpIs V'ijIs No. =1, when V'ij<vpIs V'ij=0,vijThe weights of the ith row and the jth column in the Hash matrix are calculated;
(4) dynamically adjusting threshold values,v(w,Ti) Representing features w in a document TjThe weight in (1) is (are),the co-occurrence coefficient of the characteristics among the documents is represented, the last step (3) is repeated, and k pairs of documents T are respectively selected according to different thresholdsiDocument T most similar theretojA pair of composed similar documents;
(5) calculating the average cosine similarity between similar document pairs under different threshold settings, selecting the threshold when the average cosine similarity is maximum as the final threshold, and updating the threshold as v', wherein the formula of the average document cosine similarity is as follows: similarity = simiarity =Wherein k is the similar document logarithm;
(6) and constructing a similarity matrix.
And 5: dividing similarity document groups:
(1) calculating the hamming distance between the text hash codes according to the hash value updated by the final threshold, and setting a similar text division threshold with equal distance compensation according to the hamming distance, wherein the hamming distance 3 is taken as a division step length in the embodiment;
(2) dividing the documents into different similarity groups according to similarity step sizes, wherein the documents conforming to the similarity interval are divided into the same similarity group, and group = { group1, group2, group3, … }, groupiIs a divided similarity group.
Step 6: displaying the similarity condition between the documents and the category group of the documents in a similarity visualization module, wherein the similarity condition comprises the similarity group of the target document、Similar texts with higher similarity to the similar texts in the group, and the like.
In conclusion, the medical data similarity detection system and method based on the bit string hash are particularly beneficial to analysis of similar medical problem retrieval, discovery of correlation among diseases, scientific and reasonable and strong in applicability. The method comprehensively considers the applicability of disease characteristics, retains the similarity of original texts, and reduces the ambiguity of analysis results, thereby providing guarantee for later-stage similarity detection. In addition, the method integrates a statistical-based method and a Hash-based method, mutual promotion is realized, and the purposes of improving text distinguishing quality and improving text similarity detection quality under similar medical subjects are achieved.
The foregoing has been a detailed description of various embodiments of the apparatus and/or methods of the present application via block diagrams, flowcharts, and/or examples of implementations. When the block diagrams, flowcharts, and/or embodiments include one or more functions and/or operations, it will be understood by those within the art that each function and/or operation within the block diagrams, flowcharts, and/or embodiments can be implemented, individually and/or collectively, by a wide range of hardware, software, firmware, or virtually any combination thereof.
Those skilled in the art will recognize that it is common within the art to describe devices and/or methods in the manner described in this specification and then to perform engineering practices to integrate the described devices and/or methods into a data processing system. That is, at least a portion of the devices and/or methods described herein may be integrated into a data processing system through a reasonable amount of experimentation. With respect to substantially any plural and/or singular terms used in this specification, those skilled in the art may interpret the plural as singular and/or the singular as plural as appropriate from a context and/or application. Various singular/plural combinations may be explicitly stated in this specification for the sake of clarity.
Various aspects and embodiments of the present application are disclosed herein, and other aspects and embodiments of the present application will be apparent to those skilled in the art. The various aspects and embodiments disclosed in this application are presented by way of example only, and not by way of limitation, and the true scope and spirit of the application is to be determined by the following claims.
Claims (6)
1. Medical data similarity detection system based on bit string hash, characterized by comprising: the system comprises a data storage module, a data preprocessing module, a feature extraction module, a Hash processing module, a similarity calculation module and a similarity visualization module;
the data storage module is used for storing medical text data, the data preprocessing module is used for performing dimensionality reduction processing on a text based on a medical field word stock and removing privacy information in the text data, the feature extraction module is used for extracting features based on smoothing processing and forming a document-feature weight matrix by using documents, document features and weights of the documents and the document features, the hashing processing module is used for updating parameters by defining initial weights and continuously iterating and dynamically adjusting thresholds and hashing the text according to a final threshold, the similarity calculation module is used for mapping the documents into digital fingerprints, calculating hamming distances among the documents and dividing document similarity groups according to the similarity thresholds, and the similarity visualization module is used for displaying a document subset with high similarity to a target document and similar text arrangement in each document group;
the data storage module is in communication connection with the data preprocessing module, the feature extraction module is in communication connection with the data preprocessing module and the Hash processing module respectively, the similarity calculation module is in communication connection with the Hash processing module and the similarity visualization module respectively, and the similarity visualization module is in communication connection with the similarity calculation module.
2. The system for detecting similarity of medical data based on hash of bit strings as claimed in claim 1, wherein the data stored in the data storage module includes user ID and self-describing medical text of user.
3. The medical data similarity detection system based on bit string hash as claimed in claim 1, wherein the data preprocessing module removes the private data of the patient irrelevant to the retrieval content through a privacy protection device, performs word segmentation processing on the user speech by using a word bank based on the vertical domain of disease symptoms to obtain a word segmentation set, and removes stop words;
the stop words refer to words which have no practical significance and are useless for model training, and comprise tone auxiliary words, conjunctions, prepositions and adverbs.
4. The medical data similarity detection system based on bit string hashing according to claim 1, wherein the method for constructing the text-feature weight matrix by the feature extraction module is as follows: and counting data by using a characteristic calculation device, concentrating the STF-IDF value of each characteristic word to obtain a word frequency document of the current data set, performing descending arrangement on each word according to the STF-IDF value, and selecting the first n characteristic words to form a characteristic word set of the characteristic text set.
5. The medical data similarity detection system based on bit string hashing according to claim 1, wherein said document similarity calculation module divides the document similarity group into: searching text semantic information corresponding to the text data information from a preset database, forming characteristic data vector information through a Hash function, judging the similarity between the reference text data information and the comparison text data information through a similarity function according to the similar data information corresponding to the reference text data information and the similarity corresponding to the text data information, and further dividing a document similarity group.
6. The medical data similarity detection method based on bit string Hash is characterized by comprising the following steps of:
step 1: aiming at an online medical community, storing acquired medical text data according to a database with reasonable field and attribute design to obtain a text set T = { T = }1,T2,T3,…,TiWhere T isiRepresenting a piece of medical text data;
step 2: performing data preprocessing on the data: desensitizing the data to remove unnecessary information with low information value in the corpus, including patient name, address, and family condition, and performing word segmentation on the data set, wherein a stop word dictionary d is required to be set during the word segmentation process1And a medical field dictionary d2Further eliminating useless interference words and simultaneously keeping the integrity of the proprietary medical nouns to obtain a set W = { W } of all characteristic words1,w2,w3,…,wiIn which wiA representative word;
and step 3: text feature extraction:
(1) calculating the weight of the features in the corpus by adopting a weight calculation method based on a smoothing idea, calculating the STF-IDF value of each word by utilizing a formula STF-IDF = (1+ log (tf)) ln (p/(p '+ 1) +1), wherein tf is the feature word frequency, p is the total number of documents, p' is the number of the documents containing the features, and the minimum support degree is set to be 2 to obtain an initial word weight set Z = (w) { (w)1,v1),(w2,v2),(w3,v3),…,(wi,vi) In which (w)i,vi) Is a characteristic wordAnd corresponding STF-IDF values of the feature words are arranged in a descending order according to the initial weight value v, and the first n feature words are selected to form a feature word set of the feature text set;
(2) constructing an initial document-feature matrix M, and arranging feature words in the matrix according to a v value descending order;
and 4, step 4: document matrix hashing:
(1) the weight value in the document feature matrix M is expressed according to the formula V' = (V)i+Vj) And 2 (i, j are the coordinates of the initial position and the end position of each column in the matrix respectively) to be dynamically updated until all (w)i,vi) When the calculation is finished, a new document feature matrix M is obtained1;
(2) Definition matrix M1The mean value of the middle updated weight value is vp;
(3) First with vpHash coding is carried out on data in the matrix for a threshold value, and the data is more than or equal to vpAssigned value of 1, and otherwise is 0, i.e. when V'ij≧vpIs V'ijIs No. =1, when V'ij<vpIs V'ij=0,vijThe weights of the ith row and the jth column in the Hash matrix are calculated;
(4) dynamically adjusting threshold values,v(w,Ti) Representing features w in a document TjThe weight in (1) is (are),the co-occurrence coefficient of the characteristics among the documents is represented, the last step (3) is repeated, and k pairs of documents T are respectively selected according to different thresholdsiDocument T most similar theretojA pair of composed similar documents;
(5) calculating the average cosine similarity between similar document pairs under different threshold settings, selecting the threshold when the average cosine similarity is maximum as the final threshold, and updating the threshold as v', wherein the formula of the average document cosine similarity is as follows: similarity = simiarity =Wherein k is the similar document logarithm;
(6) constructing a similar matrix;
and 5: dividing similarity document groups:
(1) calculating the Hamming distance between the text Hash codes according to the Hash value updated by the final threshold value, and setting a similar text division threshold value with equal interval compensation according to the Hamming distance;
(2) dividing the documents into different similarity groups according to similarity step sizes, wherein the documents conforming to the similarity interval are divided into the same similarity group, and group = { group1, group2, group3, … }, groupiIs a divided similarity group;
step 6: and displaying the similarity condition between the documents and the category group in the similarity visualization module.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010810385.3A CN111899890B (en) | 2020-08-13 | 2020-08-13 | Medical data similarity detection system and method based on bit string hash |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010810385.3A CN111899890B (en) | 2020-08-13 | 2020-08-13 | Medical data similarity detection system and method based on bit string hash |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111899890A true CN111899890A (en) | 2020-11-06 |
CN111899890B CN111899890B (en) | 2023-12-08 |
Family
ID=73229292
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010810385.3A Active CN111899890B (en) | 2020-08-13 | 2020-08-13 | Medical data similarity detection system and method based on bit string hash |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111899890B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112632606A (en) * | 2020-12-23 | 2021-04-09 | 天津理工大学 | SNOMED-CT-based medical text document desensitization method and system |
CN113536763A (en) * | 2021-07-20 | 2021-10-22 | 北京中科闻歌科技股份有限公司 | Information processing method, device, equipment and storage medium |
CN115909342A (en) * | 2023-01-03 | 2023-04-04 | 湖北瑞云智联科技有限公司 | Image mark recognition system and method based on contact point motion track |
CN116186231A (en) * | 2023-04-24 | 2023-05-30 | 之江实验室 | Method and device for generating reply text, storage medium and electronic equipment |
CN117313154A (en) * | 2023-10-10 | 2023-12-29 | 上海期货信息技术有限公司 | Data association relation evaluation method and device based on privacy protection |
CN117612663A (en) * | 2023-12-19 | 2024-02-27 | 苏州临亿医药科技有限公司 | Visual processing system for clinical medical data |
CN117708308A (en) * | 2024-02-06 | 2024-03-15 | 四川蓉城蕾茗科技有限公司 | RAG natural language intelligent knowledge base management method and system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106156272A (en) * | 2016-06-21 | 2016-11-23 | 北京工业大学 | A kind of information retrieval method based on multi-source semantic analysis |
CN108427717A (en) * | 2018-02-06 | 2018-08-21 | 北京航空航天大学 | It is a kind of based on the alphabetic class family of languages medical treatment text Relation extraction method gradually extended |
CN108461111A (en) * | 2018-03-16 | 2018-08-28 | 重庆医科大学 | Chinese medical treatment text duplicate checking method and device, electronic equipment, computer read/write memory medium |
CN108595517A (en) * | 2018-03-26 | 2018-09-28 | 南京邮电大学 | A kind of extensive document similarity detection method |
-
2020
- 2020-08-13 CN CN202010810385.3A patent/CN111899890B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106156272A (en) * | 2016-06-21 | 2016-11-23 | 北京工业大学 | A kind of information retrieval method based on multi-source semantic analysis |
CN108427717A (en) * | 2018-02-06 | 2018-08-21 | 北京航空航天大学 | It is a kind of based on the alphabetic class family of languages medical treatment text Relation extraction method gradually extended |
CN108461111A (en) * | 2018-03-16 | 2018-08-28 | 重庆医科大学 | Chinese medical treatment text duplicate checking method and device, electronic equipment, computer read/write memory medium |
CN108595517A (en) * | 2018-03-26 | 2018-09-28 | 南京邮电大学 | A kind of extensive document similarity detection method |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112632606A (en) * | 2020-12-23 | 2021-04-09 | 天津理工大学 | SNOMED-CT-based medical text document desensitization method and system |
CN112632606B (en) * | 2020-12-23 | 2022-12-09 | 天津理工大学 | SNOMED-CT-based medical text document desensitization method and system |
CN113536763A (en) * | 2021-07-20 | 2021-10-22 | 北京中科闻歌科技股份有限公司 | Information processing method, device, equipment and storage medium |
CN115909342A (en) * | 2023-01-03 | 2023-04-04 | 湖北瑞云智联科技有限公司 | Image mark recognition system and method based on contact point motion track |
CN115909342B (en) * | 2023-01-03 | 2023-05-23 | 湖北瑞云智联科技有限公司 | Image mark recognition system and method based on contact movement track |
CN116186231A (en) * | 2023-04-24 | 2023-05-30 | 之江实验室 | Method and device for generating reply text, storage medium and electronic equipment |
CN117313154A (en) * | 2023-10-10 | 2023-12-29 | 上海期货信息技术有限公司 | Data association relation evaluation method and device based on privacy protection |
CN117313154B (en) * | 2023-10-10 | 2024-05-31 | 上海期货信息技术有限公司 | Data association relation evaluation method and device based on privacy protection |
CN117612663A (en) * | 2023-12-19 | 2024-02-27 | 苏州临亿医药科技有限公司 | Visual processing system for clinical medical data |
CN117708308A (en) * | 2024-02-06 | 2024-03-15 | 四川蓉城蕾茗科技有限公司 | RAG natural language intelligent knowledge base management method and system |
CN117708308B (en) * | 2024-02-06 | 2024-05-14 | 四川蓉城蕾茗科技有限公司 | RAG natural language intelligent knowledge base management method and system |
Also Published As
Publication number | Publication date |
---|---|
CN111899890B (en) | 2023-12-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111899890A (en) | Medical data similarity detection system and method based on bit string Hash | |
CN111709233B (en) | Intelligent diagnosis guiding method and system based on multi-attention convolutional neural network | |
CN112632292A (en) | Method, device and equipment for extracting service keywords and storage medium | |
CN112257422B (en) | Named entity normalization processing method and device, electronic equipment and storage medium | |
US20040141354A1 (en) | Query string matching method and apparatus | |
CN113535974B (en) | Diagnostic recommendation method and related device, electronic equipment and storage medium | |
US20090282025A1 (en) | Method for generating a representation of image content using image search and retrieval criteria | |
CN109657011B (en) | Data mining system for screening terrorist attack event crime groups | |
EP0996927A1 (en) | Text classification system and method | |
CN108647322B (en) | Method for identifying similarity of mass Web text information based on word network | |
CN108363694B (en) | Keyword extraction method and device | |
CN112559684A (en) | Keyword extraction and information retrieval method | |
CN115983233B (en) | Electronic medical record duplicate checking rate estimation method based on data stream matching | |
US11281714B2 (en) | Image retrieval | |
CN115982222A (en) | Searching method based on special disease and special medicine scenes | |
CN108108346B (en) | Method and device for extracting theme characteristic words of document | |
CN108021667A (en) | A kind of file classification method and device | |
CN114358001A (en) | Method for standardizing diagnosis result, and related device, equipment and storage medium thereof | |
CN111026787A (en) | Network point retrieval method, device and system | |
CN117149956A (en) | Text retrieval method and device, electronic equipment and readable storage medium | |
CN114168751B (en) | Medical text label identification method and system based on medical knowledge conceptual diagram | |
CN113268986B (en) | Unit name matching and searching method and device based on fuzzy matching algorithm | |
Brumer et al. | Predicting relevance scores for triples from type-like relations using neural embedding-the cabbage triple scorer at wsdm cup 2017 | |
CN114416966B (en) | Reasonable use and analysis method for medical consumables based on Simhash-BERT network | |
CN115659945B (en) | Standard document similarity detection method, device and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |