CN111899890A - Medical data similarity detection system and method based on bit string Hash - Google Patents

Medical data similarity detection system and method based on bit string Hash Download PDF

Info

Publication number
CN111899890A
CN111899890A CN202010810385.3A CN202010810385A CN111899890A CN 111899890 A CN111899890 A CN 111899890A CN 202010810385 A CN202010810385 A CN 202010810385A CN 111899890 A CN111899890 A CN 111899890A
Authority
CN
China
Prior art keywords
similarity
data
text
document
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010810385.3A
Other languages
Chinese (zh)
Other versions
CN111899890B (en
Inventor
周铁华
王玲
李建
刘文强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeast Electric Power University
Original Assignee
Northeast Dianli University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeast Dianli University filed Critical Northeast Dianli University
Priority to CN202010810385.3A priority Critical patent/CN111899890B/en
Publication of CN111899890A publication Critical patent/CN111899890A/en
Application granted granted Critical
Publication of CN111899890B publication Critical patent/CN111899890B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Pathology (AREA)
  • Primary Health Care (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of text similarity detection, in particular to a medical data similarity detection system and method based on bit string Hash. The system comprises a data storage module, a data preprocessing module, a text feature extraction module, a hashing processing module, a text similarity calculation module and a similarity visualization module. The medical text data is stored in the data storage module, the text is subjected to dimensionality reduction processing and privacy information is removed in the data preprocessing module, the document characteristics and the weight of the document form a document-characteristic matrix in the characteristic extraction module, the Hash processing module hashes the text, the document is mapped into a digital fingerprint by the similarity calculation module, the hamming distance is calculated, the document similarity group is divided, and the medical text is displayed in a visual mode by the similarity visualization module. According to the invention, through carrying out Hash processing on the text, a text set with higher similarity to a target text can be found in massive medical text data, and the medical problem retrieval efficiency is improved.

Description

Medical data similarity detection system and method based on bit string Hash
Technical Field
The invention relates to the field of text similarity detection, in particular to a medical data similarity detection system and method based on bit string Hash.
Background
With the rapid development of online medical treatment, the accumulation of text data in the medical field is increased day by day, and the latent value contained in the text data can effectively reduce the communication cost between doctors and patients, help the medical community to perform fine operation and provide more targeted services. The medical text has the characteristics of unobvious classification, obvious unstructured features, high discrimination weight of low-frequency words, ubiquitous information loss, inconsistent information and the like. How to accurately calculate the similarity between medical texts and quickly and accurately retrieve relevant medical information is a problem to be solved urgently at present. In order to solve the above problems, a medical text similarity detection system and method based on hash processing are provided herein.
The text similarity calculation method is mainly applied to the field of medical text retrieval, and combined with medical field knowledge, texts similar to the specified texts are found out from thousands of texts, the types of the texts are judged, similar texts are searched in the similar texts, the similar problems with high similarity can be predicted more accurately, and retrieval matching precision among different disease consultation texts is improved.
The method mainly comprises two types of text similarity retrieval methods, one type is a traditional method based on keyword matching, the similarity is considered from the same parts of two texts, the co-occurrence and the repetition degree of character strings are taken as similarity measurement standards, the methods can only compare the texts from the literal level, and the semantic information of the texts is not considered, so the effect is not ideal; the other type is a form of converting text features into vectors by using a spatial similarity model, and the method has a good effect in text similarity calculation in the general field, but the precision is not high in text similarity calculation in the vertical subdivision field, and the conventional Chinese medical text similarity calculation method generally has the condition of semantic information deficiency, is inaccurate in calculation of the Chinese medical text similarity, and cannot accurately reflect the similarity between medical texts. Therefore, how to accurately calculate the similarity between the medical texts is a problem to be solved urgently by researchers in the field.
Disclosure of Invention
The invention aims to overcome the defects of the text similarity calculation method, provides a medical data similarity detection system and method based on bit string Hash, integrates the method based on statistics and Hash, improves text distinguishing quality, improves text similarity detection quality under similar medical subjects, analyzes similar medical problem retrieval, and finds the association among diseases.
In order to realize the purpose, the invention is realized by the following technical scheme:
medical data similarity detection system based on bit string hash includes: the system comprises a data storage module, a data preprocessing module, a feature extraction module, a Hash processing module, a similarity calculation module and a similarity visualization module; the data storage module is used for storing medical text data, the data preprocessing module is used for performing dimensionality reduction processing on a text based on a medical field word stock and removing privacy information in the text data, the feature extraction module is used for extracting features based on smoothing processing and forming a document-feature weight matrix by using documents, document features and weights of the documents and the document features, the hashing processing module is used for updating parameters by defining initial weights and continuously iterating and dynamically adjusting thresholds and hashing the text according to a final threshold, the similarity calculation module is used for mapping the documents into digital fingerprints, calculating hamming distances among the documents and dividing document similarity groups according to the similarity thresholds, and the similarity visualization module is used for displaying a document subset with high similarity to a target document and similar text arrangement in each document group; the data storage module is in communication connection with the data preprocessing module, the feature extraction module is in communication connection with the data preprocessing module and the Hash processing module respectively, the similarity calculation module is in communication connection with the Hash processing module and the similarity visualization module respectively, and the similarity visualization module is in communication connection with the similarity calculation module.
According to the medical data similarity detection system based on the bit string hash, data stored by the data storage module comprise a user ID and a user self-describing medical text.
According to the medical data similarity detection system based on the bit string Hash, the data preprocessing module removes patient privacy data irrelevant to retrieval contents through a privacy protection device, performs word segmentation processing on user speech segments by using a word bank based on a disease symptom vertical field to obtain a word segmentation set, and removes stop words; the stop words refer to words which have no practical significance and are useless for model training, and comprise tone auxiliary words, conjunctions, prepositions and adverbs.
In the medical data similarity detection system based on bit string hash, the method for constructing the text-feature weight matrix by the feature extraction module comprises the following steps: and counting data by using a characteristic calculation device, concentrating the STF-IDF value of each characteristic word to obtain a word frequency document of the current data set, performing descending arrangement on each word according to the STF-IDF value, and selecting the first n characteristic words to form a characteristic word set of the characteristic text set.
In the medical data similarity detection system based on bit string hash, the document similarity calculation module divides the document similarity group according to the following method: searching text semantic information corresponding to the text data information from a preset database, forming characteristic data vector information through a Hash function, judging the similarity between the reference text data information and the comparison text data information through a similarity function according to the similar data information corresponding to the reference text data information and the similarity corresponding to the text data information, and further dividing a document similarity group.
The medical data similarity detection method based on the bit string Hash comprises the following steps:
step 1: aiming at an online medical community, storing acquired medical text data according to a database with reasonable field and attribute design to obtain a text set T = { T = }1,T2,T3,…,TiWhere T isiRepresenting a piece of medical text data;
step 2: performing data preprocessing on the data: desensitizing the data to remove unnecessary information with low information value in the corpus, including patient name, address, and family condition, and performing word segmentation on the data set, wherein a stop word dictionary d is required to be set during the word segmentation process1And a medical field dictionary d2Further eliminating useless interference words and simultaneously keeping the integrity of the proprietary medical nouns to obtain a set W = { W } of all characteristic words1,w2,w3,…,wiIn which wiA representative word;
and step 3: text feature extraction:
(1) calculating the weight of the features in the corpus by adopting a weight calculation method based on a smoothing idea, calculating the STF-IDF value of each word by utilizing a formula STF-IDF = (1+ log (tf)) ln (p/(p '+ 1) +1), wherein tf is the feature word frequency, p is the total number of documents, p' is the number of the documents containing the features, and the minimum support degree is set to be 2 to obtain an initial word weight set Z = (w) { (w)1,v1),(w2,v2),(w3,v3),…,(wi,vi) In which (w)i,vi) The feature words are sorted in a descending order according to the initial weight value v for the feature words and the corresponding STF-IDF values thereof, and the first n feature words are selected to form a feature word set of a feature text set;
(2) constructing an initial document-feature matrix M, and arranging feature words in the matrix according to a v value descending order;
and 4, step 4: document matrix hashing:
(1) the weight value in the document feature matrix M is expressed according to the formula V' = (V)i+Vj) And 2 (i, j are the coordinates of the initial position and the end position of each column in the matrix respectively) to be dynamically updated until all (w)i,vi) When the calculation is finished, a new document feature matrix M is obtained1
(2) Definition matrix M1The mean value of the middle updated weight value is vp
(3) First with vpHash coding is carried out on data in the matrix for a threshold value, and the data is more than or equal to vpAssigned value of 1, and otherwise is 0, i.e. when V'ij≧vpIs V'ijIs No. =1, when V'ij<vpIs V'ij=0,vijThe weights of the ith row and the jth column in the Hash matrix are calculated;
(4) dynamically adjusting threshold values
Figure 100002_DEST_PATH_IMAGE001
,v(w,Ti) Representing features w in a document TjThe weight in (1) is (are),
Figure 808357DEST_PATH_IMAGE002
the co-occurrence coefficient of the characteristics among the documents is represented, the last step (3) is repeated, and k pairs of documents T are respectively selected according to different thresholdsiDocument T most similar theretojA pair of composed similar documents;
(5) calculating the average cosine similarity between similar document pairs under different threshold settings, selecting the threshold when the average cosine similarity is maximum as the final threshold, and updating the threshold as v', wherein the formula of the average document cosine similarity is as follows: similarity = simiarity =
Figure 100002_DEST_PATH_IMAGE003
Wherein k is the similar document logarithm;
(6) constructing a similar matrix;
and 5: dividing similarity document groups:
(1) calculating the Hamming distance between the text Hash codes according to the Hash value updated by the final threshold value, and setting a similar text division threshold value with equal interval compensation according to the Hamming distance;
(2) dividing the documents into different similarity groups according to similarity step sizes, wherein the documents conforming to the similarity interval are divided into the same similarity group, and group = { group1, group2, group3, … }, groupiIs a divided similarity group;
step 6: and displaying the similarity condition between the documents and the category group in the similarity visualization module.
In summary, the medical data similarity detection system and method based on bit string hash according to the present invention have the following beneficial effects:
(1) the method is particularly beneficial to analyzing similar medical problem retrieval and finding the association among diseases;
(2) the method comprehensively considers the applicability of disease characteristics, retains the similarity of original texts, and reduces the ambiguity of analysis results, thereby providing guarantee for later-stage similarity detection;
(3) the method integrates a statistical-based method and a Hash-based method, mutual promotion is realized, and the purposes of improving text distinguishing quality and improving text similarity detection quality under similar medical subjects are achieved;
(4) the method is scientific and reasonable and has strong applicability.
The foregoing is a summary of the present application and thus contains, by necessity, simplifications, generalizations and omissions of detail; those skilled in the art will appreciate that the summary is illustrative of the application and is not intended to be in any way limiting. Other aspects, features and advantages of the devices and/or methods and/or other subject matter described in this specification will become apparent as the description proceeds. The summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
Drawings
The above-described and other features of the present application will become more fully apparent from the following description and appended claims, taken in conjunction with the accompanying drawings. It is to be understood that these drawings are solely for purposes of illustrating several embodiments of the present application and are not intended as a definition of the limits of the application, for which reference should be made to the appended drawings, wherein the disclosure is to be interpreted in a more complete and detailed manner.
Fig. 1 is a system module analysis diagram of the medical data similarity detection system and method based on bit string hashing according to the present invention.
Fig. 2 is a system flow diagram of the medical data similarity detection system and method based on bit string hashing according to the present invention.
Detailed Description
In the following detailed description, reference is made to the accompanying drawings, which form a part hereof. In the drawings, the same/similar reference numerals generally refer to the same/similar parts unless otherwise specified in the specification. The illustrative embodiments described in the detailed description, drawings, and claims should not be considered limiting of the application. Other embodiments of, and changes to, the present application may be made without departing from the spirit or scope of the subject matter presented in the present application. It should be readily understood that the aspects of the present application, as generally described in the specification and illustrated in the figures herein, could be arranged, substituted, combined, designed in a wide variety of different configurations, and that all such modifications are expressly contemplated and made part of this application.
The invention provides a medical data similarity detection system and method based on bit string Hash, which are used for accurately calculating the similarity between similar medical texts, and help to quickly and accurately retrieve the problems commonly concerned by doctors and doctors of patients, reduce the communication cost of doctors and patients and find the relation between diseases.
Referring to fig. 1, the present invention includes: the device comprises a data storage module, a data preprocessing module, a feature extraction module, a hashing processing module, a similarity calculation module and a similarity visualization module. The data storage module is used for storing medical text data of network medical communities and websites and constructing a database, the data stored by the data storage module comprises user IDs and user self-describing medical texts, and in the embodiment, the stored data comprises patient IDs, consulting corpora of the on-line medical communities of patients and the like; the data preprocessing module is used for performing dimensionality reduction processing on the text based on a word bank in the medical field, removing privacy information in the text data, performing word segmentation processing on the corpus and removing stop words; the feature extraction module extracts features based on smoothing processing to form a document-feature weight matrix by documents, document features and weights thereof, the hashing processing module is used for updating parameters by defining initial weights and continuously iterating and dynamically adjusting thresholds, and hashing texts according to final thresholds, the similarity calculation module is used for mapping the documents into digital fingerprints, calculating hamming distances among the documents and dividing document similarity groups according to the similarity thresholds, and the similarity visualization module is used for displaying document subsets with high similarity to target documents and similar text arrangement in the document groups. The data storage module is in communication connection with the data preprocessing module, the feature extraction module is in communication connection with the data preprocessing module and the Hash processing module respectively, the similarity calculation module is in communication connection with the Hash processing module and the similarity visualization module respectively, and the similarity visualization module is in communication connection with the similarity calculation module.
The data preprocessing module removes the patient privacy data irrelevant to the retrieval content through a privacy protection device, performs word segmentation processing on user speech segments by utilizing a word bank based on the disease symptom vertical field to obtain a word segmentation set, and removes stop words. The stop words refer to words which have no practical meaning and are useless for model training, and comprise tone auxiliary words, conjunctions, prepositions, adverbs and the like, such as the common stop words of' and the like.
The method for constructing the text-feature weight matrix by the feature extraction module comprises the following steps: and counting data by using a characteristic calculation device, concentrating the STF-IDF value of each characteristic word to obtain a word frequency document of the current data set, performing descending arrangement on each word according to the STF-IDF value, and selecting the first n characteristic words to form a characteristic word set of the characteristic text set.
The method for dividing the document similarity group by the document similarity calculation module comprises the following steps: searching text semantic information corresponding to the text data information from a preset database, forming characteristic data vector information through a Hash function, judging the similarity between the reference text data information and the comparison text data information through a similarity function according to the similar data information corresponding to the reference text data information and the similarity corresponding to the text data information, and further dividing a document similarity group. By adopting the scheme, when the similarity of the two text data information needs to be judged, the text is processed and coded firstly, so that the text data and the text data are associated to form a matrix data, namely the text information is converted into the matrix data similar to the picture, the similarity of the two texts is accurately obtained according to the hamming distance of the calculated document matrix data, and the judgment accuracy is higher.
Referring to fig. 2, a medical data similarity detection method based on bit string hash includes the following steps:
step 1: aiming at an online medical community, storing acquired medical text data according to a database with reasonable field and attribute design to obtain a text set T = { T = }1,T2,T3,…,TiWhere T isiRepresenting a piece of medical text data.
Step 2: performing data preprocessing on the data: desensitizing the data to remove unnecessary information with low information value in the corpus, including patient name, address, and family condition, and performing word segmentation on the data set, wherein a stop word dictionary d is required to be set during the word segmentation process1And a medical field dictionary d2Further eliminating useless interference words and simultaneously keeping the integrity of the proprietary medical nouns to obtain a set W = { W } of all characteristic words1,w2,w3,…,wiIn which wiThe expression words, preferably, the dictionary used in this step is a word list of the use of Haugh disabled and a word bank of the symptoms of the disease of the dog searching.
And step 3: text feature extraction:
(1) calculating the weight of the features in the corpus by adopting a weight calculation method based on a smoothing idea, calculating the STF-IDF value of each word by utilizing a formula STF-IDF = (1+ log (tf)) ln (p/(p '+ 1) +1), wherein tf is the feature word frequency, p is the total number of documents, p' is the number of the documents containing the features, and the minimum support degree is set to be 2 to obtain an initial word weight set Z = (w) { (w)1,v1),(w2,v2),(w3,v3),…,(wi,vi) In which (w)i,vi) For feature words and their corresponding STF-IDF values, for featuresThe words are arranged in a descending order according to the initial weight value v, and the first n characteristic words are selected to form a characteristic word set of a characteristic text set;
(2) and constructing an initial document-feature matrix M, and arranging the feature words in the matrix according to the v value in a descending order.
And 4, step 4: document matrix hashing:
(1) the weight value in the document feature matrix M is expressed according to the formula V' = (V)i+Vj) And 2 (i, j are the coordinates of the initial position and the end position of each column in the matrix respectively) to be dynamically updated until all (w)i,vi) When the calculation is finished, a new document feature matrix M is obtained1
(2) Definition matrix M1The mean value of the middle updated weight value is vp
(3) First with vpHash coding is carried out on data in the matrix for a threshold value, and the data is more than or equal to vpAssigned value of 1, and otherwise is 0, i.e. when V'ij≧vpIs V'ijIs No. =1, when V'ij<vpIs V'ij=0,vijThe weights of the ith row and the jth column in the Hash matrix are calculated;
(4) dynamically adjusting threshold values
Figure 443869DEST_PATH_IMAGE001
,v(w,Ti) Representing features w in a document TjThe weight in (1) is (are),
Figure 710902DEST_PATH_IMAGE002
the co-occurrence coefficient of the characteristics among the documents is represented, the last step (3) is repeated, and k pairs of documents T are respectively selected according to different thresholdsiDocument T most similar theretojA pair of composed similar documents;
(5) calculating the average cosine similarity between similar document pairs under different threshold settings, selecting the threshold when the average cosine similarity is maximum as the final threshold, and updating the threshold as v', wherein the formula of the average document cosine similarity is as follows: similarity = simiarity =
Figure 972119DEST_PATH_IMAGE003
Wherein k is the similar document logarithm;
(6) and constructing a similarity matrix.
And 5: dividing similarity document groups:
(1) calculating the hamming distance between the text hash codes according to the hash value updated by the final threshold, and setting a similar text division threshold with equal distance compensation according to the hamming distance, wherein the hamming distance 3 is taken as a division step length in the embodiment;
(2) dividing the documents into different similarity groups according to similarity step sizes, wherein the documents conforming to the similarity interval are divided into the same similarity group, and group = { group1, group2, group3, … }, groupiIs a divided similarity group.
Step 6: displaying the similarity condition between the documents and the category group of the documents in a similarity visualization module, wherein the similarity condition comprises the similarity group of the target documentSimilar texts with higher similarity to the similar texts in the group, and the like.
In conclusion, the medical data similarity detection system and method based on the bit string hash are particularly beneficial to analysis of similar medical problem retrieval, discovery of correlation among diseases, scientific and reasonable and strong in applicability. The method comprehensively considers the applicability of disease characteristics, retains the similarity of original texts, and reduces the ambiguity of analysis results, thereby providing guarantee for later-stage similarity detection. In addition, the method integrates a statistical-based method and a Hash-based method, mutual promotion is realized, and the purposes of improving text distinguishing quality and improving text similarity detection quality under similar medical subjects are achieved.
The foregoing has been a detailed description of various embodiments of the apparatus and/or methods of the present application via block diagrams, flowcharts, and/or examples of implementations. When the block diagrams, flowcharts, and/or embodiments include one or more functions and/or operations, it will be understood by those within the art that each function and/or operation within the block diagrams, flowcharts, and/or embodiments can be implemented, individually and/or collectively, by a wide range of hardware, software, firmware, or virtually any combination thereof.
Those skilled in the art will recognize that it is common within the art to describe devices and/or methods in the manner described in this specification and then to perform engineering practices to integrate the described devices and/or methods into a data processing system. That is, at least a portion of the devices and/or methods described herein may be integrated into a data processing system through a reasonable amount of experimentation. With respect to substantially any plural and/or singular terms used in this specification, those skilled in the art may interpret the plural as singular and/or the singular as plural as appropriate from a context and/or application. Various singular/plural combinations may be explicitly stated in this specification for the sake of clarity.
Various aspects and embodiments of the present application are disclosed herein, and other aspects and embodiments of the present application will be apparent to those skilled in the art. The various aspects and embodiments disclosed in this application are presented by way of example only, and not by way of limitation, and the true scope and spirit of the application is to be determined by the following claims.

Claims (6)

1. Medical data similarity detection system based on bit string hash, characterized by comprising: the system comprises a data storage module, a data preprocessing module, a feature extraction module, a Hash processing module, a similarity calculation module and a similarity visualization module;
the data storage module is used for storing medical text data, the data preprocessing module is used for performing dimensionality reduction processing on a text based on a medical field word stock and removing privacy information in the text data, the feature extraction module is used for extracting features based on smoothing processing and forming a document-feature weight matrix by using documents, document features and weights of the documents and the document features, the hashing processing module is used for updating parameters by defining initial weights and continuously iterating and dynamically adjusting thresholds and hashing the text according to a final threshold, the similarity calculation module is used for mapping the documents into digital fingerprints, calculating hamming distances among the documents and dividing document similarity groups according to the similarity thresholds, and the similarity visualization module is used for displaying a document subset with high similarity to a target document and similar text arrangement in each document group;
the data storage module is in communication connection with the data preprocessing module, the feature extraction module is in communication connection with the data preprocessing module and the Hash processing module respectively, the similarity calculation module is in communication connection with the Hash processing module and the similarity visualization module respectively, and the similarity visualization module is in communication connection with the similarity calculation module.
2. The system for detecting similarity of medical data based on hash of bit strings as claimed in claim 1, wherein the data stored in the data storage module includes user ID and self-describing medical text of user.
3. The medical data similarity detection system based on bit string hash as claimed in claim 1, wherein the data preprocessing module removes the private data of the patient irrelevant to the retrieval content through a privacy protection device, performs word segmentation processing on the user speech by using a word bank based on the vertical domain of disease symptoms to obtain a word segmentation set, and removes stop words;
the stop words refer to words which have no practical significance and are useless for model training, and comprise tone auxiliary words, conjunctions, prepositions and adverbs.
4. The medical data similarity detection system based on bit string hashing according to claim 1, wherein the method for constructing the text-feature weight matrix by the feature extraction module is as follows: and counting data by using a characteristic calculation device, concentrating the STF-IDF value of each characteristic word to obtain a word frequency document of the current data set, performing descending arrangement on each word according to the STF-IDF value, and selecting the first n characteristic words to form a characteristic word set of the characteristic text set.
5. The medical data similarity detection system based on bit string hashing according to claim 1, wherein said document similarity calculation module divides the document similarity group into: searching text semantic information corresponding to the text data information from a preset database, forming characteristic data vector information through a Hash function, judging the similarity between the reference text data information and the comparison text data information through a similarity function according to the similar data information corresponding to the reference text data information and the similarity corresponding to the text data information, and further dividing a document similarity group.
6. The medical data similarity detection method based on bit string Hash is characterized by comprising the following steps of:
step 1: aiming at an online medical community, storing acquired medical text data according to a database with reasonable field and attribute design to obtain a text set T = { T = }1,T2,T3,…,TiWhere T isiRepresenting a piece of medical text data;
step 2: performing data preprocessing on the data: desensitizing the data to remove unnecessary information with low information value in the corpus, including patient name, address, and family condition, and performing word segmentation on the data set, wherein a stop word dictionary d is required to be set during the word segmentation process1And a medical field dictionary d2Further eliminating useless interference words and simultaneously keeping the integrity of the proprietary medical nouns to obtain a set W = { W } of all characteristic words1,w2,w3,…,wiIn which wiA representative word;
and step 3: text feature extraction:
(1) calculating the weight of the features in the corpus by adopting a weight calculation method based on a smoothing idea, calculating the STF-IDF value of each word by utilizing a formula STF-IDF = (1+ log (tf)) ln (p/(p '+ 1) +1), wherein tf is the feature word frequency, p is the total number of documents, p' is the number of the documents containing the features, and the minimum support degree is set to be 2 to obtain an initial word weight set Z = (w) { (w)1,v1),(w2,v2),(w3,v3),…,(wi,vi) In which (w)i,vi) Is a characteristic wordAnd corresponding STF-IDF values of the feature words are arranged in a descending order according to the initial weight value v, and the first n feature words are selected to form a feature word set of the feature text set;
(2) constructing an initial document-feature matrix M, and arranging feature words in the matrix according to a v value descending order;
and 4, step 4: document matrix hashing:
(1) the weight value in the document feature matrix M is expressed according to the formula V' = (V)i+Vj) And 2 (i, j are the coordinates of the initial position and the end position of each column in the matrix respectively) to be dynamically updated until all (w)i,vi) When the calculation is finished, a new document feature matrix M is obtained1
(2) Definition matrix M1The mean value of the middle updated weight value is vp
(3) First with vpHash coding is carried out on data in the matrix for a threshold value, and the data is more than or equal to vpAssigned value of 1, and otherwise is 0, i.e. when V'ij≧vpIs V'ijIs No. =1, when V'ij<vpIs V'ij=0,vijThe weights of the ith row and the jth column in the Hash matrix are calculated;
(4) dynamically adjusting threshold values
Figure DEST_PATH_IMAGE001
,v(w,Ti) Representing features w in a document TjThe weight in (1) is (are),
Figure 415249DEST_PATH_IMAGE002
the co-occurrence coefficient of the characteristics among the documents is represented, the last step (3) is repeated, and k pairs of documents T are respectively selected according to different thresholdsiDocument T most similar theretojA pair of composed similar documents;
(5) calculating the average cosine similarity between similar document pairs under different threshold settings, selecting the threshold when the average cosine similarity is maximum as the final threshold, and updating the threshold as v', wherein the formula of the average document cosine similarity is as follows: similarity = simiarity =
Figure DEST_PATH_IMAGE003
Wherein k is the similar document logarithm;
(6) constructing a similar matrix;
and 5: dividing similarity document groups:
(1) calculating the Hamming distance between the text Hash codes according to the Hash value updated by the final threshold value, and setting a similar text division threshold value with equal interval compensation according to the Hamming distance;
(2) dividing the documents into different similarity groups according to similarity step sizes, wherein the documents conforming to the similarity interval are divided into the same similarity group, and group = { group1, group2, group3, … }, groupiIs a divided similarity group;
step 6: and displaying the similarity condition between the documents and the category group in the similarity visualization module.
CN202010810385.3A 2020-08-13 2020-08-13 Medical data similarity detection system and method based on bit string hash Active CN111899890B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010810385.3A CN111899890B (en) 2020-08-13 2020-08-13 Medical data similarity detection system and method based on bit string hash

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010810385.3A CN111899890B (en) 2020-08-13 2020-08-13 Medical data similarity detection system and method based on bit string hash

Publications (2)

Publication Number Publication Date
CN111899890A true CN111899890A (en) 2020-11-06
CN111899890B CN111899890B (en) 2023-12-08

Family

ID=73229292

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010810385.3A Active CN111899890B (en) 2020-08-13 2020-08-13 Medical data similarity detection system and method based on bit string hash

Country Status (1)

Country Link
CN (1) CN111899890B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112632606A (en) * 2020-12-23 2021-04-09 天津理工大学 SNOMED-CT-based medical text document desensitization method and system
CN113536763A (en) * 2021-07-20 2021-10-22 北京中科闻歌科技股份有限公司 Information processing method, device, equipment and storage medium
CN115909342A (en) * 2023-01-03 2023-04-04 湖北瑞云智联科技有限公司 Image mark recognition system and method based on contact point motion track
CN116186231A (en) * 2023-04-24 2023-05-30 之江实验室 Method and device for generating reply text, storage medium and electronic equipment
CN117313154A (en) * 2023-10-10 2023-12-29 上海期货信息技术有限公司 Data association relation evaluation method and device based on privacy protection
CN117612663A (en) * 2023-12-19 2024-02-27 苏州临亿医药科技有限公司 Visual processing system for clinical medical data
CN117708308A (en) * 2024-02-06 2024-03-15 四川蓉城蕾茗科技有限公司 RAG natural language intelligent knowledge base management method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156272A (en) * 2016-06-21 2016-11-23 北京工业大学 A kind of information retrieval method based on multi-source semantic analysis
CN108427717A (en) * 2018-02-06 2018-08-21 北京航空航天大学 It is a kind of based on the alphabetic class family of languages medical treatment text Relation extraction method gradually extended
CN108461111A (en) * 2018-03-16 2018-08-28 重庆医科大学 Chinese medical treatment text duplicate checking method and device, electronic equipment, computer read/write memory medium
CN108595517A (en) * 2018-03-26 2018-09-28 南京邮电大学 A kind of extensive document similarity detection method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156272A (en) * 2016-06-21 2016-11-23 北京工业大学 A kind of information retrieval method based on multi-source semantic analysis
CN108427717A (en) * 2018-02-06 2018-08-21 北京航空航天大学 It is a kind of based on the alphabetic class family of languages medical treatment text Relation extraction method gradually extended
CN108461111A (en) * 2018-03-16 2018-08-28 重庆医科大学 Chinese medical treatment text duplicate checking method and device, electronic equipment, computer read/write memory medium
CN108595517A (en) * 2018-03-26 2018-09-28 南京邮电大学 A kind of extensive document similarity detection method

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112632606A (en) * 2020-12-23 2021-04-09 天津理工大学 SNOMED-CT-based medical text document desensitization method and system
CN112632606B (en) * 2020-12-23 2022-12-09 天津理工大学 SNOMED-CT-based medical text document desensitization method and system
CN113536763A (en) * 2021-07-20 2021-10-22 北京中科闻歌科技股份有限公司 Information processing method, device, equipment and storage medium
CN115909342A (en) * 2023-01-03 2023-04-04 湖北瑞云智联科技有限公司 Image mark recognition system and method based on contact point motion track
CN115909342B (en) * 2023-01-03 2023-05-23 湖北瑞云智联科技有限公司 Image mark recognition system and method based on contact movement track
CN116186231A (en) * 2023-04-24 2023-05-30 之江实验室 Method and device for generating reply text, storage medium and electronic equipment
CN117313154A (en) * 2023-10-10 2023-12-29 上海期货信息技术有限公司 Data association relation evaluation method and device based on privacy protection
CN117313154B (en) * 2023-10-10 2024-05-31 上海期货信息技术有限公司 Data association relation evaluation method and device based on privacy protection
CN117612663A (en) * 2023-12-19 2024-02-27 苏州临亿医药科技有限公司 Visual processing system for clinical medical data
CN117708308A (en) * 2024-02-06 2024-03-15 四川蓉城蕾茗科技有限公司 RAG natural language intelligent knowledge base management method and system
CN117708308B (en) * 2024-02-06 2024-05-14 四川蓉城蕾茗科技有限公司 RAG natural language intelligent knowledge base management method and system

Also Published As

Publication number Publication date
CN111899890B (en) 2023-12-08

Similar Documents

Publication Publication Date Title
CN111899890A (en) Medical data similarity detection system and method based on bit string Hash
CN111709233B (en) Intelligent diagnosis guiding method and system based on multi-attention convolutional neural network
CN112632292A (en) Method, device and equipment for extracting service keywords and storage medium
US20040141354A1 (en) Query string matching method and apparatus
CN113535974B (en) Diagnostic recommendation method and related device, electronic equipment and storage medium
US20090282025A1 (en) Method for generating a representation of image content using image search and retrieval criteria
WO1998058344A1 (en) Text classification system and method
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
CN112257422A (en) Named entity normalization processing method and device, electronic equipment and storage medium
CN108363694B (en) Keyword extraction method and device
CN112559684A (en) Keyword extraction and information retrieval method
US11281714B2 (en) Image retrieval
CN115982222A (en) Searching method based on special disease and special medicine scenes
CN108108346B (en) Method and device for extracting theme characteristic words of document
CN114358001A (en) Method for standardizing diagnosis result, and related device, equipment and storage medium thereof
CN116756303A (en) Automatic generation method and system for multi-topic text abstract
CN111026787A (en) Network point retrieval method, device and system
CN114168751B (en) Medical text label identification method and system based on medical knowledge conceptual diagram
CN113268986B (en) Unit name matching and searching method and device based on fuzzy matching algorithm
CN113868387A (en) Word2vec medical similar problem retrieval method based on improved tf-idf weighting
CN108733824B (en) Interactive theme modeling method and device considering expert knowledge
Brumer et al. Predicting relevance scores for triples from type-like relations using neural embedding-the cabbage triple scorer at wsdm cup 2017
CN112100670A (en) Big data based privacy data grading protection method
CN114416966B (en) Reasonable use and analysis method for medical consumables based on Simhash-BERT network
CN115659945B (en) Standard document similarity detection method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant