WO2021121187A1

WO2021121187A1 - Method for detecting electronic medical case duplicates based on word segmentation, device, and computer equipment

Info

Publication number: WO2021121187A1
Application number: PCT/CN2020/136146
Authority: WO
Inventors: 唐蕊
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-06-24
Filing date: 2020-12-14
Publication date: 2021-06-24
Also published as: CN111814447A; CN111814447B

Abstract

A method based on word segmentation for detecting duplicates of electronic medical cases, a device, computer equipment, and a storage medium, for use in the domain of smart medical treatment. The method comprises: after performing word segmentation of a proportional text, extracting text features and semantic features respectively, and calculating the respective ratios of these features in the word segmentation text; integrating the ratios to calculate degrees of text similarity and meaning similarity, then according to a preset weighting value, integrating different degrees of similarity to obtain a final degree of similarity, then determining similar cases among cases to be checked for duplicates on the basis of a preset limiting value. The present method also relates to blockchain technology, the case data being stored in a blockchain. By means of the present method, when the difference between illnesses corresponding to identical symptoms is great, simply performing duplicate checking according to medical characteristics can allow for high-accuracy determination of similar cases.

Description

Method, device and computer equipment for checking duplicates of electronic medical records based on word segmentation text

This application is based on the Chinese invention patent application filed on June 24, 2020 with the application number 202010592373.8, titled "Method, device, and computer equipment for checking duplicates of electronic medical records based on word segmentation text", and claims its priority.

Technical field

This application relates to the field of natural language processing in artificial intelligence, and in particular to a method, device, computer equipment and storage medium for checking duplicates of electronic medical records based on word segmentation text.

Background technique

A case is a systematic record of the occurrence, development, diagnosis, and treatment of a disease. With the popularization of electronic medical record systems in hospitals, electronic medical records have gradually replaced handwritten medical records, making the mobile phone and management of case information more convenient and faster.

Then, on the other hand, the popularity of electronic medical records also makes it easier to copy and paste or plagiarize the text of existing medical records. Therefore, a method for checking duplicate cases is urgently needed. In the prior art, the input text is generally structured and analyzed to obtain the target medical features included in the input text and the corresponding target feature attributes, and the historical medical records including the above features and attributes are obtained in the case retrieval system, and then Calculate the semantic similarity between the input text and each of these historical medical records, then calculate the feature similarity between the target feature attribute and the feature attribute in the historical medical record, and finally determine the similar cases based on the semantic similarity and feature similarity During the implementation, the inventor realized that this treatment method is too different for the diseases corresponding to the same symptoms, and that similar cases found only by pathological examination based on medical characteristics have technical problems of low accuracy.

Summary of the invention

Based on this, this application provides a method, device, computer equipment, and storage medium for checking duplicate electronic cases based on word segmentation text to solve the problem that the diseases corresponding to the same symptoms are too different in the prior art, resulting in pathology based only on medical characteristics. The technical problem that the accuracy of similar cases found is not high due to duplicate checking.

A method for checking duplicates of electronic medical records based on word segmentation text, the method comprising:

Perform word segmentation processing on the severe cases to be checked input by the user to obtain the word segmentation text;

Perform feature extraction on the word segmentation text according to the preset substring value to obtain the case text feature;

Obtain word type words and medical meaning words from the word segmentation text, and count the first ratio of the word type words in the word segmentation text and the second ratio of the medical meaning words in the word segmentation text;

Integrating the first ratio and the second ratio to obtain the meaning characteristics of the case;

Calculate the similarity between the serious case under investigation and the case text in the case database according to the case text characteristics and the meaning characteristics of the case, respectively, to obtain text similarity and meaning similarity;

The text similarity and the meaning similarity are merged according to the preset weight value to obtain the final similarity between the serious case under investigation and the case text, and the case text corresponding to the final similarity greater than the preset value is taken as Check the results.

A device for checking duplicates of electronic medical records based on word segmentation text, said device comprising:

The word segmentation module is used to perform word segmentation processing on the serious case to be checked input by the user to obtain the word segmentation text;

The extraction module is used to perform feature extraction on the word segmentation text according to the preset substring value to obtain the case text feature;

The ratio module is used to obtain word type words and medical meaning words from the word segmentation text, and calculate the first ratio of the word type words in the word segmentation text, and the number of the medical meaning words in the word segmentation text. Two ratio

The integration module is used to integrate the first ratio and the second ratio to obtain the meaning characteristics of the case;

The similarity module is used to calculate the similarity between the serious case to be investigated and the case text in the case database according to the case text characteristics and the meaning characteristics of the case, respectively, to obtain text similarity and meaning similarity;

Duplicate checking module, used to fuse the text similarity and the meaning similarity according to a preset weight value to obtain the final similarity between the case to be checked and the case text, and to determine the final similarity greater than the preset value The case text corresponding to the degree is used as the duplicate check result.

A computer device, comprising a memory and a processor, and a computer program stored in the memory and capable of running on the processor. The processor implements the above-mentioned word segmentation text-based electronic medical record check when the computer program is executed. Steps of the heavy method:

A computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps of the above-mentioned method for checking duplicates of electronic medical records based on word segmentation text: The case undergoes word segmentation processing to obtain the word segmentation text;

The above-mentioned method, device, computer equipment and storage medium for checking duplicates of electronic medical records based on word segmentation text obtain the word type words and medical meaning words of the serious case to be checked from the word segmentation text, and then calculate the ratio in the word segmentation text, and After fusion with a set ratio, the text similarity and meaning similarity with the case data in the case database is calculated, and the final similarity is obtained after fusion. The case data that meets the similarity requirements is used as the duplicate check result. When the symptoms corresponding to the disease are too different, just a pathological examination based on medical characteristics can also find similar cases with high accuracy.

Description of the drawings

In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.

FIG. 1 is a schematic diagram of the application environment of the method for duplicate checking of electronic medical records based on word segmentation text in an embodiment of the application;

2 is a schematic flowchart of a method for checking duplicates of electronic medical records based on word segmentation text in an embodiment of the application;

FIG. 3 is a schematic diagram of an electronic medical record duplicate checking device based on word segmentation text in an embodiment of the application;

Fig. 4 is a schematic diagram of a computer device in an embodiment of the application.

Detailed ways

Unless otherwise defined, all technical and scientific terms used herein have the same meanings as commonly understood by those skilled in the technical field of the application; the terms used in the specification of the application herein are only for describing specific embodiments. The purpose is not to limit the application; the terms "including" and "having" in the specification and claims of the application and the above-mentioned description of the drawings and any variations thereof are intended to cover non-exclusive inclusions. The terms "first", "second", etc. in the specification and claims of this application or the above-mentioned drawings are used to distinguish different objects, rather than to describe a specific sequence.

The reference to "embodiments" herein means that a specific feature, structure, or characteristic described in conjunction with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various places in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment mutually exclusive with other embodiments. Those skilled in the art clearly and implicitly understand that the embodiments described herein can be combined with other embodiments.

In order to make the objectives, technical solutions, and advantages of this application clearer, the following further describes the application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, and are not used to limit the present application. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

The method for checking duplicates of electronic medical records based on word segmentation text provided in the embodiments of the present application can be applied to the application environment as shown in FIG. 1. The application environment may include the terminal 102, the network, and the server 104. The network is used to provide a communication link medium between the terminal 102 and the server 104. The network may include various connection types, such as wired, wireless communication links or Fiber optic cable and so on.

The user can use the terminal 102 to interact with the server 104 through the network to receive or send messages and so on. Various communication client applications, such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, etc., may be installed on the terminal 102.

The terminal 102 may be various electronic devices with a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, moving picture experts compress standard audio Level 3), MP4 (Moving Picture Experts Group Audio Layer IV, Motion Picture Experts compress standard audio level 4) Players, laptop portable computers and desktop computers, etc.

The server 104 may be a server that provides various services, for example, a background server that provides support for pages displayed on the terminal 102.

It should be noted that the method for checking duplicate electronic medical records based on word segmentation text provided by the embodiments of the present application is generally executed by the server/terminal. Accordingly, the device for checking duplicate electronic medical records based on word segmentation text is generally set in the server/terminal device. .

This application can be used in many general or special computer system environments or configurations. For example: personal computers, server computers, handheld devices or portable devices, tablet devices, multi-processor systems, microprocessor-based systems, set-top boxes, programmable consumer electronic devices, network PCs, small computers, large computers, including Distributed computing environment for any of the above systems or equipment, etc. This application may be described in the general context of computer-executable instructions executed by a computer, such as a program module. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types. This application can also be practiced in distributed computing environments. In these distributed computing environments, tasks are performed by remote processing devices connected through a communication network. In a distributed computing environment, program modules can be located in local and remote computer storage media including storage devices.

This application can be applied in the field of smart medical care, so as to promote the construction of smart cities.

It should be understood that the numbers of terminals, networks, and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks, and servers according to implementation needs.

Among them, the terminal 102 communicates with the server 104 through the network. The server 104 receives the serious cases to be checked sent by the terminal 102, and obtains the word type words and medical meaning words of the serious cases to be checked from the word segmentation text, and then calculates the ratio in the word segmentation text, and performs a set ratio. After fusion, the text similarity and meaning similarity with the case data in the case database is calculated, the final similarity is obtained after fusion, and the case data that meets the similarity requirements is returned to the terminal 102 as the duplicate check result. Among them, the terminal 102 and the server 104 are connected through a network. The network can be a wired network or a wireless network. The terminal 102 can be, but is not limited to, various personal computers, laptops, smart phones, tablets, and portable wearable devices. , The server 104 can be implemented by an independent server or a cluster of multiple servers.

In one embodiment, as shown in FIG. 2, a method for checking duplicates of electronic medical records based on word segmentation text is provided. Taking the method applied to the server in FIG. 1 as an example for description, the method includes the following steps:

Step 202: Perform word segmentation processing on the serious case to be investigated input by the user to obtain the word segmentation text.

The serious case to be checked may be electronic medical record data input by the user.

Electronic case data includes text data. The text data is composed of a series of case files, including admission records, first course records, surgical records, discharge summary, and so on.

In this embodiment, the detected electronic case submitted by the user through the terminal is regarded as a serious case to be checked, and then the text information of each electronic case is extracted and organized into a text document, and then the pathological duplicate check is performed on the text document in the electronic medical record database. .

Further, when the serious case to be checked input by the user is detected, the word segmentation processing of the serious case to be checked needs to be performed.

In this embodiment, the existing word segmentation technology can be used to perform word segmentation processing on the case to be checked. For example, the word segmentation technology used is a hybrid word segmentation technology that comprehensively considers regular word segmentation and statistical word segmentation.

Among them, the rule-based word segmentation technology mainly maintains a dictionary. When segmenting a sentence, each character string in the sentence is matched with a word in the dictionary. If it is found, the word segmentation is performed, otherwise no segmentation is performed.

The word segmentation technology based on statistics must first establish a statistical language model, then divide the sentence into words, calculate the probability of the result of the division, and use the word segmentation result with the highest probability as the final word segmentation result.

The hybrid acne technology is based on the statistical word segmentation technology, using the rule word segmentation technology as an auxiliary, so as to comprehensively consider the word segmentation technology of these two technologies, and finally obtain the word segmentation text of the serious case under investigation.

Step 204: Perform feature extraction on the segmented text according to the preset substring value to obtain the case text feature.

In order to identify literally similar serious cases to be investigated, first, based on the word segmentation text obtained after word segmentation, a set of continuous word strings appearing in the document is constructed, that is, case text features. The case text features include continuous word strings, that is, substring elements. And include at least one.

If each serious case to be checked is represented by this continuous word string set, then there will be many common set elements between the serious cases to be checked or other case texts with repeated literal content (for example, the same sentence or phrase). And in this case, even if the sentence order in the two case texts is different.

Further, each word or character in the word segmentation text is regarded as a character, and a unique code is generated for each character string, then the text document of an electronic medical record is regarded as a large character string.

Then use the n-gram algorithm to perform feature extraction on the segmented text according to the preset substring value to obtain the case text feature, where the case text feature includes at least one continuous word string arranged in the order of character encoding, and the characters in the continuous word string They are arranged in order of the size of the unique code.

In this string, select all substrings with a preset substring value of k as the case text feature, which describes whether the text literal elements appear or not and the characteristics of a certain order relationship. Each case text is represented as a set of substring elements with a length of the preset substring value k appearing in the document, that is, the obtained case text feature.

Specifically, if a case text is represented as a string of length 6 after word segmentation, that is, [word 1 word 2 word 3 word 4 word 5 word 6].

For example, "This is a rare case, which may be caused by a special tissue pathology or physiological dysfunction."

After the participle: this is, a kind, rare, case, possible, yes, a kind, special, tissue disease, or, physiological function, disorder, caused by

If the preset substring value is 6, the case text characteristics of the serious case to be investigated can be obtained as:

"This is a rare case may be", "A rare case may be one", "A rare case may be a special", "The case may be a special", "It may be a special organization" "Lesion", "is a special tissue disease or", "a special tissue disease or physiological function", "special tissue disease or physiological function disorder", and "the tissue disease or physiological function disorder is caused by".

Get the above multiple substring elements to form a set of substring elements, and so on for other case texts.

Optionally, if the preset substring value k=3, then a substring element set with a number of substring elements of 4 is obtained, namely [word1word2word3,word2word3word4,word3word 4 words 5, words 4 words 5 words 6].

Generally, an electronic medical record text document is represented by a set of words after word segmentation, but the set of these words only reflects whether these words appear in the document, and does not reflect the order relationship between these words. Therefore, by constructing a substring whose length is the preset substring value k (that is, concatenating k consecutive words together to form a substring), to reflect a certain degree of order relationship of words, the solution obtained during pathological investigation is more accurate.

Generally, the value of k ranges from 2 to 6. The larger the k value is set (that is, the longer the word string), the more word order information is reflected in the obtained word string; the smaller the k value is set (that is, the shorter the word string), the more word order information the obtained word string reflects less.

Under normal circumstances, the value of k is set to 3, because the setting of the value of k will be related to the subsequent calculation of the text literal similarity between two electronic medical record texts. Generally, if the value of k is set to a larger value, the electronic medical record will be obtained. The longer the word string in the text word string set, the fewer the same word strings in the two electronic medical records, and the lower the similarity of the two electronic medical record texts; if the k value is set to a smaller value, the electronic medical record will be obtained The shorter the word string in the text word string set, the more identical word strings in the two electronic medical record texts, and the higher the similarity value of the two electronic medical records will be.

Therefore, if it is necessary to reflect the order relationship of certain words, and to consider the similarity of the electronic medical record text in the subsequent calculation, the value of k needs to be weighed according to the actual text, and the value of k must not be set too large or too small. Therefore, in this embodiment, the value of k is preferably set to 3.

Step 206: Obtain word type words and medical meaning words from the word segmentation text, and count the first ratio of word type words in the word segmentation text and the second ratio of medical meaning words in the word segmentation text.

Based on the word segmentation of the text document of the electronic medical record, the type and medical meaning of the words appearing in the text are considered, and the characteristics reflecting the connotative meaning of the text are extracted from it, that is, the medical meaning words.

The specific features are as follows:

I. Word type words: The word types include content words and function words. Content words include nouns, verbs, adjectives, numerals, measure words, and pronouns. Function words include adverbs, prepositions, conjunctions, auxiliary words, interjections, and onomatopoeias. There are 12 categories in total. By calculating the first ratio of different word types in the total number of words in the segmentation text, the characteristics of different types of words reflecting the word types are obtained.

II. Medical meaning words: on the basis of word segmentation, medical entity associations are made for words with medical meaning. By calculating the number of all medical entities that appear, the number of medical entities that belong to symptoms, the number of medical entities that belong to diseases, the number of medical entities that belong to inspections, and the number of medical entities that belong to drugs, these five categories are the second of all medical entity data. Ratio, get the corresponding feature that reflects the medical meaning of the word.

Specifically, the characters whose word type words are content words and function words are obtained from the word segmentation text, and the first ratio of the word type words in the word segmentation text is calculated.

Obtain medical meaning words from the word segmentation text; perform medical entity association for medical meaning words according to the medical entity database; calculate the second ratio of medical meaning words after the medical entity association in the word segmentation text.

Among them, the entity association of medical meaning words according to the medical entity database is as follows:

The medical entity database contains a large number of medical entities. A medical name and its attributes constitute a medical entity. For example, the medical name of a medical entity is "cough" and its attribute is "symptom"; the medical name of a medical entity is "acute upper respiratory infection" and its attribute is "disease" ; The medical name of a medical entity is "abdominal color Doppler ultrasound" and its attribute is "inspection and inspection"; the medical name of a medical entity is "metformin glipizide tablets" and its attribute is "drugs".

The specific implementation of the medical entity association technology is to match each word after the word segmentation of an electronic medical record text with the medical entity name in the medical entity database, that is, to associate the word with the medical entity.

All words after word segmentation in an electronic medical record text are associated with medical entities, and words related to medical entities are constructed with medical meaning words.

Specifically, the statistics of the number of all medical entities appearing, as well as the number of medical entities belonging to symptoms, the number of medical entities belonging to diseases, the number of medical entities belonging to inspections, and the number of medical entities belonging to drugs, are in the data of all medical entities. The second ratio is the corresponding feature that reflects the medical meaning of the word.

By obtaining case text information from multiple dimensions and levels, the subsequent similarity calculation is more accurate.

Step 208: Integrate the first ratio and the second ratio to obtain the meaning characteristics of the case.

The case meaning feature is a feature vector composed of n feature values, which are multiple word type words and medical meaning words described above.

Specifically, the value of n in this embodiment is 17, which represents 17 textual meaning features (word type words, medical meaning words), including 12 word type words and 5 medical meaning words.

For example, the text meaning feature vector of an electronic medical record is expressed as f1=(x ₁ , x ₂ , x ₃ ,..., x ₁₇ ), where each x corresponds to a text meaning feature.

Then, (x ₁ ,x ₂ ,x ₃ ,...,x ₁₂ ) represents 12 word type words, for example, x ₁ represents the ratio of the noun in the total number of words in this electronic medical record text (for example, the specific value of this feature is 0.1 ), x ₂ represents the ratio of the total number of verbs in this electronic medical record text (for example, the specific value of this feature is 0.05); these ratios all belong to the first ratio of word type words in the total number of words.

The remaining x is similar, (x ₁₃ , x ₁₄ , x ₁₅ , x ₁₆ , x ₁₇ ) represents 5 medical meaning words, for example, x ₁₃ represents the number of medical entities that appear (for example, the specific value of this feature is 50), x ₁₄ represents the ratio of the number of medical entities that are symptomatic to the total number of medical entities (for example, the specific value of this feature is 0.3), and the remaining x is similar. These ratios all belong to the second ratio of medical meaning words to the total number of words.

In this embodiment, the ratio of two text levels in multiple dimensions is extracted, which improves the accuracy of subsequent pathological examinations.

Step 210: Calculate the similarity between the serious case to be checked and the case text in the case database according to the case text characteristics and the meaning characteristics of the case, respectively, to obtain the text similarity and the meaning similarity.

Further, perform text literal feature extraction on the case text features of the case data in the duplicate check case and the case database to obtain the duplicate check set and the data set; calculate the number of the same continuous word string in the duplicate check set and the data set to obtain the text Similarity.

Then the cosine similarity algorithm is used to calculate the similarity of the meaning characteristics of the case to be investigated and the case data as the meaning similarity.

Specifically, the duplicate check set and data set obtained by extracting text literal features of two electronic medical records, that is, two sets of text literal elements, are calculated by calculating the Jaccard similarity of these two sets to obtain the value of the two sets of text literal elements Similarity, that is, the literal similarity of the text. Among them, Jaccard similarity, also known as Jaccard similarity coefficient (Jaccard similarity coefficient) is used to compare the similarity and difference between a limited sample set. The larger the Jaccard coefficient value, the higher the sample similarity.

Among them, the process of calculating the literal similarity of two electronic medical record texts:

Two electronic medical records perform text literal feature extraction to obtain a text literal element set: duplicate check set and data set. The intersection of duplicate check set A and data set B has m identical elements, and the union of A and B has n elements. Then the Jaccard similarity of A and B is:

Jaccard(A,B)=m/n

Use this Jaccard similarity to indicate the literal similarity of the two electronic medical record texts. Among them, the value range of accard similarity is between 0 and 1. The closer the Jaccard similarity is to 1, the higher the similarity; the closer the Jaccard similarity is to 0, the lower the similarity.

For the text meaning feature vectors f1 and f2 obtained from two electronic medical record texts, calculate the cosine similarity between the two feature vectors cosine(f1, f2), and use this cosine similarity to represent the meaning of the two electronic medical record texts Similarity.

The textual meaning feature vectors of the two electronic medical records are respectively expressed as:

f1 = (x ₁ , x ₂ , x ₃ ,..., x _n ) and f1 = (y ₁ , y ₂ , y ₃ ,..., y _n ).

The cosine similarity between these two feature vectors can be calculated by formula (1):

The value range of cosine similarity is between 0 and 1. The closer the cosine similarity is to 1, the higher the similarity; the closer the cosine similarity is to 0, the lower the cosine similarity.

In step 212, the text similarity and the meaning similarity are merged according to the preset weight value to obtain the final similarity between the case to be checked and the case text, and the case text corresponding to the final similarity greater than the preset value is used as the duplicate check result.

Using the preset weight value of w1:w2 to superimpose the text similarity and meaning similarity to obtain the final similarity, where 0<=w2<=1, w1+w2=1.

Specifically, for two electronic medical record texts, one of the electronic case texts is a serious case to be checked, the text similarity sim1 (0<=sim1<=1) and the meaning similarity sim2 (0<=sim2<=1) , The two similarity results are merged using a weight-based fusion method, where the preset weight value can be set according to the application scenario.

That is, sim=w1*sim1+w2*sim2

Among them, 0<=sim<=1,0<=w1<=1,0<=w2<=1, w1+w2=1, and sim is used to represent the final similarity of the two electronic medical record texts.

This embodiment mainly reflects that when calculating the similarity between electronic medical records, both the literal similarity of the electronic medical record text and the similarity of the meaning of the electronic medical record text are considered.

The comprehensive consideration of these two similarities is achieved by fusing the results of text literal similarity and text meaning similarity based on weight.

Generally, the weight w1 corresponding to the text similarity and the weight w2 corresponding to the meaning similarity are both set to 0.5, which means that the two similarity results are balanced.

If, relatively, the literal similarity of the text is more considered, the weight of the text similarity w1 is set larger (for example, w1 is set to 0.6), and the weight of the corresponding meaning similarity is set to be smaller (for example, w2 Set to 0.4).

Correspondingly, if the similarity of text meaning is considered relatively more, the weight of the meaning similarity w2 should be set larger (for example, w2 is set to 0.6), and the weight of the corresponding text similarity w1 is set to be smaller (for example, w1 Set to 0.4).

Among them, the size of the preset value can be set according to specific application scenarios and needs, and there is no specific limitation in this proposal.

In the above-mentioned method for checking duplicates of electronic medical records based on word segmentation text, the word type words and medical meaning words of the serious case to be checked are obtained by statistics from the word segmentation text, and then the ratio in the word segmentation text is calculated, and the ratio is set for fusion , Calculate the text similarity and meaning similarity with the case data in the case database, and get the final similarity after fusion, and use the case data that meets the similarity requirements as the duplicate check result, and the disease corresponding to the same symptom is too far apart At the same time, the pathological examination based on medical characteristics alone can also find similar cases with high accuracy.

In one embodiment, electronic pathology duplicate check application scenarios in two scenarios, online and offline, can also be implemented:

1) Online electronic medical record double check

Online means that after the doctor enters the electronic medical record text in the electronic medical record system, the server immediately checks the electronic medical record text.

This function will prompt the doctor whether the electronic medical record entered by the doctor is repeated in the electronic medical record database. If it is repeated, it will return the repeated electronic medical record number and the corresponding similarity. The implementation method is as follows:

I. For the electronic medical record text input by the doctor, first extract the text content from it to generate the corresponding text document, and then extract the text features (including text literal features and text meaning features) from the text document.

II. Calculate the text similarity between the text features of all electronic medical records in the electronic medical record text database and the text features of the input electronic medical records (a result of fusion of text literal similarity and text meaning similarity).

III. Compare the calculated final similarity with the preset value. If the electronic medical record exceeds the preset value, it means that the entered electronic medical record is duplicated in the electronic medical record database, and the server will prompt the doctor that the electronic medical record entered is duplicated, and Return the repeated electronic medical record number and its corresponding similarity; if there is no electronic medical record exceeding the preset value, it means that the entered electronic medical record is not repeated in the electronic medical record data, and the server will return to the doctor to remind the doctor that the electronic medical record entered is not repeated Information.

Among them, the value range of this preset value is also between 0 and 1 (the value range of the similarity sim between electronic medical records is also between 0 and 1).

The higher the preset value (closer to 1), the more "strict" the calculation of the similarity of the electronic medical record is, and the smaller the amount of similar medical records returned.

The higher the preset value is set (closer to 0), the more "relaxed" the calculation of the similarity of the electronic medical records, and the greater the number of similar medical records returned. Different preset values can be set according to different needs.

Generally, the default value can be set to 0.8.

(2) Offline electronic medical record double check

In the electronic medical record database, check whether each electronic medical record text and other electronic medical record texts are duplicated. The steps of checking an electronic medical record and other electronic medical records in the database are the same as the above-mentioned online input. The steps for checking duplicates of electronic medical records are similar.

This method will output the duplicate electronic medical records in the database. For these duplicate electronic medical records, the number of the duplicate electronic medical records and their corresponding similarity will be given correspondingly.

This embodiment provides two application scenarios to explain in detail the specific application of the above-mentioned method for duplicate checking of electronic medical records based on word segmentation text.

It should be understood that although the various steps in the flowchart of FIG. 2 are displayed in sequence as indicated by the arrows, these steps are not necessarily performed in sequence in the order indicated by the arrows. Unless specifically stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least part of the steps in FIG. 2 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution of these sub-steps or stages The sequence does not have to be performed sequentially, but may be performed alternately or alternately with other steps or at least a part of sub-steps or stages of other steps.

In one embodiment, as shown in FIG. 3, a device for checking duplicate electronic medical records based on word segmentation text is provided. The device for checking duplicate electronic medical records based on word segmentation text is the same as the method for checking duplicate electronic medical records based on word segmentation text in the above embodiment. One-to-one correspondence. The device for checking duplicates of electronic medical records based on word segmentation text includes:

The word segmentation module 302 is used to perform word segmentation processing on the serious case to be checked input by the user to obtain the word segmentation text;

The extraction module 304 is configured to perform feature extraction on the segmented text according to the preset substring value to obtain the case text feature;

The ratio module 306 is used to obtain word type words and medical meaning words from the word segmentation text, and count the first ratio of word type words in the word segmentation text and the second ratio of medical meaning words in the word segmentation text;

The integration module 308 is used to integrate the first ratio and the second ratio to obtain the meaning characteristics of the case;

The similarity module 310 is used to calculate the similarity between the serious case to be checked and the case text in the case database according to the case text characteristics and the meaning characteristics of the case, respectively, to obtain the text similarity and meaning similarity;

The duplicate checking module 312 is used to fuse the text similarity and meaning similarity according to the preset weight value to obtain the final similarity between the duplicate case to be checked and the case text, and use the case text corresponding to the final similarity greater than the preset value as the check Heavy results.

It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned serious cases to be investigated and case data, the above-mentioned serious cases to be investigated can also be stored in a node of a blockchain, and the case data can be distributed and not belong to the blockchain. .

Further, the extraction module 304 includes:

Coding sub-module, used to generate a unique code for each character in the word segmentation text;

The extraction sub-module is used for feature extraction of the segmented text according to the preset substring value through the n-gram algorithm to obtain the case text feature, wherein the case text feature includes at least one continuous word string arranged in character encoding order, and the continuous word The characters in the string are arranged in the order of the size of the unique code.

Further, the ratio module 306 includes:

The word sub-module is used to obtain the characters whose word type words are content words and function words from the word segmentation text, and calculate the first ratio of word type words in the word segmentation text;

The meaning sub-module is used to obtain medical meaning words from the word segmentation text;

The association sub-module is used to associate medical entities with medical meaning words according to the medical entity database;

The calculation sub-module is used to calculate the second ratio of medical meaning words in the segmentation text after the medical entity is associated.

Further, the similar module 310 includes:

The collection sub-module is used to extract the text literal features of the case to be checked and the case data in the case database to obtain the duplicate check set and the data set;

The text similarity sub-module is used to calculate the number of identical continuous word strings in the duplicate check set and the data set to obtain the text similarity;

The meaning similarity sub-module is used to calculate the similarity of the meaning characteristics of the case to be checked and the case data through the cosine similarity algorithm, as the meaning similarity.

The above-mentioned electronic medical case duplicate checking device based on word segmentation text obtains the word type words and medical meaning words of the serious case to be checked from the word segmentation text, and then calculates the ratio in the word segmentation text, and after setting the ratio for fusion, Calculate the text similarity and meaning similarity with the case data in the case database, obtain the final similarity after fusion, and use the case data that meets the similarity requirements as the duplicate check result, when the disease corresponding to the same symptom is too different , Only carrying out pathological examination based on medical characteristics can also find similar cases with high accuracy.

In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 4. The computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used to store case data. The network interface of the computer device is used to communicate with an external terminal through a network connection. When the computer program is executed by the processor, a method for checking duplicates of electronic medical records based on word segmentation text is realized.

In this embodiment, the word type words and medical meaning words of the serious case to be checked are calculated from the word segmentation text, and then the ratio in the word segmentation text is calculated, and after the set ratio is merged, the calculation is calculated with the case data in the case database The text similarity and meaning similarity of the fusion is the final similarity, and the case data that meets the similarity requirements is used as the check result. When the disease corresponding to the same symptom is too different, the pathological check is only performed based on the medical characteristics It can also find similar cases with high accuracy.

Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions. Its hardware includes, but is not limited to, a microprocessor, a dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored. When the computer program is executed by a processor, the steps of the method for checking duplicates of electronic medical records based on word segmentation text in the above embodiments are implemented, as shown in Figure 2 As shown in step 202 to step 212, or, when the processor executes the computer program, the function of each module/unit of the electronic medical patient duplicate check device based on the word segmentation text in the above embodiment is realized, for example, the modules 302 to 312 shown in FIG. 3 Features.

A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The computer program can be stored in a non-volatile computer readable storage. In the medium, when the computer program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database, or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

The blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Those skilled in the art can clearly understand that, for the convenience and conciseness of description, only the division of the above functional units and modules is used as an example. In practical applications, the above functions can be allocated to different functional units and modules as needed. Module completion, that is, the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above.

The technical features of the above embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered as the range described in this specification.

The above-mentioned embodiments only express several implementation manners of the present application, and the description is relatively specific and detailed, but it should not be understood as a limitation on the scope of the invention patent. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of this application, several modifications, improvements, or equivalent substitutions of some technical features can be made, and these modifications or substitutions are not To make the essence of the same technical solution deviate from the spirit and scope of the technical solutions of the embodiments of this application belongs to the protection scope of this application. Therefore, the scope of protection of the patent of this application shall be subject to the appended claims.

Claims

A method for checking duplicates of electronic medical records based on word segmentation text, wherein the method includes:

Perform word segmentation processing on the severe cases to be checked input by the user to obtain the word segmentation text;

Perform feature extraction on the word segmentation text according to the preset substring value to obtain the case text feature;

Obtain word type words and medical meaning words from the word segmentation text, and count the first ratio of the word type words in the word segmentation text and the second ratio of the medical meaning words in the word segmentation text;

Integrating the first ratio and the second ratio to obtain the meaning characteristics of the case;

Calculate the similarity between the serious case under investigation and the case text in the case database according to the case text characteristics and the meaning characteristics of the case, respectively, to obtain text similarity and meaning similarity;

The text similarity and the meaning similarity are merged according to the preset weight value to obtain the final similarity between the serious case under investigation and the case text, and the case text corresponding to the final similarity greater than the preset value is taken as Check the results.
The method according to claim 1, wherein the word segmentation text includes a plurality of characters composed of the word and/or words after the word segmentation, and the feature extraction is performed on the word segmentation text according to a preset substring value to obtain the case text Features include:

Generate a unique code for each character in the word segmentation text;

Perform feature extraction on the word segmentation text according to the preset substring value through the n-gram algorithm to obtain the case text feature, wherein the case text feature includes at least one continuous word string arranged in the character encoding order , And the characters in the continuous word string are arranged in the order of the size of the unique code.
The method according to claim 2, wherein the value range of the preset substring value is 2-6.
The method according to claim 1, wherein said acquiring word type words and medical meaning words from said word segmentation text, and calculating the first ratio of said word type words in said word segmentation text, said medical meaning words The second ratio in the word segmentation text includes:

Obtain the characters whose word type words are content words and function words from the word segmentation text, and calculate the first ratio of the word type words in the word segmentation text;

Obtain medical meaning words from the word segmentation text;

Perform medical entity association on the medical meaning words according to the medical entity database;

Calculate the second ratio of the medical meaning words associated with the medical entity in the word segmentation text.
The method according to claim 1, wherein the integrating the first ratio and the second ratio to obtain the meaning characteristics of the case comprises:

According to the ratio of 12:5, the word type words and the medical meaning words are integrated to obtain the case meaning feature f1=(x 1 ,x 2 ,x 3 ,...,x 17 ), where (x 1 ,x 2 ,x 3 ,...,x 12 ) represent 12 word type words, (x 13 , x 14 , x 15 , x 16 , x 17 ) represent 5 medical meaning words.
The method according to claim 1, wherein the calculating the similarity between the serious case to be investigated and the case text in the case database according to the case text characteristics and the meaning characteristics of the case, respectively, to obtain text similarity and meaning similarity Degree, including:

Perform text literal feature extraction on the case text features of the case data in the case to be checked and the case data in the case database to obtain a duplicate check set and a data set;

Calculating the number of identical continuous word strings in the duplicate check set and the data set to obtain the text similarity;

The similarity of the meaning characteristics of the case of the serious case to be checked and the case data is calculated by the cosine similarity algorithm as the meaning similarity.
The method according to claim 1, wherein the fusing the text similarity and the meaning similarity according to a preset weight value to obtain the final similarity between the serious case to be checked and the case text comprises:

Using a preset weight value of w1:w2 to superimpose the text similarity and the meaning similarity to obtain the final similarity, where 0<=w2<=1, w1+w2=1.
A device for checking duplicates of electronic medical records based on word segmentation text, which includes:

The word segmentation module is used to perform word segmentation processing on the serious case to be checked input by the user to obtain the word segmentation text;

The extraction module is used to perform feature extraction on the word segmentation text according to the preset substring value to obtain the case text feature;

The ratio module is used to obtain word type words and medical meaning words from the word segmentation text, and calculate the first ratio of the word type words in the word segmentation text, and the number of the medical meaning words in the word segmentation text. Two ratio

The integration module is used to integrate the first ratio and the second ratio to obtain the meaning characteristics of the case;

The similarity module is used to calculate the similarity between the serious case to be investigated and the case text in the case database according to the case text characteristics and the meaning characteristics of the case, respectively, to obtain text similarity and meaning similarity;

Duplicate checking module, used to fuse the text similarity and the meaning similarity according to a preset weight value to obtain the final similarity between the case to be checked and the case text, and the final similarity greater than the preset value The case text corresponding to the degree is used as the duplicate check result.
A computer device includes a memory and a processor, the memory stores a computer program, wherein the processor implements the steps of a method for checking duplicates of an electronic medical record based on word segmentation text when the processor executes the computer program:

Perform word segmentation processing on the severe cases to be checked input by the user to obtain the word segmentation text;

Perform feature extraction on the word segmentation text according to the preset substring value to obtain the case text feature;

Obtain word type words and medical meaning words from the word segmentation text, and count the first ratio of the word type words in the word segmentation text and the second ratio of the medical meaning words in the word segmentation text;

Integrating the first ratio and the second ratio to obtain the meaning characteristics of the case;

Calculate the similarity between the serious case under investigation and the case text in the case database according to the case text characteristics and the meaning characteristics of the case, respectively, to obtain text similarity and meaning similarity;

The text similarity and the meaning similarity are merged according to the preset weight value to obtain the final similarity between the serious case under investigation and the case text, and the case text corresponding to the final similarity greater than the preset value is taken as Check the results.
8. The computer device according to claim 9, wherein the word segmentation text includes a plurality of characters composed of the word and/or words after word segmentation, and the feature extraction is performed on the word segmentation text according to a preset substring value to obtain a case Text features, including:

Generate a unique code for each character in the word segmentation text;

Perform feature extraction on the word segmentation text according to the preset substring value through the n-gram algorithm to obtain the case text feature, wherein the case text feature includes at least one continuous word string arranged in the character encoding order , And the characters in the continuous word string are arranged in the order of the size of the unique code.
The computer device according to claim 10, wherein the value range of the preset substring value is 2-6.
8. The computer device according to claim 9, wherein said acquiring word type words and medical meaning words from said word segmentation text, and calculating the first ratio of said word type words in said word segmentation text, said medical meaning The second ratio of words in the segmented text includes:

Obtain the characters whose word type words are content words and function words from the word segmentation text, and calculate the first ratio of the word type words in the word segmentation text;

Obtain medical meaning words from the word segmentation text;

Perform medical entity association on the medical meaning words according to the medical entity database;

Calculate the second ratio of the medical meaning words associated with the medical entity in the word segmentation text.
The computer device according to claim 9, wherein said integrating said first ratio and said second ratio to obtain the meaning characteristics of the case comprises:

According to the ratio of 12:5, the word type words and the medical meaning words are integrated to obtain the case meaning feature f1=(x 1 ,x 2 ,x 3 ,...,x 17 ), where (x 1 ,x 2 ,x 3 ,...,x 12 ) represent 12 word type words, (x 13 , x 14 , x 15 , x 16 , x 17 ) represent 5 medical meaning words.
8. The computer device according to claim 9, wherein said calculating the similarity between the serious case to be checked and the case text in the case database according to the case text characteristics and the meaning characteristics of the case, respectively, to obtain text similarity and meaning Similarity, including:

Perform text literal feature extraction on the case text features of the case data in the case to be checked and the case data in the case database to obtain a duplicate check set and a data set;

Calculating the number of identical continuous word strings in the duplicate check set and the data set to obtain the text similarity;

The similarity of the meaning characteristics of the case of the serious case to be checked and the case data is calculated by the cosine similarity algorithm as the meaning similarity.
The computer device according to claim 9, wherein the fusion of the text similarity and the meaning similarity according to a preset weight value to obtain the final similarity between the serious case to be checked and the case text includes :

Using a preset weight value of w1:w2 to superimpose the text similarity and the meaning similarity to obtain the final similarity, where 0<=w2<=1, w1+w2=1.
A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the steps of a method for checking duplicates of an electronic medical record based on word segmentation text are realized:

Perform word segmentation processing on the severe cases to be checked input by the user to obtain the word segmentation text;

Perform feature extraction on the word segmentation text according to the preset substring value to obtain the case text feature;

Obtain word type words and medical meaning words from the word segmentation text, and count the first ratio of the word type words in the word segmentation text and the second ratio of the medical meaning words in the word segmentation text;

Integrating the first ratio and the second ratio to obtain the meaning characteristics of the case;

Calculate the similarity between the serious case under investigation and the case text in the case database according to the case text characteristics and the meaning characteristics of the case, respectively, to obtain text similarity and meaning similarity;

The text similarity and the meaning similarity are merged according to the preset weight value to obtain the final similarity between the serious case under investigation and the case text, and the case text corresponding to the final similarity greater than the preset value is taken as Check the results.
The readable storage medium according to claim 16, wherein the word segmentation text includes a plurality of characters composed of a word and/or words after word segmentation, and the feature extraction is performed on the word segmentation text according to a preset substring value, Obtain the case text characteristics, including:

Generate a unique code for each character in the word segmentation text;

Perform feature extraction on the word segmentation text according to the preset substring value through the n-gram algorithm to obtain the case text feature, wherein the case text feature includes at least one continuous word string arranged in the character encoding order , And the characters in the continuous word string are arranged in the order of the size of the unique code.
The readable storage medium according to claim 17, wherein the value range of the preset substring value is 2-6.
The readable storage medium according to claim 16, wherein said obtaining word type words and medical meaning words from said word segmentation text, and counting the first ratio of said word type words in said word segmentation text, said The second ratio of medical meaning words in the word segmentation text includes:

Obtain the characters whose word type words are content words and function words from the word segmentation text, and calculate the first ratio of the word type words in the word segmentation text;

Obtain medical meaning words from the word segmentation text;

Perform medical entity association on the medical meaning words according to the medical entity database;

Calculate the second ratio of the medical meaning words associated with the medical entity in the word segmentation text.
The readable storage medium according to claim 16, wherein said integrating the first ratio and the second ratio to obtain the meaning characteristics of the case comprises:

According to the ratio of 12:5, the word type words and the medical meaning words are integrated to obtain the case meaning feature f1=(x 1 ,x 2 ,x 3 ,...,x 17 ), where (x 1 ,x 2 ,x 3 ,...,x 12 ) represent 12 word type words, (x 13 , x 14 , x 15 , x 16 , x 17 ) represent 5 medical meaning words.