CN111814447A

CN111814447A - Electronic case duplicate checking method and device based on word segmentation text and computer equipment

Info

Publication number: CN111814447A
Application number: CN202010592373.8A
Authority: CN
Inventors: 唐蕊
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2020-10-23
Anticipated expiration: 2040-06-24
Also published as: WO2021121187A1; CN111814447B

Abstract

The application relates to the field of artificial intelligence, is applied to the field of intelligent medical treatment, and provides an electronic case duplicate checking method and device based on word segmentation texts, a computer device and a storage medium, wherein the method comprises the following steps: after the comparative example text is subjected to word segmentation, text features and meaning features are respectively extracted, the ratio of the text features and the meaning features in the segmented text is calculated, then the text similarity and the meaning similarity are calculated by integrating the ratios, the final similarity is obtained by integrating different similarities according to preset weight values, and the similar case of the serious disease case to be checked is judged by taking the preset value as a boundary. The application also relates to a blockchain technique, the case data being stored in a blockchain. By adopting the method, when the diseases corresponding to the same symptoms are too different, similar cases with high accuracy can be found only by checking the pathological changes according to the medical characteristics.

Description

Electronic case duplicate checking method and device based on word segmentation text and computer equipment

Technical Field

The application relates to the field of artificial intelligence, in particular to an electronic case duplicate checking method and device based on word segmentation text, computer equipment and a storage medium.

Background

A case is a systematic record of the occurrence, progression, diagnosis and treatment of a disease. Along with the popularization of the electronic medical record system in hospitals, the electronic medical record gradually replaces the handwritten medical record, so that the mobile phone and the management of the medical record information become more convenient and faster.

On the other hand, the spread of electronic cases also makes it easier to copy and paste or copy existing case texts, so a method for checking the case duplication is needed. In the prior art, generally, structured analysis processing is performed on an input text to obtain target medical features and target feature attributes corresponding to the target medical features included in the input text, historical medical records including the features and the attributes are obtained in a case retrieval system, semantic similarity between the input text and each of the historical medical records is calculated respectively, feature similarity between the target feature attributes and the feature attributes in the historical medical records is calculated, and similar cases are determined according to the semantic similarity and the feature similarity.

Disclosure of Invention

Based on the above, the application provides an electronic case duplicate checking method, device, computer equipment and storage medium based on a word segmentation text, so as to solve the technical problem that in the prior art, the accuracy of the found similar case is not high due to pathological duplicate checking only according to medical characteristics because the diseases corresponding to the same symptoms are too different.

An electronic case duplication checking method based on word segmentation text, the method comprising:

performing word segmentation processing on a to-be-searched disease case input by a user to obtain a word segmentation text;

performing feature extraction on the word segmentation text according to a preset sub-string value to obtain case text features;

acquiring word type words and medical meaning words from the word segmentation texts, and counting a first ratio of the word type words in the word segmentation texts and a second ratio of the medical meaning words in the word segmentation texts;

integrating the first ratio and the second ratio to obtain case meaning characteristics;

calculating the similarity between the case to be re-checked and the case text in a case database according to the case text characteristics and the case meaning characteristics respectively to obtain text similarity and meaning similarity;

and fusing the text similarity and the meaning similarity according to a preset weight value to obtain the final similarity between the case to be checked and the case text, and taking the case text corresponding to the final similarity which is greater than the preset value as a check result.

An electronic case duplication checking device based on participle text, the device comprising:

the word segmentation module is used for carrying out word segmentation on the to-be-searched ill case input by the user to obtain a word segmentation text;

the extraction module is used for carrying out feature extraction on the word segmentation text according to preset sub-string values to obtain case text features;

the rate module is used for acquiring word type words and medical meaning words from the word segmentation texts, and counting a first rate of the word type words in the word segmentation texts and a second rate of the medical meaning words in the word segmentation texts;

the integration module is used for integrating the first ratio and the second ratio to obtain case meaning characteristics;

the similarity module is used for calculating the similarity between the case to be checked and the case text in the case database according to the case text characteristics and the case meaning characteristics respectively to obtain text similarity and meaning similarity;

and the duplication checking module is used for fusing the text similarity and the meaning similarity according to a preset weight value to obtain the final similarity between the case to be duplicated and the case text, and taking the case text corresponding to the final similarity larger than the preset value as a duplication checking result.

A computer device comprising a memory and a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above-mentioned segmented text-based electronic case duplication checking method when executing the computer program.

A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the above-mentioned electronic case duplication checking method based on segmented text.

According to the electronic case duplicate checking method, device, computer equipment and storage medium based on the word segmentation text, word type words and medical meaning words of a case to be checked are obtained through statistics from the word segmentation text, then the ratio of the word type words and the medical meaning words in the word segmentation text is calculated, after fusion is carried out through a set ratio, the text similarity and the meaning similarity of case data in a case database are calculated, final similarity is obtained after fusion, case data meeting the similarity requirement are used as duplicate checking results, and when the disease difference corresponding to the same symptom is too large, similar cases with high accuracy can be found only through pathological duplicate checking according to medical characteristics.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a schematic diagram of an application environment of an electronic case duplication checking method based on a participle text in an embodiment of the present application;

FIG. 2 is a schematic flow chart of an electronic medical record duplication checking method based on word segmentation text in the embodiment of the present application;

FIG. 3 is a schematic diagram of an electronic medical record duplication checking device based on word segmentation text in the embodiment of the application;

FIG. 4 is a schematic diagram of a computer device in one embodiment of the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The electronic case duplicate checking method based on the word segmentation text provided by the embodiment of the invention can be applied to the application environment shown in figure 1. The application environment may include a terminal 102, a network for providing a communication link medium between the terminal 102 and the server 104, and a server 104, wherein the network may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user may use the terminal 102 to interact with the server 104 over a network to receive or send messages, etc. The terminal 102 may have installed thereon various communication client applications, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The terminal 102 may be various electronic devices having a display screen and supporting web browsing, including but not limited to a smart phone, a tablet computer, an e-book reader, an MP3 player (Moving Picture Experts Group audio Layer III, mpeg compression standard audio Layer 3), an MP4 player (Moving Picture Experts Group audio Layer IV, mpeg compression standard audio Layer 4), a laptop portable computer, a desktop computer, and the like.

The server 104 may be a server that provides various services, such as a background server that provides support for pages displayed on the terminal 102.

It should be noted that the electronic case duplication checking method based on the word segmentation text provided by the embodiment of the present application is generally executed by a server/terminal, and accordingly, the electronic case duplication checking device based on the word segmentation text is generally disposed in the server/terminal device.

The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The application can be applied to the field of intelligent medical treatment, and therefore the construction of a smart city is promoted.

It should be understood that the number of terminals, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Wherein, the terminal 102 communicates with the server 104 through the network. The server 104 receives the case of the serious illness to be checked sent by the terminal 102, obtains word type words and medical meaning words of the case of the serious illness to be checked by statistics from the word segmentation text, then calculates the ratio of the word type words and the medical meaning words in the word segmentation text, calculates the text similarity and the meaning similarity of the case data in the case database after fusion of the word type words and the medical meaning words in the word segmentation text, obtains the final similarity after fusion, and returns the case data meeting the similarity requirement to the terminal 102 as a result of the serious illness checking. The terminal 102 and the server 104 are connected through a network, the network may be a wired network or a wireless network, the terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.

In one embodiment, as shown in fig. 2, an electronic case duplication checking method based on word segmentation text is provided, which is described by taking the method applied to the server side in fig. 1 as an example, and includes the following steps:

step 202, performing word segmentation processing on the to-be-searched disease case input by the user to obtain a word segmentation text.

The disease case to be checked can be electronic medical record data input by a user.

The electronic case data includes text data, and the text data is composed of a series of case documents, including admission records, first-time course records, operation records, discharge knots and the like.

In the embodiment, detected electronic cases submitted by a user through a terminal are used as cases to be checked, then text information is extracted from each electronic case and is arranged into a text document, and then the pathological duplication checking is performed on the text document in an electronic medical record database.

Further, after the serious illness case to be checked input by the user is detected, word segmentation processing needs to be performed on the serious illness case to be checked.

The present embodiment may perform word segmentation processing on a to-be-found serious disease case through an existing word segmentation technology, for example, the used word segmentation technology is a mixed word segmentation technology that comprehensively considers regular word segmentation and statistical word segmentation.

The rule-based word segmentation technology is mainly characterized in that a dictionary is maintained, when a sentence is segmented, each character string in the sentence is matched with a word in the dictionary, if the word is found, the word is segmented, and otherwise, the word is not segmented.

The word segmentation technology based on statistics firstly establishes a statistical language model, then performs word segmentation on sentences, performs probability calculation on the segmentation result, and takes the word segmentation result with the maximum probability as the final word segmentation result.

The mixed acne technology is based on the statistical word segmentation technology, and takes the regular word segmentation technology as an assistant, thereby comprehensively considering the word segmentation technologies of the two technologies and finally obtaining the word segmentation text of the serious illness case to be checked.

And 204, extracting the features of the word segmentation texts according to preset sub-string values to obtain case text features.

In order to identify the cases of serious illness to be checked which are similar on the face of characters, firstly, a continuous word string set which appears in a document is constructed on the basis of word segmentation texts obtained after word segmentation, namely case text characteristics, and the case text characteristics comprise continuous word strings, namely substring elements and at least one.

With each case to be reviewed represented by such a set of consecutive word strings, there will be many common set elements between cases to be reviewed or other case texts with repeated literal content (e.g., there are identical sentences or phrases), and in this case, even if the order of sentences in the two case texts is different.

Further, each word or character in the participle text is regarded as a character, a unique code is generated for each character string, and then the text document of an electronic case is regarded as a large character string.

And then performing feature extraction on the word segmentation text according to a preset sub-string value through an n-gram algorithm to obtain case text features, wherein the case text features comprise at least one continuous word string arranged according to a character coding sequence, and characters in the continuous word string are arranged according to a size sequence of the unique code.

In the character string, all the substrings with preset substring values of k are selected as case text features, and the features describe whether text face elements appear and the characteristics of a certain sequence relation. Each case text is represented as a substring element set with a preset substring value k appearing in the document, namely the obtained case text characteristics.

Specifically, if a case text is represented as a string of length 6 after word segmentation, i.e., [ word 1 word 2 word 3 word 4 word 5 word 6 ].

For example, "this is a rare case, possibly due to a particular tissue pathology or physiological dysfunction".

After word segmentation, the method comprises the following steps: the disease is, a rare, a case, a possibility, a special, a tissue lesion, or, a physiological function, disorder, the result of

If the preset substring value is 6, the case text characteristics of the case to be checked can be obtained as follows:

"this is a rare case may be", "a rare case may be one", "a rare case may be a special", "a case may be a special", "may be a special tissue disorder", "is a special tissue disorder or", "a special tissue disorder or physiological function", and "a result of a tissue disorder or physiological function".

And obtaining the plurality of substring elements to form a substring element set, and so on for other case texts.

Optionally, if the preset substring value k is 3, a substring element set with a substring element number of 4 is obtained, that is, [ word 1, word 2, word 3, word 4, word 5, word 6] is obtained.

Generally, an electronic medical record text document is represented by a word set after word segmentation, but the word set only shows whether the words appear in the document and does not show the sequence relation among the words. Therefore, the sequential relation of words to a certain degree is embodied by constructing the substrings with the length of the preset substring value k (namely, the continuous k words are spliced together to form one substring), and the solution obtained when the pathology duplication checking is carried out is more accurate.

Typically, k ranges from 2 to 6. The larger the k value is set (namely, the longer the word string is), the more the word sequence information is embodied by the obtained word string; the smaller the k value is set (i.e., the shorter the word string is), the less word order information is represented by the resulting word string.

Generally, the k value is set to 3, because the setting of the k value is related to the literal similarity of the texts between two electronic medical record texts in the subsequent calculation, generally, if the k value is set to be larger, the word strings in the obtained electronic medical record text word string set are longer, the same word strings in the two electronic medical record text strings are fewer, and the similarity of the two electronic medical record texts is lower; if the k value is set to be smaller, the word string in the obtained electronic medical record text word string set is shorter, the number of the same word strings in the two electronic medical record texts is more, and the similarity value of the two electronic medical records is higher.

Therefore, if a certain sequential relation of words is embodied and the similarity of the electronic medical record text is considered in the subsequent calculation, the value of k needs to be weighed according to the actual text, and cannot be set too large or too small. Therefore, in the present embodiment, it is preferable to set the value of k to 3.

Step 206, obtaining word type words and medical meaning words from the word segmentation text, and counting a first ratio of the word type words in the word segmentation text and a second ratio of the medical meaning words in the word segmentation text.

On the basis of word segmentation of a text document of an electronic medical record, the type and medical meaning of words appearing in the text are considered, and features reflecting the content meaning of the text, namely medical meaning words, are extracted from the text.

The specific characteristics are as follows:

I. word type word: the word types comprise real words and imaginary words, wherein the real words comprise nouns, verbs, adjectives, numerators, quantifiers and pronouns, and the imaginary words comprise adverbs, prepositions, conjunctions, auxiliary words, sighs and pseudonyms, and the total 12 types are included. The characteristics corresponding to different types of words and reflecting word types are obtained by calculating the first ratio of the total word number of different word types in the word segmentation text.

The medical meaning: on the basis of word segmentation, the medical entity association is carried out on the words with medical meanings. And calculating a second ratio of 5 types of data of all medical entities, namely the number of all medical entities appearing, the number of medical entities belonging to symptoms, the number of medical entities belonging to diseases, the number of medical entities belonging to inspection and the number of medical entities belonging to medicines, to obtain the corresponding characteristics reflecting the medical meaning of the words.

Specifically, characters of real words and imaginary words of the word type words are obtained from the word segmentation text, and a first ratio of the word type words in the word segmentation text is calculated.

Acquiring medical meaning words from the word segmentation text; performing medical entity association on the medical meaning words according to the medical entity library; a second ratio of the medical meaning word after association with the medical entity in the segmented text is calculated.

The method comprises the following steps of performing entity association on medical meaning words according to a medical entity library, wherein the entity association is specifically as follows:

the medical entity bank contains a plurality of medical entities. A medical name and its attributes constitute a medical entity, for example, the medical name of a medical entity is "cough", and its attributes are "symptom"; the medical name of a medical entity is "acute upper respiratory infection", which is attributed to "disease"; the medical name of a medical entity is abdominal color Doppler ultrasound, and the attribute of the medical entity is inspection; one medical entity has the medical name of metformin glipizide tablets, and the attribute is 'drug'.

The medical entity association technology is specifically realized by matching words and medical entity names of each word after the words are segmented in an electronic medical record text in a medical entity library, namely associating the words and the medical entity.

And performing medical entity association on all words after the words are segmented in the text of the electronic medical record, and constructing medical meaning words on the words of the associated medical entities.

Specifically, the second ratio of 5 types of data of all medical entities, namely the number of all medical entities appearing, the number of medical entities belonging to symptoms, the number of medical entities belonging to diseases, the number of medical entities belonging to inspection and the number of medical entities belonging to medicines, is counted to obtain the corresponding characteristic reflecting the medical meaning of the word.

By acquiring the information of the case text from multiple dimensions and layers, the subsequent similarity calculation is more accurate.

And step 208, integrating the first ratio and the second ratio to obtain the case meaning characteristics.

The case meaning feature is a feature vector composed of n feature values, which are the plurality of word type words and medical meaning words described above.

Specifically, the value of n in this embodiment is 17, which indicates 17 text meaning features (word type words and medical meaning words), including 12 word type words and 5 medical meaning words.

For example, the text meaning feature vector of an electronic medical record is represented as f1 ═ (x)₁,x₂,x₃,…,x₁₇) Where each x corresponds to a textual meaning feature.

Then, (x)₁,x₂,x₃,…,x₁₂) Representing 12 word types, e.g. x₁The ratio of the total number of words in the text of the electronic medical record (for example, the specific value of the feature is 0.1), x₂Indicating the total number of verbs in the text of the electronic medical recordA ratio (e.g., the specific value of this feature is 0.05); these ratios are collectively the first ratio of word type words in the total number of words.

The remaining x are similar, (x)₁₃,x₁₄,x₁₅,x₁₆,x₁₇) Representing 5 medical meanings, e.g. x₁₃Indicating the number of medical entities present (e.g. the specific value of this feature is 50), x₁₄Representing the ratio of the number of emerging medical entities belonging to the symptom to the total number of medical entities (e.g. this feature specifically takes a value of 0.3), the remaining x are similar. And these ratios are collectively the second ratio of the medical meaning term to the total number of words.

The embodiment extracts the ratio of two text levels of multiple dimensions, and improves the accuracy of follow-up pathology duplicate checking.

And step 210, calculating the similarity between the case to be checked and the case text in the case database according to the case text characteristics and the case meaning characteristics respectively to obtain the text similarity and the meaning similarity.

Further, text literal feature extraction is carried out on the case to be duplicate checked and the case text features of the case data in the case database respectively to obtain a duplicate checking set and a data set; and calculating the number of the same continuous word strings in the duplicate checking set and the data set to obtain the text similarity.

And calculating the similarity of the meaning characteristics of the cases to be checked and the case data by a cosine similarity algorithm to serve as the meaning similarity.

Specifically, a duplication checking set and a data set, namely two text literal element sets, obtained by extracting text literal features of two electronic medical records are obtained, and the similarity of the two text literal element sets, namely the similarity on text literal is obtained by calculating the similarity of Jaccard of the two sets. The Jaccard similarity, also called Jaccard similarity coefficient (Jaccard similarity coefficient), is used to compare similarity and difference between limited sample sets. The larger the Jaccard coefficient value, the higher the sample similarity.

The process of calculating the literal similarity of the texts of the two electronic medical records comprises the following steps:

extracting text literal features of the two electronic medical records to obtain a text literal element set: checking a duplicate set and a data set, wherein m identical elements exist in the intersection of the duplicate set A and the data set B, n elements exist in the union of the A and the B, and then the Jaccard similarity of the A and the B is as follows:

Jaccard(A,B)＝m/n

the similarity of the characters of the two electronic medical records is expressed by the similarity of the Jaccard. The value range of accard similarity is between 0 and 1, and the closer the Jaccard similarity is to 1, the higher the similarity is; the closer the Jaccard similarity is to 0, the lower the similarity is.

And (3) calculating cosine similarity cosines (f1, f2) between text meaning feature vectors f1 and f2 obtained from the texts of the two electronic medical records, and expressing the similarity of the texts of the two electronic medical records in the sense of the cosine similarity.

The text meaning feature vectors of the two electronic medical records are respectively expressed as follows:

f1＝(x_l，x₂，x₃，…，x₄) And f1 ═ y₁，y₂，y₃，…，y_n)。

The cosine similarity between these two feature vectors can be calculated by equation (1):

the value range of the cosine similarity is between 0 and 1, and the closer the cosine similarity is to 1, the higher the similarity is; the closer the cosine similarity is to 0, the lower the cosine similarity is.

And 212, fusing the text similarity and the meaning similarity according to the preset weight value to obtain the final similarity between the case to be checked and the case text, and taking the case text corresponding to the final similarity larger than the preset value as a check result.

With the preset weight value as w 1: w2 superposes the text similarity and the meaning similarity to obtain the final similarity, wherein 0< ═ w2< ═ 1, and w1+ w2< > 1.

Specifically, for two electronic medical record texts, wherein one electronic medical record text is a case of a to-be-checked disease, the text similarity sim1(0< ═ sim1< ═ 1) and the meaning similarity sim2(0< ═ sim2< ═ 1) are fused by adopting a weight-based fusion method, and the preset weight value can be set according to an application scene.

Namely sim-w 1 sim1+ w2 sim2

Wherein 0< ═ sim < ═ 1,0< ═ w1< ═ 1,0< ═ w2< ═ 1, and w1+ w2< > 1, and sim is used to represent the final similarity of the two electronic medical record texts.

The embodiment is mainly embodied in that when the similarity between the electronic medical records is calculated, the literal similarity of the electronic medical records is considered, and the meaning similarity of the electronic medical records is also considered.

The comprehensive consideration of the two similarities is realized by performing similarity result fusion based on weight on the results of text literal similarity and text meaning similarity.

Generally, the weight w1 corresponding to the text similarity and the weight w2 corresponding to the meaning similarity are both set to be 0.5, which indicates that the two similarity results are fused by equalization.

If the literal similarity of the text is considered, the weight w1 of the text similarity is set to be larger (e.g. w1 is set to 0.6), and the weight w2 of the corresponding meaning similarity is set to be smaller (e.g. w2 is set to 0.4).

Accordingly, if the similarity of the text meanings is relatively considered more, the weight w2 of the meaning similarity is set to be larger (e.g., w2 is set to 0.6), and the weight w1 of the corresponding text similarity is set to be smaller (e.g., w1 is set to 0.4).

The preset value may be set according to a specific application scenario and a need, and is not specifically limited in this proposal.

In the electronic case duplication checking method based on the word segmentation text, word type words and medical meaning words of a case to be duplicated are obtained through statistics from the word segmentation text, then the ratio of the word type words and the medical meaning words in the word segmentation text is calculated, after fusion is carried out according to a set proportion, the text similarity and the meaning similarity of case data in a case database are calculated, the final similarity is obtained after the fusion, the case data meeting the similarity requirement is used as a duplication checking result, and when the difference between diseases corresponding to the same symptom is overlarge, the similar case with high accuracy can be found only through pathology duplication checking according to medical characteristics.

In one embodiment, an electronic pathology review application scenario in both online and offline scenarios can also be implemented:

1) online electronic medical record duplication checking

The online mode means that after a doctor inputs an electronic medical record text in the electronic medical record system, the server side immediately checks the electronic medical record text for duplication.

The function prompts the doctor whether the electronic medical records input by the doctor are repeated in the electronic medical record database, and if so, the repeated electronic medical record numbers and the corresponding similarity are returned. The realization method comprises the following steps:

I. for the electronic medical record text input by a doctor, firstly extracting text content from the electronic medical record text to generate a corresponding text document, and then extracting text features (including text literal features and text meaning features) from the text document.

And II, calculating the text similarity of the text features of all the electronic medical records in the electronic medical record text library and the text features of the input electronic medical record (fusing the results of the text literal similarity and the text meaning similarity).

Comparing the final similarity obtained by calculation with a preset value, if the final similarity exceeds the preset value, indicating that the input electronic medical record is repeated in the electronic medical record database, prompting that the electronic medical record input by a doctor is repeated by the server, and returning the repeated electronic medical record number and the corresponding similarity; if the electronic medical record does not exceed the preset value, the input electronic medical record is not repeated in the electronic medical record data, and the server side can return information for prompting that the electronic medical record input by the doctor is not repeated to the doctor.

The value range of the preset value is also between 0 and 1 (the value range of the similarity sim between the electronic medical records is also between 0 and 1).

The higher the preset value is set (closer to 1), the stricter the similarity calculation of the electronic medical record is, and the less the data volume of the returned similar medical record is.

The higher the preset value is set (closer to 0), the more loose the similarity calculation of the electronic medical records is, and the more the number of the returned similar medical records is. Different preset values can be set according to different requirements.

Generally, the preset value may be set to 0.8.

(2) Offline electronic medical record duplication checking

In the electronic medical record database, whether each electronic medical record text and other electronic medical record texts exist in the electronic medical record database is checked repeatedly, and the step of checking the duplicate of one electronic medical record and other electronic medical records in the database is similar to the step of checking the duplicate of one input electronic medical record on line.

The method can output the repeated electronic medical records existing in the database, and the numbers of the repeated electronic medical records and the corresponding similarity of the repeated electronic medical records are correspondingly given.

The embodiment provides two application scenarios to describe in detail the specific application of the above electronic case duplication checking method based on the word segmentation text.

It should be understood that, although the steps in the flowchart of fig. 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 3, an electronic case duplicate checking device based on a segmented text is provided, and the electronic case duplicate checking device based on the segmented text corresponds to the electronic case duplicate checking method based on the segmented text in the above embodiment one to one. The electronic case duplicate checking device based on the word segmentation text comprises:

the word segmentation module 302 is configured to perform word segmentation processing on a to-be-searched disease case input by a user to obtain a word segmentation text;

the extraction module 304 is used for performing feature extraction on the word segmentation text according to preset substrings to obtain case text features;

the ratio module 306 is used for acquiring word type words and medical meaning words from the word segmentation text, and counting a first ratio of the word type words in the word segmentation text and a second ratio of the medical meaning words in the word segmentation text;

an integration module 308 for integrating the first ratio and the second ratio to obtain the case meaning characteristics;

a similarity module 310, configured to calculate similarity between a case to be reviewed and a case text in a case database according to the case text feature and the case meaning feature, respectively, to obtain text similarity and meaning similarity;

and the duplication checking module 312 is configured to fuse the text similarity and the meaning similarity according to a preset weight value to obtain a final similarity between the case to be duplicated and the case text, and use the case text corresponding to the final similarity larger than the preset value as a duplication checking result.

It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned cases and case data to be reviewed, the above-mentioned cases to be reviewed may also be stored in a node of a block chain, and the case data may be distributed and not belong to the block chain.

Further, the extraction module 304 includes:

the encoding submodule is used for generating a unique code for each character in the word segmentation text;

and the extraction submodule is used for performing feature extraction on the word segmentation text according to the preset sub-string values through an n-gram algorithm to obtain case text features, wherein the case text features comprise at least one continuous word string arranged according to a character coding sequence, and characters in the continuous word string are arranged according to the size sequence of the unique code.

Further, the ratio module 306, includes:

the word submodule is used for acquiring characters of real words and virtual words of word types from the word segmentation text and calculating a first ratio of the word types in the word segmentation text;

the meaning submodule is used for acquiring medical meaning words from the word segmentation text;

the association submodule is used for performing medical entity association on the medical meaning words according to the medical entity library;

and the calculation submodule is used for calculating a second ratio of the medical meaning words after the medical entities are associated in the word segmentation text.

Further, the similarity module 310 includes:

the set submodule is used for respectively extracting text literal features of a case to be searched and case text features of case data in a case database to obtain a search duplication set and a data set;

the text similarity submodule is used for calculating the number of the same continuous word strings in the duplicate searching set and the data set to obtain the text similarity;

and the meaning similarity submodule is used for calculating the similarity of the meaning characteristics of the case to be checked and the case data through a cosine similarity algorithm and taking the similarity as the meaning similarity.

The electronic case duplicate checking device based on the word segmentation text obtains word type words and medical meaning words of a case to be checked through statistics from the word segmentation text, then calculates the ratio of the word type words and the medical meaning words in the word segmentation text, and after fusion is carried out through a set proportion, calculates the text similarity and the meaning similarity of the case data in a case database, obtains the final similarity after fusion, takes the case data meeting the similarity requirement as a duplicate checking result, and can find similar cases with high accuracy only through pathological duplicate checking according to medical characteristics when the diseases corresponding to the same symptoms have overlarge differences.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 4. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store case data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an electronic case duplication checking method based on segmented text.

In the embodiment, word type words and medical meaning words of a disease case to be checked are obtained through statistics from the word segmentation text, then the ratio of the word type words and the medical meaning words in the word segmentation text is calculated, after fusion is carried out according to a set proportion, the text similarity and the meaning similarity of the case data in a case database are calculated, the final similarity is obtained after the fusion, the case data meeting the similarity requirement is used as a result of checking the duplication, and when the disease difference corresponding to the same symptom is too large, the similar case with high accuracy can be found only through pathological duplication checking according to medical characteristics.

As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, and the computer program when executed by a processor implements the steps of the above-mentioned embodiment of the electronic case duplication checking method based on segmented text, such as the steps 202 to 212 shown in fig. 2, or the processor implements the functions of the modules/units of the above-mentioned embodiment of the electronic case duplication checking device based on segmented text, such as the functions of the modules 302 to 312 shown in fig. 3.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for those skilled in the art, without departing from the spirit and scope of the present invention, several changes, modifications and equivalent substitutions of some technical features may be made, and these changes or substitutions do not make the essence of the same technical solution depart from the spirit and scope of the technical solution of the embodiments of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An electronic case duplication checking method based on word segmentation text is characterized by comprising the following steps:

2. The method according to claim 1, wherein the segmented text includes a plurality of characters formed by segmented characters and/or words, and the extracting the features of the segmented text according to the preset sub-string values to obtain case text features comprises:

generating a unique code for each character in the segmented text;

and performing feature extraction on the word segmentation text according to the preset sub-string value through an n-gram algorithm to obtain case text features, wherein the case text features comprise at least one continuous word string arranged according to the character coding sequence, and the characters in the continuous word string are arranged according to the size sequence of the unique code.

3. The method of claim 2, wherein the predetermined substring value is in a range of 2-6.

4. The method of claim 1, wherein the obtaining a word type word and a medical meaning word from the segmented word text, and counting a first ratio of the word type word in the segmented word text and a second ratio of the medical meaning word in the segmented word text comprises:

acquiring characters of real words and imaginary words of word types from the word segmentation text, and calculating a first ratio of the word types in the word segmentation text;

acquiring medical meaning words from the word segmentation texts;

performing medical entity association on the medical meaning words according to a medical entity library;

a second ratio of the medical meaning word after association with the medical entity in the segmented word text is calculated.

5. The method of claim 1, wherein said integrating the first ratio and the second ratio to obtain a case meaning profile comprises:

integrating the word type words and the medical meaning words according to the proportion of 12: 5 to obtain a case meaning characteristic f1 ═ x₁，x₂，x₃，...，x₁₇) Wherein (x)₁，x₂，x₃，...，x₁₂) Represents 12 word type words, (x)₁₃，x₁₄，x₁₅，x₁₆，x₁₇) Representing 5 medical meaning words.

6. The method of claim 1, wherein the calculating the similarity between the case to be reviewed and the case text in the case database according to the case text feature and the case meaning feature respectively to obtain text similarity and meaning similarity comprises:

respectively extracting text literal characteristics of the case to be searched and the case text characteristics of the case data in the case database to obtain a search duplication set and a data set;

calculating the number of the same continuous word strings in the duplication checking set and the data set to obtain the text similarity;

and calculating the similarity of the meaning characteristics of the case to be checked and the case data by a cosine similarity algorithm, wherein the similarity is used as the meaning similarity.

7. The method according to claim 1, wherein the fusing the text similarity and the meaning similarity according to a preset weight value to obtain a final similarity between the case to be reviewed and the case text comprises:

with the preset weight value as w 1: w2 superposes the text similarity and the meaning similarity to obtain the final similarity, wherein 0< w2< 1, and w1+ w2 ≦ 1.

8. An electronic case duplicate checking device based on word segmentation text is characterized by comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.