WO2021121187A1 - Method for detecting electronic medical case duplicates based on word segmentation, device, and computer equipment - Google Patents

Method for detecting electronic medical case duplicates based on word segmentation, device, and computer equipment Download PDF

Info

Publication number
WO2021121187A1
WO2021121187A1 PCT/CN2020/136146 CN2020136146W WO2021121187A1 WO 2021121187 A1 WO2021121187 A1 WO 2021121187A1 CN 2020136146 W CN2020136146 W CN 2020136146W WO 2021121187 A1 WO2021121187 A1 WO 2021121187A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
case
similarity
word segmentation
meaning
Prior art date
Application number
PCT/CN2020/136146
Other languages
French (fr)
Chinese (zh)
Inventor
唐蕊
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021121187A1 publication Critical patent/WO2021121187A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Definitions

  • This application relates to the field of natural language processing in artificial intelligence, and in particular to a method, device, computer equipment and storage medium for checking duplicates of electronic medical records based on word segmentation text.
  • a case is a systematic record of the occurrence, development, diagnosis, and treatment of a disease.
  • electronic medical records have gradually replaced handwritten medical records, making the mobile phone and management of case information more convenient and faster.
  • the input text is generally structured and analyzed to obtain the target medical features included in the input text and the corresponding target feature attributes, and the historical medical records including the above features and attributes are obtained in the case retrieval system, and then Calculate the semantic similarity between the input text and each of these historical medical records, then calculate the feature similarity between the target feature attribute and the feature attribute in the historical medical record, and finally determine the similar cases based on the semantic similarity and feature similarity
  • the inventor realized that this treatment method is too different for the diseases corresponding to the same symptoms, and that similar cases found only by pathological examination based on medical characteristics have technical problems of low accuracy.
  • this application provides a method, device, computer equipment, and storage medium for checking duplicate electronic cases based on word segmentation text to solve the problem that the diseases corresponding to the same symptoms are too different in the prior art, resulting in pathology based only on medical characteristics.
  • the technical problem that the accuracy of similar cases found is not high due to duplicate checking.
  • a method for checking duplicates of electronic medical records based on word segmentation text comprising:
  • the text similarity and the meaning similarity are merged according to the preset weight value to obtain the final similarity between the serious case under investigation and the case text, and the case text corresponding to the final similarity greater than the preset value is taken as Check the results.
  • a device for checking duplicates of electronic medical records based on word segmentation text comprising:
  • the word segmentation module is used to perform word segmentation processing on the serious case to be checked input by the user to obtain the word segmentation text;
  • the extraction module is used to perform feature extraction on the word segmentation text according to the preset substring value to obtain the case text feature;
  • the ratio module is used to obtain word type words and medical meaning words from the word segmentation text, and calculate the first ratio of the word type words in the word segmentation text, and the number of the medical meaning words in the word segmentation text.
  • the integration module is used to integrate the first ratio and the second ratio to obtain the meaning characteristics of the case
  • the similarity module is used to calculate the similarity between the serious case to be investigated and the case text in the case database according to the case text characteristics and the meaning characteristics of the case, respectively, to obtain text similarity and meaning similarity;
  • Duplicate checking module used to fuse the text similarity and the meaning similarity according to a preset weight value to obtain the final similarity between the case to be checked and the case text, and to determine the final similarity greater than the preset value
  • the case text corresponding to the degree is used as the duplicate check result.
  • a computer device comprising a memory and a processor, and a computer program stored in the memory and capable of running on the processor.
  • the processor implements the above-mentioned word segmentation text-based electronic medical record check when the computer program is executed. Steps of the heavy method:
  • the text similarity and the meaning similarity are merged according to the preset weight value to obtain the final similarity between the serious case under investigation and the case text, and the case text corresponding to the final similarity greater than the preset value is taken as Check the results.
  • a computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps of the above-mentioned method for checking duplicates of electronic medical records based on word segmentation text: The case undergoes word segmentation processing to obtain the word segmentation text;
  • the text similarity and the meaning similarity are merged according to the preset weight value to obtain the final similarity between the serious case under investigation and the case text, and the case text corresponding to the final similarity greater than the preset value is taken as Check the results.
  • the above-mentioned method, device, computer equipment and storage medium for checking duplicates of electronic medical records based on word segmentation text obtain the word type words and medical meaning words of the serious case to be checked from the word segmentation text, and then calculate the ratio in the word segmentation text, and After fusion with a set ratio, the text similarity and meaning similarity with the case data in the case database is calculated, and the final similarity is obtained after fusion.
  • the case data that meets the similarity requirements is used as the duplicate check result.
  • FIG. 1 is a schematic diagram of the application environment of the method for duplicate checking of electronic medical records based on word segmentation text in an embodiment of the application;
  • FIG. 2 is a schematic flowchart of a method for checking duplicates of electronic medical records based on word segmentation text in an embodiment of the application;
  • FIG. 3 is a schematic diagram of an electronic medical record duplicate checking device based on word segmentation text in an embodiment of the application
  • Fig. 4 is a schematic diagram of a computer device in an embodiment of the application.
  • the method for checking duplicates of electronic medical records based on word segmentation text can be applied to the application environment as shown in FIG. 1.
  • the application environment may include the terminal 102, the network, and the server 104.
  • the network is used to provide a communication link medium between the terminal 102 and the server 104.
  • the network may include various connection types, such as wired, wireless communication links or Fiber optic cable and so on.
  • the user can use the terminal 102 to interact with the server 104 through the network to receive or send messages and so on.
  • Various communication client applications such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, etc., may be installed on the terminal 102.
  • the terminal 102 may be various electronic devices with a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, moving picture experts compress standard audio Level 3), MP4 (Moving Picture Experts Group Audio Layer IV, Motion Picture Experts compress standard audio level 4) Players, laptop portable computers and desktop computers, etc.
  • MP3 players Moving Picture Experts Group Audio Layer III, moving picture experts compress standard audio Level 3
  • MP4 Motion Picture Experts compress standard audio level 4
  • laptop portable computers and desktop computers etc.
  • the server 104 may be a server that provides various services, for example, a background server that provides support for pages displayed on the terminal 102.
  • the method for checking duplicate electronic medical records based on word segmentation text provided by the embodiments of the present application is generally executed by the server/terminal. Accordingly, the device for checking duplicate electronic medical records based on word segmentation text is generally set in the server/terminal device. .
  • This application can be used in many general or special computer system environments or configurations. For example: personal computers, server computers, handheld devices or portable devices, tablet devices, multi-processor systems, microprocessor-based systems, set-top boxes, programmable consumer electronic devices, network PCs, small computers, large computers, including Distributed computing environment for any of the above systems or equipment, etc.
  • This application may be described in the general context of computer-executable instructions executed by a computer, such as a program module.
  • program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types.
  • This application can also be practiced in distributed computing environments. In these distributed computing environments, tasks are performed by remote processing devices connected through a communication network.
  • program modules can be located in local and remote computer storage media including storage devices.
  • This application can be applied in the field of smart medical care, so as to promote the construction of smart cities.
  • terminals, networks, and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks, and servers according to implementation needs.
  • the terminal 102 communicates with the server 104 through the network.
  • the server 104 receives the serious cases to be checked sent by the terminal 102, and obtains the word type words and medical meaning words of the serious cases to be checked from the word segmentation text, and then calculates the ratio in the word segmentation text, and performs a set ratio. After fusion, the text similarity and meaning similarity with the case data in the case database is calculated, the final similarity is obtained after fusion, and the case data that meets the similarity requirements is returned to the terminal 102 as the duplicate check result.
  • the terminal 102 and the server 104 are connected through a network.
  • the network can be a wired network or a wireless network.
  • the terminal 102 can be, but is not limited to, various personal computers, laptops, smart phones, tablets, and portable wearable devices.
  • the server 104 can be implemented by an independent server or a cluster of multiple servers.
  • a method for checking duplicates of electronic medical records based on word segmentation text is provided. Taking the method applied to the server in FIG. 1 as an example for description, the method includes the following steps:
  • Step 202 Perform word segmentation processing on the serious case to be investigated input by the user to obtain the word segmentation text.
  • the serious case to be checked may be electronic medical record data input by the user.
  • Electronic case data includes text data.
  • the text data is composed of a series of case files, including admission records, first course records, surgical records, discharge summary, and so on.
  • the detected electronic case submitted by the user through the terminal is regarded as a serious case to be checked, and then the text information of each electronic case is extracted and organized into a text document, and then the pathological duplicate check is performed on the text document in the electronic medical record database. .
  • the word segmentation processing of the serious case to be checked needs to be performed.
  • the existing word segmentation technology can be used to perform word segmentation processing on the case to be checked.
  • the word segmentation technology used is a hybrid word segmentation technology that comprehensively considers regular word segmentation and statistical word segmentation.
  • the rule-based word segmentation technology mainly maintains a dictionary. When segmenting a sentence, each character string in the sentence is matched with a word in the dictionary. If it is found, the word segmentation is performed, otherwise no segmentation is performed.
  • the word segmentation technology based on statistics must first establish a statistical language model, then divide the sentence into words, calculate the probability of the result of the division, and use the word segmentation result with the highest probability as the final word segmentation result.
  • the hybrid acne technology is based on the statistical word segmentation technology, using the rule word segmentation technology as an auxiliary, so as to comprehensively consider the word segmentation technology of these two technologies, and finally obtain the word segmentation text of the serious case under investigation.
  • Step 204 Perform feature extraction on the segmented text according to the preset substring value to obtain the case text feature.
  • case text features include continuous word strings, that is, substring elements. And include at least one.
  • each serious case to be checked is represented by this continuous word string set, then there will be many common set elements between the serious cases to be checked or other case texts with repeated literal content (for example, the same sentence or phrase). And in this case, even if the sentence order in the two case texts is different.
  • each word or character in the word segmentation text is regarded as a character, and a unique code is generated for each character string, then the text document of an electronic medical record is regarded as a large character string.
  • the n-gram algorithm uses the n-gram algorithm to perform feature extraction on the segmented text according to the preset substring value to obtain the case text feature, where the case text feature includes at least one continuous word string arranged in the order of character encoding, and the characters in the continuous word string They are arranged in order of the size of the unique code.
  • each case text is represented as a set of substring elements with a length of the preset substring value k appearing in the document, that is, the obtained case text feature.
  • this is, a kind, rare, case, possible, yes, a kind, special, tissue disease, or, physiological function, disorder, caused by
  • a substring element set with a number of substring elements of 4 is obtained, namely [word1word2word3,word2word3word4,word3word 4 words 5, words 4 words 5 words 6].
  • an electronic medical record text document is represented by a set of words after word segmentation, but the set of these words only reflects whether these words appear in the document, and does not reflect the order relationship between these words. Therefore, by constructing a substring whose length is the preset substring value k (that is, concatenating k consecutive words together to form a substring), to reflect a certain degree of order relationship of words, the solution obtained during pathological investigation is more accurate.
  • the value of k ranges from 2 to 6. The larger the k value is set (that is, the longer the word string), the more word order information is reflected in the obtained word string; the smaller the k value is set (that is, the shorter the word string), the more word order information the obtained word string reflects less.
  • the value of k is set to 3, because the setting of the value of k will be related to the subsequent calculation of the text literal similarity between two electronic medical record texts.
  • the electronic medical record will be obtained. The longer the word string in the text word string set, the fewer the same word strings in the two electronic medical records, and the lower the similarity of the two electronic medical record texts; if the k value is set to a smaller value, the electronic medical record will be obtained The shorter the word string in the text word string set, the more identical word strings in the two electronic medical record texts, and the higher the similarity value of the two electronic medical records will be.
  • the value of k needs to be weighed according to the actual text, and the value of k must not be set too large or too small. Therefore, in this embodiment, the value of k is preferably set to 3.
  • Step 206 Obtain word type words and medical meaning words from the word segmentation text, and count the first ratio of word type words in the word segmentation text and the second ratio of medical meaning words in the word segmentation text.
  • the type and medical meaning of the words appearing in the text are considered, and the characteristics reflecting the connotative meaning of the text are extracted from it, that is, the medical meaning words.
  • Word type words The word types include content words and function words. Content words include nouns, verbs, adjectives, numerals, measure words, and pronouns. Function words include adverbs, prepositions, conjunctions, auxiliary words, interjections, and onomatopoeias. There are 12 categories in total. By calculating the first ratio of different word types in the total number of words in the segmentation text, the characteristics of different types of words reflecting the word types are obtained.
  • Medical meaning words on the basis of word segmentation, medical entity associations are made for words with medical meaning. By calculating the number of all medical entities that appear, the number of medical entities that belong to symptoms, the number of medical entities that belong to diseases, the number of medical entities that belong to inspections, and the number of medical entities that belong to drugs, these five categories are the second of all medical entity data. Ratio, get the corresponding feature that reflects the medical meaning of the word.
  • the characters whose word type words are content words and function words are obtained from the word segmentation text, and the first ratio of the word type words in the word segmentation text is calculated.
  • the medical entity database contains a large number of medical entities.
  • a medical name and its attributes constitute a medical entity.
  • the medical name of a medical entity is "cough" and its attribute is "symptom”; the medical name of a medical entity is "acute upper respiratory infection” and its attribute is “disease” ;
  • the medical name of a medical entity is "abdominal color Doppler ultrasound” and its attribute is “inspection and inspection”; the medical name of a medical entity is "metformin glipizide tablets" and its attribute is “drugs”.
  • the specific implementation of the medical entity association technology is to match each word after the word segmentation of an electronic medical record text with the medical entity name in the medical entity database, that is, to associate the word with the medical entity.
  • the statistics of the number of all medical entities appearing, as well as the number of medical entities belonging to symptoms, the number of medical entities belonging to diseases, the number of medical entities belonging to inspections, and the number of medical entities belonging to drugs, are in the data of all medical entities.
  • the second ratio is the corresponding feature that reflects the medical meaning of the word.
  • Step 208 Integrate the first ratio and the second ratio to obtain the meaning characteristics of the case.
  • the case meaning feature is a feature vector composed of n feature values, which are multiple word type words and medical meaning words described above.
  • n in this embodiment is 17, which represents 17 textual meaning features (word type words, medical meaning words), including 12 word type words and 5 medical meaning words.
  • (x 1 ,x 2 ,x 3 ,...,x 12 ) represents 12 word type words
  • x 1 represents the ratio of the noun in the total number of words in this electronic medical record text (for example, the specific value of this feature is 0.1 )
  • x 2 represents the ratio of the total number of verbs in this electronic medical record text (for example, the specific value of this feature is 0.05); these ratios all belong to the first ratio of word type words in the total number of words.
  • the remaining x is similar, (x 13 , x 14 , x 15 , x 16 , x 17 ) represents 5 medical meaning words, for example, x 13 represents the number of medical entities that appear (for example, the specific value of this feature is 50), x 14 represents the ratio of the number of medical entities that are symptomatic to the total number of medical entities (for example, the specific value of this feature is 0.3), and the remaining x is similar. These ratios all belong to the second ratio of medical meaning words to the total number of words.
  • the ratio of two text levels in multiple dimensions is extracted, which improves the accuracy of subsequent pathological examinations.
  • Step 210 Calculate the similarity between the serious case to be checked and the case text in the case database according to the case text characteristics and the meaning characteristics of the case, respectively, to obtain the text similarity and the meaning similarity.
  • the cosine similarity algorithm is used to calculate the similarity of the meaning characteristics of the case to be investigated and the case data as the meaning similarity.
  • the duplicate check set and data set obtained by extracting text literal features of two electronic medical records are calculated by calculating the Jaccard similarity of these two sets to obtain the value of the two sets of text literal elements Similarity, that is, the literal similarity of the text.
  • Jaccard similarity also known as Jaccard similarity coefficient (Jaccard similarity coefficient) is used to compare the similarity and difference between a limited sample set. The larger the Jaccard coefficient value, the higher the sample similarity.
  • the value range of accard similarity is between 0 and 1. The closer the Jaccard similarity is to 1, the higher the similarity; the closer the Jaccard similarity is to 0, the lower the similarity.
  • the value range of cosine similarity is between 0 and 1. The closer the cosine similarity is to 1, the higher the similarity; the closer the cosine similarity is to 0, the lower the cosine similarity.
  • step 212 the text similarity and the meaning similarity are merged according to the preset weight value to obtain the final similarity between the case to be checked and the case text, and the case text corresponding to the final similarity greater than the preset value is used as the duplicate check result.
  • sim w1*sim1+w2*sim2
  • sim is used to represent the final similarity of the two electronic medical record texts.
  • This embodiment mainly reflects that when calculating the similarity between electronic medical records, both the literal similarity of the electronic medical record text and the similarity of the meaning of the electronic medical record text are considered.
  • the weight w1 corresponding to the text similarity and the weight w2 corresponding to the meaning similarity are both set to 0.5, which means that the two similarity results are balanced.
  • the weight of the text similarity w1 is set larger (for example, w1 is set to 0.6), and the weight of the corresponding meaning similarity is set to be smaller (for example, w2 Set to 0.4).
  • the weight of the meaning similarity w2 should be set larger (for example, w2 is set to 0.6), and the weight of the corresponding text similarity w1 is set to be smaller (for example, w1 Set to 0.4).
  • the size of the preset value can be set according to specific application scenarios and needs, and there is no specific limitation in this proposal.
  • the word type words and medical meaning words of the serious case to be checked are obtained by statistics from the word segmentation text, and then the ratio in the word segmentation text is calculated, and the ratio is set for fusion , Calculate the text similarity and meaning similarity with the case data in the case database, and get the final similarity after fusion, and use the case data that meets the similarity requirements as the duplicate check result, and the disease corresponding to the same symptom is too far apart
  • the pathological examination based on medical characteristics alone can also find similar cases with high accuracy.
  • electronic pathology duplicate check application scenarios in two scenarios, online and offline can also be implemented:
  • This function will prompt the doctor whether the electronic medical record entered by the doctor is repeated in the electronic medical record database. If it is repeated, it will return the repeated electronic medical record number and the corresponding similarity.
  • the implementation method is as follows:
  • the value range of this preset value is also between 0 and 1 (the value range of the similarity sim between electronic medical records is also between 0 and 1).
  • the default value can be set to 0.8.
  • This method will output the duplicate electronic medical records in the database.
  • the number of the duplicate electronic medical records and their corresponding similarity will be given correspondingly.
  • This embodiment provides two application scenarios to explain in detail the specific application of the above-mentioned method for duplicate checking of electronic medical records based on word segmentation text.
  • steps in the flowchart of FIG. 2 are displayed in sequence as indicated by the arrows, these steps are not necessarily performed in sequence in the order indicated by the arrows. Unless specifically stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least part of the steps in FIG. 2 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution of these sub-steps or stages The sequence does not have to be performed sequentially, but may be performed alternately or alternately with other steps or at least a part of sub-steps or stages of other steps.
  • a device for checking duplicate electronic medical records based on word segmentation text is provided.
  • the device for checking duplicate electronic medical records based on word segmentation text is the same as the method for checking duplicate electronic medical records based on word segmentation text in the above embodiment.
  • the device for checking duplicates of electronic medical records based on word segmentation text includes:
  • the word segmentation module 302 is used to perform word segmentation processing on the serious case to be checked input by the user to obtain the word segmentation text;
  • the extraction module 304 is configured to perform feature extraction on the segmented text according to the preset substring value to obtain the case text feature;
  • the ratio module 306 is used to obtain word type words and medical meaning words from the word segmentation text, and count the first ratio of word type words in the word segmentation text and the second ratio of medical meaning words in the word segmentation text;
  • the integration module 308 is used to integrate the first ratio and the second ratio to obtain the meaning characteristics of the case
  • the similarity module 310 is used to calculate the similarity between the serious case to be checked and the case text in the case database according to the case text characteristics and the meaning characteristics of the case, respectively, to obtain the text similarity and meaning similarity;
  • the duplicate checking module 312 is used to fuse the text similarity and meaning similarity according to the preset weight value to obtain the final similarity between the duplicate case to be checked and the case text, and use the case text corresponding to the final similarity greater than the preset value as the check Heavy results.
  • the above-mentioned serious cases to be investigated can also be stored in a node of a blockchain, and the case data can be distributed and not belong to the blockchain. .
  • the extraction module 304 includes:
  • Coding sub-module used to generate a unique code for each character in the word segmentation text
  • the extraction sub-module is used for feature extraction of the segmented text according to the preset substring value through the n-gram algorithm to obtain the case text feature, wherein the case text feature includes at least one continuous word string arranged in character encoding order, and the continuous word The characters in the string are arranged in the order of the size of the unique code.
  • the ratio module 306 includes:
  • the word sub-module is used to obtain the characters whose word type words are content words and function words from the word segmentation text, and calculate the first ratio of word type words in the word segmentation text;
  • the meaning sub-module is used to obtain medical meaning words from the word segmentation text
  • the association sub-module is used to associate medical entities with medical meaning words according to the medical entity database
  • the calculation sub-module is used to calculate the second ratio of medical meaning words in the segmentation text after the medical entity is associated.
  • the similar module 310 includes:
  • the collection sub-module is used to extract the text literal features of the case to be checked and the case data in the case database to obtain the duplicate check set and the data set;
  • the text similarity sub-module is used to calculate the number of identical continuous word strings in the duplicate check set and the data set to obtain the text similarity
  • the meaning similarity sub-module is used to calculate the similarity of the meaning characteristics of the case to be checked and the case data through the cosine similarity algorithm, as the meaning similarity.
  • the above-mentioned electronic medical case duplicate checking device based on word segmentation text obtains the word type words and medical meaning words of the serious case to be checked from the word segmentation text, and then calculates the ratio in the word segmentation text, and after setting the ratio for fusion, Calculate the text similarity and meaning similarity with the case data in the case database, obtain the final similarity after fusion, and use the case data that meets the similarity requirements as the duplicate check result, when the disease corresponding to the same symptom is too different , Only carrying out pathological examination based on medical characteristics can also find similar cases with high accuracy.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 4.
  • the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, a computer program, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the database of the computer equipment is used to store case data.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the word type words and medical meaning words of the serious case to be checked are calculated from the word segmentation text, and then the ratio in the word segmentation text is calculated, and after the set ratio is merged, the calculation is calculated with the case data in the case database
  • the text similarity and meaning similarity of the fusion is the final similarity, and the case data that meets the similarity requirements is used as the check result.
  • the pathological check is only performed based on the medical characteristics It can also find similar cases with high accuracy.
  • the computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions.
  • Its hardware includes, but is not limited to, a microprocessor, a dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • a computer-readable storage medium on which a computer program is stored.
  • the steps of the method for checking duplicates of electronic medical records based on word segmentation text in the above embodiments are implemented, as shown in Figure 2
  • the processor executes the computer program
  • the function of each module/unit of the electronic medical patient duplicate check device based on the word segmentation text in the above embodiment is realized, for example, the modules 302 to 312 shown in FIG. 3 Features.
  • the word type words and medical meaning words of the serious case to be checked are calculated from the word segmentation text, and then the ratio in the word segmentation text is calculated, and after the set ratio is merged, the calculation is calculated with the case data in the case database
  • the text similarity and meaning similarity of the fusion is the final similarity, and the case data that meets the similarity requirements is used as the check result.
  • the pathological check is only performed based on the medical characteristics It can also find similar cases with high accuracy.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Abstract

A method based on word segmentation for detecting duplicates of electronic medical cases, a device, computer equipment, and a storage medium, for use in the domain of smart medical treatment. The method comprises: after performing word segmentation of a proportional text, extracting text features and semantic features respectively, and calculating the respective ratios of these features in the word segmentation text; integrating the ratios to calculate degrees of text similarity and meaning similarity, then according to a preset weighting value, integrating different degrees of similarity to obtain a final degree of similarity, then determining similar cases among cases to be checked for duplicates on the basis of a preset limiting value. The present method also relates to blockchain technology, the case data being stored in a blockchain. By means of the present method, when the difference between illnesses corresponding to identical symptoms is great, simply performing duplicate checking according to medical characteristics can allow for high-accuracy determination of similar cases.

Description

基于分词文本的电子病例查重方法、装置、计算机设备Method, device and computer equipment for checking duplicates of electronic medical records based on word segmentation text
本申请以2020年6月24日提交的申请号为202010592373.8,名称为“基于分词文本的电子病例查重方法、装置、计算机设备”的中国发明专利申请为基础,并要求其优先权。This application is based on the Chinese invention patent application filed on June 24, 2020 with the application number 202010592373.8, titled "Method, device, and computer equipment for checking duplicates of electronic medical records based on word segmentation text", and claims its priority.
技术领域Technical field
本申请涉及人工智能中自然语言处理领域,特别是涉及一种基于分词文本的电子病例查重方法、装置、计算机设备和存储介质。This application relates to the field of natural language processing in artificial intelligence, and in particular to a method, device, computer equipment and storage medium for checking duplicates of electronic medical records based on word segmentation text.
背景技术Background technique
病例是对疾病的发生、发展、诊断以及治疗情况系统性的记录文件。随着电子病历系统在医院的普及,电子病例逐渐替代了手写病历,使得病例信息的手机和管理变得更加方便和快捷。A case is a systematic record of the occurrence, development, diagnosis, and treatment of a disease. With the popularization of electronic medical record systems in hospitals, electronic medical records have gradually replaced handwritten medical records, making the mobile phone and management of case information more convenient and faster.
然后,另一方面,电子病例的普及也使得复制粘贴或者抄袭已有病例文本变得更加容易,所以亟需一种对病例查重的方法。现有技术中一般通过对输入的文本进行结构化解析处理,得到输入文本中包括的目标医学特征及其对应的目标特征属性,并在病例检索系统中获取包括以上特征和属性的历史病历,再分别计算输入文本与这些历史病历中每一个历史病历之间的语义相似度,再计算目标特征属性与历史病历中特征属性之间的特征相似度,最后根据语义相似度和特征相似度确定相似病例,在实施时,发明人意识到,这种处理方式因相同症状对应的疾病相差过大,仅仅根据医学特征进行病理查重导致的找到的相似病例存在准确度不高的技术问题。Then, on the other hand, the popularity of electronic medical records also makes it easier to copy and paste or plagiarize the text of existing medical records. Therefore, a method for checking duplicate cases is urgently needed. In the prior art, the input text is generally structured and analyzed to obtain the target medical features included in the input text and the corresponding target feature attributes, and the historical medical records including the above features and attributes are obtained in the case retrieval system, and then Calculate the semantic similarity between the input text and each of these historical medical records, then calculate the feature similarity between the target feature attribute and the feature attribute in the historical medical record, and finally determine the similar cases based on the semantic similarity and feature similarity During the implementation, the inventor realized that this treatment method is too different for the diseases corresponding to the same symptoms, and that similar cases found only by pathological examination based on medical characteristics have technical problems of low accuracy.
发明内容Summary of the invention
基于此,本申请提供一种基于分词文本的电子病例查重方法、装置、计算机设备及存储介质,以解决现有技术中因相同症状对应的疾病相差过大,导致的仅仅根据医学特征进行病理查重导致的找到的相似病例准确度不高的技术问题。Based on this, this application provides a method, device, computer equipment, and storage medium for checking duplicate electronic cases based on word segmentation text to solve the problem that the diseases corresponding to the same symptoms are too different in the prior art, resulting in pathology based only on medical characteristics. The technical problem that the accuracy of similar cases found is not high due to duplicate checking.
一种基于分词文本的电子病例查重方法,所述方法包括:A method for checking duplicates of electronic medical records based on word segmentation text, the method comprising:
对用户输入的待查重病例进行分词处理,得到分词文本;Perform word segmentation processing on the severe cases to be checked input by the user to obtain the word segmentation text;
根据预设子串值对所述分词文本进行特征提取,得到病例文本特征;Perform feature extraction on the word segmentation text according to the preset substring value to obtain the case text feature;
从所述分词文本中获取词语类型词和医疗含义词,并统计所述词语类型词在所述分词文本中第一比率、所述医疗含义词在所述分词文本中的第二比率;Obtain word type words and medical meaning words from the word segmentation text, and count the first ratio of the word type words in the word segmentation text and the second ratio of the medical meaning words in the word segmentation text;
整合所述第一比率、所述第二比率,得到病例含义特征;Integrating the first ratio and the second ratio to obtain the meaning characteristics of the case;
分别根据所述病例文本特征、所述病例含义特征计算所述待查重病例与病例数据库中病例文本的相似度,得到文本相似度和含义相似度;Calculate the similarity between the serious case under investigation and the case text in the case database according to the case text characteristics and the meaning characteristics of the case, respectively, to obtain text similarity and meaning similarity;
根据预设权重值融合所述文本相似度和所述含义相似度,得到所述待查重病例与所述病例文本的最终相似度,并将大于预设值的最终相似度对应的病例文本作为查重结果。The text similarity and the meaning similarity are merged according to the preset weight value to obtain the final similarity between the serious case under investigation and the case text, and the case text corresponding to the final similarity greater than the preset value is taken as Check the results.
一种基于分词文本的电子病例查重装置,所述装置包括:A device for checking duplicates of electronic medical records based on word segmentation text, said device comprising:
分词模块,用于对用户输入的待查重病例进行分词处理,得到分词文本;The word segmentation module is used to perform word segmentation processing on the serious case to be checked input by the user to obtain the word segmentation text;
提取模块,用于根据预设子串值对所述分词文本进行特征提取,得到病例文本特征;The extraction module is used to perform feature extraction on the word segmentation text according to the preset substring value to obtain the case text feature;
比率模块,用于从所述分词文本中获取词语类型词和医疗含义词,并统计所述词语类型词在所述分词文本中第一比率、所述医疗含义词在所述分词文本中的第二比率;The ratio module is used to obtain word type words and medical meaning words from the word segmentation text, and calculate the first ratio of the word type words in the word segmentation text, and the number of the medical meaning words in the word segmentation text. Two ratio
整合模块,用于整合所述第一比率、所述第二比率,得到病例含义特征;The integration module is used to integrate the first ratio and the second ratio to obtain the meaning characteristics of the case;
相似模块,用于分别根据所述病例文本特征、所述病例含义特征计算所述待查重病例与病例数据库中病例文本的相似度,得到文本相似度和含义相似度;The similarity module is used to calculate the similarity between the serious case to be investigated and the case text in the case database according to the case text characteristics and the meaning characteristics of the case, respectively, to obtain text similarity and meaning similarity;
查重模块,用于根据预设权重值融合所述文本相似度和所述含义相似度,得到所述待查重病例与所述病例文本的最终相似度,并将大于预设值的最终相似度对应的病例文本作为查 重结果。Duplicate checking module, used to fuse the text similarity and the meaning similarity according to a preset weight value to obtain the final similarity between the case to be checked and the case text, and to determine the final similarity greater than the preset value The case text corresponding to the degree is used as the duplicate check result.
一种计算机设备,包括存储器和处理器,以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述基于分词文本的电子病例查重方法的步骤:A computer device, comprising a memory and a processor, and a computer program stored in the memory and capable of running on the processor. The processor implements the above-mentioned word segmentation text-based electronic medical record check when the computer program is executed. Steps of the heavy method:
对用户输入的待查重病例进行分词处理,得到分词文本;Perform word segmentation processing on the severe cases to be checked input by the user to obtain the word segmentation text;
根据预设子串值对所述分词文本进行特征提取,得到病例文本特征;Perform feature extraction on the word segmentation text according to the preset substring value to obtain the case text feature;
从所述分词文本中获取词语类型词和医疗含义词,并统计所述词语类型词在所述分词文本中第一比率、所述医疗含义词在所述分词文本中的第二比率;Obtain word type words and medical meaning words from the word segmentation text, and count the first ratio of the word type words in the word segmentation text and the second ratio of the medical meaning words in the word segmentation text;
整合所述第一比率、所述第二比率,得到病例含义特征;Integrating the first ratio and the second ratio to obtain the meaning characteristics of the case;
分别根据所述病例文本特征、所述病例含义特征计算所述待查重病例与病例数据库中病例文本的相似度,得到文本相似度和含义相似度;Calculate the similarity between the serious case under investigation and the case text in the case database according to the case text characteristics and the meaning characteristics of the case, respectively, to obtain text similarity and meaning similarity;
根据预设权重值融合所述文本相似度和所述含义相似度,得到所述待查重病例与所述病例文本的最终相似度,并将大于预设值的最终相似度对应的病例文本作为查重结果。The text similarity and the meaning similarity are merged according to the preset weight value to obtain the final similarity between the serious case under investigation and the case text, and the case text corresponding to the final similarity greater than the preset value is taken as Check the results.
一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现上述基于分词文本的电子病例查重方法的步骤:对用户输入的待查重病例进行分词处理,得到分词文本;A computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps of the above-mentioned method for checking duplicates of electronic medical records based on word segmentation text: The case undergoes word segmentation processing to obtain the word segmentation text;
根据预设子串值对所述分词文本进行特征提取,得到病例文本特征;Perform feature extraction on the word segmentation text according to the preset substring value to obtain the case text feature;
从所述分词文本中获取词语类型词和医疗含义词,并统计所述词语类型词在所述分词文本中第一比率、所述医疗含义词在所述分词文本中的第二比率;Obtain word type words and medical meaning words from the word segmentation text, and count the first ratio of the word type words in the word segmentation text and the second ratio of the medical meaning words in the word segmentation text;
整合所述第一比率、所述第二比率,得到病例含义特征;Integrating the first ratio and the second ratio to obtain the meaning characteristics of the case;
分别根据所述病例文本特征、所述病例含义特征计算所述待查重病例与病例数据库中病例文本的相似度,得到文本相似度和含义相似度;Calculate the similarity between the serious case under investigation and the case text in the case database according to the case text characteristics and the meaning characteristics of the case, respectively, to obtain text similarity and meaning similarity;
根据预设权重值融合所述文本相似度和所述含义相似度,得到所述待查重病例与所述病例文本的最终相似度,并将大于预设值的最终相似度对应的病例文本作为查重结果。The text similarity and the meaning similarity are merged according to the preset weight value to obtain the final similarity between the serious case under investigation and the case text, and the case text corresponding to the final similarity greater than the preset value is taken as Check the results.
上述基于分词文本的电子病例查重方法、装置、计算机设备和存储介质,通过从分词文本中统计得到待查重病例的词语类型词和医疗含义词,然后计算其在分词文本中的比率,并经过设定比例进行融合后,计算与病例数据库中的病例数据的文本相似度和含义相似度,进行融合后得到最终的相似度,并将符合相似度要求的病例数据作为查重结果,在相同症状对应的疾病相差过大时,仅仅根据医学特征进行病理查重也可找到准确度高的相似病例。The above-mentioned method, device, computer equipment and storage medium for checking duplicates of electronic medical records based on word segmentation text obtain the word type words and medical meaning words of the serious case to be checked from the word segmentation text, and then calculate the ratio in the word segmentation text, and After fusion with a set ratio, the text similarity and meaning similarity with the case data in the case database is calculated, and the final similarity is obtained after fusion. The case data that meets the similarity requirements is used as the duplicate check result. When the symptoms corresponding to the disease are too different, just a pathological examination based on medical characteristics can also find similar cases with high accuracy.
附图说明Description of the drawings
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.
图1为本申请实施例中基于分词文本的电子病例查重方法的应用环境示意图;FIG. 1 is a schematic diagram of the application environment of the method for duplicate checking of electronic medical records based on word segmentation text in an embodiment of the application;
图2为本申请实施例中基于分词文本的电子病例查重方法的流程示意图;2 is a schematic flowchart of a method for checking duplicates of electronic medical records based on word segmentation text in an embodiment of the application;
图3为本申请实施例中基于分词文本的电子病例查重装置的示意图;FIG. 3 is a schematic diagram of an electronic medical record duplicate checking device based on word segmentation text in an embodiment of the application;
图4为本申请一个实施例中计算机设备的示意图。Fig. 4 is a schematic diagram of a computer device in an embodiment of the application.
具体实施方式Detailed ways
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同;本文中在申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请;本申请的说明书和权利要求书及上述附图说明中的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。本申请的说明书和权利要求书或上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定 顺序。Unless otherwise defined, all technical and scientific terms used herein have the same meanings as commonly understood by those skilled in the technical field of the application; the terms used in the specification of the application herein are only for describing specific embodiments. The purpose is not to limit the application; the terms "including" and "having" in the specification and claims of the application and the above-mentioned description of the drawings and any variations thereof are intended to cover non-exclusive inclusions. The terms "first", "second", etc. in the specification and claims of this application or the above-mentioned drawings are used to distinguish different objects, rather than to describe a specific sequence.
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。The reference to "embodiments" herein means that a specific feature, structure, or characteristic described in conjunction with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various places in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment mutually exclusive with other embodiments. Those skilled in the art clearly and implicitly understand that the embodiments described herein can be combined with other embodiments.
为了使本申请的目的、技术方案及优点更加清楚明白,下面结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the objectives, technical solutions, and advantages of this application clearer, the following further describes the application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, and are not used to limit the present application. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
本申请实施例提供的基于分词文本的电子病例查重方法,可以应用于如图1所示的应用环境中。其中,该应用环境可以包括终端102、网络以及服务端104,网络用于在终端102和服务端104之间提供通信链路介质,网络可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。The method for checking duplicates of electronic medical records based on word segmentation text provided in the embodiments of the present application can be applied to the application environment as shown in FIG. 1. The application environment may include the terminal 102, the network, and the server 104. The network is used to provide a communication link medium between the terminal 102 and the server 104. The network may include various connection types, such as wired, wireless communication links or Fiber optic cable and so on.
用户可以使用终端102通过网络与服务端104交互,以接收或发送消息等。终端102上可以安装有各种通讯客户端应用,例如网页浏览器应用、购物类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等。The user can use the terminal 102 to interact with the server 104 through the network to receive or send messages and so on. Various communication client applications, such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, etc., may be installed on the terminal 102.
终端102可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。The terminal 102 may be various electronic devices with a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, moving picture experts compress standard audio Level 3), MP4 (Moving Picture Experts Group Audio Layer IV, Motion Picture Experts compress standard audio level 4) Players, laptop portable computers and desktop computers, etc.
服务端104可以是提供各种服务的服务器,例如对终端102上显示的页面提供支持的后台服务器。The server 104 may be a server that provides various services, for example, a background server that provides support for pages displayed on the terminal 102.
需要说明的是,本申请实施例所提供的基于分词文本的电子病例查重方法一般由服务端/终端执行,相应地,基于分词文本的电子病例查重装置一般设置于服务端/终端设备中。It should be noted that the method for checking duplicate electronic medical records based on word segmentation text provided by the embodiments of the present application is generally executed by the server/terminal. Accordingly, the device for checking duplicate electronic medical records based on word segmentation text is generally set in the server/terminal device. .
本申请可用于众多通用或专用的计算机系统环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器系统、基于微处理器的系统、置顶盒、可编程的消费电子设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。This application can be used in many general or special computer system environments or configurations. For example: personal computers, server computers, handheld devices or portable devices, tablet devices, multi-processor systems, microprocessor-based systems, set-top boxes, programmable consumer electronic devices, network PCs, small computers, large computers, including Distributed computing environment for any of the above systems or equipment, etc. This application may be described in the general context of computer-executable instructions executed by a computer, such as a program module. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types. This application can also be practiced in distributed computing environments. In these distributed computing environments, tasks are performed by remote processing devices connected through a communication network. In a distributed computing environment, program modules can be located in local and remote computer storage media including storage devices.
本申请可应用于智慧医疗领域中,从而推动智慧城市的建设。This application can be applied in the field of smart medical care, so as to promote the construction of smart cities.
应该理解,图1中的终端、网络和服务端的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。It should be understood that the numbers of terminals, networks, and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks, and servers according to implementation needs.
其中,终端102通过网络与服务端104进行通信。服务端104接收终端102方发送的待查重病例,通过从分词文本中统计得到待查重病例的词语类型词和医疗含义词,然后计算其在分词文本中的比率,并经过设定比例进行融合后,计算与病例数据库中的病例数据的文本相似度和含义相似度,进行融合后得到最终的相似度,并将符合相似度要求的病例数据作为查重结果再返回给终端102。其中,终端102和服务端104之间通过网络进行连接,该网络可以是有线网络或者无线网络,终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务端104可以用独立的服务器或者是多个组成的服务器集群来实现。Among them, the terminal 102 communicates with the server 104 through the network. The server 104 receives the serious cases to be checked sent by the terminal 102, and obtains the word type words and medical meaning words of the serious cases to be checked from the word segmentation text, and then calculates the ratio in the word segmentation text, and performs a set ratio. After fusion, the text similarity and meaning similarity with the case data in the case database is calculated, the final similarity is obtained after fusion, and the case data that meets the similarity requirements is returned to the terminal 102 as the duplicate check result. Among them, the terminal 102 and the server 104 are connected through a network. The network can be a wired network or a wireless network. The terminal 102 can be, but is not limited to, various personal computers, laptops, smart phones, tablets, and portable wearable devices. , The server 104 can be implemented by an independent server or a cluster of multiple servers.
在一个实施例中,如图2所示,提供了一种基于分词文本的电子病例查重方法,以该方法应用于图1中的服务端为例进行说明,包括以下步骤:In one embodiment, as shown in FIG. 2, a method for checking duplicates of electronic medical records based on word segmentation text is provided. Taking the method applied to the server in FIG. 1 as an example for description, the method includes the following steps:
步骤202,对用户输入的待查重病例进行分词处理,得到分词文本。Step 202: Perform word segmentation processing on the serious case to be investigated input by the user to obtain the word segmentation text.
待查重病例可以是用户输入的电子病历数据。The serious case to be checked may be electronic medical record data input by the user.
电子病例数据包括文本数据,文本数据是由一系列病例文档组成得到,包括入院记录、首次病程记录、手术记录、出院小结等等。Electronic case data includes text data. The text data is composed of a series of case files, including admission records, first course records, surgical records, discharge summary, and so on.
本实施例通过将检测到的用户通过终端提交的电子病例作为待查重病例,然后对每个电子病例提取文本信息整理成一个文本文档,然后针对该文本文档在电子病历数据库中进行病理查重。In this embodiment, the detected electronic case submitted by the user through the terminal is regarded as a serious case to be checked, and then the text information of each electronic case is extracted and organized into a text document, and then the pathological duplicate check is performed on the text document in the electronic medical record database. .
进一步地,当检测到用户输入的待查重病例后,需要对待查重病例进行分词处理。Further, when the serious case to be checked input by the user is detected, the word segmentation processing of the serious case to be checked needs to be performed.
本实施例可以通过已有的分词技术对待查重病例进行分词处理,比如,所使用的分词技术是综合考虑规则分词和统计分词的混合分词技术。In this embodiment, the existing word segmentation technology can be used to perform word segmentation processing on the case to be checked. For example, the word segmentation technology used is a hybrid word segmentation technology that comprehensively considers regular word segmentation and statistical word segmentation.
其中,基于规则分词技术主要是通过维护词典,在切分语句的时候将语句中的每个字符串和字典中的词进行匹配,找到则进行词语切分,否则不进行切分。Among them, the rule-based word segmentation technology mainly maintains a dictionary. When segmenting a sentence, each character string in the sentence is matched with a word in the dictionary. If it is found, the word segmentation is performed, otherwise no segmentation is performed.
而基于统计的分词技术首先要建立统计语言模型,然后对语句进行单词划分,对划分的结果进行概率计算,将概率最大的分词结果作为最终的分词结果。The word segmentation technology based on statistics must first establish a statistical language model, then divide the sentence into words, calculate the probability of the result of the division, and use the word segmentation result with the highest probability as the final word segmentation result.
而混合粉刺技术是在统计分词技术的基础上,用规则分词技术作为辅助,从而综合考虑这两种技术的分词技术,最终得到该待查重病例的分词文本。The hybrid acne technology is based on the statistical word segmentation technology, using the rule word segmentation technology as an auxiliary, so as to comprehensively consider the word segmentation technology of these two technologies, and finally obtain the word segmentation text of the serious case under investigation.
步骤204,根据预设子串值对分词文本进行特征提取,得到病例文本特征。Step 204: Perform feature extraction on the segmented text according to the preset substring value to obtain the case text feature.
为了对字面上相似的待查重病例进行识别,首先基于分词后得到的分词文本构建文档中出现的连续词语串集合,即病例文本特征,病例文本特征中包括连续词语串,即子串元素,且至少包括一个。In order to identify literally similar serious cases to be investigated, first, based on the word segmentation text obtained after word segmentation, a set of continuous word strings appearing in the document is constructed, that is, case text features. The case text features include continuous word strings, that is, substring elements. And include at least one.
将每一待查重病例由这种连续词语串集合表示,那么具有重复字面内容(例如存在相同的句子或者短语)的待查重病例或者其他病例文本之间将会有很多公共的集合元素,并且在这种情况下,即使两个病例文本中的句子顺序不同也是如此。If each serious case to be checked is represented by this continuous word string set, then there will be many common set elements between the serious cases to be checked or other case texts with repeated literal content (for example, the same sentence or phrase). And in this case, even if the sentence order in the two case texts is different.
进一步地,将分词文本中的每一个词或者字看做是一个字符,为每一个字符串生成一个唯一的编码,那么一个电子病例的文本文档被看做是一个大的字符串。Further, each word or character in the word segmentation text is regarded as a character, and a unique code is generated for each character string, then the text document of an electronic medical record is regarded as a large character string.
然后再通过n-gram算法根据预设子串值对分词文本进行特征提取,得到病例文本特征,其中,病例文本特征包括至少一个按字符编码顺序排列的连续词语串,且连续词语串中的字符按照唯一编码的大小顺序排列。Then use the n-gram algorithm to perform feature extraction on the segmented text according to the preset substring value to obtain the case text feature, where the case text feature includes at least one continuous word string arranged in the order of character encoding, and the characters in the continuous word string They are arranged in order of the size of the unique code.
在这个字符串中,选择所有预设子串值为k的子串作为病例文本特征,该特征描述了文本字面元素是否出现以及一定顺序关系的特性。每个病例文本被表示成文档中出现的长度为预设子串值k的子串元素集合,即得到的病例文本特征。In this string, select all substrings with a preset substring value of k as the case text feature, which describes whether the text literal elements appear or not and the characteristics of a certain order relationship. Each case text is represented as a set of substring elements with a length of the preset substring value k appearing in the document, that is, the obtained case text feature.
具体地,如果一个病例文本在分词后被表示为一个长度为6的字符串,即[词语1词语2词语3词语4词语5词语6]。Specifically, if a case text is represented as a string of length 6 after word segmentation, that is, [word 1 word 2 word 3 word 4 word 5 word 6].
比如,“这是一种罕见病例,可能是一种特殊的组织病变或生理功能紊乱所致”。For example, "This is a rare case, which may be caused by a special tissue pathology or physiological dysfunction."
分词后为:这是、一种、罕见、病例、可能、是、一种、特殊、的、组织病变、或、生理功能、紊乱、所致After the participle: this is, a kind, rare, case, possible, yes, a kind, special, tissue disease, or, physiological function, disorder, caused by
如果预设子串值为6,则可以得到该待查重病例的病例文本特征为:If the preset substring value is 6, the case text characteristics of the serious case to be investigated can be obtained as:
“这是一种罕见病例可能是”、“一种罕见病例可能是一种”、“罕见病例可能是一种特殊”、“病例可能是一种特殊的”、“可能是一种特殊的组织病变”、“是一种特殊的组织病变或”、“一种特殊的组织病变或生理功能”、“特殊的组织病变或生理功能紊乱”以及“的组织病变或生理功能紊乱所致”。"This is a rare case may be", "A rare case may be one", "A rare case may be a special", "The case may be a special", "It may be a special organization" "Lesion", "is a special tissue disease or", "a special tissue disease or physiological function", "special tissue disease or physiological function disorder", and "the tissue disease or physiological function disorder is caused by".
得到以上多个子串元素,组成子串元素集合,其他的病例文本也是以此类推。Get the above multiple substring elements to form a set of substring elements, and so on for other case texts.
可选地,如果预设子串值k=3,那么得到一个子串元素个数为4的子串元素集合,即为[词语1词语2词语3,词语2词语3词语4,词语3词语4词语5,词语4词语5词语6]。Optionally, if the preset substring value k=3, then a substring element set with a number of substring elements of 4 is obtained, namely [word1word2word3,word2word3word4,word3word 4 words 5, words 4 words 5 words 6].
一般地,电子病历文本文档由分词后的词的集合表示,但是这些词的集合只体现了这些词在这个文档中是否出现,并没有体现这些词之间的顺序关系。所以通过构造长度为预设子串值k的子串(即将连续的k个词拼接在一起构成一个子串),来体现一定程度的词语的顺序 关系,在进行病理查重时得到的解决更加准确。Generally, an electronic medical record text document is represented by a set of words after word segmentation, but the set of these words only reflects whether these words appear in the document, and does not reflect the order relationship between these words. Therefore, by constructing a substring whose length is the preset substring value k (that is, concatenating k consecutive words together to form a substring), to reflect a certain degree of order relationship of words, the solution obtained during pathological investigation is more accurate.
一般地,k的取值范围为2到6之间。k值设置得越大(即词语串越长),得到的词语串体现的词语顺序信息越多;k值设置得越小(即词语串越短),得到的词语串体现的词语顺序信息越少。Generally, the value of k ranges from 2 to 6. The larger the k value is set (that is, the longer the word string), the more word order information is reflected in the obtained word string; the smaller the k value is set (that is, the shorter the word string), the more word order information the obtained word string reflects less.
一般情况下,将k值设置为3,原因是k值的设置会对后续计算两个电子病历文本之间的文本字面相似度有关,一般的,如果k值设置得越大,得到的电子病历文本词语串集合中的词语串越长,那么这两个电子病历中相同的词语串越少,那么这两个电子病历文本的相似度越低;如果k值设置得越小,得到的电子病历文本词语串集合中的词语串越短,那么这两个电子病历文本中相同的词语串越多,那么这两个电子病历的相似度的值会越高。Under normal circumstances, the value of k is set to 3, because the setting of the value of k will be related to the subsequent calculation of the text literal similarity between two electronic medical record texts. Generally, if the value of k is set to a larger value, the electronic medical record will be obtained. The longer the word string in the text word string set, the fewer the same word strings in the two electronic medical records, and the lower the similarity of the two electronic medical record texts; if the k value is set to a smaller value, the electronic medical record will be obtained The shorter the word string in the text word string set, the more identical word strings in the two electronic medical record texts, and the higher the similarity value of the two electronic medical records will be.
所以,如果既要体现一定得词语的顺序关系,又要使得在后续计算电子病历文本相似度时有所考量,需要根据实际的文本权衡k的取值,不能设置得过大或者过小。所以在本实施例中,优选的将k值设置为3。Therefore, if it is necessary to reflect the order relationship of certain words, and to consider the similarity of the electronic medical record text in the subsequent calculation, the value of k needs to be weighed according to the actual text, and the value of k must not be set too large or too small. Therefore, in this embodiment, the value of k is preferably set to 3.
步骤206,从分词文本中获取词语类型词和医疗含义词,并统计词语类型词在分词文本中第一比率、医疗含义词在分词文本中的第二比率。Step 206: Obtain word type words and medical meaning words from the word segmentation text, and count the first ratio of word type words in the word segmentation text and the second ratio of medical meaning words in the word segmentation text.
在电子病历的文本文档分词的基础上,考虑文本中出现的词语的类型和医疗含义,从中提取反映文本内涵意义的特征,即医疗含义词。Based on the word segmentation of the text document of the electronic medical record, the type and medical meaning of the words appearing in the text are considered, and the characteristics reflecting the connotative meaning of the text are extracted from it, that is, the medical meaning words.
具体的特征如下:The specific features are as follows:
I.词语类型词:词语类型有实词和虚词,其中实词包括名词、动词、形容词、数词、量词、代词,虚词包括副词、介词、连词、助词、叹词、拟声词,一共12类。通过计算不同的词语类型在分词文本中总词数的第一比率,得到不同种类词语对应的反映词语类型的特征。I. Word type words: The word types include content words and function words. Content words include nouns, verbs, adjectives, numerals, measure words, and pronouns. Function words include adverbs, prepositions, conjunctions, auxiliary words, interjections, and onomatopoeias. There are 12 categories in total. By calculating the first ratio of different word types in the total number of words in the segmentation text, the characteristics of different types of words reflecting the word types are obtained.
II.医疗含义词:在分词的基础上,对有医疗含义的词语进行医疗实体关联。通过计算出现的所有医疗实体的数目、以及属于症状的医疗实体数目、属于疾病的医疗实体数目、属于检验检查的医疗实体数目、属于药品的医疗实体数目这5类在所有医疗实体数据的第二比率,得到对应的反映词语医疗含义的特征。II. Medical meaning words: on the basis of word segmentation, medical entity associations are made for words with medical meaning. By calculating the number of all medical entities that appear, the number of medical entities that belong to symptoms, the number of medical entities that belong to diseases, the number of medical entities that belong to inspections, and the number of medical entities that belong to drugs, these five categories are the second of all medical entity data. Ratio, get the corresponding feature that reflects the medical meaning of the word.
具体地,从分词文本中获取词语类型词为实词和虚词的字符,并计算词语类型词在分词文本中的第一比率。Specifically, the characters whose word type words are content words and function words are obtained from the word segmentation text, and the first ratio of the word type words in the word segmentation text is calculated.
从分词文本中获取医疗含义词;根据医疗实体库对医疗含义词进行医疗实体关联;计算医疗实体关联后的医疗含义词在分词文本中的第二比率。Obtain medical meaning words from the word segmentation text; perform medical entity association for medical meaning words according to the medical entity database; calculate the second ratio of medical meaning words after the medical entity association in the word segmentation text.
其中,根据医疗实体库对医疗含义词进行实体关联,具体为:Among them, the entity association of medical meaning words according to the medical entity database is as follows:
医疗实体库中包含大量医疗实体。一个医疗名称以及其属性构成一个医疗实体,例如一个医疗实体的医疗名称为“咳嗽”,其属性为“症状”;一个医疗实体的医疗名称为“急性上呼吸道感染”,其属性为“疾病”;一个医疗实体的医疗名称为“腹部彩超”,其属性为“检验检查”;一个医疗实体的医疗名称为“二甲双胍格列吡嗪片”,其属性为“药品”。The medical entity database contains a large number of medical entities. A medical name and its attributes constitute a medical entity. For example, the medical name of a medical entity is "cough" and its attribute is "symptom"; the medical name of a medical entity is "acute upper respiratory infection" and its attribute is "disease" ; The medical name of a medical entity is "abdominal color Doppler ultrasound" and its attribute is "inspection and inspection"; the medical name of a medical entity is "metformin glipizide tablets" and its attribute is "drugs".
医疗实体关联技术的具体实现是,对于一个电子病历文本分词后的每个词在医疗实体库进行词语和医疗实体名称进行匹配,即将这个词语和这个医疗实体进行关联。The specific implementation of the medical entity association technology is to match each word after the word segmentation of an electronic medical record text with the medical entity name in the medical entity database, that is, to associate the word with the medical entity.
对一个电子病历文本分词后的词都进行医疗实体关联,对关联到的医疗实体的词语进行医疗含义词的构造。All words after word segmentation in an electronic medical record text are associated with medical entities, and words related to medical entities are constructed with medical meaning words.
具体的,统计出现的所有医疗实体的数目、以及属于症状的医疗实体数目、属于疾病的医疗实体数目、属于检验检查的医疗实体数目、属于药品的医疗实体数目这5类在所有医疗实体数据的第二比率,得到对应的反映词语医疗含义的特征。Specifically, the statistics of the number of all medical entities appearing, as well as the number of medical entities belonging to symptoms, the number of medical entities belonging to diseases, the number of medical entities belonging to inspections, and the number of medical entities belonging to drugs, are in the data of all medical entities. The second ratio is the corresponding feature that reflects the medical meaning of the word.
通过这种从多维度、层次获取病例文本的信息,使得在后续的相似度计算中更加准确。By obtaining case text information from multiple dimensions and levels, the subsequent similarity calculation is more accurate.
步骤208,整合第一比率、第二比率,得到病例含义特征。Step 208: Integrate the first ratio and the second ratio to obtain the meaning characteristics of the case.
病例含义特征是由n个特征值构成的特征向量,这n个特征值是前面叙述的多个词语类型词和医疗含义词。The case meaning feature is a feature vector composed of n feature values, which are multiple word type words and medical meaning words described above.
具体的,本实施例中n的取值为17,表示17个文本意义特征(词语类型词、医疗含义词),其中包含12个词语类型词和5个医疗含义词。Specifically, the value of n in this embodiment is 17, which represents 17 textual meaning features (word type words, medical meaning words), including 12 word type words and 5 medical meaning words.
例如,一个电子病历的文本意义特征向量表示为f1=(x 1,x 2,x 3,…,x 17),其中每一个x相对应于一个文本意义特征。 For example, the text meaning feature vector of an electronic medical record is expressed as f1=(x 1 , x 2 , x 3 ,..., x 17 ), where each x corresponds to a text meaning feature.
那么,(x 1,x 2,x 3,…,x 12)表示12个词语类型词,例如,x 1表示名词在这个电子病历文本总词数的比率(例如这个特征具体的取值为0.1),x 2表示动词在这个电子病历文本中总词数的比率(例如这个特征具体的取值为0.05);这些比率统属于词语类型词在总词数中的第一比率。 Then, (x 1 ,x 2 ,x 3 ,...,x 12 ) represents 12 word type words, for example, x 1 represents the ratio of the noun in the total number of words in this electronic medical record text (for example, the specific value of this feature is 0.1 ), x 2 represents the ratio of the total number of verbs in this electronic medical record text (for example, the specific value of this feature is 0.05); these ratios all belong to the first ratio of word type words in the total number of words.
剩下的x类似,(x 13,x 14,x 15,x 16,x 17)表示5个医疗含义词,例如,x 13表示出现的医疗实体的个数(例如这个特征具体的取值是50),x 14表示出现的属于症状的医疗实体的数目在总医疗实体数目中的比率(例如这个特征具体的取值为0.3),剩下的x类似。而这些比率统属于医疗含义词占总词数中的第二比率。 The remaining x is similar, (x 13 , x 14 , x 15 , x 16 , x 17 ) represents 5 medical meaning words, for example, x 13 represents the number of medical entities that appear (for example, the specific value of this feature is 50), x 14 represents the ratio of the number of medical entities that are symptomatic to the total number of medical entities (for example, the specific value of this feature is 0.3), and the remaining x is similar. These ratios all belong to the second ratio of medical meaning words to the total number of words.
本实施例提取多个维度的两个文本层次的比率,提高了后续病理查重的精确度。In this embodiment, the ratio of two text levels in multiple dimensions is extracted, which improves the accuracy of subsequent pathological examinations.
步骤210,分别根据病例文本特征、病例含义特征计算待查重病例与病例数据库中病例文本的相似度,得到文本相似度和含义相似度。Step 210: Calculate the similarity between the serious case to be checked and the case text in the case database according to the case text characteristics and the meaning characteristics of the case, respectively, to obtain the text similarity and the meaning similarity.
进一步地,分别对待查重病例、病例数据库中病例数据的病例文本特征进行文本字面特征提取,得到查重集合和数据集合;计算查重集合和数据集合中相同连续词语串的个数,得到文本相似度。Further, perform text literal feature extraction on the case text features of the case data in the duplicate check case and the case database to obtain the duplicate check set and the data set; calculate the number of the same continuous word string in the duplicate check set and the data set to obtain the text Similarity.
再通过余弦相似度算法计算待查重病例与病例数据的病例含义特征的相似度,作为含义相似度。Then the cosine similarity algorithm is used to calculate the similarity of the meaning characteristics of the case to be investigated and the case data as the meaning similarity.
具体地,对于两个电子病历进行文本字面特征提取得到的查重集合和数据集合,即两个文本字面元素集合,通过计算这两个集合的Jaccard相似度,得到这两个文本字面元素集合的相似度,即文本字面上的相似度。其中,Jaccard相似度,又称为Jaccard相似系数(Jaccard similarity coefficient)用于比较有限样本集之间的相似性与差异性。Jaccard系数值越大,样本相似度越高。Specifically, the duplicate check set and data set obtained by extracting text literal features of two electronic medical records, that is, two sets of text literal elements, are calculated by calculating the Jaccard similarity of these two sets to obtain the value of the two sets of text literal elements Similarity, that is, the literal similarity of the text. Among them, Jaccard similarity, also known as Jaccard similarity coefficient (Jaccard similarity coefficient) is used to compare the similarity and difference between a limited sample set. The larger the Jaccard coefficient value, the higher the sample similarity.
其中,计算两个电子病历文本字面相似度的过程:Among them, the process of calculating the literal similarity of two electronic medical record texts:
两个电子病历进行文本字面特征提取得到文本字面元素集合:查重集合和数据集合,查重集合A和数据集合B的交集有m个相同的元素,A和B的并集有n个元素,则A和B的Jaccard相似度为:Two electronic medical records perform text literal feature extraction to obtain a text literal element set: duplicate check set and data set. The intersection of duplicate check set A and data set B has m identical elements, and the union of A and B has n elements. Then the Jaccard similarity of A and B is:
Jaccard(A,B)=m/nJaccard(A,B)=m/n
用这个Jaccard相似度表示这两个电子病历文本字面上的相似度。其中,accard相似度的取值范围是0到1之间,Jaccard相似度越接近1,表示相似度越高;Jaccard相似度越接近0,表示相似度越低。Use this Jaccard similarity to indicate the literal similarity of the two electronic medical record texts. Among them, the value range of accard similarity is between 0 and 1. The closer the Jaccard similarity is to 1, the higher the similarity; the closer the Jaccard similarity is to 0, the lower the similarity.
对于两个电子病历文本得到的文本意义特征向量f1和f2,计算这两个特征向量之间的余弦相似度cosine(f1,f2),用这个余弦相似度表示这两个电子病历文本意义上的相似度。For the text meaning feature vectors f1 and f2 obtained from two electronic medical record texts, calculate the cosine similarity between the two feature vectors cosine(f1, f2), and use this cosine similarity to represent the meaning of the two electronic medical record texts Similarity.
两个电子病历的文本意义特征向量分别表示为:The textual meaning feature vectors of the two electronic medical records are respectively expressed as:
f1=(x 1,x 2,x 3,…,x n)和f1=(y 1,y 2,y 3,…,y n)。 f1 = (x 1 , x 2 , x 3 ,..., x n ) and f1 = (y 1 , y 2 , y 3 ,..., y n ).
可以通过公式(1)计算这两个特征向量之间的余弦相似度:The cosine similarity between these two feature vectors can be calculated by formula (1):
Figure PCTCN2020136146-appb-000001
Figure PCTCN2020136146-appb-000001
余弦相似度的取值范围在0到1之间,余弦相似度越接近于1,表示相似度越高;余弦相似度越接近于0,表示余弦相似度越低。The value range of cosine similarity is between 0 and 1. The closer the cosine similarity is to 1, the higher the similarity; the closer the cosine similarity is to 0, the lower the cosine similarity.
步骤212,根据预设权重值融合文本相似度和含义相似度,得到待查重病例与病例文本的最终相似度,并将大于预设值的最终相似度对应的病例文本作为查重结果。In step 212, the text similarity and the meaning similarity are merged according to the preset weight value to obtain the final similarity between the case to be checked and the case text, and the case text corresponding to the final similarity greater than the preset value is used as the duplicate check result.
以预设权重值为w1:w2对文本相似度和含义相似度进行叠加处理,得到最终相似度,其中,0<=w2<=1,w1+w2=1。Using the preset weight value of w1:w2 to superimpose the text similarity and meaning similarity to obtain the final similarity, where 0<=w2<=1, w1+w2=1.
具体地,对两个电子病历文本,其中一个电子病例文本是待查重病例,对其文本相似度sim1(0<=sim1<=1)和含义相似度sim2(0<=sim2<=1),采用基于权重的融合方法将两个相似度结果进行融合,其中,预设权重值可以根据应用场景设定。Specifically, for two electronic medical record texts, one of the electronic case texts is a serious case to be checked, the text similarity sim1 (0<=sim1<=1) and the meaning similarity sim2 (0<=sim2<=1) , The two similarity results are merged using a weight-based fusion method, where the preset weight value can be set according to the application scenario.
即sim=w1*sim1+w2*sim2That is, sim=w1*sim1+w2*sim2
其中,0<=sim<=1,0<=w1<=1,0<=w2<=1,w1+w2=1,用sim表示这两个电子病历文本最终的相似度。Among them, 0<=sim<=1,0<=w1<=1,0<=w2<=1, w1+w2=1, and sim is used to represent the final similarity of the two electronic medical record texts.
本实施例主要体现在计算电子病历之间相似度的时候,既考虑了电子病历文本字面的相似度,又考虑了电子病历文本意义的相似度。This embodiment mainly reflects that when calculating the similarity between electronic medical records, both the literal similarity of the electronic medical record text and the similarity of the meaning of the electronic medical record text are considered.
综合考虑这两个相似度是通过对文本字面相似度和文本意义相似度的结果进行基于权重的相似度结果融合来实现的。The comprehensive consideration of these two similarities is achieved by fusing the results of text literal similarity and text meaning similarity based on weight.
一般的,将文本相似度对应的权重w1和含义相似度对应的权重w2都设置为0.5,表示均衡对这两个相似度结果进行融合。Generally, the weight w1 corresponding to the text similarity and the weight w2 corresponding to the meaning similarity are both set to 0.5, which means that the two similarity results are balanced.
如果相对地,多考虑文本字面上的相似度,则将文本相似度的权重w1设置得大一些(比如将w1设置为0.6),相应的含义相似度的权重w2设置得小一些(比如将w2设置为0.4)。If, relatively, the literal similarity of the text is more considered, the weight of the text similarity w1 is set larger (for example, w1 is set to 0.6), and the weight of the corresponding meaning similarity is set to be smaller (for example, w2 Set to 0.4).
相应的,如果相对多考虑文本意义的相似度,则将含义相似度的权重w2设置得大一些(比如将w2设置为0.6),相应的文本相似度的权重w1设置得小一些(比如将w1设置为0.4)。Correspondingly, if the similarity of text meaning is considered relatively more, the weight of the meaning similarity w2 should be set larger (for example, w2 is set to 0.6), and the weight of the corresponding text similarity w1 is set to be smaller (for example, w1 Set to 0.4).
其中,预设值的大小可以根据具体的应用的场景和需要设定,本提案中不做具体的限定。Among them, the size of the preset value can be set according to specific application scenarios and needs, and there is no specific limitation in this proposal.
上述基于分词文本的电子病例查重方法中,通过从分词文本中统计得到待查重病例的词语类型词和医疗含义词,然后计算其在分词文本中的比率,并经过设定比例进行融合后,计算与病例数据库中的病例数据的文本相似度和含义相似度,进行融合后得到最终的相似度,并将符合相似度要求的病例数据作为查重结果,在相同症状对应的疾病相差过大时,仅仅根据医学特征进行病理查重也可找到准确度高的相似病例。In the above-mentioned method for checking duplicates of electronic medical records based on word segmentation text, the word type words and medical meaning words of the serious case to be checked are obtained by statistics from the word segmentation text, and then the ratio in the word segmentation text is calculated, and the ratio is set for fusion , Calculate the text similarity and meaning similarity with the case data in the case database, and get the final similarity after fusion, and use the case data that meets the similarity requirements as the duplicate check result, and the disease corresponding to the same symptom is too far apart At the same time, the pathological examination based on medical characteristics alone can also find similar cases with high accuracy.
在一个实施例中,还可以实现在线和离线两种场景下的电子病理查重应用场景:In one embodiment, electronic pathology duplicate check application scenarios in two scenarios, online and offline, can also be implemented:
1)在线电子病历查重1) Online electronic medical record double check
在线是指医生在电子病历系统中输入电子病历文本后,服务端立即对该电子病历文本进行查重。Online means that after the doctor enters the electronic medical record text in the electronic medical record system, the server immediately checks the electronic medical record text.
该功能会提示医生输入的电子病历在电子病历数据库中是否重复,如果重复还会返回重复的电子病历编号以及对应的相似度。其实现方法是步骤如下:This function will prompt the doctor whether the electronic medical record entered by the doctor is repeated in the electronic medical record database. If it is repeated, it will return the repeated electronic medical record number and the corresponding similarity. The implementation method is as follows:
I.对于医生输入的电子病历文本,首先从中提取文本内容生成对应的文本文档,然后从文本文档中提取出文本特征(包含文本字面特征和文本意义特征)。I. For the electronic medical record text input by the doctor, first extract the text content from it to generate the corresponding text document, and then extract the text features (including text literal features and text meaning features) from the text document.
II.计算所有在电子病历文本库中的电子病历的文本特征和输入的电子病历文本特征的文本相似度(融合了文本字面相似度和文本意义相似度的结果)。II. Calculate the text similarity between the text features of all electronic medical records in the electronic medical record text database and the text features of the input electronic medical records (a result of fusion of text literal similarity and text meaning similarity).
III.将计算得到的最终相似度和预设值比较,如果超过预设值的电子病历,说明输入的电子病历在电子病历数据库中有重复,服务端会提示医生输入的电子病历有重复,并返回重复的电子病历编号以及其对应的相似度;如果没有超过预设值的电子病历,说明输入的电子病历在电子病历数据中没有重复,服务端会给医生返回提示医生输入的电子病历没有重复的信息。III. Compare the calculated final similarity with the preset value. If the electronic medical record exceeds the preset value, it means that the entered electronic medical record is duplicated in the electronic medical record database, and the server will prompt the doctor that the electronic medical record entered is duplicated, and Return the repeated electronic medical record number and its corresponding similarity; if there is no electronic medical record exceeding the preset value, it means that the entered electronic medical record is not repeated in the electronic medical record data, and the server will return to the doctor to remind the doctor that the electronic medical record entered is not repeated Information.
其中,这个预设值的取值范围也在0到1之间(电子病历之间的相似度sim的取值范围也在0到1之间)。Among them, the value range of this preset value is also between 0 and 1 (the value range of the similarity sim between electronic medical records is also between 0 and 1).
设置的预设值越高(越接近1),表示电子病历的相似度计算越“严格”,返回的相似病历的数据量越少。The higher the preset value (closer to 1), the more "strict" the calculation of the similarity of the electronic medical record is, and the smaller the amount of similar medical records returned.
设置的预设值越高(越接近0),表示电子病历的相似度计算越“宽松”,返回的相似病历的数量越多。可以根据不同的需求,设置不同的预设值。The higher the preset value is set (closer to 0), the more "relaxed" the calculation of the similarity of the electronic medical records, and the greater the number of similar medical records returned. Different preset values can be set according to different needs.
一般的,可以将预设值设为0.8。Generally, the default value can be set to 0.8.
(2)离线电子病历查重(2) Offline electronic medical record double check
在电子病历数据库中,分别对其中的每个电子病历文本和其它电子病历文本是否存在重 复进行检查,对于一个电子病历和其它在数据库中的电子病历进行查重的步骤和上述在线的对一个输入的电子病历的查重步骤类似。In the electronic medical record database, check whether each electronic medical record text and other electronic medical record texts are duplicated. The steps of checking an electronic medical record and other electronic medical records in the database are the same as the above-mentioned online input. The steps for checking duplicates of electronic medical records are similar.
这种方式会输出在数据库中存在重复的电子病历,对这些重复的电子病历,还会对应给出重复的电子病历的编号以及其对应的相似度。This method will output the duplicate electronic medical records in the database. For these duplicate electronic medical records, the number of the duplicate electronic medical records and their corresponding similarity will be given correspondingly.
本实施例给出了两种应用场景来详细说明上述基于分词文本的电子病例查重方法的具体应用。This embodiment provides two application scenarios to explain in detail the specific application of the above-mentioned method for duplicate checking of electronic medical records based on word segmentation text.
应该理解的是,虽然图2的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the various steps in the flowchart of FIG. 2 are displayed in sequence as indicated by the arrows, these steps are not necessarily performed in sequence in the order indicated by the arrows. Unless specifically stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least part of the steps in FIG. 2 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution of these sub-steps or stages The sequence does not have to be performed sequentially, but may be performed alternately or alternately with other steps or at least a part of sub-steps or stages of other steps.
在一个实施例中,如图3所示,提供了一种基于分词文本的电子病例查重装置,该基于分词文本的电子病例查重装置与上述实施例中基于分词文本的电子病例查重方法一一对应。该基于分词文本的电子病例查重装置包括:In one embodiment, as shown in FIG. 3, a device for checking duplicate electronic medical records based on word segmentation text is provided. The device for checking duplicate electronic medical records based on word segmentation text is the same as the method for checking duplicate electronic medical records based on word segmentation text in the above embodiment. One-to-one correspondence. The device for checking duplicates of electronic medical records based on word segmentation text includes:
分词模块302,用于对用户输入的待查重病例进行分词处理,得到分词文本;The word segmentation module 302 is used to perform word segmentation processing on the serious case to be checked input by the user to obtain the word segmentation text;
提取模块304,用于根据预设子串值对分词文本进行特征提取,得到病例文本特征;The extraction module 304 is configured to perform feature extraction on the segmented text according to the preset substring value to obtain the case text feature;
比率模块306,用于从分词文本中获取词语类型词和医疗含义词,并统计词语类型词在分词文本中第一比率、医疗含义词在分词文本中的第二比率;The ratio module 306 is used to obtain word type words and medical meaning words from the word segmentation text, and count the first ratio of word type words in the word segmentation text and the second ratio of medical meaning words in the word segmentation text;
整合模块308,用于整合第一比率、第二比率,得到病例含义特征;The integration module 308 is used to integrate the first ratio and the second ratio to obtain the meaning characteristics of the case;
相似模块310,用于分别根据病例文本特征、病例含义特征计算待查重病例与病例数据库中病例文本的相似度,得到文本相似度和含义相似度;The similarity module 310 is used to calculate the similarity between the serious case to be checked and the case text in the case database according to the case text characteristics and the meaning characteristics of the case, respectively, to obtain the text similarity and meaning similarity;
查重模块312,用于根据预设权重值融合文本相似度和含义相似度,得到待查重病例与病例文本的最终相似度,并将大于预设值的最终相似度对应的病例文本作为查重结果。The duplicate checking module 312 is used to fuse the text similarity and meaning similarity according to the preset weight value to obtain the final similarity between the duplicate case to be checked and the case text, and use the case text corresponding to the final similarity greater than the preset value as the check Heavy results.
需要强调的是,为进一步保证上述待查重病例和病例数据的私密和安全性,上述待查重病例还可以存储于一区块链的节点中,病例数据可以分布式不属于区块链中。It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned serious cases to be investigated and case data, the above-mentioned serious cases to be investigated can also be stored in a node of a blockchain, and the case data can be distributed and not belong to the blockchain. .
进一步地,提取模块304,包括:Further, the extraction module 304 includes:
编码子模块,用于为分词文本中的每一字符生成唯一编码;Coding sub-module, used to generate a unique code for each character in the word segmentation text;
提取子模块,用于通过n-gram算法根据预设子串值对分词文本进行特征提取,得到病例文本特征,其中,病例文本特征包括至少一个按字符编码顺序排列的连续词语串,且连续词语串中的字符按照唯一编码的大小顺序排列。The extraction sub-module is used for feature extraction of the segmented text according to the preset substring value through the n-gram algorithm to obtain the case text feature, wherein the case text feature includes at least one continuous word string arranged in character encoding order, and the continuous word The characters in the string are arranged in the order of the size of the unique code.
进一步地,比率模块306,包括:Further, the ratio module 306 includes:
词语子模块,用于从分词文本中获取词语类型词为实词和虚词的字符,并计算词语类型词在分词文本中的第一比率;The word sub-module is used to obtain the characters whose word type words are content words and function words from the word segmentation text, and calculate the first ratio of word type words in the word segmentation text;
含义子模块,用于从分词文本中获取医疗含义词;The meaning sub-module is used to obtain medical meaning words from the word segmentation text;
关联子模块,用于根据医疗实体库对医疗含义词进行医疗实体关联;The association sub-module is used to associate medical entities with medical meaning words according to the medical entity database;
计算子模块,用于计算医疗实体关联后的医疗含义词在分词文本中的第二比率。The calculation sub-module is used to calculate the second ratio of medical meaning words in the segmentation text after the medical entity is associated.
进一步地,相似模块310,包括:Further, the similar module 310 includes:
集合子模块,用于分别对待查重病例、病例数据库中病例数据的病例文本特征进行文本字面特征提取,得到查重集合和数据集合;The collection sub-module is used to extract the text literal features of the case to be checked and the case data in the case database to obtain the duplicate check set and the data set;
文本相似子模块,用于计算查重集合和数据集合中相同连续词语串的个数,得到文本相似度;The text similarity sub-module is used to calculate the number of identical continuous word strings in the duplicate check set and the data set to obtain the text similarity;
含义相似子模块,用于通过余弦相似度算法计算待查重病例与病例数据的病例含义特征的相似度,作为含义相似度。The meaning similarity sub-module is used to calculate the similarity of the meaning characteristics of the case to be checked and the case data through the cosine similarity algorithm, as the meaning similarity.
上述基于分词文本的电子病例查重装置,通过从分词文本中统计得到待查重病例的词语类型词和医疗含义词,然后计算其在分词文本中的比率,并经过设定比例进行融合后,计算 与病例数据库中的病例数据的文本相似度和含义相似度,进行融合后得到最终的相似度,并将符合相似度要求的病例数据作为查重结果,在相同症状对应的疾病相差过大时,仅仅根据医学特征进行病理查重也可找到准确度高的相似病例。The above-mentioned electronic medical case duplicate checking device based on word segmentation text obtains the word type words and medical meaning words of the serious case to be checked from the word segmentation text, and then calculates the ratio in the word segmentation text, and after setting the ratio for fusion, Calculate the text similarity and meaning similarity with the case data in the case database, obtain the final similarity after fusion, and use the case data that meets the similarity requirements as the duplicate check result, when the disease corresponding to the same symptom is too different , Only carrying out pathological examination based on medical characteristics can also find similar cases with high accuracy.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图4所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储病例数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种基于分词文本的电子病例查重方法。In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 4. The computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used to store case data. The network interface of the computer device is used to communicate with an external terminal through a network connection. When the computer program is executed by the processor, a method for checking duplicates of electronic medical records based on word segmentation text is realized.
本实施例通过从分词文本中统计得到待查重病例的词语类型词和医疗含义词,然后计算其在分词文本中的比率,并经过设定比例进行融合后,计算与病例数据库中的病例数据的文本相似度和含义相似度,进行融合后得到最终的相似度,并将符合相似度要求的病例数据作为查重结果,在相同症状对应的疾病相差过大时,仅仅根据医学特征进行病理查重也可找到准确度高的相似病例。In this embodiment, the word type words and medical meaning words of the serious case to be checked are calculated from the word segmentation text, and then the ratio in the word segmentation text is calculated, and after the set ratio is merged, the calculation is calculated with the case data in the case database The text similarity and meaning similarity of the fusion is the final similarity, and the case data that meets the similarity requirements is used as the check result. When the disease corresponding to the same symptom is too different, the pathological check is only performed based on the medical characteristics It can also find similar cases with high accuracy.
其中,本技术领域技术人员可以理解,这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions. Its hardware includes, but is not limited to, a microprocessor, a dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.
在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现上述实施例中基于分词文本的电子病例查重方法的步骤,例如图2所示的步骤202至步骤212,或者,处理器执行计算机程序时实现上述实施例中基于分词文本的电子病例查重装置的各模块/单元的功能,例如图3所示模块302至模块312的功能。In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored. When the computer program is executed by a processor, the steps of the method for checking duplicates of electronic medical records based on word segmentation text in the above embodiments are implemented, as shown in Figure 2 As shown in step 202 to step 212, or, when the processor executes the computer program, the function of each module/unit of the electronic medical patient duplicate check device based on the word segmentation text in the above embodiment is realized, for example, the modules 302 to 312 shown in FIG. 3 Features.
本实施例通过从分词文本中统计得到待查重病例的词语类型词和医疗含义词,然后计算其在分词文本中的比率,并经过设定比例进行融合后,计算与病例数据库中的病例数据的文本相似度和含义相似度,进行融合后得到最终的相似度,并将符合相似度要求的病例数据作为查重结果,在相同症状对应的疾病相差过大时,仅仅根据医学特征进行病理查重也可找到准确度高的相似病例。In this embodiment, the word type words and medical meaning words of the serious case to be checked are calculated from the word segmentation text, and then the ratio in the word segmentation text is calculated, and after the set ratio is merged, the calculation is calculated with the case data in the case database The text similarity and meaning similarity of the fusion is the final similarity, and the case data that meets the similarity requirements is used as the check result. When the disease corresponding to the same symptom is too different, the pathological check is only performed based on the medical characteristics It can also find similar cases with high accuracy.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The computer program can be stored in a non-volatile computer readable storage. In the medium, when the computer program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database, or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、 模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that, for the convenience and conciseness of description, only the division of the above functional units and modules is used as an example. In practical applications, the above functions can be allocated to different functional units and modules as needed. Module completion, that is, the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above.
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered as the range described in this specification.
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形、改进或者对部分技术特征进行等同替换,而这些修改或者替换,并不使相同技术方案的本质脱离本申请个实施例技术方案地精神和范畴,都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation manners of the present application, and the description is relatively specific and detailed, but it should not be understood as a limitation on the scope of the invention patent. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of this application, several modifications, improvements, or equivalent substitutions of some technical features can be made, and these modifications or substitutions are not To make the essence of the same technical solution deviate from the spirit and scope of the technical solutions of the embodiments of this application belongs to the protection scope of this application. Therefore, the scope of protection of the patent of this application shall be subject to the appended claims.

Claims (20)

  1. 一种基于分词文本的电子病例查重方法,其中,所述方法包括:A method for checking duplicates of electronic medical records based on word segmentation text, wherein the method includes:
    对用户输入的待查重病例进行分词处理,得到分词文本;Perform word segmentation processing on the severe cases to be checked input by the user to obtain the word segmentation text;
    根据预设子串值对所述分词文本进行特征提取,得到病例文本特征;Perform feature extraction on the word segmentation text according to the preset substring value to obtain the case text feature;
    从所述分词文本中获取词语类型词和医疗含义词,并统计所述词语类型词在所述分词文本中第一比率、所述医疗含义词在所述分词文本中的第二比率;Obtain word type words and medical meaning words from the word segmentation text, and count the first ratio of the word type words in the word segmentation text and the second ratio of the medical meaning words in the word segmentation text;
    整合所述第一比率、所述第二比率,得到病例含义特征;Integrating the first ratio and the second ratio to obtain the meaning characteristics of the case;
    分别根据所述病例文本特征、所述病例含义特征计算所述待查重病例与病例数据库中病例文本的相似度,得到文本相似度和含义相似度;Calculate the similarity between the serious case under investigation and the case text in the case database according to the case text characteristics and the meaning characteristics of the case, respectively, to obtain text similarity and meaning similarity;
    根据预设权重值融合所述文本相似度和所述含义相似度,得到所述待查重病例与所述病例文本的最终相似度,并将大于预设值的最终相似度对应的病例文本作为查重结果。The text similarity and the meaning similarity are merged according to the preset weight value to obtain the final similarity between the serious case under investigation and the case text, and the case text corresponding to the final similarity greater than the preset value is taken as Check the results.
  2. 根据权利要求1所述的方法,其中,所述分词文本包括分词后的字和/或词组成的多个字符,所述根据预设子串值对所述分词文本进行特征提取,得到病例文本特征,包括:The method according to claim 1, wherein the word segmentation text includes a plurality of characters composed of the word and/or words after the word segmentation, and the feature extraction is performed on the word segmentation text according to a preset substring value to obtain the case text Features include:
    为所述分词文本中的每一字符生成唯一编码;Generate a unique code for each character in the word segmentation text;
    通过n-gram算法根据所述预设子串值对所述分词文本进行特征提取,得到所述病例文本特征,其中,所述病例文本特征包括至少一个按所述字符编码顺序排列的连续词语串,且所述连续词语串中的字符按照所述唯一编码的大小顺序排列。Perform feature extraction on the word segmentation text according to the preset substring value through the n-gram algorithm to obtain the case text feature, wherein the case text feature includes at least one continuous word string arranged in the character encoding order , And the characters in the continuous word string are arranged in the order of the size of the unique code.
  3. 根据权利要求2所述的方法,其中,所述预设子串值的取值范围为2-6。The method according to claim 2, wherein the value range of the preset substring value is 2-6.
  4. 根据权利要求1所述的方法,其中,所述从所述分词文本中获取词语类型词和医疗含义词,并统计所述词语类型词在所述分词文本中第一比率、所述医疗含义词在所述分词文本中的第二比率,包括:The method according to claim 1, wherein said acquiring word type words and medical meaning words from said word segmentation text, and calculating the first ratio of said word type words in said word segmentation text, said medical meaning words The second ratio in the word segmentation text includes:
    从所述分词文本中获取词语类型词为实词和虚词的字符,并计算所述词语类型词在所述分词文本中的第一比率;Obtain the characters whose word type words are content words and function words from the word segmentation text, and calculate the first ratio of the word type words in the word segmentation text;
    从所述分词文本中获取医疗含义词;Obtain medical meaning words from the word segmentation text;
    根据医疗实体库对所述医疗含义词进行医疗实体关联;Perform medical entity association on the medical meaning words according to the medical entity database;
    计算医疗实体关联后的医疗含义词在所述分词文本中的第二比率。Calculate the second ratio of the medical meaning words associated with the medical entity in the word segmentation text.
  5. 根据权利要求1所述的方法,其中,所述整合所述第一比率、所述第二比率,得到病例含义特征,包括:The method according to claim 1, wherein the integrating the first ratio and the second ratio to obtain the meaning characteristics of the case comprises:
    按照12:5的比例将所述词语类型词和所述医疗含义词进行整合,得到病例含义特征f1=(x 1,x 2,x 3,…,x 17),其中,(x 1,x 2,x 3,…,x 12)表示12个词语类型词,(x 13,x 14,x 15,x 16,x 17)表示5个医疗含义词。 According to the ratio of 12:5, the word type words and the medical meaning words are integrated to obtain the case meaning feature f1=(x 1 ,x 2 ,x 3 ,...,x 17 ), where (x 1 ,x 2 ,x 3 ,...,x 12 ) represent 12 word type words, (x 13 , x 14 , x 15 , x 16 , x 17 ) represent 5 medical meaning words.
  6. 根据权利要求1所述的方法,其中,所述分别根据所述病例文本特征、所述病例含义特征计算所述待查重病例与病例数据库中病例文本的相似度,得到文本相似度和含义相似度,包括:The method according to claim 1, wherein the calculating the similarity between the serious case to be investigated and the case text in the case database according to the case text characteristics and the meaning characteristics of the case, respectively, to obtain text similarity and meaning similarity Degree, including:
    分别对所述待查重病例、病例数据库中病例数据的病例文本特征进行文本字面特征提取,得到查重集合和数据集合;Perform text literal feature extraction on the case text features of the case data in the case to be checked and the case data in the case database to obtain a duplicate check set and a data set;
    计算所述查重集合和所述数据集合中相同连续词语串的个数,得到所述文本相似度;Calculating the number of identical continuous word strings in the duplicate check set and the data set to obtain the text similarity;
    通过余弦相似度算法计算所述待查重病例与所述病例数据的病例含义特征的相似度,作为所述含义相似度。The similarity of the meaning characteristics of the case of the serious case to be checked and the case data is calculated by the cosine similarity algorithm as the meaning similarity.
  7. 根据权利要求1所述的方法,其中,所述根据预设权重值融合所述文本相似度和所述含义相似度,得到所述待查重病例与所述病例文本的最终相似度,包括:The method according to claim 1, wherein the fusing the text similarity and the meaning similarity according to a preset weight value to obtain the final similarity between the serious case to be checked and the case text comprises:
    以预设权重值为w1:w2对所述文本相似度和所述含义相似度进行叠加处理,得到所述最终相似度,其中,0<=w2<=1,w1+w2=1。Using a preset weight value of w1:w2 to superimpose the text similarity and the meaning similarity to obtain the final similarity, where 0<=w2<=1, w1+w2=1.
  8. 一种基于分词文本的电子病例查重装置,其中,包括:A device for checking duplicates of electronic medical records based on word segmentation text, which includes:
    分词模块,用于对用户输入的待查重病例进行分词处理,得到分词文本;The word segmentation module is used to perform word segmentation processing on the serious case to be checked input by the user to obtain the word segmentation text;
    提取模块,用于根据预设子串值对所述分词文本进行特征提取,得到病例文本特征;The extraction module is used to perform feature extraction on the word segmentation text according to the preset substring value to obtain the case text feature;
    比率模块,用于从所述分词文本中获取词语类型词和医疗含义词,并统计所述词语类型词在所述分词文本中第一比率、所述医疗含义词在所述分词文本中的第二比率;The ratio module is used to obtain word type words and medical meaning words from the word segmentation text, and calculate the first ratio of the word type words in the word segmentation text, and the number of the medical meaning words in the word segmentation text. Two ratio
    整合模块,用于整合所述第一比率、所述第二比率,得到病例含义特征;The integration module is used to integrate the first ratio and the second ratio to obtain the meaning characteristics of the case;
    相似模块,用于分别根据所述病例文本特征、所述病例含义特征计算所述待查重病例与病例数据库中病例文本的相似度,得到文本相似度和含义相似度;The similarity module is used to calculate the similarity between the serious case to be investigated and the case text in the case database according to the case text characteristics and the meaning characteristics of the case, respectively, to obtain text similarity and meaning similarity;
    查重模块,用于根据预设权重值融合所述文本相似度和所述含义相似度,得到所述待查重病例与所述病例文本的最终相似度,并将大于预设值的最终相似度对应的病例文本作为查重结果。Duplicate checking module, used to fuse the text similarity and the meaning similarity according to a preset weight value to obtain the final similarity between the case to be checked and the case text, and the final similarity greater than the preset value The case text corresponding to the degree is used as the duplicate check result.
  9. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其中,所述处理器执行所述计算机程序时实现基于分词文本的电子病例查重方法的步骤:A computer device includes a memory and a processor, the memory stores a computer program, wherein the processor implements the steps of a method for checking duplicates of an electronic medical record based on word segmentation text when the processor executes the computer program:
    对用户输入的待查重病例进行分词处理,得到分词文本;Perform word segmentation processing on the severe cases to be checked input by the user to obtain the word segmentation text;
    根据预设子串值对所述分词文本进行特征提取,得到病例文本特征;Perform feature extraction on the word segmentation text according to the preset substring value to obtain the case text feature;
    从所述分词文本中获取词语类型词和医疗含义词,并统计所述词语类型词在所述分词文本中第一比率、所述医疗含义词在所述分词文本中的第二比率;Obtain word type words and medical meaning words from the word segmentation text, and count the first ratio of the word type words in the word segmentation text and the second ratio of the medical meaning words in the word segmentation text;
    整合所述第一比率、所述第二比率,得到病例含义特征;Integrating the first ratio and the second ratio to obtain the meaning characteristics of the case;
    分别根据所述病例文本特征、所述病例含义特征计算所述待查重病例与病例数据库中病例文本的相似度,得到文本相似度和含义相似度;Calculate the similarity between the serious case under investigation and the case text in the case database according to the case text characteristics and the meaning characteristics of the case, respectively, to obtain text similarity and meaning similarity;
    根据预设权重值融合所述文本相似度和所述含义相似度,得到所述待查重病例与所述病例文本的最终相似度,并将大于预设值的最终相似度对应的病例文本作为查重结果。The text similarity and the meaning similarity are merged according to the preset weight value to obtain the final similarity between the serious case under investigation and the case text, and the case text corresponding to the final similarity greater than the preset value is taken as Check the results.
  10. 根据权利要求9所述的计算机设备,其中,所述分词文本包括分词后的字和/或词组成的多个字符,所述根据预设子串值对所述分词文本进行特征提取,得到病例文本特征,包括:8. The computer device according to claim 9, wherein the word segmentation text includes a plurality of characters composed of the word and/or words after word segmentation, and the feature extraction is performed on the word segmentation text according to a preset substring value to obtain a case Text features, including:
    为所述分词文本中的每一字符生成唯一编码;Generate a unique code for each character in the word segmentation text;
    通过n-gram算法根据所述预设子串值对所述分词文本进行特征提取,得到所述病例文本特征,其中,所述病例文本特征包括至少一个按所述字符编码顺序排列的连续词语串,且所述连续词语串中的字符按照所述唯一编码的大小顺序排列。Perform feature extraction on the word segmentation text according to the preset substring value through the n-gram algorithm to obtain the case text feature, wherein the case text feature includes at least one continuous word string arranged in the character encoding order , And the characters in the continuous word string are arranged in the order of the size of the unique code.
  11. 根据权利要求10所述的计算机设备,其中,所述预设子串值的取值范围为2-6。The computer device according to claim 10, wherein the value range of the preset substring value is 2-6.
  12. 根据权利要求9所述的计算机设备,其中,所述从所述分词文本中获取词语类型词和医疗含义词,并统计所述词语类型词在所述分词文本中第一比率、所述医疗含义词在所述分词文本中的第二比率,包括:8. The computer device according to claim 9, wherein said acquiring word type words and medical meaning words from said word segmentation text, and calculating the first ratio of said word type words in said word segmentation text, said medical meaning The second ratio of words in the segmented text includes:
    从所述分词文本中获取词语类型词为实词和虚词的字符,并计算所述词语类型词在所述分词文本中的第一比率;Obtain the characters whose word type words are content words and function words from the word segmentation text, and calculate the first ratio of the word type words in the word segmentation text;
    从所述分词文本中获取医疗含义词;Obtain medical meaning words from the word segmentation text;
    根据医疗实体库对所述医疗含义词进行医疗实体关联;Perform medical entity association on the medical meaning words according to the medical entity database;
    计算医疗实体关联后的医疗含义词在所述分词文本中的第二比率。Calculate the second ratio of the medical meaning words associated with the medical entity in the word segmentation text.
  13. 根据权利要求9所述的计算机设备,其中,所述整合所述第一比率、所述第二比率,得到病例含义特征,包括:The computer device according to claim 9, wherein said integrating said first ratio and said second ratio to obtain the meaning characteristics of the case comprises:
    按照12:5的比例将所述词语类型词和所述医疗含义词进行整合,得到病例含义特征f1=(x 1,x 2,x 3,…,x 17),其中,(x 1,x 2,x 3,…,x 12)表示12个词语类型词,(x 13,x 14,x 15,x 16,x 17)表示5个医疗含义词。 According to the ratio of 12:5, the word type words and the medical meaning words are integrated to obtain the case meaning feature f1=(x 1 ,x 2 ,x 3 ,...,x 17 ), where (x 1 ,x 2 ,x 3 ,...,x 12 ) represent 12 word type words, (x 13 , x 14 , x 15 , x 16 , x 17 ) represent 5 medical meaning words.
  14. 根据权利要求9所述的计算机设备,其中,所述分别根据所述病例文本特征、所述病例含义特征计算所述待查重病例与病例数据库中病例文本的相似度,得到文本相似度和含义相似度,包括:8. The computer device according to claim 9, wherein said calculating the similarity between the serious case to be checked and the case text in the case database according to the case text characteristics and the meaning characteristics of the case, respectively, to obtain text similarity and meaning Similarity, including:
    分别对所述待查重病例、病例数据库中病例数据的病例文本特征进行文本字面特征提取,得到查重集合和数据集合;Perform text literal feature extraction on the case text features of the case data in the case to be checked and the case data in the case database to obtain a duplicate check set and a data set;
    计算所述查重集合和所述数据集合中相同连续词语串的个数,得到所述文本相似度;Calculating the number of identical continuous word strings in the duplicate check set and the data set to obtain the text similarity;
    通过余弦相似度算法计算所述待查重病例与所述病例数据的病例含义特征的相似度,作为所述含义相似度。The similarity of the meaning characteristics of the case of the serious case to be checked and the case data is calculated by the cosine similarity algorithm as the meaning similarity.
  15. 根据权利要求9所述的计算机设备,其中,所述根据预设权重值融合所述文本相似度和所述含义相似度,得到所述待查重病例与所述病例文本的最终相似度,包括:The computer device according to claim 9, wherein the fusion of the text similarity and the meaning similarity according to a preset weight value to obtain the final similarity between the serious case to be checked and the case text includes :
    以预设权重值为w1:w2对所述文本相似度和所述含义相似度进行叠加处理,得到所述最终相似度,其中,0<=w2<=1,w1+w2=1。Using a preset weight value of w1:w2 to superimpose the text similarity and the meaning similarity to obtain the final similarity, where 0<=w2<=1, w1+w2=1.
  16. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现基于分词文本的电子病例查重方法的步骤:A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the steps of a method for checking duplicates of an electronic medical record based on word segmentation text are realized:
    对用户输入的待查重病例进行分词处理,得到分词文本;Perform word segmentation processing on the severe cases to be checked input by the user to obtain the word segmentation text;
    根据预设子串值对所述分词文本进行特征提取,得到病例文本特征;Perform feature extraction on the word segmentation text according to the preset substring value to obtain the case text feature;
    从所述分词文本中获取词语类型词和医疗含义词,并统计所述词语类型词在所述分词文本中第一比率、所述医疗含义词在所述分词文本中的第二比率;Obtain word type words and medical meaning words from the word segmentation text, and count the first ratio of the word type words in the word segmentation text and the second ratio of the medical meaning words in the word segmentation text;
    整合所述第一比率、所述第二比率,得到病例含义特征;Integrating the first ratio and the second ratio to obtain the meaning characteristics of the case;
    分别根据所述病例文本特征、所述病例含义特征计算所述待查重病例与病例数据库中病例文本的相似度,得到文本相似度和含义相似度;Calculate the similarity between the serious case under investigation and the case text in the case database according to the case text characteristics and the meaning characteristics of the case, respectively, to obtain text similarity and meaning similarity;
    根据预设权重值融合所述文本相似度和所述含义相似度,得到所述待查重病例与所述病例文本的最终相似度,并将大于预设值的最终相似度对应的病例文本作为查重结果。The text similarity and the meaning similarity are merged according to the preset weight value to obtain the final similarity between the serious case under investigation and the case text, and the case text corresponding to the final similarity greater than the preset value is taken as Check the results.
  17. 根据权利要求16所述的可读存储介质,其中,所述分词文本包括分词后的字和/或词组成的多个字符,所述根据预设子串值对所述分词文本进行特征提取,得到病例文本特征,包括:The readable storage medium according to claim 16, wherein the word segmentation text includes a plurality of characters composed of a word and/or words after word segmentation, and the feature extraction is performed on the word segmentation text according to a preset substring value, Obtain the case text characteristics, including:
    为所述分词文本中的每一字符生成唯一编码;Generate a unique code for each character in the word segmentation text;
    通过n-gram算法根据所述预设子串值对所述分词文本进行特征提取,得到所述病例文本特征,其中,所述病例文本特征包括至少一个按所述字符编码顺序排列的连续词语串,且所述连续词语串中的字符按照所述唯一编码的大小顺序排列。Perform feature extraction on the word segmentation text according to the preset substring value through the n-gram algorithm to obtain the case text feature, wherein the case text feature includes at least one continuous word string arranged in the character encoding order , And the characters in the continuous word string are arranged in the order of the size of the unique code.
  18. 根据权利要求17所述的可读存储介质,其中,所述预设子串值的取值范围为2-6。The readable storage medium according to claim 17, wherein the value range of the preset substring value is 2-6.
  19. 根据权利要求16所述的可读存储介质,其中,所述从所述分词文本中获取词语类型词和医疗含义词,并统计所述词语类型词在所述分词文本中第一比率、所述医疗含义词在所述分词文本中的第二比率,包括:The readable storage medium according to claim 16, wherein said obtaining word type words and medical meaning words from said word segmentation text, and counting the first ratio of said word type words in said word segmentation text, said The second ratio of medical meaning words in the word segmentation text includes:
    从所述分词文本中获取词语类型词为实词和虚词的字符,并计算所述词语类型词在所述分词文本中的第一比率;Obtain the characters whose word type words are content words and function words from the word segmentation text, and calculate the first ratio of the word type words in the word segmentation text;
    从所述分词文本中获取医疗含义词;Obtain medical meaning words from the word segmentation text;
    根据医疗实体库对所述医疗含义词进行医疗实体关联;Perform medical entity association on the medical meaning words according to the medical entity database;
    计算医疗实体关联后的医疗含义词在所述分词文本中的第二比率。Calculate the second ratio of the medical meaning words associated with the medical entity in the word segmentation text.
  20. 根据权利要求16所述的可读存储介质,其中,所述整合所述第一比率、所述第二比率,得到病例含义特征,包括:The readable storage medium according to claim 16, wherein said integrating the first ratio and the second ratio to obtain the meaning characteristics of the case comprises:
    按照12:5的比例将所述词语类型词和所述医疗含义词进行整合,得到病例含义特征f1=(x 1,x 2,x 3,…,x 17),其中,(x 1,x 2,x 3,…,x 12)表示12个词语类型词,(x 13,x 14,x 15,x 16,x 17)表示5个医疗含义词。 According to the ratio of 12:5, the word type words and the medical meaning words are integrated to obtain the case meaning feature f1=(x 1 ,x 2 ,x 3 ,...,x 17 ), where (x 1 ,x 2 ,x 3 ,...,x 12 ) represent 12 word type words, (x 13 , x 14 , x 15 , x 16 , x 17 ) represent 5 medical meaning words.
PCT/CN2020/136146 2020-06-24 2020-12-14 Method for detecting electronic medical case duplicates based on word segmentation, device, and computer equipment WO2021121187A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010592373.8A CN111814447B (en) 2020-06-24 2020-06-24 Electronic case duplicate checking method and device based on word segmentation text and computer equipment
CN202010592373.8 2020-06-24

Publications (1)

Publication Number Publication Date
WO2021121187A1 true WO2021121187A1 (en) 2021-06-24

Family

ID=72856502

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/136146 WO2021121187A1 (en) 2020-06-24 2020-12-14 Method for detecting electronic medical case duplicates based on word segmentation, device, and computer equipment

Country Status (2)

Country Link
CN (1) CN111814447B (en)
WO (1) WO2021121187A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113689924A (en) * 2021-08-24 2021-11-23 平安国际智慧城市科技股份有限公司 Similar medical record retrieval method and device, electronic equipment and readable storage medium
CN113722418A (en) * 2021-08-30 2021-11-30 平安科技(深圳)有限公司 Clinical case standardization method, device, equipment and medium
CN113780449A (en) * 2021-09-16 2021-12-10 平安科技(深圳)有限公司 Text similarity calculation method and device, storage medium and computer equipment
CN115880704A (en) * 2023-02-16 2023-03-31 中国人民解放军总医院第一医学中心 Automatic case cataloging method, system, equipment and storage medium
CN116153452A (en) * 2023-04-18 2023-05-23 济南科汛智能科技有限公司 Medical electronic medical record storage system based on artificial intelligence

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814447B (en) * 2020-06-24 2022-05-27 平安科技(深圳)有限公司 Electronic case duplicate checking method and device based on word segmentation text and computer equipment
CN112948545A (en) * 2021-02-25 2021-06-11 平安国际智慧城市科技股份有限公司 Duplicate checking method, terminal equipment and computer readable storage medium
CN114490940A (en) * 2022-01-25 2022-05-13 中国人民解放军国防科技大学 Self-adaptive project duplicate checking method and system
CN114255839A (en) * 2022-01-26 2022-03-29 广州天鹏计算机科技有限公司 Medical big data management system and method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106934038A (en) * 2017-03-15 2017-07-07 江苏华生基因数据科技股份有限公司 A kind of medical data duplicate checking and the method and system for associating
CN108461111A (en) * 2018-03-16 2018-08-28 重庆医科大学 Chinese medical treatment text duplicate checking method and device, electronic equipment, computer read/write memory medium
US20190035506A1 (en) * 2017-07-31 2019-01-31 Hefei University Of Technology Intelligent auxiliary diagnosis method, system and machine-readable medium thereof
CN110532352A (en) * 2019-08-20 2019-12-03 腾讯科技(深圳)有限公司 Text duplicate checking method and device, computer readable storage medium, electronic equipment
CN111325015A (en) * 2020-02-19 2020-06-23 南瑞集团有限公司 Document duplicate checking method and system based on semantic analysis
CN111814447A (en) * 2020-06-24 2020-10-23 平安科技(深圳)有限公司 Electronic case duplicate checking method and device based on word segmentation text and computer equipment

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9275636B2 (en) * 2012-05-03 2016-03-01 International Business Machines Corporation Automatic accuracy estimation for audio transcriptions
US9135571B2 (en) * 2013-03-12 2015-09-15 Nuance Communications, Inc. Methods and apparatus for entity detection
CN106407183B (en) * 2016-09-28 2019-06-28 医渡云(北京)技术有限公司 Medical treatment name entity recognition system generation method and device
US10593422B2 (en) * 2017-12-01 2020-03-17 International Business Machines Corporation Interaction network inference from vector representation of words
CN109190117B (en) * 2018-08-10 2023-06-23 中国船舶重工集团公司第七一九研究所 Short text semantic similarity calculation method based on word vector
CN109885657B (en) * 2019-02-18 2021-04-27 武汉瓯越网视有限公司 Text similarity calculation method and device and storage medium
CN110517785B (en) * 2019-08-28 2022-05-10 北京百度网讯科技有限公司 Similar case searching method, device and equipment
CN111274806B (en) * 2020-01-20 2020-11-06 医惠科技有限公司 Method and device for recognizing word segmentation and part of speech and method and device for analyzing electronic medical record

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106934038A (en) * 2017-03-15 2017-07-07 江苏华生基因数据科技股份有限公司 A kind of medical data duplicate checking and the method and system for associating
US20190035506A1 (en) * 2017-07-31 2019-01-31 Hefei University Of Technology Intelligent auxiliary diagnosis method, system and machine-readable medium thereof
CN108461111A (en) * 2018-03-16 2018-08-28 重庆医科大学 Chinese medical treatment text duplicate checking method and device, electronic equipment, computer read/write memory medium
CN110532352A (en) * 2019-08-20 2019-12-03 腾讯科技(深圳)有限公司 Text duplicate checking method and device, computer readable storage medium, electronic equipment
CN111325015A (en) * 2020-02-19 2020-06-23 南瑞集团有限公司 Document duplicate checking method and system based on semantic analysis
CN111814447A (en) * 2020-06-24 2020-10-23 平安科技(深圳)有限公司 Electronic case duplicate checking method and device based on word segmentation text and computer equipment

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113689924A (en) * 2021-08-24 2021-11-23 平安国际智慧城市科技股份有限公司 Similar medical record retrieval method and device, electronic equipment and readable storage medium
CN113689924B (en) * 2021-08-24 2024-04-05 深圳平安智慧医健科技有限公司 Similar medical record retrieval method and device, electronic equipment and readable storage medium
CN113722418A (en) * 2021-08-30 2021-11-30 平安科技(深圳)有限公司 Clinical case standardization method, device, equipment and medium
CN113780449A (en) * 2021-09-16 2021-12-10 平安科技(深圳)有限公司 Text similarity calculation method and device, storage medium and computer equipment
CN113780449B (en) * 2021-09-16 2023-08-25 平安科技(深圳)有限公司 Text similarity calculation method and device, storage medium and computer equipment
CN115880704A (en) * 2023-02-16 2023-03-31 中国人民解放军总医院第一医学中心 Automatic case cataloging method, system, equipment and storage medium
CN115880704B (en) * 2023-02-16 2023-06-16 中国人民解放军总医院第一医学中心 Automatic cataloging method, system, equipment and storage medium for cases
CN116153452A (en) * 2023-04-18 2023-05-23 济南科汛智能科技有限公司 Medical electronic medical record storage system based on artificial intelligence

Also Published As

Publication number Publication date
CN111814447A (en) 2020-10-23
CN111814447B (en) 2022-05-27

Similar Documents

Publication Publication Date Title
WO2021121187A1 (en) Method for detecting electronic medical case duplicates based on word segmentation, device, and computer equipment
US11227118B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
US8898798B2 (en) Systems and methods for medical information analysis with deidentification and reidentification
Liu et al. Non-White scientists appear on fewer editorial boards, spend more time under review, and receive fewer citations
JP2020123318A (en) Method, apparatus, electronic device, computer-readable storage medium, and computer program for determining text relevance
CN108920453A (en) Data processing method, device, electronic equipment and computer-readable medium
US11276494B2 (en) Predicting interactions between drugs and diseases
CN104346418A (en) Anonymizing Sensitive Identifying Information Based on Relational Context Across a Group
US20150142821A1 (en) Database system for analysis of longitudinal data sets
CN111180086B (en) Data matching method, device, computer equipment and storage medium
WO2022174491A1 (en) Artificial intelligence-based method and apparatus for medical record quality control, computer device, and storage medium
CN110875093A (en) Treatment scheme processing method, device, equipment and storage medium
CN113826172A (en) Generation of customized personal health ontology
WO2022105172A1 (en) Pdf document cross-page table merging method and apparatus, electronic device and storage medium
CN112215008A (en) Entity recognition method and device based on semantic understanding, computer equipment and medium
CN111899821A (en) Method for processing medical institution data, method and device for constructing database
CN112885478B (en) Medical document retrieval method, medical document retrieval device, electronic device and storage medium
WO2014063118A1 (en) Systems and methods for medical information analysis with deidentification and reidentification
CN113707299A (en) Auxiliary diagnosis method and device based on inquiry session and computer equipment
WO2023178978A1 (en) Prescription review method and apparatus based on artificial intelligence, and device and medium
WO2021012913A1 (en) Data recognition method and system, electronic device and computer storage medium
WO2022227171A1 (en) Method and apparatus for extracting key information, electronic device, and medium
US20150339602A1 (en) System and method for modeling health care costs
CN116304186A (en) Post-structuring processing method and post-structuring processing system for medical document
CN113361248B (en) Text similarity calculation method, device, equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20903362

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20903362

Country of ref document: EP

Kind code of ref document: A1