WO2021114632A1 - Disease name standardization method, apparatus, device, and storage medium - Google Patents

Disease name standardization method, apparatus, device, and storage medium Download PDF

Info

Publication number
WO2021114632A1
WO2021114632A1 PCT/CN2020/099487 CN2020099487W WO2021114632A1 WO 2021114632 A1 WO2021114632 A1 WO 2021114632A1 CN 2020099487 W CN2020099487 W CN 2020099487W WO 2021114632 A1 WO2021114632 A1 WO 2021114632A1
Authority
WO
WIPO (PCT)
Prior art keywords
preset
target
disease
disease name
icd standard
Prior art date
Application number
PCT/CN2020/099487
Other languages
French (fr)
Chinese (zh)
Inventor
姚海申
蒋雪涵
徐卓扬
孙行智
胡岗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021114632A1 publication Critical patent/WO2021114632A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • This application relates to the technical field of disease name standardization of artificial intelligence, and specifically relates to a disease name standardization method, device, equipment and storage medium.
  • the embodiments of the present application provide a disease name standardization method, device, equipment, and storage medium, which are beneficial to improve the efficiency of disease name standardization.
  • the first aspect of the embodiments of the present application provides a method for standardizing disease names, which is applied to electronic equipment, including:
  • the target dictionary Acquiring the target dictionary, the current diagnosis text, and a preset ICD (international Classification of diseases) standard disease name set, where the preset ICD standard disease name set includes multiple preset ICD standard disease names;
  • ICD international Classification of diseases
  • the target preset ICD standard disease name corresponding to the target first matching degree is acquired, and the target is preset to the ICD standard disease
  • the name is determined as the conversion result of the name of the disease to be standardized.
  • the second aspect of the embodiments of the present application provides a disease name standardization device, which is applied to electronic equipment, and the device includes: an acquisition unit, a word segmentation unit, a construction unit, a matching unit, and a determination unit, wherein:
  • the acquiring unit is configured to acquire a target dictionary, a current diagnosis text, and a preset ICD standard disease name set, where the preset ICD standard disease name set includes a plurality of preset ICD standard disease names;
  • the word segmentation unit is configured to perform a word segmentation operation on the current diagnosis text based on the target dictionary to obtain the name of the disease to be standardized contained in the current diagnosis text;
  • the construction unit is configured to construct a target dictionary tree based on the preset ICD standard disease name set;
  • the matching unit is configured to match the name of the disease to be standardized with the plurality of preset ICD standard disease names in the set of preset ICD standard disease names based on the target dictionary tree to obtain a plurality of first suitability;
  • the determining unit is configured to obtain the target preset ICD standard disease name corresponding to the target first matching degree when there is a target first matching degree that satisfies a preset condition among the plurality of first matching degrees, and
  • the target preset ICD standard disease name is determined as the conversion result of the disease name to be standardized.
  • a third aspect of the embodiments of the present application provides an electronic device, which includes a processor, a memory, a communication interface, and one or more programs, and the one or more programs are stored in the memory and configured Is executed by the processor, wherein the memory is used to store a computer program, the computer program includes program instructions, and the processor is configured to invoke the program instructions to execute the first aspect of the embodiments of the present application Methods.
  • the fourth aspect of the embodiments of the present application provides a computer-readable storage medium, including a storage data area and a storage program area.
  • the storage data area stores data created according to the use of blockchain nodes
  • the storage program area stores computer programs.
  • the computer program includes program instructions, and the program instructions are executed by a processor as part or all of the steps described in the first aspect of the embodiments of the present application.
  • the above method includes: obtaining the target dictionary, the current diagnosis text and the preset ICD standard disease name set.
  • the preset ICD standard disease name set includes multiple preset ICD standard disease names based on the target Dictionary, perform word segmentation operation on the current diagnosis text, get the name of the disease to be standardized in the current diagnosis text, build a target dictionary tree based on the preset ICD standard disease name set, based on the target dictionary tree, combine the name of the disease to be standardized with the preset
  • the multiple preset ICD standard disease names in the ICD standard disease name set are matched to obtain multiple first matching degrees. When there is a target first matching degree that meets the preset conditions among the multiple first matching degrees, the target first matching degree is obtained.
  • the target preset ICD standard disease name corresponding to a match degree is determined, and the target preset ICD standard disease name is determined as the conversion result of the disease name to be standardized.
  • the current diagnosis text can be cut through the target dictionary to reduce the current diagnosis
  • problems such as colloquialization, typos, omissions, abbreviations, etc. in the text.
  • a target dictionary tree constructed based on the set of preset ICD standard disease names will match multiple preset ICD standard disease names with the names to be standardized to obtain
  • the conversion structure is conducive to improving the conversion efficiency and accuracy.
  • FIG. 1A provides a schematic structural diagram of a method for standardizing disease names according to an embodiment of this application
  • FIG. 1B provides a schematic flowchart of a method for standardizing disease names according to an embodiment of this application
  • FIG. 1C is a schematic structural diagram of a method for extracting names of diseases to be standardized according to an embodiment of this application;
  • FIG. 1D is a schematic structural diagram of a target dictionary tree provided in an embodiment of this application.
  • FIG. 2 is a schematic flowchart of a method for standardizing disease names according to an embodiment of this application;
  • FIG. 3 is a schematic flowchart of a method for standardizing disease names according to an embodiment of this application.
  • FIG. 4 provides a schematic structural diagram of an electronic device according to an embodiment of the application.
  • FIG. 5 is a schematic structural diagram of a disease name standardization device provided in an embodiment of this application.
  • Electronic devices can include various handheld devices with wireless communication functions, vehicle-mounted devices, wearable devices (such as smart watches, smart bracelets, pedometers, etc.), computing devices or other processing devices connected to wireless modems, and various Forms of user equipment (User Equipment, UE), mobile station (Mobile Station, MS), terminal equipment (terminal device), and so on.
  • UE User Equipment
  • MS Mobile Station
  • terminal device terminal device
  • Figure 1A is a schematic structural diagram of a disease name standardization method provided by an embodiment of the present application. Based on the structural schematic diagram, the target dictionary, the current diagnosis text and the preset ICD standard disease name set can be obtained, and the ICD standard can be preset
  • the disease name set includes multiple preset ICD standard disease names.
  • the current diagnosis text can be word-cut to obtain the disease names to be standardized in the current diagnosis text, and based on the preset ICD standard disease name sets , Build a target dictionary tree, based on the target dictionary tree, match the name of the disease to be standardized with multiple preset ICD standard disease names in the set of preset ICD standard disease names to obtain multiple first matching degrees, and finally, when multiple first matches
  • the target preset ICD standard disease name corresponding to the target first matching degree is obtained, and the target preset ICD standard disease name is determined as the conversion of the disease name to be standardized result.
  • the current diagnosis text can be word-cut through the target dictionary, so as to reduce the colloquialization, typos, omissions, abbreviations and other problems in the current diagnosis text.
  • the target dictionary tree constructed based on the set of preset ICD standard disease names matches multiple preset ICD standard disease names with names to be standardized to obtain a conversion structure, which is beneficial to improve conversion efficiency and accuracy.
  • FIG. 1B is a schematic flowchart of a method for standardizing disease names provided by an embodiment of the present application, which is applied to an electronic device, and the above method includes the following steps:
  • the embodiments of the present application can be applied to electronic equipment, and the electronic equipment may include a disease name standardization system as shown in FIG. 1A.
  • the target dictionary can be diagnosed by multiple historical disease conditions of multiple patients stored in a historical disease case database. The case is obtained through data processing, and the target dictionary can include multiple historical disease names; the above-mentioned preset ICD standard disease name set can be set by the user or the system defaults, and the preset ICD standard disease name set can include multiple preset ICD standards Disease name, the above current diagnosis text may refer to the diagnosis text corresponding to any one of the above new disease cases or the diagnosis text that needs to be standardized for disease names.
  • the current diagnosis text may include at least one of the following: prescription information, diagnosis information , Disease description information, discharge summary information, hospital information, department information, patient information, etc., which are not limited here.
  • A3. Perform data processing on the historical disease name set and the preset ICD standard disease name set to obtain the target dictionary.
  • the aforementioned historical disease case database can store multiple historical disease diagnosis cases of multiple patients, and the historical diagnosis cases can include at least one of the following: admission diagnosis information and discharge diagnosis information, etc., which are not limited here.
  • Both the admission diagnosis information and the discharge diagnosis information can include at least one of the following: prescription information, diagnosis information, disease description information, discharge summary information, hospital information, department information, patient information, etc., which are not limited here; the above prescription information It may include at least one of the following: disease name, disease symptoms, drug name, drug dosage, etc., which are not limited here.
  • multiple historical diagnosis cases can be extracted from the historical disease case database, and historical diagnosis text information can be extracted from them.
  • the historical diagnosis text information can be cleaned according to preset rules. , To obtain the historical disease name set.
  • the preset rules can be set by the user or the system defaults, and are not limited here.
  • the above historical diagnosis text information can be eliminated to eliminate unnecessary fields (for example, non- Disease name field), and further, based on empirical knowledge, the missing fields in the historical diagnosis text information can be supplemented or data modified.
  • a set of historical disease names can be obtained, which can include multiple historical disease diagnoses.
  • the multiple disease names corresponding to the case, further, the historical disease name set and the preset ICD standard disease name set can be processed for data, so that an expanded target dictionary can be obtained, and the target dictionary can include multiple disease names.
  • the use of preset preset rules to clean the data helps to alleviate the inaccuracy and incompleteness of the extraction of the use rules.
  • the extracted disease names do not need to be manually corrected, which is beneficial to save labor costs.
  • step A2 performing data cleaning on the historical diagnosis text information to obtain a set of historical disease names, may include the following steps:
  • A23 Determine at least one preset disease name corresponding to at least one second matching degree that exceeds a first preset threshold among the plurality of second matching degrees, and use the at least one preset disease name as the disease name set .
  • the above-mentioned first preset threshold can be set by the user or defaulted by the system, which is not limited here.
  • the electronic device can store multiple preset disease names in advance, and preset a regular expression for each preset disease name.
  • the aforementioned preset regular expression can be composed of ordinary characters and metacharacters, and the preset regular expression can reflect the logical relationship between each character of the corresponding preset disease name, because the aforementioned historical diagnosis text information may contain a large amount of The colloquialized, repetitive names or some abbreviations, typos, therefore, different preset regular expressions can be set in advance according to the characteristics of the word formation corresponding to medical nouns, for example, according to the actual corresponding disease names.
  • the delimiter is formulated, such as "(%s ⁇ s* ⁇ d+)
  • the historical diagnosis text information can be matched with each preset regular expression separately to logically filter the historical diagnosis text information to obtain multiple second matching degrees, and each preset regular expression can correspond to one The second matching degree, and further, at least one preset disease name corresponding to at least one second matching degree greater than the first preset threshold can be filtered from the plurality of second matching degrees, and the at least one preset disease name can be used as A set of disease names, in this way, a complete and reliable set of disease names can be obtained.
  • step A3 performing data processing on the historical disease name set and the preset ICD standard disease name set to obtain a target dictionary, may include the following steps:
  • A31 Combine the disease name set with the preset ICD standard disease name set to obtain a first dictionary, and the first dictionary includes multiple first disease names;
  • A32 Deduplicate the names of the multiple first diseases to obtain the target dictionary.
  • the aforementioned preset ICD standard disease name set can be set by the user or the system defaults.
  • the preset ICD standard disease name set may include multiple preset ICD standard disease names, and the expression method of the preset ICD standard disease names may be based on multiple Certain characteristics of each disease can be determined. For example, diseases can be classified according to certain rules, and disease names can be represented by coding methods.
  • the above disease name sets can be combined with Preset ICD standard disease name set for data processing to obtain an expanded target dictionary.
  • the target dictionary can still include multiple preset ICD standard disease names. In this way, it is also helpful to improve the new diagnosis text (current diagnosis text) The accuracy of word segmentation.
  • the above-mentioned disease name set and the above-mentioned preset ICD standard disease name set can be merged to obtain the first dictionary, and then the same and repeated first disease names in the first dictionary can be deduplicated, and finally obtain The above target dictionary.
  • FIG. 1C it is a schematic diagram of the structure of a method for extracting disease names to be standardized.
  • historical diagnosis text information can be extracted from the historical disease case database, and the historical diagnosis text information can be data cleaned to obtain the history Disease name set.
  • Data processing is performed on the historical disease name set and the preset ICD standard disease name set to obtain the target dictionary, obtain the current diagnosis text, and perform word cutting operations on the current diagnosis text based on the target dictionary to obtain the current diagnosis text.
  • the current diagnosis text can be cut through the target dictionary to reduce the colloquialization, typos, omissions, abbreviations and other problems in the current diagnosis text.
  • the extracted disease names do not require manual labor. Correction is conducive to saving labor costs.
  • the current diagnosis text can be segmented based on the target dictionary obtained by processing the historical diagnosis information in the historical disease case database. Operate to get the name of the disease to be standardized in the current diagnosis text.
  • the current diagnosis text can be any new diagnosis text.
  • the disease name can be extracted from the current diagnosis text faster based on the target dictionary, which is effective Solve the inaccuracy and incompleteness of the extraction of the use rules, so that the extracted disease names do not need to be manually corrected, which is beneficial to improve efficiency.
  • performing word cutting operations on the current diagnosis text to obtain the name of the disease to be standardized contained in the current diagnosis text may include the following steps: the target dictionary may be based on the corresponding target dictionary. Count the words in the current diagnosis text, and calculate the frequency of each word in the current diagnosis text. That is to say, when any sentence to be segmented in the current diagnosis text appears, all possible word segmentation results are counted, and the The word segmentation result with the highest probability is used as the name of the disease to be standardized.
  • the aforementioned preset ICD standard disease name set may include multiple preset ICD standard disease names.
  • the electronic device can construct a target dictionary tree based on the preset ICD standard disease name set of the preset ICD standard disease names.
  • the target dictionary tree can be understood as a dictionary tree constructed from any one or more strings. It stores the string of the above-mentioned preset ICD standard disease name set.
  • step 103 building a target dictionary tree based on the preset ICD standard disease name set, may include the following steps:
  • the above second preset threshold value can be set by the user or the system default, which is not limited here.
  • the above preset dictionary tree can be set by the user or the system defaults, and is not limited here.
  • the preset dictionary tree can be understood as The initial dictionary tree is a dictionary tree that has not yet stored multiple strings in the preset ICD standard disease name set.
  • the preset dictionary tree can include multiple nodes, and each node can correspond to a character.
  • the preset dictionary tree It can be generated based on the International Disease Classification ICD code.
  • the preset dictionary tree can have a two-layer structure, the first layer is the disease category, such as A00. (cholera), and the second layer is the diseases included in the category Name, such as A00.0 (typical cholera) and so on.
  • the first sequence corresponding to each preset ICD standard disease name among the multiple preset ICD standard disease names can be determined according to the preset ICD standard disease name set to obtain multiple first sequences, where, Each first sequence includes at least one character, which can include two types of text characters and special characters; further, based on a preset dictionary tree, each first sequence corresponding to the multiple first sequences can be traversed, and the Multiple first sequences are matched one by one with multiple nodes corresponding to the above-mentioned preset dictionary tree. If the matching is successful, the above-mentioned preset dictionary tree is not updated. If the matching fails, the above-mentioned preset dictionary tree is updated to obtain the target The dictionary tree.
  • the above-mentioned preset dictionary tree can be expanded step by step to convert the preset ICD standard disease name set into the target dictionary tree, which is beneficial to improve the efficiency of subsequent disease name standardization. It should be noted that it can also be based on The preset ICD standard disease name set is used to construct a target dictionary tree. In this way, when the electronic device obtains the current diagnosis text, it can directly implement subsequent steps of standardizing disease names based on the target dictionary tree.
  • multiple third matching degrees can be obtained, and the average value of the multiple third matching degrees can be calculated. If the average value is greater than the second preset threshold, the matching is considered successful , The above-mentioned preset dictionary tree is not updated. On the contrary, if the mean value is less than or equal to the second preset threshold, it is considered that the matching fails, and the above-mentioned preset dictionary tree can be updated to obtain the target dictionary tree.
  • a third preset threshold can be preset in the electronic device.
  • the third preset threshold can be set by the user or the system defaults. If a third match occurs If the degree is greater than the third preset threshold, it can be considered that the corresponding first sequence matches the node in the preset dictionary tree successfully, that is, if the matching is successful, the preset dictionary tree is not updated. If the three matching degrees are less than or equal to the third preset threshold, it can be considered that the corresponding first sequence fails to match the node in the preset dictionary tree, and the preset dictionary tree can be updated based on the corresponding first node. In this way, the above-mentioned first sequence is traversed step by step, and the above-mentioned method is used cyclically, and the above-mentioned preset dictionary tree can be gradually updated to obtain the target dictionary tree.
  • step 36 updating the preset dictionary tree to obtain the target dictionary tree, may include the following steps:
  • the character i corresponding to the first sequence i does not exist in the initial mapping, add a new mapping i, save the character i in the mapping i, and update the initial mapping to the mapping i ,
  • the first sequence i is any one of the multiple first sequences, and the character i is any one character in the first sequence;
  • the electronic device may determine an initial mapping corresponding to the preset dictionary tree based on the foregoing preset dictionary tree, and the initial mapping may be an initial path corresponding to the foregoing preset dictionary tree, and the initial path indicates the foregoing preset dictionary tree.
  • the mapping relationship between every two nodes in the dictionary tree also reflects the mapping relationship between each node in the preset dictionary tree and the corresponding stored information; if there is any corresponding character i in the first sequence i does not exist in In the case of the initial mapping corresponding to the above multiple nodes, based on the above initial mapping, a new mapping i is added to save the character i. At this time, the preset dictionary tree is updated to the first dictionary tree.
  • the initial mapping can also be updated to the first mapping including the mapping i, where the character i is any character in the first sequence, and the first sequence i is any one of the multiple first sequences
  • the above steps can be repeated for other characters in the first sequence to gradually update the first mapping.
  • the above steps can be performed for all the first sequences to update the first mapping step by step. After all the characters in a sequence are traversed, the target dictionary tree can be obtained.
  • FIG. 1D it is a schematic diagram of the structure of a target dictionary tree.
  • the target dictionary tree can be constructed based on the preset ICD standard disease name set. As shown in the figure, it can be based on the preset ICD standard disease name "Amoeba "Enteritis”, “Amoebiasis”, “Amoebic dysentery”, “Addison's disease” and “Alzheimer's disease” and other preset ICD standard disease names to construct the target dictionary tree as shown in the figure ,
  • the solid circle represents the end node of a disease name path.
  • the name of the disease to be standardized can be matched and searched based on the above target dictionary tree to obtain the ICD standard disease name corresponding to the name of the disease to be standardized, that is, the name of the disease to be standardized can be changed Matching with multiple preset ICD standard disease names in the preset ICD standard disease name set to obtain multiple first matching degrees, and obtaining the conversion result of the disease name to be standardized based on the multiple first matching degrees.
  • step 104 the name of the disease to be standardized is matched with the multiple preset ICD standard disease names in the preset ICD standard disease name set to obtain multiple first matching degrees .
  • the electronic device may determine the target mapping table corresponding to the target dictionary tree, and the target mapping table may include a preset path corresponding to each disease name in the target dictionary tree. Furthermore, based on the target dictionary tree, the target mapping table may be selected from the disease name to be standardized. Starting with the first character j, search downwards from the head node corresponding to the target dictionary tree in order to obtain the path j to be matched and the preset path j corresponding to the path to be matched in the mapping table, and compare the path j to be matched with the preset path j. The path j is matched to obtain the first matching degree, and so on, until each character corresponding to the name of the disease to be standardized is looped through each path corresponding to the target dictionary tree, and multiple first matching degrees are obtained.
  • the above preset conditions can be set by the user or the system defaults, which are not limited here. For example, it can be set that when the first matching degree is 100%, it can be determined that the matching is successful, and it can be determined when the matching is successful.
  • the preset ICD standard disease name corresponding to the first matching degree of the target that meets the preset condition is the conversion result of the disease name to be standardized, otherwise the matching fails.
  • the electronic device may, based on the path similarity and semantic similarity of the target dictionary tree, compare the name of the disease to be standardized with the above-mentioned preset ICD standard disease name.
  • the name of the disease to be standardized is matched to obtain the conversion result of the name of the disease to be standardized. In this way, when the characters corresponding to the name of the disease to be standardized are not completely matched in the target dictionary tree, the disease to be standardized can still be realized.
  • the name conversion is helpful to improve the accuracy of matching.
  • the target dictionary tree may include a root node, and the second sequence corresponding to the name of the disease to be standardized can be determined; starting from the root node corresponding to the target dictionary tree, it is obtained that the second sequence is in the target Multiple second paths obtained by matching in the dictionary tree; calculating the semantic similarities corresponding to the multiple second paths to obtain multiple semantic similarities; selecting the maximum value of the multiple semantic similarities corresponding to the second path as the target Path; determine that the character corresponding to the target path in the target dictionary tree is the conversion result of the name of the disease to be standardized.
  • the above calculation of the semantic similarity corresponding to the multiple second paths may adopt the following formula:
  • X a ⁇ X 1 + b ⁇ X 2 ;
  • n is the dimension of the word vector.
  • X 1 represents the path length corresponding to each second path, which can also be understood as the depth of each traversal.
  • a and b are real numbers respectively.
  • the values of a and b can be adjusted to adjust X 1 and X 2 to be similar in calculation semantics.
  • the weight in degrees.
  • the following steps may be included: taking the root node as the starting point , Select a path m (m is a positive integer) as the current subtree, and the path m is any path with the root node as the starting point; in any layer corresponding to the current subtree, search for the name of the disease to be standardized The first character. If the character is found, search for the second character corresponding to the name of the disease to be standardized in the next layer of any of the above layers, and then repeat the search for the third character corresponding to the name of the disease to be standardized.
  • the disease name standardization method described in the embodiments of this application is applied to electronic equipment, and this application can be applied to the field of smart medical care, thereby promoting the construction of smart cities.
  • the above method includes: obtaining the target dictionary, the current diagnosis text and the preset ICD standard disease name set, the preset ICD standard disease name set includes multiple preset ICD standard disease names, and the current diagnosis text is word-cut based on the target dictionary.
  • Get the name of the disease to be standardized contained in the current diagnosis text build a target dictionary tree based on the set of preset ICD standard disease names, and based on the target dictionary tree, combine the name of the disease to be standardized with the preset ICD standard disease names of multiple preset ICDs
  • the standard disease names are matched to obtain multiple first matching degrees.
  • the target preset ICD standard disease corresponding to the target first matching degree is obtained Name
  • the target preset ICD standard disease name is determined as the conversion result of the disease name to be standardized.
  • the current diagnosis text can be cut through the target dictionary to reduce the colloquialization, typos, and omissions in the current diagnosis text
  • the multiple preset ICD standard disease names are matched with the names to be standardized to obtain the conversion structure, which is beneficial to improve the conversion efficiency and accuracy.
  • FIG. 2 is an exemplary flow chart of a disease name standardization method disclosed in an embodiment of the present application, which is applied to an electronic device.
  • the disease name standardization method may include the following steps:
  • the disease name standardization method described in the embodiment of this application obtains a preset ICD standard disease name set
  • the preset ICD standard disease name set includes multiple preset ICD standard disease names
  • the history is extracted from the historical disease case database Diagnose text information, perform data cleaning on historical diagnosis text information to obtain a historical disease name set, perform data processing on the historical disease name set and the preset ICD standard disease name set, obtain the target dictionary, obtain the current diagnosis text, based on the target dictionary,
  • the current diagnosis text is word-cut to obtain the name of the disease to be standardized in the current diagnosis text.
  • the target dictionary tree is constructed.
  • the name of the disease to be standardized and the preset ICD standard disease are combined Multiple preset ICD standard disease names in the name set are matched to obtain multiple first matching degrees.
  • the target first matching degree is obtained
  • the corresponding target preset ICD standard disease name, and the target preset ICD standard disease name is determined as the conversion result of the disease name to be standardized.
  • the historical diagnosis text in the historical disease database can be processed through a series of processing to obtain a more practical application Expand the target dictionary and perform word cutting operations on the current diagnosis text through the target dictionary to reduce the problems of colloquialization, typos, omissions, abbreviations, etc. in the current diagnosis text.
  • the constructed target dictionary tree matches multiple preset ICD standard disease names with names to be standardized to obtain a conversion structure, which is beneficial to improve conversion efficiency and accuracy.
  • FIG. 3 is an exemplary flowchart of a disease name standardization method disclosed in an embodiment of the present application, which is applied to an electronic device.
  • the disease name standardization method may include the following steps:
  • each first sequence includes at least one character.
  • the preset dictionary tree is not updated, and the preset dictionary tree is used as the target dictionary tree.
  • the electronic device can obtain the target dictionary, the current diagnosis text, and the preset ICD standard disease name set.
  • the preset ICD standard disease name set includes multiple preset ICD standard diseases. Name, based on the target dictionary, perform word cutting operations on the current diagnosis text to obtain the name of the disease to be standardized in the current diagnosis text, and determine the preset ICD standard disease name set based on the preset ICD standard disease name set. Multiple preset ICD standards In the disease name, the first sequence corresponding to each preset ICD standard disease name is obtained, and multiple first sequences are obtained, where each first sequence includes at least one character, and the preset dictionary tree is obtained.
  • the preset dictionary tree includes multiple Nodes, traverse multiple first sequences, match each first sequence with multiple nodes corresponding to the preset dictionary tree to obtain multiple third matching degrees, and calculate the average value of the multiple third matching degrees, if the average value is greater than The second preset threshold, the preset dictionary tree is not updated, and the preset dictionary tree is used as the target dictionary tree.
  • the preset dictionary tree is updated to obtain the target dictionary tree, based on the target dictionary Tree, match the name of the disease to be standardized with multiple preset ICD standard disease names in the set of preset ICD standard disease names to obtain multiple first matching degrees, when there are targets that meet the preset conditions in the multiple first matching degrees
  • the target preset ICD standard disease name corresponding to the first matching degree of the target is obtained, and the target preset ICD standard disease name is determined as the conversion result of the disease name to be standardized.
  • the text is word-cut operation to reduce the colloquialization, typos, omissions, abbreviations and other problems in the current diagnosis text
  • the target dictionary tree is obtained by processing the preset ICD standard disease name set under the international standard.
  • the target dictionary tree is used to process the name of the disease to be standardized to obtain the conversion result of the disease name to be standardized, which is beneficial to improve the conversion accuracy.
  • FIG. 4 is a schematic structural diagram of an electronic device provided by an embodiment of the application. As shown in FIG. 4, it includes a processor, a communication interface, a memory, and one or more programs.
  • the processor, the communication interface, and the memory are connected to each other, where the memory is used to store a computer program, the computer program includes program instructions, the processor is configured to call the program instructions, and the one or more programs include Instructions to perform the following steps:
  • a target dictionary, a current diagnosis text, and a preset ICD standard disease name set where the preset ICD standard disease name set includes a plurality of preset ICD standard disease names; based on the target dictionary, perform word segmentation on the current diagnosis text Operation, the name of the disease to be standardized contained in the current diagnosis text is obtained; based on the preset ICD standard disease name set, a target dictionary tree is constructed; based on the target dictionary tree, the name of the disease to be standardized and the predicted disease name It is assumed that the multiple preset ICD standard disease names in the ICD standard disease name set are matched to obtain multiple first matching degrees; when there is a target first matching degree that meets the preset condition among the multiple first matching degrees Obtain the target preset ICD standard disease name corresponding to the first degree of matching of the target, and determine the target preset ICD standard disease name as the conversion result of the disease name to be standardized.
  • the electronic device described in the embodiment of the application can obtain the target dictionary, the current diagnosis text, and the preset ICD standard disease name set.
  • the preset ICD standard disease name set includes multiple preset ICD standard diseases. Name, based on the target dictionary, perform word cutting operations on the current diagnosis text to get the name of the disease to be standardized in the current diagnosis text, build a target dictionary tree based on the preset ICD standard disease name set, and based on the target dictionary tree, the disease to be standardized The name is matched with multiple preset ICD standard disease names in the preset ICD standard disease name set to obtain multiple first matching degrees.
  • the current diagnosis text can be cut through the target dictionary.
  • a target dictionary tree constructed based on the set of preset ICD standard disease names is to compare multiple preset ICD standard disease names with names to be standardized Matching to obtain the conversion structure is beneficial to improve the conversion efficiency and accuracy.
  • the above-mentioned processor is specifically further configured to: extract historical diagnosis text information from the historical disease case database; perform data cleaning on the historical diagnosis text information to obtain historical disease Name set; data processing is performed on the historical disease name set and the preset ICD standard disease name set to obtain the target dictionary.
  • the processor is specifically configured to: obtain multiple preset regular expressions for multiple preset disease names Formula, wherein each preset disease name corresponds to a preset regular expression; the historical diagnosis text information is matched with each preset regular expression of the plurality of preset regular expressions to obtain multiple A second matching degree, each of the preset regular expressions corresponds to a second matching degree; determining at least one preset corresponding to at least one second matching degree that exceeds the first preset threshold among the plurality of second matching degrees Set a disease name, and use the at least one preset disease name as the disease name set.
  • the processor is specifically configured to: combine the disease name set with The preset ICD standard disease name sets are merged to obtain a first dictionary, and the first dictionary includes multiple first disease names; the multiple first disease names are deduplicated to obtain the target dictionary.
  • the processor is specifically configured to: determine the preset based on the preset ICD standard disease name set The first sequence corresponding to each preset ICD standard disease name in the multiple preset ICD standard disease names in the ICD standard disease name set is obtained, and multiple first sequences are obtained, wherein each first sequence includes at least one character;
  • a dictionary tree the preset dictionary tree includes multiple nodes; traverse the multiple first sequences, and match each of the first sequences with multiple nodes corresponding to the preset dictionary tree to obtain multiple A third matching degree; calculating an average value of the plurality of third matching degrees; if the average value is greater than a second preset threshold, the preset dictionary tree is not updated, and the preset dictionary tree is used as the target Dictionary tree; if the average value is less than or equal to the second preset threshold, update the preset dictionary tree to obtain the target dictionary tree.
  • the above-mentioned processor is specifically configured to: based on the preset dictionary tree, determine the number of items in the preset dictionary tree.
  • the above-mentioned processor is specifically configured to determine a target mapping table corresponding to the target dictionary tree, a plurality of preset paths in the target mapping table, and each disease name in the target dictionary tree corresponds to a preset path; based on the The target dictionary tree starts with character j and searches downwards in turn according to the head node corresponding to the target dictionary tree to obtain the path j to be matched and the preset path j corresponding to the path to be matched in the target mapping table, where The character j is the first character in the name of the disease to be standardized; the path j to be matched is matched with the preset path j to obtain the first matching degree j of the character j.
  • an electronic device includes hardware structures and/or software modules corresponding to each function.
  • this application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
  • FIG. 5 is a schematic structural diagram of a disease name standardization device disclosed in an embodiment of the present application, which is applied to electronic equipment.
  • the device includes: an acquisition unit 501, a word segmentation unit 502, and a construction unit 503 , The matching unit 504 and the determining unit 505, wherein:
  • the acquiring unit 501 is configured to acquire a target dictionary, a current diagnosis text, and a preset ICD standard disease name set, where the preset ICD standard disease name set includes multiple preset ICD standard disease names;
  • the word segmentation unit 502 is configured to perform a word segmentation operation on the current diagnosis text based on the target dictionary to obtain the name of the disease to be standardized contained in the current diagnosis text;
  • the construction unit 503 is configured to construct a target dictionary tree based on the preset ICD standard disease name set;
  • the matching unit 504 is configured to match the name of the disease to be standardized with the plurality of preset ICD standard disease names in the preset ICD standard disease name set based on the target dictionary tree to obtain a plurality of first A degree of match
  • the determining unit 505 is configured to obtain the target preset ICD standard disease name corresponding to the target first matching degree when there is a target first matching degree that meets a preset condition among the plurality of first matching degrees, and The target preset ICD standard disease name is determined as the conversion result of the disease name to be standardized.
  • the construction unit 503 is specifically configured to: determine the preset ICD standard disease name set based on the preset ICD standard disease name set.
  • the first sequence corresponding to each preset ICD standard disease name in the multiple preset ICD standard disease names in the ICD standard disease name set obtain multiple first sequences, wherein each first sequence includes at least one character; obtain A preset dictionary tree, the preset dictionary tree includes multiple nodes; the multiple first sequences are traversed, and each of the first sequences is matched with multiple nodes corresponding to the preset dictionary tree to obtain A plurality of third matching degrees; calculating an average value of the plurality of third matching degrees; if the average value is greater than a second preset threshold, the preset dictionary tree is not updated, and the preset dictionary tree is used as the Target dictionary tree; if the average value is less than or equal to the second preset threshold, update the preset dictionary tree to obtain the target dictionary tree.
  • the construction unit 503 is specifically further configured to: based on the preset dictionary tree, determine what is in the preset dictionary tree.
  • the first sequence i is any one of the multiple first sequences, the character i is any character in the first sequence, and i is a positive integer; based on the mapping i, the preset dictionary tree is updated Is the target dictionary tree.
  • the matching unit 504 is specifically configured to determine a target mapping table corresponding to the target dictionary tree, a plurality of preset paths in the target mapping table, and each disease name in the target dictionary tree corresponds to a preset path;
  • the target dictionary tree starts from character j, and searches downwards in order according to the head node corresponding to the target dictionary tree to obtain the path j to be matched and the preset path j corresponding to the path to be matched in the target mapping table,
  • the character j is the first character in the name of the disease to be standardized;
  • the path j to be matched is matched with the preset path j to obtain the first matching degree j of the character j.
  • An embodiment of the present application further provides a computer-readable storage medium, wherein the computer storage medium stores a computer program for electronic data exchange, and when the program instructions are executed by a processor, the processor executes the following steps: obtain A target dictionary, a current diagnosis text, and a preset ICD standard disease name set, where the preset ICD standard disease name set includes a plurality of preset ICD standard disease names; based on the target dictionary, the current diagnosis text is word-cut , Obtain the name of the disease to be standardized contained in the current diagnosis text; construct a target dictionary tree based on the preset ICD standard disease name set; based on the target dictionary tree, combine the name of the disease to be standardized with the preset The multiple preset ICD standard disease names in the ICD standard disease name set are matched to obtain multiple first matching degrees; when there is a target first matching degree that meets the preset condition among the multiple first matching degrees, The target preset ICD standard disease name corresponding to the first matching degree of the target is acquired, and the target preset ICD standard
  • the processor is also used to extract historical diagnosis text information from a historical disease case database; perform data cleaning on the historical diagnosis text information to obtain a set of historical disease names; Data processing is performed on the set and the preset ICD standard disease name set to obtain the target dictionary.
  • the processor is also used to obtain multiple preset regular expressions for multiple preset disease names, where each preset disease name corresponds to a preset regular expression;
  • the historical diagnosis text information is matched with each of the plurality of preset regular expressions to obtain a plurality of second matching degrees, and each of the preset regular expressions corresponds to a second match Degree; determining at least one preset disease name corresponding to at least one second matching degree that exceeds a first preset threshold among the plurality of second matching degrees, and using the at least one preset disease name as the disease name set .
  • the processor is further configured to merge the disease name set and the preset ICD standard disease name set to obtain a first dictionary, and the first dictionary includes multiple first Disease name; deduplicate the multiple first disease names to obtain the target dictionary.
  • the processor is further configured to determine each preset ICD in the preset ICD standard disease name set based on the preset ICD standard disease name set
  • the first sequence corresponding to the standard disease name obtains multiple first sequences, wherein each first sequence includes at least one character; obtains a preset dictionary tree, the preset dictionary tree includes multiple nodes; traverses the Multiple first sequences, matching each of the first sequences with multiple nodes corresponding to the preset dictionary tree to obtain multiple third matching degrees; calculating the mean value of the multiple third matching degrees; if If the average value is greater than the second preset threshold, the preset dictionary tree is not updated, and the preset dictionary tree is used as the target dictionary tree; if the average value is less than or equal to the second preset threshold, then Update the preset dictionary tree to obtain the target dictionary tree.
  • the processor is further configured to determine the initial mapping corresponding to the multiple nodes in the preset dictionary tree based on the preset dictionary tree; if there are characters corresponding to the first sequence i i does not exist in the initial mapping, then a new mapping i is added, and the character i is stored in the mapping i, where the first sequence i is any one of the multiple first sequences, so The character i is any character in the first sequence, and i is a positive integer; based on the mapping i, the preset dictionary tree is updated to the target dictionary tree.
  • the processor is further configured to determine a target mapping table corresponding to the target dictionary tree, a plurality of preset paths in the target mapping table, and each disease name in the target dictionary tree corresponds to A preset path; based on the target dictionary tree, starting from character j, and searching downwards in turn according to the head node corresponding to the target dictionary tree, to obtain the path j to be matched and the path to be matched in the target mapping table
  • the corresponding preset path j wherein the character j is the first character in the name of the disease to be standardized; the path j to be matched is matched with the preset path j to obtain the first character of the character j A matching degree j.
  • the computer-readable storage medium may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, an application program required by at least one function, etc.; the storage data area may store Data created by the use of nodes, etc.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Epidemiology (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Machine Translation (AREA)

Abstract

Provided are a disease name standardization method, apparatus, device, and storage medium, said method comprising: obtaining a target dictionary, current diagnosis text, and preset ICD standard disease name set, said preset ICD standard disease name set comprising a plurality of preset ICD standard disease names (101); on the basis of the target dictionary, performing word-cut operations on the current diagnosis text to obtain a name of the disease to be standardized contained in the current diagnosis text (102); on the basis of the preset ICD standard disease name set, building a target dictionary tree (103); on the basis of the target dictionary tree, matching the name of the disease to be standardized with a plurality of preset ICD standard disease names in the preset ICD standard disease name set to obtain a plurality of first degrees of matching (104); if there is a target first degree of matching which meets a preset condition among the plurality of first degrees of matching, then obtaining a target preset ICD standard disease name corresponding to a target first degree of matching, and determining the target preset ICD standard disease name to be the result of conversion of the disease name to be standardized (105); thus are facilitated the improvement of conversion efficiency and accuracy. In addition, the present application also relates to blockchain technology, and data can be stored in a blockchain node.

Description

疾病名称标准化方法、装置、设备及存储介质Disease name standardization method, device, equipment and storage medium
本申请要求于2020年05月13日提交中国专利局、申请号为2020104013701、发明名称为“疾病名称标准化方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office with an application number of 2020104013701 on May 13, 2020, with the title of "Disease Name Standardization Method and Device", the entire content of which is incorporated into this application by reference.
技术领域Technical field
本申请涉及人工智能的疾病名称标准化技术领域,具体涉及一种疾病名称标准化方法、装置、设备及存储介质。This application relates to the technical field of disease name standardization of artificial intelligence, and specifically relates to a disease name standardization method, device, equipment and storage medium.
背景技术Background technique
近年来,随着智慧医疗的蓬勃发展,基于大数据的智能医疗技术对数据质量要求越来越高,而诊断疾病名称作为一个重要特征,其在医疗研究领域扮演着重要的角色。In recent years, with the vigorous development of smart medical care, smart medical technology based on big data has increasingly higher requirements for data quality. As an important feature, the name of diagnosed disease plays an important role in the field of medical research.
发明人意识到在医院里不同的医生有不同的书写习惯,对于同一疾病名称往往很难达到统一,如此,如何从病案中快速、有效地提取出医生的诊断疾病名称成为了一个需要解决的问题。The inventor realized that different doctors in the hospital have different writing habits, and it is often difficult to achieve uniformity for the same disease name. Therefore, how to quickly and effectively extract the doctor’s diagnosed disease name from the medical record has become a problem that needs to be solved. .
发明内容Summary of the invention
本申请实施例提供一种疾病名称标准化方法、装置、设备及存储介质,有利于提高疾病名称标准化效率。The embodiments of the present application provide a disease name standardization method, device, equipment, and storage medium, which are beneficial to improve the efficiency of disease name standardization.
本申请实施例第一方面提供了一种疾病名称标准化方法,应用于电子设备,包括:The first aspect of the embodiments of the present application provides a method for standardizing disease names, which is applied to electronic equipment, including:
获取目标词典、当前诊断文本和预设ICD(international Classification of diseases,国际疾病分类)标准疾病名称集,所述预设ICD标准疾病名称集包括多个预设ICD标准疾病名称;Acquiring the target dictionary, the current diagnosis text, and a preset ICD (international Classification of diseases) standard disease name set, where the preset ICD standard disease name set includes multiple preset ICD standard disease names;
基于所述目标词典,对所述当前诊断文本进行切词操作,得到所述当前诊断文本中包含的待标准化疾病名称;Based on the target dictionary, perform a word cutting operation on the current diagnosis text to obtain the name of the disease to be standardized contained in the current diagnosis text;
基于所述预设ICD标准疾病名称集,构建目标字典树;Constructing a target dictionary tree based on the preset ICD standard disease name set;
基于所述目标字典树,将所述待标准化疾病名称与所述预设ICD标准疾病名称集中的所述多个预设ICD标准疾病名称进行匹配,得到多个第一匹配度;Based on the target dictionary tree, matching the name of the disease to be standardized with the multiple preset ICD standard disease names in the preset ICD standard disease name set to obtain multiple first matching degrees;
当所述多个第一匹配度中存在满足预设条件的目标第一匹配度时,获取所述目标第一匹配度对应的目标预设ICD标准疾病名称,将所述目标预设ICD标准疾病名称确定为所述待标准化疾病名称的转换结果。When there is a target first matching degree that meets a preset condition among the plurality of first matching degrees, the target preset ICD standard disease name corresponding to the target first matching degree is acquired, and the target is preset to the ICD standard disease The name is determined as the conversion result of the name of the disease to be standardized.
本申请实施例第二方面提供了一种疾病名称标准化装置,应用于电子设备,所述装置包括:获取单元、切词单元、构建单元、匹配单元和确定单元,其中,The second aspect of the embodiments of the present application provides a disease name standardization device, which is applied to electronic equipment, and the device includes: an acquisition unit, a word segmentation unit, a construction unit, a matching unit, and a determination unit, wherein:
所述获取单元,用于获取目标词典、当前诊断文本和预设ICD标准疾病名称集,所述预设ICD标准疾病名称集包括多个预设ICD标准疾病名称;The acquiring unit is configured to acquire a target dictionary, a current diagnosis text, and a preset ICD standard disease name set, where the preset ICD standard disease name set includes a plurality of preset ICD standard disease names;
所述切词单元,用于基于所述目标词典,对所述当前诊断文本进行切词操作,得到所述当前诊断文本中包含的待标准化疾病名称;The word segmentation unit is configured to perform a word segmentation operation on the current diagnosis text based on the target dictionary to obtain the name of the disease to be standardized contained in the current diagnosis text;
所述构建单元,用于基于所述预设ICD标准疾病名称集,构建目标字典树;The construction unit is configured to construct a target dictionary tree based on the preset ICD standard disease name set;
所述匹配单元,用于基于所述目标字典树,将所述待标准化疾病名称与所述预设ICD标准疾病名称集中的所述多个预设ICD标准疾病名称进行匹配,得到多个第一匹配度;The matching unit is configured to match the name of the disease to be standardized with the plurality of preset ICD standard disease names in the set of preset ICD standard disease names based on the target dictionary tree to obtain a plurality of first suitability;
所述确定单元,用于当所述多个第一匹配度中存在满足预设条件的目标第一匹配度时,获取所述目标第一匹配度对应的目标预设ICD标准疾病名称,将所述目标预设ICD标准疾病名称确定为所述待标准化疾病名称的转换结果。The determining unit is configured to obtain the target preset ICD standard disease name corresponding to the target first matching degree when there is a target first matching degree that satisfies a preset condition among the plurality of first matching degrees, and The target preset ICD standard disease name is determined as the conversion result of the disease name to be standardized.
本申请实施例的第三方面提供一种电子设备,所述包括处理器、存储器、通信接口,以及一个或多个程序,所述一个或多个程序被存储在所述存储器中,并且被配置由所述处理器执行,其中,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述 处理器被配置用于调用所述程序指令,执行本申请实施例第一方面所述的方法。A third aspect of the embodiments of the present application provides an electronic device, which includes a processor, a memory, a communication interface, and one or more programs, and the one or more programs are stored in the memory and configured Is executed by the processor, wherein the memory is used to store a computer program, the computer program includes program instructions, and the processor is configured to invoke the program instructions to execute the first aspect of the embodiments of the present application Methods.
本申请实施例的第四方面提供了一种计算机可读存储介质,包括存储数据区和存储程序区,存储数据区存储根据区块链节点的使用所创建的数据,存储程序区存储有计算机程序,其中,所述计算机程序包括程序指令,所述程序指令当被处理器执行如本申请实施例第一方面所描述的部分或全部步骤。The fourth aspect of the embodiments of the present application provides a computer-readable storage medium, including a storage data area and a storage program area. The storage data area stores data created according to the use of blockchain nodes, and the storage program area stores computer programs. , Wherein the computer program includes program instructions, and the program instructions are executed by a processor as part or all of the steps described in the first aspect of the embodiments of the present application.
通过本申请实施例,应用于电子设备,上述方法包括:获取目标词典、当前诊断文本和预设ICD标准疾病名称集,预设ICD标准疾病名称集包括多个预设ICD标准疾病名称,基于目标词典,对当前诊断文本进行切词操作,得到当前诊断文本中包含的待标准化疾病名称,基于预设ICD标准疾病名称集,构建目标字典树,基于目标字典树,将待标准化疾病名称与预设ICD标准疾病名称集中的多个预设ICD标准疾病名称进行匹配,得到多个第一匹配度,当多个第一匹配度中存在满足预设条件的目标第一匹配度时,获取该目标第一匹配度对应的目标预设ICD标准疾病名称,将目标预设ICD标准疾病名称确定为待标准化疾病名称的转换结果,如此,可通过目标词典对当前诊断文本进行切词操作,以减少当前诊断文本中存在的口语化、错别字、漏写、缩写等问题,另外,基于预设ICD标准疾病名称集构建的目标字典树,将多个预设ICD标准疾病名称与待标准化名称进行匹配,以得到转换结构,有利于提高转换效率与准确率。Through the embodiment of this application, applied to electronic equipment, the above method includes: obtaining the target dictionary, the current diagnosis text and the preset ICD standard disease name set. The preset ICD standard disease name set includes multiple preset ICD standard disease names based on the target Dictionary, perform word segmentation operation on the current diagnosis text, get the name of the disease to be standardized in the current diagnosis text, build a target dictionary tree based on the preset ICD standard disease name set, based on the target dictionary tree, combine the name of the disease to be standardized with the preset The multiple preset ICD standard disease names in the ICD standard disease name set are matched to obtain multiple first matching degrees. When there is a target first matching degree that meets the preset conditions among the multiple first matching degrees, the target first matching degree is obtained. The target preset ICD standard disease name corresponding to a match degree is determined, and the target preset ICD standard disease name is determined as the conversion result of the disease name to be standardized. In this way, the current diagnosis text can be cut through the target dictionary to reduce the current diagnosis There are problems such as colloquialization, typos, omissions, abbreviations, etc. in the text. In addition, a target dictionary tree constructed based on the set of preset ICD standard disease names will match multiple preset ICD standard disease names with the names to be standardized to obtain The conversion structure is conducive to improving the conversion efficiency and accuracy.
附图说明Description of the drawings
图1A为本申请实施例提供了一种疾病名称标准化方法的结构示意图;FIG. 1A provides a schematic structural diagram of a method for standardizing disease names according to an embodiment of this application;
图1B为本申请实施例提供了一种疾病名称标准化方法的流程示意图;FIG. 1B provides a schematic flowchart of a method for standardizing disease names according to an embodiment of this application;
图1C为本申请实施例提供了一种待标准化疾病名称的抽取方法的结构示意图;FIG. 1C is a schematic structural diagram of a method for extracting names of diseases to be standardized according to an embodiment of this application;
图1D为为本申请实施例提供了一种目标字典树的结构示意图;FIG. 1D is a schematic structural diagram of a target dictionary tree provided in an embodiment of this application;
图2为本申请实施例提供了一种疾病名称标准化方法的流程示意图;FIG. 2 is a schematic flowchart of a method for standardizing disease names according to an embodiment of this application;
图3为本申请实施例提供了一种疾病名称标准化方法的流程示意图;FIG. 3 is a schematic flowchart of a method for standardizing disease names according to an embodiment of this application;
图4为本申请实施例提供了一种电子设备的结构示意图;FIG. 4 provides a schematic structural diagram of an electronic device according to an embodiment of the application;
图5为本申请实施例提供了一种疾病名称标准化装置的结构示意图。FIG. 5 is a schematic structural diagram of a disease name standardization device provided in an embodiment of this application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其他步骤或单元。The terms "first", "second", etc. in the specification and claims of this application and the above-mentioned drawings are used to distinguish different objects, rather than to describe a specific sequence. In addition, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusions. For example, a process, method, system, product, or device that includes a series of steps or units is not limited to the listed steps or units, but optionally includes unlisted steps or units, or optionally also includes Other steps or units inherent to these processes, methods, products or equipment.
在本申请中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本申请所描述的实施例可以与其它实施例相结合。The reference to "embodiments" in this application means that a specific feature, structure, or characteristic described in conjunction with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various places in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment mutually exclusive with other embodiments. Those skilled in the art clearly and implicitly understand that the embodiments described in this application can be combined with other embodiments.
为了能够更好地理解本申请实施例,下面将对应用本申请实施例的方法进行介绍。In order to better understand the embodiments of the present application, the method of applying the embodiments of the present application will be introduced below.
电子设备可以包括各种具有无线通信功能的手持设备、车载设备、可穿戴设备(例如智能手表、智能手环、计步器等)、计算设备或连接到无线调制解调器的其他处理设备,以及各种形式的用户设备(User Equipment,UE),移动台(Mobile Station,MS),终 端设备(terminal device)等等。为方便描述,上面提到的设备统称为电子设备。Electronic devices can include various handheld devices with wireless communication functions, vehicle-mounted devices, wearable devices (such as smart watches, smart bracelets, pedometers, etc.), computing devices or other processing devices connected to wireless modems, and various Forms of user equipment (User Equipment, UE), mobile station (Mobile Station, MS), terminal equipment (terminal device), and so on. For ease of description, the devices mentioned above are collectively referred to as electronic devices.
请参见图1A,图1A是本申请实施例提供的一种疾病名称标准化方法的结构示意图,可基于该结构示意图,获取目标词典、当前诊断文本和预设ICD标准疾病名称集,预设ICD标准疾病名称集包括多个预设ICD标准疾病名称,然后,可基于目标词典,对当前诊断文本进行切词操作,得到当前诊断文本中包含的待标准化疾病名称,并基于预设ICD标准疾病名称集,构建目标字典树,基于目标字典树,将待标准化疾病名称与预设ICD标准疾病名称集中的多个预设ICD标准疾病名称进行匹配,得到多个第一匹配度,最后,当多个第一匹配度中存在满足预设条件的目标第一匹配度时,获取该目标第一匹配度对应的目标预设ICD标准疾病名称,将目标预设ICD标准疾病名称确定为待标准化疾病名称的转换结果。Please refer to Figure 1A. Figure 1A is a schematic structural diagram of a disease name standardization method provided by an embodiment of the present application. Based on the structural schematic diagram, the target dictionary, the current diagnosis text and the preset ICD standard disease name set can be obtained, and the ICD standard can be preset The disease name set includes multiple preset ICD standard disease names. Then, based on the target dictionary, the current diagnosis text can be word-cut to obtain the disease names to be standardized in the current diagnosis text, and based on the preset ICD standard disease name sets , Build a target dictionary tree, based on the target dictionary tree, match the name of the disease to be standardized with multiple preset ICD standard disease names in the set of preset ICD standard disease names to obtain multiple first matching degrees, and finally, when multiple first matches When there is a target first matching degree that meets the preset conditions in a matching degree, the target preset ICD standard disease name corresponding to the target first matching degree is obtained, and the target preset ICD standard disease name is determined as the conversion of the disease name to be standardized result.
可以看出,通过本申请实施例提供的一种疾病名称标准化方法,可通过目标词典对当前诊断文本进行切词操作,以减少当前诊断文本中存在的口语化、错别字、漏写、缩写等问题,另外,基于预设ICD标准疾病名称集构建的目标字典树,将多个预设ICD标准疾病名称与待标准化名称进行匹配,以得到转换结构,有利于提高转换效率与准确率。It can be seen that, through the method for standardizing disease names provided by the embodiments of this application, the current diagnosis text can be word-cut through the target dictionary, so as to reduce the colloquialization, typos, omissions, abbreviations and other problems in the current diagnosis text. In addition, the target dictionary tree constructed based on the set of preset ICD standard disease names matches multiple preset ICD standard disease names with names to be standardized to obtain a conversion structure, which is beneficial to improve conversion efficiency and accuracy.
请参见图1B,图1B是本申请实施例提供的一种疾病名称标准化方法的流程示意图,应用于电子设备,上述方法包括以下步骤:Please refer to FIG. 1B. FIG. 1B is a schematic flowchart of a method for standardizing disease names provided by an embodiment of the present application, which is applied to an electronic device, and the above method includes the following steps:
101、获取目标词典、当前诊断文本和预设ICD标准疾病名称集,所述预设ICD标准疾病名称集包括多个预设ICD标准疾病名称。101. Obtain a target dictionary, a current diagnosis text, and a preset ICD standard disease name set, where the preset ICD standard disease name set includes a plurality of preset ICD standard disease names.
其中,本申请实施例可应用于电子设备,在该电子设备中可包括如图1A所示的疾病名称标准化系统,上述目标词典可由历史病情案例库中存储的多个患者的多个历史病情诊断病例经过数据处理得到,该目标词典中可包括多个历史疾病名称;上述预设ICD标准疾病名称集可由用户自行设置或者系统默认,该预设ICD标准疾病名称集中可包括多个预设ICD标准疾病名称,上述当前诊断文本可指在构建上述任意一个新的病情案例对应的诊断文本或者是需要进行疾病名称标准化的诊断文本,该当前诊断本文中可包括以下至少一种:处方信息、诊断信息、病症描述信息、出院小结信息、医院信息、科室信息、患者信息等等,在此不作限定。Among them, the embodiments of the present application can be applied to electronic equipment, and the electronic equipment may include a disease name standardization system as shown in FIG. 1A. The target dictionary can be diagnosed by multiple historical disease conditions of multiple patients stored in a historical disease case database. The case is obtained through data processing, and the target dictionary can include multiple historical disease names; the above-mentioned preset ICD standard disease name set can be set by the user or the system defaults, and the preset ICD standard disease name set can include multiple preset ICD standards Disease name, the above current diagnosis text may refer to the diagnosis text corresponding to any one of the above new disease cases or the diagnosis text that needs to be standardized for disease names. The current diagnosis text may include at least one of the following: prescription information, diagnosis information , Disease description information, discharge summary information, hospital information, department information, patient information, etc., which are not limited here.
在一种可能的示例中,上述步骤101之前,在获取目标词典之前,还可包括如下步骤:In a possible example, before the above step 101, before obtaining the target dictionary, the following steps may be further included:
A1、从历史病情案例库中提取历史诊断文本信息;A1. Extract historical diagnosis text information from the historical disease case database;
A2、对所述历史诊断文本信息进行数据清洗,得到历史疾病名称集;A2. Perform data cleaning on the historical diagnosis text information to obtain a set of historical disease names;
A3、将所述历史疾病名称集与所述预设ICD标准疾病名称集进行数据处理,得到所述目标词典。A3. Perform data processing on the historical disease name set and the preset ICD standard disease name set to obtain the target dictionary.
其中,上述历史病情案例库中可存储多个患者的多个历史病情诊断病例,该历史诊断病例中可包括以下至少一种:入院诊断信息和出院诊断信息等等,在此不做限定,另外,入院诊断信息和出院诊断信息中均可包括以下至少一项:处方信息、诊断信息、病症描述信息、出院小结信息、医院信息、科室信息、患者信息等等,在此不作限定;上述处方信息可包括以下至少一种:疾病名称、疾病症状、药物名称、药物剂量等等,在此不做限定。Among them, the aforementioned historical disease case database can store multiple historical disease diagnosis cases of multiple patients, and the historical diagnosis cases can include at least one of the following: admission diagnosis information and discharge diagnosis information, etc., which are not limited here. , Both the admission diagnosis information and the discharge diagnosis information can include at least one of the following: prescription information, diagnosis information, disease description information, discharge summary information, hospital information, department information, patient information, etc., which are not limited here; the above prescription information It may include at least one of the following: disease name, disease symptoms, drug name, drug dosage, etc., which are not limited here.
具体实现中,可从历史病情案例库中提取多个历史案情诊断病例,并从中提取出历史诊断文本信息,在得到历史诊断文本信息以后,可对该历史诊断文本信息根据预设规则进行数据清洗,得到历史疾病名称集,另外,该预设规则可由用户自行设置或者系统默认,在此不做限定,例如,可将上述历史诊断文本信息进行信息剔除,以剔除不需要的字段(例如,非疾病名称的字段),进而,可基于经验知识对上述历史诊断文本信息中的缺失字段进行信息补充或者数据修改,最后,可得到历史疾病名称集,该历史疾病名称集中可包括多个历史病情诊断案例对应的多个疾病名称,进一步地,可将历史疾病名称集与预设ICD标准疾病名称集进行数据处理,如此,可得到扩充以后的目标词典,该目标词典中可包括多个疾病名称,采用预设的预设规则对数据进行清洗,有利于缓解使用规则提取的不准确性 和不完整性,另外,抽取得到的疾病名称不需要再进行人工校正,有利于节省人力成本。In specific implementation, multiple historical diagnosis cases can be extracted from the historical disease case database, and historical diagnosis text information can be extracted from them. After historical diagnosis text information is obtained, the historical diagnosis text information can be cleaned according to preset rules. , To obtain the historical disease name set. In addition, the preset rules can be set by the user or the system defaults, and are not limited here. For example, the above historical diagnosis text information can be eliminated to eliminate unnecessary fields (for example, non- Disease name field), and further, based on empirical knowledge, the missing fields in the historical diagnosis text information can be supplemented or data modified. Finally, a set of historical disease names can be obtained, which can include multiple historical disease diagnoses. The multiple disease names corresponding to the case, further, the historical disease name set and the preset ICD standard disease name set can be processed for data, so that an expanded target dictionary can be obtained, and the target dictionary can include multiple disease names. The use of preset preset rules to clean the data helps to alleviate the inaccuracy and incompleteness of the extraction of the use rules. In addition, the extracted disease names do not need to be manually corrected, which is beneficial to save labor costs.
在一种可能的示例中,上述步骤A2,对所述历史诊断文本信息进行数据清洗,得到历史疾病名称集,可包括以下步骤:In a possible example, the foregoing step A2, performing data cleaning on the historical diagnosis text information to obtain a set of historical disease names, may include the following steps:
A21、获取针对多个预设疾病名称的多个预设正则表达式,其中,每一预设疾病名称对应一个预设正则表达式;A21. Obtain multiple preset regular expressions for multiple preset disease names, where each preset disease name corresponds to a preset regular expression;
A22、将所述历史诊断文本信息分别与所述多个预设正则表达式中的每一预设正则表达式进行匹配,得到多个第二匹配度,每一所述预设正则表达式对应一个第二匹配度;A22. Match the historical diagnosis text information with each preset regular expression of the plurality of preset regular expressions to obtain a plurality of second matching degrees, each of the preset regular expressions corresponds to A second degree of matching;
A23、确定所述多个第二匹配度中超过第一预设阈值的至少一个第二匹配度对应的至少一个预设疾病名称,并将所述至少一个预设疾病名称作为所述疾病名称集。A23. Determine at least one preset disease name corresponding to at least one second matching degree that exceeds a first preset threshold among the plurality of second matching degrees, and use the at least one preset disease name as the disease name set .
其中,上述第一预设阈值可为用户自行设置或者系统默认,在此不做限定,电子设备中可预先存储多个预设疾病名称,并针对每一预设疾病名称预设一个正则表达式,上述预设正则表达式可由普通字符和元字符组成,该预设正则表达式可体现其对应的预设疾病名称的每一字符之间的逻辑关系,由于上述历史诊断文本信息中可能包含大量的口语化、重复性名称或者一些缩写,错别字的情况,因此,可针对医学名词对应的构词法的特点,预先设置不同的预设正则表达式,例如,可根据实际中对应的疾病名称中包含的分隔符进行制定,如“(%s\s*\d+)|(%s\s*(\s*\d+)”等,如此,可根据预设正则表达式对上述诊断文本信息进行数据清洗,可实现对数据中无意义字符和重复的名称的剔除,以得到包含完整疾病名称的疾病名称集。Among them, the above-mentioned first preset threshold can be set by the user or defaulted by the system, which is not limited here. The electronic device can store multiple preset disease names in advance, and preset a regular expression for each preset disease name. , The aforementioned preset regular expression can be composed of ordinary characters and metacharacters, and the preset regular expression can reflect the logical relationship between each character of the corresponding preset disease name, because the aforementioned historical diagnosis text information may contain a large amount of The colloquialized, repetitive names or some abbreviations, typos, therefore, different preset regular expressions can be set in advance according to the characteristics of the word formation corresponding to medical nouns, for example, according to the actual corresponding disease names. The delimiter is formulated, such as "(%s\s*\d+)|(%s\s*(\s*\d+)", etc. In this way, the above-mentioned diagnostic text information can be cleaned according to the preset regular expression. , It can eliminate meaningless characters and repeated names in the data to get a set of disease names that contains complete disease names.
在具体实现中,可将历史诊断文本信息分别于每一预设正则表达式进行匹配,以对历史诊断文本信息进行逻辑过滤得到多个第二匹配度,每一预设正则表达式可对应一个第二匹配度,进而,可以从多个第二匹配度中筛选出大于第一预设阈值的至少一个第二匹配度对应的至少一个预设疾病名称,并将该至少一个预设疾病名称作为疾病名称集,如此,可得到完整、可靠的疾病名称集。In a specific implementation, the historical diagnosis text information can be matched with each preset regular expression separately to logically filter the historical diagnosis text information to obtain multiple second matching degrees, and each preset regular expression can correspond to one The second matching degree, and further, at least one preset disease name corresponding to at least one second matching degree greater than the first preset threshold can be filtered from the plurality of second matching degrees, and the at least one preset disease name can be used as A set of disease names, in this way, a complete and reliable set of disease names can be obtained.
在一种可能的示例中,上述步骤A3,将所述历史疾病名称集与所述预设ICD标准疾病名称集进行数据处理,得到目标词典,可包括如下步骤:In a possible example, the above step A3, performing data processing on the historical disease name set and the preset ICD standard disease name set to obtain a target dictionary, may include the following steps:
A31、将所述疾病名称集与所述预设ICD标准疾病名称集进行合并,得到第一词典,所述第一词典中包括多个第一疾病名称;A31. Combine the disease name set with the preset ICD standard disease name set to obtain a first dictionary, and the first dictionary includes multiple first disease names;
A32、对所述多个第一疾病名称进行去重,得到所述目标词典。A32. Deduplicate the names of the multiple first diseases to obtain the target dictionary.
其中,上述预设ICD标准疾病名称集可由用户自行设置或者系统默认,该预设ICD标准疾病名称集中可包括多个预设ICD标准疾病名称,该预设ICD标准疾病名称的表示方法可基于多个疾病的某些特征来确定,例如,可按照一定的规则将疾病分门别类,并用编码的方法来表示疾病名称,为了扩充疾病名称对应的词典,更加切合实际的数据,可将上述疾病名称集与预设ICD标准疾病名称集进行数据处理,以得到扩充以后的目标词典,该目标词典中仍旧可包括多个预设ICD标准疾病名称,如此,也有利于提高新的诊断文本(当前诊断文本)的切词的准确率。Among them, the aforementioned preset ICD standard disease name set can be set by the user or the system defaults. The preset ICD standard disease name set may include multiple preset ICD standard disease names, and the expression method of the preset ICD standard disease names may be based on multiple Certain characteristics of each disease can be determined. For example, diseases can be classified according to certain rules, and disease names can be represented by coding methods. In order to expand the dictionary corresponding to disease names and more realistic data, the above disease name sets can be combined with Preset ICD standard disease name set for data processing to obtain an expanded target dictionary. The target dictionary can still include multiple preset ICD standard disease names. In this way, it is also helpful to improve the new diagnosis text (current diagnosis text) The accuracy of word segmentation.
具体实现中,可将上述疾病名称集与上述预设ICD标准疾病名称集进行合并,得到第一词典,再将该第一词典中相同的、重复的第一疾病名称进行去重,最终可得到上述目标词典。In specific implementation, the above-mentioned disease name set and the above-mentioned preset ICD standard disease name set can be merged to obtain the first dictionary, and then the same and repeated first disease names in the first dictionary can be deduplicated, and finally obtain The above target dictionary.
如图1C所示,为一种待标准化疾病名称的抽取方法的结构示意图,如图中所示,可从历史病情案例库中提取历史诊断文本信息,对历史诊断文本信息进行数据清洗,得到历史疾病名称集,将历史疾病名称集与预设ICD标准疾病名称集进行数据处理,得到目标词典,获取当前诊断文本,基于目标词典,对当前诊断文本进行切词操作,得到当前诊断文本中包含的待标准化疾病名称,如此,可通过目标词典对当前诊断文本进行切词操作,以减少当前诊断文本中存在的口语化、错别字、漏写、缩写等问题,抽取得到的疾病名称不需要 再进行人工矫正,有利于节省人力成本。As shown in Figure 1C, it is a schematic diagram of the structure of a method for extracting disease names to be standardized. As shown in the figure, historical diagnosis text information can be extracted from the historical disease case database, and the historical diagnosis text information can be data cleaned to obtain the history Disease name set. Data processing is performed on the historical disease name set and the preset ICD standard disease name set to obtain the target dictionary, obtain the current diagnosis text, and perform word cutting operations on the current diagnosis text based on the target dictionary to obtain the current diagnosis text. To be standardized disease names, in this way, the current diagnosis text can be cut through the target dictionary to reduce the colloquialization, typos, omissions, abbreviations and other problems in the current diagnosis text. The extracted disease names do not require manual labor. Correction is conducive to saving labor costs.
102、基于所述目标词典,对所述当前诊断文本进行切词操作,得到所述当前诊断文本中包含的待标准化疾病名称。102. Based on the target dictionary, perform a word cutting operation on the current diagnosis text to obtain the name of the disease to be standardized contained in the current diagnosis text.
其中,由于当前诊断文本中可能存在大量的口语化、重复性名称或者一些缩写名称,因此,可基于由历史病情案例库中的历史诊断信息处理得到的目标词典,对该当前诊断文本进行切词操作,以得到当前诊断文本中包含的待标准化疾病名称,该当前诊断文本可为任意一个新的诊断文本,如此,可基于目标词典,更快的从当前诊断文本中提取出疾病名称,可有效解决使用规则提取的不准确性和不完整性,如此,抽取的疾病名称不需要再进行人工矫正,有利于提高效率。Among them, because there may be a large number of colloquial, repetitive names or some abbreviated names in the current diagnosis text, the current diagnosis text can be segmented based on the target dictionary obtained by processing the historical diagnosis information in the historical disease case database. Operate to get the name of the disease to be standardized in the current diagnosis text. The current diagnosis text can be any new diagnosis text. In this way, the disease name can be extracted from the current diagnosis text faster based on the target dictionary, which is effective Solve the inaccuracy and incompleteness of the extraction of the use rules, so that the extracted disease names do not need to be manually corrected, which is beneficial to improve efficiency.
具体实现中,基于所述目标词典,对所述当前诊断文本进行切词操作,得到所述当前诊断文本中包含的待标准化疾病名称可包括以下步骤:可基于上述目标词典,以目标词典中对应的词为单位进行统计,统计出当前诊断文本中每个词出现的频率,也就是说,当出现当前诊断文本中任意一句待切分的句子时,将所有可能的分词结果统计出来,并将概率最大的分词结果作为上述待标准化疾病名称。In specific implementation, based on the target dictionary, performing word cutting operations on the current diagnosis text to obtain the name of the disease to be standardized contained in the current diagnosis text may include the following steps: the target dictionary may be based on the corresponding target dictionary. Count the words in the current diagnosis text, and calculate the frequency of each word in the current diagnosis text. That is to say, when any sentence to be segmented in the current diagnosis text appears, all possible word segmentation results are counted, and the The word segmentation result with the highest probability is used as the name of the disease to be standardized.
103、基于所述预设ICD标准疾病名称集,构建目标字典树。103. Construct a target dictionary tree based on the preset ICD standard disease name set.
其中,上述预设ICD标准疾病名称集中可包括多个预设ICD标准疾病名称,当出现新的诊断文本时,为了便于与目标词典中的文本信息进行匹配,或者查找预设ICD标准疾病名称集中的预设ICD标准疾病名称,电子设备可基于上述预设ICD标准疾病名称集,构造目标字典树,该目标字典树可理解为任意一个或多个字符串构建而成的一颗字典树,用于存储上述预设ICD标准疾病名称集中的字符串。Among them, the aforementioned preset ICD standard disease name set may include multiple preset ICD standard disease names. When a new diagnosis text appears, in order to facilitate matching with the text information in the target dictionary, or to search the preset ICD standard disease name set The electronic device can construct a target dictionary tree based on the preset ICD standard disease name set of the preset ICD standard disease names. The target dictionary tree can be understood as a dictionary tree constructed from any one or more strings. It stores the string of the above-mentioned preset ICD standard disease name set.
在一种可能的示例中,上述步骤103,基于所述预设ICD标准疾病名称集,构建目标字典树,可包括如下步骤:In a possible example, the above step 103, building a target dictionary tree based on the preset ICD standard disease name set, may include the following steps:
31、基于所述预设ICD标准疾病名称集,确定所述预设ICD标准疾病名称集中多个预设ICD标准疾病名称中每一预设ICD标准疾病名称对应的第一序列,得到多个第一序列,其中,每一第一序列中包括至少一个字符;31. Based on the preset ICD standard disease name set, determine the first sequence corresponding to each preset ICD standard disease name in the multiple preset ICD standard disease names in the preset ICD standard disease name set, and obtain multiple first sequences A sequence, wherein each first sequence includes at least one character;
32、获取预设字典树,所述预设字典树中包括多个节点;32. Obtain a preset dictionary tree, where the preset dictionary tree includes multiple nodes;
33、遍历所述多个第一序列,将每一所述第一序列与所述预设字典树对应的多个节点进行匹配,得到多个第三匹配度;33. Traverse the multiple first sequences, and match each of the first sequences with multiple nodes corresponding to the preset dictionary tree to obtain multiple third matching degrees;
34、计算所述多个第三匹配度的均值;34. Calculate the average value of the plurality of third matching degrees;
35、若所述均值大于第二预设阈值,则不更新所述预设字典树,将所述预设字典树作为所述目标字典树;35. If the average value is greater than the second preset threshold, do not update the preset dictionary tree, and use the preset dictionary tree as the target dictionary tree;
36、若所述均值小于或等于所述第二预设阈值,则更新所述预设字典树,得到所述目标字典树。36. If the average value is less than or equal to the second preset threshold, update the preset dictionary tree to obtain the target dictionary tree.
其中,上述第二预设阈值可为用户自行设置或者系统默认,在此不做限定,上述预设字典树可由用户自行设置或者系统默认,在此不做限定,该预设字典树可理解为初始字典树,是还没有存储预设ICD标准疾病名称集中的多个字符串的字典树,该预设字典树中可包括多个节点,每一节点可对应一个字符,在构建预设字典树时,可基于国际疾病分类ICD编码来生成,例如,该预设字典树可以是两层结构,第一层是疾病大类,如A00.(霍乱),第二层是大类下包含的疾病名称,如A00.0(典型性霍乱)等等。Among them, the above second preset threshold value can be set by the user or the system default, which is not limited here. The above preset dictionary tree can be set by the user or the system defaults, and is not limited here. The preset dictionary tree can be understood as The initial dictionary tree is a dictionary tree that has not yet stored multiple strings in the preset ICD standard disease name set. The preset dictionary tree can include multiple nodes, and each node can correspond to a character. When constructing the preset dictionary tree It can be generated based on the International Disease Classification ICD code. For example, the preset dictionary tree can have a two-layer structure, the first layer is the disease category, such as A00. (cholera), and the second layer is the diseases included in the category Name, such as A00.0 (typical cholera) and so on.
具体实现中,可根据预设ICD标准疾病名称集,确定其对应的多个预设ICD标准疾病名称中每一预设ICD标准疾病名称对应的第一序列,得到多个第一序列,其中,每一第一序列中包括至少一个字符,该字符可包括文本字符和特殊字符两类;进而,可基于预设字典树,遍历上述多个第一序列对应的每一第一序列,并将上述多个第一序列逐一与上述预设字典树对应的多个节点进行匹配,若匹配成功,则不更新上述预设字典树,若出现匹配 失败的情况,则更新上述预设字典树,得到目标字典树,如此,可一步步的扩充上述预设字典树,以将预设ICD标准疾病名称集转换为目标字典树,有利于提高后续疾病名称标准化的效率,需要说明的是,也可预先基于预设的ICD标准疾病名称集,构建目标字典树,如此,在电子设备获取到当前诊断文本时,可直接基于目标字典树,施行后续疾病名称标准化的步骤。In specific implementation, the first sequence corresponding to each preset ICD standard disease name among the multiple preset ICD standard disease names can be determined according to the preset ICD standard disease name set to obtain multiple first sequences, where, Each first sequence includes at least one character, which can include two types of text characters and special characters; further, based on a preset dictionary tree, each first sequence corresponding to the multiple first sequences can be traversed, and the Multiple first sequences are matched one by one with multiple nodes corresponding to the above-mentioned preset dictionary tree. If the matching is successful, the above-mentioned preset dictionary tree is not updated. If the matching fails, the above-mentioned preset dictionary tree is updated to obtain the target The dictionary tree. In this way, the above-mentioned preset dictionary tree can be expanded step by step to convert the preset ICD standard disease name set into the target dictionary tree, which is beneficial to improve the efficiency of subsequent disease name standardization. It should be noted that it can also be based on The preset ICD standard disease name set is used to construct a target dictionary tree. In this way, when the electronic device obtains the current diagnosis text, it can directly implement subsequent steps of standardizing disease names based on the target dictionary tree.
进一步地,在遍历多个第一序列中每一序列时,可得到多个第三匹配度,可计算多个第三匹配度的均值,若该均值大于第二预设阈值,则认为匹配成功,不更新上述预设字典树,反之,若该均值小于或是等于第二预设阈值,则认为匹配失败,则可更新上述预设字典树,得到目标字典树。Further, when traversing each of the multiple first sequences, multiple third matching degrees can be obtained, and the average value of the multiple third matching degrees can be calculated. If the average value is greater than the second preset threshold, the matching is considered successful , The above-mentioned preset dictionary tree is not updated. On the contrary, if the mean value is less than or equal to the second preset threshold, it is considered that the matching fails, and the above-mentioned preset dictionary tree can be updated to obtain the target dictionary tree.
另外,也可在遍历的过程中进行逐步更新预设字典树的步骤,电子设备中可预设第三预设阈值,该第三预设阈值可由用户自行设置或者系统默认,若出现第三匹配度大于第三预设阈值的情况,则可认为其对应的第一序列与预设字典树中的节点匹配成功,也就是出现匹配成功的情况,则不更新该预设字典树,若出现第三匹配度小于或等于第三预设阈值的情况,则可认为其对应的第一序列与预设字典树中的节点匹配失败,则可基于其对应的第一节点更新该预设字典树,如此,逐步遍历上述第一序列,并循环采用上述方法,可逐步更新上述预设字典树,以得到目标字典树。In addition, the step of gradually updating the preset dictionary tree can also be performed during the traversal process. A third preset threshold can be preset in the electronic device. The third preset threshold can be set by the user or the system defaults. If a third match occurs If the degree is greater than the third preset threshold, it can be considered that the corresponding first sequence matches the node in the preset dictionary tree successfully, that is, if the matching is successful, the preset dictionary tree is not updated. If the three matching degrees are less than or equal to the third preset threshold, it can be considered that the corresponding first sequence fails to match the node in the preset dictionary tree, and the preset dictionary tree can be updated based on the corresponding first node. In this way, the above-mentioned first sequence is traversed step by step, and the above-mentioned method is used cyclically, and the above-mentioned preset dictionary tree can be gradually updated to obtain the target dictionary tree.
在一种可能的示例中,上述步骤36,更新所述预设字典树,得到目标字典树,可包括如下步骤:In a possible example, the above step 36, updating the preset dictionary tree to obtain the target dictionary tree, may include the following steps:
361、基于所述预设字典树,确定所述预设字典树中所述多个节点对应的初始映射;361. Determine initial mappings corresponding to the multiple nodes in the preset dictionary tree based on the preset dictionary tree.
362、若第一序列i中对应有字符i不存在于所述初始映射中,则重新添加一个映射i,在所述映射i中保存所述字符i,更新所述初始映射为所述映射i,其中,所述第一序列i为所述多个第一序列中任意一个,所述字符i为所述第一序列中任意一个字符;362. If the character i corresponding to the first sequence i does not exist in the initial mapping, add a new mapping i, save the character i in the mapping i, and update the initial mapping to the mapping i , Wherein the first sequence i is any one of the multiple first sequences, and the character i is any one character in the first sequence;
363、基于所述映射i,更新所述预设字典树为所述目标字典树。363. Based on the mapping i, update the preset dictionary tree to the target dictionary tree.
其中,由于上述步骤是基于多个第一序列所对应的多个第三匹配度的均值进行判定需不需要对预设字典树进行更新,并不明确具体需要更新预设字典树的哪一个节点,因此,在更新时,可基于每一第一序列进行判断,以逐步的更新上述预设字典树,如此,有利于提高构建出目标字典树的准确率。Among them, since the above steps are based on the average value of the multiple third matching degrees corresponding to the multiple first sequences, it is not necessary to update the preset dictionary tree, and it is not clear which node of the preset dictionary tree needs to be updated. Therefore, when updating, a judgment can be made based on each first sequence to gradually update the above-mentioned preset dictionary tree, which is beneficial to improve the accuracy of constructing the target dictionary tree.
具体实现中,电子设备可基于上述预设字典树,确定该预设字典树对应的初始映射,该初始映射可为上述预设字典树对应的初始路径,该初始路径表明了上述预设字典树中每两两节点之间的映射关系,也体现了预设字典树中每一节点与其所对应存储的信息之间的映射关系;若存在第一序列i中对应的任意一个字符i不存在于上述多个节点对应的初始映射中的情况,则基于上述初始映射,重新添加一个映射i,用于保存该字符i,此时,该预设字典树更新为第一字典树,以此同时,基于映射i,可将上述初始映射也更新为包括上述映射i的第一映射,其中,上述字符i为上述第一序列中任意一个字符,上述第一序列i为多个第一序列中任意一个,如此,可针对第一序列中的其他字符,重复上述步骤,以逐步更新第一映射,最后,可针对所有的第一序列执行上述步骤,以对第一映射进行逐步的更新,当所有第一序列中的字符全部遍历完以后,可得到目标字典树。In a specific implementation, the electronic device may determine an initial mapping corresponding to the preset dictionary tree based on the foregoing preset dictionary tree, and the initial mapping may be an initial path corresponding to the foregoing preset dictionary tree, and the initial path indicates the foregoing preset dictionary tree. The mapping relationship between every two nodes in the dictionary tree also reflects the mapping relationship between each node in the preset dictionary tree and the corresponding stored information; if there is any corresponding character i in the first sequence i does not exist in In the case of the initial mapping corresponding to the above multiple nodes, based on the above initial mapping, a new mapping i is added to save the character i. At this time, the preset dictionary tree is updated to the first dictionary tree. At the same time, Based on the mapping i, the initial mapping can also be updated to the first mapping including the mapping i, where the character i is any character in the first sequence, and the first sequence i is any one of the multiple first sequences In this way, the above steps can be repeated for other characters in the first sequence to gradually update the first mapping. Finally, the above steps can be performed for all the first sequences to update the first mapping step by step. After all the characters in a sequence are traversed, the target dictionary tree can be obtained.
如下图1D所示,为一种目标字典树的结构示意图,该目标字典树可基于预设ICD标准疾病名称集构建而成,如图所示,可根据预设ICD标准疾病名称“阿米巴性肠炎”、“阿米巴病”、“阿米巴性痢疾”、“阿狄森氏病”及“阿尔茨海默病”等预设ICD标准疾病名称构建如图所示的目标字典树,实心圆圈表示的是一个疾病名称路径的终点节点。As shown in Figure 1D below, it is a schematic diagram of the structure of a target dictionary tree. The target dictionary tree can be constructed based on the preset ICD standard disease name set. As shown in the figure, it can be based on the preset ICD standard disease name "Amoeba "Enteritis", "Amoebiasis", "Amoebic dysentery", "Addison's disease" and "Alzheimer's disease" and other preset ICD standard disease names to construct the target dictionary tree as shown in the figure , The solid circle represents the end node of a disease name path.
104、基于所述目标字典树,将所述待标准化疾病名称与所述预设ICD标准疾病名称集中的所述多个预设ICD标准疾病名称进行匹配,得到多个第一匹配度。104. Based on the target dictionary tree, match the name of the disease to be standardized with the multiple preset ICD standard disease names in the preset ICD standard disease name set to obtain multiple first matching degrees.
其中,在得到待标准化疾病名称以后,可基于上述目标字典树,对该待标准化疾病名 称进行匹配查找,以得到待标准化疾病名称对应的ICD标准疾病名称,也就是说,可将待标准化疾病名称与预设ICD标准疾病名称集中的多个预设ICD标准疾病名称进行匹配,得到多个第一匹配度,可基于多个第一匹配度,得到待标准化疾病名称的转换结果。Among them, after obtaining the name of the disease to be standardized, the name of the disease to be standardized can be matched and searched based on the above target dictionary tree to obtain the ICD standard disease name corresponding to the name of the disease to be standardized, that is, the name of the disease to be standardized can be changed Matching with multiple preset ICD standard disease names in the preset ICD standard disease name set to obtain multiple first matching degrees, and obtaining the conversion result of the disease name to be standardized based on the multiple first matching degrees.
在一种可能的示例中,上述步骤104,将所述待标准化疾病名称与所述预设ICD标准疾病名称集中的所述多个预设ICD标准疾病名称进行匹配,得到多个第一匹配度,可包括如下步骤:In a possible example, in step 104, the name of the disease to be standardized is matched with the multiple preset ICD standard disease names in the preset ICD standard disease name set to obtain multiple first matching degrees , Can include the following steps:
41、确定所述目标字典树对应的目标映射表,所述目标映射表中多个预设路径,所述目标字典树中每一疾病名称对应一个预设路径;41. Determine a target mapping table corresponding to the target dictionary tree, a plurality of preset paths in the target mapping table, and each disease name in the target dictionary tree corresponds to a preset path;
42、基于所述目标字典树,从字符j开始,依据所述目标字典树对应的头节点依次向下查找,得到待匹配路径j以及所述待匹配路径在所述目标映射表中对应的预设路径j,其中,所述字符j为所述待标准化疾病名称中第一个字符;42. Based on the target dictionary tree, starting from the character j, and sequentially searching downwards according to the head node corresponding to the target dictionary tree, to obtain the path j to be matched and the predetermined path corresponding to the path to be matched in the target mapping table. Set path j, where the character j is the first character in the name of the disease to be standardized;
43、将所述待匹配路径j与所述预设路径j进行匹配,得到所述字符j的第一匹配度j。43. Match the path j to be matched with the preset path j to obtain the first matching degree j of the character j.
其中,电子设备可确定目标字典树对应的目标映射表,该目标映射表中可包括目标字典树中每一疾病名称对应的预设路径,进而,可基于目标字典树,从待标准化疾病名称的第一个字符j开始,从上述目标字典树对应的头节点依次向下查找,得到待匹配路径j以及待匹配路径在映射表中对应的预设路径j,将待匹配路径j与该预设路径j进行匹配,得到第一匹配度,如此,直到该待标准化疾病名称对应的每一字符在上述目标字典树对应的每一路径循环完毕,得到多个第一匹配度。The electronic device may determine the target mapping table corresponding to the target dictionary tree, and the target mapping table may include a preset path corresponding to each disease name in the target dictionary tree. Furthermore, based on the target dictionary tree, the target mapping table may be selected from the disease name to be standardized. Starting with the first character j, search downwards from the head node corresponding to the target dictionary tree in order to obtain the path j to be matched and the preset path j corresponding to the path to be matched in the mapping table, and compare the path j to be matched with the preset path j. The path j is matched to obtain the first matching degree, and so on, until each character corresponding to the name of the disease to be standardized is looped through each path corresponding to the target dictionary tree, and multiple first matching degrees are obtained.
105、当所述多个第一匹配度中存在满足预设条件的目标第一匹配度时,获取该目标第一匹配度对应的目标预设ICD标准疾病名称,将所述目标预设ICD标准疾病名称确定为所述待标准化疾病名称的转换结果。105. When there is a target first matching degree that meets a preset condition among the plurality of first matching degrees, obtain the target preset ICD standard disease name corresponding to the target first matching degree, and preset the target to the ICD standard The name of the disease is determined as the conversion result of the name of the disease to be standardized.
其中,上述预设条件可为用户自行设置或者系统默认,在此不做限定,例如,可设置当第一匹配度为100%时,则可确定其匹配成功,则可确定匹配成功时,确定满足预设条件的目标第一匹配度对应的预设ICD标准疾病名称为待标准化疾病名称的转换结果,否则匹配失败。Among them, the above preset conditions can be set by the user or the system defaults, which are not limited here. For example, it can be set that when the first matching degree is 100%, it can be determined that the matching is successful, and it can be determined when the matching is successful. The preset ICD standard disease name corresponding to the first matching degree of the target that meets the preset condition is the conversion result of the disease name to be standardized, otherwise the matching fails.
可选地,若上述基于上述目标字典树,将待标准化疾病名称与上述预设ICD标准疾病名称集中的多个预设ICD标准疾病名称进行匹配时,若上述待标准化疾病名称对应的字符无法在上述目标字典树中完全匹配,也就是不满足上述预设条件时,电子设备可基于目标字典树的路径相似度和语义相似度,将待标准化疾病名称与上述预设ICD标准疾病名称集中的多个预设ICD标准疾病名称进行匹配,以得到待标准化疾病名称的转换结果,如此,可在上述待标准化疾病名称对应的字符在上述目标字典树中不完全匹配时,仍然能够实现上述待标准化疾病名称的转换,有利于提高匹配的准确率。Optionally, if the name of the disease to be standardized is matched with multiple preset ICD standard disease names in the set of preset ICD standard disease names based on the above target dictionary tree, if the characters corresponding to the name of the disease to be standardized cannot be found When the above-mentioned target dictionary tree is completely matched, that is, when the above-mentioned preset conditions are not met, the electronic device may, based on the path similarity and semantic similarity of the target dictionary tree, compare the name of the disease to be standardized with the above-mentioned preset ICD standard disease name. The name of the disease to be standardized is matched to obtain the conversion result of the name of the disease to be standardized. In this way, when the characters corresponding to the name of the disease to be standardized are not completely matched in the target dictionary tree, the disease to be standardized can still be realized. The name conversion is helpful to improve the accuracy of matching.
具体实现中,上述目标字典树可包括根节点,则可确定上述待标准化疾病名称对应的第二序列;以上述目标字典树对应的根节点为起始点开始遍历,得到上述第二序列在上述目标字典树中匹配得到的多个第二路径;计算上述多个第二路径分别对应的语义相似度,得到多个语义相似度;选取上述多个语义相似度中的最大值对应第二路径为目标路径;确定该目标路径在上述目标字典树中对应的字符为上述待标准化疾病名称的转换结果。In specific implementation, the target dictionary tree may include a root node, and the second sequence corresponding to the name of the disease to be standardized can be determined; starting from the root node corresponding to the target dictionary tree, it is obtained that the second sequence is in the target Multiple second paths obtained by matching in the dictionary tree; calculating the semantic similarities corresponding to the multiple second paths to obtain multiple semantic similarities; selecting the maximum value of the multiple semantic similarities corresponding to the second path as the target Path; determine that the character corresponding to the target path in the target dictionary tree is the conversion result of the name of the disease to be standardized.
其中,上述计算多个第二路径分别对应的语义相似度可采用以下公式:Among them, the above calculation of the semantic similarity corresponding to the multiple second paths may adopt the following formula:
X=a×X 1+b×X 2X=a×X 1 + b×X 2 ;
Figure PCTCN2020099487-appb-000001
Figure PCTCN2020099487-appb-000001
其中,wi=(a1,a2,…,an),wj=(b1,b2,…,bn),n为词向量维度。上述X 1表示每一第二路径对应的路径长度,也可以理解为每一次遍历的深度,上述a、b分别为实数,可通过调节a 和b的值调整X 1和X 2在计算语义相似度时的权重。 Among them, wi=(a1, a2,...,an), wj=(b1,b2,...,bn), and n is the dimension of the word vector. The above X 1 represents the path length corresponding to each second path, which can also be understood as the depth of each traversal. The above a and b are real numbers respectively. The values of a and b can be adjusted to adjust X 1 and X 2 to be similar in calculation semantics. The weight in degrees.
此外,在以上述目标字典树对应的根节点为起始点开始遍历,得到上述第二序列在上述目标字典树中匹配得到的多个第二路径时,可包括以下步骤:以根节点为起始点,选取一条路径m(m为正整数)为当前子树,该路径m为以根节点为起始点的任意一个路径;在该当前子树对应的任意一层中查找上述待标准化疾病名称对应的第一个字符,若该字符被查找到,则在上述任意一层的下一层查找上述待标准化疾病名称对应的第二个字符,如此,重复查找上述待标准化疾病名称对应的第三个字符,若上述当前子树中未查找到,则选取除该路径m以外的另外一条路径重复上述步骤进行遍历,如此,可得到多个第二路径,该第二路径中可完全包括或者不可完全包括上述待标准化疾病名称对应的所有字符,如此,通过上述方法将路径相似度和语义相似度相结合的匹配方法,有利于提高匹配的准确率,以快速将待标准化疾病名称转换为标准化疾病名称。In addition, when starting the traversal with the root node corresponding to the target dictionary tree as the starting point, and obtaining multiple second paths obtained by matching the second sequence in the target dictionary tree, the following steps may be included: taking the root node as the starting point , Select a path m (m is a positive integer) as the current subtree, and the path m is any path with the root node as the starting point; in any layer corresponding to the current subtree, search for the name of the disease to be standardized The first character. If the character is found, search for the second character corresponding to the name of the disease to be standardized in the next layer of any of the above layers, and then repeat the search for the third character corresponding to the name of the disease to be standardized. If it is not found in the above-mentioned current subtree, select another path other than the path m and repeat the above steps to traverse. In this way, multiple second paths can be obtained, and the second paths may or may not be completely included All the characters corresponding to the name of the disease to be standardized, so the matching method that combines path similarity and semantic similarity through the above method is beneficial to improve the accuracy of matching, so as to quickly convert the name of the disease to be standardized into the standardized disease name.
可以看出,本申请实施例中所描述的疾病名称标准化方法,应用于电子设备,本申请可应用于智慧医疗领域中,从而推动智慧城市的建设。上述方法包括:获取目标词典、当前诊断文本和预设ICD标准疾病名称集,预设ICD标准疾病名称集包括多个预设ICD标准疾病名称,基于目标词典,对当前诊断文本进行切词操作,得到当前诊断文本中包含的待标准化疾病名称,基于预设ICD标准疾病名称集,构建目标字典树,基于目标字典树,将待标准化疾病名称与预设ICD标准疾病名称集中的多个预设ICD标准疾病名称进行匹配,得到多个第一匹配度,当多个第一匹配度中存在满足预设条件的目标第一匹配度时,获取该目标第一匹配度对应的目标预设ICD标准疾病名称,将目标预设ICD标准疾病名称确定为待标准化疾病名称的转换结果,如此,可通过目标词典对当前诊断文本进行切词操作,以减少当前诊断文本中存在的口语化、错别字、漏写、缩写等问题,另外,基于预设ICD标准疾病名称集构建的目标字典树,将多个预设ICD标准疾病名称与待标准化名称进行匹配,以得到转换结构,有利于提高转换效率与准确率。It can be seen that the disease name standardization method described in the embodiments of this application is applied to electronic equipment, and this application can be applied to the field of smart medical care, thereby promoting the construction of smart cities. The above method includes: obtaining the target dictionary, the current diagnosis text and the preset ICD standard disease name set, the preset ICD standard disease name set includes multiple preset ICD standard disease names, and the current diagnosis text is word-cut based on the target dictionary. Get the name of the disease to be standardized contained in the current diagnosis text, build a target dictionary tree based on the set of preset ICD standard disease names, and based on the target dictionary tree, combine the name of the disease to be standardized with the preset ICD standard disease names of multiple preset ICDs The standard disease names are matched to obtain multiple first matching degrees. When there is a target first matching degree that meets the preset conditions among the multiple first matching degrees, the target preset ICD standard disease corresponding to the target first matching degree is obtained Name, the target preset ICD standard disease name is determined as the conversion result of the disease name to be standardized. In this way, the current diagnosis text can be cut through the target dictionary to reduce the colloquialization, typos, and omissions in the current diagnosis text In addition, based on the target dictionary tree constructed based on the preset ICD standard disease name set, the multiple preset ICD standard disease names are matched with the names to be standardized to obtain the conversion structure, which is beneficial to improve the conversion efficiency and accuracy. .
与上述一致地,请参阅图2,图2是本申请实施例公开的一种疾病名称标准化方法的流程示例图,应用于电子设备,该疾病名称标准化方法可包括如下步骤:Consistent with the above, please refer to FIG. 2. FIG. 2 is an exemplary flow chart of a disease name standardization method disclosed in an embodiment of the present application, which is applied to an electronic device. The disease name standardization method may include the following steps:
201、获取预设ICD标准疾病名称集,所述预设ICD标准疾病名称集包括多个预设ICD标准疾病名称。201. Obtain a preset ICD standard disease name set, where the preset ICD standard disease name set includes multiple preset ICD standard disease names.
202、从历史病情案例库中提取历史诊断文本信息。202. Extract historical diagnosis text information from the historical disease case database.
203、对所述历史诊断文本信息进行数据清洗,得到历史疾病名称集。203. Perform data cleaning on the historical diagnosis text information to obtain a set of historical disease names.
204、将所述历史疾病名称集与所述预设ICD标准疾病名称集进行数据处理,得到所述目标词典。204. Perform data processing on the historical disease name set and the preset ICD standard disease name set to obtain the target dictionary.
205、获取当前诊断文本。205. Acquire the current diagnosis text.
206、基于所述目标词典,对所述当前诊断文本进行切词操作,得到所述当前诊断文本中包含的待标准化疾病名称。206. Based on the target dictionary, perform a word cutting operation on the current diagnosis text to obtain the name of the disease to be standardized contained in the current diagnosis text.
207、基于所述预设ICD标准疾病名称集,构建目标字典树。207. Construct a target dictionary tree based on the preset ICD standard disease name set.
208、基于所述目标字典树,将所述待标准化疾病名称与所述预设ICD标准疾病名称集中的所述多个预设ICD标准疾病名称进行匹配,得到多个第一匹配度。208. Based on the target dictionary tree, match the name of the disease to be standardized with the multiple preset ICD standard disease names in the preset ICD standard disease name set to obtain multiple first matching degrees.
209、当所述多个第一匹配度中存在满足预设条件的目标第一匹配度时,获取所述目标第一匹配度对应的目标预设ICD标准疾病名称,将所述目标预设ICD标准疾病名称确定为所述待标准化疾病名称的转换结果。209. When there is a target first matching degree that meets a preset condition among the plurality of first matching degrees, obtain the target preset ICD standard disease name corresponding to the target first matching degree, and preset the target ICD The standard disease name is determined as the conversion result of the disease name to be standardized.
其中,上述步骤201-步骤209所描述的疾病名称标准化方法可参考图1B所描述的疾病名称标准化方法的对应步骤。For the disease name standardization method described in the above steps 201 to 209, refer to the corresponding steps of the disease name standardization method described in FIG. 1B.
可以看出,本申请实施例所描述的疾病名称标准化方法,获取预设ICD标准疾病名称集,预设ICD标准疾病名称集包括多个预设ICD标准疾病名称,从历史病情案例库中提取 历史诊断文本信息,对历史诊断文本信息进行数据清洗,得到历史疾病名称集,将历史疾病名称集与预设ICD标准疾病名称集进行数据处理,得到目标词典,获取当前诊断文本,基于目标词典,对当前诊断文本进行切词操作,得到当前诊断文本中包含的待标准化疾病名称,基于预设ICD标准疾病名称集,构建目标字典树,基于目标字典树,将待标准化疾病名称与预设ICD标准疾病名称集中的多个预设ICD标准疾病名称进行匹配,得到多个第一匹配度,当多个第一匹配度中存在满足预设条件的目标第一匹配度时,获取该目标第一匹配度对应的目标预设ICD标准疾病名称,将目标预设ICD标准疾病名称确定为待标准化疾病名称的转换结果,如此,可将历史病情库中的历史诊断文本经过一系列处理得到更切合实际应用的扩充以后的目标词典,并通过该目标词典对当前诊断文本进行切词操作,以减少当前诊断文本中存在的口语化、错别字、漏写、缩写等问题,另外,基于预设ICD标准疾病名称集构建的目标字典树,将多个预设ICD标准疾病名称与待标准化名称进行匹配,以得到转换结构,有利于提高转换效率与准确率。It can be seen that the disease name standardization method described in the embodiment of this application obtains a preset ICD standard disease name set, the preset ICD standard disease name set includes multiple preset ICD standard disease names, and the history is extracted from the historical disease case database Diagnose text information, perform data cleaning on historical diagnosis text information to obtain a historical disease name set, perform data processing on the historical disease name set and the preset ICD standard disease name set, obtain the target dictionary, obtain the current diagnosis text, based on the target dictionary, The current diagnosis text is word-cut to obtain the name of the disease to be standardized in the current diagnosis text. Based on the preset ICD standard disease name set, the target dictionary tree is constructed. Based on the target dictionary tree, the name of the disease to be standardized and the preset ICD standard disease are combined Multiple preset ICD standard disease names in the name set are matched to obtain multiple first matching degrees. When there is a target first matching degree that meets the preset conditions among the multiple first matching degrees, the target first matching degree is obtained The corresponding target preset ICD standard disease name, and the target preset ICD standard disease name is determined as the conversion result of the disease name to be standardized. In this way, the historical diagnosis text in the historical disease database can be processed through a series of processing to obtain a more practical application Expand the target dictionary and perform word cutting operations on the current diagnosis text through the target dictionary to reduce the problems of colloquialization, typos, omissions, abbreviations, etc. in the current diagnosis text. In addition, based on the preset ICD standard disease name set The constructed target dictionary tree matches multiple preset ICD standard disease names with names to be standardized to obtain a conversion structure, which is beneficial to improve conversion efficiency and accuracy.
与上述一致地,请参阅图3,图3是本申请实施例公开的一种疾病名称标准化方法的流程示例图,应用于电子设备,该疾病名称标准化方法可包括如下步骤:Consistent with the above, please refer to FIG. 3. FIG. 3 is an exemplary flowchart of a disease name standardization method disclosed in an embodiment of the present application, which is applied to an electronic device. The disease name standardization method may include the following steps:
301、获取目标词典、当前诊断文本和预设ICD标准疾病名称集,所述预设ICD标准疾病名称集包括多个预设ICD标准疾病名称;301. Obtain a target dictionary, a current diagnosis text, and a preset ICD standard disease name set, where the preset ICD standard disease name set includes multiple preset ICD standard disease names;
302、基于所述目标词典,对所述当前诊断文本进行切词操作,得到所述当前诊断文本中包含的待标准化疾病名称。302. Based on the target dictionary, perform a word cutting operation on the current diagnosis text to obtain the name of the disease to be standardized contained in the current diagnosis text.
303、基于所述预设ICD标准疾病名称集,确定所述预设ICD标准疾病名称集中多个预设ICD标准疾病名称中每一预设ICD标准疾病名称对应的第一序列,得到多个第一序列,其中,每一第一序列中包括至少一个字符。303. Based on the preset ICD standard disease name set, determine the first sequence corresponding to each preset ICD standard disease name in the multiple preset ICD standard disease names in the preset ICD standard disease name set, and obtain multiple first sequences. A sequence, wherein each first sequence includes at least one character.
304、获取预设字典树,所述预设字典树中包括多个节点。304. Obtain a preset dictionary tree, where the preset dictionary tree includes multiple nodes.
305、遍历所述多个第一序列,将每一所述第一序列与所述预设字典树对应的多个节点进行匹配,得到多个第三匹配度。305. Traverse the multiple first sequences, and match each of the first sequences with multiple nodes corresponding to the preset dictionary tree to obtain multiple third matching degrees.
306、计算所述多个第三匹配度的均值。306. Calculate the average value of the multiple third matching degrees.
307、若所述均值大于第二预设阈值,则不更新所述预设字典树,将所述预设字典树作为所述目标字典树。307. If the average value is greater than the second preset threshold, the preset dictionary tree is not updated, and the preset dictionary tree is used as the target dictionary tree.
308、若所述均值小于或等于所述第二预设阈值,则更新所述预设字典树,得到所述目标字典树。308. If the average value is less than or equal to the second preset threshold, update the preset dictionary tree to obtain the target dictionary tree.
309、基于所述目标字典树,将所述待标准化疾病名称与所述预设ICD标准疾病名称集中的所述多个预设ICD标准疾病名称进行匹配,得到多个第一匹配度。309. Based on the target dictionary tree, match the name of the disease to be standardized with the multiple preset ICD standard disease names in the preset ICD standard disease name set to obtain multiple first matching degrees.
310、当所述多个第一匹配度中存在满足预设条件的目标第一匹配度时,获取所述目标第一匹配度对应的目标预设ICD标准疾病名称,将所述目标预设ICD标准疾病名称确定为所述待标准化疾病名称的转换结果。310. When there is a target first matching degree that meets a preset condition among the plurality of first matching degrees, obtain the target preset ICD standard disease name corresponding to the target first matching degree, and preset the target ICD The standard disease name is determined as the conversion result of the disease name to be standardized.
其中,上述步骤301-步骤310所描述的疾病名称标准化方法可参考图1B所描述的疾病名称标准化方法的对应步骤。For the disease name standardization method described in the above steps 301 to 310, refer to the corresponding steps of the disease name standardization method described in FIG. 1B.
可以看出,本申请实施例所描述的疾病名称标准化方法,电子设备可获取目标词典、当前诊断文本和预设ICD标准疾病名称集,预设ICD标准疾病名称集包括多个预设ICD标准疾病名称,基于目标词典,对当前诊断文本进行切词操作,得到当前诊断文本中包含的待标准化疾病名称,基于预设ICD标准疾病名称集,确定预设ICD标准疾病名称集中多个预设ICD标准疾病名称中每一预设ICD标准疾病名称对应的第一序列,得到多个第一序列,其中,每一第一序列中包括至少一个字符,获取预设字典树,预设字典树中包括多个节点,遍历多个第一序列,将每一第一序列与预设字典树对应的多个节点进行匹配,得到多个第三匹配度,计算多个第三匹配度的均值,若均值大于第二预设阈值,则不更新预设字典树, 将预设字典树作为目标字典树,若均值小于或等于第二预设阈值,则更新预设字典树,得到目标字典树,基于目标字典树,将待标准化疾病名称与预设ICD标准疾病名称集中的多个预设ICD标准疾病名称进行匹配,得到多个第一匹配度,当多个第一匹配度中存在满足预设条件的目标第一匹配度时,获取该目标第一匹配度对应的目标预设ICD标准疾病名称,将目标预设ICD标准疾病名称确定为待标准化疾病名称的转换结果,如此,可通过目标词典对当前诊断文本进行切词操作,以减少当前诊断文本中存在的口语化、错别字、漏写、缩写等问题,并通过对国际标准下的预设ICD标准疾病名称集进行处理,得到目标字典树,基于该目标字典树,对待标准化疾病名称进行处理,得到该待标准化疾病名称的转换结果,有利于提高转换准确率。It can be seen that in the disease name standardization method described in the embodiment of the application, the electronic device can obtain the target dictionary, the current diagnosis text, and the preset ICD standard disease name set. The preset ICD standard disease name set includes multiple preset ICD standard diseases. Name, based on the target dictionary, perform word cutting operations on the current diagnosis text to obtain the name of the disease to be standardized in the current diagnosis text, and determine the preset ICD standard disease name set based on the preset ICD standard disease name set. Multiple preset ICD standards In the disease name, the first sequence corresponding to each preset ICD standard disease name is obtained, and multiple first sequences are obtained, where each first sequence includes at least one character, and the preset dictionary tree is obtained. The preset dictionary tree includes multiple Nodes, traverse multiple first sequences, match each first sequence with multiple nodes corresponding to the preset dictionary tree to obtain multiple third matching degrees, and calculate the average value of the multiple third matching degrees, if the average value is greater than The second preset threshold, the preset dictionary tree is not updated, and the preset dictionary tree is used as the target dictionary tree. If the mean value is less than or equal to the second preset threshold, the preset dictionary tree is updated to obtain the target dictionary tree, based on the target dictionary Tree, match the name of the disease to be standardized with multiple preset ICD standard disease names in the set of preset ICD standard disease names to obtain multiple first matching degrees, when there are targets that meet the preset conditions in the multiple first matching degrees For the first matching degree, the target preset ICD standard disease name corresponding to the first matching degree of the target is obtained, and the target preset ICD standard disease name is determined as the conversion result of the disease name to be standardized. In this way, the current diagnosis can be made through the target dictionary The text is word-cut operation to reduce the colloquialization, typos, omissions, abbreviations and other problems in the current diagnosis text, and the target dictionary tree is obtained by processing the preset ICD standard disease name set under the international standard. The target dictionary tree is used to process the name of the disease to be standardized to obtain the conversion result of the disease name to be standardized, which is beneficial to improve the conversion accuracy.
与上述一致地,请参阅图4,图4为本申请实施例提供的一种电子设备的结构示意图,如图4所示,包括处理器、通信接口、存储器以及一个或多个程序,所述处理器、通信接口和存储器相互连接,其中,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,上述一个或多个程序包括用于执行以下步骤的指令:Consistent with the above, please refer to FIG. 4. FIG. 4 is a schematic structural diagram of an electronic device provided by an embodiment of the application. As shown in FIG. 4, it includes a processor, a communication interface, a memory, and one or more programs. The processor, the communication interface, and the memory are connected to each other, where the memory is used to store a computer program, the computer program includes program instructions, the processor is configured to call the program instructions, and the one or more programs include Instructions to perform the following steps:
获取目标词典、当前诊断文本和预设ICD标准疾病名称集,所述预设ICD标准疾病名称集包括多个预设ICD标准疾病名称;基于所述目标词典,对所述当前诊断文本进行切词操作,得到所述当前诊断文本中包含的待标准化疾病名称;基于所述预设ICD标准疾病名称集,构建目标字典树;基于所述目标字典树,将所述待标准化疾病名称与所述预设ICD标准疾病名称集中的所述多个预设ICD标准疾病名称进行匹配,得到多个第一匹配度;当所述多个第一匹配度中存在满足预设条件的目标第一匹配度时,获取所述目标第一匹配度对应的目标预设ICD标准疾病名称,将所述目标预设ICD标准疾病名称确定为所述待标准化疾病名称的转换结果。Acquire a target dictionary, a current diagnosis text, and a preset ICD standard disease name set, where the preset ICD standard disease name set includes a plurality of preset ICD standard disease names; based on the target dictionary, perform word segmentation on the current diagnosis text Operation, the name of the disease to be standardized contained in the current diagnosis text is obtained; based on the preset ICD standard disease name set, a target dictionary tree is constructed; based on the target dictionary tree, the name of the disease to be standardized and the predicted disease name It is assumed that the multiple preset ICD standard disease names in the ICD standard disease name set are matched to obtain multiple first matching degrees; when there is a target first matching degree that meets the preset condition among the multiple first matching degrees Obtain the target preset ICD standard disease name corresponding to the first degree of matching of the target, and determine the target preset ICD standard disease name as the conversion result of the disease name to be standardized.
可以看出,本申请实施例中所描述的电子设备,该电子设备可获取目标词典、当前诊断文本和预设ICD标准疾病名称集,预设ICD标准疾病名称集包括多个预设ICD标准疾病名称,基于目标词典,对当前诊断文本进行切词操作,得到当前诊断文本中包含的待标准化疾病名称,基于预设ICD标准疾病名称集,构建目标字典树,基于目标字典树,将待标准化疾病名称与预设ICD标准疾病名称集中的多个预设ICD标准疾病名称进行匹配,得到多个第一匹配度,当多个第一匹配度中存在满足预设条件的目标第一匹配度时,获取该目标第一匹配度对应的目标预设ICD标准疾病名称,将目标预设ICD标准疾病名称确定为待标准化疾病名称的转换结果,如此,可通过目标词典对当前诊断文本进行切词操作,以减少当前诊断文本中存在的口语化、错别字、漏写、缩写等问题,另外,基于预设ICD标准疾病名称集构建的目标字典树,将多个预设ICD标准疾病名称与待标准化名称进行匹配,以得到转换结构,有利于提高转换效率与准确率。It can be seen that the electronic device described in the embodiment of the application can obtain the target dictionary, the current diagnosis text, and the preset ICD standard disease name set. The preset ICD standard disease name set includes multiple preset ICD standard diseases. Name, based on the target dictionary, perform word cutting operations on the current diagnosis text to get the name of the disease to be standardized in the current diagnosis text, build a target dictionary tree based on the preset ICD standard disease name set, and based on the target dictionary tree, the disease to be standardized The name is matched with multiple preset ICD standard disease names in the preset ICD standard disease name set to obtain multiple first matching degrees. When there is a target first matching degree that meets the preset conditions among the multiple first matching degrees, Obtain the target preset ICD standard disease name corresponding to the first matching degree of the target, and determine the target preset ICD standard disease name as the conversion result of the disease name to be standardized. In this way, the current diagnosis text can be cut through the target dictionary. In order to reduce the problems of colloquialization, typos, omissions, abbreviations, etc. in the current diagnosis text, in addition, a target dictionary tree constructed based on the set of preset ICD standard disease names is to compare multiple preset ICD standard disease names with names to be standardized Matching to obtain the conversion structure is beneficial to improve the conversion efficiency and accuracy.
在一种可能的示例中,在所述获取目标词典之前,上述处理器具体还用于:从历史病情案例库中提取历史诊断文本信息;对所述历史诊断文本信息进行数据清洗,得到历史疾病名称集;将所述历史疾病名称集与所述预设ICD标准疾病名称集进行数据处理,得到所述目标词典。In a possible example, before the acquisition of the target dictionary, the above-mentioned processor is specifically further configured to: extract historical diagnosis text information from the historical disease case database; perform data cleaning on the historical diagnosis text information to obtain historical disease Name set; data processing is performed on the historical disease name set and the preset ICD standard disease name set to obtain the target dictionary.
在一个可能的示例中,在所述对所述历史诊断文本信息进行数据清洗,得到历史疾病名称集方面,上述处理器具体用于:获取针对多个预设疾病名称的多个预设正则表达式,其中,每一预设疾病名称对应一个预设正则表达式;将所述历史诊断文本信息分别与所述多个预设正则表达式中的每一预设正则表达式进行匹配,得到多个第二匹配度,每一所述预设正则表达式对应一个第二匹配度;确定所述多个第二匹配度中超过第一预设阈值的至少一个第二匹配度对应的至少一个预设疾病名称,并将所述至少一个预设疾病名称作为所述疾病名称集。In a possible example, in terms of performing data cleaning on the historical diagnosis text information to obtain a set of historical disease names, the processor is specifically configured to: obtain multiple preset regular expressions for multiple preset disease names Formula, wherein each preset disease name corresponds to a preset regular expression; the historical diagnosis text information is matched with each preset regular expression of the plurality of preset regular expressions to obtain multiple A second matching degree, each of the preset regular expressions corresponds to a second matching degree; determining at least one preset corresponding to at least one second matching degree that exceeds the first preset threshold among the plurality of second matching degrees Set a disease name, and use the at least one preset disease name as the disease name set.
在一个可能的示例中,在所述将所述历史疾病名称集与所述预设ICD标准疾病名称集进行数据处理,得到目标词典方面,上述处理器具体用于:将所述疾病名称集与所述预设ICD标准疾病名称集进行合并,得到第一词典,所述第一词典中包括多个第一疾病名称;对所述多个第一疾病名称进行去重,得到所述目标词典。In a possible example, in the aspect of performing data processing on the historical disease name set and the preset ICD standard disease name set to obtain a target dictionary, the processor is specifically configured to: combine the disease name set with The preset ICD standard disease name sets are merged to obtain a first dictionary, and the first dictionary includes multiple first disease names; the multiple first disease names are deduplicated to obtain the target dictionary.
在一个可能的示例中,在所述基于所述预设ICD标准疾病名称集,构建目标字典树方面,上述处理器具体用于:基于所述预设ICD标准疾病名称集,确定所述预设ICD标准疾病名称集中多个预设ICD标准疾病名称中每一预设ICD标准疾病名称对应的第一序列,得到多个第一序列,其中,每一第一序列中包括至少一个字符;获取预设字典树,所述预设字典树中包括多个节点;遍历所述多个第一序列,将每一所述第一序列与所述预设字典树对应的多个节点进行匹配,得到多个第三匹配度;计算所述多个第三匹配度的均值;若所述均值大于第二预设阈值,则不更新所述预设字典树,将所述预设字典树作为所述目标字典树;若所述均值小于或等于所述第二预设阈值,则更新所述预设字典树,得到所述目标字典树。In a possible example, in terms of constructing a target dictionary tree based on the preset ICD standard disease name set, the processor is specifically configured to: determine the preset based on the preset ICD standard disease name set The first sequence corresponding to each preset ICD standard disease name in the multiple preset ICD standard disease names in the ICD standard disease name set is obtained, and multiple first sequences are obtained, wherein each first sequence includes at least one character; Suppose a dictionary tree, the preset dictionary tree includes multiple nodes; traverse the multiple first sequences, and match each of the first sequences with multiple nodes corresponding to the preset dictionary tree to obtain multiple A third matching degree; calculating an average value of the plurality of third matching degrees; if the average value is greater than a second preset threshold, the preset dictionary tree is not updated, and the preset dictionary tree is used as the target Dictionary tree; if the average value is less than or equal to the second preset threshold, update the preset dictionary tree to obtain the target dictionary tree.
在一个可能的示例中,在所述更新所述预设字典树,得到目标字典树方面,上述处理器具体用于:基于所述预设字典树,确定所述预设字典树中所述多个节点对应的初始映射;若第一序列i中对应有字符i不存在于所述初始映射中,则重新添加一个映射i,在所述映射i中保存所述字符i,其中,所述第一序列i为所述多个第一序列中任意一个,所述字符i为所述第一序列中任意一个字符,i为正整数;基于所述映射i,更新所述预设字典树为所述目标字典树。In a possible example, in the aspect of updating the preset dictionary tree to obtain a target dictionary tree, the above-mentioned processor is specifically configured to: based on the preset dictionary tree, determine the number of items in the preset dictionary tree. The initial mapping corresponding to each node; if the character i corresponding to the first sequence i does not exist in the initial mapping, a new mapping i is added, and the character i is stored in the mapping i, where the first sequence i A sequence i is any one of the plurality of first sequences, the character i is any character in the first sequence, and i is a positive integer; based on the mapping i, the preset dictionary tree is updated as all The target dictionary tree.
在一个可能的示例中,在所述将所述待标准化疾病名称与所述预设ICD标准疾病名称集中的所述多个预设ICD标准疾病名称进行匹配,得到多个第一匹配度方面,上述处理器具体用于:确定所述目标字典树对应的目标映射表,所述目标映射表中多个预设路径,所述目标字典树中每一疾病名称对应一个预设路径;基于所述目标字典树,从字符j开始,依据所述目标字典树对应的头节点依次向下查找,得到待匹配路径j以及所述待匹配路径在所述目标映射表中对应的预设路径j,其中,所述字符j为所述待标准化疾病名称中第一个字符;将所述待匹配路径j与所述预设路径j进行匹配,得到所述字符j的第一匹配度j。In a possible example, in the aspect of matching the name of the disease to be standardized with the multiple preset ICD standard disease names in the preset ICD standard disease name set to obtain multiple first matching degrees, The above-mentioned processor is specifically configured to determine a target mapping table corresponding to the target dictionary tree, a plurality of preset paths in the target mapping table, and each disease name in the target dictionary tree corresponds to a preset path; based on the The target dictionary tree starts with character j and searches downwards in turn according to the head node corresponding to the target dictionary tree to obtain the path j to be matched and the preset path j corresponding to the path to be matched in the target mapping table, where The character j is the first character in the name of the disease to be standardized; the path j to be matched is matched with the preset path j to obtain the first matching degree j of the character j.
上述主要从方法侧执行过程的角度对本申请实施例的方案进行了介绍。可以理解的是,电子设备为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所提供的实施例描述的各示例的单元及算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。The foregoing mainly introduces the solution of the embodiment of the present application from the perspective of the execution process on the method side. It can be understood that, in order to implement the above-mentioned functions, an electronic device includes hardware structures and/or software modules corresponding to each function. Those skilled in the art should easily realize that in combination with the units and algorithm steps of the examples described in the embodiments provided herein, this application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
与上述一致地,请参阅图5,图5是本申请实施例公开的一种疾病名称标准化装置的结构示意图,应用于电子设备,该装置包括:获取单元501、切词单元502、构建单元503、匹配单元504和确定单元505,其中,Consistent with the above, please refer to FIG. 5. FIG. 5 is a schematic structural diagram of a disease name standardization device disclosed in an embodiment of the present application, which is applied to electronic equipment. The device includes: an acquisition unit 501, a word segmentation unit 502, and a construction unit 503 , The matching unit 504 and the determining unit 505, wherein:
所述获取单元501,用于获取目标词典、当前诊断文本和预设ICD标准疾病名称集,所述预设ICD标准疾病名称集包括多个预设ICD标准疾病名称;The acquiring unit 501 is configured to acquire a target dictionary, a current diagnosis text, and a preset ICD standard disease name set, where the preset ICD standard disease name set includes multiple preset ICD standard disease names;
所述切词单元502,用于基于所述目标词典,对所述当前诊断文本进行切词操作,得到所述当前诊断文本中包含的待标准化疾病名称;The word segmentation unit 502 is configured to perform a word segmentation operation on the current diagnosis text based on the target dictionary to obtain the name of the disease to be standardized contained in the current diagnosis text;
所述构建单元503,用于基于所述预设ICD标准疾病名称集,构建目标字典树;The construction unit 503 is configured to construct a target dictionary tree based on the preset ICD standard disease name set;
所述匹配单元504,用于基于所述目标字典树,将所述待标准化疾病名称与所述预设ICD标准疾病名称集中的所述多个预设ICD标准疾病名称进行匹配,得到多个第一匹配度;The matching unit 504 is configured to match the name of the disease to be standardized with the plurality of preset ICD standard disease names in the preset ICD standard disease name set based on the target dictionary tree to obtain a plurality of first A degree of match
所述确定单元505,用于当所述多个第一匹配度中存在满足预设条件的目标第一匹配 度时,获取所述目标第一匹配度对应的目标预设ICD标准疾病名称,将所述目标预设ICD标准疾病名称确定为所述待标准化疾病名称的转换结果。The determining unit 505 is configured to obtain the target preset ICD standard disease name corresponding to the target first matching degree when there is a target first matching degree that meets a preset condition among the plurality of first matching degrees, and The target preset ICD standard disease name is determined as the conversion result of the disease name to be standardized.
在一个可能的示例中,在所述基于所述预设ICD标准疾病名称集,构建目标字典树方面,上述构建单元503具体用于:基于所述预设ICD标准疾病名称集,确定所述预设ICD标准疾病名称集中多个预设ICD标准疾病名称中每一预设ICD标准疾病名称对应的第一序列,得到多个第一序列,其中,每一第一序列中包括至少一个字符;获取预设字典树,所述预设字典树中包括多个节点;遍历所述多个第一序列,将每一所述第一序列与所述预设字典树对应的多个节点进行匹配,得到多个第三匹配度;计算所述多个第三匹配度的均值;若所述均值大于第二预设阈值,则不更新所述预设字典树,将所述预设字典树作为所述目标字典树;若所述均值小于或等于所述第二预设阈值,则更新所述预设字典树,得到所述目标字典树。In a possible example, in terms of constructing a target dictionary tree based on the preset ICD standard disease name set, the construction unit 503 is specifically configured to: determine the preset ICD standard disease name set based on the preset ICD standard disease name set. Suppose the first sequence corresponding to each preset ICD standard disease name in the multiple preset ICD standard disease names in the ICD standard disease name set, obtain multiple first sequences, wherein each first sequence includes at least one character; obtain A preset dictionary tree, the preset dictionary tree includes multiple nodes; the multiple first sequences are traversed, and each of the first sequences is matched with multiple nodes corresponding to the preset dictionary tree to obtain A plurality of third matching degrees; calculating an average value of the plurality of third matching degrees; if the average value is greater than a second preset threshold, the preset dictionary tree is not updated, and the preset dictionary tree is used as the Target dictionary tree; if the average value is less than or equal to the second preset threshold, update the preset dictionary tree to obtain the target dictionary tree.
在一个可能的示例中,在所述更新所述预设字典树,得到目标字典树方面,上述构建单元503具体还用于:基于所述预设字典树,确定所述预设字典树中所述多个节点对应的初始映射;若第一序列i中对应有字符i不存在于所述初始映射中,则重新添加一个映射i,在所述映射i中保存所述字符i,其中,所述第一序列i为所述多个第一序列中任意一个,所述字符i为所述第一序列中任意一个字符,i为正整数;基于所述映射i,更新所述预设字典树为所述目标字典树。In a possible example, in terms of updating the preset dictionary tree to obtain a target dictionary tree, the construction unit 503 is specifically further configured to: based on the preset dictionary tree, determine what is in the preset dictionary tree. The initial mapping corresponding to the multiple nodes; if the character i corresponding to the first sequence i does not exist in the initial mapping, a new mapping i is added, and the character i is stored in the mapping i, where all The first sequence i is any one of the multiple first sequences, the character i is any character in the first sequence, and i is a positive integer; based on the mapping i, the preset dictionary tree is updated Is the target dictionary tree.
在一个可能的示例中,在所述将所述待标准化疾病名称与所述预设ICD标准疾病名称集中的所述多个预设ICD标准疾病名称进行匹配,得到多个第一匹配度方面,上述匹配单元504具体用于:确定所述目标字典树对应的目标映射表,所述目标映射表中多个预设路径,所述目标字典树中每一疾病名称对应一个预设路径;基于所述目标字典树,从字符j开始,依据所述目标字典树对应的头节点依次向下查找,得到待匹配路径j以及所述待匹配路径在所述目标映射表中对应的预设路径j,其中,所述字符j为所述待标准化疾病名称中第一个字符;将所述待匹配路径j与所述预设路径j进行匹配,得到所述字符j的第一匹配度j。In a possible example, in the aspect of matching the name of the disease to be standardized with the multiple preset ICD standard disease names in the preset ICD standard disease name set to obtain multiple first matching degrees, The matching unit 504 is specifically configured to determine a target mapping table corresponding to the target dictionary tree, a plurality of preset paths in the target mapping table, and each disease name in the target dictionary tree corresponds to a preset path; The target dictionary tree starts from character j, and searches downwards in order according to the head node corresponding to the target dictionary tree to obtain the path j to be matched and the preset path j corresponding to the path to be matched in the target mapping table, Wherein, the character j is the first character in the name of the disease to be standardized; the path j to be matched is matched with the preset path j to obtain the first matching degree j of the character j.
本申请实施例还提供一种计算机可读存储介质,其中,该计算机存储介质存储用于电子数据交换的计算机程序,所述程序指令当被处理器执行时使所述处理器执行如下步骤:获取目标词典、当前诊断文本和预设ICD标准疾病名称集,所述预设ICD标准疾病名称集包括多个预设ICD标准疾病名称;基于所述目标词典,对所述当前诊断文本进行切词操作,得到所述当前诊断文本中包含的待标准化疾病名称;基于所述预设ICD标准疾病名称集,构建目标字典树;基于所述目标字典树,将所述待标准化疾病名称与所述预设ICD标准疾病名称集中的所述多个预设ICD标准疾病名称进行匹配,得到多个第一匹配度;当所述多个第一匹配度中存在满足预设条件的目标第一匹配度时,获取所述目标第一匹配度对应的目标预设ICD标准疾病名称,将所述目标预设ICD标准疾病名称确定为所述待标准化疾病名称的转换结果。An embodiment of the present application further provides a computer-readable storage medium, wherein the computer storage medium stores a computer program for electronic data exchange, and when the program instructions are executed by a processor, the processor executes the following steps: obtain A target dictionary, a current diagnosis text, and a preset ICD standard disease name set, where the preset ICD standard disease name set includes a plurality of preset ICD standard disease names; based on the target dictionary, the current diagnosis text is word-cut , Obtain the name of the disease to be standardized contained in the current diagnosis text; construct a target dictionary tree based on the preset ICD standard disease name set; based on the target dictionary tree, combine the name of the disease to be standardized with the preset The multiple preset ICD standard disease names in the ICD standard disease name set are matched to obtain multiple first matching degrees; when there is a target first matching degree that meets the preset condition among the multiple first matching degrees, The target preset ICD standard disease name corresponding to the first matching degree of the target is acquired, and the target preset ICD standard disease name is determined as the conversion result of the disease name to be standardized.
在一个可能的示例中,所述处理器,还用于从历史病情案例库中提取历史诊断文本信息;对所述历史诊断文本信息进行数据清洗,得到历史疾病名称集;将所述历史疾病名称集与所述预设ICD标准疾病名称集进行数据处理,得到所述目标词典。In a possible example, the processor is also used to extract historical diagnosis text information from a historical disease case database; perform data cleaning on the historical diagnosis text information to obtain a set of historical disease names; Data processing is performed on the set and the preset ICD standard disease name set to obtain the target dictionary.
在一个可能的示例中,所述处理器,还用于获取针对多个预设疾病名称的多个预设正则表达式,其中,每一预设疾病名称对应一个预设正则表达式;将所述历史诊断文本信息分别与所述多个预设正则表达式中的每一预设正则表达式进行匹配,得到多个第二匹配度,每一所述预设正则表达式对应一个第二匹配度;确定所述多个第二匹配度中超过第一预设阈值的至少一个第二匹配度对应的至少一个预设疾病名称,并将所述至少一个预设疾病名称作为所述疾病名称集。In a possible example, the processor is also used to obtain multiple preset regular expressions for multiple preset disease names, where each preset disease name corresponds to a preset regular expression; The historical diagnosis text information is matched with each of the plurality of preset regular expressions to obtain a plurality of second matching degrees, and each of the preset regular expressions corresponds to a second match Degree; determining at least one preset disease name corresponding to at least one second matching degree that exceeds a first preset threshold among the plurality of second matching degrees, and using the at least one preset disease name as the disease name set .
在一个可能的示例中,所述处理器,还用于将所述疾病名称集与所述预设ICD标准疾病名称集进行合并,得到第一词典,所述第一词典中包括多个第一疾病名称;对所述多个第一疾病名称进行去重,得到所述目标词典。In a possible example, the processor is further configured to merge the disease name set and the preset ICD standard disease name set to obtain a first dictionary, and the first dictionary includes multiple first Disease name; deduplicate the multiple first disease names to obtain the target dictionary.
在一个可能的示例中,所述处理器,还用于基于所述预设ICD标准疾病名称集,确定所述预设ICD标准疾病名称集中多个预设ICD标准疾病名称中每一预设ICD标准疾病名称对应的第一序列,得到多个第一序列,其中,每一第一序列中包括至少一个字符;获取预设字典树,所述预设字典树中包括多个节点;遍历所述多个第一序列,将每一所述第一序列与所述预设字典树对应的多个节点进行匹配,得到多个第三匹配度;计算所述多个第三匹配度的均值;若所述均值大于第二预设阈值,则不更新所述预设字典树,将所述预设字典树作为所述目标字典树;若所述均值小于或等于所述第二预设阈值,则更新所述预设字典树,得到所述目标字典树。In a possible example, the processor is further configured to determine each preset ICD in the preset ICD standard disease name set based on the preset ICD standard disease name set The first sequence corresponding to the standard disease name obtains multiple first sequences, wherein each first sequence includes at least one character; obtains a preset dictionary tree, the preset dictionary tree includes multiple nodes; traverses the Multiple first sequences, matching each of the first sequences with multiple nodes corresponding to the preset dictionary tree to obtain multiple third matching degrees; calculating the mean value of the multiple third matching degrees; if If the average value is greater than the second preset threshold, the preset dictionary tree is not updated, and the preset dictionary tree is used as the target dictionary tree; if the average value is less than or equal to the second preset threshold, then Update the preset dictionary tree to obtain the target dictionary tree.
在一个可能的示例中,所述处理器,还用于基于所述预设字典树,确定所述预设字典树中所述多个节点对应的初始映射;若第一序列i中对应有字符i不存在于所述初始映射中,则重新添加一个映射i,在所述映射i中保存所述字符i,其中,所述第一序列i为所述多个第一序列中任意一个,所述字符i为所述第一序列中任意一个字符,i为正整数;基于所述映射i,更新所述预设字典树为所述目标字典树。In a possible example, the processor is further configured to determine the initial mapping corresponding to the multiple nodes in the preset dictionary tree based on the preset dictionary tree; if there are characters corresponding to the first sequence i i does not exist in the initial mapping, then a new mapping i is added, and the character i is stored in the mapping i, where the first sequence i is any one of the multiple first sequences, so The character i is any character in the first sequence, and i is a positive integer; based on the mapping i, the preset dictionary tree is updated to the target dictionary tree.
在一个可能的示例中,所述处理器,还用于确定所述目标字典树对应的目标映射表,所述目标映射表中多个预设路径,所述目标字典树中每一疾病名称对应一个预设路径;基于所述目标字典树,从字符j开始,依据所述目标字典树对应的头节点依次向下查找,得到待匹配路径j以及所述待匹配路径在所述目标映射表中对应的预设路径j,其中,所述字符j为所述待标准化疾病名称中第一个字符;将所述待匹配路径j与所述预设路径j进行匹配,得到所述字符j的第一匹配度j。In a possible example, the processor is further configured to determine a target mapping table corresponding to the target dictionary tree, a plurality of preset paths in the target mapping table, and each disease name in the target dictionary tree corresponds to A preset path; based on the target dictionary tree, starting from character j, and searching downwards in turn according to the head node corresponding to the target dictionary tree, to obtain the path j to be matched and the path to be matched in the target mapping table The corresponding preset path j, wherein the character j is the first character in the name of the disease to be standardized; the path j to be matched is matched with the preset path j to obtain the first character of the character j A matching degree j.
进一步地,所述计算机可读存储介质可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等;存储数据区可存储根据区块链节点的使用所创建的数据等。Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, an application program required by at least one function, etc.; the storage data area may store Data created by the use of nodes, etc.
另外,所述计算机可读存储介质可以是非易失性,也可以是易失性。In addition, the computer-readable storage medium may be non-volatile or volatile.
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
以上对本申请实施例进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。The embodiments of the application are described in detail above, and specific examples are used in this article to illustrate the principles and implementation of the application. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the application; at the same time, for Those of ordinary skill in the art, based on the ideas of the application, will have changes in the specific implementation and the scope of application. In summary, the content of this specification should not be construed as limiting the application.

Claims (20)

  1. 一种疾病名称标准化方法,其中,应用于电子设备,包括:A method for standardizing disease names, which is applied to electronic equipment, including:
    获取目标词典、当前诊断文本和预设ICD标准疾病名称集,所述预设ICD标准疾病名称集包括多个预设ICD标准疾病名称;Acquiring a target dictionary, a current diagnosis text, and a preset ICD standard disease name set, where the preset ICD standard disease name set includes multiple preset ICD standard disease names;
    基于所述目标词典,对所述当前诊断文本进行切词操作,得到所述当前诊断文本中包含的待标准化疾病名称;Based on the target dictionary, perform a word cutting operation on the current diagnosis text to obtain the name of the disease to be standardized contained in the current diagnosis text;
    基于所述预设ICD标准疾病名称集,构建目标字典树;Constructing a target dictionary tree based on the preset ICD standard disease name set;
    基于所述目标字典树,将所述待标准化疾病名称与所述预设ICD标准疾病名称集中的所述多个预设ICD标准疾病名称进行匹配,得到多个第一匹配度;Based on the target dictionary tree, matching the name of the disease to be standardized with the multiple preset ICD standard disease names in the preset ICD standard disease name set to obtain multiple first matching degrees;
    当所述多个第一匹配度中存在满足预设条件的目标第一匹配度时,获取所述目标第一匹配度对应的目标预设ICD标准疾病名称,将所述目标预设ICD标准疾病名称确定为所述待标准化疾病名称的转换结果。When there is a target first matching degree that meets a preset condition among the plurality of first matching degrees, the target preset ICD standard disease name corresponding to the target first matching degree is acquired, and the target is preset to the ICD standard disease The name is determined as the conversion result of the name of the disease to be standardized.
  2. 根据权利要求1所述的方法,其中,在所述获取目标词典之前,所述方法还包括:The method according to claim 1, wherein, before the acquiring the target dictionary, the method further comprises:
    从历史病情案例库中提取历史诊断文本信息;Extract historical diagnosis text information from the historical disease case database;
    对所述历史诊断文本信息进行数据清洗,得到历史疾病名称集;Perform data cleaning on the historical diagnosis text information to obtain a set of historical disease names;
    将所述历史疾病名称集与所述预设ICD标准疾病名称集进行数据处理,得到所述目标词典。Data processing is performed on the historical disease name set and the preset ICD standard disease name set to obtain the target dictionary.
  3. 根据权利要求2所述的方法,其中,所述对所述历史诊断文本信息进行数据清洗,得到历史疾病名称集,包括:The method according to claim 2, wherein said performing data cleaning on said historical diagnosis text information to obtain a set of historical disease names comprises:
    获取针对多个预设疾病名称的多个预设正则表达式,其中,每一预设疾病名称对应一个预设正则表达式;Acquiring multiple preset regular expressions for multiple preset disease names, where each preset disease name corresponds to a preset regular expression;
    将所述历史诊断文本信息分别与所述多个预设正则表达式中的每一预设正则表达式进行匹配,得到多个第二匹配度,每一所述预设正则表达式对应一个第二匹配度;The historical diagnosis text information is matched with each of the plurality of preset regular expressions to obtain a plurality of second matching degrees, and each of the preset regular expressions corresponds to a first Second match
    确定所述多个第二匹配度中超过第一预设阈值的至少一个第二匹配度对应的至少一个预设疾病名称,并将所述至少一个预设疾病名称作为所述疾病名称集。At least one preset disease name corresponding to at least one second matching degree that exceeds a first preset threshold among the plurality of second matching degrees is determined, and the at least one preset disease name is used as the disease name set.
  4. 根据权利要求2或3所述的方法,其中,所述将所述历史疾病名称集与所述预设ICD标准疾病名称集进行数据处理,得到目标词典,包括:The method according to claim 2 or 3, wherein the data processing of the historical disease name set and the preset ICD standard disease name set to obtain a target dictionary comprises:
    将所述疾病名称集与所述预设ICD标准疾病名称集进行合并,得到第一词典,所述第一词典中包括多个第一疾病名称;Merging the disease name set and the preset ICD standard disease name set to obtain a first dictionary, and the first dictionary includes multiple first disease names;
    对所述多个第一疾病名称进行去重,得到所述目标词典。The multiple first disease names are deduplicated to obtain the target dictionary.
  5. 根据权利要求1所述的方法,其中,所述基于所述预设ICD标准疾病名称集,构建目标字典树,包括:The method according to claim 1, wherein said constructing a target dictionary tree based on said preset ICD standard disease name set comprises:
    基于所述预设ICD标准疾病名称集,确定所述预设ICD标准疾病名称集中多个预设ICD标准疾病名称中每一预设ICD标准疾病名称对应的第一序列,得到多个第一序列,其中,每一第一序列中包括至少一个字符;Based on the preset ICD standard disease name set, determine the first sequence corresponding to each preset ICD standard disease name in the multiple preset ICD standard disease names in the preset ICD standard disease name set, and obtain multiple first sequences , Wherein each first sequence includes at least one character;
    获取预设字典树,所述预设字典树中包括多个节点;Acquiring a preset dictionary tree, the preset dictionary tree including a plurality of nodes;
    遍历所述多个第一序列,将每一所述第一序列与所述预设字典树对应的多个节点进行匹配,得到多个第三匹配度;Traversing the multiple first sequences, matching each of the first sequences with multiple nodes corresponding to the preset dictionary tree, to obtain multiple third matching degrees;
    计算所述多个第三匹配度的均值;Calculating the mean value of the plurality of third matching degrees;
    若所述均值大于第二预设阈值,则不更新所述预设字典树,将所述预设字典树作为所述目标字典树;If the average value is greater than the second preset threshold, do not update the preset dictionary tree, and use the preset dictionary tree as the target dictionary tree;
    若所述均值小于或等于所述第二预设阈值,则更新所述预设字典树,得到所述目标字典树。If the average value is less than or equal to the second preset threshold, updating the preset dictionary tree to obtain the target dictionary tree.
  6. 根据权利要求5所述的方法,其中,所述更新所述预设字典树,得到目标字典树, 包括:The method according to claim 5, wherein said updating said preset dictionary tree to obtain a target dictionary tree comprises:
    基于所述预设字典树,确定所述预设字典树中所述多个节点对应的初始映射;Determine the initial mapping corresponding to the multiple nodes in the preset dictionary tree based on the preset dictionary tree;
    若第一序列i中对应有字符i不存在于所述初始映射中,则重新添加一个映射i,在所述映射i中保存所述字符i,其中,所述第一序列i为所述多个第一序列中任意一个,所述字符i为所述第一序列中任意一个字符,i为正整数;If the character i corresponding to the first sequence i does not exist in the initial mapping, a new mapping i is added, and the character i is stored in the mapping i, where the first sequence i is the multiple Any one of the first sequence, where the character i is any one of the characters in the first sequence, and i is a positive integer;
    基于所述映射i,更新所述预设字典树为所述目标字典树。Based on the mapping i, update the preset dictionary tree to the target dictionary tree.
  7. 根据权利要求1-6任一项所述的方法,其中,所述将所述待标准化疾病名称与所述预设ICD标准疾病名称集中的所述多个预设ICD标准疾病名称进行匹配,得到多个第一匹配度,包括:The method according to any one of claims 1 to 6, wherein the name of the disease to be standardized is matched with the multiple preset ICD standard disease names in the set of preset ICD standard disease names to obtain Multiple first match degrees, including:
    确定所述目标字典树对应的目标映射表,所述目标映射表中多个预设路径,所述目标字典树中每一疾病名称对应一个预设路径;Determining a target mapping table corresponding to the target dictionary tree, a plurality of preset paths in the target mapping table, and each disease name in the target dictionary tree corresponds to a preset path;
    基于所述目标字典树,从字符j开始,依据所述目标字典树对应的头节点依次向下查找,得到待匹配路径j以及所述待匹配路径在所述目标映射表中对应的预设路径j,其中,所述字符j为所述待标准化疾病名称中第一个字符;Based on the target dictionary tree, starting from the character j, and sequentially searching downwards according to the head node corresponding to the target dictionary tree, the path j to be matched and the preset path corresponding to the path to be matched in the target mapping table are obtained j, wherein the character j is the first character in the name of the disease to be standardized;
    将所述待匹配路径j与所述预设路径j进行匹配,得到所述字符j的第一匹配度j。The path j to be matched is matched with the preset path j to obtain the first matching degree j of the character j.
  8. 一种疾病名称标准化装置,其中,应用于电子设备,所述装置包括:获取单元、切词单元、构建单元、匹配单元和确定单元,其中,A disease name standardization device, which is applied to electronic equipment, the device includes: an acquisition unit, a word segmentation unit, a construction unit, a matching unit, and a determination unit, wherein,
    所述获取单元,用于获取目标词典、当前诊断文本和预设ICD标准疾病名称集,所述预设ICD标准疾病名称集包括多个预设ICD标准疾病名称;The acquiring unit is configured to acquire a target dictionary, a current diagnosis text, and a preset ICD standard disease name set, where the preset ICD standard disease name set includes a plurality of preset ICD standard disease names;
    所述切词单元,用于基于所述目标词典,对所述当前诊断文本进行切词操作,得到所述当前诊断文本中包含的待标准化疾病名称;The word segmentation unit is configured to perform a word segmentation operation on the current diagnosis text based on the target dictionary to obtain the name of the disease to be standardized contained in the current diagnosis text;
    所述构建单元,用于基于所述预设ICD标准疾病名称集,构建目标字典树;The construction unit is configured to construct a target dictionary tree based on the preset ICD standard disease name set;
    所述匹配单元,用于基于所述目标字典树,将所述待标准化疾病名称与所述预设ICD标准疾病名称集中的所述多个预设ICD标准疾病名称进行匹配,得到多个第一匹配度;The matching unit is configured to match the name of the disease to be standardized with the plurality of preset ICD standard disease names in the set of preset ICD standard disease names based on the target dictionary tree to obtain a plurality of first suitability;
    所述确定单元,用于当所述多个第一匹配度中存在满足预设条件的目标第一匹配度时,获取所述目标第一匹配度对应的目标预设ICD标准疾病名称,将所述目标预设ICD标准疾病名称确定为所述待标准化疾病名称的转换结果。The determining unit is configured to obtain the target preset ICD standard disease name corresponding to the target first matching degree when there is a target first matching degree that satisfies a preset condition among the plurality of first matching degrees, and The target preset ICD standard disease name is determined as the conversion result of the disease name to be standardized.
  9. 一种电子设备,其中,包括处理器、存储器、通信接口,以及一个或多个程序,所述一个或多个程序被存储在所述存储器中,并且被配置由所述处理器执行,其中,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,以执行以下步骤的指令:An electronic device, including a processor, a memory, a communication interface, and one or more programs, the one or more programs are stored in the memory and configured to be executed by the processor, wherein, The memory is used to store a computer program, the computer program includes program instructions, and the processor is configured to call the program instructions to execute instructions of the following steps:
    获取目标词典、当前诊断文本和预设ICD标准疾病名称集,所述预设ICD标准疾病名称集包括多个预设ICD标准疾病名称;基于所述目标词典,对所述当前诊断文本进行切词操作,得到所述当前诊断文本中包含的待标准化疾病名称;基于所述预设ICD标准疾病名称集,构建目标字典树;基于所述目标字典树,将所述待标准化疾病名称与所述预设ICD标准疾病名称集中的所述多个预设ICD标准疾病名称进行匹配,得到多个第一匹配度;当所述多个第一匹配度中存在满足预设条件的目标第一匹配度时,获取所述目标第一匹配度对应的目标预设ICD标准疾病名称,将所述目标预设ICD标准疾病名称确定为所述待标准化疾病名称的转换结果。Acquire a target dictionary, a current diagnosis text, and a preset ICD standard disease name set, where the preset ICD standard disease name set includes a plurality of preset ICD standard disease names; based on the target dictionary, perform word segmentation on the current diagnosis text Operation, the name of the disease to be standardized contained in the current diagnosis text is obtained; based on the preset ICD standard disease name set, a target dictionary tree is constructed; based on the target dictionary tree, the name of the disease to be standardized and the predicted disease name It is assumed that the multiple preset ICD standard disease names in the ICD standard disease name set are matched to obtain multiple first matching degrees; when there is a target first matching degree that meets the preset condition among the multiple first matching degrees Obtain the target preset ICD standard disease name corresponding to the first degree of matching of the target, and determine the target preset ICD standard disease name as the conversion result of the disease name to be standardized.
  10. 根据权利要求9所述的电子设备,其中,在所述获取目标词典之前,所述处理器具体还用于:The electronic device according to claim 9, wherein, before the acquisition of the target dictionary, the processor is specifically further configured to:
    从历史病情案例库中提取历史诊断文本信息;对所述历史诊断文本信息进行数据清洗,得到历史疾病名称集;将所述历史疾病名称集与所述预设ICD标准疾病名称集进行数据处理,得到所述目标词典。Extract historical diagnosis text information from the historical disease case database; perform data cleaning on the historical diagnosis text information to obtain a historical disease name set; perform data processing on the historical disease name set and the preset ICD standard disease name set, Obtain the target dictionary.
  11. 根据权利要求10所述的电子设备,其中,在所述对所述历史诊断文本信息进行数据清洗,得到历史疾病名称集方面,所述处理器具体用于:The electronic device according to claim 10, wherein, in the aspect of performing data cleaning on the historical diagnosis text information to obtain a set of historical disease names, the processor is specifically configured to:
    获取针对多个预设疾病名称的多个预设正则表达式,其中,每一预设疾病名称对应一个预设正则表达式;将所述历史诊断文本信息分别与所述多个预设正则表达式中的每一预设正则表达式进行匹配,得到多个第二匹配度,每一所述预设正则表达式对应一个第二匹配度;确定所述多个第二匹配度中超过第一预设阈值的至少一个第二匹配度对应的至少一个预设疾病名称,并将所述至少一个预设疾病名称作为所述疾病名称集。Acquire multiple preset regular expressions for multiple preset disease names, where each preset disease name corresponds to a preset regular expression; and compare the historical diagnosis text information with the multiple preset regular expressions respectively Each preset regular expression in the formula is matched to obtain a plurality of second matching degrees, and each of the preset regular expressions corresponds to a second matching degree; it is determined that the plurality of second matching degrees exceeds the first At least one preset disease name corresponding to at least one second matching degree of the preset threshold, and the at least one preset disease name is used as the disease name set.
  12. 根据权利要求10或11所述的电子设备,其中,在所述将所述历史疾病名称集与所述预设ICD标准疾病名称集进行数据处理,得到目标词典方面,所述处理器具体用于:The electronic device according to claim 10 or 11, wherein, in the aspect of performing data processing on the historical disease name set and the preset ICD standard disease name set to obtain a target dictionary, the processor is specifically configured to :
    将所述疾病名称集与所述预设ICD标准疾病名称集进行合并,得到第一词典,所述第一词典中包括多个第一疾病名称;对所述多个第一疾病名称进行去重,得到所述目标词典。Combine the disease name set with the preset ICD standard disease name set to obtain a first dictionary, the first dictionary includes multiple first disease names; deduplicate the multiple first disease names , Get the target dictionary.
  13. 根据权利要求9所述的电子设备,其中,在所述基于所述预设ICD标准疾病名称集,构建目标字典树方面,所述处理器具体用于:The electronic device according to claim 9, wherein, in terms of constructing a target dictionary tree based on the preset ICD standard disease name set, the processor is specifically configured to:
    基于所述预设ICD标准疾病名称集,确定所述预设ICD标准疾病名称集中多个预设ICD标准疾病名称中每一预设ICD标准疾病名称对应的第一序列,得到多个第一序列,其中,每一第一序列中包括至少一个字符;获取预设字典树,所述预设字典树中包括多个节点;遍历所述多个第一序列,将每一所述第一序列与所述预设字典树对应的多个节点进行匹配,得到多个第三匹配度;计算所述多个第三匹配度的均值;若所述均值大于第二预设阈值,则不更新所述预设字典树,将所述预设字典树作为所述目标字典树;若所述均值小于或等于所述第二预设阈值,则更新所述预设字典树,得到所述目标字典树。Based on the preset ICD standard disease name set, determine the first sequence corresponding to each preset ICD standard disease name in the multiple preset ICD standard disease names in the preset ICD standard disease name set, and obtain multiple first sequences , Wherein each first sequence includes at least one character; obtains a preset dictionary tree, the preset dictionary tree includes a plurality of nodes; traverses the plurality of first sequences, and compares each of the first sequences with Matching multiple nodes corresponding to the preset dictionary tree to obtain multiple third matching degrees; calculating the average value of the multiple third matching degrees; if the average value is greater than the second preset threshold, then not updating the A preset dictionary tree, using the preset dictionary tree as the target dictionary tree; if the average value is less than or equal to the second preset threshold, the preset dictionary tree is updated to obtain the target dictionary tree.
  14. 根据权利要求13所述的电子设备,其中,在所述更新所述预设字典树,得到目标字典树方面,所述处理器具体用于:The electronic device according to claim 13, wherein, in the aspect of updating the preset dictionary tree to obtain a target dictionary tree, the processor is specifically configured to:
    基于所述预设字典树,确定所述预设字典树中所述多个节点对应的初始映射;若第一序列i中对应有字符i不存在于所述初始映射中,则重新添加一个映射i,在所述映射i中保存所述字符i,其中,所述第一序列i为所述多个第一序列中任意一个,所述字符i为所述第一序列中任意一个字符,i为正整数;基于所述映射i,更新所述预设字典树为所述目标字典树。Based on the preset dictionary tree, determine the initial mapping corresponding to the multiple nodes in the preset dictionary tree; if the character i corresponding to the first sequence i does not exist in the initial mapping, then add a new mapping i. Save the character i in the mapping i, where the first sequence i is any one of the multiple first sequences, and the character i is any character in the first sequence, i Is a positive integer; based on the mapping i, the preset dictionary tree is updated to the target dictionary tree.
  15. 根据权利要求9-14任一项所述的电子设备,其中,在所述将所述待标准化疾病名称与所述预设ICD标准疾病名称集中的所述多个预设ICD标准疾病名称进行匹配,得到多个第一匹配度方面,所述处理器具体用于:The electronic device according to any one of claims 9-14, wherein the name of the disease to be standardized is matched with the plurality of preset ICD standard disease names in the set of preset ICD standard disease names , In terms of obtaining multiple first matching degrees, the processor is specifically configured to:
    确定所述目标字典树对应的目标映射表,所述目标映射表中多个预设路径,所述目标字典树中每一疾病名称对应一个预设路径;基于所述目标字典树,从字符j开始,依据所述目标字典树对应的头节点依次向下查找,得到待匹配路径j以及所述待匹配路径在所述目标映射表中对应的预设路径j,其中,所述字符j为所述待标准化疾病名称中第一个字符;将所述待匹配路径j与所述预设路径j进行匹配,得到所述字符j的第一匹配度j。Determine the target mapping table corresponding to the target dictionary tree, a plurality of preset paths in the target mapping table, and each disease name in the target dictionary tree corresponds to a preset path; based on the target dictionary tree, from the character j Initially, according to the head node corresponding to the target dictionary tree, search downwards in order to obtain the path j to be matched and the preset path j corresponding to the path to be matched in the target mapping table, where the character j is the The first character in the name of the disease to be standardized; the path j to be matched is matched with the preset path j to obtain the first matching degree j of the character j.
  16. 一种计算机可读存储介质,包括存储数据区和存储程序区,存储数据区存储根据区块链节点的使用所创建的数据,存储程序区存储有计算机程序,其中,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行如下步骤:A computer-readable storage medium includes a storage data area and a storage program area. The storage data area stores data created according to the use of blockchain nodes, and the storage program area stores a computer program, wherein the computer program includes program instructions When the program instructions are executed by the processor, the processor executes the following steps:
    获取目标词典、当前诊断文本和预设ICD标准疾病名称集,所述预设ICD标准疾病名称集包括多个预设ICD标准疾病名称;基于所述目标词典,对所述当前诊断文本进行切词操作,得到所述当前诊断文本中包含的待标准化疾病名称;基于所述预设ICD标准疾病名称集,构建目标字典树;基于所述目标字典树,将所述待标准化疾病名称与所述预设ICD标准疾病名称集中的所述多个预设ICD标准疾病名称进行匹配,得到多个第一匹配度;当所述多个第一匹配度中存在满足预设条件的目标第一匹配度时,获取所述目标第一匹配度 对应的目标预设ICD标准疾病名称,将所述目标预设ICD标准疾病名称确定为所述待标准化疾病名称的转换结果。Acquire a target dictionary, a current diagnosis text, and a preset ICD standard disease name set, where the preset ICD standard disease name set includes a plurality of preset ICD standard disease names; based on the target dictionary, perform word segmentation on the current diagnosis text Operation, the name of the disease to be standardized contained in the current diagnosis text is obtained; based on the preset ICD standard disease name set, a target dictionary tree is constructed; based on the target dictionary tree, the name of the disease to be standardized is compared with the predicted disease name. It is assumed that the multiple preset ICD standard disease names in the ICD standard disease name set are matched to obtain multiple first matching degrees; when there is a target first matching degree that meets the preset condition among the multiple first matching degrees Obtain the target preset ICD standard disease name corresponding to the first degree of matching of the target, and determine the target preset ICD standard disease name as the conversion result of the disease name to be standardized.
  17. 根据权利要求16所述的介质,其中,所述处理器,还用于从历史病情案例库中提取历史诊断文本信息;对所述历史诊断文本信息进行数据清洗,得到历史疾病名称集;将所述历史疾病名称集与所述预设ICD标准疾病名称集进行数据处理,得到所述目标词典。The medium according to claim 16, wherein the processor is further configured to extract historical diagnosis text information from a historical disease case database; perform data cleaning on the historical diagnosis text information to obtain a set of historical disease names; Data processing is performed on the historical disease name set and the preset ICD standard disease name set to obtain the target dictionary.
  18. 根据权利要求17所述的介质,其中,所述处理器,还用于获取针对多个预设疾病名称的多个预设正则表达式,其中,每一预设疾病名称对应一个预设正则表达式;将所述历史诊断文本信息分别与所述多个预设正则表达式中的每一预设正则表达式进行匹配,得到多个第二匹配度,每一所述预设正则表达式对应一个第二匹配度;确定所述多个第二匹配度中超过第一预设阈值的至少一个第二匹配度对应的至少一个预设疾病名称,并将所述至少一个预设疾病名称作为所述疾病名称集。The medium according to claim 17, wherein the processor is further configured to obtain a plurality of preset regular expressions for a plurality of preset disease names, wherein each preset disease name corresponds to a preset regular expression式; The historical diagnosis text information is matched with each of the plurality of preset regular expressions to obtain a plurality of second matching degrees, and each of the preset regular expressions corresponds to A second matching degree; determining at least one preset disease name corresponding to at least one second matching degree that exceeds the first preset threshold among the plurality of second matching degrees, and using the at least one preset disease name as all State the set of disease names.
  19. 根据权利要求17或18所述的介质,其中,所述处理器,具体还用于将所述疾病名称集与所述预设ICD标准疾病名称集进行合并,得到第一词典,所述第一词典中包括多个第一疾病名称;对所述多个第一疾病名称进行去重,得到所述目标词典。The medium according to claim 17 or 18, wherein the processor is specifically further configured to merge the disease name set and the preset ICD standard disease name set to obtain a first dictionary, and the first The dictionary includes multiple first disease names; the multiple first disease names are deduplicated to obtain the target dictionary.
  20. 根据权利要求16所述的介质,其中,所述处理器,还用于基于所述预设ICD标准疾病名称集,确定所述预设ICD标准疾病名称集中多个预设ICD标准疾病名称中每一预设ICD标准疾病名称对应的第一序列,得到多个第一序列,其中,每一第一序列中包括至少一个字符;获取预设字典树,所述预设字典树中包括多个节点;遍历所述多个第一序列,将每一所述第一序列与所述预设字典树对应的多个节点进行匹配,得到多个第三匹配度;计算所述多个第三匹配度的均值;若所述均值大于第二预设阈值,则不更新所述预设字典树,将所述预设字典树作为所述目标字典树;若所述均值小于或等于所述第二预设阈值,则更新所述预设字典树,得到所述目标字典树。The medium according to claim 16, wherein the processor is further configured to determine each of the plurality of preset ICD standard disease names in the preset ICD standard disease name set based on the preset ICD standard disease name set A first sequence corresponding to a preset ICD standard disease name is obtained, and a plurality of first sequences are obtained, wherein each first sequence includes at least one character; a preset dictionary tree is obtained, and the preset dictionary tree includes a plurality of nodes Traverse the multiple first sequences, and match each of the first sequences with multiple nodes corresponding to the preset dictionary tree to obtain multiple third matching degrees; calculating the multiple third matching degrees If the average value is greater than the second preset threshold, the preset dictionary tree is not updated, and the preset dictionary tree is used as the target dictionary tree; if the average value is less than or equal to the second preset threshold If the threshold is set, the preset dictionary tree is updated to obtain the target dictionary tree.
PCT/CN2020/099487 2020-05-13 2020-06-30 Disease name standardization method, apparatus, device, and storage medium WO2021114632A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010401370.1A CN111696635A (en) 2020-05-13 2020-05-13 Disease name standardization method and device
CN202010401370.1 2020-05-13

Publications (1)

Publication Number Publication Date
WO2021114632A1 true WO2021114632A1 (en) 2021-06-17

Family

ID=72477704

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/099487 WO2021114632A1 (en) 2020-05-13 2020-06-30 Disease name standardization method, apparatus, device, and storage medium

Country Status (2)

Country Link
CN (1) CN111696635A (en)
WO (1) WO2021114632A1 (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112131339A (en) * 2020-09-28 2020-12-25 上海梅斯医药科技有限公司 Name standardization standard processing method, device, computer and storage medium
CN112527970B (en) * 2020-12-24 2022-11-15 上海浦东发展银行股份有限公司 Data dictionary standardization processing method, device, equipment and storage medium
CN112786206A (en) * 2021-01-28 2021-05-11 山东众阳健康科技集团有限公司 Data processing method and system for information standardization of medical institution
CN112836055A (en) * 2021-03-12 2021-05-25 云知声智能科技股份有限公司 Quantity prediction method and device for clinical term standardization
CN113987113B (en) * 2021-06-25 2023-09-22 四川大学 Multi-station naming service fusion method, device, storage medium and server
CN113823404A (en) * 2021-08-26 2021-12-21 山东健康医疗大数据有限公司 Medical big data-based method for standardizing medical terms for construction of specific diseases
CN114358001A (en) * 2021-11-16 2022-04-15 安徽科大讯飞医疗信息技术有限公司 Method for standardizing diagnosis result, and related device, equipment and storage medium thereof
CN114708603A (en) * 2022-05-25 2022-07-05 杭州咏柳科技有限公司 Method, system, device and medium for identifying key information in medical bill
CN116361517B (en) * 2023-05-29 2023-08-25 北京拓普丰联信息科技股份有限公司 Enterprise word size duplicate checking method, device, equipment and medium
CN116562271B (en) * 2023-07-10 2023-10-10 之江实验室 Quality control method and device for electronic medical record, storage medium and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106682411A (en) * 2016-12-22 2017-05-17 浙江大学 Method for converting physical examination diagnostic data into disease label
CN106951684A (en) * 2017-02-28 2017-07-14 北京大学 A kind of method of entity disambiguation in medical conditions idagnostic logout
US20170220549A1 (en) * 2016-01-28 2017-08-03 Fujitsu Limited Information processing apparatus and display method
CN110008473A (en) * 2019-04-01 2019-07-12 云知声(上海)智能科技有限公司 A kind of medical text name Entity recognition mask method based on alternative manner
CN111046882A (en) * 2019-12-05 2020-04-21 清华大学 Disease name standardization method and system based on profile hidden Markov model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170220549A1 (en) * 2016-01-28 2017-08-03 Fujitsu Limited Information processing apparatus and display method
CN106682411A (en) * 2016-12-22 2017-05-17 浙江大学 Method for converting physical examination diagnostic data into disease label
CN106951684A (en) * 2017-02-28 2017-07-14 北京大学 A kind of method of entity disambiguation in medical conditions idagnostic logout
CN110008473A (en) * 2019-04-01 2019-07-12 云知声(上海)智能科技有限公司 A kind of medical text name Entity recognition mask method based on alternative manner
CN111046882A (en) * 2019-12-05 2020-04-21 清华大学 Disease name standardization method and system based on profile hidden Markov model

Also Published As

Publication number Publication date
CN111696635A (en) 2020-09-22

Similar Documents

Publication Publication Date Title
WO2021114632A1 (en) Disease name standardization method, apparatus, device, and storage medium
US11782981B2 (en) Method, apparatus, server, and storage medium for incorporating structured entity
US11182682B2 (en) System for extracting semantic triples for building a knowledge base
CN112035511A (en) Target data searching method based on medical knowledge graph and related equipment
WO2021184729A1 (en) Drug classification method and apparatus, storage medium, and intelligent device
WO2021109787A1 (en) Synonym mining method, synonym dictionary application method, medical synonym mining method, medical synonym dictionary application method, synonym mining apparatus and storage medium
JP5603250B2 (en) Archive management method for approximate string matching
WO2022160454A1 (en) Medical literature retrieval method and apparatus, electronic device, and storage medium
CN113707297A (en) Medical data processing method, device, equipment and storage medium
WO2022222943A1 (en) Department recommendation method and apparatus, electronic device and storage medium
Dutta et al. Neighbor-aware search for approximate labeled graph matching using the chi-square statistics
WO2017113886A1 (en) Data cleaning method and device
CN108874956A (en) Mass file search method, device, computer equipment and storage medium
CN110008474A (en) A kind of key phrase determines method, apparatus, equipment and storage medium
WO2019136855A1 (en) Method and apparatus for implementing multidimensional analysis on insurance policy, terminal device, and storage medium
US20220358178A1 (en) Data query method, electronic device, and storage medium
US20220005546A1 (en) Non-redundant gene set clustering method and system, and electronic device
CN113672628A (en) Data blood margin analysis method, terminal device and medium
CN111460170A (en) Word recognition method and device, terminal equipment and storage medium
CN114358001A (en) Method for standardizing diagnosis result, and related device, equipment and storage medium thereof
WO2022227171A1 (en) Method and apparatus for extracting key information, electronic device, and medium
JP2009294967A (en) Computer system for performing intensive calculation for tree structure data, and its method and computer program
US7912703B2 (en) Unsupervised stemming schema learning and lexicon acquisition from corpora
CN116383412B (en) Functional point amplification method and system based on knowledge graph
CN111599487A (en) Traditional Chinese medicine compatibility assistant decision-making method based on correlation analysis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20900249

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 15/03/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20900249

Country of ref document: EP

Kind code of ref document: A1