CN116127979B

CN116127979B - Named entity name standardization method and device, electronic equipment and storage medium

Info

Publication number: CN116127979B
Application number: CN202310347505.4A
Authority: CN
Inventors: 赵周剑; 王永明; 刘荣兵
Original assignee: Zhejiang Taimei Medical Technology Co Ltd
Current assignee: Zhejiang Taimei Medical Technology Co Ltd
Priority date: 2023-04-04
Filing date: 2023-04-04
Publication date: 2023-09-19
Anticipated expiration: 2043-04-04
Also published as: CN116127979A

Abstract

The application discloses a method and a device for standardizing names of named entities, electronic equipment and a storage medium, wherein the method comprises the steps of recalling initial named entity standard words from a standard word stock based on named entity original words, wherein the initial named entity standard words have first similarity with the named entity original words; predicting the number of standard words corresponding to the original words of the named entity, wherein the number of the predicted words is smaller than or equal to the number of the standard words of the original named entity recalled; based on the first similarity, a predicted number of named entity standard words is determined from the recalled initial named entity standard words. Thus, the accuracy of name standardization of the named entity is improved.

Description

Named entity name standardization method and device, electronic equipment and storage medium

Technical Field

The application belongs to the technical field of computers, and particularly relates to a method and a device for name standardization of named entities, electronic equipment and a storage medium.

Background

With the continuous development of internet technology, natural language processing (such as text processing) technology has become an important direction in the field of computer technology and artificial intelligence, and has been widely used; the named entity recognition (Named Entity Recognition, NER) is a basic task of other natural language processing tasks, and the named entity recognition refers to that words with entity meaning (such as extracting person names, place names, organization names and the like in sentences) are recognized from texts. The identified named entities are the application basis of various subsequent scenes, and based on the application basis, how to ensure the accuracy of the standardization of the names of the identified named entities becomes a research hot spot.

The information disclosed in this background section is only for enhancement of understanding of the general background of the application and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person of ordinary skill in the art.

Disclosure of Invention

The application aims to provide a method for standardizing names of named entities, which is used for solving the problem of low accuracy of the standardization of the names of the named entities in the prior art.

To achieve the above object, the present application provides a method for name standardization of named entities, the method comprising:

recall an initial named entity standard word from a standard word stock based on a named entity primitive word, wherein the initial named entity standard word has a first similarity with the named entity primitive word;

predicting the number of the predicted standard words corresponding to the named entity primitive words, wherein the number of the predicted standard words is smaller than or equal to the number of the recalled original named entity standard words;

and determining a predicted number of named entity standard words from the recalled initial named entity standard words based on the first similarity.

In one embodiment, the number of predictions is greater than or equal to 2, and the number of predictions is less than the number of recalled initial named entity standard words;

the method further comprises the steps of:

determining a plurality of standard phrase groups based on the recalled initial named entity standard words, wherein the standard phrase groups respectively comprise the initial named entity standard words with predicted quantity;

respectively calculating the second similarity of the standard phrase and the named entity primitive word;

determining a target standard phrase of the named entity primitive word from the standard phrases based on the first similarity and the second similarity;

and determining the initial named entity standard words in the target standard phrase as named entity standard words.

In an embodiment, the calculating the second similarity between the standard phrase and the named entity primitive word specifically includes:

acquiring a character intersection of the standard phrase and a named entity primitive word;

and calculating the second similarity based on the character intersection and the number of characters of the named entity primitive word.

In one embodiment, the calculating the second similarity based on the character intersection and the number of characters of the named entity primitive word specifically includes:

and determining the character quantity ratio of the character intersection and the named entity primitive word as the second similarity.

In an embodiment, determining, based on the first similarity and the second similarity, the target standard phrase of the named entity primitive word from the standard phrases specifically includes:

based on the first similarity and the second similarity, calculating the comprehensive similarity of the standard phrase and the named entity primitive word;

and determining the standard phrase with the highest comprehensive similarity as the target standard phrase.

In one embodiment, when the predicted number is equal to 1, the method specifically includes:

determining the initial named entity standard word with the highest first similarity as the named entity standard word;

and/or the number of the groups of groups,

before recalling the initial named entity standard word from the standard word stock based on the named entity primitive word, the method further comprises:

and preprocessing the original words of the named entities, wherein the preprocessing comprises at least one of character normalization, special symbol cleaning and idiomatic abbreviations.

In one embodiment, the named entity name is a disease name and the standard word stock includes an ICD10 diagnostic code stock.

The application also provides a device for name standardization of named entities, which comprises:

the recall module is used for recalling initial named entity standard words from the standard word stock based on named entity original words, wherein the initial named entity standard words have first similarity with the named entity original words;

the prediction module is used for predicting the prediction quantity of standard words corresponding to the named entity primitive words, wherein the prediction quantity is smaller than or equal to the initial named entity standard word quantity recalled;

and the determining module is used for determining the predicted number of named entity standard words from the recalled initial named entity standard words based on the first similarity.

The present application also provides an electronic device including:

at least one processor; and

a memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform a method of naming entity name normalization as described above.

The present application also provides a machine-readable storage medium storing executable instructions that, when executed, cause the machine to perform a method of naming entity name normalization as described above.

Compared with the prior art, according to the method for standardizing the name of the named entity, the accuracy of standardizing the name of the named entity is improved by predicting the predicted number of the standard words corresponding to the original words of the named entity and determining the predicted number of the standard words of the named entity based on the first similarity between the recalled original standard words of the named entity and the original words of the named entity.

On the other hand, aiming at the recalled original named entity standard words, combining standard word groups based on the predicted quantity, calculating the second similarity between the standard word groups and the named entity original words based on the character coverage rate, and finally comprehensively considering the first similarity and the second similarity to determine the standard word group with the highest similarity to the named entity original words, so that the accuracy of named entity name standardization is further improved aiming at the situation that one named entity original word possibly corresponds to a plurality of named entity standard words.

Drawings

FIG. 1 is a schematic diagram of an application scenario of a method for naming entity name normalization according to an embodiment of the present application;

FIG. 2 is a flow chart of a method of naming entity name normalization according to one embodiment of the present application;

FIG. 3 is a flowchart of determining a named entity standard word based on a first similarity and a second similarity when a predicted number is 2 or more in a method for name normalization of named entities according to an embodiment of the present application;

FIG. 4 is a block diagram of an apparatus for naming entity name normalization according to one embodiment of the present application;

fig. 5 is a hardware configuration diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The present application will be described in detail below with reference to the embodiments shown in the drawings. The embodiments are not intended to limit the application, but structural, methodological, or functional modifications of the application from those skilled in the art are included within the scope of the application.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "includes" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Natural language processing (Nature Language Processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

As artificial intelligence technology research and advances, artificial intelligence technology expands research and applications in a variety of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, autopilot, unmanned, robotic, smart medical, smart customer service, etc. In addition, the artificial intelligence technology can be applied in other fields, for example, in the embodiments of the present application, non-standardized named entity names are converted into the names meeting the desired standards by using means such as natural language processing, machine learning/deep learning combination, and the like, that is, the normalization of the named entity names is realized.

Taking clinical trials of drugs as an example, it refers to systemic studies of drugs in humans to determine the efficacy and safety of drugs. The clinical test stage of the medicine is divided into a phase I clinical test, a phase II clinical test and a phase III clinical test and a phase IV clinical test. Stage I is mainly related to preliminary clinical pharmacology and human safety evaluation tests. Phase II can be understood as a primary evaluation phase of therapeutic action, which primarily involves the primary evaluation of the therapeutic action and safety of the drug to the patient of the target indication, as well as providing basis for the design of phase III clinical trial studies and the determination of dosing regimen. Stage III can be understood as a treatment effect confirmation stage, and is mainly used for further verifying the treatment effect and safety of the drug on the target indication patient, evaluating the relationship between benefits and risks, and finally providing sufficient basis for the examination of the drug registration application. Stage IV is mainly a clinical trial after the drug is marketed, after which the efficacy and adverse reactions of the drug under widely used conditions can be continuously tracked to evaluate the relationship between benefits and risks and to improve dosing and the like in general or special populations.

Patient recruitment is the sponsor's drug enterprise of a clinical trial study, entrusting a hospital or clinical trial service, and recruiting appropriate patients to participate in the clinical trial study in various ways. Generally, patients meeting clinical trial research projects can be screened by collecting the latest discharge records, medical record reports, CT/nuclear magnetism imaging, gene reports and other disease information of the patients, and the patients are recommended to the corresponding trial to carry out hospital visits.

In the above business scenario, electronic medical record data written by different hospitals and doctors needs to be processed, and a plurality of different writing methods exist for the identified same disease entity (namely named entity). At this time, the problem to be solved by the disease name standardization is to correspond the same disease pathogenic words of different languages identified from the medical record to the specified standard words; also, there is often a problem in that one disease origin word may correspond to a plurality of standard words. Therefore, the application is expected to provide a method for normalizing the name of a named entity, which can find one or more named entity standard words which possibly correspond to one named entity original word, thereby improving the accuracy of the name normalization of the named entity.

Referring to fig. 1, in one example implementation environment scenario, a server is in network connection with a terminal through which a user can upload a series of medical record data (e.g., admission diagnostic books, medical examination reports, discharge nodules, etc.) to the server. The server can identify the name original words (such as disease name original words) of the named entities through various technologies such as optical character recognition (Optical Character Recognition, OCR) and the like, and run the method for standardizing the name of the named entities provided by the embodiment of the application to determine the standard words of the named entities corresponding to the name original words of the named entities.

Alternatively, in other implementation scenarios, the method for normalizing the name of the named entity provided in this embodiment may also be operated by the server and the terminal together. For example, after obtaining the medical record data, the terminal recalls the initial named entity standard words based on the locally deployed standard word library, and the server predicts the number of corresponding standard words based on the named entity original words, so as to determine named entity standard words in the recalled initial named entity standard words.

In the above implementation environment, the terminal and the server perform data communication through a communication network, alternatively, the communication network may be a wired network or a wireless network, and the communication network may be at least one of a local area network, a metropolitan area network, and a wide area network. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), basic cloud computing services such as big data and artificial intelligent platforms, and the like. The server and the terminal device may be independent devices, or may be integrated in the same system, which is not limited herein.

Based on the above description of the named entity name normalization implementation procedure, it can be seen that the named entity name normalization method according to the embodiments of the present application may be performed by any suitable computer device (terminal or server); alternatively, the method of name normalization of named entities may be performed jointly by the terminal and the server. For convenience of explanation, the method of performing the name normalization of the named entity by the computer device will be described hereinafter

Referring to FIG. 2, an embodiment of a method for name normalization of named entities according to the present application is described. In this embodiment, the method includes:

s11, recalling the initial named entity standard word from the standard word library based on the named entity original word.

It should be noted that the embodiment of the application provides a model architecture for naming entity name standardization, which can be used as a basic module in a series of services, such as an intelligent recruitment system. Thus, the named entity primitive word may be obtained in some front-end services, and of course, may also be obtained by identifying the target document when the current service is executed.

The target document may have a variety of acquisition means. For example, the computer device may obtain the download link of the target document, and then download the target document according to the download link to obtain the target document. Alternatively, if one or more documents are stored in the storage space of the computer device itself, the computer device may select at least one document from the stored one or more documents, and use the selected text as the target document. As another example, the user may transfer the target document to the computer device via a removable storage medium, as the application is not limited in this regard.

The selection of the standard word stock is related to the named entity primitive word, and in general, an appropriate standard word stock can be selected for the named entity primitive word under different application scenarios. The standard word library is a collection of standard words in a corresponding field, and one standard word library generally corresponds to one type of standard word in one field. Taking the labeling word library in the medicine field as an example, for example, an ICD10 diagnosis code library, an ICD9-CM operation code library and a medicine ATC code library, a medical institution can also establish a standard word library according to requirements by utilizing technologies such as a knowledge graph and the like.

Illustratively, a named entity primitive may correspond to a plurality of named entity standard words, such as: the term "right lung nodule metastasis is probably large" corresponds to three term-named entity standard words of "lung occupancy lesion", "lung stimulated malignant tumor" and "metastatic tumor".

In one embodiment, a series of pre-treatments may be performed on named entity primitive words with the goal of improving the accuracy and scale of recall. Preprocessing that may be employed includes performing character normalization, special symbol cleaning, idiomatic abbreviations, and the like on the named entity primordial.

For example, character normalization may be normalization of converting full-angle symbols in a named entity primitive word into half-angle symbols, numerals (normalization of arabic numerals, chinese numerals, roman numerals, etc. into arabic numerals), removal of codes in input, conversion of greek letters into english letters, etc.; the special character cleaning can be to modify English appearing in the original words of the named entities into standard names, such as modifying the original words "he" of the named entities into "unspecified mistakes" of human epididymal proteins; the acronym may be an abbreviation for a non-essential idiom in some named entity primaries, such as the named entity primaries "unspecified cholera" abbreviated as "unspecified cholera". The specific mode and steps of the pretreatment are not limited in the application.

In a specific recall process, the similarity between the standard word in the standard word bank and the original word of the named entity can be calculated based on a similarity algorithm, and for example, standard words with similarity values larger than a preset similarity threshold value or a specified number are recalled to be used as the standard words of the original named entity.

For example, feature extraction can be performed on the named entity primitive word and the standard word in the standard word stock, feature construction can be performed based on semantics and/or fonts, and then the semantics and/or font similarity of the named entity primitive word and the standard word in the standard word stock can be calculated based on algorithms such as editing distance, cosine similarity and the like. Wherein the edit distance refers to the minimum number of edits required for converting word a into word B, and the allowable editing operations may include "replace", "insert", "delete", and the like. For example, word a is an "adverse event", word B is an "adverse event", and converting word a to word B requires 1 operation, namely: deleting "in word B", the edit distance between word a and word B is 1. In general, the smaller the edit distance, the greater the degree of similarity of the two words; conversely, the greater the edit distance, the less similar the two words are. The cosine similarity algorithm refers to calculating a cosine value of an included angle of text tokens (word vectors) corresponding to two words, wherein the closer the cosine value is to 1, the closer the included angle of the surface is to 0 degrees, and the more similar the words corresponding to the two text tokens are.

In one embodiment, taking a named entity primitive word as an example of a disease pathogenic word, a standard word library such as a standard word in ICD-10 can be directly written into an elastic search and an index is created, and similarity between the disease pathogenic word and the standard word in ICD-10 is calculated through a BM25 algorithm for recall.

The recalled initial named entity standard word has a first similarity with the named entity original word, the first similarity can be determined based on the similarity algorithm in the recall process, or can be obtained by scoring by using a constructed scoring model after recalling the initial named entity standard word, and the scoring result is taken as the first similarity, which is not limited by the application.

For example, the determination of the first similarity can be used as a scoring task to name the entity primitive word d _source And each initial named entity criterion word d _{target_i} Performing pairwise correspondence to obtain [ d ] _source ,d _{target_i} ] _n Training by adopting a language model such as a characterization bidirectional coder transformer (Bidirectional Encoder Representation fromTransformers, BERT) and the like; after obtaining the pre-trained BERT model, fine Tuning (Fine-Tuning) is performed on the model, and the model is Fine-tuned into a model for similarity scoring, so that the model is used for performing first similarity scoring on the initial named entity standard words.

S12, predicting the number of the predicted standard words corresponding to the named entity primitive words.

In one embodiment, the computer device may predict the execution quantity prediction by constructing a quantity prediction model. For example, the quantity prediction is used as a multi-classification task, and a language model such as a characterization bi-directional encoder transformer (Bidirectional Encoder Representation fromTransformers, BERT) is used for training; when the pre-trained BERT model is obtained, fine Tuning (Fine-Tuning) is performed on the model, and the model is Fine-tuned to a model for quantity classification, so that the model is used for generating a predicted quantity.

In the model training process, the predicted number can be set to be less than or equal to the number of the initial named entity standard words recalled. For example, in the model training process described above, the maximum predicted number may be set to 5, and the predicted number labels of the model include "1", "2", "3", "4", "5"; correspondingly, when standard words are recalled from the standard word library, the number of recalls can be set to be 10; for another example, where the maximum predicted number of models is set to 5, when recall standard words from a standard word stock, a suitable similarity threshold may be empirically determined such that the number of standard words recalled can be greater than 5.

It should be noted that in some embodiments, even if the predicted number is greater than the number of recalled initial named entity standard words, it does not mean that the computer device cannot continue to perform normalization of named entity names. In this case, the computer device may also use all of the recalled initial named entity standard words as named entity standard words corresponding to the named entity primitive words.

S13, determining the predicted number of named entity standard words from the recalled initial named entity standard words based on the first similarity.

(1) In one case, the number of predictions of standard words corresponding to the named entity primitive is 1 in step S12, that is, one named entity standard word needs to be determined from the recalled original named entity standard words.

The computer device may determine the initial named entity standard word with the highest first similarity as the named entity standard word corresponding to the named entity primitive word at this time. In a specific implementation process, the computer device may sort the recalled initial named entity standard words according to the first similarity, for example, from large to small, to obtain a sorted similarity scoring setS _sort =[s _{bert_sim_1} ,s _{bert_sim_2} ,...,s _{bert_sim_n} ]The method comprises the steps of carrying out a first treatment on the surface of the By s in the index set _{bert_sim_1} And obtaining the corresponding initial named entity standard word with the highest first similarity.

(2) In another case, in step S12, the number of predicted standard words corresponding to the named entity primitive words is greater than or equal to 2, that is, at least two named entity standard words need to be determined from the recalled original named entity standard words.

Similarly, in some embodiments, the computer device may select, as the named entity standard words, the corresponding initial named entity standard words in order of high-to-low similarity according to the predicted number. For example, the number of predictions is 3, by indexing the setS _sort S in (3) _{bert_sim_1} ,s _{bert_sim_2} ,s _{bert_sim_3} Corresponding initial named entity standard words, the predicted number of named entity standard words can be obtained.

It can be seen that in the above embodiment, the first similarity of each initial named entity-standard word is considered separately, and the named entity-standard word is determined based on this. However, such consideration may be insufficient. Therefore, the embodiment of the application proposes that not only the first similarity of each initial named entity standard word is considered, but also the influence of the comprehensive similarity of a group of initial named entity standard words can be considered in combination to determine a group of named entity standard words corresponding to the named entity original words.

Referring to fig. 3, the computer device first determines a plurality of standard phrases based on recalled initial named entity standard words, wherein each standard phrase comprises a predicted number of initial named entity standard words. For example, there are 5 initial named entity criteria recalled, which can be expressed as the set a= [ A1, A2, A3, A4, A5], the predicted number is 2, and the combined standard phrases include 10, respectively [ A1, A2], [ A1, A3], [ A1, A4], [ A1, A5], [ A2, A3], [ A2, A4], [ A2, A5], [ A3, A4], [ A3, A5], [ A4, A5].

Then, the computer device may calculate second similarities of the standard phrase and the named entity primitive word, respectively, where the second similarities may represent character coverage of the named entity primitive word by the initial named entity standard word in the standard phrase. For example, the named entity primitive word is "the right lung nodule metastasis is probably large", one standard phrase comprises three standard words of lung occupancy disease, lung excitation malignant tumor and metastatic tumor, and it can be seen that three characters of lung, transformation and shift in the named entity primitive word can be matched in the standard phrase, and the character coverage of the standard phrase on the named entity primitive word is 3/9.

In one embodiment, the computer device may obtain a character intersection of the standard phrase and the named entity primitive word, and calculate the second similarity based on the character intersection and the number of characters of the named entity primitive word. For example, the character number ratio of the character intersection to the named entity primitive word is determined as the second similarity.

The second similarity can be calculated by the formula:

wherein, the liquid crystal display device comprises a liquid crystal display device,representing the union of the standard words of each initial named entity in the standard phrase,/for solving the character>Character intersection representing the aforementioned union and named entity primitive word, n [ []For the function of the number of characters>The number of characters representing the primitive word of the named entity.

The computer device may then determine a target standard phrase of the named entity primitive word from the standard phrases based on the first similarity and the second similarity, and determine an initial named entity standard word in the target standard phrase as a named entity standard word. Specifically, the comprehensive similarity between the standard phrase and the named entity primitive word can be calculated based on the first similarity and the second similarity, and then the standard phrase with the highest comprehensive similarity is determined as the target standard phrase.

For example, the three standard words of "lung occupancy lesion", "lung-stimulated malignancy", "metastatic tumor" in the standard phrase have a first similarity to the original word of the named entity of respectively、/>、/>Meanwhile, the second similarity between the standard phrase and the original word of the named entity is +.>。

In one embodiment, the first similarity average of the three standard words may be calculated and compared with the second similarity, and the larger one may be taken as the integrated similarity.

In one embodiment, the integrated similarity may be obtained by calculating the sum of the first similarity of the three standard words and the second similarity of the standard phrase。

In an embodiment, the first similarity and the second similarity may be weighted, and the sum of the first similarity of the three standard words and the second similarity of the standard phrase is calculated by weighting to obtain the integrated similarityWherein, K1 and K2 are weights corresponding to the first similarity and the second similarity, respectively.

In the following, a scenario is exemplified in which the named entity standard words are determined using the integrated similarity, and differences in the named entity standard words are determined with respect to using only the first similarity.

In the scene, the original words of the named entities are lung metastasis occupied tumor, the recalled original named entity standard words comprise lung occupied lesion, lung excitation malignant tumor, metastatic tumor and phthisis, and the corresponding first similarity is 0.8, 0.7 and 0.9 respectively, and the prediction quantity is 3. At this time, if the first similarity is directly ordered, the named entity standard word should be [ tuberculosis, lung space occupying lesion, lung exciting malignancy ].

If the second similarity is considered, the text coverage of the standard phrase [ lung space-occupying lesion, lung-activated malignancy, metastatic tumor ] on the named entity primitive word "lung metastasis space-occupying tumor" is 100%. However, the standard word "tuberculosis" has the highest first similarity (0.9), but no matter it is combined with any two remaining standard words, the standard word "lung metastasis tumor" of the named entity primitive word cannot be 100% covered, and thus the second similarity score of the standard word group with "tuberculosis" is lower.

The method can be used for nursing, in the scene, through setting a set comprehensive similarity calculation method, the computer equipment can screen out the standard phrase containing tuberculosis, so that the standard phrase [ lung placeholder lesions, lung excited malignant tumors and metastatic tumors ] with higher standardized accuracy can be determined.

Referring to FIG. 4, an embodiment of the apparatus for name normalization of named entities of the present application is described. In this embodiment, the means for normalizing the name of the named entity includes a recall module 21, a prediction module 22, and a determination module 23.

The recall module 21 is configured to recall an initial named entity standard word from a standard word stock based on a named entity primitive word, where the initial named entity standard word has a first similarity to the named entity primitive word; the prediction module 22 is configured to predict a predicted number of standard words corresponding to the named entity primitive words, where the predicted number is less than or equal to the number of original named entity standard words recalled; the determining module 23 is configured to determine, based on the first similarity, a predicted number of named entity standard words from the recalled initial named entity standard words.

In one embodiment, the number of predictions is greater than or equal to 2, and the number of predictions is less than the number of recalled initial named entity standard words; the determining module 23 is further configured to determine a plurality of standard phrases based on the recalled initial named entity standard words, where the standard phrases respectively include a predicted number of initial named entity standard words; respectively calculating the second similarity of the standard phrase and the named entity primitive word; determining a target standard phrase of the named entity primitive word from the standard phrases based on the first similarity and the second similarity; and determining the initial named entity standard words in the target standard phrase as named entity standard words.

In one embodiment, the determining module 23 is specifically configured to obtain a character intersection of the standard phrase and the named entity primitive word; and calculating the second similarity based on the character intersection and the number of characters of the named entity primitive word.

In one embodiment, the determining module 23 is specifically configured to determine the character quantity ratio of the character intersection to the named entity primitive word as the second similarity.

In one embodiment, the determining module 23 is specifically configured to calculate, based on the first similarity and the second similarity, a comprehensive similarity between the standard phrase and the named entity primitive word; and determining the standard phrase with the highest comprehensive similarity as the target standard phrase.

In an embodiment, when the number of predictions is equal to 1, the determining module 23 is specifically configured to determine the initial named entity standard word with the highest first similarity as the named entity standard word.

In one embodiment, the recall module 21 is further configured to pre-process the named entity primitive before recalling the original named entity standard word from the standard word stock based on the named entity primitive, wherein the pre-process includes at least one of character normalization, special symbol cleaning, and idiomatic abbreviations.

A method for naming entity name normalization according to an embodiment of the present specification is described above with reference to fig. 1 to 3. The details mentioned in the description of the method embodiments above apply equally to the means for standardizing the names of named entities of the embodiments of the present description. The above means for normalizing the name of the named entity may be implemented in hardware, or in software, or in a combination of hardware and software.

Fig. 5 shows a hardware configuration diagram of an electronic device according to an embodiment of the present specification. As shown in fig. 5, the electronic device 30 may include at least one processor 31, a memory 32 (e.g., a non-volatile memory), a memory 33, and a communication interface 34, and the at least one processor 31, the memory 32, the memory 33, and the communication interface 34 are connected together via a bus 35. The at least one processor 31 executes at least one computer readable instruction stored or encoded in the memory 32.

It should be understood that the computer-executable instructions stored in the memory 32, when executed, cause the at least one processor 31 to perform the various operations and functions described above in connection with fig. 1-3 in various embodiments of the present description.

In embodiments of the present description, electronic device 30 may include, but is not limited to: personal computers, server computers, workstations, desktop computers, laptop computers, notebook computers, mobile electronic devices, smart phones, tablet computers, cellular phones, personal Digital Assistants (PDAs), handsets, messaging devices, wearable electronic devices, consumer electronic devices, and the like.

According to one embodiment, a program product, such as a machine-readable medium, is provided. The machine-readable medium may have instructions (i.e., elements described above implemented in software) that, when executed by a machine, cause the machine to perform the various operations and functions described above in connection with fig. 1-3 in various embodiments of the specification. In particular, a system or apparatus provided with a readable storage medium having stored thereon software program code implementing the functions of any of the above embodiments may be provided, and a computer or processor of the system or apparatus may be caused to read out and execute instructions stored in the readable storage medium.

In this case, the program code itself read from the readable medium may implement the functions of any of the above embodiments, and thus the machine-readable code and the readable storage medium storing the machine-readable code form part of the present specification.

Examples of readable storage media include floppy disks, hard disks, magneto-optical disks, optical disks (e.g., CD-ROMs, CD-R, CD-RWs, DVD-ROMs, DVD-RAMs, DVD-RWs), magnetic tapes, nonvolatile memory cards, and ROMs. Alternatively, the program code may be downloaded from a server computer or cloud by a communications network.

It will be appreciated by those skilled in the art that various changes and modifications can be made to the embodiments disclosed above without departing from the spirit of the application. Accordingly, the scope of protection of this specification should be limited by the attached claims.

It should be noted that not all the steps and units in the above flowcharts and the system configuration diagrams are necessary, and some steps or units may be omitted according to actual needs. The order of execution of the steps is not fixed and may be determined as desired. The apparatus structures described in the above embodiments may be physical structures or logical structures, that is, some units may be implemented by the same physical client, or some units may be implemented by multiple physical clients, or may be implemented jointly by some components in multiple independent devices.

In the above embodiments, the hardware units or modules may be implemented mechanically or electrically. For example, a hardware unit, module or processor may include permanently dedicated circuitry or logic (e.g., a dedicated processor, FPGA or ASIC) to perform the corresponding operations. The hardware unit or processor may also include programmable logic or circuitry (e.g., a general purpose processor or other programmable processor) that may be temporarily configured by software to perform the corresponding operations. The particular implementation (mechanical, or dedicated permanent, or temporarily set) may be determined based on cost and time considerations.

The detailed description set forth above in connection with the appended drawings describes exemplary embodiments, but does not represent all embodiments that may be implemented or fall within the scope of the claims. The term "exemplary" used throughout this specification means "serving as an example, instance, or illustration," and does not mean "preferred" or "advantageous over other embodiments. The detailed description includes specific details for the purpose of providing an understanding of the described technology. However, the techniques may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the concepts of the described embodiments.

The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the scope of the disclosure. Thus, the disclosure is not limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of name normalization of named entities, the method comprising:

predicting the number of the predicted standard words corresponding to the named entity primitive words, wherein the number of the predicted standard words is more than or equal to 2, and the number of the predicted standard words is less than the number of the recalled original named entity standard words;

respectively calculating second similarity of the standard phrase and the named entity original word, wherein the second similarity represents character coverage of the initial named entity standard word in the standard phrase to the named entity original word;

2. The method for normalizing named entity names according to claim 1, wherein the calculating the second similarity between the standard phrase and the named entity primitive word comprises:

3. The method according to claim 2, wherein calculating the second similarity based on the character intersection and the number of characters of the named entity primitive word, comprises:

4. The method for normalizing names of named entities according to claim 1, wherein determining the target standard phrase of the named entity primitive word from the standard phrases based on the first similarity and the second similarity specifically comprises:

5. The method according to claim 1, characterized in that, when said predicted number is equal to 1, it comprises in particular:

and/or the number of the groups of groups,

6. The method of claim 1, wherein the named entity name is a disease name and the standard thesaurus comprises an ICD10 diagnostic code library.

7. An apparatus for name normalization of named entities, comprising:

the prediction module is used for predicting the prediction quantity of standard words corresponding to the named entity primitive words, wherein the prediction quantity is more than or equal to 2, and the prediction quantity is less than the initial named entity standard word quantity recalled;

the determining module is used for determining a plurality of standard phrase groups based on the recalled initial named entity standard words, wherein the standard phrase groups respectively comprise the initial named entity standard words with the predicted quantity; respectively calculating second similarity of the standard phrase and the named entity original word, wherein the second similarity represents character coverage of the initial named entity standard word in the standard phrase to the named entity original word; determining a target standard phrase of the named entity primitive word from the standard phrases based on the first similarity and the second similarity; and determining the initial named entity standard words in the target standard phrase as named entity standard words.

8. An electronic device, comprising:

at least one processor; and

a memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform the method of naming entity name normalization of any one of claims 1 to 6.

9. A machine-readable storage medium storing executable instructions that when executed cause the machine to perform the method of naming entity name normalization of any one of claims 1 to 6.